SkyPanther: /* 3. Train the Q-Network */

2025-02-08T22:52:57Z

3. Train the Q-Network

← Older revision		Revision as of 22:52, 8 February 2025
Line 5,237:		Line 5,237:
	* Use '''target network''':		* Use '''target network''':
	** Maintain a separate Q-network for stable training.		** Maintain a separate Q-network for stable training.
	** Update the target network periodically.		** Update the target network periodically(Less frequently than the primary Q-network).
	* Use the '''Mean Squared Error (MSE)''' loss function with '''Adam optimizer'''.		* Use the '''Mean Squared Error (MSE)''' loss function with '''Adam optimizer'''.
	----episodes = 100		----episodes = 100

SkyPanther: /* 5. Evaluate and Optimize */

2025-02-08T22:32:34Z

5. Evaluate and Optimize

Show changes

SkyPanther: /* Q-Learning with Keras */

2025-02-08T22:06:31Z

Q-Learning with Keras

Show changes

SkyPanther at 20:22, 8 February 2025

2025-02-08T20:22:23Z

← Older revision		Revision as of 20:22, 8 February 2025
Line 5,119:		Line 5,119:

	✅ '''Trains the Student Model using the Teacher Model’s logits'''		✅ '''Trains the Student Model using the Teacher Model’s logits'''
			----

			== Q-Learning with Keras ==
			'''Q-Learning''' is a '''value-based reinforcement learning''' algorithm that enables an agent to learn an '''optimal policy''' through interaction with its environment. The objective is to '''maximize cumulative reward''' over time by updating a '''Q-value function''', which estimates the expected reward for taking a given action in a particular state.

			It maintains and updates a '''Q-table''' using the '''Bellman equation''':
			[[File:Bellman Equation.png\|thumb]]


			where:

			* Q(s,a) is the current Q-value for state s and action a,
			* α (learning rate) controls how much new information overrides the old,
			* r is the immediate reward,
			* s′ (next state) results from taking action a,
			* a′ (next action) is chosen in s′,
			* γ (discount factor) determines the importance of future rewards,
			* maxa′Q(s′,a′) is the highest estimated Q-value for the next state.

			Through repeated interactions, Q-learning '''converges to an optimal policy''', enabling the agent to select the best actions to maximize long-term rewards.
			----

			=== Steps to implement Q-learning with Keras ===

			==== '''1. Initialize the Environment and Parameters''' ====

			* Use a platform like '''OpenAI Gym (CartPole)''' to define the environment.
			* Initialize the '''Q-network''' (neural network) instead of a traditional Q-table.
			* Set key '''hyperparameters''':
			** '''Learning rate (α):''' Determines how much new information overrides old values.
			** '''Discount factor (γ):''' Balances immediate vs. future rewards.
			** '''Exploration rate (ε):''' Controls trade-off between exploration (random actions) and exploitation (choosing the best action).

			==== '''2. Build the Q-Network with Keras''' ====

			* Create a '''deep neural network''' to approximate Q-values.
			* Input: State representation.
			* Output: Q-values for all possible actions.
			* Use '''Dense layers''' with '''ReLU activation''', and an output layer with '''linear activation'''.
			** Input layer size = state size
			** Output layer size = action size
			** 2 to 3 hidden layers with ReLu Activation

			==== '''3. Train the Q-Network''' ====

			* Get/Initialize the state
			* Select action
			** With Probability (epsilon)
			** Select Random action (exploration)
			** Select the action with the highest prediction Q Value (exploration)
			* Take action
			** Execute the Chosen action in the environment
			* Update Q-values
			** Use the Bellman equation
			** Compute the target Q-Value
			** Train the Q-network to minimize the difference between the predicted and target Q-Value
			* Repeat
			** Reduce the exploration rate (epsilon) to shit from exploration to exploitation.

			* Implement '''experience replay''':
			** Store agent experiences (s,a,r,s′) in a replay memory.
			** Train the model by sampling mini-batches from memory.
			* Use '''target network''':
			** Maintain a separate Q-network for stable training.
			** Update the target network periodically.
			* Use the '''Mean Squared Error (MSE)''' loss function with '''Adam optimizer'''.

			==== '''4. Implement the Q-Learning Algorithm''' ====

			# '''Initialize''' the environment.
			# '''For each episode:'''
			#* Reset the environment.
			#* For each step:
			#** Choose an action using '''ε-greedy policy'''.
			#** Execute the action and observe the '''reward''' and '''next state'''.
			#** Store the experience in '''memory'''.
			#** Sample a '''mini-batch''' from memory.
			#** Compute the '''target Q-value''' using the Bellman equation: Qtarget(s,a)=r+γa′maxQ(s′,a′)
			#** Update the Q-network.
			#** Reduce '''ε (exploration rate)''' over time.
			# '''Periodically update the target network'''.

			==== '''5. Evaluate and Optimize''' ====

			* Run test episodes to measure performance.
			* Tune hyperparameters for better convergence.

			----