Introduction
As promised in my previous article, this time, I will implement Deep Q-learning (DQN) and Deep SARSA to train an agent to play the Mountain Car game and compare their performance to that of vanilla Q-learning and SARSA.
Deep Q-learning (DQN)
The DQN algorithm is mostly similar to Q-learning. The only difference is that instead of manually mapping state-action pairs to their corresponding Q-values, we use neural networks.
Let’s compare the input and output of vanilla Q-learning vs. DQN:
The DQN algorithm is as follow:
Note that we store (state, reward) pairs in a ‘replay memory’, but only select a number of random pairs to train our policy network at each step. This random selection ensures that the samples are not correlated.
Python implementation
To implement DQN in Python, I used the following dependencies:
numpy==1.20.1
torch==1.7.1+cpu
gym==0.18.0
collections
random
In this implementation, I use a very simple network with the following structure:
def NN(space_dim, n_actions, out_feature=24):
return nn.Sequential(
nn.Linear(space_dim, out_feature),
nn.ReLU(),
nn.Linear(out_feature, out_feature),
nn.ReLU(),
nn.Linear(out_feature, n_actions))
My assumption is that since the vanilla Q-learning agent in the previous article already did a good job playing Mountain Car, a simple neural network is enough to boost its performance. My DQN agent is as follow:
class DQN:
def __init__(self, env):
# define parameters
self.env = env
self.epsilon = max_epsilon
self.space_dim = self.env.observation_space.shape[0]
self.n_actions = self.env.action_space.n
# initialize replace memory capacity
self.memory = deque(maxlen = max_memory_len)
# define policy & target networks
self.policy = NN(self.space_dim, self.n_actions)
self.target = NN(self.space_dim, self.n_actions)
self.target.load_state_dict(self.policy.state_dict())
self.target.eval()
# NN evaluation metrics
self.metric = nn.MSELoss()
self.optimizer = optim.Adam(self.policy.parameters())
def state_to_float(self, state):
'''Convert a state to float type'''
return torch.from_numpy(np.reshape(state, [1, self.env.observation_space.shape[0]])).float()
def add_memory(self, state, action, reward, next_state, terminal):
'''Add new experience to memory'''
self.memory.append((state, action, reward, next_state, terminal))
def choose_action(self, state):
'''Choose an action based on epsilon-greedy strategy'''
if random.random() < self.epsilon:
action = self.env.action_space.sample()
else:
action = np.argmax(self.target(state).detach().numpy()[0])
return action
def epsilon_update(self):
'''Decrease epsilon iteratively'''
if self.epsilon > min_epsilon:
self.epsilon *= epsilon_decay
else:
self.epsilon = min_epsilon
return self.epsilon
def target_update(self, cur_ep):
'''Clone policy network weights to target network after a few time steps'''
if cur_ep % update_frequency == 0:
self.target.load_state_dict(self.policy.state_dict())
def replay_memory(self):
# randomly select n experiences from memory to train policy network
if len(self.memory) < capacity:
return
batch = random.sample(self.memory, capacity)
# update q value
for state, action, reward, next_state, terminal in batch:
if terminal and next_state[0][0].item() >= self.env.goal_position:
q = reward
else:
q = reward + gamma * self.target(next_state).max(axis = 1)[0]
# calculate new q-value
q_val = self.target(state)
q_val[0][action] = q
# compute loss between output Q-value and target Q-value
loss = self.metric(self.policy(state), q_val)# gradient descent
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
Finally, I create a function to train the agent:
def game_play(env, n_trials):
'''Let the DQN agent play Mountain Car'''
# list to store steps required to complete each game
reward_list = []
# create new DQN object
dqn = DQN(env)for i in range(n_trials):
state = env.reset()
state = dqn.state_to_float(state)
terminal = False
cur_reward = 0
while not terminal:# render for the last 10 episodes
if i >= (n_trials - 5):
env.render()# choose action:
action = dqn.choose_action(state)
# find next state & reward
next_state, reward, terminal, _ = env.step(action)
next_state = dqn.state_to_float(next_state)
# append to memory
dqn.add_memory(state, action, reward, next_state, terminal)
state = next_state
if terminal:
break
cur_reward += reward
dqn.replay_memory()
# update target network
dqn.target_update(abs(cur_reward))
reward_list.append(cur_reward)
# update epsilon
dqn.epsilon = dqn.epsilon_update()
if i%100 == 0:
print(f'Game {i}; reward {cur_reward}; epsilon {dqn.epsilon}') env.close()
return reward_list
Result
Contrary to my prediction, the DQN agent does not do a good job of playing Mountain Car even after thousands of games. As shown in the picture below, the reward was stuck at -199 the entire time.
I have come to the conclusion that either DQN is not suitable for the Mountain Car game, or there is a bug in my current model that prevents it from learning. Thus, my future work includes two tasks: review the current code to identify bugs, and implementing DQN with Keras to see if I still get the same result.
Sources:
Choudhary, A. (2019). A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python. Analytics Vidhya. Retrieved from https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/
Deep Lizard. (n.d.). Build Deep Q-Network — Reinforcement Learning Code Project. Retrieved from https://deeplizard.com/learn/video/PyQNfsGUnQA
Mathew, J. (2020). PyTorch Artificial Intelligence Fundamentals. Packt Publishing. ISBN: 9781838557041. Retrieved from https://learning.oreilly.com/library/view/pytorch-artificial-intelligence/9781838557041/
Paszke, A. (n.d.). Reinforcement Learning (DQN) Tutorial. PyTorch. Retrieved from https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html