Playing Mountain Car with Deep Q-Learning

Ha Nguyen
4 min readMar 13, 2021

--

Introduction

As promised in my previous article, this time, I will implement Deep Q-learning (DQN) and Deep SARSA to train an agent to play the Mountain Car game and compare their performance to that of vanilla Q-learning and SARSA.

Deep Q-learning (DQN)

The DQN algorithm is mostly similar to Q-learning. The only difference is that instead of manually mapping state-action pairs to their corresponding Q-values, we use neural networks.

Let’s compare the input and output of vanilla Q-learning vs. DQN:

Q-learning vs. DQN architecture (Source: Choudhary, 2019)

The DQN algorithm is as follow:

Deep Q-Learning algorithm (Source: Deep Lizard, n.d.)

Note that we store (state, reward) pairs in a ‘replay memory’, but only select a number of random pairs to train our policy network at each step. This random selection ensures that the samples are not correlated.

Python implementation

To implement DQN in Python, I used the following dependencies:

numpy==1.20.1

torch==1.7.1+cpu

gym==0.18.0

collections

random

In this implementation, I use a very simple network with the following structure:

def NN(space_dim, n_actions, out_feature=24):
return nn.Sequential(
nn.Linear(space_dim, out_feature),
nn.ReLU(),
nn.Linear(out_feature, out_feature),
nn.ReLU(),
nn.Linear(out_feature, n_actions))

My assumption is that since the vanilla Q-learning agent in the previous article already did a good job playing Mountain Car, a simple neural network is enough to boost its performance. My DQN agent is as follow:

class DQN:

def __init__(self, env):
# define parameters
self.env = env
self.epsilon = max_epsilon
self.space_dim = self.env.observation_space.shape[0]
self.n_actions = self.env.action_space.n
# initialize replace memory capacity
self.memory = deque(maxlen = max_memory_len)

# define policy & target networks
self.policy = NN(self.space_dim, self.n_actions)
self.target = NN(self.space_dim, self.n_actions)
self.target.load_state_dict(self.policy.state_dict())
self.target.eval()
# NN evaluation metrics
self.metric = nn.MSELoss()
self.optimizer = optim.Adam(self.policy.parameters())

def state_to_float(self, state):
'''Convert a state to float type'''
return torch.from_numpy(np.reshape(state, [1, self.env.observation_space.shape[0]])).float()

def add_memory(self, state, action, reward, next_state, terminal):
'''Add new experience to memory'''
self.memory.append((state, action, reward, next_state, terminal))

def choose_action(self, state):
'''Choose an action based on epsilon-greedy strategy'''
if random.random() < self.epsilon:
action = self.env.action_space.sample()
else:
action = np.argmax(self.target(state).detach().numpy()[0])

return action

def epsilon_update(self):
'''Decrease epsilon iteratively'''
if self.epsilon > min_epsilon:
self.epsilon *= epsilon_decay
else:
self.epsilon = min_epsilon

return self.epsilon

def target_update(self, cur_ep):
'''Clone policy network weights to target network after a few time steps'''
if cur_ep % update_frequency == 0:
self.target.load_state_dict(self.policy.state_dict())

def replay_memory(self):
# randomly select n experiences from memory to train policy network
if len(self.memory) < capacity:
return

batch = random.sample(self.memory, capacity)

# update q value
for state, action, reward, next_state, terminal in batch:
if terminal and next_state[0][0].item() >= self.env.goal_position:
q = reward

else:
q = reward + gamma * self.target(next_state).max(axis = 1)[0]
# calculate new q-value
q_val = self.target(state)
q_val[0][action] = q

# compute loss between output Q-value and target Q-value
loss = self.metric(self.policy(state), q_val)
# gradient descent
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

Finally, I create a function to train the agent:

def game_play(env, n_trials):
'''Let the DQN agent play Mountain Car'''

# list to store steps required to complete each game
reward_list = []

# create new DQN object
dqn = DQN(env)
for i in range(n_trials):
state = env.reset()
state = dqn.state_to_float(state)
terminal = False
cur_reward = 0

while not terminal:
# render for the last 10 episodes
if i >= (n_trials - 5):
env.render()
# choose action:
action = dqn.choose_action(state)

# find next state & reward
next_state, reward, terminal, _ = env.step(action)
next_state = dqn.state_to_float(next_state)
# append to memory
dqn.add_memory(state, action, reward, next_state, terminal)
state = next_state

if terminal:
break

cur_reward += reward
dqn.replay_memory()

# update target network
dqn.target_update(abs(cur_reward))

reward_list.append(cur_reward)
# update epsilon
dqn.epsilon = dqn.epsilon_update()

if i%100 == 0:
print(f'Game {i}; reward {cur_reward}; epsilon {dqn.epsilon}')
env.close()

return reward_list

Result

Contrary to my prediction, the DQN agent does not do a good job of playing Mountain Car even after thousands of games. As shown in the picture below, the reward was stuck at -199 the entire time.

Mountain Car gameplay result of the DQN agent

I have come to the conclusion that either DQN is not suitable for the Mountain Car game, or there is a bug in my current model that prevents it from learning. Thus, my future work includes two tasks: review the current code to identify bugs, and implementing DQN with Keras to see if I still get the same result.

Sources:

Choudhary, A. (2019). A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python. Analytics Vidhya. Retrieved from https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/

Deep Lizard. (n.d.). Build Deep Q-Network — Reinforcement Learning Code Project. Retrieved from https://deeplizard.com/learn/video/PyQNfsGUnQA

Mathew, J. (2020). PyTorch Artificial Intelligence Fundamentals. Packt Publishing. ISBN: 9781838557041. Retrieved from https://learning.oreilly.com/library/view/pytorch-artificial-intelligence/9781838557041/

Paszke, A. (n.d.). Reinforcement Learning (DQN) Tutorial. PyTorch. Retrieved from https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

--

--