Playing Mountain Car with Deep Q-Learning

4 min readMar 13, 2021

Introduction

As promised in my previous article, this time, I will implement Deep Q-learning (DQN) and Deep SARSA to train an agent to play the Mountain Car game and compare their performance to that of vanilla Q-learning and SARSA.

Deep Q-learning (DQN)

The DQN algorithm is mostly similar to Q-learning. The only difference is that instead of manually mapping state-action pairs to their corresponding Q-values, we use neural networks.

Let’s compare the input and output of vanilla Q-learning vs. DQN:

Q-learning vs. DQN architecture (Source: Choudhary, 2019)

The DQN algorithm is as follow:

Deep Q-Learning algorithm (Source: Deep Lizard, n.d.)

Note that we store (state, reward) pairs in a ‘replay memory’, but only select a number of random pairs to train our policy network at each step. This random selection ensures that the samples are not correlated.

Python implementation

To implement DQN in Python, I used the following dependencies:

numpy==1.20.1

torch==1.7.1+cpu

gym==0.18.0

collections

random

In this implementation, I use a very simple network with the following structure:

def NN(space_dim, n_actions, out_feature=24):
    return nn.Sequential(
        nn.Linear(space_dim, out_feature), 
        nn.ReLU(),
        nn.Linear(out_feature, out_feature),
        nn.ReLU(), 
        nn.Linear(out_feature, n_actions))

My assumption is that since the vanilla Q-learning agent in the previous article already did a good job playing Mountain Car, a simple neural network is enough to boost its performance. My DQN agent is as follow:

class DQN:
    
    def __init__(self, env):
        # define parameters
        self.env = env
        self.epsilon = max_epsilon
        self.space_dim = self.env.observation_space.shape[0]
        self.n_actions = self.env.action_space.n
        # initialize replace memory capacity
        self.memory = deque(maxlen = max_memory_len)
        
        # define policy & target networks
        self.policy = NN(self.space_dim, self.n_actions)
        self.target = NN(self.space_dim, self.n_actions)
        self.target.load_state_dict(self.policy.state_dict())
        self.target.eval()
        # NN evaluation metrics
        self.metric = nn.MSELoss()
        self.optimizer = optim.Adam(self.policy.parameters())
        
    def state_to_float(self, state):
        '''Convert a state to float type'''
        return torch.from_numpy(np.reshape(state, [1, self.env.observation_space.shape[0]])).float()
    
    def add_memory(self, state, action, reward, next_state, terminal):
        '''Add new experience to memory'''
        self.memory.append((state, action, reward, next_state, terminal))
        
    def choose_action(self, state):
        '''Choose an action based on epsilon-greedy strategy'''
        if random.random() < self.epsilon:
            action = self.env.action_space.sample()
        else:
            action = np.argmax(self.target(state).detach().numpy()[0])
            
        return action
    
    def epsilon_update(self):
        '''Decrease epsilon iteratively'''
        if self.epsilon > min_epsilon:
            self.epsilon *= epsilon_decay
        else:
            self.epsilon = min_epsilon
            
        return self.epsilon
    
    def target_update(self, cur_ep):
        '''Clone policy network weights to target network after a few time steps'''
        if cur_ep % update_frequency == 0:
            self.target.load_state_dict(self.policy.state_dict())
        
    def replay_memory(self):
        # randomly select n experiences from memory to train policy network
        if len(self.memory) < capacity:
            return
        
        batch = random.sample(self.memory, capacity)
        
        # update q value
        for state, action, reward, next_state, terminal in batch:
            if terminal and next_state[0][0].item() >= self.env.goal_position:
                q = reward
            
            else:
                q = reward + gamma * self.target(next_state).max(axis = 1)[0]
            # calculate new q-value
            q_val = self.target(state)
            q_val[0][action] = q
        
            # compute loss between output Q-value and target Q-value
            loss = self.metric(self.policy(state), q_val)# gradient descent 
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

Finally, I create a function to train the agent:

def game_play(env, n_trials):
    '''Let the DQN agent play Mountain Car'''
    
    # list to store steps required to complete each game
    reward_list = []
    
    # create new DQN object
    dqn = DQN(env)for i in range(n_trials):
        state = env.reset()
        state = dqn.state_to_float(state)
        terminal = False
        cur_reward = 0
        
        while not terminal:# render for the last 10 episodes
            if i >= (n_trials - 5): 
                env.render()# choose action:
            action = dqn.choose_action(state)
            
            # find next state & reward
            next_state, reward, terminal, _ = env.step(action)
            next_state = dqn.state_to_float(next_state)
            # append to memory
            dqn.add_memory(state, action, reward, next_state, terminal)
            state = next_state
            
            if terminal:
                break
                
            cur_reward += reward
            dqn.replay_memory()
        
            # update target network
            dqn.target_update(abs(cur_reward))
                    
        reward_list.append(cur_reward)
        # update epsilon
        dqn.epsilon = dqn.epsilon_update()
        
        if i%100 == 0:
            print(f'Game {i}; reward {cur_reward}; epsilon {dqn.epsilon}')    env.close()
        
    return reward_list

Result

Contrary to my prediction, the DQN agent does not do a good job of playing Mountain Car even after thousands of games. As shown in the picture below, the reward was stuck at -199 the entire time.

I have come to the conclusion that either DQN is not suitable for the Mountain Car game, or there is a bug in my current model that prevents it from learning. Thus, my future work includes two tasks: review the current code to identify bugs, and implementing DQN with Keras to see if I still get the same result.

Sources:

Choudhary, A. (2019). A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python. Analytics Vidhya. Retrieved from https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/

Deep Lizard. (n.d.). Build Deep Q-Network — Reinforcement Learning Code Project. Retrieved from https://deeplizard.com/learn/video/PyQNfsGUnQA

Mathew, J. (2020). PyTorch Artificial Intelligence Fundamentals. Packt Publishing. ISBN: 9781838557041. Retrieved from https://learning.oreilly.com/library/view/pytorch-artificial-intelligence/9781838557041/

Paszke, A. (n.d.). Reinforcement Learning (DQN) Tutorial. PyTorch. Retrieved from https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

Playing Mountain Car with Deep Q-Learning

Written by Ha Nguyen