Task: Training an agent to land a lunar lander safely on a landing pad on the surface of the moon.
Lunar Lander is one of the environments in Open AI's Gym library. Simply put, an environment represents a problem or task to be solved. In this case, we will try to solve the environment using Deep Q-Learning Algorithm with Experience Replay.
The landing pad is designated by two flag poles and it is always at coordinates
Lunar Lander Environment.
The agent has four discrete actions:
- Do nothing (integer
$0$ ). - Fire left engine (integer
$1$ ). - Fire main engine (integer
$2$ ). - Fire right engine (integer
$3$ ).
A state vector of the agent has
- The first
$2$ variables are$(x, y)$ coordinates of the lander. The landing pad is always at coordinates$(0,0)$ . - The lander's linear velocities
$(\dot x,\dot y)$ . - Its angle
$\theta$ . - Its angular velocity
$\dot \theta$ . - Two booleans,
$l$ and$r$ , that represent whether each leg is in contact with the ground or not.
- Landing on the landing pad and coming to rest is about
$100-140$ points. - If the lander moves away from the landing pad, it loses reward.
- If the lander crashes, it receives
$-100$ points. - If the lander comes to rest, it receives
$+100$ points. - Each leg with ground contact is
$+10$ points. - Firing the main engine is
$-0.3$ points each frame. - Firing the side engine is
$-0.03$ points each frame.
An episode ends (i.e. the environment is in a terminal state) if:
- The lunar lander crashes (i.e if the body of the lunar lander comes in contact with the surface of the moon).
- The absolute value of the lander's
$x$ -coordinate is greater than$1$ (i.e. it goes beyond the left or right border)
Gym implements the classic "agent-environment loop”. An agent interacts with the environment in discrete time steps
Agent-Environment Loop Formalism.
We denote that
In this case, the state space is continuous so it is practically impossible to explore the entire state-action space. Consequently, this also makes it practically impossible to gradually estimate
In the Deep
The Q-Network Architecture.
We can train the
where
Notice that this forms a problem because the
where
In practice, we will use the following algorithm: every
where
When an agent interacts with the environment, the states, actions, and rewards the agent experiences are sequential by nature. If the agent tries to learn from these consecutive experiences it can run into problems due to the strong correlations between them. To avoid this, we employ a technique known as Experience Replay to generate uncorrelated experiences for training our agent. Experience replay consists of:
- Storing the agent's experiences (i.e the states, actions, and rewards the agent receives) in a memory buffer
- Sampling a random mini-batch of experiences from the buffer to do the learning.
The experience tuples
Deep Q-Learning with Experience Replay.
Run the command conda create --name <your_env> --file requirements.txt
- macOS Ventura
$13.0$ , Anaconda$22.9.0$ - Main packages:
gym
,TensorFlow
To train the agent, activate the Anaconda environment and run the command python main.py
in the terminal. The result will look similar to this:
Total Training Time.
You should ignore the warnings in the image above and focus on the main information. This environment is solved in
After the training is finished, the model will be saved in ./models/
so we can see the results in ./videos/lunar_lander.mp4
and the moving average of total points through episodes in ./images/moving_average.png
- The moving average plot
The Moving Average of Total Points through Episodes.
- An example of failing to land on the surface of the Moon.
lunar_lander_failed_example.mp4
- An example of successfully landing on the surface of the Moon.