Spaceship Complex RL

Self-Landing Spaceship - Complex version

Oct 24, 2016


This is a continuation of the Self-Landing Spaceship - Simple version. In this case, the spaceship starts out on the upper left corner, mimicking a parabolic descent trajectory. The landing scenario is designed to be more complex, such that the spaceship is initialized with random horizontal and vertical velocities, as well as random spaceship angle.

The goal remains the same: to land on the ground safe and sound by acting correctly at each time step. Specific landing criteria are as follows:

  • it touches the ground soft and gentle, meaning both X and Y velocity should be smaller than 30px/s
  • the spaceship body is not too tilted, meaning the body angle should be within +/- 30 degrees around upright landing
  • it doesn't go beyond the upper boundary, back to the space again..
  • it doesn't run out of fuel, which only lasts for 10 seconds

As you may have noticed from the GIF below, the spaceship now has a much more diverse movement path (some of which are pretty miserable), this is because the action space is boarder. At each frame, it needs to choose an action out of six options instead of just two options from the simple version:

  • thrusts and turns left
  • thrusts and turns right
  • thrusts and does not turn
  • does not thrust and turns left
  • does not thrust and turns right
  • does not thrust and does not turn

In the same time, the state of the spaceship is defined by four sensor inputs:

  • horizontal velocity
  • vertical velocity
  • spaceship angle
  • distance to the ground


Can naive genetic algorithm or beam search handle this level of complexity? Turned out not quite. These two approaches are purely optimization methods that search through the parameter space of the multilayer perceptron, they usually end up finding an "acceptable" solution but not the best one. In this case, even an "acceptable" solution is not really acceptable, which leads to severe landing accidents.

Since the nature of the problem is sequential decision making, reinforcement learning becomes the obvious way to go. Specifically, I used SARSA-lambda algorithm to evaluate the action values, that is, to know which action is the most beneficial to take, given the status of the current state. For example, when the spaceship is close to the ground, if it descends very fast with its body upright, then the optimal action to take is to thrust without turning, resulting in a deceleration (yields a higher reward than simply crashing into the ground with no slow down). On the other hand, considering the state dimensions are continuous, mapping out all possible state-action pairs would be a joke. So I used linear function approximation with tile coding, hoping to generalize the action value distribution from seen states to unseen states.

I kept it running on my laptop for one and half day, that's about 28000 training epochs. As the screenshot above illustrates, it learned relatively fast at the beginning and then slowly converged to a reward around 70. Sadly, the truth is it can only achieve successful landing when reward is about 140. Even though it still crashes, you can definitely tell a difference between agent not trained and trained. To demonstrate the learning results, I have defined four distinct initiation states and compared how the agent handles the situations differently:

Initiation state 1: angle = 0°, horizontal velocity = 100px/s, vertical velocity = 130px/s

Not Trained


Personal Favorite NO.1: firstly hover on the ground to counteract the high descending speed and then propel to the right to mitigate horizontal drifting.

Initiation state 2: angle = 90°, horizontal velocity = 150px/s, vertical velocity = 100px/s

Not Trained


Initiation state 3: angle = 180°, horizontal velocity = 170px/s, vertical velocity = 80px/s

Not Trained


Initiation state 4: angle = 270°, horizontal velocity = 200px/s, vertical velocity = 30px/s

Not Trained


Personal Favorite NO.2: turned a somersault from upside down to upright, along with constant thrust to slow both horizontal and vertical velocities down.

Admittedly, there are intelligent behaviors, such as deceleration while approaching to the ground, or turning to upright while the craft is upside down. But how come the SARSA-lambda algorithm wasn't able to give us successful landings? After some trial-and-error, I found the training performance is positively related with number of tiles used to encode the state features. When each feature is coded with less tiles, meaning, smaller number of parameters, the agent performs poorly, with reward converges around -60. From my final experiment, I ended up reaching the limit of my laptop by updating 3 million parameters (50k tiles for each action), resulting in reward around 100. I believe the performance will go beyond that if I migrated to a more powerful machine and used more tiles. Another possible reason is lack of tuning hyper-parameters, things like learning rate, exploration factor, discount factor and eligibility trace decay factor may have profound influence on the learning performance. Tuning them simply needs time and computational resources.

Future Work

Coincidentally, I found OpenAI Gym after I was doing this project. It seems to provide a convenient API to test out algorithms on various reinforcement learning tasks, will give it a try in the future. In addition, I will take an AI class in the upcoming Winter Quarter 2017, which might involve solving games like this one. I look forward to exploring and implementing Q-learning or even Deep Q-network algorithms to further my journey in reinforcement learning.

If you happen to be a rocket scientist, a machine learning expert or someone who is simply interested in the topic, I would be more than happy to hear your opinion and discuss better ways to carry this project out next. After all, our conquest is the sea of stars, this is just the beginning.

At last, salute to SpaceX's incredible achievement.


Code of this project will be published soon.