Rocket Lander using DRL

Objective

This project focuses on training an AI agent to land a rocket safely in the ROCKET LANDER environment, similar to how SpaceX lands their reusable rockets. We’ll use Deep Reinforcement Learning (DRL) techniques to teach our rocket to make landing decisions on its own. Our goal is to create an AI that can control a simulated rocket and guide it to a successful landing. The AI will:

Learn through trial and error
Make decisions about engine thrust and orientation
Adapt to changing conditions during descent

Overview

What is reinforcement learning?

Reinforcement Learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent takes actions, receives rewards or penalties, and learns a policy (a strategy) to maximize its total reward over time.
Imagine training a dog — you give it a treat when it does something right, and nothing when it doesn’t. Over time, it learns what actions get rewards.
In RL, the agent is that “dog,” the environment is the world it interacts with, and rewards are the treats (or penalties)
RL lets AI learn from trial and error instead of being told what to do — it becomes more autonomous and adaptive.

Approach

🌌 Environment Description

The custom environment simulates a 2D rocket controlled via thrust and torque.
It provides continuous feedback about position, velocity, and contact status.

Termination Conditions:

✅ Landed successfully
💥 Crashed or out of bounds
⛽ Fuel exhausted

🎮 Action & State Spaces

The state space is continuous (8-dimensional), allowing the policy to observe smooth dynamics of flight.

🧮 Reward Function

The reward function encourages safe, stable, and fuel-efficient landings.

[ R_t = \frac{(S_t – S_{t-1})}{10} – 0.3 \times (\text{main_power} + \text{side_power}) ]

Terminal Rewards:

+10 → Landed safely
−10 → Crashed / Out of bounds

Penalties:

Upward velocity → −1
Fuel usage → proportional penalty

This formulation uses a potential-based shaping function to stabilize learning and ensure smooth policy convergence.

⚙️ Training & Hyperparameters

Graphs

DRL

Why integrate deep learning?

Traditional RL algorithms like monte carlo or TD learning require storing a Q-table.
The rocket lander environment has a continuous action and state space.
Neural networks make computation faster and easier.

Algorithm

A Brief Introduction to PPO

It uses two neurak networks – 1) The actor network and 2) The critic network
The actor or the policy network is used to output the probabilities of various actions from a state.
The critic or value network estimates the total expected return from the state.
The above estimated value is backpropagated through the critic network.
The policy network is updated after clipping these estimated values.

How does PPO work in our case?

Actor Network: takes state → outputs action probabilities (Throttle, Side Thrust, Angle).
Critic Network: takes same state → outputs single scalar value (predicted total reward).
Each of these networks has 8 neurons (state vector) in their input layer
The actor network has 3 output neurons (continuous action space outputs)
The critic netwrok has one neuron i.e., the vakue estimate
Both network shares similar architecture but separate weights
Actor selects optimal thrust commands while the critic evaluates them

Advantages of implementing using PPO:

A dense and high variance reward function is designed to take into consideration the complex physics of the environment. The critic network makes sure that these noises do not enter the policy network.
A single nad batch of data may revert all previous learning. Clipping limits ensure that this doesn’t happen and guarantee a smoother and more reliable convergence.
Both the action space and observation space are in terms of 3 dimensional floating point box. Discretization may lead to loss of accuracy. However, neural networks solve this problem by allowing us to incorporate the floating values into out training.

Results

GitHub

Members

Tvisha Mehta
Ishan Agrawal
Raj Patil

Mentors

Aarsush Sinha
Akash Tiwari
Atharv Kulkarni
Anuj Sidam