When Deepmind's AlphaGo elevated Deep Reinforcement Learning (RL) to the pinnacle of AI research, I became obsessed with mastering it. I created a simple 2D game based in a randomised environment, and set a tight deadline to train a Deep RL agent to navigate it skillfully. Unfortunately, I soon hit a wall and the clock ran out. I saved it as a repository on Github and a few years later, Large Language Models (LLMs) became mainstream and helped me crack it. Here's what happened.
The RL Environment
To understand the challenge precisely, let me first describe the game in RL terms i.e Environment, Observations, Actions and Rewards.

The environment is a square area where 10 balls (6 Green and 4 Red) move at constant speed and bounce on each of the 4 sides without any energy loss. The initial coordinates and velocity vector of each ball is randomly set at the start of each game. Think of it like a game of pool where the cue ball (your agent) must hit the Green balls while avoiding Red ones. The agent is a Blue ball initiated at fixed coordinates that can be moved at constant speed following the 9 typical joystick actions (No move, Up, Down, Left, Right, and 4 diagonals). The observation space is the position vector of the hero ball, the position and velocity vectors of each target ball along with its Colour {Green, Red} and Status {Live, Hit}. The reward structure is +0.25 for each Green ball hit and potentially when the simulation ends as summarised by this table:
| Event | Reward | Terminal? |
|---|---|---|
| Hit Green Ball | +0.25 | No |
| All Greens Hit | +1.0 | Yes |
| Hit Red Ball | -1.0 | Yes |
| 1,000 Steps Elapsed | 0 | Yes |
In summary, we could say that this environment has:
- A small and discrete action space
- A high-dimensional multi-type observation space
- A sparse reward structure
This toy problem is particularly interesting because it mirrors real-world sparse-reward scenarios encountered in trading, robotics or drug discovery. So keep reading to find out the tips and tricks that dramatically improve speed and convergence during training.
RL Algorithm and Feature Extractor
Picking an RL algorithm is a crucial decision. I did not take any chances and went for the one that became OpenAI's gold standard since they shared it with the world in 2017: Proximal Policy Optimization (PPO). Given my tight new deadlines, I was clearly not going to implement it from scratch, so I used its implementaiton in the excellent StableBaselines3 (SB3) library.
As I pointed out earlier, our observation space mixes disparate data types like float vectors, integers and booleans. We need to aggregate them all into a floating number feature vector which will be passed to our PPO model. In the SB3, it is done by building a Pytorch module called the feature extractor which is passed to the PPO instance. I identified two main tricks that were paramount for making it run efficiently in this setup:
1. Embeddings
An Embeddings layer transforms an integer into a fixed-size float vector. We could think of it like a mapping between an integer value and a vector of floating numbers. Some kind of correspondence table. This allows a very flexible multi-dimensional numerical representation of what would be a state e.g. Ball Colour or Ball Status in our setting.
2. Residual Blocks
Training multi-layered neural network blocks can hit a wall from time to time due to the gradient vanishing problem. A very simple but yet extremely powerful way of getting around this potential bottleneck is to add a skip connection to the non-linear layer.
$$ y = x + f_{\theta}(x) $$
This allows for an Identity transform (setting non-linear term to zero) and in effect bypassing the layer altogether. You could think of residual blocks as fail-safe non-linear layers.
Training Strategy
Training in highly dimensional RL environments is challenging because most RL models are sample-inefficient i.e. requires vast trial-and-error data to learn. This is a feature and not a bug as updating Policy networks needs to be very gradual to be stable. The sparse reward structure makes matter even worse because some rewards are unlikely reached by acting randomly. This is very much the case here for the reward generated when all Green balls have been hit.
To jumpstart training, I first created a rule-based expert agent to generate good quality trajectories. These were stored in a static dataset for offline Imitation Learning or more specifically in our case, Behavioural Cloning (BC). This gave our model a strong initial Policy saving literally hours of training time.
However, looking into the distribution of all actions taken by the agent during the simulations, I noted that nearly 40% of the decisions are “No move”. This imbalance could risk biasing the learnt policy into passivity at the BC stage. To counter this, I weighted the loss function based on the inverse of the action frequency recorded. This trick completely stabilised the BC training and ensured each action was equally researched. The BC stage then completed with a balanced accuracy exceeding 90% fairly easily.
Since a PPO model is made of both a Policy and a Value network, and since BC focuses solely on the Policy transfer, there is a total misalignment between the 2 networks at the end of this preliminary training. This creates an instability when launching the full PPO training which causes the Policy network to worsen out of the gate. This instability occurs because PPO uses the Value Network to compute the Advantage Estimate ($A_t$), which is the core metric guiding policy updates. The untrained Value Network gave wildly inaccurate Advantage Estimates causing the newly learned BC Policy to immediately destabilize. The solution was Value Network Pre-training which consists in launching a full PPO training while freezing the Policy network weights and biases. This allowed the Value Network to learn a reliable estimate of the BC policy's performance from the expert trajectories. After 20k timesteps, the Value network's estimates were stable enough to start the unbridled PPO training safely.
The SB3 PPO model comes with a large number of adjustable parameters. There were two in particular that have a large impact on convergence:
- clip_range: Keeping it low prevents aggressive Policy updates hence adding training stability
- ent_coef: A low value limits exploration which is advisable here since the initial Policy is already quite capable
These technical choices didn’t just work, they revealed broader patterns for tackling sparse-reward problems. Here’s what I’ll remember.
Lessons Learnt
In retrospect, this project wasn't just about getting the satisfaction of beating a decade-old challenge. It was really about understanding what it takes to efficiently solve RL problems. Here are three lessons I will keep from this experience:
Heuristics Are Invaluable: The expert agent’s simple logic (“prioritize green, avoid red”) generated meaningful trajectories in minutes, while pure PPO exploration took hours to stumble upon the same behaviors. Even naive heuristics create better starting points than random exploration.
LLMs as Debugging Partners: LLMs are powerful for generating reliable code and ideas. To use them efficiently is to use them a springboard to the next step towards the solution. Stay in the driving seat for the high level thinking and use them as tactical problem solvers.
The Value Network Post BC Adjustment: After behavioural cloning, the untrained Value network’s poor estimates degraded the Policy during early PPO training. Freezing the Policy and pretraining the Value network on the expert trajectories for 20k epochs is a crucial step for stabilising the final training process.
Final Results & Conclusion
My second attempt at solving this challenge was a total success. Through repeated online training, the PPO model ended up completely outperforming the expert agent (which had a pretty good score in the first place):
| Model | Reward Mean | Reward StdDev |
|---|---|---|
| PPO | 1.975 | 1.076 |
| Expert | 1.462 | 1.304 |
I would attribute this fortunate turn of events to 3 main factors in no particular order:
- Access to a very stable implementation of the efficient PPO model
- A more in-depth knowledge of neural network architectures
- Access to LLMs acting as powerful debuggers and research assistants
To date, LLMs have improved my codebase tremendously, helping with optimisation, debugging, and architecture modifications. Their added value is unquestionable, but I believe my positive experience is due to me staying in control of the project at all times: I use them tactically, not strategically.
If you like this article, follow me on X and get notified on my next posts.