Why build it from scratch?
Modern ML frameworks are powerful precisely because they hide the details — automatic differentiation, GPU kernels, optimiser state — all abstracted away behind a few lines of API. That convenience is great in production, but it makes it easy to treat a neural network as a black box. I wanted to understand every multiply, every gradient, every weight update. So I wrote the whole stack myself.
The result is a two-part system: a modular neural network library built on NumPy, and a Pygame-based 2D physics engine that acts as the training environment. The capstone is a Deep Q-Network (DQN) agent that learns to land a simulated spacecraft — trained entirely inside the custom engine.
The neural network engine
The library is built around two abstract primitives: Layer and
NeuralNetwork. Every layer exposes a forward_propagate() and a
backward_propagate() method, which makes adding new layer types
straightforward. The two concrete implementations are FullyConnectedLayer
and ActivationLayer.
Weight initialisation
Weights are initialised with Xavier/Glorot uniform initialisation — the random bound is
set to 1 / sqrt(input_size + output_size). This keeps the variance of
activations roughly constant across layers at the start of training, which matters
especially for deep networks with saturating activations like tanh.
Forward and backward pass
The forward pass through a fully connected layer is a single matrix multiply plus a bias:
def forward_propagate(self, input_data):
self.input = input_data
self.output = (input_data @ self.weights) + self.bias
return self.output
The backward pass computes three quantities using the chain rule: the error to propagate upstream, the weight gradient, and the bias gradient. All three are derived from the output error passed in from the layer above:
input_error = output_error @ self.weights.T
weights_error = self.input.T @ output_error
self.weights -= learning_rate * weights_error
self.bias -= learning_rate * np.mean(output_error, axis=0)
The activation layer wraps a function and its analytic derivative. Tanh is the only
activation used throughout the project — its derivative, 1 - tanh(x)², is
computed inline during backprop.
Training loop
NeuralNetwork.train() supports both sequential and random mini-batch
sampling. For each batch, it performs a full forward pass over all layers, computes the
MSE loss and its derivative, then walks the layer list in reverse to propagate gradients.
The cost function and its derivative are injected at construction time, keeping the
network decoupled from any specific loss.
# Building a network with a fluent interface
net = (NeuralNetwork(*CostFunctions.use_mse())
.add_layer(FullyConnectedLayer(784, 100))
.add_layer(ActivationLayer(*ActivationFunctions.use_tanh()))
.add_layer(FullyConnectedLayer(100, 50))
.add_layer(ActivationLayer(*ActivationFunctions.use_tanh()))
.add_layer(FullyConnectedLayer(50, 10))
.add_layer(ActivationLayer(*ActivationFunctions.use_tanh())))
Serialisation
Trained networks are saved to disk as a folder of .npy weight files plus a
config.json that records layer types and activation IDs. Loading
reconstructs the full network from those files, making the format human-readable and
independent of any pickle-based serialisation.
Supervised baseline: MNIST digit classification
Before moving to reinforcement learning I validated the engine on MNIST. A 784→100→50→10 network (all tanh, MSE loss) was trained for one epoch with a batch size of 500 and a learning rate of 0.2. The network reaches solid accuracy on the training set, confirming that the forward pass, backward pass, and weight update are all correct.
One-hot encoding is used for the labels: the output layer produces a 10-dimensional vector and the predicted digit is the argmax. The trained weights are saved and reloaded via the serialisation system as an end-to-end integration test.
The physics engine
The reinforcement learning environments needed realistic physics, so I built a 2D rigid-body simulator on top of Pygame. The engine handles sprite rendering, input, collision detection, and collision response — all in a fixed 60 Hz update loop.
Integration
Each Sprite stores linear velocity, angular velocity, mass, and inertia,
plus separate drag coefficients for linear and quadratic damping (both translational and
rotational). Motion is integrated with semi-implicit Euler: velocity is updated first,
then position — which is unconditionally stable for linear systems and well-suited to
game-scale timesteps.
Collision detection — Separating Axis Theorem
Polygon–polygon collision detection uses the Separating Axis Theorem (SAT). For each edge normal on both shapes, the algorithm projects all vertices onto that axis and checks for overlap. If any axis separates the projections, the shapes are not colliding. When a collision is detected, SAT also returns the contact manifold: the world-space contact point, the collision normal, and the penetration depth.
Collision response — impulse method
Collision response is handled with impulse-based resolution. Given the contact normal, relative velocity at the contact point, and both bodies' mass and inertia, the engine computes the scalar impulse magnitude j from the restitution coefficient, then applies equal and opposite impulses along the normal. A separate friction impulse is applied tangentially, clamped to the Coulomb friction cone.
Deep Q-Network agent
With both the neural network library and the physics engine in place, the DQN
implementation is relatively concise. The DeepQAgent class owns two
networks — a Q-network and a target network — and a replay buffer of
MemoryCell dataclass instances.
Architecture
The Q-network maps a 9-dimensional state vector to Q-values for 6 discrete actions:
# Input: [rel_x, rel_y, vx, vy, sin θ, cos θ, ω, left_contact, right_contact]
q_net = (NeuralNetwork(*CostFunctions.use_mse())
.add_layer(FullyConnectedLayer(9, 128))
.add_layer(ActivationLayer(*ActivationFunctions.use_tanh()))
.add_layer(FullyConnectedLayer(128, 128))
.add_layer(ActivationLayer(*ActivationFunctions.use_tanh()))
.add_layer(FullyConnectedLayer(128, 6)))
Experience replay
At each step the agent stores a (state, next_state, action, reward,
terminal) tuple in the replay buffer. The buffer is capped at 1500 entries;
once full, the oldest 1000 entries are pruned. Training samples a random mini-batch
from this buffer, which breaks the temporal correlation between consecutive transitions
and stabilises learning.
Target network
A copy of the Q-network — the target network — is used to compute TD targets during training. Its weights are frozen and only synchronised with the live network every 50 training calls. This prevents the moving-target problem where the network is chasing Q-values that shift on every update:
# Bellman target
target = reward + gamma * max(target_net.predict(next_state))
# Periodically sync
if self.train_count % 50 == 0:
self.target_net = self.neural_network.copy()
Two-tier training
A standard replay buffer skews heavily towards failed episodes early in training — the agent rarely succeeds, so successes are underrepresented. To counteract this, the agent maintains a separate list of successful episodes and runs a second training pass over that list after the main pass, using a smaller batch size. This gives the agent a stronger gradient signal from the rare trajectories it should be reinforcing.
Epsilon-greedy exploration with decay
Action selection is epsilon-greedy: with probability ε the agent picks
a random action; otherwise it picks the action with the highest predicted Q-value.
ε starts at 1.0 and decays multiplicatively by 0.999 after each episode,
flooring at 0.05 to preserve a small amount of exploration indefinitely.
The Lunar Lander environment
The flagship environment tasks the agent with piloting a sprite-based lander onto a platform under gravity. The lander has realistic rotational dynamics — mass, inertia, angular drag — and can collide with the ground platform using the SAT/impulse system described above.
State representation
The 9-dimensional state vector encodes everything the agent needs to reason about its situation: normalised relative position to the platform, velocity components, sine and cosine of the current angle (avoiding angle-wrapping discontinuities), angular velocity, and binary contact signals for the left and right landing feet.
Reward shaping
A sparse terminal reward (+120 success / −120 crash) alone produces extremely slow learning. The reward function is shaped with several dense components computed every step:
- Distance reward — 0.5 per unit closer to the platform
- Speed reward — 0.05 per unit of speed reduction
- Tilt penalty — 0.2 for excessive angular deviation
- Alignment reward — bonus for hovering directly above the pad
- Leg contact bonus — +1.0 per step with both feet on the ground
- Thrust penalty — −0.02 per step the engine fires, encouraging efficiency
Curriculum learning
Rather than immediately presenting the full task, difficulty increases in stages as the agent accumulates successful landings:
- 0–100 successes: fixed spawn position, fixed platform
- 100–200: spawn height increased
- 200–300: spawn height increased further
- 300+: random spawn position each episode
This staged curriculum means the agent builds competence on an easier version of the task before facing its full difficulty — a significant help given the sparse reward structure.
Takeaways
Writing a neural network without autodiff forces you to understand the chain rule concretely — you have to know exactly which tensor is multiplied by which, and why. Writing a physics engine forces you to understand collision geometry and impulse mechanics at the same mathematical level. Combining both into a working DQN agent is a good end-to-end test that every piece is actually correct.
If I were to extend this further, the most impactful additions would be: Adam optimisation (the fixed learning rate is the main bottleneck), ReLU/leaky-ReLU activations (tanh saturates and slows learning in deep layers), and prioritised experience replay to sample high-TD-error transitions more frequently.
← Back to all posts