Building a Deep Q-Network From Scratch

Why build it from scratch?

Modern ML frameworks are powerful precisely because they hide the details — automatic differentiation, GPU kernels, optimiser state — all abstracted away behind a few lines of API. That convenience is great in production, but it makes it easy to treat a neural network as a black box. I wanted to understand every multiply, every gradient, every weight update. So I wrote the whole stack myself.

The result is a two-part system: a modular neural network library built on NumPy, and a Pygame-based 2D physics engine that acts as the training environment. The capstone is a Deep Q-Network (DQN) agent that learns to land a simulated spacecraft — trained entirely inside the custom engine.

The neural network engine

The library is built around two abstract primitives: Layer and NeuralNetwork. Every layer exposes a forward_propagate() and a backward_propagate() method, which makes adding new layer types straightforward. The two concrete implementations are FullyConnectedLayer and ActivationLayer.

Weight initialisation

Weights are initialised with Xavier/Glorot uniform initialisation — the random bound is set to 1 / sqrt(input_size + output_size). This keeps the variance of activations roughly constant across layers at the start of training, which matters especially for deep networks with saturating activations like tanh.

Forward and backward pass

The forward pass through a fully connected layer is a single matrix multiply plus a bias:

def forward_propagate(self, input_data):
    self.input = input_data
    self.output = (input_data @ self.weights) + self.bias
    return self.output

The backward pass computes three quantities using the chain rule: the error to propagate upstream, the weight gradient, and the bias gradient. All three are derived from the output error passed in from the layer above:

input_error   = output_error @ self.weights.T
weights_error = self.input.T @ output_error

self.weights -= learning_rate * weights_error
self.bias    -= learning_rate * np.mean(output_error, axis=0)

The activation layer wraps a function and its analytic derivative. Tanh is the only activation used throughout the project — its derivative, 1 - tanh(x)², is computed inline during backprop.

Training loop

NeuralNetwork.train() supports both sequential and random mini-batch sampling. For each batch, it performs a full forward pass over all layers, computes the MSE loss and its derivative, then walks the layer list in reverse to propagate gradients. The cost function and its derivative are injected at construction time, keeping the network decoupled from any specific loss.

# Building a network with a fluent interface
net = (NeuralNetwork(*CostFunctions.use_mse())
    .add_layer(FullyConnectedLayer(784, 100))
    .add_layer(ActivationLayer(*ActivationFunctions.use_tanh()))
    .add_layer(FullyConnectedLayer(100, 50))
    .add_layer(ActivationLayer(*ActivationFunctions.use_tanh()))
    .add_layer(FullyConnectedLayer(50, 10))
    .add_layer(ActivationLayer(*ActivationFunctions.use_tanh())))

Serialisation

Trained networks are saved to disk as a folder of .npy weight files plus a config.json that records layer types and activation IDs. Loading reconstructs the full network from those files, making the format human-readable and independent of any pickle-based serialisation.

Supervised baseline: MNIST digit classification

Before moving to reinforcement learning I validated the engine on MNIST. A 784→100→50→10 network (all tanh, MSE loss) was trained for one epoch with a batch size of 500 and a learning rate of 0.2. The network reaches solid accuracy on the training set, confirming that the forward pass, backward pass, and weight update are all correct.

One-hot encoding is used for the labels: the output layer produces a 10-dimensional vector and the predicted digit is the argmax. The trained weights are saved and reloaded via the serialisation system as an end-to-end integration test.

The physics engine

The reinforcement learning environments needed realistic physics, so I built a 2D rigid-body simulator on top of Pygame. The engine handles sprite rendering, input, collision detection, and collision response — all in a fixed 60 Hz update loop.

Integration

Each Sprite stores linear velocity, angular velocity, mass, and inertia, plus separate drag coefficients for linear and quadratic damping (both translational and rotational). Motion is integrated with semi-implicit Euler: velocity is updated first, then position — which is unconditionally stable for linear systems and well-suited to game-scale timesteps.

Collision detection — Separating Axis Theorem

Polygon–polygon collision detection uses the Separating Axis Theorem (SAT). For each edge normal on both shapes, the algorithm projects all vertices onto that axis and checks for overlap. If any axis separates the projections, the shapes are not colliding. When a collision is detected, SAT also returns the contact manifold: the world-space contact point, the collision normal, and the penetration depth.

Collision response — impulse method

Collision response is handled with impulse-based resolution. Given the contact normal, relative velocity at the contact point, and both bodies' mass and inertia, the engine computes the scalar impulse magnitude j from the restitution coefficient, then applies equal and opposite impulses along the normal. A separate friction impulse is applied tangentially, clamped to the Coulomb friction cone.

Penetration is resolved with a positional correction step: the two bodies are nudged apart along the collision normal in proportion to their inverse masses, avoiding the sinking artefact that pure impulse resolution produces.

Deep Q-Network agent

With both the neural network library and the physics engine in place, the DQN implementation is relatively concise. The DeepQAgent class owns two networks — a Q-network and a target network — and a replay buffer of MemoryCell dataclass instances.

Architecture

The Q-network maps a 9-dimensional state vector to Q-values for 6 discrete actions:

# Input: [rel_x, rel_y, vx, vy, sin θ, cos θ, ω, left_contact, right_contact]
q_net = (NeuralNetwork(*CostFunctions.use_mse())
    .add_layer(FullyConnectedLayer(9,   128))
    .add_layer(ActivationLayer(*ActivationFunctions.use_tanh()))
    .add_layer(FullyConnectedLayer(128, 128))
    .add_layer(ActivationLayer(*ActivationFunctions.use_tanh()))
    .add_layer(FullyConnectedLayer(128, 6)))

Experience replay

At each step the agent stores a (state, next_state, action, reward, terminal) tuple in the replay buffer. The buffer is capped at 1500 entries; once full, the oldest 1000 entries are pruned. Training samples a random mini-batch from this buffer, which breaks the temporal correlation between consecutive transitions and stabilises learning.

Target network

A copy of the Q-network — the target network — is used to compute TD targets during training. Its weights are frozen and only synchronised with the live network every 50 training calls. This prevents the moving-target problem where the network is chasing Q-values that shift on every update:

# Bellman target
target = reward + gamma * max(target_net.predict(next_state))

# Periodically sync
if self.train_count % 50 == 0:
    self.target_net = self.neural_network.copy()

Two-tier training

A standard replay buffer skews heavily towards failed episodes early in training — the agent rarely succeeds, so successes are underrepresented. To counteract this, the agent maintains a separate list of successful episodes and runs a second training pass over that list after the main pass, using a smaller batch size. This gives the agent a stronger gradient signal from the rare trajectories it should be reinforcing.

Epsilon-greedy exploration with decay

Action selection is epsilon-greedy: with probability ε the agent picks a random action; otherwise it picks the action with the highest predicted Q-value. ε starts at 1.0 and decays multiplicatively by 0.999 after each episode, flooring at 0.05 to preserve a small amount of exploration indefinitely.

The Lunar Lander environment

The flagship environment tasks the agent with piloting a sprite-based lander onto a platform under gravity. The lander has realistic rotational dynamics — mass, inertia, angular drag — and can collide with the ground platform using the SAT/impulse system described above.

State representation

The 9-dimensional state vector encodes everything the agent needs to reason about its situation: normalised relative position to the platform, velocity components, sine and cosine of the current angle (avoiding angle-wrapping discontinuities), angular velocity, and binary contact signals for the left and right landing feet.

Reward shaping

A sparse terminal reward (+120 success / −120 crash) alone produces extremely slow learning. The reward function is shaped with several dense components computed every step:

Distance reward — 0.5 per unit closer to the platform
Speed reward — 0.05 per unit of speed reduction
Tilt penalty — 0.2 for excessive angular deviation
Alignment reward — bonus for hovering directly above the pad
Leg contact bonus — +1.0 per step with both feet on the ground
Thrust penalty — −0.02 per step the engine fires, encouraging efficiency

Curriculum learning

Rather than immediately presenting the full task, difficulty increases in stages as the agent accumulates successful landings:

0–100 successes: fixed spawn position, fixed platform
100–200: spawn height increased
200–300: spawn height increased further
300+: random spawn position each episode

This staged curriculum means the agent builds competence on an easier version of the task before facing its full difficulty — a significant help given the sparse reward structure.

Takeaways

Writing a neural network without autodiff forces you to understand the chain rule concretely — you have to know exactly which tensor is multiplied by which, and why. Writing a physics engine forces you to understand collision geometry and impulse mechanics at the same mathematical level. Combining both into a working DQN agent is a good end-to-end test that every piece is actually correct.

If I were to extend this further, the most impactful additions would be: Adam optimisation (the fixed learning rate is the main bottleneck), ReLU/leaky-ReLU activations (tanh saturates and slows learning in deep layers), and prioritised experience replay to sample high-TD-error transitions more frequently.

← Back to all posts

Building a Deep Q-Network From Scratch — No ML Libraries

Why build it from scratch?

The neural network engine

Weight initialisation

Forward and backward pass

Training loop

Serialisation

Supervised baseline: MNIST digit classification

The physics engine

Integration

Collision detection — Separating Axis Theorem

Collision response — impulse method

Deep Q-Network agent

Architecture

Experience replay

Target network

Two-tier training

Epsilon-greedy exploration with decay

The Lunar Lander environment

State representation

Reward shaping

Curriculum learning

Takeaways