An Engineer's Guide to Deep Learning Optimizers
Part 1: The Impossible Dream and the Noisy Workaround
Introduction: The Mountain in the Fog
If you’re an engineer, you’ve been trained to find the “best” solution. So when you start with deep learning, you hit a simple problem: we have a loss function, and we just want to find the lowest point. This is a solved problem, right?
Well, no. The “loss landscape” of a neural network isn’t a smooth, convex bowl. It’s a nightmarish, high-dimensional mountain range with a million peaks, valleys, and saddle points, and it’s all shrouded in a dense fog.
- High-Dimensional: You don’t have two parameters (x, y) to tune. You have 175 billion.
- Non-Convex: There are countless “local minima” (valleys) that look like the lowest point but aren’t.
- Stochastic: You’re in a fog. You can’t even see the whole mountain range. You can only get a noisy guess of the slope from your immediate surroundings.
This is the real job of an optimizer: not to find the “perfect” global minimum, but to find any wide, flat, “good enough” valley, as quickly as possible, without falling off a cliff.
This is the story of how we built a better “hiker” for this insane terrain. It starts with the textbook idea and its first, critical, systems-level failure.
The Baseline: Gradient Descent (GD)
Gradient Descent (also called “Batch” Gradient Descent) is the simple, textbook algorithm you learn in your first machine learning class.
The Idea: “Before I take a single step, I will poll every single person (data point) in the entire dataset. I’ll average all their opinions of which way is ‘downhill’ to get a perfect, noise-free gradient, and then I’ll take one, confident step in that exact direction.”
The update rule is just: \(W_{\text{new}} = W_{\text{old}} - \eta \cdot \nabla L(W)\)
Where $\nabla L(W)$ is the gradient (derivative) of the loss, calculated over all training examples.
The “Why Not”: The Systems-Level Dealbreaker
This sounds great, but it’s a systems-level catastrophe.
- Problem: You have a 10TB dataset of images.
- The GD approach: You must run a full forward and backward pass over all 10 terabytes of data just to compute a single gradient.
- The Result: You will update your weights once every few hours. Your model will be “training” but going nowhere.
It’s computationally pure, but practically impossible. The bottleneck isn’t the math; it’s the I/O and compute time of using the entire dataset for every single step.
The Workaround: Stochastic Gradient Descent (SGD)
This is the first brilliant, practical “patch” that makes deep learning possible.
The Idea: “Polling the entire dataset is a waste of time. I’ll just grab a small, random handful (a “mini-batch”) of 64 data points, ask them which way is downhill, and take a quick, messy, ‘good enough’ step in that direction.”
The “Why”: The Two Big Wins
SGD solves the first problem but introduces a new one (which we’ll fix later).
1. It’s Fast. Mind-Bogglingly Fast. This is the main point. In the time it takes GD to compute one perfect step, SGD has already taken 10,000 “noisy” steps and is 90% of the way to the valley. It’s the difference between planning a perfect cross-country road trip and just getting in the car and driving west.
2. The “Bug” is a Feature: Noise is a Regularizer This is the non-obvious magic of SGD. The gradient from a mini-batch is noisy. It’s not the “true” gradient. This means the optimizer’s path is a chaotic, zig-zagging mess.
It turns out this is exactly what we want.
In a non-convex landscape, the “perfect” minimum is often a very sharp, narrow valley. This is called overfitting—the model has perfectly memorized the training data but will fail on new data.
The noise in SGD acts like a drunk hiker. It’s too shaky and chaotic to fall into those narrow, sharp valleys. It prefers to “bounce around” until it settles into a wide, flat valley. A flat valley is a generalizable solution—it means that even if the new data is slightly different, the loss is still low.
The New Problem
But that “zig-zag” is still a problem.
- If the valley is a steep, narrow ravine (like in the image above), SGD will spend all its time bouncing off the walls, making very slow progress down the actual slope.
- If the valley is a very flat plateau, the gradients are tiny, and the “noisy” steps are almost zero. SGD just sits there, barely moving.
This is our next bug. We need a way to dampen the zig-zagging and build up speed (momentum) in consistent directions.
…and that’s exactly what our first “patch,” SGD with Momentum, is designed to fix.
Part 2: The Snowball Patch (SGD with Momentum)
The Baseline: The “Drunk Hiker” (SGD)
In Part 1, we established our baseline: Stochastic Gradient Descent (SGD). It’s the “drunk hiker” in the foggy, non-convex mountain range of our loss landscape.
- The Good: It’s fast (uses mini-batches) and its “noisy” path is a feature, helping it bounce out of sharp, “overfitting” minima and find the wide, generalizable valleys.
- The Bad: That same noise is a critical bug. It creates two huge problems:
- Oscillation: In a steep, narrow ravine, the hiker just bounces from wall to wall, making almost no downhill progress.
- Stalling: On a long, flat plateau, the gradient (slope) is tiny. The hiker takes tiny, hesitant steps and barely moves at all, stalling the training.
We need a fix. We need to give our hiker a way to “average out” the noise from the zig-zagging and “build up speed” on the flats. This fix is called Momentum.
The First Fix: SGD with Momentum (The Snowball)
The Idea: Instead of taking a step based only on the current (noisy) gradient, we take a step based on a moving average of all past gradients.
Think of it as replacing our lightweight hiker with a heavy, unstoppable snowball.
At each step, we do two things:
- We apply friction to the “snowball” (the old momentum), reducing its speed by a fraction (e.g., 10%).
- We add the new gradient (the “push” from the new mini-batch) to the snowball.
In code, this “velocity” vector $v$ is updated like this (where $\beta$ is the momentum term, usually 0.9):
\(v_t = (\beta \cdot v_{t-1}) + g_t\) \(W_{\text{new}} = W_{\text{old}} - \eta \cdot v_t\)
(Note: You’ll see different forms of this equation, but they all share this core idea: the current step is a combination of the previous step and the new gradient.)
This one simple change brilliantly solves both of our problems.
1. The Fix for Oscillation: The “Smoother”
How does this stop the “zig-zag”?
Imagine our optimizer is in that narrow ravine. The “bounces” are just noisy gradients pointing in opposite directions.
- Step 1 Gradient ($g_1$):
[+10, -0.1](Bounces right, moves slightly down) - Step 2 Gradient ($g_2$):
[-10, -0.1](Bounces left, moves slightly down)
Vanilla SGD would just move right, then left, and barely go anywhere.
Momentum averages them:
- Step 1 Velocity ($v_1$):
[+10, -0.1] - Step 2 Velocity ($v_2$):
v_2 = (0.9 * [+10, -0.1]) + [-10, -0.1] = [+9, -0.09] + [-10, -0.1] = [-1, -0.19]
Look at that! The horizontal part (the +10 and -10) has averaged out and cancelled. The vertical part (the -0.1 and -0.1) has accumulated.
The snowball’s momentum in the zig-zag directions dies, while its momentum in the consistent “downhill” direction builds. The path becomes dramatically smoother.
2. The Fix for Stalling: The “Accelerator”
This is the “snowball” effect.
Now, imagine our optimizer is on that long, flat plateau. The gradient is tiny but consistent.
- Gradient at all steps:
[-0.01](a tiny, but consistent push)
Vanilla SGD would just take tiny steps: -0.01, -0.01, -0.01… It would take 100 steps just to move a total of 1.0.
Momentum builds up speed:
- $v_1$:
-0.01 - $v_2$: $(0.9 \cdot -0.01) + -0.01 = -0.019$
- $v_3$: $(0.9 \cdot -0.019) + -0.01 = -0.0271$
- $v_4$: $(0.9 \cdot -0.0271) + -0.01 = -0.03439$
The velocity is accelerating. The snowball is picking up speed, allowing the optimizer to “shoot” across the flat plateau and converge much faster.
The New Problem
So, we’re done, right? SGD with Momentum is great. It’s fast, and it’s smooth.
We’ve solved the big problems, but we’ve exposed two new, more subtle ones.
-
The “Cold-Start” Problem: What is the velocity $v_0$ at the very beginning of training? It’s
0. Because our “snowball” starts with zero momentum, it takes several steps to “warm up” and get to a reasonable speed. This means the first few iterations are artificially slow and hesitant. This is the initialization bias. -
The “One-Size-Fits-All” Problem: We are still using one learning rate $\eta$ for all 175 billion parameters. This is a huge issue. What if the loss landscape for your
Weight_1is a gentle, flat plain, but the landscape forWeight_1000is a treacherous, spiky mountain?-
Weight_1needs a big learning rate to move faster. -
Weight_1000needs a tiny learning rate to avoid exploding.
Our current optimizer gives them both the same learning rate. It’s like forcing our downhill skier and our cross-country skier to use the same pair of skis.
-
To fix this, we need a way to give every parameter its own, unique, adaptive learning rate.
And that brings us to the next generation of optimizers… and the clever ideas that led to Adam.
Part 3: The King of the Hill (Adaptive Moment Estimation)
The Story So Far
In Part 1, we established that “Batch” Gradient Descent is impossible, so we use Stochastic Gradient Descent (SGD). But SGD, our “drunk hiker,” is noisy. It zig-zags in steep ravines and stalls on flat plains.
In Part 2, we added the Momentum “patch,” which turns our hiker into a heavy “snowball.” This solves our two problems:
- It smooths out the zig-zagging by averaging gradients.
- It accelerates across flat plains by building up speed.
But we were left with two new, more subtle bugs in our optimizer:
- The “Cold-Start” Problem: Our snowball starts with zero velocity. It takes a bunch of steps to “warm up,” making the start of training artificially slow. This is an initialization bias.
- The “One-Size-Fits-All” Problem: We’re still using one learning rate $\eta$ for all 175 billion parameters. This is a huge issue. The terrain for
Weight_1might be a flat, gentle plain (needs a big step), while the terrain forWeight_1000is a spiky, treacherous mountain (needs a tiny step). We’re giving them both the same boot size.
We need an optimizer that can fix both. We need a “snowball” that starts fast and has “adaptive skis” that adjust to the terrain of each parameter.
This brings us to Adam.
The Final Patch: Adam (Adaptive Moment Estimation)
The name “Adam” is a spoiler. It stands for Adaptive Moment Estimation. It doesn’t just estimate the first moment (the “snowball” velocity). It also estimates the second moment (the “volatility” of the gradient) and uses it to adapt the learning rate.
Adam is not a new idea from scratch. It’s the “Avengers Assemble” of optimizers, combining two best-in-class solutions into one algorithm.
The Momentum Fix (The “M” in Adam)
This first part of Adam is just SGD with Momentum, but with a fix for the “cold-start” problem.
- The Biased Snowball ($m_t$): Adam keeps track of the momentum (the first moment) just like before: \(m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t\)
- The “Cold-Start” Bug: As we said, $m_0$ is initialized to 0. This makes $m_1$, $m_2$, $m_3$, etc., all artificially small. They are biased toward zero.
- The Unbiasing Trick: The Adam authors provided a simple, brilliant fix. They calculate this bias analytically and just divide it out. \(\hat{m}_t = \frac{m_t}{1 - \beta_1^t}\)
Let’s see this in action (with $\beta_1 = 0.9$):
- At Step 1 ($t=1$): The denominator is $(1 - 0.9^1) = 0.1$. The biased $m_1$ is divided by 0.1 (multiplied by 10), perfectly correcting it.
- At Step 500 ($t=500$): The denominator is $(1 - 0.9^{500})$, which is $\approx 1.0$. The correction automatically fades away and does nothing.
This “bias-corrected” momentum $\hat{m}_t$ is the first half of Adam. It’s a “snowball” that starts at full speed.
The Adaptive Fix (The “A” in Adam)
This is the second, more powerful idea. It solves the “one-size-fits-all” problem.
- The Problem: We need a per-parameter learning rate. We need to know if the terrain for
Weight_1is flat or spiky. - The Fix (RMSprop): We can measure the “spikiness” of the terrain by tracking a moving average of the squared gradients. This is the second moment ($v_t$). \(v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot (g_t)^2\)
- Why squared? Squaring $g_t$ makes it positive. Now, $v_t$ is a measure of “gradient volatility.”
- If $v_t$ is large, it means this parameter has huge, spiky gradients. The terrain is treacherous.
- If $v_t$ is small, it means this parameter has tiny, consistent gradients. The terrain is flat.
- The “Aha!” Moment: The final Adam update rule is, conceptually: \(W_{\text{new}} = W_{\text{old}} - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\)
Look at that division. We are dividing our “snowball” step ($\hat{m}_t$) by the square root of its “volatility” ($\hat{v}_t$).
This gives every single parameter its own learning rate:
- Spiky Terrain (Large $v_t$): The denominator is big. The step size becomes small. Adam says, “Whoa, be careful on this parameter. Take tiny, safe steps.”
- Flat Terrain (Small $v_t$): The denominator is tiny. The step size is amplified. Adam says, “This parameter is on a boring flat, let’s give it a boost and go faster!”
(And yes, this $v_t$ term is also biased toward zero at the start, so Adam applies the same unbiasing trick to it, giving $\hat{v}_t$).
Conclusion: The King is Crowned
Adam is the default king because it combines both fixes.
- It’s a snowball (using momentum, $\hat{m}t$) that knows which _direction to go.
- It has adaptive skis (using $\sqrt{\hat{v}t}$) that automatically adjust for the _terrain of every single parameter.
- It has a warm-up” pack (the bias correction) so it can start at full speed from iteration 1.
It solves every major problem we’ve identified. It’s fast, it’s robust, it handles initialization, and it doesn’t need nearly as much “learning rate” tuning.
This is, by far, the most common optimizer you’ll see.
…But is it perfect?
If Adam is so great, why do many state-of-the-art research papers still use plain SGD with Momentum?
It turns out, there’s a “Level 3” debate. Some researchers believe that Adam’s adaptive nature, while fast, can sometimes “find” a worse minimum—one that is sharp and doesn’t generalize as well as the wide, flat valleys that the “dumber” SGD + Momentum eventually stumbles into.
But for 99% of engineers, Adam is the robust, reliable, and powerful tool that gets the job done.
Enjoy Reading This Article?
Here are some more articles you might like to read next: