Member-only story
Practical Guide to LLM: Optimizers
A brief history of LLM optimizers :)
This is part of a series of short posts that aims to provide intuitions and implementation hints for LLM basics. I will be posting more as I refresh my knowledge. If you like this style of learning, follow for more :)
Note: as we are both learning, please point out if anything seems inaccurate so that we can improve together.
Optimizer: how we apply gradients to weights
Deep learning = applying steps in direction of gradient to weights.
Optimizer = algorithms to convert gradients to steps.
We will introduce optimizers in chronological order.
SGD
SGD is the most straightforward way: gather a batch and apply a small portion (learning rate) of the sum of gradients, and here is a over-simplified code:
criterion = nn.MSELoss()
outputs = model(batch_x)
loss = criterion(outputs, batch_y)
param -= learning_rate * param.grad
SGD with momentum
The problem with SGD is that, with a fixed learning rate: 1) if the learning rate is too small, we get stuck in local minimum, 2) if the learning rate is too big, we diverge.