Member-only story

Practical Guide to LLM: Optimizers

A brief history of LLM optimizers :)

3 min read1 day ago

This is part of a series of short posts that aims to provide intuitions and implementation hints for LLM basics. I will be posting more as I refresh my knowledge. If you like this style of learning, follow for more :)

Note: as we are both learning, please point out if anything seems inaccurate so that we can improve together.

Optimizer: how we apply gradients to weights

Deep learning = applying steps in direction of gradient to weights.

Optimizer = algorithms to convert gradients to steps.

We will introduce optimizers in chronological order.

SGD

SGD is the most straightforward way: gather a batch and apply a small portion (learning rate) of the sum of gradients, and here is a over-simplified code:

criterion = nn.MSELoss()
outputs = model(batch_x)
loss = criterion(outputs, batch_y)
param -= learning_rate * param.grad

SGD with momentum

The problem with SGD is that, with a fixed learning rate: 1) if the learning rate is too small, we get stuck in local minimum, 2) if the learning rate is too big, we diverge.

Practical Guide to LLM: Optimizers

A brief history of LLM optimizers :)

Optimizer: how we apply gradients to weights

SGD

SGD with momentum

Written by Jackson Zhou

No responses yet