← Back to blog

Replicating the Scaling Laws!

Goal

"Scaling laws, scaling laws, scaling laws." It's all i ever hear about nowadays in the online AI space, but what are they? And can i replicate them? In this blog, I get my hands dirty to try to deeply understand the scaling laws!

What are the scaling laws

Put simply

Imagine you're a cook, and you're given the recipe for pizza, which is sauce, dough, and cheese. The scaling laws say that the more cheese you have (as long as there's enough dough and sauce), the more pizza you can make. Or, the more dough you have, (as long as there's enough cheese and sauce) the more pizza you can make!

For autoregressive models, the recipe is model size, dataset size, and compute. The scaling laws say that as you add more compute or use a bigger model or dataset, as long as you're not constrained by the others, the more intelligence you can have!

Put exactly, the relationship is a powerlaw like so

L N−α
L D−α
L C−α
Kaplan et al. scaling laws: test loss versus compute, dataset size, and parameters.

Where roughly for every

The key understanding of these laws is that other factors (like network width or depth) don't seem to have as big of an impact as compute, dataset size, and model size

So, this explains why Sam altman and Dario amodei are raising billions and billions of dollars in compute. With somewhat diminishing returns

If all they did was increase their compute budget from 100 billion to 200 billion, one would think that the model would be 2x better, but based on the scaling laws we'd actually only see a 3.4% decrease in model loss!

This explains why models like GPT 4.5, which were trained with much more compute and a much larger model size than it's predecessor GPT 4o, showed modest gains for a significantly higher price!

Anyway, let's get to the experiemnts

Setup

For technicality, in these experiments, I used a small decoder-only autoregressive language model in PyTorch, with a single attention head. I trained on the WikiText-2, WikiText-103, and OpenWebText subset datasets. And, used an AdamW optimizer.

We denote

Experiments

click here to see my colab notebook

1. Scaling model size

For our first experiment, we wanted to see if varying model size while fixing compute and dataset size would follow the scaling laws, at first we tried training with differing model sizes on the smaller WikiText-2 dataset,

Running the models, we get these results.

Validation loss vs model size (log x-axis) on WikiText-2.

This is nice, but

We noticed that towards the higher end of model size, the loss drop was plateauing. I hypothesized that this had to do with WikiText-2 being somewhat small, so we tried WikiText-2's bigger brother, WikiText-103, which is about 100x bigger than WikiText-2.

Here are the results

WikiText-103 log-log plot: validation loss vs parameters.

Not bad! Looks a lot like the Kaplan et. al graph.

After fitting an equation to the line of best fit

Log-log fit for scaling exponent: log L = -0.0431 log N + 2.6107.
log L = −0.0431 log N + 2.6107

It's obvious that indeed, there is a power law relationship for model size

2. Scaling dataset size

For our second experiment, we decided to vary the dataset size and fix compute and model size.

After running the experiment, we got these results

Validation loss vs dataset size for different model sizes (6.5M, 26M, 51M, 103.7M params).

I noticed that the increase in dataset size seemed to drop validation loss at smaller dataset sizes, but as we got more and more data, the loss drop seemed to plateau.

The paper talks about how these laws can only apply as long as they're not bottlenecked by the other factors, which would be computer and model size here. So, i increased model size, as you can see in the graph above. This gave better results, but there was still a plateau, so I knew the other bottleneck had to be compute. This led to my last experiment

3. Scaling compute to remove the bottleneck

For our third experiment, I hypothesized that the reason we were seeing these plateaus in model performance was because the models were being undertrained due to a lack of compute. To address this, we decided to scale up compute as we scaled up dataset size, so the model would have enough compute to actually take advantage of the larger datasets

The orange line shows the results! Much better

51M model: fixed compute vs compute scaled with D.

As dataset size and compute size scales, the loss also drops in a power law fashion. Nice.

Interpretation

Given these results, it looks like the scaling laws really do hold, at least for our small setup. As you increase model size, dataset size, and compute, the loss decreases in a power law fashion. But, the cool part is that the experiments also showed the bottlenecks the Kaplan et al. paper talks about. Small datasets caused early plateaus, bigger models made larger datasets useful for longer, and some of the plateaus disappeared once compute was scaled up. So the scaling laws are not just about “more is better,” but about which factor is currently limiting performance

Takeaways

The scaling laws still hold at models with trillions and trillions of tokens. The amount of compute, model size, and dataset size still primarily govern the quality of large models. But this also shows a limitation. As time goes on, multiplying the amount of compute you can allocate towards a model becomes harder and harder, and because it is a power law, the returns of scaling can diminish. With test time compute, labs were able to ride this scaling a little longer, but can they keep it up at this pace? Or will it even matter.

Anyway, I had a lot of fun learning about the scaling laws and running these experiments, thanks for reading! Have a great day.

Limitation

It is important to note that this is a toy model with one attention head and a small dataset, but it still shows that the power law scaling relationship is replicable.

← Back to blog