[N] How Deepseek trained their R1 models, and how frontier LLMs are trained today

By jabbyai
No Comments

https://www.youtube.com/watch?v=aAfanTeRn84

Lex Friedman recently posted an interview called “DeepSeek’s GPU Optimization tricks”. It is a great behind the scenes look at how Deepseek trained their latest models even when they did not have as many GPUs and their American peers.

Necessity was the mother of invention and there are the few things that Deepseek did-

Their Mixture of experts configuration was innovative where they had a very high sparsity factor of 8/256 experts activating. This was much higher than in other models where 2 out of 8 experts activate.
Training this model can be hard because only a few experts actually learn for a task and are activated, making the models weak. They introduced an auxiliary loss to make sure all the experts are used across all tasks, leading to a strong model.
A challenge with mixture of experts model is that if only a few experts activate then only a few GPUs might be overloaded with compute while the rest sit idle. The auxiliary loss also prevents this from happening.
They went much further and implemented their own version of Nvidia’s NCCL communications library and used a closer to assembly level PTX instructions to manage how SM’s in the GPU are being scheduled for each operation. Such low level optimizations led to very high performance of their models on their limited hardware.

They also talk about how researchers do experiments with new model architectures and data engineering steps. They say that there are some spikes in the loss curve that happen during training, and its hard to know exactly why. Sometimes it goes away after training but sometimes ML engineers have to restart training from an earlier checkpoint.

They also mention YOLO runs, where researchers dedicate all their available hardware and budget in the attempt to get the frontier model. They might either get a really good model or waste hundreds of millions of dollars in the process.

This interview is actually a really good in-depth behinds the scene look on training frontier LLMs today. I enjoyed it, and I recommend you to check it out as well!

submitted by /u/ml_guy1
[link] [comments]

No Comments

Uncategorized

[N] How Deepseek trained their R1 models, and how frontier LLMs are trained today

Leave a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories