Loading
https://www.youtube.com/watch?v=aAfanTeRn84
Lex Friedman recently posted an interview called “DeepSeek’s GPU Optimization tricks”. It is a great behind the scenes look at how Deepseek trained their latest models even when they did not have as many GPUs and their American peers.
Necessity was the mother of invention and there are the few things that Deepseek did-
They also talk about how researchers do experiments with new model architectures and data engineering steps. They say that there are some spikes in the loss curve that happen during training, and its hard to know exactly why. Sometimes it goes away after training but sometimes ML engineers have to restart training from an earlier checkpoint.
They also mention YOLO runs, where researchers dedicate all their available hardware and budget in the attempt to get the frontier model. They might either get a really good model or waste hundreds of millions of dollars in the process.
This interview is actually a really good in-depth behinds the scene look on training frontier LLMs today. I enjoyed it, and I recommend you to check it out as well!
submitted by /u/ml_guy1
[link] [comments]