A Deep Dive on DeepSeek R1

DeepSeek R1 is the new hyped model on the block - but what makes it special? Let's dive in.

The headlines are clear: On benchmarks DeepSeek R1 performs comparably or better than the current top closed source model, OpenAIs O1, while being open source and reportedly trained on a much lower budget.

To understand what's different, lets look separately at both the Pre-Training and Post-Training phases of R1.

Pre-Training - A Tale of Cost Savings and Efficiencies

While a lot of discussions specifically focus on the reasoning abilities and high benchmark scores which are primarily a result of the Post-Training process, the main reason for the large reported cost savings, compared to other frontier models, are to be found in the Pre-Training: R1 and its sibling R1-Zero are both built on the DeepSeek-V3-Base model [1], a 671B MoE model with 37B active parameters per token.

There are a few methods that the team behind DeepSeek employed when training DeepSeek-V3-Base that are noteworthy:

Data Quality
FP8 mixed precision training
DualPipe algorithm
Lucky activations? (Potentially)

The Data

The first one is obvious: great models require great data. The authors reference using "14.8T high quality and diverse tokens" for the Pre-Training phase; This leads me to believe their dataset is at least comparable to HuggingFace's FineWeb dataset [2] that we also support as part of the more general purpose Pre-Training mixes on our Neuralfinity platform.

FP8 Mixed Precision Training

With the Hopper/Ada generation, Nvidia introduced also support for FP8 mixed precision training. This enabled less compute overhead and almost 50% memory savings for weights and activations - cutting the compute and memory requirements significantly and thus allowing for much cheaper Pre-Training. I am sure a lot of big labs are making use of this, too: At Neuralfinity we certainly do, FP8 mixed precision training is the main reason why we don't support Ampere and older generation GPUs.

The DualPipe Algorithm

This is the true magic sauce of DeepSeek's Pre-Training stage: As a novel implementation of pipeline parallelism, relying heavily on PTX to directly control SMs on the GPU, it addresses the communication overhead challenges in training large models across multiple nodes.

What makes DualPipe special is its ability to efficiently overlap computation and communication within pairs of forward and backward chunks. It divides each chunk into four components (attention, all-to-all dispatch, MLP, and all-to-all combine) and carefully orchestrates their execution to ensure both all-to-all and pipeline parallelism communications can be fully hidden behind computation.

This overlapping strategy is particularly important because models like DeepSeek-V3 have an inefficient computation-to-communication ratio of approximately 1:1 due to cross-node expert parallelism. By employing bidirectional pipeline scheduling that feeds micro-batches from both ends simultaneously and maintaining this overlap, DualPipe ensures that as long as a constant computation-to-communication ratio is maintained, fine-grained experts can still be deployed across nodes while achieving near-zero all-to-all communication overhead - effectively solving one of the key bottlenecks in distributed MoE model training.

I have previously used related techniques, though not as extreme, to optimise our own training as much as possible. As the sanctions on GPUs in China restrict the GPU to GPU communication available to the DeepSeek team, it is noteworthy how they worked around this limitation, ultimately gifting the open source community methods that will be just as valuable for anyone else who doesn't suffer from this ban.

Lucky Activations or Missing Info?

This one might be a bit controversial, but reading through the DeepSeek V3 paper [1] I noticed this sentence in the introduction:

In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.

This surprised me, in the sense that it is a quite rare occurrence, so much so, that I suspect they might have either stumbled onto a Lucky Activation by chance, or the paper omits some experiments that were necessary to pull of this training run: Lucky activations refer to neural activations during Pre-Training that happen to encode useful information or capabilities by chance, rather than through consistent learning. One has to think of it like accidentally stumbling upon a useful shortcut - the model discovers a pattern or connection that works well for certain tasks, even though it wasn't explicitly trained to find that specific pattern.

For example, a language model might develop the ability to do basic arithmetic not because it was systematically taught mathematics, but because certain neurons randomly configured themselves in a way that happens to process numbers effectively. These lucky activations can contribute to emergent capabilities that weren't explicitly part of the training objectives.

However, since these activations occur somewhat randomly during training, they can be fragile and inconsistent across different model versions or training runs.

At the start of a Pre-Training run, the weights get initialised with random values. In some cases, these initialisation already contain some lucky activations which allow for very stable training and can influence the amount of loss spikes and their severity.

The other option is that they tested a lot of initiations for a few training steps (or even epochs), but simply omitted these experiments from the actual compute requirements.

The Pre-Training impact

Pre-training is still the costliest part of model training and that didn't change with DeepSeek R1. While other factors, like much more powerful GPU generations (a Hopper generation card is about 5X as fast as an Ampere card when taking full advantage of its new capabilities, so 2,000 H800 GPUs, as outlined here, would be about equivalent to 10,000 A100, the same amount of compute that OpenAI used for the initial model behind ChatGPT) also played a role, the optimisations, specifically the four I am outlining above, had the biggest contributions on the cost savings that are now turning the AI world on its head.

Post-Training

While the Pre-Training phase delivered the savings, the Post-Training is where the benchmark beating capabilities are born. For comparison's sake look at the training of Meta's Llama 3 family:

Meta pre-trained the base models and then followed the Post-Training pipeline outlined in Figure 1.

As outlined, they start with a number of collected prompts, utilise those for rejection sampling from their reward model, generate SFT Data that is used in the SFT Model for training the DPO model which gets refined also with the annotated preference data that is used both for the DPO training and reward model training.

Figure 1: Llama 3 Post-Training Process

Comparing the R1 Post-Training [3] to the one of Llama three, we can see that a very different approach has been employed. The main goal of Meta's approach, was to make the responses align as closely to human preferences as possible. The Deepseek Team, on the other hand, was specifically exploring and experimenting with the goal of maximising Test-Time Compute to utilise Chain-of-Though (CoT) and other aids to improve models reasoning performance. They reached their goal with a three stage Post-Training pipeline that is outlined in Figure 2.

Figure 2: DeepSeek R1 Post-Training Process

The entire process was predicated by the initial training of another model, R1-Zero, which was partially used to generate some of the data needed in the first step and gave crucial insights to improve on those steps for the R1 Post-Training.

Stage 1: Reinforcement Learning with Cold Start (R1)

In the first stage, the model was initially trained on "thousands of examples" collected by the researchers. They outlined the process as follows:

To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human
annotators.
In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as
the starting point for RL.

The focus of this first stage was to maximise performance for tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions.

Once this converged after around 600,000 samples, the model progressed to the second stage:

Stage 2: SFT on a Large Dataset

This stage is the one that most resembles the Post-Training employed by Meta. The paper mentions that about 200,000 samples from the DeepSeek-V3 SFT dataset were used, together with the DeepSeek-V3 pipeline to improve non-reasoning data, such as writing, factual QA, self-cognition, and translation.

Stage 3: Reinforcement Learning for Alignment

In the final stage, the model was aligned to maximise helpfulness and harmlessness in a secondary reinforcement
learning stage. Specifically, they trained the model using a combination of reward signals and diverse prompt distributions.

Yet Another Reminder of the Importance of Data Quality

A clear part to make this successful are both the cold-start and SFT datasets. As their own paper outlines, the initial R1-Zero model suffered from many issues, including outputs that were not very readable and sometimes even combined multiple languages. A combined 800,000 data samples for the Post-Training phase were the deceicive factor in their Post-Training pipeline. A not insignificant number of these samples seems to have been created by prompting other frontier models, including OpenAI's GPT-4 and O-1.

Final Thoughts

All in all DeepSeek R1 is an impressive model that deserves the praise it garners. The main accomplishments are a significant reduction in Pre-Training costs, coupled with an novel Post-Training setup that will be a model for future Open Source models.

As all models are also a product of the biases and socio-cultural environment around those who trained them, I am not surprised by the inherent issues it has, for example, when asking about events from Chinese history that are heavily censored in China. This also serves as a great example of why it often makes sense to avoid these biases entirely when building models for specific applications.

Another aspect I didn't mention in this blog post is the distillation of the R1 model into smaller models which also set new benchmarks in their specific size classes. We will discuss these in a future blog post in the coming days.

Sources

[1]: https://arxiv.org/pdf/2412.19437
[2]: https://huggingface.co/collections/HuggingFaceFW/fineweb-datasets-662458592d61edba3d2f245d
[3]: https://github.com/deepseek-ai/DeepSeek-R1