Research 10AI / Research

How Large Language Models Are Trained

From Raw Tokens to Aligned Reasoning

May 202614 min read

15TPretraining Tokens (Llama 3)

30MGPU-Hours (405B Run)

6Training Stages

~$100MFrontier-Model Run

end-to-end llm training pipeline

From Sequence Prediction to Reasoning Assistants

The headline objective of every modern large language model is unchanged from the original Transformer era: given a sequence of tokens, predict the next one. What separates a 2026-era frontier model from a 2018 GPT-2 checkpoint is not the loss function — it is the data, the compute, and the multi-stage pipeline that wraps that simple objective. A base model trained only on next-token prediction is fluent but unhelpful: it will autocomplete, but it will not answer, refuse harmful requests, or follow instructions reliably.

The contemporary recipe (Brown et al. 2020 for GPT-3; Ouyang et al. 2022 for InstructGPT; Touvron / Grattafiori et al. 2023–2024 for Llama 2 / Llama 3; Bai et al. 2022 for Anthropic's Constitutional AI) layers four additional steps on top of pretraining: tokenization that compresses raw text into a workable vocabulary, supervised fine-tuning that teaches instruction-following format, preference learning that aligns the model with human (or AI) judgments, and a continuous evaluation loop that gates deployment.

This article walks through each of those stages with real numbers from public training reports — primarily Llama 3 (Grattafiori et al. 2024) and the InstructGPT paper, both of which documented their pipelines in unusual detail — and ends with how the same recipe scales down to domain-specific models for finance, the regime in which most enterprise teams actually operate.

Stage 1 — Pretraining: Compressing the Internet

Pretraining is the dominant cost of training a frontier LLM. For Llama 3 405B, Meta reported approximately 30.84 million GPU-hours on H100s — roughly 3.8 × 10²⁵ FLOPs of compute — to consume 15.6 trillion tokens of curated text and code. GPT-3, by comparison, consumed ~300 billion tokens at 175B parameters in 2020 (Brown et al.), making the modern scale a 50× expansion in data and roughly 100× in compute over a five-year window.

The pretraining corpus is the most under-discussed lever. A frontier corpus is not raw Common Crawl — it is heavily filtered. Llama 3's pipeline included URL-level deduplication, line-level dedup, document quality classifiers (n-gram language models trained to score "Wikipedia-likeness"), perplexity filtering against held-out high-quality text, PII removal, and toxic-content scrubbing. Public open-data efforts like RefinedWeb (Penedo et al. 2023) and FineWeb (Penedo et al. 2024) have shown that quality filtering alone can match or exceed the gains of doubling raw corpus size.

Tokenization compresses text into a vocabulary the model can address. Modern models use Byte-Pair Encoding (BPE) or SentencePiece variants with 32K–128K tokens; Llama 3 uses a 128K-token tokenizer that yields roughly 15% better compression on typical text than its 32K Llama 2 predecessor, directly translating into a 15% effective-context-window expansion at no compute cost. Vocabularies are tuned to balance coverage of multilingual, code, and math inputs.

The architecture is a decoder-only Transformer in nearly all production frontier models, with two consequential choices: dense vs Mixture-of-Experts (MoE), and rotary positional encoding (RoPE) parameters that determine context-length extrapolation. MoE designs (Mixtral 8×22B, GPT-4-class models) activate only a subset of expert blocks per token, decoupling parameter count from per-token compute and enabling 100B+ active-parameter behavior at the FLOPs cost of a much smaller dense model.

Scaling laws govern how compute, parameters, and data should be balanced. Kaplan et al. 2020 first quantified power-law improvement; Chinchilla (Hoffmann et al. 2022) corrected the canonical recipe to ~20 tokens per parameter as compute-optimal. Frontier teams now routinely train far past Chinchilla optimal — Llama 3 8B was trained on 15T tokens, ~100× over Chinchilla — because at deployment the cost is dominated by inference, and a slightly under-parameterized but heavily-trained model serves more cheaply.

Stage 2 — Supervised Fine-Tuning (SFT)

The base model that emerges from pretraining is an extremely strong probabilistic completion engine. It is not yet useful as an assistant. Hand it the prompt "What is the capital of Brazil?" and it is just as likely to continue with "is a question I get asked a lot in my geography class…" as to answer "Brasília." Supervised Fine-Tuning closes that gap by teaching format and instruction-following behavior, not new world knowledge.

SFT data is curated examples of the form (instruction, ideal response). InstructGPT (Ouyang et al. 2022) used roughly 13K labeler-written demonstrations covering open-ended generation, classification, summarization, and chat. Modern open recipes draw from larger, mixed corpora: ShareGPT (real ChatGPT conversations), FLAN (Wei et al. 2021, 1,800+ academic NLP tasks reformulated as instructions), OpenAssistant (Köpf et al. 2023, ~160K human-curated dialogues), Alpaca and Dolly (model-generated instruction sets, useful but lower-fidelity).

SFT hyperparameters look nothing like pretraining. Learning rates are 1–2 orders of magnitude lower (typically 1e-5 to 2e-5), batch sizes are smaller, and runs are short — 2 to 3 epochs over the SFT set. The risk being managed is catastrophic forgetting: too many epochs or too high a learning rate and the model degrades on tasks it previously handled cleanly. Rehearsal buffers — interleaving pretraining-style data into SFT batches — are common defense.

A useful intuition: pretraining gives the model the latent ability to do almost anything; SFT gives it the convention of doing what it is asked to do. Capability is mostly already there; the fine-tune is teaching surface format.

Stage 3 — Preference Learning: RLHF, DPO, Constitutional AI

Even after SFT, models still have systematic flaws: they hallucinate confidently, refuse harmless requests, comply with harmful ones, and pick stylistic registers that humans dislike. Preference learning is the stage that bends the model toward what humans actually want, given that "what humans want" is much easier to recognize than to specify in a loss function.

Reinforcement Learning from Human Feedback (RLHF) was the dominant approach from InstructGPT (2022) through Llama 2 (2023). The recipe has three sub-stages. First, sample completions from the SFT model and ask labelers to rank them pairwise — Ouyang et al. collected ~33K such comparisons. Second, train a reward model (RM) — typically initialized from the SFT model with the language-modeling head replaced by a scalar regression head — to predict which completion humans preferred. Third, use Proximal Policy Optimization (PPO, Schulman et al. 2017) to fine-tune the SFT model, using the RM as the reward and a KL-divergence penalty against the SFT model to prevent reward hacking.

RLHF works but is operationally heavy: three model copies in memory (policy, reference, reward), unstable PPO dynamics, and frequent reward-hacking failures where the policy exploits RM blind spots. Direct Preference Optimization (DPO, Rafailov et al. 2023) collapses the pipeline. By rewriting the RLHF objective in closed form, DPO trains directly on preference pairs with a simple classification-style loss — no separate reward model, no PPO loop. Llama 3 used a DPO-based stage in production, and most open-source post-training recipes (Zephyr, Tülu, OpenHermes) follow suit. DPO is not strictly equivalent to RLHF, but on most public benchmarks the gap is within noise, and the engineering simplification is roughly 40–60%.

Constitutional AI (Bai et al. 2022) and RLAIF (Lee et al. 2023) go a step further by replacing — or augmenting — human raters with an AI critic guided by an explicit list of principles ("the constitution"). The model self-critiques and self-revises during training, and an AI judge produces preference pairs at near-zero marginal cost. This is how Anthropic scales alignment data far beyond human-labeling budgets. The principles themselves remain human-authored, but the rating scales.

Across all three approaches, the data scale is surprisingly modest by pretraining standards: 50K–100K preference pairs is typical, often topped up with synthetic AI-generated preferences. Most of the alignment "magic" comes from careful labeler selection, principle authorship, and iteration loops — not from raw volume.

Stage 4 — Evaluation, Red-Teaming, Continuous Improvement

A model that scored well during training is not yet ready to ship. The evaluation stage runs a battery of tests across capability, safety, robustness, and regression. Public capability suites include MMLU (57-subject multiple choice), GSM8K (grade-school math word problems), HumanEval (Python code generation), MATH (competition math), AGIEval (admissions-test style reasoning), MT-Bench (multi-turn dialogue judged by GPT-4), and HELM (Stanford's holistic evaluation framework).

Safety evaluation runs in parallel. Red-teaming — both manual (specialist contractors prompted to elicit harmful outputs) and automated (adversarial prompt generation, often by other LLMs) — produces failure cases that feed back into the next round of preference data. Frontier labs report red-team budgets in the thousands of prompt-hours per release, and findings drive both refusal-tuning and content-classifier development.

Hallucination and factuality evaluation receives growing attention. TruthfulQA (Lin et al. 2022) measures susceptibility to common misconceptions; FActScore (Min et al. 2023) decomposes long-form generations into atomic claims and checks each; SimpleQA (OpenAI 2024) tests factual recall on questions with a single verifiable answer. These metrics rarely show the dramatic gains seen on capability benchmarks, which reflects an underlying truth: hallucination is a base-model property that fine-tuning can mitigate but not eliminate.

Once deployed, the loop continues. Production traffic generates new preference data — thumbs-up/down feedback, regenerate clicks, conversation-edit signals — that feed the next training cycle. Most frontier teams operate on 4–12 week release cadences, with periodic full-pipeline re-runs. The training pipeline is not a one-shot build; it is a continuously running process whose output is a series of versioned checkpoints.

Vorcl's Approach: Domain-Specific Fine-Tuning for Finance

Most enterprise AI work does not require pretraining a frontier model — it requires adapting a competent base model to a specific domain at a manageable cost. At Vorcl, the practical question is almost always: do we use Retrieval-Augmented Generation (RAG), parameter-efficient fine-tuning, or both? The answer depends on what the model needs to do.

RAG (retrieve relevant documents at inference time and inject them into the prompt) excels when the task is question-answering over a knowledge base that updates frequently — internal policies, product catalogs, recent regulatory text. It requires no training, scales linearly with document volume, and remains factual. It does not, however, change the model's reasoning style, format, or domain-specific judgment. A RAG-only system can quote a tax code section but may not chain three sections together to compute a deduction the way a human accountant would.

Fine-tuning is the right tool when the model's reasoning, format, or specialized vocabulary needs to change. We use LoRA and QLoRA (Hu et al. 2021; Dettmers et al. 2023) — low-rank adapters that train 0.5–2% of the parameters of a 7B–70B base model in 4-bit quantization. A typical Vorcl finance fine-tune uses rank 16–64 LoRA matrices over a Llama-class or Mistral base, with 8K–50K curated finance-specific instruction examples (chart-of-accounts reasoning, IFRS/GAAP classification, anomaly explanation, regulatory citation). The full run completes overnight on 4–8 H100s.

Evaluation moves with the domain. Generic MMLU tells us little about whether a finance assistant correctly classifies a borderline VAT scenario; we maintain held-out benchmarks of real (anonymized) audit cases, tax-authority correspondence, and reconciliation edge cases. Human-in-the-loop review by partner accountants gates each release, the same way frontier labs gate on red-team findings.

The result is models that operate at a fraction of frontier-model cost — typically 50–100× cheaper per token at parity accuracy on the in-domain task — while remaining auditable, explainable, and aligned to a specific firm's procedures. The training recipe is borrowed from frontier labs; the engineering problem is choosing which steps to apply, at what rank, on what data — and proving the result is safe to put in front of a regulator.

Key Findings

Data Quality Beats Data Scale

Llama 3 vs Llama 2 gains came largely from corpus filtering — quality classifiers, dedup, perplexity scoring — not from architectural changes. RefinedWeb and FineWeb confirmed: heavily filtered Common Crawl matches doubled raw volume.

Alignment Is Multi-Stage, Not Optional

A pretraining-only base model is fluent but unhelpful. SFT teaches format; preference learning aligns intent. Skipping either stage produces a model that either rambles or refuses everything. The full pipeline is what makes the assistant.

DPO Simplifies Without Losing Performance

Direct Preference Optimization (Rafailov 2023) collapses RLHF’s three-model PPO loop into a single classification-style training step. Engineering complexity drops ~40–60% with benchmark performance within noise on most public tests.

Domain Fine-Tuning Wins on Cost

A 7B base + LoRA on curated finance data delivers parity accuracy on in-domain tasks at 50–100× lower per-token inference cost than frontier general models. Pick scale to match the task, not the headline benchmark.

Need a Model Trained for Your Domain?

Vorcl designs and runs domain-specific fine-tuning pipelines — LoRA / QLoRA over open base models, with held-out evaluation, red-teaming, and audit-ready deployment.

← Back to Laboratory