Skip to main content

Building Large Language Models: Complete Guide to LLM Training

Building Large Language Models: Complete Guide to LLM Training

Building Large Language Models: Complete Guide to LLM Training

Building Large Language Models - Neural networks and data visualization

Published: April 24, 2024 | Read time: 15 minutes

A comprehensive guide covering pre-training, post-training, scaling laws, data collection, evaluation methods, and systems optimization for building state-of-the-art language models.


Table of Contents


Watch the Full Lecture


Introduction: What Makes an LLM?

Large language models (LLMs) like ChatGPT, Claude, and Gemini have fundamentally changed how we interact with AI. But how are these models actually built? What separates a state-of-the-art model from a mediocre one?

The answer is more nuanced than most people realize. While much academic research focuses on novel architectures and loss functions, the industry knows that data quality, rigorous evaluation, and systems optimization are what truly separate leading models from the rest.

Key Insight: Most architectural innovations have minimal real-world impact. Scaling laws mean that better hardware and more compute will eventually dwarf any architectural advantage.


Five Key Components for Training LLMs

Building an LLM requires careful attention to five interconnected components:

  1. Architecture — The neural network design (primarily Transformers)
  2. Training Loss & Algorithm — How the model learns and optimizes
  3. Data — What the model trains on
  4. Evaluation — How we measure progress and quality
  5. Systems — How we run these massive models efficiently on modern hardware

While academia tends to focus on components 1 and 2, industry knows that components 3, 4, and 5 are what actually matter. This disconnect explains why many architectural innovations have surprisingly little real-world impact.


Part 1: Pre-Training

Pre-training is the foundation of every modern LLM. It's where models learn language by predicting the next word across billions of documents. This section covers the essential components of the pre-training process.

The Language Modeling Task

At its core, an LLM learns to model the probability distribution over sequences of words. Given a sentence like "the mouse ate the cheese," the model assigns a probability to that sequence being valid English.

The model learns two types of knowledge:

  • Syntactic knowledge — Grammar and structure ("the the mouse at cheese" is unlikely)
  • Semantic knowledge — Meaning and world knowledge ("cheese ate the mouse" is semantically implausible)

Modern LLMs use autoregressive modeling, which decomposes the probability of a full sequence into a product of conditional probabilities:

P(sentence) = P(word₁) × P(word₂|word₁) × P(word₃|word₁,word₂) × ...

This is simply the chain rule of probability applied to language generation.

Tokenization: More Important Than You Think

Before training, text must be broken into tokens. You might think: "Why not just use words?" The answer reveals why tokenization deserves serious attention.

Why Tokens Instead of Words?

  • Typos and misspellings — A typo creates an unknown word. Tokens handle this gracefully by breaking words into smaller pieces.
  • Language diversity — Languages like Thai don't use spaces between words. Character-by-character tokenization is too long; tokens find the middle ground.
  • Sequence length efficiency — If you tokenized character-by-character, sequences would be 10x longer, and Transformer complexity scales quadratically with sequence length.

How Tokenization Works (Byte Pair Encoding)

  1. Start with a large text corpus
  2. Assign each character its own token
  3. Find the most common pair of adjacent tokens and merge them into a new token
  4. Repeat until you reach your target vocabulary size

The result: typically 3-4 characters per token, balancing expressiveness with sequence efficiency.

The Tokenization Problem: Numbers, code, and math are still poorly tokenized. This is why models struggle with arithmetic and why GPT-4 improved code performance partly through better tokenization.

Data: The Unsexy Secret to LLM Quality

When companies say they "train on the internet," they're glossing over months of unglamorous data work. Data collection and curation is arguably the most important aspect of LLM development, yet it receives far less attention than architecture.

The Data Pipeline

  1. Crawl — Download ~250 billion web pages (1 petabyte of data) from Common Crawl
  2. Extract — Convert HTML to clean text (surprisingly hard; math formulas are tricky)
  3. Filter undesirable content — Remove NSFW, harmful, and PII using classifiers and blacklists
  4. Deduplicate — Remove duplicate headers, footers, and repeated paragraphs across the web
  5. Heuristic filtering — Remove low-quality documents using rules (unusual token distributions, abnormal word lengths)
  6. Model-based filtering — Train a classifier on Wikipedia-referenced pages to identify high-quality content
  7. Domain weighting — Upweight valuable domains (books, code, Wikipedia) and downweight entertainment
  8. Final polish — In the last training phase, overfit on very high-quality data (Wikipedia, human-written content)

The Scale of Data

2017: ~150 billion tokens | 2020: ~1 trillion tokens | 2024: ~15 trillion tokens

That's a 100x increase in just 6-7 years. For context, Llama 3 was trained on 15.6 trillion tokens across 16,000 H100 GPUs for ~70 days.

Why This Matters: Data quality often matters more than model size. A smaller model trained on curated data can outperform a larger model trained on noisy data.

Evaluation: Measuring Progress During Pre-Training

Perplexity is the standard metric during pre-training. It measures how "perplexed" the model is—essentially, how many tokens it's uncertain between:

  • Perplexity = 1: Perfect predictions
  • Perplexity = vocab_size: Complete uncertainty

Perplexity has improved dramatically: from ~70 tokens (2017) to <10 tokens (2023). This means models went from being uncertain between 70 possible next words to fewer than 10.

Limitations of Perplexity

  • Depends on tokenizer choice
  • Depends on the dataset used
  • Not comparable across different models

For this reason, academia now uses broader benchmarks like HELM and Hugging Face Open LM Leaderboard, which test across multiple domains (QA, reasoning, knowledge, etc.).

Scaling Laws: Predicting the Future

One of the most important discoveries in deep learning: performance improves predictably with more compute, data, and parameters.

When you plot compute (x-axis) vs. test loss (y-axis) on a log-log scale, you get a straight line. This means you can predict how much better your model will be if you increase resources.

Why This Matters

Instead of training 30 models for 1 day each to find the best hyperparameters, you can:

  1. Train small models at different scales for 3 days
  2. Fit a scaling law to the results
  3. Extrapolate to predict performance at your target scale
  4. Train one large model for 27 days with the predicted-optimal hyperparameters

Chinchilla Scaling

Research found that the optimal ratio is ~20 tokens per parameter during training. So if you add 1 billion parameters, you should train on 20 billion more tokens.

Note: For inference-optimized models, the ratio is closer to 150:1, since you care about long-term serving costs, not just training efficiency.


Part 2: Post-Training (Alignment)

Pre-training creates a language model. Post-training turns it into an AI assistant that follows instructions, refuses harmful requests, and maintains consistent behavior.

Why Post-Training Is Necessary

If you ask GPT-3 (pure pre-training) "Explain the moon landing to a six-year-old," it might respond with "Explain the theory of gravity to a six-year-old" because it learned that internet text contains question lists, not question-answer pairs.

Post-training teaches the model to:

  • Answer questions directly
  • Follow instructions precisely
  • Refuse harmful requests
  • Maintain consistent personality
  • Provide helpful and accurate information

Supervised Fine-Tuning (SFT)

The first step is straightforward: collect examples of desired behavior and fine-tune the model on them.

The Process

  1. Collect question-answer pairs from humans
  2. Fine-tune the pre-trained model on this data using the same language modeling loss
  3. Done

A Surprising Finding

You don't need much data. The Lima paper showed that scaling from 2,000 to 32,000 examples barely helps. Why? The model already learned language from pre-training. SFT just teaches it to format answers the way you want.

"Think of pre-training as learning English, and SFT as learning to be a helpful assistant—the hard part (language) is already done."

The Synthetic Data Opportunity

Since you don't need much data, you can use LLMs to generate it. Alpaca did exactly this: take 175 human examples, ask ChatGPT to generate 52,000 similar Q&A pairs, fine-tune Llama 7B on the synthetic data. It worked remarkably well.

Reinforcement Learning from Human Feedback (RLHF)

SFT has limitations:

  • Behavioral cloning — You can only clone what humans write, not what they prefer
  • Hallucination risk — If a human writes a reference the model hasn't seen, the model learns to make up references that sound plausible
  • Expense — Writing good answers is expensive and time-consuming

RLHF solves this by learning from preferences instead of perfect examples.

The RLHF Pipeline

  1. Generate two outputs for each prompt using an SFT model
  2. Have humans rate which is better
  3. Train a reward model to predict human preferences
  4. Use reinforcement learning (PPO) to optimize the main model to maximize the reward

Why use a reward model? Instead of binary feedback ("better" or "worse"), a reward model gives continuous scores. This provides richer signal for optimization.

This was the key innovation that made ChatGPT so much better than GPT-3.

Direct Preference Optimization (DPO)

RLHF works but is complex. Reinforcement learning is notoriously finicky—lots of hyperparameters, clipping, rollout management.

DPO simplifies this: Instead of training a reward model and using RL, just directly optimize the model to generate preferred outputs while avoiding non-preferred ones.

The loss is elegant:

maximize log P(preferred output | input) - log P(non-preferred output | input)

Under certain assumptions, DPO has the same global optimum as PPO but is much simpler to implement. It's now the standard in open-source and increasingly in industry.

Evaluating Aligned Models

Once a model is aligned, you can't use perplexity anymore. The model is no longer a probability distribution—it's optimized to generate one specific "best" answer.

Evaluation Methods

  1. Human evaluation — Have humans rate model outputs on real-world tasks (slow, expensive, but gold standard)
  2. Chatbot Arena — Let internet users interact with models blindly and vote on quality (free, scalable, but biased toward tech-savvy users)
  3. LLM evaluation — Use GPT-4 to compare outputs (fast, cheap, but can have systematic biases)

The best approach: combine methods. Use LLM evaluation for rapid iteration, validate periodically with human or arena evaluation.


Part 3: Systems—Making It Feasible

Training a 400B parameter model requires solving engineering problems that most people never think about. Systems optimization can mean the difference between a project being feasible or prohibitively expensive.

Why Systems Matter

Compute is the bottleneck. You can't just "buy more GPUs"—they're expensive, scarce, and adding more creates communication overhead.

Key insight: GPUs are optimized for throughput (doing many operations in parallel) but communication is slow. Modern GPUs achieve only ~45-50% of theoretical peak performance because data can't move fast enough.

Low Precision Training

A simple but powerful optimization: use 16-bit floats instead of 32-bit.

Why This Works

  • 16-bit numbers use half the memory and bandwidth
  • Deep learning has enough noise that precision beyond 16-bit doesn't matter
  • Updating weights by 0.01 vs. 0.015 makes no practical difference

Standard Approach

  • Store model weights in 32-bit (for gradient accumulation)
  • Convert to 16-bit for matrix multiplications
  • Update weights in 32-bit

This alone can speed up training 2-3x with minimal accuracy loss.

Operator Fusion

Every PyTorch operation moves data to GPU memory, performs computation, and moves it back. This creates unnecessary communication overhead.

Naive Approach

x1 = x.cos()  # Move to GPU, compute, move back
x2 = x1.sin() # Move to GPU, compute, move back

Fused Approach

x2 = torch.compile(lambda x: x.cos().sin())(x)
# Move to GPU once, do both operations, move back once

Using torch.compile can make models 2x faster by rewriting PyTorch code in C++/CUDA to minimize data movement.

The Cost of Training

Here's a concrete example: Llama 3 (400B parameters)

  • Compute: 3.8 × 10²⁵ FLOPs
  • Hardware: 16,000 H100 GPUs for ~70 days
  • GPU Cost: ~$52 million ($2/hour rental)
  • Salaries: ~$25 million (50 employees)
  • Total: ~$75 million
  • Carbon: ~4,000 tons CO2 (≈ 2,000 transatlantic flights)

And this was just the pre-training phase. Add post-training, evaluation infrastructure, and the total easily exceeds $100 million.


Key Takeaways

What Actually Matters

The speaker makes a crucial observation: academia focuses on architecture and loss functions, but industry knows the real value lies in:

  1. Data — Quality and scale
  2. Evaluation — Rigorous benchmarking
  3. Systems — Efficient implementation

Small architectural differences matter far less than commonly assumed. Scaling laws mean that better hardware and more compute will eventually dwarf any architectural advantage.

What's Still Unsolved

Despite rapid progress, major challenges remain:

  • Data scarcity — We may be approaching the limit of publicly available internet text
  • Synthetic data — Can we generate enough high-quality synthetic data without degradation?
  • Multimodality — How do images, video, and audio improve text understanding?
  • Inference cost — Serving billions of requests is expensive; better algorithms are critical
  • Alignment — How do we ensure models behave as intended?
  • Legality — Copyright and data licensing remain unresolved

The Path Forward

Training LLMs is becoming more standardized. The basic recipe is clear:

  1. Pre-train on massive internet-scale data
  2. Apply scaling laws to optimize compute allocation
  3. Fine-tune with SFT on high-quality examples
  4. Align with DPO or similar preference optimization
  5. Evaluate rigorously with multiple methods

But execution matters enormously. Data quality, systems optimization, and thoughtful evaluation separate leading models from mediocre ones.


Learn More

If you want to dive deeper into LLM training, the speaker recommends three Stanford courses:

  • CS224N — NLP foundations and historical context
  • CS324 — Large language models, in-depth
  • CS336 — Build your own LLM from scratch (challenging but comprehensive)

The field is moving fast. What's standard today will be outdated in a year. The best way to stay current is to understand the fundamentals—architecture, data, training algorithms, evaluation—and then follow how practitioners adapt them to new constraints and capabilities.


This post summarizes a comprehensive lecture on building large language models. The video above covers these topics in depth with live Q&A.

Comments