EducationHow AI Works

From Perceptrons to Reasoning Agents

A long-form, animated walkthrough of how artificial intelligence evolved over seventy years — and how today's large language models actually think, call tools, and run on either a remote datacenter or the laptop on your desk.

AI
70Years of AI
4Paradigm Shifts
1.8T+Parameters (Frontier)
Tools Reachable
Chapter 01

Seventy Years in Two Minutes

AI did not begin with ChatGPT. It is a slow accumulation of three ideas — symbolic reasoning, learning from data, and scale — punctuated by a few moments where everything changed at once.

1956

Dartmouth Workshop

McCarthy, Minsky, Shannon and Rochester coin the term "artificial intelligence" and set the field in motion.

1958

Perceptron

Rosenblatt builds the first learning neural network — a single-layer linear classifier on custom hardware.

1980s

Expert Systems

Hand-coded rule engines (XCON, MYCIN) automate niche professional reasoning. Brittle, but commercially real.

1986

Backpropagation

Rumelhart, Hinton and Williams popularize the algorithm that lets multi-layer networks actually learn.

1997

Deep Blue beats Kasparov

IBM's search-based engine wins a 6-game match. Brute force + heuristics, not learning — but a public watershed.

2012

AlexNet

Krizhevsky, Sutskever and Hinton crush ImageNet with a deep CNN on two GPUs. Modern deep learning era opens.

2014

GANs · Seq2Seq

Generative adversarial networks (Goodfellow) and encoder-decoder translation models redefine generation.

2017

Attention Is All You Need

Vaswani et al. publish the Transformer. Self-attention replaces recurrence — every modern LLM descends from this.

2018

BERT · GPT-1

Pretraining on raw text becomes the dominant recipe. Language models stop being task-specific.

2020

GPT-3

175B parameters. Few-shot prompting works. Scaling laws (Kaplan et al.) suggest the ride is far from over.

2022

ChatGPT

RLHF turns GPT-3.5 into a usable assistant. 100M users in 2 months — fastest consumer adoption in history.

2023

GPT-4 · Llama 2

Multimodal frontier closed models and the first competitive open-weights family ship within months of each other.

2024

Tool-Use & Agents

Function calling, MCP, computer-use. LLMs stop being chat boxes and start operating real software.

2025–26

Reasoning Models

o-series, Claude, Gemini reasoning variants spend inference compute on chain-of-thought. Local 70B-class models match 2023 frontier.

← scroll horizontally · 14 milestones →

Chapter 02

What Is a Neural Network?

A neural network is a graph of weighted multiplications and non-linear squashing functions. That's it. Everything else — vision, language, reasoning — is what emerges when you stack enough of them and feed them enough data.

Inputpixels · tokens · featuresHidden 1edges · n-gramsHidden 2shapes · phrasesHidden 3concepts · intentOutputclass · next token
forward pass · learned weights · non-linear activation
01

Weights

Each connection carries a number. Training adjusts those numbers — billions of them — so the output gets closer to the right answer.

02

Activation

Each node sums its inputs and squashes the result through a non-linear function (ReLU, GELU). Without that step, the whole network collapses to a line.

03

Backprop

The error at the output flows backward through the graph. Each weight learns how much it contributed and nudges itself in the right direction.

Chapter 03

How an LLM Actually Thinks

A large language model does exactly one thing: it predicts the next token. Everything else — answering questions, writing code, holding a conversation — is a side effect of doing that very, very well.

01

Tokenization

Text → numbers

Your sentence is sliced into ~50,000-piece vocabulary chunks called tokens — sometimes whole words, often sub-words. Each token becomes an integer ID the model can address.

input:"The cat sat on the mat"Thecatsatonthemat
ids:464379733323192622603
02

Embeddings

Numbers → meaning vectors

Each token ID is looked up in a giant matrix and becomes a vector — typically 4,096 to 16,384 numbers. Positions in that high-dimensional space encode meaning: 'king' and 'queen' end up near each other, and far from 'sandwich'.

The
dim 4096
cat
dim 4096
sat
dim 4096
on
dim 4096
the
dim 4096
mat
dim 4096
03

Self-Attention

Every token reads every other token

The Transformer's core trick. For each token, the model computes how much attention to pay to every previous token. That's how it knows which 'it' refers to which 'cat', or that a function argument relates to a return type 200 lines away.

Thecatsatonthemat
The
0.90
0.05
cat
0.15
0.70
0.05
sat
0.08
0.55
0.25
on
0.05
0.35
0.30
0.20
0.05
0.05
the
0.05
0.25
0.10
0.20
0.30
0.10
mat
0.62
0.05
0.05
0.20

attention heatmap · row = current token · column = what it looks at

04

Sampling

A probability distribution → one word

The final layer produces a probability for every token in the vocabulary. The model either picks the most likely (greedy), or samples with a temperature setting that controls how 'creative' it is. Then it loops — that one new token becomes input to the next prediction.

mat
42%
floor
21%
chair
13%
roof
4%

→ next token: "mat" · loop and continue

Chapter 04 · Deep Dive

Words Become Coordinates

Inside a model, every word, sentence, image, and snippet of code is a point in a high-dimensional space. Similar things land near each other. Most of what makes AI feel smart is just clever geometry on those points.

embedding space · 2D projection of ~3,072 dimensions
kingqueenprincecrownmanwomandogcatlionwolfcodeserveralgorithmcompilerparistokyoberlinlondonbreadpizzasushipasta
royalty
animals
tech
cities
food
vector arithmetic
kingman+womanqueen

The classic Word2Vec demonstration. The direction from man to king is roughly the same as the direction from woman to queen — the model has learned an axis for "royalty" without ever being told the word.

cosine similarity · −1 to 1
kingqueen
0.86
kingpizza
0.07
dogwolf
0.74
paristokyo
0.62
codecompiler
0.81
01

Hundreds of dimensions

Real embeddings live in 1,024–3,072 dimensions, not two. Each axis encodes some learned aspect of meaning — gender, formality, animacy, intent. We can only draw two of them, but the model uses all of them at once.

02

Distance = meaning

Cosine similarity between two vectors is how a model judges relatedness. Nearest-neighbor search over millions of embeddings is how RAG, semantic search, recommendation, and de-duplication all work under the hood.

03

Same trick for everything

Embed images, audio, code, even DNA — the same vector space lets you search across modalities. CLIP famously embedded text and pictures into a shared space, which is why you can search photos with a sentence.

Chapter 05

How the Model Became Smart

A frontier LLM is not one model — it is four trained on top of each other. Most of the cost lives in the first stage; most of the personality lives in the last three.

01
Pretraining

A trillion tokens of the internet

Predict the next word — over and over — across books, code, papers, and web pages. After ~10²⁵ FLOPs the model picks up grammar, facts, reasoning patterns, and the structure of dozens of languages without ever being told what any of them are.

~15T tokens
~6 months on 25,000 GPUs
02
Supervised fine-tuning

Show, don’t tell

Hand-written instruction → response pairs teach the base model what a helpful answer looks like. Now it stops auto-completing the prompt and starts addressing it.

~100K – 1M pairs
humans + curated demonstrations
03
RLHF

Humans rank, the model learns the ranking

For each prompt, generate two answers. Ask a human which is better. Train a reward model on those preferences, then use reinforcement learning to push the LLM toward higher-rewarded outputs.

preference data
reward model + PPO / DPO
04
Constitutional / RLAIF

The model critiques itself

Replace most of the human raters with another AI guided by a written constitution — a list of principles the model should respect. Faster, cheaper, and the rules are auditable text instead of a frozen reward model.

+ written principles
Anthropic’s approach
pretraining lossloss vs tokens seen
1T5T10T15T

Loss falls fast in the first trillion tokens, then slows to a grind. The last few percent of capability cost more compute than everything before them.

RLHF · one preference round
response A

Sure! Here's a list of three reasons, with citations and a short summary at the end.

▲ chosen
response B

Yeah whatever, here you go.

▼ rejected
reward model
+0.78

Multiply this round by millions of preference comparisons and you have a numerical model of taste — strong enough to steer a 100B+ parameter network.

Scale vs alignment

Bigger pretraining gives a smarter base model — but a smarter model with no alignment is a worse product, not a better one. The frontier is in the post-training, not the parameter count.

Helpfulness vs safety

Push too hard for helpful and the model will help with anything. Push too hard for safe and it refuses to write a poem. Modern training treats this as a multi-objective optimisation, not a single dial.

Why models still hallucinate

Pretraining rewards plausible text, not true text. Post-training reduces hallucinations but never eliminates them — the model still has no built-in notion of uncertainty unless it was trained explicitly to express one.

Chapter 06

Thinking, Out Loud

A reasoning model is a regular LLM that has been trained to generate a long internal monologue before its final answer. It buys correctness with tokens — and sometimes that trade is the only one that works.

vanilla LLM0.4s
prompt: Three boxes are each mislabeled. One has apples, one oranges, one mixed. How many fruits must you draw to label them all correctly?
answer

Two — one from each labeled box.

wrong

reasoning model~12s · 4,200 thinking tokens
prompt: Three boxes are each mislabeled. One has apples, one oranges, one mixed. How many fruits must you draw to label them all correctly?
thinking trace
  • Let me re-read the puzzle.
  • There are 3 boxes. Each is mislabeled.
  • I pick from the box labeled "mixed".
  • If I draw an apple, that box must be "apples".
  • Then the "apples" label is wrong, so it’s "oranges".
  • And the "oranges" label is wrong, so it’s "mixed".
  • One draw is enough. Verifying… consistent.
answer

One. Drawing from the box labeled "mixed" is enough.

correct

tree of thoughtexplore · prune · commit

Some training recipes encourage the model to fan out — try several short hypotheses, evaluate each, prune the bad ones, then commit. The visible answer is the survivor of a tournament the user never sees.

exploredkeptcommitted

Where reasoning helps

Math, formal logic, multi-step coding, debugging, planning. Anything where one wrong sub-step poisons the rest of the answer benefits from being able to backtrack.

The latency tax

A reasoning model can spend 10–60 seconds (and 5–20× the tokens) before its first visible output. Worth it for a hard answer; pure overhead for "what time is it in Paris".

The 2026 lineup

OpenAI o-series, Claude with extended thinking, Gemini Thinking, DeepSeek R1, Qwen QwQ. Each gives you a knob for how long the model is allowed to deliberate.

Chapter 07

Functions, Tools & Agents

A model on its own only knows what was in its training data. To do anything useful in the real world — read a database, send an invoice, search the web today — it needs to call code. That mechanism is called function calling (or tool use), and it's the difference between a chatbot and an agent.

web_search()fetch live information
calculator()arithmetic, units, finance
sql_query()read your database
send_email()trigger notifications
create_invoice()business actions
browser_use()click, type, navigate
⟶ tool-use loop
User01
"Email John the Q3 numbers"
LLM02
I need data, then to send mail
sql_query03
SELECT revenue FROM q3
LLM04
Got $4.2M. Compose email.
send_email05
to: john@…
Done06
Sent. Confirmed to user.

The loop is simple: the model emits a structured request to call a function, your runtime executes the actual code, the result is handed back as a new message, and the model decides what to do next. The cycle continues — sometimes for dozens of steps — until there's nothing left to call. That's an agent.

you define the tool
{
  "name": "get_invoice",
  "description": "Fetch an invoice by id",
  "input_schema": {
    "type": "object",
    "properties": {
      "invoice_id": { "type": "string" }
    },
    "required": ["invoice_id"]
  }
}
the model emits this
{
  "type": "tool_use",
  "name": "get_invoice",
  "input": {
    "invoice_id": "INV-2026-04-118"
  }
}

Your code receives this, calls the real API, and returns the result. The model continues from there.

CHATBOT

No tools

Replies from training data only. Can be brilliant at language, useless at facts that change after the cutoff date.

AGENT

Tools + loop

Reads your CRM, runs SQL, sends Slack messages, schedules a call. Capability scales with the toolbelt you give it.

Chapter 08 · Practice

Prompts Are an Interface

Most of the gap between a useless answer and a useful one is in the prompt, not the model. The good news: prompting follows recognisable patterns, and almost all of them are about being specific in the right places.

anatomy of a working promptstack from outside in
system
You are a senior code reviewer. Be terse. Always cite line numbers.
sets the persona, tone and rules of engagement
context
Repo: payments-api · file: charge.ts · diff: +42/−7
the data the model needs but didn’t train on
instruction
Review for race conditions, missing null checks, and currency rounding bugs.
the actual ask · one verb is usually enough
examples
Bad → "looks fine" Good → "L42: idempotency key not validated, retries can double-charge"
few-shot guides shape and quality at once
output_format
{"issues": [{"severity": "high|med|low", "line": int, "fix": string}]}
lock down the shape so downstream code can parse it

Zero-shot → Few-shot

before
Categorise this support email.
after
Examples:
  "card declined" → billing
  "won’t install" → onboarding
  "still slow"   → performance

Now: "{message}" →
before
42%
after
84%

Vague → Role-primed

before
Make this paragraph better.
after
You are an editor at The Economist. Cut filler. Replace abstract nouns with concrete verbs. Keep length within ±10%.

[paragraph]
before
38%
after
78%

Freeform → JSON schema

before
Extract the date and amount.
after
Return only valid JSON matching:
{"date": "YYYY-MM-DD", "amount_usd": number}
If either is missing, return null for that field.
before
55%
after
96%

Treat prompts like code. Version them, test them on a real eval set, and never edit a production prompt without a diff.

Chapter 09

Beyond Text

Text was just the first modality to fall. Today’s frontier models take any combination of words, pictures, sound, and video — and handle them as different views of the same shared embedding space.

four streams · one shared space
TEXT
words → tokens
IMAGE
16×16 patches → tokens
AUDIO
spectrogram strips → tokens
VIDEO
frames + time → tokens
shared space
unified tokens
frontier models · 2026 · what they accept natively
modelTEXTIMAGEAUDIOVIDEO
Claude 4.x
GPT · o-series
Gemini 2.x
Llama 4
Qwen 3

Document understanding

Drop a 100-page contract PDF in. The model treats every page as an image, every paragraph as text, and answers questions across both — no OCR pipeline required.

Voice mode

Speech-in, speech-out, end to end. The same network plans the answer and shapes the prosody. Latencies have dropped from seconds to ~300 ms.

Video Q&A

Sample a few hundred frames + the audio track, embed them in the same space as text, and the model can answer "at what minute does the speaker contradict herself?".

Image generation

A diffusion or autoregressive head turns the same shared embeddings back into pixels. Edit a photo by describing the change in plain English.

Chapter 10

AI in the Real World

Tool use isn't hypothetical. Right now, AI models are folding proteins, designing entirely new molecules, controlling fusion reactors, and forecasting the weather better than the supercomputers they replaced. The most consequential application — and the most expensive problem AI has ever been pointed at — is the discovery of new medicines.

$2.6B
Cost per approved drug
10–15 yr
Discovery → market
90%
Clinical-trial failure rate
10⁶⁰
Drug-like molecules possible
⟶ traditional pipeline · attrition by stage

The 10,000-to-1 funnel

10,000
Candidate compounds
250
Hits in screening
10
Lead molecules
5
Preclinical
1
Approved drug

Of every 10,000 candidate molecules a chemist synthesises, roughly one survives clinical trials. AI is reshaping every stage of this funnel — narrowing the search space, designing molecules that don't exist yet, and predicting failure before a single test tube is filled.

How scientists actually use AI

Five concrete stages where models are now part of the lab — not as assistants, but as the engine doing the work.

01

Target identification

Which protein, mutation, or pathway should the drug attack?

Models read every paper, patent, and trial registry ever published, plus genomic and proteomic data from millions of patients. They build a knowledge graph of disease causality — and rank the proteins most likely to be druggable. Insilico's PandaOmics and BenevolentAI's graph engine do exactly this.

IPFTGF-βTNIKCTGFPDGFRpaperpatenttrialgenomic
diseasecandidate targetranked top hitevidence (paper · patent · trial · omics)
02

Protein structure prediction

From amino-acid sequence to a 3D shape you can dock molecules into

A protein's function is determined by how it folds — and folding was an unsolved problem for 50 years. AlphaFold 2 (2020) and AlphaFold 3 (2024) collapsed it. Predictions accurate to within an atom, in seconds, for any sequence on Earth. The full structural proteome — over 200 million proteins — is now public.

seq: MKTAYIAKQRQISFVKSHFSRQLEERLG…→ 3D structure
03

Generative chemistry

Design new molecules that don't exist yet

Diffusion models and graph VAEs trained on tens of millions of known compounds learn the latent space of valid chemistry. Given a binding pocket, they generate novel molecules optimised for affinity, solubility, and synthesisability — searching a space of 10⁶⁰ possible drugs that no human team could enumerate.

CCNCCOCFON
generated · sample 0341 / 12,800
Binding affinity82%
Solubility71%
Synthesisability64%
Toxicity18%

Each generated candidate is scored on multiple objectives in parallel — the model learns to optimise all of them at once.

04

Automated wet labs · phenomics

Robots run the experiments, vision models read the results

Recursion's labs image millions of human cells per week, perturbed by thousands of compounds and gene knockouts. Self-supervised CNNs convert each image into an embedding — phenotypes that look the same end up in the same neighbourhood, revealing molecules that 'rescue' diseased cells without anyone needing to know the mechanism.

cell-imaging plate · 48 wells showncluster: rescue phenotype
confirmed rescueneighbour in embeddingbaseline / no effect
05

Clinical-trial acceleration

Predict failure before the patient enrolment opens

Models trained on decades of historical trials predict which compounds will fail Phase 2 toxicity, suggest patient-stratification cohorts, and even generate digital twin control arms. Less wasted compute, less wasted years, less wasted human risk.

MOL-091
18%pass
MOL-092
74%flag
MOL-093
92%fail
MOL-094
32%pass
MOL-095
61%flag

Predicted Phase-2 failure risk · trained on 30+ years of historical trial outcomes. Flagged compounds are re-engineered before a single patient is recruited.

The toolbelt of a 2026 computational scientist

Six platforms doing the heaviest lifting today. Some are open weights you can run on a workstation; some are commercial pipelines worth multi-billion-dollar deals.

AlphaFold 3

structure
Google DeepMind / Isomorphic

Predicts the 3D structure of proteins, DNA, RNA, and ligand complexes from sequence alone. Solved 200M+ structures publicly.

RFdiffusion

design
Baker Lab, U. of Washington

A diffusion model that designs entirely new proteins from scratch — binders, enzymes, scaffolds. Won the 2024 Nobel in Chemistry.

Boltz-1 / Chai-1

docking
MIT · Chai Discovery

Open-weights successors to AlphaFold for protein–ligand docking. Lab-runnable, no API gatekeeping.

GNoME

materials
Google DeepMind

2.2 million new crystal structures predicted — 380,000 stable. A 800-year leap in materials science in one model.

Pharma.AI

pipeline
Insilico Medicine

End-to-end pipeline: target discovery (PandaOmics) + generative chemistry (Chemistry42). First AI-designed drug now in Phase 2.

Recursion OS

phenomics
Recursion Pharmaceuticals

Robotic labs run millions of cell-imaging experiments per week; CNNs cluster phenotypes to find drug candidates by visual similarity.

Companies already shipping

Not research papers — actual molecules in actual humans, or partnerships where Big Pharma is paying real money for AI-designed candidates.

Isomorphic Labs

Alphabet · DeepMind spin-out
$3B+
partnership value
AlphaFold-powered drug design

Founded 2021. Partnerships with Eli Lilly and Novartis worth $3B+ in milestones. Uses AlphaFold 3 to model how candidate molecules interact with disease-causing proteins — collapsing months of crystallography into minutes.

Insilico Medicine

Hong Kong · NYC
30 mo.
discovery → clinic
First end-to-end AI-designed drug in human trials

INS018_055 — a treatment for idiopathic pulmonary fibrosis (IPF) — was discovered, designed, and brought to Phase 1 in under 30 months for ~$3M. Now in Phase 2 trials, the first drug where both the target and the molecule came from AI.

Recursion

Salt Lake City · NASDAQ: RXRX
~50M
experiments / week
Phenomics + automated wet labs

Robotics image millions of human cells under thousands of perturbations every week. Self-supervised vision models cluster phenotypes; matches reveal which molecules rescue diseased cells. 10+ programs in or near the clinic.

BenevolentAI

London · LSE: BAI
1B+
graph relations
Knowledge-graph reasoning over biomedical literature

A graph of 1B+ relationships from papers, patents, and clinical data. In 2020 their model proposed baricitinib for COVID-19 within 48 hours; the FDA later approved it. Now applied to ALS, ulcerative colitis, and chronic kidney disease.

Beyond medicine — six other frontiers

The same recipe — train a large model on a domain's data, then let it generate or predict what experiments would have taken decades to find. It works almost everywhere it's been tried.

Weather
GraphCast · Aurora

A graph neural network that beats the European supercomputer model on 10-day forecasts — and runs in under a minute on a single TPU instead of hours on a cluster.

Fusion
DeepMind × EPFL

Reinforcement learning controls the magnetic coils of a tokamak in real time, holding plasma in shapes humans never managed to stabilise — a step toward commercial fusion.

Mathematics
AlphaProof · AlphaGeometry 2

Solved 4 of 6 problems at the 2024 International Math Olympiad — silver-medal performance. Geometry was solved by combining a language model with a symbolic deduction engine.

Astronomy
LIGO · Vera Rubin pipelines

CNNs scan gravitational-wave streams for black-hole mergers in real time, and triage tens of millions of nightly transient detections from new sky surveys.

Climate
NeuralGCM

Hybrid neural / physics climate model from Google. Atmospheric simulations 100,000× cheaper than the legacy spectral solvers used by national weather services.

Robotics
RT-2 · Optimus · Figure

Vision-language-action models map a camera frame and a sentence (“pick up the red mug”) directly to motor torques — a generalist policy instead of bespoke per-task code.

None of these systems are general intelligence. They are narrow, domain-specific function approximators trained on data nobody could sift through manually. The shift is that the bottleneck in science used to be human imagination over a tiny search space; the bottleneck is now wet-lab validation of a search space the models can canvas in an afternoon.

Chapter 11

How "Smart" Gets Measured

Every model release ships with a battery of benchmark scores. They're a microscope, not a mirror — useful for comparing neighbours, dangerous if you confuse them with the territory. Numbers below are illustrative of where the frontier sits in 2026.

MMLU

/ 100

undergraduate-level general knowledge across 57 subjects

Claude Opus 4.x
89
GPT-5
91
Gemini 2.x Pro
90
Llama 4
84
DeepSeek V3
86

HumanEval

/ 100

164 Python coding problems with hidden unit tests

Claude Opus 4.x
95
GPT-5
94
Gemini 2.x Pro
90
Qwen 3-Coder
92
DeepSeek V3
89

GPQA Diamond

/ 100

PhD-level multiple choice in biology, physics, chemistry

o-series reasoning
78
Claude (extended thinking)
75
Gemini 2.x Thinking
73
DeepSeek R1
71
Vanilla Claude Sonnet
58

SWE-Bench Verified

/ 100

real GitHub issues — patch must pass the existing test suite

Claude Opus 4.x · agent
72
GPT-5 · agent
68
Gemini 2.x · agent
60
Qwen 3-Coder · agent
55
Best 2024 model
42

A benchmark is a microscope, not a mirror.

The only number that matters is how the model performs on your data and your task. Build a 50-example eval set on day one — every model decision after that gets easier.

What each one measures

MMLU = knowledge. HumanEval = isolated coding. GPQA = reasoning under uncertainty. SWE-Bench = doing real engineering work end to end. None of them measures whether the model is actually useful for your job.

The saturation problem

Top models now sit within a few points of each other on MMLU and HumanEval. The benchmarks have stopped discriminating — newer ones like FrontierMath, ARC-AGI-2, and SWE-Bench Verified are taking their place.

The gap to real work

A model that scores 95 on HumanEval can still fail at fixing your bug. Synthetic tasks reward narrow skill; production code rewards reading 10 files, running tests, and arguing with the linter.

Chapter 12

The 2026 Model Landscape

Eight families dominate production traffic. Half are closed APIs run by their creators; half are open weights you can download, fine-tune, and host yourself. Pick by task, cost, and where the data is allowed to live.

Anthropic

Claude

closed
Opus 4.x · Sonnet 4.x
context
200K – 1M tokens
hosting
cloud
long-context reasoningcodingtool usesafety

Constitutional AI alignment. Strong at multi-step agent loops.

OpenAI

GPT / o-series

closed
GPT-4.x · o3 · o4
context
128K – 1M tokens
hosting
cloud
general purposereasoningvoiceimages

Reasoning variants spend inference compute on chain-of-thought.

Google

Gemini

closed
Gemini 2.x Pro / Flash
context
up to 2M tokens
hosting
cloud
multimodalhuge contextnative video

Tightly integrated with Google services. Strong on image + video.

Meta

Llama

open weights
Llama 3.x · 4
context
128K tokens
hosting
cloud or local
open weightsfine-tunablestrong base

The de-facto open foundation. Runs on your hardware if you have the RAM.

Mistral

Mistral / Mixtral

open weights
Large 2 · Mixtral 8×22B
context
32K – 128K tokens
hosting
cloud or local
MoE efficiencymultilingualcompact

European, Apache-licensed. Mixture-of-experts gives big-model quality at small-model cost.

xAI

Grok

closed
Grok 3 / 4
context
128K+ tokens
hosting
cloud
real-time datalong reasoning

Tight integration with X. Trained on a very large compute cluster.

DeepSeek

DeepSeek

open weights
V3 · R1
context
128K tokens
hosting
cloud or local
costreasoningopen R1 weights

Open reasoning model that rattled the market in early 2025.

Alibaba

Qwen

open weights
Qwen 3 / 3-Coder
context
128K – 1M tokens
hosting
cloud or local
multilingualcodingsmall + large variants

Extremely strong open family across many sizes. Great at Asian languages.

Benchmarks shift week to week and rarely match real-world performance on your task. The right answer is usually: pick two candidates from different vendors, build the same eval set on your own data, and let the numbers decide.

Chapter 13

The Economics of Inference

A frontier model costs ~400× as much per token as a local one. The hard part of building with AI is no longer "can it do this?" — it's "which tier should I be paying for, and where?".

price ladder · USD per 1M tokensillustrative · 2026
FrontierOpus 4.x · GPT-5 · Gemini 2.x Pro
$15 in / $75 out
MidSonnet 4.x · GPT-5 mini · Gemini Flash
$3 in / $15 out
CheapHaiku · GPT-5 nano · open weights
$0.25 in / $1.25 out
LocalLlama 4 / Qwen 3 · Mac Studio / RTX
~$0 · electricity

Output tokens cost 4–5× input tokens because they're generated one at a time, with full GPU memory pressure each step. That's why "answer in JSON, not prose" can quietly halve your bill.

same task · different tier

Summarise 100 PDFs · ~1.2M in / 200K out

Frontier
$18.40
Mid
$4.10
Cheap
$0.42
Local
$0.05

The frontier tier is ~370× the local tier on paper. Whether it's worth that gap depends entirely on whether you can tell the difference in the output.

scaling laws · the Chinchilla insightlog-log
Chinchilla-optimal (more tokens)over-parameterised, under-trainedcompute (FLOPs) →↓ loss

DeepMind's 2022 result: at any given compute budget, the best model is smaller than people thought, but trained on far more tokens. The race for parameter count was partly a misallocation — and that's why a well-trained 70B can beat an under-trained 500B.

Pick a tier per task, not per app

Classification, extraction, simple summaries — the cheap tier is enough. Save the frontier for the steps that actually need it: planning, debugging, novel reasoning.

Caching is a 60–90% discount

Provider-side prompt caching reuses the system prompt and long context across calls. For a stable agent loop, the difference is measured in orders of magnitude on the bill.

Local cost ≈ electricity

Once the GPU is bought, running a local 70B model is a few cents per million tokens — but only if you have steady throughput to amortise the hardware.

Chapter 14

Local vs Cloud Models

The single most consequential architectural decision in any AI system: does the model run inside your infrastructure, or do you send every prompt to someone else's GPUs? Both are valid — they trade different things.

cloud · API callYour Appin your VPCprompt + data↑ leaves perimeter ↑Provider GPU400B+ paramsmulti-tenantUserlocal · on-premise inferenceYour Appin your VPCprompt↓ stays inside ↓Your GPUsLlama / Qwen 70Bquantized · vLLMUser
CLOUD

GPT, Claude, Gemini, Grok

+ advantages
Frontier-class quality (200B+ effective parameters)
No hardware investment — pay per token
Always-on updates, multimodal, voice
Handles spikes elastically
− trade-offs
Data leaves your perimeter
Per-token cost compounds at scale
Latency tied to network + provider
Vendor lock-in & regulatory questions
LOCAL

Llama, Qwen, Mistral, DeepSeek

+ advantages
Data never leaves your infrastructure
Predictable cost — only electricity
Air-gapped deployment possible
Full control over fine-tuning & versioning
− trade-offs
Quality ceiling at 70B–120B class
Significant hardware investment
Inference engineering on you
No automatic upgrades
LocalCloud
Privacydata stays on premisedata sent to provider
Quality~2023 frontier (70B–120B)2026 frontier (closed)
Cost shapecapex (GPUs)opex (per token)
Latency<50ms local200ms+ network round trip
Complianceeasier (HIPAA, GDPR, on-prem)depends on provider DPA
Updatesyou re-pull weightsautomatic
Multimodallimited (image/audio LLMs grow)native voice + video
Scalingadd GPUs → linearelastic, instant
The hybrid answer

Most production systems we build at Vorcl are hybrid: frontier closed models do the hardest reasoning, a fine-tuned local model handles the high-volume, sensitive, or repetitive work, and a router decides per request. Privacy and cost stay bounded; quality stays at the ceiling where it matters.

Reference

Twenty Words That Cover the Field

A short, pin-this-to-the-fridge glossary. If you remember these, you can read almost any AI paper, blog post, or release note without getting lost.

token

01

A chunk of text the model sees as one unit. ~4 chars on average. ~50K of them in the vocabulary.

parameter

02

A learned number inside the network. Frontier models have 100B – 2T of them.

context window

03

How much the model can read at once, measured in tokens. 200K is comfortable; 1M is the new ceiling.

embedding

04

A list of numbers that represents meaning. Words, images, and audio can all become embeddings.

attention

05

The Transformer trick that lets every token decide how much each other token matters.

transformer

06

The neural net architecture, introduced in 2017, that almost every modern LLM is built on.

fine-tuning

07

Continuing training on a smaller, task-specific dataset to nudge a base model toward a behaviour.

RLHF

08

Reinforcement Learning from Human Feedback. Humans rank outputs; the model learns to prefer the winners.

temperature

09

A sampling knob. 0 = always pick the most likely token. Higher = more creative, more chaotic.

top-p

10

Nucleus sampling. Keep only tokens whose cumulative probability sums to p, then pick from those.

hallucination

11

When a model confidently states something untrue. A side effect of being trained to sound right, not be right.

RAG

12

Retrieval-Augmented Generation. Look up relevant docs first, paste them into the prompt, then answer.

agent

13

An LLM in a loop that can call tools, observe their output, and decide what to do next.

tool use

14

The mechanism that lets a model emit a JSON function call instead of free-form text.

MCP

15

Model Context Protocol. A standard for letting any model connect to any tool or data source.

reasoning model

16

A model trained to generate a long internal chain of thought before its final answer.

multimodal

17

Accepts more than one kind of input — text, images, audio, video — and reasons across them.

quantization

18

Compressing model weights from 16-bit floats to 8 or 4 bits. Smaller, faster, slightly dumber.

MoE

19

Mixture of Experts. Only a subset of parameters fire per token, so a 400B model runs like a 40B.

distillation

20

Training a small model to imitate a big one. Cheap inference, most of the quality.

End of lesson

Now Put It to Work.

You've seen how the model works. The hard part is choosing the right one for your data, wiring it into your stack, and making it safe to put in front of customers. That's the job we do.