How Large Language Models Work

One-sentence framing: a large language model is not a brain pretending to be a human—it is a very large probabilistic model trained to predict the next token from context. Text is first split into tokens, mapped into vectors, processed by many Transformer layers, and then decoded one token at a time^[1,2]^[11]. Real products wrap that core loop with context management, retrieval^[18], tool calls^[23,24], multimodal input^[29,30], and safety/evaluation systems^[14,34].

Course framework

This primer uses a strict three-layer explanation pattern: first build intuition, then explain the mechanism, and finally show how production systems wire these capabilities together. The material is grounded in primary sources—Transformer, GPT-3, InstructGPT, DPO, RAG, Toolformer, ViT, Flamingo, LLaVA, and the developer docs from OpenAI and Anthropic^{[1,2,12,13,18,23,29,31,32,35,38]}.

Positioning

This is not a name-dropping survey, nor a heavy mathematical paper review. Two perspectives run in parallel throughout:

User's perspective: why the model can chat, write code, summarize documents—and why it can sound confident while being wrong.
Engineer's perspective: why tokens cost money, why contexts overflow, and why production systems need RAG, tool calls, caches, evals, and safety controls.

Learning goals

Goal	What you will understand
Use	What tokens, context, temperature, hallucination, RAG, and tool calls actually do
Explain	Why the model can hold a conversation—communicated to a non-technical colleague
Design	How to read a real LLM product request pipeline end-to-end
Question	When models fail, why they fail, and what mitigates the failure
Go deeper	Where to look next: API, RAG, fine-tuning, inference optimization

Interactive demos

Five interactive demos are embedded inline, each in the chapter where it best illustrates the concept:

Tokenizer (Chapter 2): split arbitrary text and inspect the token count.
Attention heatmap (Chapter 4): click a token to see what it attends to.
Next-token generator (Chapter 6): step through autoregressive decoding token by token.
Temperature slider (Chapter 7): watch the softmax distribution sharpen or flatten in real time.
RAG simulator (Chapter 12): compare a closed-book answer with an evidence-grounded one.

Part I

How a language model works

Chapter 1 · What an LLM really is

The intuition

Think of an LLM as an extraordinarily powerful autocomplete. You give it some context, and it keeps asking itself a single question: given everything seen so far, what is the most likely next token? Because it has read an enormous amount of text and absorbed strong pattern-induction skills, that “autocomplete” can act like chat, writing, translation, coding, summarization, and even something that looks like reasoning. The friendly mental model is “a champion at probabilistic sentence-completion”; the more precise framing is “an autoregressive language model that models the conditional distribution of the next token given context.”

Under the hood

GPT-style models learn something close to P(next token | preceding tokens). The GPT-3 paper showed that when model size, training data, and compute are pushed far enough, this single objective induces broad capabilities across question answering, translation, completion, code, and few-shot tasks^[2]. The Transformer paper supplies the architecture that makes this modeling tractable at scale^[1]. The honest framing is this: the model has compressed an enormous amount of linguistic, factual, and procedural patterns into its parameters—but that is not the same thing as having human-style awareness, subjective experience, or understanding.

High-level LLM flow: from user input to token-by-token generation — Figure 1 · The main path from a user's input to a generated answer, token by token. The overall shape comes from combining autoregressive language modeling with the Transformer architecture.

Example

A user asks, “Explain what a black hole is.” The model does not draft a full encyclopedia entry up front. It predicts “A” “ black hole” “ is” “ a” “ region” one token at a time, extending the answer until it has covered the topic.

Common misconception

The misconception: an LLM is just a search engine. A more accurate framing: search engines retrieve, databases store and look up exact records, classical programs execute deterministic rules, and LLMs generate probable text. LLMs can be paired with search, databases, and tools—but they are not themselves any of those systems.

Bottom line

The first-principles definition is not “a chatbot.” It is “a large probability model that predicts the next token from context.”

Chapter 2 · What tokens are

The intuition

Models do not read language the way people do. From the model's perspective, text is a sequence of building blocks called tokens. A token might be a single character, a whole word, a morphological fragment, a punctuation mark, or even “a leading space plus half a word.”

Under the hood

Tokenization is the process of converting raw text into a sequence of tokens. Modern models almost always use a subword scheme such as Byte-Pair Encoding (BPE)^[7], which keeps the vocabulary at a manageable size while still handling rare words, new words, and many languages. OpenAI's developer docs make this concrete: a token can be as short as a single character or as long as a whole word, and whitespace and punctuation also count^[35]. Different languages, models, and tokenizers can split the same string in different ways.

Token splitting: English and mixed-language examples — Figure 2 · An illustrative split. It is meant to build intuition, not to reproduce any specific tokenizer. Token count matters because it directly drives cost, latency, and how much fits in the context window.

DemoTokenizerChapter 2

Run any text through the GPT-4 tokenizer (cl100k_base) in real time. Compare character and token counts.

Characters

Tokens

Chars / token

5.79

Split result (14 tokens)

Tokenization␣is␣the␣surprisingly␣subtle␣first␣step␣of␣every␣large␣language␣model.

Note: cl100k_base is the tokenizer used by GPT-4 and GPT-3.5-turbo for years. Different models can tokenize the same string differently. Whitespace and many multi-byte characters (including some CJK characters) often split into several tokens.

Example

The English word unbelievable is often split into prefix, root, and suffix: [un][believ][able]. The Chinese sentence “我想学习大模型原理” might be one-character-one-token in one tokenizer and two-character pieces like “学习” in another. None of these splits align cleanly with the word boundaries a human would draw.

Common misconception

The misconception: one token equals one character, or one token equals one word. The correct view: a token is whatever unit the model's tokenizer happens to produce, and those units rarely line up with human notions of “character” or “word.”

Bottom line

Whenever you write a little more text, the model is not reading more characters—it is processing more tokens. From an engineering perspective, the token is the budget unit.

Chapter 3 · Embeddings: turning text into computable numbers

The intuition

Computers do not natively process “cat,” “dog,” or “black hole” as symbols—they are very good at numbers. So the model first converts every token into a vector. Imagine giving each token a coordinate on a “map of meaning”: tokens that mean similar things end up close together; tokens that are unrelated end up far apart.

Under the hood

Early work on word vectors established that distributed representations can capture semantic and syntactic regularities^[6]. Transformer-based models go further: they do not stop at static token embeddings. Every layer rewrites those embeddings into contextual representations—the same word in different sentences ends up with a different internal vector. Analyses of BERT and related models show that this contextualization grows stronger in higher layers^[8,9]. The internal representation of “apple” in “I ate an apple” is not the same as in “Apple launched a phone.”

A two-dimensional projection of embeddings: semantically similar words cluster together — Figure 3 · An illustrative 2D projection of an embedding space (PCA-style). Real embeddings live in hundreds to thousands of dimensions; semantically related words still tend to land closer together.

Example

If the model repeatedly sees “cats meow, cats run, cats are animals” and “dogs bark, dogs run, dogs are animals,” the internal representations of “cat” and “dog” end up closer to each other than to “refrigerator” or “locomotive,” which appear in completely different contexts.

Common misconception

The misconception: an embedding is just a numeric ID for each token. The correct view: the ID is a discrete lookup key; the embedding is the dense, learned vector the model actually computes with. In modern Transformers, that vector is rewritten layer by layer as context flows through.

Bottom line

Embeddings move language from the world of symbols into a world the model can actually compute over.

Chapter 4 · Transformer and self-attention

The intuition

If tokens are the building blocks and embeddings are their numeric coordinates, the Transformer is the assembly line. It looks at the whole sequence over and over, deciding which earlier positions matter most for the position it is currently processing.

Under the hood

The Transformer's defining capability is self-attention. Unlike a classical RNN that ships information one cell at a time, self-attention lets every position route information directly to every other position. The original Transformer paper introduces multi-head attention as multiple parallel “attention heads” that can capture different kinds of relationships from different representation subspaces^[1]. Later analyses of BERT show that some heads are clearly biased toward syntax or coreference relationships^[10].

Self-attention: 'its' attends strongly to 'new phone' — Figure 4 · In “Apple released a new phone; its performance is strong,” self-attention routes most of the weight from “its” toward “new phone”—not “Apple.” The model has not “learned coreference” in a human sense; it has learned to put weight where context calls for it.

DemoAttention heatmapChapter 4

Click any token (the query) and the row below shows that token's attention over the rest. Weights are illustrative, not from a real model.

Click a query:

Query: its · attention to other tokens (causal mask: each position can only see ≤ itself)

10%

Apple

released

20%

new

55%

phone

its

performance

strong

Top attended:phone55%/new20%/Apple10%

Show full matrix

	Apple	released	a	new	phone	,	its	performance	is	strong
Apple	100
released	55	45
a	15	55	30
new	30	20	10	40
phone	30	30	5	30	5
,	5	5	2	35	40	13
its	10	5	2	20	55	3	5
performance	5	5	2	10	20	2	30	26
is	3	3	2	5	6	2	10	45	24
strong	2	2		5	8	2	8	45	10	17

Example

Consider “Apple released a new phone; its performance is strong.” The pronoun “its” is far more likely to refer to “new phone” than to “Apple.” The model uses contextual cues to dynamically allocate attention to the right antecedent.

Common misconception

The misconception: attention is exactly “what the model is looking at.” The correct view: attention is a learned weighting mechanism. It often matches human intuition, but it is not a direct readout of cognition or a complete explanation of the model's behavior.

Bottom line

The Transformer frees models from passing information forward one position at a time, and that is what makes long-range dependencies tractable.

Chapter 5 · How representations are refined layer by layer

The intuition

A useful image is a stack of editors revising the same draft. The first pass notices spelling and surface phrasing. The second pass tightens sentence structure. The third pass focuses on meaning and intent. The further down the stack you go, the closer you get to “what is this text actually trying to do?”

Under the hood

Probing studies of Transformer-style models such as BERT show that lower layers tend to emphasize surface features, middle layers tend to track syntax, and upper layers tend to encode semantics and task-relevant information^[9]. Higher layers are also more strongly contextualized—they depend more on the full sequence^[9,10]. None of this implies the model is “thinking in stages” in a human sense; a cleaner framing is that the same hidden representation gets repeatedly rewritten as it flows up the stack.

Multi-layer Transformer: token → embedding → layers → output representation — Figure 5 · The multi-layer view. No single layer “produces the answer”; each layer reorganizes the representation so the next one can do something more useful.

Example

The word “bank” in “I went to the bank to deposit cash” and “the bank of the river” means two very different things. Probing studies show that as a sentence flows up the layers, these two uses end up with internal representations that are more clearly distinct than they were at the bottom.

Common misconception

The misconception: layer 1 strictly handles grammar, layer 2 strictly handles meaning, layer 3 strictly handles logic. The correct view: these are tendencies, not strict assignments. Different models, different tasks, and different layers all overlap.

Bottom line

The closest translation of “the model is thinking” into mechanical terms is “the hidden representation keeps getting rewritten across layers.”

Chapter 6 · Why the model can generate answers

The intuition

The model does not draft a complete answer before speaking, the way a person might draft an essay before writing it. It plays a very long sequence game: pick the next token, append it to the context, decide the next token again, and so on.

Under the hood

At inference time an autoregressive language model decodes one token at a time. At each step it emits a probability distribution over the vocabulary, a decoding strategy chooses a token from that distribution, and the chosen token becomes part of the new context for the next step. Holtzman et al. make the point clearly: the distribution the model produces is one thing; the strategy used to sample from it is something else entirely^[11].

Next-token distribution after the prefix 'A cat is a' — Figure 6 · After the prefix “A cat is a,” the model assigns probability mass across the vocabulary. The numbers are illustrative.

DemoNext-token generatorChapter 6

Step through autoregressive decoding. At each step the model emits a top-k probability distribution; pick one to advance. Numbers are illustrative.

Current context

A␣cat␣is␣a?

Next-token candidates (top-6)

Click a candidate to advance

Example

If the model picks “mammal” at this step, the context becomes “A cat is a mammal,” and the next distribution shifts toward “,” “.” “that” “belonging”—the things that plausibly follow that prefix.

Common misconception

The misconception: the model secretly writes the entire response first and then streams it out slowly. The correct view: most production generative LLMs unfold the answer one token at a time at inference.

Bottom line

Generation is not “a paragraph in one shot.” It is “a probability distribution plus a decoding loop.”

Chapter 7 · Temperature, top-k, and top-p

The intuition

The same model can feel disciplined one moment and adventurous the next. The difference often is not the model—it is the sampling parameters.

Under the hood

Temperature controls how peaked or flat the distribution becomes. Low temperature exaggerates the top candidate's lead; high temperature lets long-tail candidates compete. Top-k keeps only the k most likely candidates and samples from those. Top-p, also known as nucleus sampling, keeps the smallest set of candidates whose cumulative probability mass reaches p. Holtzman et al. introduced nucleus sampling because purely greedy maximum-likelihood decoding tends to produce repetitive, degenerate text^[11].

DemoTemperature sliderChapter 7

Same logits, different temperatures produce very different distributions. Low T sharpens (greedy); high T flattens (explorative).

Temperature1.00

Entropy (randomness)1.48 bits

Prefix “A / cat / is / a” · candidates after softmax(logits / T)

mammal69.2%

pet14.0%

carnivore10.4%

predator2.8%

creature1.7%

feline1.1%

plant0.7%

vehicle0.052%

Temperature does not change what the model knows (the logits). It only changes how the model chooses from its own distribution. Low T fits code and structured extraction; high T fits creative writing and brainstorming.

Example

Ask the model to finish a line of poetry. At low temperature it tends toward safe, expected phrasing. At high temperature it reaches for novel imagery—but it is also more likely to drift off-topic. Code generation and structured extraction usually want low temperature; brainstorming and creative writing tolerate—and often benefit from—higher temperature. Temperature does not change what the model knows; it only changes how the model picks from its own distribution.

Common misconception

The misconception: higher temperature makes the model smarter. The correct view: higher temperature only makes the model more random. Creativity may go up, and so may error rate.

Bottom line

Sampling parameters do not swap the model. They change how the model picks an answer from its own distribution.

Part II

How the model is trained and scaled

Chapter 8 · How LLMs are trained

The intuition

If inference is “the model answering,” training is “the model studying.” First it reads a vast amount of text and learns linguistic patterns and broad world knowledge. Then it is coached on high-quality question–answer demonstrations to behave more like an assistant. Then it is shaped to match human preferences and safety norms. Finally it is stress-tested.

Under the hood

A canonical pipeline has four stages. Pretraining runs next-token prediction over enormous corpora to absorb language and knowledge patterns. Supervised fine-tuning (SFT) pulls the base model toward following instructions, using curated demonstrations. Preference alignment (RLHF, DPO, and friends) shapes the model toward being more helpful, honest, and harmless. Safety training and evaluation reduce risky outputs through rules, red-teaming, refusal strategies, and eval pipelines. InstructGPT laid out the SFT+RLHF recipe systematically^[12]; DPO offered a simpler, more direct way to do preference optimization^[13]; Constitutional AI showed that a written list of principles plus AI feedback can also drive alignment^[14].

Training pipeline: data → pretraining → SFT → alignment → safety → deployment — Figure 7 · The canonical training pipeline. Pretraining teaches it to “speak,” SFT teaches it to “assist,” preference alignment teaches it to “match what people want,” and safety training teaches it “where the lines are.”

Example

Pretraining teaches the model how to produce coherent language. SFT teaches it to act like an assistant. Preference alignment teaches it which answers feel right. Safety training teaches it to refuse or hedge when the request itself is dangerous.

Common misconception

The misconception: training literally copies an encyclopedia into the model. The correct view: the model mostly compresses statistical patterns into its parameters rather than storing a verbatim archive. That said, research shows models can memorize and reproduce snippets of training data^[17], so the truth is somewhere between “perfect copy” and “no memory at all.”

Bottom line

Training is not a single event. It is a pipeline that moves the model from “fluent language” to “assistant behavior” to “within safety bounds.”

Chapter 9 · Parameters and why 7B, 70B, 175B

The intuition

Picture a vast machine filled with knobs. The parameters are those knobs. Training is the process of patiently adjusting every knob so that, given an input, the machine produces a more sensible output.

Under the hood

Parameters are the learnable numerical weights inside the model. GPT-3 disclosed a 175B-parameter scale^[2]. LLaMA published a family ranging from 7B to 65B^[3]. Chinchilla used 70B parameters but, by training on far more tokens, outperformed several much larger models^[5]. Together, the scaling-law work^[4] and Chinchilla^[5] argue something nuanced: bigger models typically have higher ceiling performance, but whether one is actually “better” depends on training tokens, data quality, architecture, and training discipline.

Parameter-knob analogy: huge numbers of learnable weights tuned by training — Figure 8 · A model as a vast machine of learnable “knobs.” Training is the process of nudging every knob so the output distribution lands closer to the target.

Example

The B in “7B,” “70B,” and “175B” stands for billion. 7B is roughly 7 billion parameters; 70B is roughly 70 billion. More parameters generally mean more capacity—but also more memory, lower inference throughput, and higher deployment cost.

Common misconception

The misconception: more parameters always mean a better model. The correct view: parameter count is only part of the capacity and ceiling story. A well-trained 70B can routinely beat a sloppily trained much larger model.

Bottom line

Parameters are knobs. Scale matters—but scale is not the whole story.

Chapter 10 · Context windows

The intuition

A context window is a workbench. How much material you can lay out for the model at once depends on how big that workbench is. A larger bench means more material in view—but piling too much on at once, in the wrong order, makes it harder for the model to pick out what actually matters.

Under the hood

The context window is the number of tokens the model can process in one request. Anthropic's docs are explicit: the window holds the conversation history, the current request, and the space the model needs to write its output^[38]. In real-world agent and coding workflows it also holds the system prompt, file contents, prior model responses, and tool returns. Longer contexts come with real costs: higher prefill compute, KV-cache memory pressure, and higher latency. System-level work such as vLLM and PagedAttention is exactly about making this cheaper at scale^[33]. And the “Lost in the Middle” result is a useful reality check: more context is not automatically better use of context—important evidence buried in the middle of a long prompt can actually hurt performance^[22].

Context window composition: system prompt + chat history + user input + retrieved docs + output budget — Figure 9 · What lives inside one request's context window. System prompt, conversation history, the user's question, retrieved passages, and tool returns all share the same finite token budget, alongside the model's generation budget.

Engineering note

In real products, long contexts are almost always combined with truncation, summarization, retrieval, and caching. Both OpenAI and Anthropic ship prompt-caching features specifically to make long, stable prefixes cheap to reuse, which improves both latency and cost on long-document workflows^[37,40].

Common misconception

The misconception: a big enough context window means the model “remembers you forever.” The correct view: the context window is only what is currently visible inside this session—it is working memory, not long-term memory.

Bottom line

The context window is the most important—and most underappreciated—piece of desk space in the whole stack.

Part III

Retrieval, tools, and multimodality in real products

Chapter 11 · Why hallucinations happen

The intuition

Hallucination is not lying. A more honest description is closer to “a very fluent person who, when they do not know the answer, still talks as if they do.”

Under the hood

A hallucination is, by working definition, an output that is linguistically smooth but factually wrong. Recent surveys identify multiple converging causes: limited or stale parametric knowledge, biased or incomplete training data, ambiguous prompts, absent retrieval, and a decoding bias toward producing “something that looks like an answer” rather than admitting uncertainty^[15]. OpenAI's own write-up adds a sharper observation: traditional training and evaluation often reward a plausible-sounding guess over an honest “I don't know”^[16].

Causes of hallucination as a decision flow — Figure 10 · Where hallucinations come from: stale or missing knowledge, vague prompts, absent retrieval—all converging into “keep generating from probabilities,” which produces fluent text that may not be true.

Example

Ask the model to summarize a paper that does not exist—say, “Please summarize Smith et al. 2024 on Quantum Bubble Tea Optimization.” A model is likely to invent plausible authors, methods, and findings, because the generation process is biased toward producing something that looks like a real abstract.

How to mitigate

The reliable mitigations are engineering, not magic: require the model to cite sources, wire in retrieval or search, ask it to flag uncertainty explicitly, separate “known facts” from “recommendations,” and add human review on high-stakes domains like medicine, law, and finance. Hallucinations can be reduced significantly; no single trick eliminates them today.

Bottom line

Hallucination is not a character flaw. It is the natural reliability cost of any probabilistic generation system on factually demanding tasks.

Chapter 12 · RAG and why it reduces hallucination

The intuition

A plain LLM is a closed-book exam. Retrieval-augmented generation (RAG) is the open-book version: the model gets to consult relevant material first, then answer.

Under the hood

RAG is shorthand for retrieval-augmented generation. Lewis et al. frame it as “parametric memory plus non-parametric external memory”: retrieve relevant documents first, drop them into the context, then generate^[18]. The retriever can be dense; DPR showed that a dense retriever can clearly beat strong BM25 baselines on open-domain QA^[19]. Vector databases and approximate nearest-neighbor indices handle the embeddings at scale—Faiss is the canonical reference implementation^[20].

RAG flow: query rewrite → retrieve → context → generate → cite — Figure 11 · The canonical RAG pipeline. Before generation, the model (or the application) issues a retrieval (dense and/or keyword) and injects the top passages into the context; the answer can then cite those passages.

DemoRAG simulatorChapter 12

Compare a closed-book answer with an evidence-grounded one (RAG). Retrieval snippets and answers are illustrative.

User question

What are the biggest risks disclosed in this earnings report?

Closed-book · parametric memory only

Common risks for a software company at this stage include macro headwinds, competitive pressure, gross-margin compression, FX volatility, and regulatory uncertainty. To pinpoint the largest risk in this specific filing, please share the document.

⚠ Vague, hard to verify, no citations

RAG · retrieve → ground → generate

(Waiting to run retrieval...)

RAG is not a silver bullet. Retrieval quality, chunking, ranking, and the position of evidence in the prompt all affect the final answer (see Lost-in-the-Middle).

Example

Ask, “What is the biggest risk in this earnings report?” Without RAG, the model gives a generic answer: macro headwinds, competition, operational risk. With RAG, the model can pull specific passages from the filing and answer concretely: weakening cash flow, foreign-revenue concentration, longer DSO—citing the page numbers.

RAG is not a silver bullet

If retrieval fails, or if too much irrelevant material lands in the context, the model still answers wrong. Follow-up work like CRAG explicitly tackles “what to do when retrieval is bad”^[21]. The “Lost in the Middle” result shows that even when the right passage is in the context, the model may underuse it—especially if it sits in the middle of a long prompt^[22].

Common misconception

The misconception: once you add RAG, hallucinations are gone. The correct view: RAG converts “closed-book” into “may consult materials.” It does not automatically turn the system into a fact engine. Retrieval quality, chunking, ranking, and citation design all matter.

Bottom line

The point of RAG is to swap “what the model vaguely remembers” for “what the model can see in front of it right now.”

Chapter 13 · Tool use and agents

The intuition

LLMs are very good at producing text, but not necessarily at exact arithmetic, real-time lookup, or actually performing actions in the outside world. So we wire them up to tools—hands and feet, in a sense.

Under the hood

The OpenAI tool-use loop is straightforward: the application describes the available tools to the model, the model emits a structured tool call, the application actually runs the tool, the result is fed back into the model, and the model integrates the result into its final answer^[36]. Anthropic's docs follow the same pattern: the model decides whether to call a tool, which one, and what arguments to pass^[39]. Toolformer showed that language models can themselves learn when to call an API, what arguments to pass, and how to use the result^[23]. ReAct interleaves reasoning steps with action steps so the model can plan and act in alternation^[24].

Tool-use loop: decide → call → return → integrate — Figure 12 · The minimum tool-use loop. The model itself never executes external actions—it returns a structured call, the application runs it, and the result is fed back.

How to think about agents

A working definition: agent ≈ LLM + goal + tools + memory + multi-step planning. Instead of answering one question at a time, an agent decomposes a goal, chooses tools, inspects intermediate results, and decides what to do next. This is powerful—and brittle: any misread of the goal, any flawed plan, any wrong tool argument, any misinterpretation of a tool result can derail the whole chain.

Example

A user asks, “What is today's USD to EUR rate? And how much is 500 USD in EUR?” A plain LLM is likely to guess. Wired up to a forex lookup tool and a calculator, the model retrieves the rate, runs the conversion, then writes the explanation.

Common misconception

The misconception: agents are already reliable digital employees. The correct view: an agent is a design pattern, not a magical personality. It can dramatically extend capability—and can also amplify any single error across many steps.

Bottom line

Tools give the model external capabilities. Agents give the model a multi-step execution framework.

Chapter 14 · How multimodal models see, hear, and watch

The intuition

Multimodal models did not suddenly grow eyes and ears. They learned to turn images, audio, and video into representations that look, to the model, similar to the way text already looks—and then process everything in one shared space.

Under the hood

Vision Transformer (ViT) demonstrated that an image can be cut into patches and fed through a Transformer like any other sequence^[29]. CLIP showed that images and text can be projected into a shared semantic space^[30]. Flamingo and LLaVA went further by bridging a vision encoder into a language model so the system can handle interleaved image–text input, visual question answering, screenshot understanding, and multimodal dialogue^[31,32]. Video is typically “a sequence of frames plus a timeline.”

Multimodal flow: vision encoder + text tokens converge into a multimodal model — Figure 13 · The high-level multimodal flow. Visual inputs go through an encoder, text goes through tokenization and embedding, and both streams converge into the same model that produces the answer.

Example

You hand the model a screenshot of an earnings dashboard and ask, “What looks like the main driver of margin compression?” The model encodes the image into visual features, fuses those features with your text question, and produces an explanation.

Common misconception

The misconception: a multimodal model literally “sees” the way a person does. The correct view: the model sees a numerical representation produced by an encoder—not subjective visual experience.

Bottom line

Multimodality is not “a text model with an image hat.” It is the unification of multiple input modalities into one computable representation space.

Chapter 15 · Why models appear to reason

The intuition

A lot of what looks like reasoning is the model having seen, during training, an enormous number of problem–step–answer patterns—and having learned the language templates that go with them.

Under the hood

Chain-of-Thought prompting shows that explicit intermediate steps in the prompt boost performance on hard reasoning tasks^[25]. Self-Consistency improves accuracy further by sampling multiple reasoning paths and voting on the answer^[26]. Program-of-Thoughts hands the actual arithmetic off to an interpreter^[27]. Tree-of-Thoughts lets the model explore a search tree and backtrack^[28]. The common theme is clear: making the steps explicit, verifying externally, and searching across alternatives all raise the model's ceiling on solving problems.

One-shot answer vs step-by-step reasoning — Figure 14 · Direct answer versus step-by-step reasoning with verification. The slower path is usually more reliable on tasks that require multiple steps.

Example

“Roger has 5 balls. He buys 2 more cans, with 3 balls per can. How many balls in total?” A direct one-shot answer is more likely to be wrong. The same model, instructed to compute “2 × 3 = 6” first and then “5 + 6 = 11,” is much more reliable—and even more so if “2 × 3” is actually evaluated by a calculator.

Common misconception

The misconception: as long as the model writes the steps out, the logic is sound and the answer is right. The correct view: writing the steps usually helps, but it does not guarantee that any individual step is correct. “Looks like reasoning” and “reliable mathematical or logical correctness” are not the same thing.

Bottom line

LLMs can display reasoning behavior—more like “has learned a wide repertoire of problem-solving patterns” than “has an infallible logic engine inside.”

Chapter 16 · Why prompts matter

The intuition

A prompt is the brief you hand to an assistant. The clearer you are about who they are, what you want, the constraints, and the format, the more likely you are to get back what you actually need.

Under the hood

OpenAI's docs put it directly: prompting is how you instruct the model, and output quality often follows how well you prompt^[35]. Anthropic's prompt-engineering docs add an important reminder: not every failure should be solved by tweaking the prompt—some problems are better fixed by choosing a different model, redesigning the system, adding retrieval, or improving evaluation^[38]. System-message design also matters because system prompts shape role, tone, format, and safety boundaries.

Practical prompt template: role + goal + background + constraints + format + quality criteria — Figure 15 · A reusable six-block skeleton for prompts: role, goal, background, constraints, output format, quality criteria.

Anatomy of a good prompt

Role, goal, background, constraints, output format, quality criteria. Six blocks, all useful. Example: “You are a financial analyst. Based on the attached PDF, summarize the risks. Output a four-column table: Risk · Evidence · Impact · Uncertainty. If you cannot find evidence for a risk, write ‘no supporting evidence in the document.’” This prompt's value is not in making the model smarter—it is in narrowing the task, making the output checkable, and removing room for hand-waving.

Common misconception

The misconception: with a long enough, clever enough prompt, anything can be solved. The correct view: prompts work within the limits of the model's capabilities, the available context, the wired-up tools, and the surrounding system design. Important, yes. Universal master key, no.

Bottom line

A good prompt is not an incantation. It is a precise, executable, and verifiable task brief.

Chapter 17 · One complete LLM product request

The intuition

In a real product, “user asks, model answers” rides on top of an entire backend: authentication, safety policy, retrieval, tools, output validation, logging, and evaluation.

Under the hood

Take the request “Analyze this earnings report and summarize the investment risks.” In production, a real system typically: accepts the request, runs authentication and safety checks, reads the file, chunks it, retrieves and reranks the relevant passages, assembles the prompt, calls the model, invokes calculation tools when needed, generates a structured output, attaches citations, returns the result, and logs everything for later evaluation. OpenAI's structured-outputs and function-calling docs cover exactly how to constrain output schemas and wire up tools^[36]. The HELM evaluation framework reminds us that production quality is more than accuracy—robustness, calibration, fairness, toxicity, and efficiency all matter^[34].

Real LLM product request pipeline: from upload through to logs and evaluation — Figure 16 · An end-to-end product pipeline anchored on the earnings-analysis example. The LLM is one node among many; the rest of the pipeline is what makes the output reliable.

Example

If the earnings report includes metrics like year-over-year growth or cash-flow coverage ratios, a well-designed system asks the model to extract the numbers and hands the actual arithmetic to a calculator. That is almost always more reliable than asking the model to do mental math.

Common misconception

The misconception: a strong base model is enough to build a strong product. The correct view: production quality is the joint design of model + data + tools + prompts + safety + evaluation + cost discipline.

Bottom line

When you see a fluent answer in a real product, it is almost never just a model speaking. It is a whole engineering pipeline.

Part IV

Misconceptions and learning roadmap

Chapter 18 · Common misconceptions

#1An LLM is just a search engine

Why it's wrongSearch engines retrieve from existing documents. LLMs generate text from context. These are different operations.

Correct viewMany great products combine an LLM with search or RAG—but the two are not the same thing.

#2An LLM knows everything

Why it's wrongThe parameters compress a lot of knowledge, but knowledge goes stale, is missing, or was never well-represented in training.

Correct viewTreat it as “a widely-read assistant who cannot always verify what it remembers,” not as a comprehensive database.

#3More parameters always mean a better model

Why it's wrongTraining tokens, data quality, and training strategy matter just as much.

Correct viewBigger models have higher ceilings, but Chinchilla and the LLaMA family show that “smaller but trained well” can outperform “larger but trained poorly.”^[3,5]

#4If it sounds fluent, it is correct

Why it's wrongLinguistic fluency and factual correctness are not the same property.

Correct viewHallucinations tend to occur precisely when the model sounds most confident.

#5RAG eliminates hallucinations

Why it's wrongRetrieval can fail. Ranking can be wrong. Chunking can be wrong. The model can still misuse the context it received.

Correct viewRAG is a serious mitigation, not a guarantee.

#6Prompting can solve any problem

Why it's wrongSome failures are about model capability, missing tools, or system design—not prompt wording.

Correct viewPrompts are important. They are not a replacement for retrieval, tools, evaluation, or safety design.

#7Agents can autonomously handle complex tasks

Why it's wrongMulti-step planning propagates and amplifies any single error across the entire chain.

Correct viewAgents are a powerful, high-risk design pattern. They still require constraints, monitoring, and evaluation.

#8The model has a real human consciousness

Why it's wrongPapers and official docs talk about language modeling, alignment, and behavior design—not about establishing consciousness.

Correct viewFrom an engineering perspective, the safest framing is “a powerful statistical generator,” not “a digital mind.”

#9The model only copies its training data

Why it's wrongIf it only copied, it could not do few-shot generalization or composition^[2]; and yet research shows models can also memorize and leak fragments of training data^[17].

Correct viewBoth generalization and memorization are real. Acknowledge both, not just one.

#10Open-source is automatically worse (or closed-source is automatically better)

Why it's wrongCapability depends on the model, the training, the data, the surrounding system, and the use case. No tribal verdict holds in every situation.

Correct viewCompare on task performance, cost, controllability, safety, latency, and deployment requirements—not on the “open vs closed” label.

Chapter 19 · One overview diagram

This chapter introduces no new concepts. It puts the training side and the inference side on the same canvas so the whole pipeline is visible at once.

End-to-end LLM overview: training side + inference side — Figure 17 · Training side: data → pretraining → SFT → alignment → evaluation → deployment. Inference side: user input → tokenize → embed → Transformer → decode → output, with retrieval and tools wired in as needed.

One-sentence takeaway

If you remember only one thing, remember this: at its core, an LLM predicts the next token from context; product capability comes from “the model itself plus the external systems wrapped around it.”

Chapter 20 · Learning roadmap

For non-technical users

Focus on tokens, context, hallucinations, prompting, RAG, and tool calls. The goal is not to train a model—it is to use models correctly, spot the failure modes, and ask better questions.

For PMs, operators, and founders

Focus on application architecture, RAG, agents, cost, latency, risk, and evaluation. The goal is to design a real LLM pipeline that ships to production, not to admire a demo.

For early-career engineers

Focus on API usage, prompt engineering, vector databases, RAG, structured outputs, tool calling, and evaluation. The goal is to build LLM applications that actually behave reliably.

For AI engineers

Go deep on Transformers, training, alignment, inference optimization, KV cache, throughput and latency, model compression, and benchmarking. The goal is to optimize the model and the system—not just call an existing API.

Glossary

Definitions below follow the usage of the papers and official documentation cited throughout this primer.

Representation

Token: The smallest unit of text the model actually processes.
Tokenization: The process of splitting raw text into tokens.
Embedding: The dense numeric vector that each token is mapped into.
Vector space: The high-dimensional space those vectors live in.
Layer: One block in the network that progressively refines the representation.
Parameter: A learnable numeric weight inside the model.

Architecture & runtime

Transformer: The modern sequence-modeling architecture built around attention.
Self-attention: The mechanism by which each position weights every other position in the context.
Multi-head attention: Multiple attention heads operating in parallel, each capturing different relations.
Inference: The model running in production—generating outputs from inputs.
Decoding: The procedure for picking each output token from the next-token distribution.
Temperature / top-k / top-p: Sampling parameters that control randomness and the candidate set.
Context window: The maximum number of tokens that can fit in one request.
System prompt: The system-level message that sets role, rules, and constraints.

Training & alignment

Pretraining: Language-modeling training over a huge corpus to establish broad capability.
Fine-tuning: Continued training on more specific data.
SFTSupervised Fine-Tuning: Fine-tuning on curated instruction–response demonstrations.
RLHFReinforcement Learning from Human Feedback: Aligning the model using reinforcement learning on human preferences.
DPODirect Preference Optimization: A simpler, more direct alternative to traditional RLHF for preference optimization.
Alignment: Shaping model behavior to match human goals and safety norms.
Safety: Design and training that reduce harmful or risky outputs.
Evaluation: Multi-dimensional measurement of model and system quality.

Systems & products

Hallucination: Output that sounds plausible but is factually wrong.
RAGRetrieval-Augmented Generation: Retrieve external knowledge first, then generate with it in context.
Vector database: A system that stores embeddings and runs efficient similarity search.
Tool calling: Letting the model emit structured calls that the application then executes.
Agent: A multi-step execution pattern: model + goal + tools + memory + planning.
Multimodal model: A model that can jointly process text, images, audio, or video.
Latency: The wall-clock delay from request to response.
Cost: The total cost of inference, storage, bandwidth, and operations.
Deployment: Shipping the model and its surrounding system into production.

References

Papers and official documentation cited throughout this primer. Where possible, links point to primary sources (arXiv preprints, official API documentation, or research blog posts).

Architecture & pretraining

[1]Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
[2]Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020. arXiv:2005.14165
[3]Touvron, H., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971
[4]Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361
[5]Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556

Representations, tokens & interpretability

[6]Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (word2vec). arXiv:1301.3781
[7]Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). arXiv:1508.07909
[8]Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
[9]Tenney, I., Das, D., & Pavlick, E. (2019). BERT Rediscovers the Classical NLP Pipeline. arXiv:1905.05950
[10]Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What Does BERT Look At? An Analysis of BERT’s Attention. arXiv:1906.04341

Decoding & sampling

[11]Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration (nucleus sampling). ICLR 2020. arXiv:1904.09751

Alignment, SFT & safety

[12]Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). arXiv:2203.02155
[13]Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO). arXiv:2305.18290
[14]Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073

Hallucination & uncertainty

[15]Ji, Z., et al. (2023). Survey of Hallucination in Natural Language Generation. arXiv:2202.03629
[16]Kalai, A., Nachum, O., et al. (2025). Why Language Models Hallucinate (OpenAI research). openai.com/research/why-language-models-hallucinate
[17]Carlini, N., et al. (2021). Extracting Training Data from Large Language Models. arXiv:2012.07805

Retrieval-augmented generation (RAG)

[18]Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
[19]Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering (DPR). arXiv:2004.04906
[20]Johnson, J., Douze, M., & Jégou, H. (2017). Billion-scale similarity search with GPUs (Faiss). arXiv:1702.08734
[21]Yan, S., et al. (2024). Corrective Retrieval Augmented Generation (CRAG). arXiv:2401.15884
[22]Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172

Tool use & agents

[23]Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761
[24]Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629

Reasoning & chain-of-thought

[25]Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903
[26]Wang, X., et al. (2022). Self-Consistency Improves Chain-of-Thought Reasoning in Language Models. arXiv:2203.11171
[27]Chen, W., et al. (2022). Program of Thoughts Prompting: Disentangling Computation from Reasoning. arXiv:2211.12588
[28]Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601

Multimodality

[29]Dosovitskiy, A., et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT). ICLR 2021. arXiv:2010.11929
[30]Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020
[31]Alayrac, J.-B., et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. arXiv:2204.14198
[32]Liu, H., et al. (2023). Visual Instruction Tuning (LLaVA). arXiv:2304.08485

Serving systems & long context

[33]Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). arXiv:2309.06180

Evaluation

[34]Liang, P., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110

Official developer documentation

[35]OpenAI. Tokenization & Tokens · API guide. platform.openai.com/docs/guides/text-generation
[36]OpenAI. Function Calling & Structured Outputs. platform.openai.com/docs/guides/function-calling · structured-outputs
[37]OpenAI. Prompt Caching. platform.openai.com/docs/guides/prompt-caching
[38]Anthropic. Context windows & prompt engineering. docs.claude.com/.../context-windows · prompt-engineering
[39]Anthropic. Tool use with Claude. docs.claude.com/.../tool-use/overview
[40]Anthropic. Prompt caching. docs.claude.com/.../prompt-caching

All links point to publicly accessible primary sources. Updated paper versions and broken links welcome—please flag any issues.

Disclaimer

This primer is the author's personal study notes and teaching write-up. Nothing here constitutes investment, legal, medical, or other professional advice. The cited papers, official documentation, and third-party materials are used for educational illustration; the author makes no representation, express or implied, as to their accuracy or completeness.

Views expressed are solely the author's and do not represent any employer or third-party organization. Product, model, and company names mentioned belong to their respective owners.

Any input you enter into the interactive demos runs locally in your browser. The author does not collect or store any user input.