One-sentence framing: a large language model is not a brain pretending to be a human—it is a very large probabilistic model trained to predict the next token from context. Text is first split into tokens, mapped into vectors, processed by many Transformer layers, and then decoded one token at a time1,211. Real products wrap that core loop with context management, retrieval18, tool calls23,24, multimodal input29,30, and safety/evaluation systems14,34.
Course framework
This primer uses a strict three-layer explanation pattern: first build intuition, then explain the mechanism, and finally show how production systems wire these capabilities together. The material is grounded in primary sources—Transformer, GPT-3, InstructGPT, DPO, RAG, Toolformer, ViT, Flamingo, LLaVA, and the developer docs from OpenAI and Anthropic1,2,12,13,18,23,29,31,32,35,38.
Positioning
This is not a name-dropping survey, nor a heavy mathematical paper review. Two perspectives run in parallel throughout:
- User's perspective: why the model can chat, write code, summarize documents—and why it can sound confident while being wrong.
- Engineer's perspective: why tokens cost money, why contexts overflow, and why production systems need RAG, tool calls, caches, evals, and safety controls.
Learning goals
| Goal | What you will understand |
|---|---|
| Use | What tokens, context, temperature, hallucination, RAG, and tool calls actually do |
| Explain | Why the model can hold a conversation—communicated to a non-technical colleague |
| Design | How to read a real LLM product request pipeline end-to-end |
| Question | When models fail, why they fail, and what mitigates the failure |
| Go deeper | Where to look next: API, RAG, fine-tuning, inference optimization |
Interactive demos
Five interactive demos are embedded inline, each in the chapter where it best illustrates the concept:
- Tokenizer (Chapter 2): split arbitrary text and inspect the token count.
- Attention heatmap (Chapter 4): click a token to see what it attends to.
- Next-token generator (Chapter 6): step through autoregressive decoding token by token.
- Temperature slider (Chapter 7): watch the softmax distribution sharpen or flatten in real time.
- RAG simulator (Chapter 12): compare a closed-book answer with an evidence-grounded one.
Part I
How a language model works
Chapter 1 · What an LLM really is
The intuition
Think of an LLM as an extraordinarily powerful autocomplete. You give it some context, and it keeps asking itself a single question: given everything seen so far, what is the most likely next token? Because it has read an enormous amount of text and absorbed strong pattern-induction skills, that “autocomplete” can act like chat, writing, translation, coding, summarization, and even something that looks like reasoning. The friendly mental model is “a champion at probabilistic sentence-completion”; the more precise framing is “an autoregressive language model that models the conditional distribution of the next token given context.”
Under the hood
GPT-style models learn something close to P(next token | preceding tokens). The GPT-3 paper showed that when model size, training data, and compute are pushed far enough, this single objective induces broad capabilities across question answering, translation, completion, code, and few-shot tasks2. The Transformer paper supplies the architecture that makes this modeling tractable at scale1. The honest framing is this: the model has compressed an enormous amount of linguistic, factual, and procedural patterns into its parameters—but that is not the same thing as having human-style awareness, subjective experience, or understanding.
Example
A user asks, “Explain what a black hole is.” The model does not draft a full encyclopedia entry up front. It predicts “A” “ black hole” “ is” “ a” “ region” one token at a time, extending the answer until it has covered the topic.
Common misconception
The misconception: an LLM is just a search engine. A more accurate framing: search engines retrieve, databases store and look up exact records, classical programs execute deterministic rules, and LLMs generate probable text. LLMs can be paired with search, databases, and tools—but they are not themselves any of those systems.
Bottom line
The first-principles definition is not “a chatbot.” It is “a large probability model that predicts the next token from context.”
Chapter 2 · What tokens are
The intuition
Models do not read language the way people do. From the model's perspective, text is a sequence of building blocks called tokens. A token might be a single character, a whole word, a morphological fragment, a punctuation mark, or even “a leading space plus half a word.”
Under the hood
Tokenization is the process of converting raw text into a sequence of tokens. Modern models almost always use a subword scheme such as Byte-Pair Encoding (BPE)7, which keeps the vocabulary at a manageable size while still handling rare words, new words, and many languages. OpenAI's developer docs make this concrete: a token can be as short as a single character or as long as a whole word, and whitespace and punctuation also count35. Different languages, models, and tokenizers can split the same string in different ways.
Run any text through the GPT-4 tokenizer (cl100k_base) in real time. Compare character and token counts.
Note: cl100k_base is the tokenizer used by GPT-4 and GPT-3.5-turbo for years. Different models can tokenize the same string differently. Whitespace and many multi-byte characters (including some CJK characters) often split into several tokens.
Example
The English word unbelievable is often split into prefix, root, and suffix: [un][believ][able]. The Chinese sentence “我想学习大模型原理” might be one-character-one-token in one tokenizer and two-character pieces like “学习” in another. None of these splits align cleanly with the word boundaries a human would draw.
Common misconception
The misconception: one token equals one character, or one token equals one word. The correct view: a token is whatever unit the model's tokenizer happens to produce, and those units rarely line up with human notions of “character” or “word.”
Bottom line
Whenever you write a little more text, the model is not reading more characters—it is processing more tokens. From an engineering perspective, the token is the budget unit.
Chapter 3 · Embeddings: turning text into computable numbers
The intuition
Computers do not natively process “cat,” “dog,” or “black hole” as symbols—they are very good at numbers. So the model first converts every token into a vector. Imagine giving each token a coordinate on a “map of meaning”: tokens that mean similar things end up close together; tokens that are unrelated end up far apart.
Under the hood
Early work on word vectors established that distributed representations can capture semantic and syntactic regularities6. Transformer-based models go further: they do not stop at static token embeddings. Every layer rewrites those embeddings into contextual representations—the same word in different sentences ends up with a different internal vector. Analyses of BERT and related models show that this contextualization grows stronger in higher layers8,9. The internal representation of “apple” in “I ate an apple” is not the same as in “Apple launched a phone.”
Example
If the model repeatedly sees “cats meow, cats run, cats are animals” and “dogs bark, dogs run, dogs are animals,” the internal representations of “cat” and “dog” end up closer to each other than to “refrigerator” or “locomotive,” which appear in completely different contexts.
Common misconception
The misconception: an embedding is just a numeric ID for each token. The correct view: the ID is a discrete lookup key; the embedding is the dense, learned vector the model actually computes with. In modern Transformers, that vector is rewritten layer by layer as context flows through.
Bottom line
Embeddings move language from the world of symbols into a world the model can actually compute over.
Chapter 4 · Transformer and self-attention
The intuition
If tokens are the building blocks and embeddings are their numeric coordinates, the Transformer is the assembly line. It looks at the whole sequence over and over, deciding which earlier positions matter most for the position it is currently processing.
Under the hood
The Transformer's defining capability is self-attention. Unlike a classical RNN that ships information one cell at a time, self-attention lets every position route information directly to every other position. The original Transformer paper introduces multi-head attention as multiple parallel “attention heads” that can capture different kinds of relationships from different representation subspaces1. Later analyses of BERT show that some heads are clearly biased toward syntax or coreference relationships10.
Click any token (the query) and the row below shows that token's attention over the rest. Weights are illustrative, not from a real model.
Show full matrix
| Apple | released | a | new | phone | , | its | performance | is | strong | |
|---|---|---|---|---|---|---|---|---|---|---|
| Apple | 100 | |||||||||
| released | 55 | 45 | ||||||||
| a | 15 | 55 | 30 | |||||||
| new | 30 | 20 | 10 | 40 | ||||||
| phone | 30 | 30 | 5 | 30 | 5 | |||||
| , | 5 | 5 | 2 | 35 | 40 | 13 | ||||
| its | 10 | 5 | 2 | 20 | 55 | 3 | 5 | |||
| performance | 5 | 5 | 2 | 10 | 20 | 2 | 30 | 26 | ||
| is | 3 | 3 | 2 | 5 | 6 | 2 | 10 | 45 | 24 | |
| strong | 2 | 2 | 5 | 8 | 2 | 8 | 45 | 10 | 17 |
Example
Consider “Apple released a new phone; its performance is strong.” The pronoun “its” is far more likely to refer to “new phone” than to “Apple.” The model uses contextual cues to dynamically allocate attention to the right antecedent.
Common misconception
The misconception: attention is exactly “what the model is looking at.” The correct view: attention is a learned weighting mechanism. It often matches human intuition, but it is not a direct readout of cognition or a complete explanation of the model's behavior.
Bottom line
The Transformer frees models from passing information forward one position at a time, and that is what makes long-range dependencies tractable.
Chapter 5 · How representations are refined layer by layer
The intuition
A useful image is a stack of editors revising the same draft. The first pass notices spelling and surface phrasing. The second pass tightens sentence structure. The third pass focuses on meaning and intent. The further down the stack you go, the closer you get to “what is this text actually trying to do?”
Under the hood
Probing studies of Transformer-style models such as BERT show that lower layers tend to emphasize surface features, middle layers tend to track syntax, and upper layers tend to encode semantics and task-relevant information9. Higher layers are also more strongly contextualized—they depend more on the full sequence9,10. None of this implies the model is “thinking in stages” in a human sense; a cleaner framing is that the same hidden representation gets repeatedly rewritten as it flows up the stack.
Example
The word “bank” in “I went to the bank to deposit cash” and “the bank of the river” means two very different things. Probing studies show that as a sentence flows up the layers, these two uses end up with internal representations that are more clearly distinct than they were at the bottom.
Common misconception
The misconception: layer 1 strictly handles grammar, layer 2 strictly handles meaning, layer 3 strictly handles logic. The correct view: these are tendencies, not strict assignments. Different models, different tasks, and different layers all overlap.
Bottom line
The closest translation of “the model is thinking” into mechanical terms is “the hidden representation keeps getting rewritten across layers.”
Chapter 6 · Why the model can generate answers
The intuition
The model does not draft a complete answer before speaking, the way a person might draft an essay before writing it. It plays a very long sequence game: pick the next token, append it to the context, decide the next token again, and so on.
Under the hood
At inference time an autoregressive language model decodes one token at a time. At each step it emits a probability distribution over the vocabulary, a decoding strategy chooses a token from that distribution, and the chosen token becomes part of the new context for the next step. Holtzman et al. make the point clearly: the distribution the model produces is one thing; the strategy used to sample from it is something else entirely11.
Step through autoregressive decoding. At each step the model emits a top-k probability distribution; pick one to advance. Numbers are illustrative.
Example
If the model picks “mammal” at this step, the context becomes “A cat is a mammal,” and the next distribution shifts toward “,” “.” “that” “belonging”—the things that plausibly follow that prefix.
Common misconception
The misconception: the model secretly writes the entire response first and then streams it out slowly. The correct view: most production generative LLMs unfold the answer one token at a time at inference.
Bottom line
Generation is not “a paragraph in one shot.” It is “a probability distribution plus a decoding loop.”
Chapter 7 · Temperature, top-k, and top-p
The intuition
The same model can feel disciplined one moment and adventurous the next. The difference often is not the model—it is the sampling parameters.
Under the hood
Temperature controls how peaked or flat the distribution becomes. Low temperature exaggerates the top candidate's lead; high temperature lets long-tail candidates compete. Top-k keeps only the k most likely candidates and samples from those. Top-p, also known as nucleus sampling, keeps the smallest set of candidates whose cumulative probability mass reaches p. Holtzman et al. introduced nucleus sampling because purely greedy maximum-likelihood decoding tends to produce repetitive, degenerate text11.
Same logits, different temperatures produce very different distributions. Low T sharpens (greedy); high T flattens (explorative).
Temperature does not change what the model knows (the logits). It only changes how the model chooses from its own distribution. Low T fits code and structured extraction; high T fits creative writing and brainstorming.
Example
Ask the model to finish a line of poetry. At low temperature it tends toward safe, expected phrasing. At high temperature it reaches for novel imagery—but it is also more likely to drift off-topic. Code generation and structured extraction usually want low temperature; brainstorming and creative writing tolerate—and often benefit from—higher temperature. Temperature does not change what the model knows; it only changes how the model picks from its own distribution.
Common misconception
The misconception: higher temperature makes the model smarter. The correct view: higher temperature only makes the model more random. Creativity may go up, and so may error rate.
Bottom line
Sampling parameters do not swap the model. They change how the model picks an answer from its own distribution.
Part II
How the model is trained and scaled
Chapter 8 · How LLMs are trained
The intuition
If inference is “the model answering,” training is “the model studying.” First it reads a vast amount of text and learns linguistic patterns and broad world knowledge. Then it is coached on high-quality question–answer demonstrations to behave more like an assistant. Then it is shaped to match human preferences and safety norms. Finally it is stress-tested.
Under the hood
A canonical pipeline has four stages. Pretraining runs next-token prediction over enormous corpora to absorb language and knowledge patterns. Supervised fine-tuning (SFT) pulls the base model toward following instructions, using curated demonstrations. Preference alignment (RLHF, DPO, and friends) shapes the model toward being more helpful, honest, and harmless. Safety training and evaluation reduce risky outputs through rules, red-teaming, refusal strategies, and eval pipelines. InstructGPT laid out the SFT+RLHF recipe systematically12; DPO offered a simpler, more direct way to do preference optimization13; Constitutional AI showed that a written list of principles plus AI feedback can also drive alignment14.
Example
Pretraining teaches the model how to produce coherent language. SFT teaches it to act like an assistant. Preference alignment teaches it which answers feel right. Safety training teaches it to refuse or hedge when the request itself is dangerous.
Common misconception
The misconception: training literally copies an encyclopedia into the model. The correct view: the model mostly compresses statistical patterns into its parameters rather than storing a verbatim archive. That said, research shows models can memorize and reproduce snippets of training data17, so the truth is somewhere between “perfect copy” and “no memory at all.”
Bottom line
Training is not a single event. It is a pipeline that moves the model from “fluent language” to “assistant behavior” to “within safety bounds.”
Chapter 9 · Parameters and why 7B, 70B, 175B
The intuition
Picture a vast machine filled with knobs. The parameters are those knobs. Training is the process of patiently adjusting every knob so that, given an input, the machine produces a more sensible output.
Under the hood
Parameters are the learnable numerical weights inside the model. GPT-3 disclosed a 175B-parameter scale2. LLaMA published a family ranging from 7B to 65B3. Chinchilla used 70B parameters but, by training on far more tokens, outperformed several much larger models5. Together, the scaling-law work4 and Chinchilla5 argue something nuanced: bigger models typically have higher ceiling performance, but whether one is actually “better” depends on training tokens, data quality, architecture, and training discipline.
Example
The B in “7B,” “70B,” and “175B” stands for billion. 7B is roughly 7 billion parameters; 70B is roughly 70 billion. More parameters generally mean more capacity—but also more memory, lower inference throughput, and higher deployment cost.
Common misconception
The misconception: more parameters always mean a better model. The correct view: parameter count is only part of the capacity and ceiling story. A well-trained 70B can routinely beat a sloppily trained much larger model.
Bottom line
Parameters are knobs. Scale matters—but scale is not the whole story.
Chapter 10 · Context windows
The intuition
A context window is a workbench. How much material you can lay out for the model at once depends on how big that workbench is. A larger bench means more material in view—but piling too much on at once, in the wrong order, makes it harder for the model to pick out what actually matters.
Under the hood
The context window is the number of tokens the model can process in one request. Anthropic's docs are explicit: the window holds the conversation history, the current request, and the space the model needs to write its output38. In real-world agent and coding workflows it also holds the system prompt, file contents, prior model responses, and tool returns. Longer contexts come with real costs: higher prefill compute, KV-cache memory pressure, and higher latency. System-level work such as vLLM and PagedAttention is exactly about making this cheaper at scale33. And the “Lost in the Middle” result is a useful reality check: more context is not automatically better use of context—important evidence buried in the middle of a long prompt can actually hurt performance22.
Engineering note
In real products, long contexts are almost always combined with truncation, summarization, retrieval, and caching. Both OpenAI and Anthropic ship prompt-caching features specifically to make long, stable prefixes cheap to reuse, which improves both latency and cost on long-document workflows37,40.
Common misconception
The misconception: a big enough context window means the model “remembers you forever.” The correct view: the context window is only what is currently visible inside this session—it is working memory, not long-term memory.
Bottom line
The context window is the most important—and most underappreciated—piece of desk space in the whole stack.
Part III
Retrieval, tools, and multimodality in real products
Chapter 11 · Why hallucinations happen
The intuition
Hallucination is not lying. A more honest description is closer to “a very fluent person who, when they do not know the answer, still talks as if they do.”
Under the hood
A hallucination is, by working definition, an output that is linguistically smooth but factually wrong. Recent surveys identify multiple converging causes: limited or stale parametric knowledge, biased or incomplete training data, ambiguous prompts, absent retrieval, and a decoding bias toward producing “something that looks like an answer” rather than admitting uncertainty15. OpenAI's own write-up adds a sharper observation: traditional training and evaluation often reward a plausible-sounding guess over an honest “I don't know”16.
Example
Ask the model to summarize a paper that does not exist—say, “Please summarize Smith et al. 2024 on Quantum Bubble Tea Optimization.” A model is likely to invent plausible authors, methods, and findings, because the generation process is biased toward producing something that looks like a real abstract.
How to mitigate
The reliable mitigations are engineering, not magic: require the model to cite sources, wire in retrieval or search, ask it to flag uncertainty explicitly, separate “known facts” from “recommendations,” and add human review on high-stakes domains like medicine, law, and finance. Hallucinations can be reduced significantly; no single trick eliminates them today.
Bottom line
Hallucination is not a character flaw. It is the natural reliability cost of any probabilistic generation system on factually demanding tasks.
Chapter 12 · RAG and why it reduces hallucination
The intuition
A plain LLM is a closed-book exam. Retrieval-augmented generation (RAG) is the open-book version: the model gets to consult relevant material first, then answer.
Under the hood
RAG is shorthand for retrieval-augmented generation. Lewis et al. frame it as “parametric memory plus non-parametric external memory”: retrieve relevant documents first, drop them into the context, then generate18. The retriever can be dense; DPR showed that a dense retriever can clearly beat strong BM25 baselines on open-domain QA19. Vector databases and approximate nearest-neighbor indices handle the embeddings at scale—Faiss is the canonical reference implementation20.
Compare a closed-book answer with an evidence-grounded one (RAG). Retrieval snippets and answers are illustrative.
Common risks for a software company at this stage include macro headwinds, competitive pressure, gross-margin compression, FX volatility, and regulatory uncertainty. To pinpoint the largest risk in this specific filing, please share the document.
(Waiting to run retrieval...)
RAG is not a silver bullet. Retrieval quality, chunking, ranking, and the position of evidence in the prompt all affect the final answer (see Lost-in-the-Middle).
Example
Ask, “What is the biggest risk in this earnings report?” Without RAG, the model gives a generic answer: macro headwinds, competition, operational risk. With RAG, the model can pull specific passages from the filing and answer concretely: weakening cash flow, foreign-revenue concentration, longer DSO—citing the page numbers.
RAG is not a silver bullet
If retrieval fails, or if too much irrelevant material lands in the context, the model still answers wrong. Follow-up work like CRAG explicitly tackles “what to do when retrieval is bad”21. The “Lost in the Middle” result shows that even when the right passage is in the context, the model may underuse it—especially if it sits in the middle of a long prompt22.
Common misconception
The misconception: once you add RAG, hallucinations are gone. The correct view: RAG converts “closed-book” into “may consult materials.” It does not automatically turn the system into a fact engine. Retrieval quality, chunking, ranking, and citation design all matter.
Bottom line
The point of RAG is to swap “what the model vaguely remembers” for “what the model can see in front of it right now.”
Chapter 13 · Tool use and agents
The intuition
LLMs are very good at producing text, but not necessarily at exact arithmetic, real-time lookup, or actually performing actions in the outside world. So we wire them up to tools—hands and feet, in a sense.
Under the hood
The OpenAI tool-use loop is straightforward: the application describes the available tools to the model, the model emits a structured tool call, the application actually runs the tool, the result is fed back into the model, and the model integrates the result into its final answer36. Anthropic's docs follow the same pattern: the model decides whether to call a tool, which one, and what arguments to pass39. Toolformer showed that language models can themselves learn when to call an API, what arguments to pass, and how to use the result23. ReAct interleaves reasoning steps with action steps so the model can plan and act in alternation24.
How to think about agents
A working definition: agent ≈ LLM + goal + tools + memory + multi-step planning. Instead of answering one question at a time, an agent decomposes a goal, chooses tools, inspects intermediate results, and decides what to do next. This is powerful—and brittle: any misread of the goal, any flawed plan, any wrong tool argument, any misinterpretation of a tool result can derail the whole chain.
Example
A user asks, “What is today's USD to EUR rate? And how much is 500 USD in EUR?” A plain LLM is likely to guess. Wired up to a forex lookup tool and a calculator, the model retrieves the rate, runs the conversion, then writes the explanation.
Common misconception
The misconception: agents are already reliable digital employees. The correct view: an agent is a design pattern, not a magical personality. It can dramatically extend capability—and can also amplify any single error across many steps.
Bottom line
Tools give the model external capabilities. Agents give the model a multi-step execution framework.
Chapter 14 · How multimodal models see, hear, and watch
The intuition
Multimodal models did not suddenly grow eyes and ears. They learned to turn images, audio, and video into representations that look, to the model, similar to the way text already looks—and then process everything in one shared space.
Under the hood
Vision Transformer (ViT) demonstrated that an image can be cut into patches and fed through a Transformer like any other sequence29. CLIP showed that images and text can be projected into a shared semantic space30. Flamingo and LLaVA went further by bridging a vision encoder into a language model so the system can handle interleaved image–text input, visual question answering, screenshot understanding, and multimodal dialogue31,32. Video is typically “a sequence of frames plus a timeline.”
Example
You hand the model a screenshot of an earnings dashboard and ask, “What looks like the main driver of margin compression?” The model encodes the image into visual features, fuses those features with your text question, and produces an explanation.
Common misconception
The misconception: a multimodal model literally “sees” the way a person does. The correct view: the model sees a numerical representation produced by an encoder—not subjective visual experience.
Bottom line
Multimodality is not “a text model with an image hat.” It is the unification of multiple input modalities into one computable representation space.
Chapter 15 · Why models appear to reason
The intuition
A lot of what looks like reasoning is the model having seen, during training, an enormous number of problem–step–answer patterns—and having learned the language templates that go with them.
Under the hood
Chain-of-Thought prompting shows that explicit intermediate steps in the prompt boost performance on hard reasoning tasks25. Self-Consistency improves accuracy further by sampling multiple reasoning paths and voting on the answer26. Program-of-Thoughts hands the actual arithmetic off to an interpreter27. Tree-of-Thoughts lets the model explore a search tree and backtrack28. The common theme is clear: making the steps explicit, verifying externally, and searching across alternatives all raise the model's ceiling on solving problems.
Example
“Roger has 5 balls. He buys 2 more cans, with 3 balls per can. How many balls in total?” A direct one-shot answer is more likely to be wrong. The same model, instructed to compute “2 × 3 = 6” first and then “5 + 6 = 11,” is much more reliable—and even more so if “2 × 3” is actually evaluated by a calculator.
Common misconception
The misconception: as long as the model writes the steps out, the logic is sound and the answer is right. The correct view: writing the steps usually helps, but it does not guarantee that any individual step is correct. “Looks like reasoning” and “reliable mathematical or logical correctness” are not the same thing.
Bottom line
LLMs can display reasoning behavior—more like “has learned a wide repertoire of problem-solving patterns” than “has an infallible logic engine inside.”
Chapter 16 · Why prompts matter
The intuition
A prompt is the brief you hand to an assistant. The clearer you are about who they are, what you want, the constraints, and the format, the more likely you are to get back what you actually need.
Under the hood
OpenAI's docs put it directly: prompting is how you instruct the model, and output quality often follows how well you prompt35. Anthropic's prompt-engineering docs add an important reminder: not every failure should be solved by tweaking the prompt—some problems are better fixed by choosing a different model, redesigning the system, adding retrieval, or improving evaluation38. System-message design also matters because system prompts shape role, tone, format, and safety boundaries.
Anatomy of a good prompt
Role, goal, background, constraints, output format, quality criteria. Six blocks, all useful. Example: “You are a financial analyst. Based on the attached PDF, summarize the risks. Output a four-column table: Risk · Evidence · Impact · Uncertainty. If you cannot find evidence for a risk, write ‘no supporting evidence in the document.’” This prompt's value is not in making the model smarter—it is in narrowing the task, making the output checkable, and removing room for hand-waving.
Common misconception
The misconception: with a long enough, clever enough prompt, anything can be solved. The correct view: prompts work within the limits of the model's capabilities, the available context, the wired-up tools, and the surrounding system design. Important, yes. Universal master key, no.
Bottom line
A good prompt is not an incantation. It is a precise, executable, and verifiable task brief.
Chapter 17 · One complete LLM product request
The intuition
In a real product, “user asks, model answers” rides on top of an entire backend: authentication, safety policy, retrieval, tools, output validation, logging, and evaluation.
Under the hood
Take the request “Analyze this earnings report and summarize the investment risks.” In production, a real system typically: accepts the request, runs authentication and safety checks, reads the file, chunks it, retrieves and reranks the relevant passages, assembles the prompt, calls the model, invokes calculation tools when needed, generates a structured output, attaches citations, returns the result, and logs everything for later evaluation. OpenAI's structured-outputs and function-calling docs cover exactly how to constrain output schemas and wire up tools36. The HELM evaluation framework reminds us that production quality is more than accuracy—robustness, calibration, fairness, toxicity, and efficiency all matter34.
Example
If the earnings report includes metrics like year-over-year growth or cash-flow coverage ratios, a well-designed system asks the model to extract the numbers and hands the actual arithmetic to a calculator. That is almost always more reliable than asking the model to do mental math.
Common misconception
The misconception: a strong base model is enough to build a strong product. The correct view: production quality is the joint design of model + data + tools + prompts + safety + evaluation + cost discipline.
Bottom line
When you see a fluent answer in a real product, it is almost never just a model speaking. It is a whole engineering pipeline.
Part IV
Misconceptions and learning roadmap
Chapter 18 · Common misconceptions
#1An LLM is just a search engine
Why it's wrongSearch engines retrieve from existing documents. LLMs generate text from context. These are different operations.
Correct viewMany great products combine an LLM with search or RAG—but the two are not the same thing.
#2An LLM knows everything
Why it's wrongThe parameters compress a lot of knowledge, but knowledge goes stale, is missing, or was never well-represented in training.
Correct viewTreat it as “a widely-read assistant who cannot always verify what it remembers,” not as a comprehensive database.
#3More parameters always mean a better model
Why it's wrongTraining tokens, data quality, and training strategy matter just as much.
Correct viewBigger models have higher ceilings, but Chinchilla and the LLaMA family show that “smaller but trained well” can outperform “larger but trained poorly.”3,5
#4If it sounds fluent, it is correct
Why it's wrongLinguistic fluency and factual correctness are not the same property.
Correct viewHallucinations tend to occur precisely when the model sounds most confident.
#5RAG eliminates hallucinations
Why it's wrongRetrieval can fail. Ranking can be wrong. Chunking can be wrong. The model can still misuse the context it received.
Correct viewRAG is a serious mitigation, not a guarantee.
#6Prompting can solve any problem
Why it's wrongSome failures are about model capability, missing tools, or system design—not prompt wording.
Correct viewPrompts are important. They are not a replacement for retrieval, tools, evaluation, or safety design.
#7Agents can autonomously handle complex tasks
Why it's wrongMulti-step planning propagates and amplifies any single error across the entire chain.
Correct viewAgents are a powerful, high-risk design pattern. They still require constraints, monitoring, and evaluation.
#8The model has a real human consciousness
Why it's wrongPapers and official docs talk about language modeling, alignment, and behavior design—not about establishing consciousness.
Correct viewFrom an engineering perspective, the safest framing is “a powerful statistical generator,” not “a digital mind.”
#9The model only copies its training data
Why it's wrongIf it only copied, it could not do few-shot generalization or composition2; and yet research shows models can also memorize and leak fragments of training data17.
Correct viewBoth generalization and memorization are real. Acknowledge both, not just one.
#10Open-source is automatically worse (or closed-source is automatically better)
Why it's wrongCapability depends on the model, the training, the data, the surrounding system, and the use case. No tribal verdict holds in every situation.
Correct viewCompare on task performance, cost, controllability, safety, latency, and deployment requirements—not on the “open vs closed” label.
Chapter 19 · One overview diagram
This chapter introduces no new concepts. It puts the training side and the inference side on the same canvas so the whole pipeline is visible at once.
One-sentence takeaway
If you remember only one thing, remember this: at its core, an LLM predicts the next token from context; product capability comes from “the model itself plus the external systems wrapped around it.”
Chapter 20 · Learning roadmap
For non-technical users
Focus on tokens, context, hallucinations, prompting, RAG, and tool calls. The goal is not to train a model—it is to use models correctly, spot the failure modes, and ask better questions.
For PMs, operators, and founders
Focus on application architecture, RAG, agents, cost, latency, risk, and evaluation. The goal is to design a real LLM pipeline that ships to production, not to admire a demo.
For early-career engineers
Focus on API usage, prompt engineering, vector databases, RAG, structured outputs, tool calling, and evaluation. The goal is to build LLM applications that actually behave reliably.
For AI engineers
Go deep on Transformers, training, alignment, inference optimization, KV cache, throughput and latency, model compression, and benchmarking. The goal is to optimize the model and the system—not just call an existing API.
Glossary
Definitions below follow the usage of the papers and official documentation cited throughout this primer.
Representation
- Token
- The smallest unit of text the model actually processes.
- Tokenization
- The process of splitting raw text into tokens.
- Embedding
- The dense numeric vector that each token is mapped into.
- Vector space
- The high-dimensional space those vectors live in.
- Layer
- One block in the network that progressively refines the representation.
- Parameter
- A learnable numeric weight inside the model.
Architecture & runtime
- Transformer
- The modern sequence-modeling architecture built around attention.
- Self-attention
- The mechanism by which each position weights every other position in the context.
- Multi-head attention
- Multiple attention heads operating in parallel, each capturing different relations.
- Inference
- The model running in production—generating outputs from inputs.
- Decoding
- The procedure for picking each output token from the next-token distribution.
- Temperature / top-k / top-p
- Sampling parameters that control randomness and the candidate set.
- Context window
- The maximum number of tokens that can fit in one request.
- System prompt
- The system-level message that sets role, rules, and constraints.
Training & alignment
- Pretraining
- Language-modeling training over a huge corpus to establish broad capability.
- Fine-tuning
- Continued training on more specific data.
- SFTSupervised Fine-Tuning
- Fine-tuning on curated instruction–response demonstrations.
- RLHFReinforcement Learning from Human Feedback
- Aligning the model using reinforcement learning on human preferences.
- DPODirect Preference Optimization
- A simpler, more direct alternative to traditional RLHF for preference optimization.
- Alignment
- Shaping model behavior to match human goals and safety norms.
- Safety
- Design and training that reduce harmful or risky outputs.
- Evaluation
- Multi-dimensional measurement of model and system quality.
Systems & products
- Hallucination
- Output that sounds plausible but is factually wrong.
- RAGRetrieval-Augmented Generation
- Retrieve external knowledge first, then generate with it in context.
- Vector database
- A system that stores embeddings and runs efficient similarity search.
- Tool calling
- Letting the model emit structured calls that the application then executes.
- Agent
- A multi-step execution pattern: model + goal + tools + memory + planning.
- Multimodal model
- A model that can jointly process text, images, audio, or video.
- Latency
- The wall-clock delay from request to response.
- Cost
- The total cost of inference, storage, bandwidth, and operations.
- Deployment
- Shipping the model and its surrounding system into production.
References
Papers and official documentation cited throughout this primer. Where possible, links point to primary sources (arXiv preprints, official API documentation, or research blog posts).
Architecture & pretraining
- [1]Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
- [2]Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020. arXiv:2005.14165
- [3]Touvron, H., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971
- [4]Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361
- [5]Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556
Representations, tokens & interpretability
- [6]Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (word2vec). arXiv:1301.3781
- [7]Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). arXiv:1508.07909
- [8]Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
- [9]Tenney, I., Das, D., & Pavlick, E. (2019). BERT Rediscovers the Classical NLP Pipeline. arXiv:1905.05950
- [10]Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What Does BERT Look At? An Analysis of BERT’s Attention. arXiv:1906.04341
Decoding & sampling
- [11]Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration (nucleus sampling). ICLR 2020. arXiv:1904.09751
Alignment, SFT & safety
- [12]Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). arXiv:2203.02155
- [13]Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO). arXiv:2305.18290
- [14]Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073
Hallucination & uncertainty
- [15]Ji, Z., et al. (2023). Survey of Hallucination in Natural Language Generation. arXiv:2202.03629
- [16]Kalai, A., Nachum, O., et al. (2025). Why Language Models Hallucinate (OpenAI research). openai.com/research/why-language-models-hallucinate
- [17]Carlini, N., et al. (2021). Extracting Training Data from Large Language Models. arXiv:2012.07805
Retrieval-augmented generation (RAG)
- [18]Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
- [19]Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering (DPR). arXiv:2004.04906
- [20]Johnson, J., Douze, M., & Jégou, H. (2017). Billion-scale similarity search with GPUs (Faiss). arXiv:1702.08734
- [21]Yan, S., et al. (2024). Corrective Retrieval Augmented Generation (CRAG). arXiv:2401.15884
- [22]Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172
Tool use & agents
- [23]Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761
- [24]Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629
Reasoning & chain-of-thought
- [25]Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903
- [26]Wang, X., et al. (2022). Self-Consistency Improves Chain-of-Thought Reasoning in Language Models. arXiv:2203.11171
- [27]Chen, W., et al. (2022). Program of Thoughts Prompting: Disentangling Computation from Reasoning. arXiv:2211.12588
- [28]Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601
Multimodality
- [29]Dosovitskiy, A., et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT). ICLR 2021. arXiv:2010.11929
- [30]Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020
- [31]Alayrac, J.-B., et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. arXiv:2204.14198
- [32]Liu, H., et al. (2023). Visual Instruction Tuning (LLaVA). arXiv:2304.08485
Serving systems & long context
- [33]Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). arXiv:2309.06180
Evaluation
- [34]Liang, P., et al. (2022). Holistic Evaluation of Language Models (HELM). arXiv:2211.09110
Official developer documentation
- [35]OpenAI. Tokenization & Tokens · API guide. platform.openai.com/docs/guides/text-generation
- [36]OpenAI. Function Calling & Structured Outputs. platform.openai.com/docs/guides/function-calling · structured-outputs
- [37]OpenAI. Prompt Caching. platform.openai.com/docs/guides/prompt-caching
- [38]Anthropic. Context windows & prompt engineering. docs.claude.com/.../context-windows · prompt-engineering
- [39]Anthropic. Tool use with Claude. docs.claude.com/.../tool-use/overview
- [40]Anthropic. Prompt caching. docs.claude.com/.../prompt-caching
All links point to publicly accessible primary sources. Updated paper versions and broken links welcome—please flag any issues.
Disclaimer
This primer is the author's personal study notes and teaching write-up. Nothing here constitutes investment, legal, medical, or other professional advice. The cited papers, official documentation, and third-party materials are used for educational illustration; the author makes no representation, express or implied, as to their accuracy or completeness.
Views expressed are solely the author's and do not represent any employer or third-party organization. Product, model, and company names mentioned belong to their respective owners.
Any input you enter into the interactive demos runs locally in your browser. The author does not collect or store any user input.

