LLM Foundations
The vocabulary you'll reach for in every design discussion. Get these right and most "AI bugs" become obvious config problems.
Core mechanics
- Token — sub-word unit. ~0.75 words/token English. You pay per token, in and out.
- Context window — max tokens (prompt + output) the model can attend to. Frontier models now reach 1M; cost & latency rise with fill.
- Prompt vs completion — input is usually 2–5× cheaper than output. Output length dominates cost in chat/agent loops.
- Tokenizer — provider-specific (BPE-family). "Count tokens, not characters" before truncating.
Generation parameters
temperature0–1+ — randomness. 0 ≈ deterministic; raise for creative tasks.top_p— nucleus sampling. Tune temp or top_p, not both.max_tokens— output cap. Always set it; runaway output = runaway bill.stopsequences,frequency/presence_penalty— fine control over repetition & cutoffs.seed— best-effort reproducibility (not guaranteed).
Model types you'll actually pick between
| Type | What it is | Reach for it when… |
|---|---|---|
| Instruct / Chat | Aligned to follow instructions in turns | The default for product features |
| Reasoning | Spends extra "thinking" tokens before answering (extended/adaptive thinking, o-series) | Multi-step logic, math, planning, hard agentic tasks |
| Base / Foundation | Raw next-token predictor, not aligned | Rarely in app code; fine-tuning experiments |
| Multimodal | Accepts image (and audio/video) input alongside text | OCR-lite, screenshots, document understanding, vision copilots |
| Embedding | Maps text → fixed vector; no text output | Search, RAG retrieval, clustering, dedup |
| Reranker | Scores (query, doc) relevance directly (cross-encoder) | Second-stage precision after vector recall |
| MoE | Mixture-of-experts — routes tokens to a subset of params | An architecture detail, not a knob; explains big-but-fast models |
API Providers
Hosted, pay-per-token. Snapshot of the mid-2026 landscape — treat exact model names & prices as volatile; the shape of each lineup is stable.
| Provider | Current flagship / workhorse | Distinctive |
|---|---|---|
| Anthropic (Claude) | Opus 4.8 · Sonnet 4.6 · Haiku 4.5 | Strong coding/agents, ~1M context, prompt caching, adaptive thinking, native tool use & MCP |
| OpenAI (GPT) | GPT-5.5 · 5.x mini/nano · o-series | Broad ecosystem, unified router + reasoning variants, image gen, Realtime/voice |
| Google (Gemini) | Gemini 3.x Pro · 3 Flash | Long context, tight GCP/Workspace integration, strong multimodal |
| xAI (Grok) | Grok 4.x | Real-time data leaning, X integration |
| DeepSeek | DeepSeek V/R series | Budget frontier, strong reasoning per dollar, open weights available |
| Inference hosts | Groq · Fireworks · Together · OpenRouter | Serve open models (Llama, Mistral, Qwen, DeepSeek) fast & cheap; OpenRouter = one key, many models |
One integration pattern, three SDKs
APIs have converged on an OpenAI-style messages shape. Most providers also expose an OpenAI-compatible endpoint, so you can often swap base_url + model string and keep your code.
from anthropic import Anthropic client = Anthropic() # reads ANTHROPIC_API_KEY msg = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system="You are a terse backend assistant.", messages=[{"role":"user","content":"Explain idempotency keys."}], ) print(msg.content[0].text)
from openai import OpenAI # Same client points at OpenAI, Groq, Together, local servers… client = OpenAI(base_url="https://api.openai.com/v1") r = client.chat.completions.create( model="gpt-5.5", messages=[{"role":"user","content":"Explain idempotency keys."}], ) print(r.choices[0].message.content)
from google import genai client = genai.Client() # GEMINI_API_KEY r = client.models.generate_content( model="gemini-3-flash", contents="Explain idempotency keys.", ) print(r.text)
llm.complete()) so provider/model is config, not code. LiteLLM and OpenRouter do this for you across 100+ models with a single OpenAI-shaped call.Local & Self-Hosted
When you need data residency, zero per-token cost, offline operation, or control over the exact weights. The tradeoff is you own the latency, the VRAM, and the ops.
| Tool | Best for | Notes |
|---|---|---|
| Ollama | Fastest path to a local model + API | One-line pulls, OpenAI-compatible endpoint on :11434, great for prototyping & edge boxes (Jetson, mini-PCs) |
| llama.cpp | Max control, GGUF quantization, CPU/edge | The engine under many tools; supports distributed RPC inference across GPUs; pick quant (Q4_K_M, Q8) to fit VRAM |
| vLLM | High-throughput serving | PagedAttention + continuous batching; the production choice for self-hosted multi-user inference |
| LM Studio | Desktop GUI + local server | Friendly model browser, also exposes an OpenAI-compatible server |
| TGI / TensorRT-LLM | Enterprise serving on NVIDIA | HF Text-Gen-Inference; TensorRT for squeezed latency on data-center GPUs |
Open weights to know
- Llama (Meta) — broad ecosystem default
- Qwen (Alibaba) — strong multilingual & coding, many sizes
- Mistral / Mixtral — efficient, MoE options
- DeepSeek — strong reasoning, open R-series
- Gemma / Phi — small, capable, edge-friendly
Sizing rule of thumb
VRAM ≈ params × bytes-per-param. FP16 ≈ 2 B/param, so a 7B model ≈ 14 GB; 4-bit quant cuts that ~4× (≈ 4–5 GB) at a small quality cost. Add headroom for the KV cache (grows with context length × batch). Quantize first, distribute (multi-GPU RPC) when one card can't hold the weights.
Embeddings & Vector DBs
The substrate of retrieval. An embedding turns text into a vector; a vector DB finds nearest neighbors fast.
Embedding essentials
- Dimension — fixed per model (e.g. 768/1024/1536). Index & query must use the same model.
- Similarity — cosine (default), dot product, or L2. Match the metric the model was trained for.
- Normalize vectors when using cosine; many libs do this for you.
- Re-embed on model change — you cannot mix embeddings from different models in one index.
Vector store options
- pgvector — Postgres extension. Best default if you already run PG; keeps vectors next to relational data.
- Qdrant / Weaviate / Milvus — purpose-built, rich filtering, scale.
- Chroma / LanceDB — lightweight, local-first, great for dev.
- Pinecone — fully managed, zero-ops.
- FAISS — a library, not a server; in-memory ANN building block.
CREATE EXTENSION IF NOT EXISTS vector; CREATE TABLE chunks ( id bigserial PRIMARY KEY, doc_id text, content text, metadata jsonb, embedding vector(1024) -- must match your model ); -- Approximate-NN index: HNSW (fast, accurate) for cosine CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
WHERE tenant_id = ?) in one query — essential for multi-tenant and permissioned RAG.RAG & RAGFlow
Retrieval-Augmented Generation = fetch relevant context, stuff it into the prompt, let the model answer grounded in your data. The hard part isn't the LLM call — it's the retrieval quality, and that starts with chunking.
The RAG pipeline
ingest → parse → chunk → embed → upsert(vector DB)
query → embed → ANN search → rerank → assemble context → LLM → cite
Chunking methods
Chunk too big → diluted relevance & wasted context. Too small → lost meaning. Overlap preserves continuity across boundaries.
| Method | How | Tradeoff |
|---|---|---|
| Fixed-size | N tokens/chars, fixed overlap | Dead simple; blind to structure, cuts mid-sentence |
| Recursive character | Split on ¶ → sentence → word until under size | The pragmatic default (LangChain's RecursiveCharacterTextSplitter) |
| Sentence / token-aware | Respect sentence & token boundaries | Cleaner units; needs a tokenizer/NLP pass |
| Structure-aware | Split by Markdown/HTML headings, code blocks, tables | Keeps semantic units intact; format-specific |
| Semantic | Embed sentences, cut where similarity drops | Topically coherent chunks; compute cost up front |
| Parent–child / hierarchical | Retrieve small, return enclosing parent for context | Best precision+context combo; more index plumbing |
| Contextual | Prepend an LLM-written summary of where the chunk sits | Big recall gains (Anthropic's "contextual retrieval"); LLM cost per chunk |
| Late chunking | Embed the long doc first, pool per-chunk after | Chunks keep document-level context; needs long-context embedder |
RAGFlow specifically infiniflow/ragflow · open source
RAGFlow is a full RAG engine built around deep document understanding rather than naive text splitting. Its differentiator is DeepDoc: layout analysis, OCR, and table-structure recognition that parse PDFs, scans, and Office files into structured blocks before chunking — so tables, figures, and headings survive ingestion.
Template-based chunking
Instead of one splitter, RAGFlow ships document-type templates you assign per file/knowledge base. Each applies layout-aware rules tuned to that shape:
- General — default mixed layout
- Q&A — paired question/answer rows
- Manual / Book / Paper — heading & section aware
- Table — preserves rows/columns
- Laws / Resume / Presentation / Email / Picture / One — domain-specific
What you get out of the box
- OCR + layout + table recognition (DeepDoc)
- Chunk visualization & manual editing — you can see and fix chunks
- Built-in embedding + reranking + citations with source traceback
- Knowledge-graph / GraphRAG extraction option
- REST API + Python SDK; self-hostable via Docker Compose
Existing libraries (build-your-own RAG)
| Library | Role |
|---|---|
| LangChain | Splitters, loaders, retrievers, chains — broad glue layer |
| LlamaIndex | Retrieval-first: node parsers, SemanticSplitterNodeParser, query engines |
| Haystack | Production pipelines, strong eval & component model |
| Unstructured | Document partitioning — turn PDFs/HTML/docx into clean elements |
| Chonkie / semantic-text-splitter | Focused, fast chunking libraries when you don't want a framework |
| Rerankers | Cohere Rerank, BGE-reranker, Jina — cross-encoder second stage |
Custom implementation — minimal, honest RAG
~40 lines, no framework. This is the whole loop: chunk → embed → store → retrieve → ground.
import psycopg, numpy as np from openai import OpenAI client = OpenAI() EMB = "text-embedding-3-large"; CHAT = "gpt-5.5" # 1. CHUNK — recursive-ish: paragraphs, then pack to a token budget w/ overlap def chunk(text, size=800, overlap=120): words = text.split() step = size - overlap return [" ".join(words[i:i+size]) for i in range(0, len(words), step)] # 2. EMBED (batch the calls in real code) def embed(texts): r = client.embeddings.create(model=EMB, input=texts) return [d.embedding for d in r.data] # 3. INGEST → pgvector def ingest(conn, doc_id, text): parts = chunk(text) for c, v in zip(parts, embed(parts)): conn.execute("INSERT INTO chunks(doc_id,content,embedding) VALUES (%s,%s,%s)", (doc_id, c, v)) # 4. RETRIEVE — ANN top-k by cosine distance def retrieve(conn, query, k=5): qv = embed([query])[0] rows = conn.execute( "SELECT content FROM chunks ORDER BY embedding <=> %s LIMIT %s", (qv, k)).fetchall() return [r[0] for r in rows] # 5. GROUND — stuff context, instruct the model to cite & abstain def answer(conn, question): ctx = "\n\n---\n\n".join(retrieve(conn, question)) sys = ("Answer ONLY from the context. " "If it's not there, say you don't know.") r = client.chat.completions.create(model=CHAT, messages=[ {"role":"system","content":sys}, {"role":"user","content":f"Context:\n{ctx}\n\nQ: {question}"}]) return r.choices[0].message.content
Tool Calling & Structured Output
How LLMs stop being chatbots and start acting on systems. Two related primitives the JDs call "AI-enabled" and "explainable" features.
Tool / function calling
You describe functions (name + JSON-schema params). The model decides when to call one and returns a structured call; you execute it and feed the result back. The model never runs code itself — it requests, you fulfill.
- Loop: model → tool_use → you run it → tool_result → model continues
- Schema quality = call quality. Describe params like API docs.
- Always validate args before executing — treat them as untrusted input.
Structured outputs
Force responses to conform to a JSON Schema so downstream code can parse reliably — no regex-scraping prose. Now native on the major APIs (and via Pydantic helpers).
- Use for: extraction, classification, form-filling, mapping output → UI.
- Define the schema once; share it between the LLM call and your validator.
- Explainability hook: add a
reasons/citationsfield so the model justifies each value.
tools = [{
"name": "get_order_status",
"description": "Look up an order's status by ID.",
"input_schema": {"type":"object",
"properties":{"order_id":{"type":"string"}},
"required":["order_id"]}
}]
msg = client.messages.create(model="claude-sonnet-4-6", max_tokens=1024,
tools=tools, messages=[{"role":"user","content":"Where's order A-91?"}])
if msg.stop_reason == "tool_use":
call = next(b for b in msg.content if b.type=="tool_use")
result = get_order_status(**call.input) # YOU execute
# …append tool_result and call again to get the final answerAgents & MCP
An agent is a loop: the model is given tools and a goal, then plans → acts → observes → repeats until done. "Agentic workflows" + "AI copilots" in the JDs live here.
Patterns (simplest first)
- Single tool-use loop — model + tools, run until no more tool calls. Covers most real "agents".
- Workflow / graph — you wire fixed steps (router → retrieve → draft → check). Predictable, debuggable. Prefer this.
- Autonomous multi-agent — agents spawning agents. Powerful, hard to control & cost. Reach for last.
- Reflection — model critiques & revises its own output before returning.
MCP — Model Context Protocol
An open standard for exposing tools, data, and prompts to any LLM client over a uniform interface. Write an MCP server once (your app's capabilities as tools) and any MCP-aware client/agent can use it — decoupling tools from any single model or framework.
- Transports: stdio (local) or HTTP/SSE (remote).
- Server exposes tools, resources, prompts.
- The clean way to make app logic "AI-accessible" without bespoke glue per model.
Frameworks
LangGraph (stateful graphs, the durable choice) · LangChain agents · LlamaIndex agents · provider-native Agents SDKs · CrewAI / AutoGen (multi-agent). Start with a hand-written tool loop or LangGraph; reach for multi-agent frameworks only when a single loop genuinely can't express the task.
Production Concerns
The difference between a demo and a feature. This is where "understand AI limitations" and "reliability, performance, UX" from the JDs cash out.
Cost & latency
- Prompt caching — cache stable system/context for up to ~90% savings.
- Model routing — cheap model for easy calls, escalate hard ones.
- Batch API — async, ~50% off for non-urgent jobs.
- Stream tokens (SSE) so perceived latency ≈ time-to-first-token.
Reliability
- Retries w/ exponential backoff on 429/5xx.
- Timeouts + fallback model on outage.
- Idempotency keys for tool actions.
- Structured output + schema validation, not string parsing.
Safety & trust
- Prompt injection — treat retrieved/user text as untrusted; never let it grant tool powers.
- PII — redact before sending; mind data residency.
- Grounding + citations to curb hallucination.
- Eval in CI: golden sets, faithfulness, regression checks.
Mapped to the Job
Where each posting's AI asks land in this manual. The DB / web / deploy skills are the table stakes around the AI work — listed briefly; the AI column is where you differentiate.