LLM Integration — Field Manual

LLM Foundations

The vocabulary you'll reach for in every design discussion. Get these right and most "AI bugs" become obvious config problems.

Core mechanics

Token — sub-word unit. ~0.75 words/token English. You pay per token, in and out.
Context window — max tokens (prompt + output) the model can attend to. Frontier models now reach 1M; cost & latency rise with fill.
Prompt vs completion — input is usually 2–5× cheaper than output. Output length dominates cost in chat/agent loops.
Tokenizer — provider-specific (BPE-family). "Count tokens, not characters" before truncating.

Generation parameters

temperature 0–1+ — randomness. 0 ≈ deterministic; raise for creative tasks.
top_p — nucleus sampling. Tune temp or top_p, not both.
max_tokens — output cap. Always set it; runaway output = runaway bill.
stop sequences, frequency/presence_penalty — fine control over repetition & cutoffs.
seed — best-effort reproducibility (not guaranteed).

Model types you'll actually pick between

Type	What it is	Reach for it when…
Instruct / Chat	Aligned to follow instructions in turns	The default for product features
Reasoning	Spends extra "thinking" tokens before answering (extended/adaptive thinking, o-series)	Multi-step logic, math, planning, hard agentic tasks
Base / Foundation	Raw next-token predictor, not aligned	Rarely in app code; fine-tuning experiments
Multimodal	Accepts image (and audio/video) input alongside text	OCR-lite, screenshots, document understanding, vision copilots
Embedding	Maps text → fixed vector; no text output	Search, RAG retrieval, clustering, dedup
Reranker	Scores (query, doc) relevance directly (cross-encoder)	Second-stage precision after vector recall
MoE	Mixture-of-experts — routes tokens to a subset of params	An architecture detail, not a knob; explains big-but-fast models

Tier mental model: nearly every provider ships a flagship (max capability), a balanced workhorse (your default), and a mini/nano (cheap, high-volume routing & classification). Build to the balanced tier, route trivial calls down, escalate hard calls up.

API Providers

Hosted, pay-per-token. Snapshot of the mid-2026 landscape — treat exact model names & prices as volatile; the shape of each lineup is stable.

Provider	Current flagship / workhorse	Distinctive
Anthropic (Claude)	Opus 4.8 · Sonnet 4.6 · Haiku 4.5	Strong coding/agents, ~1M context, prompt caching, adaptive thinking, native tool use & MCP
OpenAI (GPT)	GPT-5.5 · 5.x mini/nano · o-series	Broad ecosystem, unified router + reasoning variants, image gen, Realtime/voice
Google (Gemini)	Gemini 3.x Pro · 3 Flash	Long context, tight GCP/Workspace integration, strong multimodal
xAI (Grok)	Grok 4.x	Real-time data leaning, X integration
DeepSeek	DeepSeek V/R series	Budget frontier, strong reasoning per dollar, open weights available
Inference hosts	Groq · Fireworks · Together · OpenRouter	Serve open models (Llama, Mistral, Qwen, DeepSeek) fast & cheap; OpenRouter = one key, many models

One integration pattern, three SDKs

APIs have converged on an OpenAI-style messages shape. Most providers also expose an OpenAI-compatible endpoint, so you can often swap base_url + model string and keep your code.

claude.py

from anthropic import Anthropic
client = Anthropic()  # reads ANTHROPIC_API_KEY

msg = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a terse backend assistant.",
    messages=[{"role":"user","content":"Explain idempotency keys."}],
)
print(msg.content[0].text)

openai_compatible.py

from openai import OpenAI
# Same client points at OpenAI, Groq, Together, local servers…
client = OpenAI(base_url="https://api.openai.com/v1")

r = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role":"user","content":"Explain idempotency keys."}],
)
print(r.choices[0].message.content)

gemini.py

from google import genai
client = genai.Client()  # GEMINI_API_KEY

r = client.models.generate_content(
    model="gemini-3-flash",
    contents="Explain idempotency keys.",
)
print(r.text)

Abstraction tip: wrap calls behind one internal interface (a thin llm.complete()) so provider/model is config, not code. LiteLLM and OpenRouter do this for you across 100+ models with a single OpenAI-shaped call.

Local & Self-Hosted

When you need data residency, zero per-token cost, offline operation, or control over the exact weights. The tradeoff is you own the latency, the VRAM, and the ops.

Tool	Best for	Notes
Ollama	Fastest path to a local model + API	One-line pulls, OpenAI-compatible endpoint on :11434, great for prototyping & edge boxes (Jetson, mini-PCs)
llama.cpp	Max control, GGUF quantization, CPU/edge	The engine under many tools; supports distributed RPC inference across GPUs; pick quant (Q4_K_M, Q8) to fit VRAM
vLLM	High-throughput serving	PagedAttention + continuous batching; the production choice for self-hosted multi-user inference
LM Studio	Desktop GUI + local server	Friendly model browser, also exposes an OpenAI-compatible server
TGI / TensorRT-LLM	Enterprise serving on NVIDIA	HF Text-Gen-Inference; TensorRT for squeezed latency on data-center GPUs

Open weights to know

Llama (Meta) — broad ecosystem default
Qwen (Alibaba) — strong multilingual & coding, many sizes
Mistral / Mixtral — efficient, MoE options
DeepSeek — strong reasoning, open R-series
Gemma / Phi — small, capable, edge-friendly

Sizing rule of thumb

VRAM ≈ params × bytes-per-param. FP16 ≈ 2 B/param, so a 7B model ≈ 14 GB; 4-bit quant cuts that ~4× (≈ 4–5 GB) at a small quality cost. Add headroom for the KV cache (grows with context length × batch). Quantize first, distribute (multi-GPU RPC) when one card can't hold the weights.

Pick hosted vs local deliberately: hosted wins on capability-per-effort and elastic scale; local wins on privacy, recurring cost at volume, and latency-floor control. Most teams ship hosted first, then move hot/sensitive paths local.

Embeddings & Vector DBs

The substrate of retrieval. An embedding turns text into a vector; a vector DB finds nearest neighbors fast.

Embedding essentials

Dimension — fixed per model (e.g. 768/1024/1536). Index & query must use the same model.
Similarity — cosine (default), dot product, or L2. Match the metric the model was trained for.
Normalize vectors when using cosine; many libs do this for you.
Re-embed on model change — you cannot mix embeddings from different models in one index.

Vector store options

pgvector — Postgres extension. Best default if you already run PG; keeps vectors next to relational data.
Qdrant / Weaviate / Milvus — purpose-built, rich filtering, scale.
Chroma / LanceDB — lightweight, local-first, great for dev.
Pinecone — fully managed, zero-ops.
FAISS — a library, not a server; in-memory ANN building block.

schema.sql — pgvector

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
  id        bigserial PRIMARY KEY,
  doc_id    text,
  content   text,
  metadata  jsonb,
  embedding vector(1024)        -- must match your model
);
-- Approximate-NN index: HNSW (fast, accurate) for cosine
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

Filter + search: the value of pgvector/Qdrant is combining ANN search with metadata filters (WHERE tenant_id = ?) in one query — essential for multi-tenant and permissioned RAG.

RAG & RAGFlow

Retrieval-Augmented Generation = fetch relevant context, stuff it into the prompt, let the model answer grounded in your data. The hard part isn't the LLM call — it's the retrieval quality, and that starts with chunking.

The RAG pipeline

ingest → parse → chunk → embed → upsert(vector DB)
query → embed → ANN search → rerank → assemble context → LLM → cite

Chunking methods

Chunk too big → diluted relevance & wasted context. Too small → lost meaning. Overlap preserves continuity across boundaries.

Method	How	Tradeoff
Fixed-size	N tokens/chars, fixed overlap	Dead simple; blind to structure, cuts mid-sentence
Recursive character	Split on ¶ → sentence → word until under size	The pragmatic default (LangChain's RecursiveCharacterTextSplitter)
Sentence / token-aware	Respect sentence & token boundaries	Cleaner units; needs a tokenizer/NLP pass
Structure-aware	Split by Markdown/HTML headings, code blocks, tables	Keeps semantic units intact; format-specific
Semantic	Embed sentences, cut where similarity drops	Topically coherent chunks; compute cost up front
Parent–child / hierarchical	Retrieve small, return enclosing parent for context	Best precision+context combo; more index plumbing
Contextual	Prepend an LLM-written summary of where the chunk sits	Big recall gains (Anthropic's "contextual retrieval"); LLM cost per chunk
Late chunking	Embed the long doc first, pool per-chunk after	Chunks keep document-level context; needs long-context embedder

RAGFlow specifically infiniflow/ragflow · open source

RAGFlow is a full RAG engine built around deep document understanding rather than naive text splitting. Its differentiator is DeepDoc: layout analysis, OCR, and table-structure recognition that parse PDFs, scans, and Office files into structured blocks before chunking — so tables, figures, and headings survive ingestion.

Template-based chunking

Instead of one splitter, RAGFlow ships document-type templates you assign per file/knowledge base. Each applies layout-aware rules tuned to that shape:

General — default mixed layout
Q&A — paired question/answer rows
Manual / Book / Paper — heading & section aware
Table — preserves rows/columns
Laws / Resume / Presentation / Email / Picture / One — domain-specific

What you get out of the box

OCR + layout + table recognition (DeepDoc)
Chunk visualization & manual editing — you can see and fix chunks
Built-in embedding + reranking + citations with source traceback
Knowledge-graph / GraphRAG extraction option
REST API + Python SDK; self-hostable via Docker Compose

When to choose RAGFlow: messy, real-world documents (scanned PDFs, financial tables, contracts) where parsing quality is the bottleneck and you want a batteries-included engine + UI rather than wiring a framework yourself.

Existing libraries (build-your-own RAG)

Library	Role
LangChain	Splitters, loaders, retrievers, chains — broad glue layer
LlamaIndex	Retrieval-first: node parsers, SemanticSplitterNodeParser, query engines
Haystack	Production pipelines, strong eval & component model
Unstructured	Document partitioning — turn PDFs/HTML/docx into clean elements
Chonkie / semantic-text-splitter	Focused, fast chunking libraries when you don't want a framework
Rerankers	Cohere Rerank, BGE-reranker, Jina — cross-encoder second stage

Custom implementation — minimal, honest RAG

~40 lines, no framework. This is the whole loop: chunk → embed → store → retrieve → ground.

rag.py

import psycopg, numpy as np
from openai import OpenAI
client = OpenAI()
EMB = "text-embedding-3-large"; CHAT = "gpt-5.5"

# 1. CHUNK — recursive-ish: paragraphs, then pack to a token budget w/ overlap
def chunk(text, size=800, overlap=120):
    words = text.split()
    step = size - overlap
    return [" ".join(words[i:i+size]) for i in range(0, len(words), step)]

# 2. EMBED (batch the calls in real code)
def embed(texts):
    r = client.embeddings.create(model=EMB, input=texts)
    return [d.embedding for d in r.data]

# 3. INGEST → pgvector
def ingest(conn, doc_id, text):
    parts = chunk(text)
    for c, v in zip(parts, embed(parts)):
        conn.execute("INSERT INTO chunks(doc_id,content,embedding) VALUES (%s,%s,%s)",
                     (doc_id, c, v))

# 4. RETRIEVE — ANN top-k by cosine distance
def retrieve(conn, query, k=5):
    qv = embed([query])[0]
    rows = conn.execute(
        "SELECT content FROM chunks ORDER BY embedding <=> %s LIMIT %s",
        (qv, k)).fetchall()
    return [r[0] for r in rows]

# 5. GROUND — stuff context, instruct the model to cite & abstain
def answer(conn, question):
    ctx = "\n\n---\n\n".join(retrieve(conn, question))
    sys = ("Answer ONLY from the context. "
           "If it's not there, say you don't know.")
    r = client.chat.completions.create(model=CHAT, messages=[
        {"role":"system","content":sys},
        {"role":"user","content":f"Context:\n{ctx}\n\nQ: {question}"}])
    return r.choices[0].message.content

To make it production-grade, add (in order of payoff): a reranker after step 4 · hybrid search (BM25/keyword ∪ vector) for exact terms & acronyms · metadata filters for tenancy/permissions · citations back to chunk IDs · and eval (retrieval recall@k + answer faithfulness) before you trust it.

Tool Calling & Structured Output

How LLMs stop being chatbots and start acting on systems. Two related primitives the JDs call "AI-enabled" and "explainable" features.

Tool / function calling

You describe functions (name + JSON-schema params). The model decides when to call one and returns a structured call; you execute it and feed the result back. The model never runs code itself — it requests, you fulfill.

Loop: model → tool_use → you run it → tool_result → model continues
Schema quality = call quality. Describe params like API docs.
Always validate args before executing — treat them as untrusted input.

Structured outputs

Force responses to conform to a JSON Schema so downstream code can parse reliably — no regex-scraping prose. Now native on the major APIs (and via Pydantic helpers).

Use for: extraction, classification, form-filling, mapping output → UI.
Define the schema once; share it between the LLM call and your validator.
Explainability hook: add a reasons/citations field so the model justifies each value.

tool_call.py — Anthropic

tools = [{
  "name": "get_order_status",
  "description": "Look up an order's status by ID.",
  "input_schema": {"type":"object",
    "properties":{"order_id":{"type":"string"}},
    "required":["order_id"]}
}]

msg = client.messages.create(model="claude-sonnet-4-6", max_tokens=1024,
    tools=tools, messages=[{"role":"user","content":"Where's order A-91?"}])

if msg.stop_reason == "tool_use":
    call = next(b for b in msg.content if b.type=="tool_use")
    result = get_order_status(**call.input)   # YOU execute
    # …append tool_result and call again to get the final answer

Agents & MCP

An agent is a loop: the model is given tools and a goal, then plans → acts → observes → repeats until done. "Agentic workflows" + "AI copilots" in the JDs live here.

Patterns (simplest first)

Single tool-use loop — model + tools, run until no more tool calls. Covers most real "agents".
Workflow / graph — you wire fixed steps (router → retrieve → draft → check). Predictable, debuggable. Prefer this.
Autonomous multi-agent — agents spawning agents. Powerful, hard to control & cost. Reach for last.
Reflection — model critiques & revises its own output before returning.

MCP — Model Context Protocol

An open standard for exposing tools, data, and prompts to any LLM client over a uniform interface. Write an MCP server once (your app's capabilities as tools) and any MCP-aware client/agent can use it — decoupling tools from any single model or framework.

Transports: stdio (local) or HTTP/SSE (remote).
Server exposes tools, resources, prompts.
The clean way to make app logic "AI-accessible" without bespoke glue per model.

Frameworks

LangGraph (stateful graphs, the durable choice) · LangChain agents · LlamaIndex agents · provider-native Agents SDKs · CrewAI / AutoGen (multi-agent). Start with a hand-written tool loop or LangGraph; reach for multi-agent frameworks only when a single loop genuinely can't express the task.

Agent failure modes to design against: infinite loops (cap iterations) · cost blowups (budget tokens & calls) · hallucinated tool args (validate) · context overflow (summarize/trim history) · silent wrong answers (add a verification step + human-in-the-loop on high-stakes actions).

Production Concerns

The difference between a demo and a feature. This is where "understand AI limitations" and "reliability, performance, UX" from the JDs cash out.

Cost & latency

Prompt caching — cache stable system/context for up to ~90% savings.
Model routing — cheap model for easy calls, escalate hard ones.
Batch API — async, ~50% off for non-urgent jobs.
Stream tokens (SSE) so perceived latency ≈ time-to-first-token.

Reliability

Retries w/ exponential backoff on 429/5xx.
Timeouts + fallback model on outage.
Idempotency keys for tool actions.
Structured output + schema validation, not string parsing.

Safety & trust

Prompt injection — treat retrieved/user text as untrusted; never let it grant tool powers.
PII — redact before sending; mind data residency.
Grounding + citations to curb hallucination.
Eval in CI: golden sets, faithfulness, regression checks.

Observability: log prompts, responses, token counts, latency, and tool calls from day one (Langfuse, Phoenix, or your own table). You cannot debug, cost-optimize, or evaluate what you don't trace. "Explainable AI" in product terms usually means: surface citations, confidence/reasons, and the tool trace to the user.

Mapped to the Job

Where each posting's AI asks land in this manual. The DB / web / deploy skills are the table stakes around the AI work — listed briefly; the AI column is where you differentiate.

AI-Reviewed Full-Stack Dev

// reviewing & correcting Manus-AI-generated code

AI integration§02 providers · §06 tools — know what the model can't reliably do

AI limitations§08 — hallucination, injection, non-determinism; review AI output like a junior's PR

AI-dev tools§07 agents — Manus & co. are agentic coders; you verify, constrain, correct

+ Front & back endtable stakes — fix real bugs the agent introduces

ShockPoint Full-Stack Engineer

// AI-enabled decision-support platforms

LLM APIs§02 — provider abstraction, routing, streaming

Agentic workflows§07 — LangGraph loops, copilots, MCP

Retrieval + vector DB§04–§05 — pgvector/Qdrant, RAG over ops docs

Explainable AI§06 + §08 — citations, reasons, tool traces in the UI

+ FastAPI / PostgreSQLbackend for APIs, auth, file pipelines

+ React / Tailwinddashboards, copilot & analytics UIs

+ Docker / CI / cloudVercel/Railway/AWS deploy, GitHub workflows

Interview-ready framing: for either role, lead with a story where you integrated an LLM into a real full-stack system — chose the model tier, built retrieval, added tool calls, then made it reliable (eval, caching, fallbacks) — and could explain the limitations you designed around. That's the exact shape both teams are buying.