Engineering Reference · v1

LLM Integration
Field Manual

A working reference for the mid-level engineer shipping AI features: providers and models, retrieval (RAG / RAGFlow), tool calling, agents, and the production glue that ties LLMs into real full-stack systems.

scope · integration, not training updated · Jun 2026 stack · FastAPI / pgvector / React
01

LLM Foundations

The vocabulary you'll reach for in every design discussion. Get these right and most "AI bugs" become obvious config problems.

Core mechanics

  • Token — sub-word unit. ~0.75 words/token English. You pay per token, in and out.
  • Context window — max tokens (prompt + output) the model can attend to. Frontier models now reach 1M; cost & latency rise with fill.
  • Prompt vs completion — input is usually 2–5× cheaper than output. Output length dominates cost in chat/agent loops.
  • Tokenizer — provider-specific (BPE-family). "Count tokens, not characters" before truncating.

Generation parameters

  • temperature 0–1+ — randomness. 0 ≈ deterministic; raise for creative tasks.
  • top_p — nucleus sampling. Tune temp or top_p, not both.
  • max_tokens — output cap. Always set it; runaway output = runaway bill.
  • stop sequences, frequency/presence_penalty — fine control over repetition & cutoffs.
  • seed — best-effort reproducibility (not guaranteed).

Model types you'll actually pick between

TypeWhat it isReach for it when…
Instruct / ChatAligned to follow instructions in turnsThe default for product features
ReasoningSpends extra "thinking" tokens before answering (extended/adaptive thinking, o-series)Multi-step logic, math, planning, hard agentic tasks
Base / FoundationRaw next-token predictor, not alignedRarely in app code; fine-tuning experiments
MultimodalAccepts image (and audio/video) input alongside textOCR-lite, screenshots, document understanding, vision copilots
EmbeddingMaps text → fixed vector; no text outputSearch, RAG retrieval, clustering, dedup
RerankerScores (query, doc) relevance directly (cross-encoder)Second-stage precision after vector recall
MoEMixture-of-experts — routes tokens to a subset of paramsAn architecture detail, not a knob; explains big-but-fast models
Tier mental model: nearly every provider ships a flagship (max capability), a balanced workhorse (your default), and a mini/nano (cheap, high-volume routing & classification). Build to the balanced tier, route trivial calls down, escalate hard calls up.
02

API Providers

Hosted, pay-per-token. Snapshot of the mid-2026 landscape — treat exact model names & prices as volatile; the shape of each lineup is stable.

ProviderCurrent flagship / workhorseDistinctive
Anthropic (Claude)Opus 4.8 · Sonnet 4.6 · Haiku 4.5Strong coding/agents, ~1M context, prompt caching, adaptive thinking, native tool use & MCP
OpenAI (GPT)GPT-5.5 · 5.x mini/nano · o-seriesBroad ecosystem, unified router + reasoning variants, image gen, Realtime/voice
Google (Gemini)Gemini 3.x Pro · 3 FlashLong context, tight GCP/Workspace integration, strong multimodal
xAI (Grok)Grok 4.xReal-time data leaning, X integration
DeepSeekDeepSeek V/R seriesBudget frontier, strong reasoning per dollar, open weights available
Inference hostsGroq · Fireworks · Together · OpenRouterServe open models (Llama, Mistral, Qwen, DeepSeek) fast & cheap; OpenRouter = one key, many models

One integration pattern, three SDKs

APIs have converged on an OpenAI-style messages shape. Most providers also expose an OpenAI-compatible endpoint, so you can often swap base_url + model string and keep your code.

claude.py
from anthropic import Anthropic
client = Anthropic()  # reads ANTHROPIC_API_KEY

msg = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a terse backend assistant.",
    messages=[{"role":"user","content":"Explain idempotency keys."}],
)
print(msg.content[0].text)
openai_compatible.py
from openai import OpenAI
# Same client points at OpenAI, Groq, Together, local servers…
client = OpenAI(base_url="https://api.openai.com/v1")

r = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role":"user","content":"Explain idempotency keys."}],
)
print(r.choices[0].message.content)
gemini.py
from google import genai
client = genai.Client()  # GEMINI_API_KEY

r = client.models.generate_content(
    model="gemini-3-flash",
    contents="Explain idempotency keys.",
)
print(r.text)
Abstraction tip: wrap calls behind one internal interface (a thin llm.complete()) so provider/model is config, not code. LiteLLM and OpenRouter do this for you across 100+ models with a single OpenAI-shaped call.
03

Local & Self-Hosted

When you need data residency, zero per-token cost, offline operation, or control over the exact weights. The tradeoff is you own the latency, the VRAM, and the ops.

ToolBest forNotes
OllamaFastest path to a local model + APIOne-line pulls, OpenAI-compatible endpoint on :11434, great for prototyping & edge boxes (Jetson, mini-PCs)
llama.cppMax control, GGUF quantization, CPU/edgeThe engine under many tools; supports distributed RPC inference across GPUs; pick quant (Q4_K_M, Q8) to fit VRAM
vLLMHigh-throughput servingPagedAttention + continuous batching; the production choice for self-hosted multi-user inference
LM StudioDesktop GUI + local serverFriendly model browser, also exposes an OpenAI-compatible server
TGI / TensorRT-LLMEnterprise serving on NVIDIAHF Text-Gen-Inference; TensorRT for squeezed latency on data-center GPUs

Open weights to know

  • Llama (Meta) — broad ecosystem default
  • Qwen (Alibaba) — strong multilingual & coding, many sizes
  • Mistral / Mixtral — efficient, MoE options
  • DeepSeek — strong reasoning, open R-series
  • Gemma / Phi — small, capable, edge-friendly

Sizing rule of thumb

VRAM ≈ params × bytes-per-param. FP16 ≈ 2 B/param, so a 7B model ≈ 14 GB; 4-bit quant cuts that ~4× (≈ 4–5 GB) at a small quality cost. Add headroom for the KV cache (grows with context length × batch). Quantize first, distribute (multi-GPU RPC) when one card can't hold the weights.

Pick hosted vs local deliberately: hosted wins on capability-per-effort and elastic scale; local wins on privacy, recurring cost at volume, and latency-floor control. Most teams ship hosted first, then move hot/sensitive paths local.
04

Embeddings & Vector DBs

The substrate of retrieval. An embedding turns text into a vector; a vector DB finds nearest neighbors fast.

Embedding essentials

  • Dimension — fixed per model (e.g. 768/1024/1536). Index & query must use the same model.
  • Similarity — cosine (default), dot product, or L2. Match the metric the model was trained for.
  • Normalize vectors when using cosine; many libs do this for you.
  • Re-embed on model change — you cannot mix embeddings from different models in one index.

Vector store options

  • pgvector — Postgres extension. Best default if you already run PG; keeps vectors next to relational data.
  • Qdrant / Weaviate / Milvus — purpose-built, rich filtering, scale.
  • Chroma / LanceDB — lightweight, local-first, great for dev.
  • Pinecone — fully managed, zero-ops.
  • FAISS — a library, not a server; in-memory ANN building block.
schema.sql — pgvector
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
  id        bigserial PRIMARY KEY,
  doc_id    text,
  content   text,
  metadata  jsonb,
  embedding vector(1024)        -- must match your model
);
-- Approximate-NN index: HNSW (fast, accurate) for cosine
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
Filter + search: the value of pgvector/Qdrant is combining ANN search with metadata filters (WHERE tenant_id = ?) in one query — essential for multi-tenant and permissioned RAG.
05

RAG & RAGFlow

Retrieval-Augmented Generation = fetch relevant context, stuff it into the prompt, let the model answer grounded in your data. The hard part isn't the LLM call — it's the retrieval quality, and that starts with chunking.

The RAG pipeline

ingest → parsechunk → embed → upsert(vector DB)
    query → embed → ANN search → rerank → assemble context → LLM → cite

Chunking methods

Chunk too big → diluted relevance & wasted context. Too small → lost meaning. Overlap preserves continuity across boundaries.

MethodHowTradeoff
Fixed-sizeN tokens/chars, fixed overlapDead simple; blind to structure, cuts mid-sentence
Recursive characterSplit on ¶ → sentence → word until under sizeThe pragmatic default (LangChain's RecursiveCharacterTextSplitter)
Sentence / token-awareRespect sentence & token boundariesCleaner units; needs a tokenizer/NLP pass
Structure-awareSplit by Markdown/HTML headings, code blocks, tablesKeeps semantic units intact; format-specific
SemanticEmbed sentences, cut where similarity dropsTopically coherent chunks; compute cost up front
Parent–child / hierarchicalRetrieve small, return enclosing parent for contextBest precision+context combo; more index plumbing
ContextualPrepend an LLM-written summary of where the chunk sitsBig recall gains (Anthropic's "contextual retrieval"); LLM cost per chunk
Late chunkingEmbed the long doc first, pool per-chunk afterChunks keep document-level context; needs long-context embedder

RAGFlow specifically infiniflow/ragflow · open source

RAGFlow is a full RAG engine built around deep document understanding rather than naive text splitting. Its differentiator is DeepDoc: layout analysis, OCR, and table-structure recognition that parse PDFs, scans, and Office files into structured blocks before chunking — so tables, figures, and headings survive ingestion.

Template-based chunking

Instead of one splitter, RAGFlow ships document-type templates you assign per file/knowledge base. Each applies layout-aware rules tuned to that shape:

  • General — default mixed layout
  • Q&A — paired question/answer rows
  • Manual / Book / Paper — heading & section aware
  • Table — preserves rows/columns
  • Laws / Resume / Presentation / Email / Picture / One — domain-specific

What you get out of the box

  • OCR + layout + table recognition (DeepDoc)
  • Chunk visualization & manual editing — you can see and fix chunks
  • Built-in embedding + reranking + citations with source traceback
  • Knowledge-graph / GraphRAG extraction option
  • REST API + Python SDK; self-hostable via Docker Compose
When to choose RAGFlow: messy, real-world documents (scanned PDFs, financial tables, contracts) where parsing quality is the bottleneck and you want a batteries-included engine + UI rather than wiring a framework yourself.

Existing libraries (build-your-own RAG)

LibraryRole
LangChainSplitters, loaders, retrievers, chains — broad glue layer
LlamaIndexRetrieval-first: node parsers, SemanticSplitterNodeParser, query engines
HaystackProduction pipelines, strong eval & component model
UnstructuredDocument partitioning — turn PDFs/HTML/docx into clean elements
Chonkie / semantic-text-splitterFocused, fast chunking libraries when you don't want a framework
RerankersCohere Rerank, BGE-reranker, Jina — cross-encoder second stage

Custom implementation — minimal, honest RAG

~40 lines, no framework. This is the whole loop: chunk → embed → store → retrieve → ground.

rag.py
import psycopg, numpy as np
from openai import OpenAI
client = OpenAI()
EMB = "text-embedding-3-large"; CHAT = "gpt-5.5"

# 1. CHUNK — recursive-ish: paragraphs, then pack to a token budget w/ overlap
def chunk(text, size=800, overlap=120):
    words = text.split()
    step = size - overlap
    return [" ".join(words[i:i+size]) for i in range(0, len(words), step)]

# 2. EMBED (batch the calls in real code)
def embed(texts):
    r = client.embeddings.create(model=EMB, input=texts)
    return [d.embedding for d in r.data]

# 3. INGEST → pgvector
def ingest(conn, doc_id, text):
    parts = chunk(text)
    for c, v in zip(parts, embed(parts)):
        conn.execute("INSERT INTO chunks(doc_id,content,embedding) VALUES (%s,%s,%s)",
                     (doc_id, c, v))

# 4. RETRIEVE — ANN top-k by cosine distance
def retrieve(conn, query, k=5):
    qv = embed([query])[0]
    rows = conn.execute(
        "SELECT content FROM chunks ORDER BY embedding <=> %s LIMIT %s",
        (qv, k)).fetchall()
    return [r[0] for r in rows]

# 5. GROUND — stuff context, instruct the model to cite & abstain
def answer(conn, question):
    ctx = "\n\n---\n\n".join(retrieve(conn, question))
    sys = ("Answer ONLY from the context. "
           "If it's not there, say you don't know.")
    r = client.chat.completions.create(model=CHAT, messages=[
        {"role":"system","content":sys},
        {"role":"user","content":f"Context:\n{ctx}\n\nQ: {question}"}])
    return r.choices[0].message.content
To make it production-grade, add (in order of payoff): a reranker after step 4 · hybrid search (BM25/keyword ∪ vector) for exact terms & acronyms · metadata filters for tenancy/permissions · citations back to chunk IDs · and eval (retrieval recall@k + answer faithfulness) before you trust it.
06

Tool Calling & Structured Output

How LLMs stop being chatbots and start acting on systems. Two related primitives the JDs call "AI-enabled" and "explainable" features.

Tool / function calling

You describe functions (name + JSON-schema params). The model decides when to call one and returns a structured call; you execute it and feed the result back. The model never runs code itself — it requests, you fulfill.

  • Loop: model → tool_use → you run it → tool_result → model continues
  • Schema quality = call quality. Describe params like API docs.
  • Always validate args before executing — treat them as untrusted input.

Structured outputs

Force responses to conform to a JSON Schema so downstream code can parse reliably — no regex-scraping prose. Now native on the major APIs (and via Pydantic helpers).

  • Use for: extraction, classification, form-filling, mapping output → UI.
  • Define the schema once; share it between the LLM call and your validator.
  • Explainability hook: add a reasons/citations field so the model justifies each value.
tool_call.py — Anthropic
tools = [{
  "name": "get_order_status",
  "description": "Look up an order's status by ID.",
  "input_schema": {"type":"object",
    "properties":{"order_id":{"type":"string"}},
    "required":["order_id"]}
}]

msg = client.messages.create(model="claude-sonnet-4-6", max_tokens=1024,
    tools=tools, messages=[{"role":"user","content":"Where's order A-91?"}])

if msg.stop_reason == "tool_use":
    call = next(b for b in msg.content if b.type=="tool_use")
    result = get_order_status(**call.input)   # YOU execute
    # …append tool_result and call again to get the final answer
07

Agents & MCP

An agent is a loop: the model is given tools and a goal, then plans → acts → observes → repeats until done. "Agentic workflows" + "AI copilots" in the JDs live here.

Patterns (simplest first)

  • Single tool-use loop — model + tools, run until no more tool calls. Covers most real "agents".
  • Workflow / graph — you wire fixed steps (router → retrieve → draft → check). Predictable, debuggable. Prefer this.
  • Autonomous multi-agent — agents spawning agents. Powerful, hard to control & cost. Reach for last.
  • Reflection — model critiques & revises its own output before returning.

MCP — Model Context Protocol

An open standard for exposing tools, data, and prompts to any LLM client over a uniform interface. Write an MCP server once (your app's capabilities as tools) and any MCP-aware client/agent can use it — decoupling tools from any single model or framework.

  • Transports: stdio (local) or HTTP/SSE (remote).
  • Server exposes tools, resources, prompts.
  • The clean way to make app logic "AI-accessible" without bespoke glue per model.

Frameworks

LangGraph (stateful graphs, the durable choice) · LangChain agents · LlamaIndex agents · provider-native Agents SDKs · CrewAI / AutoGen (multi-agent). Start with a hand-written tool loop or LangGraph; reach for multi-agent frameworks only when a single loop genuinely can't express the task.

Agent failure modes to design against: infinite loops (cap iterations) · cost blowups (budget tokens & calls) · hallucinated tool args (validate) · context overflow (summarize/trim history) · silent wrong answers (add a verification step + human-in-the-loop on high-stakes actions).
08

Production Concerns

The difference between a demo and a feature. This is where "understand AI limitations" and "reliability, performance, UX" from the JDs cash out.

Cost & latency

  • Prompt caching — cache stable system/context for up to ~90% savings.
  • Model routing — cheap model for easy calls, escalate hard ones.
  • Batch API — async, ~50% off for non-urgent jobs.
  • Stream tokens (SSE) so perceived latency ≈ time-to-first-token.

Reliability

  • Retries w/ exponential backoff on 429/5xx.
  • Timeouts + fallback model on outage.
  • Idempotency keys for tool actions.
  • Structured output + schema validation, not string parsing.

Safety & trust

  • Prompt injection — treat retrieved/user text as untrusted; never let it grant tool powers.
  • PII — redact before sending; mind data residency.
  • Grounding + citations to curb hallucination.
  • Eval in CI: golden sets, faithfulness, regression checks.
Observability: log prompts, responses, token counts, latency, and tool calls from day one (Langfuse, Phoenix, or your own table). You cannot debug, cost-optimize, or evaluate what you don't trace. "Explainable AI" in product terms usually means: surface citations, confidence/reasons, and the tool trace to the user.
09

Mapped to the Job

Where each posting's AI asks land in this manual. The DB / web / deploy skills are the table stakes around the AI work — listed briefly; the AI column is where you differentiate.

AI-Reviewed Full-Stack Dev

// reviewing & correcting Manus-AI-generated code
AI integration§02 providers · §06 tools — know what the model can't reliably do
AI limitations§08 — hallucination, injection, non-determinism; review AI output like a junior's PR
AI-dev tools§07 agents — Manus & co. are agentic coders; you verify, constrain, correct
+ Front & back endtable stakes — fix real bugs the agent introduces

ShockPoint Full-Stack Engineer

// AI-enabled decision-support platforms
LLM APIs§02 — provider abstraction, routing, streaming
Agentic workflows§07 — LangGraph loops, copilots, MCP
Retrieval + vector DB§04–§05 — pgvector/Qdrant, RAG over ops docs
Explainable AI§06 + §08 — citations, reasons, tool traces in the UI
+ FastAPI / PostgreSQLbackend for APIs, auth, file pipelines
+ React / Tailwinddashboards, copilot & analytics UIs
+ Docker / CI / cloudVercel/Railway/AWS deploy, GitHub workflows
Interview-ready framing: for either role, lead with a story where you integrated an LLM into a real full-stack system — chose the model tier, built retrieval, added tool calls, then made it reliable (eval, caching, fallbacks) — and could explain the limitations you designed around. That's the exact shape both teams are buying.