Reference · Data Engineering for AI

Graph Databases
Neo4j · Neptune · Knowledge Graphs

When relationships are the data. How to model, store, and query connected information — and why graph + LLM (GraphRAG) is becoming table stakes for senior AI/ML roles.

Property Graph RDF / Triples Cypher Gremlin SPARQL GraphRAG Semantic Search
01

Why a graph at all?

A relational database stores relationships implicitly — as foreign keys you re-discover with JOINs at query time. A graph database stores relationships as first-class, materialized records. Walking from one entity to a connected one is a pointer hop, not a JOIN.

That single design choice flips the cost model. In SQL, the price of a "friends-of-friends-of-friends" query grows with the size of the tables (each hop is another JOIN scanning more rows). In a graph, the price grows with the size of the answer — you only touch nodes you actually traverse. This is called index-free adjacency: each node physically points to its neighbors.

Node

An entity — a person, movie, account, document, gene. Carries labels (its type) and properties (key/value data).

Relationship

A typed, directed connection between two nodes. Can carry its own properties (e.g. a RATED edge with stars: 5).

Traversal

Following relationships from a starting node. The core operation — and where graphs win against JOIN-heavy SQL by orders of magnitude.

// drag nodes · run a query to highlight its traversal
// a small property graph — labels colored below
Person Movie Company Genre
When the graph wins

Deep, variable-length relationship queries (recommendations, fraud rings, dependency chains, org hierarchies, knowledge graphs). The deeper the traversal, the bigger the gap vs. SQL.

When it doesn't

Flat tabular data, heavy column aggregations / OLAP, or workloads that are mostly single-table scans. A graph DB adds operational cost for no traversal benefit — reach for relational or columnar instead.

02

Two data models you must distinguish

"Graph database" hides a fork in the road. Almost every vendor sits on one side or the other — and Neptune notably supports both. Choosing wrong means rewriting your data layer.

Labeled Property Graph (LPG)

Nodes + edges, both can hold properties. Developer-friendly, schema-optional, intuitive for application data. Edges have an identity and attributes.

Used by: Neo4j, Memgraph, TigerGraph, JanusGraph, Neptune (Gremlin / openCypher), ArangoDB.

Query with: Cypher / openCypher, Gremlin, GSQL.

RDF Triple Store

Everything is a subject — predicate — object triple. W3C-standardized, globally addressable via IRIs, supports formal schemas (RDFS/OWL) and logical inference / reasoning.

Used by: Neptune (RDF), GraphDB, Stardog, AllegroGraph, Virtuoso, Blazegraph.

Query with: SPARQL.

Same fact, both models

PROPERTY GRAPH RDF TRIPLES (:Person {name:"Alice"}) :Alice :worksAt :Acme . │ :WORKS_AT {since:2021} :Alice rdf:type :Person . ▼ :Acme rdf:type :Company . (:Company {name:"Acme"}) :Alice :name "Alice" .

The property graph keeps Alice as one rich object; RDF shreds her into atomic statements. RDF's atomicity is what makes it great for merging knowledge from many sources and for reasoning, but it's more verbose for everyday application CRUD.

Decision rule

Building an app where relationships drive features (recommendations, social, fraud)? Property graph. Integrating heterogeneous data, need interoperability, ontologies, or inference (life sciences, gov / federal data standards, the semantic web)? RDF.

03

Neo4j & Cypher

Neo4j is the market-leading property-graph database. Its query language, Cypher, reads like ASCII art of the pattern you're matching — nodes in (), relationships in -[]->. openCypher (the open spec) is also supported by Neptune and Memgraph, so the skill transfers.

Writing data: CREATE & MERGE

seed.cypherCypher
// CREATE always inserts — running this twice makes duplicates.
CREATE (a:Person {name: 'Alice', born: 1990})

// MERGE is "match-or-create" (upsert). The pattern in MERGE is the
// uniqueness key — Neo4j matches it; if absent, it creates it.
MERGE (m:Movie {title: 'The Matrix'})
  ON CREATE SET m.released = 1999, m.added = timestamp()
  ON MATCH SET  m.lastSeen = timestamp()

// Connect them with a typed, directed relationship that has a property.
MATCH (a:Person {name:'Alice'}), (m:Movie {title:'The Matrix'})
MERGE (a)-[r:RATED {stars: 5}]->(m)
Walkthrough & tradeoffs
  1. CREATE is unconditional — fast, but re-running an ingestion script duplicates nodes. Use it only for guaranteed-fresh data.
  2. MERGE is the idempotent workhorse for ETL: safe to re-run. The catch — back the merge key with a unique constraint/index (CREATE CONSTRAINT ... IS UNIQUE), or MERGE does a full label scan and gets slow at scale.
  3. Tradeoff: MERGE on a pattern with properties Neo4j can't index (like the RATED edge) can create unintended duplicates if you're not precise about what's in the merge key vs. the SET.

Reading data: the pattern is the query

traverse.cypherCypher
// "Who works at Acme?" — match the shape, return the people.
MATCH (p:Person)-[:WORKS_AT]->(c:Company {name:'Acme'})
RETURN p.name

// Variable-length traversal: friends up to 3 hops out.
// [:KNOWS*1..3] = follow KNOWS edges between 1 and 3 times.
MATCH (me:Person {name:'Alice'})-[:KNOWS*1..3]->(reach:Person)
RETURN DISTINCT reach.name

// Recommendation: movies my friends rated highly that I haven't seen.
MATCH (me:Person {name:'Alice'})-[:KNOWS]->(f)-[r:RATED]->(m:Movie)
WHERE r.stars >= 4 AND NOT (me)-[:RATED]->(m)
RETURN m.title, count(*) AS votes
ORDER BY votes DESC LIMIT 5
Walkthrough & tradeoffs
  1. You describe the shape of the data you want; Neo4j finds every subgraph that matches. The recommendation query above is one readable statement — the SQL equivalent is multiple self-JOINs plus a NOT EXISTS subquery.
  2. *1..3 is the superpower and the footgun. Variable-length paths can explode combinatorially on dense graphs — always bound the depth and prefer DISTINCT to collapse duplicate paths.
  3. NOT (me)-[:RATED]->(m) is an anti-pattern filter — cheap in a graph because it's a neighbor check, not a table anti-join.
Why Neo4j for AI/ML work

It ships a Graph Data Science library (PageRank, community detection, node embeddings, link prediction) and native vector indexes (since v5) — so you can store embeddings on nodes and do similarity search and graph traversal in one place. That combination is the backbone of modern GraphRAG.

04

Amazon Neptune

Neptune is AWS's fully managed graph database. Its defining trait: it speaks both graph models. Same cluster, your choice of API — property graph via Gremlin or openCypher, or RDF via SPARQL.

Managed

No servers to patch. Auto-scaling storage to 128 TiB, up to 15 read replicas, multi-AZ failover, continuous backup to S3.

Dual model

Property graph (Gremlin / openCypher) and RDF (SPARQL 1.1) on one engine. Pick per workload.

Neptune Analytics

In-memory analytics + built-in vector search, plus a managed GraphRAG toolkit that wires into Amazon Bedrock.

Gremlin — imperative traversal

Where Cypher is declarative pattern-matching, Gremlin (Apache TinkerPop) is a step-by-step traversal pipeline — you literally chain the walk: start here, go out this edge, filter, repeat.

traverse.groovyGremlin
// Who works at Acme?
g.V().has('Person','name','Alice')   // start vertex
 .out('WORKS_AT')                       // hop along outgoing edge
 .values('name')                       // emit the property

// Friends up to 2 hops (repeat the 'out' step twice).
g.V().has('Person','name','Alice')
 .repeat(out('KNOWS')).times(2)
 .dedup().values('name')
Walkthrough & tradeoffs
  1. Each .step() transforms a stream of graph elements — it reads like a Unix pipe over the graph. Great for programmatic, dynamically-built traversals.
  2. Tradeoff vs. Cypher: Gremlin is more verbose and harder to read for complex patterns, but it's an embeddable, language-agnostic API (Java, Python, JS, Go) — handy when the query is generated by code rather than hand-written.
  3. On Neptune you can also send openCypher for the same property graph, so teams coming from Neo4j aren't forced into Gremlin.

SPARQL — querying RDF

query.sparqlSPARQL
PREFIX : <http://example.org/>
SELECT ?company ?coworker WHERE {
  ?p   :name      "Alice" .   # bind Alice
  ?p   :worksAt   ?company .       # her employer
  ?co  :worksAt   ?company .       # anyone at that employer
  ?co  :name      ?coworker .
  FILTER(?co != ?p)               # exclude Alice herself
}
Walkthrough & tradeoffs
  1. The WHERE block is a set of triple patterns with shared variables (?company). The engine finds all variable bindings that satisfy every pattern at once — a graph pattern match expressed as joins over triples.
  2. Strength: this query federates trivially. Add SERVICE <remote-endpoint> and you join across another organization's knowledge graph — the killer feature for open / government / life-sciences data.
  3. Tradeoff: RDF + SPARQL has a steeper learning curve and more ceremony (IRIs, prefixes, ontologies) than property-graph APIs. You pay that cost to buy interoperability and reasoning.

Neo4j vs. Neptune — how to choose

AxisNeo4jAmazon Neptune
Data modelProperty graph only Property graph + RDF
Query languagesCypher (+ openCypher, GQL) Gremlin, openCypher, SPARQL
Ops modelSelf-host, or Aura (managed) Fully managed, AWS-native only
Analytics / MLRich: Graph Data Science library, native vectors Neptune Analytics + vectors; thinner algorithm library
EcosystemLargest community, Bloom viz, drivers everywhere Tight AWS integration (Bedrock, IAM, S3, SageMaker)
Pick whenYou want the richest graph tooling & Cypher DX, or multi-cloud / on-premYou're all-in on AWS, need RDF, or want zero graph ops burden
Federal context

For agencies and contractors (CGI Federal, etc.), Neptune's FedRAMP-authorized AWS footprint and RDF support align with government data-standard and interoperability requirements — a common reason it shows up in those job descriptions alongside Neo4j.

05

Knowledge graphs, semantic search & GraphRAG

This is the section that earns the line item on senior AI/ML postings. A knowledge graph turns documents and data into a network of typed entities and relationships — and pairing it with an LLM fixes the biggest weakness of plain vector RAG.

The problem with vanilla vector RAG

Standard RAG embeds text chunks, finds the k most similar to a question, and stuffs them into the prompt. It's excellent at local questions ("what does the contract say about X?") but weak at global, multi-hop ones ("how are these three programs connected?") — because the answer isn't in any single chunk; it's in the relationships across chunks. Similarity search can't follow a relationship.

What GraphRAG adds

DOCS ──► LLM entity + relation extraction ──► KNOWLEDGE GRAPH │ ┌─────────────────────────────────────┘ ▼ query ─► vector match entry nodes ─► traverse related entities ─► assemble context ─► LLM answer
graphrag_retrieve.cypherCypher
// 1) Vector search: find chunks most similar to the question embedding.
CALL db.index.vector.queryNodes('chunkEmbeddings', 5, $qEmbedding)
YIELD node AS chunk, score

// 2) Expand: pull entities mentioned in those chunks and ONE hop of
//    their relationships — the structural context vectors can't reach.
MATCH (chunk)-[:MENTIONS]->(e:Entity)-[rel]-(neighbor:Entity)
RETURN chunk.text, e.name, type(rel) AS relation, neighbor.name
LIMIT 40
Walkthrough & tradeoffs
  1. Step 1 is ordinary semantic search — the embeddings live on graph nodes, so no separate vector DB is required.
  2. Step 2 is the graph-native part: from each matched chunk we hop to the entities it mentions and their neighbors, returning explicit relation labels. The LLM now sees how facts connect, not just that they're textually similar.
  3. Tradeoff: GraphRAG costs more to build (LLM extraction is slow + expensive, and the graph needs curation). It pays off on connected, multi-hop corpora; for a flat FAQ, plain vector RAG is cheaper and good enough.

Other knowledge-graph payoffs

Entity resolution

"Bob Smith", "Robert Smith", "R. Smith" → one node. The graph makes dedup and identity-linking a structural operation, not a fuzzy guess in isolation.

Explainability

Every answer traces a concrete path of typed edges — auditable provenance, which matters enormously in regulated and federal settings.

Fraud & risk

Rings, shared devices, and circular money flows are cycles and shared-neighbor patterns — natural graph queries, near-impossible at speed in SQL.

Recommendations

Collaborative filtering becomes a two-hop traversal (you → similar users → their items), as in Q3 of the interactive graph above.

06

The broader landscape

Neo4j and Neptune anchor the field, but the right pick depends on scale, model, and where it runs. A quick map of the rest.

DatabaseModelQuery langDistinctive strength
Neo4jPropertyCypher Market leader; richest tooling, GDS library, native vectors
Amazon NeptuneProperty + RDF Gremlin / openCypher / SPARQLFully managed AWS; dual-model; Bedrock GraphRAG
MemgraphPropertyCypher In-memory, real-time + streaming; Cypher-compatible drop-in
TigerGraphPropertyGSQL Massively parallel; deep-link analytics on huge graphs
JanusGraphPropertyGremlin Open-source, scales on Cassandra/HBase + Elasticsearch
DgraphPropertyDQL / GraphQL Distributed, GraphQL-native API for app teams
ArangoDBMulti-modelAQL Graph + document + key/value in one engine
FalkorDBPropertyCypher Sparse-matrix linear algebra; fast, low-latency GraphRAG
GraphDB / StardogRDFSPARQL Reasoning, ontologies, enterprise semantic layers

A pragmatic decision path

Need RDF / reasoning / data-standard interop? └─ yes ─► Neptune (RDF), GraphDB, or Stardog └─ no ─► property graph: ├─ all-in on AWS, want zero ops ───────► Neptune ├─ want richest tooling + Cypher DX ───► Neo4j ├─ real-time / streaming, in-memory ──► Memgraph └─ extreme scale analytics ───────────► TigerGraph
Interview-ready summary

Know the two models (property vs. RDF), one query language deeply (Cypher transfers across Neo4j + Neptune + Memgraph), why index-free adjacency beats JOINs on deep traversals, and how GraphRAG uses vector entry points + graph expansion to answer global, multi-hop questions that vanilla RAG can't. That's the senior AI/ML graph story.