Graph Databases — Neo4j, Neptune & the Knowledge Graph Stack

Why a graph at all?

A relational database stores relationships implicitly — as foreign keys you re-discover with JOINs at query time. A graph database stores relationships as first-class, materialized records. Walking from one entity to a connected one is a pointer hop, not a JOIN.

That single design choice flips the cost model. In SQL, the price of a "friends-of-friends-of-friends" query grows with the size of the tables (each hop is another JOIN scanning more rows). In a graph, the price grows with the size of the answer — you only touch nodes you actually traverse. This is called index-free adjacency: each node physically points to its neighbors.

Node

An entity — a person, movie, account, document, gene. Carries labels (its type) and properties (key/value data).

Relationship

A typed, directed connection between two nodes. Can carry its own properties (e.g. a RATED edge with stars: 5).

Traversal

Following relationships from a starting node. The core operation — and where graphs win against JOIN-heavy SQL by orders of magnitude.

// drag nodes · run a query to highlight its traversal

// a small property graph — labels colored below

Person Movie Company Genre

When the graph wins

Deep, variable-length relationship queries (recommendations, fraud rings, dependency chains, org hierarchies, knowledge graphs). The deeper the traversal, the bigger the gap vs. SQL.

When it doesn't

Flat tabular data, heavy column aggregations / OLAP, or workloads that are mostly single-table scans. A graph DB adds operational cost for no traversal benefit — reach for relational or columnar instead.

Two data models you must distinguish

"Graph database" hides a fork in the road. Almost every vendor sits on one side or the other — and Neptune notably supports both. Choosing wrong means rewriting your data layer.

Labeled Property Graph (LPG)

Nodes + edges, both can hold properties. Developer-friendly, schema-optional, intuitive for application data. Edges have an identity and attributes.

Used by: Neo4j, Memgraph, TigerGraph, JanusGraph, Neptune (Gremlin / openCypher), ArangoDB.

Query with: Cypher / openCypher, Gremlin, GSQL.

RDF Triple Store

Everything is a subject — predicate — object triple. W3C-standardized, globally addressable via IRIs, supports formal schemas (RDFS/OWL) and logical inference / reasoning.

Used by: Neptune (RDF), GraphDB, Stardog, AllegroGraph, Virtuoso, Blazegraph.

Query with: SPARQL.

Same fact, both models

PROPERTY GRAPH RDF TRIPLES (:Person {name:"Alice"}) :Alice :worksAt :Acme . │ :WORKS_AT {since:2021} :Alice rdf:type :Person . ▼ :Acme rdf:type :Company . (:Company {name:"Acme"}) :Alice :name "Alice" .

The property graph keeps Alice as one rich object; RDF shreds her into atomic statements. RDF's atomicity is what makes it great for merging knowledge from many sources and for reasoning, but it's more verbose for everyday application CRUD.

Decision rule

Building an app where relationships drive features (recommendations, social, fraud)? Property graph. Integrating heterogeneous data, need interoperability, ontologies, or inference (life sciences, gov / federal data standards, the semantic web)? RDF.

Neo4j & Cypher

Neo4j is the market-leading property-graph database. Its query language, Cypher, reads like ASCII art of the pattern you're matching — nodes in (), relationships in -[]->. openCypher (the open spec) is also supported by Neptune and Memgraph, so the skill transfers.

Writing data: CREATE & MERGE

seed.cypherCypher

// CREATE always inserts — running this twice makes duplicates.
CREATE (a:Person {name: 'Alice', born: 1990})

// MERGE is "match-or-create" (upsert). The pattern in MERGE is the
// uniqueness key — Neo4j matches it; if absent, it creates it.
MERGE (m:Movie {title: 'The Matrix'})
  ON CREATE SET m.released = 1999, m.added = timestamp()
  ON MATCH SET  m.lastSeen = timestamp()

// Connect them with a typed, directed relationship that has a property.
MATCH (a:Person {name:'Alice'}), (m:Movie {title:'The Matrix'})
MERGE (a)-[r:RATED {stars: 5}]->(m)

Walkthrough & tradeoffs

CREATE is unconditional — fast, but re-running an ingestion script duplicates nodes. Use it only for guaranteed-fresh data.
MERGE is the idempotent workhorse for ETL: safe to re-run. The catch — back the merge key with a unique constraint/index (CREATE CONSTRAINT ... IS UNIQUE), or MERGE does a full label scan and gets slow at scale.
Tradeoff: MERGE on a pattern with properties Neo4j can't index (like the RATED edge) can create unintended duplicates if you're not precise about what's in the merge key vs. the SET.

Reading data: the pattern is the query

traverse.cypherCypher

// "Who works at Acme?" — match the shape, return the people.
MATCH (p:Person)-[:WORKS_AT]->(c:Company {name:'Acme'})
RETURN p.name

// Variable-length traversal: friends up to 3 hops out.
// [:KNOWS*1..3] = follow KNOWS edges between 1 and 3 times.
MATCH (me:Person {name:'Alice'})-[:KNOWS*1..3]->(reach:Person)
RETURN DISTINCT reach.name

// Recommendation: movies my friends rated highly that I haven't seen.
MATCH (me:Person {name:'Alice'})-[:KNOWS]->(f)-[r:RATED]->(m:Movie)
WHERE r.stars >= 4 AND NOT (me)-[:RATED]->(m)
RETURN m.title, count(*) AS votes
ORDER BY votes DESC LIMIT 5

Walkthrough & tradeoffs

You describe the shape of the data you want; Neo4j finds every subgraph that matches. The recommendation query above is one readable statement — the SQL equivalent is multiple self-JOINs plus a NOT EXISTS subquery.
*1..3 is the superpower and the footgun. Variable-length paths can explode combinatorially on dense graphs — always bound the depth and prefer DISTINCT to collapse duplicate paths.
NOT (me)-[:RATED]->(m) is an anti-pattern filter — cheap in a graph because it's a neighbor check, not a table anti-join.

Why Neo4j for AI/ML work

It ships a Graph Data Science library (PageRank, community detection, node embeddings, link prediction) and native vector indexes (since v5) — so you can store embeddings on nodes and do similarity search and graph traversal in one place. That combination is the backbone of modern GraphRAG.

Amazon Neptune

Neptune is AWS's fully managed graph database. Its defining trait: it speaks both graph models. Same cluster, your choice of API — property graph via Gremlin or openCypher, or RDF via SPARQL.

Managed

No servers to patch. Auto-scaling storage to 128 TiB, up to 15 read replicas, multi-AZ failover, continuous backup to S3.

Dual model

Property graph (Gremlin / openCypher) and RDF (SPARQL 1.1) on one engine. Pick per workload.

Neptune Analytics

In-memory analytics + built-in vector search, plus a managed GraphRAG toolkit that wires into Amazon Bedrock.

Gremlin — imperative traversal

Where Cypher is declarative pattern-matching, Gremlin (Apache TinkerPop) is a step-by-step traversal pipeline — you literally chain the walk: start here, go out this edge, filter, repeat.

traverse.groovyGremlin

// Who works at Acme?
g.V().has('Person','name','Alice')   // start vertex
 .out('WORKS_AT')                       // hop along outgoing edge
 .values('name')                       // emit the property

// Friends up to 2 hops (repeat the 'out' step twice).
g.V().has('Person','name','Alice')
 .repeat(out('KNOWS')).times(2)
 .dedup().values('name')

Walkthrough & tradeoffs

Each .step() transforms a stream of graph elements — it reads like a Unix pipe over the graph. Great for programmatic, dynamically-built traversals.
Tradeoff vs. Cypher: Gremlin is more verbose and harder to read for complex patterns, but it's an embeddable, language-agnostic API (Java, Python, JS, Go) — handy when the query is generated by code rather than hand-written.
On Neptune you can also send openCypher for the same property graph, so teams coming from Neo4j aren't forced into Gremlin.

SPARQL — querying RDF

query.sparqlSPARQL

PREFIX : <http://example.org/>
SELECT ?company ?coworker WHERE {
  ?p   :name      "Alice" .   # bind Alice
  ?p   :worksAt   ?company .       # her employer
  ?co  :worksAt   ?company .       # anyone at that employer
  ?co  :name      ?coworker .
  FILTER(?co != ?p)               # exclude Alice herself
}

Walkthrough & tradeoffs

The WHERE block is a set of triple patterns with shared variables (?company). The engine finds all variable bindings that satisfy every pattern at once — a graph pattern match expressed as joins over triples.
Strength: this query federates trivially. Add SERVICE <remote-endpoint> and you join across another organization's knowledge graph — the killer feature for open / government / life-sciences data.
Tradeoff: RDF + SPARQL has a steeper learning curve and more ceremony (IRIs, prefixes, ontologies) than property-graph APIs. You pay that cost to buy interoperability and reasoning.

Neo4j vs. Neptune — how to choose

Axis	Neo4j	Amazon Neptune
Data model	Property graph only	Property graph + RDF
Query languages	Cypher (+ openCypher, GQL)	Gremlin, openCypher, SPARQL
Ops model	Self-host, or Aura (managed)	Fully managed, AWS-native only
Analytics / ML	Rich: Graph Data Science library, native vectors	Neptune Analytics + vectors; thinner algorithm library
Ecosystem	Largest community, Bloom viz, drivers everywhere	Tight AWS integration (Bedrock, IAM, S3, SageMaker)
Pick when	You want the richest graph tooling & Cypher DX, or multi-cloud / on-prem	You're all-in on AWS, need RDF, or want zero graph ops burden

Federal context

For agencies and contractors (CGI Federal, etc.), Neptune's FedRAMP-authorized AWS footprint and RDF support align with government data-standard and interoperability requirements — a common reason it shows up in those job descriptions alongside Neo4j.

Knowledge graphs, semantic search & GraphRAG

This is the section that earns the line item on senior AI/ML postings. A knowledge graph turns documents and data into a network of typed entities and relationships — and pairing it with an LLM fixes the biggest weakness of plain vector RAG.

The problem with vanilla vector RAG

Standard RAG embeds text chunks, finds the k most similar to a question, and stuffs them into the prompt. It's excellent at local questions ("what does the contract say about X?") but weak at global, multi-hop ones ("how are these three programs connected?") — because the answer isn't in any single chunk; it's in the relationships across chunks. Similarity search can't follow a relationship.

What GraphRAG adds

DOCS ──► LLM entity + relation extraction ──► KNOWLEDGE GRAPH │ ┌─────────────────────────────────────┘ ▼ query ─► vector match entry nodes ─► traverse related entities ─► assemble context ─► LLM answer

Build: an LLM reads your corpus and extracts entities + relationships, writing them into a graph (Neo4j / Neptune). Optionally run community detection and pre-summarize clusters.
Retrieve (hybrid): use a vector index to find the most relevant entry nodes, then traverse their relationships to pull in connected facts the embedding alone would miss. You get both semantic similarity and explicit structure.
Answer: the assembled subgraph becomes grounded, explainable context — and because edges are explicit, you can cite the path, reducing hallucination.

graphrag_retrieve.cypherCypher

// 1) Vector search: find chunks most similar to the question embedding.
CALL db.index.vector.queryNodes('chunkEmbeddings', 5, $qEmbedding)
YIELD node AS chunk, score

// 2) Expand: pull entities mentioned in those chunks and ONE hop of
//    their relationships — the structural context vectors can't reach.
MATCH (chunk)-[:MENTIONS]->(e:Entity)-[rel]-(neighbor:Entity)
RETURN chunk.text, e.name, type(rel) AS relation, neighbor.name
LIMIT 40

Walkthrough & tradeoffs

Step 1 is ordinary semantic search — the embeddings live on graph nodes, so no separate vector DB is required.
Step 2 is the graph-native part: from each matched chunk we hop to the entities it mentions and their neighbors, returning explicit relation labels. The LLM now sees how facts connect, not just that they're textually similar.
Tradeoff: GraphRAG costs more to build (LLM extraction is slow + expensive, and the graph needs curation). It pays off on connected, multi-hop corpora; for a flat FAQ, plain vector RAG is cheaper and good enough.

Other knowledge-graph payoffs

Entity resolution

"Bob Smith", "Robert Smith", "R. Smith" → one node. The graph makes dedup and identity-linking a structural operation, not a fuzzy guess in isolation.

Explainability

Every answer traces a concrete path of typed edges — auditable provenance, which matters enormously in regulated and federal settings.

Fraud & risk

Rings, shared devices, and circular money flows are cycles and shared-neighbor patterns — natural graph queries, near-impossible at speed in SQL.

Recommendations

Collaborative filtering becomes a two-hop traversal (you → similar users → their items), as in Q3 of the interactive graph above.

The broader landscape

Neo4j and Neptune anchor the field, but the right pick depends on scale, model, and where it runs. A quick map of the rest.

Database	Model	Query lang	Distinctive strength
Neo4j	Property	Cypher	Market leader; richest tooling, GDS library, native vectors
Amazon Neptune	Property + RDF	Gremlin / openCypher / SPARQL	Fully managed AWS; dual-model; Bedrock GraphRAG
Memgraph	Property	Cypher	In-memory, real-time + streaming; Cypher-compatible drop-in
TigerGraph	Property	GSQL	Massively parallel; deep-link analytics on huge graphs
JanusGraph	Property	Gremlin	Open-source, scales on Cassandra/HBase + Elasticsearch
Dgraph	Property	DQL / GraphQL	Distributed, GraphQL-native API for app teams
ArangoDB	Multi-model	AQL	Graph + document + key/value in one engine
FalkorDB	Property	Cypher	Sparse-matrix linear algebra; fast, low-latency GraphRAG
GraphDB / Stardog	RDF	SPARQL	Reasoning, ontologies, enterprise semantic layers

A pragmatic decision path

Need RDF / reasoning / data-standard interop? └─ yes ─► Neptune (RDF), GraphDB, or Stardog └─ no ─► property graph: ├─ all-in on AWS, want zero ops ───────► Neptune ├─ want richest tooling + Cypher DX ───► Neo4j ├─ real-time / streaming, in-memory ──► Memgraph └─ extreme scale analytics ───────────► TigerGraph

Interview-ready summary

Know the two models (property vs. RDF), one query language deeply (Cypher transfers across Neo4j + Neptune + Memgraph), why index-free adjacency beats JOINs on deep traversals, and how GraphRAG uses vector entry points + graph expansion to answer global, multi-hop questions that vanilla RAG can't. That's the senior AI/ML graph story.