When relationships are the data. How to model, store, and query connected information — and why graph + LLM (GraphRAG) is becoming table stakes for senior AI/ML roles.
A relational database stores relationships implicitly — as foreign keys you re-discover with JOINs at query time. A graph database stores relationships as first-class, materialized records. Walking from one entity to a connected one is a pointer hop, not a JOIN.
That single design choice flips the cost model. In SQL, the price of a "friends-of-friends-of-friends" query grows with the size of the tables (each hop is another JOIN scanning more rows). In a graph, the price grows with the size of the answer — you only touch nodes you actually traverse. This is called index-free adjacency: each node physically points to its neighbors.
An entity — a person, movie, account, document, gene. Carries labels (its type) and properties (key/value data).
A typed, directed connection
between two nodes. Can carry its own properties (e.g. a RATED edge with
stars: 5).
Following relationships from a starting node. The core operation — and where graphs win against JOIN-heavy SQL by orders of magnitude.
Deep, variable-length relationship queries (recommendations, fraud rings, dependency chains, org hierarchies, knowledge graphs). The deeper the traversal, the bigger the gap vs. SQL.
Flat tabular data, heavy column aggregations / OLAP, or workloads that are mostly single-table scans. A graph DB adds operational cost for no traversal benefit — reach for relational or columnar instead.
"Graph database" hides a fork in the road. Almost every vendor sits on one side or the other — and Neptune notably supports both. Choosing wrong means rewriting your data layer.
Nodes + edges, both can hold properties. Developer-friendly, schema-optional, intuitive for application data. Edges have an identity and attributes.
Used by: Neo4j, Memgraph, TigerGraph, JanusGraph, Neptune (Gremlin / openCypher), ArangoDB.
Query with: Cypher / openCypher, Gremlin, GSQL.
Everything is a subject — predicate — object triple.
W3C-standardized, globally addressable via IRIs, supports formal schemas (RDFS/OWL) and logical
inference / reasoning.
Used by: Neptune (RDF), GraphDB, Stardog, AllegroGraph, Virtuoso, Blazegraph.
Query with: SPARQL.
The property graph keeps Alice as one rich object; RDF shreds her into atomic statements. RDF's atomicity is what makes it great for merging knowledge from many sources and for reasoning, but it's more verbose for everyday application CRUD.
Building an app where relationships drive features (recommendations, social, fraud)? Property graph. Integrating heterogeneous data, need interoperability, ontologies, or inference (life sciences, gov / federal data standards, the semantic web)? RDF.
Neo4j is the market-leading property-graph database. Its query language,
Cypher, reads like ASCII art of the pattern you're matching — nodes in
(), relationships in -[]->. openCypher (the open
spec) is also supported by Neptune and Memgraph, so the skill transfers.
// CREATE always inserts — running this twice makes duplicates. CREATE (a:Person {name: 'Alice', born: 1990}) // MERGE is "match-or-create" (upsert). The pattern in MERGE is the // uniqueness key — Neo4j matches it; if absent, it creates it. MERGE (m:Movie {title: 'The Matrix'}) ON CREATE SET m.released = 1999, m.added = timestamp() ON MATCH SET m.lastSeen = timestamp() // Connect them with a typed, directed relationship that has a property. MATCH (a:Person {name:'Alice'}), (m:Movie {title:'The Matrix'}) MERGE (a)-[r:RATED {stars: 5}]->(m)
CREATE is unconditional — fast, but re-running an ingestion script duplicates nodes.
Use it only for guaranteed-fresh data.MERGE is the idempotent workhorse for ETL: safe to re-run. The catch — back
the merge key with a unique constraint/index (CREATE CONSTRAINT ... IS UNIQUE),
or MERGE does a full label scan and gets slow at scale.MERGE on a pattern with properties Neo4j can't index (like the
RATED edge) can create unintended duplicates if you're not precise about what's in the
merge key vs. the SET.// "Who works at Acme?" — match the shape, return the people. MATCH (p:Person)-[:WORKS_AT]->(c:Company {name:'Acme'}) RETURN p.name // Variable-length traversal: friends up to 3 hops out. // [:KNOWS*1..3] = follow KNOWS edges between 1 and 3 times. MATCH (me:Person {name:'Alice'})-[:KNOWS*1..3]->(reach:Person) RETURN DISTINCT reach.name // Recommendation: movies my friends rated highly that I haven't seen. MATCH (me:Person {name:'Alice'})-[:KNOWS]->(f)-[r:RATED]->(m:Movie) WHERE r.stars >= 4 AND NOT (me)-[:RATED]->(m) RETURN m.title, count(*) AS votes ORDER BY votes DESC LIMIT 5
NOT EXISTS subquery.*1..3 is the superpower and the footgun. Variable-length paths can explode
combinatorially on dense graphs — always bound the depth and prefer
DISTINCT to collapse duplicate paths.NOT (me)-[:RATED]->(m) is an anti-pattern filter — cheap in a graph because it's a
neighbor check, not a table anti-join.It ships a Graph Data Science library (PageRank, community detection, node embeddings, link prediction) and native vector indexes (since v5) — so you can store embeddings on nodes and do similarity search and graph traversal in one place. That combination is the backbone of modern GraphRAG.
Neptune is AWS's fully managed graph database. Its defining trait: it speaks both graph models. Same cluster, your choice of API — property graph via Gremlin or openCypher, or RDF via SPARQL.
No servers to patch. Auto-scaling storage to 128 TiB, up to 15 read replicas, multi-AZ failover, continuous backup to S3.
Property graph (Gremlin / openCypher) and RDF (SPARQL 1.1) on one engine. Pick per workload.
In-memory analytics + built-in vector search, plus a managed GraphRAG toolkit that wires into Amazon Bedrock.
Where Cypher is declarative pattern-matching, Gremlin (Apache TinkerPop) is a step-by-step
traversal pipeline — you literally chain the walk: start here, go out
this edge, filter, repeat.
// Who works at Acme? g.V().has('Person','name','Alice') // start vertex .out('WORKS_AT') // hop along outgoing edge .values('name') // emit the property // Friends up to 2 hops (repeat the 'out' step twice). g.V().has('Person','name','Alice') .repeat(out('KNOWS')).times(2) .dedup().values('name')
.step() transforms a stream of graph elements — it reads like a Unix pipe over
the graph. Great for programmatic, dynamically-built traversals.PREFIX : <http://example.org/> SELECT ?company ?coworker WHERE { ?p :name "Alice" . # bind Alice ?p :worksAt ?company . # her employer ?co :worksAt ?company . # anyone at that employer ?co :name ?coworker . FILTER(?co != ?p) # exclude Alice herself }
WHERE block is a set of triple patterns with shared variables
(?company). The engine finds all variable bindings that satisfy every pattern at once —
a graph pattern match expressed as joins over triples.SERVICE <remote-endpoint> and you
join across another organization's knowledge graph — the killer feature for open / government
/ life-sciences data.| Axis | Neo4j | Amazon Neptune |
|---|---|---|
| Data model | Property graph only | Property graph + RDF |
| Query languages | Cypher (+ openCypher, GQL) | Gremlin, openCypher, SPARQL |
| Ops model | Self-host, or Aura (managed) | Fully managed, AWS-native only |
| Analytics / ML | Rich: Graph Data Science library, native vectors | Neptune Analytics + vectors; thinner algorithm library |
| Ecosystem | Largest community, Bloom viz, drivers everywhere | Tight AWS integration (Bedrock, IAM, S3, SageMaker) |
| Pick when | You want the richest graph tooling & Cypher DX, or multi-cloud / on-prem | You're all-in on AWS, need RDF, or want zero graph ops burden |
For agencies and contractors (CGI Federal, etc.), Neptune's FedRAMP-authorized AWS footprint and RDF support align with government data-standard and interoperability requirements — a common reason it shows up in those job descriptions alongside Neo4j.
This is the section that earns the line item on senior AI/ML postings. A knowledge graph turns documents and data into a network of typed entities and relationships — and pairing it with an LLM fixes the biggest weakness of plain vector RAG.
Standard RAG embeds text chunks, finds the k most similar to a question, and stuffs them into the prompt. It's excellent at local questions ("what does the contract say about X?") but weak at global, multi-hop ones ("how are these three programs connected?") — because the answer isn't in any single chunk; it's in the relationships across chunks. Similarity search can't follow a relationship.
// 1) Vector search: find chunks most similar to the question embedding. CALL db.index.vector.queryNodes('chunkEmbeddings', 5, $qEmbedding) YIELD node AS chunk, score // 2) Expand: pull entities mentioned in those chunks and ONE hop of // their relationships — the structural context vectors can't reach. MATCH (chunk)-[:MENTIONS]->(e:Entity)-[rel]-(neighbor:Entity) RETURN chunk.text, e.name, type(rel) AS relation, neighbor.name LIMIT 40
relation labels. The LLM now sees how facts
connect, not just that they're textually similar."Bob Smith", "Robert Smith", "R. Smith" → one node. The graph makes dedup and identity-linking a structural operation, not a fuzzy guess in isolation.
Every answer traces a concrete path of typed edges — auditable provenance, which matters enormously in regulated and federal settings.
Rings, shared devices, and circular money flows are cycles and shared-neighbor patterns — natural graph queries, near-impossible at speed in SQL.
Collaborative filtering becomes a two-hop traversal (you → similar users → their items), as in Q3 of the interactive graph above.
Neo4j and Neptune anchor the field, but the right pick depends on scale, model, and where it runs. A quick map of the rest.
| Database | Model | Query lang | Distinctive strength |
|---|---|---|---|
| Neo4j | Property | Cypher | Market leader; richest tooling, GDS library, native vectors |
| Amazon Neptune | Property + RDF | Gremlin / openCypher / SPARQL | Fully managed AWS; dual-model; Bedrock GraphRAG |
| Memgraph | Property | Cypher | In-memory, real-time + streaming; Cypher-compatible drop-in |
| TigerGraph | Property | GSQL | Massively parallel; deep-link analytics on huge graphs |
| JanusGraph | Property | Gremlin | Open-source, scales on Cassandra/HBase + Elasticsearch |
| Dgraph | Property | DQL / GraphQL | Distributed, GraphQL-native API for app teams |
| ArangoDB | Multi-model | AQL | Graph + document + key/value in one engine |
| FalkorDB | Property | Cypher | Sparse-matrix linear algebra; fast, low-latency GraphRAG |
| GraphDB / Stardog | RDF | SPARQL | Reasoning, ontologies, enterprise semantic layers |
Know the two models (property vs. RDF), one query language deeply (Cypher transfers across Neo4j + Neptune + Memgraph), why index-free adjacency beats JOINs on deep traversals, and how GraphRAG uses vector entry points + graph expansion to answer global, multi-hop questions that vanilla RAG can't. That's the senior AI/ML graph story.