Neo4j Tutorial: Learn Graph Databases & Cypher from Scratch

Matt Tanner
Head of Developer Relations
No items found.
|
April 17, 2026
Neo4j Tutorial: Learn Graph Databases & Cypher from Scratch

If you've ever tried to model a friend-of-a-friend lookup, a fraud ring, or a recommendation engine in a traditional relational database, you already know the pain. SQL gets ugly fast when relationships are the point of the query, not an afterthought. Joins pile up, plans explode, and the code you actually wanted to write disappears under a wall of JOIN ... ON boilerplate.

Neo4j takes the opposite approach to storing connected data: relationships are first-class citizens, just like the things they connect. It uses a property graph model, a query language called Cypher that looks like ASCII art, and a runtime that traverses interconnected data instead of joining tables.

This tutorial walks you through Neo4j from zero: what graph databases actually are, how nodes and relationships work, how to install Neo4j, the Cypher query language you'll use every day, and the modeling and indexing decisions that determine whether your graph stays fast at scale. By the end, you'll be able to load data, write meaningful queries, and reason about when a graph database is the right tool, and when it's worth reaching for a different architecture instead.

What is Neo4j?

Neo4j is a native graph database. The "native" part matters: instead of storing data in tables and computing relationships at query time, Neo4j uses native graph storage where nodes, relationships, and properties live on disk in a graph structure where every relationship is a direct pointer between two nodes. No join tables, no foreign keys, and no query planner agonizing over the cheapest join-order permutation, because there's no join order to pick in the first place.

The project started in 2007 and is now maintained by Neo4j, Inc. It ships as an open source community edition, a commercial enterprise edition, and a managed cloud service called Neo4j AuraDB. The official product page calls it a "high-speed graph database with unbounded scale, security, and data integrity", and Neo4j claims that "the property graph data model enables queries to run 1000x faster than relational databases" for relationship-heavy workloads.

Neo4j is also the home of Cypher, the query language we'll spend the rest of this tutorial in. Cypher started as a Neo4j-specific language and is now an open standard through the openCypher project, which is why several other graph systems (including PuppyGraph) use it too.

What Are Graph Databases?

Unlike traditional relational databases that store data in tables and model relationships through foreign keys, a graph database stores data as a graph: a set of vertices (nodes) connected by edges (relationships). Picture a small social network. Alice and Bob are nodes. The fact that Alice follows Bob is a relationship. The date she started following him is a property of that relationship. Nothing here needs a separate "follows" table or a foreign key.

Graph databases shine in two situations. The first is when your queries care about patterns of connection across complex relationships, not individual rows or records. "Find all customers who share an IP address with a flagged account, but only through accounts created in the last 30 days" is the kind of fraud detection query that graphs answer in milliseconds and traditional relational databases answer in coffee breaks. The second is when your schema evolves constantly, because adding a new relationship type is cheaper than migrating a 200-million-row join table.

Two main flavors exist. Property graphs (Neo4j, Memgraph, JanusGraph, PuppyGraph) attach key-value properties to both nodes and relationships and use languages like Cypher or Gremlin. RDF triple stores (GraphDB, AllegroGraph, Stardog) model everything as subject-predicate-object triples, query with SPARQL, and are often used to power knowledge graphs. This tutorial sticks to the property graph model, which is what most application developers reach for first.

Core Graph Concepts (Nodes, Relationships, Properties)

The Neo4j graph data model has three building blocks, which are most of the conceptual surface you need to learn.

Nodes represent entities. A node can have one or more labels that work like tags: :Person, :Product, :Host. A single node can carry multiple labels, so a :Person can also be an :Employee and a :Customer.

Relationships connect two nodes and always have a direction and exactly one type. Direction is encoded with an arrow (-[:FOLLOWS]->) and the type is written in screaming snake case by convention. A relationship always belongs to a (start, end) pair. You can still query a directed relationship in either direction at read time, so the direction is a modeling choice, not a query restriction.

Properties are key-value pairs on nodes or relationships. Values can be strings, numbers, booleans, dates, points, lists of primitives, or null. Neo4j is schema-optional: you don't declare which properties belong to which label, although you can add constraints later to enforce uniqueness or existence.

Put together, (alice:Person {name: 'Alice'})-[:FOLLOWS {since: date('2024-01-15')}]->(bob:Person {name: 'Bob'}) captures two nodes, one typed relationship, and three properties in one line. That density is what makes Cypher feel different from SQL once you get used to it.

Setting Up Neo4j (Local & Cloud)

You have three reasonable options for getting Neo4j running.

Neo4j Desktop is the easiest first-time install. It bundles the database, a query editor (Neo4j Browser), and a project management UI. Create a local database, set a password, click Start, and you're talking to Bolt on localhost:7687 and the browser on localhost:7474. Neo4j Browser doubles as a graph visualization tool: every Cypher result that contains nodes and relationships renders as an interactive graph you can pan, zoom, and inspect. For richer visualization of larger graphs, Neo4j Bloom ships with Desktop and adds search-driven exploration.

Docker is the right choice for avoiding GUIs or building a reproducible CI setup:

docker run \
  --name neo4j-tutorial \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/changeme123 \
  -d neo4j:5

Open http://localhost:7474, log in with neo4j and your password, and you have a working database in under a minute. Set NEO4J_AUTH on first start so you initialize authentication explicitly, rather than trying to log in with the default credentials baked into the image.

Neo4j AuraDB is the managed cloud option, described as "a fully managed Neo4j database, hosted in the cloud and requires no installation". There's a free tier for learning and paid tiers when you need real capacity. Any of the three works for this tutorial.

Introduction to Cypher Query Language

Cypher is what makes Neo4j feel different from a relational database. The language is declarative, like SQL: you describe the pattern you want, not the algorithm to find it. The difference is that the pattern is drawn, not joined.

Neo4j's own documentation describes Cypher as "Neo4j's declarative and GQL conformant query language" that is "similar to SQL, but optimized for graphs." Node patterns use round brackets and relationships use square brackets between arrows. So (p:Person)-[:LIKES]->(t:Technology) reads as "a Person node connected by a LIKES relationship to a Technology node," and that exact string is also valid Cypher.

The five clauses you'll use most are:

  • MATCH finds existing patterns in the graph. It's the equivalent of SELECT ... FROM ... JOIN.
  • CREATE writes new nodes and relationships. No existence checks.
  • MERGE is the upsert: it matches if the pattern exists, and creates it if it doesn't. In production code, pair MERGE with a uniqueness constraint to avoid accidental duplicates.
  • WHERE filters bound variables. It looks and behaves like SQL's WHERE.
  • RETURN projects the result, like SELECT at the end of a SQL query.

A complete read query that finds everyone Alice follows looks like this:

MATCH (a:Person {name: 'Alice'})-[:FOLLOWS]->(friend:Person)
RETURN friend.name AS name
ORDER BY name;

Two things are worth pointing out. First, the pattern itself is the join: there's no explicit JOIN keyword, because the arrow already says "follow this relationship." Second, you can chain patterns of arbitrary length, so (a)-[:FOLLOWS*2..3]->(friend_of_friend) finds everyone two or three hops away from Alice, which is a query that would be painful to write in SQL.

Cypher reference and further learning

This post walks through the parts of Cypher you need to follow the rest of the tutorial, but it isn't a full language reference. For a deep dive into every clause, function, and expression, the official Neo4j Cypher manual is the canonical source. It's worth bookmarking alongside this Neo4j Cypher tutorial, especially once you start writing longer queries that use aggregation, WITH pipelines, CALL subqueries, and list comprehensions. Cypher is also an openCypher standard, which means the query style you learn here transfers to other graph systems that implement it.

Creating and Managing Graph Data

Time to load data. The simplest way to add a single node is CREATE:

CREATE (a:Person {name: 'Alice', joined: date('2024-01-15')});

That writes one labeled node with two properties. Adding a relationship uses the same pattern syntax:

MATCH (a:Person {name: 'Alice'}), (b:Person {name: 'Bob'})
CREATE (a)-[:FOLLOWS {since: date('2024-02-01')}]->(b);

In production code, you almost always want MERGE instead of CREATE, because MERGE is idempotent. Pair it with a uniqueness constraint, and you get an upsert that won't create duplicate Alices on rerun:

CREATE CONSTRAINT person_name_unique IF NOT EXISTS
FOR (p:Person) REQUIRE p.name IS UNIQUE;

MERGE (a:Person {name: 'Alice'})
  ON CREATE SET a.joined = date()
  ON MATCH SET a.last_seen = datetime();

For bulk loads, LOAD CSV streams rows from a URL or local file and exposes each row as a Cypher map:

LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MERGE (p:Person {id: row.id})
  SET p.name = row.name, p.joined = date(row.joined);

For very large imports, the neo4j-admin database import command reads CSV files directly into the store files offline and is the right tool for hundreds of millions of rows. Updates use SET to add or change properties, REMOVE to drop a property or label, and DELETE (or DETACH DELETE to drop a node and all its relationships at once) to remove data.

Querying and Traversing Relationships

Traversal queries are where Cypher earns its keep. Once your graph has a few thousand nodes, the questions you want to ask stop looking like row lookups and start looking like paths.

Fixed-length hops

The simplest traversal is a fixed-length hop. Find every technology Alice's followers like:

MATCH (a:Person {name: 'Alice'})<-[:FOLLOWS]-(follower:Person)-[:LIKES]->(tech:Technology)
RETURN tech.name, count(*) AS mentions
ORDER BY mentions DESC
LIMIT 10;

Notice the back-arrow on the left, which traverses FOLLOWS in reverse without re-modeling anything. Cypher lets you walk a directed relationship in either direction at query time.

Variable-length paths

For variable-length paths, use the *min..max syntax. To find everyone within three hops of Alice:

MATCH (a:Person {name: 'Alice'})-[:FOLLOWS*1..3]-(other:Person)
RETURN DISTINCT other.name;

Variable-length matches are powerful, but the cost grows quickly with depth and graph density. Always cap the upper bound and anchor the query on a node you can find with an index. An open-ended [:FOLLOWS*] on a graph with millions of edges is a great way to discover what an OOM looks like in practice.

Shortest paths and graph algorithms

For shortest-path queries Cypher provides the shortestPath function, which uses bidirectional BFS. For richer graph algorithms like PageRank, community detection, or weighted shortest paths, Neo4j ships the Graph Data Science (GDS) library, which exposes a separate set of procedures you call with CALL gds.<algorithm>.stream(...) (for example, CALL gds.pageRank.stream('myGraph')).

Data Modeling and Best Practices

The biggest mistake new graph developers make is modeling a graph the way they'd model a relational schema. The biggest win is recognizing that the relationship is the model.

A few rules of thumb that will save you pain:

Model your verbs as relationships, not nodes. If you find yourself creating a :Purchase node with two relationships (:BY_USER, :OF_PRODUCT), ask whether (:User)-[:PURCHASED {at: ...}]->(:Product) would do the job. It usually does, and it's faster to traverse. Use a node for the purchase only when the event has its own identity, its own relationships, or properties that only make sense as a separate entity (returns, refunds, line items).

Use specific relationship types instead of one generic type with a kind property. (:User)-[:PURCHASED]-> and (:User)-[:VIEWED]-> traverse independently, while (:User)-[:INTERACTED {kind: 'purchase'}]-> forces every traversal to load and filter every interaction. Cypher is much faster when it can prune by type.

Anchor production traversals on an indexed property. Queries that don't start from a known node end up scanning the label index, which gets slow on large graphs. Make sure the start of most traversals has an indexed id, email, or similar unique property.

Avoid super-nodes. A super-node is a node with millions of incoming or outgoing relationships of the same type. They're the graph equivalent of a celebrity row in a relational table: every traversal that touches one stalls. If you can't avoid them, partition with an intermediate node or filter the relationship by a property as early in the query as possible.

Treat the schema as living documentation, but know its limits. Neo4j is schema-optional. You can declare uniqueness and existence constraints, but nothing forces a :Person node to have an email property, or a :PURCHASED relationship to connect a :User to a :Product rather than a :Tweet. The "ontology" of a Neo4j graph is a convention: it lives in the application code, and the developer's head, and the database will happily run a query referencing a relationship type that doesn't exist and return an empty result, rather than rejecting the query outright. That's fine for hand-written queries where the author knows the shape of the data. It becomes a problem the moment an LLM is on the other side, which is the next section's topic.

Performance, Indexing, and Optimization

Neo4j performance comes down to two things: making sure traversals start at the right place, and making sure the query planner has enough information to pick a good execution plan. Indexes do both.

Index types you'll actually use

Neo4j 5 supports several index types. Range indexes are the default and "support most types of predicates" (equality, range, and prefix). Text indexes optimize STRING operators like CONTAINS and ENDS WITH. Point indexes handle spatial predicates over POINT values. Token lookup indexes speed up label and relationship-type scans. Neo4j also offers full-text indexes for search and vector indexes for similarity search over embeddings, which is the index type GraphRAG workloads care about.

CREATE INDEX person_email IF NOT EXISTS
FOR (p:Person) ON (p.email);

Reading query plans

Use EXPLAIN to inspect a query plan without running it, and PROFILE to run the query and see actual database hits. Watch db hits and rows in the plan. A plan that starts with NodeByLabelScan instead of NodeIndexSeek is usually a sign of a missing or unused index.

The ETL Hurdle

At scale, the limits become architectural. Neo4j publishes that it can "horizontally scale your databases for up to 100 TB+ of data", but reaching that scale in production analytics means running a clustered enterprise deployment and building the ETL pipelines that copy your source systems into the graph store, keep them in sync, and reload whenever upstream schemas change.

That ETL layer is the hidden cost of a native graph database. Every table you care about gets extracted, transformed into nodes and edges, loaded, and reloaded whenever it changes. You end up with two copies of the data, two definitions of what an entity means, and two windows where reporting is stale or offline. For graph analytics, where the point is usually to traverse relationships across data that already lives in a warehouse or lakehouse, that overhead is increasingly hard to justify.

This is the gap PuppyGraph fills. PuppyGraph is a distributed graph query engine that connects directly to relational sources like Snowflake, Databricks, BigQuery, Postgres, and Iceberg lakehouses with zero ETL and no data duplication, then exposes them as a graph you can query with openCypher or Gremlin. Under the hood, it's a scale-out MPP engine with native sharding, vectorized execution, and a query optimizer built specifically for graph complexity, which enables sub-second response times for deep multi-hop traversals. Customer benchmarks cite "5-hop queries on 1B+ edges in under 3 seconds". There's no separate graph store to provision, no pipelines to maintain, and no nightly reload window, because the graph is always querying the current warehouse state. For read-heavy graph analytics on warehouse or lakehouse data, that's usually a better fit than standing up a second database.

GraphRAG and the Ontology Problem

Things get more complicated when graphs leave the prototype stage and show up inside production GraphRAG and agentic AI systems. Neo4j's marketing now positions the database as a natural backbone for GraphRAG, but there's a structural problem with that story: the ontology a GraphRAG agent depends on is not actually enforced inside a property graph. It's a convention. A developer writes (:Person)-[:WORKS_AT]->(:Company) on a whiteboard, codifies it in application code, and hopes every writer obeys. The database will happily accept a :Person node with no name, a :WORKS_AT edge pointing at a :Tweet, or a property called employer on one node and company on the next. Worse, it will execute a query that references a relationship type that doesn't exist and return an empty result, instead of rejecting it outright and telling the caller what went wrong.

For a human engineer reading a query plan, that's fine. For an LLM generating Cypher from that "schema," it's a dead end. The agent has no authoritative definition to introspect, no way to distinguish a legitimately empty result from a query that referenced a nonexistent edge type, and no structured signal it can use to self-correct. The agent isn't wrong about the data model. The data model was always a suggestion.

Agents need the opposite: an ontology that the system enforces, not one they take on faith. PuppyGraph approaches the problem from that angle. You define a graph schema over your existing lakehouse or warehouse tables, declaring vertex types, edge types, and their properties, and that schema is the enforced ontology. Every query is validated against the schema before execution. Reference an entity type that doesn't exist and PuppyGraph rejects the query and tells the agent exactly what's invalid and what the valid options are. That turns the ontology into a two-way contract between the system and the agent: the schema gives the model the context it needs to generate accurate queries, and when it gets something wrong, the structured errors give it precisely the signal it needs to correct itself. Pair that with zero-ETL access to the data already living in Snowflake, Databricks, BigQuery, or an Iceberg lakehouse, and you get a graph layer that's both ontology-enforced and always current. No second database, no pipelines, no drift.

Conclusion

Neo4j is still one of the best ways to learn how graph databases work. The property graph model is intuitive, Cypher is genuinely fun to write, and the surrounding ecosystem (Browser, Bloom, GDS, AuraDB) gives you real tools instead of toys. For prototyping a recommendation system, mapping knowledge graphs, or running fraud detection, you can be productive in Neo4j in a single afternoon.

Where the picture changes is at scale and with agents in the mix. A graph analytics workload running against warehouse data doesn't need a second database and a nightly ETL pipeline to get good traversal performance, and a GraphRAG agent doesn't need a schema it can't trust. If your next project touches either of those, it's worth comparing a zero-ETL, ontology-enforced graph layer alongside a traditional Neo4j deployment. Start with the free PuppyGraph Developer Edition or read how PuppyGraph fits into a GraphRAG architecture for the full agentic angle.

No items found.
Matt Tanner
Head of Developer Relations

Matt is a developer at heart with a passion for data, software architecture, and writing technical content. In the past, Matt worked at some of the largest finance and insurance companies in Canada before pivoting to working for fast-growing startups.

Get started with PuppyGraph!

PuppyGraph empowers you to seamlessly query one or multiple data stores as a unified graph model.

Dev Edition

Free Download

Enterprise Edition

Developer

$0
/month
  • Forever free
  • Single node
  • Designed for proving your ideas
  • Available via Docker install

Enterprise

$
Based on the Memory and CPU of the server that runs PuppyGraph.
  • 30 day free trial with full features
  • Everything in Developer + Enterprise features
  • Designed for production
  • Available via AWS AMI & Docker install
* No payment required

Developer Edition

  • Forever free
  • Single noded
  • Designed for proving your ideas
  • Available via Docker install

Enterprise Edition

  • 30-day free trial with full features
  • Everything in developer edition & enterprise features
  • Designed for production
  • Available via AWS AMI & Docker install
* No payment required