PuppyGraph is the first and only real time, zero-ETL graph query engine in the market, empowering data teams to query existing relational data stores as a unified graph model that deployed in under 10 minutes, bypassing traditional graph databases' cost, latency, and maintenance hurdles. Capable of scaling with petabytes of data and executing complex 10-hop queries in seconds, PuppyGraph supports use cases from enhancing LLMs with knowledge graphs to fraud detection, cybersecurity and more. Trusted by industry leaders, including Coinbase, AMD, Netskope, Palo Alto Network, eBay, and more.

How does PuppyGraph compare to Neo4j?

Unlike Neo4j, which requires you to load and sync data into its proprietary graph store, PuppyGraph runs directly on your data sources—eliminating ETL, reducing TCO, and enabling faster time-to-value. PuppyGraph also integrates natively with Databricks Unity Catalog, Google BigQuery, and AlloyDB.

What are the performance benefits of PuppyGraph?

PuppyGraph delivers multi-hop traversals in seconds over billions of edges. Real customer stories cite 5-hop queries on 1B+ edges in under 3 seconds.

Does PuppyGraph support my cloud data stack?

Yes. PuppyGraph natively integrates with Databricks Unity Catalog, Google BigQuery, AlloyDB, and AWS, keeping a single governed copy of your data.

How does PuppyGraph handle data governance and security?

PuppyGraph leverages your existing catalog and security (Unity Catalog, BigQuery, AlloyDB), so all graph queries respect your current access controls.

Can PuppyGraph power AI and LLM applications (GraphRAG)?

Yes. PuppyGraph enables Graph-based Retrieval Augmented Generation (GraphRAG) directly on your governed data—providing explainable, multi-hop context for LLMs and enterprise AI.

See all articles

Table of Contents

Introduction to MySQL

Database Concept

Recursive Join: The Ultimate Guide

Sa Wang

Software Engineer

No items found.

May 22, 2026

A standard SQL join combines two tables on a matching condition and returns a single result set. A recursive join is the version that loops: it joins a table to itself repeatedly, feeding each round’s output back in as the next round’s input, until nothing new comes out. That shape is what you reach for whenever the structure you’re querying is a hierarchy or a graph whose depth you don’t know in advance: org charts, threaded comments, file systems, product part trees, network paths.

This post walks through what a recursive join actually is at the level of SQL semantics, the standard WITH RECURSIVE syntax that implements it, two worked examples, how it differs from a self join, and the failure modes that bite people in production.

Get Started with PuppyGraph for FREE

What is a recursive join?

A recursive join is a query pattern that joins a table (or a derived table) with itself one or more times, where the number of join steps is not fixed at query-writing time but determined at runtime by the data. Each step extends the working set by one hop along whatever relationship the join condition encodes, and the query terminates when a step produces no new rows.

The standard mechanism for expressing this in SQL is the recursive common table expression (recursive CTE), introduced in SQL:1999 and now supported by PostgreSQL, SQL Server, Oracle, SQLite, and MySQL 8.0 or later. Some dialects also offer alternative forms (Oracle’s CONNECT BY, which DB2 also adopted), but WITH RECURSIVE is the portable surface across engines that implement the SQL:1999 recursive CTE feature.

It’s worth separating the concept from the syntax. Recursive join is colloquial vocabulary among practitioners; the SQL standard itself does not define the term. It describes the operation: an iterative self-join driven by the previous iteration’s output. Recursive CTE is the language construct most databases give you to write that operation down. People use the terms interchangeably in conversation, and that is usually fine, but the distinction matters when comparing dialects or when explaining why some queries that look recursive (e.g., a fixed JOIN ... JOIN ... JOIN chain) are not.

How recursive joins work

Conceptually, every recursive CTE has three pieces, and understanding the execution loop they define is the whole concept. The SQL standard itself only formally distinguishes the anchor and recursive members; the third piece, termination, is implicit in how the recursive member behaves when the data runs out.

The anchor member is a non-recursive query that produces the starting rows. It runs once, and its output becomes the initial working table for the recursion. In an org-chart example, the anchor selects the CEO. In a path-finding example, it selects the source vertex.

The recursive member is a query that references the CTE’s own name. It runs against the current working table (not the full accumulated result) and produces new rows. Each invocation extends the recursion by one hop. The recursive member is joined to the underlying table via whatever relationship encodes the hop: manager_id = employee_id for org hierarchies, parent_id = id for tree structures, from_node = to_node for path traversal.

The termination condition is implicit: the recursion stops when the recursive member produces zero rows. There is no explicit EXIT clause. Termination depends on the data eventually running out of new rows to produce, which is why cycles in the data are a serious failure mode (covered later).

Annotated SQL snippet of a minimal recursive CTE (WITH RECURSIVE subordinates (id, depth) AS (...)) that walks an employee hierarchy, with four colored callouts: the anchor member (SELECT 1 AS id, 0 AS depth), the recursive member (SELECT e.id, s.depth + 1 FROM employees e JOIN subordinates s ON e.manager_id = s.id), the termination condition (WHERE s.depth < 5), and the outer query that uses the CTE (SELECT id, depth FROM subordinates).

The key mental model is that each iteration sees only the previous iteration’s output, not the entire accumulated result. The final result returned by the CTE is the union of every iteration’s output (anchor included), but that accumulation is invisible to the recursive member itself. This is what makes recursive CTEs set-based rather than procedural: the database evaluates them as a series of relational operations, not as a programmatic loop with mutable state. That property is also what lets a query planner do meaningful work on them, instead of treating them as opaque imperative blocks.

Recursive join syntax in SQL

The portable form looks like this:

WITH RECURSIVE cte_name (column_list) AS (
    -- Anchor member
    SELECT ...
    FROM base_table
    WHERE <starting condition>

    UNION ALL

    -- Recursive member
    SELECT ...
    FROM base_table
    JOIN cte_name ON <hop condition>
    WHERE <optional guard>
)
SELECT ... FROM cte_name;

‍

Four things are worth pointing out in this skeleton.

The RECURSIVE keyword is required in PostgreSQL, MySQL, and SQLite. SQL Server omits it (any CTE that references itself is treated as recursive); Oracle treats WITH clauses as recursive when they self-reference, without the keyword. The overall structure of the CTE body is portable across these engines, though specific expressions (string concatenation, array types, function names) still vary by dialect.

UNION ALL is the standard combinator between anchor and recursive members. UNION (without ALL) is allowed in some dialects but is rare in practice: it forces a duplicate-elimination step on every iteration, which is both expensive and usually not what you want, since duplicates often carry meaningful path information.

Per the SQL standard and most mainstream engines, the recursive member must reference the CTE itself exactly once. Multiple self-references and mutually recursive CTEs (where t1 references t2 and t2 references t1) are not part of the standard and are rejected by most engines. Subtle violations (e.g., a self-reference inside a subquery that the planner cannot flatten) usually surface as a clear error message at parse time.

The column list on the CTE is optional but recommended. Spelling it out makes the recursive member’s column alignment explicit and prevents type-mismatch surprises across iterations.

Get Started with PuppyGraph for FREE

Recursive CTEs and recursive joins

The relationship between the two terms is straightforward once you separate the pattern from the mechanism. A recursive CTE is the standard SQL surface for writing a recursive join; a recursive join is what the engine ends up executing.

That separation matters because not every recursive CTE is only a recursive join. The anchor member can be an arbitrary query, including one that already involves joins, aggregations, or filters. The recursive member can also join the CTE against several tables in one step, not just the base table being walked. A common pattern is to recursively walk an edge table while joining each new row against a vertex table to pick up attributes:

WITH RECURSIVE traversal AS (
    SELECT v.id, v.name, 0 AS depth
    FROM vertices v
    WHERE v.id = 'start'

    UNION ALL

    SELECT v.id, v.name, t.depth + 1
    FROM traversal t
    JOIN edges e ON e.from_id = t.id
    JOIN vertices v ON v.id = e.to_id
)
SELECT * FROM traversal;

‍

The recursive join here is between traversal and edges; the additional join to vertices is a regular join layered on top of each iteration’s output. Conflating the two structurally similar joins in the recursive member is one source of confusion for readers new to the pattern.

Dialect-wise, the main divergence is Oracle’s CONNECT BY, which expresses the same iterative-walk semantics in a non-standard syntax that predates SQL:1999. Modern Oracle also supports the SQL:1999 recursive WITH (recursive subquery factoring), though without accepting the RECURSIVE keyword itself; a lot of legacy Oracle code is still written in CONNECT BY. The two are functionally equivalent for hierarchy traversal; they differ in how cycles, ordering, and depth tracking are expressed.

Recursive join examples

Two examples cover most of what you’ll encounter: a tree walk (one parent per child) and a graph walk (many neighbors per vertex). The second one is the harder of the two because of cycles.

The examples below use PostgreSQL syntax (|| for string concatenation, ARRAY[...] and = ANY(...) for array membership) because it makes the recursion pattern itself easiest to read. The structural shape carries over to other engines, but specific expressions translate: SQL Server uses + or CONCAT(...) for string concatenation, while MySQL uses CONCAT(...) only (its + is numeric addition, not string concat), and both typically encode the visited set as a delimited string rather than an array; Oracle has its own array and collection types. The recursion mechanics are the same in all of them.

Employee hierarchy. The standard introductory example. Given an employees table with id, name, and manager_id, list every employee under a given manager along with their depth in the chain.

WITH RECURSIVE subordinates (id, name, manager_id, depth, path) AS (
    SELECT id, name, manager_id, 0 AS depth, name AS path
    FROM employees
    WHERE id = 1   -- start from the CEO

    UNION ALL

    SELECT e.id, e.name, e.manager_id, s.depth + 1, s.path || ' > ' || e.name
    FROM employees e
    JOIN subordinates s ON e.manager_id = s.id
)
SELECT id, name, depth, path FROM subordinates ORDER BY depth, name;

‍

The anchor selects the CEO. The recursive member finds each employee whose manager_id matches an id already in subordinates, and accumulates a depth counter and a path string as it goes. Termination is automatic because the org chart is a tree and eventually all leaves are reached.

Side-by-side diagram: on the left, an employee org tree of seven nodes across four levels (CEO at the top in amber; below it 2 VP Eng, 3 VP Sales; below them 4 Director, 7 Carol; below Director 5 Alice and 6 Bob). On the right, the result set of the recursive CTE shown as a four-column table (id, name, depth, path) sorted by depth then name. The depth-0 row (CEO) is highlighted amber to match the anchor node; each subsequent row’s id matches a node in the tree.

Graph path traversal with cycle guard. Now the harder case: an edges table representing a directed graph. Find all vertices reachable from a starting vertex, without entering an infinite loop on cycles.

WITH RECURSIVE reachable (vertex, path, depth) AS (
    SELECT 'A' AS vertex, ARRAY['A'] AS path, 0 AS depth

    UNION ALL

    SELECT e.to_vertex, r.path || e.to_vertex, r.depth + 1
    FROM reachable r
    JOIN edges e ON e.from_vertex = r.vertex
    WHERE NOT e.to_vertex = ANY(r.path)
      AND r.depth < 10
)
SELECT DISTINCT vertex, depth FROM reachable;

‍

Two guards are doing real work. The NOT e.to_vertex = ANY(r.path) predicate prevents the recursion from re-entering a vertex it has already visited along the current path; this is the standard cycle-detection idiom in recursive CTEs and depends on the path column being accumulated. The r.depth < 10 cap is a defensive backstop: if the cycle guard has a bug or the data has a pathological shape, the depth limit prevents the query from running indefinitely. Production queries should always include one of these guards, and ideally both.

Get Started with PuppyGraph for FREE

Recursive joins vs self joins

A self join and a recursive join both involve a table joined with itself, which is why the two get conflated. The distinction is depth.

A self join expresses a fixed number of hops known at query-writing time. A query that lists each employee alongside their direct manager is a one-hop self join; a query that lists each employee alongside their grand-manager is a two-hop self join, written as two explicit joins. The depth is part of the query text. A recursive join expresses an unknown number of hops, decided at runtime by the data.

Dimension	Self join	Recursive join
Hops in query	Fixed; equal to number of explicit joins	Variable; one logical join, iterated
Depth known at write time	Yes	No
Termination	N/A	Implicit (no new rows from recursive member)
Cycle behavior	Cannot loop (depth is bounded by query text)	Unbounded without an explicit cycle guard; halted in practice by engine recursion limits or resource exhaustion
Planner work	Standard join optimization	Planned as an iterative operator that materializes intermediate results between rounds
Typical use	Simple lookups against a related row	Hierarchy / graph traversal of arbitrary depth

The synthesis worth carrying away: when the query author knows the depth, a self join is cheaper, simpler, and easier for the planner to optimize. When the depth is determined by the data, a recursive join is the only correct option, and the cost (intermediate materialization, cycle handling, less aggressive optimization) is the price of that generality. Reaching for a recursive join when a fixed self join would do is a common over-engineering mistake; reaching for a chain of self joins when the depth is genuinely variable is a correctness bug waiting to happen.

Common errors in recursive joins

Most recursive-join failures fall into a small set of patterns.

Infinite recursion from unhandled cycles. A graph with cycles, joined recursively without a cycle guard, will not terminate on its own. Most engines have some backstop, but relying on the backstop is not the same as writing a query that terminates correctly. Include a path-based guard, an explicit depth cap, or both. Tree data is acyclic by definition and does not need a guard; arbitrary graph data does.

Forgetting UNION ALL and getting UNION instead. Plain UNION removes duplicates on every iteration. For tree traversals where each node has exactly one parent, this is wasteful but harmless. For graph traversals where the same vertex can be reached via multiple paths, plain UNION silently discards the alternate paths, which is rarely what the query author intended.

Quadratic blow-up on dense data. The recursive member joins against the entire working table from the previous iteration on every round. If the branching factor is high and the depth is more than a handful of levels, the working table grows quickly, and materializing those intermediate results dominates runtime.

Index misses on the join column. The recursive member’s join needs an index on the column being joined. Without it, every iteration is a full scan of the base table, multiplied by the size of the working table. This is the most common cause of recursive CTEs being “slow” in practice.

Treating depth as semantically meaningful when it isn’t. The order in which the recursive member produces rows is not specified by SQL. If a query depends on rows arriving in BFS order (level by level), explicitly select and sort on a depth column at the end; do not assume the database produces rows in that order natively.

A pattern visible across all of these: recursive CTEs express variable-depth traversal correctly, but the performance and behavior characteristics differ from the rest of SQL in ways that are easy to miss. Although they are written in SQL syntax, the operation they describe is graph traversal: walking from row to row along join keys, one hop at a time. The engine then has to do work that relational databases were never built to optimize for. Graph query engines treat variable-depth traversal as a native primitive instead; in Cypher, a multi-hop traversal of unknown depth is MATCH (a)-[*1..10]->(b), with the depth bound built into the syntax.

Graph queries on your existing tables

The Cypher example above only helps if you can actually run Cypher against the data, and that has traditionally meant migrating it into a dedicated graph database like Neo4j, with a separate store, a separate ingestion pipeline, and the operational cost of keeping two copies of the data in sync. For workloads where the source data lives in a relational database (e.g., Postgres), a cloud data warehouse (e.g., Snowflake), an open table format (e.g., Apache Iceberg), or another SQL-accessible store, that migration cost is usually what rules out using a graph engine in the first place, even when variable-depth traversal is exactly the pattern the workload needs.

PuppyGraph is built for that gap. It sits as a graph query layer directly on top of existing relational tables: the underlying data stays where it is, PuppyGraph reads from those tables, applies a user-defined graph schema, and exposes a graph view that accepts openCypher and Gremlin. The multi-hop traversals that motivated this post become first-class queries against the same data, with no ETL step and no second store to keep in sync. The recursive-CTE machinery (cycle guards, depth caps, careful indexing, intermediate materialization) is replaced by query patterns that the engine treats as native operations.

Get Started with PuppyGraph for FREE

Conclusion

Recursive joins are SQL’s way to express variable-depth traversal without dropping into procedural code. The pattern is small: an anchor, a recursive member that extends it one hop at a time, and an implicit termination when the recursive member runs dry. The difficulty is operational, not conceptual: cycle handling, performance at depth, indexing, and the gap between what the query says and what the planner can do with it. When that operational cost starts to dominate, a graph query layer like PuppyGraph is worth considering, since it treats variable-depth traversal as a native operation against the same tables the recursive CTEs were running on.

The PuppyGraph Developer Edition is forever-free to point at your own schema, and a demo session is available for a guided walkthrough on whether the graph-layer approach fits a specific data shape.

No items found.

Sa Wang

Software Engineer

Sa Wang is a Software Engineer with exceptional mathematical ability and strong coding skills. He holds a Bachelor's degree in Computer Science and a Master's degree in Philosophy from Fudan University, where he specialized in Mathematical Logic.

‍