Table of Contents

Connected Components: Graph Algorithm Guide

Software Engineer

July 18, 2025

Connected Components: Graph Algorithm Guide

Graphs are a natural way to represent relationships, between users, machines, transactions, or any connected entities. A central question in graph analysis is: which parts of the graph are actually connected? In many applications, this isn’t just a structural detail. It determines what’s reachable, what’s isolated, and how information or risk can spread.

The concept of connected components captures this. In an undirected graph, a connected component is a set of nodes where every node is reachable from any other in the same set. Identifying these components is often a first step before deeper analysis like community detection, anomaly tracking, or attack path modeling.

This post explores how connected components are computed. We’ll start with basic graph concepts and walk through the serial algorithms based on graph traversal and disjoint set union, then explain parallel algorithms using tree contraction. We’ll also discuss how this computation fits into vertex-centric graph processing models. Finally, we’ll demonstrate how to compute connected components directly on tabular data using PuppyGraph, without data migration or ETL.

Graphs and Connectivity Concepts

A graph is a set of nodes (also called vertices) connected by edges. It’s a foundational structure for representing relationships—between users, devices, events, or any entities that interact. Following convention, we typically denote the number of nodes by n and the number of edges by m.

Graphs can be undirected, where edges imply mutual relationships (like friendships), or directed, where edges have direction (like who follows whom on a social platform).

In the property graph model, which is widely used in practice, both nodes and edges can carry labels and attributes. For example, a node might represent a user with attributes like name and role, and an edge might represent an access relationship with an associated timestamp. These properties make graphs more expressive and useful in domains like fraud detection, cybersecurity, and infrastructure modeling.

Connectivity

In an undirected graph, connectivity means there is a path between two nodes. If you can get from node A to node B by following a sequence of edges, then A and B are connected.

A connected component is a maximal group of nodes where every node is reachable from every other node in the same group. If a graph has more than one connected component, it is called disconnected.

Figure: In this graph, there are 5 connected components.

For a more formal treatment, connectivity can also be defined in terms of vertex cuts. A vertex cut (or separating set) is a set of nodes whose removal increases the number of connected components in the graph. A graph is said to be k-connected if it has more than k vertices and every vertex cut contains at least k nodes. The connectivity of a graph is the largest integer k for which the graph is k-connected.

This formal definition is useful when reasoning about fault tolerance or structural robustness—for example, in network design or critical infrastructure analysis. But in this post, we focus on computing basic connected components: the simpler notion of which nodes are reachable from one another in the graph.

Serial Algorithms for Connected Components

The simplest way to compute connected components is with a graph traversal. The idea is straightforward: start from a node, explore everything it can reach, and assign all reachable nodes to the same component. Repeat the process for any node that hasn't yet been visited.

DFS and BFS

Both DFS and BFS can be used to explore the graph one component at a time. Here's how it works:

Initialize a component ID counter.
For each unvisited node:
- Start a DFS or BFS from that node.
- Mark all visited nodes as belonging to the current component.
- Increment the component ID.

This method visits every edge and every node exactly once, so the total time complexity is O(n + m) where n is the number of nodes and m is the number of edges.

DFS is often used because it's simple to implement recursively, while BFS, though similar in total work, can serve as the basis for parallel algorithms due to its level-based structure.

Flood Fill Analogy

This process is sometimes compared to flood filling in image processing. Imagine pouring ink on a node: the ink spreads out through all connected nodes, coloring them as part of one component. Then you move to the next unvisited node and repeat. Each flood fill represents one connected component.

Disjoint Set Union (Union-Find)

Another efficient approach is to use the Disjoint Set Union (DSU) data structure, which maintains a partition of nodes into disjoint sets. Initially, each node is its own set. As you process edges, you union the sets of the two endpoints. After all edges are processed, nodes in the same set belong to the same connected component.

The DSU structure supports two operations:

find(x): returns the representative of the set containing x.
union(x, y): merges the sets containing x and y.

With path compression and union by rank, both operations run in amortized time O(α(n)), where α is the inverse Ackermann function, a value that grows extremely slowly. In practice, this makes the algorithm effectively constant time for all reasonable input sizes.

With this method, you can process edges in arbitrary order, even when they arrive as a stream. It also informs the design of parallel algorithms using tree contraction in the next section.

Parallel Algorithms for Connected Components based on Graph Contraction

Many parallel algorithms for finding connected components rely on graph contraction, including the pioneering Shiloach-Vishkin algorithm, Reif’s randomized contraction algorithm, and its deterministic variants.

We won’t go into the specifics of the underlying computation models here. Instead, you can imagine an idealized machine with unbounded parallelism—as many computational cores as needed—and concurrent read/write access to shared memory. This is a theoretical model, but it allows us to focus on the logical structure of parallel algorithms without getting bogged down in hardware constraints.

Even if we don’t have unlimited parallel hardware in practice, many of these algorithms can still be serialized and executed efficiently on real machines by simulating the parallel steps.

Graph Contraction and Hooking

The basic idea behind these algorithms is simple: if there is an edge connecting two nodes, we can merge them into a single node. The resulting node logically inherits all edges from the two original nodes. In graph theory, this operation is called a contraction. By repeatedly contracting edges, we gradually reduce the size of the graph while preserving its connected structure.

This contraction process can also be interpreted using another data structure: rooted trees, optionally with self-loops at the roots. A rooted tree is a tree in which one node is designated as the root, and every other node has a unique parent, forming a directed acyclic structure. A rooted tree is determined by a parent function P(x), where each node points to its parent. For the root, P(x)=x; that is, the root is its own parent. In the context of connected components, we use rooted trees to represent partial components during the contraction process. Initially, each node forms a singleton tree with itself as the root. As contractions proceed, trees are merged by making the root of one tree point to a node in another tree, forming progressively larger trees that represent connected components. This merging step is commonly referred to as hooking.

Figure: Consider that nodes 1, 2, and 3, along with the edges between them, are processed and contracted, forming a single connected component. In the corresponding rooted forest, the rooted trees of nodes 2 and 3 are hooked to the rooted tree of node 1.

Rooted trees are a valuable data structure for the connected component problem, akin to the disjoint set union data structure. However, three issues require consideration.

To perform contraction in parallel, we need to ensure that operations do not interfere with each other. If we use rooted trees to represent the current components, we must prevent a tree from being hooked to multiple other trees at the same time.

We want to avoid having the algorithm take too many rounds. Consider a star graph, with a center node connected to many peripheral nodes. A bad case would be if the center node’s tree is hooked to a different peripheral node in each round, one by one. This kind of pattern leads to unnecessarily slow progress in contraction and is exactly what we want to prevent.

Consider an edge between nodes a and b, and suppose we want to hook the tree containing a into the tree containing b. To do this, we first need to find the root of the tree that a belongs to. This find-root operation can become expensive if the trees are deep and we don't manage the structure efficiently.

‍

A Randomized Algorithm by Reif

To address the issues above, there is a simple randomized algorithm proposed by Reif, which uses the random mate technique. While the analysis is nontrivial, the algorithm itself is surprisingly simple. It’s a good example of how randomized algorithms can achieve strong results through an unexpectedly straightforward process, even when the analysis is not so simple.

Algorithm Description and Analysis

The algorithm proceeds in rounds. In each round, a coin is flipped in parallel for each node, assigning it either heads or tails. Then, for each edge, if one endpoint is heads and the other is tails, the root of the tree containing the head node is hooked to the rooted tree containing the tail node, in parallel. Note that this is concurrent writing. Next, the parent pointers of nodes are updated in parallel to point to the root of their respective rooted trees. Finally, a parallel path compression step ensures that each node’s parent is set to the root of its tree, optimizing the tree structure.

This approach is straightforward and, surprisingly, avoids all three issues. Let us examine them individually.

Only nodes that represent components and have a heads coin result connect their tree to a neighboring node with a tails result. It is possible for a node with heads to connect to multiple nodes with tails. However, as noted earlier, we can perform concurrent writing for P(x), and the specific result is irrelevant as all results are accepted. The contraction results in a collection of star graphs and isolated nodes.
In each round, the expected number of edges contracted is m/4, meaning the graph's scale is reduced to 3/4. Thus, the expected number of iteration rounds is O(log n).
As noted in point 1, the contraction produces a collection of star graphs and isolated nodes. If the rooted trees before the hooking process have a height of at most 2, then after hooking, the resulting rooted trees have a height of at most 3. Thus, in each round, the path compression step can be performed efficiently to restore the trees to a height of 2.

While we omit a formal proof of correctness or complexity, the above observations provide a solid intuitive understanding of how the algorithm works.

A Deterministic Variant

While the randomized algorithm is simple and effective, one may ask whether a similar approach can be implemented deterministically. The answer is yes, though the challenge lies in designing a hooking strategy that ensures consistent progress and avoids pathological cases—without relying on randomness to break symmetry.

A preliminary approach is to establish a total order over the nodes based on their unique IDs, connecting the tree with the larger root ID to the tree with the smaller root ID during each hooking operation. This approach addresses issue 1 effectively but is less effective for issue 2. For example, in a star graph where the center node has the largest ID, it may unfortunately require approximately n rounds to complete. Additionally, issue 3 poses a challenge: in a single round, parallel hooking may produce a tall tree, such as when nodes are linked in descending ID order. Consequently, there is no guarantee on the height of the trees, which may complicate the path compression step.

To address issue 2, consider a node with at least one edge where the other endpoint has a smaller ID. This node’s tree will be connected to exactly one other tree, though the specific target is determined by the outcome of concurrent writes—a detail that does not affect correctness. Thus, the number of successful hooking operations equals the number of nodes with at least one edge where the other endpoint has a smaller ID. While this value cannot be precisely controlled, it is guaranteed that the number of nodes with at least one edge where the other endpoint has a smaller ID, plus the number of nodes with at least one edge where the other endpoint has a larger ID, is at least equal to the number of non-isolated nodes. This observation inspires a hooking rule: first, compute the number of nodes with at least one edge where the other endpoint has a smaller ID and the number of nodes with at least one edge where the other endpoint has a larger ID. If the former is greater, connect trees with larger root IDs to those with smaller root IDs; otherwise, connect trees with smaller root IDs to those with larger root IDs. For example, in a star graph where the center node has the smallest ID, choosing to connect larger roots to smaller roots allows all peripheral nodes to be connected to the center in a single round, completing the iteration in one round.

To address the third issue, we employ a technique known as shortcutting or pointer jumping. This involves setting P(x) = P(P(x)) for all nodes in parallel. During the path compression step, this operation is repeated for several sub-rounds until all trees have a height of at most 2. Since this is performed in parallel, the depth of a node in the rooted tree decreases exponentially. For example, in the first sub-round, the depth decreases by 1; in the second sub-round, it decreases by 2; in the third sub-round, it decreases by 4, and so on. It is possible that, in a single sub-round, path compression requires multiple iterations not bounded by a constant, such as when a long chain is formed during the hooking process. However, the total number of sub-rounds for path compression is O(log n). Thus, the overall parallel time complexity remains O(log n).

‍

Shiloach-Vishkin Algorithm

The Shiloach-Vishkin (SV) algorithm is a pioneering solution for the connected components problem. Although introduced last here for clarity, it predates the previously discussed algorithms. The SV algorithm employs an iterative framework of hooking and path compression. The path compression step is identical to the deterministic variant described earlier, utilizing the pointer jumping technique. We focus here on the hooking rule.

Each iteration round consists of two hooking sub-rounds. In the first sub-round, trees with larger root IDs are connected to trees with smaller root IDs. After this sub-round, some rooted trees, termed stagnant, remain unchanged from the start of the round. In the second sub-round, these stagnant trees are connected to “any” adjacent rooted trees if they have nodes linked to other rooted trees. The word “any” is used here, as the choice of connection is determined by concurrent writing, and the specific outcome is irrelevant.

Compared to the previous deterministic algorithm, the first hooking sub-round connects trees with larger root IDs to trees with smaller root IDs, while the second hooking sub-round connects trees with smaller root IDs to trees with larger root IDs. This provides an alternative approach to address issue 2. The parallel time complexity is O(log n) due to the similar reason of the previous algorithm.

Connected Components in the Vertex-Centric Programming Model

In this section, we will explore solving the connected components problem using a different paradigm for expressing parallel graph algorithms: the vertex-centric programming model, with method of label propagation.

Vertex-Centric Programming Model

The vertex-centric programming model is a popular abstraction for expressing graph algorithms in parallel. It was first formalized in Google’s Pregel system and has since been adopted by frameworks like Apache Giraph, GraphX, and Flink Gelly. This model is particularly well-suited for distributed computing environments.

In this model, the computation is structured around the vertices of the graph rather than global control flow. Each vertex runs a user-defined function that can read its own state, send messages to neighboring vertices, and receive messages sent in the previous round. Execution proceeds in a sequence of supersteps, and the system synchronizes all vertices between rounds.

More formally, in each superstep, every vertex:

Receives messages sent to it in the previous round.
Updates its local state based on those messages.
Sends new messages to its neighbors (if necessary).
Optionally votes to halt if no further updates are needed.

The system halts when all vertices are inactive and no messages are in flight.

Connected Components via Label Propagation

In the vertex-centric model, connected components are typically computed using a method known as label propagation. This approach is simple, intuitive, and fits well with the message-passing style of the model.

The idea is to let each vertex maintain a label that represents the component it currently believes it belongs to. Initially, this is just the vertex’s own ID. Over time, vertices exchange labels with their neighbors and adopt the smallest one they see. Eventually, all vertices in the same connected component converge to the same label.

Here’s how the algorithm works step by step:

Initialization:
Each vertex v sets its label to its own ID: L(v)=v.
Superstep loop:
In each superstep:
- Every vertex sends its current label L(v) to all its neighbors.
- Each vertex receives the labels from its neighbors and updates its own label to the minimum value received (including its current label).
Termination:
The algorithm continues until no label changes in a round—that is, the system reaches a fixed point.

This process propagates the minimum node ID throughout each connected component. Since label values only decrease and never increase, the algorithm always converges in a finite number of steps.

Label propagation is not asymptotically optimal, but it is easy to implement, scales well, and fits naturally into vertex-centric and bulk-synchronous systems. It requires no central coordination, uses minimal communication, and is widely used in practice. While it may take more iterations to converge—especially in graphs with long diameters or weak connectivity—its simplicity and compatibility with distributed infrastructure make it a practical choice for large-scale connected component computation.

Connected Components in PuppyGraph

PuppyGraph is the first and only real time, zero-ETL graph query engine in the market, empowering data teams to query existing relational data stores as a unified graph model that deployed in under 10 minutes, bypassing traditional graph databases' cost, latency, and maintenance hurdles. It supports connected component computation directly on top of existing relational data without requiring data movement or ETL. This makes it well suited for production environments where the graph is defined over tables in SQL databases or data lakes, and performance, scale, and ease of use are all important.

PuppyGraph uses a graph schema to define how tabular data should be interpreted as a graph. The schema specifies which tables represent vertices and edges, how columns map to identifiers and attributes, and how relationships are constructed. After uploading the schema, you can query the graph using openCypher or Gremlin, or invoke built-in algorithms with openCypher or Gremlin syntax.

PuppyGraph provides two built-in algorithms for component analysis and you can see more details in the documentation: Weakly Connected Components (WCC) and Connected Component Finding. These serve different purposes and are suited to different types of graph queries.

The Weakly Connected Components algorithm identifies all groups of nodes that are connected when edge directions are ignored. This is useful when analyzing directed graphs where directionality may not matter for the purpose of understanding structure—such as identifying isolated user clusters or disconnected subnetworks. The result assigns each node a componentId representing its connected group.

To run WCC, you can use either Gremlin or openCypher syntax. A simple example in Cypher looks like this:

CALL algo.wcc({
  labels: ['User'],
  relationshipTypes: ['LINK']
}) YIELD id, componentId
RETURN id, componentId

For more control, PuppyGraph also supports parameters like maximum iterations, minimum component size, and edge weight filtering.

The Connected Component Finding algorithm, on the other hand, is designed to compute reachability from a given set of start nodes, taking into account edge direction. It returns the set of reachable nodes along with the number of hops from the nearest start node. This is especially useful in exploratory or scenario-based queries, such as “which machines can be reached from a compromised account”.

Here is an example usage in Cypher:

CALL algo.connectedComponentFinding({
  relationshipTypes: ['Knows', 'WorkAt'],
  startIds: ['User[2]'],
  direction: 'BOTH',
  maxIterations: 10
}) YIELD ID, hop
RETURN ID, hop

Both algorithms scale to large datasets and can optionally export results directly to object storage for downstream analysis.

Conclusion

Connected components are a foundational concept in graph analysis. In this post, we walked through both serial and parallel approaches—from classic algorithms like DFS and Union-Find to scalable methods such as Reif’s contraction-based technique and the Shiloach-Vishkin algorithm. We also looked at how label propagation is used in vertex-centric systems for distributed processing.

PuppyGraph makes it easy to run connected component analysis directly on your relational data, no ETL, no data duplication. Whether you’re working with directed or undirected graphs, you can use high-level queries or built-in algorithms to uncover structure at scale.

Give the PuppyGraph Developer Edition or book a demo to see how it works in action.

Sa Wang is a Software Engineer with exceptional mathematical ability and strong coding skills. He holds a Bachelor's degree in Computer Science and a Master's degree in Philosophy from Fudan University, where he specialized in Mathematical Logic.

‍

Sa Wang

Software Engineer

‍

No items found.

Get started with PuppyGraph!

PuppyGraph empowers you to seamlessly query one or multiple data stores as a unified graph model.

Developer Edition

Forever free
Single noded
Designed for proving your ideas
Available via Docker install

Free Download

Enterprise Edition

30-day free trial with full features
Everything in developer edition & enterprise features
Designed for production
Available via AWS AMI & Docker install

* No payment required

Start Free Trial

Book Demo

Connected Components: Graph Algorithm Guide