Apache Spark Graph Analytics: From GraphX to GraphFrames

Software Engineer
|
June 30, 2025
Apache Spark Graph Analytics: From GraphX to GraphFrames

Apache Spark is a powerful framework for distributed data processing, widely used in analytics, machine learning, and real-time data pipelines. While it is best known for handling structured and semi-structured data at scale, Spark also supports graph analytics, where data is modeled as a graph and analyzed in terms of relationships and connectivity.

Spark offers two major tools for graph analytics. The first is GraphX, a built-in API based on the original Resilient Distributed Dataset (RDD) abstraction. The second is GraphFrames, an external library built on Spark DataFrames that offers a higher-level API and additional functionality, such as pattern matching and seamless integration with relational queries.

In this post, we’ll explore some core concepts behind general-purpose graph systems and examine how GraphX and GraphFrames are designed within the Spark ecosystem. Though they differ in interface and capabilities, they share common principles in representing and processing graph data. We’ll also briefly discuss the broader idea of graph queries on relational data, beyond what’s available in Spark.

Property Graph Model 

Let's start with the property graph model, the data model commonly used for graph systems, including GraphX and GraphFrame. A property graph is a graph model that represents data as a set of vertices, known as nodes, and edges, known as relationships, where both vertices and edges can carry arbitrary key value attributes. This model offers a flexible way to describe rich, connected data.

Each vertex typically represents an entity, such as a user, a product, or a device, and may store information like name, category, or status. Edges describe relationships between these entities, such as follows, purchased, or connected to, and can also have properties, like timestamps or weights.

For example, in a simple e-commerce graph:

  • A vertex might represent a user with properties like name = "Alice" and age = 30.

  • Another vertex might represent a product with properties like title = "Laptop" and price = 1000.

  • An edge between them might represent a purchase, with a property like timestamp = "2024-10-01".

This model allows both structure and data to be embedded directly in the graph, making it suitable for a wide range of graph analytics and queries.

The Graph-Parallel Abstraction

Graph algorithms often require repeated computations that operate over the neighborhood structure of a graph—for example, updating a node’s value based on its neighbors. To express this kind of computation efficiently, many graph systems adopt a graph-parallel abstraction, which shifts the focus from global data operations to vertex-centric logic.

In the graph-parallel model, each vertex runs a small program that can read its own state, send or receive messages from its neighbors, and update its value accordingly or change some shared values. These vertex programs execute in parallel across all vertices and proceed in iterations (or supersteps), gradually converging to a result. Well-known systems like Pregel and PowerGraph popularized this model with their bulk-synchronous and GAS (Gather-Apply-Scatter) designs.

This abstraction fits naturally with many graph algorithms such as PageRank, label propagation, and connected components. With the abstraction, developers can focus on local rules: how a vertex updates based on the messages it receives, and which messages it sends next.

Distributed Dataflow in Spark

Spark belongs to a class of systems known as distributed dataflow frameworks, which generalize the MapReduce model for large-scale parallel data processing. These systems are characterized by a few key features:

  1. Typed collections as the core data model. Computation operates on collections of records or objects—structured or unstructured—that can be partitioned across a cluster. These collections generalize the idea of relational tables and provide a flexible foundation for analytics.

  2. A coarse-grained, data-parallel programming model. Users express computation using high-level, deterministic operators like map, filter, groupBy, and join. These transformations are applied to entire partitions of data at once, rather than record by record, enabling efficient execution and fault recovery.

  3. Execution as a directed acyclic graph (DAG) of tasks. Each job is compiled into a DAG where nodes represent parallel tasks and edges capture data dependencies. The scheduler breaks the job into stages, each consisting of tasks that process partitions independently. This allows Spark to pipeline transformations and optimize execution plans across the full job.

Building on the foundation of distributed dataflow, Spark extends the programming model with a rich set of operators and multiple layers of abstraction. These features make Spark both expressive and adaptable to a wide range of workloads, including graph processing.

Expanded Operators in Spark

Spark divides its operations into two categories: transformations and actions. Transformations are lazy, meaning they define a new dataset but do not trigger execution. Instead, they construct a logical plan that Spark later optimizes and executes when an action is invoked.

Transformations are the heart of Spark’s programming model and include a broad set of high-level operators:

  • Element-wise operations: map, flatMap, and filter apply user-defined logic to each record independently.

  • Grouping and aggregation: groupByKey, reduceByKey, and aggregateByKey collect and summarize values by key—essential for operations like counting neighbors or aggregating messages in graph algorithms.

  • Join operations: join, leftOuterJoin, cogroup, and others allow combining datasets based on shared keys, a fundamental step in connecting vertex and edge data.

  • Set and structural transformations: such as union, distinct, repartition, and coalesce, which manipulate the layout and composition of datasets across partitions.

By chaining these transformations, users can define complex computation pipelines that Spark executes efficiently and in parallel. This composability is especially useful in graph processing, where multiple stages of filtering, joining, and aggregating are often required.

Storage Abstractions: RDDs and DataFrames

Spark provides two primary abstractions for representing and manipulating distributed data: Resilient Distributed Datasets (RDDs) and DataFrames. Both support parallel transformations and follow the same execution model, but they differ in structure, flexibility, and optimization.

RDDs are Spark’s original abstraction. An RDD is an immutable, distributed collection of objects that supports functional transformations such as map, filter, and groupByKey. RDDs are type-safe, allowing users to define collections of specific objects (e.g., RDD[Int] or RDD[(String, Double)]). They offer fine-grained control over data layout and partitioning, which is particularly important in systems like GraphX that require custom control over data distribution.

RDDs maintain a lineage graph, which records how a dataset was derived. If a partition is lost due to failure, Spark can recompute it using this lineage, avoiding the need for replication. While RDDs offer full flexibility, they are treated as opaque collections by the engine, which limits Spark’s ability to optimize execution.

DataFrames, introduced later, are a higher-level abstraction built on top of RDDs. DataFrame is actually a type alias for Dataset[Row] in Scala. A DataFrame represents a distributed table with named columns and a known schema, much like a relation in a database. Users can write transformations in a declarative style, chaining methods like select, filter, groupBy, and join, or using SQL syntax directly.

Unlike RDDs, DataFrames benefit from Catalyst, Spark’s query optimizer, which performs logical rewrites, predicate pushdown, and physical planning to improve performance. This makes DataFrames especially efficient for relational workloads and enables automatic query optimization that RDDs cannot provide.

Although both RDDs and DataFrames support lazy evaluation and distributed processing, they are suited to different needs:

  • RDDs are preferred when working with complex, low-level transformations, custom data types, or when control over partitioning is essential.

  • DataFrames are preferred when working with structured data, especially when performance and concise syntax are priorities.

GraphX: Spark’s Built-in Graph Library

Figure: GraphX logo

GraphX is the original graph processing API in Apache Spark, offering a distributed property graph model tightly integrated with Spark’s core RDD abstraction. It enables users to represent, transform, and analyze graphs using a combination of functional operators and iterative algorithms—all without leaving the Spark ecosystem

Graph Representation

class Graph[VD, ED] {
  val vertices: VertexRDD[VD]
  val edges: EdgeRDD[ED]
  val triplets: RDD[EdgeTriplet[VD, ED]]
}

GraphX models graphs using two typed RDDs:

  • A vertex RDD: VertexRDD[VD], where VD is the vertex attribute type.

  • An edge RDD: EdgeRDD[ED], where ED is the edge attribute type.

These are logically combined into a Graph[VD, ED], ensuring edge references are consistent with the vertex set. GraphX also provides a triplets view, which pairs each edge with the attributes of its source and destination vertices. This enables expressive logic across neighborhood structures without requiring users to write explicit joins

Some Core Graph Operators

GraphX provides a set of core graph operators that let users transform graph structure and attributes using familiar functional programming patterns. These operators are lazy and composable, preserving Spark’s lineage-based execution model. They fall into several categories:

Property Operators

These operators modify the attributes associated with vertices or edges:

  • mapVertices: applies a function to each vertex attribute.

  • mapEdges: transforms edge attributes.

  • mapTriplets: updates edge attributes using the full triplet (source, edge, destination).

Example: If the vertex attribute is a name like “alice”, it will be transformed to “ALICE”. This is useful for preprocessing or standardizing attributes.

val upperGraph = graph.mapVertices { case (id, attr) => attr.toUpperCase }

Structural Operators

Structural operators change the topology of the graph:

  • subgraph: filters vertices and edges based on predicates, producing an induced subgraph.

  • reverse: reverses the direction of all edges.

  • groupEdges: merges multiple edges between the same pair of vertices using a user-defined function.

Example: This creates a subgraph containing only active vertices and any edge that connects two such vertices. It’s useful for pruning irrelevant parts of the graph.

val activeSubgraph = graph.subgraph(vpred = (id, attr) => attr != "inactive")

Join Operators

Join operators enrich a graph with external data:

  • joinVertices: joins an RDD with the vertex set to update attributes.

  • outerJoinVertices: similar, but allows handling missing values explicitly.

Example: If vertex 1 originally had the attribute “Alice”, this becomes “Alice in CS”.

val dept: RDD[(VertexId, String)] = sc.parallelize(Seq(
  (1L, "CS"), (2L, "Math")
))

val updatedGraph = graph.joinVertices(dept) {
  case (id, oldAttr, newDept) => s"$oldAttr in $newDept"
}

Neighborhood Aggregation

A core capability in GraphX is expressing computations that depend on a vertex’s local neighborhood. This is made possible through the aggregateMessages operator, which underpins many graph-parallel algorithms. This operator was changed from mapReduceTriplets to improve performance.

class Graph[VD, ED] {
  def aggregateMessages[Msg: ClassTag](
      sendMsg: EdgeContext[VD, ED, Msg] => Unit,
      mergeMsg: (Msg, Msg) => Msg,
      tripletFields: TripletFields = TripletFields.All)
    : VertexRDD[Msg]
}

The aggregateMessages operator runs in two stages:

  1. Message sending (sendMsg):

    Each edge can send one or more messages to its source and/or destination vertex using the EdgeContext.

  2. Message merging (mergeMsg):

    Messages arriving at the same vertex are merged using a commutative and associative function.

The result is an RDD of messages aggregated by vertex ID.

Example: Count the in-degrees of vertices. You can also use the inDegrees operator directly which internally relies on aggregateMessages essentially.

val inDegrees = graph.aggregateMessages[Int](
  triplet => triplet.sendToDst(1),   // send 1 to each destination
  _ + _                              // anonymous function to sum messages at each vertex
)

Pregel API

While core graph operators like aggregateMessages are powerful for expressing single-pass neighborhood computations, many graph algorithms require iterative, vertex-centric computation. For this, GraphX provides the Pregel API—a high-level abstraction inspired by Google’s Pregel model.

In this model, each vertex maintains a state and exchanges messages with its neighbors over a series of supersteps. The computation proceeds until a termination condition is reached, typically when no vertex receives new messages.

class GraphOps[VD, ED] {
  def pregel[A]
      (initialMsg: A,
       maxIter: Int = Int.MaxValue,
       activeDir: EdgeDirection = EdgeDirection.Out)
      (vprog: (VertexId, VD, A) => VD,
       sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
       mergeMsg: (A, A) => A)
    : Graph[VD, ED] = {
    // Receive the initial message at each vertex
    var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()

    // compute the messages
    var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
    var activeMessages = messages.count()
    // Loop until no messages remain or maxIterations is achieved
    var i = 0
    while (activeMessages > 0 && i < maxIterations) {
      // Receive the messages and update the vertices.
      g = g.joinVertices(messages)(vprog).cache()
      val oldMessages = messages
      // Send new messages, skipping edges where neither side received a message. We must cache
      // messages so it can be materialized on the next line, allowing us to uncache the previous
      // iteration.
      messages = GraphXUtils.mapReduceTriplets(
        g, sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
      activeMessages = messages.count()
      i += 1
    }
    g
  }
}

To use Pregel, you define three functions:

  • Vertex program (vprog): updates a vertex’s value given an incoming message

  • Send message (sendMsg): defines what messages a vertex sends along outgoing edges

  • Message combiner (mergeMsg): merges messages destined for the same vertex

You also provide an initial message (initialMsg), which is sent to all vertices before the first iteration.

Example: Shortest Path from a Source Vertex.

val sourceId: VertexId = 1L
val initialGraph = graph.mapVertices((id, _) => 
  if (id == sourceId) 0.0 else Double.PositiveInfinity
)

val sssp = initialGraph.pregel(Double.PositiveInfinity)(
  (id, dist, newDist) => math.min(dist, newDist),  // vertex program
  triplet => {
    if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
      Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
    } else {
      Iterator.empty
    }
  },
  (a, b) => math.min(a, b)  // merge messages
)

Built-in Algorithms

In addition to manually writing graph algorithms, you can use built-in GraphX algorithms for common use cases, such as connectivity, PageRank, and Shortest Paths. The following algorithms are included in GraphX:

  • connectedComponents: Identifies weakly connected components in the graph. Each vertex is labeled with the smallest vertex ID in its component.
  • stronglyConnectedComponents: Computes strongly connected components for directed graphs within a bounded number of iterations.
  • pageRank: Estimates the importance of each vertex using the classic PageRank algorithm. Available in both dynamic and fixed-iteration versions.
  • personalizedPageRank: A variation of PageRank that biases the random walk toward a given source vertex.
  • shortestPaths: Computes the shortest path distances from a set of source vertices to all reachable nodes using a breadth-first-style traversal.
  • triangleCount: Counts the number of triangles (closed triplets) each vertex is part of—useful for community detection and clustering analysis.

These built-in functions return either a transformed graph (with updated vertex attributes) or an RDD of vertex results. They can be chained with custom logic or additional transformations using the GraphX API.

Example: PageRank with convergence tolerance 0.0001. 

val result = graph.pageRank(0.0001).vertices

GraphFrames: A DataFrame-Based API

Figure: GraphFrames Logo

GraphFrames is an external Spark package that brings graph processing to the Spark DataFrame. You can say GraphFrames are to DataFrames as GraphX is to RDDs. While GraphX relies on RDDs and functional operators, GraphFrames builds entirely on Spark SQL and DataFrames, enabling expressive graph queries that benefit from Catalyst optimization and tight integration with relational data. It offers high-level APIs in Scala, Java, and Python, combining GraphX functionality with enhanced features leveraging Spark DataFrames, like motif finding.

Graph Representation

class GraphFrame { 
  def vertices: DataFrame 
  def edges: DataFrame
  def triplets: DataFrame
}

A GraphFrame is constructed from two DataFrames:

  • A vertex DataFrame, where each row represents a vertex and includes a column named id as the unique identifier.
  • An edge DataFrame, where each row represents an edge and includes src and dst columns for source and destination IDs.

Like GraphX, it also contains a triplets view, which can be constructed using the following 3-way join of DataFrame easily:

e.join(v, v.id == e.srcId).join(v, v.id == e.dstId)

Operators, algorithms and API

Like GraphX, GraphFrames must support a set of core graph operators. However, one of the key advantages of GraphFrames is that most of these operators can be expressed using standard DataFrame operations. Thanks to the expressiveness of Spark SQL, many transformations over vertices and edges—such as filtering, joins, and projections—can be written directly as relational queries.

The main remaining challenge is neighborhood aggregation—the kind of graph-parallel computation modeled in GraphX by the aggregateMessages operator.

In GraphX, aggregateMessages implements a two-phase message-passing model similar to the Gather phase in the GAS abstraction. Conceptually, it performs a projection followed by a grouping and reduction over the triplet view. This logic can be illustrated in SQL terms as:

SELECT t.dstId, reduceF(mapF(t)) AS msgSum
FROM triplets AS t
GROUP BY t.dstId

Here, mapF transforms each triplet into an intermediate message, and reduceF aggregates those messages by destination vertex. 

As a result, GraphFrames can reimplement the behavior of aggregateMessages using native DataFrame transformations. In principle, this means that Pregel-style iterative algorithms and many built-in GraphX functions can be reproduced in GraphFrames. Indeed, GraphFrames offers a library of common graph algorithms like GraphX.

Here is an example of counting the in-degrees of vertices. You can use groupBy and count to compute this directly, which shows the power of relational operators of DataFrame.

val inDegrees = g.edges.groupBy("dst").count()

GraphFrames also support conversion between GraphFrames and GraphX graphs, enabling users to move between APIs as needed. This allows GraphFrames to inherit the flexibility of GraphX while adding SQL expressiveness and optimization on top.

Motif Finding

Despite some nuance between motif finding and pattern matching, we will use them interchangeably here to mean searching for structural patterns in a graph. Motif finding is a unique feature of GraphFrames not available in GraphX. While GraphX requires manual composition of joins and filters to express subgraph patterns, GraphFrames provides a high-level find API that supports concise, declarative pattern queries optimized through Catalyst.

GraphFrames motif finding uses a simple Domain-Specific Language (DSL) for expressing structural queries, resembling a simplified version of Cypher. A motif is a pattern of vertices and edges defined using a concise syntax, such as “(a)-[e]->(b)”, which represents a directed edge from vertex “a” to vertex “b”, with the edge labeled 'e'. Multiple edges can be chained to form complex patterns, for example, “(a)-[e1]->(b); (b)-[e2]->(c)”, which identifies all paths of length two: a → b → c. This motif can be used to identify follow chains in a social network.

val paths = g.find("(a)-[e1]->(b); (b)-[e2]->(c)")

The result returns a DataFrame containing all such structures in the graph, with columns corresponding to each named element (vertices or edges) in the motif. In this case, the returned columns will be 'a', 'b', 'c', 'e1', and 'e2'.

You can filter the results by applying conditions on the attributes of matched elements using standard DataFrame filters:

paths.filter("a.name = 'Alice' and e2.relationship = 'follows'")

This returns all 2-hop paths where Alice is the source and the second edge is a “follows” relationship.

Compare GraphX and GraphFrames

There’s much we haven’t covered in this article, such as optimizations and system-level implementation, as they fall outside its scope. Our focus is on the basic concepts and usage. For convenience, here’s an overview comparing GraphX and GraphFrames, which we discussed earlier.

Feature/Aspect GraphX GraphFrames
Core Data Abstraction Resilient Distributed Dataset (RDD) DataFrame
Supported Languages Scala, Java, Python Scala, Java, Python
API Style Functional, type-safe, lower-level Declarative, relational-style, high-level
Graph Representation Property Graph with vertex and edge RDDs, supporting multiple types Property Graph with vertex and edge DataFrames
Graph Views Exposes vertices, edges, and triplets Exposes vertices, edges, triplets, and arbitrary pattern views
Graph Algorithms PageRank, Connected Components, Shortest Paths, Triangle Count, SVD++, … All GraphX algorithms, BFS, Strongly Connected Components, …
Pattern Matching Not supported Supported via .find() API
Graph/Relational Integration Weak; primarily graph-focused Strong; seamless integration of graph and relational queries
Current Status Core Spark component External package, under development, may be included in core Spark

Graph Analytics on Relational Data

We’ve seen how GraphX and GraphFrames enable Spark to perform graph analytics by offering graph-based interfaces over RDDs and DataFrames. This makes it feasible to run graph algorithms and traversals directly on relational data. GraphFrames, in particular, offers a convenient mix of SQL and graph abstractions, making it appealing for users familiar with DataFrames.

However, both systems have important limitations:

  1. No pattern matching in GraphX. GraphX supports standard graph operators and algorithms but does not provide a way to express structural queries like “find common friends” or “match a path with certain constraints.” GraphFrames includes a motif-finding API for basic pattern matching, but it’s limited in scope and lacks the expressiveness of full graph query languages such as openCypher or Gremlin.
  2. Limited support for heterogeneous graphs in GraphFrames. GraphFrames do not support multiple types of vertices or edges. All nodes and all relationships must share the same schema. To represent different kinds of entities or interactions, users must manually add type columns and filter results accordingly. GraphX is more flexible here, as it allows arbitrary types in vertex and edge attributes, enabling labeled property graph representations.
  3. Performance trade-offs. GraphX and GraphFrames run on top of Spark’s general-purpose batch engine. Graph operations like multi-hop traversals or recursive joins often result in expensive shuffles and materialization of intermediate results. These systems work well for moderate workloads but tend to struggle with deeply connected or large-scale graphs.

This is where PuppyGraph comes in.

PuppyGraph is the first and only real time, zero-ETL graph query engine in the market, empowering data teams to query existing relational data stores as a unified graph model that deployed in under 10 minutes, bypassing traditional graph databases' cost, latency, and maintenance hurdles. Instead of requiring users to manually build vertex and edge tables or move data into a dedicated graph database, PuppyGraph lets you define a labeled property graph model directly on top of your existing SQL tables. Once the schema is defined, you can query the graph immediately using openCypher or Gremlin. PuppyGraph also supports several common algorithms, such as PageRank, Connected Components, and Label Propagation, and continues to expand its capabilities.

PuppyGraph is also built for performance and scalability. It uses a vectorized, distributed execution engine that separates compute from storage, enabling it to scale to petabyte-sized datasets. The engine is specifically optimized for graph workloads, making operations like multi-hop traversals and neighborhood aggregation highly efficient. Complex queries like 10-hop traversals can be executed in seconds—without the overhead of Spark’s shuffle-heavy batch execution. 

Figure: PuppyGraph Architecture Diagram
Figure: An example architecture using PuppyGraph with other SQL query engines

Conclusion

Apache Spark supports graph analytics through two complementary tools: GraphX and GraphFrames. GraphX offers fine-grained control over graph structure and algorithms using RDDs, while GraphFrames provides a higher-level API with SQL integration and pattern-matching capabilities via DataFrames. Together, they demonstrate how collections or relational data can be modeled and analyzed as graphs within the Spark ecosystem.

For scenarios that require expressive graph querying or tighter integration with existing relational data sources. dedicated engines like PuppyGraph can extend these capabilities further. Feel free to try the forever-free PuppyGraph Developer Edition or book a demo with our team.

Sa Wang is a Software Engineer with exceptional math abilities and strong coding skills. He earned his Bachelor's degree in Computer Science from Fudan University and has been studying Mathematical Logic in the Philosophy Department at Fudan University, expecting to receive his Master's degree in Philosophy in June this year. He and his team won a gold medal in the Jilin regional competition of the China Collegiate Programming Contest and received a first-class award in the Shanghai regional competition of the National Student Math Competition.

Sa Wang
Software Engineer

Sa Wang is a Software Engineer with exceptional math abilities and strong coding skills. He earned his Bachelor's degree in Computer Science from Fudan University and has been studying Mathematical Logic in the Philosophy Department at Fudan University, expecting to receive his Master's degree in Philosophy in June this year. He and his team won a gold medal in the Jilin regional competition of the China Collegiate Programming Contest and received a first-class award in the Shanghai regional competition of the National Student Math Competition.

No items found.
Join our newsletter

See PuppyGraph
In Action

See PuppyGraph
In Action

Graph Your Data In 10 Minutes.

Get started with PuppyGraph!

PuppyGraph empowers you to seamlessly query one or multiple data stores as a unified graph model.

Dev Edition

Free Download

Enterprise Edition

Developer

$0
/month
  • Forever free
  • Single node
  • Designed for proving your ideas
  • Available via Docker install

Enterprise

$
Based on the Memory and CPU of the server that runs PuppyGraph.
  • 30 day free trial with full features
  • Everything in Developer + Enterprise features
  • Designed for production
  • Available via AWS AMI & Docker install
* No payment required

Developer Edition

  • Forever free
  • Single noded
  • Designed for proving your ideas
  • Available via Docker install

Enterprise Edition

  • 30-day free trial with full features
  • Everything in developer edition & enterprise features
  • Designed for production
  • Available via AWS AMI & Docker install
* No payment required