PuppyGraph is the first and only real time, zero-ETL graph query engine in the market, empowering data teams to query existing relational data stores as a unified graph model that deployed in under 10 minutes, bypassing traditional graph databases' cost, latency, and maintenance hurdles. Capable of scaling with petabytes of data and executing complex 10-hop queries in seconds, PuppyGraph supports use cases from enhancing LLMs with knowledge graphs to fraud detection, cybersecurity and more. Trusted by industry leaders, including Coinbase, AMD, Netskope, Palo Alto Network, eBay, and more.

How does PuppyGraph compare to Neo4j?

Unlike Neo4j, which requires you to load and sync data into its proprietary graph store, PuppyGraph runs directly on your data sources—eliminating ETL, reducing TCO, and enabling faster time-to-value. PuppyGraph also integrates natively with Databricks Unity Catalog, Google BigQuery, and AlloyDB.

What are the performance benefits of PuppyGraph?

PuppyGraph delivers multi-hop traversals in seconds over billions of edges. Real customer stories cite 5-hop queries on 1B+ edges in under 3 seconds.

Does PuppyGraph support my cloud data stack?

Yes. PuppyGraph natively integrates with Databricks Unity Catalog, Google BigQuery, AlloyDB, and AWS, keeping a single governed copy of your data.

How does PuppyGraph handle data governance and security?

PuppyGraph leverages your existing catalog and security (Unity Catalog, BigQuery, AlloyDB), so all graph queries respect your current access controls.

Can PuppyGraph power AI and LLM applications (GraphRAG)?

Yes. PuppyGraph enables Graph-based Retrieval Augmented Generation (GraphRAG) directly on your governed data—providing explainable, multi-hop context for LLMs and enterprise AI.

See all articles

Table of Contents

Introduction to MySQL

Data Lakehouse

Iceberg vs Delta Lake : Key Differences Explained

Jaz Ku

Solution Architect

No items found.

November 7, 2025

Iceberg vs Delta Lake : Key Differences Explained

The data lakehouse rose as a way to pair warehouse-style guarantees with the low cost and elasticity of object storage. Projects like Apache Iceberg and Delta Lake add ACID transactions, time travel, and safe schema evolution to open file formats so teams get reliability without giving up affordable storage. Although, differences in their strategies lead to different strengths and weaknesses that you might want to consider when choosing which table format works best for your organization.

In this post, we start with a quick refresher on what a data lakehouse is. We then cover how Iceberg and Delta Lake approach table structure and key features, compare where each excels, and close with a simple guide on when to choose one over the other.

Get Started with PuppyGraph for FREE

What is a Data Lakehouse?

A data lakehouse combines a data lake’s low-cost, flexible storage with a warehouse’s management and performance guarantees. The idea is to keep data in open formats on object storage, then add a transactional layer so many engines can read and write safely. This setup has three parts: the storage layer, the transaction layer, and the compute layer.

Storage Layer

The storage layer relies on inexpensive, elastic object storage for data files and the table-format metadata produced by Iceberg or Delta Lake. The data can be structured, semi-structured, or unstructured, commonly in open formats like Parquet. Tables are registered in a catalog, which tracks the active metadata file. Engines consult this metadata to plan queries, skip irrelevant files, and enforce ACID snapshots. Decoupling storage from compute lets you grow capacity independently and support many engines reading the same datasets.

Transaction Layer

Both Iceberg and Delta Lake live in the transaction layer. It is in this layer that files on object storage are turned into reliable tables that many engines can share safely.

Table format sets layout, schema and partition evolution, and snapshot consistency.
Transaction manager handles commits, conflict detection, and isolation, usually with help from the catalog.
Table services perform maintenance such as compaction, file and manifest rewrite, clustering or sorting, snapshot retention, orphan cleanup, and stats or indexing.
Catalog is the registry. It holds table metadata, offers discovery and authorization APIs, and often helps coordinate transactions.

Compute Layer

The compute layer is where engines do the work. Batch and interactive SQL, graph engines, streaming systems, notebooks, and BI tools connect through connectors that understand the table format, so reads and writes keep ACID guarantees. In a lakehouse, many engines can use the same tables without copying data. For example, run SQL analytics in Trino, then use PuppyGraph for multi-hop graph analysis on the same Iceberg or Delta tables.

Figure: Example Architecture with Multiple Data Sources and Compute Engines

What is Apache Iceberg?

Goals of Apache Iceberg

Apache Iceberg is a table format that was created in 2017 by Netflix’s Ryan Blue and Daniel Weeks. It was designed as a modern replacement for Hive tables, which struggled with schema evolution, partition management, and safe concurrent writes. At large scale, these gaps made pipelines fragile and analytics unreliable. Iceberg addresses these issues with five core goals:

Consistency: Provides reliable query results through ACID transactions and snapshot isolation, even when multiple engines access the same table.
Performance: Improves efficiency with optimized metadata handling, partition pruning, and scan planning that reduce the cost of queries at scale.
Ease of use: Reduces complexity with features like hidden partitioning and familiar SQL commands for table creation and management.
Evolvability: Supports schema changes and adapts to new execution engines with minimal disruption to downstream workloads.
Scalability: Designed to manage petabyte-scale datasets and handle concurrent access in distributed environments.

Structure of Apache Iceberg

Apache Iceberg is simply a table format with a catalog interface. This separation keeps responsibilities clear and makes it easier to run the same tables across multiple engines.

Figure: The Apache Iceberg Architecture (source)

Data Layer

The data layer consists of the files that hold the actual table content, typically stored in cloud or on-premises object storage. Iceberg is file-agnostic, letting users choose what underlying file format to use for storage: Parquet, Avro, and ORC. For handling deletes and updates, Iceberg’s default strategy is CoW, where new data files are created when changes occur. If a table is frequently updated, Iceberg can also use MoR. This approach allows updates without rewriting large datasets while still maintaining consistent query performance. From Iceberg v3 onwards, each data file comes with its own deletion vector to mask out deleted rows during reads. This is a departure from its original design, where delete files were stored separately and scattered across storage.

Metadata Layer

The metadata layer tracks the structure and version history of a table. These files are stored in Avro format, which is compact, schema-driven, and widely supported across languages and platforms. Because Avro embeds its own schema and has strong compatibility rules, engines can all parse the same files consistently, making the metadata layer engine-agnostic.

It is built from three key components:

Metadata files: Define table properties, schema, partition specs, and pointers to the current snapshot.
Manifest lists: Record which manifest files belong to a snapshot, enabling snapshot isolation and time travel.
Manifest files: Contain detailed information about data files, such as partition values, row counts, and file-level statistics.

This layered design makes features like schema evolution, time travel, and atomic operations possible. Query engines like Trino and PuppyGraph rely on these metadata files to plan queries efficiently, pruning unnecessary files and ensuring consistent results without scanning the entire dataset.

Catalog Layer

The catalog tracks table definitions and the pointer to the current metadata file. Iceberg uses optimistic concurrency for commits: a writer reads the current pointer, stages a new metadata file, then performs an atomic compare-and-swap. If another commit wins, the update is rejected and the writer retries against the latest snapshot.

This delivers atomic commits and snapshot isolation for a single table, so readers see either the old state or the new state, never a partial write. Iceberg supports Hive Metastore, AWS Glue, JDBC, Nessie, and REST catalogs. This means that Iceberg can support custom catalogs via pluggable Java APIs or the REST Catalog protocol.

Key Features of Apache Iceberg

Iceberg uses hidden partitioning, which lets you define transforms like day(ts), bucket(id, N), or truncate(col, k) while queries keep filtering on the original columns. Engines use the stored metadata to prune partitions automatically, so SQL stays simple and user error drops. Iceberg also supports partition evolution, letting you move from day(ts) to hour(ts) or add bucket(user_id, 16) as needs change, with both layouts coexisting in one table.

Apache Iceberg greatly prioritizes interoperability, with its REST Catalog API standardizing table and namespace operations over HTTP. Create, alter, list, and commit behave the same across engines, which makes multi-engine setups cleaner, reduces custom glue, and avoids tight coupling to a single metastore.

Planning is fast because engines read a metadata tree before touching data files. A manifest list points to manifests, and each manifest summarizes many data files with partition ranges and column statistics such as min, max, and null counts. Planners can skip whole manifests and files up front, and the same metadata is queryable for debugging and governance.

Get Started with PuppyGraph for FREE

What is Delta Lake?

Goals of Delta Lake

Delta Lake is a table format from Databricks, open-sourced in 2019 with the Linux Foundation. Many of the same engineers wrote “Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark” (Armbrust et al., 2018), which explains Delta Lake’s tight integration with Spark, being fully compatible with Spark APIs and integrating with Spark’s Structured Streaming to allow for both batch and streaming operations.

The early prototype of Structured Streaming focused on adding semantic guarantees on eventually consistent storage, but it hit scaling limits:

Files were held in memory
Only a single writer was supported
No transactions or conflict resolution

Delta Lake served to provide scalable transaction logs that could handle massive data volumes and complex operations, with an emphasis on efficiency and scalability.

Structure of Delta Lak[e

Both Apache Iceberg and Delta Lake bring ACID guarantees to data lakes, but they take different paths. Instead of Iceberg’s metadata tree, Delta Lake keeps a transaction log of JSON files that are periodically compacted into Parquet checkpoints.

Data Files

Delta Lake tables store their underlying data in Parquet file format, although it can still interact with other common file formats like ORC and Avro as a separate source via the Spark engine. These files contain the actual data and are stored in a distributed cloud or on-premises file storage system.

Transaction Log

The Delta log is an ordered history of every change to a table. Each commit writes a JSON file under _delta_log describing the operation, added or removed files, and the schema at that point. This log underpins ACID behavior.

While Delta Lake traditionally relied on CoW, it now also supports MoR similar to Iceberg, with deletion vectors being introduced in Release 2.3, and the ability to merge them in Release 3.1. Deletion vector files are stored in one of two ways: they can be inline within the Delta transaction log's metadata or as separate sidecar files stored alongside the original data files in the _delta_log directory.

Metadata

Table schema, partitions, and configuration live in the transaction log and are accessible through SQL, Spark, Rust, and Python APIs. Engines use this information for schema enforcement and evolution, partition pruning, and data skipping.

Checkpoints

Delta Lake periodically compacts its transaction log into Parquet checkpoint files. By default, this is set to occur every 10 commits. Readers load the latest checkpoint, then apply newer JSON log files instead of replaying the entire history. This speeds up table initialization and recovery.

Key Features

Delta Lake is catalog optional, keeping the setup simple with path-based tables on object storage. When you need governance or discovery, you can still register the tables in Unity Catalog or Hive Metastore, keep the storage layout unchanged, and add permissions and lineage later.

Streaming and batch unification comes from Delta’s transaction log, which lets engines that implement the Delta protocol coordinate concurrent writers, use versioned reads for reproducibility, and apply idempotent commits so real-time and offline pipelines share one source of truth. Although it has strong roots in Spark, the project is adopting a more interoperable stance with the Delta protocol and UniForm.

Figure: Delta Lake Interoperability Capabilities with Delta UniForm (source)

Checkpointed transaction logs are designed to be frequent and lightweight, so readers load the latest Parquet checkpoint and apply only a short tail of recent commits. That keeps overhead low for continuous, high-speed ingestion and makes recovery fast, since only changes since the last checkpoint need to be replayed.

Get Started with PuppyGraph for FREE

Iceberg vs Delta Lake: Feature Comparison

Feature	Delta Lake	Apache Iceberg
Metadata model	Append-only transaction log (JSON) with Parquet checkpoints	Versioned snapshots with metadata tree, manifest list, and manifests
Storage file formats	Parquet for data files	File-agnostic; Parquet, ORC, Avro in the same table if needed
Time travel	By version or timestamp	By snapshot ID or timestamp
Schema evolution	Adds and renames via column mapping, limited type changes, enforced on write. Less flexible than Iceberg.	Field IDs enable safe adds, renames, reorders, and type evolution with compatibility checks
CDC / Incremental reads	Change Data Feed exposes inserts, updates, deletes	Incremental planning via snapshot diffs; equality and position deletes used, no single standardized CDF
Partitioning	Recommends Liquid Clustering over static partitions for more flexibility	Hidden partitioning with transforms and partition evolution
Planning and pruning	Reads checkpoints and log; data skipping via stats, optional Z-order or Liquid Clustering	Reads metadata tree; manifest stats enable file and manifest pruning
Catalog model	Optional. Path-based tables, or register with Unity Catalog or Hive Metastore	Catalog-centric; catalogs include REST Catalog, Hive Metastore, AWS Glue, and more, so teams can choose what fits governance and operational needs
Interoperability	Works best with Spark, but is becoming more interoperable via the Delta protocol, delta-rs/Delta Kernel, and UniForm for cross-format access	Engine-agnostic design; REST Catalog standardizes operations

When to Choose Iceberg vs Delta Lake

As the projects mature, Iceberg and Delta Lake are looking more alike. Both bring ACID tables to object storage, schema evolution, and time travel capabilities. Although, there are still a few cases where one fits better than the other.

When to Choose Iceberg

Large batch analytics at scale: Petabyte-class scans and wide joins benefit from Iceberg’s metadata tree, manifest stats, and hidden partitioning, which let engines prune early and plan fast across thousands of files.
Multi-engine, vendor-neutral setups: The spec and REST Catalog make behavior consistent across Spark, Trino, Flink, Snowflake, and common catalogs like Hive Metastore or AWS Glue.
Flexible schema and partition evolution: Hidden partitioning with transforms reduces SQL friction, and partition spec changes don’t rewrite existing data; old and new layouts can coexist safely.

When to Choose Delta Lake

Databricks and Spark ecosystem: If your pipelines, jobs, and governance live on Databricks or Spark, Delta Lake makes life easy. Path-based setup is simple, with Unity Catalog available when you need central permissions and lineage.
Real-time analytics and streaming: The transactional log, checkpoints, versioned reads, and optional CDF keep streaming and batch on the same tables with consistent semantics.
Smaller to medium-sized datasets: Merge-on-write performs well at this scale, with OPTIMIZE, Z-order, and Liquid Clustering helping keep file counts and query times in check.

The Future of Interoperability

Databricks acquired Tabular, the company founded by the original creators of Apache Iceberg, which signals a push toward alignment between Iceberg and Delta Lake.

Delta Lake is becoming more cross-format friendly through Delta UniForm and the growing Delta protocol ecosystem, while Apache XTable offers metadata translation across Delta, Iceberg, and Hudi.

On the Iceberg side, the community is discussing v4 features like single-file commits and ways to reduce metadata overhead for smaller tables, which echoes Delta’s “one metadata file” approach and explores collapsing parts of the metadata tree where appropriate.

While the internals still differ, users are less locked in. Governance layers such as Unity Catalog now manage tables across formats and expose Iceberg via REST so multiple engines can participate, and UniForm can present Delta tables to Iceberg readers without duplicating data.

Get Started with PuppyGraph for FREE

Where PuppyGraph Fits

Similar to how storage is moving towards choosing the right table format for the job, PuppyGraph introduces a new kind of query engine over your existing data – one specialized for complex, multi-hop graph queries.

Before this, teams either forced SQL engines through heavy joins to stitch context, or stood up a separate graph database with ETL, extra copies, and drift.

Figure: Comparing PuppyGraph and the Traditional Framework of Graph Data Modeling

With PuppyGraph, you keep one source of truth and pick the right engine. Run SQL with Trino or Spark, and run Gremlin or openCypher with PuppyGraph on the same Iceberg or Delta tables. No migrations, no sync jobs, no trade-off between SQL and graph.

Want a unified graph view of all your data? PuppyGraph can do that too. It connects to sources including PostgreSQL, Apache Iceberg, Delta Lake, BigQuery, and others. You keep data where it belongs and scale query power independently, which supports petabyte-level workloads without duplicating data or managing fragile pipelines. PuppyGraph then builds a virtual graph layer over this data. Graph models are defined through simple JSON schema files, making it easy to update, version, or switch graph views without touching the underlying data.

Figure: PuppyGraph Supported Data Sources

PuppyGraph also helps to cut costs. Our pricing is usage based, so you only pay for the queries you run. There is no second storage layer to fund, and data stays in place under your existing governance. With fewer pipelines to build, monitor, and backfill, day-to-day maintenance drops along with your bill.

PuppyGraph also supports Gremlin and openCypher, two expressive graph query languages ideal for modeling user behavior. Pattern matching, path finding, and grouping sequences become straightforward. These types of questions are difficult to express in SQL, but natural to ask in a graph.

Figure: PuppyGraph Cybersecurity Graph Demo Revealing Lateral Movement Risks on Custom-built UI

As data grows more complex, the teams that win ask deeper questions faster. PuppyGraph fits that need. It powers cybersecurity use cases like attack path tracing and lateral movement, observability work like service dependency and blast-radius analysis, fraud scenarios like ring detection and shared-device checks, and GraphRAG pipelines that fetch neighborhoods, citations, and provenance. If you run interactive dashboards or APIs with complex multi-hop queries, PuppyGraph serves results in real time.

Getting started is quick. Most teams go from deployment to query in minutes. You can run PuppyGraph with Docker, AWS AMI, GCP Marketplace, or deploy it inside your VPC for full control.

Get Started with PuppyGraph for FREE

Conclusion

A modern lakehouse gives you warehouse-grade reliability on low-cost object storage, and both Iceberg and Delta Lake deliver on that promise with ACID guarantees, time travel, and safe evolution. The real choice comes down to needs and context: Iceberg shines for large, read-heavy analytics across many engines and catalogs, while Delta Lake fits teams centered on Spark/Databricks and real-time pipelines backed by a transactional log and checkpoints.

As the ecosystems converge, you’re less locked in. Interoperability efforts and governance layers make it easier to mix engines and formats without replatforming. Pick the format that fits today’s workloads, knowing the gap is narrowing.

This shift towards interoperability is also extending to the query layer of data lakehouses. Where graph analytics enter the picture, you no longer need a separate graph database. PuppyGraph overlays your existing Iceberg or Delta tables so you can run Gremlin or openCypher side by side with SQL, keep one source of truth, and skip ETL and data copies. That means faster iteration, fewer pipelines to maintain, and lower cost.

Ready to try graph analytics on your data? Get started with PuppyGraph’s forever-free Developer Edition, or book a demo with our team.

No items found.

Jaz Ku

Solution Architect

Jaz Ku is a Solution Architect with a background in Computer Science and an interest in technical writing. She earned her Bachelor's degree from the University of San Francisco, where she did research involving Rust’s compiler infrastructure. Jaz enjoys the challenge of explaining complex ideas in a clear and straightforward way.

Get started with PuppyGraph!

PuppyGraph empowers you to seamlessly query one or multiple data stores as a unified graph model.

Developer Edition

Forever free
Single noded
Designed for proving your ideas
Available via Docker install

Free Download

Enterprise Edition

30-day free trial with full features
Everything in developer edition & enterprise features
Designed for production
Available via AWS AMI & Docker install

* No payment required

Start Free Trial

Book Demo

Iceberg vs Delta Lake : Key Differences Explained