PuppyGraph is the first and only real time, zero-ETL graph query engine in the market, empowering data teams to query existing relational data stores as a unified graph model that deployed in under 10 minutes, bypassing traditional graph databases' cost, latency, and maintenance hurdles. Capable of scaling with petabytes of data and executing complex 10-hop queries in seconds, PuppyGraph supports use cases from enhancing LLMs with knowledge graphs to fraud detection, cybersecurity and more. Trusted by industry leaders, including Coinbase, AMD, Netskope, Palo Alto Network, eBay, and more.

How does PuppyGraph compare to Neo4j?

Unlike Neo4j, which requires you to load and sync data into its proprietary graph store, PuppyGraph runs directly on your data sources—eliminating ETL, reducing TCO, and enabling faster time-to-value. PuppyGraph also integrates natively with Databricks Unity Catalog, Google BigQuery, and AlloyDB.

What are the performance benefits of PuppyGraph?

PuppyGraph delivers multi-hop traversals in seconds over billions of edges. Real customer stories cite 5-hop queries on 1B+ edges in under 3 seconds.

Does PuppyGraph support my cloud data stack?

Yes. PuppyGraph natively integrates with Databricks Unity Catalog, Google BigQuery, AlloyDB, and AWS, keeping a single governed copy of your data.

How does PuppyGraph handle data governance and security?

PuppyGraph leverages your existing catalog and security (Unity Catalog, BigQuery, AlloyDB), so all graph queries respect your current access controls.

Can PuppyGraph power AI and LLM applications (GraphRAG)?

Yes. PuppyGraph enables Graph-based Retrieval Augmented Generation (GraphRAG) directly on your governed data—providing explainable, multi-hop context for LLMs and enterprise AI.

See all articles

Table of Contents

Introduction to MySQL

Data Lakehouse

Apache Iceberg Trino: Modern Data Lakehouse Explained

Jaz Ku

Solution Architect

September 11, 2025

Apache Iceberg Trino: Modern Data Lakehouse Explained

No items found.

Apache Iceberg is an open table format created at Netflix in 2017 to overcome the limitations of Hive tables at petabyte scale. By 2020, it became a top-level Apache Software Foundation project and is now widely adopted across the data lakehouse ecosystem. Its design brings reliability and openness to analytic storage, ensuring that data stored in different file formats can still be safely queried and updated across multiple engines without being tied to a single vendor.

Trino complements this model as an open-source distributed SQL engine built for low-latency analytics. When paired with Iceberg, it can query governed lakehouse tables directly, giving teams both speed and flexibility without duplicating data. This reflects a larger shift in the industry: the future of analytics is trending toward the decoupling of storage and compute, where open formats like Iceberg define the data layer and engines like Trino provide scalable, interchangeable compute.

In this blog, we’ll explore What Apache Iceberg and Trino are, how the two integrate and optimize for maximum performance, as well as practical use cases.

Get Started with PuppyGraph for FREE

What is Apache Iceberg?

Goals of Apache Iceberg

Iceberg was designed as a modern replacement for Hive tables, which struggled with schema evolution, partition management, and safe concurrent writes. At large scale, these gaps made pipelines fragile and analytics unreliable. Iceberg addresses these issues with five core goals:

Consistency: Provides reliable query results through ACID transactions and snapshot isolation, even when multiple engines access the same table.
Performance: Improves efficiency with optimized metadata handling, partition pruning, and scan planning that reduce the cost of queries at scale.
Ease of use: Reduces complexity with features like hidden partitioning and familiar SQL commands for table creation and management.
Evolvability: Supports schema changes and adapts to new execution engines with minimal disruption to downstream workloads.
Scalability: Designed to manage petabyte-scale datasets and handle concurrent access in distributed environments.

What is Trino?

Trino is an ANSI SQL compliant query engine for the modern data stack. It began at Facebook as Presto to replace slow Hive and MapReduce jobs with fast, interactive SQL on a massive Hadoop warehouse.

Today it is an open source, distributed engine that queries data where it lives through connectors to object storage, lakehouse table formats, and databases. You use standard SQL while Trino scales out and keeps storage and compute separate for BI, ad hoc analysis, and federated queries across many sources.

Figure: Key tenets of a modern SQL analytics engine (source)

Architecture & Internals

To understand how Apache Iceberg and Trino complement each other, we first have to take a look at the underlying architectures of these products.

Structure of Apache Iceberg

Iceberg’s design separates concerns into distinct layers. By separating concerns into distinct layers, Iceberg makes it easier to ensure consistency, scalability, and interoperability across engines like Trino, Spark, and Flink.

Figure: The Apache Iceberg Architecture (source)

Data Layer

The data layer consists of the files that hold the actual table content, typically stored in cloud or on-premises object storage. For handling deletes and updates, Iceberg’s default strategy is Copy-on-Write (COW), where new data files are created when changes occur. If a table is frequently updated, Iceberg can also use Merge-on-Read (MOR). In this mode, the data layer may include delete files that track row- or position-level deletes. This approach allows updates without rewriting large datasets while still maintaining consistent query performance.

Metadata Layer

The metadata layer tracks the structure and version history of a table. These files are stored in Avro format, which is compact, schema-driven, and widely supported across languages and platforms. Because Avro embeds its own schema and has strong compatibility rules, engines like Trino, Spark, and Flink can all parse the same files consistently, making the metadata layer engine-agnostic.

It is built from three key components:

Metadata files: Define table properties, schema, partition specs, and pointers to the current snapshot.
Manifest lists: Record which manifest files belong to a snapshot, enabling snapshot isolation and time travel.
Manifest files: Contain detailed information about data files, such as partition values, row counts, and file-level statistics.

This layered design makes features like schema evolution, time travel, and atomic operations possible. Query engines like Trino and PuppyGraph rely on these metadata files to plan queries efficiently, pruning unnecessary files and ensuring consistent results without scanning the entire dataset.

Catalog Layer

The catalog tracks table definitions and the pointer to the current metadata file. Iceberg uses optimistic concurrency for commits: a writer reads the current pointer, stages a new metadata file, then performs an atomic compare-and-swap. If another commit wins, the update is rejected and the writer retries against the latest snapshot.

This delivers atomic commits and snapshot isolation for a single table, so readers see either the old state or the new state, never a partial write. Iceberg supports Hive Metastore, AWS Glue, JDBC, Nessie, and REST catalogs. This means that Iceberg can support custom catalogs via pluggable Java APIs or the REST Catalog protocol.

Get Started with PuppyGraph for FREE

Structure of Trino Execution Engine

Figure: Presto/Trino Architecture (source)

Coordinator

A coordinator is one type of Trino server, with every cluster having one coordinator node. The coordinator parses SQL, builds the logical plan, and applies core optimizations such as predicate and column pruning, join reordering, and pushdown when supported. It then creates the distributed plan, schedules work on workers, manages resource allocation, tracks progress, and handles retries.

Workers

A worker is the other type of Trino server, and a cluster can consist of zero or more worker nodes. Workers execute the coordinator’s plan by reading assigned splits, independent slices of input, and run operators like scans, filters, joins, and aggregations. They exchange intermediate data and manage memory, spilling when needed. Planning creates only the required splits via pruning, and the cluster scales by adding workers.

Connectors & Catalogs

Connectors let Trino interact with external systems, consisting of interfaces such as the Metadata API and the Data Location API, which the coordinator can call upon for statistics to optimize queries. Common sources include Kafka, lakehouse table formats, PostgreSQL and other operational databases, warehouses such as Snowflake and Teradata, and non-relational systems like MongoDB.

In Trino, a catalog is a configured instance of a connector. It contains one or more schemas that can hold tables, views, and materialized views. As a unified query engine, Trino allows you to configure and use many catalogs to connect to multiple data sources simultaneously.

Combining SQL & Graph workloads: Example Architecture

Modern query engines built for the data stack follow a common pattern: separate storage from compute. Data lives in cheap, durable object storage, and engines like Trino and PuppyGraph can be chosen based on fit for the use case, not limited by a vendor’s bundled options.

Figure: Architecture for Unified Query Engines

Storage Layer

Data sits in object storage and is governed by Apache Iceberg. Using an open table format avoids vendor lock-in and lets multiple engines share the same tables. Iceberg’s metadata and snapshot model enable safe concurrent writes, schema evolution, time travel, and efficient pruning.

Query Layer

Trino is the unified SQL engine. This means that one Trino cluster can attach multiple catalogs at the same time, allowing you to query Iceberg tables alongside other sources, such as PostgreSQL and Databricks, in a single statement. Each source is exposed through its connector, and the catalog name becomes the first part of the table path.

BI & Viz Layer

Clients such as Tableau, Looker, notebooks, and services connect to Trino over JDBC or ODBC and issue SQL. Trino implements a largely ANSI SQL compatible dialect with some engine and connector specific functions.

Get Started with PuppyGraph for FREE

Apache Iceberg + Trino Integration

Trino’s Iceberg connector supports Apache Iceberg table spec versions 1 and 2. You choose the data file format with the format table property, using Parquet, ORC, or Avro. Iceberg records file paths in its metadata, so Trino plans from metadata first and touches the storage layer only for the files it needs.

Trino normally reads configuration from /etc/trino: node.properties, jvm.config, config.properties, and an optional log.properties. The official Trino Docker image ships with sensible defaults, so for a simple setup you usually only need to add a catalog file under /etc/trino/catalog, configuring the connection to your data source.

Hands-on Example

To see the integration in action, we can spin up a simple Docker container with the following docker-compose.yaml:

services:
 rest:
   image: tabulario/iceberg-rest
   container_name: iceberg-rest
   networks:
     iceberg_net:
   ports:
     - 8181:8181
   environment:
     - AWS_ACCESS_KEY_ID=admin
     - AWS_SECRET_ACCESS_KEY=password
     - AWS_REGION=us-east-1
     - CATALOG_WAREHOUSE=s3://warehouse/
     - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
     - CATALOG_S3_ENDPOINT=http://minio:9000
 minio:
   image: minio/minio
   container_name: minio
   environment:
     - MINIO_ROOT_USER=admin
     - MINIO_ROOT_PASSWORD=password
     - MINIO_DOMAIN=minio
   networks:
     iceberg_net:
       aliases:
         - warehouse.minio
   ports:
     - 9001:9001
     - 9000:9000
   command: ["server", "/data", "--console-address", ":9001"]
 mc:
   depends_on:
     - minio
   image: minio/mc
   container_name: mc
   networks:
     iceberg_net:
   environment:
     - AWS_ACCESS_KEY_ID=admin
     - AWS_SECRET_ACCESS_KEY=password
     - AWS_REGION=us-east-1
   entrypoint: >
    /bin/sh -c "
    until (/usr/bin/mc alias set minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
    /usr/bin/mc rm -r --force minio/warehouse;
    /usr/bin/mc mb minio/warehouse;
    /usr/bin/mc anonymous set public minio/warehouse;
    tail -f /dev/null
    "
 trino:
   image: trinodb/trino:latest
   container_name: trino
   depends_on:
     - rest
     - minio
   networks:
     iceberg_net:
   ports:
     - 8080:8080
   volumes:
     - ./post-init.sql:/post-init.sql:ro
     - ./catalog/iceberg.properties:/etc/trino/catalog/iceberg.properties:ro
networks:
 iceberg_net:
   name: trino-iceberg

We can configure the Trino connector for Apache Iceberg in /catalog/iceberg.properties:

connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://iceberg-rest:8181/
fs.native-s3.enabled=true
s3.endpoint=http://minio:9000
s3.region=us-east-1
s3.path-style-access=true
s3.aws-access-key=admin
s3.aws-secret-key=password

Then, the post-init.sql script will let us populate our Iceberg table with dummy data when we spin up the instance:

CREATE SCHEMA IF NOT EXISTS iceberg.demo;

DROP TABLE IF EXISTS iceberg.demo.items;

CREATE TABLE iceberg.demo.items (
 id   INT,
 name VARCHAR
);

INSERT INTO iceberg.demo.items VALUES
 (1, 'marko'),
 (2, 'peter');

Now that we have everything set up, we can run our Docker container:

docker compose up -d

Create and populate the tables:

docker compose exec -T trino trino --server http://localhost:8080 -f /post-init.sql

Once we have everything set up, we can now enter the Trino shell to begin querying:

docker-compose exec -it trino trino

To make sure everything is set up correctly, we can run a simple SQL command:

SELECT * FROM iceberg.demo.items;

Performance & Optimization Techniques

Performance in this stack comes from two layers working together: the table format and the query engine. Apache Iceberg abstracts away storage complexities so users can write queries that match their mental model without giving up speed. Trino builds on that by applying both standard and engine-specific optimizations, reducing the amount of data scanned and minimizing the work pushed to Iceberg tables. In this section, we delve deeper into how this is made possible.

Iceberg Performance Features

Partition Transforms & Hidden Partitioning

Iceberg lets you define partitions with transform expressions such as day(ts), hour(ts), bucket(id, n), or truncate(col, k) rather than creating these new columns just for partitioning. If a table is partitioned by day(ts), you can still filter on ts, since Iceberg uses the transform behind the scenes to prune partitions. Readers don’t have to reference partition keys, and the engine can skip whole partitions early.

Metadata-driven Pruning

Snapshots, manifest lists, and manifest files record partition values and file stats. Engines plan from this metadata first, then read only the files that survive pruning. This allows the execution engines to read only what matters and avoid unnecessary I/O.

Copy-On-Write (COW) vs Merge-On-Read (MOR)

Equality and position delete files enable updates without full rewrites. Copy-On-Write favors read performance by rewriting files on change, while Merge-On-Read favors write throughput by applying deletes at read time. Understanding the requirements of your workload helps you optimize how you handle deletes and updates in Iceberg tables.

Schema & Partition Evolution

Evolve columns or change the partition spec, reducing the need for costly operations like rewriting table data or migrating to a new table. All the while, the layout is kept aligned with how queries actually filter and group, allowing users to continue querying their data without having to rewrite their SQL queries.

Get Started with PuppyGraph for FREE

Trino Performance Features

Cost-based Optimizations

Trino takes table and column statistics into consideration, accessing information of the physical layout of data via the Trino connectors. These connectors can return multiple layouts for a single table, choosing the most efficient join order and join distribution based on estimated costs. It uses the stats to decide whether to broadcast a small input or partition both sides, and to reorder joins and aggregates for less work.

Pushdown

Trino pushes filters, projections, and other supported operations into connectors so less data is read. With Iceberg, predicate pushdown drives partition and file pruning, projection pushdown reads only needed columns, and vectorized readers handle Parquet and ORC efficiently. Pushdown depth depends on connector support and table layout.

Adaptive Plan Optimizations

Trino refines the plan using information at runtime. Dynamic filtering narrows the probe-side scan of a join by sending discovered key values back to the scan, which lets Iceberg prune more files. Trino can also spill to disk when memory is tight so large joins and aggregations finish reliably.

Parallelism

When looking at the internals of Trino’s execution engine, we see that it consists of one or more worker nodes. Capitalizing on Trino’s strength as a distributed SQL query engine, Trino’s optimization process identifies ‘stages’, which are parts of the plan that can be executed across workers, allowing the same computation to be performed on different sets of the input data simultaneously.

Operations & Management with Trino

Keeping your Trino environment healthy goes beyond speeding up queries. As usage grows and more teams rely on the platform, operations and management become crucial for ensuring reliability, security, and accurate insights.

Cost-aware Analytics

Data stays in low-cost object storage, and Trino scans only what Iceberg metadata says is relevant. Pruning from manifests and stats means fewer bytes and fewer CPU seconds. Compute scales independently of storage, and because both layers are open, you avoid vendor lock-in while keeping spend predictable.

Interactive BI

Skip the warehouse load. Trino serves fast SQL directly on Iceberg tables, using metadata to prune files and read just the needed columns. BI tools hit one endpoint, and Iceberg’s ACID snapshots keep views consistent and up to date. Operations like compaction and snapshot cleanup can be run from SQL.

Cross-source Joins

Join Iceberg facts with reference data in PostgreSQL or a warehouse from a single Trino query. One endpoint, many sources, and policies stay centralized. The open table format plus an open engine means you can change storage or engines later without rewriting data or pipelines.

Reproducible Analytics

Iceberg snapshots enable time travel and quick rollback, and Trino exposes both in SQL by snapshot ID or timestamp. Teams can audit, compare runs, and restore known-good states without moving data. The combination yields trustworthy metrics and shorter investigations.

PuppyGraph: The Trino of Graphs

One of Apache Iceberg’s biggest strengths is interoperability. That means you can pair the right compute engine with the job at hand. While Trino excels at SQL analytics, a graph query engine such as PuppyGraph shines when it comes to relationship-centric questions like paths, communities, and influence.

PuppyGraph is the first real-time, zero-ETL graph query engine. It lets data teams query existing relational stores as a single graph and get up and running in under 10 minutes, avoiding the cost, latency, and maintenance of a separate graph database. PuppyGraph is not a traditional graph database but a graph query engine designed to run directly on top of your existing data infrastructure without costly and complex ETL (Extract, Transform, Load) processes. This "zero-ETL" approach is its core differentiator, allowing you to query relational data in data warehouses, data lakes, and databases as a unified graph model in minutes.

Instead of migrating data into a specialized store, PuppyGraph connects to sources including PostgreSQL, Apache Iceberg, Delta Lake, BigQuery, and others, then builds a virtual graph layer over them. Graph models are defined through simple JSON schema files, making it easy to update, version, or switch graph views without touching the underlying data.

This approach aligns with the broader shift in modern data stacks to separate compute from storage. You keep data where it belongs and scale query power independently, which supports petabyte-level workloads without duplicating data or managing fragile pipelines.

PuppyGraph also helps to cut costs. Our pricing is usage based, so you only pay for the queries you run. There is no second storage layer to fund, and data stays in place under your existing governance. With fewer pipelines to build, monitor, and backfill, day-to-day maintenance drops along with your bill.

Figure: PuppyGraph Supported Data Sources

Figure: Architecture with graph database vs. with PuppyGraph

PuppyGraph also supports Gremlin and openCypher, two expressive graph query languages ideal for modeling user behavior. Pattern matching, path finding, and grouping sequences become straightforward. These types of questions are difficult to express in SQL, but natural to ask in a graph.

Figure: Example Architecture with PuppyGraph

As data grows more complex, the teams that win ask deeper questions faster. PuppyGraph fits that need. It powers cybersecurity use cases like attack path tracing and lateral movement, observability work like service dependency and blast-radius analysis, fraud scenarios like ring detection and shared-device checks, and GraphRAG pipelines that fetch neighborhoods, citations, and provenance. If you run interactive dashboards or APIs with complex multi-hop queries, PuppyGraph serves results in real time.

Getting started is quick. Most teams go from deploy to query in minutes. You can run PuppyGraph with Docker, AWS AMI, GCP Marketplace, or deploy it inside your VPC for full control.

Get Started with PuppyGraph for FREE

Conclusion

Iceberg and Trino separate storage and compute without lock-in. Iceberg defines the table contract for schema, partitions, snapshots, and ACID commits. Trino plans from that metadata, prunes partitions and files, and reads only the columns it needs. One cluster can join Iceberg with systems like PostgreSQL or a warehouse through catalogs, so teams query from a single endpoint while data stays in object storage.

The pairing is practical to run. Schema evolution and time travel support change and audits. Routine maintenance such as snapshot expiration, compaction, and manifest rewrites keeps latency steady and costs predictable. Because both layers are open, you can scale compute or swap engines later without reshaping storage.

When the questions are about relationships, switch to graph. PuppyGraph maps the same Iceberg tables to vertices and edges, executes deep traversals efficiently, and keeps Iceberg as the source of truth. You keep the same Iceberg catalog and choose the model that fits the job: SQL with Trino, graph with PuppyGraph.

Want graph analytics without moving data? Try the free PuppyGraph Developer Edition, or book a demo with our team to see how PuppyGraph fits into your architecture.

Jaz Ku

Solution Architect

Jaz Ku is a Solution Architect with a background in Computer Science and an interest in technical writing. She earned her Bachelor's degree from the University of San Francisco, where she did research involving Rust’s compiler infrastructure. Jaz enjoys the challenge of explaining complex ideas in a clear and straightforward way.

Get started with PuppyGraph!

PuppyGraph empowers you to seamlessly query one or multiple data stores as a unified graph model.

Developer Edition

Forever free
Single noded
Designed for proving your ideas
Available via Docker install

Free Download

Enterprise Edition

30-day free trial with full features
Everything in developer edition & enterprise features
Designed for production
Available via AWS AMI & Docker install

* No payment required

Start Free Trial

Book Demo

Apache Iceberg Trino: Modern Data Lakehouse Explained

What is Apache Iceberg?

Goals of Apache Iceberg

What is Trino?

Architecture & Internals

Structure of Apache Iceberg

Data Layer

Metadata Layer

Catalog Layer

Structure of Trino Execution Engine

Coordinator

Workers

Connectors & Catalogs

Combining SQL & Graph workloads: Example Architecture

Storage Layer

Query Layer

BI & Viz Layer

Apache Iceberg + Trino Integration

Hands-on Example

Performance & Optimization Techniques

Iceberg Performance Features

Partition Transforms & Hidden Partitioning

Metadata-driven Pruning

Copy-On-Write (COW) vs Merge-On-Read (MOR)

Schema & Partition Evolution

Trino Performance Features

Cost-based Optimizations

Pushdown

Adaptive Plan Optimizations

Parallelism

Operations & Management with Trino

Cost-aware Analytics

Interactive BI

Cross-source Joins

Reproducible Analytics

PuppyGraph: The Trino of Graphs

Conclusion

See PuppyGraphIn Action

See PuppyGraphIn Action

Get started with PuppyGraph!

Dev Edition

Enterprise Edition

Developer

Enterprise

Developer Edition

Enterprise Edition

See PuppyGraph
In Action

See PuppyGraph
In Action