Apache Iceberg Trino: Modern Data Lakehouse Explained

Apache Iceberg is an open table format created at Netflix in 2017 to overcome the limitations of Hive tables at petabyte scale. By 2020, it became a top-level Apache Software Foundation project and is now widely adopted across the data lakehouse ecosystem. Its design brings reliability and openness to analytic storage, ensuring that data stored in different file formats can still be safely queried and updated across multiple engines without being tied to a single vendor.
Trino complements this model as an open-source distributed SQL engine built for low-latency analytics. When paired with Iceberg, it can query governed lakehouse tables directly, giving teams both speed and flexibility without duplicating data. This reflects a larger shift in the industry: the future of analytics is trending toward the decoupling of storage and compute, where open formats like Iceberg define the data layer and engines like Trino provide scalable, interchangeable compute.
In this blog, we’ll explore What Apache Iceberg and Trino are, how the two integrate and optimize for maximum performance, as well as practical use cases.
What is Apache Iceberg?

Goals of Apache Iceberg
Iceberg was designed as a modern replacement for Hive tables, which struggled with schema evolution, partition management, and safe concurrent writes. At large scale, these gaps made pipelines fragile and analytics unreliable. Iceberg addresses these issues with five core goals:
- Consistency: Provides reliable query results through ACID transactions and snapshot isolation, even when multiple engines access the same table.
- Performance: Improves efficiency with optimized metadata handling, partition pruning, and scan planning that reduce the cost of queries at scale.
- Ease of use: Reduces complexity with features like hidden partitioning and familiar SQL commands for table creation and management.
- Evolvability: Supports schema changes and adapts to new execution engines with minimal disruption to downstream workloads.
- Scalability: Designed to manage petabyte-scale datasets and handle concurrent access in distributed environments.
What is Trino?

Trino is an ANSI SQL compliant query engine for the modern data stack. It began at Facebook as Presto to replace slow Hive and MapReduce jobs with fast, interactive SQL on a massive Hadoop warehouse.
Today it is an open source, distributed engine that queries data where it lives through connectors to object storage, lakehouse table formats, and databases. You use standard SQL while Trino scales out and keeps storage and compute separate for BI, ad hoc analysis, and federated queries across many sources.

Architecture & Internals
To understand how Apache Iceberg and Trino complement each other, we first have to take a look at the underlying architectures of these products.
Structure of Apache Iceberg
Iceberg’s design separates concerns into distinct layers. By separating concerns into distinct layers, Iceberg makes it easier to ensure consistency, scalability, and interoperability across engines like Trino, Spark, and Flink.

Data Layer
The data layer consists of the files that hold the actual table content, typically stored in cloud or on-premises object storage. For handling deletes and updates, Iceberg’s default strategy is Copy-on-Write (COW), where new data files are created when changes occur. If a table is frequently updated, Iceberg can also use Merge-on-Read (MOR). In this mode, the data layer may include delete files that track row- or position-level deletes. This approach allows updates without rewriting large datasets while still maintaining consistent query performance.
Metadata Layer
The metadata layer tracks the structure and version history of a table. These files are stored in Avro format, which is compact, schema-driven, and widely supported across languages and platforms. Because Avro embeds its own schema and has strong compatibility rules, engines like Trino, Spark, and Flink can all parse the same files consistently, making the metadata layer engine-agnostic.
It is built from three key components:
- Metadata files: Define table properties, schema, partition specs, and pointers to the current snapshot.
- Manifest lists: Record which manifest files belong to a snapshot, enabling snapshot isolation and time travel.
- Manifest files: Contain detailed information about data files, such as partition values, row counts, and file-level statistics.
This layered design makes features like schema evolution, time travel, and atomic operations possible. Query engines like Trino and PuppyGraph rely on these metadata files to plan queries efficiently, pruning unnecessary files and ensuring consistent results without scanning the entire dataset.
Catalog Layer
The catalog tracks table definitions and the pointer to the current metadata file. Iceberg uses optimistic concurrency for commits: a writer reads the current pointer, stages a new metadata file, then performs an atomic compare-and-swap. If another commit wins, the update is rejected and the writer retries against the latest snapshot.
This delivers atomic commits and snapshot isolation for a single table, so readers see either the old state or the new state, never a partial write. Iceberg supports Hive Metastore, AWS Glue, JDBC, Nessie, and REST catalogs. This means that Iceberg can support custom catalogs via pluggable Java APIs or the REST Catalog protocol.
Structure of Trino Execution Engine

Coordinator
A coordinator is one type of Trino server, with every cluster having one coordinator node. The coordinator parses SQL, builds the logical plan, and applies core optimizations such as predicate and column pruning, join reordering, and pushdown when supported. It then creates the distributed plan, schedules work on workers, manages resource allocation, tracks progress, and handles retries.
Workers
A worker is the other type of Trino server, and a cluster can consist of zero or more worker nodes. Workers execute the coordinator’s plan by reading assigned splits, independent slices of input, and run operators like scans, filters, joins, and aggregations. They exchange intermediate data and manage memory, spilling when needed. Planning creates only the required splits via pruning, and the cluster scales by adding workers.
Connectors & Catalogs
Connectors let Trino interact with external systems, consisting of interfaces such as the Metadata API and the Data Location API, which the coordinator can call upon for statistics to optimize queries. Common sources include Kafka, lakehouse table formats, PostgreSQL and other operational databases, warehouses such as Snowflake and Teradata, and non-relational systems like MongoDB.
In Trino, a catalog is a configured instance of a connector. It contains one or more schemas that can hold tables, views, and materialized views. As a unified query engine, Trino allows you to configure and use many catalogs to connect to multiple data sources simultaneously.
Combining SQL & Graph workloads: Example Architecture
Modern query engines built for the data stack follow a common pattern: separate storage from compute. Data lives in cheap, durable object storage, and engines like Trino and PuppyGraph can be chosen based on fit for the use case, not limited by a vendor’s bundled options.

Storage Layer
Data sits in object storage and is governed by Apache Iceberg. Using an open table format avoids vendor lock-in and lets multiple engines share the same tables. Iceberg’s metadata and snapshot model enable safe concurrent writes, schema evolution, time travel, and efficient pruning.
Query Layer
Trino is the unified SQL engine. This means that one Trino cluster can attach multiple catalogs at the same time, allowing you to query Iceberg tables alongside other sources, such as PostgreSQL and Databricks, in a single statement. Each source is exposed through its connector, and the catalog name becomes the first part of the table path.
BI & Viz Layer
Clients such as Tableau, Looker, notebooks, and services connect to Trino over JDBC or ODBC and issue SQL. Trino implements a largely ANSI SQL compatible dialect with some engine and connector specific functions.
Apache Iceberg + Trino Integration
Trino’s Iceberg connector supports Apache Iceberg table spec versions 1 and 2. You choose the data file format with the format table property, using Parquet, ORC, or Avro. Iceberg records file paths in its metadata, so Trino plans from metadata first and touches the storage layer only for the files it needs.
Trino normally reads configuration from /etc/trino: node.properties, jvm.config, config.properties, and an optional log.properties. The official Trino Docker image ships with sensible defaults, so for a simple setup you usually only need to add a catalog file under /etc/trino/catalog, configuring the connection to your data source.
Hands-on Example
To see the integration in action, we can spin up a simple Docker container with the following docker-compose.yaml:
services:
rest:
image: tabulario/iceberg-rest
container_name: iceberg-rest
networks:
iceberg_net:
ports:
- 8181:8181
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- CATALOG_S3_ENDPOINT=http://minio:9000
minio:
image: minio/minio
container_name: minio
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
- MINIO_DOMAIN=minio
networks:
iceberg_net:
aliases:
- warehouse.minio
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
networks:
iceberg_net:
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: >
/bin/sh -c "
until (/usr/bin/mc alias set minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
/usr/bin/mc rm -r --force minio/warehouse;
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc anonymous set public minio/warehouse;
tail -f /dev/null
"
trino:
image: trinodb/trino:latest
container_name: trino
depends_on:
- rest
- minio
networks:
iceberg_net:
ports:
- 8080:8080
volumes:
- ./post-init.sql:/post-init.sql:ro
- ./catalog/iceberg.properties:/etc/trino/catalog/iceberg.properties:ro
networks:
iceberg_net:
name: trino-iceberg
We can configure the Trino connector for Apache Iceberg in /catalog/iceberg.properties:
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://iceberg-rest:8181/
fs.native-s3.enabled=true
s3.endpoint=http://minio:9000
s3.region=us-east-1
s3.path-style-access=true
s3.aws-access-key=admin
s3.aws-secret-key=password
Then, the post-init.sql script will let us populate our Iceberg table with dummy data when we spin up the instance:
CREATE SCHEMA IF NOT EXISTS iceberg.demo;
DROP TABLE IF EXISTS iceberg.demo.items;
CREATE TABLE iceberg.demo.items (
id INT,
name VARCHAR
);
INSERT INTO iceberg.demo.items VALUES
(1, 'marko'),
(2, 'peter');
Now that we have everything set up, we can run our Docker container:
docker compose up -d
Create and populate the tables:
docker compose exec -T trino trino --server http://localhost:8080 -f /post-init.sql
Once we have everything set up, we can now enter the Trino shell to begin querying:
docker-compose exec -it trino trino
To make sure everything is set up correctly, we can run a simple SQL command:
SELECT * FROM iceberg.demo.items;
Performance & Optimization Techniques
Performance in this stack comes from two layers working together: the table format and the query engine. Apache Iceberg abstracts away storage complexities so users can write queries that match their mental model without giving up speed. Trino builds on that by applying both standard and engine-specific optimizations, reducing the amount of data scanned and minimizing the work pushed to Iceberg tables. In this section, we delve deeper into how this is made possible.
Iceberg Performance Features
Partition Transforms & Hidden Partitioning
Iceberg lets you define partitions with transform expressions such as day(ts), hour(ts), bucket(id, n), or truncate(col, k) rather than creating these new columns just for partitioning. If a table is partitioned by day(ts), you can still filter on ts, since Iceberg uses the transform behind the scenes to prune partitions. Readers don’t have to reference partition keys, and the engine can skip whole partitions early.
Metadata-driven Pruning
Snapshots, manifest lists, and manifest files record partition values and file stats. Engines plan from this metadata first, then read only the files that survive pruning. This allows the execution engines to read only what matters and avoid unnecessary I/O.
Copy-On-Write (COW) vs Merge-On-Read (MOR)
Equality and position delete files enable updates without full rewrites. Copy-On-Write favors read performance by rewriting files on change, while Merge-On-Read favors write throughput by applying deletes at read time. Understanding the requirements of your workload helps you optimize how you handle deletes and updates in Iceberg tables.
Schema & Partition Evolution
Evolve columns or change the partition spec, reducing the need for costly operations like rewriting table data or migrating to a new table. All the while, the layout is kept aligned with how queries actually filter and group, allowing users to continue querying their data without having to rewrite their SQL queries.
Trino Performance Features
Cost-based Optimizations
Trino takes table and column statistics into consideration, accessing information of the physical layout of data via the Trino connectors. These connectors can return multiple layouts for a single table, choosing the most efficient join order and join distribution based on estimated costs. It uses the stats to decide whether to broadcast a small input or partition both sides, and to reorder joins and aggregates for less work.
Pushdown
Trino pushes filters, projections, and other supported operations into connectors so less data is read. With Iceberg, predicate pushdown drives partition and file pruning, projection pushdown reads only needed columns, and vectorized readers handle Parquet and ORC efficiently. Pushdown depth depends on connector support and table layout.
Adaptive Plan Optimizations
Trino refines the plan using information at runtime. Dynamic filtering narrows the probe-side scan of a join by sending discovered key values back to the scan, which lets Iceberg prune more files. Trino can also spill to disk when memory is tight so large joins and aggregations finish reliably.
Parallelism
When looking at the internals of Trino’s execution engine, we see that it consists of one or more worker nodes. Capitalizing on Trino’s strength as a distributed SQL query engine, Trino’s optimization process identifies ‘stages’, which are parts of the plan that can be executed across workers, allowing the same computation to be performed on different sets of the input data simultaneously.
Operations & Management with Trino
Keeping your Trino environment healthy goes beyond speeding up queries. As usage grows and more teams rely on the platform, operations and management become crucial for ensuring reliability, security, and accurate insights.
Cost-aware Analytics
Data stays in low-cost object storage, and Trino scans only what Iceberg metadata says is relevant. Pruning from manifests and stats means fewer bytes and fewer CPU seconds. Compute scales independently of storage, and because both layers are open, you avoid vendor lock-in while keeping spend predictable.
Interactive BI
Skip the warehouse load. Trino serves fast SQL directly on Iceberg tables, using metadata to prune files and read just the needed columns. BI tools hit one endpoint, and Iceberg’s ACID snapshots keep views consistent and up to date. Operations like compaction and snapshot cleanup can be run from SQL.
Cross-source Joins
Join Iceberg facts with reference data in PostgreSQL or a warehouse from a single Trino query. One endpoint, many sources, and policies stay centralized. The open table format plus an open engine means you can change storage or engines later without rewriting data or pipelines.
Reproducible Analytics
Iceberg snapshots enable time travel and quick rollback, and Trino exposes both in SQL by snapshot ID or timestamp. Teams can audit, compare runs, and restore known-good states without moving data. The combination yields trustworthy metrics and shorter investigations.
PuppyGraph: The Trino of Graphs
One of Apache Iceberg’s biggest strengths is interoperability. That means you can pair the right compute engine with the job at hand. While Trino excels at SQL analytics, a graph query engine such as PuppyGraph shines when it comes to relationship-centric questions like paths, communities, and influence.

PuppyGraph is the first real-time, zero-ETL graph query engine. It lets data teams query existing relational stores as a single graph and get up and running in under 10 minutes, avoiding the cost, latency, and maintenance of a separate graph database. PuppyGraph is not a traditional graph database but a graph query engine designed to run directly on top of your existing data infrastructure without costly and complex ETL (Extract, Transform, Load) processes. This "zero-ETL" approach is its core differentiator, allowing you to query relational data in data warehouses, data lakes, and databases as a unified graph model in minutes.
Instead of migrating data into a specialized store, PuppyGraph connects to sources including PostgreSQL, Apache Iceberg, Delta Lake, BigQuery, and others, then builds a virtual graph layer over them. Graph models are defined through simple JSON schema files, making it easy to update, version, or switch graph views without touching the underlying data.
This approach aligns with the broader shift in modern data stacks to separate compute from storage. You keep data where it belongs and scale query power independently, which supports petabyte-level workloads without duplicating data or managing fragile pipelines.
PuppyGraph also helps to cut costs. Our pricing is usage based, so you only pay for the queries you run. There is no second storage layer to fund, and data stays in place under your existing governance. With fewer pipelines to build, monitor, and backfill, day-to-day maintenance drops along with your bill.


PuppyGraph also supports Gremlin and openCypher, two expressive graph query languages ideal for modeling user behavior. Pattern matching, path finding, and grouping sequences become straightforward. These types of questions are difficult to express in SQL, but natural to ask in a graph.

As data grows more complex, the teams that win ask deeper questions faster. PuppyGraph fits that need. It powers cybersecurity use cases like attack path tracing and lateral movement, observability work like service dependency and blast-radius analysis, fraud scenarios like ring detection and shared-device checks, and GraphRAG pipelines that fetch neighborhoods, citations, and provenance. If you run interactive dashboards or APIs with complex multi-hop queries, PuppyGraph serves results in real time.
Getting started is quick. Most teams go from deploy to query in minutes. You can run PuppyGraph with Docker, AWS AMI, GCP Marketplace, or deploy it inside your VPC for full control.
Conclusion
Iceberg and Trino separate storage and compute without lock-in. Iceberg defines the table contract for schema, partitions, snapshots, and ACID commits. Trino plans from that metadata, prunes partitions and files, and reads only the columns it needs. One cluster can join Iceberg with systems like PostgreSQL or a warehouse through catalogs, so teams query from a single endpoint while data stays in object storage.
The pairing is practical to run. Schema evolution and time travel support change and audits. Routine maintenance such as snapshot expiration, compaction, and manifest rewrites keeps latency steady and costs predictable. Because both layers are open, you can scale compute or swap engines later without reshaping storage.
When the questions are about relationships, switch to graph. PuppyGraph maps the same Iceberg tables to vertices and edges, executes deep traversals efficiently, and keeps Iceberg as the source of truth. You keep the same Iceberg catalog and choose the model that fits the job: SQL with Trino, graph with PuppyGraph.
Want graph analytics without moving data? Try the free PuppyGraph Developer Edition, or book a demo with our team to see how PuppyGraph fits into your architecture.
Get started with PuppyGraph!
Developer Edition
- Forever free
- Single noded
- Designed for proving your ideas
- Available via Docker install
Enterprise Edition
- 30-day free trial with full features
- Everything in developer edition & enterprise features
- Designed for production
- Available via AWS AMI & Docker install