Modern data architectures are changing. Data lakes alone can’t keep up with real-time business demands. But when operational and analytical workloads are handled in separate systems, it becomes hard to build consistent and trusted data products. At Open Source Data Summit 2025, I talked about how combining Apache Iceberg with data streaming technologies like Apache Kafka and Flink bridges this gap. It creates a governed, scalable foundation for both real-time and batch workloads. The talk is now available on demand for free. Slides are also available to download.
Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including various architecture discussions and best practices.
Apache Iceberg is an open table format designed for modern lakehouses. It enables governed, ACID-compliant tables in object storage that are accessible across engines like Kafka, Flink, Spark, Trino, and beyond.
Iceberg is quickly becoming the default for organizations that want one consistent table format across streaming, lakehouse, and AI workloads — without locking themselves into a single vendor ecosystem. Store data only once in your own S3 bucket, and give every engine or application access to.
Paired with a Data Streaming Platform, Apache Iceberg enables powerful patterns:
For more technical background and examples, check out this related blog post: Apache Iceberg – The Open Table Format for Lakehouse and Data Streaming.
Other table formats exist. Delta Lake from Databricks is the most prominent example. It has strong momentum in the Databricks ecosystem and continues to evolve. Databricks even acquired Tabular, the company founded by the original creators of Apache Iceberg, to improve interoperability between Delta Lake and Iceberg and to support customers that need both formats.
Even with that step, Delta Lake remains tightly connected to the Databricks platform. Apache Iceberg, in contrast, is vendor-neutral and broadly supported across cloud providers, query engines, and open source projects. This neutrality is a major reason why more enterprises choose Iceberg when they want flexibility, portability, and open governance across their data architecture.
Other table formats like Apache Hudi, Nessie, and proprietary approaches exist, but none have reached the level of adoption, ecosystem support, and traction that Iceberg and Delta Lake have achieved.
I wrote an entire blog series to discuss how Confluent and Databricks integrate through Apache Iceberg and Delta Lake to build a unified platform for streaming and lakehouse analytics.
While the architecture is powerful, getting operational data into Iceberg tables is not trivial. A few of the key technical challenges include:
These are solvable challenges — but they require experience and tooling. A Data Streaming Platform like Confluent Cloud with its product Tableflow helps by integrating governance, schema management, stream processing, and catalog integration out of the box.
The idea of writing Kafka topics directly into Iceberg tables — sometimes called a “Zero Copy” approach — is appealing. But there are trade-offs to consider:
In many cases, storing data twice is a better pattern. Kafka stores the raw event log, similar to a Write-Ahead Log (WAL) in databases, which provides durability and the ability to reprocess or replay events if needed. Meanwhile, Apache Iceberg stores the structured, governed tables for downstream analytics, machine learning, or SQL workloads. This separation mirrors proven database architecture patterns and ensures both flexibility and reliability in modern data systems.
Below is my session that walks through the above discussion in more detail.
Title: Data Streaming Meets the Lakehouse – Apache Iceberg for Unified Real-Time and Batch Analytics
Event: Open Source Data Summit 2025
Date: November 13, 2025
Access: Always free and open to all
Watch the session here:
Look at the slides (PDF):
What’s the bigger picture? Modern enterprises are shifting left. That means they no longer treat data governance and modeling as downstream tasks in the warehouse. Instead, they apply structure, quality, and metadata closer to the source — as data is created.
This is the core idea behind the Shift Left Architecture. Streaming pipelines ingest, enrich, and validate data before it’s written into lakehouse tables. That enables trusted, real-time data products used by AI agents, dashboards, or microservices – all with consistent lineage and governance.
Read more about this approach here: The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products.
Excellent real world examples come from Siemens Digital Industries (Manufacturing and Siemens Healthineers (Healthcare). They’ve built a real-time data streaming platform at global scale. Operational data flows from hundreds of business units into a central, governed data mesh. From there, it feeds everything from analytics to automation and AI — all in real time.
This is where the future is heading. Combining data streaming and Iceberg is not just about formats or engines. It’s about building real-time, trusted data infrastructure for the next decade.
Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including various architecture discussions and best practices.
Social commerce is reshaping retail by merging entertainment, influencer marketing, and instant purchasing into one…
A Kafka proxy adds centralized security and governance for Apache Kafka. Solutions like Kroxylicious, Conduktor,…
Stablecoins are reshaping digital money by linking traditional finance with blockchain technology. Built for stability…
Cyberattacks on critical infrastructure and manufacturing are growing, with ransomware and manipulated sensor data creating…
Many enterprises want to become AI ready but are limited by slow, batch based integration…
Apache Kafka’s pull-based model and decoupled architecture offer unmatched flexibility for event-driven systems. But as…