Apache Iceberg

Data Streaming Meets Lakehouse: Apache Iceberg for Unified Real-Time and Batch Analytics

Modern data architectures are changing. Data lakes alone can’t keep up with real-time business demands. But when operational and analytical workloads are handled in separate systems, it becomes hard to build consistent and trusted data products. At Open Source Data Summit 2025, I talked about how combining Apache Iceberg with data streaming technologies like Apache Kafka and Flink bridges this gap. It creates a governed, scalable foundation for both real-time and batch workloads. The talk is now available on demand for free. Slides are also available to download.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including various architecture discussions and best practices.

The Case for Apache Iceberg — and Why It Matters for Data Streaming

Apache Iceberg is an open table format designed for modern lakehouses. It enables governed, ACID-compliant tables in object storage that are accessible across engines like Kafka, Flink, Spark, Trino, and beyond.

Iceberg is quickly becoming the default for organizations that want one consistent table format across streaming, lakehouse, and AI workloads — without locking themselves into a single vendor ecosystem. Store data only once in your own S3 bucket, and give every engine or application access to.

Paired with a Data Streaming Platform, Apache Iceberg enables powerful patterns:

  • Real-time streams turn into governed Iceberg tables
  • Tables are reused across batch, SQL, and machine learning workloads
  • Data quality and governance rules are enforced at the point of ingestion

For more technical background and examples, check out this related blog post: Apache Iceberg – The Open Table Format for Lakehouse and Data Streaming.

Apache Iceberg vs. Delta Lake and Other Table Formats

Other table formats exist. Delta Lake from Databricks is the most prominent example. It has strong momentum in the Databricks ecosystem and continues to evolve. Databricks even acquired Tabular, the company founded by the original creators of Apache Iceberg, to improve interoperability between Delta Lake and Iceberg and to support customers that need both formats.

Even with that step, Delta Lake remains tightly connected to the Databricks platform. Apache Iceberg, in contrast, is vendor-neutral and broadly supported across cloud providers, query engines, and open source projects. This neutrality is a major reason why more enterprises choose Iceberg when they want flexibility, portability, and open governance across their data architecture.

Other table formats like Apache Hudi, Nessie, and proprietary approaches exist, but none have reached the level of adoption, ecosystem support, and traction that Iceberg and Delta Lake have achieved.

I wrote an entire blog series to discuss how Confluent and Databricks integrate through Apache Iceberg and Delta Lake to build a unified platform for streaming and lakehouse analytics.

Be Aware: Streaming to a Data Lake is Complicated

While the architecture is powerful, getting operational data into Iceberg tables is not trivial. A few of the key technical challenges include:

  • Schematization: Events in Kafka are often unstructured or semi-structured. They must be enriched and structured consistently before writing to Iceberg.
  • Type Conversions: Apache Kafka and Flink use formats like Avro or Protobuf. These need to match the data types expected by Iceberg-compatible engines.
  • Schema Evolution: Schemas change over time. Supporting safe, forward- and backward-compatible evolution in Iceberg requires careful handling.
  • Data Quality Rules: Validations such as null checks, range limits, and referential integrity need to be applied before writing data. This must happen in motion.
  • Sync Metadata to Catalog: Tables in Iceberg need to be registered in catalogs like Hive Metastore or AWS Glue. That sync must be managed and kept up to date.
  • Compaction: High-frequency event streams produce many small files. These need to be compacted into optimized, query-friendly large files over time.
Source: Confluent

These are solvable challenges — but they require experience and tooling. A Data Streaming Platform like Confluent Cloud with its product Tableflow helps by integrating governance, schema management, stream processing, and catalog integration out of the box.

Kafka Iceberg Topics and the “Zero Copy” Conversation

The idea of writing Kafka topics directly into Iceberg tables — sometimes called a “Zero Copy” approach — is appealing. But there are trade-offs to consider:

  • Latency and CPU Impact: Streaming directly to Iceberg introduces write latency. Operational systems that need millisecond responsiveness may suffer.
  • Losing Replayability and Recovery of Events: Kafka is a durable event log. Writing data directly into Iceberg removes the ability to replay and reprocess events from source.
  • Data Processing at Rest, Not in Motion: Transformations like Bronze => Silver => Gold are only applied once data lands. This delays decisions and adds complexity.
  • Hidden Operational Complexity: Managing Iceberg tables adds overhead: compaction, retention, catalog sync, tiered storage tuning — all of this must be operated reliably.

In many cases, storing data twice is a better pattern. Kafka stores the raw event log, similar to a Write-Ahead Log (WAL) in databases, which provides durability and the ability to reprocess or replay events if needed. Meanwhile, Apache Iceberg stores the structured, governed tables for downstream analytics, machine learning, or SQL workloads. This separation mirrors proven database architecture patterns and ensures both flexibility and reliability in modern data systems.

Watch the Session On Demand – Free Access + Slides

Below is my session that walks through the above discussion in more detail.

Title: Data Streaming Meets the Lakehouse – Apache Iceberg for Unified Real-Time and Batch Analytics

Event: Open Source Data Summit 2025

Date: November 13, 2025

Access: Always free and open to all

Watch the session here:

Look at the slides (PDF):

Fullscreen Mode

Outlook: Shift Left with Streaming-Powered Data Products

What’s the bigger picture? Modern enterprises are shifting left. That means they no longer treat data governance and modeling as downstream tasks in the warehouse. Instead, they apply structure, quality, and metadata closer to the source — as data is created.

This is the core idea behind the Shift Left Architecture. Streaming pipelines ingest, enrich, and validate data before it’s written into lakehouse tables. That enables trusted, real-time data products used by AI agents, dashboards, or microservices – all with consistent lineage and governance.

Read more about this approach here: The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products.

Excellent real world examples come from Siemens Digital Industries (Manufacturing and Siemens Healthineers (Healthcare). They’ve built a real-time data streaming platform at global scale. Operational data flows from hundreds of business units into a central, governed data mesh. From there, it feeds everything from analytics to automation and AI — all in real time.

This is where the future is heading. Combining data streaming and Iceberg is not just about formats or engines. It’s about building real-time, trusted data infrastructure for the next decade.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including various architecture discussions and best practices.

Kai Waehner

bridging the gap between technical innovation and business value for real-time data streaming, processing and analytics

Share
Published by
Kai Waehner

Recent Posts

Data Streaming in Retail: Social Commerce from Influencers to Inventory

Social commerce is reshaping retail by merging entertainment, influencer marketing, and instant purchasing into one…

6 days ago

Kafka Proxy Demystified: Use Cases, Benefits, and Trade-offs

A Kafka proxy adds centralized security and governance for Apache Kafka. Solutions like Kroxylicious, Conduktor,…

3 weeks ago

How Stablecoins Use Blockchain and Data Streaming to Power Digital Money

Stablecoins are reshaping digital money by linking traditional finance with blockchain technology. Built for stability…

4 weeks ago

Cybersecurity with a Digital Twin: Why Real-Time Data Streaming Matters

Cyberattacks on critical infrastructure and manufacturing are growing, with ransomware and manipulated sensor data creating…

1 month ago

How Siemens, SAP, and Confluent Shape the Future of AI Ready Integration – Highlights from the Rojo Event in Amsterdam

Many enterprises want to become AI ready but are limited by slow, batch based integration…

1 month ago

Scaling Kafka Consumers: Proxy vs. Client Library for High-Throughput Architectures

Apache Kafka’s pull-based model and decoupled architecture offer unmatched flexibility for event-driven systems. But as…

2 months ago