Why Databricks and Snowflake Speak the Kafka Protocol: Ingestion vs. Architecture

Kafka-compatible ingestion is showing up everywhere now, including in analytics platforms that have nothing to do with running Kafka. Databricks added Kafka-compatible APIs to its Zerobus Ingest service. Snowflake went a step further at its Summit and announced Datastream, a native, fully Kafka-compatible streaming service. In both cases, existing Kafka producers can stream straight into the platform with no code changes, just a config change.

This is good news for the ecosystem. It also confirms a point I have made for years: the Kafka API has become the de facto standard for moving events around, the same way the Amazon S3 API became the standard for object storage.

But the trend hides an important distinction. Using the Kafka API for ingestion into an analytics platform is a very different thing from building an event-driven enterprise architecture on Kafka. And in practice most companies do both: they run Kafka as the operational backbone, and one of the most common sink connectors in that setup feeds the lakehouse. This post is about that difference, ingestion versus architecture, why it matters, and why the two approaches are complementary rather than competing.

What Is Apache Kafka? Messaging, Storage, Connect, and Streams

Most people know Kafka as messaging. Pub-sub. A way to send events from A to B in real time. That is the part everyone learns first, and it is real, but it is only a quarter of the picture.

Apache Kafka (the open source framework) is four things in one platform:

Messaging is the real-time publish and subscribe layer that decouples producers from consumers.

Storage is the durable, replayable commit log. A Kafka topic is not a transient buffer that forgets data the moment it is read. Events are persisted, ordered, and can be replayed by any consumer at any time. This is what makes Kafka a system of record for events, not just a pipe.

Kafka Connect is the integration framework. Hundreds of connectors move data into and out of Kafka from databases, message queues, SaaS applications, cloud storage, and yes, lakehouses.

Kafka Streams and stream processing handle continuous processing of data while it is in motion. You filter, join, aggregate, and enrich events as they flow, rather than landing everything first and processing it later in batch.

The point is simple: Kafka is a platform, not a message queue. That combination of storage, integration, and processing on top of messaging is exactly what lets it serve as a backbone for an entire enterprise, not just a transport between two systems. The persistent log also decouples producers from consumers in time, so each system reads the same events at its own pace, whether real-time, batch, or request-response. That is what makes Kafka a tool for data consistency across the whole estate, not only real-time speed and scale.

Kafka Protocol vs. Kafka Framework: What ‘Kafka-Compatible’ Really Means

Here is the distinction that explains the whole “Kafka is everywhere” trend, and it is one people constantly blur.

The Kafka API, also called the Kafka protocol, is the open wire protocol licensed under Apache 2.0. It defines how clients talk to brokers: the requests, the binary format, the rules. It is open, and anyone can implement against it.

The Kafka framework is the actual open-source software: the brokers, Kafka Connect, Kafka Streams. It is one implementation of that protocol, the original one.

Because the protocol is open, other systems can speak it without using the framework underneath. Snowflake is refreshingly direct about this with Datastream: it is compatible with the Kafka wire protocol but does not use Kafka under the hood. The technology is pure Snowflake. That is the protocol-versus-framework split in a single sentence, stated by the vendor. And it is precisely why the Kafka API became the de facto standard for event streaming.

But there is a crucial caveat. Protocol-compatible does not mean feature-complete. Many systems that advertise “Kafka-compatible” implement only the messaging core, and even there they often miss features. They often skip the storage semantics, connectors, stream processing, and exactly-once guarantees. So when a product says it supports the Kafka API, that tells you the on-ramp works. It does not tell you that you have a streaming platform underneath. The devil is in the details, and the details are usually everything except the basic produce-and-consume path.

Two Jobs for the Kafka API: Ingestion vs. Architecture

Once you separate the protocol from the platform, two very different use cases come into focus.

The first is Kafka as the operational backbone. Event-driven microservices. Real-time applications. The central nervous system that connects operational and analytical systems across the business. Kafka as the strategic integration layer that replaces the old ESB, ETL, and iPaaS middleware. In this world, data is processed in motion, consumed by many systems at once, and the workloads are mission-critical, stateful, and continuous.

The second is Kafka as an ingestion path into analytics. Here the Kafka API is a convenient, standardized on-ramp to get events into a warehouse or lakehouse, where they are processed analytically, often in micro-batches. One producer, one destination, analytics at the end.

The first is an event-driven architecture. The second is a smarter pipe into a analytics platform.

And here is the part that matters most: most companies do not pick one. They do both. They deploy Kafka for the event-driven architecture, run their operational workloads on it, and then one of the most common sink connectors in that whole setup is the one feeding the lakehouse. Kafka runs the live business, and Snowflake or Databricks gets a continuous feed of those same events for analytics, reporting, and AI. The two are complementary, not competing.

Streaming Platform vs. Lakehouse Ingestion: Confluent, Databricks, Snowflake

This is where the recent vendor announcements fit in, and it helps to see the two approaches side by side.

On one side are the data streaming platforms. Whether you run open-source Apache Kafka yourself or work with a vendor such as Confluent, the goal is the same: build the full platform around the event-driven architecture. Operational and analytical workloads, Connect for integration, stream processing for data in motion, the durable log as a system of record. There is a rich and fast-moving market here, with strong options well beyond the obvious names. For an overview of how it is evolving, see my Data Streaming Landscape 2026.

On the other side are the analytics platforms adding Kafka-compatible ingestion. Databricks Zerobus and Snowflake Datastream are two clear examples. Their goal is narrower and perfectly reasonable: get producer data into their platform without requiring a separate Kafka cluster or a separate streaming vendor. Snowflake is explicit that Datastream is purpose-built for teams that want to replace their Kafka infrastructure with a native Snowflake service, landing topics directly as governed Snowflake or Iceberg tables. Worth noting on maturity: Databricks has rolled out its Kafka-compatible APIs in beta with the rest of Zerobus generally available, while Snowflake Datastream is still heading into private preview. Both implement the protocol as an ingest interface into their own platform, not as a general-purpose streaming backbone.

Both approaches are valid. They solve different problems.

When to Use Kafka vs. Lakehouse Ingestion

If all you need is to land events in the lakehouse, then Kafka-compatible ingestion built directly into the analytics platform is a fine choice. There is nothing wrong with it. It is one less system to run, one less vendor to manage, and the config-change migration story is real. No need to overthink it.

But be clear-eyed about what most enterprises actually do. They do not use Kafka only for ingestion into analytics. They use it for operational use cases, for a strategic event-driven enterprise architecture, and as the integration platform that replaces legacy middleware. For that, an analytics platform’s ingest service is not a substitute. It was never designed to be one. It is the on-ramp, or at most a replacement for the on-ramp, not the backbone.

So choose based on the use case, not the headline. If you only need the pipe into the lakehouse, take the pipe. If you are building the operational nervous system of your business, you need the platform, and the lakehouse becomes one important consumer of it rather than the center of it.

More vendors adopting the Kafka API is a win for everyone. It confirms the protocol is the standard. The only thing to stay sharp on is which problem you are solving: ingestion into analytics, or the event-driven enterprise architecture that runs the business.

Stay informed about the latest thinking on enterprise architecture, data integration, process intelligence, and trusted agentic AI by subscribing to my newsletter and following me on LinkedIn or X.

Kai Waehner

bridging the gap between technical innovation and business value for real-time data streaming and applied AI.

Previous « Choosing an ERP for Manufacturing: How AI Is Reshaping the Vendor Landscape

Published by

Kai Waehner

Tags: ConfluentData IngestionDatabricks Zerobusevent streamingKafka APIKafka ConnectKafka ProtocolLakehouseSnowflake Datastream

5 hours ago

Choosing an ERP for Manufacturing: How AI Is Reshaping the Vendor Landscape

ERP vendor selection for manufacturing is not a product decision. It is a strategic bet…

7 days ago

Process Intelligence

Process Intelligence Explained: Mining, Orchestration, and the Decision Gate

Process intelligence is not a single tool. It combines process mining, process orchestration, and a…

2 weeks ago

ERP Migration to SAP S/4HANA and Beyond: Lessons Learned from German Manufacturing

ERP modernization fails when the technology leads and the process work follows. Three German manufacturers…

3 weeks ago

Data Catalog

Beyond Enterprise Data Lineage: The Case for a Platform-Independent Data Catalog

Most organizations start their data governance journey by asking how to track where data comes…

1 month ago

Data Ownership in the Age of Agentic AI: Why SAP’s API Policy Forces a Data Integration Reckoning for Every Enterprise

Every enterprise is being told to go agentic. Meanwhile, the platforms holding your most critical…

2 months ago

Complex Event Processing

Flink CEP and Agentic AI: Real-Time Pattern Detection as the Foundation for Autonomous Decisions

AI agents fail in production when they are connected directly to raw event streams. Flink…