Data Warehouse Archives

18.8K views
5 minute read

The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming)

ByKai Waehner
1. April 2025

Batch processing introduces delays, complexity, and data quality issues that modern businesses can no longer afford. This article outlines the most common problems with batch workflows—ranging from outdated insights to compliance risks—and illustrates each with real-world examples. It also highlights how real-time data streaming offers a more reliable, scalable, and future-proof alternative.

Lakehouse and Data Streaming - Competitor or Complementary

12.2K views
12 minute read

How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.)

ByKai Waehner
12. October 2024

In today’s data-driven world, understanding data at rest versus data in motion is crucial for businesses. Data streaming frameworks like Apache Kafka and Apache Flink enable real-time data processing. Meanwhile, lakehouses like Snowflake, Databricks, and Microsoft Fabric excel in long-term data storage and detailed analysis, perfect for reports and AI training. This blog post explores how these technologies complement each other in enterprise architecture.

37.7K views
8 minute read

The Shift Left Architecture – From Batch and Lakehouse to Real-Time Data Products with Data Streaming

ByKai Waehner
15. June 2024

Data integration is a hard challenge in every enterprise. Batch processing and Reverse ETL are common practices in a data warehouse, data lake or lakehouse. Data inconsistency, high compute cost, and stale information are the consequences. This blog post introduces a new design pattern to solve these problems: The Shift Left Architecture enables a data mesh with real-time data products to unify transactional and analytical workloads with Apache Kafka, Flink and Iceberg. Consistent information is handled with streaming processing or ingested into Snowflake, Databricks, Google BigQuery, or any other analytics / AI platform to increase flexibility, reduce cost and enable a data-driven company culture with faster time-to-market building innovative software applications.

Snowflake with Apache Kafka and Iceberg Connector

18.7K views
8 minute read

Snowflake Data Integration Options for Apache Kafka (including Iceberg)

ByKai Waehner
22. April 2024

The integration between Apache Kafka and Snowflake is often cumbersome. Options include near real-time ingestion with a Kafka Connect connector, batch ingestion from large files, or leveraging a standard table format like Apache Iceberg. This blog post explores the alternatives and discusses its trade-offs. The end shows how data streaming helps with hybrid architectures where data needs to be ingested from the private data center into Snowflake in the public cloud.

Snowflake and Apache Kafka Data Integration Anti Patterns Zero Reverse ETL

16.1K views
9 minute read

Snowflake Integration Patterns: Zero ETL and Reverse ETL vs. Apache Kafka

ByKai Waehner
19. April 2024

Snowflake is a leading cloud-native data warehouse. Integration patterns include batch data integration, Zero ETL and near real-time data ingestion with Apache Kafka. This blog post explores the different approaches and discovers its trade-offs. Following industry recommendations, it is suggested to avoid anti-patterns like Reverse ETL and instead use data streaming to enhance the flexibility, scalability, and maintainability of enterprise architecture.

SAP Datasphere and Apache Kafka as Data Fabric for ERP Integration

19.6K views
12 minute read

SAP Datasphere and Apache Kafka as Data Fabric for S/4HANA ERP Integration

ByKai Waehner
3. January 2024
2 shares

SAP is the leading ERP solution across industries around the world. Data integration with other data platforms, applications, databases, and APIs is one of the hardest challenges in the IT and software landscape. This blog post explores how SAP Datasphere in conjunction with the data streaming platform Apache Kafka enables a reliable, scalable and open data fabric for connecting SAP business objects of ECC and S/4HANA ERP with other real-time, batch, or request-response interfaces.

16.6K views
5 minute read

The Heart of the Data Mesh Beats Real-Time with Apache Kafka

ByKai Waehner
28. July 2022
1 share

If there were a buzzword of the hour, it would undoubtedly be “data mesh”! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. Learn how the de facto standard for data streaming, Apache Kafka, plays a crucial role in building a data mesh.

Best Practices for Data Analytics with AWS Azure Googel BigQuery Spark Kafka Confluent Databricks

12.3K views
10 minute read

Best Practices for Building a Cloud-Native Data Warehouse or Data Lake

ByKai Waehner
21. July 2022

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. This is part 5: Best Practices for Building a Cloud-Native Data Warehouse or Data Lake.

Case Studies for Cloud Native Analytics with Data Warehouse Data Lake Data Streaming Lakehouse

16.6K views
7 minute read

Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization

ByKai Waehner
18. July 2022
43 shares

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. This is part 4: Case Studies for cloud-native data streaming and data warehouses.

Data Warehouse and Data Lake Modernization with Data Streaming

12.6K views
9 minute read

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure

ByKai Waehner
15. July 2022

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

Technology Evangelist

Kai Waehner

Data Warehouse

The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming)

How Microsoft Fabric Lakehouse Complements Data Streaming (Apache Kafka, Flink, et al.)

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure

Global Executive Technology Strategist

Apache Kafka vs. Middleware (MQ, ETL, ESB) – Slides + Video

Deep Learning Example: Apache Kafka + Python + Keras + TensorFlow + Deeplearning4j

Data Ownership in the Age of Agentic AI: Why SAP’s API Policy Forces a Data Integration Reckoning for Every Enterprise

Complex Event Processing (CEP) with Apache Flink: What It Is and When (Not) to Use It

MCP vs. REST/HTTP API vs. Kafka: The Architect’s Guide to Agentic AI Integration