The combination of streaming machine learning (ML), Apache Kafka and Confluent Tiered Storage enables you to build one scalable, reliable, but also simple infrastructure for all machine learning tasks using the Apache Kafka ecosystem and Confluent Platform.
This blog post is a primer for the full article I wrote for the Confluent Blog:
Streaming Machine Learning with Tiered Storage and Without a Data Lake
Please read the full blog post for all details with the following agenda:
The connected car example I use to enable predictive maintenance in real time is discussed and demo’ed in this post:
IoT Live Demo – 100.000 Connected Cars with Kubernetes, Kafka, MQTT, TensorFlow
The following two sections explain the main concepts: Streaming Machine Learning and Tiered Storage as add-on for Apache Kafka.
Let’s take a look at a new approach for model training and predictions that do not require a data lake. Instead, streaming machine learning is used: direct consumption of data streams from Confluent Platform into the machine learning framework.
This example features the TensorFlow I/O and its Kafka plugin. The TensorFlow instance acts as a Kafka consumer to load new events into its memory. Consumption can happen in different ways:
Most machine learning algorithms don’t support online model training today, but there are some exceptions like unsupervised online clustering. Therefore, the TensorFlow application typically takes a batch of the consumed events at once to train an analytic model.
At a high level, the idea is very simple: Tiered Storage in Confluent Platform combines local Kafka storage with a remote storage layer. The feature moves bytes from one tier of storage to another. When using Tiered Storage, the majority of the data is offloaded to the remote store.
Here is a picture showing the separation between local and remote storage:
Tiered Storage allows the storage of data in Kafka long-term without having to worry about high cost, poor scalability, and complex operations. You can choose the local and remote retention time per Kafka topic. Another benefit of this separation is that you can now choose a faster SSD instead of HDD for local storage because it only stores the “hot data,” which can be just a few minutes or hours worth of information.
In the Confluent Platform 5.4-preview release, Tiered Storage supports the S3 interface. However, it is implemented in a portable way that allows for added support of other object stores like Google Cloud Storage and filestores like HDFS without requiring changes to the core of your implementation. For more details about the motivation behind and implementation of Tiered Storage, check out the blog post by our engineers.
Storing data long-term in Kafka allows you to easily implement use cases in which you’d want to process data in an event-based order again:
I am really excited about Tiered Storage as add-on for Apache Kafka. What do you think? What are the use cases you see? Please let me know and share your feedback via LinkedIn, Twitter or Email.
The telecommunications industry is transforming rapidly as Telcos expand partnerships with MVNOs, IoT platforms, and…
Mobility services like Uber, Grab, and FREE NOW (Lyft) rely on real-time data to power…
The rise of Electric Vehicles (EVs) demands a scalable, efficient charging network—but challenges like fluctuating…
Apache Kafka 4.0 represents a major milestone in the evolution of real-time data infrastructure. Used…
Agentic AI marks a major evolution in artificial intelligence—shifting from passive analytics to autonomous, goal-driven…
Industrial enterprises face increasing pressure to move faster, automate more, and adapt to constant change—without…