When NOT to choose Amazon MSK Serverless for Apache Kafka?

Is Amazon MSK Serverless for Apache Kafka a Self-Driving Car or just a Car Engine
Apache Kafka became the de facto standard for data streaming. Various cloud offerings emerged and improved in the last years. Amazon MSK Serverless is the latest Kafka product from AWS. This blog post looks at its capabilities to explore how it relates to “the normal” partially managed Amazon MSK, when the serverless version is a good choice, and when other fully-managed cloud services like Confluent Cloud are the better option.

Apache Kafka became the de facto standard for data streaming. Various cloud offerings emerged and improved in the last years. Amazon MSK Serverless is the latest Kafka product from AWS. This blog post looks at its capabilities to explore how it relates to “the normal” partially managed Amazon MSK, when the serverless version is a good choice, and when other fully-managed cloud services like Confluent Cloud are the better option.

Is Amazon MSK Serverless for Apache Kafka a Self-Driving Car or just a Car Engine

Disclaimer: I work for Confluent. While AWS is a strong strategic partner, it also offers the competitive offering Amazon MSK. This post is not about comparing every feature but explaining the concepts behind the alternatives. Read articles and docs from the different vendors to make your own evaluation and decision. View this post as a list of criteria to not forget important aspects in your cloud service selection.

What is a fully-managed Kafka cloud offering?

Before we get started looking at Kafka cloud services like Amazon MSK and Confluent Cloud, it is important to understand what fully managed means actually:

Fully managed and serverless Apache Kafka

Make sure to evaluate the technical solution. Most Kafka cloud solutions market their offering as “fully managed”. However, almost all Kafka cloud offerings are only partially managed! In most cases, the customer must operate the Kafka infrastructure, fix bugs, and optimize scalability and performance.

Cloud-native Apache Kafka re-engineered for the cloud

Operating Apache Kafka as a fully-managed offering in the cloud requires several additional components to the core of open-source Kafka. A cloud-native Kafka SaaS has features like:

  • The latest stable version with non-disruptive rolling upgrades
  • Elastic scale (up and down!)
  • Self-balancing clusters that take over the complexity and risk of rebalancing partitions across Kafka brokers
  • Tiered storage for cost-efficient long-term storage and better scalability (as the cold storage does not require a rebalancing of partitions and other complex operations tasks)
  • Complete solution “on top of the infrastructure”, including connectors, stream processing, security, and data governance – all in a single fully-managed SaaS

To learn more about building a cloud-native Kafka service, I highly recommend reading the following paper:  The Cloud-Native Chasm: Lessons Learned from Reinventing Apache Kafka as a Cloud-Native, Online Service”.

Comparison of Apache Kafka products and cloud services

Apache Kafka became the de facto standard for data streaming. The open-source community is vast. Various vendors added Kafka and related tooling to their offerings or provide a Kafka cloud service. I wrote a blog post in 2021: “Comparison of Open Source Apache Kafka vs. Vendors including Confluent, Cloudera, Red Hat, Amazon MSK“:

Comparison of Apache Kafka Products and Services including Confluent Amazon MSK Cloudera IBM Red Hat

The article uses a car analogy – from the motor engine to the self-driving car – to explore the different Kafka offerings available on the market. I also cover a few other vehicles, meaning (partly) Kafka-compatible technologies. The goal is not a feature-by-feature comparison (that would be outdated the day after the publication). Instead, the intention is to educate about the different deployment models, product strategies, and trade-offs from the options.

The above post is worth reading to understand how comparing different Kafka solutions makes sense. However, products develop and innovate… Tech comparisons get outdated quickly. In the meantime, AWS released a new product: Amazon MSK Serverless. This blog post explores what it is, when to use it, and how it differs from other Kafka products. It compares especially Amazon MSK (the partially managed service) and Confluent Cloud (a fully-managed competitor to Amazon MSK Serverless).

How does Amazon MSK Serverless fit into the Kafka portfolio?

Keeping the car analogy of my previous post, I wonder: Is it a self-driving car, a complete car you drive by yourself, or just a car engine to build your own car? Interestingly, you can argue for all three. 🙂 Let’s explore this in the following sections.

Is Amazon MSK Serverless a Self-Driving Car of Apache Kafka

Introducing Amazon MSK Serverless

Amazon MSK Serverless is a cluster type for Amazon MSK to run Apache Kafka without having to manage and scale cluster capacity. MSK Serverless automatically provisions and scales compute and storage resources. Thus, you can use Apache Kafka on demand and pay for the data you stream and retain.

Amazon MSK is one of the hundreds of cloud services that AWS provides. AWS is a one-stop shop for all cloud needs. That’s definitely a key strength of AWS (and similar to Azure and GCP).

Amazon MSK Serverless is built to solve the problems that come with Amazon MSK (the partially managed Kafka service that is marketed as a fully-managed solution even though it is not): A lot of hidden ops, infra, and downtime costs. This AWS podcast has a great episode that introduces Amazon MSK Serverless and when to use it as a replacement for Amazon MSK.

What Amazon does NOT tell you about MSK Serverless

AWS has great websites, documentation, and videos for its cloud services. This is not different for Amazon MSK. However, a few important details are not obvious… 🙂 Let’s explore a few key points to make sure everybody understands what Amazon MSK Serverless is and what it is not.

Amazon MSK Serverless is incomplete Kafka

If you follow my blogs, then this might be boring. Despite that, too many people think about Kafka as a message queue and data transportation pipeline. That’s what it is, but Kafka is much more:

  • Real-time messaging at any scale
  • Data integration with Kafka Connect
  • Data processing (aka stream processing) with Kafka Streams (or 3rd party Kafka-native components like KSQL)
  • True decoupling (the most underestimated feature of Kafka because of its built-in storage capabilities) and replayability of events with flexible retention times
  • Data governance with service contracts using Schema Registry (to be fair, this is not part of open source Kafka, but a separate component and accessible from GitHub or by vendors like Confluent or Red Hat – but it is used in almost all serious Kafka projects)

As I won’t repeat myself, here are a few articles explaining why Kafka is more than a message queue like you find it in Amazon MSK Serverless:

TL;DR: AWS provides a cloud service for every problem. You can glue them together to build a solution. However, similar to a monolithic application that provides inflexibility in a single package, a mesh of too many independent glued services using different technologies is also very hard to operate and maintain. And the cost for so many services plus networking and data transfer will bring up many surprises in such a setup.

You should ask yourself a few questions:

  • How do you implement data integration and business logic with Amazon MSK Serverless?
  • What’s the consequence regarding cost, SLAs, and end-to-end latency respectively delivery guarantees of combining Amazon MSK Serverless with various other products like Amazon Kinesis Data Analytics, AWS Glue, AWS Data Pipeline, or a 3rd party integration tool?
  • What is your security and data governance strategy around streaming data? How do you build an event-based data hub that enforces compliant communication between data producers and independent downstream consumers?

Spoilt for Choice: Amazon MSK and Amazon MSK Serverless are different products

Amazon MSK is NOT fully-managed. It is partially managed. After providing the brokers, you need to deploy, operate and monitor Kafka brokers and Kafka Connect connectors, and realize rebalancing with the open source tool Cruise Control. Check out AWS’ latest MSK sizing post: “Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost“. Seriously? A ten pages long and very technical article explaining how to operate a “fully-managed cloud Kafka service”?

You might think that Amazon MSK Serverless is the successor of Amazon MSK to solve these problems. However, there are now two products to choose from: Amazon MSK and Amazon MSK Serverless.

Amazon does NOT recommend using Amazon MSK Serverless for all use cases! It is recommended if you don’t know the workloads or if they often change in volume.
Amazon recommends “the normal” Amazon MSK for predictable workloads as it is more cost-effective (and because it is not workable because of its many tough limitations). MSK Connect is also not supported yet and coming at some point in the future.

It is totally okay to provide different products for different use cases. Confluent also has different offerings for different SLAs and functional requirements in its cloud offering. Multi-tenant basic clusters and dedicated clusters are available, but you never have to self-manage the cluster or fix bugs or performance issues yourself.

You should ask yourself a few questions:

  • Which projects require Amazon MSK and which require Amazon MSK Serverless?
  • How will the project scale as your grows?
  • What’s the migration/ upgrade plan if your workload exceeds MSK Serverless partition/retention limits?
  • What is the total cost of ownership (TCO) for MSK plus all the other cloud services I need to combine it with?

Amazon MSK Serverless excludes Kafka support

Amazon MSK service level agreements say: “The Service Commitment DOES NOT APPLY to any unavailability, suspension or termination … caused by the underlying Apache Kafka or Apache Zookeeper engine software that leads to request failures …

Amazon MSK Serverless is part of the Amazon MSK product and has the same limitation. Excluding Kafka support from the MSK offering is (or should be) a blocker for any serious data streaming project!

Not much more to add here… Do you really want to buy a specific product that excludes the support for its core capability? Please also ask your manager if he agrees and takes the risk.

You should ask yourself a few questions:

  • Who is responsible and takes the risk if you hit a Kafka issue in your project using Amazon MSK or Amazon MSK Serverless?
  • How do you react to security incidents related to the Apache Kafka open source project?
  • How do you fix performance or scalability issues (on both the client and server side)?

When NOT to use Amazon MSK Serverless?

Let’s go back to the car analogy. Is Amazon MSK Serverless a self-driving car?

Obviously, Amazon MSK Serverless is self-driving. That’s what a serverless product is. Similar to Amazon S3 for object storage or AWS Lambda for serverless functions.

However, Amazon MSK Serverless is NOT a complete car! It does not provide enterprise support for its functionality. And it does not provide more than just the core of data streaming.

Therefore, Amazon MSK Serverless is a great self-driving AWS product for some use cases. But you should evaluate the following facts before deciding for or against this cloud service.

24/7 Enterprise support for the product

MSK excludes Kafka support from its product Amazon MSK. Amazon MSK Serverless is part of Amazon MSK and uses its SLAs.

I am amazed at how many enterprises use Amazon MSK without reading the SLAs. Most people are not aware that Kafka support is excluded from the product.

This makes Amazon MSK Serverless a car engine, not a complete car, right? Do you really want to build your own car and take over the burden and risk of failures while driving on the street?

If you need to deploy mission-critical workloads with 24/7 SLAs, you can stop reading and qualify out Amazon MSK (including Amazon MSK Serverless) until AWS adds serious SLAs to this product in the future.

Complete data streaming platform

AWS has a service for everything. You can glue them together. In our cars analogy, it would be many cars or vehicles in your enterprise architecture. Most of us learned the hard way that distributed microservices are no free lunch.

The monolithic data lake (now pitched as lakehouse) from vendors like Databricks and Snowflake) is no better approach. Use the right technology for a problem or data product. Finding the right mix between focus and independence is crucial. Kafka’s role is the central or decentralized real-time data hub to transport events. This includes data integration and processing, and to decouple systems from each other.

A modern data flow requires a simple, reliable and governed way to integrate and process data. Leveraging Kafka’s ecosystem like Kafka Connect and Kafka Streams enables mission-critical end-to-end latency and SLAs in a cost-efficient infrastructure. Development, operations, and monitoring are much harder and more costly if you glue together several services to build a real-time data hub.

However, Kafka is not a silver bullet. Hence, you need to understand when NOT to use Kafka and how it relates to data warehouses, data lakes, and other applications.

After a long introduction to this aspect, long story short: If you use Amazon MSK Serverless, it is the data ingestion component in your enterprise architecture. No fully managed components other than Kafka and no native integrations to other 1st party cloud AWS services like S3 or Redshift, and 3rd party cloud services like Snowflake, Databricks, or MongoDB. You must combine Amazon MSK Serverless with several other AWS services for event processing and storage. Additionally, connectivity needs to be implemented and operated by your project team using Kafka Connect connectors, or another 1st or 3rd party ETL tools, or custom glue code).

Amazon MSK Serverless only supports AWS Identity and Access Management (IAM) authentication, which limits you to Java clients only. There is no way to use the open source clients for other programming languages. Python, C++, .NET, Go, JavaScript, etc. are not supported with Amazon MSK Serverless.

MSK Connect allows deploying Kafka Connect connectors (that are available open source, licensed from other vendors, or self-built) into this platform. Similar to Amazon MSK, this is not a fully-managed product. You deploy, operate, and monitor the connectors by yourself. Look at the fully managed connectors in Confluent Cloud to understand the difference. Also, note that AWS will only support Connect workers. But it will not support the connectors themselves even if running on MSK Connect.

Event-driven architecture with true decoupling between the microservices

An event-driven architecture powered by data streaming is great for single integration infrastructure. However, the story goes far beyond this. Modern enterprise architectures leverage principles like microservices, domain-driven design, and data mesh to build decentralized applications and data products.

A streaming data exchange enables such a decentralized architecture with real-time data sharing. A critical capability for such a strategic enterprise component is long-term data storage. It

  • decouples independent applications
  • handles backpressure for slow consumers (like batch systems or request-response web services)
  • enables replayability of historical events (e.g., for a Python consumer in the machine learning platform from data engineers).

Data Mesh with Apache Kafka

The storage capability of Kafka is a key differentiator against message queues like IBM MQ, Rabbit MQ, or AWS SQS. Retention time is an important feature to set the right storage options per Kafka topic. Confluent makes Kafka even better by providing Tiered Storage to separate storage from compute for a much more cost-efficient and scalable solution with infinite storage capabilities.

Amazon MSK Serverless has a limited retention time of 24 hours. This is good enough for many data ingestion use cases but not for building a real-time data mesh across business units or even across organizations.  Another tough requirement of Amazon MSK Serverless is the limitation of 120 partitions. Not really a limit that allows building a strategic platform around it.

As Amazon MSK Serverless is a new product, expect the limitations to change and improve over time. Check the docs for updates. UPDATE Q1 2023: Amazon MSK Serverless added unlimited retention time and support for more partitions. That’s excellent news for this service. With this update, if retention time is your critical criterion, Amazon MSK Serverless is stronger since 2023. However, check the storage costs and compare different cloud offerings for this.

But anyway, these limitations prove how hard it is to build a fully-managed Kafka offering (like Confluent Cloud) compared to a partially managed Kafka offering (like “the normal” Amazon MSK).

Hybrid AWS and multi-cloud Kafka deployments

The most obvious point: Amazon MSK Serverless is only a reasonable discussion if you run your apps in the public AWS cloud. For anything else, like multi-cloud with Azure or GCP, AWS edge offerings like AWS Outpost or Wavelength, hybrid environments, or edge deployments like a factory or retail store, AWS is no option.

If you need to deploy outside the public AWS cloud, check my comparison of Kafka offerings, including Confluent, IBM, Red Hat, and Cloudera.

I want to emphasize that no product or service is 100% cloud agnostic. For instance, building Confluent Cloud on AWS, Azure, and GCP includes unique challenges under the hood. Confluent Cloud is built on Kubernetes. Hence, the template and many automation mechanisms can be reused across cloud vendors. But storage, compute, pricing, support, and many other characteristics and features differ at each cloud service provider.

Having said this, as you leverage a SaaS like Confluent Cloud with no knowledge or access to the technical infrastructure. You don’t see these issues under the hood. On the developer level, you produce and consume messages with the Kafka API and configure additional features like fully-managed connectors, data governance, or cluster linking. All the operations complexity is handled by the vendor. No matter which cloud you run on.

Coopetition: The winners are AWS and Confluent Cloud

The reason for this post was the evolution of the Amazon MSK product line.  Hence, if you read this a year later, the product features and limitations might look completely different once again. Use blog posts like this to understand how to evaluate different solutions and SaaS offerings. Then do your own accurate research before making a product decision.

Amazon MSK Serverless is a great new AWS service that helps AWS customers with some times of projects. But it has tough limitations for some other projects. Additionally, Amazon MSK (including Amazon MSK Serverless) excludes Kafka support! And it is not a complete data streaming platform. Be careful not to create a mess of glue code between tens of serverless cloud services and applications. Confluent Cloud is the much more sophisticated fully-managed Kafka cloud offering (on AWS and everywhere). I am not saying this because I am a Confluent employee but because almost everybody agrees on this 🙂 And it is not really a surprise as Confluent only focuses on data streaming with 2000 people employees and employs many full-time committers to the Apache Kafka open source project. Amazon has zero, by the way 🙂

By the way: Did you know you can use your AWS credits to consume Confluent Cloud like any other native AWS service? This is because of the strong partnership between Confluent and AWS. Yes, there is coopetition. That’s how the world looks like today…

Confluent Cloud provides a complete cloud-native platform including 99.99% SLA, fully managed connectors and stream processing, and maybe most interesting to readers of this post, integration with AWS services (S3, Redshift, Lambda, Kinesis, etc.) plus AWS security and networking (VPC Peering, Private Link, Transit Gateway, etc.). Confluent and AWS work closely together on hybrid deployments, leveraging AWS edge services like AWS Wavelength for 5G scenarios.

Which Kafka cloud service do you use today? What are your pros and cons? Do you plan a migration – e.g., from Amazon MSK to Confluent Cloud or from open source Kafka to Amazon MSK Serverless? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

Dont‘ miss my next post. Subscribe!

We don’t spam! Read our privacy policy for more info.
If you have issues with the registration, please try a private browser tab / incognito mode. If it doesn't help, write me: kontakt@kai-waehner.de

1 comment
  1. There are some factual inaccuracies in the article above. First, MSK Serverless is covered by MSK’s SLA as of May this year and second, Lambda is supported as a first-party integraion.

Leave a Reply
You May Also Like
How to do Error Handling in Data Streaming
Read More

Error Handling via Dead Letter Queue in Apache Kafka

Recognizing and handling errors is essential for any reliable data streaming pipeline. This blog post explores best practices for implementing error handling using a Dead Letter Queue in Apache Kafka infrastructure. The options include a custom implementation, Kafka Streams, Kafka Connect, the Spring framework, and the Parallel Consumer. Real-world case studies show how Uber, CrowdStrike, Santander Bank, and Robinhood build reliable real-time error handling at an extreme scale.
Read More