In the early 2010s, LinkedIn faced a massive challenge: they needed to process billions of events per day in real-time. Traditional message queues couldn’t handle the scale, and databases weren’t built for constant streaming. They built Apache Kafka, and in doing so, they created the “central nervous system” for the modern digital economy.
Today, Kafka is used by over 80% of the Fortune 100. But as companies move to the cloud, the question is no longer just what Kafka is, but how to run it effectively across AWS, GCP, and Azure.

1. What is Apache Kafka? A Deep Dive Architecture
At its core, Apache Kafka is a distributed event streaming platform. Unlike a traditional database that stores the state of the world (e.g., “User A has $50”), Kafka stores the events that led to that state (e.g., “User A deposited $20,” then “User A deposited $30”).
The Core Components
- Producers: Applications that generate data (e.g., clickstreams, IoT sensors, logs).
- Brokers: The servers that form the Kafka cluster. They receive, store, and serve data.
- Topics: The logical category where data is stored (e.g., “customer-orders”).
- Partitions: Topics are divided into partitions for scalability. Each partition is an append-only log where data is strictly ordered by an offset.
- Consumers: Applications that read data. Consumers can be grouped into Consumer Groups to balance the load of reading from multiple partitions.
The Power of the “Log”
Kafka is unique because it treats data as a distributed commit log. This means:
- Immutability: Once an event is written, it cannot be changed.
- Persistence: Data is written to disk and replicated. You can replay data from seven days ago as easily as reading data from seven seconds ago.
- High Throughput: By using sequential disk I/O and “Zero Copy” transfer, Kafka can handle millions of messages per second with minimal CPU overhead.
2. Kafka on the Clouds: AWS vs. GCP vs. Azure
Choosing how to run Kafka is a strategic decision. You have three main paths: Self-Managed (EC2/VMs), Cloud-Native Managed (AWS MSK/Google Managed), or Enterprise-SaaS (Confluent Cloud).
Amazon Web Services (AWS): Amazon MSK
Amazon Managed Streaming for Apache Kafka (MSK) is the “Gold Standard” for teams already deep in the AWS ecosystem.
- How it works: AWS manages the EC2 instances and Zookeeper/KRaft for you.
- Pros: Seamless integration with IAM for security, CloudWatch for logs, and Lambda for event processing.
- Cons: It is “managed infrastructure,” not “serverless.” You still have to choose broker sizes (e.g., m5.large) and manually trigger some scaling operations.
Google Cloud Platform (GCP): Google Managed Service for Apache Kafka
A newer entrant, Google’s native Kafka service aims for extreme simplicity.
- How it works: It uses a pay-as-you-go model based on vCPUs and storage, abstracting away the broker management.
- Pros: Excellent for BigData workloads; integrates natively with BigQuery and Vertex AI.
- Cons: Newer than MSK, meaning fewer “battle-tested” community patterns for complex edge cases.
Microsoft Azure: Event Hubs for Kafka
Azure takes a unique “API-compatible” approach.
- How it works: Azure Event Hubs provides a Kafka endpoint. Your Kafka apps think they are talking to Kafka, but they are actually talking to Azure’s proprietary backend.
- Pros: Fully serverless. No clusters to manage at all.
- Cons: Not “pure” Kafka. Some specific Kafka features (like certain Log Compaction settings) may behave differently.
3. The “10x” Better Philosophy: Confluent Cloud
As Confluent (the company founded by Kafka’s creators) argues, simply putting Kafka in a container on AWS isn’t enough. They spent over 3 million engineering hours building Kora, a cloud-native engine that makes Kafka “10x better.”
Why “10x” Matters
- 10x Faster Scaling: In open-source Kafka, adding a broker requires “rebalancing” data, which can take hours or days as data moves across the network. Confluent Cloud uses Intelligent Tiered Storage, separating compute from storage so you can scale in minutes.
- 10x Better Resiliency: With a 99.99% SLA, they handle the “on-call” duty. If a broker fails at 3 AM, their automated systems fix it before you even get an alert.
- The Complete Fabric: It includes a Schema Registry (to prevent bad data from breaking your apps) and Cluster Linking, which allows data to flow between AWS, GCP, and Azure as if they were one single cluster.
4. Cost Comparison: A Strategic View
When evaluating costs, you must look at Total Cost of Ownership (TCO), not just the monthly cloud bill.
| Metric | Self-Managed (EC2) | Amazon MSK | Confluent Cloud |
| Compute Cost | Low (Raw EC2 prices) | Medium | High (Premium SaaS) |
| Ops Cost | Very High (Requires SREs) | Medium | Near Zero |
| Scaling Speed | Days | Hours | Minutes |
| Best For | Extreme custom needs | Standard AWS shops | Rapid growth & Multi-cloud |
Pro Tip: While MSK might look 30% cheaper on the “server bill,” the cost of hiring two specialized Kafka engineers (approx. $300k+/year) often makes Confluent Cloud cheaper for mid-sized teams.
5. Industry Use Cases
- Finance: Fraud detection (processing transactions against ML models in <50ms).
- Retail: Real-time inventory. When you buy the last pair of shoes in-store, the website updates instantly.
- Healthcare: Monitoring patient vitals via IoT and alerting doctors immediately if a threshold is crossed.
- Automotive: Connected cars streaming telemetry data to optimize engine performance and predict maintenance.
6. Pros and Cons Summary
Pros
- Scalability: Can grow from a small startup cluster to a global monster handling petabytes.
- Decoupling: Allows microservices to communicate without being “locked” to each other.
- Eco-system: Thousands of pre-built connectors (S3, Snowflake, MongoDB, etc.).
Cons
- Complexity: The learning curve is steep. Terms like “ISR,” “LSO,” and “Idempotent Producers” require study.
- Operational Burden: If you don’t use a managed service, “Day 2” operations (upgrades, patching) are notoriously difficult.
- Zookeeper/KRaft: Managing the metadata layer adds another layer of potential failure.
Conclusion
Apache Kafka has evolved from a LinkedIn internal project into the world’s most powerful event-streaming engine. Whether you choose the native simplicity of AWS MSK, the analytical power of GCP, or the 10x “Complete Platform” of Confluent, the goal remains the same: transforming your business from one that analyzes the past to one that acts in the present.
Leave a Reply