A World Without Kafka
Why Kafka Falls Short for Real-Time Analytics (and What Comes Next)
Apache Kafka had a remarkable run, powering event-driven architectures for more than a decade. But the landscape has evolved, revealing clear Kafka limitations for real-time analytics as modern streaming analytics and decision-making use cases become more demanding. Kafka is increasingly being pushed to retrofit capabilities into a real-time analytics architecture it was never designed to support. To solve today’s streaming data pipeline challenges and analytics requirements, new capabilities are required. It’s time for a new kid on the block.
During the transition from batch processing to real-time streaming data, an open-source project developed inside LinkedIn gained significant attention and momentum: Apache Kafka. The goal was to simplify moving data from A to B in a scalable and resilient way using a publisher/subscriber model. Kafka enabled companies to build early streaming data pipelines and unlock a new class of event-driven use cases. An ever-growing ecosystem of connectors and integrations accelerated adoption and established Kafka as the preferred streaming storage layer. However, as real-time analytics architectures have evolved beyond simple ingestion, Kafka’s limitations for analytics workloads have become increasingly apparent.
Source: https://www.linkedin.com/pulse/apache-kafka-event-driven-architecture-prabhat-kumar/
From an architectural standpoint, Kafka is not an analytics engine. It is a resilient and scalable record-based storage system for real-time, fresh data—often referred to as the hot layer. Analytics workloads therefore must be executed outside the Kafka cluster, continuously moving data between storage and processing engines, which increases network traffic and operational overhead. In addition, Kafka does not natively enforce schemas on data published to topics. While this flexibility was acceptable for early streaming use cases, modern real-time analytics platforms require schemas to ensure consistency, governance, and data quality. To compensate, Schema Registries emerged to enforce contracts between publishers and subscribers, adding complexity to Kafka-based analytics architectures.
Last but not least, and perhaps the most critical aspect, Kafka is a record-based storage system. That is well-suited for use cases requiring a message queue, such as real-time ingestion or event-driven architectures, but has considerable limitations in addressing the current and future needs for real-time projects. Processing engines such as Spark and Flink must consume the entire topic data, even though just a portion of the event data (columns) is required. The impact is unnecessary network traffic, degraded processing performance, and excessive storage requirements.
Record-based streaming storage components will still have their space in the data architecture. Solutions such as Kafka and Pulsar are well-suited to use cases requiring full record reads. Architectural patterns based on microservices can leverage the above solutions to interchange data, decoupling functions from message transportation to improve performance, reliability, and scalability. Full record reads are also beneficial for ingestion pipelines, in which data will be stored in long-term storage systems, such as Object Storage, for historical and archival purposes. Bottlenecks and limitations arise when they are used for analytics workloads that require capabilities beyond a simple data transport layer.
Streaming Data Evolution
Today’s conversation is driven by a single aspect: Evolution. In other words, new needs require new approaches to data management. Kafka addressed the initial needs for streaming data. This first wave was mainly dominated by real-time ingestion pipelines and discrete (SEP, Simple Event Processing) analytics. Essentially, the ability to move data from point A to B, and in some cases, run simple data preparation and processing in between. Kafka, combined with Spark Streaming or ad-hoc connectors, was able to address those early use cases.

Fast-forward, and the second wave introduced complexity into the streaming pipeline. In addition to discrete data preparation, the use cases at this stage required advanced analytics functions, such as aggregation, enrichment, and complex event processing. Micro-batching fell short. A new architecture approach based on columnar storage with efficient projection pushdown and transparent data tiering, combined with sub-second processing engines, is needed. Apache Fluss and Apache Flink can deliver that promise and, together, constitute the future and the third wave in the maturity scale.
Every tech article nowadays mentions AI/ML. This third-wave evolution enables companies to build real-time AI pipelines that embed advanced analytics techniques (such as GenAI) into streaming data. This increases the need for modern real-time storage systems with enhanced features that tier data across both fast streaming and historical layers, providing integrated, unified access to business data.
The New Kid On The Block
Apache Fluss is a modern, real-time data storage system for analytics. It consolidates years of experience and lessons learned from its predecessors while addressing the current and future needs of organizations. Fluss was born in an era in which more data is required to feed ML models, Lakehouses are part of the enterprise ecosystem, and cloud infrastructure is the preferred strategy for companies.
But data storage is just one piece of the architecture puzzle. Apache Flink provides the capabilities and resilience to process vast volumes of real-time data with sub-second latency, delivering the speed needed for future streaming applications. Not limited to Flink, additional processing engines and libraries are developing integrations with Fluss, thereby strengthening the ecosystem.
Here are the main features of modern real-time analytics.
Stream as Table
Fluss stores data as schematized tables. This approach is suitable for most real-time use cases, including those that rely on both structured and semistructured data. By structuring streaming data, companies can enhance governance, improve data quality, and ensure that publishers and consumers share a common language. Fluss defines two types of tables:
- Log Tables are append-only, similar to Kafka topics. Use cases such as log monitoring, clickstreams, sensor readings, transaction logs, and others are good examples of append-only data. Events are immutable and should not be changed or updated.
- Primary Key (PK) Tables are mutable tables defined by a key. Records are initially inserted and subsequently updated or deleted over time according to the changelog they represent. A PK Table keeps the latest changes of the whole table, allowing a record “lookup” access pattern. Changelog use cases, such as account balances, shopping cart, and inventory management, can benefit from this approach. Kafka cannot perform this behavior, requiring external key-value or NoSQL databases to track the current record status, thereby resulting in complex, hard-to-maintain solutions.
In brief, PK Tables ensure record uniqueness based on the primary key, INSERT, UPDATE, and DELETE operations, and provide comprehensive record mutation capabilities. On the other hand, Log Tables are append-only; record updates are not required.
Columnar Storage
The way Fluss stores data on disk is arguably the most fundamental architectural shift relative to other solutions. Unlike Kafka, Fluss leverages Apache Arrow format to store data in columnar format, bringing the following benefits:
- Improved storage usage, as storing data in a columnar format requires less disk space. Compression rate depends on multiple data characteristics, but initial benchmarks indicate a promising 5x improvement when using Apache Arrow as the underlying storage format. Less storage = less cost. Kafka provides only a few data compression options, which are not comparable to those available in Apache Arrow out of the box.
- Efficient querying leveraging column pruning. In general, less than half of the attributes of a given business event are queried or accessed, ie, the column names you add to your SELECT FROM statement. Projection pushdown is a technique that removes unnecessary attributes (aka column pruning) when retrieving data from the storage system. Kafka is all-or-nothing due to its record-based storage format.
- Both columnar compression and projection pushdown will improve network traffic—less data to move around results in happier network administrators. With Kafka, companies are experiencing network congestion constantly and potential high egress costs.

Lakehouse Unification
Kafka was built in the Data Lake era. From the very start of design, Fluss was built for the Lakehouse. This makes a big difference. Companies realized that Data Lakes (or Data Swamps in many cases) are difficult to keep the lights on and pay back the investments in licenses, hardware, and personnel to build Big Data solutions. Fortunately, Lakehouses overcome those challenges. Lakehouses assert that data should be widely and easily accessible regardless of its age. Batch and real-time events overlap, and processing engines must be able to access both layers transparently.
These are the data tiering and unification view capabilities Fluss can provide, in addition to the hot/fresh data layer:
- Warm layer, for data aged from minutes to hours, primarily stored in Object Storage solutions.
- Cold layer, for data aged from days to years. Lakehouse solutions such as Apache Paimon and Iceberg are the preferred platforms for this historical data, feeding ML models, retrospective analytics, and compliance.
- Zero-copy data tiering, aging data from hot layer (Fluss tables) to warm/cold layers (Object Storage and Lakehouse). This means that a single copy of the data unit is available, either in the real-time or historical layer. Fluss manages the cutover between layers, facilitating querying and access. The Kafka approach relies on data duplication via a consumer/publisher job, resulting in increased storage costs and the need to convert Kafka topics to Lakehouse table format.
A Bright Future Ahead
Real-time data analytics is becoming the cornerstone of modern companies. Digital business models must deliver a better user experience and timely responses to customer interactions, which forces companies to build systems to harness and manage data in real time to create engaging and “wow” experiences. Acting now is not merely a matter of technical feasibility; for most enterprises, it is becoming a unique advantage for survival in a highly competitive global market landscape.
Fluss helps companies bridge the gap between the real-time and analytics worlds, offering unified access to both fresh, real-time data and historical, cold data. In brief, Fluss enables seamless data access regardless of dataset age and simplifies complex data analytics architectures that were dragged along for years, primarily due to the lack of best-fit components and frameworks. With Fluss serving as the real-time storage layer for analytics, the Lakehouse is granted governance, simplicity, and scalability that future-proof modern architectures.
On the operational side, it offers significant advantages by reducing the complexity of managing, storing, and serving both real-time and batch data. These efficiencies translate into direct cost savings, primarily achieved through optimized Fluss table format, a two-tiered storage system based on data temperature, and finally, minimized overall pipeline CPU usage via predicate pushdown and column pruning. Collectively, these architectural elements alleviate the operational overhead associated with platform maintenance, accelerate the onboarding of new use cases, and facilitate seamless integration with the existing enterprise IT infrastructure.
Additional Resources
Fluss: Unified Streaming Storage For Next-Generation Data Analytics
You may also like
Introducing The Era of "Zero-State" Streaming Joins
Introducing the next evolution in streaming joins: Apache Fluss offers ze...
Ververica Platform 3.0: The Turning Point for Unified Streaming Data
End the batch vs streaming divide. Flink-powered lakehouse with 5-10× fas...
Introducing Apache Fluss™ on Ververica’s Unified Streaming Data Platform
Discover how Apache Fluss™ transforms Ververica's Unified Streaming Data ...
