Apache Kafka had a remarkable run, powering event-driven architectures for more than a decade. But the landscape has evolved, revealing clear Kafka limitations for real-time analytics as modern streaming analytics and decision-making use cases become more demanding. Kafka is increasingly being pushed to retrofit capabilities into a real-time analytics architecture it was never designed to support. To solve today’s streaming data pipeline challenges and analytics requirements, new capabilities are required. It’s time for a new kid on the block.
During the transition from batch processing to real-time streaming data, an open-source project developed inside LinkedIn gained significant attention and momentum: Apache Kafka. The goal was to simplify moving data from A to B in a scalable and resilient way using a publisher/subscriber model. Kafka enabled companies to build early streaming data pipelines and unlock a new class of event-driven use cases. An ever-growing ecosystem of connectors and integrations accelerated adoption and established Kafka as the preferred streaming storage layer. However, as real-time analytics architectures have evolved beyond simple ingestion, Kafka’s limitations for analytics workloads have become increasingly apparent.
Source: https://www.linkedin.com/pulse/apache-kafka-event-driven-architecture-prabhat-kumar/
From an architectural standpoint, Kafka is not an analytics engine. It is a resilient and scalable record-based storage system for real-time, fresh data—often referred to as the hot layer. Analytics workloads therefore must be executed outside the Kafka cluster, continuously moving data between storage and processing engines, which increases network traffic and operational overhead. In addition, Kafka does not natively enforce schemas on data published to topics. While this flexibility was acceptable for early streaming use cases, modern real-time analytics platforms require schemas to ensure consistency, governance, and data quality. To compensate, Schema Registries emerged to enforce contracts between publishers and subscribers, adding complexity to Kafka-based analytics architectures.
Last but not least, and perhaps the most critical aspect, Kafka is a record-based storage system. That is well-suited for use cases requiring a message queue, such as real-time ingestion or event-driven architectures, but has considerable limitations in addressing the current and future needs for real-time projects. Processing engines such as Spark and Flink must consume the entire topic data, even though just a portion of the event data (columns) is required. The impact is unnecessary network traffic, degraded processing performance, and excessive storage requirements.
Record-based streaming storage components will still have their space in the data architecture. Solutions such as Kafka and Pulsar are well-suited to use cases requiring full record reads. Architectural patterns based on microservices can leverage the above solutions to interchange data, decoupling functions from message transportation to improve performance, reliability, and scalability. Full record reads are also beneficial for ingestion pipelines, in which data will be stored in long-term storage systems, such as Object Storage, for historical and archival purposes. Bottlenecks and limitations arise when they are used for analytics workloads that require capabilities beyond a simple data transport layer.
Today’s conversation is driven by a single aspect: Evolution. In other words, new needs require new approaches to data management. Kafka addressed the initial needs for streaming data. This first wave was mainly dominated by real-time ingestion pipelines and discrete (SEP, Simple Event Processing) analytics. Essentially, the ability to move data from point A to B, and in some cases, run simple data preparation and processing in between. Kafka, combined with Spark Streaming or ad-hoc connectors, was able to address those early use cases.
Fast-forward, and the second wave introduced complexity into the streaming pipeline. In addition to discrete data preparation, the use cases at this stage required advanced analytics functions, such as aggregation, enrichment, and complex event processing. Micro-batching fell short. A new architecture approach based on columnar storage with efficient projection pushdown and transparent data tiering, combined with sub-second processing engines, is needed. Apache Fluss and Apache Flink can deliver that promise and, together, constitute the future and the third wave in the maturity scale.
Every tech article nowadays mentions AI/ML. This third-wave evolution enables companies to build real-time AI pipelines that embed advanced analytics techniques (such as GenAI) into streaming data. This increases the need for modern real-time storage systems with enhanced features that tier data across both fast streaming and historical layers, providing integrated, unified access to business data.
Apache Fluss is a modern, real-time data storage system for analytics. It consolidates years of experience and lessons learned from its predecessors while addressing the current and future needs of organizations. Fluss was born in an era in which more data is required to feed ML models, Lakehouses are part of the enterprise ecosystem, and cloud infrastructure is the preferred strategy for companies.
But data storage is just one piece of the architecture puzzle. Apache Flink provides the capabilities and resilience to process vast volumes of real-time data with sub-second latency, delivering the speed needed for future streaming applications. Not limited to Flink, additional processing engines and libraries are developing integrations with Fluss, thereby strengthening the ecosystem.
Here are the main features of modern real-time analytics.
Fluss stores data as schematized tables. This approach is suitable for most real-time use cases, including those that rely on both structured and semistructured data. By structuring streaming data, companies can enhance governance, improve data quality, and ensure that publishers and consumers share a common language. Fluss defines two types of tables:
The way Fluss stores data on disk is arguably the most fundamental architectural shift relative to other solutions. Unlike Kafka, Fluss leverages Apache Arrow format to store data in columnar format, bringing the following benefits:
Kafka was built in the Data Lake era. From the very start of design, Fluss was built for the Lakehouse. This makes a big difference. Companies realized that Data Lakes (or Data Swamps in many cases) are difficult to keep the lights on and pay back the investments in licenses, hardware, and personnel to build Big Data solutions. Fortunately, Lakehouses overcome those challenges. Lakehouses assert that data should be widely and easily accessible regardless of its age. Batch and real-time events overlap, and processing engines must be able to access both layers transparently.
These are the data tiering and unification view capabilities Fluss can provide, in addition to the hot/fresh data layer:
Real-time data analytics is becoming the cornerstone of modern companies. Digital business models must deliver a better user experience and timely responses to customer interactions, which forces companies to build systems to harness and manage data in real time to create engaging and “wow” experiences. Acting now is not merely a matter of technical feasibility; for most enterprises, it is becoming a unique advantage for survival in a highly competitive global market landscape.
Fluss helps companies bridge the gap between the real-time and analytics worlds, offering unified access to both fresh, real-time data and historical, cold data. In brief, Fluss enables seamless data access regardless of dataset age and simplifies complex data analytics architectures that were dragged along for years, primarily due to the lack of best-fit components and frameworks. With Fluss serving as the real-time storage layer for analytics, the Lakehouse is granted governance, simplicity, and scalability that future-proof modern architectures.
On the operational side, it offers significant advantages by reducing the complexity of managing, storing, and serving both real-time and batch data. These efficiencies translate into direct cost savings, primarily achieved through optimized Fluss table format, a two-tiered storage system based on data temperature, and finally, minimized overall pipeline CPU usage via predicate pushdown and column pruning. Collectively, these architectural elements alleviate the operational overhead associated with platform maintenance, accelerate the onboarding of new use cases, and facilitate seamless integration with the existing enterprise IT infrastructure.
Fluss: Unified Streaming Storage For Next-Generation Data Analytics