Introducing The Era of "Zero-State" Streaming Joins

Large-scale streaming joins, real-time data enrichment, and continuous analytics have historically been limited by the complexity of managing state using solutions like Apache Flink®. As streams grow in volume and velocity, operators face memory pressure, slow checkpoints, and long recovery times.

That complexity ends today with the official launch of managed service for Apache Fluss™ (Incubating) on Ververica’s Unified Streaming Data Platform. Organizations can now take advantage of enterprise-grade streaming innovations that resolve long-standing challenges in modern streaming architectures and stateful stream processing.

Watch the Demo: 

Ready to see Ververica and Fluss in action together? 

(Brief) History of Streaming Joins

Real-time streaming is revolutionizing how organizations make decisions. At its core, streaming joins combine multiple continuous streams of data as it arrives, producing unified, real-time views of entities such as customers, devices, or orders.

Streaming joins enable real-time data enrichment, where raw event streams are enhanced with dynamic contextual information, such as adding product or user profile data, to provide a complete view for analytics, AI models, or operational decision-making.

For instance, a Customer 360 system  merges streams from web and mobile apps, payment systems, Customer Resource Management (CRM) platforms, marketing tools, and product telemetry to build a complete, up-to-date view of each customer. Similarly, IoT platforms join device telemetry with user activity to detect anomalies, while e-commerce platforms join clickstreams with purchase events for personalization and fraud detection.

By joining streams in motion, organizations can react instantly and maintain accurate, up-to-date insights. However, implementing streaming joins at scale has remained complex in practice.

The Curse of Streaming Joins

Anyone operating large-scale stream processing systems is familiar with the operational challenges. As state grows with each additional stream and key, the system must maintain increasingly large volumes of data. Over time, resource demand increases, checkpointing slows to a crawl, and recovery becomes more time-consuming. Each new stream makes things worse. The culprit? State.

These issues are not unique to Apache Flink; they are inherent to the nature of stateful stream processing. Managing large and continuously evolving state while preserving low latency and high availability is one of the most difficult problems in distributed computing.

For years, both the Apache Flink community and Ververica have been addressing these challenges through continuous innovation. Ververica’s VERA engine has introduced several key advancements designed to make large-scale state management more efficient and resilient, including:

  • Tiered storage: to offload cold state and reduce memory pressure while enabling larger state sizes.
  • Lazy state loading: for faster startup and recovery by loading state as the job is running.
  • Key-value separation, which improves scalability by storing large value data externally while keeping metadata local.

At the same time, open-source Apache Flink continues to evolve rapidly. The recent introduction of disaggregated state with ForStDB as a new state backend demonstrates the community’s ongoing commitment to improving performance, scalability, and reliability in state management.

This ongoing evolution, from Flink’s pioneering architecture to Ververica’s enterprise-grade enhancements, has kept shaping the way for the next step in simplifying and scaling stream processing.

Zero-State Streaming Joins: The Next Step

While these reduce the pain of managing large state, stream processing systems still fundamentally need to reason about state. To address this challenge at its core, Apache Fluss introduces Zero-State Streaming Joins, a new approach where state is seamlessly moved from the compute layer into the streaming storage layer itself.

“Zero-state” doesn’t mean stateless. The state still exists, but it now resides in a more logical, scalable place. In traditional architectures, streaming joins can easily accumulate terabytes of state within Flink, leading to operational complexity and resource strain. Fluss changes this by offloading join state into a purpose-built, indexed storage layer that can efficiently manage, merge, and query data natively. This allows Apache Flink to focus purely on computation, without being burdened by heavy, persistent state.

Though conceptually simple, this represents a fundamental shift in how we design and operate stateful workloads. Fluss introduces key innovations such as partial updates and Delta Joins, which we’ll explore later in this post. Together, these transform streaming joins from a scalability bottleneck into a practical, efficient, and elastic foundation for real-time data architectures, making continuous data unification achievable at enterprise scale.

The Problem: Why Streaming Joins Don’t Scale Easily

The Multi-Stream Merge Scenario

Let’s look at an actual use case. An enterprise wants to see a real-time Customer 360 view. In order to accomplish this, they begin by merging data from:

  • Web and mobile apps (behavior, sessions)
  • Transactions (purchases, payments)
  • CRM systems (support tickets, preferences)
  • Marketing platforms (campaigns, engagement)
  • Product telemetry (feature usage)

Each stream updates independently, but all must converge on the same customer_id to deliver the 360 view that supports personalization, fraud detection, and intelligent support.

Traditional Architecture: Apache Kafka & Flink Joins

The common, standard implementation is relatively straightforward (below, Figure 1 shows an Apache Kafka-centric approach):

  1. Each source writes its own Kafka topic.
  2. Apache Flink consumes all topics.
  3. Flink performs multi-way joins on customer_id.

However, in production, this design often becomes difficult to scale and maintain.

Apache Kafka-Centric Architecture

Figure 1: Traditional Apache Kafka-Centric Architecture

Why Kafka + Flink Breaks at Scale

Streaming joins in Apache Flink are inherently stateful operations. To join multiple continuous streams, Flink must buffer incoming events until corresponding records with matching keys arrive from the other streams. This requires maintaining a large keyed state, where partial records and intermediate join results are stored in memory or on disk.

For every new event, Flink performs a lookup against this existing state to find matching entries and produce the joined output. To ensure fault tolerance, the entire state is periodically checkpointed, persisting potentially large volumes of data. As a result, streaming joins can become resource-intensive, consuming significant memory and storage while increasing checkpoint size and recovery times.

This creates several scaling issues, including:

  • High memory usage: Even a moderate multi-stream join can consume tens of gigabytes just to store keys and intermediate results.
  • Increased CPU load: Each incoming event triggers multiple lookups and join evaluations.
  • Slow checkpoints: As the state grows, checkpoint operations take longer, reducing throughput.
  • Network overhead: Large state transfers between TaskManagers increase shuffle load and recovery times.

The Exponential Scaling Problem

As more streams are added, the operational cost of streaming joins in Flink increases nonlinearly. A simple two-way join is typically manageable, but as the number of input streams grows, complexity and resource consumption rise rapidly. A four-way join already becomes challenging to maintain, requiring substantial memory and state management, while an eight-way join can easily overwhelm the cluster.

At enterprise scale, where joining data from ten or more sources is common, streaming joins can consume 60–80% of available cluster resources, with state sizes reaching terabytes. At that point, Flink spends most of its effort managing and checkpointing state rather than processing new events. The result is a cruel irony: as the system handles more valuable real-time data, the underlying infrastructure becomes significantly harder and more expensive to operate efficiently.

The Solution: Apache Fluss and Zero-State Streaming Joins

Apache Fluss redefines how streaming architectures handle state. Instead of treating data as endless append-only logs (like Kafka), Fluss introduces primary-key tables as a first-class, mutable storage primitive.

The key difference is that state is no longer maintained within Flink operators. In conventional architectures:

  • Each operator stores local state for joins, buffering, and lookups.
  • This local state is the primary cause of high memory usage and slow checkpoints.

With Fluss, operator state is offloaded to the streaming storage layer, significantly reducing state requirements and improving overall system performance. Fluss handles merging, coordination, and versioning natively, so Flink becomes a lightweight processing engine reading already-merged, queryable data from Fluss. This is the essence of zero-state streaming joins: not stateless computing, but moving the state to a layer purpose-built for it. The result is a lean compute, durable state, and streaming joins that actually scale.

Apache Fluss-Centric Approach

Apache Fluss-Centric Approach

Figure 2: Modern Fluss-Centric Architecture

The table below summarizes the differences between the legacy Kafka and Flink and modern Fluss and Flink approaches:

  Kafka & Flink Fluss & Flink
State Location In Flink operators In Fluss KVStore
Materialized State Buffered events from all streams Single merged row per key
State Management Checkpointing & recovery overhead Handled by Fluss
Resource Cost High compute & memory load Optimized for storage, not compute
Compute Role Stateful operator Zero operator state, reads changelog
Job Management Hard Easy

Next, let’s jump into the specific features that Fluss has that allow it to work with Flink in a seamless, zero-state way, including partial updates, DeltaJoins, and Streaming Lookup Joins.

Partial Updates: Eliminating Multi-Stream Merge State

The Problem

You need to merge multiple event streams that update the same entity (like customers, orders, or devices). Each stream owns different attributes but shares a primary key.

Traditional Flink joins keep all streams in operator state and perform in-memory joins. This can be costly, slow, and mismatched with the goal to maintain only the latest version of each entity, instead of every draft.

How Partial Updates Work

With Fluss' primary key tables, each source writes only the columns it owns, directly to a shared table keyed by the primary key. As a reference, think of Fluss as a Google Docs for streaming data, where everyone can edit, and the record stays consistent.

Example: Partial Updates for Real-Time Customer 360

Let’s look at an example using our Customer 360 use case. In the example below, each stream writes partial updates, while Fluss merges them into a single row per customer. As a result, the system always reflects the freshest data, with no coordination logic or “restart-the-job” mornings.

Source Columns Updated
Web app last_page_viewed, session_duration
Payments last_purchase_date, lifetime_value
CRM last_contact_date, satisfaction_score
Marketing email_engagement_score, last_campaign_response
Partial Updates in Fluss

Figure 3: How Partial Updates in Fluss Work

Want to try this yourself?  Find a hands-on example here.

Delta  Joins: The Bidirectional Lookup Join

The Problem

When two high-volume streams need to be joined (for example, clicks with orders or profiles with events) traditional Flink joins maintain large buffers and join state for both sides. Each operator keeps keyed data until matching records arrive. Over time, this state grows unbounded, checkpointing slows, recovery takes longer, and jobs become fragile.

Most use cases don’t require re-joining the entire history every time a record changes. Instead, they only need incremental updates and small deltas that reflect changes on either side of the join. Conventional stream-stream joins, however, treat every incoming event as potentially requiring a full re-join, which leads to excessive memory and CPU consumption.

How Delta Joins Work

Lookup Join

Figure 4: The Lookup Join

Delta Joins addresses this problem with a bidirectional lookup join approach, provided by Apache Fluss.

  • Instead of buffering full streams on both sides, each incoming record probes the other stream’s table stored in Fluss by key.
  • When a record arrives from stream A, it looks up the relevant key in stream B’s table, and when a record arrives from stream B, it does the reverse.
  • Joined results are stored in Fluss tables rather than in operator buffers.

This preserves the full semantics of a stream-stream join. Both sides can trigger outputs, while offloading state management to Fluss. Operators remain lightweight, and all state is maintained efficiently in the Fluss KVStore. Because Fluss is integrated into the streaming layer, there’s no need to manage or scale an external system. In addition, all the above is achieved by maintaining both indexes and local caches for even faster hot key access.

Results: Performance and Scalability Gains

Users typically run a stream-to-stream join using Apache Kafka & Flink. By externalizing state into Apache Fluss, Delta Joins allow Flink to perform incremental, delta-based joins without maintaining large in-memory operator states.

To see this in action, the benchmark results below come from online e-commerce platform Taobao and their Fluss production practice. You can read more here.

As depicted below, this architecture brings dramatic improvements:

  • Flink CPU and memory usage are reduced by up to 85%.
  • Over 100 TB of operator state is eliminated in production deployments.
  • Checkpoint durations cut from 90 seconds to just 1 second.
  • Instant job startup with no state bootstrapping required.
Flink + Kafka vs Flink + Fluss

Figure 5: Kafka and Flink Results vs. Fluss and Flink at Taobao

Because all state lives natively in Fluss, pipelines remain lightweight and fully fault-tolerant. Delta Joins transform what used to be a state-heavy, resource-intensive operation into a scalable, zero-state streaming primitive, enabling high-throughput real-time analytics without the traditional overhead.

Note: There is more work coming, like multi-way streaming joins, which are part of the project plan, along with improvements on the current Delta join implementation. Visit the Fluss project future plan for details.

Streaming Lookup Joins: Updating Dimension Tables

Next, let’s take a moment to see what other features and benefits are available in Fluss. Although not directly related to zero-state streaming, there are scenarios where high-volume streams must be enriched by querying external systems, such as operational databases or key-value stores. In Kafka-centric architectures, users typically face two options: either query an operational database, which can introduce undesirable load and latency, or deploy and maintain an external key-value store to achieve fast access. Both approaches bring operational complexity, and in some cases, synchronization issues may arise. These considerations are beyond the scope of this blog, but you can find more information here.

With Apache Fluss, users can leverage streaming lookup joins on updating dimension tables using the system’s primary key table. In this setup, there is no need to deploy or scale an external database to handle lookups. This eliminates the operational overhead of managing caches, connectors, replication, or consistency guarantees across systems. The dimension table is co-located with the stream, maintained in Fluss’s primary key table, and backed by an immutable changelog that can be deterministically replayed.

Streaming Lookup Join

Figure 6: Streaming Lookup Join

In traditional external KV-store setups, scaling enrichment performance typically requires provisioning read replicas, sharding, or other database-level scaling strategies to sustain high query throughput. Each additional replica or shard increases operational overhead and may introduce further latency due to network round-trip times and consistency requirements.

In contrast, Fluss scales enrichment horizontally by adjusting the number of buckets in the primary key table. While network I/O between Flink operators and the Fluss KVStore still exists, scaling is simplified: instead of managing an external database cluster, you scale the table buckets, and Fluss can match throughput requirements. This makes enrichment linearly scalable with traffic, constrained primarily by streaming throughput rather than the capacity of a backing database.

A typical question that is asked is: "How many queries per second Fluss can perform on an updatable dimension table?" Fluss is currently running in production at business-to-business giant Alibaba, and they have shared their latest known production numbers, which highlight 500k queries per second on a single table.

Fluss production in alibaba

Figure 7: Alibaba’s Fluss Production Numbers

Why Streaming Joins Are Critical for Agentic AI

Agentic AI systems that make decisions and take actions in real time depend on live, unified context. Whether it’s a customer service agent handling a refund, a fraud detection system flagging suspicious activity, or a personal assistant managing appointments, these agents need an up-to-the-second view of the world. Our original use case of a Customer 360 view also relies on instant real-time data in order to feed agents enough context and data to respond autonomously and appropriately.

The challenge is that relevant data comes from multiple continuous streams. This might include orders, support interactions, loyalty and preference updates, user behavior, and inventory changes. Without real-time merging, each stream remains isolated, and the AI agent might make decisions based on incomplete or outdated information.

Streaming joins solve this problem by combining these streams into unified, live views. For example, consider a fraud detection agent monitoring transactions. A new payment arrives, a recent support ticket opens, and the user’s loyalty tier changes, all in seconds. Streaming joins allow the agent to see all of this in context instantly, enabling accurate, autonomous decisions. Without them, the agent is processing streams separately, which introduces delays, inconsistencies, or errors.

The importance of streaming joins grows with complexity. Multi-step interactions, such as a conversation where the agent needs to reference prior actions, amplify the need for fresh, integrated context at each step. Each millisecond of delay compounds, impacting both accuracy and responsiveness.

In short, streaming joins are the backbone of real-time context for agentic AI. They enable agents to act with up-to-the-moment intelligence, turning disparate data streams into an actionable, coherent understanding.

Note: There is ongoing work on zero-state streaming joins, including multi-stream delta joins. Keep an eye on the Fluss future project plan.

Next Step: Zero-State Streaming Aggregations

After tackling zero-state streaming joins, the next frontier is streaming aggregations. Just like joins, traditional Flink aggregations accumulate large amounts of keyed state inside operators. Examples of this include sums, counts, averages, or more complex column-wise merges.

The next step is to move the aggregation state out of Flink operators and into the storage layer, using an aggregation merge engine. This continues the zero-state philosophy: decoupling compute from state, letting Flink focus on transformations while Fluss manages aggregation deterministically and efficiently.

How Fluss Handles Aggregations

Similar to Apache Paimon, Fluss also introduces the concept of the Aggregation Merge Engine. In simple terms, a merge engine decides how multiple records should be merged based on the same primary key. With the aggregation merge engine:

  • Each stream writes incremental updates to a primary-key table.
  • The storage layer applies aggregation functions per column, merging updates automatically.
  • Flink operators remain lightweight, no longer buffering full state or performing heavy merges in memory.

This is similar to how it offloads join state: Flink emits updates, Fluss handles merging, and the system becomes both scalable and queryable in real-time.

SQL Aggregation

Figure 8: How the Aggregation Merge Engine Works

With zero-state aggregations, Flink operators become compute-light. No longer storing intermediate totals or stuck buffering large state, Flink can focus entirely on transformations. As a result, the architecture becomes highly scalable and predictable, as adding new keys or streams no longer risks memory blow-up. Recovery and checkpoints are faster since the aggregation state is already persisted and merged in the storage layer. Aggregated results are query-ready immediately, enabling downstream analytics or AI pipelines to consume fresh data without delay. In essence, this approach extends zero-state principles from joins to aggregations, decoupling computation from state and creating a streaming architecture that scales predictably.

Fluss Combined Usage Best Practices

State management has long been the bottleneck in real-time stream processing. Traditional Flink streaming joins require operators to maintain large states, while connecting to external systems introduces latency, network overhead, and operational complexity. This combination often leads to high memory usage, slow checkpoints, long recovery times, and scaling challenges.

Apache Fluss addresses these issues by externalizing all state into its integrated streaming key-value store (KVStore). Streaming joins no longer rely on massive local state, as lookup joins access the KVStore directly, eliminating the need for external databases while ensuring low-latency, transactional consistency. In addition, Delta Joins perform incremental, bidirectional updates efficiently without rebuilding the full join state.

The full potential of Fluss is realized when these patterns are used together. Partial updates consolidate entity state, streaming lookup joins enrich high-volume streams with real-time context, and Delta Joins apply incremental changes directly on the stored tables. By offloading all state management to Fluss, this combination delivers low-latency, queryable views, minimal in-memory operator state, and scalable real-time pipelines capable of supporting analytics and AI workloads at enterprise scale.

Key Takeaways (Zero-State Benefits)

  • Lightweight operators and efficient resource use: State is externalized to streaming storage, allowing operators to run with minimal local state and significantly reducing Flink’s memory and CPU footprint.
  • Fast checkpoints: Minimal operator state means checkpoints complete quickly and consistently.
  • Instant restarts and rescaling: Jobs recover immediately without rebuilding or reloading state.
  • Queryable state: Live streaming state can be directly accessed for debugging, analytics, or AI model features without impacting running jobs.

Watch the Demo: 

Ready to see Ververica and Fluss in action together? 

More Resources

Introducing The Era of Zero-State Streaming Joins
23:46

Sign up for Monthly Blog Notifications