Stop Recomputing Everything: The Case for Streaming Lakehouses

Principal Streaming Architect

Director of Product Excellence

May 4, 202620 min read

A risk manager at a major bank watches a corporate client draw hard on their credit line all morning. Multiple large transfers. A spike in card activity. A missed loan payment. The risk dashboard should reflect this escalating exposure right now. It does not. It pulls from last night's batch run. By the time the system catches up, the exposure has already compounded.

This is not an edge case. It is the default outcome of how most data platforms operate today.

And it is about to get worse. Organizations deploy AI models for fraud detection, credit scoring, anomaly detection, and regulatory compliance. Every model is only as reliable as the data it reasons over. A fraud model running on data that is eight hours stale does not detect fraud. It detects history. A compliance engine fed by last night's batch run creates an 18-hour blindspot. Violations accumulate undetected. The architecture that feeds AI is as important as the model itself. Stale data in means unreliable AI out. No model sophistication compensates for that.

Events fire through the day. Batch detects them tomorrow. Streaming detects them in seconds. The gap is where fraud compounds and violations accumulate.

Data platforms face three demands. Lower end-to-end latency across every pipeline layer. Cost-efficient streaming without sacrificing freshness. Stream-batch unification on one consistent architecture. Batch cannot satisfy all three. The Streamhouse™ does.

Apache Paimon, a streaming-native table format built on a Log-Structured Merge-tree (LSM-tree) model, runs this architecture in production today. Apache Fluss™ extends it further. A streaming storage system built for real-time analytics and AI workloads. Fluss decouples the streaming layer from the table format. The same architecture works with Apache Iceberg, Paimon, Lance, or (soon) Hudi. The result: a fully open, unified platform. Streaming and batch are not parallel systems to reconcile. They are a single architecture expressed at different points on a freshness continuum.

This post examines the Streamhouse through open table formats. Open table formats deliver a low-cost path to near-real-time latency. A separate post covers the full architecture with Apache Fluss extending to Apache Iceberg.

Each wait interval in the batch pipeline compounds. The streaming path removes the waits entirely · change volume, not history size, determines the work done at each layer.

Background: What a Data Lakehouse Actually Is

The data lakehouse emerged from two decades of architectural evolution.

Data warehouses offered structure and reliability. They lacked flexibility and scale. Data lakes offered cheap, flexible storage. They lacked governance and query reliability. The lakehouse unified both. Warehouse guarantees applied to lake scale and openness. Store everything. Query anything. Trust the results.

Apache Iceberg, Delta Lake, and Apache Hudi made this practical. They brought ACID transactions, schema enforcement, and time-travel capabilities to object storage. The lakehouse became the dominant architecture for modern data platforms.

Most implementations carried one constraint forward from the batch era.

The Core Limitation: Batch Lakehouses Run on Stale Data

Most lakehouses today use the medallion architecture. Three logical layers at increasing levels of refinement:

Bronze: Raw, unprocessed data from source systems.
Silver: Cleaned, enriched, joined datasets built from Bronze.
Gold: Curated, aggregated business metrics built from Silver.

The structure is sound. The update mechanism is the problem.

Each layer refreshes through scheduled batch jobs. Raw data accumulates. At a fixed interval, hourly or daily, a job reads everything, transforms it, and writes it downstream. Then the next job does the same. By the time a business event propagates from Bronze to Gold, hours have passed.

Three Compounding Consequences

1. Delayed insights. Batch pipelines introduce a fixed lag between event and reflection. If Silver runs every four hours and Gold runs after that, end-to-end latency exceeds one business day under ideal conditions. Decision-makers operate on a historical view. Not the current one.

2. Expensive, redundant recomputation. A batch job that updates total outstanding loan balance rescans the entire historical dataset. It recomputes the aggregate from scratch. The number of new records arriving does not change the work done. A dataset holds 90 days of transaction history. 1,000 new records land today. The batch job still reads and processes all 90 days. Datasets grow. Overhead grows with them. A pipeline that ran in two hours in year one runs in eight hours by year three. Same schedule. Same logic. Four times the cost.

3. Orchestration overhead and fragility. Batch pipelines require schedulers, dependency chains, retry logic, monitoring, and engineers to maintain all of it. A failure in the Bronze job blocks Silver. Silver blocks Gold. Gold blocks every downstream dashboard and application. More pipelines mean a larger, more brittle dependency web.

Delayed insights produce decisions based on yesterday's data. Redundant recomputation drives pipeline costs up every year with no corresponding gain. A single scheduler failure propagates to every consumer.

Most "Real-Time" Platforms Are Not Real-Time

Most platforms marketed as real-time are micro-batch systems dressed in streaming language. They ingest events quickly. They still process them in small scheduled windows. The marketing says streaming. The architecture says batch with a shorter interval.

The distinction matters. A fraud signal that arrives 15 minutes late is not real-time. It is a fast batch job. A compliance alert that fires after the trading window closes is not monitoring. It is an audit finding. A credit risk score computed on a four-hour-old snapshot is not dynamic. It is yesterday's answer delivered slightly faster.

Real-time means the system processes each event as it arrives, updates state incrementally, and propagates changes downstream within seconds. Not minutes. Not "near real-time with an asterisk". Seconds.

Any architecture that accumulates data before processing operates in batch mode. Frequency does not change that.

The Streamhouse: Continuous Data Flow Across All Layers

The Streamhouse replaces periodic batch processing with continuous, incremental pipelines.

Events flow through Bronze, Silver, and Gold as they arrive. Each new event triggers only the computation required to incorporate that change. The system maintains running state. It applies incremental updates. The difference: a database that processes individual transactions, versus one that reloads its contents from backup every night.

Raw events, enriched datasets, and business metrics evolve in near real time. They reflect the current state of source systems.

The Data Continuity Pattern: How the Layers Stay in Sync

Data continuity is the defining concept. Each layer of the medallion architecture stays automatically and continuously in sync with the layers below it as events flow through the system.

In a batch architecture, synchronisation is a scheduled, manual concern. If the Silver job has not run today, Silver is stale. Gold is also stale. In a Streamhouse, synchronisation is structural. Changes propagate downstream automatically, layer by layer, as they arrive. Four patterns make this work.

The four patterns make Bronze, Silver, and Gold a single continuously evolving dataset at different refinement levels · not three independent tables sharing a naming convention.

Pattern 1: Event-Driven Layer Propagation

Each layer listens for changes from the layer below. It does not wait to be told when to run. A new event in Bronze triggers Silver pipelines immediately. When Silver updates, Gold responds. Updates cascade automatically. No scheduler. No manual trigger. No accumulated wait time.

Computation speed determines latency. Not cron frequency.

Pattern 2: Streaming Materialised Views

Traditional data lake file formats were designed for immutable writes. You append new data. Updating an existing record meant rewriting entire files or partitions. Continuous, record-level updates were prohibitively expensive.

A Streamhouse built on streaming table formats like Apache Paimon operates differently. Incoming changes land in fast in-memory buffers and flush to sorted, compact files. Background compaction merges them over time. High-frequency updates are a first-class operation. Not a workaround.

When a customer's credit utilisation changes, the change propagates immediately. No surrounding data gets rewritten. No read-time debt accumulates. Incremental, high-frequency updates across all three lakehouse layers become practical.

Pattern 3: Change Data Capture (CDC) at the Source

Data continuity begins upstream of the lakehouse itself. Change Data Capture (CDC) streams database changes, inserts, updates, deletes, out of operational systems the moment they occur. No bulk exports. No schedules.

CDC converts Bronze from a periodic data dump into a continuously accurate mirror of source systems. Bronze receives a precise, ordered log of every change as it happens. This removes the first and largest source of latency in traditional lakehouse pipelines.

Pattern 4: Continuous Compaction and Schema Propagation

High-frequency streaming writes produce many small files. Good for freshness. Bad for query performance over time. The Streamhouse runs continuous background compaction. Small files merge into optimally sized ones without interrupting incoming updates or downstream queries.

Schema propagation matters too. When a source schema changes, a new field or a revised data type, that change flows downstream to Silver and Gold automatically. No manual pipeline updates. No downstream failures. In batch architectures, schema changes halt pipelines and require coordinated intervention. In a Streamhouse, the system handles them by design.

These four patterns ensure Bronze, Silver, and Gold are not independent tables that share a naming convention. They are a single, continuously evolving dataset viewed at different levels of refinement. Always in sync. Always current.

Concrete Example: Building a Real-Time Financial Platform

Take a major commercial bank. One of the most demanding environments for any data architecture.

The platform generates a constant stream of events. Payment transactions, card activity, account updates, money transfers, stock market data, mobile banking interactions. Thousands to millions per minute.

One event triggers the cascade. Every layer updates incrementally, driven by what changed · not by what a cron scheduler decided to recompute this hour.

Bronze: Continuous Event Ingestion

Bronze operates as a continuous receiver, driven by CDC. Events arrive from operational systems in real time. Loan drawdowns, card transactions, balance updates, credit limit revisions. They arrive as append-only event streams or CDC logs that capture every source database change in order. Upserts keep Bronze as a continuously accurate reflection of source systems. Updated within seconds of each change.

Silver: Real-Time Customer and Risk Intelligence

Silver turns raw events into structured, enriched datasets. Event-driven propagation makes the difference here.

The moment a new event lands in Bronze, Silver pipelines process it. The pipeline maintains a continuously updated financial health profile for every customer. That profile is the foundation of dynamic credit risk assessment.

When a significant withdrawal event propagates from Bronze to Silver, the pipeline:

Retrieves the customer's current state: live credit utilisation, outstanding balances, upcoming payment schedule, recent transfer history. Maintained as an incrementally updated profile.
Enriches the event with contextual signals: compliance flags, Know Your Customer (KYC) status, sanctions checks, relevant market indicators.
Recalculates risk metrics: debt-to-income ratio, credit utilisation rate, liquidity buffer.
Writes the updated profile. Immediately available to Gold and downstream applications.

A single event triggers the process. Only changed data gets updated.

The practical outcomes:

Dynamic credit decisions. Approve or adjust credit limits against the customer's actual financial position at the time of the request. Not yesterday's snapshot.
Proactive risk alerts. Surface customers approaching critical risk thresholds as their behaviour evolves throughout the day. Relationship managers act before exposure compounds.
Continuous compliance monitoring. Flag unusual activity as it happens. Not in the next morning's report.

Gold: Continuously Maintained Business Metrics

Gold is where dashboards, risk applications, regulatory reporting tools, and executive teams consume data.

In a batch architecture, Gold updates at most once per day. Portfolio risk dashboards reflect yesterday's positions. For a bank managing large credit exposures, the gap between current data and day-old data is a risk and compliance gap.

In a Streamhouse, Gold aggregates update incrementally as events flow through Silver. The compute model is fundamentally different.

Yesterday, 90 loan transactions were processed. Net outstanding balance: -€100M. Today, 10 new transactions arrive totalling +€5M.

Batch compute scales with the history you keep. Streaming compute scales with the changes arriving. At 500GB of history and 5GB of daily change, the gap is not a margin · it is an order of magnitude.

Batch approach: Load all 100 transactions. Sum from scratch. Result: -€95M. Batch scans the same 90 historical records again, as it did yesterday and the day before.

Streaming approach: Retrieve existing state (-€100M). Apply today's delta (+€5M). Result: -€95M. No historical rescan. Processing time scales with the 10 new transactions. Not the 100 total.

The result is identical. Streaming costs far less compute. At production scale, where datasets span years and aggregates cover millions of records, the difference translates directly into infrastructure cost and pipeline run time. A batch job rescanning 500GB of historical data daily costs the same every day it runs. A streaming pipeline processing 5GB of daily incremental changes costs a fraction of that. That fraction does not grow as the historical dataset grows.

Credit exposure, portfolio risk distributions, liquidity gap metrics, and loan performance indicators all reflect the bank's actual position at the current moment. Not the prior close of business.

The Cost Case for Streaming on the Lakehouse

Streaming lakehouses cost less to operate. Total Cost of Ownership (TCO) is structurally lower than batch.

Compute scales with change, not with history. In a batch system, pipeline cost is a function of dataset size. Dataset size only grows. In a streaming system, compute cost is a function of incoming change volume. Change volume is bounded. Historical dataset size is not. An organisation with three years of transaction history pays the same streaming compute cost per day as it did in year one. A batch pipeline rescanning that full history does not.

The intersection is year one. The divergence compounds from there. Batch pipelines pay for every record they have ever kept, every time they run. Streaming pipelines pay only for today.

Infrastructure is right-sized and consistent. Batch architectures require large compute clusters that spin up, process everything at maximum speed, and spin down. The job must finish within the batch window. Peak-load sizing becomes mandatory. Streaming architectures run continuously at a lower, steadier resource level. Predictable to budget. Cheaper to operate.

Operational overhead drops. The four data continuity patterns each remove distinct categories of operational cost: orchestration infrastructure, manual schema migration, bulk export jobs, and engineering hours spent recovering from pipeline dependency failures.

The primary cost factors unique to streaming are 24/7 infrastructure uptime and background compaction. Modern streaming engines handle both in production. At meaningful data volumes, these costs are a fraction of the compute savings from removing historical rescans.

Sovereignty and Compliance: Architecture Is Not Neutral

For regulated industries, where data flows is not only a technical decision. It is a regulatory one.

The Digital Operational Resilience Act (DORA) requires financial institutions to demonstrate operational resilience in real time. The General Data Protection Regulation (GDPR) mandates control over where personal data resides and how it moves. MiFID II demands transaction reporting with strict timeliness requirements. Basel IV tightens capital adequacy calculations that depend on current exposure data. Not yesterday's snapshot.

A batch architecture creates compliance blindspots. An 18-hour gap between event and detection is not an operational inconvenience. It is a regulatory liability. Streaming closes that gap. The streaming platform itself must also meet the sovereignty bar.

Ververica is the only enterprise streaming platform built in Europe, for European levels of compliance demand. Our architecture satisfies the strictest residency and sovereignty requirements anywhere. Your data stays under your control. In your jurisdiction. Beyond foreign surveillance.

Ververica's platform supports multiple deployment models. Managed cloud. Private Virtual Private Clouds (VPCs). Air-gapped Kubernetes environments. Zero-Trust implementations. This structural approach aligns with the governance demands of DORA, GDPR, and MiFID II.

Most US hyperscaler-native streaming platforms cannot make that guarantee. For a European bank, a Middle Eastern sovereign wealth fund, or any institution where data residency is a board-level concern, the architecture choice is a jurisdictional choice. The Streamhouse on Ververica satisfies both.

What the Streamhouse Runs

Continuous data flow plus in-sync layers opens a different category of applications. Not faster versions of existing capabilities. Categories of products that batch pipelines cannot produce.

Fraud detection systems that operate on current data. Credit risk models that score against the customer's actual position, not a day-old approximation. Compliance engines that monitor in real time, closing the blindspot that batch architectures create by design. Recommendation engines that reflect what is happening now. Portfolio rebalancing systems that act on live market conditions. AI agents that reason over the actual state of the world. Not yesterday's snapshot.

AI models are only as good as the data underneath them. The Streamhouse keeps that data current, governed, and continuously propagated. That is what makes real-time AI real.

Conclusion: The Lakehouse Evolves

The lakehouse architecture solved real limitations of traditional data warehouses and first-generation data lakes. Organizations now depend on near-real-time intelligence for risk management, compliance, AI, and customer operations. The batch-first lakehouse has reached its limits.

The Streamhouse extends the lakehouse model. Same layered structure. Same reliability guarantees. Continuous data flow and automated layer synchronization replace the rigid batch cycle. Bronze, Silver, and Gold become a living system. Always current. Always consistent. Always cascading. The four data continuity patterns hold them together.

Credit exposure, portfolio risk, liquidity metrics. All reflecting current positions. AI models reasoning over governed, real-time data. Not stale snapshots dressed in dashboards. That is the competitive edge.

Real-World Streaming Lakehouse Deployments

These patterns are not theoretical. Organisations across industries have adopted Streamhouse architectures to address exactly these challenges. The following talks provide concrete implementation perspectives from engineering teams who built and operate these systems at scale:

Yelp: Real-time business intelligence across its marketplace.
Fresha: Real-time booking and operational data for the world's largest beauty and wellness marketplace.
Unity3D: High-volume game telemetry and analytics at scale.
Lalamove: Real-time operations and driver matching for on-demand logistics.
TikTok: Petabyte-scale streaming data pipelines for content recommendations and platform analytics.
Shopee: Real-time inventory, pricing, and fraud signals across one of the largest e-commerce platforms.

Apache Flink® and Apache Fluss™ are trademarks of the Apache Software Foundation. Streamhouse™ is a trademark exclusively licensed to Ververica GmbH

Share:LinkedIn

Stop Recomputing Everything: The Case for Streaming Lakehouses

Background: What a Data Lakehouse Actually Is