What is Apache Fluss™?
Apache Fluss is an open-source, unified streaming storage layer designed to optimize real-time data processing with Apache Flink®. It bridges the gap between streaming and analytical storage by providing a single, high-performance platform for both real-time updates and historical queries on data.
In order to get immediate insights from ever-growing data volume, modern enterprises face a constant challenge: how to efficiently store and query vast streams of information in real-time, while also retaining the ability to perform deep historical analysis. Traditional architectures often consist of disparate systems for messaging, stream processing, and long-term storage. As a result, these separate systems introduce complexity, latency, and significant operational overhead.
This is precisely the critical gap that Apache Fluss aims to close. Fluss is an open-source project that acts as a unified streaming storage layer, and is built for next-generation data analytics and to revolutionize real-time data processing with Apache Flink.
Fluss is built in column format instead of log format, and fills a missing gap in streaming technologies. It is a streaming storage layer that resembles a data warehouse while also being native to Flink. It allows decisions based on both historical and current data at the millisecond level because you don’t have to re-read the entire data set to get the relevant data out each time.
Why Fluss and Columnar Streaming Reads Are Essential for Analytics
Fluss is built as a columnar streaming reads, retrieving data by columns instead of rows. This approach significantly boosts performance in analytics scenarios, including:
- Targeted Data Access: Only the necessary columns are read, reducing data transfer and processing time. For example, in a dataset with 50 columns, if analytics only require 3, columnar reads avoid loading irrelevant data.
- Optimized Compression: Columnar formats store similar data types together, enabling better compression ratios and faster decompression during reads.
- Accelerated Query Speed: Processes analytics queries like aggregations and filters directly on columns, leading to significant speedups compared to row-based processing.
The Evolution of Data Storage: Why Fluss is Essential
For years, the backbone of event-driven architecture and data stream processing has largely relied on message queues like Apache Kafka®. However, these systems are not designed to serve as a primary analytical storage layer. This limitation creates a significant bottleneck for applications demanding both high-throughput writes and low-latency analytical reads directly on streaming data. Challenges include:
- High Network Costs and Bottlenecks: Kafka's architecture often necessitates extensive data movement across networks, leading to considerable infrastructure costs and performance issues, particularly when scaling for comprehensive real-time analytics.
- Missing Columnar Streaming Storage: There is a gap in the ecosystem for a streaming storage solution that natively supports a columnar format, which is essential for fast analytical queries and efficient data compression. Existing solutions struggle to transform to meet these rigorous needs.
- Complex and Disconnected Pipelines: Integrating message queues, stream processing frameworks like Apache Flink, and separate OLAP (Online Analytical Processing) systems for analysis often results in complex, multi-layered pipelines that are difficult to manage, debug, and optimize.
Recognizing these longstanding challenges, a team of experts with deep knowledge of Apache Flink began work on a new project to create a dedicated streaming storage layer that is tuned for today's stream processing demands. This effort created Fluss, which is a current incubator project for the Apache Software Foundation.
Fluss is named from the German word for 'river,' symbolizing the continuous data flow it's built to handle. Fluss marks a significant advance toward a fully unified platform for both batch and streaming data, because it allows for seamless data management from the point of ingestion through to when that data is analyzed. To learn more, read the blog: "Fluss: Unified Streaming Storage For Next-Generation Data Analytics."
A Unified Streaming Storage Solution
As a high-performance, scalable, and fully integrated storage system, Fluss is designed to power real-time analytics. It combines the best attributes of streaming and analytical storage, providing a unified layer that eliminates the need for separate message queues and additional OLAP systems in many analytical workflows. At its core, Fluss:
- Offers Unified Batch and Stream Processing: Fluss provides a singular platform that handles both batch and streaming data seamlessly. This integration is crucial for optimizing infrastructure for sophisticated AI, ML, and analytical workloads, allowing businesses to perform efficient historical processing alongside live data streams.
- Bridges the Gap for Real-time Analytics: Fluss directly addresses the shortcomings of traditional architectures by offering a storage layer optimized for continuous updates and lightning-fast analytical queries on streaming data. This is particularly beneficial for applications where low latency is paramount.
- Streamlines Data Pipelines: By integrating natively with Apache Flink, Fluss simplifies complex data pipelines. It removes the necessity for intermediate Kafka topics, reducing infrastructure costs, enhancing scalability, and improving overall performance for high-throughput, low-latency analytics.
To meet use cases that rely on fast and efficient analytics (like powering dashboards, detecting anomalies, or training machine learning models), columnar data formats are essential.
- Kafka’s log-based design is suited for transporting events but falls short when deep analysis or rapid insights are required.
- Systems like Fluss fill this gap by offering columnar streaming reads, enabling immediate, high-performance access to data for analytical workloads while maintaining compatibility with streaming use cases.
In short, while Kafka excels as a message broker for streaming events, its row-based log format limits its utility for analytics. For analytics at scale and speed, a columnar storage and processing layer like Fluss becomes crucial.
The key differentiators Fluss offers are:
- Supports large scale data in motion and allow that data to flow faster (just like a river)
- Bridges compute and data lakes
- Delivers data faster, with less delay and millisecond level latency
With Fluss, the line between storing data and making decisions with that data becomes indistinguishable.
Apache Fluss is production-ready, running internally with the team that developed it prior to it becoming an open-source project, demonstrating its readiness and robust capabilities. Its status as an Apache project underscores its commitment to open-source collaboration and community-driven development, further solidifying its role as a key stream processing framework component. This important step was announced in the recent blog: Fluss Is Now Open Source.
Core Features and Advantages of Fluss
Fluss is packed with innovative features that set it apart as a premier streaming storage solution, making stream processing with Apache Flink even more powerful. Some of these features include:
Sub-Second Latency
- What it Does: Delivers sub-second latency for both streaming reads and writes.
- Why it Matters: Critical for time-sensitive applications like monitoring systems and financial platforms, ensuring instant data availability for actionable insights.
Stream-Table Duality with Updates and Changelogs
- What it Does: Supports stream-table duality, providing changelogs for efficient updates and consistent data flow.
- Why it Matters: Enables accurate real-time and historical insights within a unified system.
Ad-hoc, Interactive Queries
- What it Does: Provides a fully queryable storage layer for direct inspection of data.
- Why it Matters: Simplifies debugging, reduces development complexity, and enables immediate access to live insights without additional processing layers.
Unified Batch and Stream
- What it Does: Seamlessly combines batch and streaming data processing.
- Why it Matters: Optimizes infrastructure for AI, ML, and analytics workloads, allowing smooth transitions between historical and real-time processing.
Projection Pushdown
- What it Does: Optimizes streaming reads by fetching only the required fields.
- Why it Matters: Minimizes data transfer, improves query performance up to 10x, and reduces network costs.
Columnar Streaming Reads
- What it Does: Stores and processes data in a columnar format for streaming reads.
- Why it Matters: Improves compression efficiency and accelerates analytics, making it suitable for high-volume, real-time applications.
Integration with Lakehouses
- What it Does: Supports bi-directional communication with lakehouse tiered storage systems like Apache Paimon and Apache Iceberg.
- Why it Matters: Enables efficient initialization of streaming jobs from batch sources, and ensures seamless synchronization between batch and streaming data.
Seamless State Initialization and Synchronization
- What it Does: Fluss allows a streaming job to load state directly from batch sources.
- Why it Matters: This capability enables seamless state initialization and synchronization between batch and streaming data, providing a truly unified data experience.
Simplified Pipeline Architecture
- What it does: With its native integration with Apache Flink, Fluss eliminates the need for intermediate Kafka topics and additional OLAP systems.
- Why it matters: This dramatically simplifies pipeline architecture and reduces infrastructure costs while enhancing scalability.
Fluss in the Ververica Unified Streaming Data Platform
Fluss is not just a standalone technology; it's a critical component that enhances the capabilities of Ververica's Unified Streaming Data Platform. By providing a scalable, unified batch, and streaming data solution, Fluss addresses key challenges in real-time data processing and storage within the Ververica ecosystem.
- Enhancing Flink's Capabilities: Fluss complements Apache Flink's powerful stream processing capabilities by providing the missing piece: an optimized storage layer for real-time analytics. This allows users to leverage Apache Flink for both computation and a direct, high-performance storage solution.
- Driving Real-Time Intelligence: Fluss is designed to power complex real-time intelligence pipelines, enabling organizations to build highly responsive systems that react instantly to data changes. This is fundamental for modern event-driven architecture and applications that demand immediate insights.
- Cost Efficiency and Performance: By streamlining the architecture and optimizing for real-time analytics, Fluss significantly reduces operational overhead and infrastructure costs compared to traditional multi-system approaches. Its sub-second latency and columnar reads contribute directly to superior performance, making data stream processing more efficient than ever before.
- Foundation for AI/ML Workloads: The unified batch and streaming capabilities, combined with real-time updates, make Fluss an ideal foundation for feeding fresh data to AI and ML models, accelerating model training and inference.
In addition, Fluss equips businesses to stay competitive, adapt quickly, and future-proof investments through scalable, efficient, and modern data solutions. Key results include:
Stay Competitive
- How it Helps: Fluss gives you a technological edge by enabling faster, smarter decisions.
- What is the Impact: Fluss processes and delivers data instantly, enabling your teams to respond to events like market changes, system alerts, or customer actions in real time.
Adapt Quickly
- How it Helps: Fluss provides seamless integration and fast data flow.
- What is the Impact: You can adapt to changing market conditions and customer needs quickly.
Future-Proof Your Investments
- How it Helps: Fluss provides a scalable, adaptable, and forward-looking data infrastructure designed to support evolving technology trends and business needs.
- What is the Impact: Fluss is part of an essential part of a forward-looking data strategy. By providing a platform that evolves with technological advancements and business demands, Fluss empowers organizations to make investments that deliver long-term value and adaptability.
Simplify Your Data Architecture
- How it Helps: By combining real-time and historical data handling into one platform, Fluss eliminates the need for separate systems
- What is the Impact: reducing operational complexity and ensuring compatibility with future data demands.
Seamless Integration With Emerging Technologies
- How it Helps: By supporting advanced AI, machine learning, and analytics workloads, your business can leverage cutting-edge tools and methodologies
- What is the Impact: No need to overhaul your data architecture.
Built For Scalability
- How it Helps: Fluss is designed to scale with growing data volumes.
- What is the Impact: Fluss ensures that your business can handle increasing workloads as operations expand without sacrificing performance.
Compatibility With Modern Data Ecosystems
- How it Helps: Fluss integrates seamlessly with popular data lakehouse systems (like Apache Paimon and Apache Iceberg).
- What is the Impact: Fluss ensures that your business can adapt to the latest data management paradigms without disrupting existing workflows.
Open-Source Flexibility With Commercial Support
- How it Helps: Fluss is operated on open-source principles and flexibility in combination with Ververica’s expert help.
- What is the Impact: Fluss ensures your business retains control and avoids vendor lock-in while benefiting from Ververica’s commercial enhancements and support.
Optimize Resources
- How it Helps: Delivers efficient data processing and reduced network costs
- What is the Impact: Fluss provides a cost-effective solution, especially for data-intensive industries like finance, retail, and technology.
Use Cases Transformed by Fluss
The introduction of Fluss opens up new possibilities and transforms existing use cases across various industries:
- Real-time Dashboards and Monitoring Systems: Powering dashboards with sub-second latency, providing up-to-the-minute views of business operations, system health, and key performance indicators.
- Streaming ETL and ELT: Streamlining the Extract, Transform, Load processes by enabling continuous data movement and transformation directly within the streaming storage layer.
- Real-time Intelligence Pipelines: Building sophisticated pipelines for fraud detection, personalized recommendations, anomaly detection, and other applications that require immediate decision-making based on live data.
- Streaming Data Warehouses: Serving as the real-time data layer on the Lakehouse, allowing for continuously updated data warehouses that support both streaming-first architectures and traditional batch queries.
- Operational Analytics: Enabling direct data inspection for debugging and troubleshooting live applications, reducing time-to-insight for operational teams.
Fluss's ability to provide real-time updates makes it a natural fit for scenarios where data freshness is critical, ensuring that decisions are always based on the most current information.
The Road Ahead
The decision to open-source Fluss and donate it to the Apache Software Foundation marks a significant milestone. This commitment ensures that Fluss will benefit from community-driven development, fostering innovation and wider adoption. As the project continues to evolve, capabilities will expand and further solidify Fluss’s position as a leading solution for unified streaming storage. Ververica remains committed to supporting and contributing to Fluss, ensuring its continued integration and optimization within Ververica’s Unified Streaming Data Platform.
Apache Fluss represents a pivotal innovation in the world of stream processing and data analytics. By addressing the critical need for a unified, high-performance streaming storage layer, it simplifies complex data architectures, reduces costs, and accelerates the delivery of real-time insights. Its deep integration with Apache Flink makes stream processing with Apache Flink more powerful and efficient than ever before. For organizations striving to build agile, event-driven architecture and unlock the full potential of their data stream processing capabilities, Fluss offers a clear path forward, streamlining the journey from raw data to actionable intelligence.
FAQ
How is Fluss different from Apache Flink?
Flink is a stream processing engine for building and running pipelines, while Fluss is a storage/serving layer that keeps materialized views over streams for fast queries and lookups.
What problems does Fluss solve?
Fluss reduces stack complexity by replacing a patchwork of stream processor + message bus + cache + OLAP store with a single streaming data layer that provides consistent, sub‑second reads.
Can Fluss handle late or out‑of‑order events?
Yes, Fluss is designed to work with time‑aware processing (e.g., event‑time and watermarks) so derived tables stay consistent as late data arrives.
What are typical use cases for Fluss?
Common uses include real‑time dashboards, fraud/risk scoring, personalization, operational analytics, leaderboards, and fast dimension lookups for streaming joins.