What is Apache Paimon™?

What is Apache Paimon?

Apache Paimon is an open-source, stream-native data lakehouse format. It's designed to bring real-time data stream processing capabilities directly to your data lake, enabling efficient updates, consistent changelogs, and powerful analytical queries on continuously changing data. It integrates tightly with Apache Flink®, allowing you to build unified pipelines for both streaming and batch workloads.

In the world of big data, processing and analyzing vast amounts of information in real time is a key challenge. Traditional data systems struggle to handle both historical data and continuous data streams effectively.

This is where Apache Paimon comes in. Paimon acts as a stream-native data lakehouse that allows for real-time updates directly within your data lake. For companies using Apache Flink to process data streams, Apache Paimon is a foundational tool that is included as part of Ververica's Streamhouse architecture.

Apache Paimon logo

The Genesis of Apache Paimon

Apache Paimon was created due to the recognition of a critical need in the stream processing framework landscape. While Apache Flink has evolved into a powerful unified engine for both batch and streaming data, a direct, queryable storage layer for intermediate and final tables in a streaming context was lacking. While powerful, Flink's "Dynamic Tables" aren't directly queryable, which limits the immediate accessibility of continuously updated data.

As a result, FLIP-188 ("Introduce Built-in Dynamic Table Storage,") was proposed as an initiative that eventually evolved into Apache Paimon. The fundamental idea is both simple and profound: provide Apache Flink with a robust storage layer that leverages a table format, allowing intermediate data in dynamic tables to be directly accessible and queryable. This concept aligns perfectly with the expanding "Lakehouse" paradigm, which combines the flexibility and scalability of data lakes on affordable storage (like S3) with the structured querying and optimization typically found in data warehouses. While other technologies like Apache Iceberg™, Delta Lake, and Apache Hudi also embrace the Lakehouse approach, Paimon's unique strength lies in its stream-first design. This makes it exceptionally suitable for continuous updates and real-time analytics, as detailed in the recent blog: Apache Paimon: The Streaming Lakehouse.

Ververica integrated Paimon as a key ecosystem component, particularly in our Streamhouse architecture, because Paimon provides a robust, stream-native storage layer that significantly enhances Flink's capabilities for real-time data processing and advanced analytics. This strategic integration allows Ververica to offer comprehensive solutions for enterprises seeking modern, zero-trust-compliant data infrastructures.

A Stream-Native Data Lakehouse

Apache Paimon is an open-source table format designed to enable the construction of a real-time Lakehouse architecture with both streaming and batch operations. It combines the benefits of a data lake format with a Log-Structured Merge-tree (LSM) structure, bringing real-time streaming updates directly into the data lake.

Key characteristics that define Apache Paimon include:

  • Unified Data Lakehouse Storage: Paimon serves as a single storage layer that bridges the gap between batch and streaming data. It offers the scalability and flexibility of data lakes alongside the structured querying and schema enforcement of data warehouses. This unification simplifies data architectures, reducing complexity and operational overhead.
  • Real-Time Ingestion with CDC Support: Paimon excels at real-time ingestion, with strong support for Flink Change Data Capture (CDC). This enables incremental updates, allowing data changes from operational databases to be continuously written to Paimon tables, ensuring data freshness.
  • Unified Workloads (Batch and OLAP): Paimon is optimized for both analytical queries (OLAP) and batch processing, making it versatile for diverse workloads within a single system. This means you can run long-running historical queries alongside real-time analytical dashboards.
  • Tight Integration with Apache Flink: Apache Paimon integrates tightly with Apache Flink. Flink can seamlessly use Paimon as both a source (to read data) and a sink (to write data). This deep integration empowers Flink jobs to process data in real time, incrementally update Paimon tables, and query those tables for immediate analytical insights. This makes stream processing with Apache Flink significantly more powerful and flexible.

Core Features and Capabilities of Apache Paimon

Apache Paimon’s design incorporates several powerful features that make it an ideal choice for data stream processing and building a modern data lakehouse:

  • Real-Time Updates with Primary Key Tables: Paimon supports primary-key tables that enable real-time streaming updates of large amounts of data. This is crucial for maintaining data freshness, with queryable updates often available within minutes or even at sub-minute latency, depending on checkpoint intervals. Paimon offers various "merge engines" (like deduplicate, partial-update, aggregate, or first-row) to handle multiple records with the same primary key, providing flexibility in how updates are applied.
  • Flexible Updates with Merge Engines: A key strength of Paimon lies in its support for rich merge engines. This allows users to define how records are updated, whether by keeping the last row, performing partial updates (updating only specific columns), or aggregating records. This flexibility is vital for complex data stream processing scenarios.
  • Change-Tracking Updates with Changelog Producers: Paimon supports "changelog producers" that generate correct and complete changelogs from any data source. This is essential for downstream consumers to always see correct results and for building an accurate event-driven architecture.
  • Append-Only Tables: For use cases that only require data insertion (e.g., log data synchronization) and do not need update or delete operations, Paimon offers append-only tables. These tables provide large-scale batch and streaming processing capabilities efficiently.
  • Data Lake Capabilities: Inheriting the advantages of a data lake, Paimon provides low-cost storage, high reliability, and scalable metadata management. It supports features like Time Travel, allowing users to query previous versions of data by leveraging snapshots, and Full Schema Evolution, adapting to changes in data structure without disrupting pipelines.
  • Query Data Skipping: Paimon enhances query performance through indexes (like min/max) that filter irrelevant files, leading to faster data retrieval.
  • Resource-Efficient Automatic Data Compaction: To optimize read and write speeds, Paimon automatically and asynchronously combines smaller files into larger ones through a process called compaction. This prevents "small file" problems common in data lakes, resulting in faster queries and improved performance.
  • Robust Upsert Support: Paimon's use of LSM trees and various merge engines, including the powerful partial update merge engine, allows for efficient upserts (updates or inserts) directly on the Lakehouse. This can eliminate the need for costly streaming joins on primary keys, making streaming ETL more cost-effective.
  • True Streaming Reads: Paimon provides safeguard mechanisms and a consumer-id mechanism to ensure true streaming reads. This means downstream consumers can reliably track changes, even when data files expire or are deleted due to snapshot management.

Apache Paimon in the Ververica Ecosystem

Within the Ververica ecosystem, Apache Paimon plays a pivotal role in enabling a truly Unified Streaming Data Platform. It seamlessly integrates with other key components like Apache Flink (the core processing engine) and Flink CDC (for real-time data ingestion).

Foundation for Streamhouse: As described in the "The Streamhouse Evolution", Apache Paimon is the streaming storage layer for Ververica's Streamhouse architecture. This combination allows for a single, cohesive platform for both real-time stream processing and historical analysis on the data lake.

Enhanced Flink Capabilities: Paimon directly extends Apache Flink's capabilities by providing a persistent, transactional, and queryable storage layer for its dynamic tables. This allows Flink jobs to not only process data in motion but also to maintain and query the state of that data efficiently in a cost-effective data lake.

Building Real-Time Data Views: Paimon is instrumental in building real-time data views. Data from various sources is ingested (often via Flink CDC) into Paimon tables, aggregated by Apache Flink, and then made available for real-time visualization through BI tools or custom applications. This end-to-end streaming pipeline eliminates recurrent batch jobs, as elaborated in the Building Real-Time Data Views with Streamhouse blog.

Data Sovereignty and Governance: Paimon aligns with Ververica's focus on data sovereignty and compliance by keeping data storage within the customer's cloud environment. It also enforces granular access policies and integrates with cloud-native security tools, ensuring that organizations retain full control over their data.

Scalability and Cost Efficiency: Paimon is designed to handle massive-scale data workloads and supports hybrid or multi-cloud environments. Its optimization for streaming significantly reduces costs associated with traditional batch-processing architectures by leveraging cheap object storage while maintaining performance.

Developer-Friendly Design: Paimon supports tools and APIs familiar to developers working with Apache Flink, making it easier to adopt within the Ververica Unified Streaming Data Platform and simplifying the operational complexity of managing data stream processing pipelines.

Use Cases and Applications

The capabilities of Apache Paimon open up a wide array of use cases, particularly in scenarios demanding both real-time data freshness and cost-effective storage:

  • Real-Time ETL: Transforming and loading data continuously into a data lake for immediate consumption.
  • Real-Time Data Warehousing: Building a data warehouse that is continuously updated with fresh data, enabling real-time analytics and reporting.
  • Streaming Data Marts: Creating specialized, continuously updated data marts for specific business functions.
  • Feature Stores for AI/ML: Providing fresh data features for machine learning models by continuously updating tables that serve as feature stores.
  • Backfilling and Historical Analysis: Combining real-time updates with the ability to query historical snapshots for auditing, debugging, or retrospective analysis.
  • Change Data Capture (CDC) into Data Lake: Leveraging Flink CDC to ingest changes from transactional databases into the data lake in real time for unified analytics.

Apache Paimon is particularly beneficial for businesses looking to upgrade high-latency batch jobs to near real-time, assess the ROI of stream processing, or seamless migration of existing Lakehouse workloads to a streaming-first paradigm.

Conclusion

Apache Paimon represents a significant leap forward in data stream processing and data lake architectures. By providing a stream-native, unified storage layer for Apache Flink, it effectively bridges the long-standing gap between real-time and historical data. Its robust features, tight integration with Flink, and role within Ververica's Streamhouse architecture empower organizations to build highly scalable, cost-efficient, and responsive data pipelines. For businesses seeking to harness the full power of their streaming data, achieve true event-driven architecture, and unlock immediate insights from their data lakes, Apache Paimon stands as a pivotal component, leading the way towards a future where stream processing with Apache Flink on a data lake is not just a possibility, but an accessible and efficient reality.

FAQ

What makes Paimon stream-first vs Iceberg or Delta?

How does Paimon work with Apache Flink (source/sink, CDC, streaming reads)?

Does Paimon support primary keys, upserts, partial updates?

What latency can Paimon achieve for streaming reads/writes?

What engines query Paimon tables (Flink, Spark, Trino, StarRocks, Doris)?