Time: June 9th 2022, 9am PDT / 12pm EDT / 6pm CEST
Apache Iceberg brings numerous benefits like snapshot isolation, transactional commit, fast scan planning and time travel support. Those features solved important correctness and performance challenges for batch processing use cases. While originally adopted for batch, Iceberg can be leveraged as a streaming source. Streaming reads can further reduce the processing delay from hours to minutes compared to periodically scheduled batch ETL jobs.
In this talk, we are going to discuss how the Flink Iceberg source enables streaming reads from Iceberg tables, where long-running Flink jobs continuously poll and process data as soon as committed. We will discuss the design of the source operator focusing in particular on the streaming read mode. We will compare the Kafka and Iceberg sources for streaming read, and discuss how the Iceberg streaming source can power common stream processing use cases. Finally, we will present the performance evaluation results of the Iceberg streaming read
Steven Zhen Wu
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure.
In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.