Flink Forward Session Preview: Streaming Event-Time Partitioning With Apache Flink and Apache Iceberg at Netflix

Are you thinking of joining Flink Forward Europe 2019? As a first-time speaker at the event, I am excited to meet the Apache Flink community and share how we leveraged Apache Flink and Apache Iceberg (Incubating) at Netflix to build streaming event-time partitioning at scale. In this post, I will give a sneak preview of my talk Streaming Event-Time Partitioning With Apache Flink and Apache Iceberg that you can attend October 8, 2019!

If you haven’t done so already, go ahead and register to learn more from exciting use cases and technical deep dives on October 8 & 9, or attend one of the Apache Flink training sessions scheduled on October 7!

Here’s what my talk will be about at this year’s Flink Forward Europe:

Streaming Event-Time Partitioning With
Apache Flink
and Apache Iceberg

FFEU19 - Julia Bennett - Netflix

Background

At Netflix, we’ve seen a lot of success and also valuable learnings building some of our core data pipelines with near real-time stream processing in Flink. One challenge that we hadn’t yet tackled though was landing data directly from a stream into a table partitioned on event time. Instead, we were often relying on post-processing batch jobs that “put the data in the right place”.

Our playback data, in particular, depends heavily on a repartitioning pattern to handle a long tail of late-arriving events. This partitioning ensures that all playbacks that occurred on any given date can easily be queried together, regardless of what time we actually processed those events. This is a critical dataset with high volume used broadly across Netflix, powering product experiences, A/B test metrics, and offline insights.

With the introduction of Apache Iceberg as a new table format being developed at Netflix and a supporting internal Flink integration, we decided it was time to give streaming event-time partitioning a shot. As an important dataset with a growing number of low latency requirements, playback data was the perfect candidate to put it to the test.

 

Topics covered

In this talk, I will focus on what you do with your data after your streaming pipelines by discussing a pattern for performing event-time partitioning in Flink on high volume streams. 

I’ll give an overview of how we process playback data at Netflix and why event-time partitioning is so important. I’ll briefly introduce Iceberg and how it helps solve this problem, and then describe our implementation of generic event-time partitioning on streams through configurable Flink components that leverage Iceberg as an output table format. And of course, discuss how we applied this generic implementation to improve playback data! I’ll share some of the challenges we hit along the way and talk through tradeoffs with this approach. 

 

Key takeaways

Interested in what you will learn from my session? Here are some key takeaways the Apache Flink community can expect: 

  • A generic pattern for event-time partitioning at scale with Flink and Iceberg

  • Challenges and tradeoffs with a streaming approach to event-time partitioning 

  • Overview of how we process Netflix’s core playback data and learnings from moving it to this pattern 

  • A brief introduction to Apache Iceberg as a new table format and how it integrates with Flink


Make sure to secure your spot before September 15 by registering on the Flink Forward website. Sessions this year cover multiple aspects of stream processing with Apache Flink such as use cases, technology deep dives, ecosystem, operations, community and research talks among others! Grab your pass and come to meet the Apache Flink community in Berlin this October!

FFEU 2019 - Banner - wide

About the author: 

Julia_Bennett_HeadshotJulia Bennett is a member of the data engineering team for personalization at Netflix that delivers recommendations made for each user. The team is responsible for building large scale data processing used in training and scoring of the various machine learning models that power the Netflix UI experience. They have recently been working on moving some of the company’s core datasets from being processed in a once-a-day daily batch ETL to being processed in near real time using Apache Flink. Before joining Netflix, Julia completed her PhD in mathematics from The University of Texas at Austin.

 

Tags: Flink Forward