Technology Deep Dive Track
Improving throughput and latency with Flink’s network stack
Flink's network stack is designed with two goals in mind: (a) having low latency for data passing through, and (b) achieving an optimal throughput. It already achieves a good trade-off between these two but we will continue to tune it further for an optimal one. Flink 1.5 did not only further reduce overhead inside the network stack, but also introduced some groundbreaking changes in the stack itself, introducing an own (credit-based) flow control mechanism and a general change towards a more network-event-driven pull approach. With our own flow control, we steer what data is sent on the wire and make sure that the receiver has capacity to handle it. If there is no capacity, we will backpressure earlier and allow checkpoint barriers to pass through more quickly which improves checkpoint alignment and overall checkpoint speed.
In this talk, we will describe in detail how the network stack works with respect to (de)serialization, memory management and buffer pools, the different actor threads involved, configuration, as well as the design to improve throughput and latency trade-offs in the network and processing pipelines. We will go into detail on what is happening during checkpoint alignments and how credit-based flow control improves them. Additionally, we will present common pitfalls as well as debugging techniques to find network-related issues in your Flink jobs.
Nico Kruber is an Apache Flink contributor and works as a software engineer at data Artisans. He is very passionate about open source projects and likes to dig deeper into Flink’s core to further improve performance and extend Flink’s capabilities where possible. Before joining data Artisans, Nico was working on his PhD in parallel and distributed systems at Zuse-Institute Berlin.