Use Case Track  

Using Flink to inspect live data as it flows through a data pipeline

 

One of the hardest challenges with authoring a data pipeline in Flink is understanding what your data looks like at each stage of the pipeline. Pipeline authors would love to answer questions like ""why is no data coming through my filter?"" Or ""why did my regex not extract any fields?"" Or ""is my pipeline even reading anything from Kafka?"" Unit and integration testing pipeline logic goes a long way, and metrics are another great tool to understand what a pipeline is doing, but sometimes you need the data itself to answer why a pipeline is behaving the way it is.

 

To answer these questions for ourselves and our customers, at Splunk we created a simple yet robust architecture for extracting data as it moves through a pipeline. You'll also learn about our implementation of this architecture, including the lessons learned while creating it, and how you can apply this architecture yourself. You'll hear about how to rewrite your Flink job graph at job submission time, how to retrieve data from all the nodes in the job graph, and how to expose this information to a user interface through a REST API.

 

Authors

matt-dailey-250square (1)
Matthew Dailey
Splunk

Matthew Dailey

Matt Dailey is a backend software engineer with experience creating and maintaining large-scale distributed systems in both batch processing with Hadoop, and stream processing with Apache Flink. He has been using Apache Flink since Spring 2017.