Building Stateful Streaming Pipelines at Godaddy with Flink

Written by Ankit Jhalaria | 02 April 2020

Are you thinking of joining the Virtual Flink Forward on April 22 - 24? It’s the first time the Godaddy team will present at a Flink Forward event and we are beyond excited to share our experience with Apache Flink and how we used the open source framework to transform our pipelines from batch processing to real time stream processing and make data available to our downstream consumers. Read through for a sneak preview of my session Building Stateful Streaming Pipelines that you can join remotely on April 23!

If you haven’t done so already, go ahead and register for the event to learn about the new developments around Apache Flink. Here’s what my talk will be about in a few weeks:

Building Stateful Streaming Pipelines

Building a streaming platform from the ground up is a very interesting problem to solve at scale. We, at GoDaddy, leveraged Apache Beam as the programming model for writing both batch and streaming pipelines to run them on Flink on AWS. In an effort to also support running batch jobs which primarily run on our own data centers in Spark, we deploy the same beam code on Spark. Internally at GoDaddy, microservices can store data in SQLServer and make that data available to teams via DB replication. Some microservices make data available via RESTful services as well. The data platform team at GoDaddy provides a unified view of our business to our teams and regardless of where and how data is stored and/or returned, we run our streaming pipelines in Flink by combining data from multiple ingress sources including SQLServer CDC (Change Data Capture) logs.

Topics covered

Some of the core topics my session will cover are the following:

Building a data platform from scratch: Where should you start from? What do you need to take into consideration before starting developing your platform and how that impacts the development process later on?
Learnings from our journey to deploying our e-commerce production pipelines running on Flink: What do you need to be aware of, how to successfully deploy your pipelines in production, what challenges we faced and how did we overcome them?
Future-proofing your pipeline architecture so that you can run anywhere (in the cloud, on-premise). How do you make sure that your architecture can scale at different levels and in different environments? How do you ensure a cloud-native infrastructure that can also be deployed on-premises?

Key takeaways

Things to keep in mind when running things at scale with Apache Flink
Streaming + Batch architecture review for someone starting the journey of consuming data from multiple sources
Common errors when multiple massive pipelines are deployed on Flink. Things that can potentially be tuned and parameters to consider.

Registration to the Virtual Flink Forward is free and you can join from anywhere in the world through the Flink Forward website. I look forward to virtually meeting the Apache Flink community and learn about all the exciting developments around the technology!

About the author:

Ankit Jhalaria is a Principal Software Engineer at GoDaddy where he is responsible for building and maintaining the company’s streaming data platform using Apache Beam and Apache Flink to make data available to downstream customers. He previously worked at AtScale building a BI platform and Yahoo where he worked on large-scale applications with MapReduce and Hadoop. He holds a Masters in Computer Science from USC. He is an Apache Beam contributor and he enjoys spending time with his family.

View full post