A few weeks back, I attended Flink Forward in San Francisco. This was my first time at the conference and I was very excited to see that the Apache Flink Community is alive and well in 2019!
As a software engineer at FinTech Studios, using Flink to migrate a variety of our data enrichment and anomaly detection use cases on our live news data streams, it was a great opportunity for me to learn more about the latest developments with Flink and where the framework is heading.
Flink Forward, in San Francisco on April 1st and 2nd, joined the community, consumers, and committers alike to share how Flink has grown in the last year and where we’re headed over the next. Among the many steps forward, a few high-level takeaways stood out:
- Unification of Batch and Streaming
- Operational maturity and a growing list of managed solutions
- Machine Learning as a first-class citizen
- The emergence of Self-Service Flink Jobs as a Platform
Unified Batch and Streaming
The Flink community is, like many other communities, tired of maintaining duplicate logic, especially in streaming and batch applications for the same system — the Flink Team is unifying the APIs for data analytics, such as the Table API, so devs can leverage the same code for both needs.
Other projects, like Apache Beam, also aim to allow one codebase for both streaming and batch. They recently released the first stage of support for targeting Flink as a runner.
Alongside codebase unification, many are moving away from Lambda Architecture and towards Kappa Architecture. Uber has coined Kappa+ in a move to provide more concrete implementation guidelines and constraints around the types of jobs that fit into the unified offline-realtime paradigm.
Operations for Everyone
Many in the Flink community, FinTech Studios included, have struggled and built homegrown tools to manage deployments, and in response, many companies have released managed solutions, Ververica, AWS, and Eventadaor.io among them.
AWS released Flink as a runner for Kinesis Data Analytics; Ververica offers the similar Ververica Platform, which has familiar configuration options to a manual deployment and excellent job lifecycle management; and Google Cloud Platform’s Beam and DataFlow both have managed deployments as well.
Though there are many managed solutions, many teams are moving to Flink because their internal ops have started to support it at scale.
Kubernetes seems to dominate both externally and internally managed deployments; it is also the only supported cluster on the Ververica Platform.
Along with deployment operations, Flink jobs are more compatible than ever with the inclusion of Avro schema migrations in 1.7 and TypeSerializers-based schema migrations in 1.8, making it possible to upgrade a job without having to rebuild the state.
Machine Learning within Flink
Following the news that Alibaba and Ververica are joining forces, Alibaba will be contributing the major improvements they’ve made in their internal Flink fork, Blink, over the next year. Among the improvements, a few make training machine learning models much simpler, such as a library of common ML algorithms and improvements on the Table API.
There have been considerable attempts to incorporate trained machine learning models into production streaming applications, such as using RPC + Py4j to communicate with Python-hosted models. TensorFlow Extended was presented as another option for the productionisation of ML models.
The use of native machine learning models in Flink is still not an easy task and, though it is improving, microservices and perhaps the AsyncIO API is currently the easiest way for model integration.
Self-Service Flink Jobs as a Platform
As streaming data starts to account for more and more of business data, the business stakeholders need ways to leverage and work with the streams. Dynamically-defined Flink jobs to analyze and surface streaming data have popped up this year, among them Netflix’s Consolidated Logging (CL) and Cogility’s Cogynt, allowing stakeholders to do complex analytics on streamed data with little-to-no code.
At Netflix, their CL platform allows users to define filtering with simple SQL, analytics transformations, and dynamic output streams on their entire application logging stream, which handles hundreds of billions of events a day. This platform drives their core services, such as personalization and A/B testing.
Overall, Apache Flink is proving to be incredibly flexible. The Ververica team is committed to the growth of the ecosystem and have one of the most helpful, open cultures out there — I’m not sure I ever heard a member say that a use case was not possible with Flink, you just might have to get a little creative.
At FinTech Studios, we work with tens of millions of news documents a month, NER, data extraction + enrichment, and anomaly detection are just a few of our use cases! To find out more about our company, what we do and our open positions visit our website!
The next Flink Forward is scheduled for October 7-9, 2019 in Berlin and the Call for Presentations is open. Make sure to submit your talk and present how you use Flink to the open source community!
About the Author:
Austin Cawley-Edwards is a Senior Software Engineer at FinTech Studios, utilizing real-time data to bring clarity to financial news. He frequently works with Apache Flink, RabbitMQ, and Elasticsearch. He also loves API design, being part of collaborative communities, and sometimes JavaScript.