Flink Forward Session Preview: Elastic Data Processing with Apache Flink and Apache Pulsar

February 28, 2019 | by Sijie Guo

Excited for Flink Forward San Francisco 2019? As one of the 30+ conference speakers, I want to give a sneak preview of my upcoming Flink Forward talk: Elastic Data Processing with Apache Flink and Apache Pulsar.

Flink Forward returns to San Francisco for the third year in a row to showcase the latest developments around Apache Flink and the stream processing ecosystem. This year, it introduces exciting use cases to the Flink and stream processing communities. If you haven’t done so, please go ahead and register to find out the latest stream processing developments!

Here’s what you can expect from my talk during Flink Forward this year:

Elastic Data Processing with
Apache Flink and Apache Pulsar

 

Background

As fast data needs continue to expand, the adoption of stream computing as a framework providing low latency data processing is increasing by the day. Computing frameworks like Apache Flink unify batch and stream processing into one single computing engine with “streams” as the unified data representation in mind. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers.

In reality, we are still in a world where data is segregated into data silos, created by various storage and messaging technologies. The two main different types of data are still stored in very different ways - software engineers use message queues and log storage systems to store near-real-time events data, while using filesystems and object stores to store static data for batch processing. This means that even when you have a unified computing engine, data scientists still need to write programs to process different sets of data from data silos, and your SRE teams have to operate on two different sets of data. As a result, there is no single source-of-truth and it the overall operation for the developer teams is still messy.

Flink addresses the problem by standardizing the computation in a “stream” way, where everything is treated as “streams”. Batch processing is just a special case of stream processing, processing a bounded stream.

Similarly, we addressed the messy operationalisation by storing data in streams. The data only stores one copy (source-of-truth) and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). I called this as “segmented-streams”. This approach has been applied in many Apache BookKeeper-based data systems, such as Twitter’s EventBus, EMC’s Pravega and Apache Pulsar.

While Flink unifies computation in a “stream” way, Apache Pulsar (together with Apache BookKeeper) unifies data in a “stream” way. Together the combination of both can create a unified data architecture for serving many data-driven businesses.

 

Topics covered

In this presentation, I will be talking about this “segmented-streams” concept and architecture. Using Apache Pulsar as an example I will be explaining how various Apache BookKeeper-based systems are built using this “segmented-streams” concept, and how Apache Flink can integrate such systems for elastic batch and stream processing over segmented streams.

Flink Forward, Apache Flink, Apache Pulsar, Ververica, stream processing

Finally, I will be explaining how we integrate Apache Pulsar and Apache Flink for streaming and batch connector, and how Flink can leverage the built-in schema management in Apache Pulsar.


Key takeaways

I hope you find the topic of my talk as exciting as and interesting as I am about it! I hope attendees will enjoy my talk and learn the following aspects:

  • What are “segmented-streams”? And what is Apache Pulsar?
  • Why a “segmented-streams” system (Apache Pulsar) fits better with  Apache Flink for elastic batch and stream processing?
  • How do we integrate Pulsar and Flink? What are the challenges?
  • What is the future roadmap for Pulsar’s and Flink’s integration?

If you are interested in more sessions about how Flink integrates with the data processing ecosystem and technologies such as deployment and resource management frameworks (e.g. DC/OS, Kubernetes, YARN), message queues (e.g. Apache Kafka, Amazon Kinesis, Apache Pulsar), databases (e.g. Apache Cassandra, Redis), durable storage or logging and metrics, some of the talks below might interest you as well:

Don’t forget to register before March 23 to secure your spot and immerse yourself in the exciting world of stream processing and Apache Flink! See you in San Francisco in a few weeks!

Flink Forward, Apache Flink, Apache Pulsar, Ververica, stream processing

 

Topics: Flink Forward

Sijie Guo
Article by:

Sijie Guo

Find me on:

Related articles

Comments

Sign up for Monthly Blog Notifications

Please send me updates about products and services of Ververica via my e-mail address. Ververica will process my personal data in accordance with the Ververica Privacy Policy.

Our Latest Blogs

by Chen Qin September 21, 2021

The Apache Flink Story at Pinterest - Flink Forward Global 2021

On October 27, at the annual Apache Flink user conference, Flink Forward Global 2021, Pinterest Tech Lead, Chen Qin will deliver a keynote talk on “Sharing what we love: The Apache Flink Story at...

Read More
by Holger Temme August 16, 2021

Ververica named a 'Strong Performer' in Streaming Analytics by Forrester

We are excited to see Ververica Platform, developed by the original creators of Apache Flink, debut on the Forrester Wave™ 2021: Streaming Analytics report as a Strong Performer! Back in 2019,...

Read More
by Victor Xu July 13, 2021

Troubleshooting Apache Flink with Byteman

Introduction

What would you do if you need to see more details of some Apache Flink application logic at runtime, but there's no logging in that code path? An option is modifying the Flink source...

Read More