Skip to content

Introducing Ververica’s New Brand Identity!

Read more

Announcing Cascading on Apache Flink™


See also the related announcement at the Cascading blog. 

                logo_home_cascading flink_squirrel_200_color

Today we are thrilled to announce the first availability of Cascading on Flink, a result of a community-driven effort that brings together the Cascading and Apache Flink™ communities.

Cascading is a proven framework for designing and developing Big Data applications that run in Hadoop with a higher-level API than MapReduce. Cascading is one of the first frameworks in the Hadoop ecosystem, and is very widely used.

Cascading supports multiple compute fabrics but most programs still use MapReduce. MapReduce has a lot of unnecessary overhead because (1) it does not exploit in-memory computation, instead exchanging data via HDFS, and (2) it uses a batch paradigm for data processing.  As a result of MapReduce performance, Cascading users could run into issues with missed service levels.

Apache Flink™ is a streaming data processor with native support for large-scale batch workloads. With Cascading on Flink, Cascading programs are executed on Apache Flink™, taking advantage of its unique set of runtime features. Among these features are Flink's flexible network stack, which supports low-latency pipelined data transfers as well as batch transfers for massive scale-out. Flink's active memory management and custom serialization stack enable highly efficient operations on binary data and effectively prevent JVM OutOfMemoryErrors as well as frequent Garbage Collection pauses. Cascading on Flink takes advantage of Flink's in-memory operators that gracefully go to disk in case of scarce memory resources. Due to the memory-safe execution, very little parameter tuning is necessary to reliably execute Cascading programs on Flink.

With Flink, we treat batch as a special case of streaming, and advocate that well-designed stream processors can do batch analytics as well or better than legacy batch processors. Indeed, Flink’s DataSet API, the API that Cascading programs are also internally compiled to sits on top of the same stream processor as Flink’s streaming API. Cascading on Flink is an important step towards this direction: without a single code change, Cascading users can get better performance by swapping the backend to Flink.

Cascading is also a very welcome addition to a growing number of frameworks that can use Flink as a backend. Flink can now run Hadoop MapReduce programs via its Hadoop compatibility package, Storm programs via its Storm compatibility package, Cascading programs via the runner announced today, and Google Dataflow programs via the Dataflow Flink runner.

Download Cascading on Flink.

Kostas Tzoumas
Article by:

Kostas Tzoumas


Our Latest Blogs

Announcing Ververica Platform 2.7 for Apache Flink® 1.15 featured image
by Daisy Tsang July 12, 2022

Announcing Ververica Platform 2.7 for Apache Flink® 1.15

The latest minor release includes full support for Flink 1.15, several improvements to the user experience, and a brand new look!
Read More
Introducing Ververica’s New Brand Identity featured image
by Alexander Walden June 21, 2022

Introducing Ververica’s New Brand Identity

Today we are extremely excited to introduce our new visual brand identity for Ververica and our products, such as Ververica Platform! This has been in the works for many months, and a project we...
Read More
Monitoring Large-Scale Apache Flink Applications, Part 2: Metrics for Troubleshooting featured image
by Nico Kruber May 31, 2022

Monitoring Large-Scale Apache Flink Applications, Part 2: Metrics for Troubleshooting

The previous article in this series focused on continuous application monitoring and presented the most useful metrics from our point of view. However, Flink’s metrics system offers a lot more, and...
Read More