Announcing Cascading on Apache Flink™

September 21, 2015 | by Kostas Tzoumas

See also the related announcement at the Cascading blog. 

                logo_home_cascading flink_squirrel_200_color

Today we are thrilled to announce the first availability of Cascading on Flink, a result of a community-driven effort that brings together the Cascading and Apache Flink™ communities.

Cascading is a proven framework for designing and developing Big Data applications that run in Hadoop with a higher-level API than MapReduce. Cascading is one of the first frameworks in the Hadoop ecosystem, and is very widely used.

Cascading supports multiple compute fabrics but most programs still use MapReduce. MapReduce has a lot of unnecessary overhead because (1) it does not exploit in-memory computation, instead exchanging data via HDFS, and (2) it uses a batch paradigm for data processing.  As a result of MapReduce performance, Cascading users could run into issues with missed service levels.

Apache Flink™ is a streaming data processor with native support for large-scale batch workloads. With Cascading on Flink, Cascading programs are executed on Apache Flink™, taking advantage of its unique set of runtime features. Among these features are Flink's flexible network stack, which supports low-latency pipelined data transfers as well as batch transfers for massive scale-out. Flink's active memory management and custom serialization stack enable highly efficient operations on binary data and effectively prevent JVM OutOfMemoryErrors as well as frequent Garbage Collection pauses. Cascading on Flink takes advantage of Flink's in-memory operators that gracefully go to disk in case of scarce memory resources. Due to the memory-safe execution, very little parameter tuning is necessary to reliably execute Cascading programs on Flink.

With Flink, we treat batch as a special case of streaming, and advocate that well-designed stream processors can do batch analytics as well or better than legacy batch processors. Indeed, Flink’s DataSet API, the API that Cascading programs are also internally compiled to sits on top of the same stream processor as Flink’s streaming API. Cascading on Flink is an important step towards this direction: without a single code change, Cascading users can get better performance by swapping the backend to Flink.

Cascading is also a very welcome addition to a growing number of frameworks that can use Flink as a backend. Flink can now run Hadoop MapReduce programs via its Hadoop compatibility package, Storm programs via its Storm compatibility package, Cascading programs via the runner announced today, and Google Dataflow programs via the Dataflow Flink runner.

Download Cascading on Flink.

Topics: General | Company Updates

Kostas Tzoumas
Article by:

Kostas Tzoumas

Related articles


Sign up for Monthly Blog Notifications

Please send me updates about products and services of Ververica via my e-mail address. Ververica will process my personal data in accordance with the Ververica Privacy Policy.

Our Latest Blogs

by Victor Xu July 13, 2021

Troubleshooting Apache Flink with Byteman


What would you do if you need to see more details of some Apache Flink application logic at runtime, but there's no logging in that code path? An option is modifying the Flink source...

Read More

Announcing Ververica Platform 2.5 for Apache Flink 1.13

New release includes full support for Apache Flink 1.13, with greatly expanded streaming SQL, new performance monitoring, and many new application management features.

Read More
by Nico Kruber May 11, 2021

SQL Query Optimization with Ververica Platform 2.4

In my last blog post, Simplifying Ververica Platform SQL Analytics with UDFs, I showed how easy it is to get started with SQL analytics on Ververica Platform and leverage the power of user-defined...

Read More