Skip to content

Ververica Cloud, a fully-managed cloud service for stream processing!

Learn more

Announcing Cascading on Apache Flink™


by

See also the related announcement at the Cascading blog. 

logo_home_cascading flink_squirrel_200_color

Today we are thrilled to announce the first availability of Cascading on Flink, a result of a community-driven effort that brings together the Cascading and Apache Flink™ communities.

Cascading is a proven framework for designing and developing Big Data applications that run in Hadoop with a higher-level API than MapReduce. Cascading is one of the first frameworks in the Hadoop ecosystem, and is very widely used.

Cascading supports multiple compute fabrics but most programs still use MapReduce. MapReduce has a lot of unnecessary overhead because (1) it does not exploit in-memory computation, instead exchanging data via HDFS, and (2) it uses a batch paradigm for data processing.  As a result of MapReduce performance, Cascading users could run into issues with missed service levels.

Apache Flink™ is a streaming data processor with native support for large-scale batch workloads. With Cascading on Flink, Cascading programs are executed on Apache Flink™, taking advantage of its unique set of runtime features. Among these features are Flink's flexible network stack, which supports low-latency pipelined data transfers as well as batch transfers for massive scale-out. Flink's active memory management and custom serialization stack enable highly efficient operations on binary data and effectively prevent JVM OutOfMemoryErrors as well as frequent Garbage Collection pauses. Cascading on Flink takes advantage of Flink's in-memory operators that gracefully go to disk in case of scarce memory resources. Due to the memory-safe execution, very little parameter tuning is necessary to reliably execute Cascading programs on Flink.

With Flink, we treat batch as a special case of streaming, and advocate that well-designed stream processors can do batch analytics as well or better than legacy batch processors. Indeed, Flink’s DataSet API, the API that Cascading programs are also internally compiled to sits on top of the same stream processor as Flink’s streaming API. Cascading on Flink is an important step towards this direction: without a single code change, Cascading users can get better performance by swapping the backend to Flink.

Cascading is also a very welcome addition to a growing number of frameworks that can use Flink as a backend. Flink can now run Hadoop MapReduce programs via its Hadoop compatibility package, Storm programs via its Storm compatibility package, Cascading programs via the runner announced today, and Google Dataflow programs via the Dataflow Flink runner.

Download Cascading on Flink.

New call-to-action

Kostas Tzoumas
Article by:

Kostas Tzoumas

Comments

Our Latest Blogs

Q&A with Erik de Nooij: Insights into Apache Flink and the Future of Streaming Data featured image
by Kaye Lincoln 09 April 2024

Q&A with Erik de Nooij: Insights into Apache Flink and the Future of Streaming Data

Ververica is proud to host the Flink Forward conferences, uniting Apache Flink® and streaming data communities. Each year we nominate a Program Chair to select a broad range of Program Committee...
Read More
Ververica donates Flink CDC - Empowering Real-Time Data Integration for the Community featured image
by Ververica 03 April 2024

Ververica donates Flink CDC - Empowering Real-Time Data Integration for the Community

Ververica has officially donated Flink Change Data Capture (CDC) to the Apache Software Foundation. In this blog, we’ll explore the significance of this milestone, and how it positions Flink CDC as a...
Read More
Announcing the Release of Apache Flink 1.19 featured image
by Lincoln Lee 18 March 2024

Announcing the Release of Apache Flink 1.19

The Apache Flink PMC is pleased to announce the release of Apache Flink 1.19.0. As usual, we are looking at a packed release with a wide variety of improvements and new features. Overall, 162 people...
Read More