Stream Processing & Apache Flink - News and Best Practices

Announcing Cascading on Apache Flink™

Written by Kostas Tzoumas | 21 September 2015

See also the related announcement at the Cascading blog. 

 

Today we are thrilled to announce the first availability of Cascading on Flink, a result of a community-driven effort that brings together the Cascading and Apache Flink™ communities.

Cascading is a proven framework for designing and developing Big Data applications that run in Hadoop with a higher-level API than MapReduce. Cascading is one of the first frameworks in the Hadoop ecosystem, and is very widely used.

Cascading supports multiple compute fabrics but most programs still use MapReduce. MapReduce has a lot of unnecessary overhead because (1) it does not exploit in-memory computation, instead exchanging data via HDFS, and (2) it uses a batch paradigm for data processing.  As a result of MapReduce performance, Cascading users could run into issues with missed service levels.

Apache Flink™ is a streaming data processor with native support for large-scale batch workloads. With Cascading on Flink, Cascading programs are executed on Apache Flink™, taking advantage of its unique set of runtime features. Among these features are Flink's flexible network stack, which supports low-latency pipelined data transfers as well as batch transfers for massive scale-out. Flink's active memory management and custom serialization stack enable highly efficient operations on binary data and effectively prevent JVM OutOfMemoryErrors as well as frequent Garbage Collection pauses. Cascading on Flink takes advantage of Flink's in-memory operators that gracefully go to disk in case of scarce memory resources. Due to the memory-safe execution, very little parameter tuning is necessary to reliably execute Cascading programs on Flink.

With Flink, we treat batch as a special case of streaming, and advocate that well-designed stream processors can do batch analytics as well or better than legacy batch processors. Indeed, Flink’s DataSet API, the API that Cascading programs are also internally compiled to sits on top of the same stream processor as Flink’s streaming API. Cascading on Flink is an important step towards this direction: without a single code change, Cascading users can get better performance by swapping the backend to Flink.

Cascading is also a very welcome addition to a growing number of frameworks that can use Flink as a backend. Flink can now run Hadoop MapReduce programs via its Hadoop compatibility package, Storm programs via its Storm compatibility package, Cascading programs via the runner announced today, and Google Dataflow programs via the Dataflow Flink runner.

Download Cascading on Flink.