Stream Processing & Apache Flink - News and Best Practices

Apache Flink® Master Branch Monthly: New in Flink in June 2018

Written by Till Rohrmann | 10 July 2018

Thanks to Dawid Wysakowicz, Piotr Nowojski, and Stefan Richter for their contributions and support in this blog post.

At the beginning of the year, we introduced the Flink Master Monthly blog post series (see the parts for January 2018 here and February 2018 here) to highlight features that were merged into Apache Flink’s master branch during the previous month but aren’t yet part of a stable release.

Apache Flink’s major version releases occur every few months, and there’s a constant stream of activity as new features are merged to the Flink master branch in between releases. Keeping an eye on what’s going into Flink’s master is one of the best ways to stay up-to-date on new work that hasn’t yet made it into an official release.

While the Apache Flink community was focused on releasing Flink 1.5.0 in May 2018, (for more details on what is updated in Flink 1.5.0, read our blog post here) June was also a busy month with some very valuable additions that you can explore below.

If you’d like to see a full list of newly-merged features from a given time period, Git is your friend. You can run the following git command for a full list of all additions for June:

git shortlog -e --since="01 June 2018" --before="01 July 2018"

[FLINK-9418] Migrate SharedBuffer to use MapState

  • Hardens the CEP library and makes it production ready. The migration to MapState makes ShareBuffer smarter with the memory management since only the necessary parts, such as tail entries, are deserialized rather than the entire structure. This change greatly stabilizes the CEP library and allows its use for applications with far more complex patterns that maintain more state.

[FLINK-9366] & [FLINK-8620] DistributedCache works with Distributed File System & BlobStore

  • In May the team worked on the FLINK-8620 issue that modified the distributed cache to disseminate files via the BlobStore, instead of downloading them from a distributed file system. Previously, task managers would download the requested files from the DFS, while now they can retrieve it from the BlobStore. This requires the client to pre-emptively upload all used files with the distributed cache.

  • Following from FLINK-8620, the community working on FLINK-9366 in June to allow both the old and new behavior of DistributedCache. This means that files from the local filesystem can be distributed via the BlobStore, based on a path scheme, while files in DFS will be downloaded and cached directly from DFS.

[FLINK-3952][runtime] Upgrade to Netty 4.1

  • The latest Netty version, which contains a lot of bug fixes, is a major improvement in Flink that includes significant enhancements available in Netty 4.1, and particularly HTTP/2 codecs. Additionally, upgrading to the latest Netty version will allow Flink to take advantage of all available Hadoop patches for Netty 4.1 once tested and fully merged.

[FLINK-9487][state] Prepare InternalTimerHeap for asynchronous snapshots

  • This is the first step in our efforts to make timer services scalable and improve checkpoint speed.

[FLINK-8430] [table] Implement stream-stream non-window full outer join

  • This commit completes the family of the most important non-windowed join operations in Flink’s streaming SQL. In addition to time-windowed joins, the upcoming Flink version will support INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN. A design document about another class of joins for streaming enrichment has just been published. Feel free to

    Thanks to Dawid Wysakowicz, Piotr Nowojski, and Stefan Richter for their contributions and support in this blog post.

    At the beginning of the year, we introduced the Flink Master Monthly blog post series (see the parts for January 2018 here and February 2018 here) to highlight features that were merged into Apache Flink’s master branch during the previous month but aren’t yet part of a stable release.

    Apache Flink’s major version releases occur every few months, and there’s a constant stream of activity as new features are merged to the Flink master branch in between releases. Keeping an eye on what’s going into Flink’s master is one of the best ways to stay up-to-date on new work that hasn’t yet made it into an official release.

    While the Apache Flink community was focused on releasing Flink 1.5.0 in May 2018, (for more details on what is updated in Flink 1.5.0, read our blog post here) June was also a busy month with some very valuable additions that you can explore below.

    If you’d like to see a full list of newly-merged features from a given time period, Git is your friend. You can run the following git command for a full list of all additions for June:

    git shortlog -e --since="01 June 2018" --before="01 July 2018"

    [FLINK-9418] Migrate SharedBuffer to use MapState

    • Hardens the CEP library and makes it production ready. The migration to MapState makes ShareBuffer smarter with the memory management since only the necessary parts, such as tail entries, are deserialized rather than the entire structure. This change greatly stabilizes the CEP library and allows its use for applications with far more complex patterns that maintain more state.

    [FLINK-9366] & [FLINK-8620] DistributedCache works with Distributed File System & BlobStore

    • In May the team worked on the FLINK-8620 issue that modified the distributed cache to disseminate files via the BlobStore, instead of downloading them from a distributed file system. Previously, task managers would download the requested files from the DFS, while now they can retrieve it from the BlobStore. This requires the client to pre-emptively upload all used files with the distributed cache.

    • Following from FLINK-8620, the community working on FLINK-9366 in June to allow both the old and new behavior of DistributedCache. This means that files from the local filesystem can be distributed via the BlobStore, based on a path scheme, while files in DFS will be downloaded and cached directly from DFS.

    [FLINK-3952][runtime] Upgrade to Netty 4.1

    • The latest Netty version, which contains a lot of bug fixes, is a major improvement in Flink that includes significant enhancements available in Netty 4.1, and particularly HTTP/2 codecs. Additionally, upgrading to the latest Netty version will allow Flink to take advantage of all available Hadoop patches for Netty 4.1 once tested and fully merged.

    [FLINK-9487][state] Prepare InternalTimerHeap for asynchronous snapshots

    • This is the first step in our efforts to make timer services scalable and improve checkpoint speed.

    [FLINK-8430] [table] Implement stream-stream non-window full outer join

    • This commit completes the family of the most important non-windowed join operations in Flink’s streaming SQL. In addition to time-windowed joins, the upcoming Flink version will support INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN. A design document about another class of joins for streaming enrichment has just been published. Feel free to join the discussion there!

    [FLINK-8861] [sql-client] Add support for batch queries in SQL Client

    • Even though Flink SQL has unified batch and streaming semantics, the SQL Client has not been able to execute batch queries so far. With this contribution, the SQL Client becomes a more useful tool for experimenting with Flink SQL and quick prototyping.

    [FLINK-7789] Add handler for Async IO operator timeouts

    • Extends the AsyncIO operator such that the user can specify how to handle timeouts of the asynchronous operations. This is a major improvement that gives greater flexibility to Flink users.

    To stay up-to-date with the latest additions and news about Apache Flink and data Artisans you can subscribe to the Apache Flink mailing list. Follow @VervericaData to be notified when our next blog post is ready.

    join the discussion there!

[FLINK-8861] [sql-client] Add support for batch queries in SQL Client

  • Even though Flink SQL has unified batch and streaming semantics, the SQL Client has not been able to execute batch queries so far. With this contribution, the SQL Client becomes a more useful tool for experimenting with Flink SQL and quick prototyping.

[FLINK-7789] Add handler for Async IO operator timeouts

  • Extends the AsyncIO operator such that the user can specify how to handle timeouts of the asynchronous operations. This is a major improvement that gives greater flexibility to Flink users.

To stay up-to-date with the latest additions and news about Apache Flink and data Artisans you can subscribe to the Apache Flink mailing list. Follow @VervericaData to be notified when our next blog post is ready.