MATCH_RECOGNIZE: where Flink SQL and Complex Event Processing meet.

29 January 2019
by Marta Paes

With the unavoidable and ever-growing presence of sensors and smart devices, Complex Event Processing (CEP) is fast becoming a critical paradigm for enterprises to keep ahead of the curve and turn real-time, potentially infinite data streams into actionable business intelligence in loco. If you have ever worked with SQL at large scale under CEP requirements, chances are you had at least one emotional breakdown over the simple task of identifying meaningful sequences of events in a data set; which may have sent you spiralling down into complex contraptions of self-joins and other highly expensive workarounds. The introduction of row pattern recognition in the SQL standard and the ensuing rise of MATCH_RECOGNIZE as a native SQL clause brought new hope to implementing this under efficient and scalable execution scenarios.

Apache Flink, MATCH_RECOGNIZE, Flink SQL, stream processing, data processing

Marking its much anticipated consolidation with Flink CEP, initial support for MATCH_RECOGNIZE in Flink SQL was announced as part of the release of Apache Flink 1.7. Building on our previous introduction to Flink SQL, we will guide you through an additional example that showcases the bearings of MATCH_RECOGNIZE and how easily you can build powerful real-time applications without ever touching a line of Java or Scala code. If you are new to Flink SQL altogether, take a peek into the Apache Flink SQL Training first to get up to speed!

Spinning up the SQL Client environment

A well-rounded demo environment is available on GitHub as a multi-container Docker application based on the interactive Flink SQL Client. To get it up and running, simply execute:

git clone https://github.com/aljoscha/sql-demo

cd sql-demo

docker-compose up -d

docker-compose exec sql-client ./sql-client.sh

The predefined dataset is our usual TaxiRides, describing taxi rides in and around New York City. The goal here is to use Flink SQL to build and run a query that detects rides with mid-stops along the way. Some points we need to consider in advance are: what sequence of events defines such a ride? And how do we go about ensuring event ordering to structure processing? To ease you into the answers, the statement will be built as we explain each clause and a lot of scenarios that are possible but not part of the beta implementation will be glossed over. For the impatient: the complete statement and further tweaks are under the Overview section.

Input Definition

The input argument of MATCH_RECOGNIZE is a row pattern table feeding from whatever source object you declare in your base SQL statement. Since views are also a new feature in Apache Flink 1.7, we will restrict our TaxiRide dataset to only consider rides that either start or end in New York City, and use that as input:


CREATE VIEW taxiRideNYC AS
SELECT rideId,
       rowTime,
       CASE WHEN isStart THEN 'DEP' ELSE 'ARR' END AS rideType,
       toAreaId(lon,lat) AS rideArea,
       taxiId,
       driverId,
       psgCnt
FROM TaxiRides WHERE isInNYC(lon,lat);

Most use cases for pattern matching focus on uncovering trends attached to a single dimension or a small subset of dimensions in the data. This can be achieved by applying logical partitioning and ordering to the input row pattern table using the PARTITION BY and ORDER BY clauses. In our case, we want to find stops within the same ride, so we will break the input per driverId and order it chronologically within each shard:


PARTITION BY driverId
ORDER BY rowTime

It is highly recommended to always partition the input table using the PARTITION BY clause, otherwise MATCH_RECOGNIZE will be translated into a non-parallel operator to ensure global ordering. In addition, the table must be primarily sorted on the time attribute contained in each row to ensure that the order of the events can be derived in the first place.

Going back to one of the initial questions, we need to reason about what pattern of events portrays a ride with mid-stops. Patterns are specified within the PATTERN clause using the powerful and expressive syntax of regular expressions. Each pattern is composed of basic building blocks called row pattern variables (or event types), to which operators such as quantifiers can be applied (with a few limitations). Here, we are looking for a sequence of rides with the same rideType, for the same driverId.


PATTERN (S E)
DEFINE S AS S.rideType = ‘DEP’,
       E AS E.rideType = ‘DEP’

We can easily extend this pattern to look only for sequences of rides with two or more ({2,}) mid-stops:


PATTERN (S M{2,} E)
DEFINE S AS S.rideType = ‘DEP’,
       M AS M.rideId != S.rideId,
       E AS E.rideType = ‘ARR’ AND E.rideId = S.rideId

As shown above, the variables must be associated with conditions that incoming events are required to meet in order to be included in the pattern. The DEFINE clause, containing the definition for these conditions, is mandatory: pattern variables lacking definition default to true and will therefore map any row.

Conditions can also be defined in a way that, as processing iterates, rows are evaluated based on properties of other rows, for example using the logical offset operators LAST and FIRST.

Output Definition

The output of MATCH_RECOGNIZE is a row pattern table whose configuration depends on the definition of three main output dimensions within the statement: output mode, measures and after match strategy.

Output Mode

Depending on the desired granularity, the output can return a single summary row for each match of the specified pattern (ONE ROW PER MATCH) or one row for each row of the match (ALL ROWS PER MATCH). For the wide majority of use cases, the default ONE ROW PER MATCH will be the ideal solution; if you need fine-grained control over the output for — say — debugging your statement, ALL ROWS PER MATCH and its extended syntax can also turn out to be very useful, but this mode is not yet supported in Flink SQL.

Measures

Row pattern columns are defined in the MEASURES clause, which can be thought of as the SELECT of MATCH_RECOGNIZE. For non-measure columns, the names and declared types are inherited from the corresponding columns of the input table; for measure columns, these are computed based on an expression after a particular match is found.


MEASURES
         S.rideId sRideId

Pattern navigation functions (see Pattern Definition and Navigation) are allowed within the MEASURES clause, with aggregates such as SUM and AVG shipping in the upcoming release of Apache Flink 1.8. Imagine we want to check, for the extended pattern example, the evolution of the number of passengers throughout the ride:


MEASURES
        S.rideId sRideId,
        S.psgCnt AS sPsgCnt,
        FIRST(M.psgCnt) AS psgCntFirstStop,
        LAST(M.psgCnt) AS psgCntLastStop

As an appetizer, once the next release rolls out you should be able to also count the number of stops, for example:


MEASURES
        S.rideId sRideId,
        S.psgCnt AS sPsgCnt,
        FIRST(M.psgCnt) AS psgCntFirstStop,
        LAST(M.psgCnt) AS psgCntLastStop,
        COUNT(M.rideId) AS nStops

After Match Strategy

Last but not least: what happens once a (non-empty) match is found? This behaviour is handled within the optional AFTER MATCH SKIP clause, which instructs where to start a new pattern matching process and so specifies how many successful matches a single event can belong to. By default, pattern matching resumes at the next row after the last row of the current match (SKIP PAST LAST ROW), but there are three other possible strategies:

SKIP TO NEXT ROW: resume at the row after the first row of the current match;
SKIP TO LAST <rpv>: resume at the last row that is mapped to the row pattern variable rpv;
SKIP TO FIRST <rpv>: resume at the first row that is mapped to the row pattern variable rpv.

There is quite some room for run-time errors handling this clause, for instance when attempting to resume pattern matching at the same row where the last match started or trying to skip to rows that were never mapped to a row pattern variable in the first place. Make sure you keep an eye out for such caveats!

Overview

Now that we have a clear understanding of the syntax of MATCH_RECOGNIZE and how to best put it to use, we can submit the full query using the SQL client and let it do its thing:


SELECT *
FROM taxiRideNYC
MATCH_RECOGNIZE (
     PARTITION BY driverId
     ORDER BY rowTime
     MEASURES
             S.rideId AS sRideId,
             S.psgCnt AS sPsgCnt,
             FIRST(M.psgCnt) AS psgCntFirstStop,
             LAST(M.psgCnt) AS psgCntLastStop
     AFTER MATCH SKIP PAST LAST ROW
     PATTERN (S M{2,} E)
     DEFINE S AS S.rideType = ‘DEP’,
              M AS M.rideId <> S.rideId,
              E AS E.rideType = ‘ARR’ AND E.rideId = S.rideId);

Under the hood, this pattern recognition feature is making use of Flink CEP — the Complex Event Processing library of Apache Flink. One way to visualize what is really happening is to access the Flink WebUI under http://localhost:8081 as you execute the statement:

One important consideration when using MATCH_RECOGNIZE is that it does not use configured state retention time. Given that the SQL standard also does not provide an option to cap the execution time of a pattern, some attention must be paid to the declaration of pattern predicates in order to minimize the workload and keep state from running wild. But fear not: this will also be addressed in the upcoming release of Apache Flink 1.8 with the introduction of the additional (and non-standard SQL) WITHIN clause to enforce a time constraint on pattern execution — as a consequence helping with efficient memory management by limiting the overall state size that has to be maintained internally.

Excited yet? This feature is still in beta phase, so we encourage you to hit the documentation, play around and get back to us with any suggestion or request for future implementations!