Skip to content

Ververica Cloud, a fully-managed cloud service for stream processing!

Learn more

Flink SQL: Deduplication


by

Flink SQL has emerged as the de facto standard for low-code data analytics. It has managed to unify batch and stream processing while simultaneously staying true to the SQL standard. In addition, it provides a rich set of advanced features for real-time use cases. In a nutshell, Flink SQL provides the best of both worlds: it gives you the ability to process streaming data using SQL, but it also supports batch processing.

Ververica Platform makes Flink SQL even more accessible and efficiently scalable across teams. The platform comes with additional tools for developing SQL scripts, managing user-defined functions (UDFs), catalogs and connectors, as well as operating the resulting long-running queries.

We have seen that there are many use cases for Flink SQL, and we are excited to see what you will build with it. In this blog post, we will explain what deduplication is and show how you can accomplish this with Flink SQL.


What is deduplication in stream processing?

Deduplication is a process of removing duplicate data from a dataset. This is usually done to improve the quality of the data. In stream processing, data deduplication is a very important process because it can help improve the performance of the system.

Data deduplication works by identifying and removing duplicate records from the data stream. This is usually done by comparing the data in the stream to a reference dataset. When a duplicate record is found, it is removed from the stream. 

Benefits of deduplication

There are many benefits to deduplicating data, including:

  • improved performance – by removing duplicate data, you can reduce the amount of data that needs to be processed, which can improve performance
  • reduced storage requirements – duplicate data takes up unnecessary space, so removing it can free up valuable storage space
  • higher accuracy – duplicate data can lead to inaccurate results, so removing it can improve the accuracy of your data analysis
  • boosted efficiency – deduplicating data can make data processing more efficient by reducing the amount of data that needs to be processed

How to deduplicate data with Flink SQL

There are different ways that duplicate events can end up in your data sources, from human error to application bugs. Regardless of the origin, unclean data can have a real impact on the quality (and correctness) of your results.

In some cases, data producers generate records with the same ID for streaming data changes. These records may include Insert, Update, and Delete records, and they may need to be deduplicated as part of the business logic in the pipeline before being aggregated or joined with other streams. The purpose of deduplication in this context is to ensure that only unique records are processed and to avoid any issues that may arise from duplicate data.

Suppose that your order system occasionally generates duplicate events with the same order_id, but you're only interested in keeping the most recent event for downstream processing.

As a first step, you can use a combination of the COUNT function and the HAVING clause to check if and which orders have more than one event, and then filter out these events using ROW_NUMBER(). In practice, deduplication is a special case of Top-N aggregation, where N is 1 (rownum = 1) and the ordering column is either the processing or event time of events. 

In the example query below, the source table orders is backed by the built-in datagen connector, which continuously generates rows in memory.


CREATE TABLE orders ( id INT, order_time AS CURRENT_TIMESTAMP, WATERMARK FOR order_time AS order_time - INTERVAL '5' SECONDS ) WITH ( 'connector' = 'datagen', 'rows-per-second'='10', 'fields.id.kind'='random', 'fields.id.min'='1', 'fields.id.max'='100' ); --Check for duplicates in the `orders` table SELECT id AS order_id, COUNT(*) AS order_cnt FROM orders o GROUP BY id HAVING COUNT(*) > 1; --Use deduplication to keep only the latest record for each `order_id` SELECT order_id, order_time FROM ( SELECT id AS order_id, order_time, ROW_NUMBER() OVER (PARTITION BY id ORDER BY order_time) AS rownum FROM orders ) WHERE rownum = 1;

Summary

In this article, you learned about the deduplication of data. You've also seen how to use Flink SQL to write queries for this type of problem.

We encourage you to run these examples on Ververica Platform. You can follow these simple steps to install the platform.

To learn more about Flink SQL, check out the following resources:

Ververica Academy

Article by:

Ververica

Comments

Our Latest Blogs

Q&A with Erik de Nooij: Insights into Apache Flink and the Future of Streaming Data featured image
by Kaye Lincoln 09 April 2024

Q&A with Erik de Nooij: Insights into Apache Flink and the Future of Streaming Data

Ververica is proud to host the Flink Forward conferences, uniting Apache Flink® and streaming data communities. Each year we nominate a Program Chair to select a broad range of Program Committee...
Read More
Ververica donates Flink CDC - Empowering Real-Time Data Integration for the Community featured image
by Ververica 03 April 2024

Ververica donates Flink CDC - Empowering Real-Time Data Integration for the Community

Ververica has officially donated Flink Change Data Capture (CDC) to the Apache Software Foundation. In this blog, we’ll explore the significance of this milestone, and how it positions Flink CDC as a...
Read More
Announcing the Release of Apache Flink 1.19 featured image
by Lincoln Lee 18 March 2024

Announcing the Release of Apache Flink 1.19

The Apache Flink PMC is pleased to announce the release of Apache Flink 1.19.0. As usual, we are looking at a packed release with a wide variety of improvements and new features. Overall, 162 people...
Read More