This is a guest post by Jakub Piasecki, Director of Technology at Freeport Metrics about using stream processing and Apache Flink in the IoT industry. The content is based on the post that originally appeared on the Freeport Metrics Blog.
Data processing in the IoT industry poses unique challenges that make streaming processing with Flink a reliable solution. Some specific challenges that the IoT industry faces when it comes to data processing are the following:
- Devices produce way more data then users do. This makes traditional databases inefficient for handling the large amounts of continuous data produced by edge devices (i.e. “things”).
- IoT users expect real-time information they can act on immediately. Consequently, ETL (extract-transform-load) pipelines or batch data operations are not best suited for real-time response like the above.
- Connectivity is never guaranteed in the IoT industry, especially when you send data from edge devices over cellular networks.
Freeport Metrics has worked on multiple data streaming projects in the IoT industry, such as building a solar energy monitoring and billing platform, optimizing wind farm data processing, and recently created a large-scale, real-time RFID asset tracking platform that uses Apache Flink as the underlying stream processing framework. Below are 7 reasons why stream processing with Apache Flink in the IoT is a must-have:
1. Real-Time Data Processing Is a Game Changer
The IoT industry requires immediate information and action to be taken from any device activity. Let’s, for example, imagine a situation when it’s windy outside but your wind turbine produces no energy. Alternatively, let’s think of a case where a precious asset is about to leave your facility for an unknown reason. In both instances, the business needs to know and have access to such information immediately.
Operating on data streams instead of batches changes the programming paradigm fundamentally and allows for triggering calculations immediately when data is available, setting timely alerts, or detecting event patterns in a continuous manner.
Additionally, processing data as it is produced may be optimal for performance in some cases. For example, no computations are needed if no new event occurred, which means you don’t have to recompute the whole data set periodically to get fresh results.
2. Event Time is the Correct Way of Ordering Data in the IoT Industry
When data from your devices travels through a cellular network, it’s essential to account for latency and network failures. Even if you send it over a more stable connection, you cannot beat the laws of physics and the distance of your IoT devices from your data center will inevitably increase latency.
As an example, let’s imagine a factory and a machine or automotive part moving through a production line with sensors along it. There is no guarantee that readings from those sensors will arrive over the network in the order they were captured.
More often than not, when dealing with data from IoT devices, it is sensible to process events ordered based on the time they occurred (event time), not when they arrived at the data center or when they were processed (ingestion and processing time, respectively). Because of that, event time support is a must-have when selecting a data processing framework.
Read more about event time in Apache Flink and how it differs from processing or ingestion time in one of our earlier blog posts.
3. Tools for Dealing with Messy Data
Data pre-processing on the edge is usually the hardest part of the process. It is even harder when you don’t fully control the source, as is often true in the IoT world. You can end up with a serious portion of glue/clean-up code and bizarre conditional logic.
Of course, stream processing doesn’t fix your data for you, but it proposes a couple of nice tools. The most useful in our opinion is windowing — a concept of grouping elements of an unbounded stream into finite sets based on dimensions like time or element counts for further processing. Let me give you a couple examples.
If your data is noisy (e.g. analog sensors, GPS), you can write your own window processing function that can be as simple as averaging it or doing something more sophisticated.
Freeport Metrics worked with power meters that send data fast, but are failure-prone due to the Modbus protocol and therefore only produce precise data batch files at the end of the day. In this case, you can use Modbus data to calculate approximate live statistics to display to end users and then replace them with batch input later so it can be used for accurate billing. This can be achieved by writing a trigger which produces partial results as new data comes and closes the window when it gets batch data.
Sometimes, there is just no way of telling if all your data has arrived. In this case, you would trigger windows according to some heuristic watermark that can be empirically calculated (think of it as an event that pushes time forward in the system) or assuming an expiring timeout. Flink also lets you specify allowed lateness of elements and provides side outputs to handle events that come later.
4. Segmentation Allows for Parallel Processing
Often times different users of an IoT system are interested in calculations performed only on a subset of data. Let’s imagine that you created a platform that allows cat owners to see where their beloved pets are wandering. Each owner needs data only from the GPS tracker of his/her own cat. Flink introduces the concept of grouping by key for that purpose. Once a stream is partitioned it can be processed in parallel, enabling you to scale up horizontally. Of course, a key doesn’t have to be bound to a single IoT device or location. For example, in the case of fleet management, you may want to group different signals related to a single vehicle together (e.g. GPS, hardware sensors, license plate scan at parking gates)
We also recommend exploringdata Artisans Streaming Ledger which allows for distributed transactions between parallel streams across shared states and tables (available in the River Edition of data Artisans Platform).
5. Local State is Crucial to Performance
While “every programmer should know that the latency numbers” change each year thanks to progress in hardware and infrastructure, some fundamental rules remain constant:
- The closer data is to you, the faster you can process it;
- Disk I/O is no good for performance.
Apache Flink lets you keep data right where calculations are performed using the local state. More importantly, the state is fault tolerant using lightweight checkpointing, which limits I/O.
Don’t be mislead that local state is just another form of read-only local cache. It truly shines when you update it with event data. For example, you could store historical values of sensor readings in the local state and update it with new data to calculate live statistics.
Some people even question the need for another persistence layer and propose using Flink as the single-source-of-truth. If you want to get really philosophical, see this talk about the convergence of streaming and microservice architecture by Viktor Klang from Lightbend.
6. Flink Loves Messaging
When you think of stream processing, you very often also think of messaging systems such as Apache Kafka, AWS Kinesis or RabbitMQ — highly scalable and reliable for large volume event ingestion. Flink provides first class support for all three, both as producers and consumers, at the same time making its distributed nature play nicely with characteristic performance-enhancing features such as partitioning or sharding. Exactly-once, end-to-end processing is also extended from Flink to these systems (when used as consumers), if your use case requires it.
One of our earlier blog posts explains in detail how Apache Flink manages Kafka consumer offsets.
7. Data Streaming is Conceptually Simple (Once You Get Used to It)
Last but not least, once you adopt a data streaming approach, it just feels natural. Although there’s a steep learning curve for your team adapting to managing state properly in Flink or working with operator parallelism, once you get used to it, you can focus on the core logic of your application as most of the ‘dirty’ work has been handled for you by the framework.
At Freeport Metrics, we experienced something similar when we switched from batch processing to stream processing with Apache Flink and with most technological transitions to newer, updated frameworks or platforms. At some point, you just know that it’s the right tool for the job and you wonder how you ever lived without it.
The observations above make us believe that data processing in the IoT industry can truly benefit from adopting a stream processing framework such as Apache Flink. Flink’s features, connectors, fault-tolerance, and reliability provide one of the best frameworks suited to resolve the challenges that IoT companies face when dealing with the vast amount of data they are processing every day, minute, or second.
This article was initially posted on Freeport Metrics’ blog here. Special credits go to Jakub Piasecki, for allowing us to use the story and contributing some of his real-life data streaming and Apache Flink use cases on the data Artisans blog. For more information and stream processing projects from Freeport Metrics, you can visit their blog here.
Apache Flink, Flink®, Apache®, the squirrel logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation.