Technology Deep Dive Track
Using a sharded Akka distributed data cache as a Flink pipelines integration buffer
A common and reliable way to buffer streaming data in between Flink pipelines is a pair of Flink Kafka Source and Sink. However, in some low-latency streaming firehouse use-cases this option is not the best choice: a) backlog will quickly accumulate in Kafka if Source consumption rate can’t keep up with the Sink production b) Kafka broker implies a double dispatch of all data via broker network IO with unnecessary network hops to and from the Flink cluster. In the Walmart Labs Smart Pricing group, we encountered one such use-case operating Walmart.com Marketplace real-time price regulation service, managing more than 100M of the 3rd party marketplace items in real-time on at least a daily basis. As an alternative to a Kafka-based integration firehose, we decided to introduce a reliable and scalable in-memory-cache-based streaming buffer as an integration hub for our Flink-based Walmart.com Marketplace pipelines. Our main motivation was to avoid firehose backlog accumulation at all cost and maximize total end-to-end throughput. Our implementation is powered by a sharded (using Akka Cluster-Sharding) collection of replicated Akka Distributed Data caches, co-located with Flink Task Managers. Flink pipelines are interacting with this streaming buffer via a pair of custom partitioned Flink Sink and Source components that we wrote specifically to expose this cache to Flink. The resulting latency and throughput performance is better than what a Kafka-broker-based approach offers: a) there is almost never a foreground data exchange over cache cluster network IO as nearly all (determined by the cache miss rate) data is written and read in Flink pipelines through local memory b) cache data size, miss rate and updates volume can be managed via both the shard fill rate (not every write needs to be a new cache record – as opposed to messaging systems like Kafka) and the number of shards to keep alive in memory (if some shards are not actively accessed – they can be automatically killed and spilled to a permanent storage to be recovered later via Akka Persistence). The downside of this cache buffer is a large RAM demand: cache shards are memory-hungry and co-locating it with Flink Task Managers means that this memory will be unavailable to allocate to the Flink Task Manager heap and/or direct buffer.
Andrew TorsonPrincipal Data Engineer Walmart Labs
Andrew Torson is a Principal Data Engineer in the Smart Pricing team within the Walmart Labs organization. His current work is focused on big-fast-data pipelines in Flink and Spark for the Walmart e-commerce business, leveraging ML-based retail pricing algorithms. Before joining Walmart Labs, Andrew worked as a data scientist and data engineer on a handful of IoT projects in the area of mobile robotics(warehousing/manufacturing/marine container terminals industries), using ML-based tools in Scala, Python and Java. Andrew holds a PhD in Operations Management from NYU and also worked as a ML scientist in the Siemens Research/Labs after his graduation, which led him towards his current product data engineer track.