How to manage your RocksDB memory size in Apache Flink

Apache Flink Flink Features

26 February 2020 by Stefan Richter

This blog post describes some configuration options that will help you to efficiently manage the memory size of the RocksDB state backend in Apache Flink. In an earlier post, we described the alternative state backend options supported in Flink. In this post, we describe RocksDB’s operations with Flink and then we cover some important configurations for an effective resource consumption.

Update: Starting from Flink 1.10, Flink manages RocksDB's memory automatically, as described here

RocksDB State Backend in Apache Flink

Before diving into the configuration parameters, let’s first revisit how RocksDB is used to leverage state management in Apache Flink. When you choose RocksDB as your state backend, your state lives as a serialized byte-string in either the off-heap memory or the local disk. RocksDB is a Key-Value store that is organized as a log-structured merge tree (LMS-tree). When used to store your Keyed state in Flink, the Key consists of the serialized bytes of the <Keygroup, Key, Namespace>, while the Value consists of the serialized bytes of your state. Every time you register a keyed state, it is mapped to a column family (similar to a table in a traditional database) and the key-value pairs are stored as serialized bytes within RocksDB. This means that data has to be de/serialized with every READ or WRITE operation, which can compromise performance when compared to the alternative, in-memory state backends bundled in Flink.

Using RocksDB as a state backend has many advantages: it is not affected by garbage collection, it often provides a lower memory overhead of representation compared to heap objects and it is currently the only option that supports incremental checkpointing. Additionally, with RocksDB your state size is only limited by the availability of your local disk space, which best suits Flink applications that rely on large state operations.

If you are not familiar with RocksDB, the diagram below illustrates its basic READ and WRITE operations.

A WRITE Operation in RocksDB is storing data in the currently active Memory Table (Active MemTable). When a memory table is full, it becomes a READ ONLY MemTable and is replaced by a new, empty active MemTable. READ ONLY MemTable are periodically flushed to disk by background threads into key-sorted, read-only files — the so called SSTables. SSTables, in turn, are immutable but they are getting consolidated through a background log compaction, a multiway merge of SSTables. As mentioned earlier, with RocksDB every registered state is a column family, which means that every state includes its own set of MemTables and SSTables.

RocksDB, operations

READ Operations in RocksDB first access the Active Memory Table to answer a query. If the searched key is found, the READ Operations access the READ ONLY MemTables from the most recent to the oldest one until the searched key is found. If the key is not found in any MemTable, the READ Operation accesses the SSTables, again starting from the most recent. SSTable files are obtained either from the BlockCache (which holds the uncompressed table files, if contained) from the OS’s file cache, or from the local disk in the worst case. Optional indexes like SST level bloom filters can help to avoid hitting the disk.

3 configurations to manage your RocksDB memory consumption

Now that we established RocksDB’s functionality with Apache Flink, let’s have a look at the configuration options that can help you manage your RocksDB memory size more effectively. Please note here that the options below are not exhaustive since you can manage the state size of your Flink application with the State TTL (Time-To-Live) feature introduced in Apache Flink 1.6. The following three configurations are a good starting point to help you manage your RocksDB resource consumption efficiently:

1. Configuration of the block_cache_size

This configuration will ultimately control the maximum number of cached uncompressed blocks held in memory. As the number of blocks increases, the memory size will also increase — so, by configuring this upfront you can maintain a specific level of memory consumption.

2. Configuration of the write_buffer_size

This configuration essentially establishes and controls a maximum size for a MemTable in RocksDB. Active MemTables and READ ONLY MemTables will ultimately impact the memory size in RocksDB, so adjusting this early may save you some trouble later.

3. Configuration of the max_write_buffer_number

This configuration decides and controls the maximum number of MemTables held in memory before RocksDB flushes them to the local disk as SS Tables. This is essentially the maximum number of “READ ONLY” MemTables in memory.

In addition to the resources mentioned above, you can optionally configure indexes and bloom filters that will consume additional memory space, as well as the table cache on the side. And here, not only will the table cache occupy additional memory in RocksDB, it will hold open file descriptors to the SST files with an unlimited size set by default that can impact the settings of your operating system, if not configured correctly.

We just guided you through some configuration options for using RocksDB as a state backend in Flink that will help with the efficient management of memory size. For more configuration options, we suggest checking out the RocksDB tuning guide or the Apache Flink documentation. If you want to deepen your understanding of how to best configure Flink, the dA Apache Flink Advanced Training covers Capacity Planning and Deployment in detail. Find out about our training dates and locations near you below.