“Play it again, Sam”: Bookmarking, Slicing, and Replaying Unbounded Data Streams for Analytics Applications
Pravega is a novel storage system that exposes data stream as a first-class abstraction as opposed to objects and files. With Pravega, a stream is a consistently ordered, durable, available and elastic series of data events. Pravega is designed to ingest, store and serve potentially unbounded data streams with high performance while adapting to workload fluctuations thanks to auto-scaling. Developers can extract value and insights out of stream data by connecting Pravega with a stream processor: Apache Flink is a strong candidate due to its advanced stream processing features. We provide a ready-to-use connector that enables Flink jobs to process data stored in Pravega in stream (ordered) or batch (unordered) fashion. For instance, reading in batch a slice of old data events, rewinding/fast-forwarding parts of a stream, or bookmarking a specific point of a stream that is being read are common needs for developers that become simple tasks with Pravega. In this talk, we present the main abstraction to support such operations over streams: stream cut, a compact data structure that represents an event boundary across a collection of streams. Developers can easily instruct applications to work with stream cuts to go back and forth on a data stream; they are cheap to create and allow efficient seeks in a stream. In Pravega, developers may use pairs of stream cuts to replay arbitrarily old stream slices in a more natural way than batch loading data from traditional file- and object-based storage systems. Stream cuts can be created to bookmark a stream based on time (e.g., data created in a single day in a company), event references (e.g., series of events for which an anomaly has been detected), or any other aspect. Moreover, stream cuts complement Flink features such as savepoints; developers that snapshot the state of an application can also capture the precise range of events used as input to reach such a state. In this talk, we illustrate how this simple yet powerful abstraction can be exploited with examples, including Flink samples.
Raúl Gracia-TinedoDell EMC
Raúl Gracia-Tinedo is a senior software engineer at DellEMC working for Pravega: a novel distributed storage system for data streams. Prior to joining DellEMC, he has worked as a postdoc in the context of European research projects (FP7 CloudSpaces, H2020 IOStack) and as intern at IBM Research and Tel-Aviv University. He holds a Ph.D. in Computer Engineering (2015, outstanding thesis award) from Universitat Rovira i Virgili (Spain). Raúl is a highly motivated researcher and engineer interested in distributed systems, cloud storage, and data analytics, with more than 20 papers.