Flink for online Machine Learning and real-time processing at Weibo

June 15, 2020 | by Yu Qian

The Flink Forward event gave me an amazing opportunity to present to the global Apache Flink community how Weibo uses Apache Flink to run real time data processing and Machine Learning on our platform. In the following sections, I will introduce you to Weibo and I will describe the architecture of our Machine Learning platform and how we use Apache Flink to develop our real-time Machine Learning pipelines. I will finally explain how we plan to extend Flink’s usage at Weibo and give a glimpse into the experience we had with the open source technology in our organization.

Flink-Weibo-use-case-machine-learning

What is Weibo

Weibo (新浪微博) is the largest and most popular social media networking platform in China. The website is a microblogging platform (similar to Twitter or Reddit) with functionality including messaging, private messaging, commenting on articles, reposts as well as video and picture recommendations based on context consumption and interest. In 2019, Weibo had more than 220 million daily active users (DAUs) while the monthly active users reached 516 million in the same year. Based on people's social activity — such as consuming, posting and sharing news and updates all over the world — the team at Weibo developed a social network that connects users and maps content to people based on their activities and interest.

Weibo Functionality, Following Feed, Video Feed, Video Recommendation, Picture Recommendation and moreFigure 1: Weibo Functionality: Following Feed, Video Feed, Video Recommendation, Picture Recommendation and more

 

Weibo’s Machine Learning Platform (WML)

As illustrated in Figure 2 below, Weibo’s Machine Learning Platform (WML) consists of a multi-layered architecture from cluster and resource management all the way to model and inference training components. At the core of the platform, our cluster deployments, consisting of online, offline and high-performance computing clusters, run our applications and pipelines.

Weibo Machine Learning Platform architectureFigure 2: Weibo’s Machine Learning Platform architecture


On top of the cluster layer comes the scheduling layer of our platform that consists of two internally-developed frameworks (WeiBox and WeiFlow) for submitting jobs to different clusters in a unified manner. We additionally utilize Yarn and Kubernetes for resource management. The third layer of our Machine Learning Platform contains the computing layer, that comes with our internally-developed WeiLearn framework (explained in further detail in the following sections of this post) that allows users of our platform to define their own algorithms and construct their own UDFs along with multiple integrated data processing frameworks, such as Hadoop, Flink, Storm and TensorFlow. The model training and online predicting layers of our architecture sit at the very top and deliver different application scenarios for our company including feature generation, sample generation, online model training and online inference, among others.

To dive a bit deeper into how Machine Learning is executed at Weibo, let us introduce two of our own, internally-developed frameworks used for Machine Learning at Weibo: WeiLearn and WeiFlow. WeiLearn — illustrated on the left side of Figure 3 below — was established at Weibo as a framework for our developers to write and plug in their own UDFs with three main steps: Input, Process and Output for our offline data processing jobs and Source, Process and Sink for our real time data processing jobs. WeiFlow, on the other hand, is a tool for handling task dependencies and scheduling, with Cron expressions, such as rerunning from specific tasks or backfilling multiple days of data in a specific time period.

Weibo’s WeiLearn and WeiFlow frameworksFigure 3: Weibo’s WeiLearn and WeiFlow frameworks


After a few successful iterations, the Machine Learning Platform at Weibo now supports more than 100 billion parameters in our model training and over 1 million queries per second (QPS), while we managed to reduce our iteration cycle down to 10 minutes from a weekly cadence in the earlier iterations of the platform.

Performance of WeiLearn 6.0Figure 4: Performance of WeiLearn 6.0 - Included in WML Platform

 

Apache Flink for Machine Learning at Weibo 

Flink plays a significant role in the Machine Learning Platform of Weibo, and precisely in the real time computing element of our platform. The infrastructure layer of the real time computing platform consists of Apache Flink and Apache Storm clusters, along with Grafana for metrics and monitoring, Apache Flume and the ELK stack (ElasticSearch + Logstash + Kibana) as our logging system. 

Using Flink’s unique set of abstractions and its unified APIs, we were able to consolidate our Machine Learning pipelines at Weibo. We previously had separate pipelines for online and offline model and inference training, using different compute engines for each of them including Storm, Flink, Spark Streaming, Hive and MapReduce. On top of that, we had multiple application development frameworks to use and implement which was resulting in maintenance overhead for the development teams. 

We soon started asking ourselves whether maintaining two different processing frameworks was necessary for our needs, the answer to which — for the majority of the cases — was no. This is how we initiated the process of integrating the separate pipelines into one single, unified Machine Learning pipeline for both offline and real time data using Apache Flink at its core.

Integrated Machine Learning Pipelines at Weibo with Apache FlinkFigure 5: Integrated Machine Learning Pipelines at Weibo with Apache Flink

 

Online Model Training Pipelines with Flink at Weibo

Flink is being used in our online model training pipelines, illustrated in Figure 6 below. We utilize Flink for the sample generating service of our pipeline by filtering, mapping and performing multi-stream joins using Apache Flink’s timers and state. We then input the collection of data to our sample pool, the collection of samples metadata. Once the sample pool is generated, our model training service (WeiLearn) consumes the sample stream to perform feature processing, feature prediction and gradient parameter computations before storing the parameters to WeiPS. WeiPS is our internal parameter service containing two separate clusters, one for online predictions and one for online model training, to ensure the stability of our online model predicting work. As a final step, we use CI/CD and CT/CD tools to continuously train and deploy our models alongside model evaluations, stabilizations and consistent checks that are executed automatically. The models are then being served to our sorting service using fbthrift RPC for maximum performance.

Example of an online model training pipeline at WeiboFigure 6: Example of an online model training pipeline at Weibo

 

Sample generation and multi-stream joins with Apache Flink

As mentioned before, Apache Flink plays a crucial role for the sample generating and sample pooling service of Weibo’s Machine Learning Platform (illustrated in Figure 7 below). Our service combines both offline data and real time events as input coming to a unified data processing layer — powered by Apache Flink — that performs general computing, multi-stream joining and deep learning. Here, we introduce some additional feature generating (including features for Weibo posts, Users, Relationship and multimedia contents) on both the online and offline data for additional processing. Once the computation is complete, the results are being shared with our sample pooling service before being utilized for model training.

The Flink-powered Sample Service in the Weibo Machine Learning Platform

Figure 7: The Flink-powered Sample Service in the Weibo Machine Learning Platform


One of the great advantages of using Flink is the framework’s ability to perform multi-stream joins effortlessly. Weibo’s sample service ingests different data sources as input for filtering and mapping functions (UDFs can be introduced by our WeiLearn platform explained previously in this article) such as the one shown below:


@Override
public boolean filter(Tuple2<string, defaultoutmodel=""> data) throws Exception {
    if (this.isEmptyTuple(data)) {
        return false; // filter out empty records
    }
   
    DefaultOutModel outModel = data.f1;
	String business = outModel.getRecord("business");
   
    if (business.isEmpty() || !("xxx").equals(business)){
       return false; // only consider "xxx" business
	}
}
 
@Override
public Tuple2<string, defaultoutmodel=""> map(String source) throws Exception {
    Map<string, string=""> detailMap = JsonUtil.fromJsonToObject(source, Map.class);
    DefaultOutModel outModel = new DefaultOutModel();
    
    // put  into output
	outModel.putRecord("lk_hour",
        String.valueOf(detailMap.get("lk_hour")));
    
    // put  read from feature engineering into output
    this.appendFeature(outModel, getFeature("userGender"));
    
    String blogId = String.valueOf(detailMap.get("blogId"));
	String userId = String.valueOf(detailMap.get("userId"));
   
    // key = userId_blogId, value = output
    this.processOut(Tuple2.of(userId + "_" + blogId, outModel));</string,></string,></string,>

The data is then distributed by key and enters a 10-minute joining time window operation that combines them, before applying additional filtering and mapping functions or appending additional features.

Multi-stream joins with Apache FlinkFigure 8: Multi-stream joins with Apache Flink


Let us now explain how the use of the joining time window function in Apache Flink allows us to effectively manage out-or-order or late-arriving events. We frequently find that some of our events arrive instantly to our window function, such as a “click” event for example, while other event types, such as a “read” event, might arrive a few seconds/minutes later. The time window function collects users’ activities during a 10-minute timeframe. We then store every event in RocksDB as key-value pairs before sending the combined results as an output to the Sample Stream. Doing so means that we are able to guarantee different types of events can be joined together effectively.

10-minute joining time window in FlinkFigure 9: 10-minute joining time window in Flink

 

Next steps with Flink at Weibo 

As we showed in the previous sections, with the use of Apache Flink we were able to unify our online and offline machine learning pipelines at Weibo. Since this unification cannot be applied to all our internal use cases, our next steps with Flink include significant efforts towards building a unified data warehouse with Apache Flink. Some of our work includes increasing the use of SQL (both Flink SQL and Hive SQL) as a way to provide a single code base and development mode for both batch and stream processing jobs in our organization. We additionally want to provide the appropriate abstraction APIs, giving our developers unified table register APIs and metadata for schema registry, connectors and formats as illustrated in the following Figure. 

Implementing a unified data warehouse powered by Flink at WeiboFigure 10: Implementing a unified data warehouse powered by Flink at Weibo


Apache Flink has been a great enabler for our efforts to unify our batch and stream processing jobs at Weibo. The framework has been extremely scalable, extending to hundreds of instances and handling petabytes of data being processed for our machine learning pipelines. For more information about the use of Flink at Weibo, including how we perform joining time windows as well as our plans for using Flink and Tensorflow for deep learning, you can watch the recording of our Flink Forward Virtual 2020 presentation. If you want to find out more information or have any questions feel free to get in touch at any time.

Flink Forward, Youtube, Flink, Flink videos, Flink use case

Ververica Contact

 

 

 

 

 

Topics: Use Cases

Yu Qian
Article by:

Yu Qian

Related articles

Comments

Sign up for Monthly Blog Notifications

Please send me updates about products and services of Ververica via my e-mail address. Ververica will process my personal data in accordance with the Ververica Privacy Policy.

Our Latest Blogs

by Dongjie Shi & Jiaming Song September 21, 2020

Intel’s distributed Model Inference platform presented at Flink Forward

Flink Forward Global Virtual Conference 2020 is kicking off next month and the Flink community is getting ready to discuss the future of stream processing, and Apache Flink. This time, the...

Read More
by Jark Wu & Qingsheng Ren September 10, 2020

A deep dive on Change Data Capture with Flink SQL during Flink Forward

Can you believe that Flink Forward Global Virtual Conference 2020 is only a few weeks away? 

Read More
by Konstantin Knauf August 04, 2020

Introducing Ververica Platform 2.2 with Autoscaling for Apache Flink

The latest release of Ververica Platform introduces autoscaling for Apache Flink and support for Apache Flink 1.11

Read More