Democratizing data in GO-JEK
At GO-JEK, we build products that help millions of Indonesians commute, shop, eat and pay, daily. Data at GO-JEK doesn’t grow linearly with the business, but exponentially, as people start building new products and logging new activities on top of the growth of the business. We currently see 6 Billion events daily and rising. GO-JEK currently has 18+ products. Each and every team publishes events as Protobuf messages to Kafka clusters in order to have a well-defined schema and to ensure backward compatibility. This makes data available to all teams for different use-cases. In order to make sense out of this raw data, we needed to have some data aggregation pipeline. We found Flink to be useful. First use-case/requirement for real-time aggregation : We needed to implement Dynamic Surge Pricing. In order to do this, we needed real-time data of booking being created and drivers available to accept bookings per min per s2Id (http://s2geometry.io/) . We created two Flink jobs to achieve this. What are daggers? After the successful implementation of Surge pricing, we realised that real-time data aggregation can solve a lot of problems. So instead of creating different jobs for these use-cases by ourselves, we came up with a DIY solution for creating Flink jobs. We created a generic application knows as DAGGERS on top of Flink that could take parameters like the topic from which the user wants to read the data along with some options including watermark intervals, delays and parallelism. What is Datlantis? In order to give a DIY interface to the user, we created a portal called Datlantis which allows users to create and deploy massive, production-grade real-time data aggregation pipelines within minutes. Datlantis uses Flink's Monitoring REST API for communicating with the Flink cluster to monitor current jobs and deploying new ones. Now the users can just select Kafka topics from all Kafka clusters and write a simple SQL query on the UI which will spawn a new Flink job. Users also have the option to select one more Kafka data-stream in order to write JOINS query. This Flink job pushes data to InfluxDB, that enables the user to visualize their data on Grafana dashboards. Once the logic of the SQL is verified using the dashboard, the Flink job is then promoted to push the data to Kafka. The users can manage their Flink jobs on Datlantis. They can edit the jobs, stop or restart the jobs or change the job configurations. They can also see logs of their Flink jobs on Datlantis itself. The reasoning behind pushing data back to Kafka is so that the aggregated data is available for all the other teams. Our application FIREHOSE takes care of consuming this data from Kafka and pushing it to different sinks like relational DB sink, HTTP sink, GRPC sink, Influx sink, Redis sink etc. This data is then pushed to our cold storage which enables us to do historical analysis. Data Pipeline: Producer Apps → Kafka → Deserialization → DataStream → SQL → Result → Serialize → Kafka → Consumer Apps/InfluxDB This DIYness enabled not only the developers, but data analysts and even Product Managers to write simple queries and solve complex use cases like : - System uptime - Real-time customer segmentation - Fraud control - Dynamic surge pricing - Allocation metrics for city managers Deployment : We have different Kafka clusters depending upon the type and throughput of data, for example, transactional data, driver location pings, API logs etc. We have deployed multiple Flink clusters for different teams that have different configurations for resources allocated to task managers. On Datlantis, users have the option to create DAGGERS on their team’s Flink cluster. We used to deploy Flink on YARN using CHEF but have moved to Kubernetes using Helm charts as it is easier to scale. For Checkpointing on YARN based Flink clusters, we used the underlying HDFS, while for Kubernetes Flink clusters we use GlusterFS. Datlantis also takes care of automatically scaling the number of task managers on Kubernetes depending upon the number of jobs. Our build pipelines take care of having the latest production version of Dagger’s JAR available on Flink clusters. In case of JobManager failures, we have implemented a Kubernetes controller that listens for a JobManager restart and re-uploads the latest JAR to it. Metrics -> We use Influx, Grafana and Telegraf for monitoring For monitoring purposes, we use the StastD Reporter provided for Flink. Our monitoring stack consists of Telegraf as the StatsD Agent, InfluxDB as the time series database, Grafana for visualization and Kapacitor for alerts. Datlantis takes care of alert creation at the time of creation of a new Job. Users are provided with a Health dashboard that they can use to track the health of their Job. Alerts are sent to specific teams via their slack channels and pager duties.
Rohil Surana works at Go-Jek as a Product Engineer in the Data-Engineering team. He has been solving problems on data streaming and data warehousing at Go-Jek. He prefers a hands-on aprroach for solving problems while learning at the same time. He loves to travel and learn about different places and try their food.
Prakhar Mathur has completed his bachelors from Indian Institute of Technology, Jodhpur. He is currently working at GO-JEK as a Product Engineer with the Data Engineering team. He is working with the team solving problems regarding data publishing and making data easily available to the organisation.