Managing Flink operations at GoJek
At GO-JEK, we build products that help millions of Indonesians commute, shop, eat and pay, daily. Data Engineering team is responsible to create a reliable data infrastructure across all of GO-JEK’s 18+ products. We use Flink extensively to provide real-time streaming aggregation and analytics for billions of data points generated on daily basis. Working at such a large scale makes it really important to automate operations from infrastructure, failover, and monitoring. This way we can push features faster without causing chaos and disruption to the production environment. 1. Provisioning and deployment: With the nature of business at GoJek, we found ourselves provisioning Flink clusters quite often. Currently we run around 1000 jobs across 10 clusters for different data streams with increasing number of requests day by day. We also provision on the fly clusters with custom configuration for load testing, experimentation and chaos engineering. Provisioning these many clusters from ground up required lot of man hours and involved setting up virtual machines, monitoring agents, access management, configuration management, load testing and data stream integration. Our current setup has Flink over Yarn clusters as well as Kubernetes. We use our in-house provisioning tool Odin, built on top of Terraform and Chef for Yarn clusters and Kubernetes controllers for Kubernetes based deployments. It enables us to safely and predictably create and modify Flink infrastructure. Odin has helped us reduce provisioning time by 99% despite increasing number of requests. 2. Isolation and access control: Given the real-time and distributed nature of GoJek's services, events are classified into different streams depending on nature, time and transactional criticality, sensitivity and volume of data. Which requires setting up separate clusters based on security concerns, team segregation, job loads and criticality which comes at the cost of handling large volume data replication and maintenance. 3. Data quality control: The quality of ingestion events are controlled by Protobuf based version controlled strict event type schema with fully automated deployment pipeline. Deployed jobs are locked to a certain data schema and version which helps us accidental breaking schema changes and backward compatibility during migration and failover. 4. Monitoring and alerting: All the clusters are monitored using dedicated TICK setup. We monitors clusters for resource utilization, job stats and business impact per job. 5. Failover and Upgrading: Failover and upgrade operations are fully automated for yarn cluster failover, input stream failovers e.g. Kafka failover with stateless job strategies. Which helps us moving jobs from one cluster to another without any data loss or broken metric flow. 6. Chaos engineering and load testing: Loki is our disaster simulation tool that helps ensure the Flink infrastructure can tolerate random failures and excess job load. It exposes engineers to failures more frequently and incentivises them to build resilient services.
Ravi Suhag works on the Data Engineering team at GoJek which is responsible for handling data infrastructure for all GoJek Products. To know more about the speaker please visit: www.ravisuhag.com
Sumanth works on the Data Engineering team at GoJek which is responsible for handling data infrastructure for all GoJek Products.