Oliver Koo | Workers Software program Engineer
At Pinterest, our cell infrastructure is core to delivering a high-quality expertise for our customers. On this weblog, I’ll showcase how the Pinterest Cellular Builds workforce is leveraging Honeycomb(™) (beginning in 2021) to boost observability and efficiency in our cell builds and steady integration (CI) workflows.
Our cell builds workforce depends on Honeycomb(™) as a strong information engine to visualise construct metrics, analyze traits, and make data-driven choices. From monitoring construct occasions to categorizing errors, Honeycomb(™) empowers us with essential insights into our CI workflows, enabling us to proactively handle points and optimize efficiency.
We’ve constructed dashboards that set up baseline metrics, monitoring key CI indicators like construct occasions, pipeline success charges, and cluster utilization for each iOS and Android builds. Whereas many information platforms or CI suppliers can provide these capabilities out of the field, the true magic occurs when we have to go deeper — when traits look irregular, or when nuanced evaluation is required to uncover hidden points.
That is the place Honeycomb(™) really excels. Its intuitive question builder makes slicing and dicing information seamless, enabling us to drill into granular particulars with ease. Options like derived columns allow us to create dynamic metrics on the fly, whereas its blazing-fast efficiency ensures that even with 1 million occasions despatched every day[1] only for our CI construct dataset, most queries are accomplished in below a second.
This unparalleled visibility transforms how we perceive and enhance our CI processes. We will pinpoint bottlenecks, diagnose points in close to real-time, and implement enhancements sooner than ever earlier than — all with the boldness that we’re making knowledgeable, data-driven choices. Honeycomb(™) doesn’t simply give us information, it provides us readability.
Listed here are a few examples of how I used Honeycomb(™) to uncover fascinating patterns in our CI builds and pinpoint bottlenecks.
Recognizing Bottleneck Jobs in Builds
When querying our construct counts together with p95 and p50 construct occasions in a CI pipeline, I seen two distinct eventualities:
- On the left, there’s a spike in construct depend, however the p95 and p50 metrics stay unchanged. Since construct occasions aren’t impacted, there’s no want to research additional, permitting me to avoid wasting time and focus elsewhere.
- On the precise, the construct quantity stays constant, however there’s a noticeable spike within the p95 construct time. This deviation is value investigating additional.
By clicking into the particular construct inflicting the spike, I can view the construct hint. In Honeycomb(™) phrases, a “hint” represents a whole unit of labor for a number of providers in an atmosphere. In our case, the hint corresponds to a CI construct, with youngster spans representing particular person jobs inside that construct. These spans can embrace youngster traces for job steps, comparable to script execution or different duties inside a job.
The hint view revealed that one job — “tremendous secretive checks” — was taking considerably longer to finish, turning into the bottleneck and inflicting the spike in p95 construct time. Since one gradual construct isn’t sufficient to maneuver the p95 metric, I hypothesized that related slowdowns had been occurring throughout different builds. To research additional, I looked for the Buildkite URLs utilizing the web_url attribute in Honeycomb(™) to research extra builds straight in Buildkite.
You may discover this hint view is similar to the “Waterfall View” that Buildkite launched in 2023. Nevertheless, we proceed to make use of Honeycomb’s (™) hint view for a number of causes:
- Seamless Integration with Honeycomb(™): The hint view integrates straight with Honeycomb(™), permitting us to seamlessly transition from analyzing construct traits to zooming into particular builds for a deeper dive.
- Flexibility and Customization: Honeycomb’s hint view provides us the power to interrupt down Buildkite builds into extra than simply jobs, however into particular segments comparable to agent wait time and script execution. It permits us to log and analyze the elements of the construct and job which might be most related to our workflows, such because the execution of varied construct hooks or atmosphere setups. We will even go deeper and instrument construct scripts to log the construct time of particular segments throughout the script. As an example my factors, I created a demo picture beneath utilizing dummy information. This picture demonstrates an instance construct the place every Buildkite job is damaged down right into a sequence of executions throughout the job. Moreover, inside a Bazel construct script, we instrumented the method to log the execution time of particular Bazel targets. If desired, you would even log the construct time for every goal individually. The probabilities are countless!
Later we are able to combination these segments to reply questions like, “What’s the common repo cloning time throughout completely different pipelines?” or, “What are the p50 and p95 occasions for Bazel construct and take a look at phases in my PR pipeline?” These are beneficial observability metrics that may assist your workforce prioritize optimizations, scale back construct occasions, and enhance general developer productiveness.
3. Established Habits: We’ve been utilizing Honeycomb’s hint view since 2021, lengthy earlier than the Waterfall View was launched. By now, it’s grow to be a well-recognized and trusted a part of our course of.
These benefits make Honeycomb’s hint view a useful instrument for understanding our CI processes, diagnosing points, and enhancing effectivity.
Utilizing Correlation to Establish Potential Root Causes
Honeycomb’s correlation characteristic is one other sport changer. It permits us to overlay question outcomes with different dashboards, making a breadcrumb path to determine abnormalities or outliers.
As an example, I noticed a spike in p95 construct occasions for iOS CI jobs. Utilizing correlation, I in contrast the p95 information to CI cluster utilization graphs and seen a simultaneous spike in job wait occasions. Honeycomb’s synchronized dotted line throughout graphs confirmed the alignment, resulting in a powerful speculation: lengthy CI agent wait occasions had been inflicting the construct time spike.
From there, I clicked into the construct hint to substantiate my speculation. Certain sufficient, the hint revealed that the construct skilled unusually lengthy wait occasions for CI brokers. By sampling extra builds from the identical time interval, I may verify the foundation trigger and give attention to options.
With out Honeycomb(™), conducting this kind of investigation can be extremely tedious — requiring a handbook, build-by-build evaluation. Honeycomb(™) supplies a holistic view that lets you shortly pinpoint root causes, saving effort and time whereas enhancing our CI course of effectivity.
One in every of our current initiatives with Honeycomb(™) is error categorization for cell builds. Whereas nonetheless in its early phases, the outcomes have been promising. Our major objectives are:
- Deeper Perception into Construct Failures: CI construct failures can stem from varied causes, comparable to compilation errors, flaky checks, or community points. By analyzing logs and extracting particular errors, we’ve recognized the highest contributors to CI instability. This perception permits us to prioritize assets and handle essential points extra successfully.
- Streamlining On-Name and Lowering Noise: Traditionally, our workforce was notified of each CI difficulty, whatever the root trigger. With error categorization, we are able to now classify failure sorts in actual time and automate alerts, routing them to the suitable workforce’s on-call channel. This streamlines on-call duties and minimizes interruptions. As an example, take a look at failures now routinely notify the accountable workforce with out requiring our intervention.
Whereas the system continues to be being refined, it has already confirmed to be a beneficial instrument for enhancing CI administration effectivity. The diagram beneath illustrates the structure of our error categorization system, showcasing how we combine Buildkite logs with Honeycomb(™) by leveraging AWS EventBridge and the Buildkite Jobs API.
Whereas Honeycomb(™) is crucial for CI observability, its functions lengthen past construct metrics. Groups throughout Pinterest use it to achieve real-time insights into efficiency information and tailor observability to their wants.
As an example, we monitor iOS native construct metrics alongside machine particulars in Honeycomb(™), which helps us prioritize laptop computer upgrades for builders. One other use case includes analyzing Android Develocity construct information (learn extra about this in one other Pinterest Engineering weblog submit).
At Pinterest, we’re repeatedly enhancing our construct processes, and Honeycomb(™) has been an important accomplice on this journey. We’re excited to discover new use instances and develop our data-driven observability practices, enabling our groups to give attention to delivering distinctive consumer experiences.
[1] Pinterest Inside Information
[2] Pinterest Inside Information
[3] Pinterest Inside Information
[4] Pinterest Inside Information
[5] Pinterest Inside Information
[6] Pinterest Inside Information
To study extra about engineering at Pinterest, take a look at the remainder of our Engineering Weblog and go to our Pinterest Labs web site. To discover and apply to open roles, go to our Careers web page.