CI/CD Dashboards for Observability of Software Build Failures
What’s the problem with Flake
you say?
When you’re one of 42 developers spread out across 5 timezones waiting to merge work into master, flaky failures in CI can mean the difference between getting 5 tickets done and merged… versus 0. This problem is severely compounded when software work has complex dependency chains — eg: Mary’s frontend PR depends on Joe’s backend PR.
What can be done?
A precondition to correcting a problem or optimizing a system is understanding its points of failure and bottlenecks. This is not fun by the way — you have to look at the data
, and in some cases, be willing to do some trial-and-error discovery. That’s what I did with one CI pipeline in 2020. Below is a brief show-and-tell of tools and processes I put into place in the summer and fall of 2020 to quantitatively mitigate our problem of flaky failures.
What were the rewards I saw?
Sped up the build — from 95 min to 35 min
Reduced flaky failure rates from .63% to .02% — approaching the probability of being struck by lighting ⚡ (according to the National Lightning Safety Council)
Made failures a lot more actionable — people were notified in Slack right when a failure happened, with clear prompts to take the next logical action: investigate the logs, talk to the owner of the check, report a flaky build
Allowed my team to put more code into production each day — as the saying goes,
coffee is for closers
— and in Software, you can’t close without merging to prod
Details
In July 2020, we started harvesting metrics from Jenkins about our automated tests on each build – things like…
Git branch, build #, and Git commit author
Time taken to execute
Total tests passed
Total tests failed
Total tests skipped
Overall Status
We took inventory of our automation
Owners of tests were clearly identified
Every test was clearly described in plain English so that new developers, designers, and product owners could understand what each test was intended to validate without having to decipher a single line of code. This made it easy for the right people to chime in and say things like…
Why the f*** are we spending 7 minutes testing THIS?
☝🏽 This triggered an intense months-long Marie Kondo inspired drive to rid our pipeline of garbage tests that no longer sparked joy
A Shotgun Webhook integration with CI alerted engineers in a Slack channel to test failures as soon as they occurred – even before the build was complete. Notifications included the test description, the Jenkins build URL, and a button to create a Jira ticket right on the spot so the test owner could quickly fix up the test
Flaky test failure rates went down
Jul 2020:
.63%
Aug 2020:
.06%
(* this is an outlier because many tests were simultaneously disabled in late July and August to improve their stability, then re-activated later on)Sep 2020:
.23%
Oct 2020:
.13%
Nov 2020:
.09%
Dec 2020:
.05%
Jan 2021:
.02%
As an organization, we acquired the new habit of zero tolerance for flaky tests, and a solid expectation of full confidence in our tests
The quality of our CI/CD pipeline improved precipitously
Test stability: 97% improvement
Avg time to remediate flaky tests: Down from 11 days - 0 days
Avg time flaky test tickets spend in the open state: Down from 8.3 hours - 40 minutes