What’s the problem with `Flake` you say?

When you’re one of 42 developers spread out across 5 timezones waiting to merge work into master, flaky failures in CI can mean the difference between getting 5 tickets done and merged… versus 0. This problem is severely compounded when software work has complex dependency chains — eg: Mary’s frontend PR depends on Joe’s backend PR.

What can be done?

A precondition to correcting a problem or optimizing a system is understanding its points of failure and bottlenecks. This is not fun by the way — you have to look at the data, and in some cases, be willing to do some trial-and-error discovery. That’s what I did with one CI pipeline in 2020. Below is a brief show-and-tell of tools and processes I put into place in the summer and fall of 2020 to quantitatively mitigate our problem of flaky failures.

What were the rewards I saw?

Sped up the build — from 95 min to 35 min
Reduced flaky failure rates from .63% to .02% — approaching the probability of being struck by lighting ⚡ (according to the National Lightning Safety Council)
Made failures a lot more actionable — people were notified in Slack right when a failure happened, with clear prompts to take the next logical action: investigate the logs, talk to the owner of the check, report a flaky build
Allowed my team to put more code into production each day — as the saying goes, coffee is for closers — and in Software, you can’t close without merging to prod

Details

In July 2020, we started harvesting metrics from Jenkins about our automated tests on each build – things like…
- Git branch, build #, and Git commit author
- Time taken to execute
- Total tests passed
- Total tests failed
- Total tests skipped
- Overall Status
We took inventory of our automation
- Owners of tests were clearly identified
- Every test was clearly described in plain English so that new developers, designers, and product owners could understand what each test was intended to validate without having to decipher a single line of code. This made it easy for the right people to chime in and say things like… Why the f*** are we spending 7 minutes testing THIS?
- ☝🏽 This triggered an intense months-long Marie Kondo inspired drive to rid our pipeline of garbage tests that no longer sparked joy
A Shotgun Webhook integration with CI alerted engineers in a Slack channel to test failures as soon as they occurred – even before the build was complete. Notifications included the test description, the Jenkins build URL, and a button to create a Jira ticket right on the spot so the test owner could quickly fix up the test
Flaky test failure rates went down
- Jul 2020: .63%
- Aug 2020: .06% (* this is an outlier because many tests were simultaneously disabled in late July and August to improve their stability, then re-activated later on)
- Sep 2020: .23%
- Oct 2020: .13%
- Nov 2020: .09%
- Dec 2020: .05%
- Jan 2021: .02%
As an organization, we acquired the new habit of zero tolerance for flaky tests, and a solid expectation of full confidence in our tests
The quality of our CI/CD pipeline improved precipitously
- Test stability: 97% improvement
- Avg time to remediate flaky tests: Down from 11 days - 0 days
- Avg time flaky test tickets spend in the open state: Down from 8.3 hours - 40 minutes

What’s the problem with Flake you say?

What can be done?

What were the rewards I saw?

Details

What’s the problem with `Flake` you say?