Ben Willenbring

View Original

CI/CD Dashboards for Observability of Software Build Failures

What’s the problem with Flake you say?

When you’re one of 42 developers spread out across 5 timezones waiting to merge work into master, flaky failures in CI can mean the difference between getting 5 tickets done and merged… versus 0. This problem is severely compounded when software work has complex dependency chains — eg: Mary’s frontend PR depends on Joe’s backend PR.

What can be done?

A precondition to correcting a problem or optimizing a system is understanding its points of failure and bottlenecks. This is not fun by the way — you have to look at the data, and in some cases, be willing to do some trial-and-error discovery. That’s what I did with one CI pipeline in 2020. Below is a brief show-and-tell of tools and processes I put into place in the summer and fall of 2020 to quantitatively mitigate our problem of flaky failures.

What were the rewards I saw?

  • Sped up the build — from 95 min to 35 min

  • Reduced flaky failure rates from .63% to .02% — approaching the probability of being struck by lighting ⚡ (according to the National Lightning Safety Council)

  • Made failures a lot more actionable — people were notified in Slack right when a failure happened, with clear prompts to take the next logical action: investigate the logs, talk to the owner of the check, report a flaky build

  • Allowed my team to put more code into production each day — as the saying goes, coffee is for closers — and in Software, you can’t close without merging to prod


Details

  • In July 2020, we started harvesting metrics from Jenkins about our automated tests on each build – things like…

    • Git branch, build #, and Git commit author

    • Time taken to execute

    • Total tests passed

    • Total tests failed

    • Total tests skipped

    • Overall Status

  • We took inventory of our automation

    • Owners of tests were clearly identified

    • Every test was clearly described in plain English so that new developers, designers, and product owners could understand what each test was intended to validate without having to decipher a single line of code. This made it easy for the right people to chime in and say things like… Why the f*** are we spending 7 minutes testing THIS?

    • ☝🏽 This triggered an intense months-long Marie Kondo inspired drive to rid our pipeline of garbage tests that no longer sparked joy

  • A Shotgun Webhook integration with CI alerted engineers in a Slack channel to test failures as soon as they occurred – even before the build was complete. Notifications included the test description, the Jenkins build URL, and a button to create a Jira ticket right on the spot so the test owner could quickly fix up the test

  • Flaky test failure rates went down

    • Jul 2020: .63%

    • Aug 2020: .06% (* this is an outlier because many tests were simultaneously disabled in late July and August to improve their stability, then re-activated later on)

    • Sep 2020: .23%

    • Oct 2020: .13%

    • Nov 2020: .09%

    • Dec 2020: .05%

    • Jan 2021: .02%

  • As an organization, we acquired the new habit of zero tolerance for flaky tests, and a solid expectation of full confidence in our tests

  • The quality of our CI/CD pipeline improved precipitously

    • Test stability: 97% improvement

    • Avg time to remediate flaky tests: Down from 11 days - 0 days

    • Avg time flaky test tickets spend in the open state: Down from 8.3 hours - 40 minutes