1. 11
  1. 9

    I’ve done this with teams before. Always regretted it. Flaky Tests should probably be called Sloppy Tests. They always point to a problem that no one wants to take the time to deal with.

    1. 2

      Flakiness isn’t always test flakiness. We also have infra flakiness (or intermittent infra failures) that are hard or impossible to solve. With respect to test retries, I somehow agree with you, but as always, this is a matter of tradeoffs: do you prefer a faster product development or an extremely reliable product with developers spending time trying to fix issues that are false positives most of the time?

      1. 1

        I haven’t tried this retry approach, but my gut reaction is to agree with you. Reading the article my first reaction was “why not just fix the flaky tests”?

        If the tests fail sporadically and often, how can you assume it’s the tests at fault and not the application code? And if it’s the latter, it’s affecting customers.

        1. 1

          When new software is running on pre-production hardware, the line of delineation is not so easy to draw. Flaky tests could be one or the other, and filtering them out (based on the cause being one or the other) is not exactly straight forward.

        2. 1

          It sounds bananas for GUI development, but it can make sense for statistical software, where some failures are to be expected. Maybe failures are unavoidable for some GUI environments? I can’t think why off the top of my head, though.

          1. 1

            The biggest difficulty is that flaky tests in end-to-end environments involve basically every part of the system, so any sort of non-determinism or race condition (timing is almost always at the core of these) can be involved. Thank god Javascript is single-threaded.

            I once had a test fail intermittently for weeks before I realised that a really subtle CSS rule causing a 0.1s color fade would cause differences in results if the test was executing ‘too fast’

          2. 3

            Docker can fail to load the container, bundler can fail while installing some dependencies, and so can git fetch. All of those failures can be retried

            If you are retrying an action connected to an external service (whether it’s something you run or something on the internet), please, please implement exponential backoff (here is a personal example). I will never forget the phrase “you are threatening to destabilize Git hosting at Google.”

            1. 2

              Great post, I’m always interested in how companies deal with flakiness.

              At Mozilla we attempted automatic retries, but we have so much flakiness that it was significantly impacting CI resources and we turned it back off. Instead, we let the next push act as the retry and have dedicated staff to monitor the CI and look for failure patterns (they can still manually retry if needed). There is also tooling to mute known intermittents from the UI, so developers have a better sense of whether or not they broke something.

              Having people manually perform a task that could be automated is not a sexy solution, but it works fairly well and is probably the right trade-off for us in a cost benefit analysis.

              1. 2

                significantly impacting CI resources

                We’ve seen that too, specifically for iOS, where we have more limited resources :(, but automatic retries are faster than developers looking for failures.

                have dedicated staff to monitor the CI and look for failure patterns

                I interned at Mozilla almost two years ago and I remember that there was a project to solve this. Sad to hear that it hasn’t been fully solved yet.

                1. 1

                  You’re probably thinking of the autoclassify feature. That is being used and has reduced the amount of manual work sheriffs need to do. I don’t think it was ever intended to outright replace the sheriffs though.

                  Tbh, I’m glad Mozilla isn’t throwing crazy resources at the intermittent problem. We have a system that works pretty effectively and for a fraction of the cost it would take to automate. That’s not to say we’ll stop making incremental improvements. Maybe one day it will be fully automated, just not through massive spending and heroic effort.

                2. 2

                  At FB, we retried 3 times, if it didn’t work we emailed the commit author. If a test was failing (with retries) on lots of diffs we would email the contact address for the feature under test and take the test out of rotation. Very infrequently, we would re-run the failed tests again, if one of them passed 50 times we’d put it back in the rotation (or if someone pushed a fix and manually re-enabled the test).

                  significantly impacting CI resources

                  Yes. We did notice that :D

                  1. 1

                    If a test was failing (with retries) on lots of diffs we would email the contact address for the feature under test and take the test out of rotation.

                    We do this as well. Every intermittent has an orangefactor score, which is a proxy for how much pain it causes everyone (basically just failure rate and frequency). Once an intermittent passes a certain threshold, the relevant test is disabled after a short grace period to get it fixed.

                3. 2

                  I feel like monorepo is a hot topic the last weeks. At least in my bubble (called work) it is and it seems here and on the orange site as well.

                  Last year, we moved our Android apps and libraries to a monorepo and increased the size of our Android team.

                  Shopify is certainly not just an Android shop. So this move does not mean that everything moved into one repo. I also saw that fuzzyness at work where “Monorepo” was the project name for merging the repositories of two projects, leaving lots of other repos for themselves.

                  The term “Monorepo” has lost its literal meaning: Mono = Single as in “a single repo for the whole company”. Even Google does not have a single repo for the whole company because Android, Chrome, Go are not inside the big repository.

                  The term is probably not about size. I would assume we can find a small company which uses a single repo just like Google and Facebook, but the repo is smaller than a project repo somewhere else.

                  Any idea for a good definition?

                  One approach could be: In a common repo you will find some build system stuff at the root. In a monorepo you will only find folders and build system stuff in there. There is no “build everything” in a monorepo.

                  1. 2

                    I always took monorepo to mean per product, not per company.

                    1. 2

                      The definition of monorepo is definitely not clear. We don’t understand it as a single repository for the entire company, but a repository containing different projects that share code. Using this definition, we have two mobile monorepos: one for Android, and another one for iOS.