I have never seen an organization examine any retries. In fact, the reason they use a retry mechanism is to deliberately reduce the amount of time they spend investigating failures. The underlying assumption is that the failures are due to the test, and not the system. This is a very, very risky assumption.
If an automated test gives different results on subsequent runs, that means that some key variable is not under the test’s control. The thing that’s varying may be in the system, or in the test code and test environment.
Risk 1: Masking real system failures
If the uncontrolled variable is in the system, then each failure is trying to give you information about the system. If you retry these tests, and ignore failures when the subsequent run indicates success, you are deliberately ignoring the very information your test is designed to give you.
If you diligently investigate each failure, you reduce this risk. But as I said earlier, I have never seen any organization diligently investigate failures when a retry passes.
Risk 2: Destroying Trust
If the uncontrolled variable is in the test code or test environment, then the tests itself is, to some extent, unreliable. In my experience, it takes very, very little unreliability to destroy an organization’s trust in its automated tests. And it is very, very difficult to regain this trust.
My strong advice: Do not use automated retries. Instead, find the uncontrolled source of variability, and get control of it. If you cannot find the source, or if you cannot gain control of it, mark the test as unreliable, and run unreliable tests in a separate test run.
Reliability is Enormously Important
One trap I see in organizations: “Pass all the tests” as a goal for test automation.
Passing tests is a fine goal when you’re developing the system. It’s dangerous when you’re automating tests. For test automation, the goal is not to pass all the tests. The goal is: Write tests that reliably tell you something you repeatedly want to know.
Reliability is hugely important. Accommodating flaky tests through retries is dangerous. It can mask real system problems. And it can destroy your organization’s trust in its automated tests.
If the tests are not trustworthy, fix them, get rid of them, or mark the untrustworthy runs as untrustworthy.
(Why yes, the horse on which I am sitting is very tall.)