How to Retry Failed Tests in Parallel

As you might already know from your own experiences, tests can often be unstable. When that happens, you can retry tests several times, which may lead to doubling or even tripling the build time.

In this article, we’ll explain how we solved this problem and share a tool that our engineers developed for successfully retrying failed tests in parallel.

Work at Wrike

Our autotest project contains over 53,000 tests that we run in anywhere from 80 to 150 threads, depending on the build. However, we found that the majority of the build time is often occupied by retries of several tests that don’t use all the threads, and we wanted to find a way to reduce this. (After all, we pay for the dynamic agents in TeamCity and the dynamic environment!)

Here’s an example of a build timeline from Allure. In this build, 50 seconds of work out of 90 is spent retrying one test:

Because of this result, we wanted to reduce the retry time by using more threads.

The problem of long retries in JUnit 5

In the autotest project, we use Java SE 17 and JUnit 5, as well as Maven for the project building tool, so the tests are run via the Maven Surefire Plugin.

Previously, we used JUnit 4 and the Surefire Plugin would retry the failed tests of each class without waiting for the first run to finish.

test timeline — With JUnit 4, retries of Tests 1 and 2 initiate regardless of the completion status of Test 3, provided all tests are in different classes.

But now, with JUnit 5, the Maven Surefire Plugin waits for Test 3 to finish first before retrying Tests 1 and 2. This increases the test runtime for a project with a large number of classes.

test timeline 2 — With JUnit 5, Tests 1 and 2 are retried only once Test 3 is complete, even if the tests are in different classes.

As the number of modules in our Maven project grew, this problem became even more acute. Each module would wait for the test run to complete and then retry the tests — only after that did the next module’s tests start.

We partially solved this problem with our proprietary tool, Maven Modules Merger, which reduces build time by merging several Maven modules into one. You can read more about that in this article.

But even with our Merger tool, retries could still take up most of the build time (see the “All tests in one module” scenario in the image above).

So we had an idea: What if we could retry tests in parallel and not wait until the test fails several times in a row? It would certainly take much less time. A test would be considered passed if it had passed at least once already.

Here is a timeline for parallel retries:

The idea is that each test would be retried several times in parallel, increasing the number of repetitions for each test but reducing the overall build time. This method would also provide us with more statistics for failed tests because they will run more times.

The only question that remained was whether the tests retried in parallel could provide the same success rate as those retried sequentially, so we decided to give it a go.

However, we didn’t find any ready-made solutions for parallel retries. We tried to modify the JUnit 5 extension from junit-pioneer, but it’s implemented through TestTemplate, meaning we couldn’t use it with another TestTemplate (e.g., with parameterized tests — see issue #405). For that reason, it wasn’t possible to modify RepeatedTest. It’s a TestTemplate, which doesn’t work with parameterized tests, either. JUnit 5 does not support even sequential retries by default.

So we decided to extend the JUnitPlatformProvider class from the Maven Surefire Plugin, which can retry tests sequentially.

Implementing parallel retries

During the implementation, we encountered the following two major problems:

The Allure report might mark a test as failed even if it has passed once.
The standard JUnit 5 synchronization mechanisms only work within a single test run. This means that @ResourceLock, @Execution, and @Isolated annotations will not work correctly in a parallel retry.

Fixing the Allure report

During a parallel retry, a test might be marked as failed in the Allure report (i.e., an earlier retry succeeded, and a later one failed). This is because the results of each test run are sorted by start time, as shown below:

We wanted a test that passed at least once to be marked as a pass. To do this, all failed tests need to start before the successful one.

The logic for determining the order of retries cannot be changed — retries are sorted directly when compiling an Allure report from the result files. However, in these files, you can replace the start times of failed test attempts with the retry start time. This solution ensures that, when sorting the results, one of the successful retries is always the last one.