Eliminating order-dependent test failures using `rspec --bisect`

Last week, we’ve dedicated some time to improving reliability of the Enectiva test suite. Enectiva is a non-trivial application, so its test suite is coming up to 9K test cases and weird stuff can happen.

As with any project older than a few months, there are tests of varying quality and value. Using the scout rule, we try to improve the tests bit by bit, but sometimes more targeted action is called for. Among those are flaky, order-dependent test failures: tests which pass when ran on their own but fail when ran in a specific sequence because one test affects the others. Let’s look at what it looks like.

Detecting order-dependent test failures using rspec --bisect

Getting rid of order-dependent failures is usually not hard, it’s just boring and time-consuming. You know which test case fails, but you don’t know which test is causing it to fail. If you have ten tests, it’s easy to figure out, but it’s a daunting task when you have a haystack with thousands of test cases. Luckily, there are tools to automate it, for our setup it’s rspec --bisect.

In order to weed order-dependent tests out, you need to first get a test suite run which manifests the failure. If you follow best practices, you run your tests in random order. This ensures that you are exposed to as many order-dependent failures as possible - if you ran your tests in the same order every time and the problem didn’t manifest, you’d never find out about it1. Randomization surfaces the problem but makes it hard to reproduce since every run is different. The tool authors have thought about it, so most tools give you a seed with every run.

Seed is an argument of the pseudo-random generator which shuffles the test cases. If you run the same suite with the same seed, the tests will be executed in exactly the same order. This allows you to grab the seed of the run which manifests the failure and reproduce it at will.

Equipped with a checked out commit and a seed, we can use bisect to find a minimal set of tests which lead to the failure. Ideally, this will be two tests - one causing the failure, another one manifesting it. How will bisect get them:

  1. run the test suite as usual (with the same order) to make sure there is a test failure;
  2. ignore all tests after the failing one, they can’t affect it;
  3. run the test suite in the same order again but include only a half of the tests before the failing one;
  4. if the failure reoccurs, the affecting test is in the selected half, otherwise, it’s in the other one;
  5. take this half of the tests and run the test suite only on a half of those2; and
  6. repeat, repeat, repeat until it shrinks down the number of tests to a minimum which produces the error.

There are undoubtedly many subtleties and details that are happening under the hood, but as long as you only need to use bisect, this should be a good enough mental model.

Now, what did bisect uncover in our test suite?

Failure 1 - test libraries

Using rspec --bisect I managed to find a problematic interaction between a legacy controller test and a feature test. The feature test was creating a user with the time of the last login in order to test password recovery. However, the time of the last login wasn’t being saved with the factory. I didn’t find the root cause and elected to rewrite the test setup from

create :user, last_login_at: 2.days.ago


create :user do |user|
  user.update_attributes last_login_at: 2.days.ago

It’s not pretty (and adds an SQL UPDATE) but better than occasionally failing tests.

Failure 2 - master at hiding

The second attempt at finding the minimal sequence was less successful. After 15 hours, I killed bisect in round 683. This can happen, the starting conditions significantly affect bisect’s performance. If the failing test is eleventh with that particular seed, bisect only needs to search ten cases, if it’s the 9001st, bisect has to search 9000 test cases and that will take time. It would eventually converge to a minimal test sequence to reproduce the failure but sometimes, it’s simply easier to wait for another ordering (another seed) which will yield the same failure but bisect will be able to find a minimal reproduction sequence faster.

Something sneakier might also be going on: more than two tests are conspiring against you and this will slow bisect down dramatically. I have a hunch that this is happening here but not enough time to verify it.

Failure 3 - under-defined behaviour

The second scenario I tracked down successfully has been popping up for quite some time but very sparsely. It didn’t make much sense because it was a bit of code testing that a hierarchy of objects is returned flattened into an array sorted in a nearly breadth-first fashion. For example a forest like:

- Complex
    - Building A
        - Office unit A1
    - Building B
        - Office unit B1
- Standalone building C
    - Office unit C1

the returned array would be Complex, Standalone building C, Building A, Building B, Office unit A1, Office unit B1, Office unit C1.

In order to present the forest consistently, we sort each level by type of node and then by name (e.g. all office buildings A-Z, then all apartment buildings A-Z). To complicate the scenario, when a user is not allowed to access all the nodes, the top-most accessible nodes are presented as if they were roots and their ancestors are withheld, e.g. if a user has access only to Building A (+ subtree), Office unit B1 and Standalone building C (+ subtree), their view of the world would consist of:

- Building A
    - Office unit A1
- Office unit B1
- Standalone building C
    - Office unit C1

The problematic test was expecting the nodes in order: Standalone building C, Building A, Office unit A1, Office unit B1, Office unit C1 (a subset of the complete sequence). I’d get this order when running the test in isolation but from time to time, it would fail with Office unit C1 jumping to the second position in front of Building A. Why?

We use the ancestry gem to manage the tree structure. From the several approaches for modelling a forest in a relational database, ancestry defines an ancestry field (and a DB column) which is the path of IDs describing the nodes position in the tree from a root to the node. If I rewrite the tree above adding IDs and ancestry we can get (ancestry, ID, name):

- '', 1, Complex
    - '1', 2, Building A
        - '1/2', 3, Office unit A1
    - '1', 4, Building B
        - '1/4', 5, Office unit B1
- '', 6, Standalone building C
    - '6', 7, Office unit C1

Clearly, sorting this tree by ancestry (and breaking ties by type and name), the first records will be Complex and Standalone building C because their ancestry is empty. Next will be Complex’s children because their ancestry is '1'. If this was a true breadth-first order, Office unit C1 would come next but because we rely on the ancestry field which uses numeric IDs, we get Office unit A1 and Office unit B1 instead.

It’s clear, the implementation drives the design here. I was a bit surprised that it hasn’t been discussed earlier but in the end, it doesn’t make any difference to the user. We feed the array to a library which renders the forest and while it relies on the order of nodes, it is important only among sibling nodes on one level and in the same branch - and that is satisfied.

Now, back to the occasional failure: because the ordering is not truly breadth-first, it only approximates it relying on numeric IDs, the IDs matter. If the IDs are assigned so that Standalone building C’s ID is lower than Building A’s, Office unit C1 will appear earlier than Office unit A1. Given the test setup is sequential and IDs are assigned from an ascending sequence, it cannot happen within the test alone. However, the sorting is not numeric, it’s alphanumeric! Standalone building C can get ID 10 and Building A can get 5 if there’s an earlier test which “uses up” IDs 1-3. And while 5 < 10 (integers), '5' > '10' (strings) and it yields the alternative ordering.

So in the end, the problem were IDs. Even though we’ve eliminated hardcoded IDs from the test suite long time ago (precisely because they lead to order-dependent failures), they still can cause problems. Given the relaxed requirements, the solution was to allow both orderings as valid. To completely eliminate the variability, we’d have to force the order to be truly breadth-first - sorting first by the depth of the node, then by ancestry and the other fields.3

To recap, this exercise revealed three different semi-random test failures. One was caused by weird library interaction, another by insufficient definition of the algorithm (and the relevant test) combined with reliance on specific numeric IDs, and the third one is great at hiding.

That’s a pretty typical result. We’ve eliminated two out of three issues. One tried to teach us something about the tools, one lead us to a better understanding of the application, and one absolutely guarantees that the suite is going to fail in the future without an apparent reason. Even if we managed to track it down, there’s no guarantee that there no other combinations of failing tests. We haven’t eliminated all of them, just raised our confidence in the reliability of the test suite - when it fails, it’s more likely to actually mean a problem.

  1. Unless you just introduced the problem, either the test which affects others or which is affected by them. At that point, you’ll know immediately but only about that one combination. ↩︎

  2. It’s basically a binary search↩︎

  3. This could be a valid use case for Postgres' generated columns, I just learned about. ↩︎

We're looking for developers to help us save energy

If you're interested in what we do and you would like to help us save energy, drop us a line at jobs@enectiva.cz.

comments powered by Disqus