Saturday, 25 January 2014

Unit Tests - Best Practice

In my last article I looked at the inherent difficulties of creating Unit Tests. This time I will look at how they are often poorly implemented. For each of the points I also discuss how they can be avoided, that is what is best practice.

1. False success


This is a common problem that is either generally unrecognized or not discussed. At least, I have never seen it discussed.

The root problem is caused by the fact that Unit Tests are code and code often has bugs. Generally, for Unit Tests you think that there is just a binary result - the test passes or fails, but since Unit Tests can also have bugs, there are actually four different possibilities:

     1. Code is correct, Unit Test passes
     2. Code is incorrect, Unit Test fails
     3. Code is correct, Unit Test fails
     4. Code is incorrect, Unit Test passes

The first case is obviously the desirable situation. The next two are problems that will be flagged and consequently tracked down and fixed. Of course, in both cases 3 and 4 there are bugs in the Unit Tests, but case 4 is the major problem as it is indistinguishable from case 1.

Even more insidious is when the code is correct and the Unit Test passes (case 1) then later code changes causes a bug that should be detected by the Unit Test but it still passes (case 4)!

... code changes
cause a bug ...
but the Unit Test
still passes!

To be honest, I've rarely seen this situation but I suspect that there are a large number of hidden problems of this type lurking in Unit Tests. Why? Well, in the past I have fixed quite a few bugs in unit test code which were causing the test to fail (case 3), but I cannot recall ever having discovered and fixed a bug which caused a test to return a false positive (case 4). Since both types of bugs seem equally likely it seems that there are several such problems in those Unit Tests that have never been discovered.

So how do we avoid this problem? Using TDD goes a long way to eliminating the problem. That is, write the test first and make sure it fails before you write the code that makes it pass.

2. Incomplete Tests

Probably the most common problem with Unit Tests is they're incomplete. That is, there are many cases that are not tested. In fact, studies have shown that on
the most common
problem ...
is they're
incomplete
average Unit Tests only cover about half the code (though they usually cover the most commonly executed parts). For example, much error-handling is poorly tested.

A good way to estimate how complete your Unit Tests are is to use a code coverage tool. This checks  which lines of code are actually executed when running the tests. The aim should be to have 100% code coverage. If 100% coverage is not reasonable, for some reason, then the lines that are not tested should be thoroughly code-reviewed.

It is also important to remember that code coverage is only an indicator of how thorough your Unit Tests are. Even with 100% code coverage there may be gaps in your tests. For example, consider this test:


void LessThanTest()
{
    BigInt op1 = new BigInt(1);
    BigInt op2 = new BigInt(2);

    AssertTrue(op1 < op2);  // 1 < 2 is true
    AssertFalse(op2 < op1); // 2 < 1 is false
}

This code may have 100% code coverage but it still does not test the boundary condition where both operands are equal. We also need this test:



    AssertFalse(op1 < op1);  // 1 < 1 is false


3. Poor Code


Some developers have the perception that test code is not as important as application code and does not have to be written to the same standard. This may be a result of conditioning from earlier times where test rigs were written then discarded.

Or it may be a consequence of the "customer-centric" culture of most development. Unit Tests are not part of the application so are of no concern to the customer hence not important.

Forget that. It is just as important for Unit Test code to be well-written and maintained as it is for any code.

This is discussed at length in Clean Code by Robert C. "Uncle Bob" Martin, in the section Keeping Tests Clean (page 123). In that, Uncle Bob explains how a team created quick and dirty test code which made their tests unmaintainable which led to the Unit Tests being abandoned which in turn led to application code that stagnated because it could not easily evolve.

My experience is very different to Uncle Bob's. I have found that Unit Tests rarely need to change as the software evolves (assuming they test one simple thing - see my next point). Instead new tests are created to test new functionality, leaving the old tests unchanged to test old functionality. Only when existing behavior changes do existing tests need to change; but this usually requires a rewrite of the module including completely rewriting all the tests.

Of course, I agree completely with Uncle Bob that Unit Test code should be kept as clean as production code. However, in my experience this is not a major issue as good coders habitually create good code irrespective of where it is used.

The moral is that test code should be of good quality -- but there are many aspects to the quality of Unit Test code. For example, Unit Tests usually need not be optimized; on the other hand, they should not be too slow otherwise they will be disabled (see item 6 below). I discuss other aspects of writing good Unit Tests in the following sections.

4. Multi-purpose Tests

I have seen a lot of tests that are too long, trying to test many things. The problem is that if the behavior of one thing changes the whole test may be affected even though most of what is being tested has not changed. You may have to rewrite a large test just for one small difference.

A better way is to have a separate Unit Test for each piece of functionality. One school of thought is that there should only be one assertion per test. However, I think this is unnecessary and even complicates things. I prefer Uncle Bob's rule of one concept per test (see Clean Code page 131).

Why are large cumbersome tests created? One argument given to me for creating a large multi-purpose test is that the set-up and tear-down would have to be repeated if the code was split up into multiple tests. And copying and pasting the same set-up and tear-down code into multiple tests would be a violation of the DRY (Don't Repeat Yourself) principle.

My original counter-argument to this was that it's better to have multiple simple tests and violate DRY since the test code is not production code. However, this goes against my previous point that Unit Test code is not "2nd class", and hence should not violate DRY. On further consideration the obvious solution is to place the set-up and tear-down code into separate functions that are called as necessary from each of the tests.

It's important to note that one concept per test does not preclude having multiple tests per concept. In fact, it is quite normal to have a lot of almost identical tests verifying different aspects of the same concept.

In summary, each Unit Test should only test one concept. As a consequence you will have a great deal of very short tests, but that is a good thing. This makes it easier to find the test(s) related to some functionality, and makes it easier to modify, add and delete tests when required.

5. Tests that depend on other tests

Sometimes you see tests that call other tests in order to share code. Additionally, tests can unintentionally depend on side-effects of other tests, such as the existence of a file or other properties of the environment.
 Like the last point this is also bad for the maintainability and understandability of your Unit Tests.

To avoid this, without violating DRY, you can extract infrastructure (set-up and tear-down) code into separate functions which are then called from different tests. These infrastructure functions should never test anything themselves.

It also sometimes happens that a test has side-effects that subsequent tests will depend upon, whether intentional or accidental. A way to detect this is to initially runs tests individually. Also, if your test framework allows, let it run the tests in random order.

In brief, tests should be self-contained and independent of other tests.

6. Tests with external dependencies

As we saw above, having tests depend on other tests is not a good idea. A similar problem is when tests depend on some external module or function. Sometimes you need a test double to stand in for an external dependency.

How far you go with creating test doubles is a matter of judgement and considerable debate. One school of thought is that all external dependencies should be eliminated. This may necessitate creating a lot of test doubles and in my opinion this is not always necessary. For example, you rarely need test doubles for the operating system or CRT (compiler run-time) library calls.

Apart from avoiding troublesome externals, test doubles have another use. You may need to use them to simulate conditions that the external does not normally produce. I talked about this at length last month

My rule of thumb is if the external is fast, reliable and predictable you don't need a test double.

An example of when you would use a double is if it takes a long time since you risk that the tests will run too slowly and be disabled. Also, if the externals have bugs you don't want to spend time investigating failing Unit Tests only to find that they they are caused by external bugs.

In summary, use test doubles when external dependencies cause problems (or error conditions need to be emulated). See my post last month for more information on why, when, and how to use test doubles.

7. Black box testing

black box testing
is the
wrong approach
An important thing to remember is that the tests should be written by whoever wrote the code in order to take advantage of knowledge of how it works internally. Employing black box testing (where the tester knows nothing of the internals) is the wrong approach since, even with the simplest modules, testing every combination of inputs is simply not practical.

In brief, Unit Tests need to be written by someone familiar with the internals of the SUT. This is commonly referred to as White Box Testing. I talked about this in detail in my previous post on White Box Testing.

8. Delayed Test Writing

In the previous point we saw that Unit Tests should be written by the developer(s) who wrote the SUT. A similar requirement is that the Unit Tests should be written at the same time as the code they are testing.  This is another example of DIRE (see my blog next month for more on DIRE).

If writing the tests is delayed then the code will no longer be fresh in the developers' minds and things will be missed. In my experience this will result in poor and incomplete tests (if they are even done at all). Further, it greatly improves the verifiability of the code (see Verifiability).

In summary, Unit Tests should be written at the same time as the code being tested to take advantage of information only understood/remembered at the time. Ideally the developers should practice TDD (see point 1 above), where the tests are written before the code.

9. Tests not run regularly

Tests need to be run regularly. When found, problems with the tests need to be fixed immediately. I saw one project where the Unit Tests had not been run for some weeks (perhaps months). When I ran them more than half failed due to a change in behavior. This wasted quite a bit of time while we worked out what had happened.

It may seem obvious but tests should be run regularly and maintained to be consistent with the application code that they test.

10. No High-Level Tests

One final problem is when lovely Unit Tests are written for low-level modules but "high-level" modules are not tested. In many designs larger modules are built from smaller ones. In other designs some modules communicate with many other modules and are not easily tested.

In these cases, even though they are less amenable to Unit Tests, these modules should still have Unit Tests. For example, a high-level module that communicates with a lot of other modules may require a lot of test doubles.


Consider the analogy of a plane engine. An engine is composed of many parts each of which is thoroughly tested before being incorporated into the final engine design. For example, the spark plug supplier would have "Unit Tested" the spark plug design in isolation from the engine in which it is to be used.

So all parts are tested before being incorporated into the prototype for the engine. However, just because all the parts are thought to work correctly does not mean that we should not test the engine in toto before incorporating it into the plane prototype.

Too often, in software development, you see lot of low-level Unit Tests that test things like the "spark plugs". Then you have high-level acceptance tests - which is the equivalent of a test flight (with the engine already in the plane). What is missing is testing at the levels in between. Some people might call this integration testing (testing that module communicate with each other correctly) but it is more than that.


Don't stop creating tests when you get to higher levels in the design.

Summary

Unit Tests are extremely worthwhile but only if implemented properly. You need to watch out for the above 10 pitfalls and try to follow the corresponding guidelines (in red).

This is almost my last post on Unit Tests, which started last October with the post on Change.

In my next post I will summarize all the things I have said about Unit Tests and highlight the salient points.

No comments:

Post a Comment