Friday 1 June 2012

Self Describing Code

I'm currently reading a book called Clean Code by Robert (Uncle Bob) Martin, and I intend to post a review when I have finished it.  Generally, there are a lot of good ideas in it, but one that I find awful is the idea of what is commonly called self-describing code.  Though Uncle Bob never mentions that specific term (perhaps due to its previous bad press), he prescribes to all the main tenets as exemplified by these quotes:

"[a name] should tell you why it exists, what it does, and how it is used" - page 18

"Don't be afraid to make a name long." - page 39
"comments compensate for our failure to express ourself in code" - page 84
"If a name requires a comment, then the name does not reveal its intent." - page 18
"A long descriptive name is better than a long descriptive comment." - page 39

Self Describing Code


I recall the idea of self-describing code being bandied about in the early 1980's.  I thought it had been discredited and forgotten.  It seems to have had a resurgence.


I believe it all began when many coding standards insisted that all code be commented.  This had the dreadful consequence that many useless comments were often seen.  Moreover, as the code was modified the comments became out of date.  The consequence was that anyone trying to understand the code did not read the comments because they were, at best, of no value whatsoever, and, at worst, grossly misleading.


Many people then had the idea that maybe we don't even need comments.  With a simple and elegant design and meaningful variable and function names the code can describe itself.  Further, it can't become out of date the way comments can.  Then they would give a beautiful example showing how some simple code could be written, and be understandable, without any comments.


Counter-Argument


There are a few problems with the above argument.  First the names of variables and functions can become out of date just as easily as comments can -- and in my experience misleading variable names are far more common than misleading comments.  A bigger problem is that most real-world code is not as simple as the sort of examples usually given; it is not always possible to refactor the code to a state where it is self-describing, or do so in a reasonable amount of time.  (Of course, writing good comments can take some time too, but a lot less than rewriting the code.)


The worst consequence of the idea is that many programmers use it as an excuse for not adding comments, arguing that their code is self-describing.  Usually it is really that they cannot be bothered to explain the code in words.  I would go even further and say that writing good comments can make the coder think more closely about what the code is doing which, in turn, can lead to better code and less bugs.


Finally, self-describing code is usually (paradoxically) harder to read.  It usually involves very long identifiers which clutter the code and make it difficult to quickly scan (see my blog post Long Identifiers Make Code Unreadable).  Uncle Bob also recommends creating lots of temporary variables and tiny functions for the sole purpose that the extra identifiers can be have long, meaningful names that explain the code.  This is what makes the code harder to scan - see the example below.


Example


In the section on choosing descriptive name (Clean Code page 309), Uncle Bob gives an example of bad code and how it should be fixed.  Let's look at his solution and how it could be better.


First we look at the original:

public int x()
{
    int q = 0;
    int z = 0;
    for (int kk = 0; kk < 10; ++kk)
    {
        if (l[z] == 10)
        {
            q += 10 + l[z+1] + l[z+2];
            z += 1;
        }
        else if (l[z] + l[z+1] == 10)
        {
             q += 10 + l[z+2];
             z += 2;
        }
        else
        {
            q += l[z] + l[z+1];
            z += 2;
        }
    }
    return q;
}


I agree the variable names are lousy, but if Uncle Bob was honest about it I think even he would admit that nobody really writes code like this.

Lets look at Uncle Bob's refactoring:

public int score() {
    int score = 0;
    int frame = 0;
    for (int frameNumber = 0; frameNumber < 10; frameNumber++) {
        if (isStrike(frame)) {
            score += 10 + nextTwoBallsForStrike(frame);
            frame += 1;
        } else if (isSpare(frame)) {
            score += 10 + nextBallForSpare(frame);
            frame += 2;
        } else {
            score += twoBallsInFrame(frame);
            frame += 2;
        }
    }
    return score;
}

This is much better, and makes the code fairly obvious especially if you are familiar with the rules for scoring in Ten Pin Bowling.  There are some obvious problems with the code - for example, the integers frame and frameNumber have confusingly similar names.  Also are the magic numbers (with value10) in lines 4, 6, and 9 related?

My biggest concern is with the very small functions used to replace less than a single line of code.  These were simply introduced in lieu of a comment to try to explain what the code does.  The trouble is the (fairly long) names don't really tell me what the function does.

Looking at the name nextTwoBallForStrike I might be able to guess what it does, especially if I know the scoring rules, but personally I would still be inclined to check just to make sure.  All this takes time and prevents fluid reading of the code.

This is how I would have refactored the code:


static const int max_frames = 10;                    // Number of frames in a game


// Returns total score for a game of ten-pin bowling. On entry the number of fallen pins
// for each ball is provided in fallen[] and the number of balls bowled is in balls_bowled.
public int score()
{
    int points = 0;                                            // Cumulative count of points for the game
    int ball = 0;                                               // Current shot or "ball" (1 or 2 per frame)
    int frame;                                                  // Current frame of the game


    for (frame = 0; frame < max_frames; ++frame)
    {
        if (fallen[ball] == 10)
        {
            // Strike (all pins knocked over from 1 ball) gives bonus of next two balls
            points += 10 + fallen[ball + 1] + fallen[ball + 2];
            ++ball;                                                 // Move to next frame (no 2nd ball)
        }
        else if (fallen[ball] + fallen[ball + 1] == 10)
        {
            // Spare (all pins knocked over in 2 balls) gives bonus of following ball
            assert(fallen[ball + 1] > 0);
            points += 10 + fallen[ball + 2];
            ball += 2;                                             // Move to first ball of next frame
        }
        else
        {
            // Open frame just gives score of all fallen pins of both frame's balls
            assert(fallen[ball] + fallen[ball + 1] < 10);
            points += fallen[ball] + fallen[ball + 1];
            ball += 2;                                             // Move to first shot of next frame
        }
    }
    assert(frame == max_frames && ball == balls_bowled);
    assert(points <= max_frames*30);                // Maximum possible score
    return points;
}




How is this any better than Uncle Bob's version?  The main difference is the comments.  There are actually two types of comments: overview comments that appear above a section of code and explanatory comments that appear to the right of the code.

The overview comments allow you to get a quick idea of what is happening without trying to understand the code.  Scanning the above function it would take at most a few seconds to understand roughly what it is doing.  When you work with software of thousands or millions of lines of code then being able to track down what you are looking for quickly with these sorts of comments can save hours or even days of time.

Admittedly, Uncle Bob conveys the same information by adding new functions, with descriptive names, rather than adding comments.  However, I think the comments add a lot more useful information, especially for someone unfamiliar with the scoring system in ten pin bowling (ie, most people).  Moreover, the comments allow you to see if the code is of interest and whether you need to inspect the actual code.

An explanatory comment is additional information at the right end of a line of code.  You can often simply ignore them since reading the code may be enough if you already know what it is trying to do, but you can refer to them if you need more information.

When reading these sorts of comments I usually try to understand the code first then read the comment and if the comment makes no sense I will go and read the code again.  (Most of the time this is because I did not read the code properly.)  This method of using comments to double-check my understanding of the code has saved me from numerous stupid mistakes.

I also made a few other improvements to the code.  I replaced the magic number 10 with the named constant "max_frames" as the number of frames in a game of ten pin bowling can possibly be more than 10.  It also differentiates it from the other magic number 10, which is the number of pins used.  (I didn't add a named constant for "max_pins" as it does not vary and feel it would make the code harder to read.)  I also added a few assertions to check for internal consistency and to ensure that the data supplied to the function makes sense.


Little Functions and Temporary Variables


Uncle Bob's strategy is thus to introduce extra little functions and temporary variables, with long explanatory names, in lieu of comments.


The problem with little functions is that they make it hard to quickly scan the code and spot problems.  (In fact the very useful facility now provided by many languages is that of lambda functions.  These allow you to avoid having to create little functions, which is the exact opposite of what Uncle Bob is trying to achieve.)  It's always better to have the code there where you can see it than to have to go and find a function, or worse not check but wrongly interpret the function's behavior based on its name.


Lots of temporary variables have always been a sign of an inexperienced programmer, at least to me.  Often what is written in ten lines of convoluted code can be expressed in one or two lines without all the temporary variables.  All the extra code make it harder and slower to read.  They also increase the chance of an error due to a name conflict or typo.

Maintenance of Comments

OK, so maybe comments can be useful when the code is first written but what about the major argument for self-describing code that comments rapidly become out of date and useless as the code is modified.  Admittedly this is a problem, but just because something is done badly does not necessarily mean you should abandon it.  For example, the design of most if not all software degrades over time (see my blog post Why Good Programs Go Bad) - using the above argument we should abandon any attempt at a good design.

I have found that code reviews are an effective way to ensure comments are maintained.  In the places I have worked, comments are noticeably better where code reviews are done.  To ensure comments are maintained a good idea is to add it to a check-list for code reviewers.  (Another item for the checklist is that identifiers, such as variable names, have also been updated to reflect their current use.)

Conclusion

Abandoning comments and trying to make code self-describing is a bad idea; well-written and maintained comments are absolutely invaluable for saving time and preventing defects.  Overview comments save time by allowing you to quickly scan code and drill down to the relevant location to make a change.  Comments also provide more information when you do not fully comprehend the code you are reading.

The most important thing is that comments allow you to double-check your understanding of the code.  I have lost count of the number of times that I have not fully understood some code I was reading and been saved from making a silly mistake when a comment made me aware of my misunderstanding.  I may be not as good at reading and understanding code as Uncle Bob and other super programmers, but then not everybody who has to read code is a super programmer.

No comments:

Post a Comment