Friday, 18 November 2011

Zero

Zero may be nothing to you (pun intended) but it has caused many problems through history.  In computer programming it is particularly important since incorrect handling can cause minor bugs or worse.  In fact, in a well-documented disaster a rocket was lost due to a software defect related to the incorrect use of zero in a Fortran program.

So at the risk of raising much ado about nothing (I promise: that is the last stupid pun :) I will give some examples of the sort of things that can go wrong if you don't handle zero properly.  But first a brief history.

History of Zero

First, we should distinguish between zero as a digit and zero as an integer.  The symbol for zero can just be used as a digit, ie, a placeholder in a positional based numbering system (eg, the zero in 307 represents that there are no "tens").  A symbol for an empty position was first used by the Babylonians.  But the idea of the integer zero was not invented till later by the Mayans and independently in India.

The Sumerians were probably the first to use a positional number system but not having a symbol for zero created problems.  If you wanted to send a message to the market that you had 30 goats to sell there was no way to distinguish between 30 and 3.  A farmer could avoid this problem by instead selling 29 or 31 goats but as numbers became more important the lack of a zero became problematic.

The Babylonians started with a similar system to the Sumerians, but eventually used the notation of a small mark to signify an empty position in a number.  Greek astronomers were the first to use a circle for zero as a placeholder, but it was not until centuries later that Indian mathematicians included zero as a full integer like one, two etc.  One reason that mathematicians probably avoided zero was the problem of what is the result of dividing a number by zero or the even weirder idea of 0/0 (zero divided by zero).

Zero in Algorithms

Sometimes having empty input makes no sense.  For example, how do you take the average of zero numbers?

  int average(int[] array)
  {
      int sum = 0;
      for (int i = 0; i < array.size(); ++i)
          sum += array[i];
      return sum/array.size();
  }

The above code will have a divide by zero error at run-time when presented with an empty array.  We can easily fix this:

  int average(int[] array)
  {
      if (array.size() > 0)
      {
          int sum = 0;
          for (int i = 0; i < array.size(); ++i)
              sum += array[i];
          return sum/array.size();
      }
      else
          return 0;  // May be more appropriate to return -1 to indicate an error, or throw an exception
  }

Now suppose we want to find the sum of an array of numbers.  Having just coded the above average() function we might be tempted to write:

  int sum(int[] array)
  {
      if (array.size() > 0)
      {
          int sum = 0;
          for (int i = 0; i < array.size(); ++i)
              sum += array[i];
          return sum;
      }
      else
          return 0;
  }

However, in this case, taking the sum of zero numbers does make sense.  The code should simply be written as:

  int sum(int[] array)
  {
      int sum = 0;
      for (int i = 0; i < array.size(); ++i)
          sum += array[i];
      return sum;
  }

[Actually, doing it the first way would be better in Fortran as we will see below.]

I will now look at a few ways some programming languages do not handle zero well.

Fortran

Fortran was the first programming language I learnt and you would expect that I would have a certain fondness for it.  Unfortunately, I always found it very peculiar.  One of the stupidest things in Fortran is that a FOR loop is always executed at least once.  This has resulted in countless bugs where boundary conditions were not handled properly.  To avoid bugs you usually have to wrap the FOR loop in an IF statement which protects against the empty condition.

Pascal

Soon after learning Fortran I learnt Pascal and it was an enormous relief.  It is much simpler to understand and elegant and, of course, it handles FOR loops correctly.  However, I later came to appreciate that it has some deficiencies when compared to C-like languages.  Ranges in Pascal are specified using inclusive bounds, so for example, the range "1..10" means the integers from 1 to 10 inclusive.  You can express a range with one element such as "1..1", but how do you express a range with zero elements?

A common mathematical notation is to express ranges with square brackets when the ends are included and round brackets when the ends are excluded.  Many things are simplified when an inclusive lower bound and an exclusive upper bound are used.  Some people refer to this as using "asymmetric bounds".  For example, the set of numbers from zero (inclusive) to one (exclusive) is shown as [0, 1).

Now you might wonder why you want to express a range with no elements.  Actually, it is often very important to be able to do this sort of thing.  Many algorithms work on data sets that may have zero elements as a degenerate case.  If these situations are not handled then the code will fail under situations like being presented with empty input.

It is better for many reasons to use asymmetric bounds for ranges (as is conventional in C), and I will discuss these reasons in a later post.  But one of the main advantages is that you can easily specify an empty (ie, zero-length) range.

Brief

In my first job I used to edit my C code in a word processor (WordStar), and this was a very painful experience.  Soon after its release (1985?) I started using the programmer's editor called Brief which was simply wonderful in comparison.  (Perhaps the best thing was the unlimited undo which was not restricted to changes in the text but other things like changes to the selection and layout - in fact the Undo for my hex editor, HexEdit, is modelled closely on it.)

Brief allowed you to "mark" text and copy or append it to its internal "clipboard".  There were several ways of marking text including "non-inclusive" mode.  Non-inclusive mode was very useful as it used the asymmetric bounds pattern mentioned above.  This allowed you to mark the start and one past the end of a block of text for further processing.

I found non-inclusive mode very useful for creating macros in Brief's built-in programming language.  I created several macros that built up text in the clipboard by appending to the clipboard in non-inclusive mode.  But there was one major problem - copying or appending to the clipboard generated an error when the start and end of the marked range were at the same place.  That is, you could not copy a zero-length selection.

I even wrote to the creators of Brief (who were obviously very clever guys for creating such a brilliant tool).  The reply I got was basically that there could be no use for a selection of zero length.  Of course, they were wrong as many of my macros failed under situations where they attempted to append a zero-length selection to the clipboard.

The moral is that even very clever people may not appreciate the importance of zero.

C

The good things about the C language is that in almost all cases zero is handled correctly.  Experienced C programmers, usually reach the point where they need only give cursory consideration to boundary conditions.  Generally, if you write the code in the simplest, most obvious, way it will just work under all inputs with no special code for boundary conditions.

Even the syntax of the language adopts the "zero is a normal number" approach.  For example, in C the semi-colon is used as a statement terminator (rather than a statement separator as in Pascal).  A compound statement which contains N statements will have N semi-colons - so a compound statement with no nested statements is not a special case.

Consider compound statements in Pascal and C:

Pascal:
  begin statement1; statement2 end;     (* 2 statements, 1 semi-colon *)
  begin statement1 end;                       (* 1 statement,  0 semi-colons *)
  begin end;                                         (* 0 statements, can't have -1 semi-colons! *)
C:
  { statement1; statement; }             /* 2 statements, 2 semi-colons */
  { statement1;  }                            /* 1 statement,  1 semi-colon */
  { }                                               /* 0 statements, 0 semi-colons */

How can this really be that important?  This could be important is if you have software that generates C code.  In some circumstances you may need to generate a compound statement containing no nested statements.  The syntax of C means you don't need to handle the boundary condition specially.

Also if you have ever edited much Pascal code you will know how painful it is to add (or move) statements at the end of a compound statement.  You have to go to the end of the previous line and add a semicolon.  Worse, if you forget to do so, as usually happens, you get syntax errors on the next build.

Note that C does not take this to extremes though.  Commas in function parameter lists are separators not terminators.

  func(param1, param2);   // 2 parameters, 1 comma
  func(param1);                // 1 parameter, 0 commas
  func();                           // 0 parameters, can't have -1 commas!

But you can use commas as terminators in array initialisation, though the last comma is optional.  This makes it easy to edit arrays, such as appending and reordering the elements.

  float array[] = {
    1.0,
    2.449489742,
    3.141592653,
  };

Malloc Madness

Given C's very good behaviour when it comes to zero it was very surprising to find a certain proposal in the draft C standard in the late 1980's.  Some on the C standard committee were keen to have malloc(0) return NULL.

All C compilers until then based their behaviour on the defacto standard, ie the behaviour of the original UNIX C compiler created by Dennis Ritchie (often referred to as K&R C).  This behaviour was for malloc(0) to return a valid pointer into the heap.  Of course, you could not do anything with the pointer (except compare it to other pointers) since it points to a zero-sized object.

In my opinion, anyone on the committee who was in favour of malloc(0) returning NULL showed such a lack of understanding of the C language and the importance of correctly handling zero that they should have immediately been asked to leave.  At the time I scanned a fair bit of source code I had already written and it revealed that more than half the calls (about 100 from memory) to malloc() could have been passed a value of zero under boundary conditions.

In other words there were calls to malloc() like this:

  data_t * p = malloc(nelts * sizeof(data_t));

where the algorithm relied on nelts possibly having the value zero and the return pointer p not being NULL in this case.

The main problem with the proposal is that malloc() can also return NULL when memory is exhausted.  Lots of existing code would have to be changed to distinguish the two cases where NULL could be returned.  So code like this:

  if ((ptr == malloc(n)) == NULL)
      // handle error

would have to be changed to:

  if ((ptr == malloc(n)) == NULL && n > 0)
      // handle error

However, a further problem with the proposal was that it would be then impossible to distinguish between different zero-sized objects.  Of course, a simple workaround would be:

  #define good_malloc(s) malloc((s) == 0 ? 1 : (s))

Of course, there is no point in creating a macro to fix a broken malloc() when malloc() should behave correctly.

There is an even more subtle problem.  Can you see it?

Luckily, sense (partially) prevailed in that malloc(0) is not required to return NULL.  There was some sort of compromise and the behaviour is "implementation-defined".  Fortunately, I have never heard of a C compiler stupid enough to return NULL from malloc(0).

C#

I have always liked the C# language and the .Net CLR ever since I first read about it in the C/C++ Users Journal.  For example, see my article from 2004 at http://www.codeproject.com/KB/cs/overflow_checking.aspx.  However, even in C# there are problems with the handling of zero.  Actually the problem below is more to do with the .Net CLR than C# per se.

Consider the base class of all data types called System.Object.  This is effectively an "empty" object since it has no data fields.  However, the CLR has a default implementation for System.Object.Equals that does a silly thing -- it just compares the pointers.  This means that two objects will not compare equal unless they are the same instance of the object.

Side Note: It could be argued that returning false makes sense if the objects have different types (see System.Object.GetType); but it could also be argued that having zero oranges is the same as having zero apples. In any case the derived class version of Equals() should already have handled this situation.

Nothing should always be considered equal to nothing, hence System.Object.Equals() should always return true.

Now you might argue that you can't create a System.Object by itself and that a derived type (which we will call Derived in the example below) should properly implement its own version of Derived.Equals() that compares its members appropriately.  The problem with this is that if all the members compare equal then the base class version of Equals() should be called like this:

  bool Derived.Equals(Derived other)
  {
      if (different types or members have different values)
          return false;
      else
          return base.Equals(other);
  }

However, if base.Equals() is System.Object.Equals() then base.Equals() will return false (unless this and other are the same instance), which is wrong.  OK, why don't we forget calling the base class and return true?

But there is still a maintenance problem.  What if Derived is later changed to derive from another class?  Derived.Equals() should now call base.Equals() but it would be easy for this necessary change to be missed.
Things would be so much simpler if System.Object.Equals() simply returned true.

BTW My rules of thumb when implementing Equals() are thus:

1. Always override Equals() (and GetHashCode()) for your class.
2. Never call base.Equals() if you don't derive from another class (implicit System.Object derivation).
3. Remember to call base.Equals() if your class derives another class.
4. Be careful with Equals() when inserting a new derivation between a derived class and System.Object.