Saturday, 6 July 2013

Best Practice in C for Modules

Today I discuss best practice in designing and implementing modules in C (and C++). The most important aspect is that each module should have a simple, well-defined interface given by its header file.

Modules and Interfaces in C

Last year I talked about how crucial the principle of divide and conquer is to software design (see Handling Software Complexity). To divide a problem, software is split into pieces that can, to some extent, be isolated from each other. These pieces can take many forms and have different names but I try to stick to the generic term module. Modules connect to each other by means of interfaces. The idea that you only need to understand the interface of a module, rather than how it is implemented internally, isolates the complexity. This makes designing complex software not just easier but possible.

The idea of using modules to create software is almost as old as software itself (despite what OO adherents would have you believe). There are many benefits, but the main one is information hiding; a simple well-defined interface hides the complexity of how a module is actually implemented. (There are other benefits, such as facilitating polymorphism, which I will not go into now - maybe in a future post.)

The advantages of information hiding are large, and well-documented, which makes it all the more surprising that I continue to encounter C code that ignores best practice in this area which was established 25 or 30 years ago. Let's start by looking at how interfaces are defined, using header files, then look at poor practices and the resulting problems.

Header Files

C (and C++) have always used header (.H) files to provide the interface for modules. Small modules are created using a single source (.C) file. Each module (.C file) should have its own header file which exposes its interface, or what can be used from the outside.

A larger module may combine several smaller modules and generate a (link-time or run-time) library. A library is typically implemented as a separate project with its own header file that exposes its public interface. For large pieces of software you may need even larger modules composed of multiple libraries, where you could use a single header file that #includes the header files for all the parts that are exposed publicly. An example is MS Windows which is composed of numerous DLLs (run-time libraries) the interfaces to which are exposed through the header file Windows.h.

The crucial thing is to have good header files that expose well-defined interfaces. Next I will look at some cases where header files are used poorly (or circumvented entirely).

Function Declarations

When I first used C, calling other modules was error-prone. Functions were declared in a header file so that when the compiler built other modules it knew the type of value it returned but you could not declare what parameters a function took. In other words the C language did not allow you to precisely define the interface to a module. (The later development of function prototypes during the development of the first C standard in the late 1980's addressed this problem.)

Here is an example of the sort of nasty bug that I and others encountered before the addition of function prototypes to C:

/* module.c - implementation of module */
int func(long value)
{
   ...

/* module.h - interface to module */
int func(); /* declares return type but NOT parameters */
   ...

/* caller.c - user of module */
   ...
   int small_value = 8;
   ...
   func(small_value);

The above code would almost always work but very occasionally would crash or behave very strangely due to the fact that func() was getting large values. Because the problem was not reproducible it was difficult to track down.

In brief, this was caused by several things:
  1. In this environment int was 16 bits and long was 32 bits. The caller was pushing 2 bytes on the stack (see diagram) but the callee was expecting 4 bytes.
  2. C calling conventions are such that the caller pops parameters off the stack. The callee doesn't care how many bytes were pushed.
  3. Little-endian byte order means that an integer value looks exactly like the same long value if it happens to be followed by two zero bytes.
  4. The bytes following the integer on the stack were almost always zero, but very occasionally were not.


The net effect was that func() would almost always get the correct value passed to it since the bytes following the 2 bytes were zero making small_value look like the equivalent 32-bit value. Only when the bytes following it on the stack were not zero would something strange happen.

This sort of bug is exactly the reason that function prototypes were invented. They allow a better module interface:

/* module.h - interface to module */
int func(long);

Now the compiler can ensure that parameters of the correct type are being provided, and even provide implicit conversions if available.

Note that, even after the advent of function prototypes, this problem can still occur when using functions that take a variable number of parameters. Function prototypes only allow a fixed number and type of parameters to be specified. The compiler is unable to check that anything is wrong with this:

   printf("%s", 42);


  • declare function prototypes for all functions (the compiler should enforce this anyway)
  • be very careful calling functions that take a variable number of parameters


Duplicate Declarations

A closely related problem is when a function is declared in multiple places. I recently encountered a problem because a function that should have been private to a module was being called from outside the module like this:

/* module.h - interface to module */
int f(int);

/* module.c - implementation of module */
void private_func(int);
...

int f(int val)
{
   ...
   private_func(1);
   ...

void private_func(int i);
{
   ...

/* other.c - implements an other module */
   ...
   void private_func(); // duplicate declaration!!!
   private_func();
   ...

Obviously, the private function had taken no parameters at some stage, whence someone had decided they needed to bypass the interface and have direct access to the function. Later it had been changed to take an integer parameter but it was still being called without parameters in other.c.

This is a good example of the advantages of the DRY (don't repeat yourself) principle. Functions should only ever be declared in one place. For public functions of a module they should only be declared in the header file for the module.

Even without the nasty bug caused by inconsistent function declarations the idea of circumventing the interface to the module (by accessing private functions from outside the module) is a bad idea. It makes the interface to the module less clearly defined which makes maintaining the software a nightmare. Moreover, calling the private function might cause the module to be left in an inconsistent state.

A way to avoid this type of thing from happening, intentionally or accidentally, is to declare all private functions as static (see Unintentional Global Variables below). This makes them completely inaccessible from outside.
  • only declare function prototypes in one place (the header file for the module)
  • include the header file in the module source file so the compiler can check that the declaration matches the actual function definition
  • include the header file in all other modules that use the modules so the compiler can ensure that all parameters are passed correctly
  • declare private functions and (file scope) variables static

Struct Alignment

Another nasty problem that can occur is related to how pad bytes are added to structs in C header files (and classes in C++).

It is common for an interface to a C module to use structs (or usually pointers to structs) as parameters to its functions. Hence these structs are part of the interface and must be added to the header file which defines the interface to the module. For example, the Windows header files are full of such structs (often typdedef'ed) such as struct _SYSTEMTIME (typedef'ed as SYSTEMTIME).

One problem to watch for is that the caller's and the callee's interpretation of the memory layout may differ if they have different alignment settings at the point where they #include the header. This is usually controlled by means of a command line option and/or #pragma pack. See my blog entry on Alignment for a complete description of this problem and how to create struct's that do not have this problem.

In C++ the same thing can happen with classes (or references and pointers to classes). In fact, it is even more common to pass these around. However, it is unusual for C++ programs to change alignment settings, so this is not normally a problem.
  • protect structs in header file from alignment problems by using #pragma pack or (better) make the struct alignment agnostic

Unintentional Global Variables

A lot of people, like me, learnt C from K&R. There are also many other C (and C++) books that have a similar coding style for their examples. This can be a problem, not because the example code isn't of the highest standard, but because the code is written for brevity. This style is not recommended for production code, but is nevertheless emulated. For example, K&R often uses very short variable names and declares function prototypes within the code instead of in a separate header file (see "Duplicate Declarations" above for why this is a bad idea), etc.

Another thing, in the K&R code, is that functions are never declared static, even when they are obviously "private". However, in large programs it is important to declare all private functions (and file-scope variables) static (in the same way that private members of a class in C++ are declared private). A simple code example:

int privateVar;
int privateFunc() { ... }

  ...
  privateVar = ...

If the above are only used within the one module (ie only in the current source file) then they should be made invisible from outside:

static int privateVar;
static int privateFunc() { ... }

This more clearly delineates what is and is not part of the interface and has the following advantages:
  1. it helps someone reading the code to quickly see what is private to the module
  2. it avoids accidental name conflicts that cause build errors or even subtle run-time errors
  3. it prevents other modules using private parts of a module, possibly undermining its consistency
  4. it speeds up the link process as the linker has to deal with less names

C++ Classes

Typical C++ code is better than C code in creating well-defined modules. This is because the language (and the culture of its use) encourages modularity by the use of classes. For example, there is usually a clear distinction between what is public and what is private, which makes for better, well-defined interfaces.

However, there are still some poor practices in C++. A common one is the putting many classes into the same source file. As a general rule its better to put each class into its own .CPP (or .CC, .CXX, etc) file with a corresponding .H (or .HPP) header file. Of course, sometimes a module is better defined by more than one class. In such case these class would be closely related (using inheritance or at least the friend mechanism), such as a container class and its iterator.

Classes should also be as small as possible, doing one simple thing. I know this means that even small programs may have lots of source files, which some people see as a problem. This should not be a disincentive as you can use the file-system to organize the files into directories and most IDEs (like Visual Studio) support grouping different source files into folders.

In summary, a low-level module in C++ is typically composed of a single class (or a few closely related classes). The class methods should be implemented in a single source (.CPP) file. There should be a corresponding header file (same name but with .H extension) which declares the class and any related types etc. A large class could have several source (.CPP) files but should still have only a single header (.H) file - though, a class of that size may be indicative of a poor design.

One problem with C++ is that private parts of a class appear in the header file (since they are part of the class declaration) even though they are not part of the public interface. Users of the module should, of course, ignore these private methods and members. (They only appear in the header so the compiler can work out the size of objects of the class.)

C++ Name Mangling

One very good thing about C++ when compared to C is the use of what is commonly called name mangling. C++ is safer because all global names (functions, variables, member functions, etc) have a "name" that not only encodes the identifier but also its type (including number and type of parameters for functions). This will detect many of the problems mentioned above.
Note that name-mangling can be disabled in C++ using extern "C". This allows C and C++ object modules to be linked together.

For example, if you try to call function func() from one module passing a short but func() is defined (in another module) to take a parameter of type long then these two functions will have different "mangled" names. That is, the compiler will use different names in the object files (eg, .OBJ files on Windows) for the caller and callee and the linker will generate an error such as "unresolved external".

Interfaces

While we are discussing C++ we may as well talk about interfaces. Some languages
What is an interface?

The interface keyword in Java and C# is used to declare an abstract type that other classes may derive from. The interface does not provide any implementation; the deriving class implements the interface's methods.
(like C# and Java) have explicit support for interfaces. In C you create an "interface" by declaring a group of related functions in a header file. In C++ you can create an interface using an abstract base class. This is a class which defines all the functions you can call but does not actually contain any code (or data). That is all functions are declared pure virtual with no actual implementation. (Internally this is simply an array of function pointers called a VTable.) Obviously, such a class is not much use by itself, but it allows another class to derive from in order to implement the "interface".


However, because C++ does not directly support interfaces I often see abstract classes which are trying to be interfaces but are not quite right. Here is an example. How many problems can you see?

class interface
{
private:
   void internal();
   int count;
public:
   interface() {}
   ~interface();
   virtual void process() =0;
   virtual void other() =0;
};

First, an interface class does not implement anything so it should not have any data members (like count) or private methods (like internal). An interface is never instantiated so it also should not have a constructor. (Also note that the above constructor is pointless in any case as the compiler will generate a default constructor like this if necessary.)

Finally, an interface class destructor should always be declared virtual. This ensures that if the object which implements the interface is destroyed using a pointer to the interface then the correct destructor for the object is called.

class interface
{
public:
   virtual ~interface();
   virtual void process() =0;
   virtual void other() =0;
};

In summary an interface (abstract base class) class should:
  • have no private parts
  • have no member variables
  • have no constructors
  • declare all functions pure virtual

Summary

Having simple, well-defined interfaces allows the design of software to be more easily understood and consequently built and maintained.

In C, this is done by using one source file per low-level module. Private parts of the module (functions and file-scope variables) are declared static so they are not visible from the outside. The public parts (ie, the interface to the module) are declared in the header file for the module. Functions should be declared in one place only (DRY) to prevent nasty bugs caused by inconsistent declarations.

These ideas are similar in C++ except that a module is typically implemented by a class.

DO

  • create a separate source (implementation) and header (interface) file for each module
  • clearly differentiate between public and private parts of the module
  • declare function prototypes for all public module functions in the module header file
  • make all private module functions (and file scope variables, if any) static

DON'T

  • declare functions in more than one place
  • declare the interface for more than one module in the same header file
  • allow the padding of structures in header files to depend on the current alignment setting 

No comments:

Post a Comment