Saturday, 25 May 2013

STL's Dark Secret


STL has lots of things like iterators and a large number of algorithms but where it gets its most use is with its containers. (After all you can use containers without using algorithms or even iterators but the reverse is not true.) And the most used container is undoubtedly vector. Vector is arguably the star of STL. One reason I like it is that you never have to use new[] (array allocator), and perhaps forget to call delete[] (array deallocator) or accidentally use delete instead of delete[], etc. Of course, vector has many other nice features.

Despite its enormous utility and popularity there is one problem with vector that is never discussed. I am not exactly sure why - perhaps the workaround is too ugly to be considered - I call it STL's Dark Secret. It's simply that vector has much worse performance in some circumstances than the traditional way of working in C (arrays and memcpy()/memmove()/memset()). I will discuss this in a moment but I will first explain how I discovered this.

STL - First Experiences

When I first read about and tried STL, I loved it. My only concern was that all the abstraction came at a cost in performance. The very first thing I tried in XML was to compare the performance of the C run-time routine qsort() with the STL algorithm sort() on a vector. This seemed like a good test as both implement the Quicksort algorithm (at least in the implementation I was trying). To my surprise and delight the STL sort performed more than four times faster. (I soon realized that because the STL function is a template that the comparisons could be in-lined avoiding the many function calls that the non-template function qsort() requires.)

So it appeared that performance was not a problem but later I noticed some strange behavior when using a large vector. My program would occasionally "pause" when adding items to the end of the vector (using push_back()). This was due to the vector re-sizing itself - effectively doing a realloc() then copying the existing elements to the new memory. (I know you can usually get around this problem by using reserve() as discussed below.) Of course, when a vector runs out of memory then it has no choice but to do this. It still seemed that the re-sizing process was slower than it should be, so I decided to do some performance tests on vector.

Note that when I talk about the standard C library function memcpy() I also mean memmove().

These functions perform the same operation if the source and destination memory ranges do not overlap. If they do overlap the behaviour of memcpy() is undefined and you must use memmove().
I compared the performance of vector memory movements with direct memory copying using memcpy(). That is, I created large arrays of bytes and copied them using memcpy() then also persuaded a vector to move exactly the same amount of memory around. (There are two different ways in which vector needs to do this but they have similar performance, as discussed later.) To my utmost horror I found that when vector moved elements around it was more than an order of magnitude slower than using memcpy(). That is vectors were 10 times slower than they should be. (My tests used a vector of several million objects each of 20 bytes in size.)

What's going on?

The problem is that whenever vector has to move an element around in memory it has to call the copy- constructor or the copy-assignment operator and perhaps even the destructor for the object. When you are moving millions of objects around this can take some time no matter how fast these member functions are.

However, that is not the whole explanation. I found that even a vector of char was 4 times slower than using memcpy()  A char does not have a copy constructor (etc), unless the compiler somehow contrives to create one. Further, even a vector of int (which should be the fastest for reasons I will not delve into here) was more than three times slower than using memcpy().

It seems that the real problem is that vector operates on each element one at a time, whereas memcpy() just uses special assembler instructions to zip through the memory irrespective of the size or contents of the objects it is copying.


So when does vector have to move around large numbers of its elements? There are two circumstances:

1. Forced Re-Size

As I mentioned above, whenever you add elements to a vector (for example, by using push_back()) there is a chance you will hit the current capacity, and the allocated vector memory has to be expanded. One way the memory could be increased is by using realloc(), which means that the elements may not actually have to be moved if there is free memory (on the memory heap) immediately after the vector. However, my experience is that this almost never happens and the vector has to be moved to a new memory location with the consequent delay due to the copy-constructor and destructor being called for every existing element in the vector.

This diagram clearly shows how this works. First a larger block of memory is allocated and the existing elements are copied from the old memory to the new memory using the class's copy-constructor. Then the original objects are destroyed and the old memory freed. Finally the new object is added to the end of the vector.

Note that C++11 has introduced the move-constructor (and the related move-assignment operator). This means that the steps above can be combined. That is the copy-constructor followed by the destructor can be replaced by a move-constructor. This may give some performance boost to vectors in the future.

2. Inserting Near Start

When you insert a new element into a vector the vector has to shuffle up all the elements following the new element. This is why inserting elements into a vector is generally frowned upon (except at the end of course, using push_back()). However, sometimes it is necessary, and that's when it would be nice if vectors were faster.


This diagram shows how a new item is inserted into a vector at offset 2. First all elements from offset 2 are moved forward one position - the last element (E) is copied into the empty space at the end of the vector using the copy-constructor then each element below is move up using the copy-assignment operator. Now we have 2 copies of the element at offset 2 (C) so the destructor is called for the original one. Finally, the new element is constructed at offset 2.

3. Erasing Near Start

Erasing elements at or near the start of a vector is similar to inserting near the start, except that the elements have to be moved in the opposite direction.  Consequently, I will give no further special consideration to this case.

Why does vector work like this?

You might think that vector could simply use memcpy() to move the vector elements in memory, and that compiler/library vendors insist on using copy-constructors etc in some pedantic attempt at doing things properly. In general vectors must work like this because copying memory does not always give the same results as a copy-constructor.

If an object is self-referential (ie, has a pointer into its own internal data) then you cannot simply copy it to another place in memory. You must use the copy-constructor. The class SelfRef (below) shows an example of a class that stores strings. Short strings are kept in a small internal buffer to avoid memory allocations.

I include a copy-constructor, copy-assignment operator and destructor for completeness, but just have a look at the constructor to see what is happening. (Note that the code has not been tested and may have bugs.)

class SelfRef
{
private:
  char buf[16];
  char * pstr;

public:
  SelfRef() : pstr(buf) { buf[0] = '\0'; }

  explicit SelfRef(const char * str)
  {
    size_t len = strlen(str);
    if (len < sizeof(buf) - 1)
    {
      strcpy(buf, str);
      pstr = buf;
    }
    else
    {
      pstr = new char[len+1];
      strcpy(pstr, str);
    }
  }

  ~SelfRef()
  {
    if (pstr != buf)
      delete[] pstr;
  }

  SelfRef(const SelfRef &other)
  {
    if (other.pstr == other.buf)
    {
      strcpy(buf, other.buf);
      pstr = buf;
    }
    else
    {
      pstr = new char[strlen(other.pstr)+1];
      strcpy(pstr, other.pstr);
    }
  }

  SelfRef & operator=(const SelfRef & other)
  {
    if (&other != this)
    {
      // Destroy existing instance
      if (pstr != buf)
        delete[] pstr;

      if (other.pstr == other.buf)
      {
        strcpy(buf, other.buf);
        pstr = buf;
      }
      else
      {
        pstr = new char[strlen(other.pstr)+1];
        strcpy(pstr, other.pstr);
      }
    }
    return * this;
  }
  ....
};

If memcpy() is used to copy an object of ClassRef then the ptr member will end up pointing to the wrong place!

Solutions/Work-arounds

One approach to avoiding the performance problems with vector is to create a vector clone that behaves exactly like vector but uses memcpy()/memmove() to move elements around. Indeed, I created such a class myself which I called fector (for fast vector).  (If anyone has the course for it could they send it to me as I can't find it.) Of course, the one limitation of this class is it could not be used with self-referential objects as discussed above, but these sorts of objects are rare.

There are also a few tricks to using standard vector that can improve performance.

1. Use the reserve() method of vector to avoid resizing.

For example, if you know that your vector size will end up at about 1,000,000 elements then reserve memory for that many. This avoids several memory allocations and (more importantly) copying of many vector elements in memory.

Many implementations of vector will ask the memory heap manager for double the memory every time they run out. So if you add 1,000,000 elements to an empty vector using push_back()  the vector will have to re-size up to 20 times. (Some implementations only expand by 50% each time which means the vector will be re-sized up to 34 times). Though a few calls to malloc() and free() are not that significant, it's the moving of elements each time that can cause a large delay. There will be close to a million calls to a copy-constructor and destructor which can be avoided.

std::vector<int> v;
v.reserve(1000000); // reserve memory for expected elements

Sometimes it is impossible to know in advance exactly how big your vector will grow. The maximum size may depend on some external influence that can vary greatly. In this case you might allocate enough space for the worst possible case, especially if speed is important.

You might also be able to obtain a rough estimate somehow. For example, if you are reading a list of integers (as text) from a file you might be able to get a rough idea of the number of integers in the file based on the size of the file and the typical numbers of characters used for an integer.

v.reserve(file.size() / ave_chars _per_integer);
read_integers(v, file);

2. The other place that vector performance can be improved is when you are inserting elements into a large vector. Inserting at (or near) the end is not a problem, but inserting at (or near) the start will cause all the following elements to move forward. Here is an example of some code that will take a large amount of time:

std::vector<int> v(10000000);
...
v.insert(v.begin(), 52);

You can achieve the same effect using memcpy(). This does the same thing but much faster:

v.push_back(0); // extend the vector with a dummy element
memmove(&v[1], &v[0], (v.size()-1)*sizeof(vv.value_type));
v[0] = 52;

Warnings: First, the behavior of the above code is considered to be undefined, and, though it works with vector<int>, it will not work with all data types. As mentioned above it will not work for self-referential data types since pointers to internal data will be invalidated. It should also not be used for classes unless you are sure they are POD (plain old data) classes, or at least classes that do not define their own destructor, copy-constructor, + copy-assignment operator.

The vector does not know that its elements have been moved around so there will be incorrect call(s) to destructors, etc. In the last line above the vector does not know that v[0] has been moved to v[1] so will try to destroy it before assigning the new value of 52. (Technically, it call the copy-assignment operator which is equivalent to a destructor call followed by a copy-constructor.) This is OK for ints since an int "destructor" does nothing, but for class objects this could result in memory or resource leaks and other problems.

So the above technique is OK for scalar types and POD classes. You can use it for a class or struct that just has simple data types like int (and even pointers if they are not self-referential). However, you may be better off not using it in a class since you never know if some future code changes will add a new field to the class which is not a POD data type. You should definitely not use if for such things as a vector of strings or a vector of vectors.

Conclusion

It is a shame that the most useful, and probably most commonly used, feature of the STL, namely vector, has this performance problem. I find it odd that the problem is never discussed. For example, even Scott Meyers' excellent book Effective STL only mentions the first problem (see Item 14. Use reserve to avoid unnecessary reallocations.) Even then it does not mention that when vectors copy elements it is much slower than using memcpy().

If you are aware of the problem, and understand how vectors work, then the effects can be ameliorated by:
  • reserving memory of sufficient size for the largest number of expected elements
  • instead of insert() using memmove() on large vectors of scalar and POD types
  • OR use a vector replacement that uses memcpy() [but don't use it on self-referential objects]