Sunday, 8 November 2020

Rangeset - Go Generic Set

 Last month I introduced you to generics in Go.  This time I discuss a useful and substantial generic container that I created, and my experiences using the go2go tool to write and test it.

What about generic chans?

Last month I also mentioned generic chans and some potential cool uses.

Sorry, but first I want to discuss the rangeset package which has been working for a while now.

I'll talk about a generic channels package I'm working on very soon - promise!

This container I called rangeset. It contains a Set type similar to the one in the   go2go example set package but with some interesting properties and performance advantages.

Its main advantage is that it can save huge amounts of memory for sets that have a large number of contiguous elements, and can be faster too.  It will regurgitate elements in order, unlike sets based on hash tables.  You can also invert a set and create the Universal set unlike any other Go set implementations I have seen.

Introduction

I first created a primitive type of "range" set in C many decades ago and found it useful in many scenarios.  With the addition of templates (and STL) to C++ I created a complete STL compatible class.  Many people found it very useful as it had much better performance (esp. memory usage) characteristics than the C++ std::set for may common types of sets.

Sets are typically implemented as a balanced binary tree (eg: C++ std::set) or a hash table (eg: C++ std:unordered_set).  An important aspect of sets is how fast you can check if an element is in a set and balanced trees have a lookup time of O(log(n)) whereas hash tables are O(1).  In practice, I have found balanced trees are adequate (since O(log) is not too bad) unless you have very large sets.  They also have the advantage that you can "iterate" the elements in order which you can't do with a hash table (without additional complications like a separate index).

“lookup time ...
is O(log(r))”

A range set takes a different approach that makes it useful for sets with large numbers of elements that are all in one or a few contiguous ranges. Internally it uses a list of ranges, where each range is simply the first and (one past) the end of a contiguous span of elements. In my original C++ version of rangeset I used a linked-list of ranges. In the Go version I completely rewrote it to use a slice.

The lookup time for a range set is O(log(r)), where r is the number of contiguous ranges.  In the worst case (where no elements are next to any other elements) the lookup time is no better than a balanced tree, or O(log(n)), where n is the number of elements. But in the best case (where all elements are in a single range) it is O(1) the same as a hash table.

Space performance is more interesting. In the worst case it is O(n) which is still better than a binary tree or a hash table.  In the best case it is O(1).  In fact, storing the universal set (the set of all elements) requires a tiny amount of memory -- just one range.

It's important to remember that you need to know something of the likely distribution of elements -- it's only useful for large sets if the elements tend to clump together.  For example, the set of all prime numbers (which only has one contiguous range, ie: 2-3) is counter-indicated, but a good fit would be something like a set containing all numbers with a leading decimal digit "2" (ie: 2,20-29,200-299,2000-2999,...). It turns out that there are many real-world scenarios where a range set is useful.

One final thing to note is that the element type of the Go rangeset.Set type is constrained to be an integer type (byte, uint, int64, etc).  Like a balanced tree the elements must be orderable - ie support comparison using less than (<) etc. (A hash table based set only requires the elements to be comparable - ie support the == and != operations.)  In theory a range set could be created that allowed floating point or string elements but I found that these are not useful in practice and I found it handy to be able to increment values which is only supported by integer types.

Sets in Go (without Generics)

Maps as Sets 

Traditionally sets in Go are implemented using a map with the map key type being the set's element type and ignoring the map value.  So for a set of  ints you would use a map with a key of  int.  Perhaps the obvious value type for the map is a bool (ie map[int]bool for a set of ints), but whether an element is in the set or not is better represented by the presence of the key in the map so an "empty" value type is used. The empty type conventionally used in Go is the empty struct, which uses no memory.  So a set of ints would be map[int]struct{}.

When using a map as a set the set element type is restricted to types that are comparable - ie types that can have == and != applied.  Hence you cannot have a set of slices, maps or funcs, since these types cannot be compared (except to nil).  Additionally, a type is not comparable if it is a struct, array or interface that contains non-comparable component(s).  (See my previous post for a full explanation of comparable types.)

Go's built-in map has no special support for sets so if you want to perform set operations you have to write functions for the specific type of set you are implementing. For example, you would need a function like this to find the intersection of sets of strings:

func Intersect(set ...map[string]struct{}) map[string]struct{} {
    retval := make(map[string]struct{})
    if len(set) == 0 { return retval }

    outer:
    for elt := range set[0] {
        for _, other := range set[1:] {
            if _, ok := other[elt]; !ok {
                continue outer
            }
        }
        retval[elt] = struct{}{} // found in all sets
    }
    return retval
}

Another deficiency of using a map is that the syntax is a little clumsier than it would be with a set type built into the language (or a generic set).  Creating sets requires dealing with the unused empty struct as the map's value type.

    setA := StringSet{"a": {}, "b": {}, "c": {}}
    setB := StringSet{"c": {}, "d": {}, "e": {}}
    setC := StringSet{"e": {}, "a": {}, "c": {}}
    log.Println(Intersect(setA, setB, setC))  // {"c"}

Open Source Set Packages

Recognizing the problems with using the built-in map type to implement sets, there have been several open-source packages created (such as the excellent golang-set). These provide methods which allow all manner of set operations such as union, intersection, comparison (for equality), and some I don't even understand like Cartesian product.

But since Go does not (yet) have parametric polymorphism, they must use interfaces for polymorphism.  This means internally they use a map something like this:

    type set map[interface{}]struct{}

The problem with storing the elements in an interface is that you lose compile-time type saftey. For example, it would be easy to accidentally add an integer to a set of strings and you could only discover the problem at run-time using type assertions. You could even add a value of a non-comparable type to a set which would cause a run-time panic.

Furthermore, it also affects performance since set elements must be "boxed" and "unboxed" as they are stored in an interface{}

Sets using Generics

Go Authors sets package

Generics allow the above-mentioned problems with using sets in Go to be addressed (without the need for another built-in type).  Undoubtedly, when Go has generics, sets will be one of the first things added to the Go standard library. In fact the "dev.gog2go" branch of the Go compiler repository includes a "proposed" implementation of exactly that in a sets package.

Internally this sets package uses a map (see sets.go2 line 10) to implement a generic set.  It has the safety and performance of a map with convenience and reliability of tried and tested methods. There are basic methods (create a set, copy a set, compare sets, add/delete elements, size of the set, etc) and set operations (union, intersection, subset, etc)

There are also some useful methods to get or filter the elements of the set. But note that, since the elements are stored in a map, the order the elements are processed/filtered is indeterminate.

C++ range_set 

I first became familiar with parametric polymorphism when "templates" were added to C++.  I was particularly enamored with STL library originally designed by Alex Stepanov, and later incorporated into the C++ standard library.  I won't go into all the wonderful details but one of the most useful things it provided was containers (vectors, lists, deques, maps, sets, etc).

My first experiment with C++ templates was to create a "range" set which was drop-in compatible with the STL set class, apart from the fact that the element type had to be an integer type. (Technically it could be any type that implemented operator<() and operator++() which would include integers, and possibly, other user-defined types.)  Internally it was implemented using a linked-list of range structs rather than the balanced binary tree used by the STL set class.

If you are interested, the full source code (C++ header file) can be found here. I also wrote an article for the C/C++ User Journal in June 1999 that describes it (see A Container for a Set of Ranges).

Go rangeset package

A few months ago I decided to try implementing a generic range set class in Go using the experimental go2go tool. Although my C++ implementation used a linked list of ranges, I found that in Go a slice of ranges worked as well if not better.  Each "range" simply stores the bounds of the range using asymmetric bounds (inclusive lower bound, exclusive upper bound). All operations maintain the property that the ranges are kept in numeric order and non-overlapping.

Like my C++ range set, the Go implementation tries to be compatible with Go authors sets package mentioned above. I have used the same method names, including ContainsAddSetSubSetIterate, etc, so a rangeset could act as a drop-in replacement.  Of course, there are additional methods that take advantage of the unique properties of a rangeset, such as the ability to return all the ranges.  And as I've already mentioned there is a limitation that the element type must be an integer type.

Here is a simple example.  (For a complete list of methods and functions see the README file.)

    var squares, evens rangeset.Set[int]
    for i := 1; i <= 12; i++ {
        squares.Add(i * i)
    }
    for i := 2; i < 1000; i += 2 {
        evens.Add(i)
    }

    // Find all the even squares
    fmt.Println(Intersect(squares, evens)) // {4,16,36,64,100,144}

Rangeset Implementation

Using the go2go Tool

I developed the rangeset package using the go2go tool.  It is an amazing tool with very complete support for what I think Go generics will look like.  In essence it translates .go2 source files (containing generics types/functions) into .go files which can then be built as normal.

It does have a few minor deficiencies such as no ability to do conditional compilation (build tags). It has very good support for tests, except that it does not support external tests, though that may be fixed by the time you read this. (Note that rangeset has a lot of tests as I discuss below).

There are many places you can find detailed steps on how to install and use the go2go tool but in brief you need to have a recent Go compiler and git installed, then clone the latest Go repo and build the dev.go2go branch.  Then you can just run

$ go tool go2go build

to translate .go2 file to .go and invoke the Go compiler to complete the build of the package.

Spans

Internally, a set in the rangeset package (rangeset.Set) uses a slice where each element is a struct containing the start and end elements of a range. Essentially it is just:

  type Set[T constraints.Integer]  []struct{bottom, top T}

Note that constraints.Integer is a constraint defined in the constraints package that just specifies all of Go's integer types.

In years of using my C++ rangeset class I found it occasionally useful to have access to the actual ranges, so I created a separate type for this. I called this type Span to avoid confusion with other uses of the word range and to avoid conflict with Go's range keyword.

  type Span[T constraints.Integer]  struct{bottom, top T}

There is a Spans() method that returns all the ranges in the set as a slice of Span[T].  You can also quickly add a range of values to a rangeset using the AddRange() method. Of course, internally Spans are merged and split as necessary to maintain a normalized layout, which is the most efficient and allows operations such as comparing sets for equality. Interestingly this means that adding an element to a rangeset can reduce its memory usage if the new element is the missing value between two existing ranges, causing them to be joined.

See for the actual code defining these types.

It's also important to note that ranges are specified (and stored internally in Spans) using asymmetric bounds.  A counter-note is that when a set is serialized to a string the ranges are encoded using inclusive bounds; this avoids confusion to users who may be exposed to these strings.

Number of Elements

To be compatible with the Go Authors set, rangeset.Set implements a Len() method.  The problem with this method is that it returns an int but, depending on the element type, the number of elements can easily exceed the maximum int value. For example, in Go implementations where an int is 32 bits then uint, uint32, int64, uint64, and uintpr can have sets with more elements than the maximum int.  (In the Go Authors set this is not typically a problem as you run out of memory before you can have a set that big.)  Even a the set of all int or int32 (ie, rangeset.Universal[int]()) has one more element than can be represented in an int!  In these cases the Len() method will silently overflow, since for compatibility it can't return an error.

If this can possibly occur then you need to use the Length() method instead. This returns two values -- the number of elements (as uint64) and the number of ranges in the set.  Note that even using a uint64 for the length the number of elements in a universal set of int64, uint64 (and possibly int, uint and uintptr) is one more than the maximum uint64 - in this case you also need to check the number of ranges.  That is, if both values returned from rangeset.Length() are zero then you have an empty set, but if length == 0 and number of ranges == 1 then you have a universal set.

Set Operations

There are simple methods to add or delete elements (or ranges of elements) and check for the presence of an element in a set. There are also methods to perform set operations such as Intersect() which find the set intersection of a set with another set.  Similarly, the AddSet() method performs a set union, and SubSet() performs a set difference operation.

There are also functions that take a variable number of sets as parameters and return things like the union, intersection or whether they are all equal. I may later add other fancy set operations like power-sets and Cartesian products if there is demand for it.

The Complement operation and Universal Set

The Complement() method returns the inverse of a set - that is, any element in the set is removed and any element not in the set is added.  This operation is fast, only taking time proportional to the number of ranges - O(r).  The complement of the empty set is the universal set (and vice versa).

The function rangeset.Universal[T]() returns a Universal set - that is, as set that contains all elements possible for its element type.  It's faster than taking the complement of an empty set.

Serialization and Deserialization

The String() method encodes a range set as a string; the NewFromString() function does the opposite converting an encoded string to a set. These are useful for serializing sets for transmission or storage, but can also be handy for seeing or even entering sets.

The format of the string is that the set is always enclosed in braces ({}) and elements or ranges are separated by a comma (,), while a range is two numbers separated by a colon (:).  For example: "{1,3,5:9}" encodes a set of 7 elements. Note that to make it easier for users to deal with these string they use "inclusive" bounds (unlike how they are represented normally using asymmetric bounds as discussed above), so the string "{1:10}" represents ten elements (1 to 10 inclusive).

The string produced by the String() method is normalized - ie, elements and ranges are in order and not overlapping, but this is not a requirement of the string passed to NewFromString(). Plus, the special element "E" represent the end, or very first/last element of a set, so this code produces a set with the elements 1 and 99 to 255 inclusive:

    rangeset.NewFromString[uint8]("{99:E,1}")

Furthermore the shortcut "{U}" means the universal set - the same as "{E:E}".

Obtaining and Processing All Elements

Like the Go Authors set, a rangeset has a method to return all elements as a slice (called Values()), but be careful as this could produce a huge slice if there are any large ranges.  There is also a Spans() method that returns all the ranges as a slice of Spans, which may be safer.

Also like the Go Authors set, there is a method (called Iterate()) that can be used to run a function on every method in the set. Also `Filter()` can be used to selectively delete elements from the set.

Unlike the Go Authors set, there are also methods to handle channels of elements. Iterator() (not to be confused with Iterate() mentioned above) returns a chan that receives all the elements in order.  In contrast, ReadAll() adds all items it receives on a channel to the set.  Both of these methods take a context.Context as their first parameter so they can be cancelled at any time, or after a specified timeout.

Testing

The rangeset package has about 20 table-driven and 20 non-table driven test functions. In  all there are about 700 tests with 100% code coverage. I believe that every boundary condition and possibility is tested, but please tell me if you spot a gap anywhere.

Note that currently all the tests are internal tests since I could not get external test (using package rangeset_test) to work.  However, all the tests are "external" tests - ie only test the public API of the package.  I will convert the tests to external tests when possible.

To run all the tests:

$ go tool go2go test

Conclusion

A rangeset is not appropriate for every type of set.  It has some performance advantages for sets with large ranges of contiguous elements.  In fact very large sets can be implemented in a small amount of memory which would be impossible with other types of sets, such as sets with trillions of elements.

Many operations, such as adding, deleting and finding elements, use a binary search on the list of ranges.  In the worst case (where few or no elements are contiguous) this gives time complexity of O(log) but in the best case (all elements are in a single contiguous range) the time complexity of O(1), like other set implementations.  In practice it can have better performance than even a hash table.

Unlike hash table based sets, when the elements are iterated (or otherwise processed) they are seen in order. An unusual and sometimes useful feature is the ability to invert a set and even create a universal set (inverse of the empty set).

Another cool feature is the ability to send/receive sets to/from a chan.  The Iterator() method returns a (generic) chan that returns all the elements of the set (in order); the ReadAll() method adds new elements to a set by reading from a chan. These channels can be used with a generic channels package I am currently working on which I will talk about next time!


No comments:

Post a Comment