Archive

Posts Tagged ‘Hash’

Hashing Strings and Pointers – Avoiding Common Pitfalls

October 12, 2011 7 comments

Introduction
In a previous post, I suggested using fixed size hash tables to map strings and pointers. Since this code really needs to be fast, reducing hash collisions probabilities is very important to maximize the hash table performances. But there is other pitfalls that needs to be addressed too.

Hashing Strings
The most common error when hashing strings is to perform a string comparison (using strcmp) when traversing the hash bin list. Even if the hash function is quite bad and has a very high probability of generating the same hash value for different strings, this is not required at all.

For my tests, I hashed the complete dictionary found in /usr/dict/words which contained 212710 unique words. Here’s some of the results:

hairbrush hash_value=0x7f3ef2bb hash_bin=0x32bb
encomiums hash_value=0xa758b2bb hash_bin=0x32bb
engrained hash_value=0x32abb2bb hash_bin=0x32bb
   seavir hash_value=0x32abb2bb hash_bin=0x32bb

As you can see, all of these strings ended up in the same hash bin 0x32bb, but have different hash values (except for ‘engrained’ and ‘seavir’).

In the hash table, it looks like this:

[0x32ba] phrenologically(0x888fb2ba)->Alfreda(0xee9432ba)->...
[0x32bb] encomiums(0xa758b2bb)->engrained(0x32abb2bb)->seavir(0x32abb2bb)->...
[0x32bc] centillion(0xb44232bc)->cyclometer(0xc5a172bc)->...

As an optimization, we can only compare the original hash value (ie the value before it was made to fit the hash table size) until we found a match. Then, we need to do a string comparison in order to make sure that the strings match. This greatly optimize the hash performances and can be used with all non-PODs data types.

Attentive readers might have noticed that, in my previous post on memory tracking, I totally skipped the string comparison when hashing the allocation tags. I decided to trade possible errors due to hash value collisions with execution speed, simply because the hash function is so efficient that I never hit a single collision.

The hash function I’m using was published in Dr. Dobb’s Algorithm Alley by Bob Jenkins, in September 1997. I’ve been using this function in several games over the years and never hit a single hash collision, sometimes hashing hundreds of thousands of unique strings.

Hashing Pointers
When hashing pointers, special care must be taken due to alignment. It is tempting to simply use the pointer itself to index in the hash table, but this is a very bad idea, as pointers are mostly aligned on 4, 8, 16 bytes or even more. Alignment makes the lowest bits set to 0, and depending on the hash table size, it may lead to a non-uniform distribution, thus reducing the hash table to a simple linked-list in the worst case.

Over the years, I’ve been using Thomas Wang’s hash function to hash pointers on several platforms and it worked very well in all cases:

tU32 HashPointer(void* Ptr)
{
    tU32 Value = (tU32)Ptr;
    Value = ~Value + (Value << 15);
    Value = Value ^ (Value >> 12);
    Value = Value + (Value << 2);
    Value = Value ^ (Value >> 4);
    Value = Value * 2057;
    Value = Value ^ (Value >> 16);
    return Value;
}

Conclusion
Always test your hash functions to make sure distribution is uniform, especially for complex types (strings, GUIDs, etc). Pay special attention to types that may vary on different platforms (pointers, etc). Also, whenever possible, prefer a fixed-size hash table that fits the data you’re hashing, as opposed to using a templated, varying size one that may use more memory and won’t allow custom traversal comparison code.

Advertisements
Categories: Programming Tags:
%d bloggers like this: