Home > Programming > Hashing Strings and Pointers – Avoiding Common Pitfalls

Hashing Strings and Pointers – Avoiding Common Pitfalls

Introduction
In a previous post, I suggested using fixed size hash tables to map strings and pointers. Since this code really needs to be fast, reducing hash collisions probabilities is very important to maximize the hash table performances. But there is other pitfalls that needs to be addressed too.

Hashing Strings
The most common error when hashing strings is to perform a string comparison (using strcmp) when traversing the hash bin list. Even if the hash function is quite bad and has a very high probability of generating the same hash value for different strings, this is not required at all.

For my tests, I hashed the complete dictionary found in /usr/dict/words which contained 212710 unique words. Here’s some of the results:

hairbrush hash_value=0x7f3ef2bb hash_bin=0x32bb
encomiums hash_value=0xa758b2bb hash_bin=0x32bb
engrained hash_value=0x32abb2bb hash_bin=0x32bb
   seavir hash_value=0x32abb2bb hash_bin=0x32bb

As you can see, all of these strings ended up in the same hash bin 0x32bb, but have different hash values (except for ‘engrained’ and ‘seavir’).

In the hash table, it looks like this:

[0x32ba] phrenologically(0x888fb2ba)->Alfreda(0xee9432ba)->...
[0x32bb] encomiums(0xa758b2bb)->engrained(0x32abb2bb)->seavir(0x32abb2bb)->...
[0x32bc] centillion(0xb44232bc)->cyclometer(0xc5a172bc)->...

As an optimization, we can only compare the original hash value (ie the value before it was made to fit the hash table size) until we found a match. Then, we need to do a string comparison in order to make sure that the strings match. This greatly optimize the hash performances and can be used with all non-PODs data types.

Attentive readers might have noticed that, in my previous post on memory tracking, I totally skipped the string comparison when hashing the allocation tags. I decided to trade possible errors due to hash value collisions with execution speed, simply because the hash function is so efficient that I never hit a single collision.

The hash function I’m using was published in Dr. Dobb’s Algorithm Alley by Bob Jenkins, in September 1997. I’ve been using this function in several games over the years and never hit a single hash collision, sometimes hashing hundreds of thousands of unique strings.

Hashing Pointers
When hashing pointers, special care must be taken due to alignment. It is tempting to simply use the pointer itself to index in the hash table, but this is a very bad idea, as pointers are mostly aligned on 4, 8, 16 bytes or even more. Alignment makes the lowest bits set to 0, and depending on the hash table size, it may lead to a non-uniform distribution, thus reducing the hash table to a simple linked-list in the worst case.

Over the years, I’ve been using Thomas Wang’s hash function to hash pointers on several platforms and it worked very well in all cases:

tU32 HashPointer(void* Ptr)
{
    tU32 Value = (tU32)Ptr;
    Value = ~Value + (Value << 15);
    Value = Value ^ (Value >> 12);
    Value = Value + (Value << 2);
    Value = Value ^ (Value >> 4);
    Value = Value * 2057;
    Value = Value ^ (Value >> 16);
    return Value;
}

Conclusion
Always test your hash functions to make sure distribution is uniform, especially for complex types (strings, GUIDs, etc). Pay special attention to types that may vary on different platforms (pointers, etc). Also, whenever possible, prefer a fixed-size hash table that fits the data you’re hashing, as opposed to using a templated, varying size one that may use more memory and won’t allow custom traversal comparison code.

About these ads
Categories: Programming Tags:
  1. MatthewH
    October 12, 2011 at 9:33 am | #1

    Hashing is a deep, non trivial problem, and you are only asking for trouble if you “roll your own”. It is one of those things that programmers love to do because “it’s fun”, and one of the things they usually screw up big time due to a lack of the scope of the problem. Also knowing the point at while a hash is a better performer than another solution, like a b-tree, is critical.

    You casually close with: “Always test your hash functions to make sure distribution is uniform…”

    So, if the hash is *not* uniform, how do you fix that? It is easy to say “make sure it is uniform”, but hard to do, and even harder to understand the theory and math behind a good solution.

    I recommend spending some time on Paul Hsieh’s site “A Zillion Monkeys” (http://www.azillionmonkeys.com/qed/hash.html) and Bob Jenkins’ site (http://burtleburtle.net/bob/index.html), see the Hashing pages (a lot of information, analysis, and even comments about the same DJJ article you based your code on.)

    • October 12, 2011 at 1:55 pm | #2

      In the past, I’ve checked for uniformity simply by spying on the hash table contents in the Memory window :) In fact, just last week while implementing a custom hash table I had a bunch of clustered elements before realizing I was poorly hashing some pointers, as JF warns about in this post. This resulted in really poor performance. In my case, I fixed it using the “pseudorandomizing transform” which you can find in Dinkumware’s header (as in Visual C++), only because I had it handy and it solved the problem. But I think you are right in that you certainly need to know what you are doing, and studying others’ hash test suite or implementing your own can aid in the learning process.

      • jfdube
        October 13, 2011 at 10:14 am | #3

        Breaking into the debugger is something I do too, to quickly check for the uniformity of the hashing function. The only thing to keep in mind is that even if all buckets contains something, we don’t know if the buckets are filled evenly though. Adding an element counts along with the elements pointers in the buckets would help us identify this particular problem!

        • Daniel Trebbien
          October 13, 2011 at 12:10 pm | #4

          Uniformity is actually a sign that something is not random. Take a look at the two scatter plots at https://imgur.com/a/SMR6l . Which one shows a uniform sampling? Did you say the blue one? In fact, the red one shows a quasi-random sampling of points, but to the human eye it looks “more random” than the scatter plot of a true random sampling of points from the plane. That’s why a probabilistic approach should be used to test for randomness. See also: https://www.random.org/analysis/

  2. jfdube
    October 12, 2011 at 9:52 am | #5

    Hi MatthewH.

    Thanks for your comments, they are very appreciated.

    By “roll your own” I assume you’re talking about the hashing function, not the hash table code itself, which is pretty easy to write. You’re completely right then, this is why I used well-known hash functions. I also did a couple of years of testing in real projects on several platforms (games) and they proven to give excellent results, thus the blog post.

    To answer your question about testing distribution uniformity, if the hash is not uniform, I simply play with the hash function until it is. It’s all about knowing what your data is and making proper tests to make sure that the distribution is somewhat uniform. You may be right when devising hash functions for data of unknown sources though, which I don’t recommend. Always write custom hash functions for the data you’re hashing.

  3. Daniel Trebbien
    October 12, 2011 at 11:18 am | #6

    One trick to improve a hash function operating on pointer `Ptr` is to divide by `sizeof *Ptr`. Here is the technique in C++: . This has the benefit that if the hash function is applied to multiple objects that are allocated by a pool allocator, then the low-order zero bits that account for the size of the object in bytes are factored out.

    A few other comments:

    1. It is important to consider on what pointers you will be using the hash function. If you use a pool allocator, then the trick of dividing by the object size is appropriate. However, if the pointers all result from `malloc` or `new`, then it is important to consider your standard library’s memory alignment guarantees. For example, MSVCRT aligns memory on 16-byte boundaries as does GNU libc on 64-bit systems . To use the hash function on `malloc`ed blocks, one should probably divide by the boundary size.

    2. A rigorous statistical technique for testing that the output distribution is uniform should be used, such as the Kolmogorov–Smirnov goodness-of-fit test . The Dataplot software package has built-in routines for performing Kolmogorov–Smirnov tests for 60+ distributions . It would be interesting to see the results of K–S tests of Thomas Wang’s hash function.

    3. There are possible software security issues with choosing to hash pointer values, in the form of unintended information disclosure. Depending on application, the idea of hashing pointers might not be such a good idea. An alternative is to generate random integer “handles” to represent unique instances of an object and use those as the hash values rather than a hash of the pointer values.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: