Preventing Duplicates: Hash Table vs. Dictionary vs. Binary Search Tree

This week I took a look at the relative performance of C# hash tables, dictionaries, and binary search trees. It’s not uncommon for me to deal with objects that have IDs which are type long. For example a user ID in Hotmail is a long. When working with such objects I often need to determine if a particular object has been already seen or not so I can prevent duplicates of some kind. To perform this kind of checking in C# you can use a Hashtable collection from the System.Collections namespace, a Dictionary collection from the System.Collections.Generic namespace, or a custom binary search tree coded from scratch. I wanted to get a rough idea of which technique was faster. Bottom line: the Dictionary object is fastest in most cases and the Hashtable is slowest in most cases.

I ran code something like this for the Hashtable:

Random random = new Random(0);
DateTime startTime = DateTime.Now;
Hashtable ht = new Hashtable();
for (int i = 0; i < 10000; ++i )
{
long value = (long)(long.MaxValue * random.NextDouble());
if (ht.ContainsValue(value) == false)
{
// do something with value
int key = HashCode(value);
ht.Add(key, value);
}
}
DateTime endTime = DateTime.Now;
TimeSpan elapsedTime = endTime – startTime;
Console.WriteLine(“Time = ” + elapsedTime.TotalMilliseconds);

I generate a bunch of random long values, check to see if each value is in the Hashtable or not, and if not, insert the value.

I used to standard high-low XOR trick to generate a hash code for a long value:

static int HashCode(long value)
{
return ((int)value ^ (int)(value >> 32));
}

For the Dictionary I coded along the lines of:

DateTime startTime = DateTime.Now;
Dictionary<long, long> d = new Dictionary<long, long>();
for (int i = 0; i < 10000; ++i )
{
long value = (long)(long.MaxValue * random.NextDouble());
if (d.ContainsValue(value) == true)
{
// do something with value
long key = value;
d.Add(key, value);
}
}

The test for the custom binary search tree was similar. With 10,000 random longs, the hash table took 990.0 milliseconds, the dictionary 2.0 milliseconds, and the binary search tree 8.0 milliseconds. Although the tests were very uncontrolled and incomplete, they suggest to me that as a rule of thumb a Dictionary object is the best approach when preventing duplicates where the ID is type long.

This entry was posted in Software Test Automation. Bookmark the permalink.