A Virtual Enormous Hash Table with SQL

Recently I’ve been working with huge data sets. The huge size presents all kinds of challenges because things just won’t fit into memory. For example, I was trying to extract unique IDs from one data set where the ID is a .NET type long (which in my situation meant there were 9,223,372,036,854,775,807 possible IDs). One normal approach is to use a C# Hashtable object. However, when dealing with gigantic data sets, I ran into many situations where the built-in GetHashCode() method generated duplicate key values because GetHashCode() returns an int and int.MaxValue is only 2,147,483,647. I tried to create a custom ChainedHashTable class but kept running into the duplicate key issue. Another thing I tried was to create a custom simple BinarySearchTree class, which worked, but the performance was fairly poor. The approach I ended up using was to use SQL as a virtual hash table. By this I mean I created a SQL database with a single table with a single column of SQL type bigint to hold my IDs. I scanned through my data set and inserted IDs which were not already in the table with a statement along the lines of:
sqlInsertString = "if not exists(
select * from tblIDs where id=" + currID + ")
insert into tblIDs values(" + currID + ")";
This approach works but I’m not convinced it’s the best approach and I’m investigating alternatives.
This entry was posted in Software Test Automation. Bookmark the permalink.