Pulling Random Information from a Very Large SQL Server Table using C#

The problem I was looking at this week was to write C# code that allowed me to repeatedly pull one or more values from a randomly selected row in a very large SQL Server table. With a small table I could have used the NEWID trick that is well known and posted in several places on the Internet, but that trick crawls to a near halt when applied to a large (greater than about 10 million rows) table. Working backwards, I wanted the C# interface to resemble:

string server = "someServer";
string db = "dbEmployees";
string table = "tblEmployees":
int buffSize = 10;

RandomEmpGenerator reg =
  new RandomEmpGenerator(server, db, table, buffSize);
for (int i = 0; i < 20; ++i)
  long id = reg.GetNextEmp();

I'm imagining a database of employees and I want to generate random employee IDs. One approach is to use the TABLESAMPLE clause, available in SQL Server 2005 and later. For example,

FROM tblEmployees

will return sort-of random empIDs. The problem is that the number returned could be anything — 0 records to well over 1,000 records.

My class is structured:

class RandomEmpGenerator
  private long[] buff;
  private int curr;
  private int buffSize;
  private string server;
  private string db;
  private string table;
  // ctor here -- allocate buff, etc.
  // LoadBuffer()
  // TryToLoadBuffer()

  private long GetNextEmp()
    if (this.curr < this.buffSize)
      return this.buff[this.curr++];
    else {
      return this.buff[this.curr++];

The idea is to maintain a buffer of employee IDs. The curr variable is an index into the buffer; if we're not at the end of the buffer return the next ID otherwise reload the buffer then return the first ID.

private void LoadBuffer() // try to fill buffer
  int tot = 0;  
  int maxAttempts = 20;  
  int attempts = 0;
  int sampleSize = this.buffSize * 1000;
  while (tot < this.buffSize && attempts < maxAttempts)
    tot += TryToLoadBuffer(sampleSize);   
    sampleSize = sampleSize * 2;
  if (tot < this.buffSize)
    throw new Exception("Unable to LoadBuffer completely");

LoadBuffer calls a helper named TryLoadBuffer that could return enough values to fill the buffer but could return fewer.

private int TryToLoadBuffer(int sampleSize) // return count loaded
  try {
    string connString = "Server=" +
      dbServer + ";Database=" +
      database + ";Trusted_Connection=True";
    long eid;
    int ct = 0;   
    this.curr = 0; 

    string queryString = "SELECT TOP " +
      this.buffSize + " empID FROM " +
      table +
      " TABLESAMPLE (" + sampleSize + " ROWS)";

    using (SqlConnection sc = new SqlConnection(connString)) {
      SqlCommand cmd = new SqlCommand(queryString, sc);
      cmd.CommandType = System.Data.CommandType.Text;
      cmd.CommandTimeout = 60;                         
      SqlDataReader sdr = cmd.ExecuteReader();
      while (sdr.Read() == true) {
        eid = long.Parse(sdr["eid"].ToString());
        this.buff[ct++] = eid;
      sdr.Close(); sc.Close();
    } // using
    return ct; // could be less than buffer Size
  catch {
    throw new Exception("Unhandled Exception");
} // TryLoadBuffer

The key is the TABLESAMPLE clause available to SQL Server 2005 and later. I request about 10,000 random rows but a quirk of TABLESAMPLE is that it could return fewer (the TOP clause prevents more than buffSize records) so I track how many values are actually returned and loaded into the buffer. There are a ton of details but if you examine the code you should be able to figure them out and adapt the code to meet your own scenarios. The class parameterizes the value in the TABLESAMPLE clause and increases it significantly (doubles it) each time through the loop in LoadBuffer — if TryLoadBuffer isn't filling the buffer then it's likely the TABLESAMPLE value is too small.

This entry was posted in Software Test Automation. Bookmark the permalink.