Selecting a Random ID Value from a Very Large SQL Table

I ran into a really interesting and challenging problem recently. I had a huge (about 250 million rows) SQL table that had a column with (non-unique) IDs that were type bigint (64 bits or long in C# terms). I wanted to select a more or less random ID value from the table. If the table had been small I could have used the well-known “order by newid()” technique. After a few hours of experimentation, the approach I took was to first extract into a table variable a relatively small semi-random sample of rows from the huge table using the T-SQL tablesample clause. Then I used the order by newid() trick on the table variable. Here’s a sketch of how I did it:

declare @mytable table — table variable
(
pid bigint
)

insert into @mytable — grab approx. 1000 rows of a 1% sample
select top(1000) pid
from tblData
tablesample (1 percent)

declare @result bigint

select @result = (select top(1) pid from @mytable order by newid() )

The tablesample clause is kind of wonky. You’d guess that the code above means select 1% of the rows which is only sort of true. It actually means “select all rows from about 1% of the physical pages”, which will be only approximately 1% of the data — it could be more or less. See MSDN documentation for details. I used sp_spaceused to determine that I had about 20,000 8KB pages. Notice I used the top(1000) clause to grab just 1,000 rows — otherwise 1% of 250,000,000 records (2,500,000 records) is still a lot of records.

With this code I can generate a sort-of, quasi, semi-random ID on the fly. Alternatively I can use the code to populate a table of IDs and then select them as needed.

Notes: 1. If you have a file that is the underlying source of the data in the SQL table, an alternative approach is to use file seek to select a random byte position within the file, and then advance to the next complete line of data (in the case of a text file) or record (in the case of a binary file). 2. Actually my huge table had two ID columns so I had to do some fancy footwork using the UNION operator as shown below, but my actual scenario obscures the main idea. Additionally, I wrapped my T-SQL code up in a stored procedure because I wanted to call it from a C# program using ADO.NET — that’s another story, including the detail that stored procedures can’t explicitly return type bigint so I had to use an output parameter.

declare @mytable1 table
(
pid1 bigint,
pid2 bigint
)

insert into @mytable1 select top(1000) pid1, pid2
  from tblData tablesample system (1 percent)

declare @mytable2 table
(
pid bigint
)

insert into @mytable2 select top 50 pid1 from @mytable1
  union (select top 50 pid2 from @mytable1)

declare @result bigint

select @result = ( select top 1  pid from @mytable2 order by newid() )

print @result

 

This entry was posted in Software Test Automation. Bookmark the permalink.