I ran into a really interesting and challenging problem recently. I had a huge (about 250 million rows) SQL table that had a column with (non-unique) IDs that were type bigint (64 bits or long in C# terms). I wanted to select a more or less random ID value from the table. If the table had been small I could have used the well-known “order by newid()” technique. After a few hours of experimentation, the approach I took was to first extract into a table variable a relatively small semi-random sample of rows from the huge table using the T-SQL tablesample clause. Then I used the order by newid() trick on the table variable. Here’s a sketch of how I did it:
declare @mytable table — table variable
insert into @mytable — grab approx. 1000 rows of a 1% sample
select top(1000) pid
tablesample (1 percent)
declare @result bigint
select @result = (select top(1) pid from @mytable order by newid() )
The tablesample clause is kind of wonky. You’d guess that the code above means select 1% of the rows which is only sort of true. It actually means “select all rows from about 1% of the physical pages”, which will be only approximately 1% of the data — it could be more or less. See MSDN documentation for details. I used sp_spaceused to determine that I had about 20,000 8KB pages. Notice I used the top(1000) clause to grab just 1,000 rows — otherwise 1% of 250,000,000 records (2,500,000 records) is still a lot of records.
With this code I can generate a sort-of, quasi, semi-random ID on the fly. Alternatively I can use the code to populate a table of IDs and then select them as needed.
Notes: 1. If you have a file that is the underlying source of the data in the SQL table, an alternative approach is to use file seek to select a random byte position within the file, and then advance to the next complete line of data (in the case of a text file) or record (in the case of a binary file). 2. Actually my huge table had two ID columns so I had to do some fancy footwork using the UNION operator as shown below, but my actual scenario obscures the main idea. Additionally, I wrapped my T-SQL code up in a stored procedure because I wanted to call it from a C# program using ADO.NET — that’s another story, including the detail that stored procedures can’t explicitly return type bigint so I had to use an output parameter.
declare @mytable1 table
insert into @mytable1 select top(1000) pid1, pid2
from tblData tablesample system (1 percent)
declare @mytable2 table
insert into @mytable2 select top 50 pid1 from @mytable1
union (select top 50 pid2 from @mytable1)
declare @result bigint
select @result = ( select top 1 pid from @mytable2 order by newid() )