Quickly Selecting a Random Line from a Huge Text File using Seek

Consider the following scenario. You have a very large text file and you want to quickly select one or more random lines from the file using a C# program. A naive approach would be to first scan the file and count how many lines there are in the file. Then you could generate a random integer line number between 0 and the number of lines (less 1). Then you could start at the beginning of the file and read lines until you get to the random line number. If the text file is huge this approach would likely take too long, especially if you want to repeatedly sample lines. One way to get a more or less random line is to use the StreamReader Seek method. The idea is to determine how many bytes there are in the file, generate a random byte number in range, go directly to that random byte, then read the next available line. For example, suppose you want to read and store information from 10 random lines that are in a huge text file:

FileStream ifs = new FileStream(sourceFile, FileMode.Open);
StreamReader sr = new StreamReader(ifs);
string line = “”;
string[] tokens = null;

// determine extent of source file
long lastPos = sr.BaseStream.Seek(0, SeekOrigin.End);
for (int i = 0; i < 10; ++i)
  // generate a random position
  double pct = random.NextDouble(); // [0.0, 1.0)
  long randomPos = (long)(pct * lastPos);
  if (pct >= 0.99)
    randomPos -= 1024; // if near the end, back up a bit

  sr.BaseStream.Seek(randomPos, SeekOrigin.Begin);

  line = sr.ReadLine(); // consume curr partial line
  line = sr.ReadLine(); // this will be a full line
  sr.DiscardBufferedData(); // magic

  // extract info from line here
sr.Close(); ifs.Close();

First I open the file using a StreamReader. Next I determine how many bytes there are in the file using BaseStream.Seek and SeekOrigin.End. Then I enter a loop to examine 10 sort-of random lines. I assume there is a Random object named random and use it to generate a percentage value between 0.0 and 1.0. By multiplying the random percentage times the number of bytes I get a random byte position. Next I commit a hack by checking to see if I’m close to the end of the file, that is, if the random percentage is greater than or equal to 0.99, and if so, I back off a bit (arbitrarily 1024 bytes) so that when I read, I won’t fly off the end of file. You may want to check for this in a much more sophisticated way. Next I use Seek to jump directly to the random byte position. I’m likely in the middle of a line so I consume the partial line using ReadLine and then perform a second ReadLine to get a full line. At this point I use the mysterious DiscardDataBuffer to sync the internal file buffer position with the physical position. There’s a lot to this but basically when you perform multiple Seeks, you typically want to use DiscardDataBuffer after reads. Now I have my random line and I extract whatever information I want from the line by using String.Split perhaps, placing the parsed line into string array tokens.

Actually, for my situation I created a class to emit random parts of the text file. The class has a buffer of values I’m interested in. When called, if the buffer is not empty I return the next value. If the buffer has become empty, I use the code above to reload the buffer.

This entry was posted in Software Test Automation. Bookmark the permalink.

One Response to Quickly Selecting a Random Line from a Huge Text File using Seek

  1. The performance benefits of this approach are quite interesting. There are some side effects which are worth bearing in mind, though their importance depends greatly on the situation.

    1) we will never select the first line of the file.
    2) if the file is small, the .99 figure might be too large: we might seek to within the last line even with pct < 0.99. In this case we'll have a blank line (or whatever ReadLine() does if we're at the end of the file.)
    3) if the file is small, 99% of the way through the file might be less than 1024 characters from the beginning.
    4) if lines are short, the "backtrack by 1024" will cause us never to read the last few lines of the file, we will read the lines just before them twice as often as earlier lines.

    If any of these is important, we might consider seeking to a random position within the file, then backtracking to either the first previous newline (or the beginning of the file;) we would then digest from this point to the next newline (or the end of the file.)

    However, even this approach has a side effect which cuts to the heart of the "seek to a random position to save scanning for line breaks" optimization:

    4) if the file has lines of varying length, we will favor "lines which follow a long line" at the expense of "lines which follow a short line." (The backtracking approach will favor long lines at the expense of short lines.) If all of the lines in the file are roughly the same length, this is not a concern.

    When I worked in retail, I noticed that there were often times during the day when there were no customers in the store. These would be followed by times where the store was crowded. The customers would then complain "it's always so crowded in here!"

    From my point of view, this was obviously incorrect; but from their point of view, this statement is true, because when the store was empty, there were no customers there to experience the emptiness of the store!

Comments are closed.