Testing Log Data

I ran into a very interesting problem recently. I was working with a very large log file (several billion lines). The file is a text file where each line is a log entry. Each entry contains various IDs and the file was sorted by one of the IDs. For example, here is a simplified version:
sessionID  machine2  machine2
123456     567       678
123456     567       789
234567     678       901
234567     789       901
234567     890       678
I wanted to process the log file in memory but because the log file was far too big to fit I needed to process the file a chuck of about 1,000,000 lines at a time. So my idea was to create a buffer as a List (I was using C#) and read in lines. The problem was that I did not want to break the log data across the key ID field. In the hypothetical example above I wanted to make sure that the three lines with ID 234567 stayed together. Well, there are plenty of ways to solve this problem but the one I came up with seemed very effective, at least in my scenario. The approach I finally settled on was this:
create primaryBufferList
create secondaryBufferQueue
loop thru log file
  read a line
  if primaryBufferList not full
    add line to primaryBufferList
  else if primaryBufferList is full
    add line to secondaryBufferQueue
  end elseif
  if both buffers full
    check = sessionID of last line in primary
    while sessionID of item in secondary = check
      dequeue secondary into primary
    end while
  end if
  // primary now has non-broken log lines
  process primaryBufferList
  empty primaryBufferList
  transfer any lines in secondary to primary
end loop
// some lines may be left in buffers
transfer any lines in secondary to primary
process primaryBufferList
This approach was easier for me to debug and manage than several more obvious approaches I tried, such as using a "curr line" and "next line" approach. To summarize, the algorithm I describe here is useful for processing a very large text file in situations where there is some relationship between consecutive lines in the file.
This entry was posted in Software Test Automation. Bookmark the permalink.