Splitting a File of Data into Training and Test Files

A common task when working with machine learning is to split a file of training data into a file of test data (typically 20%) and training data (the remaining 80%). There are many ways to do this. One approach that is useful when there’s no extra processing involved (such as normalizing the data) is to use a file-only approach. By this I mean to not start by reading the source file into memory.

In pseudo-code:

determine number of source lines
determine number of train, test items
generate a random ordering of lines
create dictionaries that indicate if
  a source line is train or test
loop each line of source
  if line belongs to train:
    write line to train file
  else:
    write line to test file
end-loop

As always, the devil is in the details. And there are dozens of design and implementation options. When working with ML, getting data ready is never fun. Never.

# make_train_test.py
# does not read source into memory
# useful when no processing needed

import numpy as np

def file_len(fname):
 f = open(fname)
 for (i, line) in enumerate(f): pass
 f.close()
 return i+1

def main():
  source_file = ".\\source_file.txt"
  train_file = ".\\train_file.txt"
  test_file = ".\\test_file.txt"

  N = file_len(source_file)
  num_train = int(0.80 * N)
  num_test = N - num_train

  np.random.seed(1)
  indices = np.arange(N)  # array [0, 1, . . N-1]
  np.random.shuffle(indices)

  train_dict = {}
  test_dict = {}
  for i in range(0,num_train):
    k = indices[i]; v = i  # i is not used
    train_dict[k] = v
  for i in range(num_train,N):
    k = indices[i]; v = i
    test_dict[k] = v  

  f_source = open(source_file, "r")
  f_train = open(train_file, "w")
  f_test = open(test_file, "w")

  line_num = 0
  for line in f_source:
    if line_num in train_dict: # checks for key
      f_train.write(line)
    else:
      f_test.write(line)
    line_num += 1

  f_source.close()
  f_train.close()
  f_test.close() 

if __name__ == "__main__":
  main()
Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.