Splitting a File of Data into Training and Test Files

A common task when working with machine learning is to split a file of training data into a file of test data (typically 20%) and training data (the remaining 80%). There are many ways to do this. One approach that is useful when there’s no extra processing involved (such as normalizing the data) is to use a file-only approach. By this I mean to not start by reading the source file into memory.

In pseudo-code:

determine number of source lines
determine number of train, test items
generate a random ordering of lines
create dictionaries that indicate if
  a source line is train or test
loop each line of source
  if line belongs to train:
    write line to train file
    write line to test file

As always, the devil is in the details. And there are dozens of design and implementation options. When working with ML, getting data ready is never fun. Never.

# make_train_test.py
# does not read source into memory
# useful when no processing needed

import numpy as np

def file_len(fname):
 f = open(fname)
 for (i, line) in enumerate(f): pass
 return i+1

def main():
  source_file = ".\\source_file.txt"
  train_file = ".\\train_file.txt"
  test_file = ".\\test_file.txt"

  N = file_len(source_file)
  num_train = int(0.80 * N)
  num_test = N - num_train

  indices = np.arange(N)  # array [0, 1, . . N-1]

  train_dict = {}
  test_dict = {}
  for i in range(0,num_train):
    k = indices[i]; v = i  # i is not used
    train_dict[k] = v
  for i in range(num_train,N):
    k = indices[i]; v = i
    test_dict[k] = v  

  f_source = open(source_file, "r")
  f_train = open(train_file, "w")
  f_test = open(test_file, "w")

  line_num = 0
  for line in f_source:
    if line_num in train_dict: # checks for key
    line_num += 1


if __name__ == "__main__":
This entry was posted in Machine Learning. Bookmark the permalink.