Reading Data from a File into a NumPy Matrix

In many machine learning scenarios, you have to read training data from a text file into a matrix. When using Python with TensorFlow or CNTK, I often use the NumPy loadtxt() function.

Suppose you have a text file with four predictor variables followed by three values that represent a 1-of-N encoding:

# dummy data dataToRead.txt
1.5, 2.5, 3.5, 4.5, 1, 0, 0
5.5, 6.5, 7.5, 8.5, 0, 1, 0
9.5, 8.5, 7.5, 6.5, 0, 0, 1

To read the predictor values into a NumPy matrix you can use:

import numpy as np
ftrs = np.loadtxt("dataToRead.txt", 
                   dtype=np.float32,
                   comments="#",
                   delimiter=",",
                   converters=None,
                   skiprows=0,
                   usecols=(0,1,2,3),
                   unpack=False,
                   ndmin=0)

The arguments should be mostly self-explanatory, except for three rare ones. The converters argument allows you to convert data on the fly, for example to normalize or deal with missing data. The unpack argument, if True, returns a transposed matrix. The ndmin argument specifies a minimum number of return object dimensions.

Using loadtxt() is quick and easy. But it doesn’t allow you to chunk data as you often want to do in order to perform mini-batch processing. Both TensorFlow and CNTK have built-in reader functions you can use for mini-batch processing.

Advertisements
This entry was posted in CNTK, Machine Learning. Bookmark the permalink.