Using the CNTK Built-In File Reader Functions

Microsoft CNTK is a very powerful code library for machine learning. The library is written in C++ but has a Python API for convenience.

I’ve been taking a very deep dive into CNTK v2.0 Release Candidate 1. The v2.0 should be released to the public sometime in the next few months.

Yesterday I spent quite a bit of time with experiments to understand the built-in file reader functions. For example, suppose you have a data file like so:

5.0,3.5,1.3,0.3,1,0,0
4.5,2.3,1.3,0.3,1,0,0
5.5,2.6,4.4,1.2,0,1,0
6.1,3.0,4.6,1.4,0,1,0
6.2,3.4,5.4,2.3,0,0,1
5.9,3.0,5.1,1.8,0,0,1

This is part of the famous Iris Dataset. The first four numbers in each row are the predictor variables (sepal length, width, petal length, width). The next three numbers represent the species — (1,0,0) = “setosa”, (0,1,0) = “versicolor”, (0,0,1) = “virginica”.

In order to use this data with CNTK you’d have to write a custom Python function that parses the data file into two matrices, one for the predictor values, one for the label values. Not too difficult, but quite time-consuming.

An alternative is to create a file that uses a special CNTK format, and then use built-in CNTK reader functions. The data above, in CNTK format, is:

|features 5.0 3.5 1.3 0.3 |labels 1 0 0
|features 4.5 2.3 1.3 0.3 |labels 1 0 0
|features 5.5 2.6 4.4 1.2 |labels 0 1 0
|features 6.1 3.0 4.6 1.4 |labels 0 1 0
|features 6.2 3.4 5.4 2.3 |labels 0 0 1
|features 5.9 3.0 5.1 1.8 |labels 0 0 1

Here the words “features” and “labels” aren’t special so I could have used “predictors” and “species” for example.

Reading this data file would start with:

# reader_demo.py
# demo the CNTK built-in reader

import cntk as C
import numpy as np
from cntk.io import CTFDeserializer, MinibatchSource, StreamDef,
  StreamDefs
from cntk.io import INFINITELY_REPEAT

def create_reader(path, is_training, input_dim, output_dim):
  return MinibatchSource(CTFDeserializer(path, StreamDefs(
    labels = StreamDef(field='labels', shape=output_dim,
      is_sparse=False),
    features = StreamDef(field='features', shape=input_dim,
      is_sparse=False)
  )), randomize = is_training,
    max_sweeps = INFINITELY_REPEAT if is_training else 1)

The program-defined create_reader function looks a bit messy but is essentially boilerplate. The calling code could be:

print("\nEnd CNTK reader demo \n")

input_dim = 4
output_dim = 3

input_Var = C.input(input_dim, np.float32) 
label_Var = C.input(output_dim, np.float32)

theFile = "dummyData_cntk.txt"
batch_size = 2
my_reader = create_reader(theFile, True,
  input_dim, output_dim)
my_input_map = {
  label_Var  : my_reader.streams.labels,
  input_Var  : my_reader.streams.features
}

for i in range(0, 5):
  print("Reading batch " + str(i))
  currBatch =
    my_reader.next_minibatch(batch_size,
    input_map = my_input_map)

print("\nEnd CNTK reader demo \n")

The input_Var and label_Var objects are kind of mysterious and an explanation is outside the scope of this post. The my_reader object fetches chunks of the file at a time and returns a batch of feature and label data that can be passed to a training CNTK function.

Moral: Dealing with data is always rather annoying. Because CNTK is a low-level library, you can write Python code to parse a data file in whatever format you have, or you can create a special CNTK-format version of your data and then use the built-in reader functions.

Advertisements
This entry was posted in CNTK, Machine Learning. Bookmark the permalink.