More About Iterating Through a CNTK Minibatch Input Data Object

Recently, I wanted to iterate through a built-in CNTK library input data structure, a MinibatchData object, or just minibatch for short. With the help of a work colleague (Nikos), I finally figured out how to do it, but the technique is ugly. In a previous post I described how to walk through a text file in CNTK data format, which, although a bit tricky in the details, is simple in principle.

The alternative I was looking at is to iterate through a CNTK mini-batch object. Although the code (below) is short, it’s very tricky, and not at all obvious. My demo program creates a special CNTK reader. The reader has a next_minibatch() function which returns a minibatch which is actually a complex Python Dictionary collection.

To get at the data in the minibatch Dictionary, you have to get the keys, create a list from the keys (because weirdly, the keys aren’t enumerable), then get the data using the keys-list and the asarray() function. But unfortunately, the order of the keys in the Dictionary can vary from run to run, so the technique isn’t practical unless you sort the list holding the Dictionary, which is way more trouble than it’s worth. In short, to walk through CNTK input values, you’re best off using np.loadtxt() to iterate through the source data file rather than iterating through the minibatch collection that holds the data in memory.

I really like CNTK a lot. But this is a tiny bit crazy. It shouldn’t be that hard to walk through a critically important data structure. I bet that the CNTK team will be adding an easy-access function at some point in the near future. To be fair, CNTK was only released about 9 weeks ago, so a little roughness is expected. And in my opinion, CNTK is much, much easier to use than its direct competitor, the TensorFlow library.

# read_exp.py
# use Nikos solution to fetch contents of minibatch

import numpy as np
import cntk as C

def create_reader(path, is_training, input_dim, output_dim):
  return C.io.MinibatchSource(C.io.CTFDeserializer(path,
     C.io.StreamDefs(
    features = C.io.StreamDef(field='predictors', shape=input_dim,
      is_sparse=False),
    labels = C.io.StreamDef(field='passengers', shape=output_dim,
      is_sparse=False)
  )), randomize = is_training,
    max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)

the_file = "tsr_sample_cntk.txt"

input_dim = 4
output_dim = 1
input_Var = C.ops.input(input_dim, np.float32)
label_Var = C.ops.input(output_dim, np.float32)

rdr = create_reader(the_file, False, input_dim, output_dim)

my_input_map = {
  input_Var : rdr.streams.features,
  label_Var : rdr.streams.labels
}

np.set_printoptions(precision=2)
print("\nFeatures and Labels: ")
for i in range(0,6):  # each data item
  mb = rdr.next_minibatch(1, input_map = my_input_map)
  keys = list(mb.keys())
  print(mb[keys[0]].asarray()) # no order guarantee !!
  print(mb[keys[1]].asarray())
  print("")

print("\nEnd experiment \n")

Advertisements
This entry was posted in CNTK, Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s