(Note: this blog post is closely related to an earlier post, “Preparing MNIST Data for use by a CNTK Program”)
The MNIST (“modified National Institute of Standards and Technology”) image dataset is often used to demonstrate image classification. The dataset has 60,000 images for training a model, and 10,000 images for evaluating a trained model.
Each image is 28 pixels wide by 28 pixels high which is 784 pixels. Each image represents a single handwritten digit of a ‘0’ through ‘9’. Somewhat weirdly, the 60,000 training data items are stored in two files: one file contains the pixel values and the second file contains the associated label (‘0’ to ‘9’) values. The test data is also stored in two files.
Additionally, each of the four files is stored in a proprietary binary format, using big endian format rather than little endian which is used by Intel based machines. And to top it off, the four files are compressed as .gz files which can’t be unzipped by default by a Windows based machine. In short, getting MNIST data into a useable form is not trivial.
Step 1: Go to the MNIST storage site at http://yann.lecun.com/exdb/mnist/ and download to your machine into a directory named ZippedBinary these four files:
train-images-idx3-ubyte.gz (60,000 train images) train-labels-idx1-ubyte.gz (60,000 train labels) t10k-images-idx3-ubyte.gz (10,000 test images) t10k-labels-idx1-ubyte.gz (10,000 test labels)
Step 2: Unzip the four files into a directory named UnzippedBinary. To unzip the files you’ll need a utility program. I strongly recommend the free 7-Zip at https://www.7-zip.org/. After unzipping, I recommend adding a “.bin” file extension to each name to remind you the files are in a proprietary binary format. So you should now have:
train-images-idx3-ubyte.bin train-labels-idx1-ubyte.bin t10k-images-idx3-ubyte.bin t10k-labels-idx1-ubyte.bin
Step 3: Suppose the desired format of a data file containing the images is:
0 0 0 0 0 1 0 0 0 0 * 0 .. 170 52 .. 0 0 0 1 0 0 0 0 0 0 0 * 0 .. 254 66 .. 0 . . .
Each line is one image. The first 10 value are the digit/label information in one-hot encoded form, where the position of the 1 bit indicates the digit. So the two images above are ‘5’ and ‘2’. There is a dummy ‘*’ character in column  which is just for readability. The next 784 values are the pixels for the image, where each is between 0 and 255. Here’s a program to create training or test files in this format, with a specified number of images:
# converter_keras.py def generate(img_bin_file, lbl_bin_file, result_file, n_images): img_bf = open(img_bin_file, "rb") # pixels lbl_bf = open(lbl_bin_file, "rb") # labels res_tf = open(result_file, "w") # result file img_bf.read(16) # discard image header info lbl_bf.read(8) # discard label header info for i in range(n_images): # number images requested # digit label first lbl = ord(lbl_bf.read(1)) # get label like '3' encoded =  * 10 # make one-hot vector encoded[lbl] = 1 for i in range(10): res_tf.write(str(encoded[i])) res_tf.write(" ") # like 0 0 0 1 0 0 0 0 0 0 res_tf.write("* ") # arbitrary separator char # now do the image pixels for j in range(784): # get 784 vals for each image val = ord(img_bf.read(1)) res_tf.write(str(val)) if j != 783: res_tf.write(" ") # avoid trail space res_tf.write("\n") # next image img_bf.close(); lbl_bf.close(); # close the binary files res_tf.close() # close the result file # ========================================================== def main(): generate(".\\UnzippedBinary\\train-images.idx3-ubyte.bin", ".\\UnzippedBinary\\train-labels.idx1-ubyte.bin", ".\\mnist_train_keras_3.txt", n_images = 3) # first n images if __name__ == "__main__": main()
Executing this program would generate a file named mnist_train_keras_3.txt with 3 images in the format described above. You could change the three file names and rerun to make a file of test data.
In most situations, you could now read the labels and pixels into two different matrices, because that’s what Keras will need:
y_data = np.loadtxt(the_file, delimiter = " ", usecols=range(0,10), dtype=np.float32) x_data = np.loadtxt(the_file, delimiter = " ", usecols=range(11,795), dtype=np.float32)
When doing machine learning, getting your data ready is almost always the most time-consuming, annoying, and difficult part of the project.