The MNIST image data set is used as the “Hello World” example for image recognition in machine learning. The dataset has 60,000 training images to create a prediction system and 10,000 test images to evaluate the accuracy of the prediction model.
Each gray scale image represents a single, hand-drawn digit from 0 to 9. Each image is 28 x 28 pixels, where each pixel value is between 0 (pure white) and 254 (pure black).
There are a total of four files located at http://yann.lecun.com/exdb/mnist/. The first file has 60,000 labels (0 to 9). The second file has the corresponding pixel values for each image (28 * 28 = 784 values per image). The third file has the 10,000 training labels (0 to 9). The fourth file has the corresponding pixel values.
The raw data files are stored zipped and in a proprietary, binary format. In order to use MNIST data, you must convert the binary data into text data. I’ve seen many utility programs to do the conversion, usually written in Python, and most are incomprehensible. I set out to write the simplest conversion utility possible.
The first decision is to choose a format for the resulting text file. I arbitrarily decided that I wanted the result to look like this:
digit 7 pixls 0 0 25 253 . . . digit 2 pixls 0 0 127 84 . . . digit 9 pixls 0 0 0 172 . . . etc.
The next step is to download the four zipped data files from the URL above. Next you have to unzip the files. The files are in .gz format which Windows can’t handle so I needed the 7-Zip utility. After installing it, I right-clicked on each .gz file, selected “7-Zip” then “Extract files”, and unzipped to a directory I named Unzipped. In order to keep things clear, I added a “.bin” extension to each unzipped file so I could remember they’re in binary.
The next step is to write the utility function. Here’s the code:
# converter_mnist.py def convert(img_file, label_file, txt_file, n_images): lbl_f = open(label_file, "rb") # MNIST has labels (digits) img_f = open(img_file, "rb") # and pixel vals separate txt_f = open(txt_file, "w") # output file to write to img_f.read(16) # discard header info lbl_f.read(8) # discard header info for i in range(n_images): # number images requested lbl = ord(lbl_f.read(1)) # get label (unicode, one byte) txt_f.write("digit " + str(lbl) + "\n") txt_f.write("pixls ") for j in range(784): # get 784 vals from the image file val = ord(img_f.read(1)) txt_f.write(str(val) + " ") # will leave a trailing space txt_f.write("\n") # next image img_f.close(); txt_f.close(); lbl_f.close() def main(): convert(".\\Unzipped\\t10k-images.idx3-ubyte.bin", ".\\Unzipped\\t10k-labels.idx1-ubyte.bin", "mnist_test.txt", 3) if __name__ == "__main__": main()
I made the code as simple as I could. If you are new to machine learning, the ability to work with MNIST data is important. And you need code like this to get the raw, zipped, binary MNIST data into a usable format.