Getting MNIST Data into a Text File

The MNIST image data set is used as the “Hello World” example for image recognition in machine learning. The dataset has 60,000 training images to create a prediction system and 10,000 test images to evaluate the accuracy of the prediction model.

Each gray scale image represents a single, hand-drawn digit from 0 to 9. Each image is 28 x 28 pixels, where each pixel value is between 0 (pure white) and 254 (pure black).

There are a total of four files located at http://yann.lecun.com/exdb/mnist/. The first file has 60,000 labels (0 to 9). The second file has the corresponding pixel values for each image (28 * 28 = 784 values per image). The third file has the 10,000 training labels (0 to 9). The fourth file has the corresponding pixel values.

The raw data files are stored zipped and in a proprietary, binary format. In order to use MNIST data, you must convert the binary data into text data. I’ve seen many utility programs to do the conversion, usually written in Python, and most are incomprehensible. I set out to write the simplest conversion utility possible.

The first decision is to choose a format for the resulting text file. I arbitrarily decided that I wanted the result to look like this:

digit 7
pixls 0 0 25 253 . . .

digit 2
pixls 0 0 127 84 . . .

digit 9
pixls 0 0 0 172 . . .

etc.

The next step is to download the four zipped data files from the URL above. Next you have to unzip the files. The files are in .gz format which Windows can’t handle so I needed the 7-Zip utility. After installing it, I right-clicked on each .gz file, selected “7-Zip” then “Extract files”, and unzipped to a directory I named Unzipped. In order to keep things clear, I added a “.bin” extension to each unzipped file so I could remember they’re in binary.

The next step is to write the utility function. Here’s the code:

# converter_mnist.py

def convert(img_file, label_file, txt_file, n_images):
  lbl_f = open(label_file, "rb")   # MNIST has labels (digits)
  img_f = open(img_file, "rb")     # and pixel vals separate
  txt_f = open(txt_file, "w")      # output file to write to

  img_f.read(16)   # discard header info
  lbl_f.read(8)    # discard header info

  for i in range(n_images):   # number images requested 
    lbl = ord(lbl_f.read(1))  # get label (unicode, one byte) 
    txt_f.write("digit " + str(lbl) + "\n")
    txt_f.write("pixls ")
    for j in range(784):  # get 784 vals from the image file
      val = ord(img_f.read(1))
      txt_f.write(str(val) + " ")  # will leave a trailing space 
    txt_f.write("\n")  # next image

  img_f.close(); txt_f.close(); lbl_f.close()

def main():
  convert(".\\Unzipped\\t10k-images.idx3-ubyte.bin",
          ".\\Unzipped\\t10k-labels.idx1-ubyte.bin",
          "mnist_test.txt", 3)

if __name__ == "__main__":
  main()

I made the code as simple as I could. If you are new to machine learning, the ability to work with MNIST data is important. And you need code like this to get the raw, zipped, binary MNIST data into a usable format.

Advertisements
This entry was posted in Machine Learning. Bookmark the permalink.

One Response to Getting MNIST Data into a Text File

  1. Peter Boos says:

    This question might be outside your realm, and it’s not directly related to this topic’s subject, I think you work at MS, or at least they hire you, so you might have some indication as to what MS plans are for C# and ML, does MS still actively embrace C#, or is everything going Python now in ML at MS ?.

    The reason for asking for software devs it’s problematic to release open python code.
    (can only behind a web with PHP or so). I do understand that for researchers and students python has a nice syntax, and sure it’s a great language, but it’s not ideal for making a business case for client-side apps.

Comments are closed.