A well-known benchmark dataset for machine learning is the IMDB Movie Review Dataset. There are 50,000 written reviews that are labeled positive (good movie) or negative. Therefore, the dataset can be used to create a sentiment analysis model.
The Keras library has a pre-packaged version of the dataset but I wanted to generate training and test data myself, directly from the 50,000 source text files.
As I expected, generating the data from source material was very tricky and time-consuming. But I definitely learned a lot about processing files using Python.
I wrote a program called make_data_files.py that does just that. First I read 12,500 positive training reviews, 12,500 negative training reviews, 12,500 positive test reviews, and 12,500 negative test reviews into memory. Next I created a vocabulary dictionary of distinct words where the key is a word (like ‘the’) and the value is the rank by frequency (‘the’ is 1 because it’s most common).
I used the vocabulary dictionary to encode the raw text because neural models only understand numbers. I followed the Keras format mostly. Keras uses a value of 0 for padding when making all reviews the same length. Keras uses a value of 1 to indicate start of sequence — this is useless so I dropped 1. Keras uses 2 to indicate “out-of-vocabulary”, an unknown word, and so did I.
Each encoded word is offset by 3. So ‘the’ maps to 1 (most frequent), but is encoded as 1 + 3 = 4. This allows the 0, 1, 2 to be used. Weirdly, in the Keras scheme 3 is never used. OK, whatever.
After doing some experiments, I realized that the Keras version of the IMDB data has a different train-test split from the source data. For example, the raw data (from http://ai.stanford.edu/~amaas/data/sentiment/) has a short six-word negative review, “read the book forget the movie”) in the negative test data (file 6850_2.txt), but the Keras dataset stores this review as a negative training review. I’m not sure why this is so.
The moral of the story is that using pre-packaged data when exploring machine learning is convenient, but is a realistic scenario you have to generate data yourself and it’s almost always very difficult (and not very much fun — but developers’ definitions of fun are a bit different from those of ordinary people).