A Subjective Comparison of TensorFlow, PyTorch, Keras, and scikit-learn

I regularly use five neural network code libraries: TensorFlow, PyTorch, Keras, CNTK, and scikit-learn. I’m sometimes asked how they compare with each other. There are many ways to answer that question, but here’s a very subjective chart I use that compares based on ease of use and flexibility:

TensorFlow has a very difficult learning curve but has by far the best documentation and most examples available. PyTorch is slightly easier to use than TensorFlow in the sense that it feels more like the APIs were written by developers for use by developers, but PyTorch documentation is weak and skimpy.

Microsoft’s CNTK is dying — it’s been put into maintenance mode. This is a shame because I really like CNTK. But CNTK just never caught on. CNTK is widely used internally at Microsoft but even there its use is decreasing.

Keras is a wrapper library over TensorFlow. Keras is vastly easier to use than TensorFlow and has decent customizability. I usually recommend Keras for experienced developers who are just starting with neural systems.

The scikit-learn library has been around for over 10 years. In addition to neural networks, scikit-learn has traditional machine learning functionality (clustering, logistic regression, etc.) and is well-documented. The scikit-learn library is a good choice for beginners who have relatively little coding experience.

Note that there are many other criteria that could be used to compare these libraries (performance, number of jobs that advertise for each, and so on) and there are many other neural system code libraries.



Four old (maybe 1970s) magazine advertisements for rum. For commodities like rum, advertising agencies have to use secondary comparison criteria such as perceived cool factor or lifestyle.

Posted in Machine Learning | Leave a comment

Emily and LaKeisha and Machine Learning

When I was in college, I got BA degrees in psychology from the University of California at Irvine (Go Anteaters!) and applied mathematics from California State University at Fullerton (Go Titans!) I learned pretty quickly that applying statistics to people usually doesn’t end well. Applying machine learning to people has pitfalls too.

Many research studies have shown that Black sounding names like LaKeisha and Rasheed are perceived to be associated with people who have lower intelligence, higher likelihood of being violent, higher likelihood of drug use and criminal behavior, and so on. White sounding names like Emily and Greg are perceived to be associated with higher intelligence and so on. These negative perceptions of Black attributes are shared by both Black and white observers.

There are many variations of a study where researchers will create two identical resumes but one for “Greg Baker” and one for “Rasheed Jefferson”. The Greg Baker resume is always received more favorably. For example, the research paper “Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination” by M. Bertrand and S. Mullainathan was widely publicized in the media.

Well, this bias is not rocket science and rather expected. And in the job/resume scenario, employers would also wrestle with the idea that hiring a minority employee creates a significant risk of an eventual spurious discrimination lawsuit. The problem of course is that these statistics are just aggregate metrics that describe a group of people, not a specific person.


Google image search for “murder arrest”.

According to Wikipedia articles, the statistics are pretty stark. Black people as a whole have an average IQ that is about one standard deviation (15 points) less than the average for white people. Black students score much lower on average on the SAT (177 points) and ACT tests. By age 23, half of all Black males have been arrested and convicted. Black males commit over half of all murders even though Black males are less than seven percent of the population. In short, Black sounding names are associated with a long laundry list of negative attributes.


High IQ scores are positively correlated with many positive attributes such as career success and income. Low IQ scores are correlated with violence and criminal behavior. See “Thirty Years of Research on Race Differences in Cognitive Ability”, Rushton and Jenson.

Worse perhaps than the cold statistics are the constant barrages of negative media information. News feeds seem to continuously feature a story of a Black person committing a heinous crime. And a Google image search for almost anything related to serious crime turns up images of almost entirely Black people. Media may be reflecting reality but it is likely producing negative associations too.

So, the cautionary tale for machine learning is that if an ML prediction system is created that applies to people (for example, an automated resume scanning system for a human resources department), it’s possible to unintentionally include name information and then the prediction system could learn to associate Black names with all the bad statistics associated with Black people as a group, rather than evaluating the person-input independently. This is why I only use machine learning for things like predicting sports scores, and not for systems that deal with people.

The moral of the story is simple: be cautious when combining people and numbers. I evaluate people using my head and my heart, not numbers.

Update: Literally minutes after I wrote this blog post a news story appeared that described the conviction of a teen named Dawnta who murdered a female police officer in Baltimore. ML systems that train using news feeds could find hundreds of stories like this and be influenced by them.



Fractals combines art and numbers. I remember programming an image of the Mandelbrot set (center image) years ago using 8086 assembly language on an IBM PC.

Posted in Machine Learning | Leave a comment

Padding Sentences for PyTorch NLP Recurrent Neural Networks

Working with PyTorch recurrent neural networks (LSTMs in particular) is extremely frustrating. The code is still very new, is poorly documented, and there aren’t many good examples available.

I’ve been looking at sentiment analysis on the IMDB movie review dataset for several weeks. One of the many mysteries is exactly how to deal with sentences/reviews that have different lengths. The simplest approach is to feed each sentence, one at a time, without any padding. But if you want to batch sentences together for more efficient training, you need to pad the sentences with a dummy values so that all sentences have he same number of words.

In one experiment, I just used the sentences without any padding and then fed each sentence, one at a time, to the LSTM during training (“online training”). But I got relatively weak results. I don’t fully understand why the results were weak.

In another experiment, I padded each review to the same length by adding dummy 0 values at the end of each sentence/review. In the embedding layer I specified padding_idx=0 so that the embedding knew that 0 wasn’t a real word. But I did not mask off the 0 values when I sent each sentence to the LSTM and I also didn’t mask when computing the loss value. I did this mostly because I don’t yet fully understand pack_padded_sequence() and pad_packed_sequence() yet. I used online training. In spite of the crude approach, I got pretty good results anyway.


Left: Padded sentences with 0 and telling the embedding layer about 0. Right: Padded sentences with 0 but not telling the embedding layer about 0. Result are essentially the same.

In a third experiment, I padded the sentences but I just ran the sentences as-is — I didn’t even tell the embedding layer about the special 0 value. Unexpectedly, I got quite good results. I speculate that the padding contains useful information somehow — perhaps an indirect measure of sentence length which helps the network understand. Maybe.

The moral of the story is that as far as I can determine right now, it’s better to always pad input sentences to the same length. You’re likely to get better results using online training and equal-length sentences allow you to use batch training if you want too. Maybe.



Three examples of frustration by cartoonist Gary Larson.

Posted in PyTorch | Leave a comment

A Neural Approach to Combinatorial Optimization

The most well known example of combinatorial optimization is the Traveling Salesman Problem (TSP). For example, suppose you have five cities. There is a distance between each city. The goal is to visit each city so that the total distance travelled is as small as possible. If the five cities are labelled A, B, C, D, E then one path might be A – E – C – D – B.

Combinatorial optimization problems are difficult. If there are five cities then there are 5! = 5 * 4 * 3 * 2 * 1 = 120 possible combinations (technically permutations — “combinatorics” is a general term that includes mathematical combinations and permutations). But if there are 100 cities then there are 100! possible paths and 100! = 9.33e+157 which is unimaginably large.

There aren’t very many good general purpose combinatorial optimization algorithms. The two algorithms I use most often are simulated annealing and simulated bee colony.

Anyway, this past weekend I was geeking out and came across an obscure research paper titled “Neural Combinatorial Optimization with Reinforcement Learning” written by five researchers from Google. I hate to be critical, but the paper was poorly written. Let me explain what I mean by that and why it matters.


An example section from the research paper.

I read research papers almost daily, and I’m very familiar with combinatorial optimization, reinforcement learning, and neural networks. But my first read of this research paper baffled me. It was written as if the authors were writing for themselves or trying to impress a selection committee instead of writing to a general, outside audience. So by “poorly written” I really mean poorly written for an external audience — from a technical perspective I have no doubt that the paper is well written based on the reputation of some of the authors.

All this matters because the technique presented in the research paper might be a fantastic new breakthrough in combinatorial optimization, but because nobody outside of the group of people who wrote the article will take the hours required to figure out what the paper is trying to communicate, the paper will likely be quickly forgotten. I’d really like to dive into the details of the paper, but I just don’t have the week or so it’d take to dissect the details.

To be sure, authors of research papers are in a bind because in order to be accepted at a research conference, a paper has to have the appearance of deep research. This leads to a situation where authors unnecessarily complicate explanations. I like reading famous old research papers — they are brilliantly simple and aren’t trying to impress a committee.



Four illustrations by Robert McGinnis, one of the most prolific artists of the 1960s. He did book covers, movie posters (including all the James Bond films of the era), magazine ads, and more. His style is simple, distinctive, and easy to understand. I think these images are of 1960s movie actresses but I’m not sure.

Posted in Machine Learning | Leave a comment

Attention Mechanisms in Natural Language Processing

The last 24 months or so have produced amazing advances in natural language processing (NLP). Many of these ideas are extremely complex. One such idea is called “attention”. Other breakthrough ideas include transformers, ELMo, and BERT. Accurately describing attention would take dozens of pages (or more) so instead I’ll give an intuitive explanation at the expense of correctness.

Attention is most often explained in the context of a sequence-to-sequence language translation problem. Suppose you want to translate the Latin “Discipulus sum” to the English “I am a student”. Before explaining further, it’s necessary to know that in modern NLP, words are represented by a vector of numbers, for example “discipulus” could be represented by the three numeric values (0.1234, 1.0987, 2.4680). This is called a word embedding. In practice, a word embedding is usually between 100 and 500 numeric values. Exactly how word mbedings are created is a complex topic so for now, just assume that any word can be represented by a numeric vector of values.

OK, back to the translation problem. Without using NLP attention you would create a first recurrent neural network and feed “discipulus” followed by “sum” to the network (you’d send the numeric word embeddings). After “discipulus” the RNN would have an internal cell state. After “sum” the RNN would have a final cell state and produce an output value which would be a vector of numeric values.

After encoding the source sentence, you would have a second RNN and feed it the output vector from the first RNN and the second RNN would produce a sequence of vectors and each vector would represent an English word. In other words, the entire source sentence is encoded as a single big vector (of numeric values) and that single vector is used to generate the entire translation.

This approach works quite well for short sentences, but the problem is that it’s very difficult to encode an entire sentence with realistic length as a single vector of values.

The idea of attention is to encode the source sentence as before, but during the decoding done by the second RNN, you have a weight associated with each of the hidden cell states in the first encoding RNN and you emit output vectors/words based on how heavily weighted the input hidden state is. For example, in most cases you’d expect the first word of the translation to be based mostly (but not entirely) on the first word of the input sentence so the first word of the translation would be constructed from all the hidden cell states of the first input RNN but mostly using the first hidden state.

Sheesh. Believe me, it’s very difficult to explain and I simplified a lot and left out a lot. However, like many things, if you keep reading about NLP attention, every explanation will add a bit of knowledge and eventually things make sense.

The field of natural language processing has exploded in terms of both capability and complexity. In my opinion, in the past a machine learning researcher could study NLP and other areas of machine learning too, but in the past two years NLP has become so complex, studying NLP now requires complete dedication to the topic.



Four advertising illustrations by Danish artist Mads Berg. His geometric style of art grabs my attention immediately, which is what advertising should do.

Posted in Machine Learning | Leave a comment

Zoltar Prepares for the 2019 NFL Football Season

Zoltar is my machine learning system that predicts the outcomes of NFL football games. The first game of the 2019 season is Thursday, September 5 (about three weeks from now) so I’m starting to get Zoltar ready.

I’ve had several different versions of Zoltar over the years. The problem of predicting NFL football scores lends itself to all kinds of interesting algorithms and technologies. Every year I try a different twist or two, usually some kind of new optimization algorithm. Currently I’m using the C# programming language with custom hybrid reinforcement plus neural optimization.

My preliminary work involves getting three data files ready: schedule data, Las Vegas point spread data, and results data. The schedule data has a complication that there are five games that are going to be played at a neutral site — four games in London and one game in Mexico City. In NFL football, the home field advantage is an important prediction factor.

I store all my data as plain text files for simplicity. The key programming challenge is defining data structures to hold the schedule, point-spread, and result data.

It took a couple of hours, but I successfully wrote code to read the games schedule data into memory. My next step will be to get the prediction engine up and running to make preliminary predictions for the first week’s games. Then I need to tune the “advice” engine which determines if Zoltar advises to bet on a particular game or not.



My prediction system is named after the Zoltar fortune telling machine you can find in arcades. That arcade machine is named after the “Zoltar Speaks ” machine from the 1988 movie “Big” starring Tom Hanks. And that movie machine was based on a 1950s era arcade machine called Zoltan.

Posted in Zoltar | Leave a comment

I Take Another Look at the New Microsoft ML.NET Library

The Microsoft ML.NET library is a C# language code library for machine learning. ML.NET has been under development and was available in pre-release form for quite some time, but was finally released for public use as version 1.2 a few weeks ago. The motivation for ML.NET is that Python is the fact the default language for machine learning but C# is the default language for applications on a Windows platform and this makes it very difficult to integrate a trained Python model into a C# application. With the ML.NET library, C# developers can create a model in C# and easily drop it into a C# application.

A few weeks ago I used the AutoML tool to automatically generate a ML.NET model. It worked very nicely, but the auto-generated code wasn’t totally clean and efficient. So I set out to refactor the demo I created using AutoML, by coding an ML.NET program from scratch.

My demo problem was to predict employee job Satisfaction (low, medium, high) from Age, Hourly/Salaried status, job Role (tech, management, sales), and annual Income.

The demo code took me about 4 hours to write, which isn’t too bad. But I had a head start because I’ve been looking at ML.NET for many months. Additionally, ML.NET is derived from an internal Microsoft library named TLC, which in turn was derived from an internal library named TMSN, and I’d used TMSN and TLC many times.

Writing ML.NET code from scratch wasn’t trivial but machine learning problems aren’t trivial. It will be interesting to see if ML.NET catches on in the developer community, or if ML.NET falls by the wayside like so many other code libraries and framewoks. I like ML.NET quite a bit but I’m not representative of the target audience.

My example prediction model uses “Fast Trees” — an algorithm that’s not easy to do without a code library.



It’s not easy to make trees look interesting. I think these four examples by four different artists succeeded in making nice, interesting images of trees.

Posted in Machine Learning | 1 Comment