Latent Dirichlet Allocation – What is It?

Latent Dirichlet allocation (LDA) is a machine learning technique that is most often used to analyze the topics in a set of documents. The problem scenario is best explained by a concrete example. Suppose you have 100 documents, where each document is a one-page news story. First you select the number of topics, k. Suppose you set k = 3, and unknown to you these latent three topics are “sports”, “politics”, and “business”.

Next, you analyze the documents, using word frequencies. Suppose that the words (“score”, “win”, “record”, “team”) map mostly to the “sports” topic. Words (“democrat”, “republican”, “law”, “bill”) map mostly to the “politics”. And words (“profits”, “sales”, “revenue”, “tax”) map mostly to the “business” topic. But notice that a word can correspond to more than one topic. For example, the word “loss” could be associated with both sports (as in a team had a loss in a game), or business (as in profits and losses), or politics (a candidate suffered a loss in an election).

Next, you can use the word-topic mapping information to analyze each of the 100 documents. Your results might be something like Document #1 is 85% sports, 5% politics, and 10% business. Document #2 is 1% sports, 48% politics, and 51% business. And so on. Note that I’ve greatly simplified this explanation at the expense of some technical accuracy.

There’d be one graph for each of the three topics, and one graph for each of the 100 documents.

The mathematics behind latent Dirichlet allocation are based on the Dirichlet probability distribution, which is a fascinating topic in its own right. I tend to think of the Dirichlet probability distribution as an extension of the Beta distribution. But that’s not a useful information for most people. I often use a Beta distribution in my work (directly or indirectly), so Beta is a good mental point of reference for me.

Latent Dirichlet allocation is an unsupervised machine learning technique because the analysis is based on word frequencies, and therefore raw data can be analyzed. One way to think about Latent Dirichlet allocation is that it has similarities to clustering, but with probabilistic assignments (to the presence or absence of topics in a given document).

The term Dirichlet is capitalized because it’s named after Johann Peter Gustav Lejeune Dirichlet, a German mathematician who lived in the first half of the 1800s. It’s not exactly known how to pronounce “Dirichlet” because the surname was coined by his grandfather. The “ch” can be pronounced like an “sh” sound, or a hard “k” sound. And the ending “et” can be pronounced in French fashion as “lay” or as “let” with a hard “t” sound.

Latent Dirichlet allocation was first explained in a 2003 research paper, but like most techniques, the key ideas were published earlier. In machine learning, the acronym LDA is ambiguous because it can also stand for linear discriminant analysis, which is a completely different technique. (Linear discriminant analysis is a relatively crude classification technique based on analysis of variance). So I sometimes mildly scold my colleagues if the use LDA in a presentation without defining which LDA they’re referring to.

I like old travel posters. I remember waiting for my French grandfather to arrive at the Los Angeles airport from Paris on a TWA flight.

Posted in Machine Learning | Leave a comment

Neural Network Library Dropout Layers

Until quite recently, neural network libraries like TensorFlow and CNTK didn’t exist, so if you wanted to create a neural network, you’d have to do so by writing raw code using C/C++ or C# or Java or similar.

In those days, to implement neural network dropout, you’d do so by writing code to tag nodes as those to be dropped on each training iteration, and then directly editing the code that computes output (skipping drop nodes), and then directly editing the back-propagation training code, and then modifying the final weights to account for the fact that dropout was used during training.

The approach I just described was a bit tricky, but not quite as difficult as the description may sound. But still, in the old days (like 2-3 years ago), almost everything about writing neural network code was non-trivial.

So, my point is, I really, really understand dropout because I’ve read the source research papers, and I’ve implemented dropout from scratch many times.

Then in 2015 and 2016, along come TensorFlow and Keras and CNTK and other libraries. The approach used by these libraries is quite simple. Instead of creating a custom network, you place a so-called dropout layer into the network. The dropout layer sets its input nodes to 0.0 which effectively drops the associated nodes before those in the dropout layer.

Library code could resemble:

model = Sequential()          # not real code
model.add(Dense(4))           # input
model.add(Dense(6))           # hidden
model.add(Dropout(rate=0.5))  # apply to hidden
model.add(Dense(3))           # output

The only way I could fully understand this mechanism was to sketch out a few pictures. Notice if you place a dropout layer immediately after the input layer, you are dropping input values, which is sometimes called jittering (although jittering can also mean adding noise to input values). If you place a dropout layer after the output layer, you’re dropping output values — which doesn’t make sense in any scenario I’ve ever seen.

I don’t think there’s a moral to this story. But an analogy might be something like this: In the 1920s and 1930s, everyone who drove a car probably had to have pretty good knowledge of how cars worked, so that they could fix the cars when they broke. But as time went on, understanding things like how to adjust the ignition timing became less and less important. Maybe that’s true of deep neural networks.

But it’s still good to know how things work.

To the best of my knowledge, the idea of dropout (but not the term ‘dropout’) was introduced in a 2012 research paper, and the first use of the term ‘dropout’ occurred in a 2014 follow-up paper. Dropout became widely known in late 2015. There are a couple of very deep research papers about the mathematics behind dropout (and how it averages virtual sub-networks). The best explanation for me is in a paper at:

HAMR (Harvard Ambulatory Micro Robot) is about the same size as a roach. Even though all life is sacred in some sense, I do not like roaches. Ugh. Hate ’em.

Posted in Machine Learning | 1 Comment

Machine Learning and the GDPR

Many of the client companies I talk to bring up the subject of the potential impact of the European Union GDPR (general data protection regulations) which is set to take effect on May 25, 2018. The bottom line is, even though I’m not an expert in this area, that I believe the GDPR rules with respect to machine learning are very vague, will likely impose huge costs on all companies that have European customers, and could open the way to massive amounts of litigation.

Loosely speaking, the GDPR sets up strict rules for personal customer information, any company that does business in Europe. This is a good thing in principle. Potential fines are astronomical (up to 4% of a company’s global annual “turnover” — but this being a European thing, the exact amount can be determined by the GDPR entity, unfortunately creating an incentive for illegal behavior by just about everyone involved.)

The ultimate aim of collecting personal information is to make use of it in some way. Although the GDPR rules are labyrinthine, briefly, 1.) personal data must be anonymized (possibly making it useless for many ML applications) or pseudonymized, 2.) algorithms that use personal data must be explainable (where exactly what that means isn’t clearly defined), 3.) decisions reached using personal data, even if they are completely neutral, cannot have an effect related to race, politics, gender, religion, etc., etc., etc.

From the aptly-named Web site

Each of these three areas is quite interesting and very complex. And the poor phrasing of GDPR “recitals” makes any concrete discussion impossible. But for data anonymization, suppose you are a hospital and you want to use a person’s medical information to determine the best set of treatments. If your data is completely anonymized, you may not be able to use it effectively.

For algorithm explanability, it’s not clear what this means at all. Do companies have to explain the exact algorithm used, thereby giving away a competitive advantage? Or do companies have to just say what class (such as a decision tree, or a neural network) they’re using? One interesting related topic here is called counterfactual information, where a company could say something like, “Your credit application would have been approved if your income had been $10,000 higher.” The requirement of algorithmic explanability could be a dream come true for unscrupulous lawyers.

The problems with disparate effect related to just about any personal category are overwhelming. Virtually any algorithm will have a varying impact on several classes of people. For example, a machine learning advertising recommendation system could discriminate against millionaires who have red hair and are left-handed, thereby making them victims and allowing them to alert the GDPR, which in turn could legally extract millions or even billions of dollars from the offending company.

Of course, my examples here are exaggerations. But the point is, the GDPR makes these crazy scenarios at least feasible. And the cost to a company of defending against such actions could easily put them out of business.

I suspect there will be all kinds of unintended consequences of the GDPR. The regulations could easily stifle machine learning innovation by big companies with lots to lose, and push innovation to small startups. The GDPR could greatly reduce mergers and acquisitions because a large company that acquires a small company that could be liable in some way, inherits the liability. And on and on.

The intentions of the GDPR may have be good, but the realization appears to be very weak. But I’m no lawyer (thank goodness) so only time will tell regarding the impact of GDPR.

Signage with unintended consequences

Posted in Machine Learning | Leave a comment

A Neural Network in C Language

The C programming language was one of the first languages I learned. My other first languages before C were COBOL and Fortran (ugh to them), BASIC (loved it), and Pascal (loved it; where I learned pointers).

I don’t write C programs very much these days, but I like the language and so every now and then I code up something for fun. So, one Sunday morning I set out to implement a neural netwok in C. I was surprised that my C skills were still pretty good and I had my demo program up and running in a couple of hours.

It was difficult to choose a design but in the end I used a typedef-struct to hold the key data structures of the hidden and output nodes (no explicit input nodes in my design), the input-hidden and hidden-output weights, and the hidden and output biases.

typedef struct {
  int ni, nh, no;
  float *h_nodes, *o_nodes; 
  float **ih_wts, **ho_wts;
  float *h_biases, *o_biases;
} nn_t;

After that, I had to refresh my memory for C language dynamic memory allocation using malloc() and a few other details, and then it was just a question of writing-testing-debugging.

To verify my code was working correctly, I used Keras to create and train a 4-5-3 neural network for the ubiquitous Iris Dataset problem, fetched the resulting model weights, placed them into my C version, and checked that the output was the same (which it was).

Although beauty is in the eye of the beholder, I find the C language to be very beautiful — simple and logical — and a good example of how design-by-one-person (or two people in the case of C) is often better than design-by-committee.

int main()
  printf("\nBegin neural net with C demo \n");

  nn_t net;
  construct(&net, 4, 5, 3);

  float wts[43] = {
    0.7555, -0.0460,  0.3509, -0.3423, -0.5747,
    0.6113, -0.7757,  0.6457, -0.8525, -0.8164,
   -0.3907, -0.1120,  0.0006,  0.5239,  1.0379,
   -0.1949, -0.2606, -0.7172, -0.8620,  1.0875,

    0.0021, -0.0225,  0.0019, -0.0558, -0.1134,

   -0.7609,  0.1546,  0.4329,
   -0.7539,  0.4817,  0.8748,
   -0.7350, -0.6023, -0.6515,
   -0.0345, -0.8043, -0.0536,
   -1.9457,  0.3096,  2.2589,

   -0.0784,  0.1829, -0.1045 };
  set_weights(&net, wts);
  printf("\nneural net: \n");

  float inpts[4] = { 6.1, 3.1, 5.1, 1.1 };
  int i;
  printf("\ninput: \n");
  for (i = 0; i < 4; ++i)
    printf("%0.1f ", inpts[i]);

  float* probs = eval(net, inpts);

  printf("\n\noutput: \n");
  for (i = 0; i < 3; ++i)
    printf("%0.4f ", probs[i]);

  return 0;

Sometimes beauty (and not-beauty) are not in the eye of the beholder — they’re painfully obvious.

Posted in Machine Learning | 1 Comment

Recap of the 2018 Interop ITX Conference

I recently attended, and spoke at, the 2018 Interop ITX conference. The event ran from April 30 to May 4 at the Mirage hotel in Las Vegas. I believe that Interop ITX is the largest vendor-neutral (meaning not organized by an IT company like Cisco) conference for IT. See

I estimate there were between 3500 and 5000 attendees, speakers, and vendors at the conference. (I base my estimates by counting chairs at the huge room where attendees are fed lunch). The attendees I talked to came from all kinds of backgrounds: big companies, small companies, government employees, and everything in-between.

Most of the big players in IT were represented in some way at the conference. For example, I talked to people from Google, VMware, IBM, and so on.

I gave a talk titled “Understanding Deep Neural Networks” where I explained exactly what a regular neural network is, and how multiple hidden layers create a deep neural network. Then I described LSTM (long, short-term memory) networks, and CNNs (convolutional neural networks). My talk was a bit more technical than most I sat in on at Interop, but that was my intention.

I also sat on a panel discussion about data science in general, but with an emphasis on career implications. My fellow panel members were Genetha Gray (Intel) and Kim Schmidt (a consultant specializing in AWS). They were both very knowledgeable and articulate. My strongest opinion, in response to a question from the moderator, was that with regards to security, everyone should have a strong basic knowledge of encryption and hashing algorithms.

Some of the trends I observed at Interop were: IoT remains a big topic of conversation, blockchain wasn’t discussed as much as I thought it would be, and as expected, there was giant interest in machine learning and AI. The two most interesting types of companies I saw at Interop were companies in Legal/Law (NLP and document analysis), and in bio-related fields (pharma, genomics, etc.)

All in all, it was an excellent event. Interop ITX has been around for many years and in my mind what sets it apart is how the event changes significantly every year, reflecting tends in IT. For example, 10 years ago, most of the vendors at Interop were hardware related — cables, routers, etc. Then a few years ago, it was all about “software defined networks”. And then this year there was a lot of buzz around AI. And who knows what the trend will be next year? Uh, actually, I do — it will probably be AI again.

If you work in the IT world, I recommend that you give strong consideration to attending the 2019 Interop ITX event.

Vegas is bizarre. The Rhumbar lounge at the Mirage has hookah waterpipes — a topic of conversation between my talks. It was strange to see men and women in business attire using these things. Well, strange to me because I’d never seen one before except in “Alice in Wonderland” (1951) and didn’t know what it was. The hookahs and business people was a weird marriage of . . well, I’m not sure of exactly what.

Posted in Conferences

The 2018 Women’s World Chess Championship

A match for the women’s world chess championship is being played as I write this post. The two contestants are Zhongyi Tan, the current champion, and Wenjun Ju, the challenger. Both players are from China.

The match is being played in Shanghai and is the best of 10 games. After the first three games, the challenger has a 2.5 – 0.5 lead. But there are still seven games to go so anything can happen.

The third game was very exciting. Wenjun Ju had the white pieces and played the Catalan opening which has a reputation for being safe and conservative. But after only 26 moves, the challenger had the champion in dire straits. In this position, black has just played Kd7, attacking white’s rook.

The challenger pounced with Qd4 check, leaving her rook unprotected! If black takes the rook, Qd6 checkmate follows, and if the king retreats, black loses her queen in three moves (easy for a grandmaster to see, but not so obvious to chess mortals like me).

There’s a long and fascinating connection between chess and computer science and artificial intelligence. I hope to discuss some of these ideas in a future post.

Posted in Miscellaneous | 3 Comments

The IMDB Movie Sentiment Analysis Problem using a CNTK LSTM Network

I successfully tackled the IMDB Movie Review dataset sentiment analysis problem. Coding up the system using the CNTK neural library was both easy and difficult — getting the data was difficult, writing the code was not as difficult as I’d anticipated.

The problem is to create a prediction model for the IMDB dataset. There are 25,000 written reviews for training. Each review is labeled positive (“Great movie!”) or negative (“This film wasn’t bad, it was terrible.”) There are also 25,000 hold-out review for testing.

I used an LSTM (“long, short-term memory”) network. Getting the data into a format suitable for CNTK was a major challenge. But somewhat unexpectedly, writing the code to create and train the prediction model wasn’t all that bad, but of course difficulty is all relative. My model achieved 91.20% accuracy on the 23,130 test reviews (I filtered out all reviews longer than 500 words).

Luckily I stumbled on a very good sequence classification demo in the CNTK documentation. That demo accepts sequences of words (a sentence — typically 3-6 words) and classifies the sequence/sentence as one of five categories. Unfortunately the demo data had no explanation so I don’t know what the words were or what the five classes were, but I had enough experience with CNTK to not really need an explanation.

One interesting aspect of the demo data is that the input words were one-hot encoded (sparsely) in the data files, and then the words were encoded again programmatically using an Embedding layer. The Keras library accepts word indexes rather than one-hot encodings. Also, it appears that CNTK doesn’t support the Adam optimization algorithm on sparse input data on a CPU-only system. Handling details like these is always part of working with new technologies.

I ran into many glitches of course, and there are several parts of the code I don’t fully understand yet. But that’s pretty normal when dealing with complex machine learning problems.

All in all, it was a very challenging and satisfying exercise.

Many of my friends find travel to be highly satisfying. It’s hard to disagree. “Experience, travel – these are an education in themselves.” – Euripides, circa 450 BC.

Posted in CNTK, Machine Learning