Machine Learning, Data Science, and Statistics

There are no universally agreed-upon definitions for the terms “machine learning”, “data science”, and “statistics”. In my mind, classical statistics consists of traditional techniques that were developed from the 1920s through the 1970s. Statistics techniques include things like correlation, linear regression, and the t-test for hypothesis testing.

In my mind, machine learning consists of techniques that make predictions based on data and usually require computer analysis. Examples include logistic regression classification, neural network classification, and k-means clustering.

machinelearningdatasciencestatistics

In my mind, data science is a general term that includes classical statistics, machine learning, and other topics such as database theory and practice.

And in my mind, artificial intelligence is a term that refers to systems that loosely mimic human behavior. Topics include speech recognition and pattern (visual) recognition.

I recently sat in on an interesting talk at the SAS Analytics conference. The talk was a general overview of machine learning and was given by Brett Wujek of SAS. The talk had a PowerPoint slide that attempted to illustrate the relationships between various terms.

On the one hand, an exercise like this is somewhat futile because it’s very subjective. But on the other hand, it’s an interesting attempt to clarify relationships.

Posted in Machine Learning | Leave a comment

Handling File Upload using ASP.NET ASHX

On a recent project, I needed to do a Web-based file upload. On the client side I decided to use plain vanilla HTML and JavaScript, as opposed to a commercial or open source library, or jQuery, or Flash.

On the server side, to accept a file upload and save it, there were also many options. Two leading options for my particular scenario were a PHP handler, and an ASP.NET ASHX handler.

fileuploadhandlingusingashx

I hadn’t worked with ASHX before. ASHX is a generic handler, meaning it’s very raw and is designed to do general purpose things with an HTTP request from a client, as opposed to being designed to generate a specific kind of a response such as a Web page.

Anyway, I was pleasantly surprised that, even though ASHX scripts do have a certain amount of overkill abstraction that’s common to all ASP.NET technologies, using ASHX was not as bad as I’d feared it’d be.

The code is quite self-explanatory:

<%@ WebHandler Language="C#" Class="Handler" %>

using System;
using System.Web;

public class Handler : IHttpHandler {

  public void ProcessRequest (HttpContext context) {
    //context.Response.ContentType = "text/plain";
    //context.Response.Write("Hello World");

    try
    {
      HttpPostedFile postedFile = context.Request.Files[0];
      if (postedFile.ContentLength == 0)
        throw new Exception("Empty file received");

      // cannot restrict accept type on client
      //if (postedFile.ContentType == "image/png")  
      //  throw new Exception("PNG files not allowed");

      context.Response.Write(postedFile.ContentType);

      string fn = System.IO.Path.GetFileName(postedFile.FileName);
      // to save in 'App_Data':
      // string path =  
      //   HttpContext.Current.Server.MapPath("~/App_Data/");
      postedFile.SaveAs("C:\\Data\\Junk\\Uploads2\\" + fn);
      context.Response.Write("Server received " + fn);  //
    }
    catch (Exception ex)
    {
      context.Response.Write("Error occurred on server " +
        ex.Message);
    }

  } // ProcessRequest

  public bool IsReusable {
    get {
      return false;
    }
  }

} // class Handler

In general I try to avoid ASP.NET technologies, but handling a file upload using ASHX is a good way to go in a scenario where you’re working Microsoft Web (IIS and ASP.NET) technologies.

Posted in Miscellaneous | Leave a comment

NFL 2016 Week 4 Predictions – Zoltar Looks for Redemption

Zoltar is my NFL prediction program. Here are Zoltar’s predictions for week 4 of the 2016 NFL season:

Zoltar:     bengals  by    8  dog =    dolphins    Vegas:     bengals  by    7
Zoltar:       colts  by    0  dog =     jaguars    Vegas:       colts  by  2.5
Zoltar:    patriots  by   10  dog =       bills    Vegas:    patriots  by  4.5
Zoltar:      texans  by   10  dog =      titans    Vegas:      texans  by  6.5
Zoltar:       lions  by    0  dog =       bears    Vegas:       lions  by    3
Zoltar:        jets  by    2  dog =    seahawks    Vegas:    seahawks  by  1.5
Zoltar:      ravens  by    2  dog =     raiders    Vegas:      ravens  by  3.5
Zoltar:    panthers  by    1  dog =     falcons    Vegas:    panthers  by  3.5
Zoltar:    redskins  by   10  dog =      browns    Vegas:    redskins  by  9.5
Zoltar:     broncos  by    5  dog =  buccaneers    Vegas:     broncos  by    3
Zoltar:   cardinals  by    6  dog =        rams    Vegas:   cardinals  by    9
Zoltar:      saints  by    0  dog =    chargers    Vegas:    chargers  by    4
Zoltar: fortyniners  by    2  dog =     cowboys    Vegas:     cowboys  by    3
Zoltar:    steelers  by    2  dog =      chiefs    Vegas:    steelers  by    6
Zoltar:     vikings  by   10  dog =      giants    Vegas:     vikings  by    4

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. So for week 4:

1. Zoltar recommends betting on the Vegas favorite Patriots over the Bills because Zoltar predicts the Patriots will win by 10 points, and easily cover the 4.5 point spread.

2. Zoltar recommends betting on the Vegas favorite Texans over the Titans because Zoltar predicts the Texans will win by 10 points, and cover the 6.5 point spread.

3. Zoltar recommends betting on the Vegas underdog Saints over the Chargers because Zoltar predicts a virtual tie and so the Chargers will not cover the 4.o point spread.

4. Zoltar recommends betting on the Vegas underdog Chiefs over the Steelers because Zoltar predicts the Steelers will win by only 2 points, and not cover the 6.0 point spread.

5. Zoltar recommends betting on the Vegas favorite Vikings over the Giants because Zoltar predicts the Vikings will win by 10 points, and easily cover the 4.0 point spread.


In week 3, Zoltar went a so-so 3-2 against the Vegas point spread. For the year, Zoltar is 9-5 against the Vegas spread, for 64% accuracy. Historically, Zoltar is usually between 62% and 72% accuracy against the Vegas spread over the course of an entire season. Theoretically, if you must bet $110 to win $100 (typical) then you’ll make money if you predict at 53% accuracy or better.

Just for fun, I track how well Zoltar and Bing Predictions do when just trying to predict just which team will win a game. This isn’t useful except for parlay betting.

In week 3, just predicting winners, Zoltar was a poor 8-8. Bing was 8-8 also, and so was the Vegas point spread.

For the season, just predicting winners, Zoltar is 27-21 (56% accuracy). Bing is 25-23 (52% accuracy).

Note: Zoltar sometimes predicts a 0-point margin of victory. In those situations, to pick a winner so I can compare against Bing, in the first four weeks of the season, Zoltar picks the home team to win. After week 4, Zoltar uses historical data for the current season. This strategy hurt Zoltar in week #3 because 4 of 5 home teams in such situations lost.

zoltarpresentationtitleslide

Posted in Machine Learning | Leave a comment

Docker for Windows

I was taking a look at Docker for Windows. It’s difficult to explain what Docker is if you haven’t seen it before. I tend to think of Docker as a mechanism that’s somewhat similar to running a virtual Linux machine on your physical Windows machine with a VHD (virtual hard drive) file.

However, Docker is much more sophisticated and instead of loading an entire virtual machine, you load only a stripped down “image” and run it as a “container”.

dockerforwindows

First I installed Docker for Windows on my Windows 10 machine. Docker starts running in the background immediately after installation. Then I launched an CMD shell and issued the command:

docker run -it ubuntu:16.04

which loosely means, “find the image for v 16.04 of Ubuntu Linux and start it in interactive terminal mode.”

Because I hadn’t downloaded the image before, Docker reached out on the Internet to Docker Hub (a huge collection of images) and fetched the image and started a container. In effect I was running Ubuntu.

From the Ubuntu prompt, I issued an echo command (print a message) and an ls (list directory contents) command. I exited Ubuntu.

The docker ps command showed all running containers (I only had one) and the docker images command showed the Ubuntu image.

Off screen I stopped the container by issuing a docker stop mad_shockley command and then force-removed the image by docker rmi ubuntu:16.04 -f.

Docker is very, very complicated but very, very interesting.

Posted in Miscellaneous | Leave a comment

Log Loss and Cross Entropy are Almost the Same

Log loss and cross entropy are measures of error used in machine learning. The underlying math is the same. Log loss is usually used when there are just two possible outcomes that can be either 0 or 1. Cross entropy is usually used when there are three or more possible outcomes.

Suppose you are looking at a really weirdly shaped spinner that can come up “red”, “blue”, or “green”. You do some sort of physics analysis and predict the probabilities of (red, blue, green) are (0.1, 0.5, 0.4). Then you spin the spinner many thousands of times to determine the true probabilities and get (0.3, 0.2, 0.5). To measure the accuracy of your prediction you can use cross entropy error which is best explained by example:

CE = -( ln(0.1)(0.3) + ln(0.5)(0.2) + ln(0.4)(0.5) )
   = -( (-0.69)(0.3) + (-0.14)(0.2) + (-0.46)(0.5) )
   = 1.29

In words, cross entropy is the negative sum of the products of the logs of the predicted probabilities times the actual probabilities. Smaller values indicate a better prediction.

But, scenarios where predictions are different probabilities are rare. Most often predictions are discrete. For example, suppose you are trying to predict the political leaning (conservative, moderate, liberal) of a person based on their age, annual income, and so on. You encode (conservative, moderate, liberal) as [(1,0,0) (0,1,0) (0,0,1)]. Now suppose that for a particular age, income, etc. your prediction is (0.3, 0.6, 0.1) = moderate, because the middle value is largest.

Using cross entropy, the error is:

CE = -( ln(0.3)(0) + ln(0.6)(1) + ln(0.1)(0) )
   = -( 0 + (-0.51)(1) + 0 )
   = 0.51

Notice that in this common scenario, because of the 0s and a single 1 encoding, only one term contributes to cross entropy.

Now log loss is the same as cross entropy, but in my opinion, the term log loss is best used when there are only two possible outcomes. This simultaneously simplifies and complicates things.

For example, suppose you’re trying to predict (male, female) from things like annual income, years of education, etc. If you encode male = 0 and female = 1 (notice you only need one number for a binary prediction) and your prediction is 0.3 (male, because it’s less than 0.5) and the actual result is male, the log loss is:

LL = -( ln(0.3) )
   = 1.20

In words, for log loss with binary prediction, you just take the negative log of your predicted probability of the true result. This is the same as cross entropy.

So, pretty simple in principle. However, understanding all the things related to log loss and cross entropy is difficult because there are so many interrelated details, plus, vocabulary is used inconsistently.

Posted in Machine Learning | Leave a comment

2016 DevIntersection Conference Pre-Event Interview

The 2016 DevIntersection conference is one of my favorite events for software developers and IT engineers. I will be speaking there about the R language and machine learning.

I recently did a short, four-minute interview with Richard Campbell, the technical chair for DevIntersection. See https://channel9.msdn.com/Shows/The-DEVintersection-Countdown-Show/James-McCaffrey.

preeventinterviewjamesandrichard1

The 2016 DevIntersection will be in Las Vegas, at the MGM Grand Hotel, from October 25-28. See http://www.devintersection.com. The event will be co-located with the ITEdge Intersection conference for IT people, and the AngleBrackets conference for Web developers.

preeventinterviewjames1

It’s hard to quantify the value you get from going to a conference. The costs are significant: dollar costs (software conference typically run about $2000 to $5000 per person, so you have to get your company to foot the bill), and there are time-away-from-work costs. But for me, the benefits far outweigh the costs. I return to work with renewed energy and enthusiasm and motivation, and I always learn something useful.

Some and hang out with me at DevIntersection in Las Vegas!

Posted in Conferences | Leave a comment

Handling File Uploads using PHP

On a recent work project, I had to create a little system that allows users to upload files from their home machines to a Web server. It was a much more difficult problem than I thought it’d be.

It took me a few days to figure out the client side — a Web page that contains code to allow users to select files and send them. Once I had that part figured out, I turned my attention to the server side — fetching the files that were sent and then saving them.

In principle, any server-side technology can fetch uploaded files. But for my particular scenario, the two primary candidate technologies were 1.) a simple PHP script, 2.) a simple ASP.NET ASHX script.

I got a basic PHP version (no error-checking) working, after futzing around with getting PHP installed and working with IIS (I don’t use Apache very often).

The PHP code is:

<?php

 if (isset($_FILES['myFile'])) {
   echo $_FILES['myFile']['tmp_name'];
   echo "\n";
   echo $_FILES['myFile']['name'];
   move_uploaded_file($_FILES['myFile']['tmp_name'],
     "C:/Data/Junk/Uploads/" . $_FILES['myFile']['name']);
 }

?>

The client-side code sets a key named ‘myFile’ with value of the uploaded file. The server-side looks for that key and automatically saves the uploaded file with a temp name into a temp directory, then copies the temp file into a directory Uploads on the server.

The basic idea is simple, but there are tons of details.

handlingfileuploadsusingphp

Posted in Miscellaneous | Leave a comment