“Logistic Regression Using the scikit Library” in Visual Studio Magazine

I wrote an article titled “Logistic Regression Using the scikit Library” in the February 2023 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2023/02/01/scikit.aspx.

Logistic regression is a machine learning technique for binary classification. For example, you might want to predict the sex of a person (male or female) based on their age, state where they live, income and political leaning. There are many other techniques for binary classification, but logistic regression was one of the earliest developed and the technique is considered a fundamental machine learning skill for data scientists.

There are many tools and code libraries that you can use to perform logistic regression. The scikit-learn library (also called scikit or sklearn) is based on the Python language and is one of the most popular machine learning libraries.

My article presents a complete end-to-end demo of logistic regression using the scikit LogisticRegression class.

Interestingly, there are two main variations of scikit LogisitcRegression. One uses L-BFGS optimization for training, and the other uses SGD (stochastic gradient descent) for training. My demo uses the SGD approach.

Confusingly, scikit also has an SGDClassifier class which has different types of classification algorithms that can be trained using SGD, one of which is logistic regression. Unnecessary overlap like this sometimes happens in Open Source projects like scikit that are developed by different people.

Creating and training a logistic regression model is almost too easy:

  print("Creating logistic regression model")
  model = LogisticRegression(random_state=0,
    solver='sag', max_iter=1000, penalty='none')
  model.fit(train_x, train_y)

The most difficult part about using the scikit library is that each machine learning class usually has a lot of constructor parameters. The signature for the scikit LogisticRegression class is:

LogisticRegression(penalty='l2', *, dual=False, tol=0.0001,
  C=1.0, fit_intercept=True, intercept_scaling=1,
  class_weight=None, random_state=None, solver='lbfgs',
  max_iter=100, multi_class='auto', verbose=0,
  warm_start=False, n_jobs=None, l1_ratio=None)

It can take a lot of time to read through the scikit documentation to figure out the purpose of each parameter.

Machine learning has come a long way since the invention of logistic regression classification in the 1940s (although the technique wasn’t widely accepted as useful until the 1970s). New technologies can develop very quickly. Left: In 1904, the “Gobron-Brillie Gordon Bennett” car, driven by Louis Rigolly, was the first to reach 100 mph. Center: In 1927, the “Sunbeam 1000 hp Mystery” driven by Henry Segrave, was the first to reach 200 mph. Right: In 1935, the “Campbell-Railton Blue Bird”, driven by Malcolm Campbell, was the first to reach 300 mph.

This entry was posted in Scikit. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s