## Feature Engineering and Machine Learning

Suppose you want to predict a person’s annual income based on their number years of experience, age, number years education, and so on. In classical statistics it’s common to spend a lot of time on feature engineering — deciding which predictors to use and which to not use, and creating derived predictors from raw predictors. One example might be creating an “age-education” variable which is the square root of the age times the years of education.

But in neural prediction systems it’s quite rare to perform lots of feature engineering. The idea is that during training, the neural system will figure out which predictors aren’t important and assign very small weights, and because of the neural activation function, non-linear combinations of predictor values are being created.

This morning (as I write this post) I decided to do some feature engineering on the airline passenger dataset to verify that it doesn’t work well. This is a time series regression problem where goal is to predict the number of airline passengers. A data setup for straightforward approach looks like:

```|curr 1.12 1.18 1.32 1.29 |next 1.21
|curr 1.18 1.32 1.29 1.21 |next 1.35
|curr 1.32 1.29 1.21 1.35 |next 1.48
. . .
```

The first line means there were (112,000 118,000 132,000 129,000) passengers in months 1-4 and 121,000 passengers in month 5. Using this approach gives pretty good results with a standard neural network, and not-as-good results using a more sophisticated LSTM recurrent network. I created a feature engineering derived dataset:

```|curr 1.12 1.18 1.32 1.29 |next_pct 1.0804 |next_raw 1.21
|curr 1.18 1.32 1.29 1.21 |next_pct 1.1441 |next_raw 1.35
|curr 1.32 1.29 1.21 1.35 |next_pct 1.1212 |next_raw 1.48
. . .
```

Instead of predicting the raw passenger count, I predicted the percentage increase based on the first value in the sequence. The first line means that in month 5, the passenger count was 1.0804 times 1.12, which is 1.21.

Anyway, after thrashing around a bit with a PyTorch LSTM network I got some results. The results are a bit difficult to interpret but overall the feature engineering approach I tried doesn’t appear like a promising approach — as expected.

Things like this happen all the time. In the field of machine learning, you spend a lot of time creating systems that just don’t work well. An important mindset for success is dealing with the failures that are much more common than the successes.

Dealing with failure is common across many fields. My friends who are in sales have the ability to let their form of failure (not making a sale) not affect them. Good baseball players fail more than half the time when batting but don’t dwell on the failures. And so on.

There is a lot of research evidence that indicates that women fear failure much more than men. For example, see “Gender Differences in Fear of Failure amongst Engineering Students” by Nelson, Newman, McDaniel, Buboltz. This fear causes women to quickly drop out of computer science and engineering classes. On the other hand, fashion models seem to have little fear of fashion failure.

This entry was posted in Machine Learning. Bookmark the permalink.

### 1 Response to Feature Engineering and Machine Learning

1. Thorsten Kleppe says:

Thank you for this deep minds, I never thought about that until now, bam! The “age-education” variable opens a new door. 🙂

On the other side, I was asking me why you dont use syllables in your LSTM for the IMDB problem?
The new modifications for batch-norm/group-norm seems to help LSTM’s, but im not really in the LSTM area of ML.

I hope u keep on with your ideas, and I am pretty sure you will get your benefit.

My work got so many drawbacks, but sometimes it seems some ideas worked well.
For batch norm I tried out 100 different implementations and its not working well at this time, but on the Edison-Scale I reached 1%, so I will keep on!

Some other ideas worked well the first time. I was multiplying the outputs + the needed new weights, on MNIST from 1*10 to 2*10 or more and take the max, before I have to divide my target-one to a partial-one (partialOne=1.0f/outputMulti) for every number class.

With more outputs, like 5*10=50, the start becomes more and more worse, but the crazy part was, on the first test every result with a multiplied output was better at the end as the 1*10 original output till 30*10 or more. After that I was tweaking more, the partialOne with special techniques, after that instead to take the best, I merged the results back to ten, I was taking the sum of the classes to create a new original 10-outputs and asked for the new max. The second test was bad, and all the great new ideas were significantly disturbed. Today I was reading about boosting on your blog, and I think I will give it a new try.

My goal for this year was a NN with accuracy > 90 for the first 10.000 training images on MNIST, and on a good day, I did it. So whats coming next?