I was doing some experiments with anomaly detection when I noticed an effect that surprised me. Briefly, if you one-hot encode categorical data and then use deep neural autoencoding reconstruction error to identify anomalous data, the categorical data might overwhelm strictly numeric data. Let me explain.

I started with some synthetic employee data that looks like:

M 36 anaheim $44,500.00 supp F 24 anaheim $29,500.00 tech F 27 boulder $28,600.00 tech M 19 concord $32,700.00 mgmt . . .

The fields are sex, age, city, income, job-type. The data was preprocessed by one-hot encoding the categorical data and normalizing the numeric data:

0 0.36 1 0 0 0.4450 0 1 0 1 0.24 1 0 0 0.2950 0 0 1 1 0.27 0 1 0 0.2860 0 0 1 0 0.19 0 0 1 0.3270 1 0 0 . . .

I created a 9-4-2-4-9 autoencoder where the 9 input values are condensed down to a latent vector of size 2 and then expanded back to 9 values. The difference between the original 9 values of an item and the reconstructed 9 values is a measure of how anomalous the item is: low reconstruction error indicates a normal data item and high reconstruction error indicates an anomalous item.

But then I created a clearly anomalous data item for a 9-year old who makes $90,000.00 like:

M 9 concord $90,000.00 mgmt

Then, after encoding and normalizing, the autoencoder reconstruction error anomaly detedction did not detect the item as anomalous. Hmmm. What went wrong?

After a few hours of experimentation, I think I know what happened. Of the 9 item values, 7 are due to categorical encoding (all except for age and income). The 7 values due to categorical encoding overwhelm the 2 numeric values. After the fact, this made total sense. The city and job-type predictors should not contribute three terms each while the sex, age, and income only contribute one term.

So, what to do? I tried two idea, both of which seemed to work well.

The first idea was to just weight the one-hot values by taking their average squared error. For example, the city values are in columns (2,3,4) so instead of accumulating the three corresponding squared error terms, a weighted approach would sum those three squared values and then divide the sum by 3. Simple, and apparently effective.

A second idea I tried, which also appeared to work well, was to re-encode the categorical data using 0.25 and 0.75 instead of 0 and 1. For example, “boulder” would be encoded as (0.25 0.75 0.25) instead of (0 1 0).

The idea is that error values of encoded data are dampened. For example, suppose an input of “boulder” = (0 1 0) generates an output of (0.2 0.5 0.3). The squared error is (0 – 0.2)^2 + (1 – 0.5)^2 + (0 – 0.3)^2 = 0.04 + 0.25 + 0.09 = 0.38. But if “boulder” is encoded as (0.25 0.75 0.25) then the error term is (0.25 – 0.2)^2 + (0.75 – 0.5)^2 + (0.25 – 0.3)^2 = 0.0025 + 0.0625 + 0.0025 = 0.0675. Using the modified (0.25 0.75) encoding scheme instead of the raw (0 1) scheme, the error term due to the one-hot encoded variable has been significantly reduced.

It will require some experimentation to determine how well the weighted and re-encoding ideas work. And if the ideas are valid, it will require additional experiements to determine the details. But either way, good fun. What could possibly go wrong?

You are a genius. All of these little tricks are worth their weight in gold, and who knows?

You could try training all weights with very small positive values. I tried your neural regression example and it was very insightful. You can see the importance of the trained weights in relation to the other inputs.

https://github.com/grensen/regression_house_net5_demo/blob/main/RegressionDemo/regression_8_10_10_1.txt

Weights 2 – 11 = air condition = not so important.

Weights 12 – 21 = square feet = most important

And so on…

Combined with your idea we could remove the bias from the input? Oo

What should go wrong? ^^

Inspirational as always 🙂

Interesting idea. Feature weighting is tricky and usually isn’t done with neural networks — the idea is that the ordinary weights and bias values are enough. But this also involves the extent to which the predictor values are normalized. Like many things in neural networks, the ideas are not fully understood. JM