I try to write at least one PyTorch program every day. PyTorch is complicated and the only way I can learn new techniques, and avoid losing some of my existing PyTorch knowledge, is to write programs.

One morning I decided to implement an autoencoder. I consider autoencoders to be one of the four basic types of neural networks that all data scientists should know. (The other three are binary classifier, multi-class classifier, and regression model). An autoencoder learns to predict its own input. Autoencoders can be used for 1.) “dimensionality reduction”, which is sort of like data compression, or for 2.) anomaly detection, or for 3.) denoising data, or for 4.) converting mixed-type data into purely numeric data so the data can be processed by numeric-only algorithms such as k-means clustering.

For anomaly detection, the basic idea is to train an autoencoder to predict its own input values, then use the trained model to find the item(s) that have the largest reconstruction error. For example, suppose you have employee data like (sex, age, income) where a male, 32-year old employee who makes $55,000.00 is normalized and encoded as (-1, 0.32, 0.55). If you feed this input to the trained autoencoder, it should spit back a result very close to the same three input values. Suppose you get back (-0.90, 0.40, 0.60). Then the squared error for that item is 0.0100 + 0.0064 + 0.0025 = 0.0189.

If you analyze every data item and find the one with the largest reconstruction error, it is likely that the item you found is anomalous in some way, compared to the other items.

Even though autoencoders are probably the simplest form of the four basic neural network types, there are still several ways to go wrong. For instance, several of the autoencoder examples I saw on the Internet applied ReLU activation to the final decoder layer. Because ReLU returns only non-negative values, ReLU isn’t a good choice if any of the input values (and therefore desired output values) can be negative, such as encoding sex as -1 or +1.

Note: In cases where all input is scaled to between 0 and 1, you could apply sigmoid() activation on the output nodes. Or if all input is scaled to between -1 and +1, you could apply tanh() activation on the output nodes. I don’t know of any research on this topic and a few experiments I’ve performed have not have had conclusive results.

Autoencoders. Good fun.

*You’d think that it would be difficult to go wrong when designing an album cover for saxophone music. But I think it’s fair to say that these three examples are anomalies of good cover design.*

# employee_auto.py # autoencoder reconstruction error # PyTorch 1.6.0-CPU Anaconda3-2020.02 Python 3.7.6 # Windows 10 import numpy as np import torch as T device = T.device("cpu") # ----------------------------------------------------------- class EmployeeDataset(T.utils.data.Dataset): def __init__(self, src_file, num_rows=None): # sex age city income job # -1 0.27 0 1 0 0.7610 0 0 1 # +1 0.19 0 0 1 0.6550 0 1 0 # city: anaheim, boulder, concord # job: mgmt, supp, tech tmp_x = np.loadtxt(src_file, max_rows=num_rows, usecols=range(0,9), delimiter="\t", skiprows=0, dtype=np.float32) self.x_data = T.tensor(tmp_x, dtype=T.float32) def __len__(self): return len(self.x_data) def __getitem__(self, idx): preds = self.x_data[idx] sample = { 'predictors' : preds } return sample # ----------------------------------------------------------- class Net(T.nn.Module): def __init__(self): super(Net, self).__init__() self.enc1 = T.nn.Linear(9, 4) # 9-4-2-4-9 self.enc2 = T.nn.Linear(4, 2) self.dec1 = T.nn.Linear(2, 4) self.dec2 = T.nn.Linear(4, 9) T.nn.init.xavier_uniform_(self.enc1.weight) T.nn.init.zeros_(self.enc1.bias) T.nn.init.xavier_uniform_(self.enc2.weight) T.nn.init.zeros_(self.enc2.bias) T.nn.init.xavier_uniform_(self.dec1.weight) T.nn.init.zeros_(self.dec1.bias) T.nn.init.xavier_uniform_(self.dec2.weight) T.nn.init.zeros_(self.dec2.bias) def forward(self, x): z = T.tanh(self.enc1(x)) z = T.tanh(self.enc2(z)) z = T.tanh(self.dec1(z)) z = self.dec2(z) # no activation return z # ----------------------------------------------------------- def analyze_error(model, ds): largest_err = 0.0 worst_x = None worst_y = None n_features = len(ds[0]['predictors']) for i in range(len(ds)): X = ds[i]['predictors'] with T.no_grad(): Y = model(X) # should be same as X err = T.sum((X-Y)*(X-Y)).item() # SSE all features err = err / n_features # sort of norm'ed SSE if err "greater-than" largest_err: largest_err = err worst_x = X worst_y = Y print("Largest error found: %0.4f" % largest_err) print("Worst actual X = " + str(worst_x)) print("Worst computed Y = " + str(worst_y)) # ----------------------------------------------------------- def main(): # 0. get started print("\nBegin Employee autoencoder demo \n") T.manual_seed(1) np.random.seed(1) # 1. create DataLoader objects print("Creating Employee Dataset ") train_file = ".\\Data\\employee_all.txt" train_ds = EmployeeDataset(train_file) # all 240 rows bat_size = 10 train_ldr = T.utils.data.DataLoader(train_ds, batch_size=bat_size, shuffle=True) # 2. create network net = Net().to(device) # 3. train autoencoder model max_epochs = 1000 ep_log_interval = 100 lrn_rate = 0.005 loss_func = T.nn.MSELoss() optimizer = T.optim.Adam(net.parameters(), lr=lrn_rate) print("\nbat_size = %3d " % bat_size) print("loss = " + str(loss_func)) print("optimizer = Adam") print("max_epochs = %3d " % max_epochs) print("lrn_rate = %0.3f " % lrn_rate) print("\nStarting training") net.train() for epoch in range(0, max_epochs): epoch_loss = 0 # for one full epoch for (batch_idx, batch) in enumerate(train_ldr): X = batch['predictors'] Y = batch['predictors'] optimizer.zero_grad() oupt = net(X) loss_obj = loss_func(oupt, Y) # a tensor epoch_loss += loss_obj.item() # accumulate loss_obj.backward() optimizer.step() if epoch % ep_log_interval == 0: print("epoch = %4d loss = %0.4f" % (epoch, epoch_loss)) print("Done ") # 4. find item with largest reconstruction error print("\nAnalyzing data for largest reconstruction error \n") net = net.eval() analyze_error(net, train_ds) print("\nEnd Employee autoencoder demo") if __name__ == "__main__": main()

Autoencoders can be so impressive.

The de-noise example blew my mind the first time:

1. Take a picture twice, one for the target and one where you are adding a lot of noise.

2. Let the autoencoder train and watch what happens and compare the original, the noisy image and the autoencoder result (I did that with popcorn for a long time).

Another crazy thing is to do the opposite of anomaly detection, take the lowest value and make this example to the target of this class, this way trains beauties (or sometimes a anomaly), stunning!

Did you ever tried an autoencoder for Zoltar?

No I never applied an autoencoder to Zoltar — either the input data (game results) or the output data (game predictions). A very interesting idea.

Really useful post! Many thanks. Where can we find the data please? I think they first appear in the “Regression Using PyTorch” post. How do you preprocess/normalise them? I mean why income is divided by 100K for example.

Thanks!

I generated the data randomly but after running my demo I didn’t save the data. I made 240 rows of data. The gender is random 50% male, 50% female. The ages are random between 18 and 68. The city is random Anaheim, Boulder or Concord. The income is random between 23,000 and 89,000. The job is random mgmt, supp, or tech. I preprocessed the data manually in an Excel spreadsheet. Ages were divided by 100 and incomes were divided by 100,000 so that all numeric values are between 0.0 and 1.0. This makes it so that during training large values like incomes don’t overwhelm small values like age.

Ok, many thanks! Your blog is very interesting and your examples are very clear