It’s very difficult, but fun, to keep up with all the new ideas in machine learning. I was recently alerted to the new swish() activation function for neural networks. My thanks to fellow ML enthusiast Thorsten Kleppe for pointing swish() out to me when he mentioned the similarity between swish() and gelu() in a Comment to an earlier post. I don’t know Thorsten personally, but he seems like a very bright and creative guy.
In the early days of NNs, logistic sigmoid() was the most common activation function. Then came tanh(). Then relu() was found to work better for deep neural networks. Many variations of relu() followed but none were consistently better so relu() has been used as a de facto default since about 2015. The swish() function was devised in 2017.
I made this graph of sigmoid(), swish(), and relu() using Excel.
It’s sort of a cross between logistic sigmoid() and relu(). The three related activation functions are:
sigmoid(x) = 1.0 / (1.0 + exp(-x)) relu(x) = 0.0 if x is less than 0.0, 1.0 if x is greater than or equal 0.0 swish(x) = x * (1.0 / (1.0 + exp(-x))) = x * sigmoid(x)
The Wikipedia entry on swish() points out that swish() is sometimes called sil() or silu() which stands for sigmoid-weighted linear unit. At the time I’m writing this bog post, Keras and TensorFlow have a built-in swish() function (released about 10 weeks ago), but the PyTorch library does not have a swish() function. However, it’s trivial to implement inside a PyTorch neural network class, for example:
class Net(T.nn.Module): def __init__(self): super(Net, self).__init__() self.hid1 = T.nn.Linear(6, 10) # 6-(10-10)-3 self.hid2 = T.nn.Linear(10, 10) self.oupt = T.nn.Linear(10, 3) def swish(self, x): return x * T.sigmoid(x) def forward(self, x): # z = T.tanh(self.hid1(x)) # replace tanh() w/ swish() # z = T.tanh(self.hid2(z)) z = self.swish(self.hid1(x)) z = self.swish(self.hid2(z)) z = self.oupt(z) # no softmax for multi-class return z
Update: I just discovered that PyTorch 1.7 does have a built-in swish() function. It is called SiLU().
The fact that PyTorch doesn’t have a built-in swish() function is interesting. Adding such a trivial function just bloats a large library even further. But if swish() had been in PyTorch I would have discovered it earlier. So, adding what are essentially unnecessary functios to PyTorch can have a minor upside.
The demo run on the left uses tanh() activation with a LR = 0.01. The demo run on the right uses swish() activation [I call it sil(), for “sigmoid-weighted linear unit”] with a LR = 0.02. The results are very similar.
I took an existing 6-(10-10)-3 classifier I had, which used tanh() on the two hidden layers, and replaced tanh() with swish(). This is sort of a “shallow deep NN”. Compared to the NN with tanh() and a learning rate of 0.01, the swish() version learned a bit slower. But when I used a learning rate of 0.02 with swish(), I got essentially the same results. So, swish() worked fine, and I beleive the research claims that swish() is superior to relu() and tanh() for very deep NNs.
The field of machine learning is very excting. There are significant new developments, such as the use of the swish() activation function, being discovered all the time.
The swish() activation function is named for its shape. In science fiction movies, a colored hair swish is usually associated with a character that is ambiguous in some way. From left to right: Two fabricants (clones) from “Cloud Atlas” (2012). Yukio (played by actress Shiori Kutsuna), a female ninja, from “Deadpool 2” (2018). Michelle (played by actress Bai Ling), an assasin with a heart of gold, from “The Gene Generation” (2007). Psylocke (played by actress Mei Melancon), a mutant who possesses psionic powers, from “X-Men: The Last Stand” (2006).