The GELU Activation function

The current state-of-the-art neural architecture for natural language processing is the Transformer architecture. While I was looking at the PyTorch implementation of Transformer functions, I noticed that one of the options for the activation function in various modules is “gelu”. I had heard of GELU before but didn’t know much about it so I did a bit of research.

The GELU (“Gaussian Error Linear Unit”) function is sort of a modification the ReLU (“Rectrified Linear Unit”) function. Cutting to the chase, GELU is defined as:

I was best able to understand GELU by looking at a graph of it that I made using Excel:

I put the first x value of -4.0 in cell B3. The formula I used for GELU was =0.5*B3*(1+(TANH(SQRT(2/PI())*(B3*B3*B3*0.044715+B3)))) in cell D3 and then I copied it.

For x values less than 3.0, gelu(x) is close to 0. For x values between -1.0 and 0.0, gelu(x) is about -1.5. For x = 0.0, the value of gelu(x) is 0.0. And for positive values of x, gelu(x) is approximately x.

Compared to ReLU or leaky ReLU, GELU has the theoretical advantage of being differentiable for all values of x, but has the in-practice disadvantage of being much, much more complex to compute.

The demo run in the shell on the left used tanh() on both hidden layers. The demo run in the shell on the right used gelu() activation. The loss and accuracy results are essentially the same for this relatively simple neural network, but interestingly, the prediction on a dummy set of input values was much different.

I did a little experiment. I took an existing PyTorch 6-(10-10)-3 neural network classifier that used tanh() activation on the two hidden layers, and ran it. Then I replaced the tanh() activation with gelu() activation and did a second run. The results were pretty much the same, as I expected them to be.

The GELU activation function is rather exotic and my hunch is that it’s only useful for complex neural architectures like Transformers.

One of the reasons for the world-wide success of the James Bond novels and movies is that they feature exotic places and events. Here are four of the more than 20 covers for “You Only Live Twice”. The cover on the left is from the first edition, published in the U.K. in March, 1964. It was the 12th and final Bond book by Ian Fleming (May 1908 – August 1964).

This entry was posted in Machine Learning. Bookmark the permalink.

2 Responses to The GELU Activation function

Thorsten Kleppe says:

November 27, 2020 at 4:53 am

My first thought was that you talked about the Swish function,(GELU == Swish) right?

Here is the Swish function.

And its derivative

There are other very interesting comments on reddit about the relationship between these two activation functions.

[D] GELU better than RELU?
byu/AbitofAsum inMachineLearning

DanielHendrycks
“I should say the function space of x*sigmoid(a*x) and x*Phi(a*x) is approximately the same. Generally nonlinearities with learnable hyperparameters can beat those without hyperparameters, but there is an added risk of overfitting.”

The next recipe could be, take 80% of ReLU and 20% GELU and mix it up.
Pingback: GELU (Gaussian Error Linear Unit) [活性化関数] | CVMLエキスパートガイド