When a PyTorch neural network is instantiated, it’s common practice to use implicit weight and bias initialization. In the case of a Linear layer, the PyTorch documentation is not clear, and the source code is surprisingly complicated.

I spent several hours experimenting with Linear initialization and after a lot of work I was able to implement a demo program where I used explicit weight and bias initialization code to get identical values as those produced by the default implicit mechanism. For Linear layers, PyTorch uses what is called the Kaiming (aka He) algorithm.

*Note: Kaiming/He initialization is closely related to, but different from, Xavier/Glorot initialization.*

The demo network is 3-4-2 (3 input, 4 hidden, 2 output):

import math import torch as T device = T.device('cpu') class Net(T.nn.Module): def __init__(self, init_type): super(Net, self).__init__() self.hid1 = T.nn.Linear(3, 4) # 3-4-2 self.oupt = T.nn.Linear(4, 2) if init_type == 'default': T.manual_seed(1) # now calls kaiming_uniform_() elif init_type == 'explicit': T.manual_seed(1) T.nn.init.kaiming_uniform_(self.hid1.weight, a=math.sqrt(5.0)) bound = 1 / math.sqrt(3.0) # fan-in = 3 with T.no_grad(): self.hid1.bias.uniform_(-bound, bound) T.nn.init.kaiming_uniform_(self.oupt.weight, a=math.sqrt(5.0)) bound = 1 / math.sqrt(4.0) # fan-in = 4 with T.no_grad(): self.oupt.bias.uniform_(-bound, bound) . . .

So creating a neural network using the default implied initialization is:

net1 = Net('default').to(device)

And using explicit initialization is:

net2 = Net('explicit').to(device)

In either case, the initial values of the weights and biases are the same.

The code is short but very tricky. A big stumbling block was positioning the calls to the T.manual_seed(1) statements. Because the torch random number generator gets called sequentially several times, the only way to demonstrate identical results is to reset the seed.

Another pitfall was the mysterious math.sqrt(5) value for the equally mysterious “a” parameter. And yet another pitfall was the tensor.uniform_() function which is used for the biases.

Note: This blog post is essentially a follow-up to an earlier blog post where I made a careless mistake: https://jamesmccaffrey.wordpress.com/2022/01/06/pytorch-explicit-vs-implicit-weight-and-bias-initialization/.

I like PyTorch a lot, but the weight and bias initialization code and documentation is a weak point. By the way, Kaiming initialization was devised specifically for very deep convolutional networks. I’m not sure why the designers of PyTorch decided to use Kaiming as the default for Linear layers.

*In my earlier blog post, my mistake was a simple misspelling. Misspellings can happen in sports too. Left: From a college football game at University of Southern California. The USC slogan is “Fight On!” (I did my grad work at USC). Right: From an NFL professional football game for the New York Jets. I’m not kidding.*

Demo code:

# explore_nn_init.py # weight and bias init investigation # PyTorch 1.12.1+cpu Anaconda3-2020.02 Python 3.7.6 # Windows 10/11 # ----------------------------------------------------------- # 1. possible to get same results as init.kaiming_uniform_ ? # ----------------------------------------------------------- import math import torch as T device = T.device('cpu') # apply to Tensor or Module # ----------------------------------------------------------- class Net(T.nn.Module): def __init__(self, init_type): super(Net, self).__init__() self.hid1 = T.nn.Linear(3, 4) # 3-4-2 self.oupt = T.nn.Linear(4, 2) if init_type == 'default': T.manual_seed(1) # now calls kaiming_uniform_() elif init_type == 'explicit': T.manual_seed(1) T.nn.init.kaiming_uniform_(self.hid1.weight, a=math.sqrt(5.0)) bound = 1 / math.sqrt(3.0) # fan-in = 3 with T.no_grad(): self.hid1.bias.uniform_(-bound, bound) T.nn.init.kaiming_uniform_(self.oupt.weight, a=math.sqrt(5.0)) bound = 1 / math.sqrt(4.0) # fan-in = 4 with T.no_grad(): self.oupt.bias.uniform_(-bound, bound) def forward(self, x): z = T.tanh(self.hid1(x)) z = self.oupt(z) # no activation: CrossEntropyLoss() return z # ----------------------------------------------------------- def main(): print("\nBegin Linear layer init demo ") T.manual_seed(1) print("\n==================== ") print("\nCreating a 3-4-2 network default init ") net1 = Net('default').to(device) print("\nhid1 wts and biases: ") print(net1.hid1.weight.data) print(net1.hid1.bias.data) print("\noupt wts and biases: ") print(net1.oupt.weight.data) print(net1.oupt.bias.data) print("\n==================== ") print("\nCreating a 3-4-2 network explicit init ") net2 = Net('explicit').to(device) print("\nhid1 wts and biases: ") print(net2.hid1.weight.data) print(net2.hid1.bias.data) print("\noupt wts and biases: ") print(net2.oupt.weight.data) print(net2.oupt.bias.data) print("\n==================== ") print("\nEnd initialization demo ") if __name__ == "__main__": main()

Pingback: PyTorch Explicit vs. Implicit Weight and Bias Initialization | James D. McCaffrey