PyTorch Explicit vs. Implicit Weight and Bias Initialization

Sometimes library code is too helpful. In particular, I don’t like library code that uses default mechanisms. One example is PyTorch library weight and bias initialization. Consider this PyTorch neural network definition:

import torch as T
device = T.device("cpu")

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(3, 4)  # 3-(4-5)-2
    self.hid2 = T.nn.Linear(4, 5)
    self.oupt = T.nn.Linear(5, 2)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = self.oupt(z)
    return z

. . .

net = Net().to(device)

The code defines a 3-(4-5)-2 neural network. But how are the weights and bias values initialized? If you don’t explicitly specify weight and bias initialization code, PyTorch will use default code.


Left: A 3-(4-5)-2 neural network with default weight and bias initialization. Right: The same network but with explicit weight and bias initialization gives identical values.

I don’t like invisible default code. Default code can change over time — and usually does. This makes program runs non-reproducible. As it turns out, for Linear() layers, PyTorch uses fairly complicated default weight and bias initialization. I went to the initialization source code at C:\Users\(user)\Anaconda3\Lib\site-packages\torch\nn\modules\linear.py and saw default initialization is kaiming_uniform() for weights and uniform() for biases, but with some tricky parameters.

I copy/pasted the library code into the __init__ method and got code that produces the exact same initial weights and biases but is explicit:

import torch as T
device = T.device("cpu")

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(3, 4)  # 3-(4-5)-2
    self.hid2 = T.nn.Linear(4, 5)
    self.oupt = T.nn.Linear(5, 2)

    T.nn.init.kaiming_uniform_(self.hid1.weight,
      a=math.sqrt(5.0))
    bound = 1 / math.sqrt(3)
    T.nn.init.uniform_(self.hid1.bias, -bound, bound)

    T.nn.init.kaiming_uniform_(self.hid2.weight, 
      a=math.sqrt(5.0))
    bound = 1 / math.sqrt(4)
    T.nn.init.uniform_(self.hid2.bias, -bound, bound)

    T.nn.init.kaiming_uniform_(self.oupt.weight, 
      a=math.sqrt(5.0))
    bound = 1 / math.sqrt(5)
    T.nn.init.uniform_(self.oupt.bias, -bound, bound)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = self.oupt(z)
    return z

. . .

net = Net().to)device)

The sqrt(5.0) is a magic parameter for kaiming_uniform(). In the sqrt(3), sqrt(4), sqrt(5) for the biases, the 3, 4, 5 are the “fan_in” values for each layer — number of inputs.

The downside to explicit weight and bias initialization is more code. But in non-demo production scenarios, it’s almost always better to use explicit code rather than rely on implicit default code that can lead to non-reproducibility.



The goal of photorealistic art is to create an explicit representation of reality. The art deco movement of the 1920s and 1930s used implicit representations of reality. From left to right: Georges Lepape, Erte, Tamara Lempicka.


Demo code.

# layer_default_init.py
# see C:\Users\(user)\Anaconda3\Lib\site-packages
#   \torch\nn\modules\linear.py

# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10 

import math
import torch as T
device = T.device("cpu")  # apply to Tensor or Module

class Net(T.nn.Module):
  def __init__(self, init_type):
    super(Net, self).__init__()
    # T.manual_seed(1)
    self.hid1 = T.nn.Linear(3, 4)  # 3-(4-5)-2
    self.hid2 = T.nn.Linear(4, 5)
    self.oupt = T.nn.Linear(5, 2)

    if init_type == 'default':
      pass
    elif init_type == 'explicit':
      T.nn.init.kaiming_uniform_(self.hid1.weight, 
        a=math.sqrt(5.0))
      bound = 1 / math.sqrt(3.0)
      T.nn.init.uniform_(self.hid1.bias, -bound, bound)

      T.nn.init.kaiming_uniform_(self.hid2.weight, 
        a=math.sqrt(5.0))
      bound = 1 / math.sqrt(4.0)
      T.nn.init.uniform_(self.hid2.bias, -bound, bound)

      T.nn.init.kaiming_uniform_(self.oupt.weight, 
        a=math.sqrt(5.0))
      bound = 1 / math.sqrt(5.0)
      T.nn.init.uniform_(self.oupt.bias, -bound, bound)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = self.oupt(z)  # CrossEntropyLoss() 
    return z

def main():
  print("\nBegin ")
  T.manual_seed(1)

  # print("\nCreating a 3-(4-5)-2 network default init ")
  # net = Net('default').to(device)

  print("\nCreating a 3-(4-5)-2 network explicit init ")
  net = Net('custom').to(device)

  print("\nhid1 wts and biases: ")
  print(net.hid1.weight.data)
  print(net.hid1.bias.data)

  print("\nhid2 wts and biases: ")
  print(net.hid2.weight.data)
  print(net.hid2.bias.data)


  print("\noupt wts and biases: ")
  print(net.oupt.weight.data)
  print(net.oupt.bias.data)

  print("\nEnd ")

if __name__ == "__main__":
  main()
This entry was posted in PyTorch. Bookmark the permalink.

3 Responses to PyTorch Explicit vs. Implicit Weight and Bias Initialization

  1. Thorsten Kleppe says:

    Perfectly seen through, an extremely interesting example. Bias initialization seems a bit risky in combination with ReLU activation. Some neurons might not become active at all. The Tensorflow Playground uses a default initialization of 0.1 for the bias, found in nn.ts line 25 on GitHub, to avoid this problem. Impressive how well you can make the difficult examples transparent.

    Have you seen this paper?
    “Classification of Imbalanced Data Using Deep Learning with
    Adding Noise”

    Best regards

  2. At some point in the past (I don’t remember when), one of the major NN frameworks (I think it was Keras, but it might have been PyTorch), initialized bias values to 0.0 by default. As you mention, this is a tiny bit risky. In practice it wasn’t a problem, but now all NN frameworks I know of avoid a default value of 0.0 for biases.

    I hadn’t seen the paper you mention. I looked it over briefly. Dealing with imbalanced data has always been a problem. I’m not sure exactly why adding noise helps, but I didn’t have time to read the article carefully — I have been swamped with work for the past few months.

    Happy New year Thorsten, JM

    • Thorsten Kleppe says:

      From time to time you can also read about initial guesses on bias. This sounds useful for regression problems where the house price is estimated in order to throw the bias somewhere in the direction of the mean.

      The idea from the paper is probably not to let any class be overwhelmed. The basic idea is just to give noise on the outputs before softmax activation. The equation the paper recommends is: c = ND × e^σ + m, where m is the output vector before softmax and ND the standard normal distribution. Unfortunately, the paper is very new, I couldn’t find any further info and again, there are more questions than answers.

      But the technique itself I also use, so far only with the aim to increase the accuracy of a trained model. For this I take the last error cases as noise on the outputs for each class. My dirty tool against imbalance data was the recall (TP/(TP+FN)) of each class to balance the training.

      The idea of making a network a lifelong learner that increases loss and accuracy to keep learning seems very exciting. A technique that then automatically regulates a network against imbalance data and also brings a bit more accuracy would be something very neat.

      Forgive my bad manners, sometimes it seems like what you do just happens naturally.
      Nicole and I also wish you a Happy New year 2022 James.

      Btw, is 2022 the year of reassessment networks?

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s