Using the Simplest Possible Transformer Sequence-to-Sequence Example

I’ve been exploring PyTorch Transformer Architecture models sequence-to-sequence problems for several months. TA architecture systems are among the most complicated software things I’ve ever worked with.

I recently completed a demo implementation of my idea of the simplest possible sequence-to-sequence. That demo is incomplete because it trained a seq-to-seq model but did not use the trained model to make a prediction. See

Unlike relatively simple neural networks, such as a multi-class classifier, using a trained seq-to-seq model is a significant challenge. So I took the trained model and wrote a demo program to use the model to make a prediction.

My input sequence is [1, 4,5,6,7,6,5,4, 2]. The 1 is start-of-sequence, the 2 is end-of-sequence. Token 3 is for unknown and token 0 is for padding. I didn’t use 0 or 3 in my demo. The correct output is [1, 5, 6, 7, 8, 7, 6, 5, 2]. My demo didn’t do too well but at least it emitted a legal output sequence: [1, 5, 5, 4, 5, 8, 4, 4, 2].

There are many things that I don’t fully understand about Transformer seq-to-seq systems, including my own demo. But for difficult machine learning topics, persistence and determination are the keys to successful learning.

Transformer software systems are difficult to figure out. There are a surprisingly large number of movies where a human transforms into a snake. Here are three where the plot is difficult to figure out. Left: “The Reptile” (1966) is an English movie about a young woman who transforms into a snake because of a Malay curse. Center: “Cult of the Cobra” (1955) is movie about six men who unintentionally witness a ceremony of an evil cult of women who can transform into snakes. You’d think they’d stay away from mysterious women with dark reptilian eyes after that, but no, they don’t. Right: “The Sorcerer and the White Snake” (2011) is a Chinese movie. The plot baffled me but there are two women who can turn into snakes.

Demo code:

# Transformer seq-to-seq usage example

# PyTorch 1.12.1-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11

import numpy as np
import torch as T
import math

device = T.device('cpu')

# -----------------------------------------------------------

class TransformerNet(T.nn.Module):
  def __init__(self):
    # vocab_size = 12, embed_dim = d_model = 4, seq_len = 9/10
    super(TransformerNet, self).__init__()  # classic syntax
    self.embed = T.nn.Embedding(12, 4)       # word embedding
    self.pos_enc = PositionalEncoding(4)    # positional
    self.trans = T.nn.Transformer(d_model=4, nhead=2, \
      dropout=0.0, batch_first=True)  # d_model div by nhead
    self.fc = T.nn.Linear(4, 12)  # embed_dim to vocab_size
  def forward(self, src, tgt, tgt_mask):
    s = self.embed(src)
    t = self.embed(tgt)

    s = self.pos_enc(s)  # [bs,seq=10,embed]
    t = self.pos_enc(t)  # [bs,seq=9,embed]

    z = self.trans(src=s, tgt=t, tgt_mask=tgt_mask)
    z = self.fc(z)   
    return z 

# -----------------------------------------------------------

class PositionalEncoding(T.nn.Module):  # documentation code
  def __init__(self, d_model: int, dropout: float=0.0,
   max_len: int=5000):
    super(PositionalEncoding, self).__init__()  # old syntax
    self.dropout = T.nn.Dropout(p=dropout)
    pe = T.zeros(max_len, d_model)  # like 10x4
    position = \
      T.arange(0, max_len, dtype=T.float).unsqueeze(1)
    div_term = T.exp(T.arange(0, d_model, 2).float() * \
      (-np.log(10_000.0) / d_model))
    pe[:, 0::2] = T.sin(position * div_term)
    pe[:, 1::2] = T.cos(position * div_term)
    pe = pe.unsqueeze(0).transpose(0, 1)
    self.register_buffer('pe', pe)  # allows state-save

  def forward(self, x):
    x = x +[:x.size(0), :]
    return self.dropout(x)

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin PyTorch Transformer seq-to-seq use demo ")

  # 1. create Transformer network
  print("\nCreating batch-first Transformer network ")
  model = TransformerNet().to(device)

  # 2. load trained model wts and biases
  print("\nLoading saved model weights and biases ")
  fn = ".\\Models\\"

# -----------------------------------------------------------
  src = T.tensor([[1, 4,5,6,7,6,5,4, 2]],
  print("\nsrc sequence: ")
  print("\ncorrect output: ")
  print("[[1, 5, 6, 7, 8, 7, 6, 5, 2]]")

  print("\nPredicted output: ")
  tgt_in = T.tensor([[1]], dtype=T.int64).to(device)  # SOS
  for i in range(20):  # max output 20 tokens
    n = tgt_in.size(1)
    t_mask = \
    with T.no_grad():
      preds = model(src, tgt_in, tgt_mask=t_mask) 
      # [bs,tgt_in,embed] 
    next_token = T.argmax( preds[-1][-1] )  # last set 12 values
    # print(next_token); input()
    next_token = next_token.reshape(1,1)

    tgt_in =, next_token), dim=1)

    if next_token[0][0].item() == 2:  # EOS

  print("\nEnd PyTorch Transformer seq-to-seq use demo ")

if __name__ == "__main__":
This entry was posted in PyTorch, Transformers. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s