## Using the Simplest Possible Transformer Sequence-to-Sequence Example

I’ve been exploring PyTorch Transformer Architecture models sequence-to-sequence problems for several months. TA architecture systems are among the most complicated software things I’ve ever worked with.

I recently completed a demo implementation of my idea of the simplest possible sequence-to-sequence. That demo is incomplete because it trained a seq-to-seq model but did not use the trained model to make a prediction. See https://jamesmccaffrey.wordpress.com/2022/09/09/simplest-transformer-seq-to-seq-example/.

Unlike relatively simple neural networks, such as a multi-class classifier, using a trained seq-to-seq model is a significant challenge. So I took the trained model and wrote a demo program to use the model to make a prediction. My input sequence is [1, 4,5,6,7,6,5,4, 2]. The 1 is start-of-sequence, the 2 is end-of-sequence. Token 3 is for unknown and token 0 is for padding. I didn’t use 0 or 3 in my demo. The correct output is [1, 5, 6, 7, 8, 7, 6, 5, 2]. My demo didn’t do too well but at least it emitted a legal output sequence: [1, 5, 5, 4, 5, 8, 4, 4, 2].

There are many things that I don’t fully understand about Transformer seq-to-seq systems, including my own demo. But for difficult machine learning topics, persistence and determination are the keys to successful learning. Transformer software systems are difficult to figure out. There are a surprisingly large number of movies where a human transforms into a snake. Here are three where the plot is difficult to figure out. Left: “The Reptile” (1966) is an English movie about a young woman who transforms into a snake because of a Malay curse. Center: “Cult of the Cobra” (1955) is movie about six men who unintentionally witness a ceremony of an evil cult of women who can transform into snakes. You’d think they’d stay away from mysterious women with dark reptilian eyes after that, but no, they don’t. Right: “The Sorcerer and the White Snake” (2011) is a Chinese movie. The plot baffled me but there are two women who can turn into snakes.

Demo code:

```# seq2seq_use.py
# Transformer seq-to-seq usage example

# PyTorch 1.12.1-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11

import numpy as np
import torch as T
import math

device = T.device('cpu')

# -----------------------------------------------------------

class TransformerNet(T.nn.Module):
def __init__(self):
# vocab_size = 12, embed_dim = d_model = 4, seq_len = 9/10
super(TransformerNet, self).__init__()  # classic syntax
self.embed = T.nn.Embedding(12, 4)       # word embedding
self.pos_enc = PositionalEncoding(4)    # positional
self.trans = T.nn.Transformer(d_model=4, nhead=2, \
dropout=0.0, batch_first=True)  # d_model div by nhead
self.fc = T.nn.Linear(4, 12)  # embed_dim to vocab_size

def forward(self, src, tgt, tgt_mask):
s = self.embed(src)
t = self.embed(tgt)

s = self.pos_enc(s)  # [bs,seq=10,embed]
t = self.pos_enc(t)  # [bs,seq=9,embed]

z = self.fc(z)
return z

# -----------------------------------------------------------

class PositionalEncoding(T.nn.Module):  # documentation code
def __init__(self, d_model: int, dropout: float=0.0,
max_len: int=5000):
super(PositionalEncoding, self).__init__()  # old syntax
self.dropout = T.nn.Dropout(p=dropout)
pe = T.zeros(max_len, d_model)  # like 10x4
position = \
T.arange(0, max_len, dtype=T.float).unsqueeze(1)
div_term = T.exp(T.arange(0, d_model, 2).float() * \
(-np.log(10_000.0) / d_model))
pe[:, 0::2] = T.sin(position * div_term)
pe[:, 1::2] = T.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)  # allows state-save

def forward(self, x):
x = x + self.pe[:x.size(0), :]
return self.dropout(x)

# -----------------------------------------------------------

def main():
# 0. get started
print("\nBegin PyTorch Transformer seq-to-seq use demo ")
T.manual_seed(1)
np.random.seed(1)

# 1. create Transformer network
print("\nCreating batch-first Transformer network ")
model = TransformerNet().to(device)
model.eval()

# 2. load trained model wts and biases
print("\nLoading saved model weights and biases ")
fn = ".\\Models\\transformer_seq_model_200_epochs.pt"

# -----------------------------------------------------------

src = T.tensor([[1, 4,5,6,7,6,5,4, 2]],
dtype=T.int64).to(device)
print("\nsrc sequence: ")
print(src)
print("\ncorrect output: ")
print("[[1, 5, 6, 7, 8, 7, 6, 5, 2]]")

print("\nPredicted output: ")
tgt_in = T.tensor([], dtype=T.int64).to(device)  # SOS
for i in range(20):  # max output 20 tokens
n = tgt_in.size(1)
# [bs,tgt_in,embed]

next_token = T.argmax( preds[-1][-1] )  # last set 12 values
# print(next_token); input()
next_token = next_token.reshape(1,1)

tgt_in = T.cat((tgt_in, next_token), dim=1)
print(tgt_in)

if next_token.item() == 2:  # EOS
break

print("\nEnd PyTorch Transformer seq-to-seq use demo ")

if __name__ == "__main__":
main()
```
This entry was posted in PyTorch, Transformers. Bookmark the permalink.