I’ve been looking at deep neural Transformer Architecture (TA) systems for several months. In terms of conceptual ideas and engineering details, they are probably the most complex software systems I’ve ever worked with.

*Update: A few weeks after I wrote this blog post, I created my own example of a sequence-to-sequence problem. See:*

jamesmccaffrey.wordpress.com/2022/09/09/simplest-transformer-seq-to-seq-example/

and

*jamesmccaffrey.wordpress.com/2022/09/12/using-the-simplest-possible-transformer-sequence-to-sequence-example/
*

Everyone I know, including me, learns ML in the same way: 1.) find an example program, 2.) get it to run, 3.) add print() statements and make changes to figure out exactly how the example program works, 4.) gradually add new code/ideas.

*I found this transformer seq-to-seq example on the Internet. It has several flaws.*

So, it all starts with finding a working example program.

As far as I’ve been able to determine, there are no really good example programs that demonstrate PyTorch TA sequence-to-sequence on the Internet. I spent hours dissecting one of the main examples returned by a Google search. It is a blog post written by a student. I found roughly a dozen issues with the example, most minor but some significant.

The data for the example was generated programmatically rather than being some monstrously huge English-to-German NLP data. This is good, but the data made no sense to me. For example, this code statement in the data-generation function:

start = np.random.randint(0, 1)

The randint(a,b) function returns a random integer greater-or-equal-to a and strictly less-than b. So the statement always returns 0.

A few hours into the examination I displayed the values of a predicted output and the expected output just before they were passed to the loss function during training:

print("pred shape: ") print(pred.shape) print("y_expected shape: ") print(y_expected.shape) input() loss = loss_fn(pred, y_expected)

The shapes were:

pred shape: torch.Size([2, 4, 9]) y_expected shape: torch.Size([2, 9])

Different shapes. This is a bit confusing but possibly correct because the CrossEntrolyLoss function expects a model output in the shape [batch_size, nb_classes, *additional_dims] and a target in the shape [batch_size, *additional_dims] containing the class indices in the range [0, nb_classes-1].

Interestingly, I think I learned more by dissecting the glitches in the example program than I would have if the program was correct.

The point of this blog post is that transformer architecture sequence-to-sequence systems are incredibly complicated. But I’m confident I will figure them out eventually.

*The TA seq-to-seq examples I found on the Internet were disappointing. I like these two sci-fi movies but overall they were disappointing to me because they could have been so much better.*

Left: “John Carter” (2012) is based on my favorite sci-fi novel of all time, “A Princess of Mars” (1912) by Edgar Rice Burroughs. Two terrible choices for the lead actor and actress. Poor story line and editing.

*Right: “Valerian and the City of a Thousand Planets” (2017) was a follow-up in some sense to one of my favorite films of all time, “The Fifth Element” (1997) by director Luc Besson. Two even worse choices for lead actor and actress: a hero who looks like a 15-year old girl, and a heroine who was whining and obnoxious. Ugh. Both movies could have been great instead of merely OK.*

Some code I pulled from the example program I was examining. It has many flaws.

# experiment.py # examine code from a blog post import numpy as np import torch as T device = T.device('cpu') # -------------------------------------------------------- def generate_random_data(n): SOS_token = np.array([2]) # array with single value 2.0 EOS_token = np.array([3]) length = 8 data = [] # 1,1,1,1,1,1 -> 1,1,1,1,1 # what? for i in range(n // 3): X = np.concatenate((SOS_token, np.ones(length), EOS_token)) y = np.concatenate((SOS_token, np.ones(length), EOS_token)) data.append([X, y]) # 0,0,0,0 -> 0,0,0,0 for i in range(n // 3): X = np.concatenate((SOS_token, np.zeros(length), EOS_token)) y = np.concatenate((SOS_token, np.zeros(length), EOS_token)) data.append([X, y]) # 1,0,1,0 -> 1,0,1,0,1 # what?? for i in range(n // 3): X = np.zeros(length) start = np.random.randint(0, 1) # WTF? always 0 X[start::2] = 1 y = np.zeros(length) if X[-1] == 0: y[::2] = 1 else: y[1::2] = 1 X = np.concatenate((SOS_token, X, EOS_token)) y = np.concatenate((SOS_token, y, EOS_token)) data.append([X, y]) np.random.shuffle(data) return data # a list of lists of array!! # -------------------------------------------------------- def batchify_data(data, batch_size=3, padding=False, padding_token=-1): batches = [] for idx in range(0, len(data), batch_size): # We make sure we dont get the last bit if its # not batch_size size if idx + batch_size max_batch_length: max_batch_length = len(seq) # Append X padding tokens until max length for seq_idx in range(batch_size): remaining_length = max_bath_length - \ len(data[idx + seq_idx]) data[idx + seq_idx] += [padding_token] * \ remaining_length batches.append(np.array(data[idx : idx + \ batch_size]).astype(np.int64)) print(f"{len(batches)} batches of size {batch_size}") return batches # -------------------------------------------------------- def get_tgt_mask_static(size) -> T.tensor: # original version was a model method !? # Generates a squeare matrix where the each row # allows one word more to be seen mask = T.tril(T.ones(size, size) == 1) # Low triangular mask = mask.float() mask = mask.masked_fill(mask == 0, float('-inf')) # Convert zeros to -inf mask = mask.masked_fill(mask == 1, float(0.0)) # Convert ones to 0 # EX for size=5: # [[0., -inf, -inf, -inf, -inf], # [0., 0., -inf, -inf, -inf], # [0., 0., 0., -inf, -inf], # [0., 0., 0., 0., -inf], # [0., 0., 0., 0., 0.]] return mask # -------------------------------------------------------- print("\nBegin demo \n") np.random.seed(1) T.manual_seed(1) train_data = generate_random_data(90) # print(train_data[0]) # input() # [array([2., 1., 0., 1., 0., 1., 0., 1., 0., 3.]), # array([2., 1., 0., 1., 0., 1., 0., 1., 0., 3.])] # print(train_data[0][0]) # first data item X # input() # [2., 1., 0., 1., 0., 1., 0., 1., 0., 3.] # a list containing two arrays , X, Y # print(train_data) train_dataloader = batchify_data(train_data) # -------------------------------------------------------- for batch in train_dataloader: print("----------") X, y = batch[:, 0], batch[:, 1] X, y = T.tensor(X).to(device), T.tensor(y).to(device) print("X: ") print(X) # has SOS and EOS input() print("y: ") print(y) # identical to X !? input() y_input = y[:,:-1] # SOS at front but no EOS at end y_expected = y[:,1:] # no SOS at front, but EOS at end print("y_input: ") print(y_input) input() print("y_expected: ") print(y_expected) input() sequence_length = y_input.size(1) tgt_mask = get_tgt_mask_static(sequence_length).to(device) print("seq len: ") print(sequence_length) # 9 -- inc. SOS input() print("tgt_mask: ") print(tgt_mask) input() print("----------") # -------------------------------------------------------- print("\nEnd demo \n")

You must be logged in to post a comment.