This is a summary of large language models for myself to remember and help understanding. Thanks to Sebastian Raschka, Andrej Karpathy and all others whose resources have benefitted me.
Large language models (LLMs), such as OpenAI's ChatGPT or Anthropic's Claude, are deep neural networks based on the decoder architecture of the Transformer model. These models possess remarkable capabilities to understand, generate, and interpret human language.
Unlike earlier NLP models designed for specific tasks, LLMs are generalised and versatile. They have hundreds of billions of parameters—weights in the network that are optimised during training to predict the next word in a sequence. Due to their ability to generate content, LLMs fall under the broader category of Generative AI.
The encoder module processes text, encoding it into vectors that capture contextual information to pass into the decoder blocks. The decoder block, in turn, takes the encoded input and generates an output.
Both modules can also be used separately and effectively. The encoder module, due to its ability to capture the contextual information of inputs, is particularly suited for masked word prediction (e.g., "There was an _____ with the car"). This approach is used in BERT (short for Bidirectional Encoder Representations from Transformers).
On the other hand, the decoder module, due to its generative nature, can predict the following tokens (e.g., "There was an issue with the ___"). This approach forms the basis of GPT models (short for Generative Pretrained Transformers).
GPT models, primarily designed and trained to perform text completion tasks, also show remarkable versatility in their capabilities. These models are adept at executing both zero-shot and few-shot learning tasks. Zero-shot learning refers to the ability to generalise to completely unseen tasks without any prior specific examples. On the other hand, few-shot learning involves learning from a minimal number of examples the user provides as input.
The GPT model only uses the decoder module, as mentioned, it operates in a sequential manner, a left-to-right movement where words are predicted and based on those prior to it
Due to training on such a vast amount of data, the model, although primarily designed to excel at next-word prediction, exhibits emergent behavior. This refers to behavior that the model was not specifically trained for but arises as a consequence of being exposed to such extensive data.
The challenging part of building a large language model (LLM) lies in stage 1, where the data is prepared, and the attention mechanism is implemented. Once this stage is complete, the remaining decoder blocks are constructed to form fully functional decoder modules.
An input to a transformer, whether it be audio, text, video etc. cannot be processed in raw form, all data must be standardised. The easiest and most convient way to input the data, is in embeddings, which are numerical representations of words in a continuous vector space, where words with similar meanings or contexts are placed closer together. They encode semantic and syntactic relationships between words, enabling models to understand and process language more effectively.
Word embeddings are the most used for text, however paragraph and sentence embeddings are also possibilities, for GPT word embeddings is fine.
Word embeddings can have varying dimensions, from one to thousands.
While we can use pretrained models such as Word2Vec to generate embeddings for machine learning models, LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimising the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimised to the specific task and data at hand.
We will now tokenise the text, as shown
We will load the data from a stored file
with open("verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
print("Total number of character:", len(raw_text))
Total number of character: 20479
We can now split the text, without whitespaces, using the regex module
import re
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))
4649
Now, we move on to converting these tokens into integer representations called token IDs, which serve as an intermediate step before generating embedding vectors. To achieve this, we first build a vocabulary—a mapping of unique tokens, including words and special characters, to unique integers.
We convert tokens to token IDs instead of embeddings directly because token IDs serve as a compact, discrete representation that acts as a bridge between raw text and the embedding layer. This intermediate step ensures that each unique token is assigned a unique integer, enabling efficient lookup and management of embeddings.
#Converting tokens into token IDs
all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)
print(vocab_size)
vocab = {token:integer for integer, token in enumerate(all_words)}
1159
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
We can now create a simple tokeniser class, one where we combine the text split and encoding to token Ids
class SimpleTokeniserV1:
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = {i:s for s,i in vocab.items()}
def encode(self, text):
# Split text on special characters and whitespace
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
ids = [self.str_to_int[s] for s in preprocessed]
return ids
def decode(self, ids):
text = " ".join([self.int_to_str[i] for i in ids])
text = re.sub(r'([,.?_!"()\']|--|\s)', r'\1',text)
return text
# If text is: "Hello, world!"
# After re.split():
# preprocessed = ['Hello', ',', '', 'world', '!']
# After the cleaning loop:
# preprocessed = ['Hello', ',', 'world', '!']
Testing it out, you can see we achieve the first part of our model
tokeniser = SimpleTokeniserV1(vocab)
text = """It's the last he painted"""
ids = tokeniser.encode(text)
print(ids)
[58, 2, 872, 1013, 615, 541, 763]
We also add a decoder method to our tokeniser, although at this point, it is not quite as useful
What would print(tokeniser.decode(ids))
output?
To encode we first initialise with our tokeniser class passing vocab as an argument
tokeniser = SimpleTokeniserV1(vocab)
When encoding text ids = tokeniser.encoder("It's the last he painted")
, the program creates two variables:
self.str_to_int = vocab
#('As', 17)
#('At', 18)
#('Be', 19)
#('Begin', 20)
self.int_to_str = {i:s for s,i in vocab.items()}
#(17, 'as')
#(18, 'At')
...
This process is used for encoding and decoding. When entering the encode function, the text is first split based on white spaces and special characters, and then token IDs are generated.
ids = [self.str_to_int[s] for s in preprocessed]
Alternatively, the same logic can be written in a more explicit way:
for s in preprocessed:
ids = self.str_to_int[s]
In pseudocode, it looks like this:
for word in tokenised_words:
token_ids <- self.str_to_int[word]
It's important to note that for a word to be tokenised, it must exist in the vocabulary, we can not conjure a random token id for all words. Which leads to the following issue
text = "Hello, do you like tea?"
x = tokeniser.encode(text)
print(x)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[10], line 2
1 text = "Hello, do you like tea?"
----> 2 x = tokeniser.encode(text)
3 print(x)
Cell In[7], line 10, in SimpleTokeniserV1.encode(self, text)
8 preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
9 preprocessed = [item.strip() for item in preprocessed if item.strip()]
---> 10 ids = [self.str_to_int[s] for s in preprocessed]
11 return ids
KeyError: 'Hello'
To overcome the issue of tokens not appearing in our vocab, we will add special tokens to a vocabulary to deal with certain contexts: an <|unk|> token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. Furthermore, we add an <|endoftext|> token that we can use to separate two unrelated text sources.
When working with multiple independent text sources, we add <|endoftext|> tokens between these texts. These <|endoftext|> tokens act as markers, signaling the start or end of a particular segment, allowing for more effective processing and understanding by the LLM.
We can now take our tokens and add the two general special tokens, this will increase our vocab size to 1161 now.
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|unk|>", "<|endoftext|>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}
We can now update our tokeniser class by simply changing the preprocessed
line in `SimpleTokeniserV1``` by checking if the word exists in our vocab, if not, we use
class SimpleTokeniserV2:
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = {i:s for s,i in vocab.items()}
def encode(self, text):
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
preprocessed = [item.strip() if item.strip() in self.str_to_int else "<|unk|>"
for item in preprocessed if item.strip()]
ids = [self.str_to_int[s] for s in preprocessed]
return ids
def decode(self, ids):
text = " ".join([self.int_to_str[i] for i in ids])
text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
return text
We can now tokenise text, where
print(tokeniser.decode(tokeniser.encode(text)))
rather getting an error, we have our special tokens replace those tokens that do not exist within our vocab.
<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.
Thus far we have just encoded by assinged each word to a number in ascending order, however a more sophisciated approach would be to use byte pair encoding, which breaks down words into smaller subword units.
BPE (Byte Pair Encoding) is a data compression technique that iteratively merges the most frequently occurring pairs of bytes or characters.
The main advantages of BPE include: 1. Handling out-of-vocabulary words effectively 2. Reducing vocabulary size while maintaining meaning 3. Better handling of rare words and morphological variations
For example, the word "understanding" might be tokenised as "under" + "stand" + "ing", allowing the model to recognise parts of unfamiliar words based on common subwords.
The process of BPE typically follows these steps:
A simple example:
Original text: "low lower lowest"
Initial tokens: "l", "o", "w", "e", "r", "s", "t"
After merges: "low", "er", "est"
This approach allows the model to handle new words by breaking them into known subparts. For instance: - "unhappy" → "un" + "happy" - "playing" → "play" + "ing" - "cryptocurrency" → "crypto" + "currency"
Modern LLMs like GPT use BPE because it offers a good balance between vocabulary size and token effectiveness. It helps the model process words more efficiently while maintaining semantic understanding. However, in 2025, this technique may no longer be used.
We will use cl100k_base to encode the tokens, this supports tens of thousands of tokens
import tiktoken
tokeniser = tiktoken.get_encoding("cl100k_base") #gpt2
text = "Hello, do you like tea? <|endoftext|> In the sunlit terra"
integers = tokeniser.encode(text, allowed_special={"<|endoftext|>"})
print(integers)
The output is unsuprising
[9906, 11, 656, 499, 1093, 15600, 30, 220, 100257, 763, 279, 7160, 32735, 60661]
We have covered tokenisation, but now before converting to embeddings, we must generate an input-target pairs required for training an LLM
So, to accomplish this we can use torch.tril to mask text like below
To get started, we will first tokenise the whole The Verdict short story we worked with earlier using the BPE tokeniser introduced in the previous section:
#creating input-target pairs
#first will tokenise the whole testing set
with open("verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
enc_text = tokeniser.encode(raw_text)
print(len(enc_text))
4943
We now have a 4943 long list of tokens, which we can take the first 50 of with python enc_text[50:]
,and then mimic using tril with some slicing
# let x = input tokens, y = target tokens, where y=x[pos+1]
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x : {x}")
print(f"y :\t {y}")
x : [323, 9749, 5678, 304]
y : [9749, 5678, 304, 264]
You can see the offset which we can loop over
for i in range(1, context_size+1):
context = enc_sample[:i]
desired = enc_sample[i]
print(context, "--->", desired)
[323] ---> 9749
[323, 9749] ---> 5678
[323, 9749, 5678] ---> 304
[323, 9749, 5678, 304] ---> 264
We can decode the tokens with tokeniser.decode(context) to get a more human-coherent output
and ---> established
and established ---> himself
and established himself ---> in
and established himself in ---> a
We've now created the input-target pairs that we can turn into use for the LLM training, this is a rudimentry implementation and we we are interested in returning two tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict:
The code implementation will operate on token IDs directly since the encode method of the BPE tokeniser performs both tokenisation and conversion into token IDs as a single step. For the efficient data loader implementation, we will use PyTorch's built-in Dataset and DataLoader classes
As a quick reminder of pytorch, this is how we'd use tensors to model a simple MLP with a wx + b configuration such that the loss function is minimised using BCE
import torch.nn.functional as F
from torch.autograd import grad
y = torch.tensor([0.1])
x1 = torch.tensor([1.1]) #input value
w1 = torch.tensor([2.2], requires_grad=True) #input weight
b = torch.tensor([0.0], requires_grad=True) #bias
z = x1 * w1 + b #net j
a = torch.sigmoid(z) #activation function
# a = sigmoid((input value * input weight) + bias)
loss = F.binary_cross_entropy(a,y)
grad_L_w1 = grad(loss, w1 , retain_graph=True)
grad_L_b = grad(loss, b , retain_graph=True)
print(grad_L_w1) #(tensor([0.9002]),)
print(grad_L_b) # (tensor([0.8183]),)
Here we only do one forward and backward pass; the Binary Cross-Entropy loss. Imagine you're trying to predict if something is "yes" or "no" (binary classification). The binary_cross_entropy loss function measures how wrong your prediction a is compared to the true answer y. If your prediction a (which is between 0 and 1, thanks to sigmoid) is far from the target y (also a value we want to predict), the loss will be high. If a is close to y, the loss will be low. Essentially, it quantifies the "error" of your model's single prediction.
The NeuralNetwork class in PyTorch defines a multi-layered perceptron (MLP) model. The init method constructs the network's architecture as a sequential arrangement of layers. Specifically, it initialises three linear layers (torch.nn.Linear) interspersed with ReLU activation functions (torch.nn.ReLU). Each torch.nn.Linear layer performs a linear transformation on the input data, projecting it to a different dimension. The ReLU activation introduces non-linearity, enabling the network to learn complex relationships. The forward method defines the data flow through the network. Input data x is passed through the self.layers sequence. The output of the final linear layer, before any output activation function is applied, is termed logits. Logits represent the raw, unnormalised predictions of the network. These logits are then typically fed into a loss function or further processed by an activation function (such as Sigmoid for binary classification or Softmax for multi-class classification) depending on the specific task the network is designed to solve.
class NeuralNetwork(torch.nn.Module):
def __init__(self, num_inputs, num_outputs):
super().__init__()
self.layers = torch.nn.Sequential(
torch.nn.Linear(num_inputs, 30),
torch.nn.ReLU(),
torch.nn.Linear(30, 20),
torch.nn.ReLU(),
torch.nn.Linear(20, num_outputs),
)
def forward(self, x):
logits = self.layers(x)
return logits
To calculate the total trainable parameters, the code iterates through each parameter (p) in the model.parameters(). For each parameter that requires gradient computation (p.requires_grad), p.numel() counts the total number of elements in that parameter tensor, representing the individual weights and biases. Summing up these element counts across all trainable parameters gives the num_params, the total count of parameters the model will learn during training.
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable model parameters:", num_params)
We can model a forward pass with model.forward(X) to give the model's raw output, where X is a random input tensor X representing a single data sample with 50 features.
with torch.no_grad():
out = torch.softmax(model(X), dim=1)
print(out) #tensor([[0.3113, 0.3934, 0.2952]])
#without softmax -> tensor([[-0.1262, 0.1080, -0.1792]])
torch.no_grad()*** temporarily disables PyTorch's automatic gradient calculation, and the code then performs a forward pass to get the model's raw output logits, printing these unnormalised prediction values, the softmax function is applied to the model's output to convert the logits into probabilities along dimension 1, making the output interpretable as a probability distribution before printing.
*** omitting torch.no_grad() during the forward pass would still cause PyTorch to build a computation graph to track operations for potential gradient calculations later. This graph construction consumes memory and adds computational overhead, even if you don't intend to use backpropagation.
The Dataset class serves as an abstract class representing a dataset. To work with custom datasets, one must subclass Dataset and implement three essential methods: init, getitem, and len. The init method is the constructor, responsible for initializing the dataset, typically by loading or referencing the data features and labels. In the ToyDataset example, init stores input features X and corresponding labels y. The crucial getitem method, given an index, retrieves and returns a single data sample (features and label) from the dataset. For ToyDataset, getitem(index) accesses and returns the index-th feature and label pair. Finally, len simply returns the total number of samples in the dataset, which in ToyDataset is determined by the number of labels.
from torch.utils.data import Dataset
class ToyDataset(Dataset):
def __init__(self, X, y):
self.features = X
self.labels= y
def __getitem__(self, index):
one_x = self.features[index]
one_y = self.labels[index]
return one_x, one_y
def __len__(self):
return self.labels.shape[0]
train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)
Once a custom Dataset is defined, the DataLoader class facilitates efficient data loading during training and evaluation. DataLoader takes a Dataset object as input and provides an iterable over the dataset, yielding data in batches. Key parameters of DataLoader include dataset, specifying the Dataset to load from; batch_size, defining the number of samples in each batch; shuffle, which, when set to True, randomises the order of samples in each epoch, preventing biases from data ordering during training; and num_workers, controlling the number of subprocesses used for data loading. Setting num_workers to a value greater than zero can significantly speed up data loading, especially for CPU-bound preprocessing, but setting it to 0, as in the example, means data loading happens in the main process, simplifying debugging.
from torch.utils.data import DataLoader
train_loader = DataLoader(
dataset= train_ds,
batch_size= 2,
shuffle= True,
num_workers= 0
)
test_loader = DataLoader(
dataset = test_ds,
batch_size = 2,
shuffle=True,
num_workers=0
)
To train it a NeuralNetwork model is initialised, along with an SGD optimiser to adjust the model's parameters using a learning rate of 0.375. The training proceeds for a specified number of epochs (num_epochs). In each epoch, the model is set to training mode (model.train()), and the code iterates through the train_loader to process data in batches. For each batch, a forward pass calculates logits, the loss is computed using F.cross_entropy comparing logits to labels, gradients are zeroed, backpropagation is performed with loss.backward(), and the optimiser updates model parameters via optimiser.step(). Training progress is logged by printing the epoch number, batch index, and training loss. After the training loop, the model is set to evaluation mode (model.eval()) and used to predict on the X_train dataset, demonstrating the trained model's output on the training data.
import torch.nn.functional as F
torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)
optimiser= torch.optim.SGD(model.parameters(), lr=0.375)
num_epochs = 3
for epoch in range(num_epochs):
model.train()
for batch_idx, (features, labels) in enumerate(train_loader):
logits = model(features)
loss = F.cross_entropy(logits, labels)
optimiser.zero_grad()
loss.backward()
optimiser.step()
### LOGGING
print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
f" | Train Loss: {loss:.2f}")
model.eval()
print(f"{X_train} \n") # ... Epoch: 002/003 | Batch 000/002 | Train Loss: 0.49 ...
model.eval()
with torch.no_grad():
outputs = model(X_train)
print(outputs)
"""
tensor([[-1.2000, 3.1000],
[-0.9000, 2.9000],
[-0.5000, 2.6000],
[ 2.3000, -1.1000],
[ 2.7000, -1.5000]])
tensor([[ 2.2694, -3.3964],
[ 2.0256, -3.0810],
[ 1.6876, -2.6358],
[-1.1288, 1.1294],
[-1.2999, 1.3172]])
"""
The model's raw outputs are converted into class probabilities using softmax along dimension 1, subsequently, the class with the highest probability for each sample is determined using argmax along dimension 1 to obtain and print the predicted class labels.
torch.set_printoptions(sci_mode= False)
probas = torch.softmax(outputs, dim=1)
The compute_accuracy function evaluates the performance of a given model on a dataset provided by a DataLoader. It sets the model to evaluation mode (model.eval()), iterates through the dataloader in batches, performs a forward pass without gradient calculation using torch.no_grad(), and determines the predicted class labels by taking the argmax of the logits. By comparing these predictions to the true labels, it accumulates the number of correct predictions
def compute_accuracy(model, dataloader):
model = model.eval()
correct = 0.0
total_examples = 0
for idx, (features, labels) in enumerate(dataloader):
with torch.no_grad():
logits = model(features)
predictions = torch.argmax(logits, dim=1) #In torch.argmax(outputs, dim=1), dim=1 specifies that the index of the maximum value should be returned across the columns (dimension 1) of the outputs tensor, effectively selecting the class with the highest score for each sample in the batch.
compare = labels == predictions
correct += torch.sum(compare)
total_examples += len(compare)
return (correct/ total_examples).item()*100
For the efficient data loader implementation, we will use PyTorch's built-in Dataset and DataLoader classes. Within the init method of GPTDatasetV1, the code first stores the provided tokeniser for later use in encoding text. It then initialises two empty lists, self.input_ids and self.target_ids, which will store the numerical representations of the input and target sequences respectively. The input text txt is tokenised into numerical IDs using tokeniser.encode(txt) and stored as token_ids. The code then iterates through token_ids using a sliding window approach, determined by max_length and stride. In each step of the loop, it extracts a chunk of token_ids of length max_length as input_chunk and a corresponding target_chunk which is shifted by one position to the right, representing the next token prediction task. Finally, both input_chunk and target_chunk are converted into PyTorch tensors and appended to self.input_ids and self.target_ids lists, effectively preparing pairs of input and target sequences for training a GPT-like model.
import torch
from torch.utils.data import Dataset, DataLoader
class GPTDatasetV1(Dataset):
def __init__(self, txt, tokeniser, max_length, stride):
self.tokeniser = tokeniser
self.input_ids= []
self.target_ids =[]
token_ids = tokeniser.encode(txt)
for i in range(0, len(token_ids) - max_length, stride):
input_chunk = token_ids[i: i + max_length]
target_chunk= token_ids[i+1: i + max_length+ 1]
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.target_ids[idx]
We must now make a dataloader, for this we set the tokeniser to use cl100k_base, the one used for post-GPT4 models, and then create a Dataset class, the dataloader is then created using the default pytorch class.
def create_dataloader_v1(txt, batch_size=4 ,max_length=256, stride=128, shuffle=True, drop_last=True):
tokeniser = tiktoken.get_encoding("cl100k_base")
dataset = GPTDatasetV1(txt, tokeniser, max_length, stride)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
return dataloader
Visually you can see how it works
with open("verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
dataloader = create_dataloader_v1(raw_text, batch_size=3, max_length=4, stride=2, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
for i in range(0, len(first_batch)+1):
print(f"{first_batch[0][i]}\t{first_batch[1][i]}\n")
It is important to note what stride does in this case and the impact setting its value too high and too low has
tensor([ 40, 473, 1846, 2744]) tensor([ 473, 1846, 2744, 3463])
tensor([1846, 2744, 3463, 7762]) tensor([2744, 3463, 7762, 480])
tensor([3463, 7762, 480, 285]) tensor([ 7762, 480, 285, 22464])
The last step for preparing the input text for LLM training is to convert the token IDs into embedding vectors
It is important to note that we initialise these embedding weights with random values as a preliminary step. This initialization serves as the starting point for the LLM's learning process. We will optimise the embedding weights as part of the LLM training
For the sake of simplicity and illustration purposes, suppose we have a small vocabulary of only 6 words (instead of the 50,257 words in the BPE tokeniser vocabulary), and we want to create embeddings of size 3 (in GPT-3, the embedding size is 12,288 dimensions):
vocab_size = 6
output_dim = 3
torch.manual_seed(123)
embedding_layer= torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)
"""
Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
[ 0.9178, 1.5810, 1.3010],
[ 1.2753, -0.2010, -0.1606],
[-0.4015, 0.9666, -1.1481],
[-1.1589, 0.3255, -0.6315],
[-2.8400, -0.7849, -1.4096]], requires_grad=True)
"""
We can then output certain embeddings based on the index of the embedding_layer, for example using an input_ids = torch.tensor([2, 3, 5, 1]) and printing embedding_layer(input_ids) would give us what is in the embedding_layer at index 2, 3, 5, 1.
We converted the token IDs into a continuous vector representation, the so-called token embeddings. In principle, this is a suitable input for an LLM. However, a minor shortcoming of LLMs is that their selfattention mechanism doesn't have a notion of position or order for the tokens within a sequence.
There are two broad categories of position-aware embeddings: relative positional embeddings and absolute positional embeddings. Absolute positional embeddings are directly associated with specific positions in a sequence. For each position in the input sequence, a unique embedding is added to the token's embedding to convey its exact location. For instance, the first token will have a specific positional embedding, the second token another distinct embedding, and so on.
Let's scale up our embedding vectors, to 256 dimensions on a vocab of 50,257
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(token_embedding_layer) #Embedding(50257, 256)
If we sample data from the data loader, we embed each token in each batch into a 256-dimensional vector. If we have a batch size of 8 with four tokens each, the result will be an 8 x 4 x 256 tensor.
max_length = 4
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=True)
data_iter = iter(dataloader)
inputs, target = next(data_iter)
"""
Token IDs:
tensor([[ 8275, 28601, 315, 22295],
[ 439, 499, 2019, 382],
[ 1268, 568, 5602, 1555],
[ 358, 1047, 2744, 559],
[27000, 304, 459, 26762],
[ 3169, 430, 527, 2163],
[15746, 757, 11, 279],
[ 279, 3177, 1555, 54499]])
Inputs shape:
torch.Size([8, 4])
"""
the token ID tensor is 8x4-dimensional, meaning that the data batch consists of 8 text samples with 4 tokens each. We can pass this into our embedding layer with a 256-dimeniosnal vector
token_embeddings = token_embedding_layer(inputs)
""" we just print our token_embeddings[7] - there are 8 in total
tensor([[ 2.2409, -1.1483, 0.7415, ..., -0.6072, 0.8110, -2.3937],
[-0.6711, -1.5402, 0.4477, ..., -0.6150, -1.7679, -0.5633],
[-0.4210, 0.0806, -0.4998, ..., -0.5750, 0.9891, 1.0474],
[ 0.7883, 0.2666, 1.3887, ..., -0.0707, 0.2416, -0.3078]],
grad_fn=<SelectBackward0>)
torch.Size([8, 4, 256])
"""
Each token ID is now embedded as a 256-dimensional vector. For a GPT model's absolute embedding approach, we just need to create another embedding layer that has the same dimension as the token_embedding_layer:
#absoulte positional embedding
context_length = max_length #4
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
#torch.arange - Positional embeddings tell the model which token is at the beginning, middle, or end of the sequence.
print(pos_embeddings)
print(pos_embeddings.shape)
"""
tensor([[-1.5332, 0.0175, 1.4208, ..., 0.9951, -0.8595, 2.6080],
[-0.2507, 0.1291, -0.8199, ..., -0.5172, 1.4405, -0.0447],
[-0.2071, 0.5227, 1.8507, ..., 0.5915, -0.1654, 1.4362],
[-0.2940, 0.5734, 1.0133, ..., -0.6064, 0.3872, -0.9926]],
grad_fn=<EmbeddingBackward0>)
torch.Size([4, 256])
"""
As shown in the preceding code example, the input to the pos_embeddings is usually a placeholder vector torch.arange(context_length), which contains a sequence of numbers 0, 1, ..., up to the maximum input length − 1. The context_length is a variable that represents the supported input size of the LLM. Here, we choose it similar to the maximum length of the input text. In practice, input text can be longer than the supported context length, in which case we have to truncate the text.
As we can see, the positional embedding tensor c onsists of four 256-dimensional vectors. We can now add these directly to the token embeddings, where PyTorch will add the 4x256-dimensional pos_embeddings tensor to each 4x256-dimensional token embedding tensor in each of the 8 batches: input_embeddings = token_embeddings + pos_embeddings
In Summary-
Created input text by reading from textfile Then tokenised by seperating text word-by-word, removing breaks and empty spaces. Using the <|endtext|> and <|unk|> to signify the end of a text (until the next perhaps) and unk for words that do not exist in the vocab respectively. Then turned tokenised text into token ids, can use regular enumarate text, id for positional encoding, but better to use byte encoding (cl100k_base) Useing dataloaders, overloading getitem & len , we can format raw_text, with txt, tokeniser, max_length, stride inputs, where tokeniser is cl100k_base or could be gpt2
This provides us with Token IDs
Following we embed the tokens with torch.nn.Embedding, with a vocab_size and output_dim , this token embeddings is typically, in our, case in the format 4, 256. we then create positional encoding using arange where its input is the max_length. Our input embeddings is then token embeddings + positional embeddings which is fed into the transformer.
We will now look at attention mechanism, we will abstract this into 3 layers before finally understanding and reaching the multi-head attention actually used in transformers
Before transformer LLMs, it was common to use RNNs for language modeling tasks such as language translation. RNNs work fine for translating short sentences but don't work well for longer texts as they don't have direct access to previous words in the input. One major shortcoming in this approach is that the RNN must remember the entire encoded input in a single hidden state before passing it to the decoder
Self-attention is a mechanism that allows each position in the input sequence to attend to all positions in the same sequence when computing the representation of a sequence. Self-attention is a key component of contemporary LLMs based on the transformer architecture, such as the GPT series. It is a mechanism in transformers that is used to compute more efficient input representations by allowing each position in a sequence to interact with and weigh the importance of all other positions within the same sequence.
Let's start simple. The goal of self-attention is to compute a context vector, for each input element, that combines information from all other input elements. In the example depicted in this figure, we compute the context vector z(2) . The importance or contribution of each input element for computing z(2) is determined by the attention weights α21 to α2T. When computing z(2), the attention weights are calculated with respect to input element x(2) and all other inputs.
Consider the sentence "Your journey starts with one step." Each word in this sentence, like "Your," "journey," etc., is first converted into a vector of numbers called an embedding vector, let's say 3-dimensional for simplicity. These are our initial input vectors.
The magic of self-attention lies in creating "context vectors." Imagine we are focusing on the word "journey" in our sentence. Self-attention's goal is to produce a new, enriched embedding vector for "journey" called a context vector. This context vector for "journey" isn't just about "journey" alone. Instead, it cleverly incorporates information from all the words in the sentence: "Your," "journey," "starts," "with," "one," and "step."
These context vectors are super important for language models. They enable the model to understand how words relate to each other within a sentence. For example, the context vector for "journey" will capture its relationship to "Your" and "starts," allowing the model to understand the sentence as a whole, rather than just isolated words. Later on, the model will learn to build these context vectors using trainable weights, making them highly effective for tasks like predicting the next word in a sequence.
import torch
vocab_size = 6
output_dim = 3
torch.manual_seed(123)
embedding_tensor = torch.nn.Embedding(vocab_size, output_dim)
inputs = embedding_tensor.weight
print(inputs)
"""
Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690], #your
[ 0.9178, 1.5810, 1.3010], #journey
[ 1.2753, -0.2010, -0.1606], #starts
[-0.4015, 0.9666, -1.1481], #with
[-1.1589, 0.3255, -0.6315], #one
[-2.8400, -0.7849, -1.4096]], requires_grad=True) #step
"""
The first step of implementing self-attention is to compute the intermediate values ω, referred to as attention scores, so in the below figure we are howing the first step in creating this context vector for "journey": calculating attention scores. We are treating "journey" x(2) as the query. This "query" word is going to "attend" to all other words in the sentence (including itself). We need to figure out how much attention "journey" should pay to "Your," "journey," "starts," "step," etc.
To do this, we calculate attention scores α. The figure explains that these scores are computed as a dot product. For each word in the sentence, we take its embedding vector and calculate the dot product with the embedding vector of our "query" word, which is "journey" x(2).
Since we compute these scores by doing the dot product of the query x(2) with evey other input token, we can program it quite easily
# dot product all x values with respect to x(2) to get ,a, attention weights
query = inputs[1]
print("%s\n"%query)
attn_scores_2 = torch.empty(inputs.shape[0])
for i , x_i in enumerate(inputs):
attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)
"""
tensor([0.9178, 1.5810, 1.3010])
tensor([-0.1913, 5.0345, 0.6438, -0.3341, -1.3706, -5.6812]) #attention scoers !
"""
Beyond viewing the dot product operation as a mathematical tool that combines two vectors to yield a scalar value, the dot product is a measure of similarity because it quantifies how much two vectors are aligned: a higher dot product indicates a greater degree of alignment or similarity between the vectors. In the context of self-attention mechanisms, the dot product determines the extent to which elements in a sequence attend to each other: the higher the dot product, the higher the similarity and attention score between two elements.
After computing the attention scores ω21 to ω2T with respect to the input query x (2) , the next step is to obtain the attention weights α21 to α2T by normalizing the attention scores using softmax.
# noramlise with standard softmax
torch.set_printoptions(sci_mode=False, precision=3) #configures how PyTorch tensors are displayed
import torch.nn.functional as F
attn_weights_2_tmp = F.softmax(attn_scores_2, dim=0)
print("Attention weights:", [f"{weight:.4f}" for weight in attn_weights_2_tmp])
print("Sum:", attn_weights_2_tmp.sum())
"""
Attention weights: ['0.0052', '0.9765', '0.0121', '0.0046', '0.0016', '0.0000']
Sum: tensor(1.000)
"""
Now that we computed the normalised attention weights, we are ready for the final step by calculating the context vector z(2) by multiplying the embedded input tokens, x(i) , with the corresponding attention weights and then summing the resulting vectors.
The calculation of the context vector is a weighted sum of all input vectors, this involves multiplying each input vector by its corresponding attention weight
In calculation we would do: [0.4, 0.1, 0.8] x 0.1 = [0.04, 0.01, 0.08], [0.5, 0.8, 0.6] x 0.2 = [0.10, 0.16, 0.12] etc. and then add them up together by 'column', in python:
query = inputs[1]
context_vec_2 = torch.zeros(query.shape)
for i, x_i in enumerate(inputs):
context_vec_2 += attn_weights_2_tmp[i] * x_i
print(context_vec_2)
"""
tensor([0.910, 1.545, 1.261])
"""
The highlighted row shows the attention weights for the second input element as a query, as we computed in the previous section.
So, we follow the same three steps as before, we are just doing it in matrix version, we first compute attention scores, then attention weights and then compute the context vectors, so above we have the normalised attention scores aka attention weights.
# dot product for all pair x_i, x_j
attn_scores = torch.empty(6,6)
for i, x_i in enumerate(inputs):
for j, x_j in enumerate(inputs):
attn_scores[i, j] = torch.dot(x_i, x_j)
Although another way to do this and make it more efficient is to use the transpotion of vector inputs
# use matrix multiplication with transpose of vector inputs
attn_scores = inputs @ inputs.T
print(attn_scores)
"""
tensor([[ 0.174, -0.191, 0.493, -0.113, -0.342, -0.580],
[-0.191, 5.034, 0.644, -0.334, -1.371, -5.681],
[ 0.493, 0.644, 1.693, -0.522, -1.442, -3.238],
[-0.113, -0.334, -0.522, 2.414, 1.505, 2.000],
[-0.342, -1.371, -1.442, 1.505, 1.848, 3.926],
[-0.580, -5.681, -3.238, 2.000, 3.926, 10.668]])
"""
We can then normalise (it should match the image above although our input differs in this case, so does not)
attn_weights = torch.softmax(attn_scores, dim=1)
print(attn_weights)
"""
tensor([[ 0.205, 0.142, 0.282, 0.154, 0.122, 0.096],
[ 0.005, 0.976, 0.012, 0.005, 0.002, 0.000],
[ 0.166, 0.193, 0.552, 0.060, 0.024, 0.004],
[ 0.035, 0.028, 0.023, 0.442, 0.178, 0.292],
[ 0.011, 0.004, 0.004, 0.072, 0.101, 0.808],
[ 0.000, 0.000, 0.000, 0.000, 0.001, 0.999]])
"""
Now, the last step is to compute the context vector, where we will just use the matrix multiplication
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)
"""
tensor([[ 0.082, 0.244, -0.284],
[ 0.910, 1.545, 1.261],
[ 0.874, 0.228, 0.045],
[-1.147, 0.290, -1.005],
[-2.428, -0.528, -1.282],
[-2.838, -0.783, -1.409]])
"""
So we have created a simplified self-attention but now will be extending the previous self-attention mechanism with trainable weights.
To note the context vector becomes the new, context-enriched representation of "journey". It replaces the original embedding vector for "journey" and is then used as the input for the subsequent layers of the neural network, enabling the model to process information that is aware of the relationships between words in the sentence.
The self-attention mechanism with trainable weights builds on the previous concepts: we want to compute context vectors as weighted sums over the input vectors specific to a certain input element, The most notable difference is the introduction of weight matrices that are updated during model training. These trainable weight matrices are crucial so that the model (specifically, the attention module inside the model) can learn to produce "good" context vectors
We will implement the self-attention mechanism step by step by introducing the three trainable weight matrices Wq ,Wk , and Wv. These three matrices are used to project the embedded input tokens, x(i), into query, key, and value vectors as illustrated.
In the first step of the self-attention mechanism with trainable weight matrices, we compute query (q), key (k), and value (v) vectors for input elements x. Similar to previous sections, we designate the second input, x(2) , as the query input. The query vector q(2) is obtained via matrix multiplication between the input x(2) and the weight matrix Wq. Similarly, we obtain the key and value vectors via matrix multiplication involving the weight matrices Wk and Wv.
Now changing, a, such that it is trainable rather than being a direct output from the cross dot product with respect to a x(_) value Let us set this up first, with some variables as shown below:
x_2 = inputs[1]
d_in = inputs.shape[1]
d_out = 2
print("inputs:\n",inputs)
print("\ninputs_shape[1]:\n",x_2)
print("\nd_in:\n",d_in)
print("\nd_out:\n",d_out)
"""
inputs:
tensor([[ 0.337, -0.178, -0.169],
[ 0.918, 1.581, 1.301],
[ 1.275, -0.201, -0.161],
[-0.401, 0.967, -1.148],
[-1.159, 0.325, -0.632],
[-2.840, -0.785, -1.410]])
inputs_shape[1]:
tensor([0.918, 1.581, 1.301])
d_in:
3
d_out:
2
"""
d_in (input dimension): Determined by the input data's shape (specifically the number of features). d_out (output dimension): Determined by you (the designer), based on what you want the output to be (e.g., number of classes, desired output feature size)
Next, we initialise the three weight matrices Wq, Wk, and Wv
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=True)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=True)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=True)
print(W_query, W_key, W_value)
"""
Parameter containing:
tensor([[0.296, 0.517],
[0.252, 0.689],
[0.074, 0.867]], requires_grad=True) Parameter containing:
tensor([[0.137, 0.102],
[0.184, 0.726],
[0.315, 0.687]], requires_grad=True) Parameter containing:
tensor([[0.076, 0.197],
[0.316, 0.402],
[0.119, 0.827]], requires_grad=True)
"""
Next, we compute the query, key, and value vectors, we get the values for certain x values, but it is important to note that W_query, W_key, and W_value are trainable weight matrices. This means their values are adjusted during training to help the model learn. The weight matrices W, the term "weight" is short for "weight parameters," the values of a neural network that are optimised during training. This is not to be confused with the attention weights. As we already saw in the previous section, attention weights determine the extent to which a context vector depends on the different parts of the input, i.e., to what extent the network focuses on different parts of the input. In summary, weight parameters are the fundamental, learned coefficients that define the network' s connections, while attention weights are dynamic,context-specific values.
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value
print(query_2)
print(key_2)
"""
tensor([0.766, 2.690], grad_fn=<SqueezeBackward4>)
tensor([0.826, 2.136], grad_fn=<SqueezeBackward4>)
"""
We can obtain all keys and values via matrix multiplication:
keys = inputs @ W_key
values = inputs @ W_value
print("keys.shape:", keys.shape)
print("values.shape:", values.shape)
print(f"\nKeys: {keys}\n\nValues: {values}")
"""
keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])
Keys: tensor([[-0.040, -0.211],
[ 0.826, 2.136],
[ 0.087, -0.126],
[-0.239, -0.128],
[-0.297, -0.316],
[-0.977, -1.830]], grad_fn=<MmBackward0>)
Values: tensor([[-0.051, -0.145],
[ 0.724, 1.892],
[ 0.014, 0.037],
[ 0.139, -0.641],
[-0.060, -0.620],
[-0.630, -2.040]], grad_fn=<MmBackward0>)
"""
The attention score computation is a dot-product computation similar to what we have used in the simplified self-attention mechanism. The new aspect here is that we are not directly computing the dot-product between the input elements but using the query and key obtained by transforming the inputs via the respective weight matrices.
Using the dot product between W(q) and W(k) intuitively tells us how "relevant" or "similar" the key is to the query in the learned embedding space. As a refresher:
W_q (Query Weight Matrix): W_q transforms the input into queries, representing what the current position is "looking for" in other positions to gather information. It learns to extract the relevant aspects of the input needed to form effective queries.
W_k (Key Weight Matrix): W_k transforms the input into keys, acting as labels or indices representing the information available at each position. It learns to create representations that can be effectively compared with queries to determine relevance.
W_v (Value Weight Matrix): W_v transforms the input into values, holding the actual information content to be extracted from each position and aggregated based on attention weights. It learns to represent the information that is potentially valuable to be attended to and passed forward.
So, to compute the attention score of w22, we would use the query and key values from x(2)
keys_2 = keys[1]
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)
"""
tensor(6.380, grad_fn=<DotBackward0>)
"""
Or for all of them
attn_scores_2 = query_2 @ keys.T
print(attn_scores_2)
"""
tensor([-0.597, 6.380, -0.272, -0.527, -1.079, -5.670],
grad_fn=<SqueezeBackward4>)
"""
We will now scale the attention scores to give us attention weights
#All you need is attention:
# A(Q,K,V) = softmax( QK^T / SQRT(dk)) x V
d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)
"""
tensor([ 0.007, 0.972, 0.009, 0.007, 0.005, 0.000],
grad_fn=<SoftmaxBackward0>)
"""
The difference to earlier is that we now scale the attention scores by dividing them by the square root of the embedding dimension of the keys. The reason for the normalization by the embedding dimension size is to improve the training performance by avoiding small gradients. For instance, when scaling up the embedding dimension, which is typically greater than thousand for GPT-like LLMs, large dot products can result in very smallgradients during backpropagation due to the softmax function applied to them. As dot products increase, the softmax function behaves more like astep function, resulting in gradients nearing zero. These small gradients can drastically slow down learning or cause training to stagnate.The scaling by the square root of the embedding dimension is the reason why this self-attention mechanism is also called scaled-dot product attention.
In the final step of the self-attention computation, we compute the context vector by combining all value vectors via the attention weights.
Similar to before, where we computed the context vector as a weighted sum over the input vectors, we now compute the context vector as a weighted sum over the value vectors. Here, the attention weights serve as a weighting factor that weighs the respective importance of each value vector.
context_vec_2 = attn_weights_2 @ values
print(context_vec_2)
"""
tensor([0.704, 1.830], grad_fn=<SqueezeBackward4>)
"""
So far, we only computed a single context vector, we will now extend the functionality. Before we extend this, we recap over the q,k,v, structure
The terms "key," "query," and "value" in the context of attention mechanisms are borrowed from the domain of information retrieval and databases, where similar concepts are used to store, search, and retrieve information.
A "query" is analogous to a search query in a database. It represents the current item (e.g., a word or token in a sentence) the model focuses on or tries to understand. The query is used to probe the other parts of the input sequence to determine how much attention to pay to them.
The "key" is like a database key used for indexing and searching. In the attention mechanism, each item in the input sequence (e.g., each word in a sentence) has an associated key. These keys are used to match with the query.
The "value" in this context is similar to the value in a key-value pair in a database. It represents the actual content or representation of the input items. Once the model determines which keys (and thus which parts of the input) are most relevant to the query (the current focus item), it retrieves the corresponding values.
Putting it all together we get this:
import torch.nn as nn
class SelfAttention_v1(nn.Module):
def __init__(self, d_in, d_out):
super().__init__()
self.d_out = d_out
self.W_query = nn.Parameter(torch.rand(d_in, d_out))
self.W_key = nn.Parameter(torch.rand(d_in, d_out))
self.W_value = nn.Parameter(torch.rand(d_in, d_out))
def forward(self, x):
keys = x @ self.W_key
queries = x @ self.W_query
values = x @ self.W_value
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
context_vec = attn_weights @ values
return context_vec
SelfAttention_v1 is a class derived from nn.Module, which is a fundamental building block of PyTorch models, which provides necessary functionalities for model layer creation and management.
The init method initialises trainable weight matrices (W_query, W_key, and W_value) for queries, keys, and values, each transforming the input dimension d_in to an output dimension d_out.
During the forward pass, using the forward method, we compute the attention scores (attn_scores) by multiplying queries and keys, normalizing these scores using softmax. Finally, we create a context vector by weighting the values with these normalised attention scores.
torch.manual_seed(123)
sa_v1= SelfAttention_v1(d_in, d_out)
print(inputs , "\n")
print(sa_v1.forward(inputs))
"""
tensor([[ 0.337, -0.178, -0.169],
[ 0.918, 1.581, 1.301],
[ 1.275, -0.201, -0.161],
[-0.401, 0.967, -1.148],
[-1.159, 0.325, -0.632],
[-2.840, -0.785, -1.410]])
tensor([[ -0.001, -0.321],
[ 0.704, 1.830],
[ 0.202, 0.289],
[ -0.128, -0.686],
[ -0.266, -1.071],
[ -0.600, -1.962]], grad_fn=<MmBackward0>)
"""
In self-attention, we transform the input vectors in the input matrix X with the three weight matrices, Wq, Wk, and Wv. Then, we compute the attention weight matrix based on the resulting queries (Q) and keys (K). Using the attention weights and values (V), we then compute the context vectors (Z). (For visual clarity, we focus on a single input text with n tokens in this figure, not a batch of multiple inputs. Consequently, the 3D input tensor is simplified to a 2D matrix in this context)
Self-attention involves the trainable weight matrices Wq, Wk, and Wv. These matrices transform input data into queries, keys, and values, which are crucial components of the attention mechanism. As the model is exposed to more data during training, it adjusts these trainable weights, as we will see in upcoming chapters.
We can improve the SelfAttention_v1 implementation further by utilizing PyTorch's nn.Linear layers, which effectively perform matrix multiplication when the bias units are disabled. Additionally, a significant advantage of using nn.Linear instead of manually implementing nn.Parameter(torch.rand(...)) is that nn.Linear has an optimised weight initialization scheme, contributing to more stable and effective model training.
import torch.nn as nn
class SelfAttention_v2(nn.Module):
def __init__(self, d_in, d_out, qkv_bias=False):
super().__init__()
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
def forward(self, x):
keys = self.W_key(x)
values = self.W_value(x)
query = self.W_query(x)
attn_raw = query @ keys.T
attn_weight = torch.softmax(attn_raw / keys.shape[-1]**0.5, dim=-1)
context_vec = attn_weight @ values
return context_vec
Note that nn.Linear in SelfAttention_v2 uses a different weight initialization scheme as nn.Parameter(torch.rand(d_in, d_out)) used in SelfAttention_v1, which causes both mechanisms to produce different results.
Now we modify the standard self-attention mechanism to create a causal attention mechanism, which is essential for developing an LLM in the subsequent chapters. Causal attention, also known as masked attention, is a specialised form of self-attention. It restricts a model to only consider previous and current inputs in a sequence when processing any given token. This is in contrast to the standard self-attention mechanism, which allows access to the entire input sequence at once.
Consequently, when computing attention scores, the causal attention mechanism ensures that the model only factors in tokens that occur at or before the current token in the sequence.
In causal attention, we mask out the attention weights above the diagonal such that for a given input, the LLM can't access future tokens when computing the context vectors using the attention weights. For example, for the word "journey" in the second row, we only keep the attention weights for the words before ("Your") and in the current position ("journey").
torch.manual_seed(123)
sa_v2= SelfAttention_v2(d_in, d_out)
So let's get the attention weights like normal
keys = sa_v2.W_key(inputs)
queries = sa_v2.W_query(inputs)
attn_scores= queries @ keys.T
attn_weights = torch.softmax(attn_scores /keys.shape[-1]**0.5, dim=-1)
print(attn_weights)
"""
tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
[0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
[0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
[0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
[0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<SoftmaxBackward0>)
"""
We can use tril to mask this now:
##masking (inefficient way, method 1 for understanding)
context_length = attn_scores.shape[1] #6
x = torch.ones(context_length, context_length)
mask_simple = torch.tril(x)
##Returns the lower triangular part of the matrix (2-D tensor) or batch of matrices input ,
##the other elements of the result tensor out are set to 0
print("mask:\n", mask_simple)
"""
mask:
tensor([[1., 0., 0., 0., 0., 0.],
[1., 1., 0., 0., 0., 0.],
[1., 1., 1., 0., 0., 0.],
[1., 1., 1., 1., 0., 0.],
[1., 1., 1., 1., 1., 0.],
[1., 1., 1., 1., 1., 1.]])
"""
then applied on our attention scores
masked_simple = attn_scores * mask_simple
print("masked tensor (unnormalised):\n", masked_simple)
"""
masked tensor (unnormalised):
tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
[0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
[0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<MulBackward0>)
"""
We must now renormalise the attention weights to sum up to 1 again in each row. We can achieve this by dividing each element in each row by the sum in each row:
row_sums = masked_simple.sum(dim=1, keepdim=True)
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)
"""
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
[0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
[0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
[0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<DivBackward0>)
"""
row_sums = masked_simple.sum(dim=1, keepdim=True): Calculates the sum of each row in the masked_simple matrix, and stores these sums in row_sums. keepdim=True ensures row_sums remains a column-like matrix.
masked_simple_norm = masked_simple / row_sums: Divides each element in each row of masked_simple by its corresponding row sum, effectively normalizing each row. This makes the values in each row sum up to (approximately) 1.
When we apply a mask and then renormalise the attention weights, it might initially appear that information from future tokens (which we intend to mask) could still influence the current token because their values are part of the softmax calculation. However, the key insight is that when we renormalise the attention weights after masking, what we're essentially doing is recalculating the softmax over a smaller subset, the softmax function is that it's designed to work with relative scores. Even though we initially included scores for all tokens (including masked ones) in the softmax denominator, once we've made the masked token scores so extremely negative, their contribution to that denominator becomes negligible. When we renormalise after masking, we're essentially recalculating the softmax as if we had never included the masked tokens in the first place.
So, we have just calculated the attention scores, then did softmax to get the attention weights and then masked them with a tril function and then renormalised to get an answer, but the above is a more efficient way to handle this, we don't need to re-normalise just mask the attention scores with negative infinity values before applying the softmax function
The softmax function converts its inputs into a probability distribution. When negative infinity values (-∞) are present in a row, the softmax function treats them as zero probability. (Mathematically, this is because e^-∞ approaches 0.)
Note: The diagonal parameter controls which diagonal to include:
diagonal=0: Includes the main diagonal and elements above it. diagonal>0: Shifts upward, excluding more lower diagonals. diagonal<0: Includes some elements below the main diagonal.
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
print("Mask:\n", mask, '\n')
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print("Masked tensor (unnormalised):\n", masked)
##Returns the upper triangular part of a matrix (2-D tensor) or batch of matrices input ,
##the other elements of the result tensor out are set to 0.
"""
Mask:
tensor([[0., 1., 1., 1., 1., 1.],
[0., 0., 1., 1., 1., 1.],
[0., 0., 0., 1., 1., 1.],
[0., 0., 0., 0., 1., 1.],
[0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0.]])
Masked tensor (unnormalised):
tensor([[ 0.000, -inf, -inf, -inf, -inf, -inf],
[ 0.038, 0.723, -inf, -inf, -inf, -inf],
[ 0.010, 0.235, 0.098, -inf, -inf, -inf],
[ -0.005, -0.242, -0.108, 0.899, -inf, -inf],
[ -0.015, -0.379, -0.160, 0.600, 0.441, -inf],
[ -0.051, -1.097, -0.454, 0.782, 0.829, 1.967]],
grad_fn=<MaskedFillBackward0>)
"""
Now, all we need to do is apply the softmax function to these masked results, and we are done:
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5,dim=1)
print(attn_weights)
"""
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
[0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
[0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
[0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
[0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
grad_fn=<SoftmaxBackward0>)
"""
(ignore the inconsistencies in output)
We could now use the modified attention weights to compute the context vectors via context_vec = attn_weights @ values
Dropout in deep learning is a technique where randomly selected hidden layer units are ignored during training, effectively "dropping" them out. This method helps prevent overfitting by ensuring that a model does not become overly reliant on any specific set of hidden layer units. It's important to emphasise that dropout is only used during training and is disabled afterward.
In the transformer architecture, including models like GPT, dropout in the attention mechanism is typically applied in two specific areas: after calculating the attention scores or after applying the attention weights to the value vectors.
Using the causal attention mask (upper left), we apply an additional dropout mask (upper right) to zero out additional attention weights to reduce overfitting during training.
We use a dropout rate of 50%, which means masking out half of the attention weights and then apply PyTorch's dropout implementation first to a 6×6 tensor consisting of ones for illustration purposes
# torch.manual?_seed(123)
dropout = torch.nn.Dropout(0.5)
example = torch.ones(6,6)
print(dropout(example))
"""
tensor([[0., 0., 2., 2., 0., 2.],
[2., 0., 2., 0., 2., 0.],
[2., 2., 0., 0., 2., 0.],
[0., 2., 2., 2., 2., 0.],
[0., 0., 0., 0., 0., 2.],
[2., 2., 0., 0., 2., 2.]])
"""
When applying dropout to an attention weight matrix with a rate of 50%, half of the elements in the matrix are randomly set to zero. To compensate for the reduction in active elements, the values of the remaining elements in the matrix are scaled up by a factor of 1/0.5 =2. This scaling is crucial to maintain the overall balance of the attention weights, ensuring that the average influence of the attention mechanism remains consistent during both the training and inference phases.
Let's apply this to our attn_weights
torch.manual_seed(123)
print(dropout(attn_weights))
"""
tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],
[0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],
[0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],
[0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],
grad_fn=<MulBackward0>
"""
We will now incorporate the causal attention and dropout modifications into the SelfAttention Python class. This class will then serve as a template for developing multi-head attention
Before we begin, one more thing is to ensure that the code can handle batches consisting of more than one input so that the CausalAttention class supports the batch outputs produced by the data loader we implemented
batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape)
print(batch)
"""
torch.Size([2, 6, 3])
tensor([[[ 0.337, -0.178, -0.169],
[ 0.918, 1.581, 1.301],
[ 1.275, -0.201, -0.161],
[-0.401, 0.967, -1.148],
[-1.159, 0.325, -0.632],
[-2.840, -0.785, -1.410]],
[[ 0.337, -0.178, -0.169],
[ 0.918, 1.581, 1.301],
[ 1.275, -0.201, -0.161],
[-0.401, 0.967, -1.148],
[-1.159, 0.325, -0.632],
[-2.840, -0.785, -1.410]]])
"""
This results in a 3D tensor consisting of 2 input texts with 6 tokens each, where each token is a 3-dimensional embedding vector
class CausalAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout)
#we don't need to manually ensure these tensors are on the same device as your model
#parameters, avoiding device mismatch errors.
self.register_buffer(
'mask',
torch.triu(torch.ones(context_length, context_length), diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape #2 , 6, 3 e.g above
keys = self.W_key(x)
values = self.W_value(x)
queries = self.W_query(x)
#using over keys.T due to 3D nature of x, rather than 2D vector as been before
attn_scores = queries @ keys.transpose(1,2)
attn_scores.masked_fill(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
# attn_scores.masked_fill(self.masked_fill(mask.bool(), -torch.inf))
attn_weights = torch.softmax(attn_scores / keys.shape[-1], dim=-1)
attn_weights = self.dropout(attn_weights)
context_vector = attn_weights @ values
return context_vector
This is quite similiar and self-explanatory, we can use the CausalAttention class as follows, similar to SelfAttention previously
#The resulting context vector is a 3D tensor where each token is now
#represented by a 2D embedding
torch.manual_seed(123)
context_length = batch.shape[1]
ca = CausalAttention(d_in, 2, context_length, 0.0)
context_vecs = ca(batch)
print("context_vecs.shape:", context_vecs.shape)
print("\ncontext_vecs:\n", context_vecs)
"""
context_vecs.shape: torch.Size([2, 6, 2])
context_vecs:
tensor([[[ 0.122, -0.215],
[-0.150, -0.123],
[ 0.014, -0.171],
[ 0.304, -0.312],
[ 0.342, -0.315],
[ 0.671, -0.435]],
[[ 0.122, -0.215],
[-0.150, -0.123],
[ 0.014, -0.171],
[ 0.304, -0.312],
[ 0.342, -0.315],
[ 0.671, -0.435]]], grad_fn=<UnsafeViewBackward0>)
"""
The term "multi-head" refers to dividing the attention mechanism into multiple "heads," each operating independently. In this context, a single causal attention module can be considered single-head attention, where there is only one set of attention weights processing the input sequentially
In practical terms, implementing multi-head attention involves creating multiple instances of the self-attention mechanism, each with its own weights, and then combining their outputs. Using multiple instances of the self-attention mechanism can be computationally intensive, but it's crucial for the kind of complex pattern recognition that models like transformer-based LLMs are known for.
The multi-head attention module in this figure depicts two single-head attention modules stacked on top of each other. So, instead of using a single matrix Wv for computing the value matrices, in a multi-head attention module with two heads, we now have two value weight matrices: Wv1 and Wv2 . The same applies to the other weight matrices, Wq and Wk. We obtain two sets of context vectors Z1 and Z2 that we can combine into a single context vector matrix Z.
The main idea behind multi-head attention is to run the attention mechanism multiple times (in parallel) with different, learned linear projections -- the results of multiplying the input data (like the query, key, and value vectors in attention mechanisms) by a weight matrix.
We can achieve this implementing a simple MutliHeadAttentionWrapper class that stacks multiple instances
class MultiHeadAttentionWrapper(nn.Module):
"""
nn.ModuleList is a PyTorch container specifically designed to hold lists of nn.Module instances (like your CausalAttention modules).
"""
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
self.heads = nn.ModuleList([CausalAttention(d_in, d_out, context_length, dropout, qkv_bias) for _ in range(num_heads)])
"""
This line takes the output tensors produced by each individual attention head (when processing input x), and concatenates them along the last dimension.
"""
def forward(self, x):
return torch.cat([h(x) for h in self.heads], dim =-1)
For example, if we use this MultiHeadAttentionWrapper class with two attention heads (via num_heads=2) and CausalAttention output dimension d_out=2, this results in a 4-dimensional context vectors (d_out*num_heads=4)
Using the MultiHeadAttentionWrapper, we specified the number of attention heads (num_heads). If we set num_heads=2, as shown in this figure, we obtain a tensor with two sets of context vector matrices. In each context vector matrix, the rows represent the context vectors corresponding to the tokens, and the columns correspond to the embedding dimension specified via d_out=4. We concatenate these context vector matrices along the column dimension. Since we have 2 attention heads and an embedding dimension of 2, the final embedding dimension is 2 × 2 = 4.
torch.manual_seed(123)
context_length = batch.shape[1]
##d_out *= num_heads
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(d_in, d_out, context_length, 0.0, 2, False)
context_vecs = mha(batch)
"""
tensor([[[ 0.122, -0.215, 0.270, 0.158],
[-0.150, -0.123, 0.022, 0.380],
[ 0.014, -0.171, 0.289, 0.144],
[ 0.304, -0.312, 0.337, 0.071],
[ 0.342, -0.315, 0.279, 0.134],
[ 0.671, -0.435, 0.442, -0.069]],
[[ 0.122, -0.215, 0.270, 0.158],
[-0.150, -0.123, 0.022, 0.380],
[ 0.014, -0.171, 0.289, 0.144],
[ 0.304, -0.312, 0.337, 0.071],
[ 0.342, -0.315, 0.279, 0.134],
[ 0.671, -0.435, 0.442, -0.069]]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([2, 6, 4])
"""
The first dimension of the resulting context_vecs tensor is 2 since we have two input texts (the input texts are duplicated, which is why the context vectors are exactly the same for those). The second dimension refers to the 6 tokens in each input. The third dimension refers to the 4-dimensional embedding of each token.
This is good and all, however we must recognise that these are processed sequentially via [head(x) for head in self.heads] in the forward method. We can improve this implementation by processing the heads in parallel.
Instead of maintaining two separate classes, MultiHeadAttentionWrapper and CausalAttention, we can combine both of these concepts into a single MultiHeadAttention class.
Instead of running the head sequentially and concatanating them, we will integrate the multi-head functionality within a single class. It splits the input into multiple heads by reshaping the projected query, key, and value tensors and then combines the results from these heads after computing attention.
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by number of heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out)
self.dropout = nn.Dropout(dropout)
self.register_buffer(
'mask',
torch.triu(torch.ones(context_length, context_length), diagonal=1))
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
## the dimensions of keys before the .view() operation are (b, num_tokens, d_out).
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
keys = keys.transpose(1,2)
queries = queries.transpose(1,2)
values = values.transpose(1,2)
attn_scores = queries @ keys.transpose(2,3)
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
attn_scores.masked_fill(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
context_vec = (attn_weights @ values).transpose(1,2)
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec)
return context_vec
The .view() operation in PyTorch is fundamentally about reshaping tensors. Imagine you have a block of data arranged in a particular shape, and you want to rearrange it into a different shape without altering the underlying data itself. That's precisely what .view() does. In the context of multi-head attention, the initial input features, after being projected into keys, queries, and values, are in a shape that represents batch size, sequence length, and the overall feature dimension (d_out). To implement the multi-head mechanism, we need to conceptually divide this feature dimension into multiple 'heads'. This is achieved using .view(). It takes the flat feature dimension and reshapes it into two new dimensions: one representing the number of heads and the other representing the dimension of each head (head_dim). For instance, a tensor initially shaped as (batch_size, sequence_length, d_out) might be reshaped using .view() into (batch_size, sequence_length, num_heads, head_dim). This reshaping doesn't change the data, but rather organises it in a way that makes it conceptually and computationally ready for processing by multiple independent attention heads
The .transpose() operation, on the other hand, is about rearranging the order of dimensions within a tensor. Think of it as if you're rotating or re-orienting a multi-dimensional array. In the MultiHeadAttention code, .transpose() is strategically employed to reorder the dimensions to facilitate efficient batched matrix operations, which are the core of the attention mechanism. Specifically, after using .view() to create the 'head' dimension, the code then uses .transpose() to move the 'number of heads' dimension to be right after the batch dimension. So, a tensor initially of shape like (batch_size, sequence_length, num_heads, head_dim) is transformed to (batch_size, num_heads, sequence_length, head_dim). This dimension reordering is not arbitrary; it's carefully designed to prepare the tensors for subsequent matrix multiplications that calculate the attention scores and context vectors
To illustrate this batched matrix multiplication, suppose we have the following example tensor:
a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573], #A
[0.8993, 0.0390, 0.9268, 0.7388],
[0.7179, 0.7058, 0.9156, 0.4340]],
[[0.0772, 0.3565, 0.1479, 0.5331],
[0.4066, 0.2318, 0.4545, 0.9737],
[0.4606, 0.5159, 0.4220, 0.5786]]]])
Now, we perform a batched matrix multiplication between the tensor itself and a view of the tensor where we transposed the last two dimensions, num_tokens and head_dim:
print(a @ a.transpose(2, 3))
tensor([[[[1.3208, 1.1631, 1.2879],
[1.1631, 2.2150, 1.8424],
[1.2879, 1.8424, 2.0402]],
[[0.4391, 0.7003, 0.5903],
[0.7003, 1.3737, 1.0620],
[0.5903, 1.0620, 0.9912]]]])
The code snippet demonstrates how matrix multiplication is performed on a 4-dimensional tensor, which is a common operation within multi-head attention mechanisms, particularly when calculating attention scores. The key idea here is that we're not just doing a single matrix multiplication, but rather performing many matrix multiplications at once, in batches, and in parallel across the attention heads. The example starts by creating a 4D tensor a. If you look at its shape implicitly, it's designed to represent (batch_size=1, num_heads=2, num_tokens=3, head_dim=4). When we perform a @ a.transpose(2, 3), PyTorch intelligently understands that it needs to perform matrix multiplication between the last two dimensions, which are (num_tokens, head_dim) and (head_dim, num_tokens) after the transpose, for each of the 'heads' and for each batch item.
For instance, the above becomes a more compact way to compute the matrix multiplication for each head separately:
first_head = a[0, 0, :, :]
first_res = first_head @ first_head.T
print("First head:\n", first_res)
second_head = a[0, 1, :, :]
second_res = second_head @ second_head.T
print("\nSecond head:\n", second_res)
##The results are exactly the same results that we obtained when using the batched matrix multiplication print(a @ a.transpose(2, 3)) earlier:
"""
First head:
tensor([[1.3208, 1.1631, 1.2879],
[1.1631, 2.2150, 1.8424],
[1.2879, 1.8424, 2.0402]])
Second head:
tensor([[0.4391, 0.7003, 0.5903],
[0.7003, 1.3737, 1.0620],
[0.5903, 1.0620, 0.9912]])
"""
On a big-picture level, in the previous MultiHeadAttentionWrapper, we stacked multiple single-head attention layers that we combined into a multihead attention layer. The MultiHeadAttention class takes an integrated approach. It starts with a multi-head layer and then internally splits this layer into individual attention heads
In the MultiheadAttentionWrapper class with two attention heads, we initialised two weight matrices Wq1 and Wq2 and computed two query matrices Q1 and Q2 as illustrated at the top of this figure. In the MultiheadAttention class, we initialise one larger weight matrix Wq , only perform one matrix multiplication with the inputs to obtain a query matrix Q, and then split the query matrix into Q1 and Q2 as shown at the bottom of this figure. We do the same for the keys and values, which are not shown to reduce visual clutter
As a reminder, the d_in is the vector embedding "resolution" and input dimension and d_out is simply a design choice.
Continuing with MultiHeadAttention, after computing the attention weights and context vectors, the context vectors from all heads are transposed back to the shape (b, num_tokens, num_heads, head_dim). These vectors are then reshaped (flattened) into the shape (b, num_tokens, d_out), effectively combining the outputs from all heads
Additionally, we added a so-called output projection layer (self.out_proj) to MultiHeadAttention after combining the heads, which is not present in the CausalAttention class. This output projection layer is not strictly necessary), but itis commonly used in many LLM architectures, which is why we added it here for completeness.
The MultiHeadAttention class can be used similar to the SelfAttentionand CausalAttention classes we implemented earlier:
torch.manual_seed(123)
batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print(context_vecs.shape)
print(context_vecs)
"""
torch.Size([2, 6, 2])
tensor([[[ 0.126, 0.699],
[ 0.208, 0.451],
[ 0.165, 0.586],
[ 0.035, 0.975],
[ 0.030, 0.985],
[-0.077, 1.308]],
[[ 0.126, 0.699],
[ 0.208, 0.451],
[ 0.165, 0.586],
[ 0.035, 0.975],
[ 0.030, 0.985],
"""
We will now code the other building blocks of an LLM and assemble them into a GPT-like model that we will train in the next chapter to generate human-like text
LLMs, such as GPT (which stands for Generative Pretrained Transformer), are large deep neural network architectures designed to generate new text one word (or token) at a time.
We have already covered input tokenisation, embedding layers and masked multi-head attention, we will focus on implementing the core structure of the GPT model, including the transformer blocks. We are scaling up to the size of a small GPT-2 model, specifically the smallest version with 124 million parameters, as described in Radford et al.'s paper,
Let's detail the configuration
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"context_length": 1024, # Context length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-Key-Value bias
}
to note: n a linear layer, the basic operation is:
output = input * weight + bias
input: The input data (in our case, the input embedding x). weight: A learnable weight matrix. This matrix is learned during training and determines how different features of the input are weighted and combined to produce the output. bias: A learnable bias vector. This is an optional term. It's added to the result of the matrix multiplication (input * weight). The bias term allows the linear transformation to shift the output. Without a bias, the linear transformation would always map the origin (zero input) to the origin (zero output). A bias gives the model more flexibility to learn more complex relationships.
let us start with building a backbone for our GPT.
import torch
import torch.nn as nn
class DummyGPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
#* unpacks elements of the list created by the list comprehension and passes them as individual arguments to the nn.Sequential constructor
self.trf_blocks = nn.Sequential(*[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])]) #c: Creates a sequence of 'n_layers' DummyTransformerBlock instances
self.final_norm = DummyLayerNorm(cfg["emb_dim"]) #c: Initialises a DummyLayerNorm layer for final normalization
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False) #c: Initialises a linear layer to project embeddings to vocab size for output logits (no bias)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape #c: Gets the batch size and sequence length from the input index tensor shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x) #c: Passes the input through the sequence of transformer blocks
x = self.final_norm(x) #c: Applies final layer normalization to the output of transformer blocks
logits = self.out_head(x) #c: Passes the normalised output through the output linear layer to get prediction logits
return logits
class DummyTransformerBlock(nn.Module): #C: Defines a class 'DummyTransformerBlock' inheriting from nn.Module (representing a transformer block)
def __init__(self, cfg):
super().__init__()
def forward(self, x): #D: Defines the forward pass method for DummyTransformerBlock (currently a placeholder)
return x
class DummyLayerNorm(nn.Module): #E: Defines a class 'DummyLayerNorm' inheriting from nn.Module (representing layer normalization)
def __init__(self, normalized_shape, eps=1e-5): #F: Initialises DummyLayerNorm with normalised shape and epsilon for numerical stability
super().__init__()
def forward(self, x):
return x
The DummyGPTModel class in this code defines a simplified version of a GPTlike model using PyTorch's neural network module (nn.Module). The model architecture in the DummyGPTModel class consists of token and positional embeddings, dropout, a series of transformer blocks (DummyTransformerBlock), a final layer normalization (DummyLayerNorm), and a linear output layer (out_head). The configuration is passed in via a Python dictionary, for instance, the GPT_CONFIG_124M dictionary we created earlier. The forward method describes the data flow through the model: it computes token and positional embeddings for the input indices, applies dropout, processes the data through the transformer blocks, applies normalisation, and finally produces logits with the linear output layer. The code above is already functional, as we will see later after we prepare the input data. However, for now, note in the code above that we have used placeholders (DummyLayerNorm and DummyTransformerBlock) for the transformer block and layer normalisation, which we will develop.
Next, we will prepare the input data and initialise a new GPT model to illustrate its usage.
Let's implement some tokenisation again
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"
batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)
"""
tensor([[ 6109, 3626, 6100, 345], #A
[ 6109, 1110, 6622, 257]])
"""
Next, we initialise a new 124 million parameter DummyGPTModel instance and feed it the tokenised batch:
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print("Output shape:", logits.shape)
print(logits)
"""
Output shape: torch.Size([2, 4, 50257])
tensor([[[-0.929, 0.275, -0.756, ..., -1.607, 0.270, -0.589],
[-0.448, 0.173, 0.535, ..., -0.393, 1.529, 0.856],
[ 0.568, 1.605, -0.216, ..., 1.162, 0.138, 0.742],
[ 0.045, 2.479, -0.884, ..., 1.322, -0.086, -0.586]],
[[-1.547, -0.054, -1.057, ..., -1.806, -0.449, -0.675],
[-0.842, 0.824, -0.110, ..., -0.143, 0.208, 1.205],
[ 0.136, 1.186, -0.145, ..., 0.087, -0.159, 0.155],
[ 0.167, -0.814, 0.231, ..., 2.504, -0.306, -0.308]]],
grad_fn=<UnsafeViewBackward0>)
"""
The output tensor has two rows corresponding to the two text samples. Each text sample consists of 4 tokens; each token is a 50,257-dimensional vector, which matches the size of the tokeniser's vocabulary. The embedding has 50,257 dimensions because each of these dimensions refers to a unique token in the vocabulary. At the end of this chapter, when we implement the postprocessing code, we will convert these 50,257-dimensional vectors back into token IDs, which we can then decode into words.
To understand the ouput better we can see how the matrices are altered throughout each layer
class DummyGPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
# Token embedding layer: vocab_size x emb_dim (50257 x 768)
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
# Positional embedding layer: context_length x emb_dim (1024 x 768)
self.drop_emb = nn.Dropout(cfg["drop_rate"])
# Dropout layer with specified dropout rate (0.1)
self.trf_blocks = nn.Sequential(*[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])
# Sequential container of n_layers (12) DummyTransformerBlocks
self.final_norm = DummyLayerNorm(cfg["emb_dim"])
# Layer normalization applied to the embedding dimension (768)
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
# Linear layer to project embeddings to vocab size (768 -> 50257), no bias
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
# in_idx shape: (batch_size, seq_len) = (2, 4) from input batch
tok_embeds = self.tok_emb(in_idx)
# Token embeddings: (batch_size, seq_len, emb_dim) = (2, 4, 768), each token ID becomes an embedding vector
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
# Positional embeddings: (seq_len, emb_dim) = (4, 768), creates embeddings for each position in the sequence
x = tok_embeds + pos_embeds
# Add token and positional embeddings: (batch_size, seq_len, emb_dim) = (2, 4, 768), element-wise addition
x = self.drop_emb(x)
# Apply dropout to embeddings: (batch_size, seq_len, emb_dim) = (2, 4, 768), shape remains the same, some elements are randomly zeroed
x = self.trf_blocks(x)
# Pass through transformer blocks: (batch_size, seq_len, emb_dim) = (2, 4, 768), shape remains the same in dummy blocks, but would change in real Transformer blocks
x = self.final_norm(x)
# Apply final layer norm: (batch_size, seq_len, emb_dim) = (2, 4, 768), shape remains the same, normas features across emb_dim
logits = self.out_head(x)
# Output linear layer: (batch_size, seq_len, vocab_size) = (2, 4, 50257), projects emb_dim to vocab_size to get logits for each token in vocab
return logits
The embedding has 50,257 dimensions because each of these dimensions refers to a unique token in the vocabulary. At the end of this chapter, when we implement the postprocessing code, we will convert these 50,257-dimensional vectors back into token IDs, which we can then decode into words.
The trf_blocks represent the core processing units of a Transformer model. In a real GPT model, DummyTransformerBlock would be replaced by actual Transformer blocks. nn.Sequential is used to stack these blocks one after another. The output of one DummyTransformerBlock becomes the input to the next, creating a pipeline of processing.
The final_norm represents a Layer Normalization layer placed after the Transformer blocks and before the output head. This helps stablisies training and leads to better generalisation performance and faster convergance, unlike Batch Normalization, Layer Normalization operates on each input independently (normalizing across the feature dimension for each example), making it well-suited for sequential data and variable sequence lengths, and also works well with small batch sizes.
The out_head is the final linear layer that projects the model's internal representations into the vocabulary space; A linear layer is used for the output head because we want a linear projection from the learned embedding space to the vocabulary space to make predictions. It's used for prediction of next token, we map it into the vocabulary space and generating logits.
Let's implement the dummy functions that we have not fully implemented.
Training deep neural networks with many layers can sometimes prove challenging due to issues like vanishing or exploding gradients. These issues lead to unstable training dynamics and make it difficult for the network to effectively adjust its weights, which means the learning process struggles to find a set of parameters (weights) for the neural network that minimises the loss function.
We will implement layer normalisation to improve stability and efficiency of the neural network training, tha main idea behind layer noroamlisation is to adjust the outputs or activations of a neural network to have a mean of zero and a variance of one (unit variance). The adjustment speeds up the convergance to effective weights and ensures consistent, reliable training. Layer nomralistion is typically applied before and after the multi-head attention module and before the final output layer.
An illustration of layer normalization where the 5 layer outputs, also called activations, are normalised such that they have a zero mean and variance of 1.
We can recreate this example, like so:
import torch
torch.manual_seed(123)
batch_example = torch.randn(2, 5) #A
print(batch_example, end="\n\n")
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)
"""
tensor([[-0.1115, 0.1204, -0.3696, -0.2404, -1.1969],
[ 0.2093, -0.9724, -0.7550, 0.3239, -0.1085]])
tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
[0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
grad_fn=<ReluBackward0>)
"""
The neural network layer we have coded consists of a Linear layer followed by a non-linear activation function, ReLU , which is a standard activation function in neural networks. Lets now find out the mean and variance
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)
"""
Mean:
tensor([[0.1324],
[0.2170]], grad_fn=<MeanBackward1>)
Variance:
tensor([[0.0231],
[0.0398]], grad_fn=<VarBackward0>)
"""
Using keepdim=True in operations like mean or variance calculation ensures that the output tensor retains the same shape as the input tensor, even though the operation reduces the tensor along the dimension specified via dim. For instance, without keepdim=True, the returned mean tensor would be a 2-dimensional vector [0.1324, 0.2170] instead of a 2×1-dimensional matrix [[0.1324], [0.2170]].
An illustration of the dim parameter when calculating the mean of a tensor. For instance, if we have a 2D tensor (matrix) with dimensions [rows, columns], using dim=0 will perform the operation across rows (vertically, as shown at the bottom), resulting in an output that aggregates the data for each column. Using dim=1 or dim=-1 will perform the operation across columns (horizontally, as shown at the top), resulting in an output aggregating the data for each row.
Using dim=-1 for operations such as mean or variance calculation is the same as using dim=1. This is because -1 refers to the tensor's last dimension, which corresponds to the columns in a 2D tensor. Later, when adding layer normalisation to the GPT model, which produces 3D tensors with shape [batch_size,num_tokens, embedding_size], we can still use dim=-1 for normalisation across the last dimension, avoiding a change from dim=1 to dim=2.
Next, let us apply layer normalization to the layer outputs we obtained earlier. The operation consists of subtracting the mean and dividing by the square root of the variance (also known as standard deviation):
out_norm = (out - mean) / torch.sqrt(var)
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
print("Normalised layer outputs:\n", out_norm)
print("Mean:\n", mean)
print("Variance:\n", var)
"""
Normalised layer outputs:
tensor([[ 0.6159, 1.4126, -0.8719, 0.5872, -0.8719, -0.8719],
[-0.0189, 0.1121, -1.0876, 1.5173, 0.5647, -1.0876]],
grad_fn=<DivBackward0>)
Mean:
tensor([[9.9341e-09],
[0.0000e+00]], grad_fn=<MeanBackward1>)
Variance:
tensor([[1.0000],
[1.0000]], grad_fn=<VarBackward0>)
"""
Let's now encapsulate this process in a PyTorch module that we can use in the GPT model
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift
nn.Parameter is a special class in PyTorch that is used to register tensors as learnable parameters of a module. When you wrap a tensor with nn.Parameter, PyTorch treats it as a parameter of the neural network. Parameters registered using nn.Parameter are automatically tracked by the optimiser (like Adam, SGD) when you train your neural network. This means their values are updated during the backpropagation and optimization process. When you call model.parameters() on your LayerNorm module or any module containing nn.Parameter, these parameters are returned as part of the model's learnable parameters.
This specific implementation of layer Normalization operates on the last dimension of the input tensor x, which represents the embedding dimension (emb_dim). The variable eps is a small constant (epsilon) added to the variance to prevent division by zero during normalization. The scale and shift are two trainable parameters (of the same dimension as the input) that the LLM automatically adjusts during training if it is determined that doing so would improve the model's performance on its training task. This allows the model to learn appropriate scaling and shifting that best suit the data it is processing.
Let's test the LayerNorm module in practice and apply it to the batch input
print(batch_example)
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)
"""
tensor([[-0.1115, 0.1204, -0.3696, -0.2404, -1.1969],
[ 0.2093, -0.9724, -0.7550, 0.3239, -0.1085]])
Mean:
tensor([[-2.9802e-08],
[ 0.0000e+00]], grad_fn=<MeanBackward1>)
Variance:
tensor([[1.0000],
[1.0000]], grad_fn=<VarBackward0>)
"""
Thus far we have created the backbone and now the layer normalisation, next we will focux on the GELU activation function, which is one of the activation function used in LLMs, instaed of RELU.
We implement a small neural network submodule that is used as part of the transformer block in LLMs. We begin with implementing the GELU activation function, GELU and SwiGLU are more complex and smooth activation functions incorporating Gaussian and sigmoid-gated linear units, The GELU activation function can be implemented in several ways; the exact version is defined as GELU(x)=x Φ(x), where Φ(x) is the cumulative distribution function of the standard Gaussian distribution.
Where phi is defined as the CDF of the Gaussian distribution
In practice, however, it's common to implement a computationally cheaper approximation
Comparing RELU with GELU we can see that the latter has a smoother increase, the smooth and non-monotonic of GELU helps LLMs learn more complex features and have better gradient flow during training, especially in deep networks leading to better performance. Moreover, unlike RELU, which outputs zero for any negative input, GELU allows for a small, non-zero output for negative values. This characteristic means that during the training process, neurons that receive negative input can still contribute to the learning process, albeit to a lesser extent than positive inputs.
In code, we can implement this function as PyTorch module as follows:
class GELU(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return 0.5 * x * (1 + torch.tanh(torch.sqrt(torch.tensor(2.0 / torch.pi)) *(x + 0.044715 * torch.pow(x, 3))))
Let's build up our FeedForward module starting with GELU to implement a small neural network.
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)
def forward(self, x):
return self.layers(x)
The FeedForward module is a small neural network consisting of two Linear layers and a GELU activation function. In the 124 million parameter GPT model, it receives the input batches with tokens that have an embedding size of 768 each via the GPT_CONFIG_124M dictionary where GPT_CONFI G_124M["emb_dim"] = 768.
The FeedForward network in Transformers employs a 4x dimension increase because it dramatically boosts the model's capacity to learn intricate patterns. By expanding the embedding dimension by a factor of four in the first linear layer and then applying the non-linear GELU activation within this larger space, the network gains a significantly richer and more expressive internal representation. This expanded dimensionality acts as a higher-resolution canvas for feature transformation, enabling the model to capture more nuanced relationships in the data.
Our FFN does just so
ffn = FeedForward(GPT_CONFIG_124M)
x = torch.rand(2, 3, 768) #A
out = ffn(x)
print(out.shape)
"""
torch.Size([2, 3, 768])
"""
We have now not only completed GELU but also a Feed Forward Network, now we only have shortcut connections until our transformer block has been created
Shortcut connections were proposed for deep networks in computer vision (specifically, in residual networks) to mitigate the challenge of vanishing gradients. The vanishing gradient problem refers to the issue where gradients (which guide weight updates during training) become progressively smaller as they propagate backward through the layers, making it difficult to effectively train earlier layer
Below a comparison between a deep neural network consisting of 5 layers without (on the left) and with shortcut connections (on the right). Shortcut connections involve adding the inputs of a layer to its outputs, effectively creating an alternate path that bypasses certain layers
A shortcut connection creates an alternative, shorter path for the gradient to flow through the network by skipping one or more layers, which is achieved by adding the output of one layer to the output of a later layer. This is why these connections are also known as skip connections. They play a crucial role in preserving the flow of gradients during the backward pass in training
Skip connections provide an alternative pathway for gradients to flow more directly from later layers to earlier layers, bypassing several layers of transformations. Imagine a standard neural network layer transforming an input x into F(x). With a skip connection, instead of just passing F(x) to the next layer, we add the original input x to the output, resulting in H(x) = F(x) + x. This seemingly simple addition has a profound impact on gradient flow.
During backpropagation, when calculating the gradient of the loss with respect to the input x, the gradient now has two components due to the skip connection: one from the transformation path F(x) and another directly from the identity path x. Mathematically, the gradient of H(x) = F(x) + x with respect to x includes 1 (from the derivative of x with respect to x) in addition to the derivative of F(x) with respect to x. This constant addition of 1, or more accurately, the identity gradient path, ensures that even if the gradients from F(x) become very small due to passing through multiple layers, the gradient signal flowing back through the skip connection remains significant.
Here we have a simplified version of the Feed-Forward Network part of a Transformer block, including the crucial concept of skip connections:
class ExampleDeepNeuralNetwork(nn.Module):
def __init__(self, layer_sizes, use_shortcut):
super().__init__()
self.use_shortcut = use_shortcut # Store whether to use shortcut connections, controlled by the 'use_shortcut' parameter
self.layers = nn.ModuleList([ # Use nn.ModuleList to hold a list of layers, which is properly registered as part of the module
# Implement 5 layers, each being a Sequential block of Linear layer followed by GELU activation
nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), nn.GELU()), # Layer 1: Linear transformation from input size to the second size, followed by GELU activation
nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), nn.GELU()), # Layer 2: Linear transformation from the second size to the third size, followed by GELU activation
nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), nn.GELU()), # Layer 3: Linear transformation from the third size to the fourth size, followed by GELU activation
nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), nn.GELU()), # Layer 4: Linear transformation from the fourth size to the fifth size, followed by GELU activation
nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), nn.GELU()) # Layer 5: Linear transformation from the fifth size to the output size, followed by GELU activation
])
def forward(self, x):
for layer in self.layers: # Iterate through each layer defined in self.layers ModuleList
layer_output = layer(x) # Pass the input 'x' through the current layer, getting the 'layer_output'
if self.use_shortcut and x.shape == layer_output.shape: # Check if shortcut connections are enabled AND if input 'x' shape is the same as 'layer_output' shape
x = x + layer_output # If both conditions are true, apply a shortcut connection: add the original input 'x' to the 'layer_output'
else:
x = layer_output # If shortcut is not enabled or shapes are different, simply update 'x' to be the 'layer_output' for the next layer
return x # After passing through all layers, return the final transformed 'x'
The code implements a deep neural network with 5 layers, each consisting of a Linear layer and a GELU activation function. In the forward pass, we iteratively pass the input through the layers and optionally add the shortcut connections. To note: nn.Sequential is designed to create a linear pipeline of operations. When you pass input to an nn.Sequential container, it automatically flows through the layers in the exact order they are defined, nn.ModuleList is simply a container to hold a list of nn.Module objects. It registers these modules as part of your main module, so PyTorch knows about their parameters. Crucially, nn.ModuleList does not define how the data flows between these modules. It just holds them in a list, unlike nn.Sequential which would treat these nested nn.Sequential blocks as just another set of operations to be applied in a fixed sequence. It would execute them one after another, but we would lose the ability to insert our conditional shortcut logic between each of these nested
Let's use this code to first initialise a neural network without shortcut connections. Here, each layer will be initialised such that it accepts an example with 3 input values and returns 3 output values. The last layer returns a single output value:
layer_sizes = [3, 3, 3, 3, 3, 1]
sample_input = torch.tensor([[1., 0., -1.]])
torch.manual_seed(123) # specify random seed for the initial weights for remodel_without_shortcut = ExampleDeepNeuralNetwork(
model_without_shortcut = ExampleDeepNeuralNetwork(layer_sizes, use_shortcut=False)
Next, we implement a function that computes the gradients in the the model's backward pass:
def print_gradients(model, x):
# Forward pass
output = model(x)
target = torch.tensor([[0.]])
# Calculate loss based on how close the target
# and output are
loss = nn.MSELoss()
loss = loss(output, target)
# Backward pass to calculate the gradients
loss.backward()
for name, param in model.named_parameters():
if 'weight' in name:
# Print the mean absolute gradient of the weights
print(f"{name} has gradient mean of {param.grad.abs().mean()}")
output = model(x) feeds the input x through the provided model to get the model's prediction, stored in the output variable. This is the standard forward propagation in a neural network. we specify a loss function that computes how close the model output and a user-specified target (here, for simplicity, the value 0) are. Then, when calling loss.backward(), PyTorch computes the loss gradient for each layer in the model. We can iterate through the weight parameters via model.named_parameters(). Suppose we have a 3×3 weight parameter matrix for a given layer. In that case, this layer will have 3×3 gradient values, and we print the mean absolute gradient of these 3×3 gradient values to obtain a single gradient value per layer to compare the gradients between layers more easily.
print_gradients(model_without_shortcut, sample_input)
"""
layers.0.0.weight has gradient mean of 0.0006052234675735235
layers.1.0.weight has gradient mean of 0.00036035312223248184
layers.2.0.weight has gradient mean of 0.00214573135599494
layers.3.0.weight has gradient mean of 0.00419655442237854
layers.4.0.weight has gradient mean of 0.015148814767599106
"""
As we can see based on the output of the print_gradients function, the gradients become smaller as we progress from the last layer (layers.4) to the first layer (layers.0), which is a phenomenon called the vanishing gradient problem.
Let's now instantiate a model with skip connections and see how it compares:
torch.manual_seed(123)
model_with_shortcut = ExampleDeepNeuralNetwork(layer_sizes, use_shortcut=True)
print_gradients(model_with_shortcut, sample_input)
"""
layers.0.0.weight has gradient mean of 0.22186796367168427
layers.1.0.weight has gradient mean of 0.207092747092247
layers.2.0.weight has gradient mean of 0.32923877239227295
layers.3.0.weight has gradient mean of 0.2667771875858307
layers.4.0.weight has gradient mean of 1.3268063068389893
"""
Shortcut connections are important for overcoming the limitations posed by the vanishing gradient problem in deep neural networks. Shortcut connections are a core building block of very large models such as LLMs, and they will help facilitate more effective training by ensuring consistent gradient flow across layers when we train the GPT model.
We will now connect all of the previously covered concepts (layer normalization, GELU activations, feed forward module, and shortcut connections) in a transformer block. The transformer block is a fundamental building block of GPT and other LLM architectures. This block, which is repeated a dozen times in the 124 million parameter GPT-2 architecture, combines several concepts we have previously covered: multi-head attention, layer normalization, dropout, feed forward layers, and GELU activations.
When a transformer block processes an input sequence, each element in the sequence (for example, a word or subword token) is represented by a fixedsize vector (in the case of Figure 4.13, 768 dimensions). The operations within the transformer block, including multi-head attention and feed forward layers, are designed to transform these vectors in a way that preserves their dimensionality.
The idea is that the self-attention mechanism in the multi-head attention block identifies and analyzes relationships between elements in the input sequence. In contrast, the feed forward network modifies the data individually at each position. This combination not only enables a more nuanced understanding and processing of the input but also enhances the model's overall capacity for handling complex data patterns. In code, we can create the TransformerBlock as follows:
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
block_size=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"])
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_resid = nn.Dropout(cfg["drop_rate"])
def forward(self, x):
#look at the transformer above it matches it!
shortcut = x
x = self.norm1(x)
x = self.att(x)
x = self.drop_resid(x)
x = x + shortcut # Add the original input back
shortcut = x #B
x = self.norm2(x)
x = self.ff(x)
x = self.drop_resid(x)
x = x + shortcut #C
return x
The given code defines a TransformerBlock class in PyTorch that includes a multi-head attention mechanism (MultiHeadAttention) and a feed forward network (FeedForward), both configured based on a provided configuration dictionary (cfg), such as GPT_CONFIG_124M. Layer normalization (LayerNorm) is applied before each of these twocomponents, and dropout is applied after them to regularise the model and prevent overfitting. This is also known as Pre-LayerNorm. Older architectures, such as the original transformer model, applied layer normalization after the self-attention and feed-forward networks instead, known as Post-LayerNorm, which often leads to worse training dynamics. The class also implements the forward pass, where each component is followed by a shortcut connection that adds the input of the block to its output.
Let's instantiate a transformer block and feed it some sample data:
torch.manual_seed(123)
x = torch.rand(2, 4, 768) #A
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)
print("Input shape:", x.shape)
print("Output shape:", output.shape)
"""
Input shape: torch.Size([2, 4, 768])
Output shape: torch.Size([2, 4, 768])
"""
As we can see from the code output, the transformer block maintains the input dimensions in its output, indicating that the transformer architecture processes sequences of data without altering their shape throughout the network. The preservation of shape throughout the transformer block architecture is not incidental but a crucial aspect of its design. This design enables its effective application across a wide range of sequence-to-sequence tasks, where each output vector directly corresponds to an input vector, maintaining a one-to-one relationship. However, the output is a context vector that encapsulates information from the entire input sequence, as we learned. This means that while the physical dimensions of the sequence (length and feature size) remain unchanged as it passes through the transformer block, the content of each output vector is re-encoded to integrate contextual information from across the entire input sequence. With the transformer block implemented in this section, we now have all the building blocks.
The transformer block combines layer normalization, the feed forward network, including GELU activations, and shortcut connections
We started with a big-picture overview of a GPT architecture that we called DummyGPTModel. In this DummyGPTModel code implementation, we showed the input and outputs to the GPT model, but its building blocks remained a black box using a DummyTransformerBlock and DummyLayerNorm class as placeholders.
Now we are replacing the DummyTransformerBlock and DummyLayerNorm with real classes to make it fully functional. Let's remind ourself of the transformer model again though:
An overview of the GPT model architecture. This figure illustrates the flow of data through the GPT model. Starting from the bottom, tokenised text is first converted into token embeddings, which are then augmented with positional embeddings. This combined information forms a tensor that is passed through a series of transformer blocks shown in the center (each containing multi-head attention and feed forward neural network layers with dropout and layer normalization), which are stacked on top of each other and repeated 12 times.
In the case of GPT this transformer block is repeated 12 times. The output from the final transformer block then goes through a final layer normalization step before reaching the linear output layer. This layer maps the transformer's output to a high-dimensional space (in this case, 50,257 dimensions, corresponding to the model's vocabulary size) to predict the next token in the sequence.
Let's now implement the architecture we see
import torch
import torch.nn as nn
from torch import nn # Redundant import, but kept as in original user code
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
# Token Embedding Layer: Converts token indices to dense embeddings
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
# Positional Embedding Layer: Embeddings for token positions in the sequence
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
# Dropout Layer for Embeddings: Applies dropout to the combined token and positional embeddings to prevent overfitting
self.drop_emb = nn.Dropout(cfg["drop_rate"])
# Sequential Transformer Blocks: Stack of TransformerBlock layers (defined elsewhere)
self.trf_blocks = nn.Sequential(*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
# Final Layer Normalization: Layer normalization applied after the transformer blocks
self.final_norm = LayerNorm(cfg["emb_dim"])
# Output Head Linear Layer: Maps the final embeddings to the vocabulary size for prediction logits
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False # bias=False as per GPT-2 settings
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape # Get batch size and sequence length from input indices tensor shape
# Token Embeddings: Look up embeddings for the input token indices
tok_embeds = self.tok_emb(in_idx) # Shape: (batch_size, seq_len, emb_dim)
# Positional Embeddings: Create positional indices and get positional embeddings
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) #Shape: (seq_len, emb_dim)
# Note: torch.arange creates positions 0 to seq_len-1, device is set to be the same as input tensor
# Combine Embeddings: Add token embeddings and positional embeddings to inject positional information
x = tok_embeds + pos_embeds # Shape: (batch_size, seq_len, emb_dim) - Broadcasting pos_embeds across batch
# Embedding Dropout: Apply dropout to the combined embeddings
x = self.drop_emb(x) # Shape: (batch_size, seq_len, emb_dim) - Dropout applied to embedding layer output
# Transformer Blocks: Pass the embeddings through the sequence of Transformer Blocks
x = self.trf_blocks(x) # Shape: (batch_size, seq_len, emb_dim) - Processed through all transformer layers
# Final Layer Normalization: Apply Layer Normalization to the output of the transformer blocks
x = self.final_norm(x) # Shape: (batch_size, seq_len, emb_dim) - Normalised output features
# Output Head: Linear layer to project embeddings to logits for vocabulary prediction
logits = self.out_head(x) # Shape: (batch_size, seq_len, vocab_size) - Logits for each token in vocab
return logits # Return the logits, representing the model's prediction for the next token in the sequence
The init constructor of this GPTModel class initialises the token and positional embedding layers using the configurations passed in via a Python dictionary, cfg. These embedding layers are responsible for converting input token indices into dense vectors and adding positional information, Next, the init method creates a sequential stack of TransformerBlock modules equal to the number of layers specified in cfg. Following the transformer blocks, a LayerNorm layer is applied, standardizing the outputs from the transformer blocks to stabilise the learning process. Finally, a linear output head without bias is defined, which projects the transformer's output into the vocabulary space of the tokeniser to generate logits for each token in the vocabulary.
The forward method takes a batch of input token indices, computes their embeddings, applies the positional embeddings, passes the sequence through the transformer blocks, normalises the final output, and then computes the logits, representing the next token's unnormalised probabilities. We will convert these logits into tokens and text outputs later.
Let's initialise this now:
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)
"""
Input batch:
tensor([[6109, 3626, 6100, 345],
[6109, 1110, 6622, 257]])
Output shape: torch.Size([2, 4, 50257])
tensor([[[-0.0072, -0.2137, -0.3467, ..., -0.3240, -0.2338, -0.1070],
[ 0.7063, -0.7429, -0.6645, ..., -0.6530, -0.1393, -0.2166],
[ 0.8226, -0.2982, -0.4547, ..., 0.0913, -0.6949, -0.2091],
[-0.3933, 0.3074, -0.1343, ..., 1.0464, 0.4620, -0.5298]],
[[ 0.1386, -0.4081, -0.1563, ..., -0.0892, -0.0672, -0.0157],
[ 0.2353, -0.1177, -0.1307, ..., 1.0729, -0.3517, 0.3905],
[ 0.7367, 0.3377, -0.4311, ..., 0.8471, 0.2219, -0.2541],
[ 0.0166, -0.0721, 0.3451, ..., 1.1352, -0.4069, 0.0309]]],
grad_fn=<UnsafeViewBackward0>)
"""
As we can see, the output tensor has the shape [2, 4, 50257], since we passed in 2 input texts with 4 tokens each. The last dimension, 50,257, corresponds to the vocabulary size of the tokeniser. In the next section, we will see how to convert each of these 50,257-dimensional output vectors back into tokens.
We will implement the code that converts the tensor outputs of the GPT model back into text.
The question when generating text is how does a GPT model go from output tensors to the generated text shown? It involves several steps, this includes decoding the output tensors, selecting tokens based on a probability distrubution and then converting them to human-readable text
The process begins by encoding the input text into token IDs, which are then fed into the GPT model. The outputs of the model are then converted back into text and appended to the original input text. This is rather simple to understand and implement.
In each step, the model outputs a matrix with vectors representing potential next tokens. The vector corresponding to the next token is extracted and converted into a probability distribution via the softmax function. Within the vector containing the resulting probability scores, the index of the highest value is located, which translates to the token ID. This token ID is then decoded back into text, producing the next token in the sequence. Finally, this token is appended to the previous inputs, forming a new input sequence for the subsequent iteration. This step-by-step process enables the model to generate text sequentially, building coherent phrases and sentences from the initial input context.
We will now implement a function for the model to generate text
def generate_text_simple(model, idx, max_new_tokens, context_size): # Function to generate text iteratively
for _ in range(max_new_tokens): # Loop to generate 'max_new_tokens' number of new tokens (Step 6 - iterative process)
idx_cond = idx[:, -context_size:] # Take the last 'context_size' tokens from the input 'idx' as context (Step 1 - Input is token IDs, implicitly using encoded input from step 1)
with torch.no_grad(): # Disable gradient calculation during inference for efficiency
logits = model(idx_cond) # Pass the conditioned input through the GPT model to get logits (Step 2 - GPT model returns logits matrix)
logits = logits[:, -1, :] # Get logits for only the last token in the sequence (Step 3 - Extract last vector/row for next token logits)
probas = torch.softmax(logits, dim=-1) # Convert logits to probabilities using softmax along the vocabulary dimension (Step 4 - Softmax to probability distribution)
idx_next = torch.argmax(probas, dim=-1, keepdim=True) # Sample the next token index by choosing the token with the highest probability (argmax) (Step 5 - Argmax to get token ID)
idx = torch.cat((idx, idx_next), dim=1) # Append the predicted token index 'idx_next' to the input sequence 'idx' (Step 6 - Append token ID for next round)
return idx # Return the complete sequence of token indices, including the generated tokens
Here we have a simle generative loop for an LLM, it iterates for a specified number of new tokens to be generated. It iterates for a specified number of new tokens to be generated, crops the current context to fit the model's maximum context size, computes predictions and then selects the next token based on the highest probability prediction.
The softmax function is monotonic, meaning it preserves the order of its inputs when transformed into outputs. So, in practice, the softmax step is redundant since the position with the highest score in the softmax output tensor is the same position in the logit tensor. In other words, we could apply the torch.argmax function to the logits tensor directly and get identical results. However, we coded the conversion to illustrate the full process of transforming logits to probabilities, which can add additional intuition, such as that the model generates the most likely next token, which is known as greedy decoding.
We will also introduce additional sampling techniques where we modify the softmax outputs such that the model doesn't always select the most likely token, which introduces variability and creativity in the generated text.
This process of generating one token ID at a time and appending it to the context using the generate_text_simple function is further illustrated; below are six iterations of a token prediction cycle, where the model takes a sequence of initial token IDs as input, predicts the next token, and appends this token to the input sequence for the next iteration.
Let us try this exact sequence on our function and model, first we must encode the input context into token ID's.
start_context = "Hello, I am"
encoded = tokenizer.encode(start_context)
print("encoded:", encoded)
encoded_tensor = torch.tensor(encoded).unsqueeze(0) #A without this the shape would be torch.size([4]) unsqueeze() is a tensor operation that inserts a new dimension of size one at a specified position in the tensor's shape
print("encoded_tensor.shape:", encoded_tensor.shape)
"""
encoded: [15496, 11, 314, 716]
encoded_tensor.shape: torch.Size([1, 4])
"""
The unsqueeze(0) operation in the code transforms the encoded_tensor by adding a new dimension at the beginning (dimension index 0). initially, encoded_tensor likely represents a 1-dimensional sequence of token IDs. unsqueeze(0) effectively turns this 1D tensor into a 2D tensor with a shape of (1, sequence_length). The '1' at the beginning represents a batch size of one.
Now we will put the model into .eval() mode, which disables dropout and weight adjustment and then generate an output
model.eval() #A
out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=6,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output:", out)
print("Output length:", len(out[0]))
"""
Output: tensor([[15496, 11, 314, 716, 27018, 7283, 46275, 41426, 33167, 33239]])
Output length: 10
"""
Using the .decode method of the tokeniser, we can convert the IDs back into text:
decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)
"""
Hello, I am Feature IT snowballProtect youngstersMu
"""
Nothing of value was outputted as we have not trained yet rather just created the architecure, but it's working and that's leds us to the final step
We implemented the data sampling, attention mechanism and coded the LLM architecture, now we focus on implementing a training function and pretain the LLM
We will alos learn about basic model evaluation techniques to measure the quality of the generated text, which is a requirement for optimising the LLM during the training process. We will also load pretrained weights allowing us to have a solid starting point.
In the context of LLMs and other deep learning models, weights refer to the trainable parameters that the learning process adjusts. These weights are also known as weight parameters or simply parameters. After initializing a layer (new_layer = torch.nn.Linear(...)), we can access its weights through the .weight attribute, new_layer.weight. Additionally, for convenience, PyTorch allows direct access to all a model's trainable parameters, including weights and biases, through the method model.parameters(), which we will use later when implementing the model training.
We will begin by focusing on 1) , 2) and 3) in our pipeline
Let us go over the text generation we implement, we start by initialising the GPT model that we will evaluate and train, using the GPTModel class and Config
import torch
GPT_CONFIG_124M = {
"vocab_size": 50257,
"context_length": 256, #We reduced this to help with computation
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1, #B
"qkv_bias": False
}
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()
We now have a model which takes input tokens and outputs logit vector rows, we must just add the starting and ending parts to complete our model.
The i-th row of the logits is the probability distribution over the vocabulary for the next token, predicted by the model after considering the input sequence up to and including the i-th input token. Each row of logits is context-aware and represents a prediction based on the sequence up to that point.
First, the tokeniser converts input text into a series of token IDs. Second, the model receives these token IDs and generates corresponding logits, which are vectors representing the probability distribution for each token in the vocabulary. Third, these logits are converted back into token IDs, which the tokeniser decodes into human-readable text, completing the cycle from textual input to textual output.
Let's try it out
def text_to_token_ids(text, tokenizer):
encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
encoded_tensor = torch.tensor(encoded).unsqueeze(0)
return encoded_tensor
def token_ids_to_text(token_ids, tokenizer):
flat = token_ids.squeeze(0) # remove batch dimension
return tokenizer.decode(flat.tolist())
start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids(start_context, tokenizer),
max_new_tokens=10,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
"""
Output text:
Every effort moves you rentingetic minion mobilised Macicone warrantyuler respirmediated
"""
Based on the output, it's clear the model isn't yet producing coherent text because it hasn't undergone training. To define what makes text "coherent" or "high quality," we have to implement a numerical method to evaluate the generated content. This approach will enable us to monitor and enhance the model's performance throughout its training process.
Typically for inputs we have a defined target thereby allowing to find the loss and propagate back, this is similar to how we would calculate the loss with LLMs.
For each of the 3 input tokens, shown on the left, we compute a vector containing probability scores corresponding to each token in the vocabulary. The index position of the highest probability score in each vector represents the most likely next token ID. These token IDs associated with the highest probability scores are selected and mapped back into a text that represents the text generated by the model.
In our model this would be from the range of 0 to 50,256 rahter than 0 to 6
Lets' consider two input examples, which are mapped into token IDs
input1 = text_to_token_ids("every effort moves", tokenizer)
input2 = text_to_token_ids("I really like", tokenizer)
inputs = torch.cat((input1, input2), dim=0)
"""
inputs = torch.tensor([[16833, 3626, 6100], # ["every effort moves",
[40, 1107, 588]]) # "I really like"]
"""
Matching these inputs, the targets contain the token IDs we aim for the model to produce:
targets = torch.tensor([[3626, 6100, 345 ], # [" effort moves you",
[588, 428, 11311]]) # " really like chocolate"]
Note we are doing this to two examples, not one, the "effort..." and "really..." sentences are unrelated.
When we feed the inputs into the model to calculate logit vectors for the two input examples, each comprising three tokens, and apply the softmax function to transform these logit values into probability scores
with torch.no_grad(): #A
logits = model(inputs)
probas = torch.softmax(logits, dim=-1) # Probability of each token in vocabprint(probas.shape)
print(probas.shape)
"""
torch.Size([2, 3, 50257])
tensor([[[2.5756e-05, 1.0833e-05, 1.6042e-05, ..., 2.5733e-05,
6.8686e-06, 1.6034e-05],
[1.0105e-05, 9.4379e-06, 7.7280e-06, ..., 4.0692e-05,
5.7055e-06, 1.0792e-05],
[3.2463e-05, 9.2218e-06, 1.6283e-05, ..., 3.4169e-05,
1.4085e-05, 1.1884e-05]],
[[2.1001e-05, 1.7538e-05, 1.6416e-05, ..., 1.1503e-05,
5.3201e-05, 1.0935e-05],
[7.0654e-06, 1.8018e-05, 9.2447e-06, ..., 3.4000e-05,
9.1224e-06, 1.5547e-05],
[3.1890e-05, 3.1677e-05, 3.8751e-05, ..., 6.9175e-06,
5.5779e-05, 1.2184e-05]]])
"""
The first number, 2, corresponds to the two examples (rows) in the inputs, also known as batch size. The second number, 3, corresponds to the number of tokens in each input (row). Finally, the last number corresponds to the embedding dimensionality, which is determined by the vocabulary size.
Following the conversion from logits to probabilities via the softmax function, the generate_text_simple function from chapter 4 then converts the resulting probability scores back into text
token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)
"""
Token IDs:
tensor([[[36195],
[16031],
[42826]],
[[14212],
[ 7822],
[38509]]])
"""
We can decode this now
# Finally, step 5 converts the token IDs back into text:
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")
"""
Targets batch 1: effort moves you
Outputs batch 1: lif savesNetflix
"""
Again not trained yet, but let's add an evaluation method so we can measure loss. The model training aims to increase the softmax probability in the index positions corresponding to the correct target token IDs. This softmax probability is also used in the evaluation metric we are implementing to numerically assess the model's generated outputs: the higher the probability in the correct positions, the better.
In this figure we only have a 7-token vocabulary to fit everything, this implies random values will be around 1/7 but for GPT-2 has 50,257, so most starting probabilites will be around 1/50,257 which is 0.00002.
For each of the two input texts, we can print the initial softmax probability scores corresponding to the target tokens via the following code:
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]] #get first probas #get all 3 vectors inside it #get [ 3626, 6100, 345] index positions which is targets[text_idx]
print("Text 1:", target_probas_1)
text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)
"""
Text 1: tensor([4.1353e-05, 1.9397e-05, 1.1213e-05])
Text 2: tensor([3.1776e-05, 1.3458e-05, 5.2655e-06])
targets:
tensor([[ 3626, 6100, 345],
[ 588, 428, 11311]])
"""
The goal of training an LLM is to maximise these values, aiming to get them as close to a probability of 1.
To maximise the softmax probabilty values corresponding to the target tokens, we must update the model weights so that the model outputs higher values for the respective token IDs we want to generate. The update is backpropagated by the loss function. We have our targets already set.
So steps 1 to 3 calculate the token probabilities corresponding to the target tensor, we then transform via negative average log. This is the loss we want to compute.
We have already done step 1 to 3, but now we must find the log of them
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print(log_probas)
"""
tensor([-10.0934, -10.8504, -11.3984, -10.3568, -11.2159, -12.1543])
"""
Working with logarithms of probability scores is more manageable in mathematical optimization than handling the scores directly, we now combine these into a single score by computing the average.
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)
"""
tensor(-11.0115)
"""
The goal is to get the average log probability as close to 0 as possible by updating the model's weights as part of the training process, so we find the neg log
neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)
"""
tensor(11.0115)
"""
The "cross-entropy loss" in the formalised and computationally efficient way of implementing the idea of minimizing the negative average log probability, the cross-entropy loss is designed to quantify how "surprised" our model is by the actual correct tokens in our dataset. A higher cross-entropy loss means the model is making less accurate and less confident predictions on average (lower probabilities for correct tokens, hence larger negative log probabilities and a larger average). A lower cross-entropy loss means the model's predictions are getting better, assigning higher probabilities to the correct tokens (closer to 1, hence log probabilities closer to 0 and a smaller negative average).
The cross entropy loss is a popular measure in machine learning and deep learning that measures the difference between two probability distributions--typically, the true distribution of labels (here, tokens in a dataset) and the predicted distribution from a model. Before we apply the cross entropy function, let's briefly recall the shape of the logits and target tensors:
print("Logits shape:", logits.shape)
print("Targets shape:", targets.shape)
"""
Logits shape: torch.Size([2, 3, 50257])
Targets shape: torch.Size([2, 3])
"""
As we can see, the logits tensor has three dimensions: batch size, number of tokens, and vocabulary size. The targets tensor has two dimensions: batch size and number of tokens
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)
"""
Flattened logits: torch.Size([6, 50257])
Flattened targets: torch.Size([6])
"""
Remember that the targets are the token IDs we want the LLM to generate, and the logits contain the unscaled model outputs before they enter the softmax function to obtain the probability scores. Previously, we applied the softmax function, selected the probability scores corresponding to the target IDs, and computed the negative average log probabilities. PyTorch's cross_entropy function will take care of all these steps for us:
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)
"""
tensor(11.0115)
"""
As you can see this matches when we calculated this by ourself.
Perplexity is a metric used to evaluate language models, offering a more intuitive understanding of model performance than just loss values. It essentially measures how "perplexed" or "confused" a model is when predicting the next word in a sequence. A lower perplexity score implies that the model is less uncertain and more confident in its predictions. Mathematically, perplexity is derived from the cross-entropy loss; in fact, it's the exponentiated cross-entropy loss. It measures how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset. Similar to the loss, a lower perplexity indicates that the model predictions are closer to the actual distribution.
Perplexity can be calculated as perplexity = torch.exp(loss) which returns tensor(60568.8867),is often considered more interpretable than the raw loss value because it signifies the effective vocabulary size about which the model is uncertain at each step. In the given example, this would translate to the model being unsure about which among 60,568 words or tokens in the vocabulary to generate as the next token, this is of course an estimate
We will now prepaer the training and validation datasets to train the LLM, then we calculate the cross entropy for the training and validiaton sets. To compute the loss on the training and validation datasets we use a very small dataset like the "The Verdict" short story.
To put the scale of our project into perspective, consider the training of the 7 billion parameter Llama 2 model, a relatively popular openly available LLM. This model required 184,320 GPU hours on expensive A100 GPUs, processing 2 trillion tokens. At the time of writing, running an 8xA100 cloud server on AWS costs around $30 per hour. A rough estimate puts the total training cost of such an LLM at around $690,000.
As seen we can load our story
file_path = "verdict.txt"
with open(file_path, "r", encoding="utf-8") as file:
text_data = file.read()
total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))
print("Characters:", total_characters)
print("Tokens:", total_tokens)
"""
Characters: 20479
Tokens: 5145
"""
5145 is not enough tokens but this is just for theorectical purposes, we will load pretrained weights later.
Next, we divide the dataset into a training and a validation set and use the data loaders we made a long while back to prepare the batches for LLM training. This process is visualised below
For visualization purposes, we use a max_length=6 due to spatial constraints. However, for the actual data loaders we are implementing, we set the max_length equal to the 256-token context length that the LLM supports so that the LLM sees longer texts during training.
We are training the model with training data presented in similarly-sized chunks for simplicity and efficiency. However, in practice, it can also be beneficial to train an LLM with variable-length inputs to help the LLM to better generalise across different types of inputs when it is being used.
We define a train_ratio
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]
Using the train_data and val_data subsets, we can now create the respective data loader reusing the create_dataloader_v1
torch.manual_seed(123)
train_loader = create_dataloader_v1(
train_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=True,
shuffle=True
)
val_loader = create_dataloader_v1(
val_data,
batch_size=2,
max_length=GPT_CONFIG_124M["context_length"],
stride=GPT_CONFIG_124M["context_length"],
drop_last=False,
shuffle=False
)
We used a relatively small batch size in the preceding code to reduce the computational resource demand because we were working with a very small dataset. In practice, training LLMs with batch sizes of 1,024 or larger is not uncommon. As an optional check, we can iterate through the data loaders to ensure that they were created correctly
x is the input sequence of token IDs. It's what you feed into your GPT model as the context to predict the next token.
y is the target sequence of token IDs. It represents the "correct" next tokens that the model should ideally predict given the input sequence x.
Shapes torch.Size([2, 256]) for both x and y:
print("Train loader:")
for x, y in train_loader:
print(x.shape, y.shape)
print("\nValidation loader:")
for x, y in val_loader:
print(x.shape, y.shape)
"""
Train loader:
torch.Size([2, 256]) torch.Size([2, 256]) #1
torch.Size([2, 256]) torch.Size([2, 256]) #2
torch.Size([2, 256]) torch.Size([2, 256]) #3
torch.Size([2, 256]) torch.Size([2, 256]) #4
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256])
torch.Size([2, 256]) torch.Size([2, 256]) #9
Validation loader:
torch.Size([2, 256]) torch.Size([2, 256]) #1
"""
Now we implement a function to calculate a cross entropy loss of a given batch return via the training and validation loader
def calc_loss_batch(input_batch, target_batch, model, device):
# Move input and target batches to the specified device (GPU or CPU)
input_batch, target_batch = input_batch.to(device), target_batch.to(device)
# Get the model's output logits for the input batch
logits = model(input_batch)
# Calculate cross-entropy loss
loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
# Return the calculated loss
return loss
We can now use this calc_loss_batch utility function, which computes the loss for a single batch, to implement the following calc_loss_loader function that computes the loss over all the batches sampled by a given data loader:
def calc_loss_loader(data_loader, model, device, num_batches=None):
total_loss = 0. # Initialise a variable to accumulate the total loss
if num_batches is None: # Check if num_batches is not provided (is None)
num_batches = len(data_loader) # If None, set it to total number of batches in the data_loader (process all batches)
else:
num_batches = min(num_batches, len(data_loader)) #B - use the smaller value between num_batches and the actual number of batches in data_loader (limit batches if requested)
for i, (input_batch, target_batch) in enumerate(data_loader): # Loop through the data_loader, iterating over batches of (input_batch, target_batch) and their index 'i'
if i < num_batches: # Check if the current batch index 'i' is less than the specified num_batches (process only up to num_batches)
loss = calc_loss_batch(input_batch, target_batch, model, device) # Calculate the loss for the current batch using the calc_loss_batch function
total_loss += loss.item() #C - Accumulate the batch loss (converted to a Python float using .item()) to the total_loss
else: # If the current batch index 'i' is no longer less than num_batches (processed enough batches)
break # Exit the loop, stop processing further batches
return total_loss / num_batches #D - After processing the desired number of batches, return the average loss by dividing the total_loss by the num_batches
Although a bit hard to read, this is self-explantory. By default, the calc_loss_batch function iterates over all batches in a given data loader, accumulates the loss in the total_loss variable, and then computes and averages the loss over the total number of batches.
Let's now see this calc_loss_batch function in action, applying it to the training and validation set loaders:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") #A
model.to(device)
train_loss = calc_loss_loader(train_loader, model, device) #B
val_loss = calc_loss_loader(val_loader, model, device)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)
"""
Training loss: 10.98758347829183
Validation loss: 10.982392311096191
"""
The loss values are relatively high because the model has not yet been trained. For comparison, the loss approaches 0 if the model learns to generate the next tokens as they appear in the training and validation sets. Now that we have a way to measure the quality of the generated text, in the next section, we train the LLM to reduce this loss so that it becomes better at generating text.
In this section, we finally implement the code for pretraining the LLM, our GPTModel. For this, we focus on a straightforward training loop, as illustrated. There are more advanced techniques that could be included such as learning rate warmup, cosine annealing, and gradient clipping, but we will not implement and just discuss this at the end.
This is all fairly trivial, the code will help us understand this futher.
def train_model_simple(model, train_loader, val_loader, optimizer, device, eval_freq, eval_iter, start_context):
train_losses, val_losses, track_tokens_seen = [], [], []
tokens_seen, global_step = 0, -1 # Initialise token counter and global step counter
for epoch in range(num_epochs): #1) For each training epoch (outer loop for epochs)
model.train() # Set model to training mode (enable dropout, batchnorm, etc.)
for input_batch, target_batch in train_loader: # 2) For each batch in training set (inner loop for batches)
optimizer.zero_grad() # 3) Reset loss gradients from previous batch
loss = calc_loss_batch(input_batch, target_batch, model, device) # 4) Calculate loss on current batch
loss.backward() #D - 5) Backward pass to calculate loss gradients
optimizer.step() #E - 6) Update model weights using loss gradients
tokens_seen += input_batch.numel() # Track number of tokens seen during training
global_step += 1 # Increment global step counter
if global_step % eval_freq == 0: #7) Check if it's time to evaluate (every eval_freq steps)
train_loss, val_loss = evaluate_model( # Evaluate model on train and validation sets
model, train_loader, val_loader, device, eval_iter)
train_losses.append(train_loss) # Store training loss for tracking
val_losses.append(val_loss) # Store validation loss for tracking
track_tokens_seen.append(tokens_seen) # Store tokens seen at evaluation step
print(f"Ep {epoch+1} (Step {global_step:06d}): " # Print epoch, step, train/val losses
f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
generate_and_print_sample( #8) Generate sample text for visual inspection (optional step)
model, train_loader.dataset.tokenizer, device, start_context
)
return train_losses, val_losses, track_tokens_seen # Return lists of train/val losses and tokens seen for plotting/analysis
As you can see this code is missing some functions but is generally straightforward to understand. The evaluate_model function calculates the loss over the training and validation set while ensuring the model is in evaluation mode with gradient tracking and dropout disabled when calculating the loss over the training and validation sets
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
model.eval() # evaluation mode (turns off dropout, batchnorm in eval mode)
with torch.no_grad(): #Disable gradient calculation during evaluation for efficiency (no training here)
train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter) # Calculate average loss over a subset of the TRAIN loader (optional training loss eval)
val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter) # Calculate average loss over a subset of the VALIDATION loader (key validation loss eval)
model.train() # 5) Set model back to training mode AFTER evaluation (important if training continues after evaluation, though maybe redundant here)
return train_loss, val_loss # 6) Return the calculated training and validation losses
Similar to evaluate_model, the generate_and_print_sample function is a convenience function that we use to track whether the model improves during the training. In particular, the generate_and_print_sample function takes a text snippet (start_context) as input, converts it into token IDs, and feeds it to the LLM to generate a text sample using the generate_text_simple function we used earlier:
def generate_and_print_sample(model, tokenizer, device, start_context):
model.eval() # Set model to evaluation mode (turns off dropout, etc.)
context_size = model.pos_emb.weight.shape[0] # Get context size from positional embedding dimension
encoded = text_to_token_ids(start_context, tokenizer).to(device) # Encode start context to token IDs and move to device
with torch.no_grad(): # Disable gradient calculation for efficient generation
token_ids = generate_text_simple( # Generate new tokens using the model
model=model, idx=encoded,
max_new_tokens=50, context_size=context_size # Set max tokens to generate and context size
)
decoded_text = token_ids_to_text(token_ids, tokenizer) # Decode generated token IDs back to text
print(decoded_text.replace("\n", " ")) # Compact print format: replace newlines with spaces
model.train() # Set model back to training mode after generation
AdamW is an optimiser algorithm, and the loss function produces the output that is backpropagated. It is a variant of Adam that improves the weight decay approach, which aims to minimise model complexity and prevent overfitting by penalizing larger weights. This adjustment allows AdamW to achieve more effective regularization and better generalization and is thus frequently used in the training of LLMs
With the optimiser now chosen, we can train our LLM for 10 epochs. Executing the training_model_simple function starts the training process
torch.manual_seed(123) # Set random seed for reproducibility
model = GPTModel(GPT_CONFIG_124M) # Initialise the GPT model with the specified configuration
model.to(device) # Move the model to the specified device (CPU or GPU)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0) # Initialise AdamW optimiser for model parameters
num_epochs = 10 # Set the number of training epochs
train_losses, val_losses, tokens_seen = train_model_simple( # Train the model using the simple training loop
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs, eval_freq=5, eval_iter=1, # Set training parameters: epochs, eval frequency and iterations
start_context="Every effort moves you" # Set the starting context for sample text generation during training
)
The output is as shown, , based on the results printed during the training, the training loss improves drastically, starting with a value of 9.781 and converging to 0.967. The language skills of the model have improved quite a lot. In the beginning, the model is only able to append commas to the start context or repeat the word "the". At the end of the training, it can generate grammatically correct text.
Ep 1 (Step 000000): Train loss 9.781, Val loss 9.933
Every effort moves you,,,,,,,,,,,,.
Ep 2 (Step 000015): Train loss 5.961, Val loss 6.616
Every effort moves you, and, and, and, and, and, and, and, and, and, and, a[...] #A
...
Ep 9 (Step 000080): Train loss 1.386, Val loss 6.212
Every effort moves you know," was not that my hostess was "interesting of the frame. "There: make yourself comfortable--and here are the cigars you like." "Oh, I felt a little a "There were days when I
Ep 10 (Step 000085): Train loss 0.967, Val loss 6.253
Every effort moves you know," was not that my hostess was "interesting": on that Mrs. "Yes--and by me to me to have been through--it was fitting that Mrs. "Oh, I had the donkey. "There were days when I
Similar to the training set loss, we can see that the validation loss starts high (9.781) and decreases during the training. However, it never becomes as small as the training set loss and remains at 6.372 after the 10th epoch.
We can plot the validation and training loss on a graph and observe a sharpe decrease, which is a sign that the model is learning, however the training set loss continues to decrease whilst the validation does not, this is a sign that the model is overfitting to the training loss
import matplotlib.pyplot as plt
import torch # Import torch, even if not directly used in the corrected plotting code, it's in the original context
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
fig, ax1 = plt.subplots(figsize=(5, 3)) # Create figure and primary axes ax1
ax1.plot(epochs_seen, train_losses, label="Training loss") # Plot training loss vs epochs on ax1
ax1.plot(epochs_seen, val_losses, linestyle="-.", label="Validation loss") # Plot validation loss vs epochs on ax1, using a dash-dot linestyle
ax1.set_xlabel("Epochs") # Set x-axis label for ax1 as 'Epochs'
ax1.set_ylabel("Loss") # Set y-axis label for ax1 (and ax2 as they share y-axis) as 'Loss'
ax1.legend(loc="upper right") # Display legend in the upper right corner of ax1
ax2 = ax1.twiny() #A - Create a secondary axes ax2 sharing the y-axis with ax1
ax2.plot(tokens_seen, train_losses, alpha=0) #B - Plot tokens_seen vs train_losses with alpha=0 to make the line invisible, just to set x-ticks for ax2
ax2.set_xlabel("Tokens seen") # Set x-axis label for ax2 as 'Tokens seen'
ax2.set_xbound(tokens_seen[0], tokens_seen[-1]) # Set the x-axis bounds for ax2 to match the range of tokens_seen
fig.tight_layout() # Adjust plot layout to prevent labels from overlapping
plt.show() # Display the plot
epochs_tensor = torch.linspace(1, num_epochs, len(train_losses)) # Corrected epochs tensor to start from 1 and represent epochs
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses) # Call the plot_losses function to generate and display the plot
This memorization is expected since we are working with a very, very small training dataset and training the model for multiple epochs. Usually, it's common to train a model on a much, much larger dataset for only one epoch. We will train this on the 60,000 public domain books from project Gutenberg, where this does not occur, we may also look at the open web Reddit dataset which contains all reddit answers.
We will now explore sampling methods employed by LLMs to mitigate memorization effects, resulting in more novel generated text. We will cover two techniques, temperature scaling, and top-k sampling, to improve the generate_text_simple function which we covered
def generate_text_simple(model, idx, max_new_tokens, context_size): # Function to generate text iteratively
for _ in range(max_new_tokens): # Loop to generate 'max_new_tokens' number of new tokens (Step 6 - iterative process)
idx_cond = idx[:, -context_size:] # Take the last 'context_size' tokens from the input 'idx' as context (Step 1 - Input is token IDs, implicitly using encoded input from step 1)
with torch.no_grad(): # Disable gradient calculation during inference for efficiency
logits = model(idx_cond) # Pass the conditioned input through the GPT model to get logits (Step 2 - GPT model returns logits matrix)
logits = logits[:, -1, :] # Get logits for only the last token in the sequence (Step 3 - Extract last vector/row for next token logits)
probas = torch.softmax(logits, dim=-1) # Convert logits to probabilities using softmax along the vocabulary dimension (Step 4 - Softmax to probability distribution)
idx_next = torch.argmax(probas, dim=-1, keepdim=True) # Sample the next token index by choosing the token with the highest probability (argmax) (Step 5 - Argmax to get token ID)
idx = torch.cat((idx, idx_next), dim=1) # Append the predicted token index 'idx_next' to the input sequence 'idx' (Step 6 - Append token ID for next round)
return idx # Return the complete sequence of token indices, including the generated tokens
But first since we finished training we should begin by transferring the model back from the GPU to the CPU since inference with a relatively small model does not require a GPU. Also, after training, we put the model into evaluation model to turn off random components such as dropout.
model.to("cpu")
model.eval()
Next, as we have already done, we can pass the GPTModel instance into the generate_text_simple function, and use the LLM to generate one token at a time
tokenizer = tiktoken.get_encoding("gpt2")
token_ids = generate_text_simple(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=25,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
"""
Output text:
Every effort moves you know," was not that my hostess was "interesting": on that point I could have given Miss Croft the fact,
"""
Since we have now trained our LLM we do not get rubbish outputted like last time. The generated token is selected at each generation step corresponding to the largest probability score among all tokens in the vocabulary. This means that the LLM will always generate the same outputs even if we run the generate_text_simple function above multiple times on the same start context
"Well, come while he's not looking,"
We will introduce two concepts to control the randomness and diversity of the text: temperature scaling and top-k sampling
This is a technique that adds a probabilistic selection process to the next-token generation task. Previously, inside the generate_text_simple function, we always sampled the token with the highest probability as the next token using torch.argmax, also known as greedy decoding. To generate text with more variety, we can replace the argmax with a function that samples from a probability distribution.
To illustrate the probabilistic sampling with a concrete example, let's briefly discuss the next-token generation process using a very small vocabulary
vocab = {
"closer": 0,
"every": 1,
"effort": 2,
"forward": 3,
"inches": 4,
"moves": 5,
"pizza": 6,
"toward": 7,
"you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}
Assume the LLM is given the start context "every effort moves you" and generates the following next-token logits:
next_token_logits = torch.tensor(
[4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)
As mentioned inside the generate_text_simple, we convert the logits into probabilities via the softmax function and obtain the token ID corresponding the generated token via the argmax function, which we can then map back into text via the inverse vocabulary:
probas = torch.softmax(next_token_logits, dim=0)
next_token_id = torch.argmax(probas).item()
print(inverse_vocab[next_token_id])
Since the largest logit value, and correspondingly the largest softmax probability score, is in the fourth position the generated word is "forward".
To implement a probabilistic sampling process, we can now replace the argmax with the multinomial function in PyTorch
torch.manual_seed(123)
next_token_id = torch.multinomial(probas, num_samples=1).item()
print(inverse_vocab[next_token_id])
This formula represents the probability mass function (PMF) of the multinomial distribution. It calculates the probability of observing a specific combination of counts ((x_1, x_2, ..., x_k)) for (k) different categories in (n) independent trials.
The printed output is "forward" just like before. What happened? The multinomial function samples the next token proportional to its probability score. In other words, "forward" is still the most likely token and will be selected by multinomial most of the time but not all the time. To illustrate this, let's implement a function that repeats this sampling 1000 times:
def print_sampled_tokens(probas, inverse_vocab):
torch.manual_seed(123)
num_samples = 100 # Define the number of tokens to sample for frequency analysis
sample = [torch.multinomial(probas, num_samples=1).item() for _ in range(num_samples)] # Sample token indices based on probabilities using multinomial distribution, repeat for num_samples
sampled_ids = torch.bincount(torch.tensor(sample), minlength=len(inverse_vocab)) # Count the occurrences of each token ID in the sampled list, ensure output size matches vocabulary
for i, freq in enumerate(sampled_ids): # Iterate through the frequency counts for each token ID
if freq > 0: # Print only if the frequency is greater than 0 (token was sampled)
print(f"{freq} x {inverse_vocab[i]}") # Print the frequency and the corresponding token from inverse_vocab
"""
The sampling output is as follows:
73 x closer
0 x every
0 x effort
582 x forward
2 x inches
0 x moves
0 x pizza
343 x toward
"""
As we can see based on the output, the word "forward" is sampled most of the time (582 out of 1000 times), but other tokens such as "closer", "inches", and "toward" will also be sampled some of the time. This means that if we replaced the argmax function with the multinomial function inside the generate_and_print_sample function, the LLM would sometimes generate texts such as "every effort moves you toward", "every effort moves you inches", and "every effort moves you closer" instead of "every effort moves you forward".
We can further control the distribution and selection process via a concept called temperature scaling, where temperature scaling is just a fancy description for dividing the logits by a number greater than 0:
def softmax_with_temperature(logits, temperature):
scaled_logits = logits / temperature
return torch.softmax(scaled_logits, dim=0)
Temperatures greater than 1 result in more uniformly distributed token probabilities, and Temperatures smaller than 1 will result in more confident (sharper or more peaky) distributions.
A temperature of 1 represents the unscaled probability scores for each token in the vocabulary. Decreasing the temperature to 0.1 sharpens the distribution, so the most likely token (here "forward") will have an even higher probability score. Vice versa, increasing the temperature to 5 makes the distribution more uniform.
Example:
logits = torch.tensor([2.0, 1.0, 0.5])
Probabilities (Temperature=1) or softmax(logits) : [0.665 0.2447 0.0903]
Probabilities (Temperature=5): [0.394 0.324 0.282]
Probabilities (Temperature=0.1): [9.9998e-01 2.0606e-05 3.7675e-07] -> mimics argmax()
Here is a proof that is self-explanatory
The higher the temperature the most diverse, the less the temperature the more greedy.
We implemented a probabilistic sampling approach coupled with temperature scaling to increase the diversity of the outputs. However, one downside of this approach is that it sometimes leads to grammatically incorrect or completely nonsensical outputs such as "every effort moves you pizza".
So we introduce top-k sampling, which when combined with probabilistic sampling and temperature scaling, can improve the text generation results.
In top-k sampling, we can restrict the sampled tokens to the top-k most likely tokens and exclude all other tokens from the selection process by masking their probability scores.
5 Using top-k sampling with k=3, we focus on the 3 tokens associated with the highest logits and mask out all other tokens with negative infinity (-inf) before applying the softmax function. This results in a probability distribution with a probability value 0 assigned to all nontop-k tokens.
next_token_logits = torch.tensor(
[4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)
top_k = 3
top_logits, top_pos = torch.topk(next_token_logits, top_k)
print("Top logits:", top_logits)
print("Top positions:", top_pos)
"""
Top logits: tensor([6.7500, 6.2800, 4.5100])
Top positions: tensor([3, 7, 0])
"""
Subsequently, we apply PyTorch's where function to set the logit values of tokens that are below the lowest logit value within our top-3 selection to negative infinity (-inf).
new_logits = torch.where(
condition=next_token_logits < top_logits[-1], #A
input=torch.tensor(float('-inf')), #B
other=next_token_logits #C
)
print(new_logits)
"""
tensor([4.5100, -inf, -inf, 6.7500, -inf, -inf, -inf, 6.2800, -inf])
"""
Lastly, let's apply the softmax function to turn these into next-token probabilities:
topk_probas = torch.softmax(new_logits, dim=0)
print(topk_probas)
"""
tensor([0.0615, 0.0000, 0.0000, 0.5775, 0.0000, 0.0000, 0.0000, 0.3610, 0.0000])
"""
We can now apply the temperature scaling and multinomial function for probabilistic sampling introduced to select the next token among these 3 non-zero probability scores to generate the next token.
We can now introduce the two concepts: temperature scaling and top-k sampling. We will combine this into our generate_simple function to create the generate function:
def generate(model, idx, max_new_tokens, context_size, temperature, top_k=None): # top_k defaults to None, meaning no top-k sampling if not provided
for _ in range(max_new_tokens): #A - Iterate to generate 'max_new_tokens' number of new tokens
idx_cond = idx[:, -context_size:] # Get the last 'context_size' tokens from the current sequence 'idx' as input (context window)
with torch.no_grad(): # Disable gradient calculations for efficient inference
logits = model(idx_cond) # Pass the context through the model to get logits for the next token
logits = logits[:, -1, :] # Take logits for only the last token in the sequence (we only predict the next token)
if top_k is not None: #B - Apply top-k filtering if top_k value is provided
top_logits, _ = torch.topk(logits, top_k) # Get the top 'top_k' logits and their indices
min_val = top_logits[:, -1] # Get the smallest logit value among the top 'top_k' logits (threshold for filtering)
logits = torch.where( # Use torch.where to filter logits
logits < min_val, # Condition: if a logit is less than the minimum of top-k logits
torch.tensor(float('-inf')).to(logits.device), # Replace logits outside top-k with negative infinity, effectively masking them out in softmax
logits # Otherwise, keep the original logit value (for top-k logits)
)
if temperature > 0.0: #C - Apply temperature scaling if temperature > 0
logits = logits / temperature # Divide logits by temperature to control probability distribution
probs = torch.softmax(logits, dim=-1) # Apply softmax to get probabilities from the scaled logits
idx_next = torch.multinomial(probs, num_samples=1) # Sample the next token index from the probability distribution
else: #D - Greedy decoding if temperature is 0 (deterministic)
idx_next = torch.argmax(logits, dim=-1, keepdim=True) # Select the token with the highest logit as the next token
idx = torch.cat((idx, idx_next), dim=1) # Append the sampled/chosen next token index to the current sequence 'idx'
return idx # Return the complete generated sequence of token IDs
This is long, however is quite self-explanatory when we've seen everything.
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort moves you", tokenizer),
max_new_tokens=15,
context_size=GPT_CONFIG_124M["context_length"],
top_k=25,
temperature=1.4
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
"""
Output text:
Every effort moves you stand to work on surprise, a one of us had gone with
"""
As we can see, the generated text is very different from the one we previously generated via the generate_simple function ("Every effort moves you know," was one of the axioms he laid...!"), which was a memorised passage from the training set.
We have discussed how to numerically evaluate the training progress and pretrain an LLM from scratch. Even though both the LLM and dataset were relatively small, this exercise showed that pretraining LLMs is computationally expensive. Thus, it is important to be able to save the LLM so that we don't have to rerun the training every time we want to use it in a new session.
The recommended way is to save a model's so-called state_dict, a dictionary mapping each layer to its parameters, using the torch.save function as follows:
torch.save(model.state_dict(), "model.pth") #The .pth extension is a convention for PyTorch files, though we could technically use any file extension
After saving the model weights via the state_dict, we can load the model weights into a new GPTModel model instance as follows:
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("model.pth"))
model.eval()
If we plan to continue pretraining a model later, for example, using the train_model_simple function we defined earlier in this chapter, saving the optimiser state is also recommended.
Adaptive optimisers such as AdamW store additional parameters for each model weight. AdamW uses historical data to adjust learning rates for each model parameter dynamically. Without it, the optimiser resets, and the model may learn suboptimally or even fail to converge properly, which means that it will lose the ability to generate coherent text. Using torch.save, we can have both the model and optimiser state_dict contents as follows:
torch.save({
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"model_and_optimizer.pth"
)
Then, we can restore the model and optimiser states as follows by first loading the saved data via torch.load and then using the load_state_dict method:
checkpoint = torch.load("model_and_optimizer.pth")
model = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train()