This is a short summary of large language models for myself to remember and help understanding. Thanks to Sebastian Raschka & Andrej Karpathy
Large language models (LLMs), such as OpenAI's ChatGPT or Anthropic's Claude, are deep neural networks based on the decoder architecture of the Transformer model. These models possess remarkable capabilities to understand, generate, and interpret human language.
Unlike earlier NLP models designed for specific tasks, LLMs are generalised and versatile. They have hundreds of billions of parameters—weights in the network that are optimized during training to predict the next word in a sequence. Due to their ability to generate content, LLMs fall under the broader category of Generative AI.
The encoder module processes text, encoding it into vectors that capture contextual information to pass into the decoder blocks. The decoder block, in turn, takes the encoded input and generates an output.
Both modules can also be used separately and effectively. The encoder module, due to its ability to capture the contextual information of inputs, is particularly suited for masked word prediction (e.g., "There was an _____ with the car"). This approach is used in BERT (short for Bidirectional Encoder Representations from Transformers).
On the other hand, the decoder module, due to its generative nature, can predict the following tokens (e.g., "There was an issue with the ___"). This approach forms the basis of GPT models (short for Generative Pretrained Transformers).
GPT models, primarily designed and trained to perform text completion tasks, also show remarkable versatility in their capabilities. These models are adept at executing both zero-shot and few-shot learning tasks. Zero-shot learning refers to the ability to generalize to completely unseen tasks without any prior specific examples. On the other hand, few-shot learning involves learning from a minimal number of examples the user provides as input.
The GPT model only uses the decoder module, as mentioned, it operates in a sequential manner, a left-to-right movement where words are predicted and based on those prior to it
Due to training on such a vast amount of data, the model, although primarily designed to excel at next-word prediction, exhibits emergent behavior. This refers to behavior that the model was not specifically trained for but arises as a consequence of being exposed to such extensive data.
The challenging part of building a large language model (LLM) lies in stage 1, where the data is prepared, and the attention mechanism is implemented. Once this stage is complete, the remaining decoder blocks are constructed to form fully functional decoder modules.
An input to a transformer, whether it be audio, text, video etc. cannot be processed in raw form, all data must be standardised. The easiest and most convient way to input the data, is in embeddings, which are numerical representations of words in a continuous vector space, where words with similar meanings or contexts are placed closer together. They encode semantic and syntactic relationships between words, enabling models to understand and process language more effectively.
Word embeddings are the most used for text, however paragraph and sentence embeddings are also possibilities, for GPT word embeddings is fine.
Word embeddings can have varying dimensions, from one to thousands.
While we can use pretrained models such as Word2Vec to generate embeddings for machine learning models, LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimising the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand.
We will now tokenise the text, as shown
We will load the data from a stored file
with open("verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
print("Total number of character:", len(raw_text))
Total number of character: 20479
We can now split the text, without whitespaces, using the regex module
import re
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))
4649
Now, we move on to converting these tokens into integer representations called token IDs, which serve as an intermediate step before generating embedding vectors. To achieve this, we first build a vocabulary—a mapping of unique tokens, including words and special characters, to unique integers.
We convert tokens to token IDs instead of embeddings directly because token IDs serve as a compact, discrete representation that acts as a bridge between raw text and the embedding layer. This intermediate step ensures that each unique token is assigned a unique integer, enabling efficient lookup and management of embeddings.
#Converting tokens into token IDs
all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)
print(vocab_size)
vocab = {token:integer for integer, token in enumerate(all_words)}
1159
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
We can now create a simple tokeniser class, one where we combine the text split and encoding to token Ids
class SimpleTokeniserV1:
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = {i:s for s,i in vocab.items()}
def encode(self, text):
# Split text on special characters and whitespace
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
ids = [self.str_to_int[s] for s in preprocessed]
return ids
def decode(self, ids):
text = " ".join([self.int_to_str[i] for i in ids])
text = re.sub(r'([,.?_!"()\']|--|\s)', r'\1',text)
return text
# If text is: "Hello, world!"
# After re.split():
# preprocessed = ['Hello', ',', '', 'world', '!']
# After the cleaning loop:
# preprocessed = ['Hello', ',', 'world', '!']
Testing it out, you can see we achieve the first part of our model
tokeniser = SimpleTokeniserV1(vocab)
text = """It's the last he painted"""
ids = tokeniser.encode(text)
print(ids)
[58, 2, 872, 1013, 615, 541, 763]
We also add a decoder method to our tokeniser, although at this point, it is not quite as useful
What would print(tokeniser.decode(ids))
output?
To encode we first initialise with our tokeniser class passing vocab as an argument
tokeniser = SimpleTokeniserV1(vocab)
When encoding text ids = tokeniser.encoder("It's the last he painted")
, the program creates two variables:
self.str_to_int = vocab
#('As', 17)
#('At', 18)
#('Be', 19)
#('Begin', 20)
self.int_to_str = {i:s for s,i in vocab.items()}
#(17, 'as')
#(18, 'At')
...
This process is used for encoding and decoding. When entering the encode function, the text is first split based on white spaces and special characters, and then token IDs are generated.
ids = [self.str_to_int[s] for s in preprocessed]
Alternatively, the same logic can be written in a more explicit way:
for s in preprocessed:
ids = self.str_to_int[s]
In pseudocode, it looks like this:
for word in tokenised_words:
token_ids <- self.str_to_int[word]
It's important to note that for a word to be tokenised, it must exist in the vocabulary, we can not conjure a random token id for all words. Which leads to the following issue
text = "Hello, do you like tea?"
x = tokeniser.encode(text)
print(x)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[10], line 2
1 text = "Hello, do you like tea?"
----> 2 x = tokeniser.encode(text)
3 print(x)
Cell In[7], line 10, in SimpleTokeniserV1.encode(self, text)
8 preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
9 preprocessed = [item.strip() for item in preprocessed if item.strip()]
---> 10 ids = [self.str_to_int[s] for s in preprocessed]
11 return ids
KeyError: 'Hello'
To overcome the issue of tokens not appearing in our vocab, we will add special tokens to a vocabulary to deal with certain contexts: an <|unk|> token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. Furthermore, we add an <|endoftext|> token that we can use to separate two unrelated text sources.
When working with multiple independent text sources, we add <|endoftext|> tokens between these texts. These <|endoftext|> tokens act as markers, signaling the start or end of a particular segment, allowing for more effective processing and understanding by the LLM.
We can now take our tokens and add the two general special tokens, this will increase our vocab size to 1161 now.
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|unk|>", "<|endoftext|>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}
We can now update our tokeniser class by simply changing the preprocessed
line in `SimpleTokeniserV1``` by checking if the word exists in our vocab, if not, we use
class SimpleTokeniserV2:
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = {i:s for s,i in vocab.items()}
def encode(self, text):
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
preprocessed = [item.strip() if item.strip() in self.str_to_int else "<|unk|>"
for item in preprocessed if item.strip()]
ids = [self.str_to_int[s] for s in preprocessed]
return ids
def decode(self, ids):
text = " ".join([self.int_to_str[i] for i in ids])
text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
return text
We can now tokenise text, where
print(tokeniser.decode(tokeniser.encode(text)))
rather getting an error, we have our special tokens replace those tokens that do not exist within our vocab.
<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.
Thus far we have just encoded by assinged each word to a number in ascending order, however a more sophisciated approach would be to use byte pair encoding, which breaks down words into smaller subword units.
BPE (Byte Pair Encoding) is a data compression technique that iteratively merges the most frequently occurring pairs of bytes or characters.
The main advantages of BPE include: 1. Handling out-of-vocabulary words effectively 2. Reducing vocabulary size while maintaining meaning 3. Better handling of rare words and morphological variations
For example, the word "understanding" might be tokenized as "under" + "stand" + "ing", allowing the model to recognise parts of unfamiliar words based on common subwords.
The process of BPE typically follows these steps:
A simple example:
Original text: "low lower lowest"
Initial tokens: "l", "o", "w", "e", "r", "s", "t"
After merges: "low", "er", "est"
This approach allows the model to handle new words by breaking them into known subparts. For instance: - "unhappy" → "un" + "happy" - "playing" → "play" + "ing" - "cryptocurrency" → "crypto" + "currency"
Modern LLMs like GPT use BPE because it offers a good balance between vocabulary size and token effectiveness. It helps the model process words more efficiently while maintaining semantic understanding. However, in 2025, this technique may no longer be used.
We will use cl100k_base to encode the tokens, this supports tens of thousands of tokens
import tiktoken
tokeniser = tiktoken.get_encoding("cl100k_base") #gpt2
text = "Hello, do you like tea? <|endoftext|> In the sunlit terra"
integers = tokeniser.encode(text, allowed_special={"<|endoftext|>"})
print(integers)
The output is unsuprising
[9906, 11, 656, 499, 1093, 15600, 30, 220, 100257, 763, 279, 7160, 32735, 60661]