← Home

Large Language Models

2024-11-21T00:33:48+0000

This is a short summary of large language models for myself to remember and help understanding. Thanks to Sebastian Raschka & Andrej Karpathy

Contents

  1. Introduction
  2. Encoder & Decoder
  3. Encoder
  4. Building an LLM Blueprint
  5. Data Preperation

1. Introduction

Large language models (LLMs), such as OpenAI's ChatGPT or Anthropic's Claude, are deep neural networks based on the decoder architecture of the Transformer model. These models possess remarkable capabilities to understand, generate, and interpret human language.

Unlike earlier NLP models designed for specific tasks, LLMs are generalised and versatile. They have hundreds of billions of parameters—weights in the network that are optimized during training to predict the next word in a sequence. Due to their ability to generate content, LLMs fall under the broader category of Generative AI.

2. Encoder & Decoder

Transformer architecture

The encoder module processes text, encoding it into vectors that capture contextual information to pass into the decoder blocks. The decoder block, in turn, takes the encoded input and generates an output.

Both modules can also be used separately and effectively. The encoder module, due to its ability to capture the contextual information of inputs, is particularly suited for masked word prediction (e.g., "There was an _____ with the car"). This approach is used in BERT (short for Bidirectional Encoder Representations from Transformers).

On the other hand, the decoder module, due to its generative nature, can predict the following tokens (e.g., "There was an issue with the ___"). This approach forms the basis of GPT models (short for Generative Pretrained Transformers).

GPT models, primarily designed and trained to perform text completion tasks, also show remarkable versatility in their capabilities. These models are adept at executing both zero-shot and few-shot learning tasks. Zero-shot learning refers to the ability to generalize to completely unseen tasks without any prior specific examples. On the other hand, few-shot learning involves learning from a minimal number of examples the user provides as input.

2.1 GPT decoder

The GPT model only uses the decoder module, as mentioned, it operates in a sequential manner, a left-to-right movement where words are predicted and based on those prior to it

decoder architecture

Due to training on such a vast amount of data, the model, although primarily designed to excel at next-word prediction, exhibits emergent behavior. This refers to behavior that the model was not specifically trained for but arises as a consequence of being exposed to such extensive data.

2.2 Building an LLM

The challenging part of building a large language model (LLM) lies in stage 1, where the data is prepared, and the attention mechanism is implemented. Once this stage is complete, the remaining decoder blocks are constructed to form fully functional decoder modules.

blueprint

3. Data preperation

An input to a transformer, whether it be audio, text, video etc. cannot be processed in raw form, all data must be standardised. The easiest and most convient way to input the data, is in embeddings, which are numerical representations of words in a continuous vector space, where words with similar meanings or contexts are placed closer together. They encode semantic and syntactic relationships between words, enabling models to understand and process language more effectively.

Word embeddings are the most used for text, however paragraph and sentence embeddings are also possibilities, for GPT word embeddings is fine.

Word embeddings can have varying dimensions, from one to thousands.

While we can use pretrained models such as Word2Vec to generate embeddings for machine learning models, LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimising the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand.

3.1 Tokenisation

We will now tokenise the text, as shown

llm tokens

Implementation Example

We will load the data from a stored file

with open("verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print("Total number of character:", len(raw_text))
Total number of character: 20479

We can now split the text, without whitespaces, using the regex module

import re

preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))
4649

3.2 Converting to Token IDs{#32-token-ids}

Now, we move on to converting these tokens into integer representations called token IDs, which serve as an intermediate step before generating embedding vectors. To achieve this, we first build a vocabulary—a mapping of unique tokens, including words and special characters, to unique integers.

We convert tokens to token IDs instead of embeddings directly because token IDs serve as a compact, discrete representation that acts as a bridge between raw text and the embedding layer. This intermediate step ensures that each unique token is assigned a unique integer, enabling efficient lookup and management of embeddings.

Implementation Example

#Converting tokens into token IDs
all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)
print(vocab_size)

vocab = {token:integer for integer, token in enumerate(all_words)}
1159

(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)

We can now create a simple tokeniser class, one where we combine the text split and encoding to token Ids

class SimpleTokeniserV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        # Split text on special characters and whitespace
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'([,.?_!"()\']|--|\s)', r'\1',text)
        return text

# If text is: "Hello, world!"
# After re.split():
# preprocessed = ['Hello', ',', '', 'world', '!']

# After the cleaning loop:
# preprocessed = ['Hello', ',', 'world', '!']

Testing it out, you can see we achieve the first part of our model

tokeniser = SimpleTokeniserV1(vocab)
text = """It's the last he painted"""
ids = tokeniser.encode(text)
print(ids)
[58, 2, 872, 1013, 615, 541, 763]

We also add a decoder method to our tokeniser, although at this point, it is not quite as useful

tokens for llm

What would print(tokeniser.decode(ids)) output?

To encode we first initialise with our tokeniser class passing vocab as an argument tokeniser = SimpleTokeniserV1(vocab)

When encoding text ids = tokeniser.encoder("It's the last he painted") , the program creates two variables:

self.str_to_int = vocab 
#('As', 17)
#('At', 18)
#('Be', 19)
#('Begin', 20)

self.int_to_str = {i:s for s,i in vocab.items()}
#(17, 'as')
#(18, 'At')
...

This process is used for encoding and decoding. When entering the encode function, the text is first split based on white spaces and special characters, and then token IDs are generated.

ids = [self.str_to_int[s] for s in preprocessed]

Alternatively, the same logic can be written in a more explicit way:

for s in preprocessed:
    ids = self.str_to_int[s]

In pseudocode, it looks like this:

for word in tokenised_words:
    token_ids <- self.str_to_int[word]

It's important to note that for a word to be tokenised, it must exist in the vocabulary, we can not conjure a random token id for all words. Which leads to the following issue

text = "Hello, do you like tea?"
x = tokeniser.encode(text)
print(x)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[10], line 2
      1 text = "Hello, do you like tea?"
----> 2 x = tokeniser.encode(text)
      3 print(x)

Cell In[7], line 10, in SimpleTokeniserV1.encode(self, text)
      8 preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
      9 preprocessed = [item.strip() for item in preprocessed if item.strip()]
---> 10 ids = [self.str_to_int[s] for s in preprocessed]
     11 return ids

KeyError: 'Hello'

3.3 Adding Special Tokens

To overcome the issue of tokens not appearing in our vocab, we will add special tokens to a vocabulary to deal with certain contexts: an <|unk|> token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. Furthermore, we add an <|endoftext|> token that we can use to separate two unrelated text sources.

unk and endtext

When working with multiple independent text sources, we add <|endoftext|> tokens between these texts. These <|endoftext|> tokens act as markers, signaling the start or end of a particular segment, allowing for more effective processing and understanding by the LLM.

We can now take our tokens and add the two general special tokens, this will increase our vocab size to 1161 now.

all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|unk|>", "<|endoftext|>"])
vocab = {token:integer for integer, token in enumerate(all_tokens)}

We can now update our tokeniser class by simply changing the preprocessed line in `SimpleTokeniserV1``` by checking if the word exists in our vocab, if not, we use .

class SimpleTokeniserV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed  = [item.strip() if item.strip() in self.str_to_int else "<|unk|>" 
                           for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) 
        return text

We can now tokenise text, where and will be assigned 1159 and 1160 respectively.

print(tokeniser.decode(tokeniser.encode(text)))

rather getting an error, we have our special tokens replace those tokens that do not exist within our vocab.

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.

3.4 Byte Pair Encoding

Thus far we have just encoded by assinged each word to a number in ascending order, however a more sophisciated approach would be to use byte pair encoding, which breaks down words into smaller subword units.

BPE (Byte Pair Encoding) is a data compression technique that iteratively merges the most frequently occurring pairs of bytes or characters.

The main advantages of BPE include: 1. Handling out-of-vocabulary words effectively 2. Reducing vocabulary size while maintaining meaning 3. Better handling of rare words and morphological variations

For example, the word "understanding" might be tokenized as "under" + "stand" + "ing", allowing the model to recognise parts of unfamiliar words based on common subwords.

The process of BPE typically follows these steps:

  1. Start with individual characters as the base vocabulary
  2. Count frequency of adjacent pairs
  3. Merge the most frequent pair
  4. Repeat until reaching desired vocabulary size

A simple example:

Original text: "low lower lowest"
Initial tokens: "l", "o", "w", "e", "r", "s", "t"
After merges: "low", "er", "est"

This approach allows the model to handle new words by breaking them into known subparts. For instance: - "unhappy" → "un" + "happy" - "playing" → "play" + "ing" - "cryptocurrency" → "crypto" + "currency"

Modern LLMs like GPT use BPE because it offers a good balance between vocabulary size and token effectiveness. It helps the model process words more efficiently while maintaining semantic understanding. However, in 2025, this technique may no longer be used.

We will use cl100k_base to encode the tokens, this supports tens of thousands of tokens

import tiktoken
tokeniser = tiktoken.get_encoding("cl100k_base") #gpt2
text = "Hello, do you like tea? <|endoftext|> In the sunlit terra"
integers = tokeniser.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

The output is unsuprising

[9906, 11, 656, 499, 1093, 15600, 30, 220, 100257, 763, 279, 7160, 32735, 60661]