Generative Deep Learning

2025-02-20

[Introduction to Generative Deep Learning][generativeLearningRef]
[Deep Learning MLPS][deeplearningRef]
[Deep Learning CNNs][deeplearningRef2]

Introduction to Generative Deep Learning

Generative modeling is a branch of machine learning that involves training a model to produce new data that is similar to a given dataset.

Discriminative modeling estimates p(y|x) That is, discriminative modeling aims to model the probability of a label y given some observation p(x)

Generative modeling estimates p(x) That is, generative modeling aims to model the probability of observing an observation x. Sampling from this distribution allows us to generate new observations.

Defining terms

In the generative modelling framework, we have a dataset of observations ,X, where we assume that those observations have been generated by p_data, and we want to build a model such that p_model mimics p_data

We focus on three things:

Accuracy: The model assigns high probability to realistic samples and low probability to unrealistic ones.
Generation: The model can easily generate new, realistic samples.
Representation: The model captures and explains high-level features of the data.

Represenation Learning

Representation learning involves mapping high-dimensional data to a lower-dimensional latent space, where each point in the latent space serves as a compact representation of the original data. A mapping function is then learned to translate points between the latent space and the original data domain.

For example, a training set of grayscale images of biscuit tins can be represented in a lower-dimensional latent space with just two features: height and width. A machine learning model learns to identify these features as the most meaningful dimensions, then maps points in the latent space back to the original image domain using a mapping function. This enables generating new images by sampling and manipulating the latent space, such as increasing the height dimension to make a taller tin.

Latent Space: This is an abstract, lower-dimensional space where data is represented in a simplified form. It captures the essential features of the data while discarding irrelevant details. For example, a latent space for images might represent high-level features like size, shape, or color.

So, in short a manifold in high-dimensional pixel space is mapped to a simpler latent space that can be sampled from

Defining probabalistic terms

Sample space The complete set of all values an observation x can take
Probability density function A density function , p(x) that maps a point x in the sample space to a number across a probability distrubution. The integral of p(x) over all points must sum to 1, there is only one true density function p_data(x), there are however, many density functions p_model(x) that can estimate p_data(x)
Parametric modeling A technique that we can use to structure our approach to finding p_model(x). A parametric model is a family of density functions p_θ(x) that can be described using a finite number of parameters, θ
Likelihood The L(θ|x) of a parameter set θ is a function that measures the plausibility of θ, given some observed point x, it is defined as:

$\ \mathcal{L}(\theta | x) = p_\theta(x) \$

That is, the likelihood of θ given some observed point x is defined to be the value of the density function parameterized by θ, at the point x. If we have a whole dataset x of independent observations, then we can write: $$ \ \mathcal{L}(\theta | \mathbf{X}) = \prod_{x \in \mathbf{X}} p_\theta(x) \ $$

Since the product of a large number of terms between 0 and 1 can be quite computationally difficult to work with, we often use the log-likelihood instead:

$\ \mathcal{L}(\theta | \mathbf{X}) = \prod_{x \in \mathbf{X}} p_\theta(x) \$

Expanding the summation:

$\ \mathcal{L}(\theta | \mathbf{X}) = p_\theta(x_1) \cdot p_\theta(x_2) \cdot p_\theta(x_3) \cdot \ldots \$

Taking logarithm over product:

$\ \ell(\theta | \mathbf{X}) = \log \prod_{x \in \mathbf{X}} p_\theta(x) \$

Using basic log rules, we get the log-likelihood:

$\ \ell(\theta | \mathbf{X}) = \sum_{x \in \mathbf{X}} \log p_\theta(x) \$

Maximum likelihood estimation Allows us to estimate θ^ - the set of parameters θ of a density function p_θ(x) that is most likely to explain some observed data x. More formally:

$\ \hat{\theta} = \arg \max_{\theta} \ell(\theta | \mathbf{X}) \$

Neural networks typically minimize a loss function, so we can equivalently talk about finding the set of parameters that minimize the negative log-likelihood:

$\ \hat{\theta} = \arg \min_{\theta} \left( -\ell(\theta | \mathbf{X}) \right) = \arg \min_{\theta} \left( -\log p_\theta(\mathbf{X}) \right) \$

Deep Learning - MLPs

Data for Deep Learning

Unstructured data refers to data that is not naturally arranged into a structured format, such as video data. This is rather unhelpful as individual pixels and features are uninformative.

Random forests and linear regression are poor at capturing information from unstructured data because they rely on predefined features and cannot automatically learn representations from raw data. Deep Neural Networks (DNNs), on the other hand, excel at handling unstructured data such as images, audio, and text. This is because DNNs can automatically learn hierarchical feature representations directly from the raw data through multiple layers of abstraction. This ability to learn complex patterns and representations makes DNNs particularly powerful for tasks involving unstructured data.

In a forward pass through a neural network, each layer transforms the input using a nonlinear function applied to a weighted sum of its inputs, culminating in the output layer which predicts the probability of a category. Training involves adjusting these weights to minimise prediction error by comparing outputs to ground truth and using backpropagation to update weights in the direction that reduces error. This iterative process enables the network to learn and identify features that improve prediction accuracy.

Neural networks excel at learning features from input data without human intervention, eliminating the need for feature engineering. For instance, consider a trained network predicting if a face is smiling:

Input Layer: Receives pixel values.
First Hidden Layer: Detects low-level features like edges.
Second Hidden Layer: Identifies higher-level features such as teeth.
Output Layer: Determines if the person is smiling.

Each layer progressively combines features from the previous layer, naturally arising from the training process, without explicit instructions on feature detection.

Multilayer Perceptrons

A Multi-Layer Perceptron (MLP) is a type of artificial neural network composed of multiple layers of nodes, often referred to as perceptrons. It is fully connected, meaning every node in one layer is connected to every node in the next. An MLP typically consists of an input layer (receiving data), one or more hidden layers (performing nonlinear transformations using activation functions like ReLU or sigmoid), and an output layer (producing the final result).

The MLP is a discriminative (rather than generative) model, but supervised learning will still play a role in many types of generative models.

In the example below, we must classify the CIFAR-10 database, which is a collection of images that fit into one of 10 categories. The image data consists of integers between 0 to 255 for each pixel channel. To improve training, these are scaled to the range of [0,1] by dividing by 255. Neural networks output probabilities for each class (e.g., [0.1, 0.3, 0.6...]), so we need the ground truth in the same vector format to calculate the error during training (e.g., using categorical cross-entropy loss). Therefore, we must encode the class labels accordingly, hence we will use one-hot encoding, e.g., Integer label 3 → One-hot vector: [0,0,0,1,0,0,0,0,0,0].

import numpy as np
import torch
import torchvision
import torchvision.transforms as transforms

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

#Load CIFAT-10 Dataset
transform = transforms.Compose([
    transforms.ToTensor(), # Converts to [0, 1] range
])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

#Normalize pixel values from 0-255 to 0-1 range
x_train = trainset.data.astype('float32') / 255.0
x_test = testset.data.astype('float32') / 255.0

NUM_CLASSES = 10
# Convert integer class labels to one-hot encoded vectors
y_train = torch.nn.functional.one_hot(torch.tensor(trainset.targets), NUM_CLASSES).float()
y_test = torch.nn.functional.one_hot(torch.tensor(testset.targets), NUM_CLASSES).float()

The trainset.data is a 4D array representing images in the CIFAR-10 dataset. Its shape is typically:

$(50000, 32, 32, 3)$

50000: Number of training images
32 x 32: Width and height of each image
3: Colour channels (RGB)

Each pixel in this array has a value between 0 and 255, representing the intensity of the corresponding colour channel

[[[ 34,  56, 123], [255, 100,  67], ...],  # Row 1
 [[ R,  G,  B], [200, 134,  33], ...],  # Row 2
 ...
]

.astype('float32'): Converts the integer pixel values (e.g 0,255) to floating-point values.

The / 255.0 Divides each pixel value by 255 to normalize the range from [0, 255] to [0, 1]

Before normalisation:

[[[34, 56, 123], [255, 100, 67], ...],
 [[12, 45, 90], [200, 134, 33], ...], ...]

After normalisation:

[[[0.133, 0.220, 0.482], [1.000, 0.392, 0.262], ...],
 [[0.047, 0.176, 0.353], [0.784, 0.525, 0.129], ...], ...]

As for creating the model, PyTorch uses nn.Sequential, which allows you to define a model where layers are applied in a simple, linear order. The model starts by flattening the input from its original shape (batch_size,32,32,3) into a vector, followed by two fully connected (nn.Linear) layers with ReLU activations, and ends with a final nn.Linear layer that outputs probabilities for 10 classes using Softmax.

import torch
import torch.nn as nn

class SimpleMLP(nn.Module):
    def __init__(self):
        super(SimpleMLP, self).__init__()
        self.model = nn.Sequential(
            nn.Flatten(), #Flattens inputs from (batch_size (52), 32, 32, 3) to (52, 32x32x3)
            nn.Linear(32 * 32 * 3, 200), #Full connected layer with 200 neurons
            nn.ReLU(),
            nn.Linear(200, 150), #200 neurons - 150 neurons
            nn.ReLU(),
            nn.Linear(150, 10), #output layer to 10 neurons
#             nn.Softmax(dim=1) #softmax
        )

    def forward(self, x):
        return self.model(x)

# Instantiate and print the model
model = SimpleMLP().to(device)
print(model)

Here is the architecture of the model

To build our MLP, we use three key layers: Input, Flatten, and Dense. The Input layer specifies the shape of each data element (e.g (32,32,3)), without requiring the batch size, as neural networks can process varying batch sizes dynamically. The Flatten layer converts the multidimensional input (e.g.32×32×3) into a flat vector (e.g., length 3072), which is necessary because Dense layers require flat inputs. The Dense layer, a fundamental building block, connects every unit in the current layer to every unit in the previous layer via weighted connections, and applies a nonlinear activation function (e.g., ReLU). This allows the network to learn complex, non-linear relationships in the data rather than just linear combinations, making Dense layers critical for feature extraction and classification.

Acitvation functions

ReLU (Rectified Linear Unit) outputs f(x)=max(0,x), activating neurons only when x>0. It is computationally efficient and widely used in hidden layers but can suffer from the "dead neuron" problem, where neurons stop updating due to gradients becoming zero for negative inputs. Leaky ReLU addresses this by allowing a small slope (ax) for x<0, preventing neurons from dying completely. Sigmoid, on the other hand, maps inputs to the range (0,1), making it useful for binary classification tasks. However, it suffers from the vanishing gradient problem as its derivative diminishes for extreme input values, leading to slow or stalled learning.

In practice, ReLU is the default choice for hidden layers due to its simplicity and efficiency, while Leaky ReLU is preferred when the dead neuron issue is significant. Sigmoid is primarily used in the output layer for binary classification or probabilistic outputs. The derivatives of these functions differ: ReLU has a constant derivative of 1 for positive inputs and 0 otherwise, while Leaky ReLU has a small non-zero derivative for negative inputs, avoiding gradient stagnation. Sigmoid’s derivative is f(x)=f(x)(1−f(x)), which shrinks for large or small values of x, leading to the vanishing gradient problem in deep networks. This is why ReLU or Leaky ReLU is generally favored for deep architectures.

This is softmax:

$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}}$

Loss functions

Loss functions are mathematical formulations that quantify the difference between the predicted values p, and the actual target values,y. During backpropagation, the gradient of the loss with respect to the model parameters (∂𝐿/∂𝜃) is computed and used to update the weights

Mean squared error: Used for regression tasks, it penalizes larger deviations between predictions and targets.

Mathematical Form: $$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - p_i)^2 $$
Backpropagation Gradient: $$ \frac{\partial \text{MSE}}{\partial p_i} = -\frac{2}{n}(y_i - p_i) $$

Categorical Cross-Entropy: Used for multi-class classification tasks where each observation belongs to exactly one class. It measures the negative log-probability of the true class.

Mathematical Form: $$ \text{CCE} = -\sum_{i=1}^{n} y_i \log(p_i) $$
Here, $y_i$ is 1 only for the true class; otherwise, it’s 0 (one-hot encoded).
Backpropagation Gradient: $$ \frac{\partial \text{CCE}}{\partial p_i} = -\frac{y_i}{p_i} $$

Binary Cross-Entropy: Used for binary classification tasks or multilabel classification problems. It combines the loss from the positive and negative classes.

Mathematical Form: $$ \text{BCE} = -\frac{1}{n} \sum_{i=1}^{n} \Big( y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \Big) $$
Backpropagation Gradient: $$ \frac{\partial \text{BCE}}{\partial p_i} = -\frac{y_i}{p_i} + \frac{(1 - y_i)}{(1 - p_i)} $$

We can now define the loss function and optimiser

import torch.optim as optim

# Define the loss function
criterion = nn.CrossEntropyLoss()  # For categorical cross-entropy
# Define the optimiser
optimiser = optim.Adam(model.parameters(), lr=0.001)

We will now train the model. The network's weights are initialised to small random values, then the network performs training steps. At each step, one batch of images passes through the network, and the error is backpropagated. This is batch gradient descent, rather than stochastic gradient descent. Typically, the batch size ranges between 32 and 256. Once all the data passes through, this completes one epoch. 1 epoch equals dataset size divided by batch size in steps, which determines the number of splits.

batch_size = 32

from torch.utils.data import DataLoader
train_loader = DataLoader(
    trainset, batch_size=batch_size, shuffle=True, num_workers=2)

test_loader = DataLoader(
    testset, batch_size=batch_size, shuffle=False, num_workers=2)

from torch.utils.data import DataLoader, TensorDataset

x_train = torch.tensor(x_train)
y_train = torch.tensor(y_train)

epochs = 10
shuffle = True


for epoch in range(epochs):
    model.train() #set model to training mode
    epoch_loss = 0.0 #track loss per epoch

    for batch_x, batch_y in train_loader:

        batch_x, batch_y = batch_x.to(device), batch_y.to(device)


        optimiser.zero_grad() # zeor gradient from previous step

        predictions = model(batch_x) #forward pass: compute predictions


        loss = criterion(predictions, batch_y) #convert one hot lables to class indices

        loss.backward() #backward pass: compute gradients

        optimiser.step() #update model parameters

        epoch_loss += loss.item() #accumulate batch loss for epoch training

    avg_loss = epoch_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

The output would be in epochs

Epoch 1/10, Loss: 1.8445
Epoch 2/10, Loss: 1.6727
Epoch 3/10, Loss: 1.5933
Epoch 4/10, Loss: 1.5474
Epoch 5/10, Loss: 1.5052
Epoch 6/10, Loss: 1.4810
Epoch 7/10, Loss: 1.4572
Epoch 8/10, Loss: 1.4335
Epoch 9/10, Loss: 1.4188
Epoch 10/10, Loss: 1.4005

We know the model achieves a loss of 1.4 on the training set, but how does it perform on data it has never seen? To do this we simply repeat the last step, but alter it so that it runs on the test dataset that we attained

x_test = torch.tensor(x_test)
y_test = torch.tensor(y_test)

# Set the model to evaluation mode
model.eval()
# Initialize variables for tracking accuracy
correct = 0
total = 0

# Evaluate the model
with torch.no_grad():  # Disable gradient computation
    for batch_x, batch_y in test_loader:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)

        # Forward pass: Compute predictions
        predictions = model(batch_x)

        # Convert predictions to class indices
        predicted_classes = torch.argmax(predictions, dim=1)

        # Update accuracy metrics
        correct += (predicted_classes == batch_y).sum().item()
        total += batch_y.size(0)

# Calculate and print overall accuracy
accuracy = correct / total
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Test Accuracy: 49.05%

We can see that model accuracy is still 49.0% even on images that it has never seen before. Note that if the model were guessing randomly, it would achieve approximately 10% accuracy (because there are 10 classes), so 49.0% is a good result, given that we have used a very basic neural network.

Deep Learning - CNNs

In our current network, there is nothing that explicitly considers the spatial structure of the input images. The first step involves flattening the image into a single vector, which is then passed into the first dense layer. However, to account for spatial structure, we will use a Convolutional Neural Network (CNN).

CNNs are motivated by the need to recognize patterns with tolerance to variations in location, as well as spatial and temporal shifts. This tolerance is achieved by learning to recognize patterns independently at each location. Instead of treating each location separately, CNNs efficiently share weights across subregions of the image, a concept known as weight sharing. This is accomplished by using cross-correlation as the transfer function.

A kernel (also called a filter or mask) is used to extract features from the input image. The values in this kernel are learned through backpropagation. The kernel is systematically applied across the entire input image (or the output of the previous layer), effectively "scanning" it. This process is analogous to using a sliding window to analyze different parts of the image.

As the kernel moves across the image, it performs element-wise multiplication with the section of the image it covers. The resulting values are summed up to produce a single output pixel in the feature map for that specific position. By repeating this process across the image, the CNN extracts meaningful features while preserving the spatial structure, making it highly effective for image-related tasks.

In a convolutional layer, consider a simple example of a 5×5 grayscale image as the input. A 3×3 kernel (also referred to as a filter or mask, denoted as W) is used to perform the convolution operation.

The kernel moves systematically across the input image, position by position, according to defined steps called strides. At each position, the kernel overlays a 3×3 section of the image. The values of the kernel and the overlapping image pixels are multiplied element-wise and then summed up to produce a single output value. This output value represents a feature map pixel for that position.

Here is an example

Let's build a basic conv2D layer applied to grayscale images

import torch
import torch.nn as nn

# Define the CNN as a class
class SimpleConvNet(nn.Module):
    def __init__(self):
        super(SimpleConvNet, self).__init__()

        # Convolutional Layer 1
        # in_channels = 1 (grayscale image, single channel)
        # out_channels = 2 (number of filters)
        # kernel_size = 3x3
        # stride = 1 (step size for sliding the kernel)
        # padding = 'same' equivalent in PyTorch: padding = 1 (to maintain spatial dimensions)
        self.conv_layer_1 = nn.Conv2d(
            in_channels=1,      # Single input channel (grayscale image)
            out_channels=2,     # Two output filters
            kernel_size=(3, 3), # Size of the convolution kernel (3x3)
            stride=1,           # Stride of 1
            padding=1           # "same" padding ensures output has same spatial dimensions as input
        )

    def forward(self, x):
        # Pass input through the convolutional layer
        x = self.conv_layer_1(x)
        return x

# Initialize the model
model = SimpleConvNet()

# Example input: Batch of 1 grayscale image of size 64x64
example_input = torch.randn(1, 1, 64, 64) # Shape: (batch_size, channels, height, width)

# Forward pass through the model
output = model(example_input)

# Print the output shape
print("Output shape:", output.shape)

torch.Size([1, 2, 64, 64])

In the context of convolutional neural networks (CNNs), channels refer to the depth or number of feature maps (or layers) in the input or output tensor.

For example: A grayscale image of size 64 × 64 is represented as a tensor of shape (1, 64, 64), where "1" is the number of input channels. An RGB image of the same size would be a tensor of shape (3, 64, 64).

Output channels represent the number of learned features or patterns after applying the convolution. As the network goes deeper, the number of output channels typically increases, allowing the model to learn a richer set of features, as for input channels, convolutional layers process each input channel independently and combine the results. For example, in an RGB image, the network might process the red, green, and blue channels separately at first and then combine the information to identify features.

Usually, there is no padding in convolution operations. However, we can apply padding by adding zeros around the edges of the input, which allows the convolution to preserve or increase the spatial dimensions of the output. This is particularly useful when we want the output to have the same size as the input, ensuring that important edge features are not lost during the convolution process.

Stride refers to the step size at which a convolutional kernel moves across the input image. By default, the stride is 1, meaning the kernel slides one pixel at a time, overlapping adjacent regions. Increasing the stride reduces the spatial dimensions of the output by skipping pixels during the convolution process, effectively downsampling the feature map. For example, a stride of 2 halves the height and width of the output compared to the input. Stride is useful for reducing computational complexity and capturing broader patterns in the input.

Dilation expands the receptive field of a convolutional kernel by inserting gaps (empty spaces) between the elements of the kernel. This allows the network to capture larger context or patterns without increasing the kernel size or the number of parameters. A dilation rate of 1 corresponds to a standard convolution, while a rate of 2 inserts one gap between kernel elements, effectively spreading the kernel. Dilation is particularly effective in tasks like segmentation and object detection, where understanding larger structures is important.

Let us now consider our CIFAR10 dataset and do the same on it on an input that has three channels, RGB

class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()

        # Convolutional Layer 1
        # in_channels = 3 (input has 3 channels, e.g., RGB)
        # out_channels = 10 (number of filters)
        # kernel_size = (4, 4)
        # stride = 2
        # padding = 'same' in TensorFlow means padding=2 for a 4x4 kernel with stride=2
        self.conv_layer_1 = nn.Conv2d(
            in_channels=3,      # Input has 3 channels (e.g., RGB image)
            out_channels=10,    # Number of output filters
            kernel_size=4,      # 4x4 convolutional kernel
            stride=2,           # Stride of 2
            padding=1           # Padding ensures "same" output size
        )

        # Convolutional Layer 2
        # in_channels = 10 (output of the previous layer)
        # out_channels = 20 (number of filters)
        # kernel_size = (3, 3)
        # stride = 2
        # padding = 'same' corresponds to padding=1 for a 3x3 kernel with stride=2
        self.conv_layer_2 = nn.Conv2d(
            in_channels=10,     # Input comes from the previous layer
            out_channels=20,    # Number of output filters
            kernel_size=3,      # 3x3 convolutional kernel
            stride=2,           # Stride of 2
            padding=1           # Padding ensures "same" output size
        )

        # Fully Connected (Dense) Layer
        # Output size = 10 (number of classes)
        # Softmax activation will be applied during the loss calculation in PyTorch
        self.fc = nn.Linear(20 * 8 * 8, 10) # Flattened size = 20 filters x 8 x 8 spatial size

    def forward(self, x):
        # Apply the first convolutional layer
        x = self.conv_layer_1(x)

        # Apply the second convolutional layer
        x = self.conv_layer_2(x)

        # Flatten the feature maps for the fully connected layer
        x = torch.flatten(x, start_dim=1)  # Flatten all dimensions except the batch dimension

        # Pass through the fully connected layer
        x = self.fc(x)

        return x

# Instantiate the model
model = ConvNet()

# Example input: Batch of 1 RGB image of size 32x32
example_input = torch.randn(1, 3, 32, 32)  # Shape: (batch_size, channels, height, width)

# Forward pass through the model
output = model(example_input)

# Print the output shape
print("Output shape:", output.shape)  # Should be (1, 10) for 1 image and 10 classes

This may look complicated, but it is easy once you see a visual. Looking back at padding, and the variable same. Same ensures that the output size is teh same as the input size convolution. Without padding, the output shrinks ( It allows the convolutional filter to process edge pixels, which would otherwise be ignored.)

Looking at the model, we have an input of size 32 x 32 x 3. We then apply 10 filters (masks) of size 4 x 4 x 3, which slide over the input. The 3 in the filter size corresponds to the RGB channels. Each filter produces an output, resulting in 10 output channels. The output size is reduced to 16 x 16 due to the stride of 2. The rest of the model's architecture is straightforward.

Batch Normalisation

One issue with DNNs, is ensuring that the weights of the network remain within a reasonable range of values—if they start to become too large, this is a sign that your network is suffering from what is known as the exploding gradient problem, As errors are propagated backward through the network, the calculation of the gradient in the earlier layers can sometimes grow exponentially large, causing wild fluctuations in the weight values. To prevent it from happening, you need to understand the root cause of the exploding gradient problem.

Covariate Shift

One of the reasons for scaling input data to a neural network is to ensure a stable start to training over the first few iterations. Since the weights of the network are initially randomized, unscaled input could potentially create huge activation values that immediately lead to exploding gradients. For example, instead of passing pixel values from 0–255 into the input layer, we usually scale these values to between –1 and 1. Because the input is scaled, it’s natural to expect the activations from all future layers to be relatively well scaled as well. Initially this may be true, but as the network trains and the weights move further away from their random initial values, this assumption can start to break down. This phenomenon is known as covariate shift. To remain stable, when the network updates the weights, each layer implicitly assumes that the distribution of its input from the layer beneath is approximately consistent across iterations. However, since there is nothing to stop any of the activation distribu‐ tions shifting significantly in a certain direction, this can sometimes lead to runaway weight values and an overall collapse of the network.

Batch normalization is a technique that drastically reduces this problem;

We can place batch normalization layers after dense or convolutional layers to normalize the output. When training a model with batch normalization, the mean and standard deviation of the input features for each channel are computed across the batch of data. These statistics are used to normalize the data for each channel so that it has a mean of 0 and a standard deviation of 1. This helps the model learn more efficiently by stabilizing the distribution of inputs to each layer.

However, during prediction (or inference), we often make predictions on a single sample (or a very small batch). In such cases, there is no "batch" over which to calculate the mean and standard deviation. To address this, during the training phase, batch normalization layers maintain a moving average of the mean and standard deviation for each channel across all batches. These moving averages are updated over time using the statistics from each batch, and they approximate the overall mean and standard deviation of the training data.

In pytorch we implement batch normalisation like so:

# Batch normalization for 2D inputs (e.g., images)
batch_norm = nn.BatchNorm2d(num_features=10, momentum=0.1)  # 0.1 because PyTorch's momentum is inverted

Batch normalization uses moving averages of the batch mean and variance during training to normalize inputs. This moving average provides a smoothed estimate of the statistics over time, which is then used during inference (when batch statistics are unavailable). The momentum parameter controls how much weight is given to the previous moving average versus the current batch statistics when updating these estimates.

A high momentum (e.g., 0.9) means the moving averages update slowly and give more weight to past values, resulting in smoother but slower adaptation to changes in the data distribution. A low momentum (e.g., 0.1) means the moving averages update more quickly, giving more weight to the current batch, which can react faster to changes but may be noisier.

Dropout

ny successful machine learning algorithm must ensure that it generalizes to unseen data, rather than simply remem‐ bering the training dataset. If an algorithm performs well on the training dataset, but not the test dataset, we say that it is suffering from overfitting. To counteract this problem, we use regularization techniques, which ensure that the model is penalised if it starts to overfit.

Dropout layers are very simple. During training, each dropout layer chooses a random set of units from the preceding layer and sets their output to 0. Dropout layers are used most commonly after dense layers since these are the most prone to overfitting due to the higher number of weights, though you can also use them after convolutional layers.

Batch normalization also has been shown to reduce overfitting, and therefore many modern deep learning architectures don’t use dropout at all, relying solely on batch normalization for regularization. As with most deep learning principles, there is no golden rule that applies in every situation

dropout = nn.Dropout(p=0.25)  # p is the probability of dropping a unit

Building the CNN

Let's put the all the pieces together now

import torch
import torch.nn as nn
import torch.nn.functional as F

class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()

        # Define the layers
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=2, padding=1)
        self.bn2 = nn.BatchNorm2d(32)
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.bn3 = nn.BatchNorm2d(64)
        self.conv4 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=2, padding=1)
        self.bn4 = nn.BatchNorm2d(64)

        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(64 * 8 * 8, 128)  # Adjusted input size for fully connected layer
        self.bn_fc = nn.GroupNorm(num_groups=1, num_channels=128)
        self.dropout = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        # Input: (batch_size, 3, 32, 32)

        # First Conv Block
        x = self.conv1(x)
        x = self.bn1(x)
        x = F.leaky_relu(x)

        # Second Conv Block
        x = self.conv2(x)
        x = self.bn2(x)
        x = F.leaky_relu(x)

        # Third Conv Block
        x = self.conv3(x)
        x = self.bn3(x)
        x = F.leaky_relu(x)

        # Fourth Conv Block
        x = self.conv4(x)
        x = self.bn4(x)
        x = F.leaky_relu(x)

        # Flatten the tensor
        x = self.flatten(x)

        # Fully Connected Layers
        x = self.fc1(x)
        x = self.bn_fc(x)
        x = F.leaky_relu(x)
        x = self.dropout(x)
        x = self.fc2(x)

        # Output: (batch_size, 10)
        return F.softmax(x, dim=1)

# Instantiate the model
model = ConvNet()
print(model)

# Example input to test the model
example_input = torch.randn(1, 3, 32, 32)  # (batch_size, channels, height, width)
output = model(example_input)
print(output.shape)  # Should print (1, 10)

We use four stacked Conv2D layers, each followed by a BatchNormalization and a LeakyReLU layer. After flattening the resulting tensor, we pass the data through a Dense layer of size 128, again followed by a BatchNormalization and a LeakyReLU layer. This is immediately followed by a Dropout layer for regularization, and the network is concluded with an output Dense layer of size 10.

Training and eval is the same as the last example

training-text

Epoch 1/10, Loss: 2.0214
Epoch 2/10, Loss: 1.8958
Epoch 3/10, Loss: 1.8430
Epoch 4/10, Loss: 1.8126
Epoch 5/10, Loss: 1.7858
Epoch 6/10, Loss: 1.7667
Epoch 7/10, Loss: 1.7489
Epoch 8/10, Loss: 1.7347
Epoch 9/10, Loss: 1.7232
Epoch 10/10, Loss: 1.7089

x_test = torch.tensor(x_test)
y_test = torch.tensor(y_test)

# Set the model to evaluation mode
model.eval()  
# Initialize variables for tracking accuracy
correct = 0
total = 0

# Evaluate the model
with torch.no_grad():  # Disable gradient computation
    for batch_x, batch_y in test_loader:

        batch_x, batch_y = batch_x.to(device), batch_y.to(device)
        predictions = model(batch_x) # Forward pass: Compute predictions
        predicted_classes = torch.argmax(predictions, dim=1) # Convert predictions to class indices
        correct += (predicted_classes == batch_y).sum().item() # Update accuracy metrics
        total += batch_y.size(0)

# Calculate and print overall accuracy
accuracy = correct / total
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Test Accuracy: 73.64%

The simple NN has a lower loss because it spreads probabilities more evenly across classes, avoiding large penalties, but its predictions are often wrong, resulting in low accuracy. The CNN, while more accurate, still incurs higher loss early in training because it is learning to fine-tune its confidence in its predictions. Over time, with training, the CNN's loss should decrease as it becomes more confident in its predictions.