Parallel Systems

2024-11-03T01:11:38+0000

This is a short summary of parallel computing for my understanding. Thanks to Kayvon Fatahalian and Kunle Olukotun

A Modern Multi-Core Processor

Introduction

Parallel computing is a method of executing multiple calculations or processes simultaneously by dividing a task into smaller parts that can run independently across multiple processors. This approach increases computational speed and efficiency, making it ideal for handling large datasets, complex simulations, or tasks requiring significant processing power.

A Modern Multi-Core Processor

The simple processor has a Fetch/Decode unit, ALU and Memory. the Fetch/Decode unit, which retrieves and interprets instructions; the Execution Unit (ALU), which performs arithmetic and logical operations; and the Execution Context, which stores the necessary data for instruction execution.

Exmaple instructions that changes the state of a processor can be like so

ld  r0, addr[r1]
mul r1, r0, r0
mul r1, r1, r0
...
...
...
st  addr[r2], r0

But this is sequential and slow, typically the compiler can introduce parralelism into our programs, but not at all times.

Instruction-level parallelism (ILP) focuses on executing multiple instructions simultaneously by identifying independent operations within a sequence of tasks. Each instruction's dependencies must be analysed to avoid conflicts, ensuring computations can progress without errors. By breaking tasks into independent pieces and overlapping their execution, ILP improves efficiency, making processors faster and more capable of handling complex programs.

Akin to a scaler processor, that executes one instruction at a time per clock cycle, a superscalar processor enhances performance by decoding and executing multiple instructions per clock cycle, leveraging multiple execution units. It identifies independent instructions within a program and processes them simultaneously, utilising out-of-order execution logic to optimize throughput.

The execution context refers to memory, typically DRAM. However, one issue with DRAM is its latency. In the best-case scenario, data transfer in DRAM could take approximately 248 cycles at 4 GHz, whereas L1 cache takes 4 cycles, L2 takes 12 cycles, and L3 takes 38 cycles. This means that a simple instruction like ld r0, addr[r1] could take nearly 250 cycles, rather than the 1 cycle it might theoretically take on paper.

This cache simulation shows how a memory array interacts with a cache system using a Least Recently Used (LRU) replacement policy. The total cache capacity is 8 bytes, divided into 4-byte cache lines, meaning only two lines can fit in the cache at a time. As addresses are accessed, cache actions such as "cold miss" (when data is loaded into cache for the first time) and "hit" (when data is found in cache) are tracked. When the cache is full, LRU replaces the least recently accessed line.

Temporal locality is observed when recently accessed memory locations (e.g., 0x0 to 0x3) are reused soon, resulting in cache hits. Spatial locality is evident as sequential memory locations (e.g., 0x0 to 0x3) within a cache line are accessed, reducing misses by prefetching adjacent addresses.

Let’s look at an example where parallelism cannot be achieved automatically by the compiler; instead, the programmer must explicitly implement it.

void sinx(int N, int terms, float *x, float *y) {
    for (int i = 0; i < N; i++) {
        float value = x[i];
        float numer = x[i] * x[i] * x[i]; // Start with x^3
        int denom = 6;                   // 3!
        int sign = -1;                   // Alternate sign

        for (int j = 1; j < terms; j++) {
            value += sign * numer / denom;   // Add the next term
            numer *= x[i] * x[i];           // Increment numerator to x^(2j+1)
            denom *= (2 * j + 2) * (2 * j + 3); // Update denominator
            sign *= -1;                      // Flip sign
        }

        y[i] = value; // Store the result
    }
}

This programs outputs the taylor expansion of sin(x) to predict it. This has no issue, but a slight flaw, that being its inefficiency; It processes each element of the array sequentially, which makes it slower for large arrays or high precision (more terms). This slowness arises because the operations for each x[i] are independent and can be parallelized. By leveraging parallel computing, the program could handle multiple elements (x[i]) simultaneously, significantly improving performance.

Taylor expansion for sin(x): $$ \sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots $$

Running this on a scaler or even superscaler processor as above would not make it parralel in execution, we would have two sets of ALUs and Decode Units (as seen above), but the instructions would only be running on one, this is because the machine code would be sequential and there would be no apparant way making instructions independent of each other for execution. To understand better how this program can be parralelised, below is some psuedo-code

1. Input: N (array size), terms (number of terms in Taylor series), x (input array), y (output array)

2. For each element in array x:
   a. Set initial value = x[i]
   b. Initialize:
      - numer = x[i]^3 (first power term)
      - denom = 6 (factorial of 3)
      - sign = -1 (to alternate signs)
   c. For each term in the series (from 1 to terms - 1):
      - Add the current term to value:
        value += sign * (numer / denom)
      - Update numer = numer * x[i]^2
      - Update denom = denom * (next factorial terms)
      - Flip the sign (sign = sign * -1)
   d. Store the result in y[i]
3. Output: Array y containing sin(x) values for all elements.

Since no ILP is available in such code, and neither superscale nor scalar processors can do anything to speedup this code, we have to resort to another type of processor called multi processors, in a setup called Task-Level Parallelism (TLP), where the computations for x[i] and x[j] are completely independent. Since no data dependencies exist between the tasks assigned to each core, they can be executed in parallel without any bottlenecks. Each core operates independently on its own execution context, enabling full utilization of resources.

So since there is no way for our compiler to automatically make this parralel, we create a thread where we can now run 1/2 of the terms on one core and the other half on another core

The my_args struct is a container for passing multiple arguments to the thread function. It includes the size of the input array (N), the number of terms for the Taylor expansion (terms), and pointers to the input (x) and output (y) arrays. This structure simplifies argument management, making it easy to pass all required data to the thread function.

We use std::thread to create on thread which takes N/2 args and after that thread we run sinx(args) with the other half of N, these spawn two threads, that run seprately on two cores, my_thread.join() waits for the my_thread execution to finish before proceeding with the program

So, we have taken control and explicity defined how to run this program in parallel, we can extend this up to 4, 8, 16 or even 32 cores depending on our CPU. Imaigne if we had 32 cores, and chose to run this program at ~1/32 the potential speed that it could run up to.

An interesting aspect of this program is that all iterations of the loop operate independently, updating separate memory blocks in the result array. Each computation for sin(x[i]) reads a specific input value from x[i] and writes the result to its corresponding output result[i], without interfering with other iterations. This has some relation to how matrixes and tensors operate when multiplying with each other in machine learning

SIMD allows the same instruction to be broadcast to all ALUs, enabling parallel execution of the operation on multiple data points simultaneously. This approach leverages the fact that the same computation (e.g., a mathematical operation like addition or multiplication) often needs to be performed on different data elements, such as in arrays or matrices. Because SIMD treats an array of data like a vector, we just need to perform vector operations on different vectors to make mass changes to memory, only needing to fetch a vector of values. So for N elements every element can be processed independetly on a single core that is SIMD compatabile

The program of this would look like so:

#include <immintrin.h>

void sinx(int N, int terms, float* x, float* y) {
    float three_fact = 6; // 3!
    for (int i = 0; i < N; i += 8) {
        __m256 origx = _mm256_load_ps(&x[i]);         // Load 8 elements from x
        __m256 value = origx;                        // Initialize value with x[i:i+8]
        __m256 numer = _mm256_mul_ps(origx, _mm256_mul_ps(origx, origx)); // x^3
        __m256 denom = _mm256_broadcast_ss(&three_fact); // Broadcast 3! to vector
        int sign = -1;

        for (int j = 1; j < terms; j++) {
            // Compute value += sign * numer / denom
            __m256 tmp = _mm256_div_ps(_mm256_mul_ps(_mm256_set1_ps(sign), numer), denom);
            value = _mm256_add_ps(value, tmp);

            // Update numer and denom
            numer = _mm256_mul_ps(numer, _mm256_mul_ps(origx, origx)); // Increment x^(2j+1)
            denom = _mm256_mul_ps(denom, _mm256_broadcast_ss(&((2 * j + 2) * (2 * j + 3))));
            sign *= -1; // Flip sign
        }

        _mm256_store_ps(&y[i], value); // Store 8 results into y
    }
}

This vectorized program leverages AVX (Advanced Vector Extensions) intrinsics to process eight array elements simultaneously by utilizing 256-bit vector registers. Instead of handling one element at a time, the program groups eight floating-point values (e.g., x[i:i+8]) into a single vector register (__m256) using intrinsic functions like _mm256_load_ps. Operations such as multiplication, addition, and division are applied to the entire vector in parallel (e.g., _mm256_mul_ps), significantly accelerating the computation. The results are then stored back into the output array using _mm256_store_ps. This approach optimizes memory updates and arithmetic operations by reducing the overhead of repeated scalar instructions, making it highly efficient for large datasets. AVX extensions, part of modern CPU instruction sets, enable this seamless vectorisation and parallelism.

This means that 16 cores with SIMD compatibility allow each core to process 8-byte-long vectors, performing 8 simultaneous operations in a single clock cycle. With 16 cores × 8 operations per core, this results in a total of 128 operations being performed in parallel, significantly boosting computational throughput.

Summary: Three Different Forms of Parallel Execution

Superscalar:

Exploit ILP (Instruction-Level Parallelism) within an instruction stream. Process different instructions from the same instruction stream in parallel (within a core). Parallelism is automatically discovered by the hardware during execution.

SIMD:

Multiple ALUs controlled by the same instruction (within a core). Efficient for data-parallel workloads: amortize control costs over many ALUs. Vectorization is done by the compiler (explicit SIMD) or at runtime by hardware (implicit SIMD).

Multi-core:

Use multiple processing cores. Provides thread-level parallelism: simultaneously execute a completely different instruction stream on each core. Software creates threads to expose parallelism to hardware (e.g., via threading APIs).

This is what an Intel I7-7700K CPU looks like

We already saw how cache reduces latency, we can also use prefetching to reduce stall, although even this may not reduce all latency

Practically speaking, we can never achieve a 0-latency scenario, but multithreading allows us to hide stalls caused by latency. When one thread stalls (e.g., waiting for memory), the processor switches to another thread that is runnable, keeping the ALUs busy. By overlapping the execution of multiple threads (e.g., Thread 1 processes elements 0-7, Thread 2 processes 8-15, etc.), we minimize idle time in the core. This approach ensures better utilization of hardware resources, effectively masking latency while maximizing throughput.

In modern multithreaded processors, each hardware thread running on a core requires its own independent set of registers and memory contexts to store its data, intermediate results, and state. This separation allows multiple threads to execute concurrently without interfering with each other, hence the four seperate memories that are temporally and virtually partitioned from the DRAM

Parallel Systems

Contents

Introduction

A Modern Multi-Core Processor

Summary: Three Different Forms of Parallel Execution