[ad_1]
Matrix multiplication the holy grail of deep neural networks and modern-day language comprehending behemoths. As MLEs or information researchers, our fingers are way too quick to variety tf.matmul
or torch.matmul
and we never ever appear again. But do not tell me you have never experienced the millisecond infatuation to know what might be taking place to that matrix when it enters the GPU! If you did, you’re in the proper area. Sign up for me in a journey by means of the interesting intricacies within a GPU.
I’ll demonstrate to you how these compute powerhouses crunch up the figures. You’ll master three very little-acknowledged outstanding issues GPUs do, when they occur facial area-to-experience with matrices. By the end of this blog write-up, you’ll have a superior understanding of how matrix multiplication works inside of GPUs.
GEMM or generalized matrix multiplication is the kernel which is executed when GPUs conduct matrix multiplication.
C = a (A.B) + b C
Below, a
and b
are scalars, A
is an MxK
matrix, B
is an KxN
matrix, and as a result C
is an MxN
matrix. It is easy as that! You could possibly wonder why that trailing addition exists. Turns out this is a quite frequent pattern for neural networks (e.g. incorporating bias, applying ReLU, adding residual connections).
If you are requested to create a matrix multiplication algorithm from very first concepts, here’s what you’ll do (until you’re gifted with a GPU in lieu of a mind — would not that help save revenue for an MLE!).
for (int i = i < M ++i)
for (int j = 0 j < N ++j)
for (int k = 0 k < K ++k)
C[i][j] += A[i][k] * B[k][j]
Here’s an animated visual that shows you what this does.
But did you know GPUs despise this implementation 🤔? To understand why that’s the case, you need to understand the GPU memory architecture,
For all comparisons and specifications, I’ll be using the Nvidia A100 GPU specifications.
A GPU has three main memory levels,
- Global memory or HBM (what you typically refer to as GPU memory and what you see when you run
nvidia-smi
) - Shared memory (a local memory that is dedicated to a single streaming multiprocessor [or SM] and shared between threads running in that SM)
- Registers (individually allocated to threads to carry out their workload)
This is what it looks like,
The first thing to note is that shared memory (referred to as SRAM from now on) is way smaller than the HBM, let alone registers. So your matrix is not going to fit in there (in most occasions). If we go back to our animation, for a single row of A
all columns of B
needs to be retrieved, and repeat the process for all rows in A
. This means, the GPU needs to do many-many reads to compute the output. The HBM (~1.5TB/s) is several magnitudes slower than SRAM (~19TB/s).
To put that in numbers, say you want to multiply a 10x20
and 20x30
matrix, you need to read columns of B
10x30=300
times. Is there a better way we can do this?
Turns out a simple trick can go a long way here! Simply flip the order of the loops, so that k
becomes the outer most loop. And you’re done! 😮
for (int k = 0 k < K ++k)
for (int i = 0 i < M ++i)
for (int j = 0 j < N ++j)
C[i][j] += A[i][k] * B[k][j]
We did not touch the actual computation, just the order of the loops, so we should get the same result as before. Here’s what the matrix multiplication looks like now!
You see, we only bring one column of A
and one row of B
at a time and never look back. This requires far less reads than the original implementation. The only difference is we were computing the inner product between two vectors before, now we’re computing the outer product.
But still, we need entire C
in SRAM, which might be too big to fit in SRAM. What does CUDA do then? That brings us to the second trick.
Not to worry! I’m not going to blast you with any complex mathematics or Leetcode algorithms. The main thing to keep in mind is, a matrix is a 2D layout of individual tiles. The following animation does justice to what I’m trying to explain.
The result of the green block 💚 is the light blue strip of A 💙 and the light yellow strip of B 💛. Taking this a step further, to compute the output, you can bring one block of that strip of A and one block from B’s strip at a time, compute the output and accumulate the result in the green box.
This gives us a flexible framework where we can load an arbitrary size block (or tile) of A and B and still compute the final answer. We don’t have to stop there, we can keep recursively dividing the problem to even smaller problems. i.e. the matrix is broken into tiles, tiles are broken into fragments, and fragments to individual values.
And this lends itself nicely to the process execution architecture in a GPU. There are three layers to a kernel execution in a GPU. For simplicity, we’ll say a SM runs a single thread block (although in practice they execute them concurrently, to reduce something known as the tail effect).
- Threads
- Warps (a collection of 32 threads)
- Thread blocks (a collection of several warps)
The exact number of threads in a thread block depends on a specific architecture. For example, an A100 has the following specifications.
- Maximum of 2048 threads per SM
- Maximum of 1024 threads per block
- Maximum of 32 thread blocks per SM
Sidebar #2: Magic of the power of 2
Going back to the tiling, It has been found that (heuristically) a matrix tile of size 256x128
per thread block gives reasonable efficiency for most problems. Therefore it’s a common tile size used by CUDA.
You might have heard about a best practice of keeping batch size, hidden dimension size as powers of two. This is where this comes from! When your matrix dimensions are of powers of two, it will be fully divisible to a set of tiles with no remainder. If not, it makes your code less efficient.
GPU computations are more efficient when your matrix dimensions are in the power of 2
What happens when it’s not a power of 2?
Sidebar #2: Tile quantization
What happens is an effect known as tile quantization. In other words, if you have a tile row dimension of 128 but your matrix has 257 elements in a row, you’ll need not two, but three tiles in a row (i.e. 256+1). This is illustrated below.
Problem with this is that, the thread block does the same amount of computation regardless of the useful data residing in it. So, you’re taking the opportunity to do useful computation away from your GPU, leading to inefficiencies.
A similar effect is known as wave quantization, where the matrix is over-sized and the SMs collectively cannot fit it at once. Then the GPU needs to do the computation in 2 “waves”. However this is less of a concern for modern GPUs as they leverage concurrency to reduce wave quantization.
Tile quantization happens when a thread block has to spill data partially, wave quantization happens when SMs have to spill data.
The final trick is kernel fusion. More often than not, it is faster to do all the computations in one kernel than having two kernels called one after the other. Why? Because one kernel needs to write the data to HBM and other needs to read that back. We already talked about how slow this is. A better approach is just combine the two operations into one. Some examples are,
So as it is seen here (I’m sure Pytorch has a similar glossary), there are many fused kernels offered through TensorFlow that combines commodious operations in to a single kernel. In code, it means something like this,
for (int m = 0 m < M m += Mtile)
for (int n = 0 n < N n += Ntile)
for (int k = 0 k < K ++k)
float tmp = 0
for (int i = 0 i < Mtile ++i)
for (int j = 0 j < Ntile ++j)
int row = m + i
int col = n + j
tmp += A[row][k] * B[k][col]
// Do other things
C[row][col] = tmp
In other words, we hold dearly to our tmp
variable until after we’ve finished all our computations. Then only we’ll write the result back to C
.
That’s it folks. I hope this was an enjoyable excursion through the weeds of a GPU. If you’re interested in the audio-visual version here’s the link to my YouTube video.
To recap, we discussed three things that makes GPUs really fast at matrix multiplication.
- GPUs abandon the friendlier inner product implementation of matmul and embrace the more read-efficient outer product implementation of matmul
- GPUs split the matrices into smaller blocks (and blocks into fragments) and split the compute load across thread blocks, warps and threads.
- GPUs employ kernel fusion to bring commonly co-occurring functionality togetter, improving GPU efficiency.
If you enjoyed this story, feel free subscribe to Medium, and you will get notifications to fresh content from me, as well as unlock full access to thousands of quality stories from other authors.
Unless otherwise noted all images are by the author
[ad_2]
Source link