1. That roofline curve idea applies to multiple processes, computers, and data centers just as well. If you have enough "cache" (disk, RAM, whatever), you can do a distributed matmul and actually effectively use every coprocessor at nearly 100% efficiency.
2. If you need f32 intermediate precision, you can approximate that with Kahan-like ideas and still take advantage of the f16 core, at somewhere in the 25%-50% efficiency range (still much better than the <10% you get by ignoring the tensor core).
"An H100 GPU has 989 TFLOPs of half-precision matrix multiply compute, and ~60 TFLOPs of “everything else”"
I always thought there was a lot of crossover between gaming GPU's & DC GPU's (and the volume is why NVIDIA is so far ahead). Are tensor cores somehow related to the pre tensorcore SM's (like an abstraction on top of SM's?)
I agree with the author that programming GEMM on newer GPUs is a very different experience, though I'm wondering if "newer GPUs are [actually strictly] better"? It seems like there should still be some highly cost-effective use cases for T4 GPUs — aren't there?
In my parallel programming class we used several techniques to increase the speed of matrix multiplication, and compared them. I vaguely remember using OpenMP and cuda. I need to look into my backups to see if I still have those codes. Specially the cuda one, I wonder how similar it is to tensors
>> Our first contribution is Optimised Attention, which performs similarly to standard attention, but has 3/4 as many parameters and one matrix multiplication fewer per head. Next, we introduce Efficient Attention, which performs on par with standard attention with only 1/2 as many parameters as many parameters and two matrix multiplications fewer per head and is up to twice as fast as standard attention. Lastly, we introduce Super Attention, which surpasses standard attention by a significant margin in both vision and natural language processing tasks while having fewer parameters and matrix multiplications.
>>> Convolution is in fact multiplication in Fourier space (this is the convolution theorem [1]) which says that Fourier transforms convert convolutions to products.
Some related fun facts:
1. That roofline curve idea applies to multiple processes, computers, and data centers just as well. If you have enough "cache" (disk, RAM, whatever), you can do a distributed matmul and actually effectively use every coprocessor at nearly 100% efficiency.
2. If you need f32 intermediate precision, you can approximate that with Kahan-like ideas and still take advantage of the f16 core, at somewhere in the 25%-50% efficiency range (still much better than the <10% you get by ignoring the tensor core).
"An H100 GPU has 989 TFLOPs of half-precision matrix multiply compute, and ~60 TFLOPs of “everything else”"
I always thought there was a lot of crossover between gaming GPU's & DC GPU's (and the volume is why NVIDIA is so far ahead). Are tensor cores somehow related to the pre tensorcore SM's (like an abstraction on top of SM's?)
Great article — and several other high-quality deep dives linked at the end! Here's another one on the H100 that I found particularly useful: <https://cudaforfun.substack.com/p/outperforming-cublas-on-h1...>
I agree with the author that programming GEMM on newer GPUs is a very different experience, though I'm wondering if "newer GPUs are [actually strictly] better"? It seems like there should still be some highly cost-effective use cases for T4 GPUs — aren't there?
In my parallel programming class we used several techniques to increase the speed of matrix multiplication, and compared them. I vaguely remember using OpenMP and cuda. I need to look into my backups to see if I still have those codes. Specially the cuda one, I wonder how similar it is to tensors
Great deep dive. I've learned a lot already and haven't even finished the introduction
Closely related to this if you're interested in the topic if Deep Mind's guide on how to scale your model.
https://jax-ml.github.io/scaling-book/roofline/
If I ever need a fast matmul, you're hired.
Multiplication algorithm: https://en.wikipedia.org/wiki/Multiplication_algorithm
From https://news.ycombinator.com/item?id=40519828 re: LLMs and matrix multiplication with tensors:
> "You Need to Pay Better Attention" (2024) https://arxiv.org/abs/2403.01643 :
>> Our first contribution is Optimised Attention, which performs similarly to standard attention, but has 3/4 as many parameters and one matrix multiplication fewer per head. Next, we introduce Efficient Attention, which performs on par with standard attention with only 1/2 as many parameters as many parameters and two matrix multiplications fewer per head and is up to twice as fast as standard attention. Lastly, we introduce Super Attention, which surpasses standard attention by a significant margin in both vision and natural language processing tasks while having fewer parameters and matrix multiplications.
From "Transformer is a holographic associative memory" (2025) https://news.ycombinator.com/item?id=43029899 .. https://westurner.github.io/hnlog/#story-43028710 :
>>> Convolution is in fact multiplication in Fourier space (this is the convolution theorem [1]) which says that Fourier transforms convert convolutions to products.
From https://news.ycombinator.com/item?id=41322088 :
> "A carbon-nanotube-based tensor processing unit" (2024)
"Karatsuba Matrix Multiplication and Its Efficient Hardware Implementations" (2025) https://arxiv.org/abs/2501.08889 .. https://news.ycombinator.com/item?id=43372227