First, I would like to extend my sincere gratitude to Prof. Zeyi Wen and my senior colleague, Jihang Li , from the Data Science and Analytics Thrust at the Hong Kong University of Science and Technology (Guangzhou), for their invaluable guidance and resource support.

I am finally beginning this series as a way to commemorate my brief tenure as a Research Assistant. GEMM stands for GEneral Matrix Multiplication. The term “General” here refers to standard matrices that do not possess distinctive characteristics, such as being elongated in shape or being sparse. During my RA appointment, my primary task was to implement two operator fusionsMerging computations that were originally executed as separate operations into a single, unified operation. involving GEMM. The goal was to achieve a speedup over the baseline approach, which calls the vendor library, cuBLASA library for GPU-based scientific computing developed by an American team., twice in succession. Consequently, this work necessitated a deep dive into the study of GEMM. The chosen focus was single-precision GEMM, or SGEMM (a decision that was made for reasons I cannot quite recall).

I have gone through nearly all available domestic and international blog tutorials on GEMM, finding that related academic papers are few and far between. In reality, research on GEMM in academia reached its peak decades ago, marked by one or two foundational papers. Since then, few truly groundbreaking works have emerged. Significant progress will likely have to wait for the next major architectural change in NVIDIA GPUs. Most domestic blogs tend to be unhelpful for learners; they are either overly abstract and inaccessible, or too superficial, leaving readers wanting more. Of course, there are exceptions that are genuinely well-written. Many international blogs are excellent, but they undoubtedly raise the barrier to entry for native Chinese speakers who are new to the field.

My hope is to enable more Chinese researchers to gain a clear and accessible introduction to GEMM. By the way, I will compile and list the high-quality blogs I referenced at the end of this article. Therefore, the primary focus of this series is educational, aiming to provide a comprehensive overview of the entire research process. Inevitably, there may be oversights or inaccuracies. I welcome any corrections or suggestions from readers and encourage you to contact me.

Progress Plan

It is undeniable that developing a GEMM operator with ultimate performance requires a deep understanding of low-level principles. Readers must have a firm grasp of GPU hardware architecture and how it is driven to work in parallel at the software level, as well as a clear awareness of the various points where performance can be optimized. This makes it essential to have some familiarity with CUDAA programming language developed by NVIDIA for GPU programming. and PTXAn intermediate, machine-independent assembly language into which CUDA code is compiled. assembly. Abstractions and high-level languages are primarily a convenience for developers, not for researchers seeking peak performance. This includes frameworks like TritonA high-level parallel programming language developed by OpenAI. and more recent tools such as TileLangAn AI compiler designed by Peking University, which has been adopted by DeepSeek 3.2. My senior colleague finds its writing to be a piece of shit..

Building on this foundational understanding of the hardware, we can delve into the application of GEMM algorithms on GPUs. Accordingly,

  • Chapter 1 will provide a comprehensive and accessible introduction to CUDA and PTX, covering the essential prerequisite knowledge needed for subsequent algorithm design and implementation.
  • In Chapter 2, I will briefly survey the fundamental algorithms for matrix multiplication, from their execution on CPUs to their adaptation for GPUs, serving as a high-level overview of the methodologies.

The subsequent chapters will each offer a detailed code analysis and implementation of the algorithms introduced in Chapter 2.

Additionally, I will write a supplementary chapter dedicated to using NVIDIA’s Nsight ComputeA profiling tool developed by NVIDIA for analyzing operator performance bottlenecks. to profile and analyze operator bottlenecks. Understanding where a bottleneck occurs is arguably even more critical than being able to write a high-performance operator in the first place.

Target Audience and Resources

The author has a formal background in computer science, so readers are expected to have a foundational understanding of the subject. This blog may not be suitable for those transitioning into the field from other disciplines.

Below is a curated list of high-quality blogs from the web:

国内好文

All the methods listed above are based on the traditional, high-performance iteration-k approach. Other specialized techniques, such as split-k, stream-k, Tensor Cores, swizzling, and persistent kernels, will be explored in detail in a future advanced series.