Taming Nanoseconds: Optimizing CUDA Kernel Launch Latency for Small GEMMs

Executing sequences of very small matrix multiplications (GEMMs) on a GPU can present a significant performance challenge. While GPUs excel at massively parallel tasks, the overhead associated with launching a CUDA kernel for each tiny operation can easily dwarf the actual computation time. This article dives deep into practical strategies for mitigating kernel launch latency, enabling developers to unlock the true potential of their GPUs for workloads involving numerous small GEMMs.

We’ll explore techniques ranging from manual kernel batching and leveraging CUDA Graphs to utilizing optimized libraries like cuBLAS and implementing persistent kernels. For each method, we’ll discuss its mechanics, ideal use cases, and provide illustrative code examples to guide your optimization efforts.

The Nanosecond Nemesis: Understanding Kernel Launch Latency

CUDA kernel launch latency is the time elapsed from the CPU invoking a kernel launch (e.g., via cudaLaunchKernel()) to the GPU actually beginning execution of that kernel’s code. This overhead, typically in the range of 5 to 30 microseconds, includes API call processing, driver interaction, and GPU hardware setup.

For computationally intensive kernels that run for milliseconds or seconds, this launch latency is negligible. However, when dealing with very small GEMMs (e.g., 4x4, 16x16, 32x32), the kernel execution time itself might only be a few microseconds or even nanoseconds. In such scenarios, if each small GEMM is launched as an individual kernel, the cumulative launch latency becomes the dominant factor, severely underutilizing the GPU’s computational power.

Consider a scenario:

Small GEMM execution time: 2 µs
Kernel launch latency: 10 µs
Total time per GEMM: 12 µs (83% overhead!)

The goal is to drastically reduce this overhead by minimizing the number of kernel launches or by amortizing the launch cost over a larger batch of work.

Core Strategies to Mitigate Launch Latency

Several powerful techniques can be employed to tackle kernel launch latency for sequences of small GEMMs.

1. Kernel Fusion / Manual Batching

The most direct approach is to combine multiple small GEMM operations into a single, larger kernel. This “fused” or “batched” kernel is launched once but performs the work of many individual operations.

Concept: Design a CUDA kernel that accepts arrays of pointers to input/output matrices or a contiguous block of memory containing all matrix data. Each thread block, or even a group of threads within a block, can be assigned to compute one of the small matrix multiplications.

Illustrative Code (Conceptual):

This example shows a kernel designed to process a batch of small matrix multiplications. Each thread block handles one matrix multiplication. For simplicity, matrix dimensions (M, N, K) are assumed to be small and uniform for the batch.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
__global__ void batchedSmallGemmKernel(
    const float** A_list, // Array of pointers to A matrices (on GPU)
    const float** B_list, // Array of pointers to B matrices (on GPU)
    float** C_list,       // Array of pointers to C matrices (on GPU)
    int num_matrices,     // Total number of matrices in the batch
    int M, int N, int K) {

    int matrix_idx = blockIdx.x;

    if (matrix_idx >= num_matrices) {
        return;
    }

    const float* A = A_list[matrix_idx];
    const float* B = B_list[matrix_idx];
    float* C = C_list[matrix_idx];

    // Simple GEMM: Each thread computes one element of C.
    // Assumes M, N are small enough for threads in a block.
    // e.g., for a 16x16 matrix, use a dim3(16, 16) block.
    int row = threadIdx.y; // C matrix row
    int col = threadIdx.x; // C matrix col

    if (row < M && col < N) {
        float sum = 0.0f;
        for (int i = 0; i < K; ++i) {
            sum += A[row * K + i] * B[i * N + col];
        }
        C[row * N + col] = sum;
    }
}

To launch this kernel, you would configure the grid to have num_matrices blocks and thread blocks appropriately sized for the small GEMMs (e.g., dim3(N, M) if each thread computes one element of C).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Example Launch Configuration (CPU-side)
// Assumes M_dim, N_dim, K_dim are the dimensions of small matrices
// num_matrices is the batch size
dim3 block_dim(N_dim, M_dim); // e.g., (16, 16) for 16x16 C matrices
dim3 grid_dim(num_matrices);  // One block per matrix multiplication

// d_A_ptrs, d_B_ptrs, d_C_ptrs are GPU arrays of pointers
// to the actual matrix data (also on GPU).
batchedSmallGemmKernel<<<grid_dim, block_dim>>>(
    d_A_ptrs, d_B_ptrs, d_C_ptrs,
    num_matrices, M_dim, N_dim, K_dim);

When to Use:

When you have a large number of identical or similarly structured small GEMMs.
When operations are independent and can be easily parallelized within a single kernel.
When cuBLAS batched functions don’t perfectly fit the workload or when fine-grained control over the computation is needed.

Considerations:

Requires careful management of memory access patterns.
Shared memory can be used to optimize individual small GEMMs within the batch if dimensions allow, but for extremely small matrices, register usage might be sufficient.

2. CUDA Graphs

CUDA Graphs allow you to define a sequence of GPU operations (kernel launches, memory copies) once and then launch the entire sequence multiple times with significantly reduced CPU overhead.

Concept: You “record” a sequence of operations into a graph object. This graph is then “instantiated” into an executable graph. Subsequent launches of this executable graph bypass much of the typical CPU overhead associated with individual API calls.

Illustrative Code (Conceptual):

This example demonstrates capturing a launch of our batchedSmallGemmKernel within a CUDA Graph.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// --- Graph Setup (once) ---
cudaGraph_t graph;
cudaGraphExec_t instance;
cudaStream_t stream;
cudaStreamCreate(&stream);

// Dummy parameters for capture
// Ensure these reflect the actual parameters you'll use or make them
// updatable if graph parameters change (more advanced).
// d_A_ptrs, d_B_ptrs, d_C_ptrs are pre-allocated arrays of pointers on GPU
// M_dim, N_dim, K_dim, num_matrices are known.
dim3 block_dim_graph(N_dim, M_dim);
dim3 grid_dim_graph(num_matrices);

cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

// All operations recorded on 'stream' will be part of the graph
// For example, H2D copies if input data changes per graph launch
// cudaMemcpyAsync(..., stream);

batchedSmallGemmKernel<<<grid_dim_graph, block_dim_graph, 0, stream>>>(
    d_A_ptrs_const, d_B_ptrs_const, d_C_ptrs_for_graph_update,
    num_matrices, M_dim, N_dim, K_dim);

// For example, D2H copies if results are needed after each graph launch
// cudaMemcpyAsync(..., stream);

cudaStreamEndCapture(stream, &graph);

// Error checking for graph creation omitted for brevity

// Instantiate the graph
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
// Error checking for instantiation omitted

// --- Graph Execution (potentially many times in a loop) ---
for (int i = 0; i < num_iterations; ++i) {
    // If kernel parameters or data need updating:
    // cudaGraphExecKernelNodeSetParams(instance, node_in_graph, &params);
    // Or, for memcpy nodes:
    // cudaGraphExecMemcpyNodeSetParams(instance, node, &memcpyParams);
    // This example assumes static parameters/data for simplicity after capture.

    cudaGraphLaunch(instance, stream);
    cudaStreamSynchronize(stream); // Wait for this launch to complete
}

// --- Cleanup ---
cudaGraphExecDestroy(instance);
cudaGraphDestroy(graph);
cudaStreamDestroy(stream);

When to Use:

When you have a fixed sequence of operations that is executed repeatedly.
Ideal for scenarios where the structure of work is static, even if data changes (updatable graph parameters).
Can encapsulate not just kernel launches but also memory copies and other CUDA API calls.

Considerations:

Graph creation and instantiation have their own one-time overhead.
Updating graph parameters (e.g., pointers, scalar arguments) is possible but adds some complexity compared to static graphs.

3. cuBLAS Batched GEMM Functions

NVIDIA’s cuBLAS library provides highly optimized routines for batched GEMM operations, such as cublasSgemmBatched (single-precision), cublasDgemmBatched (double-precision), and cublasGemmBatchedEx (flexible precision, supports Tensor Cores).

Concept: You prepare arrays of pointers (residing on the GPU) that point to the individual A, B, and C matrices (also on the GPU). A single call to the batched cuBLAS function then executes all matrix multiplications.

Illustrative Code (Conceptual):

This demonstrates setting up and calling cublasSgemmBatched.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#include <cublas_v2.h>
// Error checking utility (simplified)
#define CUBLAS_CHECK(err) \
    if (err != CUBLAS_STATUS_SUCCESS) { \
        printf("cuBLAS error %d at %s:%d\n", err, __FILE__, __LINE__); \
        exit(EXIT_FAILURE); \
    }

// --- Setup (once or if batchCount/matrices change significantly) ---
cublasHandle_t handle;
CUBLAS_CHECK(cublasCreate(&handle));

// Parameters for small GEMMs
int M = 16, N = 16, K = 16;
int batchCount = 1000; // Number of small GEMMs in the batch
float alpha = 1.0f, beta = 0.0f; // Standard GEMM scalars

// 1. Allocate memory for arrays of pointers on HOST
float** h_A_array = (float**)malloc(batchCount * sizeof(float*));
float** h_B_array = (float**)malloc(batchCount * sizeof(float*));
float** h_C_array = (float**)malloc(batchCount * sizeof(float*));

// 2. Allocate individual matrices on GPU and store pointers in host arrays
//    Each matrix is M*K, K*N, M*N elements respectively.
for (int i = 0; i < batchCount; ++i) {
    cudaMalloc((void**)&h_A_array[i], M * K * sizeof(float));
    cudaMalloc((void**)&h_B_array[i], K * N * sizeof(float));
    cudaMalloc((void**)&h_C_array[i], M * N * sizeof(float));
    // Initialize/fill A and B matrices on GPU here...
    // e.g., cudaMemcpy(h_A_array[i], host_data_A_i, ..., cudaMemcpyHostToDevice)
}

// 3. Allocate memory for arrays of pointers on DEVICE
float** d_A_array;
float** d_B_array;
float** d_C_array;
cudaMalloc((void**)&d_A_array, batchCount * sizeof(float*));
cudaMalloc((void**)&d_B_array, batchCount * sizeof(float*));
cudaMalloc((void**)&d_C_array, batchCount * sizeof(float*));

// 4. Copy pointer arrays from HOST to DEVICE
cudaMemcpy(d_A_array, h_A_array, batchCount * sizeof(float*),
           cudaMemcpyHostToDevice);
cudaMemcpy(d_B_array, h_B_array, batchCount * sizeof(float*),
           cudaMemcpyHostToDevice);
cudaMemcpy(d_C_array, h_C_array, batchCount * sizeof(float*),
           cudaMemcpyHostToDevice);

// --- cuBLAS Call (can be in a loop if data changes but pointers don't) ---
CUBLAS_CHECK(cublasSgemmBatched(handle,
                                CUBLAS_OP_N, CUBLAS_OP_N, // No transpose
                                M, N, K,
                                &alpha,
                                (const float**)d_A_array, M, // lda
                                (const float**)d_B_array, K, // ldb
                                &beta,
                                d_C_array, M,              // ldc
                                batchCount));
cudaDeviceSynchronize(); // Wait for computation

// Retrieve results from d_C_array[i]...
// Cleanup (free all GPU memory, host arrays, destroy handle)
// ...
// CUBLAS_CHECK(cublasDestroy(handle));
// for (int i=0; i<batchCount; ++i) { cudaFree(h_A_array[i]); ... }
// free(h_A_array); ...

Important: For cublas<T>gemmBatched, lda, ldb, and ldc are typically the leading dimensions of the individual matrices (e.g., M for A in column-major if not transposed, K for B, M for C). However, the pointers in d_A_array, etc., point to the start of each matrix. The example uses M, K, M which assumes row-major storage interpretation or column-major with CUBLAS_OP_N and appropriate leading dimensions. Always consult the cuBLAS documentation for precise parameter definitions.

When to Use:

When your problem maps directly to standard GEMM operations.
Generally provides the best performance for batched GEMMs due to NVIDIA’s expert tuning.
For common data types and matrix sizes.

Considerations:

Requires data to be arranged as arrays of pointers on the GPU.
cublasGemmStridedBatchedEx is an alternative if matrices in the batch are regularly spaced in a single large memory allocation, which can simplify memory management.

4. Persistent Kernels (Dynamic Work Queues)

For highly dynamic workloads where the number or parameters of small GEMMs change frequently, a persistent kernel approach can be beneficial.

Concept: Launch a “super-kernel” once with enough thread blocks to saturate the GPU. These threads run in a loop, atomically fetching work items (descriptions of small GEMMs to perform) from a queue in global memory. The kernel persists until the queue is empty.

Illustrative Code (Conceptual):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
struct GemmTask {
    const float* A_ptr;
    const float* B_ptr;
    float* C_ptr;
    // Potentially M, N, K if they vary per task
};

__device__ unsigned int active_threads_count = 0; // Track active threads
__device__ unsigned int next_task_idx = 0;        // Global task counter

__global__ void persistentGemmKernel(
    GemmTask* task_queue,     // Array of tasks on GPU
    int total_tasks,
    int M, int N, int K) {    // Assuming M,N,K fixed for simplicity here

    // Increment active threads counter upon entry
    // This is a simplified way; real implementations might use a different
    // mechanism or rely on block completion.
    // atomicAdd(&active_threads_count, 1); // Not standard, conceptual

    extern __shared__ char shmem[]; // Dynamic shared memory

    while (true) {
        unsigned int current_task_id;
        // Atomically fetch the next task ID
        // This is a very common pattern for work distribution.
        if (threadIdx.x == 0 && threadIdx.y == 0) { // Only one thread per block
                                                    // attempts to get work
            current_task_id = atomicInc(&next_task_idx, total_tasks);
        }
        // Broadcast task ID to all threads in the block
        current_task_id = __shfl_sync(0xFFFFFFFF, current_task_id, 0);


        if (current_task_id >= total_tasks) {
            break; // All tasks have been claimed
        }

        // Fetch task details
        const float* A = task_queue[current_task_id].A_ptr;
        const float* B = task_queue[current_task_id].B_ptr;
        float* C = task_queue[current_task_id].C_ptr;

        // Optional: Load A and B into shared memory if beneficial
        // float* sh_A = (float*)shmem;
        // float* sh_B = (float*)&shmem[M * K * sizeof(float)];
        // ... load data into sh_A, sh_B ...
        // __syncthreads();

        // Perform GEMM (similar to batchedSmallGemmKernel inner logic)
        int row = threadIdx.y;
        int col = threadIdx.x;
        if (row < M && col < N) {
            float sum = 0.0f;
            for (int i = 0; i < K; ++i) {
                // Read from A, B (or sh_A, sh_B if using shared memory)
                sum += A[row * K + i] * B[i * N + col];
            }
            C[row * N + col] = sum;
        }
        __syncthreads(); // Ensure all threads in block finish before next task
    }
    // Decrement active threads counter upon exit (conceptual)
    // if (threadIdx.x == 0 && threadIdx.y == 0) {
    //    atomicSub(&active_threads_count, 1);
    // }
}

CPU-side Management: The CPU populates task_queue in GPU memory and then launches persistentGemmKernel once. A mechanism (e.g., another small kernel, atomic counters, or events) is needed for the CPU to know when all tasks are complete.

When to Use:

When tasks arrive dynamically or are not known far in advance.
Can achieve very high GPU utilization if there’s always work in the queue.
Amortizes launch cost over the entire lifetime of the persistent kernel.

Considerations:

Requires careful design of the work queue and synchronization mechanisms (atomic operations).
Determining when the persistent kernel can safely terminate can be tricky.
Shared memory usage must be managed carefully if tasks have varying requirements.

Choosing the Right Strategy

Static, Repetitive Workloads:
1. cuBLAS Batched GEMM: Start here if your operations are standard GEMMs. Usually offers the best performance with least effort.
2. CUDA Graphs (with cuBLAS or custom batched kernel): If the sequence involves cuBLAS calls or your custom batched kernel and is repeated many times, wrap it in a graph for minimal launch overhead.
3. Custom Batched Kernel: If cuBLAS doesn’t fit or you need more control (e.g., fusion with other operations beyond GEMM).
Dynamic Workloads:
1. Persistent Kernels: Best for unpredictable task arrival or parameters.
2. Frequent Re-batching (custom kernel or cuBLAS): If tasks arrive in chunks, you might re-batch and launch frequently. CUDA Graphs with updatable parameters can also fit here to some extent.

Essential Tools for Diagnosis and Profiling

Identifying and quantifying kernel launch latency bottlenecks requires proper tools:

NVIDIA Nsight Systems (nsys):
- Visualizes the timeline of CPU API calls, CUDA runtime activity, and GPU kernel executions.
- Excellent for observing gaps between kernels (launch latency) and the overall CPU-GPU interaction.
- Use nsys profile ./your_application to generate a report.
NVIDIA Nsight Compute (ncu):
- Provides in-depth analysis of individual kernel performance.
- While not directly for launch latency, it’s crucial for optimizing your custom batched or persistent kernels to ensure they are efficient.
- Use ncu --target-processes all -o profile_name ./your_application and then ncu -i profile_name.ncu-rep to view results.

CUDA Events:

Programmatically measure the duration of GPU operations.
Useful for fine-grained timing of kernel sequences or graph launches.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
cudaEvent_t start_event, stop_event;
cudaEventCreate(&start_event);
cudaEventCreate(&stop_event);

cudaEventRecord(start_event, stream);
// --- Operations to time (e.g., kernel launch, graph launch) ---
myKernel<<<grid, block, 0, stream>>>(...);
// ---
cudaEventRecord(stop_event, stream);
cudaEventSynchronize(stop_event); // Wait for stop_event to complete

float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start_event, stop_event);
printf("GPU execution time: %f ms\n", milliseconds);

cudaEventDestroy(start_event);
cudaEventDestroy(stop_event);

Common Pitfalls to Avoid

Launching one kernel per small GEMM: The primary anti-pattern.
Excessive cudaDeviceSynchronize() calls: Synchronizing unnecessarily between small operations serializes execution and exposes latency. Use streams and asynchronous operations.
Ignoring memory transfer costs: If data for small GEMMs is constantly moved between CPU and GPU, this can overshadow any kernel launch optimizations. Keep data on the GPU if it’s part of a larger GPU pipeline.
Inefficient custom kernels: A poorly written batched kernel might perform worse than frequent launches of a highly optimized (but tiny) kernel or cuBLAS. Profile custom kernels with Nsight Compute.
Not considering GPU occupancy: Ensure your batched or persistent kernels launch enough thread blocks to effectively utilize the GPU’s streaming multiprocessors.

Advanced Considerations

Strided Batched GEMM: cuBLAS offers cublasGemmStridedBatchedEx where matrices in a batch are located at regular intervals (strides) in memory. This can simplify memory management compared to arrays of pointers.
Tensor Cores: For compatible matrix sizes (e.g., 16x16x16 for FP16 input, FP32 accumulation) and GPU architectures, using cuBLAS functions ending in Ex or MMA intrinsics in custom kernels can leverage Tensor Cores for significantly faster GEMM computations, further improving throughput once launch latency is addressed.
Dynamic Parallelism: While a CUDA kernel can launch other kernels, this typically introduces its own overhead and is generally not the preferred solution for regular sequences of small GEMMs compared to the batching techniques discussed.

Conclusion

Optimizing CUDA kernel launch latency for sequences of very small matrix multiplications is crucial for achieving high GPU performance in relevant applications. By understanding the nature of this overhead and strategically applying techniques like kernel fusion, CUDA Graphs, cuBLAS batched routines, and persistent kernels, developers can transform latency-bound workloads into compute-bound ones. Always profile your application using tools like Nsight Systems to identify the true bottlenecks and validate the impact of your optimizations. The choice of strategy depends heavily on workload characteristics, but the principles discussed provide a strong foundation for taming those nanoseconds and maximizing your GPU’s efficiency.