NVIDIA CUDA 13.1 Drops CUB Boilerplate with New Single-Call API

Felix Pinkston
Jan 21, 2026 21:57

NVIDIA simplifies GPU development with CUB single-call API in CUDA 13.1, eliminating repetitive two-phase memory allocation code without performance loss.

NVIDIA has shipped a significant quality-of-life upgrade for GPU developers with CUDA 13.1, introducing a single-call API for the CUB template library that eliminates the clunky two-phase memory allocation pattern developers have worked around for years.

The change addresses a long-standing pain point. CUB—the C++ template library powering high-performance GPU primitives like scans, sorts, and histograms—previously required developers to call each function twice: once to calculate required memory, then again to actually run the algorithm. This meant every CUB operation looked something like this verbose dance of memory estimation, allocation, and execution.

PyTorch’s codebase tells the story. The framework wraps CUB calls in macros specifically to hide this two-step invocation, a workaround common across production codebases. Macros obscure control flow and complicate debugging—a trade-off teams accepted because the alternative was worse.

Zero Overhead, Less Code

The new API cuts straight to the point. What previously required explicit memory allocation now fits in a single line, with CUB handling temporary storage internally. NVIDIA’s benchmarks show the streamlined interface introduces zero performance overhead compared to the manual approach—memory allocation still happens, just under the hood via asynchronous allocation embedded within device primitives.

Critically, the old two-phase API remains available. Developers who need fine-grained control over memory—reusing allocations across multiple operations or sharing between algorithms—can continue using the existing pattern. But for the majority of use cases, the single-call approach should become the default.

The Environment Argument

Beyond simplifying basic calls, CUDA 13.1 introduces an extensible “env” argument that consolidates execution configuration. Developers can now combine custom CUDA streams, memory resources, deterministic requirements, and tuning policies through a single type-safe object rather than juggling multiple function parameters.

Memory resources—a new utility for allocation and deallocation—can be passed through this environment argument. NVIDIA provides default resources, but developers can substitute their own custom implementations or use CCCL-provided alternatives like device memory pools.

Currently, the environment interface supports core algorithms including DeviceReduce operations (Reduce, Sum, Min, Max, ArgMin, ArgMax) and DeviceScan operations (ExclusiveSum, ExclusiveScan). NVIDIA is tracking additional algorithm support via their CCCL GitHub repository.

Practical Implications

For teams maintaining GPU-accelerated applications, this update means less wrapper code and cleaner integration. The CUB library already serves as a foundational component of NVIDIA’s CUDA Core Compute Libraries, and simplifying its API reduces friction for developers building custom CUDA kernels.

The timing aligns with broader industry movement toward more accessible GPU programming. As AI workloads drive demand for optimized GPU code, lowering barriers to using high-performance primitives matters.

CUDA 13.1 is available now through NVIDIA’s developer portal. Teams currently using macro wrappers around CUB calls should evaluate migrating to the native single-call API—it delivers the same abstraction without the debugging headaches.

Image source: Shutterstock

Source link