CUB is a flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming.
CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model: - Parallel primitives - Warp-wide "collective" primitives - Cooperative warp-wide prefix scan, reduction, etc. - Safely specialized for each underlying CUDA architecture - Block-wide "collective" primitives - Cooperative I/O, sort, scan, reduction, histogram, etc. - Compatible with arbitrary thread block sizes and types - Device-wide primitives - Parallel sort, prefix scan, reduction, histogram, etc. - Compatible with CUDA dynamic parallelism - Utilities - Fancy iterators - Thread and thread block I/O - PTX intrinsics - Device, kernel, and storage management