Dask cuDF (a.k.a. dask-cudf or dask_cudf
) is an extension library for Dask DataFrame that provides a Pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, Dask cuDF is automatically registered as the "cudf"
dataframe backend for Dask DataFrame.
[!IMPORTANT] Dask cuDF does not provide support for multi-GPU or multi-node execution on its own. You must also deploy a distributed cluster (ideally with Dask-CUDA) to leverage multiple GPUs efficiently.
Please visit the official documentation page for detailed information about using Dask cuDF.
See the RAPIDS install page for the most up-to-date information and commands for installing Dask cuDF and other RAPIDS packages.
A very common Dask cuDF use case is single-node multi-GPU data processing. These workflows typically use the following pattern:
import dask
import dask.dataframe as dd
from dask_cuda import LocalCUDACluster
from distributed import Client
if __name__ == "__main__":
# Define a GPU-aware cluster to leverage multiple GPUs
client = Client(
LocalCUDACluster(
CUDA_VISIBLE_DEVICES="0,1", # Use two workers (on devices 0 and 1)
rmm_pool_size=0.9, # Use 90% of GPU memory as a pool for faster allocations
enable_cudf_spill=True, # Improve device memory stability
local_directory="/fast/scratch/", # Use fast local storage for spilling
)
)
# Set the default dataframe backend to "cudf"
dask.config.set({"dataframe.backend": "cudf"})
# Create your DataFrame collection from on-disk
# or in-memory data
df = dd.read_parquet("/my/parquet/dataset/")
# Use cudf-like syntax to transform and/or query your data
query = df.groupby('item')['price'].mean()
# Compute, persist, or write out the result
query.head()
If you do not have multiple GPUs available, using LocalCUDACluster
is optional. However, it is still a good idea to enable cuDF spilling.
If you wish to scale across multiple nodes, you will need to use a different mechanism to deploy your Dask-CUDA workers. Please see the RAPIDS deployment documentation for more instructions.