cuDF - A GPU-accelerated DataFrame library for tabular data processingcuDF (pronounced "KOO-dee-eff") is an Apache 2.0 licensed, GPU-accelerated DataFrame library for tabular data processing. The cuDF library is one part of the RAPIDS GPU Accelerated Data Science suite of libraries.
cuDF is composed of multiple libraries including:
Notable projects that use cuDF include:
Operating System, GPU driver, and supported CUDA version information can be found at the RAPIDS Installation Guide
A stable release of each cudf library is available on PyPI. You will need to match the major version number of your installed CUDA version with a -cu## suffix when installing from PyPI.
A development version of each library is available as a nightly release by including the -i https://pypi.anaconda.org/rapidsai-wheels-nightly/simple index.
# CUDA 13
pip install libcudf-cu13
pip install pylibcudf-cu13
pip install cudf-cu13
pip install cudf-polars-cu13
pip install dask-cudf-cu13
# CUDA 12
pip install libcudf-cu12
pip install pylibcudf-cu12
pip install cudf-cu12
pip install cudf-polars-cu12
pip install dask-cudf-cu12
A stable release of each cudf library is available to be installed with the conda package manager by specifying the -c rapidsai channel.
A development version of each library is available as a nightly release by specifying the -c rapidsai-nightly channel instead.
conda install -c rapidsai libcudf
conda install -c rapidsai pylibcudf
conda install -c rapidsai cudf
conda install -c rapidsai cudf-polars
conda install -c rapidsai dask-cudf
To install cuDF from source, please follow the contribution guide detailing how to setup the build environment.
The following examples showcase reading a parquet file, dropping missing rows with a null value, and performing a groupby aggregation on the data.
import cudf and the APIs are largely similar to pandas.
import cudf
df = cudf.read_parquet("data.parquet")
df.dropna().groupby(["A", "B"]).mean()
With a Python file containing pandas code:
import pandas as pd
df = cudf.read_parquet("data.parquet")
df.dropna().groupby(["A", "B"]).mean()
Use cudf.pandas by invoking python with -m cudf.pandas
$ python -m cudf.pandas script.py
If running the pandas code in an interactive Jupyter environment, call %load_ext cudf.pandas before
importing pandas.
In [1]: %load_ext cudf.pandas
In [2]: import pandas as pd
In [3]: df = cudf.read_parquet("data.parquet")
In [4]: df.dropna().groupby(["A", "B"]).mean()
Using Polars' lazy API, call collect with engine="gpu" to run
the operation on the GPU
import polars as pl
lf = pl.scan_parquet("data.parquet")
lf.drop_nulls().group_by(["A", "B"]).mean().collect(engine="gpu")
For bug reports or feature requests, please file an issue on the GitHub issue tracker.
For questions or discussion about cuDF and GPU data processing, feel free to post in the RAPIDS Slack workspace.
cuDF is open to contributions from the community! Please see our guide for contributing to cuDF for more information.