Optimized primitives for collective multi-GPU communication
conda install jjh_cio_testing::nccl
conda install jjh_cio_testing/label/in_defaults::nccl