LAMMPS-SNAP performance drop with LLVM/Clang on Nvidia A100

Shubh · August 21, 2024, 9:28pm

Hello, Can someone help with this please🙏 ? I am running LAMMPS-SNAP (23 June 2022 release: NERSC / N10 Benchmarks / Materials by Design Workflow · GitLab) written in Kokkos 3.6.1 with CUDA 12.2 (Nvidia A100 on Perlmutter) backend as a test application and I notice a ~20% difference in loop time when clang [LLVM17,18,19,20 all behave similarly] is used as a device compiler than nvcc12.2 keeping same MPI implementation. Can someone guide to where the issue might be or how to start debugging?
I am new to both compilers and LAMMPS so please tolerate some naive questions.

alphataubio · August 21, 2024, 10:50pm

is there a reason you need to compile with clang ?

would it not be more appropriate to use the vendor’s compiler for their own hardware since nvidia is more likely to have a compiler highly optimized for A100 (sometimes using undocumented tricks) ?? my guess is the nvidia compiler knows how to produce optimized code for the latest A100 tensor cores while clang does not.

it’s also time to update your LAMMPS to latest (develop branch kokkos lib has been upgraded to 4.3.01).

alphataubio · August 21, 2024, 10:52pm

git clone https://github.com/lammps/lammps.git

then read:

https://docs.lammps.org/Build_cmake.html

Shubh · August 22, 2024, 5:52am

Thank you so much for the response, really appreciate it. The only reason for using clang for device is my organization is focusing on adopting open source implementations instead of vendor provided ones. The goal is to get a difference of less than 10%. I would also like to mention few insights:

ComputeZi SNAP kernel has the biggest difference in loop time than other kernels during profiling.
Lammps current stable (commit: 46265e3) version has a flag which is set with only clang on device: KOKKOS_IMPL_CUDA_CLANG_WORKAROUND. I didn’t find any documentations about why it exist and what does it do

Please let me know what do you think?

akohlmey · August 22, 2024, 6:19am

Please note that the SNAP code has been heavily optimized with the help of software engineers from both Nvidia and AMD to give the best performance possible with the “native” tools on the top supercomputers.

If you don’t use those tools (for whatever reason) you must expect a degraded performance.

Kokkos is still a rapidly moving target. By using the stable branch you are significantly behind in the development. Please note that good performance is due to both, Kokkos and the LAMMPS code.

Any of these kinds of KOKKOS_IMPL flags are part of Kokkos, so you need to consult the Kokkos documentation and the Kokkos developers to find out more. https://kokkos.org/

Shubh · August 22, 2024, 7:06am

Thank you so much for the quick response. I do understand your point and it does makes sense that spending time to optimize for the last 20% will not be worth it.

Also regarding LAMMPS in develop branch, it was mentioned somewhere in build docs that Kokkos versions that comes with LAMMPS are heavily tested before every release so do you think a naive replacement with the latest Kokkos version will work?

Gratefully,
Shubh

stamoor · August 26, 2024, 10:05pm

Yes normally it works fine.

Shubh · August 29, 2024, 5:10pm

Thank you so much, apologies for the delay in response. I tried a naive replacement with Kokkos 4.4.0 for LAMMPS (Update 4 for Stable release 2 August 2023) and I got the following error:

LAMMPS (2 Aug 2023 - Update 4)
KOKKOS mode with Kokkos version 4.4.0 is enabled (src/KOKKOS/kokkos.cpp:108)
  will use up to 4 GPU(s) per node
  using 1 OpenMP thread(s) per MPI tasks
...
...
cxil_map: write error
MPICH ERROR [Rank 1] [job id 29930307.0] [Thu Aug 29 02:15:43 2024] [nid001005] - Abort(471444751) (rank 1 in comm 0): Fatal error in PMPI_Irecv: Other MPI error, error stack:
PMPI_Irecv(166)........: MPI_Irecv(buf=0x6a4f4be80, count=589824, MPI_DOUBLE, src=33, tag=0, MPI_COMM_WORLD, request=0x7ffc42ad33f4) failed
MPID_Irecv(529)........: 
MPIDI_irecv_unsafe(163): 
MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Invalid argument)

aborting job:
Fatal error in PMPI_Irecv: Other MPI error, error stack:

CMake build command:

cmake -D CMAKE_C_COMPILER=cc -D CMAKE_CXX_COMPILER=CC -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=${PSCRATCH}/exaalt_device_compilers/LMP_TEST   -D LAMMPS_EXCEPTIONS=on     -D BUILD_SHARED_LIBS=yes     -D BUILD_MPI=yes     -D PKG_KOKKOS=yes -D Kokkos_ARCH_AMPERE80=ON -D Kokkos_ENABLE_CUDA=yes  -D PKG_MANYBODY=yes     -D PKG_REPLICA=yes     -D PKG_ML-SNAP=yes     -D PKG_EXTRA-FIX=yes     -D PKG_MPIIO=yes     -D LAMMPS_SIZES=BIGBIG   -D CMAKE_CXX_STANDARD=17  ../cmake

Please let me know what do you think about this.

stamoor · August 29, 2024, 5:23pm

Can you try disabling async malloc:

For CMake you need to compile with Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC:BOOL=OFF or KOKKOS_CUDA_OPTIONS = "enable_lambda,disable_malloc_async" for Makefile.

Otherwise your MPI may not be CUDA aware.

alphataubio · August 29, 2024, 7:50pm

my experience is you need to build your own openmpi configured for CUDA and your local cluster environment, otherwise you will get into all kinds of troubles.

this is how i did it:

wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.gz
tar xzvf openmpi-5.0.5.tar.gz; cd openmpi-5.0.5

module load StdEnv/2023 gcc/12.3 cuda/12.2

./configure --prefix=$HOME/local/openmpi-5.0.5 --with-cuda=$CUDA_HOME \
--with-cuda-libdir=$CUDA_HOME/lib64/stubs --disable-io-romio \
--without-knem --with-io-romio-flags=--without-ze ; make -j 64 all; make install

StdEnv/2023 is the standard module for my local clusters. yours will be different.

openmpi configure options for your local cluster might also be different, consult openmpi documentation:

https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/index.html

$HOME/local/modules/openmpi/5.0.5 :

#%Module
set prefix {$HOME/local/openmpi-5.0.5}
set version {5.0.5}
prepend-path CMAKE_PREFIX_PATH ${prefix}
prepend-path PATH ${prefix}/bin
prepend-path CPATH ${prefix}/include
prepend-path LIBRARY_PATH ${prefix}/lib
prepend-path LD_LIBRARY_PATH ${prefix}/lib
prepend-path MANPATH ${prefix}/share/man
prepend-path PKG_CONFIG_PATH ${prefix}/lib/pkgconfig
setenv MODULE_OPENMPI_PREFIX ${prefix}
prepend-path MODULEPATH $HOME/local/openmpi-5.0.5

and then you can use

module use $HOME/local/modules
module load StdEnv/2023 gcc/12.3 cuda/12.2 openmpi/5.0.5

before building and running lmp.

Shubh · August 29, 2024, 9:28pm

Thank you so much for the answer, really appreciate your efforts. I will try this out but I would like to mention that I am running this with Cray-MPICH 8.1.28 on Perlmutter at NERSC which has gcc12.3 for host and nvcc12.2 for device, Also I only see this error with kokkos 4.4 and not with kokkos 3.7.2. Please let me know if you would still expect a difference with a custom openmpi build?

Shubh · August 29, 2024, 9:34pm

Thank you, I will test this out and report back. Also LAMMPS (Update 4 for Stable release 2 August 2023) only builds with clang>=19, shouldn’t this be mentioned anywhere in the release docs (I am happy to help) as I had to try out multiple versions starting from clang16 to see if anything works.

akohlmey · August 29, 2024, 10:32pm

Please be careful with such statements. It is not correct that LAMMPS requires these clang compiler versions. LAMMPS itself requires C++11 and thus supports a wide variety of compilers (back to GCC 4.8.x on CentOS 7 which is barely C++11 compatible).

This, however, is not true for Kokkos. LAMMPS 2 Aug 2023 has only been vetted with Kokkos 3.x, which requires a C++14 capable compiler. So by using a Kokkos 4.x version, you are taking a significant risk of incompatibilities; perhaps not in syntax (or else it would not compile), but in semantics.

The current LAMMPS development version (hopefully released later today as new stable version), has been vetted with Kokkos 4.3.x and that requires C++17. I have been able to compile a LAMMPS executable with these sources including GPU support for an AMD GPU using the hipcc compiler which is based on Clang 17.

You have to check the release notes of Kokkos to determine which compilers are compatible with it. It may be that the conditions for your use mode (GPU support via Clang) may be different from requirements for Kokkos/OpenMP for Kokkos/HIP for example.

Shubh · August 29, 2024, 10:39pm

Apologies for my naive conclusion, I will be more careful next time. I really appreciate the clarification, I am still learning and this is incredibly helpful.

Shubh · August 29, 2024, 11:51pm

This works, thank you so much, is it possible to elaborate more on why this worked and why is it not the default?

stamoor · August 30, 2024, 12:10am

The version of UCX used by some MPI does not support cudaMallocAsync. For more info see these links:

github.com/openucx/ucx

process_vm_readv Bad Address causes abort with Cuda and OpenMPI on large application

opened 07:11PM - 05 Aug 21 UTC

lfmeadow

Bug

### Describe the bug Running LAMMPS+KOKKOS compiled for KOKKOS cuda target usin…g gcc and nvcc on 2x Nvidia A100 with OpenMPI and UCX results in: ``` [ed-dlgpu-168c:666873:0:666873] cma_ep.c:95 process_vm_readv(pid=666872 length=524288) returned -1: Bad address ==== backtrace (tid: 666873) ==== 0 /home/larry/sycl-with-cuda/ucx_install/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7fcc610ab4a4] 1 /home/larry/sycl-with-cuda/ucx_install/lib/libucs.so.0(ucs_fatal_error_message+0xb8) [0x7fcc610a80a8] 2 /home/larry/sycl-with-cuda/ucx_install/lib/libucs.so.0(ucs_log_default_handler+0xf4f) [0x7fcc610ad0ff] 3 /home/larry/sycl-with-cuda/ucx_install/lib/libucs.so.0(ucs_log_dispatch+0xe4) [0x7fcc610ad2e4] 4 /home/larry/sycl-with-cuda/ucx_install/lib/ucx/libuct_cma.so.0(uct_cma_ep_tx+0x306) [0x7fcc60dcddb6] 5 /home/larry/sycl-with-cuda/ucx_install/lib/libuct.so.0(uct_scopy_ep_progress_tx+0x76) [0x7fcc61031cb6] 6 /home/larry/sycl-with-cuda/ucx_install/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0x102) [0x7fcc6109e722] 7 /home/larry/sycl-with-cuda/ucx_install/lib/libuct.so.0(uct_scopy_iface_progress+0x81) [0x7fcc61031631] 8 /home/larry/sycl-with-cuda/ucx_install/lib/libucs.so.0(+0x23f02) [0x7fcc6109ff02] 9 /home/larry/sycl-with-cuda/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7fcc611242da] 10 /home/larry/sycl-with-cuda/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x3af) [0x7fcc611f773f] 11 /home/larry/sycl-with-cuda/ompi_install/lib/libmpi.so.40(PMPI_Send+0x123) [0x7fcc63f3c133] ``` If I set UCX_MEMTYPE_CACHE=n the code runs to completion with lots of messages like this: ``` [1628187760.785959] [ed-dlgpu-168c:666843:0] cuda_ipc_md.c:233 UCX ERROR cuIpcGetMemHandle(&key->ph, (CUdeviceptr)addr)() failed: invalid argument ``` The application used to run fine; some minor updates to the lammps and kokkos versions resulted in this failure. The SYCL targeting CUDA version of the application still runs fine. Disabling memtype caching reduces performance by about 12% (from the last good version) as well as the annoying error messages. With UCX_LOG_LEVEL=debug I can see some registration failures: ``` [1628187898.136959] [ed-dlgpu-168c:666872:0] ib_md.c:379 UCX DEBUG ibv_reg_mr(address=0x4152cb680, length=3383712, access=0xf) failed: Bad address [1628187898.137004] [ed-dlgpu-168c:666872:0] rcache.c:872 UCX DEBUG failed to register region 0x556dca278290 [0x4152cb680..0x415605820]: Input/output error [1628187898.137010] [ed-dlgpu-168c:666872:0] ucp_mm.c:143 UCX DIAG failed to register address 0x4152cb680 mem_type bit 0x1 length 3383712 on md[7]=mlx5_0: Input/output error (md reg_mem_types 0x1) ``` I have the same failures using the prebuilt HPC-X from MLX/NV, tarball is: ` hpcx-v2.9.0-gcc-MLNX_OFED_LINUX-5.4-1.0.3.0-ubuntu20.04-x86_64.tbz` ### Steps to Reproduce - Command line `mpirun --map-by ppr:1:socket --bind-to socket ../lammps_install_cuda_gcc/bin/lmp -sf kk -pk kokkos newton on cuda/aware on neigh half comm device -k on g 2 t 1 -in ${input} -log ${input}-cuda_gcc.log ` - UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by `ucx_info -v`) ``` # UCT version=1.11.0 revision fa84605 # configured with: --prefix=/home/larry/sycl-with-cuda/ucx_install --with-cuda=/usr/local/cuda --enable-mt ``` - **Any UCX environment variables used** Works but slower and other error messages with UCX_MEMTYPE_CACHE=n ### Setup and versions - OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...) ``` Ubuntu 20.04.2 LTS \n \l Linux ed-dlgpu-168c 5.6.19-050619-generic #202006171132 SMP Wed Jun 17 16:31:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux ``` - For RDMA/IB/RoCE related issues: - Driver version: `MLNX_OFED_LINUX-5.3-1.0.0.1` - HW information from `ibstat` or `ibv_devinfo -vv` command ``` $ ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.28.4512 Hardware version: 0 Node GUID: 0xb8cef60300506e08 System image GUID: 0xb8cef60300506e08 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x2651e84a Port GUID: 0xb8cef60300506e08 Link layer: InfiniBand ``` - For GPU related issues: - GPU type `NVIDIA A100-PCI` - Cuda: - Drivers version `cuda-11.3` - Check if peer-direct is loaded: `lsmod|grep nv_peer_mem` and/or gdrcopy: `lsmod|grep gdrdrv` No nv_peer_mem ``` $ lsmod|grep gdrdrv gdrdrv 24576 0 nvidia 34893824 30 nvidia_uvm,gdrdrv,nvidia_modeset ``` ### Additional information (depending on the issue) - OpenMPI version 4.1.1 configured with: ``` ./configure --prefix=/home/larry/sycl-with-cuda/ompi_install --with-cuda=/usr/local/cuda \ --with-hwloc=/home/larry/sycl-with-cuda/hwloc_install \ --with-ucx=/home/larry/sycl-with-cuda/ucx_install \ --with-ucx-libdir=/home/larry/sycl-with-cuda/ucx_install/lib \ --enable-mca-no-build=btl-uct,btl-vader,btl-openib ``` - Output of `ucx_info -d` to show transports and devices recognized by UCX Attached [ucx_info.txt](https://github.com/openucx/ucx/files/6941100/ucx_info.txt) - Configure result - config.log Attached [config.log](https://github.com/openucx/ucx/files/6941103/config.log) - Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data" Attached [ucx.log.gz](https://github.com/openucx/ucx/files/6941059/ucx.log.gz)

github.com/kokkos/kokkos

Trilinos nightly failures in cuda/11.2 builds, no-uvm (various packages)

opened 07:03PM - 15 Sep 23 UTC

closed 06:10PM - 27 Sep 23 UTC

ndellingwood

Failure - Nightly Failure - Trilinos

**Describe the bug** Several tests are failing in cuda/11.2 builds of Trilinos …(no-uvm) **Test failures** ``` 01:53:50 The following tests FAILED: 01:53:50 1759 - MueLu_PerformanceModel_MPI_4 (SEGFAULT) 01:53:50 1762 - MueLu_ComboPTest_MPI_4 (SEGFAULT) 01:53:50 1783 - MueLu_CalcRotations_MPI_4 (SEGFAULT) 01:53:50 1827 - MueLu_Driver_TogglePFactory_sa_tent_Tpetra_MPI_4 (SEGFAULT) 01:53:50 1831 - MueLu_Driver_TogglePFactory_semi_tent_line_Tpetra_MPI_4 (SEGFAULT) 01:53:50 1963 - Zoltan2_TpetraCrsColorer_galeri2_MPI_4 (SEGFAULT) 01:53:50 2314 - PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 (Failed) 01:53:50 2321 - PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Tri-Order-1 (Failed) 01:53:50 2337 - PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell2D_MPI_4 (SEGFAULT) ``` The failure and more extensive output can be found for example here: [cdash experimental track](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercount=3&showfilters=1&filtercombine=and&field1=revision&compare1=61&value1=73dcc21ae6078b22b4e281e7064bb2f221339cd0&field2=status&compare2=62&value2=passed&field3=site&compare3=63&value3=weaver) This follows from changes in this list of commits: **Changes**: Git (git https://github.com/kokkos/kokkos.git) Deprecate Cuda(cudaStream_t, bool) ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail)) Fixup checked interger operations death test ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail)) Deprecate HIP(hipStream_t, bool) ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail)) Let Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC be ON by default ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail)) Print whether KOKKOS_ENABLE_IMPL_CUDA_MALLOC_ASYNC is defined ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail)) Introduce disable_malloc_async Cuda option with generated makefiles ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail)) Preserve one build that disables Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail)) Use archive extraction time for timestamps ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail)) Disable performance benchmarks in AppVeyor CI ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail)) team-level std algos: part 6 (#6210) ([detail](https://jenkins-son.sandia.gov/view/Kokkos/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/208/changes#detail)) **Please also include the following items to support reproducing the bug** **Reproducer configuration** (Weaver rhel8): ```bash # Repos git clone -b develop https://github.com/trilinos/Trilinos.git git clone -b develop https://github.com/kokkos/kokkos.git git clone -b develop https://github.com/kokkos/kokkos-kernels.git # Symbolic link to external kokkos and kokkos-kernels repos in Trilinos source directory for source override cd Trilinos ln -s <path-to-your-repo>/kokkos kokkos ln -s <path-to-your-repo>/kokkos-kernels kokkos-kernels cd $HOME mkdir -p build cd build # Interactive Weaver session bsub -Is -n 1 -q rhel8 -gpu "num=1" bash # Environment export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver source /projects/ppc64le-pwr9-rhel8/legacy-env.sh source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-11.2-opt export OMPI_CXX=$KOKKOS_DIR/bin/nvcc_wrapper cmake \ -G"Unix Makefiles" \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DCMAKE_INSTALL_PREFIX=$TRILINOS_INSTALL_DIR \ -DCMAKE_CXX_STANDARD="17" \ -DFC_FN_UNDERSCORE=UNDER \ -DTPL_ENABLE_CUSPARSE=ON \ -DTrilinos_ENABLE_TESTS=ON \ -DTrilinos_ENABLE_ALL_PACKAGES=ON \ -DTrilinos_ENABLE_Stokhos=ON \ -DKokkos_ENABLE_CUDA_UVM=OFF \ -DKokkos_ARCH_VOLTA70=ON \ -DKokkos_ARCH_POWER9=ON \ -DKokkos_CoreUnitTest_CudaTimingBased_MPI_1_DISABLE=ON \ -DKokkos_CoreUnitTest_Default_MPI_1_SET_RUN_SERIAL=ON \ -DIntrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests_MPI_1_SET_RUN_SERIAL=ON \ -DKokkos_SOURCE_DIR_OVERRIDE:STRING=kokkos \ -DKokkosKernels_SOURCE_DIR_OVERRIDE:STRING=kokkos-kernels \ -DTrilinos_ENABLE_INSTALLATION_TESTING=OFF \ -DCTEST_BUILD_FLAGS=-j16 \ -DCTEST_PROJECT_NAME="KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm" \ -DCTEST_BUILD_NAME="KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm" \ $TRILINOS_DIR ```

github.com/kokkos/kokkos

cudaMallocAsync not supported by UCX, may cause failure in OpenMPI+Kokkos+cuda applications

opened 04:32PM - 11 Aug 21 UTC

closed 08:27PM - 01 Jul 23 UTC

lfmeadow

InDevelop

See UCX issue https://github.com/openucx/ucx/issues/7194 cudaMallocAsync was ad…ded to Kokkos/Cuda sometime in the last few months and results in a application crash in lammps+kokkos+cuda using OpenMPI+UCX. The UCX issue comments mention 3 PRs to UCX that should resolve the problem. In the meantime the following patch to Kokkos will fix it: ```diff diff --git a/core/src/Cuda/Kokkos_CudaSpace.cpp b/core/src/Cuda/Kokkos_CudaSpace.cpp index 39da8de6b..13c35eb87 100644 --- a/core/src/Cuda/Kokkos_CudaSpace.cpp +++ b/core/src/Cuda/Kokkos_CudaSpace.cpp @@ -229,7 +229,7 @@ void *CudaSpace::impl_allocate( #ifndef CUDART_VERSION #error CUDART_VERSION undefined! -#elif (CUDART_VERSION >= 11020) +#elif 0 // (CUDART_VERSION >= 11020) UCX issue cudaError_t error_code; if (arg_alloc_size >= memory_threshold_g) { error_code = cudaMallocAsync(&ptr, arg_alloc_size, 0); @@ -358,7 +358,7 @@ void CudaSpace::impl_deallocate( try { #ifndef CUDART_VERSION #error CUDART_VERSION undefined! -#elif (CUDART_VERSION >= 11020) +#elif 0 // (CUDART_VERSION >= 11020) UCX issue if (arg_alloc_size >= memory_threshold_g) { CUDA_SAFE_CALL(cudaDeviceSynchronize()); CUDA_SAFE_CALL(cudaFreeAsync(arg_alloc_ptr, 0)); ```

You are not the first person to have this issue and I’m to the point where I think we should change the default, at least until it gets fixed. I will plan to submit a PR.

Shubh · August 30, 2024, 1:02am

Thank you so much for the links, it helps a lot.