Insufficient memory on accelerator on GPU

Hi all,

I recently got LAMMPS to work in GPUs and I wanted to test one of my systems to see how much performance updrage do I see. However, when I try to run the simulations on the GPU, I get the following error.

I have very little to no knowledge about GPU errors, could someone please help me understand why is error occurs and how to solve the problem?

I have attatched all the input files in the post. cpu.out is the output file I get when I run the simulation on CPUs. I want to the test how fast the GPU are as compared the this CPU simulation.

We have RTX 3090 GPU

Device 0: NVIDIA GeForce RTX 3090, 82 CUs, 23/24 GB, 1.7 GHZ (Mixed Precision)

Error:

ERROR on proc 0: Insufficient memory on accelerator (src/GPU/pair_lj_cut_gpu.cpp:110)
Last command: minimize 0.0 0.0 1000 10000
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Cuda driver error 4 in call at file '/home/vmahajan/softwares/lammps-gpu/lib/gpu/geryon/nvd_timer.h' in line 98.
Abort(-1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
Cuda driver error 4 in call at file '/home/vmahajan/softwares/lammps-gpu/lib/gpu/geryon/nvd_timer.h' in line 98.
Abort(-1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
"slurm-54484.out" 114L, 4856B 

Link to the input files: test_gpu - Google Drive
Thank you.

Since GPU support is a bit of a moving target, you should best always test with the latest available LAMMPS version. You have 2 Aug 2023 - Update 3 but the latest is 29 Aug 2024 - Update 1.

FYI, I have been able to run your input deck to completion with my laptop which has an Intel CPU with an integrated GPU with the specs (from ocl-get-devices):

Device 0: "Intel(R) Iris(R) Xe Graphics"
  Type of device:                                GPU
  Supported OpenCL Version:                      3.0
  Is a subdevice:                                No
  Double precision support:                      No
  Total amount of global memory:                 13.9662 GB
  Number of compute units/multiprocessors:       80
  Total amount of constant memory:               4294959104 bytes
  Total amount of local/shared memory per block: 65536 bytes
  Maximum group size (# of threads per block)    512
  Maximum item sizes (# threads for each dim)    512 x 512 x 512
  Clock rate:                                    1.3 GHz

This is with the current development branch from the git repository.

With 4 MPI processes and the GPU (in single precision mode since that is all the Xe GPU supports) I get the following timings for minimization and run:

Loop time of 119.254 on 4 procs for 1000 steps with 470300 atoms

61.8% CPU use with 4 MPI tasks x 1 OpenMP threads

[...]

Loop time of 233.505 on 4 procs for 2500 steps with 470300 atoms

Performance: 1.850 ns/day, 12.973 hours/ns, 10.706 timesteps/s, 5.035 Matom-step/s
76.0% CPU use with 4 MPI tasks x 1 OpenMP threads

With 4 MPI processes without GPU acceleration I get the following timings:

Loop time of 999.789 on 4 procs for 1000 steps with 470300 atoms

99.6% CPU use with 4 MPI tasks x 1 OpenMP threads

[...]

Loop time of 1444.19 on 4 procs for 2500 steps with 470300 atoms

Performance: 0.299 ns/day, 80.233 hours/ns, 1.731 timesteps/s, 814.124 katom-step/s
99.7% CPU use with 4 MPI tasks x 1 OpenMP threads

So the speedup on this machine is about 8.4x for minimization and 6.2x for MD.

1 Like

Thank you so much, I will try it!

Thank you very much, again. That is very promising! This is what I had expected.

I will try to install the latest version of LAMMPS and try.

Hi Axel,

I tried the latest version but I still get the same error.

LAMMPS (29 Aug 2024 - Update 1)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0 0 0) to (240 240 240)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  470300 atoms
  reading velocities ...
  470300 velocities
  scanning bonds ...
  1 = max bonds/atom
  scanning angles ...
  1 = max angles/atom
  orthogonal box = (0 0 0) to (240 240 240)
  1 by 1 by 1 MPI processor grid
  reading bonds ...
  299 bonds
  reading angles ...
  298 angles
Finding 1-2 1-3 1-4 neighbors ...
  special bond factors lj:    0        0        0       
  special bond factors coul:  0        0        0       
     2 = max # of 1-2 neighbors
     2 = max # of 1-3 neighbors
     4 = max # of 1-4 neighbors
     6 = max # of special neighbors
  special bonds CPU = 0.048 seconds
  read_data CPU = 2.702 seconds
Finding 1-2 1-3 1-4 neighbors ...
  special bond factors lj:    0        0        0.5     
  special bond factors coul:  0        0        0       
     2 = max # of 1-2 neighbors
     2 = max # of 1-3 neighbors
     4 = max # of 1-4 neighbors
     6 = max # of special neighbors
  special bonds CPU = 0.039 seconds
300 atoms in group pol

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- GPU package (short-range, long-range and three-body potentials): doi:10.1016/j.cpc.2010.12.021, doi:10.1016/j.cpc.2011.10.012, doi:10.1016/j.cpc.2013.08.002, doi:10.1016/j.commatsci.2014.10.068, doi:10.1016/j.cpc.2016.10.020, doi:10.3233/APC200086
- Type Label Framework: https://doi.org/10.1021/acs.jpcb.3c08419
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

WARNING: Using a manybody potential with bonds/angles/dihedrals and special_bond exclusions (src/pair.cpp:243)

--------------------------------------------------------------------------
- Using acceleration for tersoff/gpu:
-  with 1 proc(s) per device.
-  with 1 thread(s) per proc.
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce RTX 3090, 82 CUs, 23/24 GB, 1.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.

WARNING: Communication cutoff 0 is shorter than a bond length based estimate of 4.295. This may lead to errors. (src/comm.cpp:730)
WARNING: Increasing communication cutoff to 9.106544 for GPU pair style (src/GPU/pair_tersoff_gpu.cpp:228)

--------------------------------------------------------------------------
- Using acceleration for lj/cut:
-  with 1 proc(s) per device.
-  with 1 thread(s) per proc.
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA GeForce RTX 3090, 82 CUs, 23/24 GB, 1.7 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.

Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 12
  ghost atom cutoff = 12
  binsize = 6, bins = 40 40 40
  4 neighbor lists, perpetual/occasional/extra = 4 0 0
  (1) pair tersoff, perpetual, skip from (4)
      attributes: full, newton on, ghost, cut 5.553272
      pair build: skip/ghost
      stencil: none
      bin: none
  (2) pair lj/cut, perpetual, skip from (3)
      attributes: full, newton on
      pair build: skip
      stencil: none
      bin: none
  (3) neighbor class addition, perpetual
      attributes: full, newton on
      pair build: full/bin
      stencil: full/bin/3d
      bin: standard
  (4) neighbor class addition, perpetual
      attributes: full, newton on, ghost, cut 5.553272
      pair build: full/bin/ghost
      stencil: full/ghost/bin/3d
      bin: standard
Setting up cg style minimization ...
  Unit style    : real
  Current step  : 0
WARNING: Communication cutoff adjusted to 12 (src/comm.cpp:739)
ERROR on proc 0: Insufficient memory on accelerator (src/GPU/pair_lj_cut_gpu.cpp:110)
Last command: minimize 0.0 0.0 1000 10000
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Cuda driver error 4 in call at file '/home/vmahajan/softwares/lammps-gpu/lib/gpu/geryon/nvd_timer.h' in line 98.
Abort(-1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
Cuda driver error 4 in call at file '/home/vmahajan/softwares/lammps-gpu/lib/gpu/geryon/nvd_timer.h' in line 98.
Abort(-1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

Could this be because of CUDA version issue ?

Hi William,

I tested your input files on an A800 cluster and have a different error.
Device 0: NVIDIA A800-SXM4-80GB, 108 CUs, 76/79 GB, 1.4 GHZ (Mixed Precision)
The minimization ran for 371 steps and crashed with
Cuda driver error 700 in call at file '/share/home/XXX/lammps-29Aug2024/lib/gpu/geryon/nvd_memory.h' in line 502.

I don’t know if it’s because there are too many particles and the vRAM is insufficient or because of CUDA.

I don’t know, I currently do not have a usable Nvidia GPU on any of my development machines.
You could try compiling with OpenCL instead of CUDA. There are some difference in the code paths.

@ndtrung is our expert for the GPU package. Perhaps he has some idea.

@William_Moriarty I can reproduce the issue with Insufficient memory on accelerator with NVIDIA GPUs with 16 and 40 GB RAM. The issue seems to disappear with the OpenCL build with these NVIDIA GPUs. I need to take a close look at the issue with the CUDA code path. For now I would follow Axel’s suggestion with using OpenCL build (which is the default option, -DGPU_API=opencl).

For the hybrid pair style in use (tersoff + lj/cut), the performance difference between OpenCL vs CUDA builds might be small because tersoff/gpu would dominate.

I tried compiling with openCL and the error goes away. There is other problem that after some number of steps, log files and the trajectories are not updated.

Do you think it is a problem with the size of the system ?

Hi,

I am testing openCL with some other systems and I get this error:

OpenCL error in file '/home/vmahajan/softwares/lammps-gpu-plumed/lib/gpu/geryon/ocl_memory.h' in line 655 : -5.

Error code -5 seems to indicate out of resources error.

Do you have any idea about this error ?

I tried with a much much smaller system. Running nvidia-smi on the node running the simulation gives me this. I can see a lot of memory being free (1049MiB / 24576MiB is used) [or am I reading this wrong ?]

I am attatching the input files if you want to try the system.
4.25.tar.xz (304.2 KB)

If you do not have PLUMED then just remove the respective lines on cg.in.

Thank you.

Fri Oct 25 12:56:34 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:01:00.0 Off |                  N/A |
| 30%   33C    P2             107W / 350W |   1049MiB / 24576MiB |     28%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  | 00000000:41:00.0 Off |                  N/A |
| 30%   33C    P2             117W / 350W |   1049MiB / 24576MiB |     27%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        On  | 00000000:81:00.0 Off |                  N/A |
| 30%   38C    P2             115W / 350W |   1049MiB / 24576MiB |     27%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  | 00000000:C1:00.0 Off |                  N/A |
| 30%   38C    P2             107W / 350W |   1049MiB / 24576MiB |     28%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2431443      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    0   N/A  N/A   2431445      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    0   N/A  N/A   2431447      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    0   N/A  N/A   2431449      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    1   N/A  N/A   2431444      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    1   N/A  N/A   2431446      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    1   N/A  N/A   2431448      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    1   N/A  N/A   2431450      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    2   N/A  N/A   2431451      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    2   N/A  N/A   2431452      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    2   N/A  N/A   2431453      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    2   N/A  N/A   2431454      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    3   N/A  N/A   2431455      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    3   N/A  N/A   2431456      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    3   N/A  N/A   2431457      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
|    3   N/A  N/A   2431458      C   ...res/lammps-gpu-plumed/build/lmp_gpu      258MiB |
+---------------------------------------------------------------------------------------+

@William_Moriarty could you rebuild LAMMPS with the following change to the source file lib/gpu/lal_neighbor.cpp:

diff --git a/lib/gpu/lal_neighbor.cpp b/lib/gpu/lal_neighbor.cpp
index 288415e0e7..aca9b1d141 100644
--- a/lib/gpu/lal_neighbor.cpp
+++ b/lib/gpu/lal_neighbor.cpp
@@ -365,7 +365,9 @@ void Neighbor::get_host(const int inum, int *ilist, int *numj,
       int i=ilist[ii];
       three_ilist[i] = ii;
     }
-    three_ilist.update_device(inum,true);
+    // needs to transfer _max_atoms because three_ilist indexes all the atoms (local and ghost)
+    // not just inum (number of neighbor list items)
+    three_ilist.update_device(_max_atoms,true);
   }
 
   time_nbor.stop();

Hi @ndtrung

I am so sorry for the late response. I tried this and it works indeed for LAMMPS + PLUMED on the GPU.

But the performance is very very slow. I am attaching the log file for your reference.
log.lammps (179.8 KB)

Thank you so much. Could you also elaborate a bit in detail what was the issue? I am just curious to know.

Kind Regards.

Slow compared to what?

I was running the lammps command like this mpiexec -np 6 $lmp -in cg.in -sf gpu -pk gpu 0 neigh no and I was getting less than 1 ns/day.

However. mpiexec -np 1 $lmp -in cg.in -sf gpu -pk gpu 0 neigh no was giving me ~ 95 ns/day.

I had done the same simulations on CPUs and the performance I was getting is:

Performance: 80.968 ns/day, 0.296 hours/ns, 468.567 timesteps/s, 2.358 Matom-step/s
99.0% CPU use with 16 MPI tasks x 1 OpenMP threads

I think I am believing that using GPUs is going to give me immense performance boost. Please correct me if I am not wrong here.

@William_Moriarty your system has 5032 atoms, and you are using 6 MPI procs x 2 OpenMP threads, just too many for a small system size. As you can see with 1 MPI proc with 1 thread, the performance is better. You can try 2 MPI procs to see if there is any improvement. Note that the number of neighbors per atom is pretty low (0.0357), it’s expected that GPU acceleration would provide modest speedup in this case (~95 vs ~81 on 16 MPI proc run).

You are a victim of both, successful marketing and also lack of research into the matter of GPU acceleration on your own. And ignoring some common sense, too.

First, there is Amdahl’s law that states that acceleration is limited by the amount of the total time that is accelerated (or parallelized). Even for perfect and infinite acceleration, you can only get a total speedup of 5x if 20% of your total time is not accelerated.
In your case, for example, the neighbor lists must be computed with CPUs and not on the GPU so there is no acceleration for that and extra overhead of transferring the neighbor lists.

Second, there is the fact that GPUs support massive parallelism, but that comes at a price, because the individual compute unit by itself is not that fast. There is just lots of them, so you need to be able to create many work units. You don’t have that. So the overhead of launching a calculation on the GPU becomes significant and by oversubscribing the GPU with multiple MPI processes, you enhance that effect.

Third, you have a somewhat aged consumer GPU, so there are limits to that as well compared to much more expensive data center GPUs.

Finally, if you have an unlimited budget, CPUs will always beat GPUs, since the limit of scaling is at far fewer atoms per CPU core and with MPI + OpenMP hybrid parallelization, you can even minimize the collisions for communication. That said, on high-end supercomputers, you cannot afford to build large enough CPU-only machines, so GPUs is the way to go to achieve very high throughput, but you need a suitable problem for it.

Bottom line. There is the right tool and the right way to use for every problem. If you try to make something work that does not fit this, you have to manage your expectations. You can do this by studying the available documentation in LAMMPS and publications. With publications or online benchmarks and so on, you always have to be careful since people only show you the good stuff. So you also have to watch out for what is not shown, and that is usually where the performance is less than ideal or outright abysmal.

1 Like

That makes sense. Thank you so much for your reply and suggesting a fix.

It now works with openCL. I do not know if CUDA is faster than openCL or not, but it would be great if this now works with both of them. If I compile with CUDA instead of openCL then I get an error:
symbol lookup error: /home/vmahajan/softwares/lammps-gpu-plumed/build/lmp_gpu: undefined symbol: cuInit

Thank you so much Axel, this makes a lot of sense. As was visible, my knowledge about GPU is close to non-existent. I have certainly learned a lot!