Lmp_mpi: geryon/ucl_d_vec.h:350: int ucl_cudadr::UCL_D_Vec<numtyp>::resize(int) [with numtyp = int]: Assertion `_kind!=UCL_VIEW' failed

Dear all,

I have LAMMPS (29 Aug 2024 - Update 1) version installed with CUDA 12.2. My system has 1 RTX A6000 GPU. I’m now testing it. The job starts normally but stopped after some time

I am now testing the wetting process of water molecule droplets on metal surfaces. When the simulation system is small (Lx * Ly * Lz=30d * 30d * 60d, d=3.61 Å), the program can run normally. But when I enlarge the simulation system (LxLyLz=50d50d300d, d=3.61 Å), the program starts normally but will terminate abnormally after some time and instead of feeding back the common Error message, it will display the following message: lmp_mpi: geryon/ucl_d_vec.h:350: int ucl_ cudadr::UCL_D_Vec::resize(int) [with numtyp = int]: Assertion `_kind!=UCL_VIEW’ failed.

A strange phenomenon is that when I adjust the number of CPUs I’m using, the time at which the error occurs changes. By chance, when I changed the run command from “mpirun -np 20 lmp_mpi -sf gpu -pk gpu 1 -in in.runshi” to “mpirun -np 1 lmp_mpi -sf gpu -pk gpu 1 -in in.runshi”, i.e., to use only one CPU, the program ran fine. But it runs too slowly, which is obviously unacceptable for a larger system.

I carefully searched the forums for related problems and solutions, but did not find the same error message. I’m not sure if this is a bug in the way the software runs or a problem with my simulation system. So I will submit two zip files including in files, log files, and error messages for two different sized systems and look forward to your attention.

Since this is the first time I ask a question here, I am not sure if the information I have provided is detailed enough. If you need more information for troubleshooting, please let me know!

Many thanks in advance for any guidance on this issue

Deyang
Large_system.rar (17.7 KB)
Small_system.rar (17.5 KB)

This looks like you are running out of memory on the GPU. Not much that you can do about this other than either running a smaller system or running on a machine that has more GPUs or GPUs with more RAM or both.

P.S.: please note that .rar files are useless to most people, especially when they are not running on Windows.

Dear Axel, thank you for your attention and reply. I’m sorry for submitting the compressed file. I have resubmitted the files from two simulations and distinguished them by their names.
Error Message (27.0 KB)
in.large (2.5 KB)
large_log.lammps (24.8 KB)
in.small (2.5 KB)
log_small.lammps (52.2 KB)
TIP3P-Ewald.txt (612 Bytes)

To validate your suggestion, I attempted to simulate the same example on a supercomputer platform (RTX4090), desktop computer (RTXA6000), and my own Windows laptop (RTX3060) using 1GPU+6CPU. The code is attached below:

variable x equal 50
variable y equal 50
variable z equal 300
variable T equal 300
variable time equal 1
variable Tdamp equal 100*{time} variable Pdamp equal 1000*{time}

units real
dimension 3
boundary p p p
timestep ${time}
atom_style full
bond_style harmonic
angle_style harmonic
neighbor 2.0 bin
neigh_modify every 1 delay 1 check yes

lattice fcc 3.61
region box block 0 50 0 50 0 300
create_box 3 box &
bond/types 1 &
angle/types 1 &
extra/bond/per/atom 2 &
extra/angle/per/atom 1 &
extra/special/per/atom 2

region bottom block INF INF INF INF INF 3
create_atoms 1 region bottom

region water sphere 25 25 13 10
molecule water TIP3P-Ewald.txt
lattice sc 3.2
#create_atoms 0 region water mol water 12345432 units box
create_atoms 1 random 4000 123454321 water mol water 12345432 overlap 1.33 units box

group water type 2 3
group cu type 1

#parameter
set type 2 charge -1.04844
set type 3 charge 0.52422
mass 1 64
mass 2 15.9994
mass 3 1.00797
#kspace_modify slab 3.0
kspace_style pppm/tip4p 1.0e-4
pair_style lj/cut/tip4p/long 2 3 1 1 0.1250 12.0 12.0
pair_coeff 1 1 13.4443 2.27 # Cu
pair_coeff 2 2 0.16275 3.16435 # o
pair_coeff 3 3 0 1.0 # h
pair_coeff 1 2 1.479 2.720475282 #Cu-O
pair_coeff 1 3 0 1.0 #Cu-h
pair_coeff 2 3 0 1.0 #o-h
bond_coeff 1 1000 0.9572 # o*-h*
angle_coeff 1 100 104.52 # h*-o*-h*
write_data initial.data

compute water_temp water temp
thermo 100
thermo_style custom step press c_water_temp

##minimize
fix SHAKE water shake 0.0001 10 0 b 1 a 1
#fix 111 cu setforce 0 0 0
#fix zwalls all wall/reflect zlo EDGE zhi EDGE
dump 1 all atom 100 mini.xyz.*
minimize 1e-10 1e-10 10000 10000
undump 1

velocity water create 300 815327 mom yes rot yes dist gaussian

##run
reset_timestep 0
dump 1 all custom 1000 nvt.dump.* id type x y z
fix 1 all nvt temp 300 300 100
fix_modify 1 temp water_temp
run 100000
undump 1
unfix 1

It runs smoothly on the supercomputing platform, but after 75700 steps on the desktop computer, an error message appears with lmp_mpi: geryon/ucl-d_vec. h: 350: int uclcudadr:: UCLD_Vec: resize (int) [with numtyp=int]: Asset ` _kind= UCL_VIEW’ failed. And an error occurred on the laptop after 48000 steps.
“job aborted:[ranks] message [0] terminated [1] application aborted aborting MPI_COMM_WORLD (comm=0x44000000), error -1, comm rank 1 [2-5] terminated
---- error analysis -----
[1] on DESKTOP-0VHNQL2 lmp aborted the job. abort code -1
---- error analysis -----”

Our current simulation system should already be very small, even in the simulation system I named “Large”, there are only 35000 copper atoms and 4000 water molecules. The simulations I submitted are just trial calculations, and I truly hope that the simulated system will be much larger.
In fact, I once ran a system with over one million atoms (platinum atoms+argon atoms, LJ potential) on our desktop computer without appearing similar errors. I’m not quite sure why thousands of water molecules cause GPU memory to run out. Is it a problem with my code format? I hope you can provide more guidance.

Thank you again for your reply and assistance!

Does it read for you properly? It does not for me, which means that you have not read or are not following the forum guidelines about formatting quoted text.

Those don’t have bond information attached to them and probably have a much shorter cutoff. The main memory consumer in MD (and thus LAMMPS) is not the per-atom storage but the memory required for neighbor lists, which depends on the product of the number of atoms and the average number of neighbors for the atoms.

There is a logic problem here. Using multiple MPI ranks with just one GPU doesn’t give you that much added speed, since the acceleration of the GPU is not shared with multiple processes and that will create overhead. The speedup will only happen on the CPU side, which is a small part relative to the GPU acceleration. Yet using 4 MPI process, 9 at the most, should give you some speedup on the CPU side, provided you are avoiding any load balancing issues. Those are significant, since your simulation cell is mostly empty and thus you would need to add a command like processors * * 1 to avoid using domain decomposition along the (mostly empty) z-axis.

Thus getting more GPU acceleration requires using more GPUs so you can use more MPI ranks. Please note that a major bottleneck of you system is the fact that you are using long-range electrostatic and kspace_style pppm/tip4p is not GPU accelerated, yet it is very expensive since its cost is not only based on the number of atoms, but also on the volume and the latter is very large in your case. For that reason, it would seem advantageous to use a longer Coulomb cutoff to offset the cost of PPPM. Most certainly, you don’t want to use a Coulomb cutoff shorter than the LJ cutoff; that simply makes no sense.

TL;DR:
Try the following changes:

  • add processors * * 1 before defining the box
  • change the Coulomb cutoff from 10 Angstrom to 12
  • use mpirun -np 4

Dear Axel,

Thank you for your guidance! Your extensive professional knowledge and experience have helped resolve the issue that has been troubling me for a month. After my verification, I found that using processor commands appropriately and adjusting the number of CPUs can help eliminate errors and speed up calculations. Thank you once again, and I hope my question can serve as a reference for other users in the community who encounter similar problems.

Lastly, I have a small question. Could you please explain what information was used to recommend that I use 4 CPUs (mpirun -np 4)? This would be valuable to learn for larger system simulations.

This is based on geometry considerations. With 4 processes you have an even 2 by 2 decomposition for a square geometry in x and y. The next best option would be 3x3=9.

Ok! I understand, thank you very much for your help and reply!