Problem with GPU-enabled pair styles

nicole_luchetti · March 6, 2024, 11:05pm

Dear all, I’m a new LAMMPS user who has been recently involved in a project using LAMMPS for coarse-grained MD simulations. The systems under investigation are nucleosomes, so both proteins and nucleic acids. We found an implemented HPS model for this kind of systems in the literature, but the developers included the hybrid/overlay pair style in the model, which is not running on GPUs:

pair_style hybrid/overlay table linear 4001 coul/long 40.0

bond_style hybrid table linear 3001 harmonic
angle_style hybrid table linear 3001 zero

Are there possible ways to overcome the problem without laying hands on the LAMMPS code?

Thank you all in advance!

akohlmey · March 6, 2024, 11:18pm

What kind of system are you running on? CPU, GPU, OS etc.?
What command line are you using to run LAMMPS?
What kind of speedup do you need from using the GPU?

nicole_luchetti · March 7, 2024, 3:09am

Actually I’m running on CPUs, but they asked me if it’s possible to speed up the simulation with the use of GPUs.
The command line I’m using is: mpirun --oversubscribe -np 8 lmp -in LAMMPS.run.inp (on a supercomputer with both MPI and OpenMP).
Since I’m performing tests on smaller systems (for example only one NCP, and I’m trying to set up a simulation with 10 NCPs - the intention is to increase the number of nucleosomes of at least one order of magnitude), and with an NCP of less than 3k atoms and 1k ions, the actual performance is more than 5 hours wall time with 1 NCP system for 4 ns MD.

akohlmey · March 7, 2024, 3:38am

So on what grounds do you then make the claim that pair style hybrid/overlay cannot be used with GPUs?

Who is “they”?

This information is useless without knowing what kind of hardware you are running on.
Why do you use --oversubscribe? Isn’t that counterproductive?

That sounds like a rather small system altogether.

That doesn’t sound like a lot. I’ve known projects where people had to run for weeks or months on thousands of processors (but those were quite a bit larger and were using more complex potentials).

LAMMPS has very good strong and weak scaling, especially for systems that do not require long-range electrostatics, so you should be able to get a good speedup when using more processors for a larger system.

Yes, you can.

nicole_luchetti · March 7, 2024, 4:10am

As I could understand from the documentation pair_style hybrid command — LAMMPS documentation (maybe I’m wrong). In addition to that, when I try to run simulations with GPUs enabled following the suggestions of the help desk (mpirun -gpu -np 4 --map-by socket: PE=8 --rank-by core lmp -k on g 4 -sf kk -in in.lj # 4 MPI tasks, 4 GPUs) it returns a segmentation fault.

nicole_luchetti · March 7, 2024, 4:10am

My supervisors.

nicole_luchetti · March 7, 2024, 4:13am

The characteristics of each node are the following:
–ntasks-per-node=4 # Number of MPI ranks per node
–cpus-per-task=8 # Number of threads per MPI rank
–gres=gpu:4 # Number of requested gpus per node, can vary between 1 and 4

(If it helps, I’m running on Leonardo cluster of CINECA)

nicole_luchetti · March 7, 2024, 4:15am

Well, this is an interesting point, since it’s the first time for me with this kind of system and simulation, and I’d no idea about the performances. Thank you!

akohlmey · March 7, 2024, 4:34am

It is very important for any of these kinds of discussions, that you report which LAMMPS version you are using exactly and how it was compiled. If you capture the output from lmp -help and report everything up to the “List of individual style options included in this LAMMPS executable”, it should contain almost all of the useful information.

Any advice that is given by any of the LAMMPS developers will usually refer to the latest (feature) release which is also what the default online documentation corresponds to. Currently, that is LAMMPS version 7 Feb 2024

The pair style hybrid documentation explicitly lists this:

Accelerator Variants: hybrid/overlay/kk

When you run with the GPU package, no special pair style is needed, but you need to keep creating the neighbor list on the CPU. LAMMPS should tell you this. This is done with: -sf gpu -pk gpu 0 neigh no

When you run with the KOKKOS package, you either need to have a GPU-aware MPI library or you need to tell LAMMPS that it is not with: -pk kokkos gpu/aware off. The segfault you are seeing is likely a consequence of that.

This still doesn’t tell me anything about the hardware but with this kind of request you should be using a LAMMPS version that has KOKKOS support for OpenMP and GPUs (=CUDA) included and then use -k on g 4 t 8.

With the GPU package you should have an executable that also includes the OPENMP package and then you can use -sf hybrid gpu omp -pk gpu 0 neigh no.

In both cases, they can add OpenMP multi-threading where there is no GPU acceleration available. But for the GPU package you may also change your request to have something like

–ntasks-per-node=16 # Number of MPI ranks per node
–cpus-per-task=2 # Number of threads per MPI rank
–gres=gpu:4 # Number of requested gpus per node, can vary between 1 and 4

The GPU package is rather efficient when attaching multiple MPI tasks to the same GPU since it only offloads part of the calculation to the GPU and thus can create a higher occupancy this way and parallelize the non-GPU part better. Of course, it would be even better if there was a way to request the CUDA multi-processor server (MPS), but for using that correctly, you need to contact your local user support.

Yes, it does. I had an account for a project there recently (but not anymore and I didn’t run LAMMPS on it).

nicole_luchetti · March 7, 2024, 4:43am

Thank you very much for all the information!
The version actually compiled by the “managers” on the cluster is the 20220623–openmpi–4.1.4–gcc–11.3.0-cuda-11.8 with KOKKOS.

akohlmey · March 7, 2024, 9:28am

This is a) an almost 2 year old version. For something that changes as quickly as GPU support with Kokkos, it is strongly recommended to use something more recent. and b) it would be nice if - for a change - you would provide the information I am asking for and not something that is related but not quite as specific.

Most of which can be found in this section of the LAMMPS manual:
https://docs.lammps.org/Speed_packages.html

nicole_luchetti · March 7, 2024, 5:38pm

I don’t have information about the compilation of codes and softwares, being an user and not a sudoer of the HPC cluster. So please, tell me exactly what you want to know and I can try to find information.

akohlmey · March 7, 2024, 6:08pm

I already did. Please see below which is quoted from a previous post of mine.
You don’t have to be a superuser to get this.
Same goes for the exact hardware, which you can collect with commands like lscpu or lspci or nvidia-smi, but that I am no longer concerned about the output of that at this point.

nicole_luchetti · March 7, 2024, 6:36pm

This is the output to lscpu :

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Stepping: 6
CPU MHz: 2601.000
CPU max MHz: 2601.0000
CPU min MHz: 800.0000
BogoMIPS: 5200.00
Virtualization: VT-x
L1d cache: 48K
L1i cache: 32K
L2 cache: 1280K
L3 cache: 49152K
NUMA node0 CPU(s): 0-31,64-95
NUMA node1 CPU(s): 32-63,96-127
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities

Other information about the cluster architecture at the access to the cluster:

Red Hat Enterprise Linux 8.7 (Ootpa)
Booster module: Atos Bull Sequana X2135 “Da Vinci” Blade
3456 compute nodes with: - 32 cores Ice Lake at 2.60 GHz - 4 x NVIDIA Ampere A100 GPUs, 64GB - 512 GB RAM

DataCentric General Purpose module (DCGP): Atos BullSequana X2140 Blade 1536 compute nodes with: - 2x56 cores Intel Sapphire Rapids at 2.00 GHz - 512 GB RAM

Internal Network: Nvidia Mellanox HDR DragonFly++

akohlmey · March 7, 2024, 6:40pm

And again, you are not doing what I have asked for.

nicole_luchetti · March 7, 2024, 6:53pm

lmp -help
lmp: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

akohlmey · March 7, 2024, 6:56pm

You either need to run on a compute node (after requesting an interactive session) or copy libcuda.so.1 from a compute node into your account and then augment the LD_LIBRARY_PATH environment variable so it can find the CUDA driver library. So you can run the executable.

nicole_luchetti · March 7, 2024, 7:00pm

Could it be okay (after lmp -help)?

Large-scale Atomic/Molecular Massively Parallel Simulator - 23 Jun 2022

Usage example: lmp -var t 300 -echo screen -in in.alloy

List of command line options supported by this LAMMPS executable:

-echo none/screen/log/both : echoing of input script (-e)
-help : print this help message (-h)
-in none/filename : read input from file or stdin (default) (-i)
-kokkos on/off … : turn KOKKOS mode on or off (-k)
-log none/filename : where to send log output (-l)
-mdi ‘’ : pass flags to the MolSSI Driver Interface
-mpicolor color : which exe in a multi-exe mpirun cmd (-m)
-cite : select citation reminder style (-c)
-nocite : disable citation reminder (-nc)
-package style … : invoke package command (-pk)
-partition size1 size2 … : assign partition sizes (-p)
-plog basename : basename for partition logs (-pl)
-pscreen basename : basename for partition screens (-ps)
-restart2data rfile dfile … : convert restart to data file (-r2data)
-restart2dump rfile dgroup dstyle dfile …
: convert restart to dump file (-r2dump)
-reorder topology-specs : processor reordering (-r)
-screen none/filename : where to send screen output (-sc)
-skiprun : skip loops in run and minimize (-sr)
-suffix gpu/intel/opt/omp : style suffix to apply (-sf)
-var varname value : set index style variable (-v)

OS: Linux “Red Hat Enterprise Linux 8.7 (Ootpa)” 4.18.0-425.19.2.el8_7.x86_64 x86_64

Compiler: GNU C++ 11.3.0 with OpenMP 4.5
C++ standard: C++14
MPI v3.1: Open MPI v4.1.4, package: Open MPI [email protected] Distribution, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022

Accelerator configuration:

GPU package API: CUDA
GPU package precision: mixed
KOKKOS package API: CUDA Serial
KOKKOS package precision: double

Compatible GPU present: yes

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_PNG
-DLAMMPS_JPEG
-DLAMMPS_FFMPEG
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint): 32-bit
sizeof(bigint): 64-bit

Available compression formats:

Extension: .gz Command: gzip
Extension: .bz2 Command: bzip2
Extension: .zst Command: zstd
Extension: .xz Command: xz
Extension: .lzma Command: xz

Installed packages:

ASPHERE BOCS CLASS2 CORESHELL DIELECTRIC DIFFRACTION DIPOLE DPD-BASIC
DPD-REACT DRUDE EXTRA-COMPUTE EXTRA-DUMP EXTRA-FIX EXTRA-MOLECULE EXTRA-PAIR
GPU KIM KOKKOS KSPACE MANYBODY MEAM MISC MOLECULE MOLFILE PHONON PLUGIN REAXFF
REPLICA RIGID SRD TALLY

akohlmey · March 7, 2024, 7:05pm

Almost. It is missing the first two lines.