Can't Access Multiple GPUs when Using PPPM GPU Acceleration

MC_DA · January 20, 2024, 9:05am

Dear developers and user,

I’m running the simulation using an unmodified LAMMPS (2 Aug 2023) version. I want to use multiple GPUs and CUDA API to accelerate the PPPM alogrithm. However, I can’t access more than one GPU when running program. I have tried numerous ways to solve this problem, but I have not succeeded yet.

Hardware and Software Platform

My sever has 64 CPUs and 4 Nvidia A100 GPUs whose hardware configuration is below:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:31:00.0 Off |                    0 |
| N/A   45C    P0              90W / 400W |   7405MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:4B:00.0 Off |                    0 |
| N/A   43C    P0              54W / 400W |      8MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   42C    P0              55W / 400W |      8MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:E3:00.0 Off |                    0 |
| N/A   44C    P0              63W / 400W |      8MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

And here are libraries I use:

cmake/3.26.3-intel-2021.4.0
cuda/12.1.1
oneapi/2021.4.0
intel-oneapi-mpi/2021.4.0
intel-oneapi-mkl/2021.4.0
intel-oneapi-tbb/2021.4.0
intel-oneapi-compilers/2021.4.0

Compile, Launch and Output

Here are my compiling commands:

$ cmake -C ../cmake/presets/most.cmake -C ../cmake/presets/nolib.cmake -D PKG_GPU=on -D GPU_API=cuda -D LAMMPS_MACHINE=pppm_gpu ../cmake
$ cmake --build ..

And here is my launch command:

mpirun -n 2 ../../build_gpu/lmp_pppm_gpu -sf gpu -in in.spce-bulk-nvt

And this is my input file in.spce-bulk-nvt:

# SPC/E water box bulk
log PPPM500GPU2-2.out

package gpu 2 device_type nvidiagpu omp 2

units		real	
atom_style	full
read_data	equi_bulk.4000000.data
group O type 1
group H type 2
group water   type 1:2:1

replicate	1 1 1

pair_style	lj/cut/coul/long 10.0 10.0

kspace_style pppm/gpu 0.071

#kspace_style pppm 0.071

pair_coeff	1 1 0.1556 3.166
pair_coeff	* 2 0.0000 0.0000	
bond_style	harmonic
angle_style	harmonic
bond_coeff	1 1000.00 1.000
angle_coeff	1 100.0 109.47
special_bonds   lj/coul 0 0 0.5
neighbor        2.0 bin
neigh_modify	every 10 delay 10 check yes one 5000 

thermo_style custom step etotal temp
thermo_modify    line one 
thermo	100
#===================================================
fix	1 water shake 0.0001 5000 0 b 1 a 1
fix	2 water nvt temp 298 298 5
timestep	1

run 100000
write_data equi_bulk.*.data nocoeff

And here is the output initialization information( I printed additional variables, including NodeRank , Proc(s) per node , FirstGPU , LastGPU , and Devices Number .):

PPPM initialization ...
  using 12-bit tables for long-range coulomb (src/kspace.cpp:342)
  G vector (1/distance) = 0.04463336
  grid = 1 1 1
  stencil order = 5
  estimated absolute RMS force accuracy = 27.659355
  estimated relative force accuracy = 0.083295326
  using double precision KISS FFT
  3d grid and FFT values/proc = 216 1

--------------------------------------------------------------------------
- NodeRank: 0
- Proc(s) per node = 1
- FirstGPU: 0
- LastGPU: 0
- Devices Number: 1
- Using acceleration for pppm:
-  with 1 proc(s) per device.
-  with 2 thread(s) per proc.
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA A100-SXM4-40GB, 108 CUs, 39/39 GB, 1.4 GHZ (Mixed Precision)
--------------------------------------------------------------------------

My Analysis

I tried to analyze this problem in three aspects:

Hardware Platform
Software Dependencies
Launch Commands

And I explain my analysis from above three aspects.

Hardware

I print the GPUs’ device information by a simple CUDA program. And the 4 GPUs can be detected by program correctly.

What’s more, I can also change the device ID by using the command:

package gpu 2 gpuID 1

In this case, the program uses device 1 to run the kernel function.

Software Dependencies

I haven’t a good idea to verify whether the software dependencies are correct. However, I remember that about a month ago, when I last used those dependencies, the PPPM’s GPU version could correctly utilize multiple GPUs.

Launch Commands

After all the above efforts, I believe I need to examine the device codes to understand how the GPU devices are allocated. The code file is in .\lib\gpu\lal_device.cpp.

/****Codes in `.\lib\gpu\lal_device.cpp`****/

// Get the rank/size within the world
MPI_Comm_rank(_comm_world,&_world_me);
MPI_Comm_size(_comm_world,&_world_size);
// Get the rank/size within the replica
MPI_Comm_rank(_comm_replica,&_replica_me);
MPI_Comm_size(_comm_replica,&_replica_size);

// Get the names of all nodes
int name_length;
char node_name[MPI_MAX_PROCESSOR_NAME];
auto node_names = new char[MPI_MAX_PROCESSOR_NAME*_world_size];
MPI_Get_processor_name(node_name,&name_length);
MPI_Allgather(&node_name,MPI_MAX_PROCESSOR_NAME,MPI_CHAR,&node_names[0],
              MPI_MAX_PROCESSOR_NAME,MPI_CHAR,_comm_world);
std::string node_string=std::string(node_name);

// Get the number of procs per node
std::map<std::string,int> name_map;
std::map<std::string,int>::iterator np;
for (int i=0; i<_world_size; i++) {
  std::string i_string=std::string(&node_names[i*MPI_MAX_PROCESSOR_NAME]);
  np=name_map.find(i_string);
  if (np==name_map.end())
    name_map[i_string]=1;
  else
    np->second++;
}
int procs_per_node=name_map.begin()->second;
_procs_per_node = procs_per_node;

// Assign a unique id to each node
int split_num=0, split_id=0;
for (np=name_map.begin(); np!=name_map.end(); ++np) {
  if (np->first==node_string)
    split_id=split_num;
  split_num++;
}
delete[] node_names;

// Set up a per node communicator and find rank within
MPI_Comm node_comm;
MPI_Comm_split(_comm_world, split_id, 0, &node_comm);
int node_rank;
MPI_Comm_rank(node_comm,&node_rank);

// ------------------- Device selection parameters----------------------

if (ndevices > procs_per_node)
  ndevices = procs_per_node;

After reading through the codes and printing key variables, I’ve identified two problems that I believe may be crucial to solving the entire issue.

The MPI’s world rank, _world_me, is always 0, which leads that all the MPI processes use the same GPU device.
The variable procs_per_node, which represents the number of procs per node, is always 1, resulting in the program using only one GPU regardless of the number specified in the input command.

I have no idea how to solve those problems. I’ve tried many different mpirun -n and -ppn combinations, but I’m still facing challenges. Could someone please offer guidance on my issue? I’m particularly struggling. Any suggestions, explanations, or examples would be greatly appreciated. Thank you in advance for your help!

akohlmey · January 20, 2024, 1:38pm

There is little acceleration to be had for PPPM in the first place. When running with multiple MPI ranks, it is usually more efficient to run PPPM on the CPU(!) and only accelerate the pair style while optimizing the Coulomb cutoff for optimal performance (the same can be done with CPU-only runs) so the pair and the rest run in the most effective way concurrently on the GPU and CPU.

There is no evidence of that here, e.g. the output from nvidia-smi while you are running your job. Please note that your description is ambiguous as it does not explain whether you are intending to use multiple GPUs at all or multiple GPUs per MPI task. The latter is impossible, instead you can increase the number of MPI tasks. In fact, you could even use between 2 and 8 MPI tasks per GPU for increased utilization and better parallelization of the non-accelerated parts of the system.

Please also note that the amount of acceleration achieved is strongly dependent on the details of the system. Please see 7.4.1. GPU package — LAMMPS documentation

The “device_type” keyword only applies to OpenCL and has no effect when using CUDA.

With this command you are signaling to LAMMPS that there are only 2 GPUs per node that may be used. I.e. even with 4 MPI processes only the first 2 GPUs will be accessed.

Your analysis is incorrect.

MC_DA · January 20, 2024, 6:25pm

Thank you for your quick response! I appreciate your help.

Thanks for telling me this information. I just wanna explore how the program’s performance can be improved when using PPPM’s GPU version. Therefore, my first step is to run the PPPM’s GPU version in a multi-GPU environment.

It’s my problem. I will give more detials later.

I have tried using multiple GPUs collectively and assigning multiple GPUs per MPI task.

I tried to launch two MPI processes and allocate two GPUs to each node, resulting to 4 GPUs to be used in total. The launching commands are below:

mpirun -n 2 -ppn 1 ../../build_gpu/lmp_pppm_gpu -sf gpu  -in in.spce-bulk-nvt
package gpu 2 omp 2

The state of the devices, as printed by nvidia-smi , is as follows: only device 0 is in use. However, if the GPUs were allocated correctly, the utility of all 4 GPUs would be around 99%, which means that device 0 and device 1 would be allocated to Node 1 and device 2 and device 3 would be allocated to Node 2. Both node has only one MPI process according to the command mpirun’s configuration -ppn 1. It indicates that I am unable to use multiple GPUs at all or assign multiple GPUs per MPI process.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:31:00.0 Off |                    0 |
| N/A   41C    P0              83W / 400W |    932MiB / 40960MiB |     99%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:4B:00.0 Off |                    0 |
| N/A   39C    P0              53W / 400W |      8MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   39C    P0              53W / 400W |      8MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:E3:00.0 Off |                    0 |
| N/A   40C    P0              61W / 400W |      8MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    376203      C   ../../build_gpu/lmp_pppm_gpu             456MiB |
|    0   N/A  N/A    376204      C   ../../build_gpu/lmp_pppm_gpu             456MiB |
+---------------------------------------------------------------------------------------+

akohlmey · January 20, 2024, 7:08pm

As I already mentioned, this is NOT possible. You must have at least one MPI task per GPU. This is a good thing with the GPU package. Since only the parts that are well accelerated are ported to the GPU (i.e. primarily pair styles), you need to use MPI parallelization to speed up the rest. For your machine with so many CPU cores per node, you should use many or them and the MPI parallelization is the most effective parallelization in LAMMPS anyway. Thus for 4 GPUs you need to use at least 4 MPI tasks and should not use the “package” command (i.e. use the default settings). The number of GPUs used for the package command is the number of GPUs in total per node.

This is all explained in detail in the documentation that I have pointed out.

MC_DA:

I tried to launch two MPI processes and allocate two GPUs to each node, resulting to 4 GPUs to be used in total. The launching commands are below:
mpirun -n 2 -ppn 1 ../../build_gpu/lmp_pppm_gpu -sf gpu  -in in.spce-bulk-nvt
package gpu 2 omp 2 

This explanation is wrong and in contradiction with the LAMMPS manual.

This is correct and consistent with your request. Again, you have misunderstood the documentation.

You have only requested one GPU per node and you cannot use multiple GPUs per MPI process.

MC_DA · January 21, 2024, 3:10am

Thank you for your reply. I acknowledge that it was my mistake to misunderstand the documentation. Despite following your instructions and consulting the documentation, I am still confused about the program’s GPU allocation strategy.

I have totally understand this rule now.

MC_DA:

I tried to launch two MPI processes and allocate two GPUs to each node, resulting to 4 GPUs to be used in total. The launching commands are below:
mpirun -n 2 -ppn 1 ../../build_gpu/lmp_pppm_gpu -sf gpu  -in in.spce-bulk-nvt
package gpu 2 omp 2 

In my opinion, mpirun command will launch 2 nodes and each node have 2 GPUs. Althrough there is only 1 MPI processs in this node, the process in different nodes should be allocated to different GPU. Just like process 0 is allocated to GPU 0 and process 1 is allocated to GPU 2.

And I run the example command in documentation 7.4.1. GPU package — LAMMPS documentation.

The initial section of the documentation is as follows:

Use the “-sf gpu” command-line switch, which will automatically append “gpu” to styles that support it. Use the “-pk gpu Ng” command-line switch to set Ng = # of GPUs/node to use. If Ng is 0, the number is selected automatically as the number of matching GPUs that have the highest number of compute cores.
lmp_machine -sf gpu -pk gpu 1 -in in.script # 1 MPI task uses 1 GPU 
mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node 
mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # ditto on 4 16-core nodes
Note that if the “-sf gpu” switch is used, it also issues a default package gpu 0 command, which will result in automatic selection of the number of GPUs to use.

To fully use 4 GPUs, I tried to let 32 MPI processes share 4 GPUs on a 64 CPU node. According to documentation, there should be 8 MPI processes running on each GPU. The command is below which is almost same with commands on documentation.

mpirun -np 32 ../../build_gpu/lmp_pppm_gpu -sf gpu -pk gpu 4 -in in.spce-bulk-nvt

However, the devices’ state printed by nvidia-smi is still that only device 0 is in use. In fact, I have tried many different the total number of MPI tasks(just like 4,16,32,64), the number of MPI tasks per node(let node number be 1, 2, 4) and the number of GPU per node(just like 1, 2, 4). However, all of them can only utilize one GPU for the entire program. I’m feeling really confused about this problem.

Thank you once again for your prompt response! As a beginner in LAMMPS, your assistance has been greatly appreciated.

akohlmey · January 21, 2024, 5:51am

This is now reaching the point where the details that matter are specific to individual machines.
You need to work this out with the people managing your local machine. The only thing that I can do at this point is to give a demonstration of what does work.
I am running the “in.rhodo.scaled” input from the LAMMPS bench folder without modifications, it also uses the “data.rhodo” file. Both of which I have copied into my LAMMPS build folder. I have built LAMMPS with CMake using:

cmake -S cmake -C cmake/presets/basic.cmake -D PKG_GPU=yes -D GPU_API=CUDA -D PKG_OPENMP=yes -B build-gpu

After compiling I am running from the “build-gpu” folder.

Our local cluster (well, one of them) has GPU nodes with two Nvidia A100 GPUs with 80GB RAM and two AMD EPYC 7343 16-Core Processors. I am submitting a job running across 2 nodes thus with a total of either 4 MPI tasks, 4 GPUs and 4 OpenMP threads per MPI task. For comparison, I am also running a CPU-only job which runs with 64 MPI tasks in total and no OpenMP. While running the GPU accelerated run, I also run nvidia-smi 5 seconds into the job. To benefit from those GPUs, I have to “blow up” the job to have enough work. This replicates the 32,000 atom system 4x4x4 = 64 times. Here are the submit scripts with the command lines:
GPU:

#!/bin/sh
#SBATCH --partition=gpu
#SBATCH --time=1:00:00
#SBATCH --job-name=gpu-bench
#SBATCH --ntasks=4 --nodes=2
#SBATCH --gpus-per-node=2
#SBATCH --cpus-per-gpu=4

module load cuda
module load mpi/openmpi

cd ${SLURM_SUBMIT_DIR}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_GPU}

mpirun -n ${SLURM_NTASKS} --bind-to socket ./lmp -in in.rhodo.scaled -sf hybrid gpu omp -v x 4 -v y 4 -v z 4 -pk gpu 0 pair/only no &
sleep 5
nvidia-smi
wait

CPU:

#!/bin/sh
#SBATCH --partition=gpu
#SBATCH --time=1:00:00
#SBATCH --job-name=gpu-bench
#SBATCH --ntasks=64 --nodes=2
#SBATCH --cpus-per-task=1

module load cuda
module load mpi/openmpi

cd ${SLURM_SUBMIT_DIR}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

mpirun -n ${SLURM_NTASKS} --bind-to core ./lmp -in in.rhodo.scaled -sf omp -v x 4 -v y 4 -v z 4

The output from nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe           On | 00000000:21:00.0 Off |                    0 |
| N/A   32C    P0               63W / 300W|   4037MiB / 81920MiB |     17%   E. Process |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe           On | 00000000:81:00.0 Off |                    0 |
| N/A   30C    P0               63W / 300W|   4037MiB / 81920MiB |     19%   E. Process |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   4019276      C   ./lmp                                      4034MiB |
|    1   N/A  N/A   4019277      C   ./lmp                                      4034MiB |
+---------------------------------------------------------------------------------------+

So both GPUs are used and each is attached to a different MPI process. As can also be seen, both GPUs are operated in “process exclusive mode”, i.e. I cannot attach multiple MPI ranks to a GPU.

The GPU initialization messages:

--------------------------------------------------------------------------
- Using acceleration for pppm:
-  with 1 proc(s) per device.
-  with 4 thread(s) per proc.
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA A100 80GB PCIe, 108 CUs, 79/79 GB, 1.4 GHZ (Mixed Precision)
Device 1: NVIDIA A100 80GB PCIe, 108 CUs, 1.4 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Devices 0-1 on core 0...Done.


--------------------------------------------------------------------------
- Using acceleration for lj/charmm/coul/long:
-  with 1 proc(s) per device.
-  with 4 thread(s) per proc.
-  Horizontal vector operations: ENABLED
-  Shared memory system: No
--------------------------------------------------------------------------
Device 0: NVIDIA A100 80GB PCIe, 108 CUs, 79/79 GB, 1.4 GHZ (Mixed Precision)
Device 1: NVIDIA A100 80GB PCIe, 108 CUs, 1.4 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Devices 0-1 on core 0...Done.

The GPU job performance summary:

Loop time of 16.894 on 16 procs for 100 steps with 2048000 atoms

Performance: 1.023 ns/day, 23.464 hours/ns, 5.919 timesteps/s, 12.123 Matom-step/s
262.4% CPU use with 4 MPI tasks x 4 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 1.7311     | 1.8006     | 1.8902     |   4.3 | 10.66
Bond    | 2.4294     | 2.5375     | 2.6211     |   5.1 | 15.02
Kspace  | 3.8523     | 3.9901     | 4.155      |   6.8 | 23.62
Neigh   | 0.011163   | 0.014985   | 0.02307    |   3.9 |  0.09
Comm    | 1.1418     | 1.1696     | 1.1859     |   1.7 |  6.92
Output  | 0.0033298  | 0.0034808  | 0.0036718  |   0.2 |  0.02
Modify  | 7.1126     | 7.2247     | 7.3341     |   3.6 | 42.77
Other   |            | 0.153      |            |       |  0.91

The CPU performance summary:

Loop time of 25.1528 on 64 procs for 100 steps with 2048000 atoms

Performance: 0.687 ns/day, 34.934 hours/ns, 3.976 timesteps/s, 8.142 Matom-step/s
97.6% CPU use with 64 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 12.386     | 12.48      | 12.833     |   2.0 | 49.62
Bond    | 0.51685    | 0.52275    | 0.53133    |   0.4 |  2.08
Kspace  | 6.396      | 6.7365     | 6.8407     |   2.7 | 26.78
Neigh   | 2.6529     | 2.6596     | 2.6663     |   0.2 | 10.57
Comm    | 1.287      | 1.357      | 1.4407     |   2.9 |  5.40
Output  | 0.00064878 | 0.00066189 | 0.00077863 |   0.0 |  0.00
Modify  | 1.2369     | 1.3631     | 1.4313     |   4.0 |  5.42
Other   |            | 0.03276    |            |       |  0.13

In summary, everything works as documented when using the correct settings for the cluster at hand.

P.S.: For the sake of completeness. For this particular combination of hardware and problem type/size, the KOKKOS package is a bit faster:

Loop time of 11.7637 on 16 procs for 100 steps with 2048000 atoms

Performance: 1.469 ns/day, 16.339 hours/ns, 8.501 timesteps/s, 17.409 Matom-step/s
84.6% CPU use with 4 MPI tasks x 4 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 2.9709     | 3.0034     | 3.0314     |   1.4 | 25.53
Bond    | 0.081167   | 0.082019   | 0.082904   |   0.3 |  0.70
Kspace  | 4.921      | 4.9335     | 4.9493     |   0.5 | 41.94
Neigh   | 0.95663    | 0.9583     | 0.96       |   0.2 |  8.15
Comm    | 1.4384     | 1.4578     | 1.4788     |   1.3 | 12.39
Output  | 0.00020512 | 0.00033938 | 0.00039998 |   0.0 |  0.00
Modify  | 1.24       | 1.2479     | 1.256      |   0.6 | 10.61
Other   |            | 0.08035    |            |       |  0.68

The submit script in this case was:

#!/bin/sh
#SBATCH --partition=gpu
#SBATCH --time=1:00:00
#SBATCH --job-name=gpu-bench
#SBATCH --ntasks=4 --nodes=2
#SBATCH --gpus-per-node=2
#SBATCH --cpus-per-gpu=4

module load cuda
module load mpi/openmpi

cd ${SLURM_SUBMIT_DIR}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_GPU}
export OMP_PROC_BIND=spread
export OMP_PLACES=threads

mpirun -n ${SLURM_NTASKS} --bind-to socket --map-by socket ./lmp -in in.rhodo.scaled -k on g 2 t 4 -sf kk -v x 4 -v y 4 -v z 4  -pk kokkos newton on neigh half

MC_DA · January 23, 2024, 4:00am

Thank you! I will try it.

MC_DA · January 31, 2024, 1:00pm

I trid it again. And I found that the key problem is the c++ compiler. I tried two different kind of intel compiler: icpc and icpx. Here are my two CMake presets.

# Preset Name: icpc_gpu.cmake
# preset that turns on a wide range of packages, some of which require
# external libraries. Compared to all_on.cmake some more unusual packages
# are removed. The resulting binary should be able to run most inputs.

set(ALL_PACKAGES
  BODY
  GPU
  KSPACE
  MANYBODY
  MOLECULE
  RIGID)

foreach(PKG ${ALL_PACKAGES})
  set(PKG_${PKG} ON CACHE BOOL "" FORCE)
endforeach()

set(BUILD_TOOLS ON CACHE BOOL "" FORCE)
set(GPU_API "cuda" CACHE STRING "" FORCE)
set(GPU_ARCH "sm90" CACHE STRING "" FORCE)


# preset that will enable the classic Intel compilers with support for MPI and OpenMP (on Linux boxes)

set(CMAKE_CXX_COMPILER "icpc" CACHE STRING "" FORCE)
set(CMAKE_C_COMPILER "icc" CACHE STRING "" FORCE)
set(CMAKE_Fortran_COMPILER "ifort" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS  "-mavx512f" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS_DEBUG "-Wall -Wextra -g" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "-Wall -Wextra -g -O2 -DNDEBUG" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS_RELEASE "-O3 -DNDEBUG" CACHE STRING "" FORCE)
set(CMAKE_Fortran_FLAGS_DEBUG "-Wall -Wextra -g" CACHE STRING "" FORCE)
set(CMAKE_Fortran_FLAGS_RELWITHDEBINFO "-Wall -Wextra -g -O2 -DNDEBUG" CACHE STRING "" FORCE)
set(CMAKE_Fortran_FLAGS_RELEASE "-O3 -DNDEBUG" CACHE STRING "" FORCE)
set(CMAKE_C_FLAGS_DEBUG "-Wall -Wextra -g" CACHE STRING "" FORCE)
set(CMAKE_C_FLAGS_RELWITHDEBINFO "-Wall -Wextra -g -O2 -DNDEBUG" CACHE STRING "" FORCE)
set(CMAKE_C_FLAGS_RELEASE "-O3 -DNDEBUG" CACHE STRING "" FORCE)

set(MPI_CXX "icpc" CACHE STRING "" FORCE)
set(MPI_CXX_COMPILER "mpicxx" CACHE STRING "" FORCE)

unset(HAVE_OMP_H_INCLUDE CACHE)
set(OpenMP_C "icc" CACHE STRING "" FORCE)
set(OpenMP_C_FLAGS "-qopenmp -qopenmp-simd" CACHE STRING "" FORCE)
set(OpenMP_C_LIB_NAMES "omp" CACHE STRING "" FORCE)
set(OpenMP_CXX "icpc" CACHE STRING "" FORCE)
set(OpenMP_CXX_FLAGS "-qopenmp -qopenmp-simd" CACHE STRING "" FORCE)
set(OpenMP_CXX_LIB_NAMES "omp" CACHE STRING "" FORCE)
set(OpenMP_Fortran_FLAGS "-qopenmp -qopenmp-simd" CACHE STRING "" FORCE)
set(OpenMP_omp_LIBRARY "libiomp5.so" CACHE PATH "" FORCE)

# Preset Name: icpx_gpu.cmake
# preset that turns on a wide range of packages, some of which require
# external libraries. Compared to all_on.cmake some more unusual packages
# are removed. The resulting binary should be able to run most inputs.

set(ALL_PACKAGES
  BODY
  GPU
  KSPACE
  MANYBODY
  MOLECULE
  RIGID)

foreach(PKG ${ALL_PACKAGES})
  set(PKG_${PKG} ON CACHE BOOL "" FORCE)
endforeach()

set(BUILD_TOOLS ON CACHE BOOL "" FORCE)
set(GPU_API "cuda" CACHE STRING "" FORCE)
set(GPU_ARCH "sm90" CACHE STRING "" FORCE)

# preset that will enable the LLVM based Intel compilers with support for MPI and OpenMP and Fortran (on Linux boxes)

set(CMAKE_CXX_COMPILER "icpx" CACHE STRING "" FORCE)
set(CMAKE_C_COMPILER "icx" CACHE STRING "" FORCE)
set(CMAKE_Fortran_COMPILER "ifx" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS  "-mavx512f" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS_DEBUG "-Wall -Wextra -g" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS_RELWITHDEBINFO "-Wall -Wextra -g -O2 -DNDEBUG" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS_RELEASE "-O3 -DNDEBUG" CACHE STRING "" FORCE)
set(CMAKE_Fortran_FLAGS_DEBUG "-Wall -Wextra -g" CACHE STRING "" FORCE)
set(CMAKE_Fortran_FLAGS_RELWITHDEBINFO "-Wall -Wextra -g -O2 -DNDEBUG" CACHE STRING "" FORCE)
set(CMAKE_Fortran_FLAGS_RELEASE "-O3 -DNDEBUG" CACHE STRING "" FORCE)
set(CMAKE_C_FLAGS_DEBUG "-Wall -Wextra -g" CACHE STRING "" FORCE)
set(CMAKE_C_FLAGS_RELWITHDEBINFO "-Wall -Wextra -g -O2 -DNDEBUG" CACHE STRING "" FORCE)
set(CMAKE_C_FLAGS_RELEASE "-O3 -DNDEBUG" CACHE STRING "" FORCE)

set(MPI_CXX "icpx" CACHE STRING "" FORCE)
set(MPI_CXX_COMPILER "mpicxx" CACHE STRING "" FORCE)

unset(HAVE_OMP_H_INCLUDE CACHE)
set(OpenMP_C "icx" CACHE STRING "" FORCE)
set(OpenMP_C_FLAGS "-fopenmp" CACHE STRING "" FORCE)
set(OpenMP_C_LIB_NAMES "omp" CACHE STRING "" FORCE)
set(OpenMP_CXX "icpx" CACHE STRING "" FORCE)
set(OpenMP_CXX_FLAGS "-fopenmp" CACHE STRING "" FORCE)
set(OpenMP_CXX_LIB_NAMES "omp" CACHE STRING "" FORCE)
set(OpenMP_Fortran_FLAGS "-fopenmp" CACHE STRING "" FORCE)
set(OpenMP_omp_LIBRARY "libiomp5.so" CACHE PATH "" FORCE)

Using above two presets, I gerenerated executable file lmp_icpc_gpu and lmp_icpx_gpu by command:

cmake -C ../cmake/presets/icpc_gpu.cmake -D LAMMPS_MACHINE=icpc_gpu ../cmake
cmake --build .

and

cmake -C ../cmake/presets/icpx_gpu.cmake -D LAMMPS_MACHINE=icpx_gpu ../cmake
cmake --build .

I launch the program by commands:

mpirun -n 4 icpc_gpu -sf gpu  -i ./in.spce-bulk-nvt

and

mpirun -n 4 icpx_gpu -sf gpu  -i ./in.spce-bulk-nvt

The in.spce-bulk-nvt is:

# SPC/E water box bulk
log PPPM500GPU4-OneAPI.out

package gpu 4 omp 2

units		real	
atom_style	full
read_data	equi_bulk.4000000.data
group O type 1
group H type 2
group water   type 1:2:1

replicate	1 1 1
 
pair_style	lj/cut/coul/long 10.0 10.0

kspace_style pppm/gpu 0.071

# kspace_style pppm 0.071

pair_coeff	1 1 0.1556 3.166
pair_coeff	* 2 0.0000 0.0000	
bond_style	harmonic
angle_style	harmonic
bond_coeff	1 1000.00 1.000
angle_coeff	1 100.0 109.47
special_bonds   lj/coul 0 0 0.5
neighbor        2.0 bin
neigh_modify	every 10 delay 10 check yes one 5000 
#===================================================
thermo_style custom step etotal temp
thermo_modify    line one 
thermo	100
#===================================================
fix	1 water shake 0.0001 5000 0 b 1 a 1
fix	2 water nvt temp 298 298 5
timestep	1

run 100000
write_data equi_bulk.*.data nocoeff

The result is that the program lmp_icpc_gpu can use multiple GPUs correctly, while the program lmp_icpx_gpu can only use one GPU.

icpc result:

icpx result:

It’s really confusing.

I have printed some information in GPU allocation part.

//file: lib/gpu/lal_device.cpp
//line: stating from line 110

// Get the rank/size within the world
MPI_Comm_rank(_comm_world,&_world_me);
MPI_Comm_size(_comm_world,&_world_size);
// Get the rank/size within the replica
MPI_Comm_rank(_comm_replica,&_replica_me);
MPI_Comm_size(_comm_replica,&_replica_size);

 //I add this to print MPI environment.
printf("MPI: _world_me=%d _world_size=%d _replica_me=%d _replica_size=%d\n",_world_me,_world_size,_replica_me,_replica_size);

The output of lmp_icpc_gpu is same as below for all MPI processes:

MPI: _world_me=0 _world_size=4 _replica_me=0 _replica_size=4 
MPI: _world_me=1 _world_size=4 _replica_me=1 _replica_size=4 
MPI: _world_me=2 _world_size=4 _replica_me=2 _replica_size=4 
MPI: _world_me=3 _world_size=4 _replica_me=3 _replica_size=4

The output of lmp_icpx_gpu is same as below for all MPI processes:

MPI: _world_me=0 _world_size=1 _replica_me=0 _replica_size=1
MPI: _world_me=0 _world_size=1 _replica_me=0 _replica_size=1

akohlmey · January 31, 2024, 1:23pm

I disagree with that assessment. To me it seems the problem is the MPI library support. Please have a careful look at your CMake command outputs to see if it has correctly detected and compiled with MPI support and which library it detected. You can also get MPI version info if you run LAMMPS with the -h flag.

You don’t really need to add the output you did since the log file already contains the same information.
In one case you see: 1 by 1 by 1 MPI processor grid while the other will likely have 1 by 2 by 2 MPI processor grid or some similar distribution.

That means in one case your executable is either not supporting MPI at all or it was compiled with an MPI library that is different from the one providing the mpirun command. As a consequence you will be running multiple copies of a serial executable and those will all pick GPU 0 as the one to be used since they all are MPI rank 0 and that is why they all have the exact same output for your MPI checking print statement.

In the case of the other executable, MPI is initialized correctly and you have 4 MPI ranks and then the GPU library will assign GPU 0 to MPI rank 0, GPU 1 to MPI rank 1, and so on.

MC_DA · January 31, 2024, 3:54pm

Thank you for your reply! You’re right! I haven’t set the MPI compiler correctly.

The correct MPI compiler for icpx is mpiicpc , not mpicxx . This may also indicate an error in the \cmake\presets\oneapi.cmake file.

The right version is:

set(MPI_CXX "icpx" CACHE STRING "" FORCE)
set(MPI_CXX_COMPILER "mpiicpc" CACHE STRING "" FORCE) # right version

akohlmey · January 31, 2024, 4:12pm

No. These settings are highly system specific and thus must be customized. What you suggest is definitely inconsistent (it suggests to use icpc instead of icpx) but it works just by chance due to the way how CMake treats MPI libraries by not using the corresponding MPI compiler wrapper but only queries the wrappers for the location of the MPI libraries and headers.

That said, if you have troubles using MPI and GPUs correctly, it is always a very bad idea to use the Intel compilers but rather stick to the GNU compilers. There is no significant performance difference and the GNU compilers are much better tested and supported on Linux platforms and thus much more reliable. If you are using GPU acceleration the host compiler is even less important and then using the Intel compilers mess things up even more.

stamoor · January 31, 2024, 4:54pm

Agreed: typically use GNU compilers with NVIDIA CUDA, not Intel compiler.

srtee · February 1, 2024, 1:36am

Thirded, from personal painful experiences, Intel compilers don’t play well with CUDA.

srtee · February 1, 2024, 2:00am

Just to be explicit about this, the official oneAPI documentation states that mpiicpc should be used with icpc and mpiicpx should be used with icpx.

Furthermore, you can actually read the source of all the MPI wrappers you are worried about, which are really just simple (but longwinded and carefully written) shell scripts. In fact mpiicpx does nothing besides call mpiicpc -cxx=icpx and disallow any other C++ compiler, which is a hilariously small functionality addition for an entirely new “wrapper”.