How to Run LAMMPS on Multiple Processors with GPU Acceleration

I have a workstation equipped with dual processors, each boasting 10 cores. My objective is to execute LAMMPS on 20 cores using mpiexec with GPU acceleration. However, I encounter an error during the process:

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.
Initializing Device 0 on core 1...Done.
Initializing Device 0 on core 2...Done.
Initializing Device 0 on core 3...Done.
Initializing Device 0 on core 4...Done.
Initializing Device 0 on core 5...Done.
Initializing Device 0 on core 6...Done.
Initializing Device 0 on core 7...Done.
Initializing Device 0 on core 8...Done.
Initializing Device 0 on core 9...Done.
Initializing Device 0 on core 10...Done.
Initializing Device 0 on core 11...Done.
Initializing Device 0 on core 12...Done.
Initializing Device 0 on core 13...Done.
Initializing Device 0 on core 14...Done.
Initializing Device 0 on core 15...Done.
Initializing Device 0 on core 16...Done.
Initializing Device 0 on core 17...Done.
Initializing Device 0 on core 18...Done.
Initializing Device 0 on core 19...
job aborted:
[ranks] message

[0-3] terminated

[4] application aborted
aborting MPI_COMM_WORLD (comm=0x44000000), error -1, comm rank 4

[5-19] terminated

---- error analysis -----

[4] on WS10
LAMMPS aborted the job. Abort code -1

I run LAMMPS using the following command:

mpiexec -n 20 lmp -sf gpu -pk gpu 1 -in .\in.Peierls_StrainRate

It’s worth noting that I can successfully run LAMMPS using a single processor with 10 cores, and my operating system is Windows 10.

Any assistance or guidance on this issue would be greatly appreciated.

There is a lot of necessary information missing here:

  • What is your exact version of LAMMPS?
  • How did you compile/install it?
  • You wrote you are running on Windows 10. Are you using a Windows native executable or the Windows subsystem for Linux (WSL)?
  • What is your CPU and GPU hardware?
  • For the GPU what is the output of either ocl_get_devices or nvc_get_devices?
  • How much RAM do they have available?
  • Does the crash on the GPU only happen with your specific input, or also with the LAMMPS benchmark inputs and other examples shipped with LAMMPS?
  • Does the crash also happen when using only 1 MPI rank and the GPU?

Please also note, that when you are sharing a single GPU across 20(!) MPI ranks, there is very little GPU acceleration to be achieved. It may even lead to a slowdown unless you have an extremely capable GPU (which are usually not built into workstations).

Thank you Alex for you Reply.
I am adding the necessary details below:

  1. Lammps version: LAMMPS (2 Aug 2023)
  2. I installed it using Windows .exe installer of Lammps
  3. CPU: Intel Xeon Silver 4114 CPU @ 2.2 GHz, GPU : NVIDIA RTX 6000
  4. Output of ocl_ger_devices:
PS E:\VRPHD\MD\NEW\EdgePeierls_StrainRate_112_60K> ocl_get_devices.exe
Found 4 platform(s).

Platform 0:

Device 0: "Quadro P6000"
  Type of device:                                GPU
  Supported OpenCL Version:                      3.0
  Is a subdevice:                                No
  Double precision support:                      Yes
  Total amount of global memory:                 24 GB
  Number of compute units/multiprocessors:       30
  Total amount of constant memory:               65536 bytes     
  Total amount of local/shared memory per block: 49152 bytes     
  Maximum group size (# of threads per block)    1024
  Maximum item sizes (# threads for each dim)    1024 x 1024 x 64
  Clock rate:                                    1.645 GHz       
  ECC support:                                   No
  Device fission into equal partitions:          No
  Device fission by counts:                      No
  Device fission by affinity:                    No
  Maximum subdevices from fission:               1
  Shared memory system:                          No
  Subgroup support:                              No
  Shuffle support:                               Yes

Platform 1:
There is no device supporting OpenCL

Platform 2:

Device 0: "Intel(R) FPGA Emulation Device"
  Type of device:                                ACCELERATOR
  Supported OpenCL Version:                      1.20
  Is a subdevice:                                No
  Double precision support:                      Yes
  Total amount of global memory:                 127.71 GB
  Number of compute units/multiprocessors:       20
  Total amount of constant memory:               131072 bytes
  Total amount of local/shared memory per block: 262144 bytes
  Maximum group size (# of threads per block)    67108864
  Maximum item sizes (# threads for each dim)    67108864 x 67108864 x 67108864
  Clock rate:                                    2.2 GHz
  ECC support:                                   No
  Device fission into equal partitions:          Yes
  Device fission by counts:                      Yes
  Device fission by affinity:                    No
  Maximum subdevices from fission:               20
  Shared memory system:                          Yes
  Subgroup support:                              No
  Shuffle support:                               No

Platform 3:

Device 0: "Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz"
  Type of device:                                CPU
  Supported OpenCL Version:                      3.0
  Is a subdevice:                                No
  Double precision support:                      Yes
  Total amount of global memory:                 127.71 GB
  Number of compute units/multiprocessors:       20
  Total amount of constant memory:               131072 bytes
  Total amount of local/shared memory per block: 32768 bytes
  Maximum group size (# of threads per block)    8192
  Maximum item sizes (# threads for each dim)    8192 x 8192 x 8192
  Clock rate:                                    2.2 GHz
  ECC support:                                   No
  Device fission into equal partitions:          Yes
  Device fission by counts:                      Yes
  Device fission by affinity:                    No
  Maximum subdevices from fission:               20
  Shared memory system:                          Yes
  Subgroup support:                              Yes
  Shuffle support:                               Yes
  1. Total RAM in my system : 128GB, GPU Memory : 24GB
    When I run on 10 processors with gpu acceleration then My CPU and ram utilization are around 45-50%
  2. No the GPU does not crashes I run my code with less than 10 number of cores but It crashes I run it with more than 10 number of cores

Is it possible to run lammps on my systems with 20 cores and gpu acceleration

Please make sure you are using the 2 Aug 2023 Update1 version.

How much speedup do you get when using 10 MPI ranks when you compare the run with GPU against a run without GPU?

Does this crash with more than 10 MPI ranks only happen for your input or also with the LAMMPS benchmark examples (e.g. in.lj, in.rhodo, in.eam)?

As I already stated it is not likely that you will see much GPU acceleration, if a single GPU has to be shared across 20 MPI ranks. The capability of the GPU does not multiply with over subscribing and there is some overhead associated with accessing a GPU device from multiple processors. E.g. if you have a speedup of 10x versus a single CPU core, than with using 10 CPU cores the speedup may drop to less than 2x.