Dear Lammps users,
I installed lammps (Aug 2023 version) with using cmake on a local cluster. The installation was successful and ./nvc_get_devices gives:
Found 1 platform(s).
CUDA Driver Version: 12.0
Device 0: “NVIDIA GeForce RTX 4090”
Type of device: GPU
Compute capability: 8.9
Double precision support: Yes
Total amount of global memory: 23.6496 GB
Number of compute units/multiprocessors: 128
Number of cores: 24576
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum group size (# of threads per block) 1024 x 1024 x 64
Maximum item sizes (# threads for each dim) 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 2.535 GHz
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive
Concurrent kernel execution: Yes
Device has ECC support enabled: No
I submitted some jobs in bench folder with the following terminal command:
mpirun --np 1 …/build/lmp -sf gpu -pk gpu 0 -in in.lj
with output:
LAMMPS (2 Aug 2023 - Update 2)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0 0 0) to (33.591924 33.591924 33.591924)
1 by 1 by 1 MPI processor grid
Created 32000 atoms
using lattice units in orthogonal box = (0 0 0) to (33.591924 33.591924 33.591924)
create_atoms CPU = 0.002 seconds
CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
Your simulation uses code contributions which should be cited:
* GPU package (short-range, long-range and three-body potentials): doi:10.1016/j.cpc.2010.12.021, doi:10.1016/j.cpc.2011.10.012, doi:10.1016/j.cpc.2013.08.002, doi:10.1016/j.commatsci.2014.10.068, doi:10.1016/j.cpc.2016.10.020, doi:10.3233/APC200086
The log file lists these citations in BibTeX format.
CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
---
* Using acceleration for lj/cut:
* with 1 proc(s) per device.
* with 1 thread(s) per proc.
* Horizontal vector operations: ENABLED
* Shared memory system: No
---
## Device 0: NVIDIA GeForce RTX 4090, 128 CUs, 23/24 GB, 2.5 GHZ (Mixed Precision)
Initializing Device and compiling on process 0…Done.
Initializing Device 0 on core 0…Done.
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Setting up Verlet run …
Unit style : lj
Current step : 0
Time step : 0.005
Per MPI rank memory allocation (min/avg/max) = 9.491 | 9.491 | 9.491 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733683 0 -4.6134358 -5.019707
100 0.75745333 -5.7585059 0 -4.6223614 0.20726081
Loop time of 0.0440848 on 1 procs for 100 steps with 32000 atoms
Performance: 979929.146 tau/day, 2268.355 timesteps/s, 72.587 Matom-step/s
100.0% CPU use with 1 MPI tasks x 1 OpenMP threads
## MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
Pair | 0.024244 | 0.024244 | 0.024244 | 0.0 | 54.99
Neigh | 2.6e-07 | 2.6e-07 | 2.6e-07 | 0.0 | 0.00
Comm | 0.0089505 | 0.0089505 | 0.0089505 | 0.0 | 20.30
Output | 9.178e-05 | 9.178e-05 | 9.178e-05 | 0.0 | 0.21
Modify | 0.0053737 | 0.0053737 | 0.0053737 | 0.0 | 12.19
Other | | 0.005424 | | | 12.30
Nlocal: 32000 ave 32000 max 32000 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 19657 ave 19657 max 19657 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 0 ave 0 max 0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 5
Dangerous builds not checked
---
Device Time Info (average):
---
## Data Transfer: 0.0133 s.
Neighbor copy: 0.0001 s.
Neighbor build: 0.0005 s.
Force calc: 0.0043 s.
Device Overhead: 0.0031 s.
Average split: 1.0000.
Lanes / atom: 4.
Vector width: 32.
Prefetch mode: None.
Max Mem / Proc: 26.88 MB.
CPU Neighbor: 0.0021 s.
CPU Cast/Pack: 0.0093 s.
CPU Driver_Time: 0.0014 s.
CPU Idle_Time: 0.0122 s.
Total wall time: 0:00:00
But when I submit the job with 2 or more cpus, it gives the error:
mpirun --np 2 …/build/lmp -sf gpu -pk gpu 0 -in in.lj
LAMMPS (2 Aug 2023 - Update 2)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
ERROR: Unable to initialize accelerator for use (src/GPU/gpu_extra.h:65)
Last command: package gpu 0
Cuda driver error 4 in call at file ‘/home/masharma/lammps-2Aug2023/lib/gpu/geryon/nvd_device.h’ in line 429.
Cuda driver error 4 in call at file ‘/home/masharma/lammps-2Aug2023/lib/gpu/geryon/nvd_device.h’ in line 430.
Cuda driver error 4 in call at file ‘/home/masharma/lammps-2Aug2023/lib/gpu/geryon/nvd_device.h’ in line 429.
Cuda driver error 4 in call at file ‘/home/masharma/lammps-2Aug2023/lib/gpu/geryon/nvd_device.h’ in line 430.Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:Process name: [[48720,1],1]
Exit code: 1
the nvcc -version is:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
My aim to use multiple mpi tasks is for kspace calculation for tip4p water model. The log file shows that on 1 MPI with gpu, kspace calculation is taking the most time (since for pppm/tip4p does not have gpu acceleration):
Performance: 29.764 ns/day, 0.806 hours/ns, 172.246 timesteps/s, 1.413 Matom-step/s
100.0% CPU use with 1 MPI tasks x 1 OpenMP threadsMPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %totalPair | 0.81155 | 0.81155 | 0.81155 | 0.0 | 2.80
Bond | 0.00054313 | 0.00054313 | 0.00054313 | 0.0 | 0.00
Kspace | 25.872 | 25.872 | 25.872 | 0.0 | 89.13
Neigh | 0.0012502 | 0.0012502 | 0.0012502 | 0.0 | 0.00
Comm | 0.33599 | 0.33599 | 0.33599 | 0.0 | 1.16
Output | 0.040797 | 0.040797 | 0.040797 | 0.0 | 0.14
Modify | 1.8704 | 1.8704 | 1.8704 | 0.0 | 6.44
Other | | 0.09564 | | | 0.33
Thanks in advance.