That is difficult to say. I depends on the kind of simulation, how large the problem is and how frequently communication is required. You would have to provide some tangible data, eg. by running the examples from the “bench” folder blown up to a representative size.
For a single node, the network performance is irrelevant. It is just the most common problem when people use clusters and get bad performance. I suggested that, because no tangible information about the two clusters was given.
There are several other possible causes:
- there is no exclusive node access and there are users using more resources than they asked for and thus reducing the availability of CPU cores
- one of the clusters has hyperthreading enabled and you are using the hyperthreads which provide only a minimal speedup versus real CPU cores.
- there is CPU affinity set in one case and not the other and you don’t set OMP_NUM_THREADS=1
- the cluster with the bad performance is badly managed and the node is not properly cleaned when a job has reached its walltime limit
- the cluster with the bad performance is badly managed and allows users to log into the compute nodes and start calculations even if they have no job in that node
- you are not using OpenMP correctly
- your LAMMPS executable has not been compiled correctly for the available MPI library
To be able to tell what is happening, we need to know more details, how your LAMMPS version(s) were compiled (e.g. the output from lmp -h
) and which command lines you are using.
For your reference, here are some numbers for using 1, 2, and 4 CPU cores on an Intel NUC machine with a 4-core CPU and the “rhodo” benchmark input.
1 MPI 1 OpenMP: mpirun -np 1 ../build/lmp -in in.rhodo
Loop time of 26.6078 on 1 procs for 100 steps with 32000 atoms
Performance: 0.649 ns/day, 36.955 hours/ns, 3.758 timesteps/s, 120.266 katom-step/s
99.7% CPU use with 1 MPI tasks x 1 OpenMP threads
2 MPI 1 OpenMP: mpirun -np 2 ../build/lmp -in in.rhodo
Loop time of 13.4511 on 2 procs for 100 steps with 32000 atoms
Performance: 1.285 ns/day, 18.682 hours/ns, 7.434 timesteps/s, 237.899 katom-step/s
99.8% CPU use with 2 MPI tasks x 1 OpenMP threads
4 MPI 1 OpenMP: mpirun -np 4 ../build/lmp -in in.rhodo
Loop time of 7.59636 on 4 procs for 100 steps with 32000 atoms
Performance: 2.275 ns/day, 10.550 hours/ns, 13.164 timesteps/s, 421.255 katom-step/s
99.6% CPU use with 4 MPI tasks x 1 OpenMP threads
1 MPI 1 OpenMP: OMP_NUM_THREADS=1 mpirun -np 1 ../build/lmp -in in.rhodo -sf omp
Loop time of 23.744 on 1 procs for 100 steps with 32000 atoms
Performance: 0.728 ns/day, 32.978 hours/ns, 4.212 timesteps/s, 134.771 katom-step/s
99.8% CPU use with 1 MPI tasks x 1 OpenMP threads
1 MPI 2 OpenMP: OMP_NUM_THREADS=2 mpirun -np 1 ../build/lmp -in in.rhodo -sf omp
Loop time of 12.684 on 2 procs for 100 steps with 32000 atoms
Performance: 1.362 ns/day, 17.617 hours/ns, 7.884 timesteps/s, 252.287 katom-step/s
199.7% CPU use with 1 MPI tasks x 2 OpenMP threads
1 MPI 4 OpenMP: OMP_NUM_THREADS=4 mpirun -np 1 ../build/lmp -in in.rhodo -sf omp
Loop time of 7.01865 on 4 procs for 100 steps with 32000 atoms
Performance: 2.462 ns/day, 9.748 hours/ns, 14.248 timesteps/s, 455.928 katom-step/s
398.9% CPU use with 1 MPI tasks x 4 OpenMP threads
2 MPI 2 OpenMP: OMP_NUM_THREADS=2 mpirun -np 2 ../build/lmp -in in.rhodo -sf omp
Loop time of 6.56561 on 4 procs for 100 steps with 32000 atoms
Performance: 2.632 ns/day, 9.119 hours/ns, 15.231 timesteps/s, 487.388 katom-step/s
199.5% CPU use with 2 MPI tasks x 2 OpenMP threads
with hyper-threading:
2 MPI 4 OpenMP: OMP_NUM_THREADS=4 mpirun -np 2 ../build/lmp -in in.rhodo -sf omp
Loop time of 5.67482 on 8 procs for 100 steps with 32000 atoms
Performance: 3.045 ns/day, 7.882 hours/ns, 17.622 timesteps/s, 563.895 katom-step/s
396.9% CPU use with 2 MPI tasks x 4 OpenMP threads
4 MPI 2 OpenMP: OMP_NUM_THREADS=2 mpirun -np 4 ../build/lmp -in in.rhodo -sf omp
Loop time of 5.39629 on 8 procs for 100 steps with 32000 atoms
Performance: 3.202 ns/day, 7.495 hours/ns, 18.531 timesteps/s, 593.000 katom-step/s
198.4% CPU use with 4 MPI tasks x 2 OpenMP threads