MPI breakdown

dorde.dangic · December 12, 2022, 10:55am

Hello everyone,

I am running LAMMPS using MPI and I am wondering if I installed lammps correctly. I used GNU compilers. MPI breakdown looks like this:

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 1601.8     | 2808.9     | 4369.8     | 993.6 | 42.31
Neigh   | 0.00014366 | 0.00018994 | 0.00023398 |   0.0 |  0.00
Comm    | 2203.9     | 3698.9     | 4923.3     | 855.3 | 55.71
Output  | 0.11648    | 0.15851    | 0.1741     |   3.7 |  0.00
Modify  | 58.658     | 124.24     | 252.16     | 478.0 |  1.87
Other   |            | 7.419      |            |       |  0.11

It seems to me that Comm part is way too large. The input script looks like this:

units           metal                       # this keyword defines the units used in simulation, you can check the definitions on lammps website
                                            # For GAP always metal
atom_style      atomic                      # This defines how atoms are defined (are they charged or not)

boundary        p p p                       #Periodic conditions
box tilt large                              # Here you specify that the simulation box is really skewed. You can skip this one if your simulation
                                            # box is cubic.
read_data       data.pos                    #Structure file. I think it would be easy for you to generate one. Just follow the rules for 
                                            # triclinic cell explained on LAMMPS website
# Atomic masses
mass 3 207.2          
mass 2 132.9
mass 1  79.9

# Definition of potential
pair_style      quip
pair_coeff * * /scratch/djordjedangic/CsPbBr3/Fit_potential/v12/GAP.xml "IP GAP" 35 55 82     #Empirical potential

# Setup neighbor style, you do not really need to change this unless you try to melt GeTe
neighbor        1.0 nsq
neigh_modify    delay 100 

variable        t equal 100                 #Define the temperature (t)

variable        dt equal 0.001        
timestep        ${dt}                        #Define time step

# This defines the output in the LOG file. What each keyword means is on LAMMPS website
thermo_style    custom step temp pe ke press pxx pyy pzz pxy pxz pyz 
thermo          200

# Define starting velocities of atoms
velocity        all create $t 153278 dist gaussian mom yes
velocity        all scale $t

# Equilibrium run 
fix             1 all nvt temp $t $t $(dt*100) # Here you say you are using NVT ensemble with temperature relaxation time of 100 timesteps
run             10000                          # Run it for 50k timesteps
unfix           1                              # Here you say stop the run 1

fix             11 all npt temp $t $t $(dt*100) x 0 0 $(dt*1000) y 0.0 0.0 $(dt*1000) z 0.0 0.0 $(dt*1000) couple none # Here you say you are using NVT ensemble with temperature relaxation time of 100 timesteps
run             40000                          # Run it for 50k timesteps
unfix           11                              # Here you say stop the run 1
# the main part of the run
fix             2 all npt temp $t $t $(dt*100) x 0 0 $(dt*1000) y 0.0 0.0 $(dt*1000) z 0.0 0.0 $(dt*1000) couple none # Here you say you are using NVT ensemble with temperature relaxation time of 100 timesteps

# define the output file 
# Here 2 means collect every 2 timesteps, name of the file is data.atom and the format is id of the atom,
# followed by atomic positions, velocities and forces in Cartesian coordinates
dump            tdep all custom 200 data.atom id element x y z vx vy vz fx fy fz 
dump_modify     tdep sort id
dump_modify     tdep element Br Cs Pb

run             200000                        #Number of timesteps for main part

Is this the value of Comm that is expected? If it is not, is there something wrong with my input script?

Kind regards,

Dorde

akohlmey · December 12, 2022, 11:12am

How many atoms do you have in your system and how many MPI processes do you use?

Please note that you also have some load imbalance in your system, so your domain decomposition may be suboptimal. Can you provide an image showing your system geometry and the box?

dorde.dangic · December 12, 2022, 11:21am

Hi,

I have 40 atoms in the simulation. I use 36 cores with 3 by 3 by 4 MPI processor grid. My starting structure is perovskite cubic.

akohlmey · December 12, 2022, 11:32am

For so few atoms, the high communication load is expected. Your calculation will probably run faster with fewer MPI processes. Your kind of setup can work for DFT based MD codes, but for classical MD you typically cannot get strong scaling with less than a few 100s of atoms. For “expensive” potentials like GAP this number may come down a bit, but having one atom per-processor is far too few.

dorde.dangic · December 12, 2022, 11:34am

Thank you. I will try with a smaller number of cores to check this.

What would be the suggested number of atoms per core? For ML potentials and for empirical many-body potential such as Tersoff?

akohlmey · December 12, 2022, 11:45am

People use such potentials to do simulations on very large systems. So - as I already remarked - 100s to 1000s of atoms per MPI process is common. Often people use even more.

Usually the system size is determined by the problem and the need for accuracy and good statistics. Most systems with very few atoms have significant finite size effects and thus larger systems are required to get meaningful results in the first place.

A typical workflow would be to make some experiments with the applicability of a potential for a given type of geometry and the necessary input settings with very small systems using only a few processes (1-8) on a desktop machine, then scale it up to the size required to get reliable results and sampling and then do a strong scaling test, i.e. run with increasingly more MPI processes (often on a logarithmic scale) and find out when the gain from adding more processes is not worth the effort.

dorde.dangic · December 15, 2022, 7:54am

Hello,

I have tried to decrease the number of cores and I have seen a modest speed-up of the calculation.

However, I have noticed that if I set the number of OMP processes to 1 in my slurm script I see an even larger speed-up. I believe that somehow the communication overhead was coming from the fact that my lammps calculations were trying to use both MPI and OMP parallelization. I am not sure why this was happening. Either way, once I set the number of OMP processes to 1 I see the expected scaling of the computation time with the increasing number of MPI processes.

Thank you.