Vacuum

Hi All,
I am running a simulation of liquid SPC/E water molecules, starting in a volume of 120 x 120 x 20 angstroms. I also am including roughly 30 angstroms of vacuum on either side to allow for the liquid to expand slightly. I have noticed that if I were to increase the vacuum on either side from 30 angstroms to say 300, the time for the simulation to run is greatly increased. Why is this happening? Should this be happening?

Regards,
Peter

Hi All,
I am running a simulation of liquid SPC/E water molecules, starting in a volume of 120 x 120 x 20 angstroms. I also am including roughly 30 angstroms of vacuum on either side to allow for the liquid to expand slightly. I have noticed that if I were to increase the vacuum on either side from 30 angstroms to say 300, the time for the simulation to run is greatly increased. Why is this happening? Should this be happening?

a) you should have a look at the performance summary output that LAMMPS prints. the manual discusses it and it can provide valuable insight into where things are slowing down and why.

b) if you are running in parallel and are not making any adjustments, you are creating a load imbalance.
c) if you are using long-range electrostatics with PPPM, part of the computational effort scales by the volume of the simulation box
d) it probably should be happening, but its impact can be reduced. if you are doing a simulation of a water droplet, you are likely better off using shrinkwrap boundary conditions (“s” or “m”), which will adapt automatically to the space required.

axel.

Hi Axel,
Thank you very much. After looking at the outputs and reading your comments I believe my issue is a large load imbalance. The liquid simulation was an intermediary step to a liquid-vapor simulation. When the two are combined I was not re-balancing the processors so they were still subdivided based on discretizing the domain into equal sized boxes as described on the balance command manual page. Thanks for your help.
Regards,
Peter

Hi Axel,
Would you say using balance rcb would be the optimal solution for running a liquid-vapor simulation?
Regards,
Peter

If your model contains a lot of empty space due to the distribution of particles, even after shrink wrap, or a considerable difference in density between parts of your system, then yes rcb is likely to be an improvement. You can also try the other grid style that load balances better than the default brick for that case too.

Adrian

Hi All,
Thank you for the advice. Using balance RCB has sped up the performance quite a bit but I still think it should be faster. When I run the simulation of just liquid there is roughly 100 atoms per core after using balance RCB. When I add the vapor phase, I also increase the number of cores used which amounts to roughly 45 atoms per core. The liquid-vapor simulation is slower even with the fewer atoms per core. Why might this be? Looking at the MPI task time breakdown, most of the time spent is in Kspace and Comm.
Thanks,
Peter

What interval is being used in your fix balance command?

Sincerely,
Adrian Diaz
Graduate Research Fellow – Mechanical and Aerospace Engineering
Herbert Wertheim College of Engineering
University of Florida

Contact – 786-863-2326

Every 1000 time steps.
Regards,
Peter

Hi All,
Thank you for the advice. Using balance RCB has sped up the performance quite a bit but I still think it should be faster.

what makes you think this?

have you done some proper strong scaling test? if yes, please report its result. if not, please do it and learn from it.
please also learn about Amdahl’s law and keep in mind, that Amdahl’s law does not account for “parallel overhead”.

When I run the simulation of just liquid there is roughly 100 atoms per core after using balance RCB. When I add the vapor phase, I also increase the number of cores used which amounts to roughly 45 atoms per core. The liquid-vapor simulation is slower even with the fewer atoms per core. Why might this be? Looking at the MPI task time breakdown, most of the time spent is in Kspace and Comm.

no parallel software scales indefinitely. at 100 or less atoms, you are already way beyond the scale out point for simple classical force fields without kspace. however, kspace has its own parallel scaling limit and that is dominated by the parallel FFTs. for normal classical MD calculations with kspace (via PPPM), their parallel efficiency is usually best, when the ratio between Pair and Kspace is about 2:1. the Pair part generally scales much better than the Kspace part.

as for why parallel FFTs are a problem, when using too many MPI ranks, you need to look at what is done in PPPM.
all charges are smeared out and the resulting charge density density grid computed (this smearing because of adding a damping to charge interactions in real space, that has to be subtracted in reciprocal space). for a 3d-FFT you need to do a series of 1d-FFT along each direction. for those to work you need all gridpoints in that direction, but that is not how the atom data is distributed (or the grid computed) in a code with domain decomposition. thus the grid point data has to be redistributed requiring MPI all-to-all communication. after that each MPI rank holds “sticks” or “pencils” of the total grid. since you now can only distribute the grid in 2d (as the third direction has to be continuous on a single MPI rank), you are limited as to how much work you can distribute. add to that, that you need to do a transpose between each set of 1d-FFTs so that you can do them in x-, y-, and z-direction. and with regular PPPM you need to one 3d FFT forward and one 3d FFT backward. for each of the transpose and the remap back and forth between “sticks” and domain decomposed data, you have communication overhead, and that overhead is growing the more MPI ranks you have and conversely the “productive work” is getting less the more MPI ranks there are.

there is no sense to go beyond the strong scaling limit and even reaching that is very wasteful.
you can limit the impact of what i like to call “the curse of the kspace” by using either a mix of multi-threading and MPI or the verlet/split run style. while multi-threading doesn’t parallelize as efficiently as MPI in LAMMPS, it allows for PPPM to have fewer MPI ranks in the 3d-FFTs and thus limits the parallel overhead through all-to-all communication. with verlet/split, you split the kspace part to a separate partition of MPI ranks and thus can pick the number of ranks, where PPPM scales out.

since you have a rather low and inhomogeneous particle density, it is not clear how much you can gain by any of these. it also has to go in tandem with choosing the optimal cutoff for coulomb interactions. finally, you need to check whether your kspace forces are properly converged. the accuracy based estimator, that LAMMPS uses by default to infer the reciprocal space parameters, only works properly for homogeneous density. for an inhomogeneous particle density, you may be lacking accuracy in the high particle density parts.

axel

HI Axel,
Thank you for the helpful information. I guess not knowing enough about parallel computing makes me think this. I have not done a proper scaling test. Would looking at the MPI breakdown for different numbers of cores used be sufficient? I will report the results once I have done this test.

Just as a sanity check, my simulation of 20,000 SPC/E water molecules takes roughly 48 hours to finish 200,000 timesteps when using 4 Intel Xeon Phi Nodes with 68 cores each and the fix balance RCB command to balance the load. Does this seem reasonable or have I missed something very important which is causing a major slow down? One thing I noticed is that I am not using the pair_style lj/cut/intel as described in the accelerated packages page. I have included the script I am using to run this just in case that is helpful.

i have no comments about the performance of xeon phi processors. i never used them.

the most obvious question: why use kspace style ewald and not pppm?

Hi Axel,
That is a good question. I had not read about the logarithmic scaling factor of the PPPM solver as compared to the exponential scaling of Ewald. Thank you.
Regards,
Peter

Hi Axel,
Using PPPM does make the simulation run quite a bit faster. Thank you for the help. However, I had to change the neighbor distance from 1.0 angstroms to 0.2 angstroms because in the liquid phase the z-direction cuts used to split atoms between processors is 0.5 angstroms created by balancing the load. Does this have a negative affect on simulation accuracy or does it just increase the number of times the neighbor lists are created?
Regards,
Peter

Hi Axel,
Using PPPM does make the simulation run quite a bit faster. Thank you for the help. However, I had to change the neighbor distance from 1.0 angstroms to 0.2 angstroms because in the liquid phase the z-direction cuts used to split atoms between processors is 0.5 angstroms created by balancing the load. Does this have a negative affect on simulation accuracy or does it just increase the number of times the neighbor lists are created?

make a test. you need to test how well converged the reciprocal space forces are, anyway, if you have paid attention to what i had been explaining before.

you can also consider changing the initial processor distribution in different directions with the processors keyword to better match the particle distribution.

axel.

Hi Axel,
Thank you for the help. I did see what you mentioned earlier but do not know how to check the convergence of the forces in reciprocal space. Is there documentation describing this process?
Regards,
Peter

Hi Axel,
Thank you for the help. I did see what you mentioned earlier but do not know how to check the convergence of the forces in reciprocal space. Is there documentation describing this process?

this is a discussion you should have with your adviser or tutor. i am not it. i already went out of my way to explain rather elementary parallel programming facts, and i only did this knowing that this thread will be archived, so in the future i can point others in a similar situation to it. but this is going too far. you also need to read up upon on ewald summation and related methodology so you know what you are doing, anyway. you are not doing a conventional setup, where the default heuristics and settings that people have used for many similar situations apply.

axel.

Hi Axel,
Do you offer tutoring? Just kidding. Thank you for the help. Sorry for going to far with the conversation. I will make sure to work harder on using available reaources to understand the framework of LAMMPS before asking questions again.
Regards,
Peter