Calculating speed on parallel computer

Dear all lammps users,

The simulation I’m doing now is running on my desktop with i5-8500, and PC4-19200 8GB x2 ram.

The system is including 7076 atoms, 4580 bonds, 2450 atoms, 6 atom types, 7 bond types, 7 angle types.

Also I’m using pair_style lj/cut/tip4p/long/omp, kspace_style pppm/tip4p/omp, fix shake, and fix rigid commands.

It took about 2 mins per 5000 timestep (1fs), so it will take about 5 months to end these simulations but I have to do within 2 monts.

I found there is a computing facility that I can use, which have 8437 nodes with Intel Xeon Phi 7250 (68 cores) processors and 96GB ram.

But also I figure out from tests that there is no big difference between 6 cores (i5-8500) and 16 cores processors (Ryzen 1950X), which means that calculating speeds are not dramatically increasing with CPU core numbers.

And, I found that GPUs can enhance simulation speeds, but lammps does not support TIP4P PPPM solver on GPUs.

So my questions are,

  1. Does results of fix rigid command with massless atom (OM) in water are same with TIP4P command supported by lammps?

  2. If 1) is working, which one would be better to use parallel cluster (e.g. 10 or more nodes) or run on my desktop with GPU and fix rigid?

  3. If 1) is not working, which strategies I can use to improve my simulation speed?

Any advices would be very helpful. Thanks

Dongwoo Kang

Dear all lammps users,

The simulation I'm doing now is running on my desktop with i5-8500, and PC4-19200 8GB x2 ram.

The system is including 7076 atoms, 4580 bonds, 2450 atoms, 6 atom types, 7 bond types, 7 angle types.

Also I'm using pair_style lj/cut/tip4p/long/omp, kspace_style pppm/tip4p/omp, fix shake, and fix rigid commands.

It took about 2 mins per 5000 timestep (1fs), so it will take about 5 months to end these simulations but I have to do within 2 monts.

which command line did you use for that?

I found there is a computing facility that I can use, which have 8437 nodes with Intel Xeon Phi 7250 (68 cores) processors and 96GB ram.

you will only benefit significantly from the xeon phi, if you make
effective use of the vector units (as the cores are otherwise rather
slow) by using the USER-INTEL package, which has no support for tip4p,
IIRC.

But also I figure out from tests that there is no big difference between 6 cores (i5-8500) and 16 cores processors (Ryzen 1950X), which means that calculating speeds are not dramatically increasing with CPU core numbers.

you are comparing apples and oranges here. the two CPUs have different
architectures and clock rates. so only comparing simulation speed vs
number of cores is not a valid comparison.

And, I found that GPUs can enhance simulation speeds, but lammps does not support TIP4P PPPM solver on GPUs.

but the problem here is, that you need a lot of atoms per GPU to have
significant speedups.

So my questions are,

1) Does results of fix rigid command with massless atom (OM) in water are same with TIP4P command supported by lammps?

2) If 1) is working, which one would be better to use parallel cluster (e.g. 10 or more nodes) or run on my desktop with GPU and fix rigid?

using 10 or more nodes is not going to work on such a tiny system. you
will scale out when you reach a few 100s of atoms per CPU core.

3) If 1) is not working, which strategies I can use to improve my simulation speed?

to make any serious assessment, you have to provide your complete
input (best in a compressed tar archive or at least with the data file
compressed with gzip).

axel.

Dear all lammps users,

The simulation I'm doing now is running on my desktop with i5-8500, and PC4-19200 8GB x2 ram.

The system is including 7076 atoms, 4580 bonds, 2450 atoms, 6 atom types, 7 bond types, 7 angle types.

Also I'm using pair_style lj/cut/tip4p/long/omp, kspace_style pppm/tip4p/omp, fix shake, and fix rigid commands.

It took about 2 mins per 5000 timestep (1fs), so it will take about 5 months to end these simulations but I have to do within 2 monts.

are you sure about this estimate. i just tried estimated, that this
would mean, that you are going to simulate half a millisecond. what
kind of study needs that long a trajectory of such a small system?
...and does it have to be a continuous trajectory? otherwise you could
just create a few 100 decorrelated equilibrated restarts and then run
them all concurrently?

axel.

Dear axel,

Thanks for kind replies.

The purpose of my simulation is to observe crystal growth of clathrate hydrate, and I want to simulate this for 500 ns.

From calculating speed on my desktop (1 fs timestep, about 2500 fs/min), my estimation is 3.6 ns/day so it’ll take about 5 months to end.

The reason system is tiny is to reduce simulation time on my desktop. So when I use Xeon phi, I can scale up my system if I want or have to do.

But the important thing is simulation speed, because I want to simulate more than 10 ns/day.

This is why I’m now concerned about the computing facility, because its fee is not quite cheap, and also I have much time for that 3.6 ns/day speed.

USER-INTEL package does not support TIP4P, so does I have to use pair_style lj/coul/long and fix rigid instead of TIP4P?

I attached my input files with .gz

I have no experience in this parallel computing field, so please understand my bad background knowledge.

2019년 3월 6일 (수) 오후 5:04, Axel Kohlmeyer <[email protected]>님이 작성:

in.190227mdew.gz (720 Bytes)

190227md.dat.gz (260 KB)

Dear axel,

I have mistake on “I have much time for that 3.6 ns/day speed” -> “have no much time on”

Also, if I use xeon phi, does distribution of MPI/OMP is depending on my system? which means that trial and error is required.

Although I scale up simulation system, it will not be large ( ~ < 30000 atoms ) .

And I’m wondering your opinion of TIP4P, because many previous research uses TIP4P like this system, but as you said USER-INTEL or GPU package does not support, so using pair style lj/cut/coul/long + fix rigid + manual massless charge OM will be different with TIP4P or not.

Hoping for your advices.

Dongwoo Kang

2019년 3월 7일 (목) 오후 3:54, ­강동우(대학원 화학공학과) <lukekang070@…5003…>님이 작성:

Dear axel,

Thanks for kind replies.

The purpose of my simulation is to observe crystal growth of clathrate hydrate, and I want to simulate this for 500 ns.

wouldn't that be something that is better done with monte-carlo simulations?
with MD you need large enough fluctuations to cross the activation
barriers, and those are limited with small systems available to MD.

From calculating speed on my desktop (1 fs timestep, about 2500 fs/min), my estimation is 3.6 ns/day so it'll take about 5 months to end.

you should be using MPI parallelization first and then only OpenMP
threads, when there is no more speedup. MPI parallelization covers
everything, OpenMP only parts of the calculation, and that
parallelization - as implemented by the USER-OMP package - is
increasingly inefficient as the number of threads in creases.

The reason system is tiny is to reduce simulation time on my desktop. So when I use Xeon phi, I can scale up my system if I want or have to do.

But the important thing is simulation speed, because I want to simulate more than 10 ns/day.

This is why I'm now concerned about the computing facility, because its fee is not quite cheap, and also I have much time for that 3.6 ns/day speed.

USER-INTEL package does not support TIP4P, so does I have to use pair_style lj/coul/long and fix rigid instead of TIP4P?

only benchmarks can tell you, if this is more efficient. probably not,
since you have to lower the time step and thus lose this way, what you
gain from USER-INTEL.

I attached my input files with .gz

thanks. i would recommend against using an explicit number of threads
in the package omp command. just use a 0 (wildcard), so you can
control the number of threads conveniently from the command line by
setting the OMP_NUM_THREADS environment variable.

I have no experience in this parallel computing field, so please understand my bad background knowledge.

please contact your local HPC gurus. they should be able to help you
get started or point you in the right direction.

axel.

whether TIP4P is a good choice or not depends on many factors. other
people stick with TIP3P or use SPC/E. it may be convenient to pick a
water potential, that overestimates the (self-)diffusion, so you have
a more mobile liquid and faster processes that are diffusion limited.
but then again, all non-polarizable point-charge, pairwise additive,
spherical potential based water potentials have significant
limitations, so there is no specific recommendable choice beyond what
is compatible with the parameters of your *other* entities in the
simulation. if those were parameterized for SPC/E for example, then
TIP4P would be a very bad choice and vice versa.

axel.

i have made some tests with your inputs.
after reducing the number of time steps to 5000 and changing to “package omp 0”
on my quad-core machine i get the following performance with:
OMP_NUM_THREADS=4 /path/to/lmp -in in.190227mdew

Loop time of 94.9935 on 4 procs for 5000 steps with 7076 atoms

Performance: 4.548 ns/day, 5.277 hours/ns, 52.635 timesteps/s
392.1% CPU use with 1 MPI tasks x 4 OpenMP threads

and with plain MPI using:

mpirun -np 4 /path/to/lmp -in in.190227mdew
Loop time of 131.978 on 4 procs for 5000 steps with 7076 atoms

Performance: 3.273 ns/day, 7.332 hours/ns, 37.885 timesteps/s
96.2% CPU use with 4 MPI tasks x 1 OpenMP threads

which is a bit surprising (as MPI parallelization is generally more efficient than OpenMP). but looking a bit further down explains it:

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Dear axel,

Thanks for your kind advices.

I’ve tried test simulation on Ryzen 1950X with 16 MPI task x 1 OpenMP threads after reading your replies, and the results was very good with 11.37 ns/day simulation speed.

But on my i5-8500 cpu, 6 MPI task x 1 OpenMP threads results was quite slow, with 1.44ns/day, as implemented with same balance command you added in the email.

Anyway, the Ryzen results make me a little hopeful, so I may have to do some trial and error in computing facility with xeon phi.

Thanks again for you advices.

Dongwoo Kang

2019년 3월 8일 (금) 오전 7:15, Axel Kohlmeyer <[email protected]>님이 작성: