Possible a bug in using tip4p together with verlet/split

Hi Axel,

I’ve done the following 3 tests following your suggestion (the in files and configuration file are in the attachments, a typical one is copied in the end), and in every in file, there is only on run command without any minimization.
First, the run verlet/split still didn’t work, and I got the same error,

ERROR on proc 0: TIP4P hydrogen is missing (…/pppm_tip4p.cpp:488)

Since there is only one run now, I think this is probably a bug in LAMMPS.
Then, I switched to openmp scheme. I first run a task with 2 MPI tasks and no omp thread as benchmark, the runtime statistics is as below

Pair time () = 4.68271 (37.4681) Bond time () = 0.761198 (6.09063)
Kspce time () = 6.60392 (52.8404) Neigh time () = 0.196383 (1.57134)
Comm time () = 0.0914288 (0.731556) Outpt time () = 0.022729 (0.181864)
Other time (%) = 0.139489 (1.11611)

FFT time (% of Kspce) = 3.5833 (54.2602)
FFT Gflps 3d (1d only) = 2.52843 10.6664

Then I switched to the OMP scheme by adding the line

package omp 4 force/neigh

in the in file, and submitted the job using

mpirun -x OMP_NUM_THREADS=4 -np 2 /apps/lammps/24May13/lmp_openmpi -sf omp -screen scr.log -in in-omp

, and got the runtime statistics as shown below,

Pair time () = 7.32019 (9.97139) Bond time () = 1.26643 (1.7251)
Kspce time () = 60.8111 (82.8356) Neigh time () = 0.398665 (0.543052)
Comm time () = 2.77501 (3.78005) Outpt time () = 0.0216095 (0.029436)
Other time (%) = 0.818844 (1.11541)

FFT time (% of Kspce) = 85.8029 (141.097)
FFT Gflps 3d (1d only) = 0.105592 8.92023

Then I got puzzled.

  1. Although I know that there might be something to do with the omp which make the pair time (bond, Neigh, comm etc.) longer then MPI-task-only jobs, I didn’t expect that the Kspace calculation is almost 10 times longer. In other words, it took me 10 times more time by using 8 cores with 2 MPI tasks with 4 OMP thread each than using 2 MPI tasks only.
  2. In the log file, in there any way that I can find how many cores are being effectively used?

Best

Ming

split-and-omp.tar.gz (717 KB)

Hi Axel,

I've done the following 3 tests following your suggestion (the in files and
configuration file are in the attachments, a typical one is copied in the
end), and in every in file, there is only on run command without any
minimization.
First, the run verlet/split still didn't work, and I got the same error,

ERROR on proc 0: TIP4P hydrogen is missing (../pppm_tip4p.cpp:488)

i have finally been able to reproduce this. i don't have an explanation, though.

Since there is only one run now, I think this is probably a bug in LAMMPS.
Then, I switched to openmp scheme. I first run a task with 2 MPI tasks and
no omp thread as benchmark, the runtime statistics is as below

Pair time (\) = 4\.68271 \(37\.4681\) Bond time \() = 0.761198 (6.09063)
Kspce time (\) = 6\.60392 \(52\.8404\) Neigh time \() = 0.196383 (1.57134)
Comm time (\) = 0\.0914288 \(0\.731556\) Outpt time \() = 0.022729 (0.181864)
Other time (%) = 0.139489 (1.11611)

FFT time (% of Kspce) = 3.5833 (54.2602)
FFT Gflps 3d (1d only) = 2.52843 10.6664

Then I switched to the OMP scheme by adding the line

  package omp 4 force/neigh

in the in file, and submitted the job using

mpirun -x OMP_NUM_THREADS=4 -np 2 /apps/lammps/24May13/lmp_openmpi -sf omp
-screen scr.log -in in-omp

, and got the runtime statistics as shown below,

Pair time (\) = 7\.32019 \(9\.97139\) Bond time \() = 1.26643 (1.7251)
Kspce time (\) = 60\.8111 \(82\.8356\) Neigh time \() = 0.398665 (0.543052)
Comm time (\) = 2\.77501 \(3\.78005\) Outpt time \() = 0.0216095 (0.029436)
Other time (%) = 0.818844 (1.11541)

FFT time (% of Kspce) = 85.8029 (141.097)
FFT Gflps 3d (1d only) = 0.105592 8.92023

Then I got puzzled.
1. Although I know that there might be something to do with the omp which
make the pair time (bond, Neigh, comm etc.) longer then MPI-task-only jobs,
I didn't expect that the Kspace calculation is almost 10 times longer. In
other words, it took me 10 times more time by using 8 cores with 2 MPI tasks
with 4 OMP thread each than using 2 MPI tasks only

this should not happen. there seems to be some *other* problem with
your input when using more than 2 threads. again, it is not
straightforward to tell, *where* the problem is, but i can reproduce
it.

2. In the log file, in there any way that I can find how many cores are
being effectively used?

LAMMPS-ICMS has a much more detailed timer output. i am working with
steve, to have this adapted, so it can be included in the upstream
version. but that only tells half of the story. the rest comes from
how you run the job. if you have an MPI installation that enforces
processors affinity per MPI task or does not properly distribute MPI
tasks across nodes, it is not easily possible to tell from within
LAMMPS.

please let me repeat and rephrase my earlier question, since i believe
there has to be a better way to run your system. i have implemented
most of the support code for it last night, but i need to know where
this is heading. so _please_ tell me:

what is your target system?
1) a water droplet on a single graphene sheet
2) a water droplet between two graphene sheets
3) a stack of graphene sheets with water droplets between them

what level of accuracy are you after?
1) as much as possible with a classical model for the entire system,
no matter the computational cost
2) i need mostly the water droplet structure
3) i need mostly the graphene sheet structure
4) i want something else (what?)

how long a trajectory are you after?

what are the maximum number of CPUs that you want to use and how are
they connected (desktop, dual CPU workstation, single cluster node,
multiple cluster nodes with ethernet, multiple cluster nodes with
infiniband or similar)?

thanks,
    axel.

Hi Axel,

Thank you for having spent so much time in helping me. Since you can
reproduce the problem I met, at least it seems that I don't have some simple
mistakes.

My answer to your questions are listed as below

what is your target system?
1) a water droplet on a single graphene sheet
2) a water droplet between two graphene sheets
3) a stack of graphene sheets with water droplets between them

My target system is (3).

what level of accuracy are you after?
1) as much as possible with a classical model for the entire system, no

matter the computational cost

2) i need mostly the water droplet structure
3) i need mostly the graphene sheet structure
4) i want something else (what?)

The closest answer to me is (2), while the dynamics of graphene should also
be right, e.g. the DoS.

how long a trajectory are you after?

I need trajectories at least on the order of tens of ns, and prefer on the
order of ns. Now I can generate 6 ns/day roughly.

what are the maximum number of CPUs that you want to use and how are they

connected (desktop, dual CPU workstation, single cluster node, multiple
cluster nodes with ethernet, multiple cluster nodes with infiniband or
similar)?
My main resources are HECToR and NCI (au). For more details of the
specifications of these two world-class supercomputer, you can refer to
http://www.hector.ac.uk/service/hardware/
http://nf.nci.org.au/facilities/fujitsu.php

Besides, I can also use few GPU clusters with up-to-date hardware, and
desktop for sure.

Thanks again for you kind help

Best
Ming

Hi Axel,

Thank you for having spent so much time in helping me. Since you can
reproduce the problem I met, at least it seems that I don't have some simple
mistakes.

My answer to your questions are listed as below

what is your target system?
1) a water droplet on a single graphene sheet
2) a water droplet between two graphene sheets
3) a stack of graphene sheets with water droplets between them

My target system is (3).

well, then - in my personal opinion - you should definitely switch to using

pair_style lj/long/tip4p/long
kspace_style pppm/disp/tip4p

the inhomogeneity of your system will result in significant artifacts
on the lennard jones part of your model. especially, since you are
using a rather short cutoff or 10 angstrom. with the long-range
treatment of the attractive LJ term, you effectively have an infinite
cutoff. i am not sure how good it is to model this with only one sheet
and one droplet, though.

what level of accuracy are you after?
1) as much as possible with a classical model for the entire system, no

matter the computational cost

2) i need mostly the water droplet structure
3) i need mostly the graphene sheet structure
4) i want something else (what?)

The closest answer to me is (2), while the dynamics of graphene should also
be right, e.g. the DoS.

see above for the concern about the water droplets between the sheets
being forced to move synchronously as a result of periodicity.

how long a trajectory are you after?

I need trajectories at least on the order of tens of ns, and prefer on the
order of ns. Now I can generate 6 ns/day roughly.

what are the maximum number of CPUs that you want to use and how are they

connected (desktop, dual CPU workstation, single cluster node, multiple
cluster nodes with ethernet, multiple cluster nodes with infiniband or
similar)?
My main resources are HECToR and NCI (au). For more details of the
specifications of these two world-class supercomputer, you can refer to
http://www.hector.ac.uk/service/hardware/
http://nf.nci.org.au/facilities/fujitsu.php

those are big machines, but cannot scale the system you posted to fill
a large chunk of those machines.
my followup question is how far (how many CPUs at the most) do you
want it to scale it?

Besides, I can also use few GPU clusters with up-to-date hardware, and
desktop for sure.

GPUs are worth a consideration, but your setup does not favor using them a lot.

axel.

Hi Axel,

Thanks for pointing out these. In fact, the cutoff and periodicity of the
system are, as you said, always need to be treated carefully. In fact, we
have carried out an detailed analysis about this two possible artifacts and
we found that their effects on the properties we care out are negligible.
The poor scalability of this very inhomogeneous system is just why I tried
to use verlet/split (switched later to OMP as you suggested). The main
problem is that when used more than one node, the kspace calculation almost
doesn't scale! However, since at present the only solution is to change the
pair style as you said, I'll give it a try.
Regarding how far (how many CPUs at the most) I want it to scale it, to me
as long as the efficiency (esp. for electrostatic interaction) is reasonable
I can use up to 1024 cores. For GPU, it just provides more threads, so since
OMP didn't work for my particular system, I think I'd better use CPU based
calculation.

Best
Ming

Hi Axel,

Thanks for pointing out these. In fact, the cutoff and periodicity of the
system are, as you said, always need to be treated carefully. In fact, we
have carried out an detailed analysis about this two possible artifacts and
we found that their effects on the properties we care out are negligible.
The poor scalability of this very inhomogeneous system is just why I tried
to use verlet/split (switched later to OMP as you suggested). The main
problem is that when used more than one node, the kspace calculation almost
doesn't scale! However, since at present the only solution is to change the
pair style as you said, I'll give it a try.

please have a little patience. i am working on some changes, that could help.
in the end. if you want to run as fast as possible, you want it all.
verlet/split *and* OpenMP.

Regarding how far (how many CPUs at the most) I want it to scale it, to me
as long as the efficiency (esp. for electrostatic interaction) is reasonable
I can use up to 1024 cores. For GPU, it just provides more threads, so since
OMP didn't work for my particular system, I think I'd better use CPU based
calculation.

this is a flawed comparison. multi-threading and GPU acceleration have
very different performance properties and also how they are
implemented in LAMMPS is very different. you cannot transfer anything
from USER-OMP to GPU and vice versa. GPU might indeed be very
attractive to speed up kspace for your system, since kspace is the
dominating part of the calculation and you can get a higher
performance with less MPI tasks (and are thus less affected by the
3d-FFT disaster, but for the simple reason that there is no pppm/tip4p
support, it is not possible.

axel.

Hi Axel,

Thanks for your on-going work regard about this issue and I'm looking
forward to it.
I've contacted Steve 4 days ago about the GPU version of TIP4P and he told
me that he would let Mark Brown comment on this. This will also make LAMMPS
even more popular as the TIP4P/2005 is regarded as one of the best rigid
water model up to now.

Best
Ming

Hi Axel,

Thanks for your on-going work regard about this issue and I'm looking
forward to it.
I've contacted Steve 4 days ago about the GPU version of TIP4P and he told
me that he would let Mark Brown comment on this. This will also make LAMMPS
even more popular as the TIP4P/2005 is regarded as one of the best rigid
water model up to now.

if you believe this so much and want it so much, you better get ready
to contribute the code yourself. this is how open source software
works. i think your perspective on how much of an impact more support
for tip4p has on LAMMPS are very biased by your personal desires.
"popularity" constitutes itself from a large populus and not a single
person repeatedly stating it. i for once tend to lose interest when
people repeatedly state that something would serve the general public,
as i've observed that they actually only mean that it would serve
themselves well and that they would say anything for as long as they
don't have to do the dirty work themselves. i am not saying that this
is true in this case, but it is my observation on the average.

besides, i've heard people argue about what is the "better" water
model since i was an undergrad and it tires me without end (hell, i've
even added a couple of my own to the pool), since this is an argument
that nobody can win.

axel.

Hi Axel,

Thanks for your on-going work regard about this issue and I'm looking
forward to it.

quick update. i did some auditing of the tip4p support in verlet/split
and can only come to the conclusion that it is horribly broken and i
submitted to steve a patch that prints a corresponding error message.
the comments in the code from the original author hint at the fact,
that this person wasn't certain about this either.

axel.

Hi Axel,

Thanks for this useful information. I'll try to use the way you suggested to
get better scalability.

Best
Ming

Hi Axel,

Thanks for this useful information. I'll try to use the way you suggested to
get better scalability.

here are a couple of examples for improving performance.

the in-hybrid example runs almost twice as fast on 4 nodes with 12 CPU
cores as your original input.
in that one, you can also "play" with the coulomb cutoff. to use this,
you need to upgrade LAMMPS to the very latest patchlevel, since only
that includes the tip4p/long pair style for use with hybrid overlay.

the in-regular runs also with older versions of LAMMPS, but also has
less load imbalance and thus better performance.

axel.

in-hybrid.gz (962 Bytes)

in-regular.gz (941 Bytes)