# Bonds as rate-limiting step in very large simulation?

Dear all:

In some recent simulations of a polymer in ionic liquid, with about 300,000 atoms running on about 150 cores, we’re getting some strange results. For a simulation that has a complete set of partial charges and thus uses k-space methods, we’re seeing the bonded interactions as the rate-limiting step, and by a rather large margin:

Pair time () = 11.2315 (5.60342) Bond time () = 162.37 (81.0069)
Kspce time () = 20.3289 (10.1421) Neigh time () = 0.946642 (0.472284)
Comm time () = 1.04557 (0.521637) Outpt time () = 0.00497484 (0.00248197)
Other time (%) = 4.51216 (2.25113)

However, I can’t think of a logical reason why this should be the case—particularly given the fact that there’s the k-space summation also taking place. Is there any good reason why this should be the case? (Visualization of the system doesn’t reveal any obvious problems, and the simulation has run successfully for several nanoseconds.)

Thanks,

–AEI

I am not qualified to answer, (and I agree that seems odd), but my
unofficial advice would be to try disabling or changing things and see
what happens.

Do you get the same results running on a single core (for a short
time)? (I assume yes.) Cut out the entire "Dihedrals" section of the
data file (disable all dihedrals) and see what happens. Same for the
"Angles" section and the "Impropers" section. (And set the
corresponding dihedral counter at the top of the data file to 0.) You
can use awk to filter out certain lines from any of these sections
which use a certain dihedral type:
awk '{if (\$2 != 14) print \$0}' < AtomsSection.txt > NewAtomsSection.txt

I don't know if that helped at all. Perhaps somebody else has some
better advice. Good luck and tell us what you found.
Andrew

Dear all:

In some recent simulations of a polymer in ionic liquid, with about 300,000
atoms running on about 150 cores, we're getting some strange results. For a
simulation that has a complete set of partial charges and thus uses k-space
methods, we're seeing the bonded interactions as the rate-limiting step, and
by a rather large margin:

Pair time (\) = 11\.2315 \(5\.60342\) Bond time \() = 162.37 (81.0069)
Kspce time (\) = 20\.3289 \(10\.1421\) Neigh time \() = 0.946642 (0.472284)
Comm time (\) = 1\.04557 \(0\.521637\) Outpt time \() = 0.00497484 (0.00248197)
Other time (%) = 4.51216 (2.25113)

However, I can't think of a logical reason why this should be the
case—particularly given the fact that there's the k-space summation also
taking place. Is there any good reason why this should be the case?

a ton of dihedrals?

i've seen something similar with implicit solvent models,
but never to this extreme. dihedral calculations with lots
of sine and cosines can be quite time consuming, if there
are a lot of them. and the neighbor lists are short.

however, it is not immediately obvious how this would happen

would it be possible to have a look at your input data somehow?

axel.

Dear all:

There are sample files and inputs for HPC and GPU versions at:

Apparently, the problem appears to arise in the GPU version only; the HPC version works as expected, with the bonded calculations taking up 1% as much time as in the GPU version.

Thanks,

—AEI

Dear all:

There are sample files and inputs for HPC and GPU versions at:

Apparently, the problem appears to arise in the GPU version only; the HPC
version works as expected, with the bonded calculations taking up 1% as much
time as in the GPU version.

this is a (somewhat) expected behavior.

it works as follows:
the pair style time is only the time that is needed
to set up the GPU code and than this is dispatched
to the GPU and control returns to the CPU, which
then goes on to compute bonds and kspace.
since you also run kspace on the GPU, the
pair kernels have to be waited on until they are
done and apparently that time is mostly accounted

this hints at that you should not run pppm on
the on the GPU but on the CPU instead. pppm
doesn't accelerate as much as non-bonded
(particularly, if you don't use single precision FFTs
(and thus no single precision pppm acceleration).
this way the non-bonded will continue
asynchronously while bonds *and* pppm are
computed on the CPU and the code waits for
the non-bonded forces after pppm is done.

because of this behavior, you can somewhat
tweak the performance a little bit by varying
the coulomb cutoff, since this would allow to
balance between CPU and GPU.

hope this helps to clarify matters,
axel.

Interesting. Thanks.

a few more comments on running
this specific input on the gpu.
my first assessment was based on
the information that the run was done
on 150 CPU cores when in fact it was
done only on one and that changes
things quite a bit.

i just made some tests on our of
our test machines. this has a GTX 580
GPU, so it is not fully comparable to
the tesla C2050 used in the original run.

if you *have* to use only one cpu and one
GPU, you should also give the USER-CUDA
package a try. i haven't tried for a while, but
historically, this code performed better for
a large number of atoms per GPU. the GPU
package was always giving better performance
when running in capability mode. both packages
are continuously developed, so thing may
change over time.
what i see is, that the GPU is only used
with one process per GPU. most machines
have multiple CPU cores per GPU, and
LAMMPS can take advantage of this, but
attaching multiple MPI processes to the
same gpu.

running the input as is for the first 1000 steps
with one MPI task (and one GPU) yields:

Loop time of 194.372 on 1 procs (1 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.445 ns/day 53.992 hours/ns 5.145 timesteps/s

running with two MPI processes i get:

Loop time of 106.498 on 2 procs (2 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.811 ns/day 29.583 hours/ns 9.390 timesteps/s

running with three MPI processes i get:

Loop time of 73.3959 on 3 procs (3 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.177 ns/day 20.388 hours/ns 13.625 timesteps/s

with four MPI i get:

Loop time of 56.1067 on 4 procs (4 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.540 ns/day 15.585 hours/ns 17.823 timesteps/s

with six MPI i get:

Loop time of 49.253 on 6 procs (6 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.754 ns/day 13.681 hours/ns 20.303 timesteps/s

only with eight MPI the trend is stopped:

Loop time of 51.2074 on 8 procs (8 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.687 ns/day 14.224 hours/ns 19.528 timesteps/s

due to the large number of particles per GPU, there is (initially)
no benefit from running pppm on the CPU, but when using more
processes per GPU, the situation changes:

Loop time of 280.669 on 1 procs (1 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.308 ns/day 77.964 hours/ns 3.563 timesteps/s

Loop time of 146.488 on 2 procs (2 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.590 ns/day 40.691 hours/ns 6.826 timesteps/s

Loop time of 100.034 on 3 procs (3 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.864 ns/day 27.787 hours/ns 9.997 timesteps/s

Loop time of 75.2256 on 4 procs (4 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.149 ns/day 20.896 hours/ns 13.293 timesteps/s

Loop time of 52.1402 on 6 procs (6 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.657 ns/day 14.483 hours/ns 19.179 timesteps/s

Loop time of 40.9061 on 8 procs (8 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 2.112 ns/day 11.363 hours/ns 24.446 timesteps/s

Loop time of 39.1422 on 10 procs (10 MPI x 1 OpenMP) for 1000 steps
with 125964 atoms
Performance: 2.207 ns/day 10.873 hours/ns 25.548 timesteps/s

Loop time of 40.1902 on 12 procs (12 MPI x 1 OpenMP) for 1000 steps
with 125964 atoms
Performance: 2.150 ns/day 11.164 hours/ns 24.882 timesteps/s

another alternative to consider is using both GPU *and* OpenMP for acceleration,
i.e. GPU for pair and kspace and OpenMP (with two threads) for bonds.

Loop time of 127.787 on 2 procs (1 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.676 ns/day 35.496 hours/ns 7.826 timesteps/s

Loop time of 72.3458 on 4 procs (2 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.194 ns/day 20.096 hours/ns 13.823 timesteps/s

Loop time of 52.1095 on 6 procs (3 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.658 ns/day 14.475 hours/ns 19.190 timesteps/s

Loop time of 50.6327 on 8 procs (4 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.706 ns/day 14.065 hours/ns 19.750 timesteps/s

Loop time of 49.8165 on 12 procs (6 MPI x 2 OpenMP) for 1000 steps
with 125964 atoms
Performance: 1.734 ns/day 13.838 hours/ns 20.074 timesteps/s

Loop time of 51.5851 on 16 procs (8 MPI x 2 OpenMP) for 1000 steps
with 125964 atoms
Performance: 1.675 ns/day 14.329 hours/ns 19.385 timesteps/s

...and finally the same exercise with only pair on the GPU and
bond and kspace using OpenMP:

Loop time of 168.817 on 2 procs (1 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.512 ns/day 46.894 hours/ns 5.924 timesteps/s

Loop time of 92.3525 on 4 procs (2 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.936 ns/day 25.653 hours/ns 10.828 timesteps/s

Loop time of 64.3889 on 6 procs (3 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.342 ns/day 17.886 hours/ns 15.531 timesteps/s

Loop time of 46.5042 on 8 procs (4 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.858 ns/day 12.918 hours/ns 21.503 timesteps/s

Loop time of 39.0201 on 12 procs (6 MPI x 2 OpenMP) for 1000 steps
with 125964 atoms
Performance: 2.214 ns/day 10.839 hours/ns 25.628 timesteps/s

Loop time of 39.4776 on 16 procs (8 MPI x 2 OpenMP) for 1000 steps
with 125964 atoms
Performance: 2.189 ns/day 10.966 hours/ns 25.331 timesteps/s

so the absolute fastest way to run this input on my test machine
(4x AMD Opteron 6238 (Interlagos) at 2.6GHz) would be to use
6 MPI processes with 2 OpenMP threads and GPU acceleration
only for pair forces. those tests were all done without using processor
and memory affinity, which should help particularly the OpenMP
part of the code.

in comparison, using all-MPI on *all* 48 cores *with* processor
affinity runs at:
Loop time of 74.5643 on 48 procs (48 MPI x 1 OpenMP) for 1000 steps
with 125964 atoms
Performance: 1.159 ns/day 20.712 hours/ns 13.411 timesteps/s

and on the 12 CPU cores that would be occupied by the best
GPU effort run the performance is:
Loop time of 278.323 on 12 procs (12 MPI x 1 OpenMP) for 1000 steps
with 125964 atoms
Performance: 0.310 ns/day 77.312 hours/ns 3.593 timesteps/s

so depending on perspective, one can say that the
one GPU can accelerate the simulation by a factor
of almost 2 or 7.

hopefully this illustrates some of the optimization
options available for such calculations.

cheers,
axel.