a few more comments on running
this specific input on the gpu.
my first assessment was based on
the information that the run was done
on 150 CPU cores when in fact it was
done only on one and that changes
things quite a bit.
i just made some tests on our of
our test machines. this has a GTX 580
GPU, so it is not fully comparable to
the tesla C2050 used in the original run.
if you *have* to use only one cpu and one
GPU, you should also give the USER-CUDA
package a try. i haven't tried for a while, but
historically, this code performed better for
a large number of atoms per GPU. the GPU
package was always giving better performance
when running in capability mode. both packages
are continuously developed, so thing may
change over time.
what i see is, that the GPU is only used
with one process per GPU. most machines
have multiple CPU cores per GPU, and
LAMMPS can take advantage of this, but
attaching multiple MPI processes to the
same gpu.
running the input as is for the first 1000 steps
with one MPI task (and one GPU) yields:
Loop time of 194.372 on 1 procs (1 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.445 ns/day 53.992 hours/ns 5.145 timesteps/s
running with two MPI processes i get:
Loop time of 106.498 on 2 procs (2 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.811 ns/day 29.583 hours/ns 9.390 timesteps/s
running with three MPI processes i get:
Loop time of 73.3959 on 3 procs (3 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.177 ns/day 20.388 hours/ns 13.625 timesteps/s
with four MPI i get:
Loop time of 56.1067 on 4 procs (4 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.540 ns/day 15.585 hours/ns 17.823 timesteps/s
with six MPI i get:
Loop time of 49.253 on 6 procs (6 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.754 ns/day 13.681 hours/ns 20.303 timesteps/s
only with eight MPI the trend is stopped:
Loop time of 51.2074 on 8 procs (8 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.687 ns/day 14.224 hours/ns 19.528 timesteps/s
due to the large number of particles per GPU, there is (initially)
no benefit from running pppm on the CPU, but when using more
processes per GPU, the situation changes:
Loop time of 280.669 on 1 procs (1 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.308 ns/day 77.964 hours/ns 3.563 timesteps/s
Loop time of 146.488 on 2 procs (2 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.590 ns/day 40.691 hours/ns 6.826 timesteps/s
Loop time of 100.034 on 3 procs (3 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.864 ns/day 27.787 hours/ns 9.997 timesteps/s
Loop time of 75.2256 on 4 procs (4 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.149 ns/day 20.896 hours/ns 13.293 timesteps/s
Loop time of 52.1402 on 6 procs (6 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.657 ns/day 14.483 hours/ns 19.179 timesteps/s
Loop time of 40.9061 on 8 procs (8 MPI x 1 OpenMP) for 1000 steps with
125964 atoms
Performance: 2.112 ns/day 11.363 hours/ns 24.446 timesteps/s
Loop time of 39.1422 on 10 procs (10 MPI x 1 OpenMP) for 1000 steps
with 125964 atoms
Performance: 2.207 ns/day 10.873 hours/ns 25.548 timesteps/s
Loop time of 40.1902 on 12 procs (12 MPI x 1 OpenMP) for 1000 steps
with 125964 atoms
Performance: 2.150 ns/day 11.164 hours/ns 24.882 timesteps/s
another alternative to consider is using both GPU *and* OpenMP for acceleration,
i.e. GPU for pair and kspace and OpenMP (with two threads) for bonds.
Loop time of 127.787 on 2 procs (1 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.676 ns/day 35.496 hours/ns 7.826 timesteps/s
Loop time of 72.3458 on 4 procs (2 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.194 ns/day 20.096 hours/ns 13.823 timesteps/s
Loop time of 52.1095 on 6 procs (3 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.658 ns/day 14.475 hours/ns 19.190 timesteps/s
Loop time of 50.6327 on 8 procs (4 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.706 ns/day 14.065 hours/ns 19.750 timesteps/s
Loop time of 49.8165 on 12 procs (6 MPI x 2 OpenMP) for 1000 steps
with 125964 atoms
Performance: 1.734 ns/day 13.838 hours/ns 20.074 timesteps/s
Loop time of 51.5851 on 16 procs (8 MPI x 2 OpenMP) for 1000 steps
with 125964 atoms
Performance: 1.675 ns/day 14.329 hours/ns 19.385 timesteps/s
...and finally the same exercise with only pair on the GPU and
bond and kspace using OpenMP:
Loop time of 168.817 on 2 procs (1 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.512 ns/day 46.894 hours/ns 5.924 timesteps/s
Loop time of 92.3525 on 4 procs (2 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 0.936 ns/day 25.653 hours/ns 10.828 timesteps/s
Loop time of 64.3889 on 6 procs (3 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.342 ns/day 17.886 hours/ns 15.531 timesteps/s
Loop time of 46.5042 on 8 procs (4 MPI x 2 OpenMP) for 1000 steps with
125964 atoms
Performance: 1.858 ns/day 12.918 hours/ns 21.503 timesteps/s
Loop time of 39.0201 on 12 procs (6 MPI x 2 OpenMP) for 1000 steps
with 125964 atoms
Performance: 2.214 ns/day 10.839 hours/ns 25.628 timesteps/s
Loop time of 39.4776 on 16 procs (8 MPI x 2 OpenMP) for 1000 steps
with 125964 atoms
Performance: 2.189 ns/day 10.966 hours/ns 25.331 timesteps/s
so the absolute fastest way to run this input on my test machine
(4x AMD Opteron 6238 (Interlagos) at 2.6GHz) would be to use
6 MPI processes with 2 OpenMP threads and GPU acceleration
only for pair forces. those tests were all done without using processor
and memory affinity, which should help particularly the OpenMP
part of the code.
in comparison, using all-MPI on *all* 48 cores *with* processor
affinity runs at:
Loop time of 74.5643 on 48 procs (48 MPI x 1 OpenMP) for 1000 steps
with 125964 atoms
Performance: 1.159 ns/day 20.712 hours/ns 13.411 timesteps/s
and on the 12 CPU cores that would be occupied by the best
GPU effort run the performance is:
Loop time of 278.323 on 12 procs (12 MPI x 1 OpenMP) for 1000 steps
with 125964 atoms
Performance: 0.310 ns/day 77.312 hours/ns 3.593 timesteps/s
so depending on perspective, one can say that the
one GPU can accelerate the simulation by a factor
of almost 2 or 7.
hopefully this illustrates some of the optimization
options available for such calculations.
cheers,
axel.