fix shake + user-cuda not working with 'shake mol template-ID'

I'm trying to get user-cuda to work with a shake water model (spc).
FixShakeCuda::FixShakeCuda() will fail if 'mol' keyword is encountered,
eg in :
    ...
    fix 2 Solvent shake 0.0001 10 0 b 1 a 1 t 1 2 mol Water
    ..
after reading only single-char 'modes' and bailing out then:

[fix_shake_cuda.cxx: 125]
    ...
    while(next < narg) {

     if(strcmp(arg[next], "b") == 0) mode = 'b';
     else if(strcmp(arg[next], "a") == 0) mode = 'a';
     else if(strcmp(arg[next], "t") == 0) mode = 't';
     else if(strcmp(arg[next], "m") == 0) {
        ...

I'm not sure my attempts to modify this will lead to
anything good, but maybe somebody has already a patch?

Regards & Thanks

M.

I'm trying to get user-cuda to work with a shake water model (spc).
FixShakeCuda::FixShakeCuda() will fail if 'mol' keyword is encountered,
eg in :
    ...
    fix 2 Solvent shake 0.0001 10 0 b 1 a 1 t 1 2 mol Water
    ..
after reading only single-char 'modes' and bailing out then:

[fix_shake_cuda.cxx: 125]
    ...
    while(next < narg) {

     if(strcmp(arg[next], "b") == 0) mode = 'b';
     else if(strcmp(arg[next], "a") == 0) mode = 'a';
     else if(strcmp(arg[next], "t") == 0) mode = 't';
     else if(strcmp(arg[next], "m") == 0) {
        ...

I'm not sure my attempts to modify this will lead to
anything good, but maybe somebody has already a patch?

the code in USER-CUDA predates the introduction of molecule templates
and the USER-CUDA packages is essentially abandoned. the LAMMPS
maintainers try to keep the code working, but there are by now several
known issues. trying to add support for molecule templates is likely
going to be a significant undertaking that would affect much more than
just fix_shake_cuda.cpp.

is there a reason that you cannot use the GPU package instead? on
modern hardware with multiple CPU cores per GPU, the GPU package
should be much more effective than USER-CUDA through GPU
oversubscription despite its more minimalist approach.

axel.

Thank you for your recommendation, I was under the impression
user-cuda would be much faster than any other acceleration. OK,
if it's no longer maintained and doesn't support recent features
that's no point anymore.

The gpu package supports all features I need, in compact systems
(liquid water, lbox=45A) it's considerably faster than user-omp
(GTX-980 on hexacore Intel) but gets very slow on large, sparse
systems (lbox=500A, few molecules, large pppm grid) - but maybe I'm
doing something wrong (user-cuda ~50x(!) faster than gpu).

   $> lmp_git_gpu -sf gpu -pk gpu 1 -in input.in
   ...
   using 12 OpenMP thread(s) per MPI task
   ...
   PPPM initialization ...
     G vector (1/distance) = 0.208377
     grid = 144 144 144
   ...
   - Using acceleration for pppm:
   - with 1 proc(s) per device.
   Device 0: GeForce GTX 980, 16 CUs, 3.6/4 GB, 1.3 GHZ (Single Precision)
  ...

Thanks & regards

M.

[...]

is there a reason that you cannot use the GPU package instead? on
modern hardware with multiple CPU cores per GPU, the GPU package
should be much more effective than USER-CUDA through GPU
oversubscription despite its more minimalist approach.

Thank you for your recommendation, I was under the impression
user-cuda would be much faster than any other acceleration. OK,
if it's no longer maintained and doesn't support recent features
that's no point anymore.

The gpu package supports all features I need, in compact systems
(liquid water, lbox=45A) it's considerably faster than user-omp
(GTX-980 on hexacore Intel) but gets very slow on large, sparse

with or without hyperthreading enabled?

systems (lbox=500A, few molecules, large pppm grid) - but maybe I'm
doing something wrong (user-cuda ~50x(!) faster than gpu).

there are a number of issues. first of all. you don't seem to be using
multiple MPI tasks on the same GPU. with the GPU package, this is a
significant improvement, since you can run the rest of the code in
parallel. also, i would recommend running PPPM on the CPU rather than
on the GPU and finally. tune the real space cutoff for optimal
performance. if your system is sparse, then you can usually speed up
things this way, since the amount of work per atom remains reasonable
and the efficiency of the GPU becomes higher with more neighbors. on
the other hand, the cost of PPPM is dependent on volume (mind you the
default kspace parameter estimators are only good for homogeneous and
dense systems, for others you likely need to override the grid to get
converged forces and energies) and then scales with N*log(N) plus
additional communications to redistribute the grid. using a larger
real space cutoff can significantly reduce the cost of kspace *and*
the GPU package can run the pair style forces concurrently to the
bonded and kspace interactions on the GPU, so some of the extra
overhead of pushing more work to the GPU comes at no extra cost. it
takes some effort to tweak this, but it will be worth it.

also, i would refrain from using all single precision. at the very
least use mixed precision. especially, if you have a large volume,
there impact on the accuracy of the forces by how far the atoms are
away from the origin is quite noticeable for large boxes. some careful
testing is definitely advisable, even for all CPU runs.

axel.

On Tue, Feb 17, 2015 at 11:55 AM, Mirco Wahab

The gpu package supports all features I need, in compact systems
(liquid water, lbox=45A) it's considerably faster than user-omp
(GTX-980 on hexacore Intel) but gets very slow on large, sparse

with or without hyperthreading enabled?

Hyperthrading enabled, USER-OMP installed, export OMP_NUM_THREADS=12
ans export OMP_PROC_BIND=true set.

systems (lbox=500A, few molecules, large pppm grid) - but maybe I'm
doing something wrong (user-cuda ~50x(!) faster than gpu).

there are a number of issues. first of all. you don't seem to be using
multiple MPI tasks on the same GPU. with the GPU package, this is a
significant improvement, since you can run the rest of the code in
parallel.

I did this now with a combination of GPU and USER_OMP w/12 omp threads
on one MPI thread and explicitly used the cpu for kspace and the
gpu for pair_style. Starting with a 'dense' system of 2916 SPC water,
fully shake-constrained and using lj/cut/coul/long 10.0 10.0, I could
reach acceptable performance (npt, 1fs, 100K steps):

  gpu+user-omp 1080 sec

comparison:
  user-omp (12t) 1680 sec
  user-cuda (1t) 600 sec

I did not manage to get more than one MPI task running with the
single GPU ("ERROR: Accelerator sharing is not currently supported on system").

also, i would recommend running PPPM on the CPU rather than
on the GPU and finally. tune the real space cutoff for optimal
performance. if your system is sparse, then you can usually speed up
things this way, since the amount of work per atom remains reasonable
and the efficiency of the GPU becomes higher with more neighbors. on
the other hand, the cost of PPPM is dependent on volume (mind you the
default kspace parameter estimators are only good for homogeneous and
dense systems, for others you likely need to override the grid to get
converged forces and energies) and then scales with N*log(N) plus
additional communications to redistribute the grid. using a larger
real space cutoff can significantly reduce the cost of kspace *and*
the GPU package can run the pair style forces concurrently to the
bonded and kspace interactions on the GPU, so some of the extra
overhead of pushing more work to the GPU comes at no extra cost. it
takes some effort to tweak this, but it will be worth it.

After a first test of longer coulomb cutoff (starting w/factor 2,
resulting in a coarser grid), I see a tremendous speedup on the sparse
system! Especially after combining USER-OMP + GPU in a single MPI task:

   $> lmp_git_gpu -sf gpu -pk gpu 1 -sf omp -pk omp 12 -in input.in

and using explicit accelerator styles as you recommended. Thats it.
I have now something to experiment with.

also, i would refrain from using all single precision. at the very
least use mixed precision. especially, if you have a large volume,
there impact on the accuracy of the forces by how far the atoms are
away from the origin is quite noticeable for large boxes. some careful
testing is definitely advisable, even for all CPU runs.

This appears to be a very valuable recommendation, especially for
somebody like me who started recently working with Lammps. I did
in fact compile the gpu part single/single and used a SP
fftw3f but will now double check results on larger boxes with
a double or single/double version.

Regards & Thank you very much

M.

On Tue, Feb 17, 2015 at 11:55 AM, Mirco Wahab

The gpu package supports all features I need, in compact systems
(liquid water, lbox=45A) it's considerably faster than user-omp
(GTX-980 on hexacore Intel) but gets very slow on large, sparse

with or without hyperthreading enabled?

Hyperthrading enabled, USER-OMP installed, export OMP_NUM_THREADS=12
ans export OMP_PROC_BIND=true set.

*bad* idea. USER-OMP is at best effective for half the number of
threads and rarely for hyperthreading.
MPI parallelization is often more effective. that being said, most
/omp styles have optimizations similar or better than /opt pair
styles, so they are often significantly faster with just one thread
than their regular counterparts (but the code is a bit more complex to
read).
most people i know use USER-OMP with 2-3 threads per MPI task and 2-6
MPI tasks per node (depending on the number of cores/per node). you
always should make sure that all threads are confined to a single
socket or subunit where they share caches.

in my tests hyperthreading is usually only effective when running on
the CPU exclusively. and with OpenMP and MPI hybrid mode, you need to
run in bind-to-socket mode from the MPI library or else you may be
binding threads from different MPI tasks to the same cores.

systems (lbox=500A, few molecules, large pppm grid) - but maybe I'm
doing something wrong (user-cuda ~50x(!) faster than gpu).

there are a number of issues. first of all. you don't seem to be using
multiple MPI tasks on the same GPU. with the GPU package, this is a
significant improvement, since you can run the rest of the code in
parallel.

I did this now with a combination of GPU and USER_OMP w/12 omp threads
on one MPI thread and explicitly used the cpu for kspace and the
gpu for pair_style. Starting with a 'dense' system of 2916 SPC water,
fully shake-constrained and using lj/cut/coul/long 10.0 10.0, I could
reach acceptable performance (npt, 1fs, 100K steps):

  gpu+user-omp 1080 sec

comparison:
  user-omp (12t) 1680 sec
  user-cuda (1t) 600 sec

I did not manage to get more than one MPI task running with the
single GPU ("ERROR: Accelerator sharing is not currently supported on
system").

use nvidia-smi to reconfigure your GPU and make it a persistent change.
i wouldn't be surprised if you can squeeze out another factor of 2 or more.

On Tue, Feb 17, 2015 at 11:55 AM, Mirco Wahab

The gpu package supports all features I need, in compact systems
(liquid water, lbox=45A) it's considerably faster than user-omp
(GTX-980 on hexacore Intel) but gets very slow on large, sparse

with or without hyperthreading enabled?

Hyperthrading enabled, USER-OMP installed, export OMP_NUM_THREADS=12
ans export OMP_PROC_BIND=true set.

*bad* idea. USER-OMP is at best effective for half the number of
threads and rarely for hyperthreading.
MPI parallelization is often more effective. that being said, most
/omp styles have optimizations similar or better than /opt pair
styles, so they are often significantly faster with just one thread
than their regular counterparts (but the code is a bit more complex to
read).
most people i know use USER-OMP with 2-3 threads per MPI task and 2-6
MPI tasks per node (depending on the number of cores/per node). you
always should make sure that all threads are confined to a single
socket or subunit where they share caches.

I tested my example system according to your suggestions and everything
turned out as you predicted. It is now clear to me that any performance
gain is significantly dependent on the number of MPI tasks on the
system in question.

in my tests hyperthreading is usually only effective when running on
the CPU exclusively. and with OpenMP and MPI hybrid mode, you need to
run in bind-to-socket mode from the MPI library or else you may be
binding threads from different MPI tasks to the same cores.

This is what I saw in my tests. On a single 6 core cpu, running 2 x 6
threads appears to be somehow more effective than using the 6 cores
alone. Also, the proper combination of /omp and /gpu styles resulted
in the largest speedup:

user-omp, default: 1243.580 s
user-omp + 6 MPI x 2 OMP: 253.143 s
gpu, default: 470.295 s
gpu + 6 MPI x 2 OMP: 149.798 s

user-cuda, default: 122.577 s (for comparison)

combined lj/cut/coul/long/gpu + pppm/omp
    6 MPI x 1 OMP: 124.283 s
    6 MPI x 2 OMP: 115.841 s
    6 MPI x 4 OMP: 133.071 s (oversubscribing threads)

combined lj/cut/coul/long/omp + pppm/gpu
    6 MPI x 1 OMP: 575.908 s
    6 MPI x 2 OMP: 425.269 s

The system is npt, 8748 atoms (2916 SPC + shake), 1fs, 20K steps,
pppm w/30x30x30 grid (a small cube of liquid water at rho=0.97).

I did not manage to get more than one MPI task running with the
single GPU ("ERROR: Accelerator sharing is not currently supported on
system").

use nvidia-smi to reconfigure your GPU and make it a persistent change.
i wouldn't be surprised if you can squeeze out another factor of 2 or more.

Ooops, yes - I did put it into exclusive mode yesterday in order
to "gain some performance" :frowning:

Thank you very much for your helpful hints, maybe I'm doing much
less wrong now than before ...

Regards

M.