GPU and renewing of neighbor list

_physics.hangyan.che · February 17, 2012, 8:22pm

Hello all,

I would like to run some GPU jobs to speed up the simulation. The simulation system is composed with carbon nantoube and water. But I have some questions related to the renewing of neighbor list.

In the previous calculations performed by CPUs, the neighbor list of carbon nanotube was not renewed. Therefore, I use the command ‘neigh_modify exclude cnt’ to turn off the pair interaction between carbon atoms.

My question is:

When using GPUs, I have to specify the command ‘package gpu force 0 1 1’ at the beginning of input script. That means the neigh_modify command is not available anymore. How to combine ‘neigh_modify’ and ‘package’ command?

2)When I have the neighbor list renewed on CPUs, everything works well. But the calculation was even much slower than CPUs. I supposed the calculation should be more efficient on GPUs. How does that happen?

I appreciate any help!

Thanks,

Hang

_Christian_Muller · February 17, 2012, 9:17pm

A few remarks:
-Could you post your input script?

-Why do you think that you cant use neigh_modify together with a package command?

-You probably should let the GPU do the neighbor list update as well and use force/neigh instead of force as an argument to the package command.

-The GPU package should be used with more than one MPI process per GPU. Did you do that?

-You could also try out the USER-CUDA package, depending on your system size and hardware config it might be faster.

-What kind of GPU do you have? Small GPUs like they are typical for some workstations (like a cheap Quadro) are not that well suited for computations.

Cheers
Christian

-------- Original-Nachricht --------

_physics.hangyan.che · February 17, 2012, 9:38pm

Hi christian,

Thanks very much!

Here is the input file, please let me know if there is anything wrong.

newton off

package gpu force 0 1 1

units real
atom_style full

pair_style lj/cut/coul/long/gpu 12.0
kspace_style pppm/gpu 0.0001
pair_modify shift yes mix arithmetic
neighbor 0.3 bin
neigh_modify every 1 delay 10 check yes
boundary p p p

read_data data.lammps

timestep 1.0
thermo_style custom step temp etotal

group cnt type 1 2
group water type 3 4

neigh_modify exclude group cnt cnt

fix 1 cnt setforce 0 0 0

min_style cg
min_modify dmax 0.2
minimize 0.01 0.001 100 100000

fix rigid rigidpart shake 0.0000001 30 0 b 1 a 1

thermo_style custom step temp epair etotal press vol

fix 2 water nvt temp 300.0 300.0 100.0
run 1000
unfix 2
suffix gpu

2012/2/17 Christian Trott <ceearem@…116…>

A few remarks:
-Could you post your input script?

-Why do you think that you cant use neigh_modify together with a package command?

When I use neigh_modify with the package gpu command, the output energy is not the same(it is infinite) as the result calculated by CPU. Therefore, I suppose that the neigh_modify command can not work with package command.

-You probably should let the GPU do the neighbor list update as well and use force/neigh instead of force as an argument to the package command.

I tried to do that but I got wrong energy.

-The GPU package should be used with more than one MPI process per GPU. Did you do that?

Yes, I did.

-You could also try out the USER-CUDA package, depending on your system size and hardware config it might be faster.

I will try that later. Thanks for reminder.

-What kind of GPU do you have? Small GPUs like they are typical for some workstations (like a cheap Quadro) are not that well suited for computations.

It is 16 NVIDIA Tesla C2050 GPU, with 448 cores and 2 GB of memory.

2012/2/17 Christian Trott <ceearem@…116…>

Nguyen_Dac_Trung · February 19, 2012, 5:20am

Hi Hang,

regarding the gpu package flags, the GPU IDs are inclusive, meaning that “0 1” requests 2 GPUs 0 and 1 in the node. Did you try running with “0 0” to test with one GPU? If you have two GPUs physically connected to a node, I suggest you try running with more than 2 MPI processes with “0 1”.

As far as I know, the gpu package hasn’t supported “neigh_modify exclude” with “force/neigh” yet. You can check to see if “gpu package force” with “neigh_modify exclude” could be faster than “gpu package force/neigh” without “neigh_modify exclude” in your particular simulated systems.

For “force/neigh” without “neigh_modify exclude”, you can subtract the pairwise interaction energy within group cnt from the system potential energy to get the effective pe, which is equivalent to using “gpu package force”, or CPU, with “neigh_modify exclude”. For example,

compute pe_cnt cnt group/group cnt
variable pe_eff equal pe-c_pe_cnt/count(all)
thermo_style custom step temp etotal pe c_pe_cnt v_pe_eff

or, in case the thermo output is not normalized:

compute pe_cnt cnt group/group cnt
variable pe_eff equal pe-c_pe_cnt
thermo_style custom step temp etotal pe c_pe_cnt v_pe_eff
thermo_modify norm no

If the van der Waals energy between the atoms in the cnt is huge for some reason, you can set the epsilon value of their lj term to be zero so that the above subtraction gives meaningful numbers.

Cheers,
-Trung

2012/2/17 陈航燕 <physics.hangyan.chen@…24…>

_physics.hangyan.che · February 20, 2012, 8:28pm

Hi Trung,

I appreciate your help!

regarding the gpu package flags, the GPU IDs are inclusive, meaning that “0 1” requests 2 GPUs 0 and 1 in the node. Did you try running with “0 0” to test with one GPU? If you have two GPUs physically connected to a node, I suggest you try running with more than 2 MPI processes with “0 1”.

Thanks, I will do that.

As far as I know, the gpu package hasn’t supported “neigh_modify exclude” with “force/neigh” yet. You can check to see if “gpu package force” with “neigh_modify exclude” could be faster than “gpu package force/neigh” without “neigh_modify exclude” in your particular simulated systems.

That’s the problem. I didn’t found the calculation was accelerated by using “gpu package force” with “neigh_modify exclude”. Most of the time was spent on neighbor calculation.

Pair time () = 4.92542 (3.08564) Bond time () = 0.00609803 (0.00382025)
Kspce time () = 12.495 (7.8278) Neigh time () = 137.339 (86.0391)
Comm time () = 0.44152 (0.2766) Outpt time () = 0.0417147 (0.0261331)
Other time (%) = 4.37508 (2.74087)

For “force/neigh” without “neigh_modify exclude”, you can subtract the pairwise interaction energy within group cnt from the system potential energy to get the effective pe, which is equivalent to using “gpu package force”, or CPU, with “neigh_modify exclude”. For example,

compute pe_cnt cnt group/group cnt
variable pe_eff equal pe-c_pe_cnt/count(all)
thermo_style custom step temp etotal pe c_pe_cnt v_pe_eff

or, in case the thermo output is not normalized:

compute pe_cnt cnt group/group cnt
variable pe_eff equal pe-c_pe_cnt
thermo_style custom step temp etotal pe c_pe_cnt v_pe_eff
thermo_modify norm no

If the van der Waals energy between the atoms in the cnt is huge for some reason, you can set the epsilon value of their lj term to be zero so that the above subtraction gives meaningful numbers.

From the above suggestion, it seems to me that I could only modify the output of the energy to have it being normal, right? I did the test as you mentioned: The epsilon value was set to zero for carbon atoms. However, I found the output energy was normal(not infinite) but the pressure was ‘nan’. How did that happen?

Many many thanks!

akohlmey · February 20, 2012, 8:41pm

some comments below.

That’s the problem. I didn’t found the calculation was accelerated by using “gpu package force” with “neigh_modify exclude”. Most of the

well, the pair interactions are accelerated and if they are not,
there is something else wrong with your input.

time was spent on neighbor calculation.

Pair time () = 4.92542 (3.08564) Bond time () = 0.00609803 (0.00382025)
Kspce time () = 12.495 (7.8278) Neigh time () = 137.339 (86.0391)
Comm time () = 0.44152 (0.2766) Outpt time () = 0.0417147 (0.0261331)
Other time (%) = 4.37508 (2.74087)

For “force/neigh” without “neigh_modify exclude”, you can subtract the pairwise interaction energy within group cnt from the system potential energy to get the effective pe, which is equivalent to using “gpu package force”, or CPU, with “neigh_modify exclude”. For example,

compute pe_cnt cnt group/group cnt
variable pe_eff equal pe-c_pe_cnt/count(all)
thermo_style custom step temp etotal pe c_pe_cnt v_pe_eff

or, in case the thermo output is not normalized:

compute pe_cnt cnt group/group cnt
variable pe_eff equal pe-c_pe_cnt
thermo_style custom step temp etotal pe c_pe_cnt v_pe_eff
thermo_modify norm no

compute group/group is counterproductive since it will
recompute the pair interactions on the CPU again and
in a not overly efficient way. better to run all-CPU in
this case.

If the van der Waals energy between the atoms in the cnt is huge for some reason, you can set the epsilon value of their lj term to be zero so that the above subtraction gives meaningful numbers.

From the above suggestion, it seems to me that I could only modify the output of the energy to have it being normal, right? I did the test as you mentioned: The epsilon value was set to zero for carbon atoms. However, I found the output energy was normal(not infinite) but the pressure was ‘nan’. How did that happen?

sounds a lot like there is something wrong with your simulation
setup outside of using GPUs. i would first try to do the same
setup without the GPU and then compare against the GPU
code with all double precision and then try mixed precision.
large forces can overflow in single precision much faster than
in double precision. also, i would suspect that your intra CNT
parameters are bad or the geometry that you use for it.

cheers,
axel.

_physics.hangyan.che · February 20, 2012, 10:19pm

Hello Axel,

Thanks for your comments.

I will do some more tests as you mentioned to figure out what’s going on:

sounds a lot like there is something wrong with your simulation
setup outside of using GPUs. i would first try to do the same
setup without the GPU and then compare against the GPU
code with all double precision and then try mixed precision.
large forces can overflow in single precision much faster than
in double precision. also, i would suspect that your intra CNT
parameters are bad or the geometry that you use for it.

Best,
Hang

_Christian_Muller · February 20, 2012, 11:21pm

Hi

I did some fast checks as well. For me using just "package gpu forc 0 1 1" together with neigh_modify exclude worked well enough. The best time I got with your conifguration was with heavy oversubscription though. With 2 GPUs and 12 MPI threads (on a dual hexcore node) I got 1.4 (single prec) / 3.7 (double prec) seconds for 100 steps as opposed to 6.3 with CPUs alone. Not that I used "neighbor 2.0 bin" for single prec and "neighbor 1.0 bin" for double prec though.

With the USER-CUDA package + USER-OMP for pppm and bonded interactions I got 1.6s in single prec and 2.7 in double prec. For both I used "neighbor 1.0 bin".

Cheers
Christian

Detailed timings:
CPU:
Loop time of 6.06247 on 12 procs (12 MPI x 1 OpenMP) for 100 steps with 27623 atoms

Pair time (\) = 2\.97785 \(49\.1195\) Bond time \() = 7.08302e-05 (0.00116834)
Kspce time (\) = 2\.01748 \(33\.2781\) Neigh time \() = 0.909485 (15.0019)
Comm time (\) = 0\.0434309 \(0\.716391\) Outpt time \() = 9.68377e-05 (0.00159733)
Other time (%) = 0.114056 (1.88135)

GPU Double Prec:
Loop time of 3.70102 on 12 procs (12 MPI x 1 OpenMP) for 100 steps with 27623 atoms

Pair time (\) = 0\.929584 \(25\.117\) Bond time \() = 8.42015e-05 (0.00227509)
Kspce time (\) = 0\.698344 \(18\.869\) Neigh time \() = 0.859942 (23.2353)
Comm time (\) = 0\.0438945 \(1\.18601\) Outpt time \() = 9.49701e-05 (0.00256605)
Other time (%) = 1.16908 (31.588)

GPU Single Prec:
Loop time of 1.43997 on 12 procs (12 MPI x 1 OpenMP) for 100 steps with 27623 atoms

Pair time (\) = 0\.0942881 \(6\.54791\) Bond time \() = 0.000101666 (0.00706026)
Kspce time (\) = 0\.520849 \(36\.1708\) Neigh time \() = 0.507215 (35.224)
Comm time (\) = 0\.0471223 \(3\.27245\) Outpt time \() = 9.63608e-05 (0.00669186)
Other time (%) = 0.270299 (18.7712)

CUDA Double Prec:
Loop time of 2.79023 on 12 procs (2 MPI x 6 OpenMP) for 100 steps with 27623 atoms

Pair time (\) = 0\.946571 \(33\.9244\) Bond time \() = 0.000649452 (0.0232759)
Kspce time (\) = 0\.65732 \(23\.5579\) Neigh time \() = 0.399988 (14.3353)
Comm time (\) = 0\.0991979 \(3\.55518\) Outpt time \() = 0.000181317 (0.00649829)
Other time (%) = 0.686325 (24.5974)

CUDA Single Prec:
Loop time of 1.66011 on 12 procs (2 MPI x 6 OpenMP) for 100 steps with 27623 atoms

Pair time (\) = 0\.247013 \(14\.8793\) Bond time \() = 0.000660181 (0.0397674)
Kspce time (\) = 0\.662051 \(39\.8801\) Neigh time \() = 0.39837 (23.9966)
Comm time (\) = 0\.080032 \(4\.8209\) Outpt time \() = 0.000172019 (0.0103619)
Other time (%) = 0.271808 (16.3729)

-------- Original-Nachricht --------

_Brown_W_Michael · February 21, 2012, 10:53pm

In case it hasn't been clear from the other e-mails, if you use the "force" option instead of the "force/neigh" option, the neighbor list is built on the CPU using standard LAMMPS routines and copied to the accelerator. This is compatible with all of the neighbor options in LAMMPS.

"force/neigh" does not support neigh_modify exclude and LAMMPS will generate an error if you try to use this combination. There are several options to get around this. Examples are:

1. Use CPU neighbor list builds with the "force" option. In this case, you will want to run many MPI processes per GPU in order to parallelize the neighbor list build on the CPU. This can impact performance. In cases where you can scale up to multiple nodes with GPUs, the impact can be minimal.

2. Use GPU neighbor list builds and set the cutoff in the pair_coeff for the types you want to exclude to 0.

- Mike

_physics.hangyan.che · February 22, 2012, 9:03pm

Hello Axel,

I am so sorry that I made a mistake for timestep setting(in metal unit, I should use the timestep 0.001 but not 1.0) when I tried to modify the input file working for GPUs. Now everything( including energy and pressure) looks good.

在 2012年2月20日下午5:19，陈航燕 <physics.hangyan.chen@…24…>写道：

_physics.hangyan.che · February 22, 2012, 9:12pm

Hello christian,

I appreciate your help so much!

I did the same test as you did with 2 GPUs and 12 MPI threads(single prec)， which is the most efficient case in your tests. But the calculation lasted for 56 seconds.

Your calculation was so efficient. I was thinking it’s due to the GPUs difference if I didn’t make mistakes for the GPU setting.

Best,

Hang

2012/2/20 Christian Trott <ceearem@…33…116…>

_physics.hangyan.che · February 22, 2012, 9:20pm

Hi Mike,

Thanks for your suggestions. Now I could understand the strange results calculated from’ force/neigh" combined with ‘neigh_exclude’. I really appreciate that.

I would like to do several tests to figure out which one gives the best efficiency with GPUs.

Best,

Hang

2012/2/21 Brown, W. Michael <brownw@…33…79…>

_Christian_Muller · February 23, 2012, 12:23am

You are probably using wrong affinity settings. In particular with OpenMP this can be a deal breaker. For example if you set a "one core per process" affinity all your OpenMP threads of one MPI process try to run on the same core.

So to have 6 cores per process you would use following OpenMPI command:

.../openmpi-1.4.4/bin/mpirun -np 2 -hostfile hosts -cpus-per-proc 6 lammps args

With mvapich2 it would be this:

.../mvapich2-1.8a2-cuda/bin/mpirun_rsh -np 2 -hostfile=hosts MV2_CPU_MAPPING=0-5:6-11 lammps args

Cheers
Christian

-------- Original-Nachricht --------