[lammps-users] Large output time on GPU

Hello,

I have 4 CPUs server with Nvidia Quadro FX 2800M (compute capability 1.1). I built LAMMPS (2 Feb 11) with gpu support and launched the same input (modified for gpu case) on 4 CPUs and on 4 CPUs+GPU. Here is the timing:

4 CPUs:
Loop time of 615.299 on 4 procs for 1000 steps with 73600 atoms

Pair time () = 252.29 (41.0028) Bond time () = 270.805 (44.012)
Kspce time () = 26.4647 (4.30112) Neigh time () = 0 (0)
Comm time () = 14.7663 (2.39986) **Outpt time () = 21.3677 (3.47274)**
Other time (%) = 29.6052 (4.81151)

FFT time (% of Kspce) = 0.247574 (0.935485)
FFT Gflps 3d (1d only) = 4.808 8.50605

4 CPUs + GPU:
Loop time of 536.939 on 4 procs for 1000 steps with 73600 atoms

Pair time () = 0.710702 (0.132362) Bond time () = 290.326 (54.0706)
Kspce time () = 26.0955 (4.86004) Neigh time () = 0 (0)
Comm time () = 6.28615 (1.17074) **Outpt time () = 188.124 (35.0364)**
Other time (%) = 25.3961 (4.7298)

FFT time (% of Kspce) = 0.247622 (0.948906)
FFT Gflps 3d (1d only) = 4.80707 8.28033

As you can see, the Pair time on GPU decreased significantly as expected, but the Output time increased 9 times. Is it normal behavior for GPU computations? What exactly is output time and how can I decrease it?

Nikita Tropin

Output time is what is spent doing thermo output to
the screen (normally tiny), and writing dump file snapshots
(typically small, but you control how often that happens),
If your thermo or dump output includes some expensive
computations (it usually doesn't) then that increases the cost.

Try turning dump and thermo output (nearly) off and
see what happens.

Steve

Hello,

I have 4 CPUs server with Nvidia Quadro FX 2800M (compute capability 1.1). I
built LAMMPS (2 Feb 11) with gpu support and launched the same input
(modified for gpu case) on 4 CPUs and on 4 CPUs+GPU. Here is the timing:

4 CPUs:
Loop time of 615.299 on 4 procs for 1000 steps with 73600 atoms

Pair time (\) = 252\.29 \(41\.0028\) Bond time \() = 270.805 (44.012)
Kspce time (\) = 26\.4647 \(4\.30112\) Neigh time \() = 0 (0)
Comm time (\) = 14\.7663 \(2\.39986\) Outpt time \() = 21.3677 (3.47274)
Other time (%) = 29.6052 (4.81151)

FFT time (% of Kspce) = 0.247574 (0.935485)
FFT Gflps 3d (1d only) = 4.808 8.50605

4 CPUs + GPU:
Loop time of 536.939 on 4 procs for 1000 steps with 73600 atoms

Pair time (\) = 0\.710702 \(0\.132362\) Bond time \() = 290.326 (54.0706)
Kspce time (\) = 26\.0955 \(4\.86004\) Neigh time \() = 0 (0)
Comm time (\) = 6\.28615 \(1\.17074\) Outpt time \() = 188.124 (35.0364)
Other time (%) = 25.3961 (4.7298)

FFT time (% of Kspce) = 0.247622 (0.948906)
FFT Gflps 3d (1d only) = 4.80707 8.28033
As you can see, the Pair time on GPU decreased significantly as expected,
but the Output time increased 9 times. Is it normal behavior for GPU
computations? What exactly is output time and how can I decrease it?

the accumulation of timings as it is currently done in the CPU version
of LAMMPS is not able to capture how time is spent on the GPU code.
the reason for that is that the GPU code is launched to compute
asynchronously which means that the pair style returns quickly
while the GPU is still busy and then the CPU can work on the other
parts of the calculation until it will halt in fix gpu and wait until the
GPUs are done and then the run can continue. because of that,
the waiting time will be accumulated to the wrong segment. the
gpu fix will output some GPU timing independently, you have to
look at that.

on top of that, try running with only 1 CPU and don't
oversubscribe the GPU. for a compute 1.1 GPU, that
might be more efficient.

cheers,
    axel.

p.s.: to get reliable timinigs, we'd have to revamp the whole internal
profiling system in lammps, as the current system implicitly assumes
that only one task is done at a time. with asynchronous GPU kernel
launches and also multi-threading, this is no longer the time and thus
a different strategy, e.g. thread-safe per class time accumulators would
be needed. if done right, it would also allow to show how well the
asynchronous parts of the code overlap and thus how effective they are.
not an easy thing to do, though.

The LAMMPS pair time reported when running with accelerators is the time
the CPU waits for pair (and optionally neighbor) calculation on the GPU.

For your simulation, the CPU is not waiting for GPU computations because
it is busy running k-space at the same time the GPU is computing neighbors
and forces. Adding more processes per GPU, not decreasing the number,
would help here (assuming you have the cores).

When the CPU is not doing asynchronous work (no k-space, cpu/gpu hybrid
styles, etc.) the pair time reported by LAMMPS will be the time spent
doing pair (and neighbor) work on the GPU.

The output time should not increase when running with accelerators. I
would try turning off any dumps as recommended, and see if that helps. It
could be possible, but probably unlikely, that an issue with the way the
timing is performed causes the output time to increase with load
imbalance. This could be tested by placing a MPI barrier at the end of the
GPU fix post-force routine.

I am interested to know what you find out.

Also, I am finishing up a PPPM/gpu for LAMMPS. If you, or anyone else,
would like to test, let me know.

- Mike

That is exactly my case - I have a lot of computes in thermo_style. If I turn off the output with "thermo 0", here is the result:

4 CPUs:
Loop time of 634.603 on 4 procs for 1000 steps with 73600 atoms

Pair time (\) = 275\.95 \(43\.4839\) Bond time \() = 288.126 (45.4025)
Kspce time (\) = 33\.0725 \(5\.21153\) Neigh time \() = 0 (0)
Comm time (\) = 5\.66031 \(0\.891945\) Outpt time \() = 7.93949 (1.2511)
Other time (%) = 23.8549 (3.75903)

FFT time (% of Kspce) = 0.248146 (0.750309)
FFT Gflps 3d (1d only) = 4.79691 8.46782

4 CPUs + GPU:
Loop time of 353.317 on 4 procs for 1000 steps with 73600 atoms

Pair time (\) = 0\.704713 \(0\.199456\) Bond time \() = 288.111 (81.5446)
Kspce time (\) = 24\.8321 \(7\.02827\) Neigh time \() = 0 (0)
Comm time (\) = 5\.82459 \(1\.64854\) Outpt time \() = 9.65612 (2.73299)
Other time (%) = 24.1885 (6.84612)

FFT time (% of Kspce) = 0.248778 (1.00184)
FFT Gflps 3d (1d only) = 4.78473 8.32729

The pair time decreased and the output time is the same. But I don't understand why this happens. Does it mean that thermo_style computes are running on GPU and they are running there slower than on CPU?

Nikita Tropin

On 1 CPU + GPU the computation is 3 times slower than on 4 CPUs + GPU but faster than on 1 CPU alone, and with the same situation about output time:

1 CPU:
Loop time of 2285.58 on 1 procs for 1000 steps with 73600 atoms

Pair time (\) = 1052\.5 \(46\.0497\) Bond time \() = 1033.43 (45.2154)
Kspce time (\) = 92\.7307 \(4\.0572\) Neigh time \() = 0 (0)
Comm time (\) = 1\.53196 \(0\.067027\) Outpt time \() = 58.8563 (2.57511)
Other time (%) = 46.5254 (2.0356)

1 CPU + GPU:
Loop time of 1827.98 on 1 procs for 1000 steps with 73600 atoms

Pair time (\) = 0\.647218 \(0\.0354063\) Bond time \() = 1034.57 (56.5968)
Kspce time (\) = 92\.9573 \(5\.08526\) Neigh time \() = 0 (0)
Comm time (\) = 1\.53856 \(0\.0841676\) Outpt time \() = 651.632 (35.6478)
Other time (%) = 46.6252 (2.55065)

By the way, the output of fix gpu is different on 1 and 4 CPUs:

1 CPU + GPU:

In case of 1 CPU it does show that GPU was used 121 seconds so these
seconds are definitely not in Pair time only,

The screen output of GPU times is time spent on the GPU running the
kernels, not time on the CPU. The LAMMPS timings are CPU times. See last
e-mail.

but in case of 4 CPUs it doesn't show any timing, only memory.

No timing is shown because I do not currently have a good way of doing
this accurately when multiple processes use the GPUs. When multiple
processes share a GPU, you don't know how many processes have kernels
running between timestamps because the order is not deterministic.
Additionally, on newer cards, multiple kernels can run at the same time on
the GPU, further complicating the issue.

- Mike

The pair time decreased and the output time is the same. But I don't
understand why this happens. Does it mean that thermo_style computes are
running on GPU and they are running there slower than on CPU?

Computes do not run on the GPU (unless the compute were to call the
pair_style force loop again - can't think of a reason to do this though).

- Mike

How can you explain then that in CPU+GPU the time of compute is 10 times more than in CPU only?

Look at the timings below, with thermo=10 and thermo=0 GPU time report is the same, so computes are running not on GPU as you said. But the output time for 1 CPU + GPU is 10 times more than on 1 CPU (651 and 58 seconds). My suggestion is that when using GPU by some reason compute routines become slower.

Here is the test on 1 CPU and 1 CPU + GPU:

1 CPU with thermo=10
Loop time of 2285.58 on 1 procs for 1000 steps with 73600 atoms

Pair time (\) = 1052\.5 \(46\.0497\) Bond time \() = 1033.43 (45.2154)
Kspce time (\) = 92\.7307 \(4\.0572\) Neigh time \() = 0 (0)
Comm time (\) = 1\.53196 \(0\.067027\) Outpt time \() = 58.8563 (2.57511)
Other time (%) = 46.5254 (2.0356)

1 CPU + GPU with termo=10
Loop time of 1827.98 on 1 procs for 1000 steps with 73600 atoms

Pair time (\) = 0\.647218 \(0\.0354063\) Bond time \() = 1034.57 (56.5968)
Kspce time (\) = 92\.9573 \(5\.08526\) Neigh time \() = 0 (0)
Comm time (\) = 1\.53856 \(0\.0841676\) Outpt time \() = 651.632 (35.6478)
Other time (%) = 46.6252 (2.55065)

Knowing almost nothing about your simulation, I can't. Can you narrow down which compute is causing the problem? Provide an input script? Reproduce with a simple LAMMPS benchmark or example?

- Mike

I found the compute causing the slow-down on GPU - it is "compute group/group". Adding this compute to thermo_style increases 4CPUs+GPU output time from 9 to 180 seconds, and I think I now understand why. In the description there is a note "The energy and force are calculated by looping over a neighbor list of pairwise interactions" and I use "fix gpu force/neigh" so neighbor lists are computed both on GPU and on CPU.

The solution that I tried is to use "fix gpu force", but I bumped into another problem here. When I start the computation with "fix gpu force" I have the error:

There are many commands (e.g. fixes and computes) in LAMMPS that do things
the GPU knows nothing about. If you use them in your simulation then they
will be performed on the CPU.

Steve

The error "Could not allocate..." is an out of memory error. Using CPU
neighbor builds can require more GPU memory because the neighbors are
repacked for optimal access on the GPU and this is not done in-place.

- Mike