Memory Error when running on GPU

Hi LAMMPS users,

I have recently installed lammps-18Feb11 and want to run a simulation on my GeForce GTX 570 GPU (compute capability 2.0, 1.280GB memory).

The "peptide" test system runs fine, and very fast too, but the system I am interested in is substantially larger (167k atoms, input file is 46MB), and it does not run: I get the error message shown below.

Could it be that the 1.2 GB on the GPU is not enough? Last time I checked, the "1057996800 bytes" that "could not be allocated" according to the error message are equal to 0.99 GB, so that should fit. Is there something else that could be wrong that causes this message?

The error message appears when I use:
fix 0 all gpu force/neigh 0 0 X (tested with X -1.0, 0.2, 0.4, 0.5, 1.0)
fix 0 all gpu force 0 0 X (with X = -1.0 or X > 0.4)

but not when using:
fix 0 all gpu force 0 0 X (with X = 0.4)

In the latter case however, using the GPU is much slower than running on the Quad Xeon CPU I have.

If it is indeed the memory, would getting a GTX 580 with 1.5 GB memory be sufficient to run the simulation? Or would adding a second GTX 570 help? Unfortunately we do not have a big pile of GPUs lying around so I cannot test this myself.

And as a more general question: how can I find out how much GPU memory I need for a given simulation?

Thanks!

Louic Vermeer

==== Error message below ====

(...)
PPPM initialization ...
   G vector = 0.209068
   grid = 48 48 36
   stencil order = 5
   RMS precision = 8.17169e-05
   brick FFT buffer size/proc = 115169 82944 25281

Just a suggestion, check how many memory used in only CPU run and to see whether it is bigger than 1.25G.

Best wishes,
Yangpeng Ou

在 Mar 23, 2011,7:45 AM, Louic Vermeer 写道:

2011/3/23 Yangpeng Ou <[email protected]...>:

Just a suggestion, check how many memory used in only CPU run and to see whether it is bigger than 1.25G.

the GPU code needs (much) more memory.
i understand it is doing some out-of-place
operations on larger chunks of memory.

cheers,
    axel.

With X<0.4, the GPU is handling less than 40% of the particles - reducing memory usage.

How much memory does LAMMPS say it is using when you run on the CPU?

- Mike

With X<0.4, the GPU is handling less than 40% of the particles -
reducing memory usage.

How much memory does LAMMPS say it is using when you run on the CPU?

LAMMPS reports, when running on 1 core of my quad Xeon:
Memory usage per processor = 809.134 Mbytes

When running on all 4 of the quads:
Memory usage per processor = 244.154 Mbytes

How does this translate to memory usage on a GPU?

Memory requirements are generally higher for the GPU - by how much depends. First, full neighbor lists are used instead of half neighbor lists. Second, the GPU initially allocates storage for up to 300 neighbors per atom and grows from there if needed. Third, if you are using CPU neighbor lists or an expensive potential (such as Gay-Berne), the neighbor memory allocation is doubled.

If the average number of neighbors per atom reported by LAMMPS <<300, you can reduce memory usage by changing this number in the lib/gpu directory. It is hard-coded so you will have to recompile. (The only instances of 300 in the lib/gpu directory are currently for max neighbors).

You can also dump out half of your simulation box on the CPU, run on the GPU, and see what the reported GPU memory usage is before getting another card.

Might be worth applying for an account on a parallel hybrid machine.

- Mike

Memory requirements are generally higher for the GPU - by how much
depends. First, full neighbor lists are used instead of half neighbor
lists. Second, the GPU initially allocates storage for up to 300
neighbors per atom and grows from there if needed. Third, if you are
using CPU neighbor lists or an expensive potential (such as Gay-Berne),
the neighbor memory allocation is doubled.

First of all, thanks for your answer.

I am using the all-atom charmm force-field:
lj/charmm/coul/long 11 12
kspace pppm 1e-4

If the average number of neighbors per atom reported by LAMMPS<<300,
you can reduce memory usage by changing this number in the lib/gpu
directory. It is hard-coded so you will have to recompile. (The only
instances of 300 in the lib/gpu directory are currently for max neighbors).

Can you be more specific where to change this? I would be happy to give that a try. If I understand the output (see below) correctly, I can substantially reduce the 300.

You can also dump out half of your simulation box on the CPU, run on the
GPU, and see what the reported GPU memory usage is before getting
another card.

You mean something like this?
fix 0 all gpu force 0 0 0.4

a ratio 0.5 doesn't run and gives a memory error, see below for the output of the run with a ratio of 0.4

The GPU part of the output says: Max Mem / Proc: 807.03 MB
Does that mean that to run the full system on the GPU, I would need 807.03 / 0.4 = 2017 MB ?

Thanks in advance,
Louic

It's looks like your average nbors/atom is >300 and will not be helped by
changing the initial allocation.

Regarding my comments on "dump out half your simulation box" - this was
poorly worded and not thought out. I meant to reduce the number of atoms
in your simulation. You could potentially do this with the region and
delete_atoms commands, but you would have to be careful about bonds. Maybe
you can delete solvent or something.

- Mike

Hi lammps-users,

I just wanted to let you know that this problem has now been solved, although I don't know why (I changed several things at the same time)

- I changed the primary video card to a GeForce GTX 580, with 1.5G of memory (it was a 570 with 1.2G).
- This required updated nvidia drivers, so I installed those, and I also installed LAMMPS-28Mar11.
- I put the GTX 570 that gave the memory error in the second PCI-E slot next to the 580.

I can now run my simulations on either of those video cards alone, or on both at the same time, using a "fix" scaling factor of 1.0. So my guess about the memory problem is that either the updated drivers, or lammps corrected it. Another possibility is that the 1.5G memory of the 580 is enough for my simulation to run, but the 1.2G of the 570 is not when it is used as a primary video card (when it is also in use to show things on my screen). This would be easy to test of course, but I am more interested in running my simulations now that it all works.

Anyway, I am happy to report that the GTX 570 and 580 when used alone make my simulations run 2.2x faster than on the 4 cores of my Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz. When both cards are used (no SLI), the speedup compared to 4 CPU cores is 3.9x. Great!

... and thanks for your replies!

Louic