[lammps-users] Trying to compile with GPU option

Hi Axel,

did you try to run the “nvc_get_devices” program?
what is its output?

Found 1 platform(s).
Using platform: NVIDIA Corporation NVIDIA CUDA
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20

Device 0: “GeForce 9400M”
Type of device: GPU
Compute capability: 1.1
Double precision support: No
Total amount of global memory: 0.247681 GB
Number of compute units/multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum group size (# of threads per block) 512 x 512 x 64
Maximum item sizes (# threads for each dim) 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.1 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default
Concurrent kernel execution: No
Device has ECC support enabled: No

Regards,
Anna.

Hi Axel,

did you try to run the "nvc_get_devices" program?
what is its output?

Found 1 platform(s).
Using platform: NVIDIA Corporation NVIDIA CUDA
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20

Device 0: "GeForce 9400M"
Type of device: GPU
Compute capability: 1.1

a-ha.

Double precision support: No
Total amount of global memory: 0.247681 GB
Number of compute units/multiprocessors: 2
Number of cores: 16

ok. that means two things.

1) your hardware is working.
  to get over the linking errors, try changing
  src/MAKE/Makefile.mac by replacing:

gpu_SYSLIB = -lcudart

with:

gpu_SYSLIB = -lcudart -lcuda

2) you are not likely to get much acceleration out of it
    due to its limited compute capability.

for as long as you are only interested in CUDA development
or general testing, that should be fine (although compute capability
is already considered "GPU stone age"). if you expect to
get your results much faster, you need to get newer and
more capable hardware. GPU acceleration doesn't come
easy for MD.

axel.

Hi Axel,

Thanks for your reply.

1) your hardware is working.
to get over the linking errors, try changing
src/MAKE/Makefile.mac by replacing:

gpu_SYSLIB = -lcudart

with:

gpu_SYSLIB = -lcudart -lcuda

Thanks, that got rid of the first 2 errors. But I'm still getting
MPI-related compile errors... E.g. _MPI_Type_create_hindexed symbol is
undefined. There are plenty more that says it's undefined also.

2) you are not likely to get much acceleration out of it
   due to its limited compute capability.

for as long as you are only interested in CUDA development
or general testing, that should be fine (although compute capability
is already considered "GPU stone age"). if you expect to
get your results much faster, you need to get newer and
more capable hardware. GPU acceleration doesn't come
easy for MD.

Yes that's fine. For now I'm just interested in CUDA development, and seeing
what sort of programs can compile, run & test under a development machine
with such hardware. Also I'm trying to free up CPU cycles by using GPU for
testing.

Regards,
Anna.

hi anna,

[...]

gpu_SYSLIB = -lcudart -lcuda

Thanks, that got rid of the first 2 errors. But I'm still getting
MPI-related compile errors... E.g. _MPI_Type_create_hindexed symbol is
undefined. There are plenty more that says it's undefined also.

ok. did you compile and want to use the MPI stub library
in the STUBS directory?

if not, you have to change the Makefile.mac to point to
your MPI installation.

if yes, you need to wait a few minutes until i have
backported the change that is needed to make the
STUBS library compatible to the GPU library which
uses an MPI feature that was previously not used in
LAMMPS and thus not implemented in the stub library.

alternatively you can try the files from LAMMPS-ICMS:
http://git.icms.temple.edu/git/?p=lammps-icms.git;a=tree;f=src/STUBS

those are a little bit different, since they are
written in c and not in c++. it doesn't make a
difference for LAMMPS, but when somebody
wants to use the library interface from c code.

axel.

Hi Axel,

Thanks, I managed to get it to compile using an install of open-mpi and a
modified version of src/MAKE/Makefile.mac_mpi.

I am now trying to run the colloid example, and I'm getting the following
error:
ERROR: Invalid pair style

Regards,
Anna.

Hi Axel,

Thanks, I managed to get it to compile using an install of open-mpi and a
modified version of src/MAKE/Makefile.mac_mpi.

ok. cool.

I am now trying to run the colloid example, and I'm getting the following
error:
ERROR: Invalid pair style

colloid is an optional package. you have to do:

make yes-colloid
make mac_mpi

to add it to your lammps binary.
you can see which packages are active with:

make package-status

axel.

Hi Axel,

Hi Axel,

Thanks, I managed to get it to compile using an install of open-mpi and a
modified version of src/MAKE/Makefile.mac_mpi.

ok. cool.

I am now trying to run the colloid example, and I'm getting the following
error:
ERROR: Invalid pair style

colloid is an optional package. you have to do:

make yes-colloid
make mac_mpi

to add it to your lammps binary.
you can see which packages are active with:

make package-status

I added all of the packages except for meam & reax, and it compiled
successfully, and ran the normal examples. But...

When I try to run the gpulammps example for gb.in, it freezes my computer
somewhat. I can still move the mouse, but I can't update the display in any
other way.

I use
mpirun -np 2 lammps < gb.in

It gets as far as compiling the GPU program, but that's where it gets stuck.

Is there anything that I'm doing wrong?

Thanks,

Regards,
Anna.

I added all of the packages except for meam & reax, and it compiled
successfully, and ran the normal examples. But...

When I try to run the gpulammps example for gb.in, it freezes my computer
somewhat. I can still move the mouse, but I can't update the display in any
other way.

I use
mpirun -np 2 lammps < gb.in

It gets as far as compiling the GPU program, but that's where it gets stuck.

Is there anything that I'm doing wrong?

if you are in graphics mode, you have competition between
sofware wanting to update the display (and using the GPU
for it) and your GPU computing requests. even with a very
powerful GPU, running a CUDA code, makes the graphics
go slower, sometimes a lot. the nvidia driver typicallly has
a timeout setting that will kill the CUDA job if it doesn't
free the GPU fast enough. thus i would do my experiments
in text mode rather than in graphics mode.

also, you should not oversubscribe the GPU, i.e. don't
try to use more MPI tasks than you have GPUs.

you have only 16 cores of 1.1 compute capability,
little memory, and a stripped down mobile GPU to
boot. there is not much performance to be expected.
if a gay-berne system would be 2x faster on the GPU
than on the CPU that would probably be a good
performance.

just compare your GPU with the specs from the previous
generation tesla card that i have here. there are already
15x more cores.

axel.

Found 1 platform(s).
Using platform: NVIDIA Corporation NVIDIA CUDA
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20

Device 0: "Tesla C1060"
  Type of device: GPU
  Compute capability: 1.3
  Double precision support: Yes
  Total amount of global memory: 3.99982 GB
  Number of compute units/multiprocessors: 30
  Number of cores: 240
  Total amount of constant memory: 65536 bytes
  Total amount of local/shared memory per block: 16384 bytes
  Total number of registers available per block: 16384
  Warp size: 32
  Maximum number of threads per block: 512
  Maximum group size (# of threads per block) 512 x 512 x 64
  Maximum item sizes (# threads for each dim) 65535 x 65535 x 1
  Maximum memory pitch: 2147483647 bytes
  Texture alignment: 256 bytes
  Clock rate: 1.296 GHz
  Concurrent copy and execution: Yes
  Run time limit on kernels: No
  Integrated: No
  Support host page-locked memory mapping: Yes
  Compute mode: Default
  Concurrent kernel execution: No
  Device has ECC support enabled: No

if you compare this with a 1.5 year old high-end desktop

Hi Axel,

OK, thanks for that. I am mainly just seeing how easy it is to get lammps on
GPUs running, on an older laptop with CUDA.

Just as an update:
1) I ran lj_kspace.in, with and without mpirun, successfully
2) running "lammps < gb.in" produces a cuda error 700 at file
geryon/nvd_device.h at line 41

Regards,
Anna.

I see no reason that you should need to run in text mode. If there is a
timeout, it will be >5 seconds; your card doesn't have enough memory to
run a GB simulation that takes that long per timestep.

I have a MacBook Pro laptop with a 2.66 Ghz Intel i7, and a GE-Force GT
330M, compute capability 1.1, with 48 compute cores. It runs Gay-Berne
with 27000 particles without issue. The speedup versus a CPU-only run on
all 4 cores is about 6 times.

Running with multiple processes per GPU improves performance slightly, but
Axel is the correct that the overhead for context switching on GPUs is
higher for older cards.

Did you compile with -arch=sm_11 in lib/gpu?
Can you attach the input script you are using?
Higher up in the output is there an error from LAMMPS (such as out of
memory on GPU)?

Thanks.

- Mike

Hi Mike,

Thanks for your reply.

Did you compile with -arch=sm_11 in lib/gpu?

Yes

Can you attach the input script you are using?

OK

Higher up in the output is there an error from LAMMPS (such as out of
memory on GPU)?

No

Regards,
Anna.