USER-CUDA select device

Hello lammps-users,

It looks as if I managed to compile the USER-CUDA package. I’m testing it on nodes with two GPU cards each. The problem is that lammps always seems to select Device 0 which is always busy by other processes and an error is thrown.
Here is the error file output:

Using device 0: Tesla C2050

Cuda error: Cuda_AtomVecCuda_PackExchangeList: Kernel execution failed in file ‘atom_vec_cuda.cu’ in line 285 : invalid argument.

Is there a possibility to tell the program to use Device 1? In analogy to the GPU package I tried

package cuda gpu/node 1

but this had no effect.

Thanks,

Nikita

Hi

you were on the right track. But "package cuda gpu/node 1" only means that you should use 1 GPU per node. What you want is a "special" GPU request and not the automatic GPU assignement so you would need to put there:
"package cuda gpu/node special 1 1"

I just saw that the documention is still uncomplete so here the short version:

gpu/node is the option to determine how many GPUs per node are used so if you request k GPUs per node process m is using GPU m%k'. The k' is because the available GPUs are sorted according to the number of multiprocs, and from that list they are chosen. So if you have for example two compute GPUs and one small for the x screen, you would want to use
"gpu/node 2". The sorting ensures that the small GPU is ignored and only the two most powerfull GPUs on each node are used.
This works both with GPUs in exclusive and in default mode.

If you want to specify a specific GPU (e.g. on a shared system where GPUs are in default mode) you can specify a list of GPUs on each node to be used with the option "gpu/node special m g1 ... gm" where m is the number of GPUs and g1 to gm are there indices. Note that the indeces are not the same as reported by nvidia-smi since for some reasons PCIe numbers, driver ids and runtime ids of the GPUs are not necessarily the same.

Cheers
Christian

P.S. That error seems strange anyway. What version are you using? And could you send me the input script so I can have a look?

-------- Original-Nachricht --------

Hi Christian,

You are right, selecting the wrong GPU was not the source of the error. Still, it guess it would have messed things up later. I recompiled the binary today and apparently i didn’t pay attention to some compiler warnings at the end:


ipo: warning #11009: file format not recognized for /opt/sw/cuda/lib/libcufft.so
ipo: warning #11009: file format not recognized for /opt/sw/cuda/lib/libcudart.so
ipo: warning #11009: file format not recognized for /opt/sw/cuda/lib/libcudart.so
ld: skipping incompatible /opt/sw/cuda/lib/libcufft.so when searching for -lcufft
ld: skipping incompatible /opt/sw/cuda/lib/libcudart.so when searching for -lcudart
ld: skipping incompatible /opt/sw/cuda/lib/libcudart.so when searching for -lcudart
size …/lmp_linux
text data bss dec hex filename
37633102 257696 1612376864 1650267662 625d160e …/lmp_linux

so I assume that the incompatible libraries are the issue. I think something went wrong with pointing make to the libraries as they are actually in /opt/sw/cuda/lib/lib/ and in general I would assume lib64 to be the correct folder… I’ll tinker around with that today.

Anyway - the infile that I use can be found here: http://pastebin.com/3QYBwwkV - essentially it’s the crack example with some lines added at the beginning to make it use CUDA. I don’t use the CUDA or GPU example files - since my background is in the macro area, it’s simply easier for me to interpret the correctness of the crack example.

Regards,

Nikita

The skipping message is ok. Thats just because both the 32bit and 64bit paths are included (so you actually dont have to care for which you have compiled, since it will choose the correct ones by itself). The "file format" warning is strange though.

Anyway I reproduced your error, and it seems to be related to the boundary conditions (s s p) and certainly is a bug in my code.

I am going to investigate and let you know as soon as I have fixed it (hopefully today).

Cheers
Christian

-------- Original-Nachricht --------

Hi Nikita

I found the two bugs which were preventing a succesfull completion of the runs:

in lib/cuda/domain_kernel.cu

line 208:
  else {minx=lo[2];maxx=hi[2];}

must be changed in:
  else {minz=lo[2];maxz=hi[2];}

And in atom_vec_cuda.cu in the function Cuda_AtomVecCuda_PackExchangeList()
an:
if(n>1+return_value)

must be put in front of:
  cudaMemcpy(buf_send, sdata->buffer, (1+return_value)*sizeof(double), cudaMemcpyDeviceToHost);

I'll submit these fixes to Steve so they will be roled out with a patch in the next few days. But this should do it for you.

Cheers
Christian

P.S. let me know if it works, and if you encounter other problems.

-------- Original-Nachricht --------

Hi Christian,

Thank you for your support, everything seems to work now.
Only one remark about the undefined PI: I didn’t use math_const.h because PI is getting a value assigned elsewhere (PI = 4.0*atan(1.0)). I just defined a double for PI in the beginning of the code and gave it the initial value from math_const.h - I didn’t want to mess around too much with the code. But I guess it would make sense to use math_const.h - it also has PI/2 defined and this would eliminate one multiplication in pppm_cuda.cpp (as well as pppm_gpu.cpp).

Thanks again,

Nikita

Hi Nikita

The fixes for using PI from the math_const.h have been part of the latest patch now. Also if you encounter any further problems let me know. Since the gpu stuff has not been in as wide use as the main part of LAMMPS and it is still fairly new, the chance for encountering bugs is obviously higher. And its almost impossible to cover all use cases, so I am more or less dependent on feedback of users.

Cheers
Christian

-------- Original-Nachricht --------

Hi Christian,

I think I’ll soon have the opportunity to test some other simulations (more serious ones) than just the crack example. I’ll let you know about the results. Apart from some other initial problems I have right now anyway, I guess I will also have to tinker a bit with my input files in order to make them write out data less often without losing too much information. I think by writing a log file at each step, I pretty much kill all the performance boost I may get from CUDA…

Cheers,

Nikita