Error when trying to compile LAMMPS with CUDA and GPU enabled

HI,

When I try to make lammps I get this error for fft3d_cuda.cpp:

“/usr/local/cuda-5.0/include/host_defines.h”, line 128: catastrophic error:

#error directive: — !!! UNKNOWN COMPILER: please provide a CUDA

compatible definition for ‘align’ !!! —

#error — !!! UNKNOWN COMPILER: please provide a CUDA compatible definition for ‘align’ !!! —

1 catastrophic error detected in the compilation of “fft3d_cuda.cpp”.

Compilation terminated.

make[1]: *** [fft3d_cuda.o] Error 2

make[1]: Leaving directory `/home/qphll/lammps-20Apr12/src/Obj_gpu’

make: *** [gpu] Error 2

I’m working on a RHEL 5 machine with two NVIDIA Tesla C1060’s, pgi and cuda are completely up to date. This is the compilation command that it’s trying to execute:

mpicxx -fast -DLAMMPS_GZIP -I…/…/lib/cuda -DLMP_USER_CUDA -DLMP_USER_OMP -I…/…/lib/reax
-I…/…/lib/poems -I…/…/lib/meam -I/home/qphll/mpich/include -DFFT_CUFFT -I/usr/local/cuda-5.0/include
-I/usr/local/cuda-5.0/include -DUNIX -DFFT_CUFFT -DCUDA_PRECISION=1 -DCUDA_ARCH=20 -c fft3d_cuda.cpp

There are some repeated options but that shouldn’t be the problem.
Any help is appreciated and if you need more info about my machine or the LAMMPS options I can give it to you.

Cheers,
Craig Needham
Ph.D. Candidate, Westmoreland Group, NCSU
[email protected]…1442…

HI,

When I try to make lammps I get this error for fft3d_cuda.cpp:

"/usr/local/cuda-5.0/include/host_defines.h", line 128: catastrophic error:
          #error directive: --- !!! UNKNOWN COMPILER: please provide a CUDA
          compatible definition for '__align__' !!! ---
  #error --- !!! UNKNOWN COMPILER: please provide a CUDA compatible
definition for '__align__' !!! ---

1 catastrophic error detected in the compilation of "fft3d_cuda.cpp".
Compilation terminated.
make[1]: *** [fft3d_cuda.o] Error 2
make[1]: Leaving directory `/home/qphll/lammps-20Apr12/src/Obj_gpu'
make: *** [gpu] Error 2

I'm working on a RHEL 5 machine with two NVIDIA Tesla C1060's, pgi and cuda
are completely up to date. This is the compilation command that it's trying
to execute:

mpicxx -fast -DLAMMPS_GZIP -I../../lib/cuda -DLMP_USER_CUDA -DLMP_USER_OMP
-I../../lib/reax
-I../../lib/poems -I../../lib/meam -I/home/qphll/mpich/include -DFFT_CUFFT
-I/usr/local/cuda-5.0/include
-I/usr/local/cuda-5.0/include -DUNIX -DFFT_CUFFT -DCUDA_PRECISION=1
-DCUDA_ARCH=20 -c fft3d_cuda.cpp

There are some repeated options but that shouldn't be the problem.
Any help is appreciated and if you need more info about my machine or the
LAMMPS options I can give it to you.

just use GCC and junk PGI.

axel.

I think there is a way around that thing, some compiler directives. If I remember correctly the MVAPICH2 user guide has them somewhere listed.

Christian

-------- Original-Nachricht --------

I think there is a way around that thing, some compiler directives. If I remember correctly the MVAPICH2 user guide has them somewhere listed.

why even try? the PGI c++ compiler is such
a bad compiler, it is not worth it.

axel.

Thanks for the suggestion Christian I will certainly check it out, but after all my struggles with pgi over the last few months (this is the latest of many) I think I may be about ready to follow Axel’s advice and just go with GCC. On that note Axel, I am relatively familiar with the optimization capabilities of GCC but I was wondering if there were any options that you would normally invoke when compiling with GCC for GPUs? Also, is there any reason not to attempt to compile with the CUDA NVCC compiler? My goal with this compilation of LAMMPS is to get it working with our GPUs as efficiently as possible. Our group has a few compilations that work nicely but are not set up to use our GPUs.

Cheers,
Craig

Thanks for the suggestion Christian I will certainly check it out, but after
all my struggles with pgi over the last few months (this is the latest of
many) I think I may be about ready to follow Axel's advice and just go with
GCC. On that note Axel, I am relatively familiar with the optimization
capabilities of GCC but I was wondering if there were any options that you
would normally invoke when compiling with GCC for GPUs? Also, is there any

that is a strange question.
you need GCC to compile the host code, not the GPU code.

in general, compiler optimization needs to be used with care.
there is very little benefit to optimize code that is only executed
for a fraction of a percent of the total execution time. however,
if that code is miscompiled (as it becomes increasingly likely
with aggressive optimization), it can ruin your calculation.

the default settings in lib/gpu/Makefile.linux are fine.

as for optimization of the rest of LAMMPS, i've been
getting reliable and reasonably fast binaries with:

-O2 -fomit-frame-pointer -fno-rtti -fno-exceptions \
  -march=native -ffast-math -mpc64 -finline-functions \
                        -funroll-loops -fstrict-aliasing

the AtC package is the only package that uses exceptions
and run time type information, so i don't compile it in.
the way the force kernels are currently written, there is very
little benefit to be had from vectorization and it is far from
trivial to rewrite them so there is.

if you use a lot of "complex" potentials that use a lot of transcendental
math, you should see, if you can try the intel compiler. with

-O3 -xHOST -no-prec-div -no-prec-sqrt -fast-transcendentals -pc64
-ansi-alias -fno-rtti -fno-exceptions

since the intel compiler comes with inline math functions (log, exp,
pow, cos, sin)
that can be 2-3x faster than the libm/gcc counterparts. normally, these tweaks
don't make much of a difference, but some parts of LAMMPS use
these math functions a lot.

reason not to attempt to compile with the CUDA NVCC compiler? My goal with

the only way to not use nvcc is to compile for OpenCL.
sources in written in CUDA need to be compiled with a CUDA compiler.

this compilation of LAMMPS is to get it working with our GPUs as
efficiently as possible. Our group has a few compilations that work nicely
but are not set up to use our GPUs.

in my experience, more time is wasted by running superfluous or
flawed simulations, than you can ever compensate by generating
slightly faster executables. i believe that applying the 80:20 rule
is a useful thing. better to spend the rest of the effort planning
the simulations and finding/fixing bugs. :wink:

cheers,
    axel.

I wont necessarily argue with that :slight_smile:

-------- Original-Nachricht --------