Problem compiling recent USER_CUDA, and problem with cuda runs

Dear all,

I’m running into two different issues using and compiling USER-CUDA.
I think they may be related, and I hope you can give me some advice.

  1. A few months ago I installed lammps with the cuda package on a workstation
    with a K20. Lammps version is 23-Nov-2013.

I was able to compile it, run a few test and one long simulation (4 days) using
eam. This is just to say that this version works.

Then, a few days ago I wanted to replicate some simulations I saw in a paper; I
wrote the attached script “cu_nanopillar…”.

I tested it on CPUs (1, 2 and 8 mpi tasks) on both the workstation and on my PC

(lammps 20-Mar-2014, cpu only), and it “works”.
The simulation box shrinks a little (~4 angstroms) along Z and then the system
heats up at the rate I am asking.

However, when I try to run it with suffix=cuda, the script runs fine, but the output
changes: the simulation box grows along Z very quickly, and the system deforms
a lot. I made a few test changing some relaxation time scales in the script (namely
in the fix npt), but it does not change the overall behavior of the system.

You can clearly see the effect by comparing the outputs: just let the simulation run
for ~3000 timesteps, and observe that in the CPU-case the box shrinks along z,
and grows when using “user-cuda”.

This is rather unexpected, and led me to issue #2.

  1. I wanted to make sure that there wasn’t any issue with my older version of
    lammps, so I tried to compile two recent tarballs: the latest, 18-Apr-2014 and
    the 20-Mar-2014 I was testing on my local machine (for non-cuda stuff).

Unfortunately, I couldn’t compile either of them. Apparently, the folder /lib/cuda
has not changed since November, and in fact liblammpscuda compiles apparently
ok, and the .a file looks identical between the different versions.

When I switch to /src and try to compile lammps with “make openmpi”, however,
the compilation fails in two different ways for the two versions.

I already made sure that the makefiles are identical to the original (Nov) one,
and besides, all lammps versions compile sanely without package user-cuda.

Also, I stripped off lammps of all packages except “manybody” and “user-cuda”,
to make sure there weren’t any interferences, but this does not make any difference.

The OS of the workstation is Fedora 18, with cuda v5.5.

I am attaching the screen of the compile failure for both versions.

As an added information, I also am unable to compile both versions, with “user-cuda”
on my desktop PC (it has a quadro 600).

I just updated to (K)ubuntu 14.04 from 13.10. The older version I had working is the
2013Mar11; this was working just a few days ago; also it still compiles correctly
on this setup (tested today).

So, does have any sense? Do you have any idea of what I’m doing wrong?
I hope I was clear enough, but let me know if I can tell you more details.

Also: thank you very much for your work and efforts!

Best,

Alessandro Sellerio

comperror_20140320_fc18.txt (3.84 KB)

cu_nanopillar_comments.lmp (2.75 KB)

comperror_20140418_fc18.txt (3.93 KB)

comperror_20140418_ubuntu.txt (6.86 KB)

Dear all,
   I'm running into two different issues using and compiling USER-CUDA.
I think they may be related, and I hope you can give me some advice.

i don't think they are related. :wink:

in general, please be aware that USER-CUDA is no longer actively
maintained and will soon be deprecated in favor of a new package that
can handle GPU acceleration, multi-threading and vectorization in one
go. thus whenever there are some API changes in the base code,
USER-CUDA is likely to break until somebody reports it.

the GPU package, however, is still actively maintained.

1) A few months ago I installed lammps with the cuda package on a
workstation
with a K20. Lammps version is 23-Nov-2013.
I was able to compile it, run a few test and one long simulation (4 days)
using
eam. This is just to say that this version works.

Then, a few days ago I wanted to replicate some simulations I saw in a
paper; I
wrote the attached script "cu_nanopillar...".
I tested it on CPUs (1, 2 and 8 mpi tasks) on both the workstation and on my
PC
(lammps 20-Mar-2014, cpu only), and it "works".
The simulation box shrinks a little (~4 angstroms) along Z and then the
system
heats up at the rate I am asking.

However, when I try to run it with suffix=cuda, the script runs fine, but
the output
changes: the simulation box grows along Z very quickly, and the system
deforms
a lot. I made a few test changing some relaxation time scales in the script
(namely
in the fix npt), but it does not change the overall behavior of the system.

You can clearly see the effect by comparing the outputs: just let the
simulation run
for ~3000 timesteps, and observe that in the CPU-case the box shrinks along
z,
and grows when using "user-cuda".
This is rather unexpected, and led me to issue #2.

there is one important piece of information missing here:
did you compile the CUDA support in single precision or double
precision. particularly the stress tensor is very sensitive to
floating point truncation and thus has significant errors with single
precision math, which can lead to the described behavior.
to compare apples with apples, you have to check, if the same behavior
happens with double precision. but even then, it is unlikely that the
USER-CUDA code will be updated unless the fix is trivial for the
reasons i outlined above.

2) I wanted to make sure that there wasn't any issue with my older version
of
lammps, so I tried to compile two recent tarballs: the latest, 18-Apr-2014
and
the 20-Mar-2014 I was testing on my local machine (for non-cuda stuff).
Unfortunately, I couldn't compile either of them. Apparently, the folder
/lib/cuda
has not changed since November, and in fact liblammpscuda compiles
apparently
ok, and the .a file looks identical between the different versions.
When I switch to /src and try to compile lammps with "make openmpi",
however,
the compilation fails in two different ways for the two versions.

I already made sure that the makefiles are identical to the original (Nov)
one,
and besides, all lammps versions compile sanely without package user-cuda.
Also, I stripped off lammps of all packages except "manybody" and
"user-cuda",
to make sure there weren't any interferences, but this does not make any
difference.
The OS of the workstation is Fedora 18, with cuda v5.5.
I am attaching the screen of the compile failure for both versions.

As an added information, I also am unable to compile both versions, with
"user-cuda"
on my desktop PC (it has a quadro 600).
I just updated to (K)ubuntu 14.04 from 13.10. The older version I had
working is the
2013Mar11; this was working just a few days ago; also it still compiles
correctly
on this setup (tested today).

So, does have any sense? Do you have any idea of what I'm doing wrong?

there is nothing wrong in what you are doing here. some time ago, we
integrated an optimization to the FFT support for IBM BG/Q machines,
which resulted in a change of the remap API. this was not reconciled
with the USER-CUDA package until april 2014. however at that point,
the USER-CUDA code was also corrected for some spelling mistakes, but
the corresponding changes in lib/cuda were left out (likely my fault).

you should be able to compile by applying the following change to the
code in lib/cuda

diff --git a/lib/cuda/cuda_wrapper.cu b/lib/cuda/cuda_wrapper.cu
index c8bda6e..051d37d 100644
--- a/lib/cuda/cuda_wrapper.cu
+++ b/lib/cuda/cuda_wrapper.cu
@@ -254,7 +254,7 @@ void cuda_check_error(char* comment)
   printf("ERROR-CUDA %s %s\n", comment,
cudaGetErrorString(cudaGetLastError()));
}

-int CudaWrapper_CheckMemUseage()
+int CudaWrapper_CheckMemUsage()
{
   size_t free, total;
   cudaMemGetInfo(&free, &total);
diff --git a/lib/cuda/cuda_wrapper_cu.h b/lib/cuda/cuda_wrapper_cu.h
index 5bcfaff..3f57446 100644
--- a/lib/cuda/cuda_wrapper_cu.h
+++ b/lib/cuda/cuda_wrapper_cu.h
@@ -36,7 +36,7 @@ extern "C" void CudaWrapper_CopyData(void*
dev_dest, void* dev_source, unsigned
extern "C" void* CudaWrapper_AllocPinnedHostData(unsigned nbytes,
bool mapped = false, bool writeCombind =
extern "C" void CudaWrapper_FreePinnedHostData(void* dev_data);
extern "C" void cuda_check_error(char* comment);
-extern "C" int CudaWrapper_CheckMemUseage();
+extern "C" int CudaWrapper_CheckMemUsage();
extern "C" double CudaWrapper_CheckUploadTime(bool reset = false);
extern "C" double CudaWrapper_CheckDownloadTime(bool reset = false);
extern "C" double CudaWrapper_CheckCPUBufUploadTime(bool reset = false);

I hope I was clear enough, but let me know if I can tell you more details.
Also: thank you very much for your work and efforts!

in summary:

- please check if the patch from above makes the current version of
LAMMPS compile and linke with USER-CUDA installed.
- please check with compiling GPU support in double precision if you
don't do so already
- please check against using the GPU package
- please let us know, if there are still issues after this and which ones.

axel.

>
> Dear all,
> I'm running into two different issues using and compiling USER-CUDA.
> I think they may be related, and I hope you can give me some advice.

i don't think they are related. :wink:

in general, please be aware that USER-CUDA is no longer actively
maintained and will soon be deprecated in favor of a new package that
can handle GPU acceleration, multi-threading and vectorization in one
go. thus whenever there are some API changes in the base code,
USER-CUDA is likely to break until somebody reports it.

the GPU package, however, is still actively maintained.

Dear Axel,
   thank you for the quick reply. Today I was able to perform a few tests,
and I was able to solve my issues. I have to say, I feared these issue might
have been related due to some mistake from my part in compiling the
libraries,
I'm glad it's not the case.

there is one important piece of information missing here:
did you compile the CUDA support in single precision or double
precision. particularly the stress tensor is very sensitive to
floating point truncation and thus has significant errors with single
precision math, which can lead to the described behavior.
to compare apples with apples, you have to check, if the same behavior
happens with double precision. but even then, it is unlikely that the
USER-CUDA code will be updated unless the fix is trivial for the
reasons i outlined above.

I douple(!) checked, and can confirm that everything was built with
double precision, on both machines. So I fear this might be a real
issue of the package/fix.

there is nothing wrong in what you are doing here. some time ago, we
integrated an optimization to the FFT support for IBM BG/Q machines,
which resulted in a change of the remap API. this was not reconciled
with the USER-CUDA package until april 2014. however at that point,
the USER-CUDA code was also corrected for some spelling mistakes, but
the corresponding changes in lib/cuda were left out (likely my fault).

<CUT>

I can confirm that by renaming these functions I was able to compile with
the latest lammps tarball.

in summary:

- please check if the patch from above makes the current version of
LAMMPS compile and linke with USER-CUDA installed.
- please check with compiling GPU support in double precision if you
don't do so already

I can also confirm that package "GPU" compiles correctly on both machines/OS
without issues.
I used the "Makefile.linux.double" makefile to compile double precision
libraries.
I also tested compiling and linking GPU with the K20 native architecture,
by changing the architecture line in this file to "arch=sm_35".
Both the libraries and lammps compile, and work, apparently without issues
(but I didn't do thorough tests or benchmarks).

- please check against using the GPU package

The input script works using the GPU package, and gives the same results
as the CPU version (as expected). In this case, the speed appears to be on
par, if not a bit faster, than with USER-CUDA

- please let us know, if there are still issues after this and which ones.

Now everything works, thanks a lot! I'll be switching all the other machines
from USER-CUDA to GPU as I update/upgrade them.

Cheers,
   Alessandro