Problem with CUDA driver error

Ekhi_Arroyo · October 10, 2012, 10:51am

Thank you Christian.

I will try with newer drivers and CUDA5RC with the benchmark cases.

So, do you think that the error was definitely caused by memory issues?

And another question. Is there a limit to the size of the simulation
with my graphic card features (2GB memory)? How many atoms do you
estimate that can be achieved with that hardware in EAM cases without a
crash? Can I increase the practical limit by using Single precision
instead of double precision?

Thanks for your help

_Brown_W_Michael · October 10, 2012, 3:28pm

You might hit more issues with the release candidate. I am not sure when the RC was updated last, but there are issues that are being worked through that can affect LAMMPS (and other codes) in different ways with the RC.

LAMMPS should generate an out of memory error if you use too much memory on the GPU. Can you check if this output was generated (above the cuda driver errors) or attach? If you are using csh you can redirect all of the output using >& instead of > ( in bash you can use 2> , I think, to redirect the error output to a file).

You should also see this thread:

http://sourceforge.net/mailarchive/forum.php?thread_name=5056F042.1090807%40ubu.es&forum_name=lammps-users

Thanks. - Mike

Ekhi_Arroyo · October 10, 2012, 4:14pm

Hi Michael. Thanks for your help.

I'm still working with CUDA Toolkit 4.2 and Driver 295.XX (the version referred from the official CUDA toolkit webpage).

I don't really know if it is a memory problem or something else, but it's very strange because it is a random error. Sometimes, the simulation finished well and in other cases, the simulations gives the error. I couldn't find the cause of that behaviour.

I redirected the usual output and the error output from lammps. The responses are in the attached files (out.eam.gpu & error.eam.gpu). The command to call the case in.eam.gpu was the following:

mpirun -np 4 /home/ekhi/bin/lmp_openmpi.gea.gpu -sf gpu -c off -v g 1 -v x 32 -v y 32 -v z 64 -v t 100 < in.eam.gpu > out.eam.gpu 2> error.eam.gpu

Thank you very much.

error.eam.gpu (972 Bytes)

out.eam.gpu (996 Bytes)

_Brown_W_Michael · October 12, 2012, 9:39pm

I tried this with a similar setup - quadro with compute capability 3,
295.41 driver, toolkit 4.2 and could not reproduce your problem. Memory
usage was under 1GB with up to 12 procs based on the nvidia diagnostic
tools.

A couple of similar issues have shown up on the mailing list that turned
out to be bad hardware. Both were GTX 6XX I believe. Can you try the
memory test in the thread I referred you to and see if that reports any
errors?

- Mike

Ekhi_Arroyo · October 15, 2012, 4:14pm

Hi.

I tried the cuda_memtest and one error appears in test number 3. I attach the .output and .error files. It's weird because a couple of months ago, I tried that memtest with other Linux distribution (Scientific Linux 6) and other Graphic Card (Another GTX680 chipset but built by PNY instead of MSI) and the test gave me the same error so because of that, I think that cannot be a punctual hardware problem (2 different cards with the same memory error?) but all your contributions are welcome, XD.

Thank you for your consideration

cuda_memtest.error (134 Bytes)

cuda_memtest.out (1.05 KB)