I am running a 80,000 atom job using 2 tesla card (M2090). I need to runt hsi job for little longer ~50ns but somehow after few nanosecond of run, job freezes at random points and stops producing any output. GPU usage goes down to 0% but CPU still runs with 100% usage. Hard disk has sufficient space.
Here are relevant details
GPU cards 2 X M2090
CPU threads 12
pair style lj/charmm/coul/long/gpu
package gpu force/neigh 0 1 -1 (tried 0 1 1 as well)
LAMMPS package (12oct2012)
Please let me know what may be wrong.
Thanks and regards
Mike can comment, but that is likely
far from enough info to figure out what
might be wrong.
Dear Prof. Plimpton,
What other info might me useful. Please let me know.
Does it always occur at the same timestep of the simulation?
If not, try checking the temperature of the cards with nvidia-smi in another window to see if you have a ventilation problem.
If so, please compile all of the C++ code with –g, run with gdb, and hit ctrl-c when it hangs. Type bt to get a stack trace.
If it happens quickly, you can also send me the input and I can try to reproduce.
I haven’t had any problems with hangs for jobs lasting 12 hours on thousands of gpus, but maybe there is something specific to your simulation.
Thanks a lot Mr. Brown,
It happens at random points and after 40-50hrs of runtime. I seems to be temperature problem. Unfortunately nvidia-smi doesn’t report temperature for my Card.
I will do the exercise and will try to find out. Is there way, I can resume simulation without actually re-submitting it.
I think that it is very unlikely that your problem is an issue with the code because of the randomness (assuming you are using the same random seeds and number of procs each failed run). The GPU code is deterministic and has been tested pretty thoroughly with memory checkers.
I would recommend writing restart files from LAMMPS every few hours and then continue your simulations that way so that you can complete your work.