I can produce the temperature blow up with dsf/gpu, and also with coul/long/gpu + pppm. When I used pair hybrid with coul/dsf and table/gpu, the run seems fine in my test. When I switched to using fix nvt instead of fix npt, the gpu runs didn’t cause the temperature blowup anymore, at least in my short runs. Can you also test with fix nvt to make sure that is the case?
This might have something to do with some gpu styles on your particular system with a triclinic box and running npt. I am still narrowing the issue down, hopefully someone on the list can quickly have better ideas.
I’d like to help you debug this problem. It may be related to a previous issue with GPU + NPT that another user and I came across. (see thread http://lammps.sandia.gov/threads/msg42046.html)
I set up a (somewhat) simple example script that leads to similar problems using the rhodo example that’s distributed with lammps. I’m attaching the input script. It produces different results for CPU or GPU. The correct results can be obtained by setting the thermo output to every timestep for the GPU simulation. I think the issue I pointed out in the earlier email still stands, changing the frequency of the thermodynamic output should not change the final answer.
Thank you all for your kind attention on this matter.
Trung, switching to NVT did work for me as well. There are some conditions
on my simulations that it will be useful to use GPU and a fixed volume.
Once again, thanks!
Can you rebuild the GPU package with the attached lal_answer.cpp file? (i.e. save a copy of the current lal_answer.cpp, put the attached file into lib/gpu/, make clean and make -f Makefile.your_machine to build libgpu.a). After that, remove the current LAMMPS binary in src/ and rebuild LAMMPS.
I think I fixed a bug with accumulating virial with the GPU pacakge when energy is not accumulated (eflag == 0 and vflag != 0) for the GPU pair styles that require charges. I would like to have both of you check with your systems to make sure that the bug fix works. I used your input files comparing GPU double precision vs. CPU runs, and they all match in my short runs.
I still can't match the GPU double prec and CPU results for pair dsf and
ewald. Minimization run matches 100% with dsf. T is not blowing up
anymore but is still very high and tilted the box too far.
Thanks for looking into this. I also don’t find that the new lal_answer fixes the discrepancies. I think the key (for me) is that if thermo output frequency is set to 1 (thermo 1) the correct pressures, box sizes, etc are produced with the GPU code, but if the thermo output is less frequent then discrepancies arise. This is true before and after the modification you made. BTW, Are your tests using thermo output at every timestep? If so, you may not see a problem.
Luis:
If you set the thermo output frequency to every timestep do your problems resolve?
output from final timestep of in.gpu.rhodo-ex using CPU, GPU thermo 1, and GPU thermo 10 with new lal_answer.cpp double precision:
Yes, setting 'thermo 1' works for me too. I find it really strange
though why this is so. Other means to compute the thermodynamic
variables every timestep, like fix ave/time, won't do the same trick.
Yes, I’ve also found that fix ave/time doesn’t work properly with some quantities for a GPU enabled run. Actually, I suspect it might if the thermo output is set to one (although I haven’t tested).
I tested with thermo 10. Let me look through the changes again, probably I changed something else in addition to lal_answer.cpp. Will get back to you later.
Sorry for my mistake in the previous attempt (I was messed up with the working version of the source file and clean rebuilds). Please build a clean libgpu.a with the attached lal_answer.cpp, and build a new LAMMPS binary. I also attach the log files I got for your input files for CPU and GPU double precision runs as references.
Mike, I made some changes to the input script(s) to reduce the divergence between CPU and GPU runs (such as using gpu force 0 0 1, pair_modify table 0 for coul/long).
Luis, I generated a binary restart file at the end of a CPU run and read in the restart file when comparing CPU and GPU runs (coul/dsf and coul/long+pppm). You may want to generate a restart file using the LAMMPS version available on your system to ensure compatibility.
Let me know how it works. Thanks for your cooperation,
I think this correction is a great advance. The bug is not entirely
fixed yet. I preformed a fairly long simulation with ewald+gpu+triclinic
and here is the end of the output
Most of the gpu accelerated scripts ends like that. I don't know if
helps but this error always comes up near the last phase transition
(tetragonal to cubic). Perhaps strong fluctuations of the box could be
leading to this kind of error.
did the CPU runs also fail when they get close to the phase transition (not necessarily at the exact time steps)? Can you restart the simulation from a restart file close to the point it crashed (e.g. at t = 800000 for the run you were referring to) and run with and without GPU to see if the problem occurs only with the GPU run?
Also, when you mentioned running with ewald+gpu+triclinic, you are using coul/long/gpu + kspace ewald, yes?
I restarted one of the crashed GPU runs without the GPU and the
simulation is going fine for the pair_style dsf. Restart began at 950000
and it is running way past the crash point.
As I said before, CPU runs fine during the phase transitions.
thanks for trying the without GPU run from restart. Have you also tried restart with GPU? If the simulation keeps crashing, then the reasons are not obvious to me, at least for now. You may try more defensive values for neighbor list rebuilds (for example, every 1 delay 10 check yes) when it gets closer to the phase transitions to see if it helps.