Hybrid/overlay lj96/cut/gpu pressure blowup on newer Tesla card

Hello,

I am simulating a periodic crystalline system with GPU accelerated lj96/cut and coul/long pair styles using the most recent stable version of LAMMPS (1 Feb 2014). My input script runs fine on an older Tesla card with the following deviceQuery output:

Device 0: "Tesla C2070"
  CUDA Driver Version / Runtime Version 6.0 / 5.0
  CUDA Capability Major/Minor version number: 2.0
  Total amount of global memory: 5375 MBytes (5636554752 bytes)
  (14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA Cores...

On this card, I compiled the GPU library with double/double precision and the -arch=sm_21 flag using cuda/5.0.35.

On a newer Tesla K20Xm card, I compiled the GPU library with -arch=sm_30 using both cuda/5.0.35 and cuda/6.0.37. In either case, the pressure blows up and the simulation crashes after 1 step on the K20Xm if a long range Coulombic solver is used with lj96/cut/gpu. I've tried these combinations of pair styles using "hybrid/overlay":

lj96/cut/gpu, coul/long/gpu, pppm/gpu -> blows up
lj96/cut/gpu, coul/cut, none -> works fine
lj96/cut, coul/long/gpu, pppm/gpu -> works fine
lj96/cut/gpu, coul/long/gpu, pppm -> blows up
lj96/cut/gpu, coul/long/gpu, ewald -> blows up
lj/cut/gpu (12-6 style), coul/long/gpu, pppm/gpu -> works fine

The problem seems to be isolated to when lj96/cut/gpu is used in conjunction with a long range solver.

The deviceQuery output for this card is:

Device 0: "Tesla K20Xm"
  CUDA Driver Version / Runtime Version 6.0 / 5.0
  CUDA Capability Major/Minor version number: 3.5
  Total amount of global memory: 5760 MBytes (6039339008 bytes)
  (14) Multiprocessors x (192) CUDA Cores/MP: 2688 CUDA Cores

To test if this problem is isolated to my local cluster, I tested my code on Amazon's EC2 servers. On a cg1.4xlarge instance running a Tesla C2070, everything works fine. On a g2.2xlarge instance running a newer compute capabaility 3.5 GPU card, I get the same pressure explosion problem.

My input and data files are too large to post here, but I can email them you if you'd like to try them. I'd really appreciate any feedback.

Thanks,
Jeff Camp
Sholl Group
Georgia Tech

I assume these combos work fine if you run all on CPU?

Maybe Axel or Trung has an idea.

Steve

Hello,

I am simulating a periodic crystalline system with GPU accelerated lj96/cut and coul/long pair styles using the most recent stable version of LAMMPS (1 Feb 2014). My input script runs fine on an older Tesla card with the following deviceQuery output:

jeff,

before anybody will take a closer look, please try out the latest
development version. there have been updates and bugfixes.

thanks,
     axel.