MPI error

Hi everyone,

I encounter a confusing problem. I run a code with few molecules(about 20) in the box, it works well. However, when I add more molecules(about 150) inside it, the code can run a long time(300,000 timestep), and then it stops with no message displayed in the log file, and the error file is like this:

srun: error: cn175: task 0: Killed
srun: Terminating job step 550210.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: *** STEP 550210.0 ON cn175 CANCELLED AT 2016-12-23T03:55:11 ***
slurmstepd: *** STEP 550210.0 ON cn175 CANCELLED AT 2016-12-23T03:55:11 ***
srun: error: cn176: tasks 25,27,29,33-34,44,47: Killed
srun: error: cn175: tasks 4-6,9,13,18,22-23: Killed
srun: error: cn176: task 41: Killed
srun: error: cn175: task 1: Killed
srun: error: cn176: tasks 32,36: Killed
srun: error: cn175: tasks 2,14: Killed
srun: error: cn176: tasks 24,30,43,46: Killed
srun: error: cn175: tasks 8,10-11,17: Killed
srun: error: cn176: tasks 31,35,37: Killed
srun: error: cn175: tasks 3,7,15-16: Killed
srun: error: cn176: tasks 26,28,40,45: Killed
srun: error: cn176: tasks 38-39,42: Killed
srun: error: cn175: tasks 12,19-21: Killed

The code contains a homemade angle potential for computing properties of molecule. I try to change the physical values in the program, and it will stop at the same timestep(almost), I have checked it that there is no relevance with the physical process of our research object. It only relates to the running timestep. That’s a very confusing problem. I think this is a accumulated error for MPI running, but I can’t detect it. If anyone knows this or has any suggestions about this?

Thanks,

Huilin

If you’re saying that LAMMPS (as provided) runs w/out
a hang or MPI error, but with your added potential
it does not, then I think you have to do the debugging …

Steve

Hi Steve,

Thanks for your reply. Yes, when I used many processors, it has problem. But when I used a single core, it works. In the potential I code myself there are only two syntax associated with the MPI as follows:

MPI_Allreduce(cm[0],cmall[0],3*nmolecules, MPI_DOUBLE,MPI_SUM,world);

MPI_Allreduce(ntot_tmp,ntot,nmolecules,MPI_INT,MPI_SUM,world);

and

MPI_Allreduce(Atot_tmp,Atot,nmolecules,MPI_DOUBLE,MPI_SUM,world); MPI_Allreduce(Vtot_tmp,Vtot,nmolecules,MPI_DOUBLE,MPI_SUM,world);

I don’t know where is the error!

Huilin

I don’t know where is the error!

No one is likely to debug code you wrote for you.
You’ll have to do it yourself. Find the smallest

of procs (e.g. 2) that reproduce the problem.

Add some print statements. Insure that both
proc are reaching the Allreduce operations together.
Etc.

Steve

Hi Steve,

Thanks for your reply. Yes, when I used many processors, it has problem. But
when I used a single core, it works. In the potential I code myself there
are only two syntax associated with the MPI as follows:

MPI_Allreduce(cm[0],cmall[0],3*nmolecules, MPI_DOUBLE,MPI_SUM,world);

MPI_Allreduce(ntot_tmp,ntot,nmolecules,MPI_INT,MPI_SUM,world);

and

MPI_Allreduce(Atot_tmp,Atot,nmolecules,MPI_DOUBLE,MPI_SUM,world);
MPI_Allreduce(Vtot_tmp,Vtot,nmolecules,MPI_DOUBLE,MPI_SUM,world);

I don't know where is the error!

nobody else can know. you wrote the code. ...and just showing us a few
MPI commands won't help.

one common mistake that people make, that manifest only when using a
large number of processors, is to have written code that does not
handled the situation of having no local atoms on a processor. this
can be easily tested on a small system.

another common mistake is not properly initializing variables. on
linux, newly allocated storage (via "new" or "malloc" or
"memory->create()" etc.) is usually initialized to all zeroes, but
with more processors, these blocks get smaller and may be recycled
from previously allocated and then freed storage, and those blocks are
not zeroed.

as steve mentioned, you will have to sort this out yourself. debugging
complex code is not always obvious and sometimes requires some
thinking, perseverance and luck,

axel.

I remember you brought up this issue before. Hanging is most likely that your code on every task is not reaching or completing the Allreduce operation. You have to make sure every task executes those Allreduce operations or the implied barrier will lock the program in a timestep.

Thanks very much for the suggestions. In order to exclude the MPI error, I run the program with single core, and it can run more timesteps(about 400,000) rather than before(300,000). However, it will stop without any error message in log file. It is more like the error due to some accumulation with timesteps.