MPI error in LAMMPS

Dear all,

I recently encounter a kind of error confusing me. There is no error message printed in log file, there is only error message like this:

srun: error: cn252: task 0: Killed
srun: Terminating job step 537118.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: *** STEP 537118.0 ON cn252 CANCELLED AT 2016-12-15T19:26:03 ***
slurmstepd: *** STEP 537118.0 ON cn252 CANCELLED AT 2016-12-15T19:26:03 ***
srun: error: cn252: tasks 5,21-22: Killed
srun: error: cn253: tasks 25,35,45: Killed
srun: error: cn252: tasks 7,14: Killed
srun: error: cn253: tasks 26,31-32,34: Killed
srun: error: cn252: task 15: Killed
srun: error: cn253: tasks 24,27-30,33: Killed
srun: error: cn253: tasks 37-40,43-44,46: Killed
srun: error: cn252: tasks 2-4,8-10,13,16,19-20: Killed
srun: error: cn253: tasks 36,41-42,47: Killed
srun: error: cn252: tasks 1,6,11-12,17-18,23: Killed

And the program usually stop in different timestep, while closely, with different dump steps. I firstly think this is due to some problem in MPI, and I can not find it.

PS: I code the angle and dihedral potential by myself, and I have tested that if I used these potentials defined in LAMMPS, there is no error, so there is definitely a error happening in my code. I only used two MPI function like following in my code:

MPI_Allreduce(cm[0],cmall[0],3*nmolecules,
MPI_DOUBLE,MPI_SUM,world);
MPI_Allreduce(ntot_tmp,ntot,nmolecules,
MPI_INT,MPI_SUM,world);

and

MPI_Allreduce(&Atot_tmp[0],&Atot[0],nmolecules,MPI_DOUBLE,MPI_SUM,world); MPI_Allreduce(&Vtot_tmp[0],&Vtot[0],nmolecules,MPI_DOUBLE,MPI_SUM,world);

which are used to compute the molecule center of mass and molecule volume and area
respectively. If anybody encounters this or has suggestion about this problem?

Huilin Ye

This

MPI_Allreduce(ntot_tmp,ntot,nmolecules,
MPI_INT,MPI_SUM,world);

should probably be this:

MPI_Allreduce(&ntot_tmp,&ntot,nmolecules,
MPI_INT,MPI_SUM,world);

Steve

Hi Steve,

I think this may not be the reason. ntot and ntot_tmp is one dimensional array, Atot, Vtot are one dimensional array and cm is two dimensional array.
Anyway, I can try if this is reason.

Thanks,

Huilin

I didn’t see the nmolecules length param. So if ntot is a vector,

your code is correct, you should not use &not. Must be some

other issue.

Steve

MPI_Allreduce deadlocks whenever a task isn’t participating in the collective communication. Make sure every task in your job reaches the All_reduce.

Hi Adrian and Steve,

Thanks for your help. How can I make sure all tasks reach the All_reduce, using MPI_Barrier?

My program can run a long time( about 300,000 timesteps), and then stop with the error message I present before.

Huilin

All_reduce already implies a barrier.

Hi Adrian,

Then how can I make it? According to my description, do you think what causes this result?

Thanks,

Huilin

You have to rigorously check your code to ensure the coded pair style has no logic that can prevent a task from reaching the All_reduce clause. I had an issue once where changing one variable (an averaging interval) in a compute prevented the compute from being called on every timestep by every task. This ultimately also resulted in an All_reduce deadlock so these issues can be subtle.