[lammps-users] Parallel Problems?

Dave_Schall · June 17, 2010, 8:03pm

Hello,

I have a weird problem. I set up a small coordinate system for tweaking an input script for the friction simulations that I am working on. The small test system runs fine in both serial and parallel. However, when I try to scale up the system size, lammps will only run in serial. If I try to run in parallel, the output hangs indefinitely at the “run” command after having read all the input data with no complaints. The only difference in the input scripts for the small and large system are filenames. The only difference in the coordinate files for the small and large system is the number of atoms and the same code was used to generate both coordinate files.

Any thoughts? Even more odd, I just checked and my original script will run on 1 and 8 processors but not 4.

Thanks, Dave

The input script is kind of long but here it is for reference. It stalls at the first run command.

Fe + lube simulation

dimension 3
boundary p p f

atom_style molecular
units metal
neighbor 0.5 bin

read_data tmp.dat
#read_restart fe+octane.restart.100000

#define pair styles
pair_style hybrid airebo 3.0 1 1 eam/fs lj/cut 9.433

#C-H interactions
pair_coeff * * airebo /gfs/software/lammps/potentials/CH.airebo C H NULL NULL NULL NULL NULL NULL
#Fe
pair_coeff * * eam/fs /gfs/software/lammps/potentials/Fe_mm.eam.fs NULL NULL Fe Fe Fe Fe Fe Fe
#Fe+C
pair_coeff 1 3* lj/cut 0.000904 3.7732 9.433
#Fe+H
pair_coeff 2 3* lj/cut 0.000208 3.3982 9.433

set groups

group octane type 1 2
group topfix type 3
group topthrm type 4
group topfree type 5
group botfree type 6
group botthrm type 7
group botfix type 8
group mobile union topfree topthrm botfree botthrm octane
group topall union topfix topthrm topfree

set initial velocity

velocity mobile create 300.0 4928459 rot yes mom yes dist gaussian
velocity botfix set 0.0 0.0 0.0 sum no units box

set fixes

fix 1 all nve
fix 11 topfix viscous 0.01
fix 2 topfix setforce 0.0 0.0 NULL
fix 3 topfix aveforce NULL NULL -0.05
fix 4 mobile temp/berendsen 300.0 300.0 0.1
fix 5 octane temp/berendsen 300.0 300.0 0.1
fix 6 botfix setforce 0.0 0.0 0.0
fix 7 topall ave/atom 1 1 100 fx fy fz

computes

compute c1 topthrm temp/com
compute c2 topthrm temp
compute c3 topfree temp/com
compute c4 topfree temp
compute c5 octane temp
compute c6 botfree temp
compute c7 botthrm temp
compute c8 mobile temp
compute c9 topall reduce sum f_7[1] f_7[2] f_7[3]

#run parameters
thermo 100
thermo_style custom step pe ke etotal press temp c_c8 c_c1 c_c2 c_c3 c_c4 c_c5 c_c6 c_c7 c_c9[1] c_c9[2] c_c9[3]
timestep 0.00025
dump 1 all atom 1000 fe+octane.dump
dump 2 all xyz 100 fe+octane.xyz
restart 1000 fe+octane.restart

bring tip in with therm on all atoms

run 100

start sliding

velocity topfix set 1.148 0.0 0.0 sum yes units box
unfix 4
fix 4 topthrm temp/berendsen 300. 300. 0.05
fix_modify 4 temp c1
fix 8 botthrm temp/berendsen 300. 300. 0.05

run 500

akohlmey · June 17, 2010, 8:10pm

Hello,

I have a weird problem. I set up a small coordinate system for tweaking an
input script for the friction simulations that I am working on. The small
test system runs fine in both serial and parallel. However, when I try to
scale up the system size, lammps will only run in serial. If I try to run in
parallel, the output hangs indefinitely at the "run" command after having
read all the input data with no complaints. The only difference in the input
scripts for the small and large system are filenames. The only difference in
the coordinate files for the small and large system is the number of atoms
and the same code was used to generate both coordinate files.

dave,

how large is "large".

have you run "top" on the machine in question.
how much memory do the lmp_xxxx processes use?

try running with thermo output for every step.

cheers,
axel.

Dave_Schall · June 17, 2010, 8:26pm

Large is not much larger than small at present. ~4000 atoms vs. 2000. That should still be a trivial number of atoms. Setting thermo 1 did nothing. Still hangs at run. It also doesn’t explain why my “small” system runs on 1 and 8 procs but not 4.

Here’s the result of top command:

top - 16:24:11 up 113 days, 4:44, 1 user, load average: 3.97, 2.37, 1.08
Tasks: 307 total, 5 running, 302 sleeping, 0 stopped, 0 zombie
Cpu(s): 25.0%us, 0.0%sy, 0.0%ni, 74.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16430012k total, 1118796k used, 15311216k free, 200808k buffers
Swap: 1020116k total, 0k used, 1020116k free, 597428k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31319 schall2 25 0 182m 10m 5588 R 100 0.1 4:26.78 lmp_openmpi
31320 schall2 25 0 183m 11m 5820 R 100 0.1 4:26.90 lmp_openmpi
31321 schall2 25 0 183m 10m 5540 R 100 0.1 4:26.89 lmp_openmpi
31322 schall2 25 0 181m 9560 4844 R 100 0.1 4:26.89 lmp_openmpi

akohlmey · June 17, 2010, 8:32pm

Large is not much larger than small at present. ~4000 atoms vs. 2000. That
should still be a trivial number of atoms. Setting thermo 1 did nothing.
Still hangs at run. It also doesn't explain why my "small" system runs on 1
and 8 procs but not 4.

i would suspect that you have a problem with your initial configuration
and that your forces may be very high and then - depending on the
distribution of atoms across the processors - get summed up so that
they overflow or not. if you have the overflow, then your calculation
can get stuck in an iterative loop forever.

but this is speculation. try reducing the time step massively.

cheers,
axel.

Dave_Schall · June 17, 2010, 9:01pm

The timestep is already 0.25 fs which is pretty small. That’s a typical number I use for airebo. Just for fun I changed it to 0.025 fs. That didn’t help either.

Interestingly, if I change the thermo output to 1, even the parallel run systems that worked before now stall at the run command. However, running thermo 1 and serial runs fine and there is no funny business with the temperature or pressure either. I think the initial configuration is fine but I’ll dump forces at step 1 and look for any outliers just to be sure. Would doing minimization before MD help this? I can always try it.

Another thought, if I am running an LJ/cut potential for metal-hydrocarbon interactions. What happens at the cut off? Is there a discontinuity that could be creating an overflow there? Should I include a shift so it goes smoothly to zero?

Dave_Schall · June 18, 2010, 5:54pm

I’ve checked the forces and run a pre-md minimization and I still get a hang up. I have narrowed down the problem to the “fix aveforce” command.

In short if I run
fix 2 topfix setforce 0.0 0.0 NULL
#fix 3 topfix aveforce NULL NULL -0.05

With the fix aveforce commented out of the script all my systems run fine. When I uncomment fix aveforce, the run hangs at the run command.

Occasionally I see another odd error when I run with the aveforce command uncommented:

ERROR: Variable name for fix aveforce does not exist.

Yet, there are no variables in my aveforce command, just 2 NULLs and a numerical value. Is there a bug in this fix?

Thanks, Dave

akohlmey · June 18, 2010, 6:55pm

I've checked the forces and run a pre-md minimization and I still get a hang
up. I have narrowed down the problem to the "fix aveforce" command.

In short if I run
fix 2 topfix setforce 0.0 0.0 NULL
#fix 3 topfix aveforce NULL NULL -0.05

With the fix aveforce commented out of the script all my systems run fine.
When I uncomment fix aveforce, the run hangs at the run command.

Occasionally I see another odd error when I run with the aveforce command
uncommented:

ERROR: Variable name for fix aveforce does not exist.

that is very strange.

Yet, there are no variables in my aveforce command, just 2 NULLs and a
numerical value. Is there a bug in this fix?

could be. i'll add to my list of items to be looked at.

it could also be that your network is not perfect and drops
data unannounced when the load gets too high.

cheers,
axel.

Peter_Shannon · June 18, 2010, 8:18pm

I can confirm getting this error message with similar settings. Axel and I are now investigating this.

Peter

sjplimp · June 18, 2010, 8:41pm

Posted a patch (24Jun10) for this - an initialization line
was left out of fix aveforce when the variable option
was added.

Steve

Dave_Schall · June 19, 2010, 3:52pm

Just to confirm, this patch appears to fix both the

ERROR: Variable name for fix aveforce does not exist.

and the issues I was having with parallel jobs hanging up.

Thanks for the quick fix! Much appreciated.

Dave Schall