Varying results with multiple processors

Vijay_Subramanian · April 26, 2011, 10:56pm

Hi all,

My special thanks to Axel and Steve, amongst others, for their inputs to my previous queries.

The issue: Errors introduced in a simulation when multiple processors are used.

Description of the issue:

Our group is currently trying to simulate crack propagation in metals. In doing so, we recently observed a complete change in the crack propagation behavior depending on the number of the processors utilized, for the same input script. This was observed irrespective of the (Jan 15, 2010 & Feb 18, 2011) versions. The input script is listed below my signature.

We tested it on 1x1x1 configuration (1 processor), 8x8x1 (64 processors), 8x12x1 (96 processors) and 16x16x1 (256 processors). Openmpi_gcc-1.2.5 was used in all the runs.

Deviations in temperature and other computed parameters (dumped in the log.lammps file) began from the 200th step. The thermo command has a step size of 200.

The differences in computed parameters are so drastic that in some cases the crack splits the specimen in to two halves while getting prematurely arrested in the mid-length of the specimen in other cases. Unless this job was completely conducted in a single processor, it is difficult to know which result is correct.

We wonder if such artifacts are due to approximation errors while passing the compute parameters between different compute nodes. If so, we had like to know of possible ways to eliminate or reduce this issue.

We appreciate any thoughts and experiences concerning this issue.

Thanks,

Vijay

sjplimp · April 27, 2011, 2:18pm

There are 3 reasons this could happen.

(1) if you setup the problem differently
on different numbers of processors, e.g.
when initializing velocities. Some velocity
command options do this, some do not.
See the doc page.

(2) you use a command like fix langevin
for thermostatting that does something
different on different numbers of processors

3) you run for a while and 2 runs on different
numbers of processors slowly diverge.

The symptom of the first 2 is that the thermo
output of your two runs differs very quickly
I would print out thermo every step, not every
200 and see what is happening.
This is typically fixable, by setting things up correctly
or using alternate commands.

The symptom of the 3rd is the slow divergence.
There is typically nothing you can do about it
as it is due to lo-level round-off. For something
like crack propagation, a specific evett is likely
very sensitive to initial conditions and round-off,
so there is little hope that you could reproduce
identical behavior at long timescales.Basically
crack propagation has a random chaotic nature
to it.

Steve

Vijay_Subramanian · April 27, 2011, 2:40pm

Thanks Steve,

That was quite descriptive and helpful. I will look at the thermo output step by step for long/short term deviations and update on this issue.

For now it looks like fixing a certain number of processors (such as 64) to run all my trials would hopefully make the crack propagation behavior consistent.

Vijay