[lammps-users] running lammps parallel

Burcu_Eksioglu · May 15, 2006, 11:42pm

Dear all,
I have been trying to set LAMMPS parallel on an Athlon cluster. On the website it states that I should ask such questions to a local expert but we have been working on it for weeks and had a lot of troubles. I just want to make sure if it is rather a simple command problem.

For the same system that lammps runs on a single processor with the same inputs without any error, I get three types of errors when I run it parallel.

ERROR on proc 0: Failed to reallocate 204955488 bytes for array atom:dihedral_atom1
mpiexec: Warning: accept_abort_conn: MPI_Abort from IP 10.0.0.22, killing all.
[0] MPI Abort by user Aborting program !
[0] Aborting program!

It runs on a single processor so I do not understand why it would run out of memory. Below is my batch processing file:

#PBS -l walltime=300:05:10
#PBS -l nodes=2:ppn=2
#PBS -N nptpolymer2
#PBS -S /bin/ksh
#PBS -j oe
cd $HOME/systems/polymer/parallel
mpiexec -np 4 ./lmp_linux < polymer2.in

In my input file I put this command: processors 2 2 1

I tried with a smaller system (much less atoms), it worked fine.

To confirm that it works fine for a smaller system, I ran another small system with different morphology on 4 processors again. In my output file, there is the following error:
Dihedral problem: 1 2 55 57 84 83
1st atom: 1 nan nan nan
2nd atom: 1 nan nan nan
3rd atom: 1 nan nan nan
4th atom: 1 nan nan nan
Dihedral problem: 1 2 77 79 80 53
1st atom: 1 nan nan nan
2nd atom: 1 nan nan nan
3rd atom: 1 nan nan nan
4th atom: 1 nan nan nan

It continues like this. Eventough the second dihedral is a different set of atoms it again numbers them as 1st dihedral. It is not like this in my data file.

The log file does not have this error message but it prints nan for the energy, temperature …

For larger systems I tried writing processors 1 4 1 . It complains of bad grid. For any other combinations of number of nodes and processors, it fails to calculate anything. For instance:

nodes =2: processors per node =1
-np 2

or

nodes = 4: ppn =2
-np 8

In this cluster there are two processors per node. It looked like there is a mapping problem and I do not know hot to get it working for large systems(such as 30,000 atoms) for extended runs(100ps). The only succesful run was for 3000 atoms so far.

I hope it was not very long of a question. Thank you for taking the time to read it. Looking forward for responses.
Best regards,
Burcu Eksioglu

Arnab_Chakrabarty · May 16, 2006, 4:41am

Hi,
While trying to compare the time required for a job in different cluster environment for each component(Pair, Communication etc) , i found in a particular log file the communication time for three of those components (Bond,Output and Other) have time (and %) values negative. Is it natural, or does it have to do anything with the cluster itself (there are some stability issues in the cluster asfar as i know). Thanking in advance.

Thanks
Arnab

sjplimp · May 16, 2006, 1:38pm

I don't know how you could get a negative time for something
like bond or output, since it is computed explicitly as the
difference of two timers. Other is possible since it is inferred
from other times, I suppose.

Steve

sjplimp · May 16, 2006, 1:43pm

You shouldn't need to use the "processors" command. Without
it, LAMMPS will simply use all the procs you are running on.

This sounds to me like a system problem. Can you successfully
run any parallel job on your system via mpirun? Can you write
your own simple parallel code using the same constructs that
LAMMPS uses to initialize itself and get it to run?

For the first error, you can look in the code and see where the error
is coming from. There will be a smalloc or other memory allocation
call. You could print out the values that are being passed. The non-
proc-0 processors probably have a large or negative value that is
causing the memory allocator to croak. That should give you a clue
as to why those procs have a bad value (possibly from a previous
broadcast from proc 0).

Steve