[lammps-users] MPICH version

Hello to all!

With which version of MPICH LAMMPS can work with cluster system?

Is it just MPICH1 or MPICH2, maybe both?

LAMMPS uses only MPI calls from MPI 1.1. So
you should be able to link with 1.1 or 2.x (which is
backward compatible).

Steve

Steve, maybe there is some ideas about the error:

I'm using MPICH2 release 1.0.8 and LAMMPS release 6Dec08 with SMP
nodes (node have 2xXeon processors each processor have 2 core). When
running one node all is ok, but then I'm starting 2 nodes there is
strange MPI error

[[email protected]... ~]$ mpdtrace
w6
ap1

[[email protected]... ~]$ /usr/local/mpich2-1.0.7ver-sock/bin/mpiexec -n 8
/home/grid1/lmp_g++_poems < in.2D.11.12.2008
LAMMPS (21 May 2008)
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)...............................:
MPI_Bcast(buf=0xbfe4aadc, count=1, MPI_INT, root=0, MPI_COMM_WORLD)
failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the
process group structure with id <>[cli_3]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)...............................:
MPI_Bcast(buf=0xbfe4aadc, count=1, MPI_INT, root=0, MPI_COMM_WORLD)
failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the
process group structure with id <>
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)...............................:
MPI_Bcast(buf=0xbff08acc, count=1, MPI_INT, root=0, MPI_COMM_WORLD)
failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the
process group structure with id <>[cli_1]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)...............................:
MPI_Bcast(buf=0xbff08acc, count=1, MPI_INT, root=0, MPI_COMM_WORLD)
failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the
process group structure with id <>
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)...............................:
MPI_Bcast(buf=0xbff6ee1c, count=1, MPI_INT, root=0, MPI_COMM_WORLD)
failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the
process group structure with id <>[cli_5]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)...............................:
MPI_Bcast(buf=0xbff6ee1c, count=1, MPI_INT, root=0, MPI_COMM_WORLD)
failed
MPIR_Bcast(198)..............................:
MPIC_Recv(81)................................:
MPIC_Wait(270)...............................:
MPIDI_CH3i_Progress_wait(215)................: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(640)...:
MPIDI_CH3_Sockconn_handle_connopen_event(887): unable to find the
process group structure with id <>
rank 5 in job 5 w6.gridzone.ru_45740 caused collective abort of all ranks
  exit status of rank 5: return code 1
rank 3 in job 5 w6.gridzone.ru_45740 caused collective abort of all ranks
  exit status of rank 3: return code 1
rank 1 in job 5 w6.gridzone.ru_45740 caused collective abort of all ranks
  exit status of rank 1: return code 1

Don't know what is causing your problem. The src/MAKE/Makefile.linux
is what I run on my box which links to a MPICH2 that was built
in the standard way with the resulting "make install" putting things
in /usr/local. If you can't get LAMMPS to run with your installed MPI,
can you get any other MPI-based program (e.g. the test programs in
MPICH) to run with it?

Steve

Yes another programs (including tests from MPICH2 distr and my one )
run in parallel and all is ok.
When I'm startting 8 mpi-processes on one machine all is work
(SMP-computer), but on 2 computers there is such error... I'm running
Makefile.g++_poems and MPICH2 whith not standard prefix, but linking
it in makefile
Here is a makefile itself:
CC = g++34
CCFLAGS = -g -O -I../../lib/poems \
               -DFFT_FFTW -DLAMMPS_GZIP -DMPICH_IGNORE_CXX_SEEK
-I/usr/local/mpich2-1.0.7ver-sock/include
DEPFLAGS = -M
LINK = g++34
LINKFLAGS = -g -O -L../../lib/poems -L/usr/local/mpich2-1.0.7ver-sock/lib
USRLIB = -lfftw -lmpich -lpoems
SYSLIB = -lpthread
ARCHIVE = ar
ARFLAGS = -rc
SIZE = size

# Link target

\(EXE\): (OBJ)
       \(LINK\) (LINKFLAGS) \(OBJ\) (USRLIB) \(SYSLIB\) \-o (EXE)
       \(SIZE\) (EXE)

# Library target

lib: \(OBJ\) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(ARCHIVE) \(ARFLAGS\) (EXE) $(OBJ)

# Compilation rules

\.o:.cpp
       \(CC\) (CCFLAGS) -c $<

\.d:.cpp
       \(CC\) (CCFLAGS) \(DEPFLAGS\) < > $@

# Individual dependencies

DEPENDS = \(OBJ:\.o=\.d\) include (DEPENDS)

I guess I'd find a local expert in building/running MPI
programs. It's not likely to be a LAMMPS problem.

Steve