Regarding running a LAMMPS Program on HPC

Dear all,

I am trying to run a simulation on HPC and my LAMMPS VERSION now is lammps-10Aug15 .Now running the program on the latest version,then also the error is coming .My region defined is region whole block 0 20.0 -80.0 80.0 -80.0 80.0 .When we are increasing the size of simmulation box ,the segmentation fault occurs
As running it on multiple processors(nodes=5;ppn=20) shows the error.
But when we run the same program on a single processor then there is no error.Can anybody tell me how to debug this problem.I have consulted my HPC engineer ,so there is no problem with HPC.

With Regards,
Ritesh Satwani

[c5b:31559:0] Caught signal 11 (Segmentation fault)
[c5b:31551:0] Caught signal 11 (Segmentation fault)
[c5b:31552:0] Caught signal 11 (Segmentation fault)
[c5b:31554:0] Caught signal 11 (Segmentation fault)
[c5b:31557:0] Caught signal 11 (Segmentation fault)
[c5b:31563:0] Caught signal 11 (Segmentation fault)
[c5b:31562:0] Caught signal 11 (Segmentation fault)
[c5b:31567:0] Caught signal 11 (Segmentation fault)
[c5b:31570:0] Caught signal 11 (Segmentation fault)
[c5b:31569:0] Caught signal 11 (Segmentation fault)
[c5b:31566:0] Caught signal 11 (Segmentation fault)
[c5d:802 :0] Caught signal 11 (Segmentation fault)
[c5d:803 :0] Caught signal 11 (Segmentation fault)
[c5d:804 :0] Caught signal 11 (Segmentation fault)
[c5d:807 :0] Caught signal 11 (Segmentation fault)
[c5d:809 :0] Caught signal 11 (Segmentation fault)
[c5d:814 :0] Caught signal 11 (Segmentation fault)
[c5d:806 :0] Caught signal 11 (Segmentation fault)
[c5a:30025:0] Caught signal 11 (Segmentation fault)
[c5a:30024:0] Caught signal 11 (Segmentation fault)
[c5a:30026:0] Caught signal 11 (Segmentation fault)
[c5a:30029:0] Caught signal 11 (Segmentation fault)
[c5a:30031:0] Caught signal 11 (Segmentation fault)
[c5a:30028:0] Caught signal 11 (Segmentation fault)
[c5a:30030:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
==== backtrace ====
2 0x000000000006397c mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.2.2989/src/mxm/util/debug/debug.c:641
3 0x0000000000063aec mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.2.2989/src/mxm/util/debug/debug.c:616
4 0x00000038d3e329a0 killpg() ??:0
5 0x000000000077698c _ZN9LAMMPS_NS8Neighbor8full_binEPNS_9NeighListE() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/neigh_full.cpp:305
6 0x000000000076e0d2 _ZN9LAMMPS_NS8Neighbor9build_oneEPNS_9NeighListEi() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/neighbor.cpp:1607
7 0x00000000004b6eb2 _ZN9LAMMPS_NS16ComputeCoordAtom15compute_peratomEv() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/compute_coord_atom.cpp:144
8 0x00000000005348a4 _ZN9LAMMPS_NS10DumpCustom5countEv() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/dump_custom.cpp:417
9 0x000000000052cb4a _ZN9LAMMPS_NS4Dump5writeEv() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/dump.cpp:292
10 0x000000000078a878 _ZN9LAMMPS_NS6Output5writeEl() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/output.cpp:303
11 0x0000000000ae56c0 _ZN9LAMMPS_NS6Verlet3runEi() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/verlet.cpp:310
12 0x0000000000ab2af9 _ZN9LAMMPS_NS3Run7commandEiPPc() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/run.cpp:175
13 0x0000000000705ef3 _ZN9LAMMPS_NS5Input15command_creatorINS_3RunEEEvPNS_6LAMMPSEiPPc() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/input.cpp:720
14 0x00000000007045bb _ZN9LAMMPS_NS5Input15execute_commandEv() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/input.cpp:703
15 0x000000000070587e _ZN9LAMMPS_NS5Input4fileEv() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/input.cpp:241
16 0x00000000007147e9 main() /scratch/compile/mukesh.gcc/mukesh/lammps/lammps-10Aug15/src/Obj_mpi/…/main.cpp:31
17 0x00000038d3e1ed1d __libc_start_main() ??:0
18 0x0000000000417629 _start() ??:0

Looks like the segfault happens during building the neighbor list. Are you sure your dynamics are OK (both for the single-core job and the parallel one?), i.e., does the kinetic energy of the system blow up shortly before the segfault or not?

2015-08-24 10:41 GMT-04:00 Ritesh Satwani <[email protected]...>:

Dear all,
I am trying to run a simulation on HPC and my LAMMPS VERSION now is
*lammps-10Aug15* .Now running the program on the latest version,then also
the error is coming .My region defined is region whole block 0 *20.0* -80.0
80.0 -80.0 80.0 .When we are increasing the size of simmulation box ,the
segmentation fault occurs
As running it on multiple processors(nodes=5;ppn=20) shows the error.
But when we run the same program on a* single* processor then there is no
error.Can anybody tell me how to debug this problem.I have consulted my HPC
engineer ,so there is no problem with HPC.

​there are multiple possible reasons for that, however, it is close to
impossible to tell them from only a stack trace and a couple of lines of
input.
​furthermore, the stack trace indicates a segmentation fault at a location,
that does not make any sense at all. so it has to be a somewhat subtle
reason.

please make a copy of your input and remove everything that can be removed
and while still causing the segmentation fault and then post the input to
the mailing list.

axel.