Please submit questions to the mailing list. to answer your question, you will find a recent discussion of a problem with the same symptoms, but a more informative subject line: Run stops without error - Reax/c
Aidan
Please submit questions to the mailing list. to answer your question, you will find a recent discussion of a problem with the same symptoms, but a more informative subject line: Run stops without error - Reax/c
Aidan
Dear Aidan,
Thank you so much for your kind advice.
there are more information ,and the in file is exactly same
in file
before reporting any problems with the LAMMPS code not working, you should first try running the same input with the very latest patch of LAMMPS.
your LAMMPS version is two years old. a lot of improvements have been made since. thus to avoid trying to correct problems that are already corrected, check your input with the latest patch.
also, there should be more output than log files, e.g. regular stdout and strerror output. if you are running using a batch system, they may be in separate files. especially the stderr output is crucial, as that it will for certain contain output from the MPI library. no MPI parallel job just stops like what you are reporting without such output.
assuming that you are running under batch, what is the amount of time requested for this job. a quick back-of-the-envelope check indicates that your calculation was cut off after about 12hrs, a quite typical wall clock limit for batch queues. are you sure that your jobs wasn’t simply terminated because your ran out of time?
axel.
Please read the previous thread that I pointed out to you.
Dear Axel,
I take your advices : (1) update the version with LAMMPS (7 Dec 2015) and runing the same input with the correct patch. (2) update the MPI
(3)ensure that the wall clock limit isn’t existent. However, the problem of the program is emerged again.
I need some new advices. Thank you very much.
The input script is shown below:
nobody can give advice without information and you don't provide what
is *crucial* for determining the cause of the stop of your run.
i am *very* confident that there *is* additional information. there
*has* to be output printed to the console that is either printed to
the screen and that you are somehow discarding (e.g. by redirecting it
to /dev/null) or it is captured by the batch system and written to
files that you don't pay attention to or have disabled.
an application like LAMMPS doesn't just stop for no reason without any
indication of a problem. full stop.
provide the missing information, and people might help you.
axel.
is this calling script?
#/bin/bash
#PBS -N lammps
#PBS -l nodes=1:ppn=16
#PBS -q new
project_name=in.TiO2
cd $PBS_O_WORKDIR
export LD_LIBRARY_PATH=/public/program/jpeg-8-itel2013/lib:/public/program/mpi/mpich2-1.5-intel2013/lib:/public/program/gcc-4.5.1/lib64/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/public/program/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64:LD_LIBRARY_PATH
ulimit -s unlimited
NSLOTS=`cat {PBS_NODEFILE} | wc -l`
LMP_PATH=/public/src/lammps-7Dec15/src
OPENMPI_PATH=/public/program/mpi/mpich2-1.5-intel2013/bin
OUTDIR=/tmp/$USER/$PBS_JOBID.$PBS_JOBNAME
echo “--------------------- $PBS_JOBID INFORMATION ----------------------” > jobinfo.$PBS_JOBID
echo “” >> jobinfo.$PBS_JOBID
echo “ORIGINAL FILES locate :
$PBS_O_WORKDIR” >> jobinfo.$PBS_JOBID
echo “” >> jobinfo.$PBS_JOBID
echo “TEMPORARY FILES locate :
$OUTDIR” >> jobinfo.$PBS_JOBID
echo “” >> jobinfo.$PBS_JOBID
echo “PBS JOBNAME is :
$PBS_JOBNAME” >> jobinfo.$PBS_JOBID
echo “” >> jobinfo.$PBS_JOBID
echo “PBS JOB ID is :
$PBS_JOBID” >> jobinfo.$PBS_JOBID
echo “” >> jobinfo.$PBS_JOBID
echo “NUMBER of EXECUTIVE NODES is :
${NSLOTS}” >> jobinfo.$PBS_JOBID
echo “” >> jobinfo.$PBS_JOBID
mkdir -p OUTDIR
cp -rf {PBS_O_WORKDIR}/* ${OUTDIR}/
cd $OUTDIR
time ${OPENMPI_PATH}/mpirun -np NSLOTS {LMP_PATH}/lmp_linux -in {project_name} > {project_name}.log
cp -rf {OUTDIR}/* {PBS_O_WORKDIR}/
rm -rf $OUTDIR
is this calling script?
that is you input *to* the batch system, what i am asking for is the
corresponding per submit output *from* the batch system.
axel.
I just find this log:
LAMMPS (7 Dec 2015)
Reading data file …
orthogonal box = (-0.317941 -0.323637 -1.00378) to (54.8101 54.3212 50.2103)
4 by 2 by 2 MPI processor grid
reading atoms …
11700 atoms
WARNING: Resetting reneighboring criteria during minimization (…/min.cpp:168)
Neighbor list info …
2 neighbor list requests
update every 1 steps, delay 0 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 12
ghost atom cutoff = 12
binsize = 6, bins = 10 10 9
Setting up cg style minimization …
Unit style: real
Memory usage per processor = 136.866 Mbytes
Step Temp E_pair E_mol TotEng Press
0 0 -1064470.9 0 -1064470.9 -149113.04
1794 0 -1202287.6 0 -1202287.6 -1350.4279
Loop time of 1734.86 on 16 procs for 1794 steps with 11700 atoms
99.7% CPU use with 16 MPI tasks x no OpenMP threads
Minimization stats:
Stopping criterion = linesearch alpha is zero
Energy initial, next-to-last, final =
-1064470.86812 -1202287.59557 -1202287.59557
Force two-norm initial, final = 11039.9 69.7653
Force max component initial, final = 178.765 28.2834
Final line search alpha, max atom move = 1.64638e-12 4.65651e-11
Iterations, force evaluations = 1794 8806
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
If that’s the log file that LAMMPS produced, then it may
be truncated (missing the last part w/ the error message) b/c the
batch job died
before the file was flushed. Batch systems should
also produce a file that has everything that
would have gone to the screen if you had run
interactively. That is where the error will likely be (at the end).
It would also be in the log file if you’d run interactively.
Steve
It is not possible for us to help you with this problem if you are unable to find the error message generated by the system. For example, under extreme conditions it is possible for the memory requirements of pair style reax/c to change very rapidly and this can result in a single process aborting in a very rough manner e.g.
if( total_hbonds >= hbonds->num_intrs ) {
fprintf(stderr,
“p%d: not enough space for hbonds! total=%d allocated=%d\n”,
system->my_rank, total_hbonds, hbonds->num_intrs );
MPI_Abort( comm, INSUFFICIENT_MEMORY );
}
As I said about a month ago, if your computer system is unable to preserve this output to stderr, or you don’t know how to find it, then we can not diagnose your problem.
Aidan
Looking at the information again, I think if you add an ampersand(&) you will get stderr in your output file, as follows:
time ${OPENMPI_PATH}/mpirun -np NSLOTS {LMP_PATH}/lmp_linux -in {project_name} >& {project_name}.log
Even when you did not do that, the standard error output is probably in another file, typically labelled with the job number.
Aidan