[lammps-users] Job fails to complete when running at higher core counts

I was doing some scalability testing with plans of running a large problem ( 365,000 atoms, 1m time steps ) on 32, 64, 128, and 256 cores.

The jobs complete on the first three runs (32, 64, 128), but fail on the 256 core run. Actually, they never fully die as the job still shows to be running in the scheduler:

login-1-0:jpummil:/fasttmp/jpummil$ showq | grep jpummil
77057 jpummil Running 256 23:59:54 Thu May 20 15:25:51

But long after the problem should have completed, I kill the job with canceljob and then look at the output:

login-1-0:jpummil:/fasttmp/jpummil$ more LaMMPS.77057.moab
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory

I am pretty sure that lammps doesn’t use BLAS, so I am not certain why it is giving this error. Even if it does, the smaller runs ran to completion.

I am running lammps from an NFS mounted shared filesystem. Is there a possible issue with starting that many MPI tasks at once? Maybe I should move the binary to /fasttmp/jpummil which is a Lustre FS shared across the nodes with Infiniband?

Attached are my PBS script and my input deck…

Thank You!

in.testbench_huge (1.07 KB)

LaMMPS-OMPI.pbs (291 Bytes)

hi jeff,

login-1-0:jpummil:/fasttmp/jpummil$ more LaMMPS.77057.moab
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading
shared libraries: libblas.so.3: cannot open shared object file: No such
file

well, it looks as if some nodes don'e have the shared BLAS library installed.

I am pretty sure that lammps doesn't use BLAS, so I am not certain why it is
giving this error. Even if it does, the smaller runs ran to completion.

if LAMMPS has been compiled with support for the ATC package, then
it _does_ use BLAS and LAPACK.

i suggest to check your compiled binary with ldd and then check the nodes
if they are all having BLAS installated.

cheers,
    axel.

(tg champion for temple)

Hey Axel!

You were correct as usual :wink:

Lib distribution problems on a few nodes...

Appreciate it!

Have a nice weekend.