I was doing some scalability testing with plans of running a large problem ( 365,000 atoms, 1m time steps ) on 32, 64, 128, and 256 cores.
The jobs complete on the first three runs (32, 64, 128), but fail on the 256 core run. Actually, they never fully die as the job still shows to be running in the scheduler:
login-1-0:jpummil:/fasttmp/jpummil$ showq | grep jpummil
77057 jpummil Running 256 23:59:54 Thu May 20 15:25:51
But long after the problem should have completed, I kill the job with canceljob and then look at the output:
login-1-0:jpummil:/fasttmp/jpummil$ more LaMMPS.77057.moab
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
/share/apps/lammps/lammps-11Jan10/src/lmp_Star-OMPI: error while loading shared libraries: libblas.so.3: cannot open shared object file: No such file
or directory
I am pretty sure that lammps doesn’t use BLAS, so I am not certain why it is giving this error. Even if it does, the smaller runs ran to completion.
I am running lammps from an NFS mounted shared filesystem. Is there a possible issue with starting that many MPI tasks at once? Maybe I should move the binary to /fasttmp/jpummil which is a Lustre FS shared across the nodes with Infiniband?
Attached are my PBS script and my input deck…
Thank You!
in.testbench_huge (1.07 KB)
LaMMPS-OMPI.pbs (291 Bytes)