Problem with MPIIO package.

Dear Lammps Users

I have been using Lammps to simulate PbTiO3 bulk and nanoparticles under differents conditions having no problems until a few days ago using 10/10/2018 Lammps's version.

Out of nowhere (because i did't recompile Lammps) the simulations started to crash. The first one did that in the step where it had to write a restart file timestep. After that, I tried with new runs and the program load the system configuration but crashes again. I opened the dump file and the file contained no format and special characters. So is evident a problem with the MPIIO package. Because the program crashes when it has to write a dump(mpiio) or restart (mpiio) file. When I run the simulations without writting restarts or dump files, Lammps works fine,without any incovenients.

Just to be clear when I say "the program crashes" I mean that the thermo output stops being written and it keeps running indefinetely.

I have been reading trying to solve the problem but I got no answers just a similar problem sent to the mailing list (https://sourceforge.net/p/lammps/mailman/message/7171080/) with a solution that was useless to me (I tried to compile the MPIIO package with a different optimization).

It would be great some help for finding a solution.

Thank you in advance.

Dear Lammps Users

I have been using Lammps to simulate PbTiO3 bulk and nanoparticles under
differents conditions having no problems until a few days ago using
10/10/2018 Lammps’s version.

Out of nowhere (because i did’t recompile Lammps) the simulations
started to crash. The first one did that in the step where it had to
write a restart file timestep. After that, I tried with new runs and the
program load the system configuration but crashes again. I opened the
dump file and the file contained no format and special characters. So is
evident a problem with the MPIIO package. Because the program crashes
when it has to write a dump(mpiio) or restart (mpiio) file. When I run
the simulations without writting restarts or dump files, Lammps works
fine,without any incovenients.

i disagree with your conclusion based on the evidence you are providing. there are many other explanations.
if as you say, you way didn’t recompile LAMMPS and it worked before, it is unlikely, that the issue is caused by LAMMPS, but rather by changes in your machine or the supporting libraries or the drivers or the hardware.

are you actually writing restart files on jobs large enough (i.e. using enough MPI ranks), that warrants using MPIIO over the normal dump styles?
does the issue only happen with MPIIO dumps and restarts? or also with regular dumps?
what happens, if you download/compile a new version of LAMMPS from scratch?
have you contacted your system folks? were there any issues with the machine recently? do you have enough space in your file system and/or quota to write the restart or dump files?

Just to be clear when I say “the program crashes” I mean that the thermo
output stops being written and it keeps running indefinetely.

I have been reading trying to solve the problem but I got no answers
just a similar problem sent to the mailing list
(https://sourceforge.net/p/lammps/mailman/message/7171080/) with a
solution that was useless to me (I tried to compile the MPIIO package
with a different optimization).

It would be great some help for finding a solution.

before finding that, there has to be (much) more information.

axel.

Axel, thank you for the reply.

You are right I precipitate saying that problem is MPIIO. As you said, the most likely cause would be changes in the machines I am carring out my simulations. I will talk to my sytem folks to find out what changes have been doing because I didn’t notice any apart from this inconvinient.

The reason I wrote is trying to figure out how to solve this and maybe I could do something starting from the code.

When I run the simulations just in one core, the restart and dump files (normal ones) are written and the simulation continues fine. The problem is when I run in parallel, neither any restarts or dumps are written under this condition. The systems I need to simulate are too big to be runned with just in 1 core.

I have downloaded a new version of Lammps I did’t get different results. Besides, I checked and the memory is enought to write new restart and dumps files.

I will keep digging, but in the meantime it’s always good to get some help from the people that knows (much) more than me.

Thank you.

Axel, thank you for the reply.

You are right I precipitate saying that problem is MPIIO. As you said, the most likely cause would be changes in the machines I am carring out my simulations. I will talk to my sytem folks to find out what changes have been doing because I didn’t notice any apart from this inconvinient.

The reason I wrote is trying to figure out how to solve this and maybe I could do something starting from the code.

When I run the simulations just in one core, the restart and dump files (normal ones) are written and the simulation continues fine. The problem is when I run in parallel, neither any restarts or dumps are written under this condition. The systems I need to simulate are too big to be runned with just in 1 core.

I have downloaded a new version of Lammps I did’t get different results. Besides, I checked and the memory is enought to write new restart and dumps files.

you haven’t answered the question as to why you are using MPIIO style dump files are restarts and whether the stalling happens only with them or for any dump style.

please also note, that output to stdout is different between running on a single CPU compared to running on multiple CPUs with certain MPI libraries. since under MPI stdout/stderr has to be channeled from different processes to the process at MPI rank zero, output may change from line buffered to block buffered and with some HPC system, those blocks may be larger than you would normally expect.

so it may be worth coming up with a very small test system and make some fast tests to narrow down under which of the many permutations of possible reasons this “stall” or “i/o hang” happens.

please also understand, i am not saying it is impossible that this is an MPIIO issue (actually, the code is quite complex, so things may go wrong there, too), but only that it is not conclusive from the evidence you have produced so far. if you can pose a small/fast/simple input that would allow to reproduce the stall, it might help a lot.

axel.

Axel, I have to confess I thought the only way to write restarts and dump files while running in parallel was via MPIIO. I admit I misunderstood. Now I can continue my simulations via normal restarts and dump files(the simulations go on well with them). I don’t think that efficiency or time might be affected very much because of this in the simulations that I am carrying out now.

The weird fact is that I had been using restart/mpiio and dump/mpiio without any problem, and suddenly they stop working. If in the future I simulated a bigger system and I needed these commands the runs would still crash.

I can reproduce “the stall” simply running in parallel one of the lammps examples (i did it with the coreshell and I know it is exagerated because the size of the system) and just adding the command custom/mpiio. The log file charges the system configuration but the programm crashes(because the dump is written in timestep 0). The same happens when I add a restart/mpiio command. The output is written until the restart must be written, then the program crashes.

Thanks Again.