Hi Steve.
If what you want is a final dump file with the atoms in each snapshot
sorted by ID, then I would use Joanne’s Pizza.py script. I.e. read each
unsorted snapshot (as LAMMPS wrote it out), sort it, and write it
back out. You never have to hold more than one snapshot in memory.
If you can’t fit one snapshot in memory of your post-processing box (e.g.
a billion atoms), then you need an out-of-core sort, which exist, though
not packaged as a LAMMPS tool.
I’m interested in the situation where a single snapshot at a given timestep will not fit into memory (sorry if this hasn’t been clear in the first couple of posts). My current solution (as Joanne suggested in her second email to me which I included in my last post to the list) is to i) split the single timestep into many files, each corresponding to some range of atom indices, then ii) sort each of those individually, then iii) cat them back together. This is what you mean by “out-of-core”, I guess.
I don’t see how your suggestion for LAMMPS to do the sorting would
work. Proc 0 can’t do a merge sort of all the contributions from the
other proc’s without holding data for all the atoms in it’s memory
which would be too much for really large simulations.
Proc0 doesn’t need to hold anything more in memory than it would normally. The setup I imagine is that proc0 has N MPI readBuffers open, one from each worker proc. The worker nodes are filling those buffers with atom data that is pre-sorted on the worker side. Usually proc zero writes to disk sequentially reading all data from each processor. In this scenario (this is what I mean by a merge sort), at every iteration in the “writeAtomToDisk” loop, proc0 writes to disk whichever atom has the lowest index of those sitting on the fronts of the N readBuffers. It takes no more memory than the usual scenario (after adjusting the MPI read buffer sizes appropriately), and the computational overhead of doing the merge sort on node0 is possibly dominated by the communication overhead, anyway.
Besides, if you have to do that, then you might as well allocate the entire
array and stick the atoms in the correct place as they come in from
other procs, which is what Naveen implemented for DCD and XYZ formats.
But I’m interested in the situation where the system won’t fit into ram on any given processor.
The only other solution is a true distributed-memory parallel sort,
but I have never been convinced it is worth writing or worth the
cost of executing each time you want to dump a snapshot. I view
it as a post-processing issue.
Doing this in post-processing certainly works but doesn’t leverage the fact that the data came from many CPUs which could have been useful helping out with the sorting process. It just doesn’t seem like it would take that much effort to put it online. If I have a couple of free hours, I’ll give it a go and let everyone know how it goes.
Cheers,
Craig