[lammps-users] sorting large dump files

Hi all.

After painfully implementing and debugging a line-based mergesort routine, I realized that Joanne’s solution is much better.

The mergesort works but is pretty slow, since it has to make log2(N) passes over the file. You can bump it up from excruciatingly painful to moderately painful by using a large buffer size for file-IO.

But… Since the index we want to sort on is dense and flat (just consecutive integers) it’s easiest to just split into files based on index, each of which is small enough to fit in RAM, sort those files, then cat the sorted files. I feel stupid for not having thought of that in the first place, but at least now I have a line-based mergesort which will work when the key is not balanced.

It would be really convenient if LAMMPS had an option which, for each dump frame, would sort the atoms by index on each worker node before communicating with the head node. Then the head node would have the relatively easy task of merging pre-sorted streams into a single sorted file. This is an orderN operation, but so is just dumping everything to the disk without sorting it, so it wouldn’t change the scaling. As Naveen pointed out, the current method for producing sorted xyz or dcd files will fail for big systems. (But maybe it’s useless for most people to have an xyz or dcd file which won’t fit into memory on a single processor?)

Cheers,
Craig

If what you want is a final dump file with the atoms in each snapshot
sorted by ID, then I would use Joanne's Pizza.py script. I.e. read each
unsorted snapshot (as LAMMPS wrote it out), sort it, and write it
back out. You never have to hold more than one snapshot in memory.
If you can't fit one snapshot in memory of your post-processing box (e.g.
a billion atoms), then you need an out-of-core sort, which exist, though
not packaged as a LAMMPS tool.

I don't see how your suggestion for LAMMPS to do the sorting would
work. Proc 0 can't do a merge sort of all the contributions from the
other proc's without holding data for all the atoms in it's memory
which would be too much for really large simulations.

Besides, if you have to do that, then you might as well allocate the entire
array and stick the atoms in the correct place as they come in from
other procs, which is what Naveen implemented for DCD and XYZ formats.

The only other solution is a true distributed-memory parallel sort,
but I have never been convinced it is worth writing or worth the
cost of executing each time you want to dump a snapshot. I view
it as a post-processing issue.

Steve

Hi Steve.

If what you want is a final dump file with the atoms in each snapshot
sorted by ID, then I would use Joanne’s Pizza.py script. I.e. read each
unsorted snapshot (as LAMMPS wrote it out), sort it, and write it
back out. You never have to hold more than one snapshot in memory.
If you can’t fit one snapshot in memory of your post-processing box (e.g.
a billion atoms), then you need an out-of-core sort, which exist, though
not packaged as a LAMMPS tool.

I’m interested in the situation where a single snapshot at a given timestep will not fit into memory (sorry if this hasn’t been clear in the first couple of posts). My current solution (as Joanne suggested in her second email to me which I included in my last post to the list) is to i) split the single timestep into many files, each corresponding to some range of atom indices, then ii) sort each of those individually, then iii) cat them back together. This is what you mean by “out-of-core”, I guess.

I don’t see how your suggestion for LAMMPS to do the sorting would
work. Proc 0 can’t do a merge sort of all the contributions from the
other proc’s without holding data for all the atoms in it’s memory
which would be too much for really large simulations.

Proc0 doesn’t need to hold anything more in memory than it would normally. The setup I imagine is that proc0 has N MPI readBuffers open, one from each worker proc. The worker nodes are filling those buffers with atom data that is pre-sorted on the worker side. Usually proc zero writes to disk sequentially reading all data from each processor. In this scenario (this is what I mean by a merge sort), at every iteration in the “writeAtomToDisk” loop, proc0 writes to disk whichever atom has the lowest index of those sitting on the fronts of the N readBuffers. It takes no more memory than the usual scenario (after adjusting the MPI read buffer sizes appropriately), and the computational overhead of doing the merge sort on node0 is possibly dominated by the communication overhead, anyway.

Besides, if you have to do that, then you might as well allocate the entire
array and stick the atoms in the correct place as they come in from
other procs, which is what Naveen implemented for DCD and XYZ formats.

But I’m interested in the situation where the system won’t fit into ram on any given processor.

The only other solution is a true distributed-memory parallel sort,
but I have never been convinced it is worth writing or worth the
cost of executing each time you want to dump a snapshot. I view
it as a post-processing issue.

Doing this in post-processing certainly works but doesn’t leverage the fact that the data came from many CPUs which could have been useful helping out with the sorting process. It just doesn’t seem like it would take that much effort to put it online. If I have a couple of free hours, I’ll give it a go and let everyone know how it goes.

Cheers,
Craig

Hi Craig,

I just implemented a heapsort that would work on the buf array in the pack
function of dump.cpp, sorting by the first value (per atom) of the buf
array (the things one does to avoid writing a thesis :). This is done on
each of the slave nodes, and the final mergesort on the main processor
needs to be implemented (which would require changing the communication
routines to send subsets of sorted atoms from each slave processor).

Naveen

Oops, i forgot to include the code (this is just the heapsort with a test
case).

I just need to integrate it into lammps.

Naveen

sort.tar (10 KB)

Hi Steve,

This would actually be useful in the general case for the dcd and xyz
format, because they are currently limited by the size of the system. I'm
sure one day someone would like to output in dcd a system that is too big
to keep on one processor.

Naveen

Proc0 doesn't need to hold anything more in memory than it would
normally. The setup I imagine is that proc0 has N MPI readBuffers
open, one from each worker proc. The worker nodes are filling those
buffers with atom data that is pre-sorted on the worker side.
Usually proc zero writes to disk sequentially reading all data from
each processor. In this scenario (this is what I mean by a merge
sort), at every iteration in the "writeAtomToDisk" loop, proc0 writes
to disk whichever atom has the lowest index of those sitting on the
fronts of the N readBuffers. It takes no more memory than the usual
scenario (after adjusting the MPI read buffer sizes appropriately),
and the computational overhead of doing the merge sort on node0 is
possibly dominated by the communication overhead, anyway.

This would actually be use

Hi all.

I might as well post an implementation of Joanne’s scheme in case anyone has a use for this.

python2.4 or greater is mandatory!

The “key=” option to the sort routine is what allows for reasonable efficiency. For some reason, python’s “file::readlines()” routine seems to read only about 300 lines at a time when I give it a big argument even though I have huge IO buffers set. Doesn’t seem to be too much of a performance hit, though. The sorting dominates.

Takes a couple of minutes to sort a 15 million atom 500MB dump file. After tweaking the python sorting, I feel less of a need to try to do the sorting online (although the fact that it’s so efficient leads me to believe that it shouldn’t be much overhead at all to do it online).