[lammps-users] sorting large dump files

Craig_Maloney · April 18, 2007, 2:19am

Hi all.

I have some reasonably large dump files (15 million atoms: ascii dump file=.5GB)

I want to compute the displacement of each atom since time 0. I usually use pizza.py: sort the data and diff the 2 position vectors (modulo PBCs). For the present case, the dump file is large enough that pizza’s dump.py exhausts memory on loading. I tried shaving the header off the dump files and giving them to GNU sort, but that exhausts the memory too.

I was about to implement a mergesort but was wondering if anyone has a better solution. (Sorry, slightly off-topic).

Thanks,
Craig

sjplimp · April 18, 2007, 2:18pm

The dump tool in Pizza.py has an option for the big-memory
issue. You can open the file with d = dump("file",0)
which won't read any snapshots.

Then read them one-at-a-time via d.next(), process it,
and throw it away via d.delete()

Steve

Craig_Maloney · April 18, 2007, 2:51pm

The dump tool in Pizza.py has an option for the big-memory
issue. You can open the file with d = dump(“file”,0)
which won’t read any snapshots.

Then read them one-at-a-time via d.next(), process it,
and throw it away via d.delete()

I understand that it’s not a problem to read in the data line-by-line from disk, but I need to sort it. I was wondering if there is some mechanism via which LAMMPS can write dump files sorted on atom index.

I want to find the difference in each atom’s position (or any other atomic quantity) for any 2 dump files. This requires sorting the dump files based on atomID. Once the sorting is done, one can read a handful of lines at a time from each of the sorted dump files to diff them. The memory bottleneck isn’t an issue after the sorting has been done, but the sorting step becomes a problem for big dump files.

I’m working on a disk-based mergesort right now which should be OK for my needs, but it seems like it would be natural to optionally do this sorting in the LAMMPS proper before each dump. If the dumps were pre-sorted individually on each processor that would take care of a lot of the work… the head node would then only need to do a couple generations (log2(numCPUs)) worth of mergesorting the pre-sorted dump files from individual processors.

Does nobody else ever have occasion to compare two snapshots of a big system off-line on a single CPU?

–Craig

_Budzien_Joanne_Loui · April 18, 2007, 3:32pm

Craig,

If what you want is a series of files that are each sorted by atom index, you can use the pizza dump tool. I would make a script to send to pizza with the commands

if not globals().has_key(“argv”): argv.sys.argv
infile=argv[1]
outbase=argv[2]
nmax=int(argv[3])
d=dump(infile,0)
for i in range(0,nmax):
print “Working on step “,i
outfile=outbase+”.”+str(i)
d.next()
d.map(1,”id”,2,”type”,3,”x”,4,y”,5,”z”,6,”ix”,7,”iy”,8,”iz”)
d.sort()
d.write(outfile)
d.tselect.none()
d.delete
print “all done”

This will step through the dump file one frame at a time, sort that frame by atom id, write it to a new file, and then delete that frame. To run the script out of pizza, type:
@run script.py infile outfile nframes

This worked on the file I tried.

Joanne Budzien

Post-doc, Org.1814: Computational Materials Science and Engineering
Sandia National Laboratories
PO Box 5800
Albuquerque, NM 87185-1411

Voice: (505) 844-0959
Fax: (505) 844-9781
Email: jlbudzi@…3…

system · April 18, 2007, 3:32pm

Hi Craig,

The dcd format requires that all the atoms are sorted in each frame. You
can take a look at the dump_dcd code to see how this is done (basically it
allocates a new array of size N and copies the per atom data into each
location based on the atom's id - this probably wouldn't work for a huge
system of 15 million atoms, but that's effectively the amount of extra
space you'd need if you wanted to do an mergesort on the main processor.

Naveen

Craig_Maloney · April 18, 2007, 4:02pm

Hi Naveen.

Yikes! coords has length natoms.

This means that the DCD (and it looks also like xyz) output puts a pretty harsh restriction on LAMMPS system size.

The point of mergesort is that it can be done efficiently on disk. So it will work… when I eventually get it working, that is.

I’m also borrowing some cycles on a 28GB machine which will solve my problems today, but that solution doesn’t scale well

–Craig

system · April 18, 2007, 5:17pm

Hi Craig,

Yeah, that implementation isn't the most scalable, but I've never had
to deal with systems bigger than 100000 atoms, and I calculated that
even a million atom system would only require an extra 24 Mb of memory
on the main processor.

Merge-sort does work efficiently on disk, but it still might slow down
your simulation times unacceptably (since all the slave processors will
have to wait until the main processor completes the merge-sort). If you do
implement an efficient version for LAMMPS it would be useful to modify the
dcd code to use the same framework.

Naveen

_Zhimin_Xiong · April 19, 2007, 4:57pm

The dump tool in Pizza.py has an option for the big-memory
issue. You can open the file with d = dump(“file”,0)
which won’t read any snapshots.

Then read them one-at-a-time via d.next(), process it,
and throw it away via d.delete()

I understand that it’s not a problem to read in the data line-by-line from disk, but I need to sort it. I was wondering if there is some mechanism via which LAMMPS can write dump files sorted on atom index.

I want to find the difference in each atom’s position (or any other atomic quantity) for any 2 dump files. This requires sorting the dump files based on atomID. Once the sorting is done, one can read a handful of lines at a time from each of the sorted dump files to diff them. The memory bottleneck isn’t an issue after the sorting has been done, but the sorting step becomes a problem for big dump files.

I am wondering whether you are wanting to read in the quantities via atomID index. If so, I often dump the file via custom type, so I dump the tag of the atom, that is the atomID. I read the atomID and others, that is like tag x, y, z, ix, iy, iz.