fast dump file reader (attempt)

_alDosari_Majid_S · April 22, 2011, 9:46pm

I have attempted to make a fast dump file reader in python. I’ve tested it on a file with 640 atoms with 4 attributes per atom. The file was processed at a rate of about 2-3 MB/s on modern CPUs.

While the pizza/dump file reader is more flexible, the performance was really slow when considering dump files in the GB range.

The dump file should be “perfect”. So for disrupted runs, a new file should be made (eg $timestep.dump).

Usage: initiate dumpreader object with dump file. Use the method user_extractallvectors with a large chunksize. This will output nested generators. The outer generator reads chunksize timesteps. The inner generator gives the the vectors for each atom (of length chunksize).

Caveat: the program is disrupted when the number of atoms changes. To keep up the speed, I assumed the number of atoms doesn’t change often.

If the atom vectors are to read more than once, it is recommended to keep them in a database for fast access. This will be my next step.

It can be downloaded from here.

https://gforge.accre.vanderbilt.edu/plugins/scmsvn/viewcvs.php/dump2vecs.py?root=lammpstools&rev=4&view=markup

Please let me know what you think.

sjplimp · April 23, 2011, 5:52pm

2-3 Mb sec still sounds slow, but I haven’t benchmarked the Pizza.py dump tool.
If you just try to read a several Gb dump file with it, that’s not a good test,
b/c some machines don’t have enough memory, and it’s somewhat unpredicable
how much memory Python requires at that scale, so I’ve seen slowdowns.

You can use the dump tool in Pizza.py to read one (or a few snapshots) at
a time, process them, throw them away, loop. Which will not eat up memory.
I would imagine in that mode it is faster than 2 Mb/sec.

If you get your script working and think it is a useful tool (and documented),
we can release it in the tools/python dir.

Steve

_alDosari_Majid_S · April 25, 2011, 4:13pm

Well I set out to write my script b/c the dump file reader was too slow. On the same dump file it required ~.4 secs just to get an atom’s vector AFTER it had scanned the file (when the output displays the timesteps read which takes several seconds). So when I looped through 640 atoms, it was simply unacceptable.

And to get around memory restrictions, my first attempt was to use the pizza dump tool in a loop. My tool reads in chunks. The bigger the chunk, the better. I’ve made every attempt to use optimized python constructions to read the dump file. It’s definitely faster than the pizza dump tool where there is a lot of looping in python.

The tool works now but it is independent of the part that puts the data in a database so that’s why I released it. I thought more people would be interested in that than putting it in a database which should be easy. I’ll release my way of doing that if there is interest.

sjplimp · April 26, 2011, 1:45pm

It’s up to you. We release all sorts of tools (in lammps/tools) that we
didn’t write and don’t support. It does have to be self-contained and
documented. And you put your name on it to field questions.

Steve

_alDosari_Majid_S · April 28, 2011, 8:10pm

I’ve made some modifications: it now reads sequences of dumpfiles (but it will be slow to read if one frame were in each file). The object to initiate is dumps2vecs. And I’ve put some documention.

Feedback is appreciated.