Reading very large (170GB ) LAMMPS dump atom output with python/pizza.py

Amir_Hosein_Sadeghi · August 1, 2019, 3:48pm

Dear all,
My output files contains id type and coordinates (instead of scaled coordinates) of ~1000 to ~35000 structureless beads of different sizes dumped in 70000 time steps. The sizes of the files range from ~100MB to ~170GB. I read the Pizza.py documentation and I know how to use it. I want to use the positions to calculate the local volume fraction or local density, and other quantities after equilibrium.
Since some of the output files are large, I want to analyze them on a cluster. My questions are:
1- Is the size of the output normal for 35000 beads?!
2- What Is it efficient and fast to analyze the date? Is pizza.py a good candidate for opening the file on then doing calculation on the data on Python?
3- Is it possible to use python in parallel? In particular, can I use pizza.py in parallel?
4- I also try do write my own dump reader and then open the dump file to Pandas Dataframes using the chunksize argument. Then, I can use Dask for parallelizing the Pandas Dataframes. What about this approach?
Thank you very much for your help and advice.
Best regards,
Amir Sadeghi

akohlmey · August 1, 2019, 5:58pm

Dear all,
My output files contains id type and coordinates (instead of scaled coordinates) of ~1000 to ~35000 structureless beads of different sizes dumped in 70000 time steps. The sizes of the files range from ~100MB to ~170GB. I read the Pizza.py documentation and I know how to use it. I want to use the positions to calculate the local volume fraction or local density, and other quantities after equilibrium.
Since some of the output files are large, I want to analyze them on a cluster. My questions are:
1- Is the size of the output normal for 35000 beads?!

size depends on number of particles, choice of output properties, file format and frequency of output, not just on the number of particles. so this question cannot be answered.

2- What Is it efficient and fast to analyze the date? Is pizza.py a good candidate for opening the file on then doing calculation on the data on Python?

if you want efficient and fast, then (pure) python is not the answer. typical analysis calculations are of the order of 100x faster in compiled (C++, C or Fortran) code.

3- Is it possible to use python in parallel? In particular, can I use pizza.py in parallel?

technically yes. there are multiple modules that allow parallel programming at different levels in python. however, python is not intrinsically parallel and all access to the python interpreter has to be serialized or only distributed data parallelization is possible. all of that requires explicit parallel programming.

pizza.py is rather minimal and most certainly not parallel. that said, if you can break your files in pieces and analyze each piece independently, you can run things in this “conveniently parallel” fashion without any parallel programming. of course, that only works for static analysis, that does not need correlations to previous or future steps.

4- I also try do write my own dump reader and then open the dump file to Pandas Dataframes using the chunksize argument. Then, I can use Dask for parallelizing the Pandas Dataframes. What about this approach?

doesn’t solve the principal problem of python being easy and convenient to program, but slow on doing heavy computations (unless the computation itself is handed to a compiled module that is loaded dynamically).
you just add one more step to the process.

please note, that many simple analysis tools need to load the entire trajectory data into RAM, so there may be limitations independent of the software.

axel.