Working with large data files

david_f · April 20, 2011, 10:54am

Hello users !

I’m sure many of you have faced the problem of analyzing (post processing) large files output from lammps simulations (and specifically from reaxff molfrag script to give the fragments vs timestep “fra.out” file).
In my last simulation with reaxff I had a 2GB bonds.reax file which I converted to a 270MB fra.out file using the molfrag script of Aidan located in the …/tools/reax dir.

However, that file contains above 46,000 columns - there’s absolutely no way of loading it into MATLAB \ Origin \ Excel \ Open office , at least in my Dual Core 1 GB lenovo for analysis. (plus the software above have limit of ~ 2K - 10K columns).

As you all know, most of the columns are insignificant, having many zeros (If a fragment shows up during the simulation, it’s included in the file EVEN if it shows up just ONCE during the whole simulation).
So, I was wondering if there’s any useful way of dealing with this kind of a problem .
What would you do in such a situation ?

I’ve attached a perl script to go over the fra.out file, sum each column, and if the sum is bigger than 3% of the original molecule (fragment Intensity > 3%) , print that column to a new file. It could be useful for some people !
Nevertheless, It takes a few hours (!) on my machine to do that.

Anyone has any suggestion \ faster script \ other way of dealing with this problem ?

Thanks

Intensity.pl (1.62 KB)

sjplimp · April 20, 2011, 2:35pm

Perl or Python is the way to go. A Python script would only
take a few 10s of seconds to read a 270Mb file. I can't imagine it
would take hours to filter it down to something smaller unless you
are doing something very inefficient. In Python you could
look at using Numeric or Numpy arrays to store/process
vectors or arrays of numbers. They offer C-like speed and efficiency.

Steve