Reading large files in parallel?

coldzer · June 29, 2025, 10:17pm

Hello,

Do you have any guidance on how to read large files of size around 150 GB?
I am trying to follow this, any other advice?
https://www.ovito.org/manual/python/introduction/advanced_topics.html#using-ovito-with-python-s-multiprocessing-module

How many cores should be used for reading the file?

Thank you

stukowski · June 30, 2025, 8:26am

Reading/parsing the file is a single-core CPU operation. The question is what comes after that? What do you want to do with the dataset after it is loaded? For sure you’ll need a machine with a lot of RAM to process such a dataset, and not all functions of OVITO run multi-threaded or support billions of particles.

coldzer · June 30, 2025, 10:54am

I need OVITO to read the dump files and extract the ID, position to be processed later.

If OVITO reading is a single-core operation, then is it faster to process it via bash?

stukowski · June 30, 2025, 5:08pm

I’m not entirely sure what you’re trying to achieve. Could you clarify your goal? Is this question about maximizing I/O performance or minimizing your own effort?

If your aim is to extract particle IDs and positions from a large dump file, you could simply use the OVITO Python module for loading the file and accessing the data as NumPy views. Here’s a basic example:

from ovito.io import import_file
data = import_file('file.dump').compute()
positions = data.particles.positions[...]
ids = data.particles.identifiers[...]

I’m also not sure what you meant by “process it via bash”. Bash is a shell, not an actual program for processing data files. Maybe you had the awk or sed commands in mind?