Repeatedly Applying the Apply Function to the Same Data Collection

Hello,

I am using version 3.14.1 of the OVITO library, with Python version 3.13.11, on a Windows operating system.

I am continuously applying an apply function to a data collection, while monitoring the execution time for each iteration. I’ve noticed that after a sufficient number of iterations, each operation begins to take longer. The image below shows how the execution time changes with the number of iterations.

Here is my code:

from ovito.io import import_file
from ovito.modifiers import ClusterAnalysisModifier
import numpy as np
import time


def main():
    dump_file = "dump file which contains 4096 particles to be analyzed"
    
    pipeline = import_file(dump_file)
    pipeline_atoms = pipeline.compute()
    particles = pipeline_atoms.particles.count

    cluster_modifier = ClusterAnalysisModifier(cutoff=1.3, compute_gyration=True, sort_by_size=True, unwrap_particles=True, only_selected=True)

    time_interval_between_iterations = []
    
    start_time = time.time()
    for iteration in range(3000):
        # pipeline_atoms = pipeline.compute()
        print(f"Iteration {iteration} number of particles: {pipeline_atoms.particles.count} {list(pipeline_atoms.particles.keys())}")

        # randomly select 30% of the particles for clustering
        selection = np.zeros(particles, dtype=bool)
        selection[np.random.choice(particles, size=int(0.3 * particles), replace=False)] = True
        pipeline_atoms.particles_.create_property("Selection", data=selection)
        pipeline_atoms.apply(cluster_modifier)

        rg_values = np.asarray(pipeline_atoms.tables["clusters"]["Radius of Gyration"][...], dtype=float)

        current_time = time.time()
        time_interval = current_time - start_time
        time_interval_between_iterations.append(time_interval)
        start_time = current_time
    
    # plot the time intervals between iterations
    import matplotlib.pyplot as plt
    plt.plot(time_interval_between_iterations)
    plt.xlabel("Iteration")
    plt.ylabel("Time Interval (s)")
    plt.title("Time Interval Between Iterations")
    plt.show()


if __name__ == "__main__":
    main()
    # import ovito
    # print(f"OVITO version: {ovito.version}")

Is this behavior expected?

Thank you in advance.

Thank you for raising this question. This behavior is not good and initially surprised me as well. However, it is not a bug, but rather the result of a desired behavior.

Let me explain:
Some analysis modifiers, such as the ClusterAnalysisModifier used here, are designed so that they can be used multiple times within the same pipeline, for example with different settings:

pipeline.modifiers.append(ClusterAnalysisModifier(cutoff=1.8))
pipeline.modifiers.append(ClusterAnalysisModifier(cutoff=1.9))

To ensure that the second instance of the modifier does not overwrite the results of the first instance and that you still have access to the results of both modifiers at the end of the pipeline, the results are automatically numbered consecutively in the output data collection:

data = pipeline.compute()
cluster_count1 = data.attributes['ClusterAnalysis.cluster_count']
cluster_count2 = data.attributes['ClusterAnalysis.cluster_count.2']
table1 = data.tables['clusters']
table2 = data.tables['clusters.2']

This behavior is important, especially in the OVITO GUI, where you usually only work with one linear pipeline. At the end of the pipeline, you may want to make a comparison and determine the difference between the two calculations, for example.

If you use the apply() method like this:

for i in range(N):
   data.apply(ClusterAnalysisModifier(...))

the result will be equivalent to my example above. The DataCollection continues to grow with each iteration, adding new attributes and data objects such as tables.

Here lies the problem: OVITO’s DataCollection class is currently not designed to store hundreds or thousands of data objects. Since the runtime of some lookup and append operations increases quadratically with the number of objects, your script will become noticeably slower with each iteration.

You can avoid the problem quite easily by creating a new copy of the DataCollection in each iteration and applying the modifier only to this temporary copy:

for i in range(N):
   temp_data = data.clone()
   temp_data.apply(ClusterAnalysisModifier(...))

   clusters = temp_data.tables['clusters']

The clone() operation is cheap because it creates a shallow copy.

Admittedly, the whole thing was hardly foreseeable for you as a user and represents something of a pitfall that is easy to stumble into. We will consider what we can do about this in the future on our end.

1 Like

Thank you for your reply. I have understood it.