Saving an EnsembleOptimizer and using it for uncertainty prediction with new structures

mjwaters · December 12, 2023, 11:35pm

What’s the best way to do this? I want to do some active learning, so I was hoping to do something like:

A, y = sc.get_fit_data(key='mixing_energy')
opt = EnsembleOptimizer((A, y), 
   fit_method=fit_method, 
   ensemble_size=ensemble_size)
opt.train()
ce = ClusterExpansion(cluster_space=cs, parameters=opt.parameters)
ce.write('cluster_model.ce')

And then read the ensemble model later to predict mixing_energy and mixing_energy_std:

ce_ensemble = ClusterExpansion.read('cluster_model.ce')
for structure in structures:
    mixing_energy, mixing_energy_std = ce.predict(structure)

But it looks like this isn’t how the EnsembleOptimizer works and it only saves the mean model. Should I save the mixing_energy parameter variances to a separate cluster expansion for error estimation or is there an easier way that I’ve missed?

erikfransson · December 13, 2023, 5:22pm

The cluster expansion can not be created with multiple sets of parameters.

The EnsembleOptimizer holds the different sets of parameters in .parameters_splits, see e.g. here

You can then e.g. predict the std energy of a structure with something like

cv = cs.get_cluster_vector(atoms)
energies = np.dot(parameters_splits, cv)
E_mean = np.mean(energies)
E_std = np.std(energies)

mjwaters · December 13, 2023, 6:52pm

Thank you! I thought that was the case so I was wondering what the preferred way evaluating those parameters was. As for storing the ensemble of models, do you recommend that I store the ‘parameter_splits’ as a ‘.npy’ (numpy array) or as set of ‘.ce’ files?

Also, I had some theory ideas that maybe we could talk about sometime.

erikfransson · December 14, 2023, 11:08am

I think its probably easiest to just save a collection of CEs.
But depending on the use case, maybe all these files will be large and uncertainty predictions slower (since if you do for ce in ces and ce.predict() it will recompute the cluster vector every time).
So for training set generation i personally just save the parametera as .npy.

Happy to discuss , send me an email