Tips on reducing computation time

Andrew_Wong · June 18, 2020, 1:43am

I have been trying to use the conversion featurizers on a personal dataset of compositions. However this dataset has decimal based compositions instead of integer ones (Cu0.75Pd0.25 vs Cu3Pd for example). Upon transforming them into integer based compositions, some compositions became very large (for example, Cu158655Pd1684) and hence when I attempt to apply featurizers to the dataset, the time required for the computation is very high. Is there anything I can do to reduce the computation time? I have already tried truncating the compositions by a factor of 1000. However, even after truncation, some compositions still had 3 digits and still took a significant amount of time to process. Thank you for the assistance in this matter

ardunn · June 24, 2020, 6:35pm

Hey Andrew,

Could you paste the code you are using to instantiate and run the conversion featurizer? I am guessing you are using StrToComposition. It will be much easier to troubleshoot if we can see some of the code you are running.

Thanks,
Alex

Andrew_Wong · June 25, 2020, 1:08am

A way to recreate the error is

from pymatgen import Composition
from matminer.featurizers.conversions import *
ctoc = CompositionToOxidComposition()
ctoc.featurize(Composition(‘Cu159Pd2’))

I used the time tracking feature and this line of code took me 10.5 seconds to run and the time taken increases as the numbers increase. Furthermore, I have over 4500 compositions similar to this that need to be run, resulting in a very long processing time.

ardunn · June 26, 2020, 9:40pm

Hey Andrew,

We have an issue for this currently open on the matminer repo: https://github.com/hackingmaterials/matminer/issues/454

However, the underlying slowness is likely coming from pymatgen’s conversion from non oxidation state-containing composition to oxidation state-containing composition. You may have more luck asking on the pymatgen forum of matsci.org

On a more fundamental level though, it seems you are dealing with metal alloy compositions, no? What is the significance of these oxidation states in metals? From what I know, pymatgen simply uses known oxidation states of elements and the composition to solve a series of linear equations (i.e., Fe2O3 means Fe+3 and O-2). In metal alloys, I am not sure these oxidation states are meaningful, but perhaps someone else can advise (e.g., @alex)

Thanks,
Alex

Anubhav_Jain · June 26, 2020, 10:16pm

The oxidation guesses are really for ionic compounds. For an intermetallic like Cu-Pd alloy, the only oxidation states you care about are Cu0+ and Pd0+.

From the user side, you could detect alloy compositions and just assign the oxidation states yourself rather than using the algorithm to guess the oxidation state. The algorithm takes a long time since it tries to enumerate all the possible oxidation states of all atoms and runs through them to see if they are charge balanced, then ranks solutions according to a probability. With a ton of atoms, the enumeration becomes huge.

From the developer side, one should modify the pymatgen oxidation guess algorithm so that:

if all known oxidation states of the elements are either positive or negative, it just returns zero for everything without enumeration. In this case, the possible oxidation states for Cu is just 2+ and for Pd is 2+, 4+, so there is no way to achieve zero overall charge. So Cu and Pd would just be assigned 0+ almost instaneously.
(slightly more rigorous) - if each atom is assigned its most negative oxidation state, and the sum is still greater than zero, you can’t possibly charge balance. Or, if each atom is assigned its most positive oxidation state, and the sum is still less than zero, you can’t possibly charge balance.

This would be a simple developer-side change that would make this algorithm robust to throwing in things like Cu3253535Pd3215

Andrew_Wong · June 27, 2020, 5:29am

Could you advise me on how I can assign the oxidation states by myself? I think this would help with my issue tremendously.

Anubhav_Jain · June 27, 2020, 6:42pm

You can use a code like below to assign oxidation states to a Composition object. If I get a chance I’ll push a function to Composition object that makes this easier:

from pymatgen import Composition, Specie

el_oxi = {"Fe": 0, "Pd": 0}
comp = Composition("Fe354Pd35")


species_amts = {}
for elem, amt in comp.items():
    el = str(elem)
    species_amts[Specie(el, el_oxi[el])] = amt
comp_oxid = Composition(species_amts)

print(comp_oxid)