Getting information about the compositions in a dataset (metal, insulator etc.)


I’m working with a dataset of chemical compositions and associated target values and I’m wondering if there is a way to obtain extra information about the inputs, in particular whether a given (inorganic) composition represents a metal or an insulator or semiconductor. I would be interested in this to understand more about how much my dataset is chemically-biased. Unfortunately I have poor domain knowledge and I’m afraid that with only a chemical composition as an input one cannot conclude too much but I’m still wondering if any simple classification task (maybe using pymatgen/matminer) or similar can be possible to obtain some approximated discrimination.

Many thanks,


Hello Federico,
I do think I’ve seen machine learning studies which tried to predict metal/insulator just from the chemical composition, but I don’t remember them to be very accurate - it’s a hard problem. You might be better off with a “brute-force” approach; it’s not difficult to obtain the calculated electronic band gap of a material or any other supported_property from the Materials Project API, allowing you to classify it directly as a metal or insulator (classifying a material as a semiconductor is more complex, requiring knowledge of crystal defects, carrier effective mass, etc.). If you do decide to do this, you might want to classify gap < 0.1 eV as a metal because some calculations with coarse k-point grids may show a tiny gap when the true band structure is metallic. Conversely, there is the well-known DFT underestimation of band gap which may bias the results in the opposite direction.
Another potential issue is that the same chemical composition may exist in several different structural polymorphs, each of which will have a different band gap. It’s hard to guess whether that would be a major problem for you without knowing quite what you’re doing with the dataset. I would probably take the lowest-energy polymorph and say if that’s metallic, then the composition counts as metallic.

1 Like

Hi Steven and thanks for the detailed answer.

I’ve recently found this work where they train ML models to try achieving what I have asked, but in addition to stoichiometry they also use structural information. I was pretty sure that remaining agnostic to structure, one could not conclude too much, as you have pointed out.

As you suggested, now I’m interested in just training a classifier to predict metal/insulator based on band gap values. The only thing is that I would like to retrieve from materials project only the entries with experimentally-matched crystal structure (they are reported in green). Do you know if there is any simple way to query only them?

Many thanks,


The first thing I would try would be MPRester.query with the legacy API, passing a criteria dict as argument. I think you can restrict your query to materials which have at least one item in the icsd_ids list. mapidoc/materials/snl_final/about at master · materialsproject/mapidoc · GitHub

Hi Frederico,

There is relevant literature regarding ML models to predict the bandgap, for example, Learning properties of ordered and disordered materials from multi-fidelity data | Nature Computational Science

To select only experimental structures via MPRester you can use the theoretical tag and set it to False.
Good luck!