References for expt_gap dataset

Hi all,

For the expt_gap: Experimental band gap of 6354 inorganic semiconductors. dataset, is there are easy way to access the literature references for each of these entries? I assume the data came from Citrine originally?



And/or is matbench_expt_gap dataset preferred, if it’s easier to get references for these values?

Hi Matt,

I’m not sure about getting the individual references for each entry, but there is more information on the provenance of the dataset here:

1 Like

Hey Matt,

The matbench one may be preferred depending on what you are doing.

The original is a compendium of many experimental measurements, many of which are conflicting (i.e., is the band gap of X2Y3 = 0.3eV or 0.28eV?). The matbench dataset removes many of these conflicting entries (spans of more than 0.1eV iirc) and condenses the remaining groups of disagreeing measurements (of a single composition) to the measurement closest to the average of the group. However, along with the removal of these entries is a loss of valuable data.

The idea is that since the original has conflicting measurements of identical compositions, it’s not good for validating models (unless you explicitly separate groups of compositions during train/test splitting). If X2Y3=0.3eV is in your training dataset, predicting on X2Y3=0.28eV in your test will not tell you much about how your model is actually doing.

In short:

  • To make production predictions, you might want to use the original, full expt_gaps dataset
  • To validate a model, use the matbench one.

To get more info on the dataset:

from matminer.datasets.dataset_retrieval import get_all_dataset_info


1 Like

Also Alex is right regarding the references. They are from a bunch of different studies, AFAIK. If you do use the matbench one, maybe also cite matminer.

Many thanks @ardunn! This is really useful information. The matbench dataset definitely seems like the more generally useful.

So the source of this data seems to be this spreadsheet which also, frustratingly, does not contain any references, though in their paper they do link to four additional publications:

  • Kiselyova, N. N.; Dudarev, V. A.; Korzhuyev, M. A. Database on the Bandgap of Inorganic Substances and Materials. Inorg. Mater. Appl. Res. 2016, 7 , 34– 39, DOI: 10.1134/S2075113316010093 [Crossref], Google Scholar

  • Strehlow, W. H.; Cook, E. L. Compilation of Energy Band Gaps in Elemental and Binary Compound Semiconductors and Insulators. J. Phys. Chem. Ref. Data 1973, 2 , 163– 200, DOI: 10.1063/1.3253115 [Crossref], [CAS], Google Scholar

  • Joshi, N. V. Photoconductivity: Art, Science, and Technology ; Marcel Dekker: New York, 1990. Google Scholar

  • Madelung, O. Semiconductors: Data Handbook ; Springer: New York, 2004. [Crossref], Google Scholar

The frustrating thing about this is that the paper does not even link to which of those four publications a given entry comes from. I imagine the 2016 reference might be the most complete, and has its own searchable database at

I’ve included the above in case it’s helpful to anyone else trying to track this down.