Candidate datasets

This topic is to gather suggestions for datasets for the Data Dictionaries working group effort.

A dataset is something for which developing and publishing a corresponding dictionary would be helpful.

We prefer that candidate datasets be already collected. Ideally they are already accessible online, but they may also be accessible only to a subset of our members. The important quality here is that the dataset exists, encoded in some form, somewhere, today. We do not wish to work with a “future” dataset that depends on computational or experimental work that is yet to be completed.

I have ~20 years of microscopy abstracts from the annual Microscopy and Microanalysis conference (13 GB, PDF format). I don’t think I can post these publicly online. But what if I ran a Python script where I strip out everything but the text and did a text payload? Would I be violating copyright if I did that? The data is already hosted in a NIST file server, so in principle, everyone in the WG can get access. Let me know what you all think.

What about Dataset: An Inter-Laboratory Study of Zn-Sn-Ti-O Thin Films using High-throughput Experimental Methods - CKAN from this paper?

There are multiple modalities of spectral data and extracted properties for the Zn–Sn–Ti–O, made by two different thin film deposition methods.

Perhaps just a subset of the dataset if it is too complex for a pilot level effort?

Phonon database ( Sci. Data 5 , 180065 (2018)):

Electronic transport database ( Sci. Data 4 , 170085 (2017)):

Refractive index database ( Phys. Rev. Materials 3 , 044602 (2019)):

@briandecost this looks promising. I notice that you are a co-author on the paper. Are you intimately familiar with the dataset and its terminology for the purpose of data dictionary work?

@june.lau I thought that abstracts are generally okay for public distribution, as opposed to the full text of proceedings. For example, the Matscholar project makes full abstract text available. Whom might we contact to verify this?

In any case, we could post any computed data that is derived from the source data. For example, the Materials Project sources experimental crystalline structure data as CIF files from the ICSD and cannot share those directly, but can and does openly share CIF files that are the result of performing structural relaxations on the source structures using DFT. Similarly, I think we would be able to post any structured data resulting from your script executing against the abstracts, for example to recognize named entities.

I’m familiar enough with the dataset to at least get some momentum going, and I can certainly pull in some of those co-authors to make sure everything is correct

How do people feel about the three datasets proposed by @june.lau, @briandecost, and @gmrigna ? Any additional dataset candidates?

@owodo @Jamie_McCusker @kennethk @stuchalk @rar @ml-evs @kjappelbaum @Zachary_Trautt

I’m on board with the Zn–Sn–Ti–O dataset. It has been on my to-do list to get it back up.

I think are all good/interesting candidates and I would be interested to work on the microscopy abstracts…

Looks good, I would also be interested in getting involved in microstructural data aspects. I should have a posted data (with URL handl) 2 weeks from now (with 1700 microstructures - generated using the computational model and annotated with descriptors).