Candidate datasets

This topic is to gather suggestions for datasets for the Data Dictionaries working group effort.

A dataset is something for which developing and publishing a corresponding dictionary would be helpful.

We prefer that candidate datasets be already collected. Ideally they are already accessible online, but they may also be accessible only to a subset of our members. The important quality here is that the dataset exists, encoded in some form, somewhere, today. We do not wish to work with a “future” dataset that depends on computational or experimental work that is yet to be completed.

I have ~20 years of microscopy abstracts from the annual Microscopy and Microanalysis conference (13 GB, PDF format). I don’t think I can post these publicly online. But what if I ran a Python script where I strip out everything but the text and did a text payload? Would I be violating copyright if I did that? The data is already hosted in a NIST file server, so in principle, everyone in the WG can get access. Let me know what you all think.

What about Dataset: An Inter-Laboratory Study of Zn-Sn-Ti-O Thin Films using High-throughput Experimental Methods - CKAN from this paper?

There are multiple modalities of spectral data and extracted properties for the Zn–Sn–Ti–O, made by two different thin film deposition methods.

Perhaps just a subset of the dataset if it is too complex for a pilot level effort?


Phonon database ( Sci. Data 5 , 180065 (2018)):

Electronic transport database ( Sci. Data 4 , 170085 (2017)):

Refractive index database ( Phys. Rev. Materials 3 , 044602 (2019)):

1 Like

@briandecost this looks promising. I notice that you are a co-author on the paper. Are you intimately familiar with the dataset and its terminology for the purpose of data dictionary work?

@june.lau I thought that abstracts are generally okay for public distribution, as opposed to the full text of proceedings. For example, the Matscholar project makes full abstract text available. Whom might we contact to verify this?

In any case, we could post any computed data that is derived from the source data. For example, the Materials Project sources experimental crystalline structure data as CIF files from the ICSD and cannot share those directly, but can and does openly share CIF files that are the result of performing structural relaxations on the source structures using DFT. Similarly, I think we would be able to post any structured data resulting from your script executing against the abstracts, for example to recognize named entities.

I’m familiar enough with the dataset to at least get some momentum going, and I can certainly pull in some of those co-authors to make sure everything is correct

1 Like

How do people feel about the three datasets proposed by @june.lau, @briandecost, and @gmrigna ? Any additional dataset candidates?

@owodo @Jamie_McCusker @kennethk @stuchalk @rar @ml-evs @kjappelbaum @Zachary_Trautt

I’m on board with the Zn–Sn–Ti–O dataset. It has been on my to-do list to get it back up.

1 Like

I think are all good/interesting candidates and I would be interested to work on the microscopy abstracts…

1 Like

Looks good, I would also be interested in getting involved in microstructural data aspects. I should have a posted data (with URL handl) 2 weeks from now (with 1700 microstructures - generated using the computational model and annotated with descriptors).

@june.lau @dwinston Do we have a sense of what terms we would be structuring around the abstracts? Is this a bibliographic exercise, or are there existing keywords? If we need to build the index as part of the exercise, this catapults it into a much more expensive venture.

So, I’ve thought about this question in the past, and below is a list of terms that I came up with on 1st pass (not exhaustive - maybe not even good). I imagined that there would be some clustering tool somewhere (for example, the MatScholar tool: Matscholar) that we might borrow. Not entire sure if this is the right way about this problem.

		Example clusters:
		Energy Dispersive X-Ray Microanalysis
		(EDX) Energy Dispersive X-ray Spectroscopy
		(EDS) Energy Dispersive Spectroscopy
		(SXES) Soft x-ray emission spectroscopy
		(EELS) Electron energy-loss spectroscopy
		(EELS) Electron energy-loss spectrometry
		(VEELS) valance EELS
		(EFTEM) energy-filtered TEM
		(CBED) convergent beam electron diffraction
		(SAD) selected area diffraction
		(SAED) selected area  electron diffraction
		Electron tomography
		(APT) atom probe tomography
		Atom probe
		(ESEM) Environmental SEM
		(ETEM) Environmental TEM/STEM
		(HRTEM) High-resolution TEM
		(t-SEM) transmission scanning electron microscopy
		In situ
		In operando
		(CLEM) Correlative light and electron microscopy
		Secondary electrons
		Biological specimen
		(NW) nanowires
		(NP) nanoparticles
		(specimen prep techniques)
		Staining (positive, negative)
		(FIB) Focused ion beam
		(microscope attribute)
		accelerating voltage
		high tension
		Landing voltage
		Landing energy
		(microscope technology)
		(TEM) Transmission electron microscope/microscopy
		(STEM) Scanning transmission electron microscope/microscopy 
		(SEM) Scanning electron microscope/microscopy
		(LEEM) Low-energy electron microscopy
			(SP-LEEM) spin-polarized LEEM
		(PEEM) Photoemission electron microscopy
		FIB-SEM, Dual beam
		(instrument type)
		SEMs: Gemini, S4XXXX, S5XXX, Quanta, etc.
		TEMs: ARM, Technai, Krios, etc.
		Dual-beams: Nova, Helios, Nvision, etc..
		(detector technology)
		germanium ED detector
		(EELS) Electron energy-loss spectrometer
		(PEELS) Parallel EELS 
		Pixelated detector
		Direct-electron detector
		Single-electron detector
		Silicon Drift Detectors
		CMOS detectors
		(HAADF) high-angle annular dark-field detector
		CMOS cameras: Oneview, etc..
		Direct cameras: K2, K3, Merlin, etc..
		CCD cameras: Ultrascan, Orius, Multiscan etc..
		(detector class)
		(beam damage)
		Beam damage
		Dose rate
		Spectrum image

It’s an exercise worth doing, but I was asking the meta-question of if it had been done yet, or if the annotation was part of the necessary development. If they were already annotated, our working group would just attack the focused question of how to express those terms in a common interchange format. As they are not already annotated, we would need to structure it first.

A reasonable metaphor here is that our working group wants to write an instruction manual, but if the dataset isn’t annotated yet, we need to write an instruction manual for a device that hasn’t been built yet.

due to similar concerns i think Zn–Sn–Ti–O is the better candidate to start with. The phonons also look good to me.

Okay, I think we’ve got three good candidate datasets available, and one on deck.

My summary / elaboration of the above:

I think this is a good set, spanning datasets derived from experiments, annotations, and calculations. I am on board with all of them – I may start with the M&M abstracts work in order to bring the matscholar named entities online, to be sure we can annotate the abstracts readily via the matscholar API and ease Ken’s valid concern.

I suggest we start three new topics on this discussion board, for each of the identified primary datasets, in order to isolate their particular discussions.

I’m most interested in the thin film dataset; it already looks like an exemplary dataset in the field given the developing connections with HTEM and FAIR digital objects (I think due to Zach!).

We will have attendees representing HTEM at the upcoming OPTIMADE workshop (free to register and attend…) who can provide some context too.

1 Like

I’m on board for the phonon database dataset. (The Zn-Sn-Ti-O thin films would also be ok if needed to distribute people more evenly).