Candidate datasets

dwinston · April 29, 2021, 4:07pm

This topic is to gather suggestions for datasets for the Data Dictionaries working group effort.

A dataset is something for which developing and publishing a corresponding dictionary would be helpful.

We prefer that candidate datasets be already collected. Ideally they are already accessible online, but they may also be accessible only to a subset of our members. The important quality here is that the dataset exists, encoded in some form, somewhere, today. We do not wish to work with a “future” dataset that depends on computational or experimental work that is yet to be completed.

june.lau · April 29, 2021, 6:10pm

I have ~20 years of microscopy abstracts from the annual Microscopy and Microanalysis conference (13 GB, PDF format). I don’t think I can post these publicly online. But what if I ran a Python script where I strip out everything but the text and did a text payload? Would I be violating copyright if I did that? The data is already hosted in a NIST file server, so in principle, everyone in the WG can get access. Let me know what you all think.

briandecost · April 29, 2021, 6:48pm

What about Dataset: An Inter-Laboratory Study of Zn-Sn-Ti-O Thin Films using High-throughput Experimental Methods - CKAN from this paper?

There are multiple modalities of spectral data and extracted properties for the Zn–Sn–Ti–O, made by two different thin film deposition methods.

Perhaps just a subset of the dataset if it is too complex for a pilot level effort?

gmrigna · April 29, 2021, 5:52pm

Phonon database ( Sci. Data 5 , 180065 (2018)):

Electronic transport database ( Sci. Data 4 , 170085 (2017)):
https://doi.org/10.5061/dryad.gn001

Refractive index database ( Phys. Rev. Materials 3 , 044602 (2019)):
https://journals.aps.org/prmaterials/supplemental/10.1103/PhysRevMaterials.3.044602/db.csv

dwinston · April 29, 2021, 7:14pm

@briandecost this looks promising. I notice that you are a co-author on the paper. Are you intimately familiar with the dataset and its terminology for the purpose of data dictionary work?

dwinston · April 29, 2021, 7:24pm

@june.lau I thought that abstracts are generally okay for public distribution, as opposed to the full text of proceedings. For example, the Matscholar project makes full abstract text available. Whom might we contact to verify this?

In any case, we could post any computed data that is derived from the source data. For example, the Materials Project sources experimental crystalline structure data as CIF files from the ICSD and cannot share those directly, but can and does openly share CIF files that are the result of performing structural relaxations on the source structures using DFT. Similarly, I think we would be able to post any structured data resulting from your script executing against the abstracts, for example to recognize named entities.

briandecost · April 29, 2021, 7:34pm

I’m familiar enough with the dataset to at least get some momentum going, and I can certainly pull in some of those co-authors to make sure everything is correct

dwinston · May 13, 2021, 7:16pm

How do people feel about the three datasets proposed by @june.lau, @briandecost, and @gmrigna ? Any additional dataset candidates?

@owodo @Jamie_McCusker @kennethk @stuchalk @rar @ml-evs @kjappelbaum @Zachary_Trautt

Zachary_Trautt · May 14, 2021, 12:02pm

I’m on board with the Zn–Sn–Ti–O dataset. It has been on my to-do list to get it back up.

stuchalk · May 14, 2021, 1:25pm

I think are all good/interesting candidates and I would be interested to work on the microscopy abstracts…

owodo · May 17, 2021, 1:34am

Looks good, I would also be interested in getting involved in microstructural data aspects. I should have a posted data (with URL handl) 2 weeks from now (with 1700 microstructures - generated using the computational model and annotated with descriptors).

kennethk · May 17, 2021, 3:35pm

@june.lau @dwinston Do we have a sense of what terms we would be structuring around the abstracts? Is this a bibliographic exercise, or are there existing keywords? If we need to build the index as part of the exercise, this catapults it into a much more expensive venture.

june.lau · May 17, 2021, 4:18pm

So, I’ve thought about this question in the past, and below is a list of terms that I came up with on 1st pass (not exhaustive - maybe not even good). I imagined that there would be some clustering tool somewhere (for example, the MatScholar tool: Matscholar) that we might borrow. Not entire sure if this is the right way about this problem.

		Example clusters:
		
		(techniques)
		Energy Dispersive X-Ray Microanalysis
		(EDX) Energy Dispersive X-ray Spectroscopy
		(EDS) Energy Dispersive Spectroscopy
		(SXES) Soft x-ray emission spectroscopy
		
		(EELS) Electron energy-loss spectroscopy
		(EELS) Electron energy-loss spectrometry
		(VEELS) valance EELS
		(EFTEM) energy-filtered TEM
		
		(CBED) convergent beam electron diffraction
		(SAD) selected area diffraction
		(SAED) selected area  electron diffraction
		
		Electron tomography
		(APT) atom probe tomography
		Atom probe
		
		(ESEM) Environmental SEM
		(ETEM) Environmental TEM/STEM
		
		(HRTEM) High-resolution TEM
		(t-SEM) transmission scanning electron microscopy
		
		In situ
		In operando
		
		(CLEM) Correlative light and electron microscopy
		
		(product)
		x-ray
		Photon
		Secondary electrons
		
		(specimens)
		Biological specimen
		Textile
		fiber
		Meteorite
		Polymer
		Tissue
		Semiconductors
		Superconductors
		Crystalline
		Amorphous
		Nanotubes
		Graphene
		2D
		(NW) nanowires
		(NP) nanoparticles
		
		
		(specimen prep techniques)
		Ultramicrotomy/microtome
		Staining (positive, negative)
		Vitribot
		Blotting
		Plunge-freeze
		(FIB) Focused ion beam
		
		
		(microscope attribute)
		accelerating voltage
		high tension
		Landing voltage
		Landing energy
		
		(microscope technology)
		(TEM) Transmission electron microscope/microscopy
		(STEM) Scanning transmission electron microscope/microscopy 
		(SEM) Scanning electron microscope/microscopy
		(LEEM) Low-energy electron microscopy
			(SP-LEEM) spin-polarized LEEM
		(PEEM) Photoemission electron microscopy
		FIB-SEM, Dual beam
		
		
		
		(instrument type)
		SEMs: Gemini, S4XXXX, S5XXX, Quanta, etc.
		TEMs: ARM, Technai, Krios, etc.
		Dual-beams: Nova, Helios, Nvision, etc..
		
		
		
		(detector technology)
		germanium ED detector
		(EELS) Electron energy-loss spectrometer
		(PEELS) Parallel EELS 
		CCD
		Pixelated detector
		Direct-electron detector
		Single-electron detector
		Silicon Drift Detectors
		CMOS detectors
		(HAADF) high-angle annular dark-field detector
		CMOS cameras: Oneview, etc..
		Direct cameras: K2, K3, Merlin, etc..
		CCD cameras: Ultrascan, Orius, Multiscan etc..
		
		(detector class)
		
		(beam damage)
		Beam damage
		dose
		Dose rate
		Radiation
		Electrolysis
		Knock-on
		
		(data)
		Spectrum
		Image
		Spectrum image
		4D STEM
		Diffraction

kennethk · May 17, 2021, 8:31pm

It’s an exercise worth doing, but I was asking the meta-question of if it had been done yet, or if the annotation was part of the necessary development. If they were already annotated, our working group would just attack the focused question of how to express those terms in a common interchange format. As they are not already annotated, we would need to structure it first.

A reasonable metaphor here is that our working group wants to write an instruction manual, but if the dataset isn’t annotated yet, we need to write an instruction manual for a device that hasn’t been built yet.

kjappelbaum · May 19, 2021, 5:43am

due to similar concerns i think Zn–Sn–Ti–O is the better candidate to start with. The phonons also look good to me.

dwinston · May 19, 2021, 8:46pm

Okay, I think we’ve got three good candidate datasets available, and one on deck.

My summary / elaboration of the above:

Zn–Sn–Ti–O thin films
- experiments
- article: https://doi.org/10.1021/acscombsci.8b00158
- dataset @ catalog.data.gov
- on board: Brian, Zachary
M&M abstracts
- annotations / meta-analysis
- ~20 years of abstracts from annual Microscopy and Microanalysis conference. PDF format, ~13 GB. Hosted on file server accessible to June Lau.
- can subset and formalize matscholar named entities, i.e. (a subset of) those found in the M&M corpus.
  - article: https://doi.org/10.1021/acs.jcim.9b00470
  - entities dataset: https://doi.org/10.6084/m9.figshare.8184413 (nearly ~10M named entities)
  - entity normalizations: https://doi.org/10.6084/m9.figshare.8184365 (should reduce the above number)
- on board: June, Stuart
phonon database
- calculations
- dataset: https://doi.org/10.6084/m9.figshare.c.3938023.v1
- article: https://doi.org/10.1038/sdata.2018.65
- on board: Gian-Marco, Kevin
(backup / on deck) microstructure calculations
- data should be online in ~2 weeks
- on board: Olga

I think this is a good set, spanning datasets derived from experiments, annotations, and calculations. I am on board with all of them – I may start with the M&M abstracts work in order to bring the matscholar named entities online, to be sure we can annotate the abstracts readily via the matscholar API and ease Ken’s valid concern.

I suggest we start three new topics on this discussion board, for each of the identified primary datasets, in order to isolate their particular discussions.

ml-evs · May 20, 2021, 8:49am

I’m most interested in the thin film dataset; it already looks like an exemplary dataset in the field given the developing connections with HTEM and FAIR digital objects (I think due to Zach!).

We will have attendees representing HTEM at the upcoming OPTIMADE workshop (free to register and attend…) who can provide some context too.

rar · May 21, 2021, 6:23am

I’m on board for the phonon database dataset. (The Zn-Sn-Ti-O thin films would also be ok if needed to distribute people more evenly).