Data dict for M&M abstracts

Many of us met on 2021-06-02 re: the Microscopy and Microanalysis (M&M) conference abstracts dataset.

@june.lau and @briandecost are going to work with their NIST summer student to

  1. convert their directory of PDF abstracts to (ID, plain text), via tesseract-ocr or some alternative tool. IDs would ideally be DOIs.
  2. Use LBNLP to extract named entities via the pretrained matscholar NER model (thanks to @jdagdelen @ardunn et al.!).
  3. Select a small number of frequently encountered / important terms (~10-30) in corpus for initial dictionary terms (there are ~10M matscholar named entities, which we can reduce with their published normalization dataset, but we want to focus on M&M “all-stars” for the WG).

Once we have the (abstract → named entities) dataset, we can focus on developing the formal data dictionary. Our actual approach for this is TBD, but I suspect we will use semantic web tech and vocabulary systems like dcmi, skos, etc. for informal annotation properties and rdfs/owl for machine-actionable properties.

The next meeting of the abstracts subgroup is 2021-06-16.

1 Like

We met again today (2021-06-16). @Jamie_McCusker agreed to be technical lead for this subgroup wrt data dictionary implementation - yay!

I got lbnlp working with the matscholar_2020v1 package’s ner model as follows:

# in shell
git clone
cd lbnlp
# python 3.6 or 3.7 needed for tensorflow==1.15.0
# python 3.6 classifier in lbnlp/, so using 3.6
conda create -n lbnlp python=3.6
conda activate lbnlp
pip install -e .[ner]
pip install jupyter
python -m ipykernel install --name lbnlp
jupyter notebook
# open new notebook, set to use lbnlp kernel

Then, I followed the instructions at

The ner_model = load("ner") part took a few minutes, and there were many tensorflow deprecation warnings, but no module requirement errors.

When I ran ner_model.tag_doc(doc), I got a helpful error message suggesting I run cde data download in a shell. I ran this in the notebook as !cde data download. Tagging then worked.

The tags use the so-called “inside-outside-beginning” tagging scheme (Inside–outside–beginning (tagging) - Wikipedia) for multi-token entities. B- means beginning, I- means inside, O means outside. So, for example, B-MAT followed by I-MAT means it’s a 2-token MAT entity. Thanks @ardunn for clarifying this for me!