Data dict for M&M abstracts

Many of us met on 2021-06-02 re: the Microscopy and Microanalysis (M&M) conference abstracts dataset.

@june.lau and @briandecost are going to work with their NIST summer student to

  1. convert their directory of PDF abstracts to (ID, plain text), via tesseract-ocr or some alternative tool. IDs would ideally be DOIs.
  2. Use LBNLP to extract named entities via the pretrained matscholar NER model (thanks to @jdagdelen @ardunn et al.!).
  3. Select a small number of frequently encountered / important terms (~10-30) in corpus for initial dictionary terms (there are ~10M matscholar named entities, which we can reduce with their published normalization dataset, but we want to focus on M&M “all-stars” for the WG).

Once we have the (abstract → named entities) dataset, we can focus on developing the formal data dictionary. Our actual approach for this is TBD, but I suspect we will use semantic web tech and vocabulary systems like dcmi, skos, etc. for informal annotation properties and rdfs/owl for machine-actionable properties.

The next meeting of the abstracts subgroup is 2021-06-16.

1 Like

We met again today (2021-06-16). @Jamie_McCusker agreed to be technical lead for this subgroup wrt data dictionary implementation - yay!

I got lbnlp working with the matscholar_2020v1 package’s ner model as follows:

# in shell
git clone https://github.com/lbnlp/lbnlp/
cd lbnlp
# python 3.6 or 3.7 needed for tensorflow==1.15.0
# python 3.6 classifier in lbnlp/setup.py, so using 3.6
conda create -n lbnlp python=3.6
conda activate lbnlp
pip install -e .[ner]
pip install jupyter
python -m ipykernel install --name lbnlp
jupyter notebook
# open new notebook, set to use lbnlp kernel

Then, I followed the instructions at https://lbnlp.github.io/lbnlp/pretrained/.

The ner_model = load("ner") part took a few minutes, and there were many tensorflow deprecation warnings, but no module requirement errors.

When I ran ner_model.tag_doc(doc), I got a helpful error message suggesting I run cde data download in a shell. I ran this in the notebook as !cde data download. Tagging then worked.

The tags use the so-called “inside-outside-beginning” tagging scheme (Inside–outside–beginning (tagging) - Wikipedia) for multi-token entities. B- means beginning, I- means inside, O means outside. So, for example, B-MAT followed by I-MAT means it’s a 2-token MAT entity. Thanks @ardunn for clarifying this for me!

1 Like

Here are gzipped JSON-Lines files (one JSON document per line) for each of abstracts, sentences, and taggings, including taggings normalized according to the published normalizations:

  • abstracts (~13.8k): ark:57802/md1snr3c886
    • (doi: str, n_sents: int) # DOI, number of sentences
  • sentences (~373k): ark:57802/md1rc4y7a15
    • (doi: str, idx_s: int, raw_s: str) # DOI, index of sentence in abstract (0-based), “raw” sentence (imperfect pdf->txt conversion)
  • taggings (~1M): ark:57802/md19sjz1c79
    • (doi: str, idx_s: int, ne: str, cat: str) # DOI, index of sentence in abstract (0-based), named entity, named-entity category (from matscholar)
  • taggings_normalized (~1M): ark:57802/md1nytzhs82
    • (doi: str, idx_s: int, ne: str, cat: str) # DOI, index of sentence in abstract (0-based), named entity (normalized via published normalizations), named-entity category (from matscholar)

This addresses #2 above (“Use LBNLP to extract named entities via the pretrained matscholar NER model”). Thanks to @june.lau’s summer student for performing #1 (pdfs->text).

We now want to select a small set of frequently encountered / important / interesting terms for a data dictionary.