Many of us met on 2021-06-02 re: the Microscopy and Microanalysis (M&M) conference abstracts dataset.
@june.lau and @briandecost are going to work with their NIST summer student to
- convert their directory of PDF abstracts to (ID, plain text), via tesseract-ocr or some alternative tool. IDs would ideally be DOIs.
- Use LBNLP to extract named entities via the pretrained matscholar NER model (thanks to @jdagdelen @ardunn et al.!).
- Select a small number of frequently encountered / important terms (~10-30) in corpus for initial dictionary terms (there are ~10M matscholar named entities, which we can reduce with their published normalization dataset, but we want to focus on M&M “all-stars” for the WG).
Once we have the (abstract → named entities) dataset, we can focus on developing the formal data dictionary. Our actual approach for this is TBD, but I suspect we will use semantic web tech and vocabulary systems like dcmi, skos, etc. for informal annotation properties and rdfs/owl for machine-actionable properties.
The next meeting of the abstracts subgroup is 2021-06-16.
1 Like
We met again today (2021-06-16). @Jamie_McCusker agreed to be technical lead for this subgroup wrt data dictionary implementation - yay!
I got lbnlp working with the matscholar_2020v1
package’s ner
model as follows:
# in shell
git clone https://github.com/lbnlp/lbnlp/
cd lbnlp
# python 3.6 or 3.7 needed for tensorflow==1.15.0
# python 3.6 classifier in lbnlp/setup.py, so using 3.6
conda create -n lbnlp python=3.6
conda activate lbnlp
pip install -e .[ner]
pip install jupyter
python -m ipykernel install --name lbnlp
jupyter notebook
# open new notebook, set to use lbnlp kernel
Then, I followed the instructions at https://lbnlp.github.io/lbnlp/pretrained/.
The ner_model = load("ner")
part took a few minutes, and there were many tensorflow deprecation warnings, but no module requirement errors.
When I ran ner_model.tag_doc(doc)
, I got a helpful error message suggesting I run cde data download
in a shell. I ran this in the notebook as !cde data download
. Tagging then worked.
The tags use the so-called “inside-outside-beginning” tagging scheme (Inside–outside–beginning (tagging) - Wikipedia) for multi-token entities. B- means beginning, I- means inside, O means outside. So, for example, B-MAT followed by I-MAT means it’s a 2-token MAT entity. Thanks @ardunn for clarifying this for me!
1 Like
Here are gzipped JSON-Lines files (one JSON document per line) for each of abstracts, sentences, and taggings, including taggings normalized according to the published normalizations:
- abstracts (~13.8k): ark:57802/md1snr3c886
- (doi: str, n_sents: int) # DOI, number of sentences
- sentences (~373k): ark:57802/md1rc4y7a15
- (doi: str, idx_s: int, raw_s: str) # DOI, index of sentence in abstract (0-based), “raw” sentence (imperfect pdf->txt conversion)
- taggings (~1M): ark:57802/md19sjz1c79
- (doi: str, idx_s: int, ne: str, cat: str) # DOI, index of sentence in abstract (0-based), named entity, named-entity category (from matscholar)
- taggings_normalized (~1M): ark:57802/md1nytzhs82
- (doi: str, idx_s: int, ne: str, cat: str) # DOI, index of sentence in abstract (0-based), named entity (normalized via published normalizations), named-entity category (from matscholar)
This addresses #2 above (“Use LBNLP to extract named entities via the pretrained matscholar NER model”). Thanks to @june.lau’s summer student for performing #1 (pdfs->text).
We now want to select a small set of frequently encountered / important / interesting terms for a data dictionary.