Many of us met on 2021-06-02 re: the Microscopy and Microanalysis (M&M) conference abstracts dataset.
- convert their directory of PDF abstracts to (ID, plain text), via tesseract-ocr or some alternative tool. IDs would ideally be DOIs.
- Use LBNLP to extract named entities via the pretrained matscholar NER model (thanks to @jdagdelen @ardunn et al.!).
- Select a small number of frequently encountered / important terms (~10-30) in corpus for initial dictionary terms (there are ~10M matscholar named entities, which we can reduce with their published normalization dataset, but we want to focus on M&M “all-stars” for the WG).
Once we have the (abstract → named entities) dataset, we can focus on developing the formal data dictionary. Our actual approach for this is TBD, but I suspect we will use semantic web tech and vocabulary systems like dcmi, skos, etc. for informal annotation properties and rdfs/owl for machine-actionable properties.
The next meeting of the abstracts subgroup is 2021-06-16.