I am a professor at a university in Japan and would like to suggest to the undergraduate students in my laboratory some matminer-based optimization problems. The time period is about a month for the students to work on the project, maybe less. They know python and the basics of matminer now. Can anyone offer suggestions as to what might be good targets for the databases available to matminer? There are four students so I need to find four problems. I can probably take a shot in the dark and come up with something myself, but if there are suggestions from more experienced users (developers) out there, I would be most grateful for your ideas.
Dear Paul,
Just brain-stormed some quick ideas.
(1.) Taking the training data from the several data providers (structures <-> properties), program a simple random forest regression predictor for one property (e.g. enthalpy of formation), based on an arbitrary crystalline structure given. Estimate the prediction quality (MAE, R2 score) on an in-advance pre-selected validation dataset. Thus, compare the training data from the different data providers and select the best one, whose data are better suited for the prediction of the enthalpy of formation.
(2.) Using the MPDS data provider, take some physical property for the binary compounds obtained simultaneously from the (i) first principles, (ii) machine learning, and (iii) peer-reviewed experimental literature (see Peer-reviewed vs. machine-learning vs. ab initio data | MPDS). Search for the trends in these data and compare, how these trends are manifested across the different data harvesting approaches (i-iii).
(3.) Take some decent amount of the crystalline structures, calculate and compare different descriptors for them. Based on those descriptors, attribute the crystalline structures to the prototypes (e.g. A1, A2, D01 in Strukturbericht notation), atomic environment types (e.g. octahedral, pyramidal, square-prism-cube) and classes (e.g. spinels, perovskites).
(4.) Program the new matminer data retrieval class for to the Optimade universal protocol (https://optimade.org). Currently there are >10 providers (see https://providers.optimade.org) and the amount of data offered in a standard format is more than impressive.
Happy to inspire you!
Evgeni’s answer is already great -
If you are more interested in the methods development and less interested in developing novel datasets for new research, you could also look at trying to do data mining on the pre-prepared automatminer data sets. There are several data sets along with a “leaderboard” for who has done the best job so far doing machine learning on those data sets:
Dear Blokhin,
Thank you for the suggestions. I was thinking along the lines of something like your first suggestion. The students are electrical engineers and not very familiar with materials science so I want to motivate them and at the same time teach them that machine learning is quite accessible using well designed libraries. I am approaching this as a former national laboratory scientist who (just) moved to academia so I want to be careful not to overwhelm the students, but at the same time give them a chance to grow and try new things.
Dear Anubhav,
Thank you for your ideas. I think I may follow you suggestion next time around. Now I am trying to put together introductory materials so that fourth year students with EE backgrounds can follow things. I think I have more homework to do on how to present a good introduction to the students as I am new to machine learning as well (up to now I was a senior materials scientist at a national lab here in Japan doing a combination of synchrotron-based materials analysis [XAFS, HAXPES, etc.] and first-principles calculations. I have been using pymatgen for a while, but need to study machine learning more myself as well. The tutorials of the materials project have been very helpful in this regard. It would be nice to attend the workshop this year as well, but being on the other side of the globe might make that difficult to attend in real time.
Best wishes,
Paul Fons
@111457 if you are interested in the most updated website for these datasets, you can find them here https://hackingmaterials.lbl.gov/matbench