Recording and questions for Jakoah Brgoch, "Finding Superhard Materials through Machine Learning"

mkhorton · July 9, 2021, 6:52pm

Speaker

Jakoah Brgoch, Associate Professor, University of Houston

Date

Monday July 12th, 10am (USA/Pacific)

Abstract

Superhard materials with a Vickers hardness above 40 GPa are essential in applications ranging from manufacturing to energy production. Finding new superhard materials has traditionally been guided by empirical design rules derived from classically known materials. However, the ability to quantitatively predict hardness remains a significant barrier in materials design. To address this challenge, we constructed an ensemble machine-learning model capable of directly predicting load-dependent hardness. The predictive power of our model was validated on eight unmeasured metal disilicides and a hold-out set of superhard materials. The trained model was then used to screen compounds in Pearson’s Crystal Data (PCD) set and combined with our recently developed machine-learning phase diagram tool to suggest previously unreported superhard compounds. Finally, industrial materials often experience tremendous heat during application; thus, we are building a method for predicting hardness at elevated temperatures.

Recording

A recording of this seminar is available here.

Questions

If you are unable to ask questions live, please feel welcome to ask any questions following the talk here and we will ask the speaker to check afterwards. Whether they will be able to answer questions or not depends on the speaker’s availability.

mkhorton · July 12, 2021, 6:18pm

Questions answered live

Questions are numbered according to the order they came in. Only questions relevant to this talk specifically are shown.

Is high melting point is one of the criteria for superhard materials?
How transferable are these “handcrafted” features that you’re using for this problem? It seems like you need a new set of features for every problem you’re working on, which can be time-consuming.
Crystalline vs. amorphous materials for hardness. What does your experience tell you regarding additional opportunities for discovery/existence of superhard, amorphous materials? Is crystallinity a prerequisite for superhard materials?
Aren’t the data sets from the Materials Project all calculated at 0 degrees K? Temperature affects hardness. Diamond starts to be noticeably softer at 1000C.
Can you talk a little bit more about the feature reduction?
You briefly discussed how it can be tricky to interpret the importance score assigned to individual properties - e.g. electron density and valence electron density showing up as important for B or G respectively. Is it possible that those properties are both highly correlated, and your feature selection proceedure will choose one or the other based on which one has a (perhaps only slightly) stronger relationship with your target value?
if we are able to synthesize the material after predicting it via ML, why isnt the use of it happening in the industry? preferably over the less harder materials?

Additional questions asked

Our apologies to all who asked questions that we did not have time to address during the talk.

for thin film c-BN what force should we apply to measure the hardness?
Can you predict hardness properties based on microstructure formation as multiple phases are formed upon sintering?
What sort of challenges can come to applying ML for the discovery of materials which can have a huge impact on our society? for example, searching for new materials with high conversion of solar energy to electrical energy?
You specifically sanitized to remove the theoretical compounds from the MP database. Do you think that your model would perform poorly using the theoretical compounds as a test set? It would be interesting to see how it performs.
Reference slide 24 screening… Where does cubic C3N4 or BC2N fit on the graph?
On slide 24, what is the difference between Al2O3 in the top right and bottom right ?
Is there a bias in the selection of materials where MP has run elastic tensor calculations? For example in how chemistries are selected or the size of the unit cells?
Did you consider learning/predicting G^3/B^2 directly?
What do you think about amorphous carbon obtained from fullerenes? Is it superhard? Can it be harder than diamond?
If we want to predict gravimetric capacity of battery, based on known capacity data, which is very hetergenous in nature, what is the range of RMSE is acceptable? Like if the RMSE is around 40 to 45. is that acceptable? (Also the train data set is very small)
The input for the SVM model include features like cohesive energy; how are these features obtained, from a materials database like MP or calcuated from scratch using DFT?
Rule of thumb for “number of samples” vs. “number of predictors” for a good, quantitative supervised ML model?
How much experimental data did you train on? And how much computational data did you train on?
Also weighting the training cost to achieve higher accuracies in the higher hardness regions?
Is there any important correlation between any tolerance factor and the hardness of the material?
great talk. More of a technical question. How do you deal with the different number of input parameters when you have binary - ternary materials?
Structural defects (point, line) can play a big role for material screening. Any thoughts on that ?

Jakoah_Brgoch · July 27, 2021, 3:02am

Thanks, everyone for attending the talk and asking some awesome questions. I am sorry it took a bit of time to get responses. I am happy to provide the best insight/answer I can. If you have any additional questions or want me to expand then feel free to email me ([email protected]).

for thin film c-BN what force should we apply to measure the hardness?

The applied force in a hardness measurement is a bit arbitrary. At low applied load, you are probing the material’s intrinsic properties more so than the bulk properties. These measurements are also more challenging because the size of the indentation imprint is tiny. The hardness will be lower at higher applied, but it is easier to measure because the imprint will be larger. This also is more representative of the full material, including local effects like bonding and global effects like microstructure. The best bet is the measure the hardness at a range of loads and report the load-hardness curve.

Can you predict hardness properties based on microstructure formation as multiple phases are formed upon sintering?

In principle, it should be possible to make the hardness predictions with different microstructures. Unfortunately, we do not have access to sufficient training data based on the hardness-microstructure relationship, so this is not something we have tested yet. However, this is high on our “to-do” list. Thanks for asking!

What sort of challenges can come to applying ML for the discovery of materials which can have a huge impact on our society? for example, searching for new materials with high conversion of solar energy to electrical energy?

The possible applications of ML for materials discovery are endless – along as there are training data available, ML is applicable. The key to remember is that asking the right questions about your data is vital. Try not to extrapolate because that is not a strength of ML. Additionally, ML is just another tool that can be used to advance materials chemistry. When used right, however, it has tremendous potential.

You specifically sanitized to remove the theoretical compounds from the MP database. Do you think that your model would perform poorly using the theoretical compounds as a test set? It would be interesting to see how it performs.

We included the hypothetical compounds as well as the compounds that are experimentally confirmed. This work was unpublished but overall we found the performance remained consistent regardless of the training dataset size. This is probably stemming from the fact that ~3000 data (our cleaned data set) is already approaching the minimum in terms of trained model statistics. We elected to stay with the experimentally confirmed phases to try and limit any issues with our experimental efforts.

Reference slide 24 screening… Where does cubic C3N4 or BC2N fit on the graph?

I don’t recall where these compositions fall on the plot. However, they certainly fall in the top right corner. All of our predictions are available in the supporting information of our JACS publication (https://doi.org/10.1021/jacs.8b02717). Check it out, and if you can’t find the compound you are interested in, please shoot me a message and we will make the prediction.

On slide 24, what is the difference between Al2O3 in the top right and bottom right?

Like the above question, I don’t recall its exact position on the screening plot, but I am sure it was near the top right corner. Check out our publication’s SI and please let me know if you would like a specific prediction.

Is there a bias in the selection of materials where MP has run elastic tensor calculations? For example in how chemistries are selected or the size of the unit cells?

This is a great question. Although I am unsure how MP selects the compositions to run their elastic tensor calculation on, analyzing the training dataset shows a significant bias towards smaller crystal structures. We did our best to expand the training set with larger compounds including specifically running calculations on our own to supplement the MP. The data is still skewed though and something worth keeping in mind.

Did you consider learning/predicting G^3/B^2 directly?

We did try to predict G^3/B^2 directly. We first predicted G and B independently, as covered in the talk, and then used this information to calculate the G^3/B^2 hardness manually. Similarly, we first calculated the ratio and then used this information as training data. In both cases, we can predict G^3/B^2 with reasonable accuracy. We found that this term, although suggested to correlate with experimental hardness, seems to break down readily. The G^3/B^2 works well with main group hard materials where covalent bonding dominates (like BN, diamond, Si). However, it does not work well with metallic-type materials like WB4, etc. This is why we shifted away from predicting G^3/B^2 directly.

What do you think about amorphous carbon obtained from fullerenes? Is it superhard? Can it be harder than diamond?

These are fascinating materials too. They certainly have the potential to be harder than diamond. The challenge will be making these materials on a large scale. Nevertheless, these types of compounds have a lot of opportunities to move the field forward.

If we want to predict gravimetric capacity of battery, based on known capacity data, which is very hetergenous in nature, what is the range of RMSE is acceptable? Like if the RMSE is around 40 to 45. is that acceptable? (Also the train data set is very small)

Thanks for the question. Unfortunately, I do not have any background in using ML for battery materials. I am sure someone else at a future MP talk may have a better answer for you.

The input for the SVM model includes features like cohesive energy; how are these features obtained, from a materials database like MP or calculated from scratch using DFT?

These features are all empirical inputs because it makes them the most transferable. Information like cohesive energy is related to the elemental properties. We could also include features like DFT calculated properties, limiting the screening capabilities to systems where DFT information is known. I am sure this would make the model more accurate but at a cost. It is all about balancing predictive power vs. transferability.

Rule of thumb for “number of samples” vs. “number of predictors” for a good, quantitative supervised ML model?

The best rule of thumb that I know of is to have 10 training data for any single descriptor. This should reduce concerns about overfitting. However, as with all ML projects, the more data, the better!

How much experimental data did you train on? And how much computational data did you train on?

We trained on 1063 Vickers hardness data
We trained on ~2600 bulk and shear moduli values from MP

Also weighting the training cost to achieve higher accuracies in the higher hardness regions?
*Yes, that is the advantage of using an algorithm like XGBoost. The absence of boosting using random forest resulted in poor statistics at high hardness, whereas applying XGB improved the model’s predive power dramatically.
Is there any important correlation between any tolerance factor and the hardness of the material?

I am not sure of any tolerance factor or similar metric that is correlated with hardness. That is what lead us to predict Vickers hardness directly using ML.

great talk. More of a technical question. How do you deal with the different number of input parameters when you have binary - ternary materials?

Thanks for a great question – It’s the same number of input parameters regardless of how many elements. For example, if the compound is AB the input would look like [0 1/2 1/2] whereas the compound ABC would be represented by [1/3 1/3 1/3] – it is still a three-dimensional vector. I hope that answers your question!

Structural defects (point, line) can play a big role for material screening. Any thoughts on that?

Yes, defects are vital for hardness. As of right now, there are not sufficient data available for relating hardness to the point or line defects. We want to pursue this approach using methods like molecular dynamics, but this is a project that will happen in the future.