Some problems confused me when I learn ICET

Hello, everyone. I am utilizing the ICET to deal with a binary(Al-Cu) system MC simulation. I have trained a ce model by iteration which is the same to the Unicle. But I found some problems really making me confused. I am looking forward to someone who may help to give perspectives to these questions. My questions are as follow:
1.when training a ce model, the nmsd(Normalized mean-squared displacement) is a significant value to determine whether a structure should be added into the structure container or not (DOI:10.1103/PhysRevB.96.014107), higher nmsd means greater distortion after structure optimization, which may contribute to failure of the ce model. So at the beginning, I only choose stuctures with small nmsd adding into the structure container. But in the iteration process(using the ce model to predict the whole configuration space, if there are structures with lower enrgy than the structures in training set, add them to the structures container, then train the ce model agian;iterate this process untill no structures with lower energy was found), I found that some structures with lower energy but larger nmsd, if I add these structures into sc, it may do bad to the ce model(in fact, predictions on these structure with larger nmsd will have a bigger errors relative to the reference energy calculated by dft(the error for structures with higher nmsd can reach 60mev), and this is reasonable because clusters of strutures with bigger nmsd may change after structure optimization if a bigger distortion happened); if I don’t add these structures into sc, the iteration process will never come to an end, and this means some potential base structures didn’t include into the training set, which may lead to some potential mistakes.

So is there a method helping to determine whether these structures shoulde be added into sc or not?

2.How to judge whether a ce model is reasonable or not?
I utilize the ce model I trained to predict the whole configuration space, it seems all right to almost all the structures except two pure components structures(Al and Cu) and some structures with larger nmsd. For instance, the reference energy(Formation energy) for single Al and single Cu are 0,but the preditced energy for these two are 13.95mev and 14.98mev respectively, which is much more bigger than the training error(3.9mev) and validation scroce(testing error 4.68mev). So confusing! As I descripted in the first question, structures with small nmsd usually correspond to a accurate prediction, however, the nmsd values of these two pure component structures are 0 while the prediction are not accurate as usual, which confuses me a lot.

I wonder that whether this condition is normal or not? if it is not, could you please give me some advices to figure out the uncovered mistakes?(Though I think bigger error on the two pure componet structure may not affect the MC simulation since the reference energy for these two structures are almost the biggest among the configuration space, it still make me uncomfortable, hoping someone can give a more convincing idea)

3.In ICET tutorial, there is a portion instructing how to analyse ECIs, Analyzing ECIs — icet documentation (
In source code, there is a method named ‘orbits_as_dataframe’ , however, I went through the ClusterExpansion module and found no method named ‘orbits_as_dataframe’.

Could you please tell me where I can find this method?

4.When the kinds of elements increase, the number of enumarate structures will explode, which makes it unavailable to claculate all the structures, instead we choose some of the enumerate structures as initial training set and gradually perfect the training set by iteration. Do you think it is necessary to utilize some principles to determine the initial training set? For example, choose strutures which are the most different by Fingerprint of structures?
From the point of my view, Cluster Expansion could be regarded as a machine learning process(regression), then the training data containing more information of configuration space will be better. An initial training set with the most different structures should contain more information of the configuration space. Picking out the most different structures colud be implemeted by fingeprint.

I wonder that have you thaught of this question, is it a necessary process to conduct? Or from the point of your view, what should be the most ideal method to train a ce model with high reliability?

Many thanks and best wishes, looking forward to your reply!

  1. I dont think there is exactly a “correct” way of doing this.
    If you’re getting large errors have you tried increasing cutoffs and order of the expansion to see if this can help?
    How well does the CEs trained with only small NMSD predict these low energy structures? If it also has large error then I guess it doesnt matter too much if you include them or not on the resulting CE?
    (Note if structures relax so much that e.g. atoms swap lattice sites, you can remap to a new configuration instead which can help if this is a problem)

  2. I think its quite common that CEs will be less accurate for the “end-point” structures since the training datasets are usually quite biased towards concentration around 50%. You only have one training structure at 0% but probably hundreds close to 50%.
    You can add a weight for the 0% and 100% structures in the training to make sure these are better captured by the CE, but this of course comes at a cost of getting slightly larger errors for different structures.

Determining if energy errors are too large or not (related to both 1 and 2) is not straightforward.
What do you want to use the CE for?
If you want to e.g. use the CE to predict a phase diagram, then I think you should test to compute this phase diagram with different CEs (with different number of training structures, different NMSD thresholds, etc) to understand how these choices influences the final result.

  1. I think this function has been renamed to to_dataframe, see docs here

  2. Yes you can select the initial training set in smart ways, e.g. see here, but if you’re anyways going to iterate after this point and if the initial training set is small then I guess it doesnt matter too much how it is selected?

then the training data containing more information of configuration space will be better.

I agree, but one also may needs to be a bit careful with this. You want the CE to be accurate for the type of structures that will show up in MC simulations. Including lots of high energy structures that will never show up in MC simulations, and are thus not thermodynamically relevant, may not be the best idea even if they contain lots of information about the configuration space.

what should be the most ideal method to train a ce model with high reliability?

I think starting from an initial structure set based on condition number (link above) and then doing some active learning (see e.g. here ) where structures with large uncertainties are added to the training set is a good approach. Then adding some of the ground-state structures to the training set is also likely a good idea.

Hello, dear Erik! I really appreciate it that you gave me such a detailed reply, which really helps me a lot.
I came across some new problems when I worked with ICET. I think I really need your professional perspectives.
1.I conducted a Al_Cu binary system MC simulation in canonical ensemble with the CE model trained before. I found that when the concentration of Cu is low(like lower than 0.05at%), the result of MC simulation will be ‘bad’. The inital structure is disordered, the Cu atoms distribute randomly in the Al bulk; after a long period annealing process(From 20000K to 300K), the final structure is still disordered, for the Cu atoms still distribute randomly in Al bulk,which gives no information and the MC simulation seems useless. While when concentration is high, the result seems to be ‘good’
Actually, what I really want to research in is the condition when the concentration of Cu is low. Could you give me some advice about this?

  1. I didn’t consider the vacancy previously. Now I am going to add vacancies in structures, studying the influence of vacancy. I enumerate Al_Cu_X configuration space(12 atoms at most) with the concentration restrictions of X in (0.083,0.25), so the structures in the configuration space contains at least one vacancy, at most 3 vacancies.
    (Al_Cu + Al_Cu_X0.083-0.25 = Al_Cu_X0-0.25 ), this means that when I study the Al_Cu_X0-0.25, I can choose strutures in Al_Cu_X0.083-0.25 and Al_Cu instead.
    Because I have enumerated the whole configuration space of Al_Cu, I want to make use of it.
    I utilized the CE model trained with Al_Cu binary configuration space to predict the energy of structures in Al_Cu_X0.083-0.25, hoping to select some structures with lower energy to add into the training set, however, it didn’t work(errors pop out).
    I really confused by these errors. I have checked that Al_Cu + Al_Cu_X0.083-0.25 = Al_Cu_X0-0.25,
    all the strutures should have based on the same primitive structure, but the errors poping out seem tell that the structures in Al_Cu_X0.083-0.25 didn’t originated form the same primitive structure with Al_Cu binary system.( I delete the X in structures in Al_Cu_X0.083-0.25 before predicting the energy)
    Here are the information about the errors poping out:

    Should I include the structures with vacancies into the training set first so that the CE model works effectively against structures with vacancies?
    Looking forward to your reply!
    Best wishes!

Hello, dear Erik! Please ignore the second question. I figure it out. It turns out that deleting X is a necessity for DFT calculation but not for CE training. A binary system CE model definitely can not calculate a ternary system though one of three is deleted.
Hope everything goes well! Best wishes!

I dont know why this would be bad or not give you any information?

For low concentrations dont you expect them to randomly distribute due to entropy?

Dear Erik, thank you for your reply!
In fact, I expect them to form cluster or something like this, which happens when concentration is high. But your view does make sense from the perspective of entropy. Maybe when the concentration is low, entropy will be the dominant affect.
Or maybe it requires more trials for low concetration systems to reach a state in which atoms forms cluster. I try a low concentration system MC simulation with more trials, but the result still be the same.So maybe your view should be the reasoncor more trials are required.

At the beginning of my study, what I really want to research in is the condition when the concentration of Cu is low. But now, I realize that reasearch on actual concentrations is really difficult via simulation, for no matter how large the system we simulate, it is still small when compared with the real materials that should contain almost countless atoms. What we simulate could only be seen as a really small part of the real materials, which usually don’t show the actual concetration of the whole materilals. From this perspective, the concentration I simulate can’t not represent the real materials with the same concentration. So I think I should change my thought, treating my study in a new perspective.

By the way, it is really good to communicate with you here. Thank you and thank this community platform.
Best wishes! Hope everything goes well!

Maybe when the concentration is low, entropy will be the dominant affect.
Or maybe it requires more trials for low concetration systems to reach a state in which atoms forms cluster.

I think in general yes at low concentration (and finite temperatures) entropy will push them to randomly distribute. But if you’re talking about T=0, then energy should dominate.

So if the lowest energy is really a cluster then you “should” recover this in the cooling simulation down to T=0. But as you say possible this requires lots of sampling, you can e.g. check accptance ratio in the MC simulation to gauge how well or not well the sampling is going.

Another thing you can try if the cluster configuration is lowest in energy is to do a heating simulation, e.g. start from T=0 with the cluster configuration as starting point, and then slowly heat the system and see when the cluster starts breaking up.

If (defect) concentration is really small like less than 0.1% I’d expect dilute limit approximation to work very well, and so maybe it doesnt make sense to try and break your back to simulate these conditions since analytical approximations work well?

By the way, it is really good to communicate with you here. Thank you and thank this community platform.

Thanks :slight_smile: , happy people are trying out our codes!