When a beginner starts training an NEP potential using GPUMD, an effective tool for MLFF, how can they effectively identify which structures are needed for training?
I’ve come across a few methods. One approach is to check the NEP/DFT training curves to see if all the points align well with the line. If not, you can include the corresponding points back into the training set. Another approach is to select more new structures related to the research as testing data, compare them with the training data, and then perform PCA to determine whether the test data is already included in the training data.
Selecting appropriate structures is usually the hardest problem when training a NEP, or any MLFF for that matter
With the NEP/DFT tranining curves, do you mean parity plots? If so, yes I agree that that is an excellent tool to identify which structures are not captured entirely. I would then look closer at these structures that are not captured well, and check if there is something wrong with them (very large forces, or maybe the model cutoffs are too short to accurately capture this structure), and if so potentially exclude the structure or change the model hyperparameters.
However, I would not necessarily remove a structure from the training set if it aligns well with the line, as it could be important even if it’s well described by the model. One pretty nice feature to keep in mind is that GPUMD allows you to weight different structures relative to each other in the loss function by changing the weight keyword in the .xyz entry for that structure: train.xyz and test.xyz — GPUMD documentation
As for the second point, I completely agree that the main method of selecting new structures should be through testing the model on the problem that you intend to use the model for, and make sure it works well there. This could be e.g. by running active learning in the temperature range in which you want to perform subsequent research-grade MD simulations. Selecting structures from these tests can then be done as you suggest by comparing to training data, or by computing the prediction uncertainty if you for example have an ensemble of models, for example through active learning: active — GPUMD documentation
I know another approach is to run the molecular dynamics of the small cells of the target working condition with the existing potential function, then observe the maximum time it can run, take the trajectory just before the structure collapses, sample it with FPS to run the single-point energy, and incorporate it into the training set to retrain the NEP potential function. This is an active learning process.
Good question and answers above. I would like to add that it is a good practice to keep all the different data sets in separate folders such that they can be traced back. Sometimes you may want to step back if some new data turned out to be “bad”.