NOMAD cannot recognize MD data done by GROMACS

Hello everyone,

I have tried to upload data of some MD runs which I did, but NOMAD did not process it because it could not recognize that it was GROMACS.
I tried it both in “global” NOMAD both in “local” NOMAD in our group.

Can someone help me and tell me what is the problem? What is the main file of GROMACS MD simulations?

I have attached 2 examples of my MD runs, in case some NOMAD developed could take a look. Do you have some example of successful GROMACS upload so I can compare what is missing in my folders?

Actually it does not allow me to upload zip folders because they are too large, I should find some other way to do it.

Thank you.

Hi Srdjan125,

My name is Joe, I am the domain expert for soft matter simulations for FAIRmat, and have done some development on the Gromacs NOMAD parser.

You can find examples of Gromacs data published in NOMAD by going to the “Explore” tab in NOMAD and then selecting “Gromacs” under Method–>Program_Name.

However, I can help you with your data if you share it with me via NOMAD. Go to the upload page for the Gromacs upload that was not parsed correctly and add me as a reviewer: Go to the “Manage upload members” button, it looks like 2 people, then search “Rudzinski”, you should see 2 accounts, please add the one that says “Max Planck Institute…”, then select “Reviewer” as the role and submit.

Once you have done that, please post the upload id here, and I will look into the problem.

If you have further questions about parsing MD data in the future, feel free to tag me.

Best,
Joe

Hi @JFRudzinski , thanx for your answer.

After another attempt the NOMAD has started processing files but at the end I got message “failure” (for npt run). Second process is still running so I don’t know what will happen.

It seems that one of the issues is that my mdp files were …/npt_umbrella.mdp and …/md_umbrella.mdp but I have copied them into folder and change the name to be same like *xtc *edr and *log files (i.e. npt2.mdp npt2.log npt2.xtc etc). But even after this NOMAD starts processing and then it gives failure.

I have added you as reviewer there, upload ID is : q6FBgoyYTzS6k128o-PLcg

Thank you.

Hi srdjan125,

Thanks for sharing your data. I got it, and have been working on some of the issues with the Gromacs parser today.

There are likely several things that we need to fix. I opened an issue: Problems with Gromacs parser from umbrella sampling simulation · Issue #56 · nomad-coe/atomistic-parsers · GitHub, just FYI you can follow it if you want, but it will prob not be of so much interest to you.

Once I diagnose all the problems, I will post again here to let you know how to proceed or the timeline with which we will be able to fix this problem for you.

Best,
Joe

@JFRudzinski , thanx!

Once the issue is fixed it will not be problem to use parser for GROMACS umbrella sampling simulations with coarse-grained force field? (the one I shared with you was all-atom run).
I know that for GROMACS does not matter is it all-atom or coarse-grained simulation as long it can read force field parameters.

I also had problem to upload data for “normal” all-tom MD run (not umbrella sampling but just usual NVT, NPT runs). Should I include you like reviewer also there?

I also have some general questions about uploading GROMACS data:

a) Some of my runs were restarted (for example I run 400 ns simulation and after some time I decide to continue run from 400 ns to 800 ns) which means that log file will be interrupted due to appending data of new run to same file. Is this a problem for GROMACS parser or it can recognize such scenarios?

b) I noticed that upload and processing of data takes long time, and then I need to wait is it success or failure. Can I somehow tell to NOMAD to process every 20th snapshot? Or I need to do post-processing and take every 20th snapshot by myself before uploading into NOMAD?

c) which files are the most slow for processing and which one I must include in upload folder and which ones I can skip (ie .edr .log .trr .xtc etc).

Thanx!

@srdjan125

I will definitely answer all your questions here in depth. But allow me to first finish fixing all the issues of the data that you shared, then I can give you a more accurate response.

In the mean time, can you please also share the data for the “normal” MD run that you mentioned that didn’t work?

Also, if you think of further questions or potential issues in the mean time please don’t hesitate to post them. Then, I can cover everything together at the end.

Thanks and hope to get back to you soon with the fixes!

@JFRudzinski thanx, this is the link for upload of “normal” MD run: TZYSVNBYTrueiosjd1DY8w.
I tried to upload it today again, and I got again message “FAILURE” for NPT and NVT runs but “SUCCESS” for minimization run.

The input files for MD runs were prepared by CHARMM GUI in case this information will be useful for you.

It would be good to hear some practical tips how to speed up uploading process (which data I can omit etc).

Thanx!

@srdjan125

My apologies for the delay in getting back to you. Actually, I have been focused on fixing some issues with the Gromacs parser that came to light due to your post and data. So, thank you for helping us to improve MD support in NOMAD.

Let me first address the specific issues you had with the data that you shared, and then I can try to answer some of your more general questions.

There were 2 relatively straightforward issues with parsing your data:

  1. Our previous parsing of Energy quantities from the log file was insufficient for more general cases. This was the main error that you were getting in your first set of uploads. This is now fixed.

  2. Some of your .trr files do not have positions stored. This was causing some problems with our parser because it previously did not look for an xtc file if a trr file existed. I have added a check to make sure that we can extract the positions from the trr file and if not we check for the xtc. In this way, we avoid the issue for your set of data.

With these fixes, I can successfully upload all your data in NOMAD. These fixes will soon be merged with the development branch. Then, it may then take a little time to get pushed up to the beta deployment of NOMAD. I will check on this and try to return to you with some sort of estimate.

Now, let me try to address your other questions:

“srdjan125: Once the issue is fixed it will not be problem to use parser for GROMACS umbrella sampling simulations with coarse-grained force field? (the one I shared with you was all-atom run).”

Yes, there should now be no problem to upload your data (once the fixes go through). Note that while there should be no problem in uploading umbrella sampling simulations, we do not yet have explicit support for enhanced sampling methods, i.e., we do not store these parameters / output in a normalized format. The Gromacs parser will by default catch the constraint energies and some other quantities though, so they will in any case be available in the archive. Depending on your intended usage of NOMAD, this may be enough. This is something I am happy to discuss with your more via a Zoom call if you are interested.

Similar answer for the CG simulations. You should have no problem uploading the data. However, we have not yet developed all the appropriate metadata for CG simulations. This is something that is pretty high on our priority list.

“srdjan125: a) Some of my runs were restarted (for example I run 400 ns simulation and after some time I decide to continue run from 400 ns to 800 ns) which means that log file will be interrupted due to appending data of new run to same file. Is this a problem for GROMACS parser or it can recognize such scenarios?”

This is something that I have not yet looked at in depth, so we will have to treat this on a case by case basis to make sure that the parser can handle these situations. I know that if you run a simulation and then prune the trajectory file (i.e., subsample using trjconv), the parser has no problem storing the data accurately, even though the number of steps in the log file and in the trajectory file are different. So, based on this I would guess that your data would still be parsed correctly, but again we would need to test this. It would be great if, after the above fixes are made, if you could try it out and let me know if it works or not. If you have problems then I can work on fixing the parser for you (as we did here).

“srdjan125: b) I noticed that upload and processing of data takes long time, and then I need to wait is it success or failure. Can I somehow tell to NOMAD to process every 20th snapshot? Or I need to do post-processing and take every 20th snapshot by myself before uploading into NOMAD?”

A couple things to note here:

The MD parsers will automatically calculate a few observables (mostly for equilibration detection purposes) like the molecular radial distribution functions and mean squared displacements. Most of the processing time for your data was actually being spent calculating these observables for the > 50k water molecules in your system. This is obviously not very useful, so I set a limit of the number of molecules for which these calculations will run. The first set of simulations that you sent now both are parser in less than 10 min.

More generally, the parser will already automatically prune your trajectory data for storage in the archive if the cumulative number of atoms (i.e., n_atoms * n_frames) is greater than some threshold (I think set to 2.5M at the moment). This is simply for efficiency of features in the GUI, but the raw data is stored unpruned.

If you would like to further prune your data, as I mentioned you can prune your trajectory file without messing up the parser. However, there is no way at the moment to provide custom pruning instructions to NOMAD. We are working on more custom approaches for uploading MD data, which are much more flexible in these terms. Preliminary support and documentation should be available by the end of the year.

“srdjan125: c) which files are the most slow for processing and which one I must include in upload folder and which ones I can skip (ie .edr .log .trr .xtc etc).”

I would advise generally to include all the raw data files from your simulation in the upload (other than situations where your trajectory is too large to store the full thing, in which case you can first prune it for the upload). I hope that with my fixes you will already find the processing times reasonable. However, if you still find long processing times, please let me know and we can discuss further.

Again, I will get back to you when I have a better idea of when the fixes will be available in the beta.

Best,
Joe

@JFRudzinski thank you for help! Sorry for late response I was on vacations.
As far as I am concerned this is fine for me, I can contact you if I see some other issue when I upload more data.

1 Like

@JFRudzinski I was a bit busy with other tasks (plus I had vacations), but now I should proceed with collecting the data I created during PhD.

Can I ask if this GROMACS parser update is available on NOMAD or it is still in “testing phase”?

If it is available, how can administrator update parser on local NOMAD? He just needs to do update?

@JFRudzinski one more question. If I upload data on local NOMAD (of our group) and there is some problem, can I tag you there to check it?
Or I can do it only on global NOMAD?

Thanx.

Hi @srdjan125

Yes, the updates are available on the standard/stable version of NOMAD now. I would first test the files that you sent before to see if your local NOMAD (Oasis) has been updated since the fix. If you find that the parsing fails (i.e., no update has been done), then I would ask your administrator to update to the current stable version. I presume that they will now how to do this. If not, please contact me again and I can direct your administrator accordingly.

There is no way that I can access files on your local NOMAD. If you have issues, I would advise uploading the problematic files (or examples thereof) to the central NOMAD repo, and then add me as a reviewer (same as last time). In any case this will be the best course of action, just in case your Oasis is behaving differently from the central NOMAD.

Best,
Joe

Hi @JFRudzinski I started uploading data and it works :).

Two of the uploads of the MD data went easily (both on oasis NOMAD both on global NOMAD).

However, I have a problem with third upload, it contains lot of data so I zipped it into several smaller portions (<= 32GB). I started uploading other batches while first batch is still processed, now first zip folder is whole day with status Processing …, 228/325 entries processed, and it is not moving from that.

Did I make mistake to upload other batches before first one have finished processing of data?
Do I need to delete it and upload one by one, wait that data is processed, then go next one etc?

This is the link for upload: https://nomad-lab.eu/prod/v1/gui/user/uploads/upload/id/KmkjkIrrQ4abpewUzQXieA .

Hi @srdjan125

Can you add me as a reviewer (Joseph Rudzinski, affiliation: FAIRmat) so that I can have a look at your upload. Likely it is stuck in the processing stage and I will have to request it to be removed so that you can try again.

Please share all your uploads that appear to be stuck in processing. Also, could you describe in a bit more detail what exactly your uploads contain (i.e., how did you split your data, how large are the systems, how long are the trajectories, etc.).

I performed a similarly large upload a few months ago with many entries per upload, and I also had these types of processing issues. As you suggested, it typically helps if you upload one at a time, but this issue can still happen with just a single upload processing. We are still not exactly sure why this is happening, so having your examples and more details will hopefully help us fix the problem.

FYI - I will be on vacation from 21.12 to 01.01, but if you post these details I will make sure to address the issue as soon as I am back.

Hi @JFRudzinski , thanx for answer.

In the meantime I got help by another NOMAD developer as seen on this new topic: My NOMAD upload is stuck - it is processing data for 4 days already - #25 by mscheidgen .

My uploads are mostly umbrella sampling runs with coarse-grained protein structures, and only last trajectory snapshots are saved. Despite that, data are quite large.

Hi @srdjan125, that’s great, I am glad that your problem is being addressed.

btw - did you know that there is workflow support that would allow you link your simulations, e.g., your umbrella sampling runs, together to reference each other and display a workflow graph? HERE is a recent tutorial on MD simulations in NOMAD that we gave that describes the various functionalities of NOMAD for MD. Under the Advanced tab, you can find information on workflows in particular.

Also, I was wondering if you have published any of your uploads yet? If so, would you mind sharing one of the upload-ids with me? We will be looking to add explicit support for enhanced sampling in the not so distant future, and it would be helpful to compile a list of relevant existing data already on NOMAD.

@JFRudzinski thanx for tutorial. I did not publish any of these data yet, I am leaving this Institute so I am just summarizing and uploading all relevant research data.
Some of them should first undergo revision by journals before being published. Part of data is on local NOMAD oasis which I think you cannot access it.

If it is relevant I can share with you published uploads, once when it is done. If you want to see current data which is being processed, I can add you as reviewer there.

Hi @srdjan125 if you can let me know when you publish your data, that would be great, thanks!