[SUGGESTION] "/all" or "/archives" endpoint for downloading *all* entries

Hi all, to encourage some action on this forum, here is a suggestion for a new type of OPTIMADE endpoint for your consideration.

Many users will want to download an entire dataset/database for use in e.g. ML tasks. With OPTIMADE currently, this may require thousands of paginated requests, putting undue strain on a server. My suggestion is to standardize a way of providing a compressed OPTIMADE-compliant JSON file for an entire database. This can be hosted as a static file, with a link provided under some OPTIMADE endpoint, with the archive updated as often as desired by the provider (see below for a mock-up output).

We could either define a niche “/all” endpoint with corresponding “/all/structures” etc. for each entry type, which just hosts a link to the latest archive, or we could add “archive” as an entry type itself (allowing for e.g. “/info/archives”). This would allow for easier bundling of e.g. references within the same JSON file, and should work with our existing format by specifying multiple “type”'s in a given “data” block. The “meta” fields for this endpoint would include the timestamps of the dataset generation and request, versioning and some hints as to how to decompress the archive (e.g. we could have enums for zip, tar, gz etc that allow the archive to be machine actionable).

As a potential drawback, this is extra work for implementers, and compressing their entire database in an OPTIMADE format may incur a greater compute cost than that of receiving multiple attempts to download the entire database, so of course this endpoint must be optional.

Let me know what you think!

3 Likes

This idea was discussed at the last OPTIMADE workshop and has been written up here: Suggestion of an "/archives" endpoint · Issue #364 · Materials-Consortia/OPTIMADE · GitHub

Currently there is no-one available to write-up a full feature request, but there seemed to be consensus that this could be a good idea.

1 Like