Changing reference for all publications in a dataset

Hi, I’ve been trying to change reference for large number of publication (close to 3000) within a dataset. Each calculation has been uploaded individually, so I have to use API and I’ve been having issues with it.

I can do it for single calculation like this:

base_url = ''
query = {'calc_id': u'hlb48O0vNx9NtuO5txg_QWBfqdfZ'}
response =
    base_url + "/entries/edit",
    headers={'Authorization': 'Bearer {}'.format(token_access)},
    json = {'query': query,
           'metadata': {'references': ["",""]}

I’ve tried it for a few calculations and that works.

When I instead try specifying the whole dataset using this query:

query = {'datasets' : {'dataset_id': u'w-aD3WUATVqP_GpRd-GU1g',}}

it does not seem to have any effect, even though the same query works using /entries/query.

Should I change every calculation individually?


Did you give it some time? The /entries/edit does not complete during the API call. It is triggering some processing that should eventually perform the operation.

Yeah, it’s been two days, since I first tried.

It looks good to me and should work. The query is ok. There are also no errors in the logs. Therefore, I can’t tell you what goes wrong. Has to be some kind of bug.

The simple solution would be that I just manually do the changes in our backend.

Unfortunately, you created individual uploads for each entry. Is there a reason why you did this? it makes it very hard to change the metadata. Also we try to avoid this as much as possible. We have limits on simultaneously unpublished uploads etc. You did a good job circumventing this via API, but it is still less than ideal for us. Right now, you dataset constitutes 30% of all nomad uploads because of it. This makes future maintenance, backup, and migration tasks unnecessarily hard for us and has negative performance implications.

Can I ask you to re-upload the data into one upload again?. E.g. via one zip files with directories 0, 1, …, 2871 or something. I can than remove the tiny uploads and make sure that the references are those you want, and migrate the dataset while keeping the DOI intact.

Oh, sorry about this, I didn’t realize that uploading many calculations individually is not a good practice. I did it like this, simply because it was easier to implement. I think that the when put all together the zip file will be very large and will probably have to be split, but anyway I can upload in few large increments.

Its our fault. We have to document stuff better. It is really hard to steer the user behavior enough without putting up to many limits. Sorry for the inconvinience. If you go above 32gb let me know. Its better to have one upload with 64gb that 3000 with just a few kb.

I’ve put everything in one zip file, but it is 580gb. I can split it into smaller ones if that’s better.

If this is too big I can remove the Wannier Hamiltonians as they take most of the size, although they were the main reason why I wanted to upload this data.

Overall, we can handle the size, if you want to have the Hamiltonians. But yes, smaller pieces would be better. Ideally 32gb.

Hi @Zeleznyj ,

I read “Wannier Hamiltonians” and I just wanted to jump in :slight_smile:

May I ask which code are you use for calculating the Wannier orbitals? Wannier90? Furthermore, are you referring to store the reciprocal space full Hamiltonian? If that is the case, I’d say that you would only need to store the hopping matrix instead, plus the material information; that would be enough to reconstruct the k-space full Hamiltonian. Maybe that is a further way of saving some Gb.

If you are using Wannier90, you would only need to keep the .win, .wout and _hr.dat files for each Wannier interpolation. If it is not Wannier90, we can further help you.

Best regards,

I have split the files into zip files below 32GB, but I’m still having troubles with uploading.

When I try it using Python requests it crashes with the error:

OverflowError: string longer than 2147483647 bytes

I also tried curl, but I get 413 Request Entity Too Large. This happens even with small files, so I must be doing something wrong.

The Wannier Hamiltonians are generated using the FPLO code. I’ve only included the Hamiltonian file itself, which contains the matrix elements, so I don’t think it’s possible to further reduce the size.

Can you post the curl command that you were using? I try to reproduce this behaviour. Unfortunately, our logs do not contain any 413 request in the last 7 days.

Sorry for not replying earlier, I was swamped with other stuff.

I figured out the problem with curl, I was using wrong url.

I managed to upload all the files. It’s split into 21 uploads. Can you replace the old uploads now? I can put the new uploads into a separate dataset if it would help.

Either you publish with a new dataset and I remove the old dataset and uploads later, or the other way around. We can also reuse the dataset, if this is somehow beneficial for you? Whatever you prefer.

Some criteria to delete the old ones would help. But I guess, I can use the upload time? They where all done in the same day/week/etc? Just want to make sure to not delete some wrong stuff.

OK, I put all new uploads in a new dataset “High-throughput AHE 2”. It’s fine if you just delete the old dataset, I don’t really care which dataset we use, but it would be great if we could keep the same DOI since this is already referenced in a published paper.

I think, I can move the DOI to the new dataset and delete the old one. Do you want to rename the new dataset, e.g. remove the “2”?

Sure, if you can remove the 2 it would be nice, it doesn’t matter much. Thanks for your help.

I did the following

  • rename the new dataset
  • moved the existing DOI to the new dataset
  • deleted the old dataset
  • deleted the 2871 uploads associated with the old dataset

Everything looks good now from my end. Thanks for your help and sorry again for all the extra work and inconvenience.