ICSD cif does not work for pymatgen CifParser

roadtripper · April 9, 2020, 12:02am

Hi,

I tried to use pymatgen CifParser to parse cif downloaded from ICSD, and output the bibtex from cifparser. But it seems like all the ICSD cifs do not work.

from pymatgen.io.cif import CifParser

icsd_fn = "YourCustomFileName1_CollCode2356.cif"
parser = CifParser(icsd_fn)
parser.get_bibtex_string()
`


The `get_bibtex_string()` function will give the following error:
`
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/anaconda/envs/strumining/lib/python3.6/site-packages/latexcodec/codec.py in encode(self, unicode_, errors)
    804         return (
--> 805             encoder.encode(unicode_, final=True),
    806             len(unicode_),

~/anaconda/envs/strumining/lib/python3.6/site-packages/latexcodec/lexer.py in encode(self, unicode_, final)
    478             return self.emptychar.join(
--> 479                 self.get_latex_bytes(unicode_, final=final))
    480         except UnicodeEncodeError as e:

~/anaconda/envs/strumining/lib/python3.6/site-packages/latexcodec/codec.py in get_latex_bytes(self, unicode_, final)
    725                 "expected unicode for encode input, but got {0} instead"
--> 726                 .format(unicode_.__class__.__name__))
    727         # convert character by character

TypeError: expected unicode for encode input, but got list instead

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-53-e8cf025fac04> in <module>()
----> 1 parser.get_bibtex_string()

~/anaconda/envs/strumining/lib/python3.6/site-packages/monty/dev.py in decorated(*args, **kwargs)
     90             if not self.condition:
     91                 raise RuntimeError(self.message)
---> 92             return _callable(*args, **kwargs)
     93 
     94         return decorated

~/anaconda/envs/strumining/lib/python3.6/site-packages/pymatgen/io/cif.py in get_bibtex_string(self)
   1123             entries['cif-reference-{}'.format(idx)] = Entry('article', list(bibtex_entry.items()))
   1124 
-> 1125         return BibliographyData(entries).to_string(bib_format='bibtex')
   1126 
   1127     def as_dict(self):

~/anaconda/envs/strumining/lib/python3.6/site-packages/pybtex/database/__init__.py in to_string(self, bib_format, **kwargs)
    284         """
    285         writer = find_plugin('pybtex.database.output', bib_format)(**kwargs)
--> 286         return writer.to_string(self)
    287 
    288     def to_bytes(self, bib_format, **kwargs):

~/anaconda/envs/strumining/lib/python3.6/site-packages/pybtex/database/output/__init__.py in to_string(self, bib_data)
     51 
     52     def to_string(self, bib_data):
---> 53         result = self._to_string_or_bytes(bib_data)
     54         return result if self.unicode_io else result.decode(self.encoding)
     55 

~/anaconda/envs/strumining/lib/python3.6/site-packages/pybtex/database/output/__init__.py in _to_string_or_bytes(self, bib_data)
     47     def _to_string_or_bytes(self, bib_data):
     48         stream = io.StringIO() if self.unicode_io else io.BytesIO()
---> 49         self.write_stream(bib_data, stream)
     50         return stream.getvalue()
     51 

~/anaconda/envs/strumining/lib/python3.6/site-packages/pybtex/database/output/bibtex.py in write_stream(self, bib_data, stream)
    167                 self._write_persons(stream, persons, role)
    168             for type, value in entry.fields.items():
--> 169                 self._write_field(stream, type, value)
    170             stream.write(u'\n}\n')

~/anaconda/envs/strumining/lib/python3.6/site-packages/pybtex/database/output/bibtex.py in _write_field(self, stream, type, value)
    121 
    122     def _write_field(self, stream, type, value):
--> 123         stream.write(u',\n    %s = %s' % (type, self.quote(self._encode(value))))
    124 
    125     def _format_name(self, stream, person):

~/anaconda/envs/strumining/lib/python3.6/site-packages/pybtex/database/output/bibtex.py in _encode(self, text)
    105         import latexcodec  # NOQA
    106 
--> 107         return codecs.encode(text, 'ulatex+{}'.format(self.encoding))
    108 
    109     def _encode_with_comments(self, text):

TypeError: encoding with 'ulatex+UTF-8' codec failed (TypeError: expected unicode for encode input, but got list instead)

Can anyone help with this? I tested cifs from other DBs, like from COD (Crystallography Open Database), the function works great. But all the cifs I got from ICSD do not work.

mkhorton · April 10, 2020, 9:33pm

Hi @roadtripper, welcome!

Can you share the header for the CIF file you’re looking at? It’s difficult to diagnose problems like this without being able to run an example. As a guess, it’s likely related to some special characters in either an author name or a title.

roadtripper · April 10, 2020, 11:19pm

Hi @mkhorton,

Actually I was initially planning to provide the CIF files in the post, but I didn’t find how to attach files. I uploaded some CIFs in a google drive link google drive

I also show the header of one CIF below, not sure if the format is still strictly copied. This problem I think exists for all the ICSD cifs I tested so far.


#(C) 2019 by FIZ Karlsruhe - Leibniz Institute for Information Infrastructure.  All rights reserved.
data_2356-ICSD
_database_code_ICSD 2356
_audit_creation_date 1980-01-01
_audit_update_record 2012-08-01
_chemical_name_systematic 'Barium pentaoxodititanate'
_chemical_formula_structural 'Ba Ti2 O5'
_chemical_formula_sum 'Ba1 O5 Ti2'
_chemical_name_structure_type V2GaO5
_exptl_crystal_density_diffrn 5.13
_publ_section_title 'Refinement of barium dititanate'
loop_
_citation_id
_citation_journal_full
_citation_year
_citation_journal_volume
_citation_page_first
_citation_page_last
_citation_journal_id_ASTM
primary

;
Acta Crystallographica, Section B: Structural Crystallography and Crystal
Chemistry
; 1974 30 2894 2896 ACBCAR
loop_
_publ_author_name
'Tillmanns, E.'
[...]

mkhorton · April 10, 2020, 11:35pm

Ok, thanks! I’ll look into it.

peterschindler · April 17, 2020, 12:25am

@roadtripper Try deleting the first line of every cif file (it contains the copyright symbol, which I think causes the issue).

roadtripper · April 17, 2020, 12:58am

I tried but still doesn’t work

If delete the first line, same errorlog

#(C) 2019 by FIZ Karlsruhe - Leibniz Institute for Information Infrastructure. All rights reserved.

If continue deleting the first two lines, parser.get_bibtex_string() will give empty output.

#(C) 2019 by FIZ Karlsruhe - Leibniz Institute for Information Infrastructure.  All rights reserved.
data_2356-ICSD