How to get the “structure” feature?

Hey Matminer! I have some cif files downloaded from the recent experiment, if I want to obtain the “structure” column as shown in the image below, could you tell me how to get this column of data? Thank you very much for your reply.

I note that CompositionToStructureFromMP() only works if my compositions are ones that in the MP database,

image

Hi @obaica

So you can convert each of the cifs (one by one) to pymatgen structure objects using the pymatgen Structure method called from_file: pymatgen.core.structure module — pymatgen 2022.8.23 documentation

To convert a single one, you’d do:

structure_object = Structure.from_file("/path/to/my_file.cif")

If you have many of these cifs and would like to put them in a dataframe as per your picture, you can use the matminer.featurizers.conversions PymatgenFunctionApplicator method like I show below (see here for src code: matminer/conversions.py at 7f8520b97175db3c4fc6afe055cee664ebd77238 · hackingmaterials/matminer · GitHub). The requirement is that you have the cif filenames either in a python list or a dataframe column


fileconverter = PymatgenFunctionApplicator(func=Structure.from_file, target_col_id="structure")

# if you want them as a list
# assuming your filenames are in an iterable called cif_filenames
structures = fileconverter.featurize_many(cif_filenames)

# if you want them as a dataframe
# assuming your cif filenames are in a df called "df" under a column name "cif_filenames"
df_with_structures = fileconverter.featurize_dataframe(df, "cif_filenames")
1 Like

Thank you very much for your patient and detailed reply.

This answer is very helpful to me.

1 Like

Dear Matminer

  • I’ve tried both approaches above that featurize_many and featurize_dataframe,but none moth can batch all the data, Error as shown
  • Here’s how I do it:(1) I used PymatgenFunctionApplicator.featurize read the local cif file, put them into a list, following the method described above, appears "structure"columns are ‘nan’. I also tried putting the filename in column “cif_filenames”, and the result was the same.

image

@kaifeng_zhang You seem to already have the data you want in your first cell. After that, it looks like you are trying to from_file the pymatgen structures. If we go through each cell it will be apparent why this is happening:

CELL 1: you apply PymatgenFunctionApplicator to each file in a for loop. The files are converted into pmg structures and appended to the s list.

CELL 2: You make a df from the structures. This is actually already the data you want.

CELL 3: You rename the column so that your structures are under the name “cif_filename”. But this is actually NOT cif filenames, these are your actual structures. Then you rerun the featurizer using the cif_filenames column (but really they are structures) as input which results in nan because these are obviously not CIFs. With ignore_errors=True, this shows no error messages.

CELL 4: You do the same thing as cell 3 but without dataframes. Again, this results in a bunch of nans.

Here’s an example of reading and using pymatgen function applicator to do the same thing with 3 test CIF files I had:

Note if you are getting a bunch of nans, it is worth setting ignore_errors=False so you can debug more easily.

@kaifeng_zhang Oh, and here is the code as copy+pastable format so you can adapt to your own purposes:

from matminer.featurizers.conversions import PymatgenFunctionApplicator
from pymatgen.core.structure import Structure
import os
import pandas as pd
pfa  = PymatgenFunctionApplicator(func=Structure.from_file, target_col_id="structure")
df = pd.DataFrame({"cif": [os.path.join("structures/", d) for d in os.listdir("structures/")]})
print(df)
df = pfa.featurize_dataframe(df, col_id="cif")
test_structure = df["structure"].iloc[0]
print(test_structure, type(test_structure))