How to get the “structure” feature?

Hey Matminer! I have some cif files downloaded from the recent experiment, if I want to obtain the “structure” column as shown in the image below, could you tell me how to get this column of data? Thank you very much for your reply.

I note that CompositionToStructureFromMP() only works if my compositions are ones that in the MP database,

image

Hi @obaica

So you can convert each of the cifs (one by one) to pymatgen structure objects using the pymatgen Structure method called from_file: pymatgen.core.structure module — pymatgen 2022.8.23 documentation

To convert a single one, you’d do:

structure_object = Structure.from_file("/path/to/my_file.cif")

If you have many of these cifs and would like to put them in a dataframe as per your picture, you can use the matminer.featurizers.conversions PymatgenFunctionApplicator method like I show below (see here for src code: matminer/conversions.py at 7f8520b97175db3c4fc6afe055cee664ebd77238 · hackingmaterials/matminer · GitHub). The requirement is that you have the cif filenames either in a python list or a dataframe column


fileconverter = PymatgenFunctionApplicator(func=Structure.from_file, target_col_id="structure")

# if you want them as a list
# assuming your filenames are in an iterable called cif_filenames
structures = fileconverter.featurize_many(cif_filenames)

# if you want them as a dataframe
# assuming your cif filenames are in a df called "df" under a column name "cif_filenames"
df_with_structures = fileconverter.featurize_dataframe(df, "cif_filenames")
1 Like

Thank you very much for your patient and detailed reply.

This answer is very helpful to me.

1 Like

Dear Matminer

  • I’ve tried both approaches above that featurize_many and featurize_dataframe,but none moth can batch all the data, Error as shown
  • Here’s how I do it:(1) I used PymatgenFunctionApplicator.featurize read the local cif file, put them into a list, following the method described above, appears "structure"columns are ‘nan’. I also tried putting the filename in column “cif_filenames”, and the result was the same.

image

@kaifeng_zhang You seem to already have the data you want in your first cell. After that, it looks like you are trying to from_file the pymatgen structures. If we go through each cell it will be apparent why this is happening:

CELL 1: you apply PymatgenFunctionApplicator to each file in a for loop. The files are converted into pmg structures and appended to the s list.

CELL 2: You make a df from the structures. This is actually already the data you want.

CELL 3: You rename the column so that your structures are under the name “cif_filename”. But this is actually NOT cif filenames, these are your actual structures. Then you rerun the featurizer using the cif_filenames column (but really they are structures) as input which results in nan because these are obviously not CIFs. With ignore_errors=True, this shows no error messages.

CELL 4: You do the same thing as cell 3 but without dataframes. Again, this results in a bunch of nans.

Here’s an example of reading and using pymatgen function applicator to do the same thing with 3 test CIF files I had:

Note if you are getting a bunch of nans, it is worth setting ignore_errors=False so you can debug more easily.

@kaifeng_zhang Oh, and here is the code as copy+pastable format so you can adapt to your own purposes:

from matminer.featurizers.conversions import PymatgenFunctionApplicator
from pymatgen.core.structure import Structure
import os
import pandas as pd
pfa  = PymatgenFunctionApplicator(func=Structure.from_file, target_col_id="structure")
df = pd.DataFrame({"cif": [os.path.join("structures/", d) for d in os.listdir("structures/")]})
print(df)
df = pfa.featurize_dataframe(df, col_id="cif")
test_structure = df["structure"].iloc[0]
print(test_structure, type(test_structure))

Hi!
I want to do the inverse! I have a dataframe with a column containing structure features, These materials do not have any cife files, and I want to read this column with pymatgen directly by the IStructure function without creating cife files to use in ACSF function. The format of the structure is the same poscar but the number of lines is different.

Could you please help me?

Thank you in advance for your response.