Problems in preparing the training data

Dear all
Recently, I repeated the work of Predicting bulk modulus (https://nbviewer.jupyter.org/github/hackingmaterials/matminer_examples/blob/main/matminer_examples/machine_learning-nb/bulk_modulus.ipynb) by preparing the data. I need to predict the bulk modulus of some structures. First I export the formate of the data before featurization using df.to_csv(‘export.csv’), the first structure of csv file is shown :
,material_id,formula,space_group,structure,elastic_anisotropy,G_VRH,K_VRH,poisson_ratio
0,mp-10003,Nb4CoSi,124,"Full Formula (Nb8 Co2 Si2)
Reduced Formula: Nb4CoSi
abc : 6.221780 6.221780 5.022400
angles: 90.000000 90.000000 90.000000
Sites (12)
'# SP a b c


0 Nb 0.152391 0.333153 0.5
1 Nb 0.847609 0.666847 0.5
2 Nb 0.666847 0.152391 0.5
3 Nb 0.333153 0.847609 0.5
4 Nb 0.847609 0.333153 0
5 Nb 0.666847 0.847609 0
6 Nb 0.333153 0.152391 0
7 Nb 0.152391 0.666847 0
8 Co 0 0 0.75
9 Co 0 0 0.25
10 Si 0.5 0.5 0.75
11 Si 0.5 0.5 0.25",0.0306879170543,97.1416044794,194.2688843,0.28570074251999994
Just using the data that was exported from the database, I have encountered the problems. The data was imported by:
import pandas as pd
df=pd.read_csv(‘export.csv’)
df.head()
It is ok to add the composition-based features, but not ok with density features.


RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
“”"
Traceback (most recent call last):
File “/Users/bob/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/matminer/featurizers/base.py”, line 493, in featurize_wrapper
return self.featurize(*x)
File “/Users/bob/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/matminer/featurizers/structure.py”, line 109, in featurize
output.append(s.density)
AttributeError: ‘str’ object has no attribute ‘density’

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/Users/bob/Documents/software/condainstall/anaconda3/lib/python3.8/multiprocessing/pool.py”, line 125, in worker
result = (True, func(*args, **kwds))
File “/Users/bob/Documents/software/condainstall/anaconda3/lib/python3.8/multiprocessing/pool.py”, line 48, in mapstar
return list(map(*args))
File “/Users/bob/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/matminer/featurizers/base.py”, line 508, in featurize_wrapper
reraise(type(e), type(e)(msg), sys.exc_info()[2])
File “/Users/bob/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/six.py”, line 718, in reraise
raise value.with_traceback(tb)
File “/Users/bob/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/matminer/featurizers/base.py”, line 493, in featurize_wrapper
return self.featurize(*x)
File “/Users/bob/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/matminer/featurizers/structure.py”, line 109, in featurize
output.append(s.density)
AttributeError: ‘str’ object has no attribute ‘density’
TO SKIP THESE ERRORS when featurizing specific compounds, set ‘ignore_errors=True’ when running the batch featurize() operation (e.g., featurize_many(), featurize_dataframe(), etc.).
“”"

The above exception was the direct cause of the following exception:

AttributeError Traceback (most recent call last)
in
2
3 df_feat = DensityFeatures()
----> 4 df = df_feat.featurize_dataframe(df, “structure”) # input the structure column to the featurizer
5 df.head()

~/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/matminer/featurizers/base.py in featurize_dataframe(self, df, col_id, ignore_errors, return_errors, inplace, multiindex, pbar)
335
336 # Compute the features
→ 337 features = self.featurize_many(df[col_id].values,
338 ignore_errors=ignore_errors,
339 return_errors=return_errors,

~/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/matminer/featurizers/base.py in featurize_many(self, entries, ignore_errors, return_errors, pbar)
465 return_errors=return_errors,
466 ignore_errors=ignore_errors)
→ 467 return p.map(func, entries, chunksize=self.chunksize)
468
469 def featurize_wrapper(self, x, return_errors=False, ignore_errors=False):

~/Documents/software/condainstall/anaconda3/lib/python3.8/multiprocessing/pool.py in map(self, func, iterable, chunksize)
362 in a list that is returned.
363 ‘’’
→ 364 return self._map_async(func, iterable, mapstar, chunksize).get()
365
366 def starmap(self, func, iterable, chunksize=None):

~/Documents/software/condainstall/anaconda3/lib/python3.8/multiprocessing/pool.py in get(self, timeout)
769 return self._value
770 else:
→ 771 raise self._value
772
773 def _set(self, i, obj):

~/Documents/software/condainstall/anaconda3/lib/python3.8/multiprocessing/pool.py in worker()
123 job, i, func, args, kwds = task
124 try:
→ 125 result = (True, func(*args, **kwds))
126 except Exception as e:
127 if wrap_exception and func is not _helper_reraises_exception:

~/Documents/software/condainstall/anaconda3/lib/python3.8/multiprocessing/pool.py in mapstar()
46
47 def mapstar(args):
—> 48 return list(map(*args))
49
50 def starmapstar(args):

~/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/matminer/featurizers/base.py in featurize_wrapper()
506 "the batch featurize() operation (e.g., "
507 “featurize_many(), featurize_dataframe(), etc.).”
→ 508 reraise(type(e), type(e)(msg), sys.exc_info()[2])
509
510 def featurize(self, *x):

~/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/six.py in reraise()
716 value = tp()
717 if value.traceback is not tb:
→ 718 raise value.with_traceback(tb)
719 raise value
720 finally:

~/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/matminer/featurizers/base.py in featurize_wrapper()
491 return list(self.featurize(*x)) + [float(“nan”)]
492 else:
→ 493 return self.featurize(*x)
494 except BaseException as e:
495 if ignore_errors:

~/Documents/software/condainstall/anaconda3/lib/python3.8/site-packages/matminer/featurizers/structure.py in featurize()
107
108 if “density” in self.features:
→ 109 output.append(s.density)
110
111 if “vpa” in self.features:

AttributeError: ‘str’ object has no attribute ‘density’
TO SKIP THESE ERRORS when featurizing specific compounds, set ‘ignore_errors=True’ when running the batch featurize() operation (e.g., featurize_many(), featurize_dataframe(), etc.).

If I add ignore_errors=True, I will get no value of ‘density’, ‘vpa’, ‘packing fraction’ for corresponding structures.
I do not know where is wrong. Please help out with this.
Best regards.
chunbo zhang

Hey @chunbo_zhang

You need to consider the format you are serializing the data to. CSV is a potentially lossy format which is difficult to serialize object data to/from (for example, if your data includes pymatgen structures, which it seems like it does).

You can serialize dataframes including structures to/from json on disk pretty reliably but you need to use the special matminer.utils.io functions:

from matminer.utils.io import load_dataframe_from_json, store_dataframe_as_json


# if you want to store a dataframe on disk
store_dataframe_as_json(df, "my_df_file.json")

# If you want to load a dataframe from disk
df = load_dataframe_from_json("my_df_file.json")

This is a much more reliable method than using DataFrame.to_csv or DataFrame.to_json.

Also note that when you use CSV, pandas converts the pmg structures to STRINGS, which make sense to us as humans examining them in a CSV (lattice data like you posted) but are not easily made back into pymatgen python objects.. This is a similar problem when using pandas to_json, so I strongly encourage you to use the load_dataframe_from_json and store_dataframe_as_json functions.

Dear ardunn, thanks for your helpful suggestions.