Test of all OpenKIM models with Asap

Hi OpenKIM folks,

I have now made a test in Asap’s test suite that tests all installed OpenKIM models. I installed all available models and tested them. Only three models failed, which is much better than with OpenKIM v1 where I never got rid of the last memory management issues!

ex_model_Ar_P_Morse_MultiCutoff

Does not set a Destroy pointer. Is that not compulsory now?
2018-12-19:11:05:03CET * 6 * error * 0x7f8e62f0ad70 * KIM_ModelImplementation.cpp:2166 * Model supplied Create() routine did not set pointer for Destroy.

EAM_IMD_BrommerGaehler_2006B_AlNiCo__MO_128037485276_003
EAM_NN_Johnson_1988_Cu__MO_887933271505_002

They both fail the test that the force is the numerical derivative of the energy. The test works the following way:
  
* A random supported element with a known crystal structure among FCC, BCC, HCP, SC and diamond is chosen, and a host crystal is created.
  - The test is done with a 7 x 7 x 7 unit cell system with orthogonal axes, and a 1 x 2 x 7 system with non-orthogonal axes.

* If the model supports more than one element, an impurity element is chosen and a random number of atoms are replaces by impurity atoms.

* Then all atomic positions are moved a distance drawn from a normal distribution with a width of 0.1 A.

* The energy and force is calculated. One atom is moved 0.001 A and energies and forces are recalculated. The change in energy should be consistent with the average of the two forces.

EAM_IMD_BrommerGaehler_2006B_AlNiCo__MO_128037485276_003 passes this test if Al is the host material and Ni or Co is the impurity. It fails miserably for all other combinations.

EAM_NN_Johnson_1988_Cu__MO_887933271505_002 consistently fails this test (no impurity atom in this case).

Best regards

Jakob

I also build OpenKIM and Asap with Intel compilers, as Asap benefits tremendously from this.

Now a new group of models fail with a memory deallocation erro, all in the same way. It is the models based on the drivers EAM_Magnetic2GQuintic__MD_543355979582_002 and EAM_MagneticCubic__MD_620624592962_002

They all fail when the model object is destroyed by Asap, with this error:

forrtl: severe (173): A pointer passed to DEALLOCATE points to an object that cannot be deallocated
Image PC Routine Line Source
libifcoremt.so.5 00007F9AD0103488 for_dealloc_alloc Unknown Unknown
libkim-api-v2-mod 00007F9AA2F81DE0 destroy Unknown Unknown
libkim-api-v2.so. 00007F9AD0835A84 _ZN3KIM19ModelImp Unknown Unknown
libkim-api-v2.so. 00007F9AD08358F0 _ZN3KIM19ModelImp Unknown Unknown
libkim-api-v2.so. 00007F9AD0808D30 _ZN3KIM5Model7Des Unknown Unknown
_asap_p3.so 00007F9AD0B4AC8C _ZN6AsapNS17OpenK Unknown Unknown
_asap_p3.so 00007F9AD0B4AC3A _ZN6AsapNS17OpenK Unknown Unknown
_asap_p3.so 00007F9AD0B1B5E0 Unknown Unknown Unknown
libpython3.6m.so. 00007F9ADBC9E3B8 Unknown Unknown Unknown
libpython3.6m.so. 00007F9ADBC5EFA6 Unknown Unknown Unknown
libpython3.6m.so. 00007F9ADBC9E328 Unknown Unknown Unknown
libpython3.6m.so. 00007F9ADBC5BB64 PyDict_SetItem Unknown Unknown
libpython3.6m.so. 00007F9ADBD2891C _PyEval_EvalFrame Unknown Unknown
libpython3.6m.so. 00007F9ADBD1C91D PyEval_EvalCodeEx Unknown Unknown
libpython3.6m.so. 00007F9ADBD1BE59 PyEval_EvalCode Unknown Unknown
libpython3.6m.so. 00007F9ADBD6B1FE PyRun_FileExFlags Unknown Unknown
libpython3.6m.so. 00007F9ADBD6AC74 PyRun_SimpleFileE Unknown Unknown
libpython3.6m.so. 00007F9ADBD87F94 Py_Main Unknown Unknown
python 00000000004018F3 main Unknown Unknown
libc-2.17.so 00007F9ADAACB3D5 __libc_start_main Unknown Unknown
python3.6 0000000000401729 Unknown Unknown Unknown

Best regards

Jakob

Hi Jakob,

Thanks for the reports and testing!

See below.

Hi OpenKIM folks,

I have now made a test in Asap’s test suite that tests all installed OpenKIM models. I installed all available models and tested them. Only three models failed, which is much better than with OpenKIM v1 where I never got rid of the last memory management issues!

ex_model_Ar_P_Morse_MultiCutoff

Does not set a Destroy pointer. Is that not compulsory now?

Yes, it is compulsory. This is a bug in the ex_model_Ar_P_Morse_MultiCutoff example. Thanks for the report.

2018-12-19:11:05:03CET * 6 * error * 0x7f8e62f0ad70 * KIM_ModelImplementation.cpp:2166 * Model supplied Create() routine did not set pointer for Destroy.

EAM_IMD_BrommerGaehler_2006B_AlNiCo__MO_128037485276_003
EAM_NN_Johnson_1988_Cu__MO_887933271505_002

They both fail the test that the force is the numerical derivative of the energy. The test works the following way:

* A random supported element with a known crystal structure among FCC, BCC, HCP, SC and diamond is chosen, and a host crystal is created.
  - The test is done with a 7 x 7 x 7 unit cell system with orthogonal axes, and a 1 x 2 x 7 system with non-orthogonal axes.

* If the model supports more than one element, an impurity element is chosen and a random number of atoms are replaces by impurity atoms.

* Then all atomic positions are moved a distance drawn from a normal distribution with a width of 0.1 A.

* The energy and force is calculated. One atom is moved 0.001 A and energies and forces are recalculated. The change in energy should be consistent with the average of the two forces.

EAM_IMD_BrommerGaehler_2006B_AlNiCo__MO_128037485276_003 passes this test if Al is the host material and Ni or Co is the impurity. It fails miserably for all other combinations.

EAM_NN_Johnson_1988_Cu__MO_887933271505_002 consistently fails this test (no impurity atom in this case).

The first model is known to have problems. See the Verification Check Dashboard here

https://openkim.org/dev-kim-item/EAM_IMD_BrommerGaehler_2006B_AlNiCo__MO_128037485276_003

The second is more surprising. It might be because of the non-smooth cutoff, but we would have to look into it a bit more to see if there is a bug, or some other explaination.

Ryan

Hi Jakob,

Interesting. I tried to test this, but have been unable to reporduce.

Can you send details? Compiler verions, system specs., etc.

This thread:

https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/390944

suggests it could be a compiler bug, but there are other things it could possibly be too...

The first step is to reproduce on our end....

Thanks,

Ryan

Hi Jakob,

Interesting. I tried to test this, but have been unable to reporduce.

Can you send details? Compiler verions, system specs., etc.

10:31 [slid] ~$ ifort --version
ifort (IFORT) 18.0.3 20180410
Copyright (C) 1985-2018 Intel Corporation. All rights reserved.

10:31 [slid] ~$ icc --version
icc (ICC) 18.0.3 20180410
Copyright (C) 1985-2018 Intel Corporation. All rights reserved.

10:45 [slid] ~$ gcc --version
gcc (GCC) 7.3.0

It is a CentOS Linux release 7.6.1810, but with EasyBuild on top of it, so the entire toolchain, all modules, Python, virtually all libraries etc are built with EasyBuild.

This thread:

https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/390944

suggests it could be a compiler bug, but there are other things it could possibly be too...

Hmm, that thread is from 2013, and I am using the 2018 version of the compilers. But it may also be the code doing something dubious:

So is the statement

1 call a%set(t_new())
valid in F2008 at all? Should I avoid using the result of a function as an argument this way and introduce temporary variable instead?

1 class(t), pointer :: tmp
2 tmp=>t_new()
3 call a%set(tmp)

And a comment:

In F2008 a variable can be a function reference that has a data pointer result (which I don't think ifort does yet, but I think gfortran does). A variable can be a target - whatever the function result points at has to have had the target attribute. So you have a target actual with a target dummy. So things are ...umm... different. After that - I get confused and give up. Part of my confusion resulted in a post to c.l.f.

And I don’t know fortran, so I have no opinion of my own.

The first step is to reproduce on our end....

Indeed. I have just released Asap 3.11.3 with support for OpenKIM v. 2.0.0.beta3. I’ll announce it once I have updated the web pages. You should be able to install it and build it with the Intel compilers by following the instructions here:

https://wiki.fysik.dtu.dk/asap/Installation#optimized-installation

Remember to install ASE first:
python3 -m pip install ase --user

Best regards

Jakob

And I don’t know fortran, so I have no opinion of my own.

The first step is to reproduce on our end....

Indeed. I have just released Asap 3.11.3 with support for OpenKIM v. 2.0.0.beta3. I’ll announce it once I have updated the web pages. You should be able to install it and build it with the Intel compilers by following the instructions here:

https://wiki.fysik.dtu.dk/asap/Installation#optimized-installation

Remember to install ASE first:
python3 -m pip install ase --user

OK, I have been able to test this on two different systems: one with v16.0 and one with v18.0 of the Intel compilers

With the 16.0 compilers there are no segfaults.

With the 18.0 compilers the OpenKIM_AllModels.py (with no blacklist) segfaults in this way:

(gdb) run OpenKIM_AllModels.py
Starting program: /panfs/roc/groups/9/elliotrs/relliott/miniconda3/bin/python OpenKIM_AllModels.py
[Thread debugging using libthread_db enabled]
Missing separate debuginfo for /panfs/roc/groups/9/elliotrs/relliott/miniconda3/lib/python3.7/site-packages/numpy/core/../.libs/libgfortran-ed201abd.so.3.0.0
[New Thread 0x7fffe9dcc700 (LWP 10551)]
[New Thread 0x7fffe93cb700 (LWP 10552)]
[New Thread 0x7fffe69ca700 (LWP 10553)]
[New Thread 0x7fffe3fc9700 (LWP 10554)]
[New Thread 0x7fffe15c8700 (LWP 10555)]
[New Thread 0x7fffdebc7700 (LWP 10556)]
[New Thread 0x7fffdc1c6700 (LWP 10557)]
[Thread 0x7fffe15c8700 (LWP 10555) exited]
[Thread 0x7fffdc1c6700 (LWP 10557) exited]
[Thread 0x7fffe9dcc700 (LWP 10551) exited]
[Thread 0x7fffe69ca700 (LWP 10553) exited]
[Thread 0x7fffe93cb700 (LWP 10552) exited]
[Thread 0x7fffe3fc9700 (LWP 10554) exited]
[Thread 0x7fffdebc7700 (LWP 10556) exited]
Detaching after fork from child process 10561.
Detaching after fork from child process 10562.
Detaching after fork from child process 10563.

KIM model: EAM_Magnetic2GQuintic_ChiesaDerletDudarev_2011_Fe__MO_140444321607_002
  Potential info: with 5th order knot functions
Supported elements: ['Fe']
Generated a bcc system with 686 Fe-atoms and 0 None-atoms
Lattice constant a = 2.87

Program received signal SIGSEGV, Segmentation fault.
__libc_free (mem=0x61) at malloc.c:3716
3716 if (chunk_is_mmapped(p)) /* release mmapped memory. */
(gdb) bt
#0 __libc_free (mem=0x61) at malloc.c:3716
#1 0x00007fffecec3f2c in for_deallocate () from /panfs/roc/intel/x86_64/2018/parallel_studio_xe_msi/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64/libifcoremt.so.5
#2 0x00007fffee0d3dd4 in kim_model_compute_arguments_module_mp_kim_model_compute_arguments_get_neighbor_list_ () from /home/elliotrs/relliott/kim-api/build/test-install/lib64/libkim-api-v2.so.2
#3 0x00007fffe9bc6f6f in compute_energy_forces () from /home/elliotrs/relliott/kim-api/build/test-install/lib64/kim-api-v2/model-drivers/EAM_Magnetic2GQuintic__MD_543355979582_002/libkim-api-v2-model-driver.so
#4 0x00007fffee124cc4 in KIM::ModelImplementation::ModelCompute(KIM::ComputeArguments const*) const () from /home/elliotrs/relliott/kim-api/build/test-install/lib64/libkim-api-v2.so.2
#5 0x00007fffee124664 in KIM::ModelImplementation::Compute(KIM::ComputeArguments const*) const () from /home/elliotrs/relliott/kim-api/build/test-install/lib64/libkim-api-v2.so.2
#6 0x00007fffefac79ef in AsapNS::OpenKIMcalculator::DoCalculate (this=0x61) at OpenKIMimport/OpenKIMcalculator.cpp:630
#7 0x00007fffefac77bf in AsapNS::OpenKIMcalculator::Calculate (this=0x61, pyatoms=0x40000) at OpenKIMimport/OpenKIMcalculator.cpp:598
#8 0x00007fffefac4390 in AsapNS::OpenKIMcalculator::GetPotentialEnergy (this=0x61, pyatoms=0x40000) at OpenKIMimport/OpenKIMcalculator.cpp:311
#9 0x00007fffefab7f47 in PyAsap_PotentialGetPotentialEnergy (self=0x61, args=0x40000, kwargs=0x7fffffffb600) at Interface/PotentialInterface.cpp:720
#10 0x00005555556b7ea4 in _PyMethodDef_RawFastCallKeywords ()
#11 0x00005555556c0bef in _PyMethodDescr_FastCallKeywords ()
#12 0x000055555572cc68 in _PyEval_EvalFrameDefault ()
#13 0x0000555555666528 in _PyEval_EvalCodeWithName ()
#14 0x00005555556b7645 in _PyFunction_FastCallKeywords ()
#15 0x00005555557285b0 in _PyEval_EvalFrameDefault ()
#16 0x0000555555666528 in _PyEval_EvalCodeWithName ()
#17 0x00005555556673a4 in PyEval_EvalCodeEx ()
#18 0x00005555556673cc in PyEval_EvalCode ()
#19 0x0000555555781304 in run_mod ()
#20 0x0000555555789611 in PyRun_FileExFlags ()
#21 0x0000555555789804 in PyRun_SimpleFileExFlags ()
#22 0x000055555578b17d in pymain_main.constprop.327 ()
#23 0x000055555578b3f0 in _Py_UnixMain ()
#24 0x00007ffff7648d20 in __libc_start_main (main=0x555555646e20 <main>, argc=2, ubp_av=0x7fffffffc5e8, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffffffc5d8) at libc-start.c:226
#25 0x0000555555737e32 in _start ()

This look, to me, different than the backtrace Jakob presented.

I've also found that this same segfault occurs with the fortran example models and the example simulators. SO, it is not an ase or asap thing. Not sure why Jakob is not seeing similar problems with other models...

So it seems to be isolated to the v18.0 Intel compilers.

The backtrace indicates a fortran deallocate but there is no such deallocate in the code anywhere in the indicated functions. So, I would guess that the deallocation is some sort of internal fortran behavior.

So, it seems (to me, at the moment) to be a ifort compiler bug....

Ryan

Hi Ryan,

So it seems to be isolated to the v18.0 Intel compilers.

Is it 18.0.0 or 18.0.3 ?

The Intel compilers ending in .0.0 are usually surprisingly buggy, both 18.0.0 and 16.0.0 (I think it was) fail to compile Asap correctly unless optimization is turned off. Asap did something a bit unusual with pointers, which is perfect legal, and both versions optimized it incorrectly. Others have also reported that these versions are broken.

The backtrace indicates a fortran deallocate but there is no such deallocate in the code anywhere in the indicated functions. So, I would guess that the deallocation is some sort of internal fortran behavior.

Probably some kind of temporary variable. The page you linked to seems to indicate that some kinds of temporary expressions are not allowed, and can produce this kind of error, but I really did not understand it.

So, it seems (to me, at the moment) to be a ifort compiler bug....

Normally, compiler errors are very rare, and it is usually subtle bugs in the code rather than compiler bugs. But my experience with the Intel compilers is unfortunately that compiler bugs are a real issue. That may be the price of the very agressive and efficient optimization. And the fortran compiler is probably less battle-tested than the C/C++ compiler.

So I guess you are probably right.

Best regards

Jakob

OK,

I can now confirm the error that Jakob found using v18.0.3.

In this case the example models and simulators work just fine, but I see exactly the same error messages as Jakob reported when using asap with the EAM_Magnetic2GQuintic_ChiesaDerletDudarev_2011_Fe__MO_140444321607_002 model.

So, this seems to definitely be an issue strongly tied to the ifort compiler (and specifically the 18.0.0 and 18.0.3 versions).

Ryan