Suggestion: Two levels of tests

Hi OpenKIM folks!

I would like to make a suggestion, running the risk of suggesting the obvious and something that is already planned :slight_smile:

As I understand it, tests in OpenKIM play the role of testing the quality of the models by calculating quantities that can then be compared against experiments or quantum calculations. Just like there are model drivers and models, and most models will probably be implemented through model drivers, I understood that there will be a similar two-level structure for tests (simulators and tests?), so all the nitty-gritty of making neighbor lists etc can be taken care of once and for all, and writing the tests will be relatively simple.

I would like to suggest to extend this to a three-level structure: Simulators, tests and code-tests. The purpose of the code-test is to check the *implementation* of the models, the tests and the simulators, and in particular to catch when something unexpectedly stops working due to apparently unrelated changes.

A code-test could be a simple way of specifying a model, a test, an expected result and a tolerance. All code-tests should then be run automatically regularly, and a red flag raised if a result falls outside the expected tolerance. Preferably, a contributor should be able to run all code-tests relating to a given model, model driver, test or simulator prior to submitting a new version.

This could probably be implemented with code-tests as simple text files, and a job running all the tests based on these files. But it would require that tests have a standard way of getting their input (the model) and presenting their result. I guess such a standard will be needed anyway for the OpenKIM web infrastructure.

Best regards

Jakob

Jakob:

We’ve been planning several code-tests: for example, tests to check whether the potential forces are the derivatives of the energies. Is there a reason to distinguish them from regular tests? I even think it might be good to store their results in the database…

Jim

Hi everyone,

Jim is right, we have already been planning to have things like the derivatives check. As Jim says, there is not necessarily a reason to distinguish these from regular Tests. However, we are planning to do so; at least in some situations. We have been calling these "verification checks". The reason for making this distinction is that we are planning to have these be checks that are run and must be passed (or else have a good reason for not passing) before a Model is officially accepted into the openKIM system...

Notice however, that all this does not actually directly address Jacob's suggestion. Jacob was (I think) actually suggesting a different kind of check. One that can apply to Tests as well as Models (unlike the verification checks discussed above).

This seems like a good idea. In fact, let me modify Jacob's suggestion a bit. It seems to me that the idea is to have a way to notice when the results of a calculation have changed. That way, if they should not have changed, one can identify a previously hidden bug and start to track it down.

I think what this boils down to is the following: We would like a way to take two sets of results/output (Predictions) from a Test (or Verification Check) and compute a real number that represents the "sameness" of these two results. We would also like to have a tolerance that can be used to define "equal".

It seems to me that there is no hope (or at least very little hope) that this can be defined in general. So, I think this would best be something that is defined on a Test/Verification by Test/Verification basis. So, we could encourage Test/Verification writers to also provide a script of some sort that takes in two sets of results and prints their sameness value. The same Test/Verification writer should also provide the tolerance value that can be used to make an equal/not equal determination.

This can certainly be used to check when the result of a single Test/Model pairing changes. Which would indicate that either there is a bug or that some intentional update has been made to either the Test or the Model. But, it could also be used to identify different Models that produce the same result for a given Test.

Anyway, this would seem to eliminate the need for defining a standard way for Tests to communicate. (Although, such a standard may, ultimately, be necessary for the openKIM system, anyway...)

Your thoughts are welcome...

Cheers,

Ryan

Dear Jim and Ryan,

Thanks for your replies!

Jim: I think that there *is* a need to distinguish these two kinds of test, as they answer two completely different questions. The tests in the database answer the question "Is this model a good model for calculating that quantity?". The code-checks (unit tests) answer the question "is the code buggy?". The only reason to bunch them together is that the existence of the former kind of tests makes the latter kind easy bordering on trivial to implement.

It is clear that consistency checks like numerical derivatives are one part of this code testing, and should be performed for all models. But it is easy to make bugs that produce wrong but consistent energies and forces. I actually managed to demonstrate that more than once during the workshop! :slight_smile: In particular, most errors in the neighbor list generation or handling will produce self-consistent errors.

The developer of a model will be able to use existing tests to generate code-tests for his model, and vice versa for the test developers. The existence of such a test suite is essential for reliable code. We have used it extensively in for example our GPAW DFT package, where a test suite is run every night on the current SVN version. Hardly a week passes without something coming up, typically unexpected side-effects of changes/bug fixes! Developers are also running these tests on their own machines, sometimes finding architecture-dependent issues. In fact, these unit tests have become so important that there are even coverage test checking that every line of code is executed at least once in the unit tests (or somebody understands why it is not necessary to execute that line). Coverage tests in a multi-language environment will be a challenge, however (to put it mildly).

The vast majority of the GPAW tests are actually calculations of real values, and writing them requires some effort. The OpenKIM infrastructure should on the other hand make it virtually trivial to write the vast majority of these, as it will just be a question of checking for the value of some already existing calculation being the same as it used to be. It might even not be necessary to do anything - one could in principle recalculate the whole KIM database and look for changed values, and then raise a read flag. Unfortunately, "changed" is not well-defined.

Since the KIM infrastructure will anyway need some standardized way of collecting values from tests, why not use it to make it easy to make such tests? This is not urgent, though :slight_smile:

Best regards

Jakob

Jakob:

That sounds fantastic. I think we all agree that code-tests like this
are crucial, although your experience suggests that they can be much
more important than I had realized. (How do you write a "coverage
test"?) I would support a separate name for them, even if the
interface and functionality is similar, just to emphasize that we
agree they are crucial for a professional code. (Part of our mission
is to set standards for good potential programming practice, after
all.) What does everyone think?

Jim

That is not a trivial task, and I have no idea how to do it.

GPAW actually only does coverage testing for the part implemented in Python. There are preexisting tools for that, the most used is apparently coverage.py (see http://nedbatchelder.com/code/coverage/ ). I think it is also possible to use various kinds of profiling tools for coverage test of compiled code, but I have no experience with it.

Best regards

Jakob