Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing asymptotic model outputs #504

Closed
sbenthall opened this issue Feb 9, 2020 · 10 comments
Closed

Testing asymptotic model outputs #504

sbenthall opened this issue Feb 9, 2020 · 10 comments

Comments

@sbenthall
Copy link
Contributor

@llorracc writes here:

I think we need to confront an issue that we have talked about many times but not reached any answer to: We need to figure out a way to devise some tests for whether the SUBSTANTIVE outputs of anything change when there has been a code change. Travis will tell us whether any code 'breaks' in the sense of not executing, but it is likely that at some point someone will make what they think is an innocuous "cleanup" push which will pass Travis but will meaningfully (and wrongly) change quantitative results.

We should start with some one particular example, a REMARK. If necessary, we can modify it to construct all of its output in some simple machine-readable way that excludes meaningless stuff like creation timestamps etc. Then the test would be basically whether the new output files generated by rerunning it (say, the do_all.sh script) are different at all from the output files generated on the last run. If so, we would flag it as an error.

I'm about to do a bit of work on BufferStockTheory, and will see if that is the REMARK we should use to start with.

@sbenthall
Copy link
Contributor Author

@llorracc I agree that this would be a good feature to support:

We should start with some one particular example, a REMARK. If necessary, we can modify it to construct all of its output in some simple machine-readable way that excludes meaningless stuff like creation timestamps etc. Then the test would be basically whether the new output files generated by rerunning it (say, the do_all.sh script) are different at all from the output files generated on the last run. If so, we would flag it as an error.

In particular, I think standardizing a machine-readable way of testing model outputs is important.

That said, on some other points you've raised I see things differently from you. In particular:

Travis will tell us whether any code 'breaks' in the sense of not executing, but it is likely that at some point someone will make what they think is an innocuous "cleanup" push which will pass Travis but will meaningfully (and wrongly) change quantitative results.

I take issue with a few points here:

  1. Travis should be running the unit test suite, which is defined here. Unit tests should do more than simply test whether or not code '"executes". They should also test the functionality of the code to make sure that on well-understood cases that cover a representative range of functional scope, the code is giving the correct results.

  2. If there are not good unit tests that confirm the functionality on simple cases, then there is no reason besides independent confidence in the quality of the code to think that current substantive REMARK results are correct.

  3. REMARKs, which are intended to be static representations of research output, can guarantee that the results stay static despite changes to the underlying library by having their HARK dependency pegged to a specific release. Indeed, this is the correct way to keep REMARK output static, because REMARKs are archival. It is not a good idea to use REMARK results, per se, as tests of HARK library functionality, though in principle one could be adapted into an automated test.

Long story short, if you're concerned about the stability of the code functionality given minor changes, the right thing to do is:

  • ensure the adequacy of the current unit test suite
  • have an automated testing requirement as part of the new contribution policy.

@sbenthall
Copy link
Contributor Author

I see now that HARK is missing a lot of unit tests.
There are no unit tests on ConsumptionSaving classes, for example.

@MridulS
Copy link
Member

MridulS commented Feb 9, 2020 via email

@sbenthall
Copy link
Contributor Author

Oh awesome @MridulS

I made a PR for adding just one test -- #506 -- so @llorracc can see what that would look like.
It would be easy enough to do something like that for all the ConsumptionSaving types.

If those were in place, that would do a lot to catch whether a change to the HARK library code was effecting any substantive results.

@llorracc
Copy link
Collaborator

llorracc commented Feb 9, 2020

Let me take issue with some of the issues you took issue with:

"1. If there are not good unit tests that confirm the functionality on simple cases, then there is no reason besides independent confidence in the quality of the code to think that current substantive REMARK results are correct."

I'm all for unit tests; fine. But the implicit argument here is "if there is a merge that changes our results, we have no idea whether the old ones or the new ones are more likely to be right." Nonsense. The existing code has been vetted and gone over many times by many people, and in most cases produces results that are very similar to results that are well understood in many papers in the published literature. If a merge request changes the results substantially, we DO have a strong suspicion that the new results are very likely to be wrong, because they have not been examined nearly as carefully as the old ones. At a minimum, we want to examine them -- the best case scenario is that we discover a bug that has been missed by everybody, and furthermore has been independently created by all the other similar papers, in which case a major publication is in the offing.

The kinds of results I'm talking about are things like "as wealth goes to infinity, the portfolio share approaches the value in the Merton-Samuelson model." With all due respect to Travis, this is not the kind of thing that it does.

"REMARKs, which are intended to be static representations of research output, can guarantee that the results stay static despite changes to the underlying library by having their HARK dependency pegged to a specific release. Indeed, this is the correct way to keep REMARK output static, because REMARKs are archival. It is not a good idea to use REMARK results, per se, as tests of HARK library functionality, though in principle one could be adapted into an automated test."

Right now, zero percent of our REMARKs are in that category. Saying that it is not useful to run tests for whether new code merges break them is assuming that everything about them is already perfect (or at least frozen).

And, actually, for REMARKs that DO reach the "permanent archive" stage, that will be because we have considerable confidence that their substantive results are right. If anything, such REMARKs are going to be MORE useful for finding problems with new code than work-in-progress REMARKs will be, since the latter might well have their own bugs. The only (important) caveat to this is that sometimes we will make "breaking changes" to the code base. But, even for that case, the automatic test of vetted and reliable substantive results is useful, because if results for an archived REMARK change as a result of a "breaking change" we have two options: Either we deliberately mark that REMARK as compatible with "all releases earlier than" whichever is the breaking change, or we pay some (appropriate) attention to whether the "breaking change" is worth doing. (I'm thinking here more of cases where we had not REALIZED that it was a breaking change, than of cases where we had previously made a deliberate decision to make a breaking change. We should at least know that the change is a breaking one).

To be clear, I'm NOT saying that we should be UPDATING the frozen REMARKs to use new versions of the code; it's fine for the "master" version of them to remain frozen. But that doesn't mean that we can't test whether the frozen code still works with a new release.

In any case, I'd be nervous about merging in lots of apparently small changes over a short span of time, without a single "substantive" test in place. We should have at least one or two such substantive tests in place before we make, especially, lots of little and seemingly-innocuous changes throughout the codebase; precisely because they are seemingly innocuous, those are the hardest kind to pin down if they DO make some subtle but important change that did not occur to the person who made the code change precisely because it was subtle.

@sbenthall
Copy link
Contributor Author

sbenthall commented Feb 10, 2020 via email

@llorracc
Copy link
Collaborator

llorracc commented Feb 10, 2020 via email

@sbenthall
Copy link
Contributor Author

Ok, I think I see what you're getting at now.
I just sent off a question to a friend of mine who does computational research support for astrophysicists.
That's my best lead on this.

@sbenthall
Copy link
Contributor Author

Pytest supports approximate equality. h/t/ @MridulS
https://stackoverflow.com/questions/8560131/pytest-assert-almost-equal

@sbenthall
Copy link
Contributor Author

@sbenthall sbenthall changed the title Testing/handling of changes to substantive research output when library code changes Testing asymptotic model outputs Mar 11, 2020
@MridulS MridulS closed this as completed Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants