Testing asymptotic model outputs #504

sbenthall · 2020-02-09T19:02:07Z

I think we need to confront an issue that we have talked about many times but not reached any answer to: We need to figure out a way to devise some tests for whether the SUBSTANTIVE outputs of anything change when there has been a code change. Travis will tell us whether any code 'breaks' in the sense of not executing, but it is likely that at some point someone will make what they think is an innocuous "cleanup" push which will pass Travis but will meaningfully (and wrongly) change quantitative results.

We should start with some one particular example, a REMARK. If necessary, we can modify it to construct all of its output in some simple machine-readable way that excludes meaningless stuff like creation timestamps etc. Then the test would be basically whether the new output files generated by rerunning it (say, the do_all.sh script) are different at all from the output files generated on the last run. If so, we would flag it as an error.

I'm about to do a bit of work on BufferStockTheory, and will see if that is the REMARK we should use to start with.

sbenthall · 2020-02-09T19:17:25Z

@llorracc I agree that this would be a good feature to support:

We should start with some one particular example, a REMARK. If necessary, we can modify it to construct all of its output in some simple machine-readable way that excludes meaningless stuff like creation timestamps etc. Then the test would be basically whether the new output files generated by rerunning it (say, the do_all.sh script) are different at all from the output files generated on the last run. If so, we would flag it as an error.

In particular, I think standardizing a machine-readable way of testing model outputs is important.

That said, on some other points you've raised I see things differently from you. In particular:

Travis will tell us whether any code 'breaks' in the sense of not executing, but it is likely that at some point someone will make what they think is an innocuous "cleanup" push which will pass Travis but will meaningfully (and wrongly) change quantitative results.

I take issue with a few points here:

Travis should be running the unit test suite, which is defined here. Unit tests should do more than simply test whether or not code '"executes". They should also test the functionality of the code to make sure that on well-understood cases that cover a representative range of functional scope, the code is giving the correct results.
If there are not good unit tests that confirm the functionality on simple cases, then there is no reason besides independent confidence in the quality of the code to think that current substantive REMARK results are correct.
REMARKs, which are intended to be static representations of research output, can guarantee that the results stay static despite changes to the underlying library by having their HARK dependency pegged to a specific release. Indeed, this is the correct way to keep REMARK output static, because REMARKs are archival. It is not a good idea to use REMARK results, per se, as tests of HARK library functionality, though in principle one could be adapted into an automated test.

Long story short, if you're concerned about the stability of the code functionality given minor changes, the right thing to do is:

ensure the adequacy of the current unit test suite
have an automated testing requirement as part of the new contribution policy.

sbenthall · 2020-02-09T19:34:18Z

I see now that HARK is missing a lot of unit tests.
There are no unit tests on ConsumptionSaving classes, for example.

MridulS · 2020-02-09T21:50:34Z

I have started creating some unit tests for Consumption Saving classes in the refactor PR.

…

On 10-Feb-2020, at 1:04 AM, Sebastian Benthall ***@***.***> wrote: I see now that HARK is missing a lot of unit tests. There are no unit tests on ConsumptionSaving classes, for example. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#504?email_source=notifications&email_token=ABI5RFG7VKV5YUIYY73K7S3RCBLDXA5CNFSM4KSDTWX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELGVYJQ#issuecomment-583883814>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABI5RFHNJAET6NE7Y3QBMDTRCBLDXANCNFSM4KSDTWXQ>.

…on-ark#504

sbenthall · 2020-02-09T22:10:42Z

Oh awesome @MridulS

I made a PR for adding just one test -- #506 -- so @llorracc can see what that would look like.
It would be easy enough to do something like that for all the ConsumptionSaving types.

If those were in place, that would do a lot to catch whether a change to the HARK library code was effecting any substantive results.

llorracc · 2020-02-09T23:41:20Z

Let me take issue with some of the issues you took issue with:

"1. If there are not good unit tests that confirm the functionality on simple cases, then there is no reason besides independent confidence in the quality of the code to think that current substantive REMARK results are correct."

I'm all for unit tests; fine. But the implicit argument here is "if there is a merge that changes our results, we have no idea whether the old ones or the new ones are more likely to be right." Nonsense. The existing code has been vetted and gone over many times by many people, and in most cases produces results that are very similar to results that are well understood in many papers in the published literature. If a merge request changes the results substantially, we DO have a strong suspicion that the new results are very likely to be wrong, because they have not been examined nearly as carefully as the old ones. At a minimum, we want to examine them -- the best case scenario is that we discover a bug that has been missed by everybody, and furthermore has been independently created by all the other similar papers, in which case a major publication is in the offing.

The kinds of results I'm talking about are things like "as wealth goes to infinity, the portfolio share approaches the value in the Merton-Samuelson model." With all due respect to Travis, this is not the kind of thing that it does.

"REMARKs, which are intended to be static representations of research output, can guarantee that the results stay static despite changes to the underlying library by having their HARK dependency pegged to a specific release. Indeed, this is the correct way to keep REMARK output static, because REMARKs are archival. It is not a good idea to use REMARK results, per se, as tests of HARK library functionality, though in principle one could be adapted into an automated test."

Right now, zero percent of our REMARKs are in that category. Saying that it is not useful to run tests for whether new code merges break them is assuming that everything about them is already perfect (or at least frozen).

And, actually, for REMARKs that DO reach the "permanent archive" stage, that will be because we have considerable confidence that their substantive results are right. If anything, such REMARKs are going to be MORE useful for finding problems with new code than work-in-progress REMARKs will be, since the latter might well have their own bugs. The only (important) caveat to this is that sometimes we will make "breaking changes" to the code base. But, even for that case, the automatic test of vetted and reliable substantive results is useful, because if results for an archived REMARK change as a result of a "breaking change" we have two options: Either we deliberately mark that REMARK as compatible with "all releases earlier than" whichever is the breaking change, or we pay some (appropriate) attention to whether the "breaking change" is worth doing. (I'm thinking here more of cases where we had not REALIZED that it was a breaking change, than of cases where we had previously made a deliberate decision to make a breaking change. We should at least know that the change is a breaking one).

To be clear, I'm NOT saying that we should be UPDATING the frozen REMARKs to use new versions of the code; it's fine for the "master" version of them to remain frozen. But that doesn't mean that we can't test whether the frozen code still works with a new release.

In any case, I'd be nervous about merging in lots of apparently small changes over a short span of time, without a single "substantive" test in place. We should have at least one or two such substantive tests in place before we make, especially, lots of little and seemingly-innocuous changes throughout the codebase; precisely because they are seemingly innocuous, those are the hardest kind to pin down if they DO make some subtle but important change that did not occur to the person who made the code change precisely because it was subtle.

sbenthall · 2020-02-10T02:03:12Z

I think we agree:

- it's good to have more automated test coverage before making changes to the core source code. When I proposed my changes earlier, I didn't realize that test coverage was currently so poor. I'll add some tests to my PRs to try to track the functionality that I'm trying to preserve while refactoring.

kinds of results I'm talking about are things like "as wealth goes to infinity, the portfolio share approaches the value in the Merton-Samuelson model." With all due respect to Travis, this is not the kind of thing that it does.

I think you may be underestimating what it's possible to do with automated testing and Travis. Travis's full name is "Travis Continuous Integration". It's designed for coordinating efforts between very large teams of people continuous building and improving on systems deployed to thousands of users in production. I think that from a software perspective, what you are describing amounts to running some code and testing to see if the result matches expectations. That's absolutely what any automated test would do. So, let's assume you've written the code to (a) run this simulation and (b) test the result. Where would it be best to put this test? If it doesn't require much code besides what's in HARK to run, it would make sense to include it in HARK's test suite. It wouldn't technically be a "unit" test (it's a test of a different kind), but it could happily live in the test suite with the other tests. If it *does* require a lot of code external to HARK, then I suppose it should go in a different repository. We could make an issue in the REMARK repository and continue the discussion of REMARK testing there. A key issue, in my view, is that if you are writing a test for (HARK + ExtraCode), then if the test is positive, that does not guarantee either that HARK or ExtraCode work entirely as expected. It could be a false positive based on their interaction, or because of specific parameters used. Similarly, if such a test came up negative, you wouldn't know if it was HARK or ExtraCode that was the problem. That's why in software testing, it's generally a good idea to get coverage on the simplest, most understandable units first, then build the tests up. In general, having good automated tests, along with good documentation and clean design, are indicators of software quality. Overconfidence in software is FAR more common than underconfidence in software. The better documented, the more clearly written, and better tested the code is, the more attractive it will be to other users and contributors. I understand that you are nervous about things changing; I think that's partly because as it is currently, the software is fragile. I think you do understand that I am trying to make changes that will improve the robustness of the project moving forward.

llorracc · 2020-02-10T09:16:02Z

I wasn't suggesting that the kinds of tests I have in mind could not be done with Travis. And they do not envision using any code outside of Econ-ARK/HARK (plus the code of the REMARK being tested). We've had a number of inconclusive conversations along the lines of "if we wanted to test whether the portfolio share is targeting its analytical limit as wealth goes to infinity, how exactly would we set that up? It never _reaches_ infinity, and various numerical issues could cause substantively small differences in the exact last few digits of the output. Even rounding errors could matter, so we can't just compare the outputs byte-by-byte and declare failure if there is not an exact match." There was never any doubt that we could use Travis for this; what we were unclear about is what the best way would be to formulate such tests, and especially to naturally integrate them into the automatic output of the remark. I feel sure that other scientific computing projects must have done things like this extensively before. Like, presumably astrophysicists studying black holes test whether when a simulated particle crosses the event horizon, the simulation ever shows it popping back out (except as "Hawking radiation!"). @sebastian Benthall <[email protected]> @mridul Seth <[email protected]> if you could do some research about how other fields do this before our meeting Thu, that would help us reach a strategy. I will give some thought to the particular example I want to start with -- probably BufferStockTheory, since I am making final revisions to it right now.

…

On Mon, Feb 10, 2020 at 3:03 AM Sebastian Benthall ***@***.***> wrote: > > I think we agree: - it's good to have more automated test coverage before making changes to the core source code. When I proposed my changes earlier, I didn't realize that test coverage was currently so poor. I'll add some tests to my PRs to try to track the functionality that I'm trying to preserve while refactoring. > kinds of results I'm talking about are things like "as wealth goes to > infinity, the portfolio share approaches the value in the Merton-Samuelson > model." With all due respect to Travis, this is not the kind of thing that > it does. > I think you may be underestimating what it's possible to do with automated testing and Travis. Travis's full name is "Travis Continuous Integration". It's designed for coordinating efforts between very large teams of people continuous building and improving on systems deployed to thousands of users in production. I think that from a software perspective, what you are describing amounts to running some code and testing to see if the result matches expectations. That's absolutely what any automated test would do. So, let's assume you've written the code to (a) run this simulation and (b) test the result. Where would it be best to put this test? If it doesn't require much code besides what's in HARK to run, it would make sense to include it in HARK's test suite. It wouldn't technically be a "unit" test (it's a test of a different kind), but it could happily live in the test suite with the other tests. If it *does* require a lot of code external to HARK, then I suppose it should go in a different repository. We could make an issue in the REMARK repository and continue the discussion of REMARK testing there. A key issue, in my view, is that if you are writing a test for (HARK + ExtraCode), then if the test is positive, that does not guarantee either that HARK or ExtraCode work entirely as expected. It could be a false positive based on their interaction, or because of specific parameters used. Similarly, if such a test came up negative, you wouldn't know if it was HARK or ExtraCode that was the problem. That's why in software testing, it's generally a good idea to get coverage on the simplest, most understandable units first, then build the tests up. In general, having good automated tests, along with good documentation and clean design, are indicators of software quality. Overconfidence in software is FAR more common than underconfidence in software. The better documented, the more clearly written, and better tested the code is, the more attractive it will be to other users and contributors. I understand that you are nervous about things changing; I think that's partly because as it is currently, the software is fragile. I think you do understand that I am trying to make changes that will improve the robustness of the project moving forward. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#504?email_source=notifications&email_token=AAKCK74T7WJXMQSOU6BNO6TRCCYWDA5CNFSM4KSDTWX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELHADOQ#issuecomment-583926202>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKCK72DUNJH3N7RJSBHZ7DRCCYWDANCNFSM4KSDTWXQ> .

-- - Chris Carroll

sbenthall · 2020-02-10T13:40:21Z

Ok, I think I see what you're getting at now.
I just sent off a question to a friend of mine who does computational research support for astrophysicists.
That's my best lead on this.

sbenthall · 2020-02-10T15:53:28Z

Pytest supports approximate equality. h/t/ @MridulS
https://stackoverflow.com/questions/8560131/pytest-assert-almost-equal

sbenthall · 2020-02-10T15:57:51Z

Fuzz testing:
https://wiki.python.org/moin/PythonTestingToolsTaxonomy#Fuzz_Testing_Tools

sbenthall mentioned this issue Feb 9, 2020

[WIP] clean 'newborn' code on IndShockConsumerType.getShocks(), fixes #494 #499

Closed

sbenthall added a commit to sbenthall/HARK that referenced this issue Feb 9, 2020

adding test for default solution to PerfForesightConsumerType. See ec…

c8dffd8

…on-ark#504

sbenthall mentioned this issue Feb 9, 2020

adding test for default solution to PerfForesightConsumerType #506

Merged

This was referenced Feb 10, 2020

Tests for IndShockConsumerType based on BufferStock #510

Closed

IndShockConsumerType checkConditions() returns nothing, always #525

Closed

sbenthall mentioned this issue Feb 21, 2020

AIC condition is not checked on IndShockConsumerType #534

Closed

sbenthall changed the title ~~Testing/handling of changes to substantive research output when library code changes~~ Testing asymptotic model outputs Mar 11, 2020

MridulS closed this as completed Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing asymptotic model outputs #504

Testing asymptotic model outputs #504

sbenthall commented Feb 9, 2020

sbenthall commented Feb 9, 2020

sbenthall commented Feb 9, 2020

MridulS commented Feb 9, 2020 via email

sbenthall commented Feb 9, 2020

llorracc commented Feb 9, 2020

sbenthall commented Feb 10, 2020 via email

llorracc commented Feb 10, 2020 via email

sbenthall commented Feb 10, 2020

sbenthall commented Feb 10, 2020

sbenthall commented Feb 10, 2020

Testing asymptotic model outputs #504

Testing asymptotic model outputs #504

Comments

sbenthall commented Feb 9, 2020

sbenthall commented Feb 9, 2020

sbenthall commented Feb 9, 2020

MridulS commented Feb 9, 2020 via email

sbenthall commented Feb 9, 2020

llorracc commented Feb 9, 2020

sbenthall commented Feb 10, 2020 via email

llorracc commented Feb 10, 2020 via email

sbenthall commented Feb 10, 2020

sbenthall commented Feb 10, 2020

sbenthall commented Feb 10, 2020