-
-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing asymptotic model outputs #504
Comments
@llorracc I agree that this would be a good feature to support:
In particular, I think standardizing a machine-readable way of testing model outputs is important. That said, on some other points you've raised I see things differently from you. In particular:
I take issue with a few points here:
Long story short, if you're concerned about the stability of the code functionality given minor changes, the right thing to do is:
|
I see now that HARK is missing a lot of unit tests. |
I have started creating some unit tests for Consumption Saving classes in the refactor PR.
… On 10-Feb-2020, at 1:04 AM, Sebastian Benthall ***@***.***> wrote:
I see now that HARK is missing a lot of unit tests.
There are no unit tests on ConsumptionSaving classes, for example.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#504?email_source=notifications&email_token=ABI5RFG7VKV5YUIYY73K7S3RCBLDXA5CNFSM4KSDTWX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELGVYJQ#issuecomment-583883814>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABI5RFHNJAET6NE7Y3QBMDTRCBLDXANCNFSM4KSDTWXQ>.
|
Oh awesome @MridulS I made a PR for adding just one test -- #506 -- so @llorracc can see what that would look like. If those were in place, that would do a lot to catch whether a change to the HARK library code was effecting any substantive results. |
Let me take issue with some of the issues you took issue with: "1. If there are not good unit tests that confirm the functionality on simple cases, then there is no reason besides independent confidence in the quality of the code to think that current substantive REMARK results are correct." I'm all for unit tests; fine. But the implicit argument here is "if there is a merge that changes our results, we have no idea whether the old ones or the new ones are more likely to be right." Nonsense. The existing code has been vetted and gone over many times by many people, and in most cases produces results that are very similar to results that are well understood in many papers in the published literature. If a merge request changes the results substantially, we DO have a strong suspicion that the new results are very likely to be wrong, because they have not been examined nearly as carefully as the old ones. At a minimum, we want to examine them -- the best case scenario is that we discover a bug that has been missed by everybody, and furthermore has been independently created by all the other similar papers, in which case a major publication is in the offing. The kinds of results I'm talking about are things like "as wealth goes to infinity, the portfolio share approaches the value in the Merton-Samuelson model." With all due respect to Travis, this is not the kind of thing that it does. "REMARKs, which are intended to be static representations of research output, can guarantee that the results stay static despite changes to the underlying library by having their HARK dependency pegged to a specific release. Indeed, this is the correct way to keep REMARK output static, because REMARKs are archival. It is not a good idea to use REMARK results, per se, as tests of HARK library functionality, though in principle one could be adapted into an automated test." Right now, zero percent of our REMARKs are in that category. Saying that it is not useful to run tests for whether new code merges break them is assuming that everything about them is already perfect (or at least frozen). And, actually, for REMARKs that DO reach the "permanent archive" stage, that will be because we have considerable confidence that their substantive results are right. If anything, such REMARKs are going to be MORE useful for finding problems with new code than work-in-progress REMARKs will be, since the latter might well have their own bugs. The only (important) caveat to this is that sometimes we will make "breaking changes" to the code base. But, even for that case, the automatic test of vetted and reliable substantive results is useful, because if results for an archived REMARK change as a result of a "breaking change" we have two options: Either we deliberately mark that REMARK as compatible with "all releases earlier than" whichever is the breaking change, or we pay some (appropriate) attention to whether the "breaking change" is worth doing. (I'm thinking here more of cases where we had not REALIZED that it was a breaking change, than of cases where we had previously made a deliberate decision to make a breaking change. We should at least know that the change is a breaking one). To be clear, I'm NOT saying that we should be UPDATING the frozen REMARKs to use new versions of the code; it's fine for the "master" version of them to remain frozen. But that doesn't mean that we can't test whether the frozen code still works with a new release. In any case, I'd be nervous about merging in lots of apparently small changes over a short span of time, without a single "substantive" test in place. We should have at least one or two such substantive tests in place before we make, especially, lots of little and seemingly-innocuous changes throughout the codebase; precisely because they are seemingly innocuous, those are the hardest kind to pin down if they DO make some subtle but important change that did not occur to the person who made the code change precisely because it was subtle. |
I think we agree:
- it's good to have more automated test coverage before making changes to
the core source code.
When I proposed my changes earlier, I didn't realize that test coverage was
currently so poor.
I'll add some tests to my PRs to try to track the functionality that I'm
trying to preserve while refactoring.
kinds of results I'm talking about are things like "as wealth goes to
infinity, the portfolio share approaches the value in the Merton-Samuelson
model." With all due respect to Travis, this is not the kind of thing that
it does.
I think you may be underestimating what it's possible to do with automated
testing and Travis.
Travis's full name is "Travis Continuous Integration". It's designed for
coordinating efforts between very large teams of people continuous building
and improving on systems deployed to thousands of users in production.
I think that from a software perspective, what you are describing amounts
to running some code and testing to see if the result matches expectations.
That's absolutely what any automated test would do.
So, let's assume you've written the code to (a) run this simulation and (b)
test the result.
Where would it be best to put this test?
If it doesn't require much code besides what's in HARK to run, it would
make sense to include it in HARK's test suite. It wouldn't technically be a
"unit" test (it's a test of a different kind), but it could happily live in
the test suite with the other tests.
If it *does* require a lot of code external to HARK, then I suppose it
should go in a different repository. We could make an issue in the REMARK
repository and continue the discussion of REMARK testing there.
A key issue, in my view, is that if you are writing a test for (HARK +
ExtraCode), then if the test is positive, that does not guarantee either
that HARK or ExtraCode work entirely as expected. It could be a false
positive based on their interaction, or because of specific parameters
used. Similarly, if such a test came up negative, you wouldn't know if it
was HARK or ExtraCode that was the problem.
That's why in software testing, it's generally a good idea to get coverage
on the simplest, most understandable units first, then build the tests up.
In general, having good automated tests, along with good documentation and
clean design, are indicators of software quality. Overconfidence in
software is FAR more common than underconfidence in software. The better
documented, the more clearly written, and better tested the code is, the
more attractive it will be to other users and contributors. I understand
that you are nervous about things changing; I think that's partly because
as it is currently, the software is fragile. I think you do understand that
I am trying to make changes that will improve the robustness of the project
moving forward.
|
I wasn't suggesting that the kinds of tests I have in mind could not be
done with Travis. And they do not envision using any code outside of
Econ-ARK/HARK (plus the code of the REMARK being tested).
We've had a number of inconclusive conversations along the lines of "if we
wanted to test whether the portfolio share is targeting its analytical
limit as wealth goes to infinity, how exactly would we set that up? It
never _reaches_ infinity, and various numerical issues could cause
substantively small differences in the exact last few digits of the
output. Even rounding errors could matter, so we can't just compare the
outputs byte-by-byte and declare failure if there is not an exact match."
There was never any doubt that we could use Travis for this; what we were
unclear about is what the best way would be to formulate such tests, and
especially to naturally integrate them into the automatic output of the
remark.
I feel sure that other scientific computing projects must have done things
like this extensively before. Like, presumably astrophysicists studying
black holes test whether when a simulated particle crosses the event
horizon, the simulation ever shows it popping back out (except as "Hawking
radiation!").
@sebastian Benthall <[email protected]> @mridul Seth <[email protected]> if
you could do some research about how other fields do this before our
meeting Thu, that would help us reach a strategy. I will give some thought
to the particular example I want to start with -- probably
BufferStockTheory, since I am making final revisions to it right now.
…On Mon, Feb 10, 2020 at 3:03 AM Sebastian Benthall ***@***.***> wrote:
>
> I think we agree:
- it's good to have more automated test coverage before making changes to
the core source code.
When I proposed my changes earlier, I didn't realize that test coverage was
currently so poor.
I'll add some tests to my PRs to try to track the functionality that I'm
trying to preserve while refactoring.
> kinds of results I'm talking about are things like "as wealth goes to
> infinity, the portfolio share approaches the value in the
Merton-Samuelson
> model." With all due respect to Travis, this is not the kind of thing
that
> it does.
>
I think you may be underestimating what it's possible to do with automated
testing and Travis.
Travis's full name is "Travis Continuous Integration". It's designed for
coordinating efforts between very large teams of people continuous building
and improving on systems deployed to thousands of users in production.
I think that from a software perspective, what you are describing amounts
to running some code and testing to see if the result matches expectations.
That's absolutely what any automated test would do.
So, let's assume you've written the code to (a) run this simulation and (b)
test the result.
Where would it be best to put this test?
If it doesn't require much code besides what's in HARK to run, it would
make sense to include it in HARK's test suite. It wouldn't technically be a
"unit" test (it's a test of a different kind), but it could happily live in
the test suite with the other tests.
If it *does* require a lot of code external to HARK, then I suppose it
should go in a different repository. We could make an issue in the REMARK
repository and continue the discussion of REMARK testing there.
A key issue, in my view, is that if you are writing a test for (HARK +
ExtraCode), then if the test is positive, that does not guarantee either
that HARK or ExtraCode work entirely as expected. It could be a false
positive based on their interaction, or because of specific parameters
used. Similarly, if such a test came up negative, you wouldn't know if it
was HARK or ExtraCode that was the problem.
That's why in software testing, it's generally a good idea to get coverage
on the simplest, most understandable units first, then build the tests up.
In general, having good automated tests, along with good documentation and
clean design, are indicators of software quality. Overconfidence in
software is FAR more common than underconfidence in software. The better
documented, the more clearly written, and better tested the code is, the
more attractive it will be to other users and contributors. I understand
that you are nervous about things changing; I think that's partly because
as it is currently, the software is fragile. I think you do understand that
I am trying to make changes that will improve the robustness of the project
moving forward.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#504?email_source=notifications&email_token=AAKCK74T7WJXMQSOU6BNO6TRCCYWDA5CNFSM4KSDTWX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELHADOQ#issuecomment-583926202>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKCK72DUNJH3N7RJSBHZ7DRCCYWDANCNFSM4KSDTWXQ>
.
--
- Chris Carroll
|
Ok, I think I see what you're getting at now. |
Pytest supports approximate equality. h/t/ @MridulS |
@llorracc writes here:
The text was updated successfully, but these errors were encountered: