Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functional simulation test #1148

Merged
merged 8 commits into from
Sep 28, 2022
Merged

functional simulation test #1148

merged 8 commits into from
Sep 28, 2022

Conversation

sbenthall
Copy link
Contributor

@sbenthall sbenthall commented May 31, 2022

Addresses #1105 by replacing tests on simulation results that target specific values to target relations between values based on the transition equations of the model.

Maybe to introduce another way to test simulations also.

Please ensure your pull request adheres to the following guidelines:

  • Tests for new functionality/models or Tests to reproduce the bug-fix in code.
  • Updated documentation of features that add new functionality.
  • Update CHANGELOG.md with major/minor changes.

@codecov-commenter
Copy link

codecov-commenter commented May 31, 2022

Codecov Report

Base: 73.73% // Head: 73.62% // Decreases project coverage by -0.10% ⚠️

Coverage data is based on head (fad1a26) compared to base (ec3a7bf).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1148      +/-   ##
==========================================
- Coverage   73.73%   73.62%   -0.11%     
==========================================
  Files          72       72              
  Lines       11561    11513      -48     
==========================================
- Hits         8524     8476      -48     
  Misses       3037     3037              
Impacted Files Coverage Δ
.../ConsumptionSaving/tests/test_ConsAggShockModel.py 98.00% <ø> (-0.24%) ⬇️
...umptionSaving/tests/test_ConsGenIncProcessModel.py 100.00% <ø> (ø)
...mptionSaving/tests/test_ConsPortfolioFrameModel.py 100.00% <ø> (ø)
...ConsumptionSaving/tests/test_ConsPrefShockModel.py 100.00% <ø> (ø)
...ptionSaving/tests/test_IndShockConsumerTypeFast.py 100.00% <ø> (ø)
...ConsumptionSaving/tests/test_ConsPortfolioModel.py 100.00% <100.00%> (ø)
...nsumptionSaving/tests/test_IndShockConsumerType.py 75.47% <100.00%> (-0.39%) ⬇️
...tionSaving/tests/test_PerfForesightConsumerType.py 100.00% <100.00%> (ø)
...tionSaving/tests/test_TractableBufferStockModel.py 100.00% <100.00%> (ø)
HARK/tests/test_distribution.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sbenthall sbenthall changed the title [WIP] change TractableBufferstockModel to use functional simulation test, s… [WIP] functional simulation test Jun 1, 2022
@sbenthall
Copy link
Contributor Author

In order to preserve the old values in case a low-precision test is desired, I'll comment out the old tests and leave them in the suite.

@sbenthall sbenthall self-assigned this Aug 15, 2022
@sbenthall sbenthall requested a review from Mv77 September 20, 2022 21:45
@sbenthall sbenthall changed the title [WIP] functional simulation test functional simulation test Sep 20, 2022
np.mean(self.economy.MrkvNow_hist),
0.4818181818181818
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In conversations with Chris and others, this types of tests was controversial.

Notice is not testing for a single simulation value, but for the mean of a long history of simulations. Chris liked these tests because they are informative (if the mean of 10k simulations is 30% off of where we expect it to be, there is something wrong). You (Seb) disagreed because technically there is a greater than zero probability of that event occurring even if all the code is fine.

I see both points.

I'd vote for keeping them (because they are informative) but making them have a very loose tolerance. Say, 10% relative tolerance for the mean of a 10k obs simulation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, happy to revise my position if this is heretical from a software engineering/comp-sci point of view

@@ -185,15 +189,15 @@ def test_methods(self):

# testing update_solution_terminal()
self.assertEqual(
self.agent.solution_terminal.cFunc[0](10,self.economy.MSS),
self.agent.solution_terminal.cFunc[0](10, 13.32722),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test checks a solution, not a simulation. I'd leave it in; there is nothing random about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is economy.MSS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's self.kSS * self.RfreeSS + self.wRteSS. Of course. I'll change it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, MSS is stochastic, so this test and the one above can leave.

10
)

self.assertAlmostEqual(
self.economy.agents[0].solution[0].cFunc[0](
10,self.economy.MSS
10, 13.32722
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave this one in. Checks a solution, not a simulation

@@ -100,4 +100,5 @@ def test_simulation(self):
self.agent.initialize_sim()
self.agent.simulate()

self.assertAlmostEqual(np.mean(self.agent.history["mLvl"]), 1.2043946738813716)
# simulation test -- seed/generator specific
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for a moment, not a specific draw. Might leave in or not

@@ -36,15 +36,19 @@ def test_simOnePeriod(self):
self.pcct.track_vars += ["aNrm"]
self.pcct.initialize_sim()

self.assertFalse(np.any(self.pcct.shocks["Adjust"]))
# simulation test -- seed/generator specific
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is agent.pcct an analogue of agent.history?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, self.pcct refers to the AgentType object.
This is testing the stored values of that object's shocks dictionary after some number of simulated steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is a shock and should be removed.

self.assertAlmostEqual(self.pcct.shocks["PermShk"][0], 0.85893446)

self.assertAlmostEqual(self.pcct.shocks["TranShk"][0], 1.0)
self.assertAlmostEqual(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are what you have called 'functional' tests.

I want to ask what kind of error you see these catching.

The only way I can see this failing is someone messing up the transition equations of the simulation method, or introducing a bug in the time indexing of hark.core so that the states and shocks get badly out of sync. Both are valuable, I just want to know if you see other possible cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the generalization of the specific value test that makes the fewest assumptions.
I think you have pointed out two ways the test could fail.

I've been thinking about it but can't come up with others. Why do you ask?


self.assertAlmostEqual(self.pcct.controls["Share"][0], 0.8627164488246847)
self.assertAlmostEqual(self.pcct.controls["cNrm"][0], 1.67874799)
# a drawn shock ; may not be robust to RNG/disitrubition implementations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this one? It tests a specific draw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -94,8 +95,9 @@ def test_simulated_values(self):
self.agent.simulate()

self.assertAlmostEqual(self.agent.MPCnow[1], 0.5711503906043797)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test might have to go too. MPC now depends on assets now which are stochastic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know, thanks

# simulation tests -- seed/generator specific
# But these are based on aggregate population statistics.
# WARNING: May fail stochastically, or based on specific RNG types.
self.assertAlmostEqual(c_std2, 0.0376882)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep these in we might want a low tolerance for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if there's no objection to me removing them, and it sounds like there isn't, I will.

I'll respond to your point about tests of sampled moments in the main thread.

@@ -58,7 +59,8 @@ def test_simulated_values(self):
self.agent.simulate()
self.assertAlmostEqual(self.agent.MPCnow[1], 0.5711503906043797)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPCnow is stochastic

@@ -65,19 +65,20 @@ def test_simulation(self):
) # This implicitly uses the assign_parameters method of AgentType

# Create PFexample object
self.agent_infinite.track_vars = ["mNrm"]
self.agent_infinite.track_vars = ["bNrm", "mNrm", "TranShk"]
self.agent_infinite.initialize_sim()
self.agent_infinite.simulate()

self.assertAlmostEqual(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another one of those we might keep in with a high tolerance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I've done here (just to mix it up...) is keep the test for the moment (the mean) but turned it into a test of the transition function.


def test_MVNormal(self):

## Are these tests generator/backend specific?
dist = MVNormal()

self.assertTrue(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is generator-specific

np.mean(self.agent_infinite.history["mNrm"], axis=1)[100],
-29.140261331951606,
)
# simulation test -- seed/generator specific
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might keep in with high tolerance

Copy link
Contributor

@Mv77 Mv77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbenthall this looks like a move in a direction we should be moving towards. It removes a bunch of the tests that are RNG-sensitive.

There is the question of whether we want to keep the ones that check for a moment (e.g. the mean) of a large number of draws, which should be much less sensitive to RNG. I see arguments both for and against them.

@llorracc
Copy link
Collaborator

@Mv77 thanks for looking over all of this.

To elaborate a bit on the history, @sbenthall and I had some past disagreements about what should be tested, until finally I realized that his point was that the traditional purpose of such tests was to discover places where a change just caused code to stop working in even the most elemental sense. What I had wanted was to test whether the code produced the "right answer" in a substantive sense (or at least the same answer to a substantive question that had been produced by previous versions of the code).

We now have some tests of one kind and some of the other, and perhaps we ought to try to distinguish them more explicitly so that any software engineers who come to the project will understand which tests are of the "does it run" kind and which are of the "is it right" kind. (I can see how "is it right" doesn't make sense for many software projects, like, say, a Word processor or whatever).

In any case, for the "is it right" kinds of tests my sense is that the appropriate choice is probably to set some threshold of the kind you propose -- is the new answer within x percent of the old answer. But I think "x" should probably be something like 0.1 percent, not 15 percent, and not the 12 digits or 8 digits of floating point precision we had used in some of the "is it right" tests before.

@Mv77
Copy link
Contributor

Mv77 commented Sep 22, 2022

Yes, I remember I was part of some of thise conversations and I remember them generating some disagreement.

I see the value for having both kinds of tests.

I trust @sbenthall's wisdom that perhaps a unit test is not the right way to check whether the stochastic properties of objects "behave well." Maybe we should figure out a more fitting way to do that, raising warnings or who knows what.

But that's one for the future, I think. With this PR Seb clears a bunch of tests that we have agreed definitely should not be there, and that is a very good thing. So I'd be very happy to merge this in and postpone the useful discussion of how to do the other type of tests.

@llorracc
Copy link
Collaborator

llorracc commented Sep 22, 2022 via email

@Mv77
Copy link
Contributor

Mv77 commented Sep 22, 2022

I will wait for Seb to reply to the review comments

@sbenthall
Copy link
Contributor Author

Hello,

Thanks for the review @Mv77

I've removed/repaired tests as per your recommendations.

As for testing sampled moments, I see it like this:

  • We do not want to have tests that sometimes, rarely, fire a false positive. Why? Because in 10 years when HARK is being used by the Fed, ECB, and Goldman Sachs to dynamically model the economy with trillions of dollars on the line, we don't want a stochastic test failure in a Continuous Integration pipeline (that runs the tests a hundred times a day) to set off a lot of mysterious, hard-to-debug warning bells and cost the economy because we wrote some sloppy tests when it didn't matter.
  • Technically, with a fixed random seed, these won't be stochastic tests. But we are rewriting these tests so that we don't need to rely on a particular RNG or distribution implementation; in the future we may support many alternatives.
  • I think you, @Mv77 , are totally right to ask, of tests, "What is this a test for?" When we have tests that some value, computed from a bunch of given parameters, is within some approximate range will, with a negative result, tell you 'something went wrong'. But it's better to test smaller pieces of the functionality, so they tell you what went wrong.

@Mv77
Copy link
Contributor

Mv77 commented Sep 28, 2022

Yup, I accept your wisdom about false positives, and your optimism about the future!

This is not a concern that blocks this PR but a question instead, which could be placed somewhere else for a continued discussion: how does the comp-sci/software development community deal with 'testing' the robustness of things that should always be true in a fuzzy sense about the software? Like the accuracy of a numerical method: you know that changes to the algorithm can introduce small changes in the exact output, but you'd like to have a check that lets you know if the output moves away from a known theoretical answer.

I trust your advice that unit tests are not the place to do this. But is there some other way to do this? How do Matlab and Mathematica check that their differential equation solvers do not break when they make a marginal improvement to their matrix multiplication algorithm?

@llorracc
Copy link
Collaborator

My sense is that the way to proceed on all of this is to set fairly loose tests regarding the final results, and successively tighter tests for the components that lead to the final result.

If the existing code says that the value of $\bar{A}/\bar{P}$ is $w$, and some change to the code produces a number that is not within 1 percent of $w$, it is almost certain that somewhere a bug has been introduced. Since there are a bunch of steps (W, X, Y, and Z) that contribute to producing the value of x, the natural progression would be to successively introduce tests for whether the Z step produces the correct value $z$, the Y step produces the correct value $Y$, and the X step produces the correct value $X$.

But I would not agree with the proposition that we should not test for $w$ being within 1 percent of the "correct" value of $w$ because that would not immediately tell us which step was the culprit. Seems like the natural process of debugging why $w$ is off would be to introduce the tests at each of the earlier stages. You might only get to the X test and realize the bug was there; then adding the tests fro the Y and Z stages can wait until there's a case where the W test fails and the X test fails, in which case you need to write a Y test, and if the model passes that Y test you need to write a Z test.

This lets us build our testing apparatus as needed, and prevents us from writing tests for cases where in practice there is never a problem.

@sbenthall
Copy link
Contributor Author

how does the comp-sci/software development community deal with 'testing' the robustness of things that should always be true in a fuzzy sense about the software?

I assume that you mean probabilistic, rather than fuzzy. Fuzzy logic being a rather different beast which allows for continuously valued truth values.

It's a good question, and not one I've looked into carefully before. But after doing a little searching and intuiting...

It looks like there are at least a couple things we haven't considered:

  • Getting a good probabilistic estimate requires sampling the system over multiple realizations (or seeds). That means that the accuracy of the test comes with a computational cost, which has to be traded off.
  • Rigorous thinking about the statistics of what's being measured is important for calibrating these tests. Hand-waving "almosts" won't cut it.

These two articles (from 2016 and 2018) are near-top Google hits on the topic, and indicate that this is an active research area and unsolved problem. "Probabilistic programming" is a relatively recent research area which has yet to find mainstream uptake and applications.

https://alexey.radul.name/ideas/2016/on-testing-probabilistic-programs/

https://www.cs.cornell.edu/~legunsen/pubs/DuttaETAL18ProbFuzz.pdf

@Mv77
Copy link
Contributor

Mv77 commented Sep 28, 2022

I think I meant fuzzy in the fuzzy logic sense. The tests we are talking about would not be of the type "Is A==B" which is either true or false. We'd like to have an answer to "Is moment_a close to true answer B?" which admits answers like "kind of."

The fuzzy truth value of the answer to that question is, as you said, stochastic.

@Mv77 Mv77 merged commit e6c936b into econ-ark:master Sep 28, 2022
@sbenthall sbenthall added this to the 0.13.0 milestone Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants