functional simulation test #1148

sbenthall · 2022-05-31T19:58:03Z

Addresses #1105 by replacing tests on simulation results that target specific values to target relations between values based on the transition equations of the model.

Maybe to introduce another way to test simulations also.

Please ensure your pull request adheres to the following guidelines:

Tests for new functionality/models or Tests to reproduce the bug-fix in code.
Updated documentation of features that add new functionality.
Update CHANGELOG.md with major/minor changes.

econ-ark#1105

codecov-commenter · 2022-05-31T20:19:32Z

Codecov Report

Base: 73.73% // Head: 73.62% // Decreases project coverage by -0.10% ⚠️

Coverage data is based on head (fad1a26) compared to base (ec3a7bf).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1148      +/-   ##
==========================================
- Coverage   73.73%   73.62%   -0.11%     
==========================================
  Files          72       72              
  Lines       11561    11513      -48     
==========================================
- Hits         8524     8476      -48     
  Misses       3037     3037

Impacted Files	Coverage Δ
.../ConsumptionSaving/tests/test_ConsAggShockModel.py	`98.00% <ø> (-0.24%)`	⬇️
...umptionSaving/tests/test_ConsGenIncProcessModel.py	`100.00% <ø> (ø)`
...mptionSaving/tests/test_ConsPortfolioFrameModel.py	`100.00% <ø> (ø)`
...ConsumptionSaving/tests/test_ConsPrefShockModel.py	`100.00% <ø> (ø)`
...ptionSaving/tests/test_IndShockConsumerTypeFast.py	`100.00% <ø> (ø)`
...ConsumptionSaving/tests/test_ConsPortfolioModel.py	`100.00% <100.00%> (ø)`
...nsumptionSaving/tests/test_IndShockConsumerType.py	`75.47% <100.00%> (-0.39%)`	⬇️
...tionSaving/tests/test_PerfForesightConsumerType.py	`100.00% <100.00%> (ø)`
...tionSaving/tests/test_TractableBufferStockModel.py	`100.00% <100.00%> (ø)`
HARK/tests/test_distribution.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

sbenthall · 2022-06-01T16:36:47Z

In order to preserve the old values in case a low-precision test is desired, I'll comment out the old tests and leave them in the suite.

…replaced with better tests

Mv77 · 2022-09-21T22:25:58Z

HARK/ConsumptionSaving/tests/test_ConsAggShockModel.py

-            np.mean(self.economy.MrkvNow_hist),
-            0.4818181818181818
-        )
+


In conversations with Chris and others, this types of tests was controversial.

Notice is not testing for a single simulation value, but for the mean of a long history of simulations. Chris liked these tests because they are informative (if the mean of 10k simulations is 30% off of where we expect it to be, there is something wrong). You (Seb) disagreed because technically there is a greater than zero probability of that event occurring even if all the code is fine.

I see both points.

I'd vote for keeping them (because they are informative) but making them have a very loose tolerance. Say, 10% relative tolerance for the mean of a 10k obs simulation

However, happy to revise my position if this is heretical from a software engineering/comp-sci point of view

Mv77 · 2022-09-21T22:27:44Z

HARK/ConsumptionSaving/tests/test_ConsAggShockModel.py

@@ -185,15 +189,15 @@ def test_methods(self):

        # testing update_solution_terminal()
        self.assertEqual(
-            self.agent.solution_terminal.cFunc[0](10,self.economy.MSS),
+            self.agent.solution_terminal.cFunc[0](10, 13.32722),


This test checks a solution, not a simulation. I'd leave it in; there is nothing random about it.

What is economy.MSS?

Oh, it's self.kSS * self.RfreeSS + self.wRteSS. Of course. I'll change it back.

Actually, MSS is stochastic, so this test and the one above can leave.

Mv77 · 2022-09-21T22:28:09Z

HARK/ConsumptionSaving/tests/test_ConsAggShockModel.py

            10
        )

        self.assertAlmostEqual(
            self.economy.agents[0].solution[0].cFunc[0](
-                10,self.economy.MSS
+                10, 13.32722


Leave this one in. Checks a solution, not a simulation

Mv77 · 2022-09-21T22:29:33Z

HARK/ConsumptionSaving/tests/test_ConsGenIncProcessModel.py

@@ -100,4 +100,5 @@ def test_simulation(self):
        self.agent.initialize_sim()
        self.agent.simulate()

-        self.assertAlmostEqual(np.mean(self.agent.history["mLvl"]), 1.2043946738813716)
+        # simulation test -- seed/generator specific


Tests for a moment, not a specific draw. Might leave in or not

Mv77 · 2022-09-21T22:30:47Z

HARK/ConsumptionSaving/tests/test_ConsPortfolioFrameModel.py

@@ -36,15 +36,19 @@ def test_simOnePeriod(self):
        self.pcct.track_vars += ["aNrm"]
        self.pcct.initialize_sim()

-        self.assertFalse(np.any(self.pcct.shocks["Adjust"]))
+        # simulation test -- seed/generator specific


Is agent.pcct an analogue of agent.history?

Here, self.pcct refers to the AgentType object.
This is testing the stored values of that object's shocks dictionary after some number of simulated steps.

So, this is a shock and should be removed.

Mv77 · 2022-09-21T22:35:19Z

HARK/ConsumptionSaving/tests/test_ConsPortfolioModel.py

-        self.assertAlmostEqual(self.pcct.shocks["PermShk"][0], 0.85893446)
-
-        self.assertAlmostEqual(self.pcct.shocks["TranShk"][0], 1.0)
+        self.assertAlmostEqual(


These are what you have called 'functional' tests.

I want to ask what kind of error you see these catching.

The only way I can see this failing is someone messing up the transition equations of the simulation method, or introducing a bug in the time indexing of hark.core so that the states and shocks get badly out of sync. Both are valuable, I just want to know if you see other possible cases.

These are the generalization of the specific value test that makes the fewest assumptions.
I think you have pointed out two ways the test could fail.

I've been thinking about it but can't come up with others. Why do you ask?

Mv77 · 2022-09-21T22:36:00Z

HARK/ConsumptionSaving/tests/test_ConsPortfolioModel.py


-        self.assertAlmostEqual(self.pcct.controls["Share"][0], 0.8627164488246847)
-        self.assertAlmostEqual(self.pcct.controls["cNrm"][0], 1.67874799)
+        # a drawn shock ; may not be robust to RNG/disitrubition implementations


I'd remove this one? It tests a specific draw

Mv77 · 2022-09-21T22:39:31Z

HARK/ConsumptionSaving/tests/test_IndShockConsumerType.py

@@ -94,8 +95,9 @@ def test_simulated_values(self):
        self.agent.simulate()

        self.assertAlmostEqual(self.agent.MPCnow[1], 0.5711503906043797)


I think this test might have to go too. MPC now depends on assets now which are stochastic

good to know, thanks

Mv77 · 2022-09-21T22:40:38Z

HARK/ConsumptionSaving/tests/test_IndShockConsumerType.py

+        # simulation tests -- seed/generator specific
+        # But these are based on aggregate population statistics.
+        # WARNING: May fail stochastically, or based on specific RNG types.
+        self.assertAlmostEqual(c_std2, 0.0376882)


If we keep these in we might want a low tolerance for them?

that's a good idea.

But if there's no objection to me removing them, and it sounds like there isn't, I will.

I'll respond to your point about tests of sampled moments in the main thread.

Mv77 · 2022-09-21T22:41:05Z

HARK/ConsumptionSaving/tests/test_IndShockConsumerTypeFast.py

@@ -58,7 +59,8 @@ def test_simulated_values(self):
        self.agent.simulate()
        self.assertAlmostEqual(self.agent.MPCnow[1], 0.5711503906043797)


MPCnow is stochastic

Mv77 · 2022-09-21T22:42:19Z

HARK/ConsumptionSaving/tests/test_PerfForesightConsumerType.py

@@ -65,19 +65,20 @@ def test_simulation(self):
        )  # This implicitly uses the assign_parameters method of AgentType

        # Create PFexample object
-        self.agent_infinite.track_vars = ["mNrm"]
+        self.agent_infinite.track_vars = ["bNrm", "mNrm", "TranShk"]
        self.agent_infinite.initialize_sim()
        self.agent_infinite.simulate()

        self.assertAlmostEqual(


Another one of those we might keep in with a high tolerance

What I've done here (just to mix it up...) is keep the test for the moment (the mean) but turned it into a test of the transition function.

Mv77 · 2022-09-21T22:43:59Z

HARK/tests/test_distribution.py


    def test_MVNormal(self):
+
+        ## Are these tests generator/backend specific?
        dist = MVNormal()

        self.assertTrue(


This one is generator-specific

Mv77 · 2022-09-21T22:45:31Z

HARK/ConsumptionSaving/tests/test_PerfForesightConsumerType.py

-            np.mean(self.agent_infinite.history["mNrm"], axis=1)[100],
-            -29.140261331951606,
-        )
+        # simulation test -- seed/generator specific


Might keep in with high tolerance

Mv77

@sbenthall this looks like a move in a direction we should be moving towards. It removes a bunch of the tests that are RNG-sensitive.

There is the question of whether we want to keep the ones that check for a moment (e.g. the mean) of a large number of draws, which should be much less sensitive to RNG. I see arguments both for and against them.

llorracc · 2022-09-22T00:48:38Z

@Mv77 thanks for looking over all of this.

To elaborate a bit on the history, @sbenthall and I had some past disagreements about what should be tested, until finally I realized that his point was that the traditional purpose of such tests was to discover places where a change just caused code to stop working in even the most elemental sense. What I had wanted was to test whether the code produced the "right answer" in a substantive sense (or at least the same answer to a substantive question that had been produced by previous versions of the code).

We now have some tests of one kind and some of the other, and perhaps we ought to try to distinguish them more explicitly so that any software engineers who come to the project will understand which tests are of the "does it run" kind and which are of the "is it right" kind. (I can see how "is it right" doesn't make sense for many software projects, like, say, a Word processor or whatever).

In any case, for the "is it right" kinds of tests my sense is that the appropriate choice is probably to set some threshold of the kind you propose -- is the new answer within x percent of the old answer. But I think "x" should probably be something like 0.1 percent, not 15 percent, and not the 12 digits or 8 digits of floating point precision we had used in some of the "is it right" tests before.

Mv77 · 2022-09-22T01:01:06Z

Yes, I remember I was part of some of thise conversations and I remember them generating some disagreement.

I see the value for having both kinds of tests.

I trust @sbenthall's wisdom that perhaps a unit test is not the right way to check whether the stochastic properties of objects "behave well." Maybe we should figure out a more fitting way to do that, raising warnings or who knows what.

But that's one for the future, I think. With this PR Seb clears a bunch of tests that we have agreed definitely should not be there, and that is a very good thing. So I'd be very happy to merge this in and postpone the useful discussion of how to do the other type of tests.

llorracc · 2022-09-22T01:13:55Z

Go ahead and merge. (My message was partly just to codify our past conversations for future contributors).

On Wed, Sep 21, 2022 at 6:01 PM Mateo Velásquez-Giraldo < ***@***.***> wrote: Yes, I remember I was part of some of thise conversations and I remember them generating some disagreement. I see the value for having both kinds of tests. I trust @sbenthall <https://github.com/sbenthall>'s wisdom that perhaps a unit test is not the right way to check whether the stochastic properties of objects "behave well." Maybe we should figure out a more fitting way to do that, raising warnings or who knows what. But that's one for the future, I think. With this PR Seb clears a bunch of tests that we have agreed definitely should not be there, and that is a very good thing. So I'd be very happy to merge this in and postpone the useful discussion of how to do the other type of tests. — Reply to this email directly, view it on GitHub <#1148 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKCK72WXI37A5TW53VTYFDV7OVV3ANCNFSM5XOVTTGA> . You are receiving this because you commented.Message ID: ***@***.***>

-- Sent from Gmail Mobile

Mv77 · 2022-09-22T01:15:54Z

I will wait for Seb to reply to the review comments

sbenthall · 2022-09-27T21:00:39Z

Hello,

Thanks for the review @Mv77

I've removed/repaired tests as per your recommendations.

As for testing sampled moments, I see it like this:

We do not want to have tests that sometimes, rarely, fire a false positive. Why? Because in 10 years when HARK is being used by the Fed, ECB, and Goldman Sachs to dynamically model the economy with trillions of dollars on the line, we don't want a stochastic test failure in a Continuous Integration pipeline (that runs the tests a hundred times a day) to set off a lot of mysterious, hard-to-debug warning bells and cost the economy because we wrote some sloppy tests when it didn't matter.
Technically, with a fixed random seed, these won't be stochastic tests. But we are rewriting these tests so that we don't need to rely on a particular RNG or distribution implementation; in the future we may support many alternatives.
I think you, @Mv77 , are totally right to ask, of tests, "What is this a test for?" When we have tests that some value, computed from a bunch of given parameters, is within some approximate range will, with a negative result, tell you 'something went wrong'. But it's better to test smaller pieces of the functionality, so they tell you what went wrong.

Mv77 · 2022-09-28T01:14:07Z

Yup, I accept your wisdom about false positives, and your optimism about the future!

This is not a concern that blocks this PR but a question instead, which could be placed somewhere else for a continued discussion: how does the comp-sci/software development community deal with 'testing' the robustness of things that should always be true in a fuzzy sense about the software? Like the accuracy of a numerical method: you know that changes to the algorithm can introduce small changes in the exact output, but you'd like to have a check that lets you know if the output moves away from a known theoretical answer.

I trust your advice that unit tests are not the place to do this. But is there some other way to do this? How do Matlab and Mathematica check that their differential equation solvers do not break when they make a marginal improvement to their matrix multiplication algorithm?

llorracc · 2022-09-28T01:40:17Z

My sense is that the way to proceed on all of this is to set fairly loose tests regarding the final results, and successively tighter tests for the components that lead to the final result.

If the existing code says that the value of $\bar{A}/\bar{P}$ is $w$, and some change to the code produces a number that is not within 1 percent of $w$, it is almost certain that somewhere a bug has been introduced. Since there are a bunch of steps (W, X, Y, and Z) that contribute to producing the value of x, the natural progression would be to successively introduce tests for whether the Z step produces the correct value $z$, the Y step produces the correct value $Y$, and the X step produces the correct value $X$.

But I would not agree with the proposition that we should not test for $w$ being within 1 percent of the "correct" value of $w$ because that would not immediately tell us which step was the culprit. Seems like the natural process of debugging why $w$ is off would be to introduce the tests at each of the earlier stages. You might only get to the X test and realize the bug was there; then adding the tests fro the Y and Z stages can wait until there's a case where the W test fails and the X test fails, in which case you need to write a Y test, and if the model passes that Y test you need to write a Z test.

This lets us build our testing apparatus as needed, and prevents us from writing tests for cases where in practice there is never a problem.

sbenthall · 2022-09-28T15:16:09Z

how does the comp-sci/software development community deal with 'testing' the robustness of things that should always be true in a fuzzy sense about the software?

I assume that you mean probabilistic, rather than fuzzy. Fuzzy logic being a rather different beast which allows for continuously valued truth values.

It's a good question, and not one I've looked into carefully before. But after doing a little searching and intuiting...

It looks like there are at least a couple things we haven't considered:

Getting a good probabilistic estimate requires sampling the system over multiple realizations (or seeds). That means that the accuracy of the test comes with a computational cost, which has to be traded off.
Rigorous thinking about the statistics of what's being measured is important for calibrating these tests. Hand-waving "almosts" won't cut it.

These two articles (from 2016 and 2018) are near-top Google hits on the topic, and indicate that this is an active research area and unsolved problem. "Probabilistic programming" is a relatively recent research area which has yet to find mainstream uptake and applications.

https://alexey.radul.name/ideas/2016/on-testing-probabilistic-programs/

https://www.cs.cornell.edu/~legunsen/pubs/DuttaETAL18ProbFuzz.pdf

Mv77 · 2022-09-28T15:28:22Z

I think I meant fuzzy in the fuzzy logic sense. The tests we are talking about would not be of the type "Is A==B" which is either true or false. We'd like to have an answer to "Is moment_a close to true answer B?" which admits answers like "kind of."

The fuzzy truth value of the answer to that question is, as you said, stochastic.

sbenthall added 2 commits May 31, 2022 15:56

change TractableBufferstockModel to use functional simulation test, see

1f67d65

econ-ark#1105

functional test for PerfectForesight simulation, econ-ark#1105

8953c1b

sbenthall changed the title ~~[WIP] change TractableBufferstockModel to use functional simulation test, s…~~ [WIP] functional simulation test Jun 1, 2022

sbenthall self-assigned this Aug 15, 2022

sbenthall added 3 commits September 20, 2022 16:16

Merge branch 'master' into i1105

c3e62f5

comment out tests based on brittle simulation results; in some cases …

5f834e3

…replaced with better tests

CHANGELOG update

40084ea

sbenthall requested a review from Mv77 September 20, 2022 21:45

sbenthall added the Status: Review Needed label Sep 20, 2022

sbenthall changed the title ~~[WIP] functional simulation test~~ functional simulation test Sep 20, 2022

sbenthall added 2 commits September 20, 2022 18:29

fixes to Portfolio tests

fad1a26

Merge branch 'master' into i1105

3371efd

Mv77 reviewed Sep 21, 2022

View reviewed changes

updates to tests based on @Mv77's review. econ-ark#1105

e822f2f

Mv77 merged commit e6c936b into econ-ark:master Sep 28, 2022

sbenthall added this to the 0.13.0 milestone Jan 4, 2023

sbenthall removed the Status: Review Needed label Jan 4, 2023

		@@ -94,8 +95,9 @@ def test_simulated_values(self):
		self.agent.simulate()

		self.assertAlmostEqual(self.agent.MPCnow[1], 0.5711503906043797)

		@@ -58,7 +59,8 @@ def test_simulated_values(self):
		self.agent.simulate()
		self.assertAlmostEqual(self.agent.MPCnow[1], 0.5711503906043797)

functional simulation test #1148

functional simulation test #1148

Conversation

sbenthall commented May 31, 2022 • edited Loading

codecov-commenter commented May 31, 2022 • edited Loading

Codecov Report

sbenthall commented Jun 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mv77 left a comment

Choose a reason for hiding this comment

llorracc commented Sep 22, 2022

Mv77 commented Sep 22, 2022

llorracc commented Sep 22, 2022 via email

Mv77 commented Sep 22, 2022

sbenthall commented Sep 27, 2022

Mv77 commented Sep 28, 2022 • edited Loading

llorracc commented Sep 28, 2022

sbenthall commented Sep 28, 2022

Mv77 commented Sep 28, 2022

sbenthall commented May 31, 2022 •

edited

Loading

codecov-commenter commented May 31, 2022 •

edited

Loading

Mv77 commented Sep 28, 2022 •

edited

Loading