Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unnecessary vpf check? #237

Closed
lgarrison opened this issue Nov 6, 2020 · 14 comments · Fixed by #238
Closed

unnecessary vpf check? #237

lgarrison opened this issue Nov 6, 2020 · 14 comments · Fixed by #238

Comments

@lgarrison
Copy link
Collaborator

There's a check in vpf to make sure the total volume in spheres doesn't exceed the box volume: https://github.com/manodeep/Corrfunc/blob/master/Corrfunc/theory/vpf.py#L207

But I'm not sure that condition is necessary? Spheres are allowed to overlap in vpf, right?

@manodeep do you know why this is there?

@manodeep
Copy link
Owner

manodeep commented Nov 7, 2020

it's a sanity check to make sure that the total number of spheres requested is sensible. If the total volume exceeds the box volume, then there must be some non-independent data-points, and therefore, the total number of spheres used is no longer a fair representation of the amount of scatter expected. In those cases, it is betterr to use a smaller number of random spheres (which might overlap). Does that make sense?

@lgarrison
Copy link
Collaborator Author

I guess I was thinking about it like this. The VPF tells you what the probability is of having 0 (or N) particles in a sphere thrown down at random in a box. So if we throw down as many such random spheres as possible, isn't the fraction with N exactly what we want to know?

I agree a single box is only going to be a single sampling of the "true" VPF, but I would think limiting the number of spheres just gives you a noisier estimate of the VPF, not a less biased one.

Perhaps the real issue is that one shouldn't trust the large-radius VPF measurements from one box, but maybe one wants to be able to get a low-noise estimate of it, even if it is biased, so one can average over multiple boxes, for example. And at small scales, maybe it's useful to be able to "oversample" to beat down noise some more when it's at a radius that you do trust the measurement.

@manodeep
Copy link
Owner

manodeep commented Nov 8, 2020

I am not sure how repeatedly re-sampling the same volume would give a less noisy estimate of the (large-R) VPF. If anything, the larger number of spheres will make you underestimate the error on the VPF. Of course, the only way to really get the scatter in the VPF is by looking at the scatter in the measurement over many different realisations.

Having said all that, I have never tested the bias-variance - seems like an easy enough test to run and check whether limiting the total volume is meaningful. And I can certainly see an use-case for a keyword parameter that allows the user to over-ride the volume check.

@lgarrison
Copy link
Collaborator Author

For the large-R VPF, if one is going out to 20% of the box, for example, then that's only 30 spheres. So the granularity of the VPF is only 3%! That's a pretty extreme case, but perhaps we should let the user make this choice. What do you think about changing the error to a warning? That's slightly less overhead than a new parameter, but I'm also fine with that.

@manodeep
Copy link
Owner

If the user really wants to probe Rmax to 20% of Lbox, then restricting the total number of spheres seems like a good idea!

Though I like the idea of changing to a warning, I have found that people just ignore warnings. It would be a little more work to put in a override_volume_check keyword, but at least that will ensure that users have to consciously make that choice.

@lgarrison
Copy link
Collaborator Author

lgarrison commented Nov 10, 2020 via email

@manodeep
Copy link
Owner

Are you concerned with the extra code required to over-ride the default behaviour? To me, it seems like adding a boolean keyword override_volume_check or allow_oversampling would not be that much code but would let the user get the desired behaviour. Regardless, I do agree that the behaviour might be desirable - I am trying to avoid making any default choices that can impact the fidelity of the results.

I am not sure how you would capture a rare, small-scale event with the vpf? The measurement (i.e., the mean of nspheres) is unlikely to be affected much, the standard deviation might be - is that the kind of effect you are thinking about?

@lgarrison
Copy link
Collaborator Author

Yes, it's a combination of not wanting to add more surface area to the API, and a lingering feeling that this check shouldn't be there at all. The condition forces you to miss a lot of sphere configurations!

By "rare events", I mean low-amplitude counts-in-cells. I think for any radius (doesn't have to be small), if you go to high enough (or small enough) counts, the probability p(N) will be very low. And then you probably want at least 10/p(N) spheres to get above Poisson noise. But the volume condition doesn't know anything about p(N), so in many cases it won't let you use the appropriate number of spheres!

@manodeep
Copy link
Owner

Seems like you feel fairly strongly about this :). Changing to a warning falls under the same scope as checking for the ratio - Rmax/Lbox in the correlation function caiculations - something we do not do currently.

Since the change will alter the default behaviour, should we milestone for a minor release (i.e., 2.4.0) rather than a patch release?

@lgarrison
Copy link
Collaborator Author

Sorry! I swear I wasn't planning on dying on this hill... I haven't worked with the VPF before, so I'm just trying to think it through and make sure I've wrapped my head around what's going on. And hopefully doing so will help newcomers to the code, especially since I think many students use corrfunc!

So do you think you agree with my assessment? If so, then yes, 2.4.0 sounds good to me. Do you think we could also do the boxsize change (#199) in that release?

@manodeep
Copy link
Owner

Hehehe - definitely seemed like the hill :). All good - let's milestone this change for 2.4.0. Did you want to PR this up?

Seems like the boxsize PR has some merge conflicts - are you able to fix those? Otherwise, I think that one is good to go but I will take one final look (it's been a while)

I am assuming this means that the next release will be v2.4.0. There are a lot of issues that are milestoned for 2.4.0 - we might have to go through and check what's feasible

@lgarrison
Copy link
Collaborator Author

Yes, sounds good, I'll take a look tomorrow.

laperezNYC added a commit to laperezNYC/Corrfunc that referenced this issue Nov 18, 2020
Following the discussion in issue manodeep#237 and after conversations with LGarrison, I've completely removed the check that confirmed that the volume inhabited by the dropped spheres was less than the volume of the sample. This check was unnecessary for the VPF to run, and actively discouraged the oversampling of the sample volume with test spheres that one needs to precisely measure the VPF at a given radius.
@laperezNYC
Copy link
Contributor

laperezNYC commented Nov 18, 2020

Hi @manodeep! I'm Lucia, and I'm the one who brought up this issue to @lgarrison while using CORRFUNC for my current projects. I was throwing CORRFUNC at much smaller volumes that I'm sure you wrote this for, and was struggling with the code stumbling for radii where I knew the VPF could still be trustworthy and measured, and finding it was giving inconsistent results because of the undersampling the check imposed. (So if anyone is dying on this hill, it's me, lol.)

My paper that's in press right now is very relevant here, and @lgarrison 's mentioned some things that I brought up to him from my work: https://arxiv.org/abs/2011.03556. Specifically, Section 3.3 discusses some common-sense limits of when the VPF can be measured given a sample's density and volume, which quantifies the concern of when a radius is too large. Still, regardless of whether the radius you're measuring the VPF for is valid or can give accurate results, one should still oversample with the dropped test spheres to at least guarantee that the act of measuring the VPF isn't adding additional uncertainty.

I'm about to PR a branch with the updated vpf.py that completely omits the check, though I did draft one earlier that gave folks the option to override the check with a flag. CORRFUNC is awesome and I'm happy to contribute a tiny improvement to it!

@manodeep
Copy link
Owner

Hi @laperezNYC - my name is Manodeep - nice to e-meet you! Thank you for using Corrfunc and even more thanks for contributing an improvement.

@manodeep manodeep linked a pull request Nov 19, 2020 that will close this issue
manodeep added a commit that referenced this issue Jan 5, 2021
* Removed independent volume check in vpf.py

Following the discussion in issue #237 and after conversations with LGarrison, I've completely removed the check that confirmed that the volume inhabited by the dropped spheres was less than the volume of the sample. This check was unnecessary for the VPF to run, and actively discouraged the oversampling of the sample volume with test spheres that one needs to precisely measure the VPF at a given radius.

* Update vpf.py

Without the check of independent volumes, there's no need to overtly calculate the volume the samples inhabit.

* Update CHANGES.rst

Updated 2.4.0 with this new enhancement to theory.vpf

* Update CHANGES.rst

I promise I can type!

* Tweaked wording

@laperezNYC Tweaked the wording a bit. If this looks acceptable to you, then I will merge in the PR

Co-authored-by: Manodeep Sinha <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants