unnecessary vpf check? #237

lgarrison · 2020-11-06T21:15:13Z

There's a check in vpf to make sure the total volume in spheres doesn't exceed the box volume: https://github.com/manodeep/Corrfunc/blob/master/Corrfunc/theory/vpf.py#L207

But I'm not sure that condition is necessary? Spheres are allowed to overlap in vpf, right?

@manodeep do you know why this is there?

manodeep · 2020-11-07T01:57:58Z

it's a sanity check to make sure that the total number of spheres requested is sensible. If the total volume exceeds the box volume, then there must be some non-independent data-points, and therefore, the total number of spheres used is no longer a fair representation of the amount of scatter expected. In those cases, it is betterr to use a smaller number of random spheres (which might overlap). Does that make sense?

lgarrison · 2020-11-07T04:59:56Z

I guess I was thinking about it like this. The VPF tells you what the probability is of having 0 (or N) particles in a sphere thrown down at random in a box. So if we throw down as many such random spheres as possible, isn't the fraction with N exactly what we want to know?

I agree a single box is only going to be a single sampling of the "true" VPF, but I would think limiting the number of spheres just gives you a noisier estimate of the VPF, not a less biased one.

Perhaps the real issue is that one shouldn't trust the large-radius VPF measurements from one box, but maybe one wants to be able to get a low-noise estimate of it, even if it is biased, so one can average over multiple boxes, for example. And at small scales, maybe it's useful to be able to "oversample" to beat down noise some more when it's at a radius that you do trust the measurement.

manodeep · 2020-11-08T23:56:25Z

I am not sure how repeatedly re-sampling the same volume would give a less noisy estimate of the (large-R) VPF. If anything, the larger number of spheres will make you underestimate the error on the VPF. Of course, the only way to really get the scatter in the VPF is by looking at the scatter in the measurement over many different realisations.

Having said all that, I have never tested the bias-variance - seems like an easy enough test to run and check whether limiting the total volume is meaningful. And I can certainly see an use-case for a keyword parameter that allows the user to over-ride the volume check.

lgarrison · 2020-11-09T18:11:41Z

For the large-R VPF, if one is going out to 20% of the box, for example, then that's only 30 spheres. So the granularity of the VPF is only 3%! That's a pretty extreme case, but perhaps we should let the user make this choice. What do you think about changing the error to a warning? That's slightly less overhead than a new parameter, but I'm also fine with that.

manodeep · 2020-11-10T22:08:27Z

If the user really wants to probe Rmax to 20% of Lbox, then restricting the total number of spheres seems like a good idea!

Though I like the idea of changing to a warning, I have found that people just ignore warnings. It would be a little more work to put in a override_volume_check keyword, but at least that will ensure that users have to consciously make that choice.

lgarrison · 2020-11-10T22:24:23Z

Well, even a much more reasonable case, like 5% of the box, is only 2000 spheres, which may not be enough to capture rare small-scale events that may nonetheless hold interesting information. If the user wants the vpf across a range of Rmax scales, we might say from a computational perspective that it makes sense for the user to run vpf many times, decreasing nspheres as one increases Rmax. But the structure of the API doesn't really suggest that, since we are providing `nbins` to allow simultaneously measuring across many scales. So... given that the vpf is cheap, I am in favor of lifting the restriction on spheres altogether and making it a warning! What do you think?

…

On Tue, Nov 10, 2020 at 5:08 PM Manodeep Sinha ***@***.***> wrote: If the user really wants to probe Rmax to 20% of Lbox, then restricting the total number of spheres seems like a good idea! Though I like the idea of changing to a warning, I have found that people just ignore warnings. It would be a little more work to put in a override_volume_check keyword, but at least that will ensure that users have to consciously make that choice. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLA7S77LXC5TQCUKYZPVDLSPG2WTANCNFSM4TNEVXUA> .

-- Lehman Garrison <[email protected]> Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

manodeep · 2020-11-11T04:15:52Z

Are you concerned with the extra code required to over-ride the default behaviour? To me, it seems like adding a boolean keyword override_volume_check or allow_oversampling would not be that much code but would let the user get the desired behaviour. Regardless, I do agree that the behaviour might be desirable - I am trying to avoid making any default choices that can impact the fidelity of the results.

I am not sure how you would capture a rare, small-scale event with the vpf? The measurement (i.e., the mean of nspheres) is unlikely to be affected much, the standard deviation might be - is that the kind of effect you are thinking about?

lgarrison · 2020-11-11T13:41:28Z

Yes, it's a combination of not wanting to add more surface area to the API, and a lingering feeling that this check shouldn't be there at all. The condition forces you to miss a lot of sphere configurations!

By "rare events", I mean low-amplitude counts-in-cells. I think for any radius (doesn't have to be small), if you go to high enough (or small enough) counts, the probability p(N) will be very low. And then you probably want at least 10/p(N) spheres to get above Poisson noise. But the volume condition doesn't know anything about p(N), so in many cases it won't let you use the appropriate number of spheres!

manodeep · 2020-11-12T02:30:02Z

Seems like you feel fairly strongly about this :). Changing to a warning falls under the same scope as checking for the ratio - Rmax/Lbox in the correlation function caiculations - something we do not do currently.

Since the change will alter the default behaviour, should we milestone for a minor release (i.e., 2.4.0) rather than a patch release?

lgarrison · 2020-11-12T03:03:16Z

Sorry! I swear I wasn't planning on dying on this hill... I haven't worked with the VPF before, so I'm just trying to think it through and make sure I've wrapped my head around what's going on. And hopefully doing so will help newcomers to the code, especially since I think many students use corrfunc!

So do you think you agree with my assessment? If so, then yes, 2.4.0 sounds good to me. Do you think we could also do the boxsize change (#199) in that release?

manodeep · 2020-11-17T02:25:34Z

Hehehe - definitely seemed like the hill :). All good - let's milestone this change for 2.4.0. Did you want to PR this up?

Seems like the boxsize PR has some merge conflicts - are you able to fix those? Otherwise, I think that one is good to go but I will take one final look (it's been a while)

I am assuming this means that the next release will be v2.4.0. There are a lot of issues that are milestoned for 2.4.0 - we might have to go through and check what's feasible

lgarrison · 2020-11-17T02:33:53Z

Yes, sounds good, I'll take a look tomorrow.

Following the discussion in issue manodeep#237 and after conversations with LGarrison, I've completely removed the check that confirmed that the volume inhabited by the dropped spheres was less than the volume of the sample. This check was unnecessary for the VPF to run, and actively discouraged the oversampling of the sample volume with test spheres that one needs to precisely measure the VPF at a given radius.

laperezNYC · 2020-11-18T19:25:33Z

Hi @manodeep! I'm Lucia, and I'm the one who brought up this issue to @lgarrison while using CORRFUNC for my current projects. I was throwing CORRFUNC at much smaller volumes that I'm sure you wrote this for, and was struggling with the code stumbling for radii where I knew the VPF could still be trustworthy and measured, and finding it was giving inconsistent results because of the undersampling the check imposed. (So if anyone is dying on this hill, it's me, lol.)

My paper that's in press right now is very relevant here, and @lgarrison 's mentioned some things that I brought up to him from my work: https://arxiv.org/abs/2011.03556. Specifically, Section 3.3 discusses some common-sense limits of when the VPF can be measured given a sample's density and volume, which quantifies the concern of when a radius is too large. Still, regardless of whether the radius you're measuring the VPF for is valid or can give accurate results, one should still oversample with the dropped test spheres to at least guarantee that the act of measuring the VPF isn't adding additional uncertainty.

I'm about to PR a branch with the updated vpf.py that completely omits the check, though I did draft one earlier that gave folks the option to override the check with a flag. CORRFUNC is awesome and I'm happy to contribute a tiny improvement to it!

manodeep · 2020-11-19T02:05:53Z

Hi @laperezNYC - my name is Manodeep - nice to e-meet you! Thank you for using Corrfunc and even more thanks for contributing an improvement.

@laperezNYC

* Removed independent volume check in vpf.py Following the discussion in issue #237 and after conversations with LGarrison, I've completely removed the check that confirmed that the volume inhabited by the dropped spheres was less than the volume of the sample. This check was unnecessary for the VPF to run, and actively discouraged the oversampling of the sample volume with test spheres that one needs to precisely measure the VPF at a given radius. * Update vpf.py Without the check of independent volumes, there's no need to overtly calculate the volume the samples inhabit. * Update CHANGES.rst Updated 2.4.0 with this new enhancement to theory.vpf * Update CHANGES.rst I promise I can type! * Tweaked wording @laperezNYC Tweaked the wording a bit. If this looks acceptable to you, then I will merge in the PR Co-authored-by: Manodeep Sinha <[email protected]>

laperezNYC mentioned this issue Nov 18, 2020

Removed independent volume check in vpf.py #238

Merged

manodeep linked a pull request Nov 19, 2020 that will close this issue

Removed independent volume check in vpf.py #238

Merged

manodeep closed this as completed in #238 Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unnecessary vpf check? #237

unnecessary vpf check? #237

lgarrison commented Nov 6, 2020

manodeep commented Nov 7, 2020

lgarrison commented Nov 7, 2020

manodeep commented Nov 8, 2020

lgarrison commented Nov 9, 2020

manodeep commented Nov 10, 2020

lgarrison commented Nov 10, 2020 via email

manodeep commented Nov 11, 2020

lgarrison commented Nov 11, 2020

manodeep commented Nov 12, 2020

lgarrison commented Nov 12, 2020

manodeep commented Nov 17, 2020

lgarrison commented Nov 17, 2020

laperezNYC commented Nov 18, 2020 •

edited

Loading

manodeep commented Nov 19, 2020

unnecessary vpf check? #237

unnecessary vpf check? #237

Comments

lgarrison commented Nov 6, 2020

manodeep commented Nov 7, 2020

lgarrison commented Nov 7, 2020

manodeep commented Nov 8, 2020

lgarrison commented Nov 9, 2020

manodeep commented Nov 10, 2020

lgarrison commented Nov 10, 2020 via email

manodeep commented Nov 11, 2020

lgarrison commented Nov 11, 2020

manodeep commented Nov 12, 2020

lgarrison commented Nov 12, 2020

manodeep commented Nov 17, 2020

lgarrison commented Nov 17, 2020

laperezNYC commented Nov 18, 2020 • edited Loading

manodeep commented Nov 19, 2020

laperezNYC commented Nov 18, 2020 •

edited

Loading