-
Notifications
You must be signed in to change notification settings - Fork 30
CCDCesque critical value specification #66
Comments
I've actually thought a little bit about this too. In the short term, I get the feeling a lot of what users are using for their thresholds are found using trial and error. So in terms of selecting the correct threshold I am not sure changing it to a p value will make much of a difference (although it would seem a little more statistical). That does beg the question of whether trial and error will really yield better results than deriving it using a chi distribution, however. That being said, it is something important to consider documentation could definitely be useful. What I think would be nice would be some sort of well thought out workflow in choosing what bands to use and threshold. This would be something done for each scene and a little more involved than just clicking around in TSTools and changing the threshold. If we could take the time to document these 'change scores' or residuals for each bands for the different landcover classes and change classes we could easily test what the threshold should be in a more robust fashion. Thoughts? |
I am in favor of converting the threshold parameter to I definitely understand the concern that this could be misleading/confusing since we are not actually finding change using a formal null hypothesis testing framework, but I think the chi distribution value we are currently using is no less confusing and most of us are selecting by trial and error in a less than meaningful way. By internally selecting the critical value based on a user-defined p-value and number of test indices, we would (a) ensure comparability across model runs with variable number of test indices (something that is currently easy to overlook!) and (b) put the critical value into terms that are likely more familiar to users (i.e. I think it would be easier to explain that a p-value of 0.1 will be more sensitive to change compared to a p-value of 0.01 versus trying to explain where the threshold of "3.368" came from). I also agree that there should be better documentation explaining the interpretation of p-value/critical value in this context and how it is selected. In the next couple of days, I plan to write up what I learned yesterday from Chris and I discussing all of the inputs as a starting point. To Eric's point about a workflow, I am also working on some test runs comparing change results for different parameter sets. I'm using a relatively small subset (~500,000 pixels) in the Broadmoor/Boston area, but it should be interesting to see if/how tweaking different parameters, fit types, and design matrices will impact the change results. I'll keep you both posted on what I find, and if there are any useful lessons learned that we could share as examples in the documentation. Once I have a good process for comparing/more standard set of model runs, might also be worth running a few other subsets for further comparison. |
Some arguments against, which is currently my opinion:
We could also just allow both in the I'm definitely in favor of better documentation (see #35 for documenting how we go about selecting parameters). Valerie's test runs ought to also give us some actual numbers behind what's just been a speculative sense of the sensitivity of each parameter, and this would make for some great documentation material. I'm also up for better framing the |
I think it makes more sense to continue using the threshold and try to |
I guess my main concern with the current I think the p-value approach is advantageous because it handles the connection to the If there is a way to account for differing numbers of test indices in another way, that could work too. For example, at one point yesterday, I think we talked about dividing the L2 norm by the number of test bands (or something like that?). Anything that ensures that when I change the test indices, my comparison to the threshold is adjusted accordingly would address my concern. Maybe another option would be to abstract away from a p-value per se. I mean, when you select a threshold, you are in some sense arbitrarily relating to the chi-not-squared distribution. Maybe we could have a range of values that correspond to p-values but aren't p-values. For example, maybe I might also just be over-complicating this. I don't know how many users will actually want to produce and compare maps from different test indices. But after doing a couple of runs changing my inputs to see whether the original bands v. BGW produced different results, only to find that the results were not actually a fair comparison unless I adjusted the threshold accordingly, I think it is worth coming up with a better way of handling the relationship between |
OK long post below outlining why I'm going to eliminate option I'm in favor of I think I understand the argument in favor of The larger problem is that the chi distribution isn't suited for this purpose anyway. We're not actually looking at the sum of standard normally distributed variables (
Simple code example in R using > fit <- lm(mpg~hp, data=mtcars)
> rmse <- mean(residuals(fit) ^ 2) ^ 0.5
> resid_std <- var(residuals(fit)) ^ 0.5
> print(c(rmse, resid_std))
[1] 3.740297 3.800146 So if we're not actually summing It seems quite a bit of a shame though to have to dismiss it on a technicality. Maybe this leads to some discussion about reshaping the "test statistic". We could, for example, just use the |
The "test statistic" used in CCDC is the square root of the sum of squared scaled residuals for all bands tested. This "test statistic" isn't normalized by how many bands you're adding, so the critical value needs to depend on how many test indices you have. If you use more indices to test with, then you'll need to increase the critical value by some amount.
Zhu derives the "test statistic" critical value for
p=0.01
andk=len(test_indices)
using the inverse survival function of thescipy.stats.chi
distribution since we're probably summing squares of normally distributed variables (the scaled residuals).Questions of independence, normality, statistical soundness, etc. aside, my biggest concern is that we don't really care about finding change according to some null hypothesis testing framework value. CCDC is, at best, vaguely statistical and we've never analytically or numerically explored what the distribution of the "test statistic" is under the null hypothesis of no change. However, using the
scipy.stats.chi.isf
does convey the important message that the critical value depends on how many bands are being tested.So, the proposed solution either includes:
threshold
parameter top_value
and retrieve thethreshold
usingscipy.stats.chi.isf
for a given number oftest_indices
threshold
andp_value
withthreshold
being the default input that overridesp_value
if both are specifiedThoughts, @bullocke, @valpasq, @parevalo, and @xjtang?
The text was updated successfully, but these errors were encountered: