-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Confidence Adjustment for Larger Shingle Sizes #407
Conversation
This PR addresses further adjustments to the confidence calculation issue discussed in PR 405. While PR 405 successfully resolved the issue for a shingle size of 4, it did not achieve the same results for larger shingle sizes like 8. Key Changes 1. Refinement of seenValues Calculation: * Previously, the formula increased confidence even as numImputed (number of imputations seen) increased because seenValues (all values seen) also increased. * This PR fixes the issue by counting only non-imputed values as seenValues. 2. Upper Bound for numImputed: * The numImputed is now upper bounded to the shingle size. * The impute fraction calculation, which uses numberOfImputed * 1.0 / shingleSize, now ensures the fraction does not exceed 1. 3. Decrementing numberOfImputed: * The numberOfImputed is decremented when there is no imputation. * Previously, numberOfImputed remained unchanged when there is an imputation as there was both an increment and a decrement, keeping the imputation fraction constant. This PR ensures the imputation fraction accurately reflects the current state. This adjustment ensures that the forest update decision, which relies on the imputation fraction, functions correctly. The forest is updated only when the imputation fraction is below the threshold of 0.5. Testing * Added test scenarios with various shingle sizes to verify the changes. Signed-off-by: Kaituo Li <[email protected]>
* the current timestamp (if all input values are missing), the timestamp of the | ||
* imputed tuple is the current timestamp, and we increment numberOfImputed. | ||
* | ||
* To check if imputed values are still present in the shingle, we use the first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least this part of the comment is still valid right? Would be good to have as I was pretty confused what previousTimeStamps[0] != previousTimeStamps[1]
really meant to do. or maybe exaplain that part in
Additionally so I am understaind correct, we are saying if the last 10 values were imputed and shingle size is 8, we will most likely return false here and not allow an update until we get an actual value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me add the comment back.
we are saying if the last 10 values were imputed and shingle size is 8, we will most likely return false here and not allow an update until we get an actual value?
yes, your understanding is correct.
numberOfImputed = numberOfImputed + 1; | ||
numberOfImputed = Math.min(numberOfImputed + 1, shingleSize); | ||
} else if (numberOfImputed > 0) { | ||
numberOfImputed = numberOfImputed - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point are we guaranteed to have seen a new value that is not imputed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
Do we want to do this because cause we only care for the fraction in relation to the current shingle? |
yes. |
Signed-off-by: Kaituo Li <[email protected]>
Description of changes:
This PR addresses further adjustments to the confidence calculation issue discussed in PR 405. While PR 405 successfully resolved the issue for a shingle size of 4, it did not achieve the same results for larger shingle sizes like 8.
Key Changes
Testing
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.