-
Notifications
You must be signed in to change notification settings - Fork 33
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix Confidence Adjustment for Larger Shingle Sizes
This PR addresses further adjustments to the confidence calculation issue discussed in PR 405. While PR 405 successfully resolved the issue for a shingle size of 4, it did not achieve the same results for larger shingle sizes like 8. Key Changes 1. Refinement of seenValues Calculation: * Previously, the formula increased confidence even as numImputed (number of imputations seen) increased because seenValues (all values seen) also increased. * This PR fixes the issue by counting only non-imputed values as seenValues. 2. Upper Bound for numImputed: * The numImputed is now upper bounded to the shingle size. * The impute fraction calculation, which uses numberOfImputed * 1.0 / shingleSize, now ensures the fraction does not exceed 1. 3. Decrementing numberOfImputed: * The numberOfImputed is decremented when there is no imputation. * Previously, numberOfImputed remained unchanged when there is an imputation as there was both an increment and a decrement, keeping the imputation fraction constant. This PR ensures the imputation fraction accurately reflects the current state. This adjustment ensures that the forest update decision, which relies on the imputation fraction, functions correctly. The forest is updated only when the imputation fraction is below the threshold of 0.5. Testing * Added test scenarios with various shingle sizes to verify the changes. Signed-off-by: Kaituo Li <[email protected]>
- Loading branch information
Showing
3 changed files
with
73 additions
and
52 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters