Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Confidence Adjustment for Larger Shingle Sizes #407

Merged
merged 2 commits into from
Aug 1, 2024

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented Jul 31, 2024

Description of changes:

This PR addresses further adjustments to the confidence calculation issue discussed in PR 405. While PR 405 successfully resolved the issue for a shingle size of 4, it did not achieve the same results for larger shingle sizes like 8.

Key Changes

  1. Refinement of seenValues Calculation:
  • Previously, the formula increased confidence even as numImputed (number of imputations seen) increased because seenValues (all values seen) also increased.
  • This PR fixes the issue by counting only non-imputed values as seenValues.
  1. Upper Bound for numImputed:
  • The numImputed is now upper bounded to the shingle size.
  • The impute fraction calculation, which uses numberOfImputed * 1.0 / shingleSize, now ensures the fraction does not exceed 1.
  1. Decrementing numberOfImputed:
  • The numberOfImputed is decremented when there is no imputation.
  • Previously, numberOfImputed remained unchanged when there is an imputation as there was both an increment and a decrement, keeping the imputation fraction constant. This PR ensures the imputation fraction accurately reflects the current state. This adjustment ensures that the forest update decision, which relies on the imputation fraction, functions correctly. The forest is updated only when the imputation fraction is below the threshold of 0.5.

Testing

  • Added test scenarios with various shingle sizes to verify the changes.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@kaituo kaituo requested a review from greaa-aws July 31, 2024 22:06
This PR addresses further adjustments to the confidence calculation issue discussed in PR 405. While PR 405 successfully resolved the issue for a shingle size of 4, it did not achieve the same results for larger shingle sizes like 8.

Key Changes
1. Refinement of seenValues Calculation:
* Previously, the formula increased confidence even as numImputed (number of imputations seen) increased because seenValues (all values seen) also increased.
* This PR fixes the issue by counting only non-imputed values as seenValues.
2. Upper Bound for numImputed:
* The numImputed is now upper bounded to the shingle size.
* The impute fraction calculation, which uses numberOfImputed * 1.0 / shingleSize, now ensures the fraction does not exceed 1.
3. Decrementing numberOfImputed:
* The numberOfImputed is decremented when there is no imputation.
* Previously, numberOfImputed remained unchanged when there is an imputation as there was both an increment and a decrement, keeping the imputation fraction constant. This PR ensures the imputation fraction accurately reflects the current state. This adjustment ensures that the forest update decision, which relies on the imputation fraction, functions correctly. The forest is updated only when the imputation fraction is below the threshold of 0.5.

Testing
* Added test scenarios with various shingle sizes to verify the changes.

Signed-off-by: Kaituo Li <[email protected]>
* the current timestamp (if all input values are missing), the timestamp of the
* imputed tuple is the current timestamp, and we increment numberOfImputed.
*
* To check if imputed values are still present in the shingle, we use the first
Copy link
Contributor

@amitgalitz amitgalitz Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least this part of the comment is still valid right? Would be good to have as I was pretty confused what previousTimeStamps[0] != previousTimeStamps[1] really meant to do. or maybe exaplain that part in
Additionally so I am understaind correct, we are saying if the last 10 values were imputed and shingle size is 8, we will most likely return false here and not allow an update until we get an actual value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me add the comment back.

we are saying if the last 10 values were imputed and shingle size is 8, we will most likely return false here and not allow an update until we get an actual value?

yes, your understanding is correct.

numberOfImputed = numberOfImputed + 1;
numberOfImputed = Math.min(numberOfImputed + 1, shingleSize);
} else if (numberOfImputed > 0) {
numberOfImputed = numberOfImputed - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point are we guaranteed to have seen a new value that is not imputed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

@amitgalitz
Copy link
Contributor

The numImputed is now upper bounded to the shingle size.

Do we want to do this because cause we only care for the fraction in relation to the current shingle?

@kaituo
Copy link
Collaborator Author

kaituo commented Aug 1, 2024

The numImputed is now upper bounded to the shingle size.

Do we want to do this because cause we only care for the fraction in relation to the current shingle?

yes.

Signed-off-by: Kaituo Li <[email protected]>
@kaituo kaituo merged commit f2984b5 into aws:main Aug 1, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants