Skip to content

Commit

Permalink
added comment
Browse files Browse the repository at this point in the history
Signed-off-by: Kaituo Li <[email protected]>
  • Loading branch information
kaituo committed Aug 1, 2024
1 parent 568199d commit 765c748
Showing 1 changed file with 47 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,44 @@ protected boolean updateAllowed() {

@Override
protected void updateTimestamps(long timestamp) {
/*
* For imputations done on timestamps other than the current one (specified by
* the timestamp parameter), the timestamp of the imputed tuple matches that of
* the input tuple, and we increment numberOfImputed. For imputations done at
* the current timestamp (if all input values are missing), the timestamp of the
* imputed tuple is the current timestamp, and we increment numberOfImputed.
*
* To check if imputed values are still present in the shingle, we use the
* condition (previousTimeStamps[0] == previousTimeStamps[1]). This works
* because previousTimeStamps has a size equal to the shingle size and is filled
* with the current timestamp.
*
* For example, if the last 10 values were imputed and the shingle size is 8,
* the condition will most likely return false until all 10 imputed values are
* removed from the shingle.
*
* However, there are scenarios where we might miss decrementing
* numberOfImputed:
*
* 1. Not all values in the shingle are imputed. 2. We accumulated
* numberOfImputed when the current timestamp had missing values.
*
* As a result, this could cause the data quality measure to decrease
* continuously since we are always counting missing values that should
* eventually be reset to zero. To address the issue, we add code in method
* updateForest to decrement numberOfImputed when we move to a new timestamp,
* provided there is no imputation. This ensures th e imputation fraction does
* not increase as long as the imputation is continuing. This also ensures that
* the forest update decision, which relies on the imputation fraction,
* functions correctly. The forest is updated only when the imputation fraction
* is below the threshold of 0.5.
*
* Also, why can't we combine the decrement code between updateTimestamps and
* updateForest together? This would cause Consistency.ImputeTest to fail when
* testing with and without imputation, as the RCF scores would not change. The
* method updateTimestamps is used in other places (e.g., updateState and
* dischargeInitial), not only in updateForest.
*/
if (previousTimeStamps[0] == previousTimeStamps[1]) {
numberOfImputed = numberOfImputed - 1;
}
Expand All @@ -142,8 +180,12 @@ void updateForest(boolean changeForest, double[] input, long timestamp, RandomCu
updateShingle(input, scaledInput);
updateTimestamps(timestamp);
if (isFullyImputed) {
// The numImputed is now capped at the shingle size to ensure that the impute
// fraction,
// calculated as numberOfImputed * 1.0 / shingleSize, does not exceed 1.
numberOfImputed = Math.min(numberOfImputed + 1, shingleSize);
} else if (numberOfImputed > 0) {
// Decrement numberOfImputed when the new value is not imputed
numberOfImputed = numberOfImputed - 1;
}
if (changeForest) {
Expand All @@ -166,6 +208,11 @@ public void update(double[] point, float[] rcfPoint, long timestamp, int[] missi
return;
}
generateShingle(point, timestamp, missing, getTimeFactor(timeStampDeviations[1]), true, forest);
// The confidence formula depends on numImputed (the number of recent
// imputations seen)
// and seenValues (all values seen). To ensure confidence decreases when
// numImputed increases,
// we need to count only non-imputed values as seenValues.
if (missing == null || missing.length != point.length) {
++valuesSeen;
}
Expand Down

0 comments on commit 765c748

Please sign in to comment.