You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In pre-release versions of Rollup we used to concatenate all the rollup doc keys together to form the doc's ID, but at some point this was changed to a rolling CRC32. I don't recall the specifics about why we did this, probably just to save space and avoid giant IDs.
Unfortunately, this was poorly thought out. 32bit IDs lead to collisions very quickly as the doc count grows. Existing Rollup indices over 200k docs have a very high chance of at least one collision, meaning a leaf node of a rollup interval overwrote a previous leaf node. Ultimately, this leads to data loss and subtly incorrect results. It also limits the total number of docs in the rollup index.
Our current plan to fix is:
Move back to concatenating job ID and composite keys together to form a unique ID
Insert a delimiter after the Job ID to guarantee that collisions can't be created (accidentally or otherwise) based on the job name.
Guarantee that the date histo key is immediately after the job ID. Due to date format restrictions, this means we don't need to blacklist the delimiter in job names... just ensure it can't appear in a valid date pattern.
If the ID is >= 32k bytes (the hard limit for Lucene) fall back to hashing with a sufficiently large hash.
Add a flag to the persistent task's state that indicates if the job is running on the old or new ID
If the indexer is still running, continue to use the old ID scheme.
Whenever the job checkpoints (on stop, failure or periodically), toggle the flag and switch to new scheme. Since the flag is also persisted, any future triggers of the job will continue to use the new ID
All new jobs start with flag toggled and use the new ID scheme
Bump the internal Rollup version, so that we have some diagnosing power for reports of problems in the future.
Benchmark the size of concatenated IDs to make sure it doesn't bloat the index too much. Prefix-compression should be strong for these IDs, but good to double check. If it's really bad we can just hash the values directly instead and skip the part where we cutover at 32k
Changing the ID scheme in the middle of a job is acceptable as long as we have checkpointed our position. Deterministic IDs are only required so that, if we are interrupted before we get to the next checkpoint, we can roll back to the last checkpoint and just overwrite existing docs. So as long as we change the ID scheme on checkpoint, we know there are no "partial results" that may need overwriting.
Indeed we have indices that, when rolled up, can indeed pass over the 200k barrier. As such, Rollup is not usable in its current state until we know for sure that we won't generate any collisions.
@draimondi-amadeus Merged and backported the fix to 6.4. Thanks for reporting this, glad we caught it sooner than later! I'm still kicking myself over such a silly mistake.
Hi @polyfractal , yes, I saw the code merge. Thank you for such a quick action! 👍
No worries, I understand that Rollup is still experimental. I greatly appreciate your efforts in improving this really useful function.
In pre-release versions of Rollup we used to concatenate all the rollup doc keys together to form the doc's ID, but at some point this was changed to a rolling CRC32. I don't recall the specifics about why we did this, probably just to save space and avoid giant IDs.
Unfortunately, this was poorly thought out. 32bit IDs lead to collisions very quickly as the doc count grows. Existing Rollup indices over 200k docs have a very high chance of at least one collision, meaning a leaf node of a rollup interval overwrote a previous leaf node. Ultimately, this leads to data loss and subtly incorrect results. It also limits the total number of docs in the rollup index.
Our current plan to fix is:
Move back to concatenating job ID and composite keys together to form a unique ID
Add a flag to the persistent task's state that indicates if the job is running on the old or new ID
Bump the internal Rollup version, so that we have some diagnosing power for reports of problems in the future.
Benchmark the size of concatenated IDs to make sure it doesn't bloat the index too much. Prefix-compression should be strong for these IDs, but good to double check. If it's really bad we can just hash the values directly instead and skip the part where we cutover at 32k
Changing the ID scheme in the middle of a job is acceptable as long as we have checkpointed our position. Deterministic IDs are only required so that, if we are interrupted before we get to the next checkpoint, we can roll back to the last checkpoint and just overwrite existing docs. So as long as we change the ID scheme on checkpoint, we know there are no "partial results" that may need overwriting.
We'll deal with 7.0 upgrade issues in a followup.
/cc @jimczi @colings86 @pcsanwald @clintongormley
The text was updated successfully, but these errors were encountered: