[Rollup] Replace ID generation scheme to avoid collisions #32372

polyfractal · 2018-07-25T17:36:01Z

In pre-release versions of Rollup we used to concatenate all the rollup doc keys together to form the doc's ID, but at some point this was changed to a rolling CRC32. I don't recall the specifics about why we did this, probably just to save space and avoid giant IDs.

Unfortunately, this was poorly thought out. 32bit IDs lead to collisions very quickly as the doc count grows. Existing Rollup indices over 200k docs have a very high chance of at least one collision, meaning a leaf node of a rollup interval overwrote a previous leaf node. Ultimately, this leads to data loss and subtly incorrect results. It also limits the total number of docs in the rollup index.

Our current plan to fix is:

Move back to concatenating job ID and composite keys together to form a unique ID
1. Insert a delimiter after the Job ID to guarantee that collisions can't be created (accidentally or otherwise) based on the job name.
2. Guarantee that the date histo key is immediately after the job ID. Due to date format restrictions, this means we don't need to blacklist the delimiter in job names... just ensure it can't appear in a valid date pattern.
3. If the ID is >= 32k bytes (the hard limit for Lucene) fall back to hashing with a sufficiently large hash.
Add a flag to the persistent task's state that indicates if the job is running on the old or new ID
1. If the indexer is still running, continue to use the old ID scheme.
2. Whenever the job checkpoints (on stop, failure or periodically), toggle the flag and switch to new scheme. Since the flag is also persisted, any future triggers of the job will continue to use the new ID
3. All new jobs start with flag toggled and use the new ID scheme
Bump the internal Rollup version, so that we have some diagnosing power for reports of problems in the future.
Benchmark the size of concatenated IDs to make sure it doesn't bloat the index too much. Prefix-compression should be strong for these IDs, but good to double check. If it's really bad we can just hash the values directly instead and skip the part where we cutover at 32k

Changing the ID scheme in the middle of a job is acceptable as long as we have checkpointed our position. Deterministic IDs are only required so that, if we are interrupted before we get to the next checkpoint, we can roll back to the last checkpoint and just overwrite existing docs. So as long as we change the ID scheme on checkpoint, we know there are no "partial results" that may need overwriting.

We'll deal with 7.0 upgrade issues in a followup.

/cc @jimczi @colings86 @pcsanwald @clintongormley

elasticmachine · 2018-07-25T17:36:02Z

Pinging @elastic/es-search-aggs

draimondi-amadeus · 2018-07-26T08:44:30Z

Thank you @polyfractal and @lucabelluccini for opening and tracking this issue. 👍

Indeed we have indices that, when rolled up, can indeed pass over the 200k barrier. As such, Rollup is not usable in its current state until we know for sure that we won't generate any collisions.

Best Regards,
Dom

polyfractal · 2018-08-07T15:24:24Z

@draimondi-amadeus Merged and backported the fix to 6.4. Thanks for reporting this, glad we caught it sooner than later! I'm still kicking myself over such a silly mistake.

draimondi-amadeus · 2018-08-08T08:52:21Z

Hi @polyfractal , yes, I saw the code merge. Thank you for such a quick action! 👍
No worries, I understand that Rollup is still experimental. I greatly appreciate your efforts in improving this really useful function.

My team will eagerly await the 6.4 release!

Ciao,
Dom

$@polyfractal$ polyfractal added >bug :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data labels Jul 25, 2018

$@polyfractal$ polyfractal mentioned this issue Aug 1, 2018

[Rollup] Improve ID scheme for rollup documents #32558

Merged

$@polyfractal$ polyfractal closed this as completed in #32558 Aug 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rollup] Replace ID generation scheme to avoid collisions #32372

[Rollup] Replace ID generation scheme to avoid collisions #32372

polyfractal commented Jul 25, 2018 •

edited

Loading

elasticmachine commented Jul 25, 2018

draimondi-amadeus commented Jul 26, 2018

polyfractal commented Aug 7, 2018

draimondi-amadeus commented Aug 8, 2018

[Rollup] Replace ID generation scheme to avoid collisions #32372

[Rollup] Replace ID generation scheme to avoid collisions #32372

Comments

polyfractal commented Jul 25, 2018 • edited Loading

elasticmachine commented Jul 25, 2018

draimondi-amadeus commented Jul 26, 2018

polyfractal commented Aug 7, 2018

draimondi-amadeus commented Aug 8, 2018

polyfractal commented Jul 25, 2018 •

edited

Loading