Tolerate b1 shard index inconsistencies during TSM conversion #5647

jonseymour · 2016-02-12T07:45:25Z

These commits address data loss issues caused by a shard index inconsistency caused by a defect that was fixed (3348dab) in version 0.9.3. The data loss problem this inconsistency caused was reported with #5606.

The changes have been tested with an instance of a 47GB database of 32 shards and 1.1B points. Prior to the fix, around 600M points (4GB) were unexpectedly dropped from the converted database. With the fix, the full 1.1B points were preserved resulting in a converted database size of 8.8GB.

Feedback is welcome about whether the list of orphaned series should be dumped to the output and whether the tracker should be used to to dump the warning messages.

Other areas of improvement might be the means that the RepairIndex configuration is associated with the b1 reader. Should I pass this as a parameter to the reader constructor or pass an reader specific options/configuration object?

A similar change should probably also be made to the bz1 reader, but I didn't have a test case I could use to verify the integrity of such a change. I am happy to extend the change to the bz1 reader if that is desired.

It is probably worth noting that any influx database that has data that was created with a version less than 0.9.3 may be susceptible to silent data loss during TSM data conversion until this fix is applied.

A case (influxdata#5606) was found where a lot of data unexpectedly disappeared from a database following a TSM conversion. The proximate cause was an inconsistency between the root Bolt DB bucket list and the meta data in the "series" bucket of the same shard. There were apparently valid series in Bolt DB buckets that were no longer referenced by the meta data in the "series" bucket - so-called orphaned series; since the conversion process only iterated across the series found in the meta data, the conversion process caused the orphaned series to be removed from the converted shards. This resulted in the unexpected removal of data from the TSM shards that had previously been accessible (despite the meta data inconsistency) in the b1 shards. The root cause of the meta data inconsistency in the case above is not understood but, in the case above, removal of the orphaned series wasn't the appropriate resolution of the inconsistency. This change detects occurrences of meta data inconsistency in shards during conversion and, by default, will cause the conversion process to fail (--repair-index=fail) if any such inconsistency is found. The user can force the conversion to occur by choosing to resolve the inconsistency either by assuming the meta data is incorrect and attempting to convert the orphaned series (--repair-index=repair) or by assuming the meta data is correct and ignoring the orphaned series (--repair-index=ignore). Currently detection and recovery of orphaned series is only supported for b1 shards; bz1 shards are not currently supported. Signed-off-by: Jon Seymour <[email protected]>

Signed-off-by: Jon Seymour <[email protected]>

jwilder · 2016-02-12T14:26:42Z

@joelegasse can you take a look?

joelegasse · 2016-02-12T16:48:31Z

@jonseymour Do you have an example of how to create the orphaned series in the old versions? I'd like to see if bz1 is similarly affected.

As far as your code, the change looks good, but I would suggest a few things:

Modify the constructor of b1.Reader, rather than adding an exported field
Change the usage text of the command line flag to remove the repetition, maybe "Determines how to handle orphaned series: 'fail', 'repair', or 'ignore'"

But those might be moot depending on the answer to a question... For the "orphaned" series, are they query-able if left in their current shards? If so, is there any reason we would want to ignore/drop the data during the conversion? In other words, why would we want to make this configurable, instead of just automatically repairing?

joelegasse · 2016-02-12T18:16:39Z

@jonseymour After talking with the team, it looks like there's no reason not to just scan for all the series and include them, completely ignoring the index. I'll create another PR that will handle both b1 and bz1 shards.

Thank you for digging in to this and figuring out what was going wrong. :-)

joelegasse · 2016-02-12T23:29:13Z

I've incorporated, simplified, and extended your changes to both b1 and bz1 in #5665, so I'm going to close this PR.

jonseymour added 2 commits February 12, 2016 20:41

If Open failed, then r.tx may not be initialized.

21c4b93

Signed-off-by: Jon Seymour <[email protected]>

jonseymour force-pushed the repair-b1-shard branch from 18f675b to 21c4b93 Compare February 12, 2016 09:42

joelegasse closed this Feb 12, 2016

This was referenced Feb 13, 2016

influx_tsm: ignore shard index when converting b1 shards #5674

Closed

influx_tsm: ignore shard index when converting b1 shards #5675

Merged

influx_tsm: convert non-indexed series, too #5665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tolerate b1 shard index inconsistencies during TSM conversion #5647

Tolerate b1 shard index inconsistencies during TSM conversion #5647

jonseymour commented Feb 12, 2016

jwilder commented Feb 12, 2016

joelegasse commented Feb 12, 2016

joelegasse commented Feb 12, 2016

joelegasse commented Feb 12, 2016

Tolerate b1 shard index inconsistencies during TSM conversion #5647

Tolerate b1 shard index inconsistencies during TSM conversion #5647

Conversation

jonseymour commented Feb 12, 2016

jwilder commented Feb 12, 2016

joelegasse commented Feb 12, 2016

joelegasse commented Feb 12, 2016

joelegasse commented Feb 12, 2016