Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backup: "invalid header size" error during restore #40670

Closed
solongordon opened this issue Sep 11, 2019 · 9 comments · Fixed by #40888
Closed

backup: "invalid header size" error during restore #40670

solongordon opened this issue Sep 11, 2019 · 9 comments · Fixed by #40888
Assignees
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@solongordon
Copy link
Contributor

I encountered a concerning error while trying to restore a registration cluster backup to a roachprod cluster:

root@localhost:26257/defaultdb> RESTORE TABLE registration.* FROM 's3://cockroach-reg-backups/2019-09-01?AWS_ACCESS_KEY_ID=<redacted>&AWS_SECRET_ACCESS_KEY=<redacted>';
pq: importing 12095 ranges: importing span /Table/78/1/"\r{=RΡ\xadEj\x88iM\x03g+N\x0e"/4/1920-09-16T09:02:26.852719999Z/"SELECT _, _, _ FROM _ AS OF SYSTEM TIME _ WHERE _ = _"/1/0/"$ internal-read orphaned table leases"-k7~k\xd4N\xea\xa7\t\xa9\xe4\xa6\\\x01\x8a"/1/1920-10-04T04:48:11.013350999Z/"SELECT _, _ FROM _ WHERE (_ IN ($1, $2, __more1__)) AND (_ < $4) ORDER BY _ LIMIT _"/1/0/"$ internal-gc-jobs"}: adding to batch: /Table/71/1/"\ri\x91\x815}H\xf6\x9a\xdbDP2\x1e\xa2t"/1/1920-07-14T04:41:27.182670999Z/"UPDATE _ SET _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = -_, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _ WHERE _ = _"/0/0/"40f09bee"/0/1561058311.817775579,4 -> /TUPLE/4:4:Bytes/v2.0.2/1:5:Int/356418/1:6:False/false/5:11:Int/81/1:12:Int/0/1:13:Int/81/2:15:Float/0.5/1:16:Float/40.5/1:17:Float/0.0002460487839506174/1:18:Float/1.2843239990119444e-05/1:19:Float/0.00011604993209876539/1:20:Float/4.570659468070251e-06/1:21:Float/0.0009339686666666668/1:22:Float/0.00017130743693108405/1:23:Float/0.003720578672839506/1:24:Float/0.0030473801431714874/1:25:Float/0.005016646055555556/1:26:Float/0.005024995901326392: computing stats for SST [/Table/71/1/"\ri\x91\x815}H\xf6\x9a\xdbDP2\x1e\xa2t"/1/1920-05-05T15:38:08.254344999Z/"SELECT _, _, _, _, _, _, _, _, _ FROM _ WHERE _ IN (_, _)"/0/0/"40f09bee"/0, /Table/71/1/"\ri\x91\x815}H\xf6\x9a\xdbDP2\x1e\xa2t"/1/1920-07-14T04:41:27.178937999Z/"UPDATE _ SET _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = -_, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _ WHERE _ = _"/0/0/"40f09bee"/0/NULL): /Table/71/1/"\ri\x91\x815}H\xf6\x9a\xdbDP2\x1e\xa2t"/1/1920-05-05T21:38:06.091400999Z/"UPDATE _ SET _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ =
_, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _, _ = _ WHERE _ = _"/0/0/"40f09bee"/0: invalid header size: 4

I tried the same restore out on a few different cockroach versions and observed the error on v19.2.0-beta.20190826 and later but not on v19.2.0-alpha.20190805 and earlier.

Repro steps:

CLUSTER=$USER-secure
roachprod create $CLUSTER -n 3 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
roachprod stage $CLUSTER:1-3 cockroach
roachprod start $CLUSTER:1-3 --secure
roachprod sql $CLUSTER:1 --secure

Then run the following statements, filling in sensitive info as necessary:

SET CLUSTER SETTING cluster.organization = '<redacted>';
SET CLUSTER SETTING enterprise.license = '<redacted>';
CREATE DATABASE registration;
RESTORE TABLE registration.* FROM 's3://cockroach-reg-backups/2019-09-01?AWS_ACCESS_KEY_ID=<redacted>&AWS_SECRET_ACCESS_KEY=<redacted>';

The error should appear within 30 seconds.

@solongordon solongordon added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels Sep 11, 2019
@jordanlewis
Copy link
Member

@pbardea @solongordon is this a release blocker?

@solongordon
Copy link
Contributor Author

Yes, @lucy-zhang added it to the list this morning.

@pbardea has bisected this issue to a commit which bumped the Pebble version. So far the reg cluster backups are the only known example of the error.

@pbardea pbardea self-assigned this Sep 12, 2019
@pbardea
Copy link
Contributor

pbardea commented Sep 17, 2019

cc @petermattis
Through experimentation I found that the issue seems to be related to the introduction of the two level index block in pebble. Commenting out the line: https://github.com/cockroachdb/pebble/blob/master/sstable/writer.go#L434 seems to allow the import progress. It's not clear to me yet how this relates to a value.

@pbardea
Copy link
Contributor

pbardea commented Sep 18, 2019

It also seems that after the import succeeds without that line, we are able to read the data from the restore. Unsure if that's expected considering that I think this means that the topLevleIndex is empty (I assume it may just scan all the data blocks in the SST in this case?)

@petermattis
Copy link
Collaborator

Huh, commenting out the line you indicated seems really problematic as we would create invalid sstables (the top-level index would be broken). Can you instead try setting Options.IndexBlockSize = math.MaxInt32? That is the "correct" way to disable two-level indexes.

@pbardea
Copy link
Contributor

pbardea commented Sep 18, 2019

It looks like that also resolves the issue. (The RESTORE is not yet complete, but usually errors out quite quickly -- will update when the RESTORE completes).

In this case does it look like this is an Pebble issue? (I haven't found anything above this in the stack that looks amiss otherwise.) If so, I can file an issue and set the index block size as described above as a temporary work-around until the two level index issue is resolved?

(For posterity: Yesterday I also noticed that the issue disappeared when I toggled https://github.com/cockroachdb/pebble/blob/master/sstable/writer.go#L364 to be w.twoLevelIndex = false, which I believe would also force the usage of a single index block.)

@petermattis
Copy link
Collaborator

In this case does it look like this is an Pebble issue? (I haven't found anything above this in the stack that looks amiss otherwise.) If so, I can file an issue and set the index block size as described above as a temporary work-around until the two level index issue is resolved?

Yes. Two-level indexes were only recently added to pebble. We don't actually enable them for RocksDB. Totally fine to disable them.

It will be useful for the issue you file to have reproduction instructions. Please include the SHA of cockroachdb you were running at.

(For posterity: Yesterday I also noticed that the issue disappeared when I toggled https://github.com/cockroachdb/pebble/blob/master/sstable/writer.go#L364 to be w.twoLevelIndex = false, which I believe would also force the usage of a single index block.)

Right. That's the brute force way to disable two-level indexes.

pbardea added a commit to pbardea/cockroach that referenced this issue Sep 18, 2019
Setting the IndexBlockSize to MaxInt disables two level indexes. Using
two level indexes cause issues restoring some registration cluster
backups. This change servers as a work-around until
cockroachdb/pebble#285 is resolved.

Fixes cockroachdb#40670.

Release justification: RESTOREs on registration cluster backups started
failing after enabling two level indexes in Pebble. This was a release
blocking bug and this fix allows these backups to be restored again
until more investigation is done in the two level index issue.

Release note: None
@petermattis
Copy link
Collaborator

If I understand correctly, Pebble is being used to write the sstables which are then ingested into RocksDB, right? It is possible RocksDB has a bug in handling two-level indexes.

@craig craig bot closed this as completed in dd5aa30 Sep 19, 2019
@petermattis
Copy link
Collaborator

Correcting my misunderstanding above: Pebble is being used to write the sstables and then golang/leveldb/table is being used to iterate over them in order to compute range stats. golang/leveldb/table doesn't understand two-level indexes. We should really change that code to use pebble/sstable instead, though there is also a bug in Pebble here. pebble/sstable.Writer should create LevelDB compatible tables when asked to do so (and it wasn't).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants