CBG-3292: Handling for corrupt db config #6377

gregns1 · 2023-08-23T13:43:34Z

First I have changed it so that when sync gateway on the config interval looks for database configs in the bucket, will also check that the config bucket name matches the bucket the config is found in. If it doesn't we add the db context to the server context corrupt database map to keep track of. Then the changes will remove the in memory representation of the db config and the db context off of the server context all together.
This forces operation on the database to fail so I have added some handling for a suitable error message to be returned to the user in this eventuality informing them that they need to update config to correct the bucket name.
Lastly we needed to be able to correct this corrupt config which is a challenge as at this stage all requests to the db would fail. But I added handling to handle a PUT/POST requests to the db to update the config to correct the corrupt bucket name.
Test seems to panic against rosmar when trying to update bucket config so added walrus check to avoid it against rosmar.

Pre-review checklist

Removed debug logging (fmt.Print, log.Print, ...)
Logging sensitive data? Make sure it's tagged (e.g. base.UD(docID), base.MD(dbName))
Updated relevant information in the API specifications (such as endpoint descriptions, schemas, ...) in docs/api

Integration Tests

GSI=true,xattrs=true https://jenkins.sgwdev.com/job/SyncGateway-Integration/1982/

rest/config.go

rest/admin_api.go

rest/adminapitest/admin_api_test.go

rest/config.go

rest/handler.go

rest/server_context.go

torcolvin

I left comments on the code, and the code comments will hold true regardless, but there is a bigger problem, maybe.

We store the map by dbName but if you run a backup on bucketA with db1 to bucketB and bucketC, there are actually three buckets with a matching database name, but we index the invalid names by dbname only so fixing in bucketB will remove this from bucketC. This code may behave OK, but if that's the case, I'd prefer a test for this. I'd probably try to break that test from the other test cases, or at least parametrize the test for the different cases to make it easier to understand what the test is doing in the future.

Additionally, I'm worried about the frequency of logging here and I'd like other people's input. It's definitely possible that this isn't possible to fix quickly and I don't know the right level to log on. It might be right to log the first time on warning and then log on info for the subsequent errors. (Just throwing out ideas).

rest/config.go

rest/handler.go

rest/utilities_testing.go

rest/adminapitest/admin_api_test.go

rest/handler.go

torcolvin

As discussed offline, we need to worry about different cases and can write tests for them - I'd each of these as a separate test case so it's clear what they are doing.

start RestTester with no buckets and write a dbconfig to a bucket with the SDK that is an invalid bucket (this is a restore of a bucket with nothing running). Assert that the database doesn't exist or we hit the point in time where it doesn't work
make sure if you do a PUT with a mismatched bucket name in the config it will fail (this may already be tested)
in bucketA and bucketB, write two dbconfigs with bucket name as bucketC; then start RestTester, make sure neither of them are picked up (this is the case where we have two buckets that are backed up/restored but no working buckets)
create bucketA and bucketB with dbconfigs that that both list bucket name as bucketA, then start RestTester make sure database for bucketA is picked up and there is logging for bucketB having the wrong bucket name (this is the case where we did a backup/restore of bucketA as bucketB and started SG)
create a working dbconfig in bucketA in a resttester, make sure db is running. write the config into bucketB while the RestTester is running, make sure the database backed by bucketA is still running and there's appropriate logging. (this is the case where SG is running and backup/restore hits)

These roughly match the scenarios of what happens if you start a SG when you have multiple buckets with different information and also what happens if a SG is already running and a bucket appears.

Because an invalid database can never appear, I don't think we need code in handler.go - if there's an bucket name mismatch, we just won't load the database, so you can't query it with /db -> it'll just return 404 as not found because the database wasn't loaded.

The last change we want to make to the bootstrap logic is to log a warning the first time we see a mismatched bucket, but then after log a warning. This is a change from the code on a timer that I wrote.

rest/server_context.go

torcolvin

Most comments are just nitpicks about readability.

There is one bigger problem which I think it's worth writing a new test or two for - what happens if you have

dbconfig A, groupID A backed by bucket A in bucket A
dbconfig B, groupID B backed by bucket A in bucket B
dbconfig C, groupID A backed by bucket A in bucket B

It's possible most tests need to do include groupID so make sure that we are tracking dbName, bucket, groupID because it seems like our invalid database structure drops groupID.

Generally I think groupID is used to have a duplicate configuration but avoid having certain nodes participate in import.

rest/adminapitest/admin_api_test.go

base/constants.go

rest/adminapitest/admin_api_test.go

rest/config.go

rest/handler.go

rest/server_context.go

torcolvin · 2023-09-11T16:21:41Z

Most comments are just nitpicks about readability.

There is one bigger problem which I think it's worth writing a new test or two for - what happens if you have

dbconfig A, groupID A backed by bucket A in bucket A

dbconfig B, groupID B backed by bucket A in bucket B

dbconfig C, groupID A backed by bucket A in bucket B

It's possible most tests need to do include groupID so make sure that we are tracking dbName, bucket, groupID because it seems like our invalid database structure drops groupID.

Generally I think groupID is used to have a duplicate configuration but avoid having certain nodes participate in import.

This is actually not an issue outside of tests because Sync Gateway only has a single ServerContext and it can only have one groupID

* CBG-3292: force config update on corrupt db config

CBG-3292: force config update on corrupt db config

c1431a2

gregns1 self-assigned this Aug 23, 2023

fix lint error, fix potential panic too

15f6d97

torcolvin reviewed Aug 30, 2023

View reviewed changes

updates from review

f438766

gregns1 assigned torcolvin and unassigned gregns1 Aug 31, 2023

gregns1 requested a review from torcolvin August 31, 2023 13:33

torcolvin reviewed Aug 31, 2023

View reviewed changes

rest/handler.go Outdated Show resolved Hide resolved

torcolvin reviewed Sep 1, 2023

View reviewed changes

rest/server_context.go Outdated Show resolved Hide resolved

gregns1 added 3 commits September 1, 2023 15:58

updates to add more tests and address review commnets

c808369

go fmt run

56701ac

updates to update logging

1f8026f

torcolvin reviewed Sep 5, 2023

View reviewed changes

torcolvin assigned gregns1 and unassigned torcolvin Sep 5, 2023

gregns1 added 2 commits September 5, 2023 16:49

updates to address comments

ca97d61

fix test assertions

2befa71

torcolvin enabled auto-merge (squash) September 11, 2023 16:21

torcolvin approved these changes Sep 11, 2023

View reviewed changes

torcolvin merged commit b35e52e into master Sep 11, 2023
26 checks passed

torcolvin deleted the CBG-3292 branch September 11, 2023 18:28

torcolvin pushed a commit that referenced this pull request Sep 15, 2023

CBG-3292: Handling for corrupt db config (#6377)

8ef6388

* CBG-3292: force config update on corrupt db config

bbrks pushed a commit that referenced this pull request Oct 3, 2023

CBG-3292: 3.1.2 Backport: Handling for corrupt db config (#6377)

22ba9bc

* CBG-3292: force config update on corrupt db config

bbrks pushed a commit that referenced this pull request Oct 3, 2023

CBG-3292: 3.1.2 Backport: Handling for corrupt db config (#6377)

5888d47

* CBG-3292: force config update on corrupt db config

This was referenced Oct 3, 2023

[3.1.2 Backport] CBG-3494: Handling for corrupt db config #6487

Merged

[3.1.2 Backport] CBG-3387: Handle nil buckets in db config #6488

Merged

bbrks pushed a commit that referenced this pull request Mar 28, 2024

CBG-3292: Handling for corrupt db config (#6377)

d7b7101

* CBG-3292: force config update on corrupt db config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CBG-3292: Handling for corrupt db config #6377

CBG-3292: Handling for corrupt db config #6377

gregns1 commented Aug 23, 2023 •

edited

Loading

torcolvin left a comment

torcolvin left a comment

torcolvin left a comment

torcolvin commented Sep 11, 2023

CBG-3292: Handling for corrupt db config #6377

CBG-3292: Handling for corrupt db config #6377

Conversation

gregns1 commented Aug 23, 2023 • edited Loading

Pre-review checklist

Integration Tests

torcolvin left a comment

Choose a reason for hiding this comment

torcolvin left a comment

Choose a reason for hiding this comment

torcolvin left a comment

Choose a reason for hiding this comment

torcolvin commented Sep 11, 2023

gregns1 commented Aug 23, 2023 •

edited

Loading