Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBG-3742: Allow registry rollbacks based on db config doc rollbacks #6709

Merged
merged 18 commits into from
Mar 5, 2024

Conversation

bbrks
Copy link
Member

@bbrks bbrks commented Feb 29, 2024

CBG-3742: Allow registry rollbacks based on db config doc rollbacks.

  • Registry will roll back to state of an older db config when possible (allows for vbucket rollbacks, db config doc backup/restore).
  • If a rollback like this causes a collection conflict, the database is marked as invalid and the registry rolls back with this information. The database will not be loaded, but can be repaired with a normal db config update.

Example log output for a rollback causing conflicting collections:

2024-02-29T15:39:54.284Z [INF] Config: db:c1_db1 Registry rollback required for bucket: sg_int_rosmar_0ce23081d42ca93d4beda50d123d681d, groupID: default, dbName:c1_db1
2024-02-29T15:39:54.284Z [INF] Config: db:c1_db1 Rolling back config registry to align with db config version 1-a for db: c1_db1, bucket:sg_int_rosmar_0ce23081d42ca93d4beda50d123d681d configGroup:default
2024-02-29T15:39:54.284Z [INF] Config: db:c1_db1 Rollback requested but registry did not include previous version for db <ud>c1_db1</ud> - using config doc as previous version
2024-02-29T15:39:54.284Z [WRN] db:c1_db1 db <ud>c1_db1</ud> config rollback would cause collection conflicts (<ud>map[sg_test_0.sg_test_1:c1_db2]</ud>) - marking database as invalid to allow for manual repair -- rest.(*GatewayRegistry).rollbackDatabaseConfig() at config_registry.go:234
2024-02-29T15:39:54.284Z [INF] Config: db:c1_db1 Successful config registry rollback for bucket: sg_int_rosmar_0ce23081d42ca93d4beda50d123d681d, configGroup: default, db: c1_db1
2024-02-29T15:39:54.285Z [WRN] Must repair invalid database config for "c1_db1" for it to be usable! Conflicting collections detected -- rest.(*invalidDatabaseConfigs).addInvalidDatabase() at config.go:334

TODO

  • REST API level testing of db config (avoid direct usage of sc.BootstrapContext in TestPersistentConfigRegistryRollbackCollectionConflictAfterDbConfigRollback)
    • Want to make sure that it's not possible to load an invalid db config accidentally (e.g. an on-demand db config load)

Pre-review checklist

  • Removed debug logging (fmt.Print, log.Print, ...)
  • Logging sensitive data? Make sure it's tagged (e.g. base.UD(docID), base.MD(dbName))
  • Updated relevant information in the API specifications (such as endpoint descriptions, schemas, ...) in docs/api

Integration Tests

@bbrks bbrks self-assigned this Feb 29, 2024
@bbrks bbrks marked this pull request as ready for review March 1, 2024 15:54
@bbrks bbrks requested a review from torcolvin March 1, 2024 15:55
@bbrks bbrks removed their assignment Mar 1, 2024
Copy link
Collaborator

@torcolvin torcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of something off the top of my head, but is there a more than 2 way rollback scenario we should worry about?

base/stats.go Outdated Show resolved Hide resolved
rest/config_registry.go Outdated Show resolved Hide resolved
rest/persistent_config_test.go Outdated Show resolved Hide resolved
rest/persistent_config_test.go Show resolved Hide resolved
rest/persistent_config_test.go Outdated Show resolved Hide resolved
rest/config.go Outdated
logMessage += " Conflicting collections detected"
} else {
// Nothing is expected to hit this case, but we might add more invalid sentinel values and forget to update this code.
logMessage += "Invalid config with no known cause."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it reasonable to log the config contents here so we have a chance to update it? I think it might expose information, but we could at least list the name of the document so it's easier for support to pull this from a customer.

I think this is low priority because we shouldn't hit this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't like to log a full config and I wouldn't expect to require bucket access to the config doc to repair it. Especially thinking forwards to system collections where that won't be possible.

The only reason we'd end up here is some new code in SG called _handleInvalidDatabaseConfig without updating this bit of code to appropriately handle that case.

I've reworded the log message to be less accusatory about the config itself being invalid. It could be something else that made us consider the database invalid.

logArgs = append(logArgs, base.MD(d.dbNames[dbname].configBucketName), base.MD(d.dbNames[dbname].persistedBucketName))
} else if cnf.Version == invalidDatabaseConflictingCollectionsVersion {
base.SyncGatewayStats.GlobalStats.ConfigStat.DatabaseCollectionCollisions.Add(1)
logMessage += " Conflicting collections detected"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be logged upstream, but I think it would great to somewhere print the name of the registry / config docs and what collections are conflicting

Copy link
Member Author

@bbrks bbrks Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not feasible to do that here on load, but the warning log that occurred when the rollback set this invalid flag logs what collection(s) caused it.

2024-02-29T15:39:54.284Z [WRN] db:c1_db1 db <ud>c1_db1</ud> config rollback would cause collection conflicts (<ud>map[sg_test_0.sg_test_1:c1_db2]</ud>) - marking database as invalid to allow for manual repair -- rest.(*GatewayRegistry).rollbackDatabaseConfig() at config_registry.go:234
2024-02-29T15:39:54.285Z [WRN] Must repair invalid database config for "c1_db1" for it to be usable! Conflicting collections detected -- rest.(*invalidDatabaseConfigs).addInvalidDatabase() at config.go:334

Given the rollback that just happened, the customer probably doesn't want to change the non-rolled back database, so I'm not sure there's value in knowing which database this one conflicts with. The rollback is the problem that needs correcitng and we log the collection that needs removing.

if registryDatabase.PreviousVersion == nil {
return fmt.Errorf("Rollback requested but registry did not include previous version for db %s", base.MD(dbName))
base.InfofCtx(ctx, base.KeyConfig, "Rollback requested but registry did not include previous version for db %s - using config doc as previous version", base.UD(dbName))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "config doc" might be "last read of doc %s" if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes any material difference to change this? Last read is generally implied.

rest/persistent_config_test.go Outdated Show resolved Hide resolved
rest/persistent_config_test.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@adamcfraser adamcfraser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@torcolvin torcolvin assigned bbrks and unassigned torcolvin Mar 5, 2024
@bbrks bbrks requested a review from torcolvin March 5, 2024 21:03
@bbrks bbrks assigned torcolvin and unassigned bbrks Mar 5, 2024
@bbrks
Copy link
Member Author

bbrks commented Mar 5, 2024

Added subtest to cover simultaneous db rollback

@adamcfraser
Copy link
Collaborator

Changes look good, just a test tweak needed for the CI failures.

@bbrks bbrks requested a review from adamcfraser March 5, 2024 22:21
@adamcfraser adamcfraser merged commit fe58465 into master Mar 5, 2024
30 checks passed
@adamcfraser adamcfraser deleted the CBG-3742 branch March 5, 2024 22:53
bbrks added a commit that referenced this pull request Mar 6, 2024
…acks based on db config doc rollbacks (#6709)

* Extend startBootstrapServerWithoutConfigPolling polling interval

* tidy return

* Use config from bucket to roll back if no previous version is found (config vbucket rollback/config restore)

* wip

* Mark conflicting rollbacks as invalid in db registry and prevent loading by SG. Allows for manual repair of database config.

* Improve log message for invalid database configurations - handle multiple scenarios

* Handle unknown reasons why a db could be invalid

* Do database repair via normal Admin REST API rather to ensure externally recoverable

* Equal(len())->Len()

* Rename stat

* UD->MD fix

* lower retry timeout for testing

* Require 3 datastores

* Replace registry entry with one built from config

* Reword unexpected InvalidDatabase case

* Push mutliDatabsaeRollback into subtest

* fix ineffassign

* Fix non-deterministic slice ordering from RequireInvalidDatabaseConfigNames helper
adamcfraser pushed a commit that referenced this pull request Mar 7, 2024
#6718)

* [3.1.4 Backport] CBG-3751: Cherry-pick CBG-3742: Allow registry rollbacks based on db config doc rollbacks (#6709)

* Extend startBootstrapServerWithoutConfigPolling polling interval

* tidy return

* Use config from bucket to roll back if no previous version is found (config vbucket rollback/config restore)

* wip

* Mark conflicting rollbacks as invalid in db registry and prevent loading by SG. Allows for manual repair of database config.

* Improve log message for invalid database configurations - handle multiple scenarios

* Handle unknown reasons why a db could be invalid

* Do database repair via normal Admin REST API rather to ensure externally recoverable

* Equal(len())->Len()

* Rename stat

* UD->MD fix

* lower retry timeout for testing

* Require 3 datastores

* Replace registry entry with one built from config

* Reword unexpected InvalidDatabase case

* Push mutliDatabsaeRollback into subtest

* fix ineffassign

* Fix non-deterministic slice ordering from RequireInvalidDatabaseConfigNames helper

* New tests require bootstrap connection
bbrks added a commit that referenced this pull request Mar 28, 2024
…6709)

* Extend startBootstrapServerWithoutConfigPolling polling interval

* tidy return

* Use config from bucket to roll back if no previous version is found (config vbucket rollback/config restore)

* wip

* Mark conflicting rollbacks as invalid in db registry and prevent loading by SG. Allows for manual repair of database config.

* Improve log message for invalid database configurations - handle multiple scenarios

* Handle unknown reasons why a db could be invalid

* Do database repair via normal Admin REST API rather to ensure externally recoverable

* Equal(len())->Len()

* Rename stat

* UD->MD fix

* lower retry timeout for testing

* Require 3 datastores

* Replace registry entry with one built from config

* Reword unexpected InvalidDatabase case

* Push mutliDatabsaeRollback into subtest

* fix ineffassign

* Fix non-deterministic slice ordering from RequireInvalidDatabaseConfigNames helper
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants