Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: data race in (*diskStatsMap).initDiskStatsMap() #91414

Closed
renatolabs opened this issue Nov 7, 2022 · 8 comments · Fixed by #101097 or #101148
Closed

server: data race in (*diskStatsMap).initDiskStatsMap() #91414

renatolabs opened this issue Nov 7, 2022 · 8 comments · Fixed by #101097 or #101148
Assignees
Labels
A-admission-control A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Nov 7, 2022

This data race is sometimes observed when TestClusterConnectivity is run with the race detector enabled:

==================                                                                                                                                                                                                               
WARNING: DATA RACE                                                                                                                                                                                                               
Write at 0x00c0038ca920 by goroutine 20258:                                                                                                                                                                                      
  github.com/cockroachdb/cockroach/pkg/server.(*diskStatsMap).initDiskStatsMap()                                                                                                                                                 
      github.com/cockroachdb/cockroach/pkg/server/node.go:839 +0x5c                                                                                                                                                              
  github.com/cockroachdb/cockroach/pkg/server.(*Node).registerEnginesForDiskStatsMap()                                                                                                                                           
      github.com/cockroachdb/cockroach/pkg/server/node.go:859 +0x3b38                                                                                                                                                            
  github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart()                                                                                                                                                               
      github.com/cockroachdb/cockroach/pkg/server/server.go:1703 +0x3ad0                                                                                                                                                         
  github.com/cockroachdb/cockroach/pkg/server.(*Server).Start()                                                                                                                                                                  
      github.com/cockroachdb/cockroach/pkg/server/server.go:1045 +0x38                                                                                                                                                           
  github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()                                                                                                                                                              
      github.com/cockroachdb/cockroach/pkg/server/testserver.go:593 +0x50                                                                                                                                                        
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()                                                                                                                                        
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:601 +0x9c                                                                                                                                        
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func1()                                                                                                                                        
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:381 +0xbc                       
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func4()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:382 +0x44
                                                                                                                
Previous read at 0x00c0038ca920 by goroutine 21009:                                                             
  github.com/cockroachdb/cockroach/pkg/server.(*diskStatsMap).empty()   
      github.com/cockroachdb/cockroach/pkg/server/node.go:835 +0x44                                             
  github.com/cockroachdb/cockroach/pkg/server.(*diskStatsMap).tryPopulateAdmissionDiskStats()
      github.com/cockroachdb/cockroach/pkg/server/node.go:808 +0x74                                             
  github.com/cockroachdb/cockroach/pkg/server.(*Node).GetPebbleMetrics()                                        
      github.com/cockroachdb/cockroach/pkg/server/node.go:866 +0xc4
  github.com/cockroachdb/cockroach/pkg/util/admission.(*StoreGrantCoordinators).SetPebbleMetricsProvider.func1()
      github.com/cockroachdb/cockroach/pkg/util/admission/grant_coordinator.go:107 +0x164                       
                                                                                                                
Goroutine 20258 (running) created at:                                                                           
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()                             
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:380 +0x568                      
  github.com/cockroachdb/cockroach/pkg/server_test.TestClusterConnectivity.func2()          
      github.com/cockroachdb/cockroach/pkg/server_test/pkg/server/connectivity_test.go:275 +0x5d8
  testing.tRunner()                                                                                             
      GOROOT/src/testing/testing.go:1446 +0x188                                                                 
  testing.(*T).Run.func1()                                                                                      
      GOROOT/src/testing/testing.go:1493 +0x40                                                                  
                                                                                                                
Goroutine 21009 (running) created at:                                                                           
  github.com/cockroachdb/cockroach/pkg/util/admission.(*StoreGrantCoordinators).SetPebbleMetricsProvider()      
      github.com/cockroachdb/cockroach/pkg/util/admission/grant_coordinator.go:98 +0x258
  github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart()                              
      github.com/cockroachdb/cockroach/pkg/server/server.go:1592 +0x2e9c          
  github.com/cockroachdb/cockroach/pkg/server.(*Server).Start()           
      github.com/cockroachdb/cockroach/pkg/server/server.go:1045 +0x38                                          
  github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()                                             
      github.com/cockroachdb/cockroach/pkg/server/testserver.go:593 +0x50                                       
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:601 +0x9c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func1()                       
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:381 +0xbc
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func4()
      github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:382 +0x44                       
==================                                                                                              

Jira issue: CRDB-21249

@renatolabs renatolabs added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Nov 7, 2022
@renatolabs
Copy link
Contributor Author

@knz assigned this to you as you recently changed that code. Feel free to reassign as you see fit, if necessary. Thank you.

@knz
Copy link
Contributor

knz commented Jan 10, 2023

This is actually for the storage team.

cc @jbowens @itsbilal for triage.

@knz knz removed their assignment Jan 10, 2023
@knz knz added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jan 10, 2023
@blathers-crl blathers-crl bot added the T-storage Storage Team label Jan 10, 2023
@knz knz changed the title server: data race when running TestClusterConnectivity server: data race in (*diskStatsMap).initDiskStatsMap() Jan 10, 2023
@knz knz changed the title server: data race in (*diskStatsMap).initDiskStatsMap() server: data race in (*diskStatsMap).initDiskStatsMap() Jan 10, 2023
@jbowens
Copy link
Collaborator

jbowens commented Jan 10, 2023

@sumeerbhola — do you mind taking a look here? looks like a race in the admission control disk-stats during node initialization

@sumeerbhola
Copy link
Collaborator

The race is because the call that initializes the diskStatsMap happens at

if err := s.node.registerEnginesForDiskStatsMap(s.cfg.Stores.Specs, s.engines); err != nil {

which is after the call to SetPebbleMetricsProvider at
s.storeGrantCoords.SetPebbleMetricsProvider(ctx, s.node, s.node)

Both these calls are in Server.PreStart.

I can hoist the latter above the former. But the positioning of the latter was not done with much knowledge, so there may be other bugs in this sequencing: specifically, the initialization of diskStatsMap must happen after all stores have had their StoreIdentKey written. The write to the StoreIdentKey happens in kvserver.InitEngine which is called in 3 places: initializeFirstStoreAfterJoin, bootstrapCluster, initializeAdditionalStores. Given storeIDs are unique across a cluster, I am guessing there is some sort of negotiation that causes their assignment, which would mean we don't have all the storeIDs for the engines at a node when the node is starting.
@knz is there something I can read regarding storeID bootstrapping? Also, can new stores be added to a node without restarting it?

@knz
Copy link
Contributor

knz commented Jan 11, 2023

@irfansharif can you perhaps chime in on sumeer's question here?

@jbowens
Copy link
Collaborator

jbowens commented Apr 3, 2023

Also seen here #99567.

@knz
Copy link
Contributor

knz commented Apr 10, 2023

Given storeIDs are unique across a cluster, I am guessing there is some sort of negotiation that causes their assignment, which would mean we don't have all the storeIDs for the engines at a node when the node is starting.

Yes, this is correct. However, the startup sequence contains logic that blocks it from completing until the store IDs are known.
In function (*Server) PreStart(), you see this:

state, initialStart, err ≔ initServer.ServeAndWait(...)

This makes the PreStart function wait until the cluster has initialized and the store IDs are known.

Currently, both the calls to s.node.registerEnginesForDiskStatsMap and s.storeGrantCoords.SetPebbleMetricsProvider come after this point, which is what we want -- they couldn't read the store IDs before then.

We can invert their ordering, as long as they remain in PreStart and after the call to ServeAndWait completes.

Also, can new stores be added to a node without restarting it?

Very theoretically, yes, but we would need to patch a few loose ends before it can fully work.

@knz
Copy link
Contributor

knz commented Apr 10, 2023

ok here's a PR: #101097

irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 10, 2023
Fixes cockroachdb#91414.
Fixes cockroachdb#101010.
Fixes cockroachdb#100902.

The call to registerEnginesForDiskStatsMap needs to wait until the store
IDs are known.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 10, 2023
Fixes cockroachdb#91414.
Fixes  cockroachdb#101010.
Fixes cockroachdb#100902.

There was a race between registerEnginesForDiskStatsMap and
SetPebbleMetricsProvider, the latter making use of a map constructed by
the former but appeared in the opposite order in code.

Release note: None
craig bot pushed a commit that referenced this issue Apr 10, 2023
101148: server: (re-)fix data race in server initialization r=irfansharif a=irfansharif

Fixes #91414.
Fixes  #101010.
Fixes #100902.

There was a race between registerEnginesForDiskStatsMap and SetPebbleMetricsProvider, the latter making use of a map constructed by the former but appeared in the opposite order in code.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Apr 10, 2023
Fixes #91414.
Fixes  #101010.
Fixes #100902.

There was a race between registerEnginesForDiskStatsMap and
SetPebbleMetricsProvider, the latter making use of a map constructed by
the former but appeared in the opposite order in code.

Release note: None
blathers-crl bot pushed a commit that referenced this issue Apr 10, 2023
Fixes #91414.
Fixes  #101010.
Fixes #100902.

There was a race between registerEnginesForDiskStatsMap and
SetPebbleMetricsProvider, the latter making use of a map constructed by
the former but appeared in the opposite order in code.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team
Projects
None yet
5 participants