storage: prevent startup after physical corruption #103899

jbowens · 2023-05-25T16:39:46Z

Replication-level corruption results in crashing the node, but also the writing of a special file (base.PreventedStartupFile) to prevent node startup:

cockroach/pkg/kv/kvserver/replica_corruption.go

Lines 57 to 69 in df21930

    
           	preventStartupMsg := fmt.Sprintf(`ATTENTION: 
        
           this node is terminating because replica %s detected an inconsistent state. 
        
           Please contact the CockroachDB support team. It is not necessarily safe 
        
           to replace this node; cluster data may still be at risk of corruption. 
        
           A file preventing this node from restarting was placed at: 
        
           %s 
        
           `, r, path) 
        
           	if err := fs.WriteFile(r.store.TODOEngine(), path, []byte(preventStartupMsg)); err != nil { 
        
           		log.Warningf(ctx, "%v", err) 
        
           	}

This is done because the node requires manual intervention. Today, corruption of physical media (eg, bit rot resulting in node-local data loss) similarly requires manual intervention to replace the node. Although we hope to support automatic recovery of physical corruption from other replicas some day (#67568), in the meantime we require the node with local corruption be decommissioned and replaced. We should consider writing base.PreventedStartupFile before crashing these nodes to prevent automatic restarting of a node that we have every expectation will crash again.

Jira issue: CRDB-28250

The text was updated successfully, but these errors were encountered:

nicktrav · 2023-07-24T21:17:03Z

@RahulAggarwal1016 is going to pick this one up. Thanks, Rahul!

Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption. Fixes: cockroachdb#103899 Release-note: None

Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption. Fixes: cockroachdb#103899 Release note: None

107828: storage: Write PreventStartupFile on Node SSTFile Corruption r=RahulAggarwal1016 a=RahulAggarwal1016 Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption. Note: There is another place where a node crashes due to sstable corruption: https://github.com/cockroachdb/cockroach/blob/ba053dd3ff75dbca557e3b21195ed4cdff250660/pkg/storage/pebble_iterator.go#L958 however I am not quite sure how to access a `Pebble` instance without it getting a little messy. Fixes: #103899 Release-note: None Co-authored-by: Rahul Aggarwal <[email protected]>

jbowens added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team E-quick-win Likely to be a quick win for someone experienced. labels May 25, 2023

jbowens self-assigned this May 31, 2023

jbowens removed their assignment Jul 24, 2023

nicktrav assigned raggar Jul 24, 2023

raggar mentioned this issue Jul 28, 2023

storage: Write PreventStartupFile on Node SSTFile Corruption #107828

Merged

joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Aug 14, 2023

craig bot closed this as completed in bbfd162 Aug 14, 2023

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: prevent startup after physical corruption #103899

storage: prevent startup after physical corruption #103899

jbowens commented May 25, 2023 •

edited

Loading

nicktrav commented Jul 24, 2023

storage: prevent startup after physical corruption #103899

storage: prevent startup after physical corruption #103899

Comments

jbowens commented May 25, 2023 • edited Loading

nicktrav commented Jul 24, 2023

jbowens commented May 25, 2023 •

edited

Loading