Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

SlotMeta is_connected field isn't updated as expected #27820

Closed
steviez opened this issue Sep 15, 2022 · 4 comments · Fixed by #28069
Closed

SlotMeta is_connected field isn't updated as expected #27820

steviez opened this issue Sep 15, 2022 · 4 comments · Fixed by #28069
Assignees
Labels
validator Issues that relate to the validator

Comments

@steviez
Copy link
Contributor

steviez commented Sep 15, 2022

Problem

The SlotMeta::is_connected field was originally designed to indicate whether a given slot n had a full replayable sequence back to genesis without any holes. It follows that all slots in this sequence must be full in order to be replayable. In a more compact form, we're saying

is_connected(n) = is_connected(parent_of_n) && is_full(n)

where n is the direct child of parent_of_n.

What this design failed to account for is that validators might start from a snapshot at some slot s > 0. This is a perfectly valid way for a validator to bootstrap (recall that by definition, s is a root); however, that validator doesn't / didn't have a path back to genesis. As such, is_connected will be false for all slots that this validator processes. This has a potential ripple effect as the is_connected field is currently being used to signal to ReplayStage when there is more work to do. See #27786, which is what sparked the dig into this issue.

Proposed Solution

From some discussion, an initial idea is to expand is_connected to link back to geneiss OR to a slot that we load snapshot from.

  • Starting completely clean (no snapshot or blockstore) seems like it'd be easy enough; mark snapshot slot as connected (although we wouldn't technically have these shreds) and then as additional slots are inserted, they'll be able to get marked as is_connected inductively as well
  • Starting with a fetched snapshot but existing blockstore would require a little extra. For example, suppose the blockstore has slots in the range [b_start, b_end] and we fetch a snapshot at slot s:
    • Fetching a snapshot for s > b_end has no issues; s marked as connected and progresses onward (so functionally similar to starting without any blockstore)
    • Fetching a snapshot for b_start < s < b_end would mark s as connected, but if slots (s, b_end] are already present in blockstore, their connected status wouldn't change.
      • Maybe we could update blockstore_process::load_frozen_forks() to mark these slots as connected ?
    • Fetching a snapshot for s < b_start would be similar to above case; mark s connected and then potentially need to update slots as connected in load_frozen_forks()

Technically, when we fetch a snapshot at slot s, we don't actually need s in the blockstore as the bank unpacked from the snapshot is frozen after playing s. There might be some subtlety with getting this correct

Additional Context / Reading

The original design doc for blockstore is here, and below is the SlotMeta struct in code.

pub struct SlotMeta {
// The number of slots above the root (the genesis block). The first
// slot has slot 0.
pub slot: Slot,
// The total number of consecutive shreds starting from index 0
// we have received for this slot.
pub consumed: u64,
// The index *plus one* of the highest shred received for this slot. Useful
// for checking if the slot has received any shreds yet, and to calculate the
// range where there is one or more holes: `(consumed..received)`.
pub received: u64,
// The timestamp of the first time a shred was added for this slot
pub first_shred_timestamp: u64,
// The index of the shred that is flagged as the last shred for this slot.
// None until the shred with LAST_SHRED_IN_SLOT flag is received.
#[serde(with = "serde_compat")]
pub last_index: Option<u64>,
// The slot height of the block this one derives from.
// The parent slot of the head of a detached chain of slots is None.
#[serde(with = "serde_compat")]
pub parent_slot: Option<Slot>,
// The list of slots, each of which contains a block that derives
// from this one.
pub next_slots: Vec<Slot>,
// True if this slot is full (consumed == last_index + 1) and if every
// slot that is a parent of this slot is also connected.
pub is_connected: bool,
// Shreds indices which are marked data complete.
pub completed_data_indexes: BTreeSet<u32>,
}

@steviez
Copy link
Contributor Author

steviez commented Sep 15, 2022

Quick thought that needs more examination would be to change is_connected field to be is_parent_connected. is_connected currently contains duplicate information as we can determine from other fields whether we are full or not (see how SlotMeta::is_full() is implemented). This bit of information is useful for knowing if there is work to be done in the issue I linked above; however, there might be other consequences of making the change.

Ultimately, it'd be nice to not have to change the struct layout of SlotMeta as that raises all sorts of compatibility issues

@steviez steviez self-assigned this Sep 16, 2022
@steviez
Copy link
Contributor Author

steviez commented Sep 28, 2022

In regards to my previous comment, I think we need two bits of state:

  • is_connected
  • is_parent_connected

My defense for that is that when we start from a snapshot at slot S, we may not necessarily have the shreds for slot S. So, depending on is_full() && is_parent_connected to build the chain from a snapshot won't work. And, the whole reason for this discussion is that is_connected doesn't capture the subtlety that work can be done on a child slot if the parent is connected but before the child slot is full (ie if it has the initial couple shreds)

We can capture both of these bits in the same amount of space that the boolean previously occupied; however, this is a compatibility issue regardless as trying to decode a byte that is not one of {0, 1} is a deserialization error.

@steviez
Copy link
Contributor Author

steviez commented Sep 29, 2022

For completeness, another option that would avoid adding the extra bit of state to SlotMeta would be:

  • Changing is_connected to is_parent_connected
  • Querying the blockstore for the parent SlotMeta in Blockstore::slot_has_updates() and in Blockstore::handle_chaining_for_slot()
    • In both of these instances for a given slot S, we want to know if S's parent is both connected and full

@t-nelson
Copy link
Contributor

t-nelson commented Oct 20, 2022

regarding avoiding the struct layout change avoidance, the existing field is a bool which encodes to a u8 where false is strictly 0x00u8 and true 0x01u8, so we have some free bits to play with.

  • treat it as an #[repr(u8)] enum, and reserve variants 0 and 1 as legacy values
  • treat it as a bitflags!{struct Flags: u8 { ... }} from the bitflags crate, where the LSB is legacy

oh. i thought this all seemed familiar 😅

@sakridge sakridge added the validator Issues that relate to the validator label Oct 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
validator Issues that relate to the validator
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants