-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Monitor/Fix deadlock bugs in bootstrapping #136
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Nevertheless, I have some questions on the design of making bootstrap and block handler concurrent, as per below comments. Not sure if I miss something or not.
@@ -24,6 +25,8 @@ type MonitorConfig struct { | |||
LivenessCheckIntervalSeconds uint64 `mapstructure:"liveness-check-interval-seconds"` | |||
// Max lasting BTC heights that a checkpoint is not reported before an alarm is sent | |||
MaxLiveBtcHeights uint64 `mapstructure:"max-live-btc-heights"` | |||
// the confirmation depth to consider a BTC block as confirmed | |||
BtcConfirmationDepth uint64 `mapstructure:"btc-confirmation-depth"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a part of monitor's config? Looks like its value is retrieved from Babylon and we could not specify its value in a config file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Before this PR, the value is retrieved from Babylon, but after a second thought, I think it's better the make it also configurable on the Monitor side. One reason is that we should not trust any queries from Babylon. All the information obtained from Babylon should be verifiable (we only trust the genesis file). Another reason is that this makes testing easier.
prevHash := &tipBlock.Header.PrevBlock | ||
for i := int(tipHeight-baseHeight) - 1; i >= 0; i-- { | ||
for i := len(chainBlocks) - 2; i >= 0; i-- { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a comment describing why -2
here
types/ckpt_bookkeeper.go
Outdated
cb.Lock() | ||
defer cb.Unlock() | ||
|
||
func (cb *CheckpointsBookkeeper) exists(id string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
func (cb *CheckpointsBookkeeper) exists(id string) bool { | |
func (cb *CheckpointsBookkeeper) Has(id string) bool { |
is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, has
is better but I'm not sure we should make it exposed as it is used internally and not thread-safe.
"need to restart the bootstrapping process", event.Height, err.Error()) | ||
bs.Synced.Store(false) | ||
bs.Bootstrap() | ||
if bs.Synced.Swap(false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean only when the scanner is synced, an error here triggers re-bootstrapping? Is it possible that the scanner fails to handle a new block while syncing, thus missing this new block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question. Yep, it is possible when the bootstrapping is happening but failed to include the new block. In this case, the bootstrapping will happen again when the next block comes. I don't think it would be a problem since this happens rarely and it will always be synced.
This PR is to fix bugs in the monitor found during the integration test with other components. The cause of the bugs is mostly related to deadlocks across goroutines. With this change, the monitor worked well in verifying historical BTC checkpoints. A subsequent PR would fix bugs in handling incoming BTC blocks.