-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock in txsub.System.Tick()
and tx_bad_seq
errors
#815
Conversation
This commit fixes to issues in `txsub` package. First, issue was a deadlock in `Tick()` function: ```go sys.tickMutex.Lock() if sys.tickInProgress { logger.Debug("ticking in progress") return } sys.tickInProgress = true sys.tickMutex.Unlock() ``` When two `Tick()` methods were called simultanously one of them could `Lock()` the mutex without calling `Unlock()` it properly. This could basically break the `txsub` system as sequence number could not be updated. The second issue was a problem with `App` initialization. `App.networkPassphrase` was set to test network passphrase by default. If `txsub` system was initialized before updating the network passphrase using connected `stellar-core`, the system would not be able to correctly find the transaction result in a database as calculated hash was different than the actual hash of the transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great fix to me, Bartek! I had a few questions, I know this is a rush fix, so feel free to ignore suggestions in the interest of getting it out the door.
services/horizon/internal/app.go
Outdated
log.Error("Network passhprase of stellar-core does not match Horizon configuration. Exiting...") | ||
os.Exit(1) | ||
} | ||
|
||
a.coreVersion = resp.Info.Build | ||
a.networkPassphrase = resp.Info.Network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be removed now, right?
services/horizon/internal/app.go
Outdated
// Check if NetworkPassphrase is different, if so exit Horizon as it can break the | ||
// state of the application. | ||
if resp.Info.Network != a.networkPassphrase { | ||
log.Error("Network passhprase of stellar-core does not match Horizon configuration. Exiting...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change passhprase
to passphrase
, and maybe improve error message to print the two passphrase values.
services/horizon/internal/app.go
Outdated
@@ -64,7 +64,7 @@ func NewApp(config Config) (*App, error) { | |||
|
|||
result := &App{config: config} | |||
result.horizonVersion = app.Version() | |||
result.networkPassphrase = build.TestNetwork.Passphrase | |||
result.networkPassphrase = config.NetworkPassphrase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove App.networkPassphrase
and just use App.Config.NetworkPassphrase
as the single source of truth?
@@ -178,7 +187,8 @@ func (sys *System) Tick(ctx context.Context) { | |||
// Make sure Tick is not run concurrently | |||
sys.tickMutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might accomplish this locking with two functions:
SetTickInProgress
:
sys.tickMutex.Lock()
defer sys.tickMutex.Unlock()
if sys.tickInProgress {
logger.Debug("ticking in progress")
return false
}
sys.tickInProgress = true
return true
UnsetTickInProgress
:
sys.tickMutex.Lock()
defer sys.tickMutex.Unlock()
sys.tickInProgress = false
and then call those from Tick
(which checks the results of SetTickInProgress
and exits if false
, and calls UnsetTickInProgress()
in defer
.
I think that'd make this a little safer from this kind of bug in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The ultimate solution is to remove global ticking service completely (#613). It caused a lot of concurrency problems in the past (including this one) and can be implemented much easier using infinite loop.
if viper.GetString("network-passphrase") == "" { | ||
stdLog.Fatal("Invalid config: network-passphrase is blank. Please specify --network-passphrase on the command line or set the NETWORK_PASSPHRASE environment variable.") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, is it a breaking change to make this required? It seems like it might be, and it'd be better to leave network-passphrase
optional + default to testnet as we do currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, set default value to testnet passphrase but in the next minor we need to make this empty by default. Ex. if you run ingestion node without web server and when stellar-core is not running it should be explicitly set by Horizon admin (#817).
I've tested this branch in an environment where I can reproduce the issue and the problem is solved. |
@oryband thanks for checking this. Would you mind checking this again in your environment (I pushed one last commit with updates). |
@bartekn I tested again, and it works. Thanks for sorting this issue out! 🏆 |
@bartekn can you release a new stable version that includes this branch? |
@oryband today. |
WHOOHOO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion: maybe it's worth breaking PRs fixing two issues into two smaller ones in the future? e.g. one for the tick, one for the passphrase
@@ -170,25 +179,40 @@ func (sys *System) submitOnce(ctx context.Context, env string) SubmissionResult | |||
return sr | |||
} | |||
|
|||
// setTickInProgress sets `tickInProgress` to `true` if it's not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we set tickInProgress
to true
only if it is false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! #824
@@ -73,12 +74,20 @@ func (sys *System) Submit(ctx context.Context, env string) (result <-chan Result | |||
// check the configured result provider for an existing result | |||
r := sys.Results.ResultByHash(ctx, info.Hash) | |||
|
|||
if r.Err != ErrNoResults { | |||
if r.Err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing wrong with the logic here but what is the reason that we have an Err field in r as oppose to have ResultByHash return r and err directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also used as an object returned by channel.
will resolve conflicts in following commits * tag 'horizon-v0.15.4': Horizon 0.15.4 CHANGELOG (stellar#818) Fix deadlock in `txsub.System.Tick()` and `tx_bad_seq` errors (stellar#815) Add information about test scenarios to docs (stellar#813) trades endpoint does not have resolution (stellar#733) fix links and update orderbook field names (stellar#750) Doc: flesh out docs for developers (stellar#804) Fix graceful shutdown (stellar#803) Log client name and version (stellar#727) Use go vet in CI (stellar#611) Fix panic when missing an argument for horizon db backfill (stellar#806)
This commit fixes two issues in
txsub
package. First, issue was adeadlock in
Tick()
function:When two
Tick()
methods were called simultanously one of them couldLock()
the mutex without callingUnlock()
properly. This couldbasically break the
txsub
system as sequence number could not beupdated.
The second issue was a problem with
App
initialization.App.networkPassphrase
was set to test network passphrase by default.If
txsub
system was initialized before updating the network passphraseusing connected
stellar-core
, the system would not be able tocorrectly find the transaction result in a database as calculated hash
was different than the actual hash of the transaction.
Close #305.