Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horizon returned bad sequence error although the transaction was added to ledger #2628

Closed
bantalon opened this issue May 27, 2020 · 16 comments · Fixed by #2805 or #2877
Closed

Horizon returned bad sequence error although the transaction was added to ledger #2628

bantalon opened this issue May 27, 2020 · 16 comments · Fixed by #2805 or #2877
Assignees
Labels
Milestone

Comments

@bantalon
Copy link

What version are you using?

I am working on the public network through the public Horizon instance: https://horizon.stellar.org. I posted transactions to the network using this API: https://www.stellar.org/developers/horizon/reference/endpoints/transactions-create.html.

What did you do?

On may 8, I started posting a transaction on 06:47:19 UTC. 30 seconds later, on 06:47:49, a https://www.stellar.org/developers/horizon/reference/errors/timeout.html was returned. As instructed in the documentation, I retried sending the same transaction after a second delay. Within less than a second I got a bad sequence failure:

{
 "type": "https://stellar.org/horizon-errors/transaction_failed",
 "title": "Transaction Failed",
 "status": 400,
 "detail": "The transaction failed when submitted to the stellar network. The `extras.result_codes` field on this response contains further details. Descriptions of each code can be found at: https://www.stellar.org/developers/guides/concepts/list-of-operations.html",
 "extras": {
 "envelope_xdr": "AAAAAGRFCbG4m7/GKYJDKVwWrhSAhfgjSjOILbb0cOAN+rH+AAAAZAGfEqQAAC2bAAAAAQAAAABetP9IAAAAAF61DoQAAAABAAAAG1hMTSBlMmUgbW9uaXRvciB0cmFuc2FjdGlvbgAAAAABAAAAAAAAAAEAAAAA+dkdOXuuwsV5OpRuUk9Owqe6ocz1ukhgrYqnKliCQA8AAAAAAAAAAAAAAAEAAAAAAAAAAQ36sf4AAABAhnftLR7Lx9hKXojyGkKDLu4bdI4+wgx/qSPxEW3Vr9CAWBDQPngRWihDppCNzpmKTn3Q3wsM0z3qOJl6+72zCA==",
 "result_codes": {
 "transaction": "tx_bad_seq"
 },
 "result_xdr": "AAAAAAAAAAD////7AAAAAA=="
 }
}

My code treated this as failure.
However, I was surprised to find the transaction on the chain: https://stellar.expert/explorer/public/tx/45a9674370754d50b50df23e03d7fa87bd6ee70bd2db2431a8ab874218c3115a.

What did you expect to see?

I expected to see a successful submit result, either on the first attempt which ended in timeout or on the second attempt which ended in tx_bad_seq.

What did you see instead?

I got a wrong misleading tx_bad_seq failure.

@bantalon bantalon added the bug label May 27, 2020
@bartekn
Copy link
Contributor

bartekn commented May 27, 2020

Thanks for the report. I can't reproduce it with the latest version (ex. I get a success response for your tx). Going to check horizon.stellar.org logs from May, 8th (I think it was running 1.2.1 at this time) and will give you more info.

@bartekn bartekn self-assigned this May 27, 2020
@bantalon
Copy link
Author

I think the tx_bad_seq is a temporary erroneous response. After a short period of time all goes back to normal, and trying to post the same transaction again results in success. Feels like there is a short period of time in which the transaction is not yet identified by Horizon as pushed, but the account sequence number already goes up.

@bantalon
Copy link
Author

bantalon commented Jun 4, 2020

@bartekn - Any updates on this one?
I am refining my transaction pushing algorithm and I wish to know if to handle tx_bad_seq as a final result (as I think it should be) or as a provisional one (which is required if this issue persists).

@bartekn
Copy link
Contributor

bartekn commented Jun 4, 2020

Sorry for delay commenting this issue. I think this should be fixed by #2601 to be released next week (and deployed to horizon.stellar.org next Wednesday).

Before Horizon 1.4.0 txsub system is using two databases: stellar-core DB to get sequence numbers and horizon DB to get transaction results (if it's ingesting failed transactions and horizon.stellar.org do). The issue you are experiencing is due to a delay between stellar-core closing a ledger and horizon ingesting it. When you submit a tx that's already added to a ledger closed by stellar-core but not ingested by horizon yet the following happens: the sequence number on a tx is <= than the current sequence of the account but tx results are not found. In such case tx_bad_seq is returned.

I'm closing it now, please confirm if you still experience it after next week deploy.

@bartekn bartekn closed this as completed Jun 4, 2020
@bantalon
Copy link
Author

bantalon commented Jul 6, 2020

@bartekn
We did not see this again until yesterday.
However, yesterday (2020-07-05) at 16:58:33 UTC we started submitting transaction 615cf369....
At 16:59:04 we got an Horizon timeout error. We retried after 1 second. We got a tx_bad_seq response. See full response below.
Can you reopen the ticket or should I file a new one?

{
  "type": "https://stellar.org/horizon-errors/transaction_failed",
  "title": "Transaction Failed",
  "status": 400,
  "detail": "The transaction failed when submitted to the stellar network. The `extras.result_codes` field on this response contains further details.  Descriptions of each code can be found at: https://www.stellar.org/developers/guides/concepts/list-of-operations.html",
  "extras": {
    "envelope_xdr": "AAAAAGRFCbG4m7/GKYJDKVwWrhSAhfgjSjOILbb0cOAN+rH+AAAAZAGfEqQAAEHzAAAAAQAAAABfAgWLAAAAAF8CCQ8AAAABAAAAG1hMTSBlMmUgbW9uaXRvciB0cmFuc2FjdGlvbgAAAAABAAAAAAAAAAEAAAAA+dkdOXuuwsV5OpRuUk9Owqe6ocz1ukhgrYqnKliCQA8AAAAAAAAAAAAAAAEAAAAAAAAAAQ36sf4AAABAOCbRkyGoptNNRB8yDNXC2VqD5GyYBMtevrlKj2RvgzSykZZ4GShhgwAkEnTO0dlsdeu5dHJK1VQD0uX05CUjBw==",
    "result_codes": {
      "transaction": "tx_bad_seq"
    },
    "result_xdr": "AAAAAAAAAGT////7AAAAAA=="
  }
}

@bartekn
Copy link
Contributor

bartekn commented Jul 6, 2020

@bantalon I found another problem causing this. Reopening.

@bartekn bartekn reopened this Jul 6, 2020
@bartekn bartekn added this to the Horizon 1.6.0 milestone Jul 6, 2020
@bartekn
Copy link
Contributor

bartekn commented Jul 6, 2020

We make two DB queries: first checking if tx result exists in a DB (ResultByHash), second getting the current sequence number of a source account (GetSequenceNumbers):

// check the configured result provider for an existing result
r := sys.Results.ResultByHash(ctx, hash)
if r.Err == nil {
sys.Log.Ctx(ctx).WithField("hash", hash).Info("Found submission result in a DB")
sys.finish(ctx, hash, response, r)
return
}
if r.Err != ErrNoResults {
sys.Log.Ctx(ctx).WithField("hash", hash).Info("Error getting submission result from a DB")
sys.finish(ctx, hash, response, r)
return
}
// From now: r.Err == ErrNoResults
sourceAccount := envelope.SourceAccount()
// The database doesn't (yet) store muxed accounts, so we query
// the corresponding AccountId
accid := sourceAccount.ToAccountId()
sourceAddress := accid.Address()
curSeq, err := sys.Sequences.GetSequenceNumbers([]string{sourceAddress})
if err != nil {
sys.finish(ctx, hash, response, Result{Err: err})
return
}

Queries are not sent in a transaction (REPEATABLE READ) so it's possible that ResultByHash and GetSequenceNumbers are getting data for different ledgers: one without a tx ingested, second with tx ingested which can result in bad_seq errors.

Because we are using Horizon DB only I think we should remove ResultProvider and SequenceProvider interfaces and use history.Q directly to send queries.

@bartekn
Copy link
Contributor

bartekn commented Jul 17, 2020

@bantalon this was fixed in #2805 and will be included in Horizon 1.6.0 release (release: Monday 20 July, deploy to horizon.stellar.org: Wednesday 22 July).

@bartekn bartekn closed this as completed Jul 17, 2020
@bantalon
Copy link
Author

@bartekn - It happened again on July 23 on transaction 2df5d0369f89de16193b757800c5f565acfb45eb4cbb8c5ff76ea0077f0fdfac.

  • We first submitted the transaction at 00:27:46 UTC. Got a timeout after 30 seconds and slept a single second.
  • Submitted again at 00:28:17 UTC. Got a timeout after 30 seconds and slept a single second.
  • Submitted again at 00:28:49 UTC. Got a timeout after 30 seconds and slept a single second.
  • Submitted again at 00:29:20 UTC. Got a timeout after 30 seconds and slept a single second.
  • At 00:29 we submitted again and immediately got a bad sequence failure:
{
  "type": "https://stellar.org/horizon-errors/transaction_failed",
  "title": "Transaction Failed",
  "status": 400,
  "detail": "The transaction failed when submitted to the stellar network. The `extras.result_codes` field on this response contains further details.  Descriptions of each code can be found at: https://www.stellar.org/developers/guides/concepts/list-of-operations.html",
  "extras": {
    "envelope_xdr": "AAAAADOwt23lzX6o0FuxpqtyusGaOAnkHTxj+sQ+rqOcYAHpAAAAZAGjkGEAAEyZAAAAAQAAAABfGNhSAAAAAF8Y29YAAAABAAAAG1hMTSBlMmUgbW9uaXRvciB0cmFuc2FjdGlvbgAAAAABAAAAAAAAAAEAAAAA98VQmHHZ8ZVEc4kkMKTryZwyxSBC4B+Q8vtapAp3FzMAAAAAAAAAAAAAAAEAAAAAAAAAAZxgAekAAABA1CWSBsCvuEeYSET9xPtW8t8KEJ+dn/9DrstONnk8NhbW7vyDaHyhDdr3SkNWoEEOxKgkpBI8Juc4Cg0fDlTsDA==",
    "result_codes": {
      "transaction": "tx_bad_seq"
    },
    "result_xdr": "AAAAAAAAAGT////7AAAAAA=="
  }
}

When was 1.6.0 deployment completed on https://horizon.stellar.org?

@bartekn
Copy link
Contributor

bartekn commented Jul 27, 2020

@bantalon 1.6.0 deployment was a couple hours earlier. @tamirms can you debug it?

@bartekn bartekn reopened this Jul 27, 2020
@bartekn bartekn modified the milestones: Horizon 1.6.0, Horizon 1.7.0 Jul 27, 2020
@bantalon
Copy link
Author

The following may help with the analysis: legitimate tx_bad_seq responses come with result_xdr in which feeCharged is 0. In the faulty responses I added here, the feeCharged is greater than 0 (100 in both cases).

@2opremio 2opremio changed the title Horizon returned returned bad sequence error although the transaction was added to ledger Horizon returned bad sequence error although the transaction was added to ledger Jul 30, 2020
@tamirms
Copy link
Contributor

tamirms commented Jul 30, 2020

At 2020-07-23 00:28:50 UTC the transaction was submitted to Horizon for a 3rd attempt after the previous two attempts resulted in timeouts. However, at 00:29:20 Horizon discarded the transaction because enough time had passed to trigger a timeout. It seems that Stellar Core finally accepted the transaction in ledger 30741789 shortly after Horizon timed out at 00:29:20.

Horizon again received a request to submit the same transaction at 00:29:21. But the transaction was not present in the Horizon DB so Horizon tried to submit the transaction to Stellar Core and Stellar Core responded with a bad sequence error since the transaction was already present in the previous ledger.

According to Horizon ingestion logs, Horizon finally ingested ledger 30741789 at 00:29:22.

time="2020-07-23T00:29:22.891Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=30741789)" next_state="resume(latestSuccessfullyProcessedLedger=30741789)" pid=31178 service=expingest

Horizon can respond with a false positive bad sequence error when Horizon's ingestion lags behind Stellar Core.

We can determine an approximate upper bound for how often this issue occurs by searching for the "AAAAAAAAAGT////7AAAAAA==" (result_xdr in which feeCharged is 100) in the Horizon logs.
I did a sample search and unfortunately I mostly found cases where there were legitimate bad sequence errors.
To get an accurate count of how often this issue occurs we would need to add a metric to the transaction submission system.

To solve this issue we could sleep before calling txResultByHash after a bad sequence error.

// any error other than "txBAD_SEQ" is a failure
isBad, err := sr.IsBadSeq()
if err != nil {
sys.finish(ctx, hash, response, Result{Err: err})
return
}
if !isBad {
sys.finish(ctx, hash, response, Result{Err: sr.Err})
return
}
// If error is txBAD_SEQ, check for the result again
tx, err = txResultByHash(db, hash)

This wouldn't completely solve the issue because this solution assumes that if Horizon is behind Stellar Core it will catch up by the time we finish sleeping. We could pick a sufficiently high sleep duration which would cover most cases. However, we can never guarantee that Horizon will be consistent with Stellar Core after the sleep duration.

Another solution is that we could check if the latest ingested ledger matches the latest ledger in Stellar Core. If the Stellar Core ledger is ahead of Horizon, we can block until Horizon catches up. The only drawback of this approach is that we are adding another dependency on functionality in Stellar Core. We have been trying to reduce coupling between Horizon and Stellar Core as much as possible.

The last solution that is worth mentioning is that we could modify Stellar Core to provide a new error code (e.g. ErrTransactionAlreadyExists) whenever someone tries to submit a transaction which has already been included in a recent ledger.

For now we should probably update the documentation for Horizon's transaction submission endpoint with a warning that, if Horizon's ingestion system is lagging behind Stellar Core, Horizon might respond with an ErrBadSequence error when the transaction was actually included in a ledger that has not yet been ingested by Horizon.

@bartekn
Copy link
Contributor

bartekn commented Jul 30, 2020

To solve this issue we could sleep before calling txResultByHash after a bad sequence error.

During the meeting today I suggested a solution (similar to this one) that simply ignores bad sequence error returned by Stellar-Core and relies on Horizon view only. Later during that meeting I said that I was wrong because if a user submits a tx with sequence number larger than current + 1 they would never see a bad seq error. But after looking at the code I think the original solution (simply ignore bad seq from Stellar-Core) can work:

  1. Transaction is only submitted to Stellar-Core when it's sequence number is +1 the current sequence of the account in Horizon DB (Horizon maintains a queue for accounts' seqnums).

  2. If user submits a tx with a seq larger then current + 1 Horizon waits for 10 sec. for a smaller sequence number to be submitted. If this doesn't happen it returns bad sequence error (this case confused me during the call):

    if time.Since(q.lastActiveAt) > q.timeout {
    for q.Size() > 0 {
    ch, _ := q.pop()
    ch <- ErrBadSequence
    close(ch)
    }
    }

  3. If tx is finally submitted to Stellar-Core it means that according to Horizon DB, sequence number of the tx is exactly the sequence number of the source account + 1. Now, if it returns bad sequence error there are two cases:

    1. The transaction in current requests incremented a sequence number.
    2. Another transaction submitted elsewhere incremented a sequence number.

    So, I think we should replace the code that checks for tx status for the second time:

    // If error is txBAD_SEQ, check for the result again
    tx, err = txResultByHash(db, hash)
    if err == nil {
    // If the found use it as the result
    sys.finish(ctx, hash, response, Result{Transaction: tx})
    } else {
    // finally, return the bad_seq error if no result was found on 2nd attempt
    sys.finish(ctx, hash, response, Result{Err: sr.Err})
    }

    with the code that checks (in 1 sec. intervals until HTTP request timeout) if a sequence number of the source account is equal of greater than tx seqnum. If it is, check if transaction exists in Horizon DB. If it does, return it's status. If not (case ii above), return bad sequence.

Does it make sense?

@tamirms
Copy link
Contributor

tamirms commented Jul 31, 2020

@bartekn yes, I think that solution will work. I'll create a PR on monday with the implementation

@bartekn
Copy link
Contributor

bartekn commented Aug 11, 2020

This should be fixed in Horizon 1.7.0 (will be deployed to horizon.stellar.org on Wednesday). Please reopen if there's still something wrong with the transaction submission.

@bartekn bartekn closed this as completed Aug 11, 2020
@bartekn
Copy link
Contributor

bartekn commented Aug 12, 2020

@bantalon unfortunately there's a delay deploying this version to production (we are doing some maintenance tasks connected to this deployment). Please observe horizon_version on https://horizon.stellar.org/ to check if it's on 1.7.0 already. Hopefully, it will be deployed by the end of this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment