Horizon returned bad sequence error although the transaction was added to ledger #2628

bantalon · 2020-05-27T06:49:50Z

What version are you using?

I am working on the public network through the public Horizon instance: https://horizon.stellar.org. I posted transactions to the network using this API: https://www.stellar.org/developers/horizon/reference/endpoints/transactions-create.html.

What did you do?

On may 8, I started posting a transaction on 06:47:19 UTC. 30 seconds later, on 06:47:49, a https://www.stellar.org/developers/horizon/reference/errors/timeout.html was returned. As instructed in the documentation, I retried sending the same transaction after a second delay. Within less than a second I got a bad sequence failure:

{
 "type": "https://stellar.org/horizon-errors/transaction_failed",
 "title": "Transaction Failed",
 "status": 400,
 "detail": "The transaction failed when submitted to the stellar network. The `extras.result_codes` field on this response contains further details. Descriptions of each code can be found at: https://www.stellar.org/developers/guides/concepts/list-of-operations.html",
 "extras": {
 "envelope_xdr": "AAAAAGRFCbG4m7/GKYJDKVwWrhSAhfgjSjOILbb0cOAN+rH+AAAAZAGfEqQAAC2bAAAAAQAAAABetP9IAAAAAF61DoQAAAABAAAAG1hMTSBlMmUgbW9uaXRvciB0cmFuc2FjdGlvbgAAAAABAAAAAAAAAAEAAAAA+dkdOXuuwsV5OpRuUk9Owqe6ocz1ukhgrYqnKliCQA8AAAAAAAAAAAAAAAEAAAAAAAAAAQ36sf4AAABAhnftLR7Lx9hKXojyGkKDLu4bdI4+wgx/qSPxEW3Vr9CAWBDQPngRWihDppCNzpmKTn3Q3wsM0z3qOJl6+72zCA==",
 "result_codes": {
 "transaction": "tx_bad_seq"
 },
 "result_xdr": "AAAAAAAAAAD////7AAAAAA=="
 }
}

My code treated this as failure.
However, I was surprised to find the transaction on the chain: https://stellar.expert/explorer/public/tx/45a9674370754d50b50df23e03d7fa87bd6ee70bd2db2431a8ab874218c3115a.

What did you expect to see?

I expected to see a successful submit result, either on the first attempt which ended in timeout or on the second attempt which ended in tx_bad_seq.

What did you see instead?

I got a wrong misleading tx_bad_seq failure.

The text was updated successfully, but these errors were encountered:

bartekn · 2020-05-27T13:11:41Z

Thanks for the report. I can't reproduce it with the latest version (ex. I get a success response for your tx). Going to check horizon.stellar.org logs from May, 8th (I think it was running 1.2.1 at this time) and will give you more info.

bantalon · 2020-05-28T11:33:41Z

I think the tx_bad_seq is a temporary erroneous response. After a short period of time all goes back to normal, and trying to post the same transaction again results in success. Feels like there is a short period of time in which the transaction is not yet identified by Horizon as pushed, but the account sequence number already goes up.

bantalon · 2020-06-04T11:50:07Z

@bartekn - Any updates on this one?
I am refining my transaction pushing algorithm and I wish to know if to handle tx_bad_seq as a final result (as I think it should be) or as a provisional one (which is required if this issue persists).

bartekn · 2020-06-04T12:33:50Z

Sorry for delay commenting this issue. I think this should be fixed by #2601 to be released next week (and deployed to horizon.stellar.org next Wednesday).

Before Horizon 1.4.0 txsub system is using two databases: stellar-core DB to get sequence numbers and horizon DB to get transaction results (if it's ingesting failed transactions and horizon.stellar.org do). The issue you are experiencing is due to a delay between stellar-core closing a ledger and horizon ingesting it. When you submit a tx that's already added to a ledger closed by stellar-core but not ingested by horizon yet the following happens: the sequence number on a tx is <= than the current sequence of the account but tx results are not found. In such case tx_bad_seq is returned.

I'm closing it now, please confirm if you still experience it after next week deploy.

bantalon · 2020-07-06T06:31:34Z

@bartekn
We did not see this again until yesterday.
However, yesterday (2020-07-05) at 16:58:33 UTC we started submitting transaction 615cf369....
At 16:59:04 we got an Horizon timeout error. We retried after 1 second. We got a tx_bad_seq response. See full response below.
Can you reopen the ticket or should I file a new one?

{
  "type": "https://stellar.org/horizon-errors/transaction_failed",
  "title": "Transaction Failed",
  "status": 400,
  "detail": "The transaction failed when submitted to the stellar network. The `extras.result_codes` field on this response contains further details.  Descriptions of each code can be found at: https://www.stellar.org/developers/guides/concepts/list-of-operations.html",
  "extras": {
    "envelope_xdr": "AAAAAGRFCbG4m7/GKYJDKVwWrhSAhfgjSjOILbb0cOAN+rH+AAAAZAGfEqQAAEHzAAAAAQAAAABfAgWLAAAAAF8CCQ8AAAABAAAAG1hMTSBlMmUgbW9uaXRvciB0cmFuc2FjdGlvbgAAAAABAAAAAAAAAAEAAAAA+dkdOXuuwsV5OpRuUk9Owqe6ocz1ukhgrYqnKliCQA8AAAAAAAAAAAAAAAEAAAAAAAAAAQ36sf4AAABAOCbRkyGoptNNRB8yDNXC2VqD5GyYBMtevrlKj2RvgzSykZZ4GShhgwAkEnTO0dlsdeu5dHJK1VQD0uX05CUjBw==",
    "result_codes": {
      "transaction": "tx_bad_seq"
    },
    "result_xdr": "AAAAAAAAAGT////7AAAAAA=="
  }
}

bartekn · 2020-07-06T11:39:44Z

@bantalon I found another problem causing this. Reopening.

bartekn · 2020-07-06T11:47:02Z

We make two DB queries: first checking if tx result exists in a DB (ResultByHash), second getting the current sequence number of a source account (GetSequenceNumbers):

go/services/horizon/internal/txsub/system.go

Lines 85 to 110 in 0067572

    
           // check the configured result provider for an existing result 
        
           r := sys.Results.ResultByHash(ctx, hash) 
        
           if r.Err == nil { 
        
           	sys.Log.Ctx(ctx).WithField("hash", hash).Info("Found submission result in a DB") 
        
           	sys.finish(ctx, hash, response, r) 
        
           	return 
        
           } 
        
           if r.Err != ErrNoResults { 
        
           	sys.Log.Ctx(ctx).WithField("hash", hash).Info("Error getting submission result from a DB") 
        
           	sys.finish(ctx, hash, response, r) 
        
           	return 
        
           } 
        
           // From now: r.Err == ErrNoResults 
        
           sourceAccount := envelope.SourceAccount() 
        
           // The database doesn't (yet) store muxed accounts, so we query 
        
           // the corresponding AccountId 
        
           accid := sourceAccount.ToAccountId() 
        
           sourceAddress := accid.Address() 
        
           curSeq, err := sys.Sequences.GetSequenceNumbers([]string{sourceAddress}) 
        
           if err != nil { 
        
           	sys.finish(ctx, hash, response, Result{Err: err}) 
        
           	return 
        
           }

Queries are not sent in a transaction (REPEATABLE READ) so it's possible that ResultByHash and GetSequenceNumbers are getting data for different ledgers: one without a tx ingested, second with tx ingested which can result in bad_seq errors.

Because we are using Horizon DB only I think we should remove ResultProvider and SequenceProvider interfaces and use history.Q directly to send queries.

bartekn · 2020-07-17T11:36:37Z

@bantalon this was fixed in #2805 and will be included in Horizon 1.6.0 release (release: Monday 20 July, deploy to horizon.stellar.org: Wednesday 22 July).

bantalon · 2020-07-26T17:41:30Z

@bartekn - It happened again on July 23 on transaction 2df5d0369f89de16193b757800c5f565acfb45eb4cbb8c5ff76ea0077f0fdfac.

We first submitted the transaction at 00:27:46 UTC. Got a timeout after 30 seconds and slept a single second.
Submitted again at 00:28:17 UTC. Got a timeout after 30 seconds and slept a single second.
Submitted again at 00:28:49 UTC. Got a timeout after 30 seconds and slept a single second.
Submitted again at 00:29:20 UTC. Got a timeout after 30 seconds and slept a single second.
At 00:29 we submitted again and immediately got a bad sequence failure:

{
  "type": "https://stellar.org/horizon-errors/transaction_failed",
  "title": "Transaction Failed",
  "status": 400,
  "detail": "The transaction failed when submitted to the stellar network. The `extras.result_codes` field on this response contains further details.  Descriptions of each code can be found at: https://www.stellar.org/developers/guides/concepts/list-of-operations.html",
  "extras": {
    "envelope_xdr": "AAAAADOwt23lzX6o0FuxpqtyusGaOAnkHTxj+sQ+rqOcYAHpAAAAZAGjkGEAAEyZAAAAAQAAAABfGNhSAAAAAF8Y29YAAAABAAAAG1hMTSBlMmUgbW9uaXRvciB0cmFuc2FjdGlvbgAAAAABAAAAAAAAAAEAAAAA98VQmHHZ8ZVEc4kkMKTryZwyxSBC4B+Q8vtapAp3FzMAAAAAAAAAAAAAAAEAAAAAAAAAAZxgAekAAABA1CWSBsCvuEeYSET9xPtW8t8KEJ+dn/9DrstONnk8NhbW7vyDaHyhDdr3SkNWoEEOxKgkpBI8Juc4Cg0fDlTsDA==",
    "result_codes": {
      "transaction": "tx_bad_seq"
    },
    "result_xdr": "AAAAAAAAAGT////7AAAAAA=="
  }
}

When was 1.6.0 deployment completed on https://horizon.stellar.org?

bartekn · 2020-07-27T13:45:39Z

@bantalon 1.6.0 deployment was a couple hours earlier. @tamirms can you debug it?

bantalon · 2020-07-27T22:04:34Z

The following may help with the analysis: legitimate tx_bad_seq responses come with result_xdr in which feeCharged is 0. In the faulty responses I added here, the feeCharged is greater than 0 (100 in both cases).

tamirms · 2020-07-30T20:19:25Z

At 2020-07-23 00:28:50 UTC the transaction was submitted to Horizon for a 3rd attempt after the previous two attempts resulted in timeouts. However, at 00:29:20 Horizon discarded the transaction because enough time had passed to trigger a timeout. It seems that Stellar Core finally accepted the transaction in ledger 30741789 shortly after Horizon timed out at 00:29:20.

Horizon again received a request to submit the same transaction at 00:29:21. But the transaction was not present in the Horizon DB so Horizon tried to submit the transaction to Stellar Core and Stellar Core responded with a bad sequence error since the transaction was already present in the previous ledger.

According to Horizon ingestion logs, Horizon finally ingested ledger 30741789 at 00:29:22.

time="2020-07-23T00:29:22.891Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=30741789)" next_state="resume(latestSuccessfullyProcessedLedger=30741789)" pid=31178 service=expingest

Horizon can respond with a false positive bad sequence error when Horizon's ingestion lags behind Stellar Core.

We can determine an approximate upper bound for how often this issue occurs by searching for the "AAAAAAAAAGT////7AAAAAA==" (result_xdr in which feeCharged is 100) in the Horizon logs.
I did a sample search and unfortunately I mostly found cases where there were legitimate bad sequence errors.
To get an accurate count of how often this issue occurs we would need to add a metric to the transaction submission system.

To solve this issue we could sleep before calling txResultByHash after a bad sequence error.

go/services/horizon/internal/txsub/system.go

Lines 147 to 160 in 59324f1

    
           // any error other than "txBAD_SEQ" is a failure 
        
           isBad, err := sr.IsBadSeq() 
        
           if err != nil { 
        
           	sys.finish(ctx, hash, response, Result{Err: err}) 
        
           	return 
        
           } 
        
           if !isBad { 
        
           	sys.finish(ctx, hash, response, Result{Err: sr.Err}) 
        
           	return 
        
           } 
        
           // If error is txBAD_SEQ, check for the result again 
        
           tx, err = txResultByHash(db, hash)

This wouldn't completely solve the issue because this solution assumes that if Horizon is behind Stellar Core it will catch up by the time we finish sleeping. We could pick a sufficiently high sleep duration which would cover most cases. However, we can never guarantee that Horizon will be consistent with Stellar Core after the sleep duration.

Another solution is that we could check if the latest ingested ledger matches the latest ledger in Stellar Core. If the Stellar Core ledger is ahead of Horizon, we can block until Horizon catches up. The only drawback of this approach is that we are adding another dependency on functionality in Stellar Core. We have been trying to reduce coupling between Horizon and Stellar Core as much as possible.

The last solution that is worth mentioning is that we could modify Stellar Core to provide a new error code (e.g. ErrTransactionAlreadyExists) whenever someone tries to submit a transaction which has already been included in a recent ledger.

For now we should probably update the documentation for Horizon's transaction submission endpoint with a warning that, if Horizon's ingestion system is lagging behind Stellar Core, Horizon might respond with an ErrBadSequence error when the transaction was actually included in a ledger that has not yet been ingested by Horizon.

bartekn · 2020-07-30T22:16:01Z

To solve this issue we could sleep before calling txResultByHash after a bad sequence error.

During the meeting today I suggested a solution (similar to this one) that simply ignores bad sequence error returned by Stellar-Core and relies on Horizon view only. Later during that meeting I said that I was wrong because if a user submits a tx with sequence number larger than current + 1 they would never see a bad seq error. But after looking at the code I think the original solution (simply ignore bad seq from Stellar-Core) can work:

Transaction is only submitted to Stellar-Core when it's sequence number is +1 the current sequence of the account in Horizon DB (Horizon maintains a queue for accounts' seqnums).

If user submits a tx with a seq larger then current + 1 Horizon waits for 10 sec. for a smaller sequence number to be submitted. If this doesn't happen it returns bad sequence error (this case confused me during the call):

go/services/horizon/internal/txsub/sequence/queue.go

Lines 106 to 112 in 43f14b3

    
           if time.Since(q.lastActiveAt) > q.timeout { 
        
           	for q.Size() > 0 { 
        
           		ch, _ := q.pop() 
        
           		ch <- ErrBadSequence 
        
           		close(ch) 
        
           	} 
        
           }

If tx is finally submitted to Stellar-Core it means that according to Horizon DB, sequence number of the tx is exactly the sequence number of the source account + 1. Now, if it returns bad sequence error there are two cases:

The transaction in current requests incremented a sequence number.
Another transaction submitted elsewhere incremented a sequence number.

So, I think we should replace the code that checks for tx status for the second time:

go/services/horizon/internal/txsub/system.go

Lines 159 to 168 in 43f14b3

    
           // If error is txBAD_SEQ, check for the result again 
        
           tx, err = txResultByHash(db, hash) 
        
           if err == nil { 
        
           	// If the found use it as the result 
        
           	sys.finish(ctx, hash, response, Result{Transaction: tx}) 
        
           } else { 
        
           	// finally, return the bad_seq error if no result was found on 2nd attempt 
        
           	sys.finish(ctx, hash, response, Result{Err: sr.Err}) 
        
           }

with the code that checks (in 1 sec. intervals until HTTP request timeout) if a sequence number of the source account is equal of greater than tx seqnum. If it is, check if transaction exists in Horizon DB. If it does, return it's status. If not (case ii above), return bad sequence.

Does it make sense?

tamirms · 2020-07-31T06:31:18Z

@bartekn yes, I think that solution will work. I'll create a PR on monday with the implementation

bartekn · 2020-08-11T11:06:40Z

This should be fixed in Horizon 1.7.0 (will be deployed to horizon.stellar.org on Wednesday). Please reopen if there's still something wrong with the transaction submission.

bartekn · 2020-08-12T19:44:59Z

@bantalon unfortunately there's a delay deploying this version to production (we are doing some maintenance tasks connected to this deployment). Please observe horizon_version on https://horizon.stellar.org/ to check if it's on 1.7.0 already. Hopefully, it will be deployed by the end of this week.

bantalon added the bug label May 27, 2020

bartekn self-assigned this May 27, 2020

bartekn closed this as completed Jun 4, 2020

bartekn reopened this Jul 6, 2020

bartekn added this to the Horizon 1.6.0 milestone Jul 6, 2020

bartekn removed their assignment Jul 8, 2020

tamirms self-assigned this Jul 8, 2020

tamirms mentioned this issue Jul 10, 2020

services/horizon/internal/txsub: Check transaction submission results using repeatable read transaction #2805

Merged

7 tasks

abuiles linked a pull request Jul 10, 2020 that will close this issue

services/horizon/internal/txsub: Check transaction submission results using repeatable read transaction #2805

Merged

7 tasks

bartekn closed this as completed Jul 17, 2020

bartekn reopened this Jul 27, 2020

bartekn modified the milestones: Horizon 1.6.0, Horizon 1.7.0 Jul 27, 2020

2opremio changed the title ~~Horizon returned returned bad sequence error although the transaction was added to ledger~~ Horizon returned bad sequence error although the transaction was added to ledger Jul 30, 2020

tamirms mentioned this issue Aug 3, 2020

services/horizon/internal/txsub: Fix false positive bad sequence error when submitting transactions #2877

Merged

7 tasks

bartekn closed this as completed Aug 11, 2020

ire-and-curses mentioned this issue Nov 11, 2021

services/horizon: chained transactions are delayed #4062

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horizon returned bad sequence error although the transaction was added to ledger #2628

Horizon returned bad sequence error although the transaction was added to ledger #2628

bantalon commented May 27, 2020

bartekn commented May 27, 2020

bantalon commented May 28, 2020

bantalon commented Jun 4, 2020

bartekn commented Jun 4, 2020

bantalon commented Jul 6, 2020

bartekn commented Jul 6, 2020 •

edited

Loading

bartekn commented Jul 6, 2020 •

edited

Loading

bartekn commented Jul 17, 2020 •

edited by ire-and-curses

Loading

bantalon commented Jul 26, 2020

bartekn commented Jul 27, 2020

bantalon commented Jul 27, 2020

tamirms commented Jul 30, 2020

bartekn commented Jul 30, 2020

tamirms commented Jul 31, 2020

bartekn commented Aug 11, 2020

bartekn commented Aug 12, 2020

Horizon returned bad sequence error although the transaction was added to ledger #2628

Horizon returned bad sequence error although the transaction was added to ledger #2628

Comments

bantalon commented May 27, 2020

What version are you using?

What did you do?

What did you expect to see?

What did you see instead?

bartekn commented May 27, 2020

bantalon commented May 28, 2020

bantalon commented Jun 4, 2020

bartekn commented Jun 4, 2020

bantalon commented Jul 6, 2020

bartekn commented Jul 6, 2020 • edited Loading

bartekn commented Jul 6, 2020 • edited Loading

bartekn commented Jul 17, 2020 • edited by ire-and-curses Loading

bantalon commented Jul 26, 2020

bartekn commented Jul 27, 2020

bantalon commented Jul 27, 2020

tamirms commented Jul 30, 2020

bartekn commented Jul 30, 2020

tamirms commented Jul 31, 2020

bartekn commented Aug 11, 2020

bartekn commented Aug 12, 2020

bartekn commented Jul 6, 2020 •

edited

Loading

bartekn commented Jul 6, 2020 •

edited

Loading

bartekn commented Jul 17, 2020 •

edited by ire-and-curses

Loading