Fix captive core integration tests #3144

2opremio · 2020-10-20T16:55:20Z

Followup to #3137

2opremio · 2020-10-21T17:58:31Z

After some changes suggested by @bartekn (3d5c083 + setting RUN_STANDALONE=false in the quickstart core instance) I have managed to make progress but it still doesn't fully work yet because:

Captive Core's temporary db directory disappears breaking execution #3145
The wait time until the first ledger checkpoint. Captive core doesn't start streaming until core closes the first checkpoint ledger. This means we need to wait ~5minutes until Horizon is operative and, in turn, that we would need to wait 5 minutes before running each test. Unfortunately run-from (Switch to stellar-core run-from command when added to Stellar-Core #2842) won't help with this. As a workaround, we could build an alternative quickstart docker image with a pre-generated initial checkpoint. However, this will still be a problem for end users testing or using a standalone network.
Captive core's online mode fails to start at ledger 1 #3157

2opremio · 2020-10-28T17:08:35Z

I have finally managed to get the captive core tests to start after fast-forwarding to the first checkpoint.

However, transaction submissions time out because it takes captive core ~50 seconds to obtain the new ledger :(

Core seems to be happy and Captive core seems to be connected (note the "authenticated_count" : 1):

$ docker exec -ti horizon-integration curl localhost:11626/info
{
   "info" : {
      "build" : "stellar-core 15.0.0 (8d6e6d74cfae65943211bdcc5ba083976d525a04)",
      "history_failure_rate" : "0.0",
      "ledger" : {
         "age" : 787,
         "baseFee" : 100,
         "baseReserve" : 100000000,
         "closeTime" : 1603903411,
         "hash" : "3fb8b9d0ff3fd2b6a7801f6cc08bdc7bf26fc1a2029947efcd5b40cf82850e9d",
         "maxTxSetSize" : 100,
         "num" : 66,
         "version" : 15
      },
      "network" : "Standalone Network ; February 2017",
      "peers" : {
         "authenticated_count" : 1,
         "pending_count" : 0
      },
      "protocol_version" : 15,
      "quorum" : {
         "node" : "self",
         "qset" : {
            "agree" : 1,
            "cost" : 0,
            "delayed" : 0,
            "disagree" : 0,
            "fail_at" : 0,
            "hash" : "d05a3b",
            "lag_ms" : null,
            "ledger" : 65,
            "missing" : 0,
            "phase" : "EXTERNALIZE",
            "validated" : true
         },
         "transitive" : {
            "critical" : null,
            "intersection" : true,
            "last_check_ledger" : 2,
            "node_count" : 0
         }
      },
      "startedOn" : "2020-10-28T16:42:26Z",
      "state" : "Synced!"
   }
}

However, from the Horizon logs, it seems like Horizon cannot obtain new ledgers quickly enough and captive core loses consensus:

time="2020-10-28T16:42:39.908Z" level=info msg="Starting request" app_name=undefined app_version=undefined client_name=go-stellar-sdk client_version=2.1.0 forwarded_ip= host="localhost:8000" ip=172.17.0.1 ip_port="172.17.0.1:43132" method=POST path="/transactions?tx=AAAAAgAAAABzdv3ojkzWHMD7KUoXhrPx0GH18vHKV0ZfqpMiEblG1gAAAGQAAAAAAAAAAQAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAADgAAAAAAAAAABfXhAAAAAAEAAAAAAAAAAHN2%2FeiOTNYcwPspSheGs%2FHQYfXy8cpXRl%2BqkyIRuUbWAAAAAAAAAAAAAAABEblG1gAAAED19qgymvbjGRkccKkWlneE5jftWbUU0M%2FlJs8XDqxyNQvu5%2FVU13TRqOlUzhfhkcqg8jdyssNyyXdH4glOvYYM" pid=199 referer=undefined req=a603cfd9f5d0/RvIGIwd6CP-000008 streaming=false
time="2020-10-28T16:42:39.908Z" level=info msg="Processing transaction" hash=ee53d9a038fd138cbc9fd459eef06c17adfdfa1003261739941079fd6daa7b74 pid=199 req=a603cfd9f5d0/RvIGIwd6CP-000008 service=txsub.System tx=AAAAAgAAAABzdv3ojkzWHMD7KUoXhrPx0GH18vHKV0ZfqpMiEblG1gAAAGQAAAAAAAAAAQAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAADgAAAAAAAAAABfXhAAAAAAEAAAAAAAAAAHN2/eiOTNYcwPspSheGs/HQYfXy8cpXRl+qkyIRuUbWAAAAAAAAAAAAAAABEblG1gAAAED19qgymvbjGRkccKkWlneE5jftWbUU0M/lJs8XDqxyNQvu5/VU13TRqOlUzhfhkcqg8jdyssNyyXdH4glOvYYM tx_type=EnvelopeTypeEnvelopeTypeTx
time="2020-10-28T16:42:39.917Z" level=info msg="Submitter result" duration=0.005336614 err="<nil>" pid=199 req=a603cfd9f5d0/RvIGIwd6CP-000008 service=txsub.submitter
time="2020-10-28T16:42:39.917Z" level=info msg="Created a new submission for a transaction" hash=ee53d9a038fd138cbc9fd459eef06c17adfdfa1003261739941079fd6daa7b74 pid=199 service=txsub.submissionList
time="2020-10-28T16:42:40.473Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=65)" next_state="resume(latestSuccessfullyProcessedLedger=65)" pid=199 service=ingest
time="2020-10-28T16:42:40.475Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=65 ingest_sequence=66 pid=199 service=ingest

[...]

time="2020-10-28T16:43:09.668Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=65 ingest_sequence=66 pid=199 service=ingest
time="2020-10-28T16:43:10.119Z" level=warning msg="Cleared submission due to timeout" hash=ee53d9a038fd138cbc9fd459eef06c17adfdfa1003261739941079fd6daa7b74 listeners=1 pid=199 service=txsub.submissionList
time="2020-10-28T16:43:10.119Z" level=info msg="request failed due to error" error="problem: timeout" pid=199 req=a603cfd9f5d0/RvIGIwd6CP-000008 stack=unknown
time="2020-10-28T16:43:10.119Z" level=info msg="Finished request" app_name=undefined app_version=undefined bytes=310 client_name=go-stellar-sdk client_version=2.1.0 duration=30.21106663 forwarded_ip= host="localhost:8000" ip=172.17.0.1 ip_port="172.17.0.1:43132" method=POST path="/transactions?tx=AAAAAgAAAABzdv3ojkzWHMD7KUoXhrPx0GH18vHKV0ZfqpMiEblG1gAAAGQAAAAAAAAAAQAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAADgAAAAAAAAAABfXhAAAAAAEAAAAAAAAAAHN2%2FeiOTNYcwPspSheGs%2FHQYfXy8cpXRl%2BqkyIRuUbWAAAAAAAAAAAAAAABEblG1gAAAED19qgymvbjGRkccKkWlneE5jftWbUU0M%2FlJs8XDqxyNQvu5%2FVU13TRqOlUzhfhkcqg8jdyssNyyXdH4glOvYYM" pid=199 referer=undefined req=a603cfd9f5d0/RvIGIwd6CP-000008 route=/transactions status=504 streaming=false
time="2020-10-28T16:43:10.671Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=65)" next_state="resume(latestSuccessfullyProcessedLedger=65)" pid=199 service=ingest
time="2020-10-28T16:43:10.676Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=65 ingest_sequence=66 pid=199 service=ingest
time="2020-10-28T16:43:11.679Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=65)" next_state="resume(latestSuccessfullyProcessedLedger=65)" pid=199 service=ingest
time="2020-10-28T16:43:11.686Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=65 ingest_sequence=66 pid=199 service=ingest
GATSN [Herder WARNING] Lost track of consensus
GATSN [Herder WARNING] Out of sync context: {
   "queue" : null,
   "scp" : {
      "54" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 54 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 54 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],

[...]

After a while, it seems captive core recovers consensus and Horizon obtains the ledger, but it takes 50 seconds

GATSN [Herder INFO] Quorum information for 64 : {"agree":1,"cost":948,"delayed":0,"disagree":0,"fail_at":1,"hash":"d05a3b","lag_ms":0,"ledger":64,"missing":0,"phase":"EXTERNALIZE"}

GATSN [Herder WARNING] Ledger took 45.009415267 seconds, SCP information:{"node":"GATSN","qset":{"agree":1,"cost":{"66":{"local_core":624}},"delayed":null,"disagree":null,"fail_at":1,"fail_with":["local_core"],"hash":"d05a3b","lag_ms":{"local_core":0},"ledger":66,"missing":null,"phase":"EXTERNALIZE","value":{"t":1,"v":["local_core"]}},"transitive":{"critical":null,"intersection":true,"last_check_ledger":54,"node_count":2}}

GATSN [Ledger INFO] Got consensus: [seq=66, prev=4b64a4, txs=1, ops=1, sv: [ SIGNED@local_core txH: 0d5de6, ct: 1603903411, upgrades: [ ] ]]
GATSN [Tx INFO] applying ledger 66 (txs:1, ops:1)
GATSN [Ledger INFO] Closed ledger: [seq=66, hash=3fb8b9]
GATSN [Ledger INFO] Changing state LM_CATCHING_UP_STATE -> LM_SYNCED_STATE
GATSN [Herder INFO] Transitive closure of quorum has changed, re-analyzing.
GATSN [SCP INFO] Calculating 2-node network quorum intersection
GATSN [SCP INFO] Examining 1 node groups for intersection-criticality
GATSN [SCP INFO] No intersection-critical groups found
time="2020-10-28T16:43:22.762Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=65)" next_state="resume(latestSuccessfullyProcessedLedger=65)" pid=199 service=ingest
time="2020-10-28T16:43:22.764Z" level=info msg="Processing ledger" commit=true ledger=true pid=199 sequence=66 service=ingest state=true
time="2020-10-28T16:43:22.774Z" level=info msg="Processed ledger" commit=true duration=0.010409042 ledger=true pid=199 sequence=66 service=ingest state=true stats_accounts_created=0 stats_accounts_removed=0 stats_accounts_updated=3 stats_claimable_balances_created=1 stats_claimable_balances_removed=0 stats_claimable_balances_updated=0 stats_data_created=0 stats_data_removed=0 stats_data_updated=0 stats_offers_created=0 stats_offers_removed=0 stats_offers_updated=0 stats_operations=1 stats_operations_account_merge=0 stats_operations_allow_trust=0 stats_operations_begin_sponsoring_future_reserves=0 stats_operations_bump_sequence=0 stats_operations_change_trust=0 stats_operations_claim_claimable_balance=0 stats_operations_create_account=0 stats_operations_create_claimable_balance=1 stats_operations_create_passive_sell_offer=0 stats_operations_end_sponsoring_future_reserves=0 stats_operations_in_failed=0 stats_operations_in_successful=1 stats_operations_inflation=0 stats_operations_manage_buy_offer=0 stats_operations_manage_data=0 stats_operations_manage_sell_offer=0 stats_operations_path_payment_strict_receive=0 stats_operations_path_payment_strict_send=0 stats_operations_payment=0 stats_operations_revoke_sponsorship=0 stats_operations_set_options=0 stats_transactions=1 stats_transactions_failed=0 stats_transactions_successful=1 stats_trust_lines_created=0 stats_trust_lines_removed=0 stats_trust_lines_updated=0
time="2020-10-28T16:43:22.775Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=65)" next_state="resume(latestSuccessfullyProcessedLedger=66)" pid=199 service=ingest
time="2020-10-28T16:43:22.777Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=66 ingest_sequence=67 pid=199 service=ingest

Full logs:

horizon-stdout---supervisor-XKwdGb.log.gz
stellar-core-stdout---supervisor-vcQO0t.log.gz

bartekn · 2020-10-28T18:01:58Z

services/horizon/internal/test/integration/integration.go

@@ -309,7 +310,8 @@ func (i *Test) waitForCore() {

 		// For some reason, we need to do some waiting blackmagic for Core to publish the checkpoint.
 		// Otherwise, it indefinitely stays in `"status" : [ "Publishing 1 queued checkpoints [63-63]: Waiting: prepare-snapshot" ]`
-		time.Sleep(5 * time.Second)


I merged a PR with a method that can help here. The problem with the code about (close ledgers in a loop) is that manualclose is not blocking (TIL!). So running it 62 times doesn't mean 62 new ledgers will be closed.

I don't think it will help in the case because, even if the ledgers keep advancing, the publishing status doesn't disappear unless you wait long enough before closing a ledger.

bartekn · 2020-10-28T18:06:04Z

services/horizon/internal/test/integration/integration.go

+			}
+		}
+
+		// For some reason, we need to do some waiting blackmagic for Core to publish the checkpoint.


It's not blackmagic, TIL when I worked on the other PR that Core will publish checkpoint after closing the next ledger after the checkpoint ledger. Maybe it's just a behaviour in in manual mode 🤷 .

The waiting times seem like blackmagic ...

Also, the problem here isn't only advancing the ledger sequences but also getting Core to publish the checkpoint. I have observed that, unless you wait enough before manually closing a ledger, Core will remain in the checkpoint publishing state even if the ledger sequences keep advancing.

2opremio · 2020-10-28T20:34:56Z

New log dump, including trace logs for Core (both captive an normal core):
logs.tar.gz

2opremio · 2020-10-29T10:17:15Z

After a few conversations to @rokopt and @graydon it turns out that Core's MANUAL_CLOSE wasn't originally intended for ....

... speeding through ledgers in a fresh network. Hence the unexpected magical waits I had to add when (successfully) adding MANUAL_CLOSE support to non-captive-core integration tests at services/horizon: Update integration tests to use MANUAL_CLOSE #3167 .
... multi-node networks (in our case, one Core node and another Captive-core node). That's why captive-core is falling out of sync after the first checkpoint.

@graydon was nice enough to collect the issues I encountered at stellar/stellar-core#2778 . The Core team will try to resolve those issues so that MANUAL_CLOSE fully supports our usecase.

To move forward with the captive-core integration-tests we have two options:

Simply wait for the Core team to hopefully solve the issues at Synchronization issues when driving core with manual-close from horizon stellar-core#2778
Stop using two Core nodes for the integration tests. @bartekn already suggested doing this a while ago but we decided to move forward with a two-node configuration to more closely resemble how captive-core would connect to an existing network
Using a single Core node for both receiving transactions and emitting metadata would allow us to:
- Emit metadata from ledger 1 instead of needing to wait for the first checkpoint. Having access to the first checkpoint is only required when joining a network, which isn't needed when using a single node. This means we won't need to fast-foward to ledger 63.
- Use MANUAL_CLOSE to speed up the tests without synchronization issues.
However:
- I don't think we can do that with an unmodified captive-core ingestion backend. Also, I am not sure what changes and how much work it would entail.
- As I already mentioned, it would move us away from a real-world scenario, in which captive-core would be joining an existing network.

Regardless, I think I am going to park this work for a while and clear my mind working on higher-priority issues.

cla-bot bot added the cla: yes label Oct 20, 2020

2opremio mentioned this pull request Oct 21, 2020

Captive Core's temporary db directory disappears breaking execution #3145

Closed

2opremio force-pushed the fix-captive-core-integration-tests branch 2 times, most recently from 5ab15ff to 1e7f62f Compare October 28, 2020 16:38

bartekn reviewed Oct 28, 2020

View reviewed changes

graydon mentioned this pull request Oct 29, 2020

Synchronization issues when driving core with manual-close from horizon stellar/stellar-core#2778

Open

Reenable captive core integration tests

ad9f77a

2opremio force-pushed the fix-captive-core-integration-tests branch from 556c695 to b9be268 Compare October 29, 2020 19:49

Fast-forward to first checkpoint and fix ports

9723255

2opremio force-pushed the fix-captive-core-integration-tests branch from b9be268 to 9723255 Compare October 29, 2020 19:52

2opremio closed this Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix captive core integration tests #3144

Fix captive core integration tests #3144

2opremio commented Oct 20, 2020 •

edited

Loading

2opremio commented Oct 21, 2020 •

edited

Loading

2opremio commented Oct 28, 2020 •

edited

Loading

bartekn Oct 28, 2020

2opremio Oct 28, 2020

bartekn Oct 28, 2020

2opremio Oct 28, 2020

2opremio Oct 28, 2020

2opremio commented Oct 28, 2020

2opremio commented Oct 29, 2020 •

edited

Loading

Fix captive core integration tests #3144

Fix captive core integration tests #3144

Conversation

2opremio commented Oct 20, 2020 • edited Loading

2opremio commented Oct 21, 2020 • edited Loading

2opremio commented Oct 28, 2020 • edited Loading

bartekn Oct 28, 2020

Choose a reason for hiding this comment

2opremio Oct 28, 2020

Choose a reason for hiding this comment

bartekn Oct 28, 2020

Choose a reason for hiding this comment

2opremio Oct 28, 2020

Choose a reason for hiding this comment

2opremio Oct 28, 2020

Choose a reason for hiding this comment

2opremio commented Oct 28, 2020

2opremio commented Oct 29, 2020 • edited Loading

2opremio commented Oct 20, 2020 •

edited

Loading

2opremio commented Oct 21, 2020 •

edited

Loading

2opremio commented Oct 28, 2020 •

edited

Loading

2opremio commented Oct 29, 2020 •

edited

Loading