Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix captive core integration tests #3144

Closed

Conversation

2opremio
Copy link
Contributor

@2opremio 2opremio commented Oct 20, 2020

Closes #3153

Followup to #3137

@2opremio
Copy link
Contributor Author

2opremio commented Oct 21, 2020

After some changes suggested by @bartekn (3d5c083 + setting RUN_STANDALONE=false in the quickstart core instance) I have managed to make progress but it still doesn't fully work yet because:

  1. Captive Core's temporary db directory disappears breaking execution #3145
  2. The wait time until the first ledger checkpoint. Captive core doesn't start streaming until core closes the first checkpoint ledger. This means we need to wait ~5minutes until Horizon is operative and, in turn, that we would need to wait 5 minutes before running each test. Unfortunately run-from (Switch to stellar-core run-from command when added to Stellar-Core #2842) won't help with this. As a workaround, we could build an alternative quickstart docker image with a pre-generated initial checkpoint. However, this will still be a problem for end users testing or using a standalone network.
  3. Captive core's online mode fails to start at ledger 1 #3157

@2opremio
Copy link
Contributor Author

2opremio commented Oct 28, 2020

I have finally managed to get the captive core tests to start after fast-forwarding to the first checkpoint.

However, transaction submissions time out because it takes captive core ~50 seconds to obtain the new ledger :(

Core seems to be happy and Captive core seems to be connected (note the "authenticated_count" : 1):

$ docker exec -ti horizon-integration curl localhost:11626/info
{
   "info" : {
      "build" : "stellar-core 15.0.0 (8d6e6d74cfae65943211bdcc5ba083976d525a04)",
      "history_failure_rate" : "0.0",
      "ledger" : {
         "age" : 787,
         "baseFee" : 100,
         "baseReserve" : 100000000,
         "closeTime" : 1603903411,
         "hash" : "3fb8b9d0ff3fd2b6a7801f6cc08bdc7bf26fc1a2029947efcd5b40cf82850e9d",
         "maxTxSetSize" : 100,
         "num" : 66,
         "version" : 15
      },
      "network" : "Standalone Network ; February 2017",
      "peers" : {
         "authenticated_count" : 1,
         "pending_count" : 0
      },
      "protocol_version" : 15,
      "quorum" : {
         "node" : "self",
         "qset" : {
            "agree" : 1,
            "cost" : 0,
            "delayed" : 0,
            "disagree" : 0,
            "fail_at" : 0,
            "hash" : "d05a3b",
            "lag_ms" : null,
            "ledger" : 65,
            "missing" : 0,
            "phase" : "EXTERNALIZE",
            "validated" : true
         },
         "transitive" : {
            "critical" : null,
            "intersection" : true,
            "last_check_ledger" : 2,
            "node_count" : 0
         }
      },
      "startedOn" : "2020-10-28T16:42:26Z",
      "state" : "Synced!"
   }
}

However, from the Horizon logs, it seems like Horizon cannot obtain new ledgers quickly enough and captive core loses consensus:

time="2020-10-28T16:42:39.908Z" level=info msg="Starting request" app_name=undefined app_version=undefined client_name=go-stellar-sdk client_version=2.1.0 forwarded_ip= host="localhost:8000" ip=172.17.0.1 ip_port="172.17.0.1:43132" method=POST path="/transactions?tx=AAAAAgAAAABzdv3ojkzWHMD7KUoXhrPx0GH18vHKV0ZfqpMiEblG1gAAAGQAAAAAAAAAAQAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAADgAAAAAAAAAABfXhAAAAAAEAAAAAAAAAAHN2%2FeiOTNYcwPspSheGs%2FHQYfXy8cpXRl%2BqkyIRuUbWAAAAAAAAAAAAAAABEblG1gAAAED19qgymvbjGRkccKkWlneE5jftWbUU0M%2FlJs8XDqxyNQvu5%2FVU13TRqOlUzhfhkcqg8jdyssNyyXdH4glOvYYM" pid=199 referer=undefined req=a603cfd9f5d0/RvIGIwd6CP-000008 streaming=false
time="2020-10-28T16:42:39.908Z" level=info msg="Processing transaction" hash=ee53d9a038fd138cbc9fd459eef06c17adfdfa1003261739941079fd6daa7b74 pid=199 req=a603cfd9f5d0/RvIGIwd6CP-000008 service=txsub.System tx=AAAAAgAAAABzdv3ojkzWHMD7KUoXhrPx0GH18vHKV0ZfqpMiEblG1gAAAGQAAAAAAAAAAQAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAADgAAAAAAAAAABfXhAAAAAAEAAAAAAAAAAHN2/eiOTNYcwPspSheGs/HQYfXy8cpXRl+qkyIRuUbWAAAAAAAAAAAAAAABEblG1gAAAED19qgymvbjGRkccKkWlneE5jftWbUU0M/lJs8XDqxyNQvu5/VU13TRqOlUzhfhkcqg8jdyssNyyXdH4glOvYYM tx_type=EnvelopeTypeEnvelopeTypeTx
time="2020-10-28T16:42:39.917Z" level=info msg="Submitter result" duration=0.005336614 err="<nil>" pid=199 req=a603cfd9f5d0/RvIGIwd6CP-000008 service=txsub.submitter
time="2020-10-28T16:42:39.917Z" level=info msg="Created a new submission for a transaction" hash=ee53d9a038fd138cbc9fd459eef06c17adfdfa1003261739941079fd6daa7b74 pid=199 service=txsub.submissionList
time="2020-10-28T16:42:40.473Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=65)" next_state="resume(latestSuccessfullyProcessedLedger=65)" pid=199 service=ingest
time="2020-10-28T16:42:40.475Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=65 ingest_sequence=66 pid=199 service=ingest

[...]

time="2020-10-28T16:43:09.668Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=65 ingest_sequence=66 pid=199 service=ingest
time="2020-10-28T16:43:10.119Z" level=warning msg="Cleared submission due to timeout" hash=ee53d9a038fd138cbc9fd459eef06c17adfdfa1003261739941079fd6daa7b74 listeners=1 pid=199 service=txsub.submissionList
time="2020-10-28T16:43:10.119Z" level=info msg="request failed due to error" error="problem: timeout" pid=199 req=a603cfd9f5d0/RvIGIwd6CP-000008 stack=unknown
time="2020-10-28T16:43:10.119Z" level=info msg="Finished request" app_name=undefined app_version=undefined bytes=310 client_name=go-stellar-sdk client_version=2.1.0 duration=30.21106663 forwarded_ip= host="localhost:8000" ip=172.17.0.1 ip_port="172.17.0.1:43132" method=POST path="/transactions?tx=AAAAAgAAAABzdv3ojkzWHMD7KUoXhrPx0GH18vHKV0ZfqpMiEblG1gAAAGQAAAAAAAAAAQAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAADgAAAAAAAAAABfXhAAAAAAEAAAAAAAAAAHN2%2FeiOTNYcwPspSheGs%2FHQYfXy8cpXRl%2BqkyIRuUbWAAAAAAAAAAAAAAABEblG1gAAAED19qgymvbjGRkccKkWlneE5jftWbUU0M%2FlJs8XDqxyNQvu5%2FVU13TRqOlUzhfhkcqg8jdyssNyyXdH4glOvYYM" pid=199 referer=undefined req=a603cfd9f5d0/RvIGIwd6CP-000008 route=/transactions status=504 streaming=false
time="2020-10-28T16:43:10.671Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=65)" next_state="resume(latestSuccessfullyProcessedLedger=65)" pid=199 service=ingest
time="2020-10-28T16:43:10.676Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=65 ingest_sequence=66 pid=199 service=ingest
time="2020-10-28T16:43:11.679Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=65)" next_state="resume(latestSuccessfullyProcessedLedger=65)" pid=199 service=ingest
time="2020-10-28T16:43:11.686Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=65 ingest_sequence=66 pid=199 service=ingest
GATSN [Herder WARNING] Lost track of consensus
GATSN [Herder WARNING] Out of sync context: {
   "queue" : null,
   "scp" : {
      "54" : {
         "ballotProtocol" : {
            "ballot" : "(4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ])",
            "heard" : true,
            "phase" : "EXTERNALIZE",
            "state" : "i: 54 | EXTERNALIZE | b: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | p: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | p': (<null_ballot>) | h: (4294967295,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | c: (1,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | M: 2"
         },
         "nomination" : {
            "roundnumber" : 0,
            "started" : false
         },
         "quorum_sets" : {
            "d05a3b" : {
               "t" : 1,
               "v" : [ "local_core" ]
            }
         },
         "statements" : [
            [
               1603903357,
               "{ENV@local_core |  i: 54 | EXTERNALIZE | c: (1,[ SIGNED@local_core txH: 6a8524, ct: 1603903399, upgrades: [ ] ]) | h.n: 1 | (lastD): d05a3b }",
               false
            ],

[...]

After a while, it seems captive core recovers consensus and Horizon obtains the ledger, but it takes 50 seconds

GATSN [Herder INFO] Quorum information for 64 : {"agree":1,"cost":948,"delayed":0,"disagree":0,"fail_at":1,"hash":"d05a3b","lag_ms":0,"ledger":64,"missing":0,"phase":"EXTERNALIZE"}

GATSN [Herder WARNING] Ledger took 45.009415267 seconds, SCP information:{"node":"GATSN","qset":{"agree":1,"cost":{"66":{"local_core":624}},"delayed":null,"disagree":null,"fail_at":1,"fail_with":["local_core"],"hash":"d05a3b","lag_ms":{"local_core":0},"ledger":66,"missing":null,"phase":"EXTERNALIZE","value":{"t":1,"v":["local_core"]}},"transitive":{"critical":null,"intersection":true,"last_check_ledger":54,"node_count":2}}

GATSN [Ledger INFO] Got consensus: [seq=66, prev=4b64a4, txs=1, ops=1, sv: [ SIGNED@local_core txH: 0d5de6, ct: 1603903411, upgrades: [ ] ]]
GATSN [Tx INFO] applying ledger 66 (txs:1, ops:1)
GATSN [Ledger INFO] Closed ledger: [seq=66, hash=3fb8b9]
GATSN [Ledger INFO] Changing state LM_CATCHING_UP_STATE -> LM_SYNCED_STATE
GATSN [Herder INFO] Transitive closure of quorum has changed, re-analyzing.
GATSN [SCP INFO] Calculating 2-node network quorum intersection
GATSN [SCP INFO] Examining 1 node groups for intersection-criticality
GATSN [SCP INFO] No intersection-critical groups found
time="2020-10-28T16:43:22.762Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=65)" next_state="resume(latestSuccessfullyProcessedLedger=65)" pid=199 service=ingest
time="2020-10-28T16:43:22.764Z" level=info msg="Processing ledger" commit=true ledger=true pid=199 sequence=66 service=ingest state=true
time="2020-10-28T16:43:22.774Z" level=info msg="Processed ledger" commit=true duration=0.010409042 ledger=true pid=199 sequence=66 service=ingest state=true stats_accounts_created=0 stats_accounts_removed=0 stats_accounts_updated=3 stats_claimable_balances_created=1 stats_claimable_balances_removed=0 stats_claimable_balances_updated=0 stats_data_created=0 stats_data_removed=0 stats_data_updated=0 stats_offers_created=0 stats_offers_removed=0 stats_offers_updated=0 stats_operations=1 stats_operations_account_merge=0 stats_operations_allow_trust=0 stats_operations_begin_sponsoring_future_reserves=0 stats_operations_bump_sequence=0 stats_operations_change_trust=0 stats_operations_claim_claimable_balance=0 stats_operations_create_account=0 stats_operations_create_claimable_balance=1 stats_operations_create_passive_sell_offer=0 stats_operations_end_sponsoring_future_reserves=0 stats_operations_in_failed=0 stats_operations_in_successful=1 stats_operations_inflation=0 stats_operations_manage_buy_offer=0 stats_operations_manage_data=0 stats_operations_manage_sell_offer=0 stats_operations_path_payment_strict_receive=0 stats_operations_path_payment_strict_send=0 stats_operations_payment=0 stats_operations_revoke_sponsorship=0 stats_operations_set_options=0 stats_transactions=1 stats_transactions_failed=0 stats_transactions_successful=1 stats_trust_lines_created=0 stats_trust_lines_removed=0 stats_trust_lines_updated=0
time="2020-10-28T16:43:22.775Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=65)" next_state="resume(latestSuccessfullyProcessedLedger=66)" pid=199 service=ingest
time="2020-10-28T16:43:22.777Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=66 ingest_sequence=67 pid=199 service=ingest

Full logs:

horizon-stdout---supervisor-XKwdGb.log.gz
stellar-core-stdout---supervisor-vcQO0t.log.gz

@@ -309,7 +310,8 @@ func (i *Test) waitForCore() {

// For some reason, we need to do some waiting blackmagic for Core to publish the checkpoint.
// Otherwise, it indefinitely stays in `"status" : [ "Publishing 1 queued checkpoints [63-63]: Waiting: prepare-snapshot" ]`
time.Sleep(5 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged a PR with a method that can help here. The problem with the code about (close ledgers in a loop) is that manualclose is not blocking (TIL!). So running it 62 times doesn't mean 62 new ledgers will be closed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will help in the case because, even if the ledgers keep advancing, the publishing status doesn't disappear unless you wait long enough before closing a ledger.

}
}

// For some reason, we need to do some waiting blackmagic for Core to publish the checkpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not blackmagic, TIL when I worked on the other PR that Core will publish checkpoint after closing the next ledger after the checkpoint ledger. Maybe it's just a behaviour in in manual mode 🤷 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The waiting times seem like blackmagic ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the problem here isn't only advancing the ledger sequences but also getting Core to publish the checkpoint. I have observed that, unless you wait enough before manually closing a ledger, Core will remain in the checkpoint publishing state even if the ledger sequences keep advancing.

@2opremio
Copy link
Contributor Author

New log dump, including trace logs for Core (both captive an normal core):
logs.tar.gz

@2opremio
Copy link
Contributor Author

2opremio commented Oct 29, 2020

After a few conversations to @rokopt and @graydon it turns out that Core's MANUAL_CLOSE wasn't originally intended for ....

  1. ... speeding through ledgers in a fresh network. Hence the unexpected magical waits I had to add when (successfully) adding MANUAL_CLOSE support to non-captive-core integration tests at services/horizon: Update integration tests to use MANUAL_CLOSE #3167 .

  2. ... multi-node networks (in our case, one Core node and another Captive-core node). That's why captive-core is falling out of sync after the first checkpoint.

@graydon was nice enough to collect the issues I encountered at stellar/stellar-core#2778 . The Core team will try to resolve those issues so that MANUAL_CLOSE fully supports our usecase.

To move forward with the captive-core integration-tests we have two options:

  1. Simply wait for the Core team to hopefully solve the issues at Synchronization issues when driving core with manual-close from horizon stellar-core#2778

  2. Stop using two Core nodes for the integration tests. @bartekn already suggested doing this a while ago but we decided to move forward with a two-node configuration to more closely resemble how captive-core would connect to an existing network
    Using a single Core node for both receiving transactions and emitting metadata would allow us to:

    • Emit metadata from ledger 1 instead of needing to wait for the first checkpoint. Having access to the first checkpoint is only required when joining a network, which isn't needed when using a single node. This means we won't need to fast-foward to ledger 63.
    • Use MANUAL_CLOSE to speed up the tests without synchronization issues.

    However:

    • I don't think we can do that with an unmodified captive-core ingestion backend. Also, I am not sure what changes and how much work it would entail.
    • As I already mentioned, it would move us away from a real-world scenario, in which captive-core would be joining an existing network.

Regardless, I think I am going to park this work for a while and clear my mind working on higher-priority issues.

@2opremio 2opremio force-pushed the fix-captive-core-integration-tests branch from 556c695 to b9be268 Compare October 29, 2020 19:49
@2opremio 2opremio force-pushed the fix-captive-core-integration-tests branch from b9be268 to 9723255 Compare October 29, 2020 19:52
@2opremio 2opremio closed this Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make integration tests work with Captive core
2 participants