services/horizon: Allow captive core to start from any ledger. #3160

abuiles · 2020-10-23T17:50:45Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've updated the relevant CHANGELOG (here for Horizon) if
needed with deprecations, added features, breaking changes, and DB schema changes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

Allow captive core to start from any ledger.

Why

Previously we were limiting the ledgers where online captive core could start since we were always trying to start (captive core) from the previous check-point ledger.

This was probably problematic since this wouldn't work for ledgers smaller than 63.

Known limitations

[TODO or N/A]

bartekn · 2020-10-26T12:12:12Z

ingest/ledgerbackend/captive_core_backend.go

@@ -290,14 +290,18 @@ func (c *CaptiveStellarCore) runFromParams(from uint32) (runFrom uint32, ledgerH
 		// ledger and then fast-forward to the desire ledger
 		//
 		//
-		runFrom = roundDownToFirstReplayAfterCheckpointStart(from) - 1
+		runFrom = from - 1


If from is 1, wouldn't this always fail because runFrom will be 0 and this is invalid ledger sequence? While this will never happen in Horizon I think it can be called by other users. I think we need a table in the comment with return values for the following arguments: 1, 2, 3 (corner cases because core would start from ledger 2 instead of 1), 62, 63, 64 (corner cases around first checkpoint), 126, 127, 128 (corner cases for general case). We also need tests for each of these.

But before that I think there's another problem with the approach here. Let's say that we need to restart Horizon (ex. version upgrade) and from is 2 ledgers after checkpoint, for simplicity here: 127+2=129. Then we won't see the ledger 128 (from-1) in the archives until the next checkpoint is closed so in 5 minutes. This is a long time. What we can do is start from previous checkpoint (it should be in the archive already) and fast forward from there. Surprisingly, I think this should simplify this function.

Finally, we still have the trust issue: it's possible that bad actor has changed archives and we'll learn about it only when core errors. And as Nicolas mentioned internally it can be after, possibly, hours of catchup.

To sum it up I think we should:

Fix the wait-for-checkpoint issue I explained above because this will be a problem anyway.

Add more tests to check corner cases as above (1, 2, 3, 62, 63, 64, 126, 127, 128).

In another PR, let's allow CaptiveCore to get ledger hashes we know about. This can be done by passing a store that implement a known interface with a method to return hash based on sequence number (@tamirms's idea).

Implemented 1 and 2 in 8adeebf. Also added a small tool to run tests against running standalone network to ensure params are correct.

2opremio · 2020-10-27T14:43:29Z

Thanks!

This seems to fix #3157 ! (which I tested using #3144 ).

However, I find confusing that the stats obtained from GET / don't reflect the log messages. For instance, even if Horizon was outputting this:

time="2020-10-27T14:36:46.525Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=61)" next_state="resume(latestSuccessfullyProcessedLedger=61)" pid=198 service=expingest
time="2020-10-27T14:36:46.533Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=61 ingest_sequence=62 pid=198 service=expingest

The root stats are still at 0 (including the CoreSequence and IngestSequence):

I would expect the IngestSequence and CoreSequence to be consistent in both the log messages and the root endpoint.

bartekn · 2020-10-27T14:53:59Z

@2opremio I think I run into this while working on this PR but haven't debugged it much yet. I suspect that changes in #3106 broke something. If you /bin/bash the container and run curl localhost:8000 there you'll see correct values. So it looks like two Horizons are running? @tamirms can you confirm/take a look?

2opremio · 2020-10-27T15:01:21Z

It's strange, because after ledger 64 is reached (according to the logs) the CoreSequence I obtain from the integration tests is correct.

2opremio · 2020-10-27T15:16:20Z

If you /bin/bash the container and run curl localhost:8000 there you'll see correct values.

True.

$ docker exec -ti horizon-integration curl localhost:8000 | grep ledger
    "ledger": {
      "href": "http://localhost:8000/ledger/{sequence}",
    "ledgers": {
      "href": "http://localhost:8000/ledgers{?cursor,limit,order}",
  "ingest_latest_ledger": 18,
  "history_latest_ledger": 18,
  "history_elder_ledger": 2,
  "core_latest_ledger": 18,

bartekn · 2020-10-27T15:20:02Z

@2opremio I noticed there is a new env variable: HORIZON_INTEGRATION_ENABLE_CAPTIVE_CORE. I haven't checked it but maybe it will fix it.

I'm going to approve this PR but please 👍 too because I worked on this partially. And maybe let's more discussion about the issue with a container to a new issue.

2opremio · 2020-10-27T15:24:48Z

OK, just for the record. This seems to be the problem:

$ docker ps
CONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS              PORTS                                                                                                                                                    NAMES
9a1d35cf918b        stellar/quickstart:testing2   "/start --standalone…"   2 minutes ago       Up 2 minutes        0.0.0.0:32797->1570/tcp, 0.0.0.0:32796->5432/tcp, 0.0.0.0:32795->6060/tcp, 0.0.0.0:32794->8000/tcp, 0.0.0.0:32793->11625/tcp, 0.0.0.0:32792->11626/tcp   horizon-integration

I think it should be 0.0.0.0:8000->8000/tcp instead

cla-bot bot added the cla: yes label Oct 23, 2020

abuiles changed the base branch from master to captive-run-from October 23, 2020 17:51

abuiles changed the title ~~services/horizon: Allow online captive core to start from any ledger.~~ services/horizon: Allow captive core to start from any ledger. Oct 23, 2020

abuiles requested review from bartekn and 2opremio October 23, 2020 18:08

abuiles marked this pull request as ready for review October 23, 2020 18:08

abuiles force-pushed the start-from-any-ledger branch from 25ad1bf to eee405a Compare October 23, 2020 18:10

abuiles mentioned this pull request Oct 23, 2020

Captive Core's run-from implementation doesn't work for ledgers < 63 #3155

Closed

bartekn reviewed Oct 26, 2020

View reviewed changes

abuiles force-pushed the captive-run-from branch from 6bf5b55 to 8b3a263 Compare October 26, 2020 17:59

Allow captive core backend to start at any ledger.

560a216

abuiles force-pushed the start-from-any-ledger branch from eee405a to 560a216 Compare October 26, 2020 21:37

Fixes

8adeebf

bartekn linked an issue Oct 27, 2020 that may be closed by this pull request

Captive core's online mode fails to start at ledger 1 #3157

Closed

Check if err.

5a447a6

bartekn approved these changes Oct 27, 2020

View reviewed changes

2opremio approved these changes Oct 27, 2020

View reviewed changes

bartekn merged commit b5619d6 into stellar:captive-run-from Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

services/horizon: Allow captive core to start from any ledger. #3160

services/horizon: Allow captive core to start from any ledger. #3160

abuiles commented Oct 23, 2020 •

edited

Loading

bartekn Oct 26, 2020

bartekn Oct 27, 2020

2opremio commented Oct 27, 2020 •

edited

Loading

bartekn commented Oct 27, 2020 •

edited

Loading

2opremio commented Oct 27, 2020 •

edited

Loading

2opremio commented Oct 27, 2020

bartekn commented Oct 27, 2020

2opremio commented Oct 27, 2020 •

edited

Loading

services/horizon: Allow captive core to start from any ledger. #3160

services/horizon: Allow captive core to start from any ledger. #3160

Conversation

abuiles commented Oct 23, 2020 • edited Loading

PR Structure

Thoroughness

Release planning

What

Why

Known limitations

bartekn Oct 26, 2020

Choose a reason for hiding this comment

bartekn Oct 27, 2020

Choose a reason for hiding this comment

2opremio commented Oct 27, 2020 • edited Loading

bartekn commented Oct 27, 2020 • edited Loading

2opremio commented Oct 27, 2020 • edited Loading

2opremio commented Oct 27, 2020

bartekn commented Oct 27, 2020

2opremio commented Oct 27, 2020 • edited Loading

abuiles commented Oct 23, 2020 •

edited

Loading

2opremio commented Oct 27, 2020 •

edited

Loading

bartekn commented Oct 27, 2020 •

edited

Loading

2opremio commented Oct 27, 2020 •

edited

Loading

2opremio commented Oct 27, 2020 •

edited

Loading