Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon: Allow captive core to start from any ledger. #3160

Merged
merged 3 commits into from
Oct 27, 2020

Conversation

abuiles
Copy link
Contributor

@abuiles abuiles commented Oct 23, 2020

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

Allow captive core to start from any ledger.

Why

Previously we were limiting the ledgers where online captive core could start since we were always trying to start (captive core) from the previous check-point ledger.

This was probably problematic since this wouldn't work for ledgers smaller than 63.

Known limitations

[TODO or N/A]

@cla-bot cla-bot bot added the cla: yes label Oct 23, 2020
@abuiles abuiles changed the base branch from master to captive-run-from October 23, 2020 17:51
@abuiles abuiles changed the title services/horizon: Allow online captive core to start from any ledger. services/horizon: Allow captive core to start from any ledger. Oct 23, 2020
@abuiles abuiles requested review from bartekn and 2opremio October 23, 2020 18:08
@abuiles abuiles marked this pull request as ready for review October 23, 2020 18:08
@abuiles abuiles force-pushed the start-from-any-ledger branch from 25ad1bf to eee405a Compare October 23, 2020 18:10
@@ -290,14 +290,18 @@ func (c *CaptiveStellarCore) runFromParams(from uint32) (runFrom uint32, ledgerH
// ledger and then fast-forward to the desire ledger
//
//
runFrom = roundDownToFirstReplayAfterCheckpointStart(from) - 1
runFrom = from - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If from is 1, wouldn't this always fail because runFrom will be 0 and this is invalid ledger sequence? While this will never happen in Horizon I think it can be called by other users. I think we need a table in the comment with return values for the following arguments: 1, 2, 3 (corner cases because core would start from ledger 2 instead of 1), 62, 63, 64 (corner cases around first checkpoint), 126, 127, 128 (corner cases for general case). We also need tests for each of these.

But before that I think there's another problem with the approach here. Let's say that we need to restart Horizon (ex. version upgrade) and from is 2 ledgers after checkpoint, for simplicity here: 127+2=129. Then we won't see the ledger 128 (from-1) in the archives until the next checkpoint is closed so in 5 minutes. This is a long time. What we can do is start from previous checkpoint (it should be in the archive already) and fast forward from there. Surprisingly, I think this should simplify this function.

Finally, we still have the trust issue: it's possible that bad actor has changed archives and we'll learn about it only when core errors. And as Nicolas mentioned internally it can be after, possibly, hours of catchup.

To sum it up I think we should:

  1. Fix the wait-for-checkpoint issue I explained above because this will be a problem anyway.
  2. Add more tests to check corner cases as above (1, 2, 3, 62, 63, 64, 126, 127, 128).
  3. In another PR, let's allow CaptiveCore to get ledger hashes we know about. This can be done by passing a store that implement a known interface with a method to return hash based on sequence number (@tamirms's idea).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented 1 and 2 in 8adeebf. Also added a small tool to run tests against running standalone network to ensure params are correct.

@abuiles abuiles force-pushed the start-from-any-ledger branch from eee405a to 560a216 Compare October 26, 2020 21:37
@2opremio
Copy link
Contributor

2opremio commented Oct 27, 2020

Thanks!

This seems to fix #3157 ! (which I tested using #3144 ).

However, I find confusing that the stats obtained from GET / don't reflect the log messages. For instance, even if Horizon was outputting this:

time="2020-10-27T14:36:46.525Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=61)" next_state="resume(latestSuccessfullyProcessedLedger=61)" pid=198 service=expingest
time="2020-10-27T14:36:46.533Z" level=info msg="Waiting for ledger to be available in stellar-core" core_sequence=61 ingest_sequence=62 pid=198 service=expingest

The root stats are still at 0 (including the CoreSequence and IngestSequence):

Screenshot 2020-10-27 at 15 42 13

I would expect the IngestSequence and CoreSequence to be consistent in both the log messages and the root endpoint.

@bartekn
Copy link
Contributor

bartekn commented Oct 27, 2020

@2opremio I think I run into this while working on this PR but haven't debugged it much yet. I suspect that changes in #3106 broke something. If you /bin/bash the container and run curl localhost:8000 there you'll see correct values. So it looks like two Horizons are running? @tamirms can you confirm/take a look?

@bartekn bartekn linked an issue Oct 27, 2020 that may be closed by this pull request
@2opremio
Copy link
Contributor

2opremio commented Oct 27, 2020

It's strange, because after ledger 64 is reached (according to the logs) the CoreSequence I obtain from the integration tests is correct.

@2opremio
Copy link
Contributor

If you /bin/bash the container and run curl localhost:8000 there you'll see correct values.

True.

$ docker exec -ti horizon-integration curl localhost:8000 | grep ledger
    "ledger": {
      "href": "http://localhost:8000/ledger/{sequence}",
    "ledgers": {
      "href": "http://localhost:8000/ledgers{?cursor,limit,order}",
  "ingest_latest_ledger": 18,
  "history_latest_ledger": 18,
  "history_elder_ledger": 2,
  "core_latest_ledger": 18,

@bartekn
Copy link
Contributor

bartekn commented Oct 27, 2020

@2opremio I noticed there is a new env variable: HORIZON_INTEGRATION_ENABLE_CAPTIVE_CORE. I haven't checked it but maybe it will fix it.

I'm going to approve this PR but please 👍 too because I worked on this partially. And maybe let's more discussion about the issue with a container to a new issue.

@2opremio
Copy link
Contributor

2opremio commented Oct 27, 2020

OK, just for the record. This seems to be the problem:

$ docker ps
CONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS              PORTS                                                                                                                                                    NAMES
9a1d35cf918b        stellar/quickstart:testing2   "/start --standalone…"   2 minutes ago       Up 2 minutes        0.0.0.0:32797->1570/tcp, 0.0.0.0:32796->5432/tcp, 0.0.0.0:32795->6060/tcp, 0.0.0.0:32794->8000/tcp, 0.0.0.0:32793->11625/tcp, 0.0.0.0:32792->11626/tcp   horizon-integration

I think it should be 0.0.0.0:8000->8000/tcp instead

@bartekn bartekn merged commit b5619d6 into stellar:captive-run-from Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Captive core's online mode fails to start at ledger 1
3 participants