Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the detection of captive-core failures #3158

Closed
2opremio opened this issue Oct 23, 2020 · 1 comment · Fixed by #3187
Closed

Improve the detection of captive-core failures #3158

2opremio opened this issue Oct 23, 2020 · 1 comment · Fixed by #3187

Comments

@2opremio
Copy link
Contributor

2opremio commented Oct 23, 2020

In the context of adding captive core integration tests (#3153 ) I found that a failure in captive core's run-from execution failed but went uunnoticed

Most relevant log lines:

GA76T [default INFO] *
GA76T [default INFO] * Target ledger 1 is not newer than last closed ledger 1 - nothing to do
GA76T [default INFO] * If you really want to catchup to 1 run stellar-core new-db
GA76T [default INFO] *
[...]
time="2020-10-23T12:28:02.726Z" level=info msg="Range prepared" duration=1.01071565 ledger=2 pid=196 service=ingest

Thus, Horizon seems satisfied with Core's preparation of the range. However, after that, we get:

time="2020-10-23T12:28:04.740Z" level=error msg="Error in ingestion state machine" current_state="resume(latestSuccessfullyProcessedLedger=1)" error="Error running processors on ledger: Protocol version not supported: Error getting ledger: unmarshalling framed LedgerCloseMeta: unmarshalling XDR frame header: xdr:DecodeUint: EOF while decoding 4 bytes - read: '[]'" next_state="resume(latestSuccessfullyProcessedLedger=1)" pid=196 service=ingest

My guess is that captive-core returned a 0 exit code. However, it seems like captive core's output was empty, and we could have detected that.

Furthermore, I think that the execution of captive core (with and without HTTP) could probably more defensive.

Full log context (using `run-from`)

time="2020-10-23T12:27:59.708Z" level=error msg="Error in ingestion state machine" current_state="resume(latestSuccessfullyProcessedLedger=1)" error="error preparing range: opening subprocess: error calculating ledger and hash for stelar-core run: error trying to read ledger header 1 from HAS: error opening ledger stream: Bad HTTP response '404 File not found' for GET 'http://localhost:1570/ledger/00/00/00/ledger-0000003f.xdr.gz'" next_state=start pid=196 service=ingest
time="2020-10-23T12:28:00.708Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=1)" next_state=start pid=196 service=ingest
time="2020-10-23T12:28:00.711Z" level=info msg="Resuming ingestion system from last processed ledger..." last_ledger=1 pid=196 service=ingest
time="2020-10-23T12:28:01.711Z" level=info msg="Ingestion system state machine transition" current_state=start next_state="resume(latestSuccessfullyProcessedLedger=1)" pid=196 service=ingest
time="2020-10-23T12:28:01.715Z" level=info msg="Released ingestion lock to prepare range" pid=196 service=ingest
time="2020-10-23T12:28:01.715Z" level=info msg="Preparing range" ledger=2 pid=196 service=ingest
<startup> [default INFO] Config from /opt/stellar-default/standalone/core/etc/stellar-captive-core.cfg
<startup> [default INFO] Generated QUORUM_SET: {
   "t" : 1,
   "v" : [ "standalone_1" ]
}

GA76T [default INFO] Starting stellar-core stellar-core 15.0.0~rc1 (b0de6c17bb61f618359e07f8e4a76f6b050af59a)
GA76T [Database INFO] Connecting to: sqlite3://:memory:
GA76T [Ledger INFO] Streaming metadata to file descriptor 3
GA76T [SCP INFO] LocalNode::LocalNode@GA76T qSet: d05a3b
GA76T [default INFO] Listening on 127.0.0.1:21626 for HTTP requests
GA76T [default INFO] * 
GA76T [default INFO] * The database has been initialized
GA76T [default INFO] * 
GA76T [Database INFO] Applying DB schema upgrade to version 13
GA76T [Database INFO] DB schema is in current version
GA76T [Ledger INFO] Established genesis ledger, closing
GA76T [Ledger INFO] Root account seed: SC5O7VZUXDJ6JBDSZ74DSERXL7W3Y5LTOAMRF7RQRL3TAGAPS7LUVG3L
GA76T [History INFO] Archive 'standalone_1' has 'get' command only, will not be written
GA76T [History WARNING] No writable archives configured, history will not be written.
GA76T [Ledger INFO] Starting up application
GA76T [default INFO] Connection effective settings:
GA76T [default INFO] TARGET_PEER_CONNECTIONS: 8
GA76T [default INFO] MAX_ADDITIONAL_PEER_CONNECTIONS: 64
GA76T [default INFO] MAX_PENDING_CONNECTIONS: 500
GA76T [default INFO] MAX_OUTBOUND_PENDING_CONNECTIONS: 56
GA76T [default INFO] MAX_INBOUND_PENDING_CONNECTIONS: 444
GA76T [Ledger INFO] Last closed ledger (LCL) hash is 76df1b9ff762a7a8af1f3c88549fe742e488e80fde561e34b1b6115bdac8d397
GA76T [Ledger INFO] LCL is genesis: [seq=1, hash=76df1b]
GA76T [Ledger INFO] Assumed bucket-state for LCL: [seq=1, hash=76df1b]
GA76T [Ledger INFO] Changing state LM_BOOTING_STATE -> LM_CATCHING_UP_STATE
GA76T [default INFO] *
GA76T [default INFO] * Target ledger 1 is not newer than last closed ledger 1 - nothing to do
GA76T [default INFO] * If you really want to catchup to 1 run stellar-core new-db
GA76T [default INFO] *
GA76T [default INFO] Application destructing
GA76T [default INFO] Application destroyed
time="2020-10-23T12:28:02.726Z" level=info msg="Range prepared" duration=1.01071565 ledger=2 pid=196 service=ingest
time="2020-10-23T12:28:03.728Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=1)" next_state=start pid=196 service=ingest
time="2020-10-23T12:28:03.732Z" level=info msg="Resuming ingestion system from last processed ledger..." last_ledger=1 pid=196 service=ingest
time="2020-10-23T12:28:04.734Z" level=info msg="Ingestion system state machine transition" current_state=start next_state="resume(latestSuccessfullyProcessedLedger=1)" pid=196 service=ingest
time="2020-10-23T12:28:04.739Z" level=info msg="Processing ledger" commit=true ledger=true pid=196 sequence=2 service=ingest state=true
time="2020-10-23T12:28:04.739Z" level=info msg="FONS FONS GetLedger(2)" pid=196
time="2020-10-23T12:28:04.740Z" level=error msg="Error in ingestion state machine" current_state="resume(latestSuccessfullyProcessedLedger=1)" error="Error running processors on ledger: Protocol version not supported: Error getting ledger: unmarshalling framed LedgerCloseMeta: unmarshalling XDR frame header: xdr:DecodeUint: EOF while decoding 4 bytes - read: '[]'" next_state="resume(latestSuccessfullyProcessedLedger=1)" pid=196 service=ingest
time="2020-10-23T12:28:05.741Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=1)" next_state="resume(latestSuccessfullyProcessedLedger=1)" pid=196 service=ingest
time="2020-10-23T12:28:05.746Z" level=info msg="Released ingestion lock to prepare range" pid=196 service=ingest
time="2020-10-23T12:28:05.746Z" level=info msg="Preparing range" ledger=2 pid=196 service=ingest
<startup> [default INFO] Config from /opt/stellar-default/standalone/core/etc/stellar-captive-core.cfg
<startup> [default INFO] Generated QUORUM_SET: {
   "t" : 1,
   "v" : [ "standalone_1" ]
}

GDFMI [default INFO] Starting stellar-core stellar-core 15.0.0~rc1 (b0de6c17bb61f618359e07f8e4a76f6b050af59a)
GDFMI [Database INFO] Connecting to: sqlite3://:memory:
GDFMI [Ledger INFO] Streaming metadata to file descriptor 3
GDFMI [SCP INFO] LocalNode::LocalNode@GDFMI qSet: d05a3b
GDFMI [default INFO] Listening on 127.0.0.1:21626 for HTTP requests
GDFMI [default INFO] * 
GDFMI [default INFO] * The database has been initialized
GDFMI [default INFO] * 
GDFMI [Database INFO] Applying DB schema upgrade to version 13
GDFMI [Database INFO] DB schema is in current version
GDFMI [Ledger INFO] Established genesis ledger, closing
GDFMI [Ledger INFO] Root account seed: SC5O7VZUXDJ6JBDSZ74DSERXL7W3Y5LTOAMRF7RQRL3TAGAPS7LUVG3L
GDFMI [History INFO] Archive 'standalone_1' has 'get' command only, will not be written
GDFMI [History WARNING] No writable archives configured, history will not be written.
GDFMI [Ledger INFO] Starting up application
GDFMI [default INFO] Connection effective settings:
GDFMI [default INFO] TARGET_PEER_CONNECTIONS: 8
GDFMI [default INFO] MAX_ADDITIONAL_PEER_CONNECTIONS: 64
GDFMI [default INFO] MAX_PENDING_CONNECTIONS: 500
GDFMI [default INFO] MAX_OUTBOUND_PENDING_CONNECTIONS: 56
GDFMI [default INFO] MAX_INBOUND_PENDING_CONNECTIONS: 444
GDFMI [Ledger INFO] Last closed ledger (LCL) hash is 76df1b9ff762a7a8af1f3c88549fe742e488e80fde561e34b1b6115bdac8d397
GDFMI [Ledger INFO] LCL is genesis: [seq=1, hash=76df1b]
GDFMI [Ledger INFO] Assumed bucket-state for LCL: [seq=1, hash=76df1b]
GDFMI [Ledger INFO] Changing state LM_BOOTING_STATE -> LM_CATCHING_UP_STATE
GDFMI [default INFO] *
GDFMI [default INFO] * Target ledger 1 is not newer than last closed ledger 1 - nothing to do
GDFMI [default INFO] * If you really want to catchup to 1 run stellar-core new-db
GDFMI [default INFO] *
GDFMI [default INFO] Application destructing
GDFMI [default INFO] Application destroyed
panic: send on closed channel

goroutine 11824 [running]:
github.com/stellar/go/ingest/ledgerbackend.(*stellarCoreRunner).start.func1(0xc00080e630)
	/Users/fons/stellar-go/ingest/ledgerbackend/stellar_core_runner_posix.go:41 +0x5f
created by github.com/stellar/go/ingest/ledgerbackend.(*stellarCoreRunner).start
	/Users/fons/stellar-go/ingest/ledgerbackend/stellar_core_runner_posix.go:40 +0x194
starting horizon...
@bartekn
Copy link
Contributor

bartekn commented Oct 28, 2020

I was about to create another issues for this. Here's a small repro script I wrote (based on captive-core-start-tester):

package main

import (
	"fmt"

	"github.com/stellar/go/ingest/ledgerbackend"
	"github.com/stellar/go/network"
)

func main() {
	check(6218)
}

func check(ledger uint32) bool {
	c, err := ledgerbackend.NewCaptive(
		"stellar-core",
		"stellar-core-testnet.cfg",
		network.TestNetworkPassphrase,
		[]string{"http://history.stellar.org/prd/core-testnet/core_testnet_001/"},
	)
	if err != nil {
		panic(err)
	}
	defer c.Close()

	err = c.PrepareRange(ledgerbackend.UnboundedRange(ledger))
	if err != nil {
		fmt.Println(err)
		return false
	}

	ok, meta, err := c.GetLedger(ledger)
	if err != nil {
		fmt.Println(err)
		return false
	}

	if !ok {
		fmt.Println("no ledger")
		return false
	}

	if meta.LedgerSequence() != ledger {
		fmt.Println("wrong ledger", meta.LedgerSequence())
		return false
	}

	fmt.Println(ledger, "ok")
	return true
}

After starting it find stellar-core process with ps and then kill it. You should still receive an error but with a message like Stellar-Core exited unexpectedly.

@bartekn bartekn self-assigned this Nov 3, 2020
bartekn added a commit that referenced this issue Nov 9, 2020
…ore backend (#3187)

This commit introduces `bufferedLedgerMetaReader`  which decouples
buffering and unmarshaling from `stellarCoreRunner` and
`CaptiveStellarCore`.

`bufferedLedgerMetaReader` fixes multiple issues:

* It fixes #3132 by increasing internal buffers' sizes to hold more
ledgers. It makes catchup code much faster.
* It fixes #3158 - `bufferedLedgerMetaReader` allowed rewriting shutdown
code to a much simpler version. Now `bufferedLedgerMetaReader` and
`CaptiveStellarCore` listen to a single shutdown signal:
`stellarCoreRunner.getProcessExitChan()`. When Stellar-Core process
terminates `bufferedLedgerMetaReader.Start` go routine will stop and
`CaptiveStellarCore` will return a user friendly error in `PrepareRange`
and `GetLedger` methods. When `CaptiveStellarCore.Close()` is called, it
kills the Stellar-Core processing triggering shutdown code explained
above.
* Decouple buffering and unmarshaling into a single struct. This makes
`stellarCoreRunner` and `CaptiveStellarCore` simpler.
* It fixes a possible OOM issue when network closes a series of large
ledgers. In such case `bufferedLedgerMetaReader` will wait for a buffer
to be consumed first before reading more ledgers into memory preventing
an increased memory usage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants