Improve the detection of captive-core failures #3158

2opremio · 2020-10-23T15:30:47Z

In the context of adding captive core integration tests (#3153 ) I found that a failure in captive core's run-from execution failed but went uunnoticed

Most relevant log lines:

GA76T [default INFO] *
GA76T [default INFO] * Target ledger 1 is not newer than last closed ledger 1 - nothing to do
GA76T [default INFO] * If you really want to catchup to 1 run stellar-core new-db
GA76T [default INFO] *
[...]
time="2020-10-23T12:28:02.726Z" level=info msg="Range prepared" duration=1.01071565 ledger=2 pid=196 service=ingest

Thus, Horizon seems satisfied with Core's preparation of the range. However, after that, we get:

time="2020-10-23T12:28:04.740Z" level=error msg="Error in ingestion state machine" current_state="resume(latestSuccessfullyProcessedLedger=1)" error="Error running processors on ledger: Protocol version not supported: Error getting ledger: unmarshalling framed LedgerCloseMeta: unmarshalling XDR frame header: xdr:DecodeUint: EOF while decoding 4 bytes - read: '[]'" next_state="resume(latestSuccessfullyProcessedLedger=1)" pid=196 service=ingest

My guess is that captive-core returned a 0 exit code. However, it seems like captive core's output was empty, and we could have detected that.

Furthermore, I think that the execution of captive core (with and without HTTP) could probably more defensive.

Full log context (using `run-from`)

time="2020-10-23T12:27:59.708Z" level=error msg="Error in ingestion state machine" current_state="resume(latestSuccessfullyProcessedLedger=1)" error="error preparing range: opening subprocess: error calculating ledger and hash for stelar-core run: error trying to read ledger header 1 from HAS: error opening ledger stream: Bad HTTP response '404 File not found' for GET 'http://localhost:1570/ledger/00/00/00/ledger-0000003f.xdr.gz'" next_state=start pid=196 service=ingest
time="2020-10-23T12:28:00.708Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=1)" next_state=start pid=196 service=ingest
time="2020-10-23T12:28:00.711Z" level=info msg="Resuming ingestion system from last processed ledger..." last_ledger=1 pid=196 service=ingest
time="2020-10-23T12:28:01.711Z" level=info msg="Ingestion system state machine transition" current_state=start next_state="resume(latestSuccessfullyProcessedLedger=1)" pid=196 service=ingest
time="2020-10-23T12:28:01.715Z" level=info msg="Released ingestion lock to prepare range" pid=196 service=ingest
time="2020-10-23T12:28:01.715Z" level=info msg="Preparing range" ledger=2 pid=196 service=ingest
<startup> [default INFO] Config from /opt/stellar-default/standalone/core/etc/stellar-captive-core.cfg
<startup> [default INFO] Generated QUORUM_SET: {
   "t" : 1,
   "v" : [ "standalone_1" ]
}

GA76T [default INFO] Starting stellar-core stellar-core 15.0.0~rc1 (b0de6c17bb61f618359e07f8e4a76f6b050af59a)
GA76T [Database INFO] Connecting to: sqlite3://:memory:
GA76T [Ledger INFO] Streaming metadata to file descriptor 3
GA76T [SCP INFO] LocalNode::LocalNode@GA76T qSet: d05a3b
GA76T [default INFO] Listening on 127.0.0.1:21626 for HTTP requests
GA76T [default INFO] * 
GA76T [default INFO] * The database has been initialized
GA76T [default INFO] * 
GA76T [Database INFO] Applying DB schema upgrade to version 13
GA76T [Database INFO] DB schema is in current version
GA76T [Ledger INFO] Established genesis ledger, closing
GA76T [Ledger INFO] Root account seed: SC5O7VZUXDJ6JBDSZ74DSERXL7W3Y5LTOAMRF7RQRL3TAGAPS7LUVG3L
GA76T [History INFO] Archive 'standalone_1' has 'get' command only, will not be written
GA76T [History WARNING] No writable archives configured, history will not be written.
GA76T [Ledger INFO] Starting up application
GA76T [default INFO] Connection effective settings:
GA76T [default INFO] TARGET_PEER_CONNECTIONS: 8
GA76T [default INFO] MAX_ADDITIONAL_PEER_CONNECTIONS: 64
GA76T [default INFO] MAX_PENDING_CONNECTIONS: 500
GA76T [default INFO] MAX_OUTBOUND_PENDING_CONNECTIONS: 56
GA76T [default INFO] MAX_INBOUND_PENDING_CONNECTIONS: 444
GA76T [Ledger INFO] Last closed ledger (LCL) hash is 76df1b9ff762a7a8af1f3c88549fe742e488e80fde561e34b1b6115bdac8d397
GA76T [Ledger INFO] LCL is genesis: [seq=1, hash=76df1b]
GA76T [Ledger INFO] Assumed bucket-state for LCL: [seq=1, hash=76df1b]
GA76T [Ledger INFO] Changing state LM_BOOTING_STATE -> LM_CATCHING_UP_STATE
GA76T [default INFO] *
GA76T [default INFO] * Target ledger 1 is not newer than last closed ledger 1 - nothing to do
GA76T [default INFO] * If you really want to catchup to 1 run stellar-core new-db
GA76T [default INFO] *
GA76T [default INFO] Application destructing
GA76T [default INFO] Application destroyed
time="2020-10-23T12:28:02.726Z" level=info msg="Range prepared" duration=1.01071565 ledger=2 pid=196 service=ingest
time="2020-10-23T12:28:03.728Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=1)" next_state=start pid=196 service=ingest
time="2020-10-23T12:28:03.732Z" level=info msg="Resuming ingestion system from last processed ledger..." last_ledger=1 pid=196 service=ingest
time="2020-10-23T12:28:04.734Z" level=info msg="Ingestion system state machine transition" current_state=start next_state="resume(latestSuccessfullyProcessedLedger=1)" pid=196 service=ingest
time="2020-10-23T12:28:04.739Z" level=info msg="Processing ledger" commit=true ledger=true pid=196 sequence=2 service=ingest state=true
time="2020-10-23T12:28:04.739Z" level=info msg="FONS FONS GetLedger(2)" pid=196
time="2020-10-23T12:28:04.740Z" level=error msg="Error in ingestion state machine" current_state="resume(latestSuccessfullyProcessedLedger=1)" error="Error running processors on ledger: Protocol version not supported: Error getting ledger: unmarshalling framed LedgerCloseMeta: unmarshalling XDR frame header: xdr:DecodeUint: EOF while decoding 4 bytes - read: '[]'" next_state="resume(latestSuccessfullyProcessedLedger=1)" pid=196 service=ingest
time="2020-10-23T12:28:05.741Z" level=info msg="Ingestion system state machine transition" current_state="resume(latestSuccessfullyProcessedLedger=1)" next_state="resume(latestSuccessfullyProcessedLedger=1)" pid=196 service=ingest
time="2020-10-23T12:28:05.746Z" level=info msg="Released ingestion lock to prepare range" pid=196 service=ingest
time="2020-10-23T12:28:05.746Z" level=info msg="Preparing range" ledger=2 pid=196 service=ingest
<startup> [default INFO] Config from /opt/stellar-default/standalone/core/etc/stellar-captive-core.cfg
<startup> [default INFO] Generated QUORUM_SET: {
   "t" : 1,
   "v" : [ "standalone_1" ]
}

GDFMI [default INFO] Starting stellar-core stellar-core 15.0.0~rc1 (b0de6c17bb61f618359e07f8e4a76f6b050af59a)
GDFMI [Database INFO] Connecting to: sqlite3://:memory:
GDFMI [Ledger INFO] Streaming metadata to file descriptor 3
GDFMI [SCP INFO] LocalNode::LocalNode@GDFMI qSet: d05a3b
GDFMI [default INFO] Listening on 127.0.0.1:21626 for HTTP requests
GDFMI [default INFO] * 
GDFMI [default INFO] * The database has been initialized
GDFMI [default INFO] * 
GDFMI [Database INFO] Applying DB schema upgrade to version 13
GDFMI [Database INFO] DB schema is in current version
GDFMI [Ledger INFO] Established genesis ledger, closing
GDFMI [Ledger INFO] Root account seed: SC5O7VZUXDJ6JBDSZ74DSERXL7W3Y5LTOAMRF7RQRL3TAGAPS7LUVG3L
GDFMI [History INFO] Archive 'standalone_1' has 'get' command only, will not be written
GDFMI [History WARNING] No writable archives configured, history will not be written.
GDFMI [Ledger INFO] Starting up application
GDFMI [default INFO] Connection effective settings:
GDFMI [default INFO] TARGET_PEER_CONNECTIONS: 8
GDFMI [default INFO] MAX_ADDITIONAL_PEER_CONNECTIONS: 64
GDFMI [default INFO] MAX_PENDING_CONNECTIONS: 500
GDFMI [default INFO] MAX_OUTBOUND_PENDING_CONNECTIONS: 56
GDFMI [default INFO] MAX_INBOUND_PENDING_CONNECTIONS: 444
GDFMI [Ledger INFO] Last closed ledger (LCL) hash is 76df1b9ff762a7a8af1f3c88549fe742e488e80fde561e34b1b6115bdac8d397
GDFMI [Ledger INFO] LCL is genesis: [seq=1, hash=76df1b]
GDFMI [Ledger INFO] Assumed bucket-state for LCL: [seq=1, hash=76df1b]
GDFMI [Ledger INFO] Changing state LM_BOOTING_STATE -> LM_CATCHING_UP_STATE
GDFMI [default INFO] *
GDFMI [default INFO] * Target ledger 1 is not newer than last closed ledger 1 - nothing to do
GDFMI [default INFO] * If you really want to catchup to 1 run stellar-core new-db
GDFMI [default INFO] *
GDFMI [default INFO] Application destructing
GDFMI [default INFO] Application destroyed
panic: send on closed channel

goroutine 11824 [running]:
github.com/stellar/go/ingest/ledgerbackend.(*stellarCoreRunner).start.func1(0xc00080e630)
	/Users/fons/stellar-go/ingest/ledgerbackend/stellar_core_runner_posix.go:41 +0x5f
created by github.com/stellar/go/ingest/ledgerbackend.(*stellarCoreRunner).start
	/Users/fons/stellar-go/ingest/ledgerbackend/stellar_core_runner_posix.go:40 +0x194
starting horizon...

The text was updated successfully, but these errors were encountered:

bartekn · 2020-10-28T18:32:35Z

I was about to create another issues for this. Here's a small repro script I wrote (based on captive-core-start-tester):

package main

import (
	"fmt"

	"github.com/stellar/go/ingest/ledgerbackend"
	"github.com/stellar/go/network"
)

func main() {
	check(6218)
}

func check(ledger uint32) bool {
	c, err := ledgerbackend.NewCaptive(
		"stellar-core",
		"stellar-core-testnet.cfg",
		network.TestNetworkPassphrase,
		[]string{"http://history.stellar.org/prd/core-testnet/core_testnet_001/"},
	)
	if err != nil {
		panic(err)
	}
	defer c.Close()

	err = c.PrepareRange(ledgerbackend.UnboundedRange(ledger))
	if err != nil {
		fmt.Println(err)
		return false
	}

	ok, meta, err := c.GetLedger(ledger)
	if err != nil {
		fmt.Println(err)
		return false
	}

	if !ok {
		fmt.Println("no ledger")
		return false
	}

	if meta.LedgerSequence() != ledger {
		fmt.Println("wrong ledger", meta.LedgerSequence())
		return false
	}

	fmt.Println(ledger, "ok")
	return true
}

After starting it find stellar-core process with ps and then kill it. You should still receive an error but with a message like Stellar-Core exited unexpectedly.

…ore backend (#3187) This commit introduces `bufferedLedgerMetaReader` which decouples buffering and unmarshaling from `stellarCoreRunner` and `CaptiveStellarCore`. `bufferedLedgerMetaReader` fixes multiple issues: * It fixes #3132 by increasing internal buffers' sizes to hold more ledgers. It makes catchup code much faster. * It fixes #3158 - `bufferedLedgerMetaReader` allowed rewriting shutdown code to a much simpler version. Now `bufferedLedgerMetaReader` and `CaptiveStellarCore` listen to a single shutdown signal: `stellarCoreRunner.getProcessExitChan()`. When Stellar-Core process terminates `bufferedLedgerMetaReader.Start` go routine will stop and `CaptiveStellarCore` will return a user friendly error in `PrepareRange` and `GetLedger` methods. When `CaptiveStellarCore.Close()` is called, it kills the Stellar-Core processing triggering shutdown code explained above. * Decouple buffering and unmarshaling into a single struct. This makes `stellarCoreRunner` and `CaptiveStellarCore` simpler. * It fixes a possible OOM issue when network closes a series of large ledgers. In such case `bufferedLedgerMetaReader` will wait for a buffer to be consumed first before reading more ledgers into memory preventing an increased memory usage.

2opremio added horizon fast-txmeta labels Oct 23, 2020

bartekn self-assigned this Nov 3, 2020

bartekn mentioned this issue Nov 3, 2020

ingest/ledgerbackend: Decouple meta pipe buffering in CaptiveStellarCore backend #3187

Merged

7 tasks

bartekn closed this as completed in #3187 Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the detection of captive-core failures #3158

Improve the detection of captive-core failures #3158

2opremio commented Oct 23, 2020 •

edited

Loading

bartekn commented Oct 28, 2020

Improve the detection of captive-core failures #3158

Improve the detection of captive-core failures #3158

Comments

2opremio commented Oct 23, 2020 • edited Loading

bartekn commented Oct 28, 2020

2opremio commented Oct 23, 2020 •

edited

Loading