Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon/docker/ledgerexporter: deploy ledgerexporter image as service #4490

Merged
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion exp/services/ledgerexporter/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ func main() {
continueFromLatestLedger := flag.Bool("continue", false, "start export from the last exported ledger (as indicated in the target's /latest path)")
endingLedger := flag.Uint("end-ledger", 0, "ledger at which to stop the export (must be a closed ledger), 0 means no ending")
writeLatestPath := flag.Bool("write-latest-path", true, "update the value of the /latest path on the target")
captiveCoreUseDb := flag.Bool("captive-core-use-db", true, "configure captive core to store database on disk in working directory rather than in memory")
flag.Parse()

logger.SetLevel(supportlog.InfoLevel)
Expand All @@ -51,6 +52,7 @@ func main() {
CheckpointFrequency: 64,
Log: logger.WithField("subservice", "stellar-core"),
Toml: captiveCoreToml,
UseDB: *captiveCoreUseDb,
}
core, err := ledgerbackend.NewCaptive(captiveConfig)
logFatalIf(err, "Could not create captive core instance")
Expand Down Expand Up @@ -91,7 +93,7 @@ func main() {
err = core.PrepareRange(context.Background(), ledgerRange)
logFatalIf(err, "could not prepare range")

for nextLedger := startLedger; nextLedger <= endLedger; {
for nextLedger := startLedger; endLedger < 1 || nextLedger <= endLedger; {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during testing with END=0, ran into this as it was stopping before generation anything

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

ledger, err := core.GetLedger(context.Background(), nextLedger)
if err != nil {
logger.WithError(err).Warnf("could not fetch ledger %v, retrying", nextLedger)
Expand Down
1 change: 0 additions & 1 deletion historyarchive/s3_archive.go
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,6 @@ func (b *S3ArchiveBackend) PutFile(pth string, in io.ReadCloser) error {
params := &s3.PutObjectInput{
Bucket: aws.String(b.bucket),
Key: aws.String(key),
ACL: aws.String(s3.ObjectCannedACLPublicRead),
Copy link
Contributor Author

@sreuland sreuland Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current txmeta bucket uses policy for auth, which requires credentials(access key/id) from an IAM principal(I initially used a user in the AWS account, need to create a non-user service account) but when using policy on s3 bucket, it can trigger auth error if explicit ACLs are set by client when updating the bucket, and this ACL was causing AccessControlListNotSupported

I can either change the new txmeta bucket to be ACL based and revert this, or we go with the AWS recommendation on s3 bucket policy and migrate existing buckets by adding a policy to them which has Allow statement Public Read. Noticed only 3 buckets on S3 at present and this was the only sdk usage of s3 ACLs, I see other s3 usages that don't send the ACL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with alternative approach and added ACL config option - 452b20c, this way client can work with buckets in either permissions config.

Body: bytes.NewReader(buf.Bytes()),
}
req, _ := b.svc.PutObjectRequest(params)
Expand Down
30 changes: 30 additions & 0 deletions services/horizon/docker/ledgerexporter/captive-core-testnet.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
PEER_PORT=11725
DATABASE = "sqlite3:///cc/stellar.db"

UNSAFE_QUORUM=true
FAILURE_SAFETY=1

[[HOME_DOMAINS]]
HOME_DOMAIN="testnet.stellar.org"
QUALITY="HIGH"

[[VALIDATORS]]
NAME="sdf_testnet_1"
HOME_DOMAIN="testnet.stellar.org"
PUBLIC_KEY="GDKXE2OZMJIPOSLNA6N6F2BVCI3O777I2OOC4BV7VOYUEHYX7RTRYA7Y"
ADDRESS="core-testnet1.stellar.org"
HISTORY="curl -sf http://history.stellar.org/prd/core-testnet/core_testnet_001/{0} -o {1}"

[[VALIDATORS]]
NAME="sdf_testnet_2"
HOME_DOMAIN="testnet.stellar.org"
PUBLIC_KEY="GCUCJTIYXSOXKBSNFGNFWW5MUQ54HKRPGJUTQFJ5RQXZXNOLNXYDHRAP"
ADDRESS="core-testnet2.stellar.org"
HISTORY="curl -sf http://history.stellar.org/prd/core-testnet/core_testnet_002/{0} -o {1}"

[[VALIDATORS]]
NAME="sdf_testnet_3"
HOME_DOMAIN="testnet.stellar.org"
PUBLIC_KEY="GC2V2EFSXN6SQTWVYA5EPJPBWWIMSD2XQNKUOHGEKB535AQE2I6IXV2Z"
ADDRESS="core-testnet3.stellar.org"
HISTORY="curl -sf http://history.stellar.org/prd/core-testnet/core_testnet_003/{0} -o {1}"
92 changes: 92 additions & 0 deletions services/horizon/docker/ledgerexporter/ledgerexporter.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# this file contains the ledgerexporter deployment and it's config artifacts.
# when importing the manifest with kubectl, will only create, skips any that already exist.
#
# make sure to include namespace destination, the manifest does not specify,
# otherwise it'll go in your current kubectl context.
#
# if defining the secrets for first time, substitue <base64 encoded value here> placeholders.
#
# $ kubectl create -f ledgerexporter.yml -n horizon-dev
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
fluxcd.io/ignore: "true"
labels:
app: ledgerexporter
name: ledgerexporter-pubnet-env
data:
START: "2"
END: "0"
# can only have CONTINUE or START set, not both.
#CONTINUE: "true"
WRITE_LATEST_PATH: "true"
sreuland marked this conversation as resolved.
Show resolved Hide resolved
CAPTIVE_CORE_USE_DB: "true"
HISTORY_ARCHIVE_URLS: "https://history.stellar.org/prd/core-live/core_live_001,https://history.stellar.org/prd/core-live/core_live_002,https://history.stellar.org/prd/core-live/core_live_003"
NETWORK_PASSPHRASE: "Public Global Stellar Network ; September 2015"
ARCHIVE_TARGET: "s3://horizon-ledgermeta-pubnet"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've config'd the horizon-ledgermeta-pubnet bucket in same AWS account as Batch, with bucket owner enforced with ACLs disabled and a inline policy that defines allowed/disallowed statements, following aws recommendations .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean you changed the existing bucket settings?

Unfortunately the S3-writing code (HistoryArchive) code assumes ACLs are enabled (because it writes them). That's why they where enabled.

I am not against the change but then we should also change the writing code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't changed any existing bucket permissions, but this looks fairly constrained on ACL usage though, only one place in s3 writing code sets the object ACL to include public read during put which I removed and left a comment on one potential to migrate off that.. There are several other places where s3 object writing(puts) are done with s3uploadmanager and those don't specify ACLs.

So, the net seems to be in identifying how many existing buckets which could have been written to by this routine in historyarchive/s3_archive.go. With just removal of ACL here, the net effect is that put still works to the existing buckets but the objects won't have public read until the bucket permissions are updated to have ACL's disabled and a policy added with statement for Allow Everyone/Public Read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated approach and added ACL config option on s3 - 452b20c, this way client can work with buckets in either permissions config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2opremio , I added more to permissions on horizon-ledgermeta-prodnet-test and horizon-index, added policy to grant write to ec2(Batch access) and an IAM user(k8s access) and a PublicRead rule.

I noticed slight diff on their main config, horizon-ledgermeta-prodnet-test had ACLs enabled/bucket owner preferred, and horizon-index had ACLs disabled/bucket owner enforced. The policy applies on top of either.

---
apiVersion: v1
kind: Secret
metadata:
labels:
app: ledgerexporter
name: ledgerexporter-pubnet-secret
type: Opaque
data:
AWS_REGION: <base64 encoded value here>
Copy link
Contributor Author

@sreuland sreuland Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS credentials get loaded into the cluster as secrets, which deployment pulls into ledgerexporter container as env variables.

AWS_ACCESS_KEY_ID: <base64 encoded value here>
AWS_SECRET_ACCESS_KEY: <base64 encoded value here>
---
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
fluxcd.io/ignore: "true"
deployment.kubernetes.io/revision: "3"
labels:
app: ledgerexporter
name: ledgerexporter-deployment
spec:
selector:
matchLabels:
app: ledgerexporter
replicas: 1
template:
metadata:
annotations:
fluxcd.io/ignore: "true"
# if we expect to add metrics at some point to ledgerexporter
# this just needs to be set to true
prometheus.io/port: "6060"
prometheus.io/scrape: "false"
labels:
app: ledgerexporter
spec:
containers:
- envFrom:
- secretRef:
name: ledgerexporter-pubnet-secret
- configMapRef:
name: ledgerexporter-pubnet-env
image: stellar/horizon-ledgerexporter:latest
imagePullPolicy: Always
name: ledgerexporter
resources:
limits:
cpu: 1
memory: 4Gi
requests:
cpu: 250m
memory: 500m
volumeMounts:
- mountPath: /cc
name: tempfs-volume
dnsPolicy: ClusterFirst
volumes:
- name: tempfs-volume
emptyDir:
medium: Memory
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captive core's on-disk db(stellar.db) will be on this tempfs RAM drive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scratch that, after further testing on cluster, found the tempfs ram was not suitable/stable for captive core on-disk db usage, the size of this volume is derived from the underlying host k8s node RAM, and is limited by default to half, so, it's variable, anytime the pod restarts it could end up running on a diff node with different RAM and I saw a few OOMKilled due to this, switched volume to use a PV/PVC.




19 changes: 14 additions & 5 deletions services/horizon/docker/ledgerexporter/start
Original file line number Diff line number Diff line change
@@ -1,17 +1,25 @@
#! /usr/bin/env bash
set -e

START="${START:=0}"
START="${START:=2}"
END="${END:=0}"
CONTINUE="${CONTINUE:=false}"
# Writing to /latest is disabled by default to avoid race conditions between parallel container runs
WRITE_LATEST_PATH="${WRITE_LATEST_PATH:=false}"
NETWORK_PASSPHRASE="${NETWORK_PASSPHRASE:=Public Global Stellar Network ; September 2015}"
HISTORY_ARCHIVE_URLS="${HISTORY_ARCHIVE_URLS:=https://s3-eu-west-1.amazonaws.com/history.stellar.org/prd/core-live/core_live_001}"
CAPTIVE_CORE_CONFIG="${CAPTIVE_CORE_CONFIG:=/captive-core-pubnet.cfg}"
CAPTIVE_CORE_USE_DB="${CAPTIVE_CORE_USE_DB:=true}"

if [ -z "$ARCHIVE_TARGET" ]; then
echo "error: undefined ARCHIVE_TARGET env variable"
exit 1
fi

if [ "$NETWORK_PASSPHRASE" = "Test SDF Network ; September 2015" ]; then
CAPTIVE_CORE_CONFIG="/captive-core-testnet.cfg"
fi
sreuland marked this conversation as resolved.
Show resolved Hide resolved

# Calculate params for AWS Batch
if [ ! -z "$AWS_BATCH_JOB_ARRAY_INDEX" ]; then
# The batch should have three env variables:
Expand Down Expand Up @@ -39,9 +47,10 @@ fi
echo "START: $START END: $END"

export TRACY_NO_INVARIANT_CHECK=1
/ledgerexporter --target $ARCHIVE_TARGET \
--captive-core-toml-path /captive-core-pubnet.cfg \
--history-archive-urls 'https://history.stellar.org/prd/core-live/core_live_001' --network-passphrase 'Public Global Stellar Network ; September 2015' \
--continue="$CONTINUE" --write-latest-path="$WRITE_LATEST_PATH" --start-ledger "$START" --end-ledger "$END"
/ledgerexporter --target "$ARCHIVE_TARGET" \
--captive-core-toml-path "$CAPTIVE_CORE_CONFIG" \
--history-archive-urls "$HISTORY_ARCHIVE_URLS" --network-passphrase "$NETWORK_PASSPHRASE" \
--continue="$CONTINUE" --write-latest-path="$WRITE_LATEST_PATH" \
--start-ledger "$START" --end-ledger "$END" --captive-core-use-db "$CAPTIVE_CORE_USE_DB"

echo "OK"