-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp/lighthorizon/build/index-batch: merge map/reduce updates to latest on feature branch #4543
Changes from 8 commits
d5c7292
e12cc83
2b9009a
70aebcc
8d148a2
0ba708b
b706613
65db297
78e63a7
2329f3c
806b74c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# `stellar/horizon-indexer` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't the image called There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, thanks for catch, fixed on small follow-up pr #4552. |
||
|
||
This docker image contains the ledger/checkpoint indexing executables. It allows running multiple instances of `map`/`reduce` on a single machine or running it in [AWS Batch](https://aws.amazon.com/batch/). | ||
|
||
## Env variables | ||
|
||
See the [package documentation](../../index/cmd/batch/doc.go) for more details |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: 'batch-map-job' | ||
spec: | ||
completions: 52 | ||
parallelism: 10 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does this mean? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. k8s job will only run 10 pods concurrently, and job will continue to start a new pod when one exits, until 52 pods have been launched, each pod getting assigned a unique There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. very cool, thanks! |
||
completionMode: Indexed | ||
template: | ||
spec: | ||
restartPolicy: Never | ||
containers: | ||
- name: 'worker' | ||
image: 'stellar/lighthorizon-index-batch' | ||
imagePullPolicy: Always | ||
envFrom: | ||
- secretRef: | ||
name: <reference to secret name here if needed for source/target> | ||
env: | ||
- name: RUN_MODE | ||
value: "map" | ||
- name: BATCH_SIZE | ||
value: "10048" | ||
- name: FIRST_CHECKPOINT | ||
value: "41426080" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Arbitrary? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, @2opremio mentioned running for roughly the last month to start with to verify the job stream, this is around July 1st. |
||
- name: WORKER_COUNT | ||
value: "8" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 8 workers but 4 CPUs? 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, a little bit of time slicing, these can be suggestive, for where this is aiming to deploy on dev cluster, there aren't huge amount of resources available, this also ties into the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think the process is mostly IO-bound so I think it's fine to have more workers than CPUs |
||
- name: TXMETA_SOURCE | ||
value: "<url of txmeta source>" | ||
- name: JOB_INDEX_ENV | ||
value: "JOB_COMPLETION_INDEX" | ||
- name: NETWORK_PASSPHRASE | ||
value: "pubnet" | ||
- name: INDEX_TARGET | ||
value: "url of target index" | ||
resources: | ||
limits: | ||
cpu: 4 | ||
memory: 5Gi | ||
requests: | ||
cpu: 500m | ||
memory: 500Mi | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: 'batch-reduce-job' | ||
spec: | ||
completions: 52 | ||
parallelism: 10 | ||
completionMode: Indexed | ||
template: | ||
spec: | ||
restartPolicy: Never | ||
containers: | ||
- name: 'worker' | ||
image: 'stellar/lighthorizon-index-batch' | ||
imagePullPolicy: Always | ||
envFrom: | ||
- secretRef: | ||
name: <reference to secret name here if needed for source/target> | ||
env: | ||
- name: RUN_MODE | ||
value: "reduce" | ||
- name: MAP_JOB_COUNT | ||
value: 52 | ||
- name: REDUCE_JOB_COUNT | ||
value: 52 | ||
- name: WORKER_COUNT | ||
value: 8 | ||
- name: INDEX_SOURCE_ROOT | ||
value: "<url of index location>" | ||
- name: JOB_INDEX_ENV | ||
value: JOB_COMPLETION_INDEX | ||
- name: INDEX_TARGET | ||
value: "<url of index location>" | ||
resources: | ||
limits: | ||
cpu: 4 | ||
memory: 5Gi | ||
requests: | ||
cpu: 500m | ||
memory: 500Mi | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,6 +25,12 @@ func NewFileBackend(dir string, parallel uint32) (*FileBackend, error) { | |
parallel = 1 | ||
} | ||
|
||
err := os.MkdirAll(dir, fs.ModeDir|0755) | ||
if err != nil { | ||
log.Errorf("Unable to mkdir %s, %v", dir, err) | ||
return nil, err | ||
} | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a behavior change. I'm for it, but can we use this as a chance to add a docstring? e.g.
|
||
return &FileBackend{ | ||
dir: dir, | ||
parallel: parallel, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -68,7 +68,7 @@ for (( i=0; i < $BATCH_COUNT; i++ )) | |
do | ||
echo -n "Creating map job $i... " | ||
|
||
NETWORK_PASSPHRASE='testnet' MODULES='accounts_unbacked,transactions' \ | ||
NETWORK_PASSPHRASE='testnet' JOB_INDEX_ENV='AWS_BATCH_JOB_ARRAY_INDEX' MODULES='accounts_unbacked,transactions' \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment here as on the other PR - why do we need the name of the name of the value rather than just passing the value itself? e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, good point, this would be a good simplification, might squeeze it in. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, I think I see why, under different orchestration environments like k8s, aws, that layer injects the job index number as a container env variable, the name can be anything, AWS_BATCH_JOB_ARRAY_INDEX or JOB_COMPLETION_INDEX for k8s, and there is no shell layer like here for interpolation, the only way the app code which got launched in the container deployed on orchestration layer can realistically reference/find the value is to be given the known deployment env var key name for job index instead and app code then performs the 2nd look up against that in env space to get the value. |
||
AWS_BATCH_JOB_ARRAY_INDEX=$i BATCH_SIZE=$BATCH_SIZE FIRST_CHECKPOINT=$FIRST \ | ||
TXMETA_SOURCE=file://$1 INDEX_TARGET=file://$2 WORKER_COUNT=1 \ | ||
./map & | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's a good idea to use testnet (since it can be reset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there's a missing space before the
horizon-light:
jobThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, intent was to avoid the long catchup seen on pubnet ranges, trade-off was a potential window of a couple minutes of time when newly reset testnet hasn't gen'd at least 150 ledgers yet during which this CI job step would fail if was invoked at same time.
I changed it on a small follow-up PR #4552 to reference same early range but on pubnet, maybe that suffices, need to check how long catchup runs on that, if it's quick, then maybe suffices?