-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce sanity/compatibility test for live clusters #12175
Conversation
25d39ce
to
a405b68
Compare
runtime/src/bank.rs
Outdated
@@ -110,7 +110,7 @@ type TransactionAccountRefCells = Vec<Rc<RefCell<Account>>>; | |||
type TransactionLoaderRefCells = Vec<Vec<(Pubkey, RefCell<Account>)>>; | |||
|
|||
// Eager rent collection repeats in cyclic manner. | |||
// Each cycle is composed of <partiion_count> number of tiny pubkey subranges | |||
// Each cycle is composed of <partition_count> number of tiny pubkey subranges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(while intentionally triggering the CI) let's increase my karma. :)
ci/buildkite-pipeline.sh
Outdated
@@ -125,8 +125,9 @@ wait_step() { | |||
} | |||
|
|||
all_test_steps() { | |||
command_step checks ". ci/rust-version.sh; ci/docker-run.sh \$\$rust_nightly_docker_image ci/test-checks.sh" 20 | |||
wait_step | |||
#command_step checks ". ci/rust-version.sh; ci/docker-run.sh \$\$rust_nightly_docker_image ci/test-checks.sh" 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously, I need to revert these before merging!
# for your pain-less copy-paste | ||
|
||
# UPDATE docs/src/clusters.md TOO!! | ||
test_with_live_cluster "testnet" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When backporting to v1.2, I'll remove this line.
@mvines I think this pr is getting in pretty good shape. Could you review this? I changed to launch and run on adhoc GCE instance and the test duration is pretty short (~10 min) for both testnet and mainnet-beta. |
docs/src/clusters.md
Outdated
@@ -36,15 +36,15 @@ solana config set --url https://devnet.solana.com | |||
|
|||
```bash | |||
$ solana-validator \ | |||
--entrypoint entrypoint.devnet.solana.com:8001 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reordered for logical use order: entrypoint (contact to the cluster) => [trusted] validator (fetch genesis/snapshot) => expected-... (let's assert expected things finally)
ci/live-cluster-sanity.sh
Outdated
instance_ip=$(./net/gce.sh info | grep bootstrap-validator | awk '{print $3}') | ||
|
||
on_trap() { | ||
if [[ -z $instance_deleted ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global variables! \ o /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe to just try to delete here
docs/src/clusters.md
Outdated
@@ -74,20 +74,21 @@ solana config set --url https://testnet.solana.com | |||
|
|||
##### Example `solana-validator` command-line | |||
|
|||
[comment]: <> (UPDATE ci/live-cluster-sanity.sh TOO!!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed docusaus can't handle this correctly, this is the reason of failing travis build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now fixed.
ci/live-cluster-sanity.sh
Outdated
-d '{"jsonrpc":"2.0","id":1, "method":"validatorExit"}' \ | ||
http://localhost:18899 | ||
|
||
(sleep 3 && kill "$tail_pid") & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this trick realizes set +e
-less elegant wait
.
ci/live-cluster-sanity.sh
Outdated
./net/ssh.sh "$instance_ip" mkdir cluster-sanity | ||
|
||
validator_log="$cluster_label-validator.log" | ||
./net/ssh.sh "$instance_ip" -Llocalhost:18899:localhost:18899 ./solana-validator \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combined with --private-rpc
and --rpc-bind-address
, the exposure to the public internet is minimized by -L...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer needed
ci/live-cluster-sanity.sh
Outdated
show_log | ||
done | ||
|
||
echo "--- Monitoring validator $cluster_label" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I also add the catchup
phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's skip this for now. This will increase the test time.
ci/live-cluster-sanity.sh
Outdated
--trusted-validator 9QxCLckBiJc783jnMvXZubK4wH86Eqqvashtrwvcsgkv \ | ||
--expected-genesis-hash 4uhcVJyU9pJkvQyS88uRDiswHXSCkY3zQawwpjk2NsNY \ | ||
# for your pain-less copy-paste | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder it's nice to have to upload fetched snapshots to the buildkite as artifacts for reproducible testing if anything odd happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done by uploading snapshots only if the build failed.
ci/live-cluster-sanity.sh
Outdated
|
||
(sleep 3 && kill "$tail_pid") & | ||
kill_pid=$! | ||
wait "$ssh_pid" "$tail_pid" "$kill_pid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guard with timeout N
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well it turned out this is rather complicated. hint: wait
must be shell builtin but timeout
is just a normal command. let's skip this.
ci/live-cluster-sanity.sh
Outdated
source ci/_ | ||
source ci/rust-version.sh stable | ||
|
||
escaped_branch=$(echo "$BUILDKITE_BRANCH" | tr -c "[:alnum:]" - | sed -r "s#(^-*|-*head-*|-*$)##g") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If BUILDKITE_BRANCH
is empty (like if ci/live-cluster-sanity.sh
is run locally), set escaped_branch
to $(whoami)
perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ci/live-cluster-sanity.sh
Outdated
# ensure to delete leftover cluster | ||
./net/gce.sh delete -p "$instance_prefix" || true | ||
# only bootstrap, no normal validator | ||
./net/gce.sh create -p "$instance_prefix" -n 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ensure the instances are shut down promptly if something goes wrong:
./net/gce.sh create -p "$instance_prefix" -n 0 | |
./net/gce.sh create -p "$instance_prefix" -n 0 --self-destruct-hours 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for tipping this nice option. I didn't know it.
ci/live-cluster-sanity.sh
Outdated
|
||
_ cargo +"$rust_stable" build --bins --release | ||
_ ./net/scp.sh ./target/release/solana-validator "$instance_ip:." | ||
echo 500000 | ./net/ssh.sh "$instance_ip" sudo tee /proc/sys/vm/max_map_count > /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this, let's copy solana-sys-tuner
in so it can set max_map_count and we verify that code path too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -36,15 +36,15 @@ solana config set --url https://devnet.solana.com | |||
|
|||
```bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc/ and bank.rs
changes in here look just fine, why don't you just land those as a separate PR while we work through the ci/
files in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no strong reason to create separate PRs. I just thought it's not worth to be its own prs. bank.rs
changes are needed to trigger live-cluster
tests. (yeah, I could improve the ci/buildkite-pipeline.sh
). And docs
chagnes somewhat mentions this pr about the update notice. So separating them introduces a bit of work.
@@ -169,6 +172,12 @@ all_test_steps() { | |||
artifact_paths: "log-*.txt" | |||
agents: | |||
- "queue=cuda" | |||
- command: "ci/live-cluster-sanity.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to run this on every PR? It seems like a nightly would be suffcient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this worth on every PR. These are some reasons:
- nightly is a bit too infrequent in my opinion;
- According to the insights, it seems that we're merging 20 PRs per business day (100 per weak/500 per month). Then, assume roughly half of it is rust (validator) related (quick guess from https://buildkite.com/solana-labs/solana/builds?branch=master&page=2). Under that numbers in mind, bisecting regressions will take about 3 steps (2 ** 3 =~ 10) in average with nightly. This is tedious in my opinion; bisecting is very effective for the very-wide window, it's not so much effective in small window.
- I can tolerate with hourly, but then why not every-pr? ;)
- This doesn't make the whole CI longer from the PR author's perspective (
local-cluster
is the longest at this pipeline phase...)
- According to the insights, it seems that we're merging 20 PRs per business day (100 per weak/500 per month). Then, assume roughly half of it is rust (validator) related (quick guess from https://buildkite.com/solana-labs/solana/builds?branch=master&page=2). Under that numbers in mind, bisecting regressions will take about 3 steps (2 ** 3 =~ 10) in average with nightly. This is tedious in my opinion; bisecting is very effective for the very-wide window, it's not so much effective in small window.
- it's less ideal compared to unit-tests, but this test could serve as a smoke test around process startup, whose tests are currently particularly weak.
- Running every PR could work as a last minute sanity check in the case of hotfix.
live-cluster
occupiesqueue=gce-deploy
which isn't so crowded compared to thequeue=default
.- gossip/turbine/bpf exeuction code changes will benefit from testing with actual production environment as part of normal CI build. These area currently lacks integration tests with fixture data extracted from the real environment. So, no need to manually run validator each time for minor changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
live-cluster
occupiesqueue=gce-deploy
which isn't so crowded compared to thequeue=default
.
I believe there are only one or two agents running gce-deploy
ATM. So we'll want to bump that up first. It should just be a matter of ensuring the gcloud CLI tools are installed and pointed at the correct project, then adding a systemd service for the new agent
ci/live-cluster-sanity.sh
Outdated
source ci/rust-version.sh stable | ||
|
||
escaped_branch=$(echo "$BUILDKITE_BRANCH" | tr -c "[:alnum:]" - | sed -r "s#(^-*|-*head-*|-*$)##g") | ||
instance_prefix="testnet-live-sanity-$escaped_branch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think metrics will complain about this since there won't be a database named $instance_prefix
. (I hit similar trying to get cute with the rolling upgrades instance names)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately, this pr doesn't use much of net/*.sh
s. This isn't affected. Anyway, I've specifically setup a metric database for this job.
ci/live-cluster-sanity.sh
Outdated
instance_ip=$(./net/gce.sh info | grep bootstrap-validator | awk '{print $3}') | ||
|
||
on_trap() { | ||
if [[ -z $instance_deleted ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe to just try to delete here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, now! Rolling out more BK agents is currently blocked on #12527, though
This reverts commit ae24ab6.
0cbbf00
to
f74ea2c
Compare
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This stale pull request has been automatically closed. Thank you for your contributions. |
Problem
bisecting is hurting... (hard-labored fruit this time: #12176)
Summary of Changes
I think if ci time and resource is allowed, this should be run on each prs instead of nightly ci job. and it seems that running this doesn't take much time.
- [ ] todo what to do if the tested cluster is dead? Maybe easy turn-off knob like github's(EDIT: Well, let's skip this for now? clusters are pretty stable nowadays)skip-live-cluster
label?- [ ] todo if the cluster is dead, fallback to some periodic backup of snapshot + minimum ledger?(EDIT: Well, let's skip this for now? clusters are pretty stable nowadays)