Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce sanity/compatibility test for live clusters #12175

Closed
wants to merge 57 commits into from

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Sep 10, 2020

Problem

bisecting is hurting... (hard-labored fruit this time: #12176)

Summary of Changes

I think if ci time and resource is allowed, this should be run on each prs instead of nightly ci job. and it seems that running this doesn't take much time.

- [ ] todo what to do if the tested cluster is dead? Maybe easy turn-off knob like github's skip-live-cluster label? (EDIT: Well, let's skip this for now? clusters are pretty stable nowadays)

- [ ] todo if the cluster is dead, fallback to some periodic backup of snapshot + minimum ledger? (EDIT: Well, let's skip this for now? clusters are pretty stable nowadays)

@ryoqun ryoqun force-pushed the live-cluster-sanity branch from 25d39ce to a405b68 Compare September 18, 2020 12:37
@@ -110,7 +110,7 @@ type TransactionAccountRefCells = Vec<Rc<RefCell<Account>>>;
type TransactionLoaderRefCells = Vec<Vec<(Pubkey, RefCell<Account>)>>;

// Eager rent collection repeats in cyclic manner.
// Each cycle is composed of <partiion_count> number of tiny pubkey subranges
// Each cycle is composed of <partition_count> number of tiny pubkey subranges
Copy link
Member Author

@ryoqun ryoqun Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(while intentionally triggering the CI) let's increase my karma. :)

@@ -125,8 +125,9 @@ wait_step() {
}

all_test_steps() {
command_step checks ". ci/rust-version.sh; ci/docker-run.sh \$\$rust_nightly_docker_image ci/test-checks.sh" 20
wait_step
#command_step checks ". ci/rust-version.sh; ci/docker-run.sh \$\$rust_nightly_docker_image ci/test-checks.sh" 20
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously, I need to revert these before merging!

# for your pain-less copy-paste

# UPDATE docs/src/clusters.md TOO!!
test_with_live_cluster "testnet" \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When backporting to v1.2, I'll remove this line.

@ryoqun ryoqun requested a review from mvines September 18, 2020 13:23
@ryoqun ryoqun marked this pull request as ready for review September 18, 2020 13:26
@ryoqun
Copy link
Member Author

ryoqun commented Sep 18, 2020

@mvines I think this pr is getting in pretty good shape. Could you review this? I changed to launch and run on adhoc GCE instance and the test duration is pretty short (~10 min) for both testnet and mainnet-beta.

@ryoqun ryoqun changed the title [wip] Live cluster sanity Introduce sanity/compatibility test for live clusters Sep 18, 2020
@@ -36,15 +36,15 @@ solana config set --url https://devnet.solana.com

```bash
$ solana-validator \
--entrypoint entrypoint.devnet.solana.com:8001 \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reordered for logical use order: entrypoint (contact to the cluster) => [trusted] validator (fetch genesis/snapshot) => expected-... (let's assert expected things finally)

instance_ip=$(./net/gce.sh info | grep bootstrap-validator | awk '{print $3}')

on_trap() {
if [[ -z $instance_deleted ]]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global variables! \ o /

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to just try to delete here

@@ -74,20 +74,21 @@ solana config set --url https://testnet.solana.com

##### Example `solana-validator` command-line

[comment]: <> (UPDATE ci/live-cluster-sanity.sh TOO!!)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed docusaus can't handle this correctly, this is the reason of failing travis build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now fixed.

-d '{"jsonrpc":"2.0","id":1, "method":"validatorExit"}' \
http://localhost:18899

(sleep 3 && kill "$tail_pid") &
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this trick realizes set +e-less elegant wait.

./net/ssh.sh "$instance_ip" mkdir cluster-sanity

validator_log="$cluster_label-validator.log"
./net/ssh.sh "$instance_ip" -Llocalhost:18899:localhost:18899 ./solana-validator \
Copy link
Member Author

@ryoqun ryoqun Sep 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combined with --private-rpc and --rpc-bind-address, the exposure to the public internet is minimized by -L....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed

show_log
done

echo "--- Monitoring validator $cluster_label"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I also add the catchup phase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's skip this for now. This will increase the test time.

--trusted-validator 9QxCLckBiJc783jnMvXZubK4wH86Eqqvashtrwvcsgkv \
--expected-genesis-hash 4uhcVJyU9pJkvQyS88uRDiswHXSCkY3zQawwpjk2NsNY \
# for your pain-less copy-paste

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder it's nice to have to upload fetched snapshots to the buildkite as artifacts for reproducible testing if anything odd happens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done by uploading snapshots only if the build failed.

@mvines mvines requested a review from t-nelson September 19, 2020 02:29

(sleep 3 && kill "$tail_pid") &
kill_pid=$!
wait "$ssh_pid" "$tail_pid" "$kill_pid"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guard with timeout N

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it turned out this is rather complicated. hint: wait must be shell builtin but timeout is just a normal command. let's skip this.

source ci/_
source ci/rust-version.sh stable

escaped_branch=$(echo "$BUILDKITE_BRANCH" | tr -c "[:alnum:]" - | sed -r "s#(^-*|-*head-*|-*$)##g")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If BUILDKITE_BRANCH is empty (like if ci/live-cluster-sanity.sh is run locally), set escaped_branch to $(whoami) perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# ensure to delete leftover cluster
./net/gce.sh delete -p "$instance_prefix" || true
# only bootstrap, no normal validator
./net/gce.sh create -p "$instance_prefix" -n 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ensure the instances are shut down promptly if something goes wrong:

Suggested change
./net/gce.sh create -p "$instance_prefix" -n 0
./net/gce.sh create -p "$instance_prefix" -n 0 --self-destruct-hours 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for tipping this nice option. I didn't know it.


_ cargo +"$rust_stable" build --bins --release
_ ./net/scp.sh ./target/release/solana-validator "$instance_ip:."
echo 500000 | ./net/ssh.sh "$instance_ip" sudo tee /proc/sys/vm/max_map_count > /dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, let's copy solana-sys-tuner in so it can set max_map_count and we verify that code path too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -36,15 +36,15 @@ solana config set --url https://devnet.solana.com

```bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc/ and bank.rs changes in here look just fine, why don't you just land those as a separate PR while we work through the ci/ files in this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no strong reason to create separate PRs. I just thought it's not worth to be its own prs. bank.rs changes are needed to trigger live-cluster tests. (yeah, I could improve the ci/buildkite-pipeline.sh). And docs chagnes somewhat mentions this pr about the update notice. So separating them introduces a bit of work.

@@ -169,6 +172,12 @@ all_test_steps() {
artifact_paths: "log-*.txt"
agents:
- "queue=cuda"
- command: "ci/live-cluster-sanity.sh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to run this on every PR? It seems like a nightly would be suffcient

Copy link
Member Author

@ryoqun ryoqun Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this worth on every PR. These are some reasons:

  • nightly is a bit too infrequent in my opinion;
    • According to the insights, it seems that we're merging 20 PRs per business day (100 per weak/500 per month). Then, assume roughly half of it is rust (validator) related (quick guess from https://buildkite.com/solana-labs/solana/builds?branch=master&page=2). Under that numbers in mind, bisecting regressions will take about 3 steps (2 ** 3 =~ 10) in average with nightly. This is tedious in my opinion; bisecting is very effective for the very-wide window, it's not so much effective in small window.
      • I can tolerate with hourly, but then why not every-pr? ;)
    • This doesn't make the whole CI longer from the PR author's perspective (local-cluster is the longest at this pipeline phase...)
  • it's less ideal compared to unit-tests, but this test could serve as a smoke test around process startup, whose tests are currently particularly weak.
  • Running every PR could work as a last minute sanity check in the case of hotfix.
  • live-cluster occupies queue=gce-deploy which isn't so crowded compared to the queue=default.
  • gossip/turbine/bpf exeuction code changes will benefit from testing with actual production environment as part of normal CI build. These area currently lacks integration tests with fixture data extracted from the real environment. So, no need to manually run validator each time for minor changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

live-cluster occupies queue=gce-deploy which isn't so crowded compared to the queue=default.

I believe there are only one or two agents running gce-deploy ATM. So we'll want to bump that up first. It should just be a matter of ensuring the gcloud CLI tools are installed and pointed at the correct project, then adding a systemd service for the new agent

source ci/rust-version.sh stable

escaped_branch=$(echo "$BUILDKITE_BRANCH" | tr -c "[:alnum:]" - | sed -r "s#(^-*|-*head-*|-*$)##g")
instance_prefix="testnet-live-sanity-$escaped_branch"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think metrics will complain about this since there won't be a database named $instance_prefix. (I hit similar trying to get cute with the rolling upgrades instance names)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, this pr doesn't use much of net/*.shs. This isn't affected. Anyway, I've specifically setup a metric database for this job.

instance_ip=$(./net/gce.sh info | grep bootstrap-validator | awk '{print $3}')

on_trap() {
if [[ -z $instance_deleted ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to just try to delete here

@ryoqun ryoqun changed the title Introduce sanity/compatibility test for live clusters [NOTE TO SELF; UNCOMMENT]Introduce sanity/compatibility test for live clusters Sep 28, 2020
@ryoqun ryoqun changed the title [NOTE TO SELF; UNCOMMENT]Introduce sanity/compatibility test for live clusters [NOTE TO SELF; UNCOMMENT] Introduce sanity/compatibility test for live clusters Sep 28, 2020
@ryoqun
Copy link
Member Author

ryoqun commented Sep 28, 2020

@mvines @t-nelson I think this is good for another review in code-wise. The only remaining job is to increase build-agent for gce-deploy.

t-nelson
t-nelson previously approved these changes Sep 28, 2020
Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, now! Rolling out more BK agents is currently blocked on #12527, though

@ryoqun ryoqun force-pushed the live-cluster-sanity branch from 0cbbf00 to f74ea2c Compare July 9, 2021 18:01
@stale
Copy link

stale bot commented Jul 20, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 20, 2021
@stale
Copy link

stale bot commented Jul 30, 2021

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale bot closed this Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants