Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce sanity/compatibility test for live clusters #12175

Closed
wants to merge 57 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
350ff35
Test against live clusters
ryoqun Sep 10, 2020
250029f
Fix typo to trigger CI
ryoqun Sep 18, 2020
7e3d007
Enable shellcheck
ryoqun Sep 18, 2020
a147f0a
clean up
ryoqun Sep 18, 2020
3e53824
Sync cluster info with docs
ryoqun Sep 18, 2020
b535877
Bad copy pasta...
ryoqun Sep 18, 2020
4d1463a
Just use artifact_paths
ryoqun Sep 18, 2020
f8706be
Minor polishments
ryoqun Sep 18, 2020
87e19ed
Remove needless sleep and display longer line more
ryoqun Sep 18, 2020
9e4c48f
Extract some
ryoqun Sep 18, 2020
2f14108
Reorder a bit
ryoqun Sep 18, 2020
553b17e
Fix shellcheck
ryoqun Sep 18, 2020
764ccc4
Use more compatible commenting hack?
ryoqun Sep 24, 2020
5fc203e
Minor review comments
ryoqun Sep 24, 2020
69dd2b2
Wrong place...
ryoqun Sep 24, 2020
570887e
Extract into new remote shell
ryoqun Sep 24, 2020
2999eda
Fix shellcheck
ryoqun Sep 24, 2020
bfd3c96
Forgot the &...
ryoqun Sep 24, 2020
80b3009
Collect logs and run sys-tuner
ryoqun Sep 24, 2020
5e83689
Fix shellcheck
ryoqun Sep 24, 2020
3ed3901
Really fix shellcheck
ryoqun Sep 24, 2020
acdedc1
Fix killing tuner and really collect logs for profit
ryoqun Sep 24, 2020
1999544
Really kill
ryoqun Sep 24, 2020
2cf1318
Really kill sys-tuner...
ryoqun Sep 25, 2020
ce3db63
Collect logs even if failed, enable metrics, snapshot upload
ryoqun Sep 25, 2020
6011e5a
Adjust log file path and really upload snapshot
ryoqun Sep 25, 2020
aa14827
Don't always upload snapshots
ryoqun Sep 25, 2020
9c80451
Revert comment out
ryoqun Oct 1, 2020
f6ea13f
Rename to nicely align with local-cluster
ryoqun Oct 1, 2020
b00613e
Increase duration of monitoring phase
ryoqun Oct 1, 2020
0edc8ce
Run ledger-tool verify too
ryoqun Oct 5, 2020
389fb2a
Maybe bpf_loader.so needed only for `ledger-tool`?
ryoqun Oct 5, 2020
0fdde8a
Well, this shouldn't needed anymore
ryoqun Oct 9, 2020
70d6826
Silly me.
ryoqun Oct 9, 2020
956af9f
meh...
ryoqun Oct 10, 2020
5350ca0
Reduce rooted slots also rename confusing var
ryoqun Oct 13, 2020
ec336b5
more var renaming fix....
ryoqun Oct 13, 2020
c85aa05
Double timeout (testnet is slow for some reason)
ryoqun Oct 26, 2020
29fa355
Increase timeout...
ryoqun Oct 28, 2020
dd85437
Tooooo much log
ryoqun Oct 28, 2020
7078191
Chery pick bank frozen INFO message
ryoqun Oct 30, 2020
f39821b
Cherry-pick more logs.
ryoqun Oct 30, 2020
1f4ea03
less log
ryoqun Oct 30, 2020
a3f2739
Restore --expected-shred-version for faster boot?
ryoqun Oct 30, 2020
da49d67
longer timeout for ledger-tool and high-legel logs
ryoqun Oct 31, 2020
2902804
disable audit
ryoqun Oct 31, 2020
8a1ccdc
longer
ryoqun Oct 31, 2020
056e1da
Revert "disable audit"
ryoqun Nov 2, 2020
31c4b1a
Remove expected shred version?
ryoqun Nov 15, 2020
dfc13bc
Update remote-live-cluster-sanity.sh
ryoqun Dec 13, 2020
d6e8c02
Update buildkite-pipeline.sh
ryoqun Dec 13, 2020
d2a9e33
Update to new validator subcommands
ryoqun Mar 7, 2021
7f44ee5
Add --identity
ryoqun Mar 8, 2021
f116cab
Add --no-poh-speed-test....
ryoqun Mar 8, 2021
157960c
Add more entrypoints
ryoqun Mar 8, 2021
ca1a3c0
Add --force.....
ryoqun Mar 9, 2021
f74ea2c
Add --force
ryoqun Mar 25, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions ci/buildkite-pipeline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,15 @@ all_test_steps() {
artifact_paths: "log-*.txt"
agents:
- "queue=cuda"
- command: "ci/live-cluster-sanity.sh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to run this on every PR? It seems like a nightly would be suffcient

Copy link
Member Author

@ryoqun ryoqun Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this worth on every PR. These are some reasons:

  • nightly is a bit too infrequent in my opinion;
    • According to the insights, it seems that we're merging 20 PRs per business day (100 per weak/500 per month). Then, assume roughly half of it is rust (validator) related (quick guess from https://buildkite.com/solana-labs/solana/builds?branch=master&page=2). Under that numbers in mind, bisecting regressions will take about 3 steps (2 ** 3 =~ 10) in average with nightly. This is tedious in my opinion; bisecting is very effective for the very-wide window, it's not so much effective in small window.
      • I can tolerate with hourly, but then why not every-pr? ;)
    • This doesn't make the whole CI longer from the PR author's perspective (local-cluster is the longest at this pipeline phase...)
  • it's less ideal compared to unit-tests, but this test could serve as a smoke test around process startup, whose tests are currently particularly weak.
  • Running every PR could work as a last minute sanity check in the case of hotfix.
  • live-cluster occupies queue=gce-deploy which isn't so crowded compared to the queue=default.
  • gossip/turbine/bpf exeuction code changes will benefit from testing with actual production environment as part of normal CI build. These area currently lacks integration tests with fixture data extracted from the real environment. So, no need to manually run validator each time for minor changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

live-cluster occupies queue=gce-deploy which isn't so crowded compared to the queue=default.

I believe there are only one or two agents running gce-deploy ATM. So we'll want to bump that up first. It should just be a matter of ensuring the gcloud CLI tools are installed and pointed at the correct project, then adding a systemd service for the new agent

name: "live-cluster"
timeout_in_minutes: 60
artifact_paths:
- "*/validator.log"
- "*/sys-tuner.log"
- "*/snapshot-*.tar.*"
agents:
- "queue=gce-deploy"
EOF
else
annotate --style info \
Expand Down
78 changes: 78 additions & 0 deletions ci/live-cluster-sanity.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
#!/usr/bin/env bash
set -e
cd "$(dirname "$0")/.."

source ci/_
source ci/rust-version.sh stable

if [[ -n $CI ]]; then
escaped_branch=$(echo "$BUILDKITE_BRANCH" | tr -c "[:alnum:]" - | sed -r "s#(^-*|-*head-*|-*$)##g")
instance_prefix="testnet-live-sanity-$escaped_branch"
else
instance_prefix="testnet-live-sanity-$(whoami)"
fi

# ensure to delete leftover cluster
_ ./net/gce.sh delete -p "$instance_prefix" || true
# only bootstrap, no normal validator
_ ./net/gce.sh create -p "$instance_prefix" -n 0 --self-destruct-hours 1
instance_ip=$(./net/gce.sh info | grep bootstrap-validator | awk '{print $3}')

on_trap() {
set +e
_ ./net/gce.sh delete -p "$instance_prefix"
}
trap on_trap INT TERM EXIT

_ cargo +"$rust_stable" build --bins --release
_ ./net/scp.sh \
./ci/remote-live-cluster-sanity.sh \
./target/release/{solana,solana-keygen,solana-validator,solana-ledger-tool,solana-sys-tuner} \
"$instance_ip:."

test_with_live_cluster() {
mvines marked this conversation as resolved.
Show resolved Hide resolved
cluster_label="$1"
rm -rf "./$cluster_label"
mkdir "./$cluster_label"

validator_failed=
_ ./net/ssh.sh "$instance_ip" ./remote-live-cluster-sanity.sh "$@" || validator_failed=$?

# let's collect logs for profit!
for log in $(./net/ssh.sh "$instance_ip" ls 'cluster-sanity/*.log'); do
_ ./net/scp.sh "$instance_ip:$log" "./$cluster_label"
done

if [[ -n $validator_failed ]]; then
# let's even collect snapshot for diagnostics
for log in $(./net/ssh.sh "$instance_ip" ls 'cluster-sanity/ledger/snapshot-*.tar.*'); do
_ ./net/scp.sh "$instance_ip:$log" "./$cluster_label"
done

(exit "$validator_failed")
fi
}

# UPDATE docs/src/clusters.md TOO!!
test_with_live_cluster "mainnet-beta" \
--entrypoint mainnet-beta.solana.com:8001 \
--entrypoint entrypoint2.mainnet-beta.solana.com:8001 \
--entrypoint entrypoint3.mainnet-beta.solana.com:8001 \
--entrypoint entrypoint4.mainnet-beta.solana.com:8001 \
--entrypoint entrypoint5.mainnet-beta.solana.com:8001 \
--trusted-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 \
--trusted-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \
--trusted-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ \
--trusted-validator CakcnaRDHka2gXyfbEd2d3xsvkJkqsLw2akB3zsN1D2S \
--expected-genesis-hash 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d \
# for your pain-less copy-paste

# UPDATE docs/src/clusters.md TOO!!
test_with_live_cluster "testnet" \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When backporting to v1.2, I'll remove this line.

--entrypoint entrypoint.testnet.solana.com:8001 \
--trusted-validator 5D1fNXzvv5NjV1ysLjirC4WY92RNsVH18vjmcszZd8on \
--trusted-validator ta1Uvfb7W5BRPrdGnhP9RmeCGKzBySGM1hTE4rBRy6T \
--trusted-validator Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN \
--trusted-validator 9QxCLckBiJc783jnMvXZubK4wH86Eqqvashtrwvcsgkv \
--expected-genesis-hash 4uhcVJyU9pJkvQyS88uRDiswHXSCkY3zQawwpjk2NsNY \
# for your pain-less copy-paste
124 changes: 124 additions & 0 deletions ci/remote-live-cluster-sanity.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
#!/usr/bin/env bash

handle_error() {
action=$1
set +e
kill "$validator_then_ledger_tool_pid" "$tail_pid"
wait "$validator_then_ledger_tool_pid" "$tail_pid"
echo "--- Error: validator failed to $action"
exit 1
}

show_log() {
if find cluster-sanity/log-tail -not -empty | grep ^ > /dev/null; then
echo "##### new log:"
timeout 0.01 cat cluster-sanity/log-tail | tail -n 3 | cut -c 1-300 || true
truncate --size 0 cluster-sanity/log-tail
echo
fi
}

rm -rf cluster-sanity
mkdir cluster-sanity

cluster_label="$1"
shift

echo "--- Starting validator $cluster_label"

validator_log="cluster-sanity/validator.log"
sys_tuner_log="cluster-sanity/sys-tuner.log"
metrics_host="https://metrics.solana.com:8086"
export SOLANA_METRICS_CONFIG="host=$metrics_host,db=testnet-live-cluster,u=scratch_writer,p=topsecret"
export RUST_LOG="warn,solana_runtime::bank=info,solana_validator=info,solana_core=info,solana_ledger=info,solana_core::repair_service=warn"

# shellcheck disable=SC2024 # create log as non-root user
sudo ./solana-sys-tuner --user "$(whoami)" &> "$sys_tuner_log" &
sys_tuner_pid=$!

(
echo "$(date): VALIDATOR STARTED." &&
./solana-keygen new --force --no-passphrase --silent --outfile ./identity.json &&
./solana-validator \
--identity ./identity.json \
--ledger ./cluster-sanity/ledger \
--no-untrusted-rpc \
--no-poh-speed-test \
--log - \
--init-complete-file ./cluster-sanity/init-completed \
--private-rpc \
--rpc-port 8899 \
--rpc-bind-address localhost \
--snapshot-interval-slots 0 \
"$@" &&
echo "$(date): VALIDATOR FINISHED AND LEDGER-TOOL STARTED." &&
./solana-ledger-tool \
--ledger cluster-sanity/ledger \
verify &&
echo "$(date): LEDGER-TOOL FINISHED."
) &> "$validator_log" &

validator_then_ledger_tool_pid=$!
tail -F "$validator_log" > cluster-sanity/log-tail 2> /dev/null &
tail_pid=$!

attempts=200
while ! [[ -f cluster-sanity/init-completed ]]; do
attempts=$((attempts - 1))
if [[ (($attempts == 0)) || ! -d "/proc/$validator_then_ledger_tool_pid" ]]; then
handle_error "start"
fi

sleep 3
echo "##### validator is starting... (until timeout: $attempts) #####"
show_log
done
echo "##### validator finished starting! #####"

echo "--- Monitoring validator $cluster_label"

# shellcheck disable=SC2012 # ls here is handy for sorted snapshots
snapshot_slot=$(ls -t cluster-sanity/ledger/snapshot-*.tar.* |
head -n 1 |
grep -o 'snapshot-[0-9]*-' |
grep -o '[0-9]*'
)
current_root=$snapshot_slot
goal_root=$((snapshot_slot + 50))

attempts=200
while [[ $current_root -le $goal_root ]]; do
attempts=$((attempts - 1))
if [[ (($attempts == 0)) || ! -d "/proc/$validator_then_ledger_tool_pid" ]]; then
handle_error "root new slots"
fi

sleep 3
current_root=$(./solana --url http://localhost:8899 slot --commitment root)
echo "##### validator is running ($current_root/$goal_root)... (until timeout: $attempts) #####"
show_log
done
echo "##### validator finished running! #####"

./solana-validator \
--ledger cluster-sanity/ledger \
exit --force

attempts=4000
while [[ -d "/proc/$validator_then_ledger_tool_pid" ]]; do
attempts=$((attempts - 1))
if [[ (($attempts == 0)) ]]; then
handle_error "ledger tool"
fi

sleep 3
echo "##### ledger-tool is running... (until timeout: $attempts) #####"
show_log
done
echo "##### ledger-tool finished running! #####"

# well, kill $sys_tuner_pid didn't work for some reason, maybe sudo doen't relay signals?
(set -x && sleep 3 && kill "$tail_pid" && sudo pkill -f solana-sys-tuner) &
kill_pid=$!

wait "$validator_then_ledger_tool_pid" "$sys_tuner_pid" "$tail_pid" "$kill_pid"
39 changes: 21 additions & 18 deletions docs/src/clusters.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,15 +42,13 @@ solana config set --url https://api.devnet.solana.com

```bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc/ and bank.rs changes in here look just fine, why don't you just land those as a separate PR while we work through the ci/ files in this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no strong reason to create separate PRs. I just thought it's not worth to be its own prs. bank.rs changes are needed to trigger live-cluster tests. (yeah, I could improve the ci/buildkite-pipeline.sh). And docs chagnes somewhat mentions this pr about the update notice. So separating them introduces a bit of work.

$ solana-validator \
--identity validator-keypair.json \
--vote-account vote-account-keypair.json \
--trusted-validator dv1LfzJvDF7S1fBKpFgKoKXK5yoSosmkAdfbxBo1GqJ \
--identity ~/validator-keypair.json \
--vote-account ~/vote-account-keypair.json \
--no-untrusted-rpc \
--ledger ledger \
--rpc-port 8899 \
--dynamic-port-range 8000-8010 \
--entrypoint entrypoint.devnet.solana.com:8001 \
--expected-genesis-hash EtWTRABZaYq6iMfeYKouRu166VU2xqa1wcaWoxPkrZBG \
--wal-recovery-mode skip_any_corrupted_record \
--limit-ledger-size
```
Expand Down Expand Up @@ -87,22 +85,24 @@ solana config set --url https://api.testnet.solana.com

##### Example `solana-validator` command-line

[comment]: # (UPDATE ci/live-cluster-sanity.sh TOO!!)

```bash
$ solana-validator \
--identity validator-keypair.json \
--vote-account vote-account-keypair.json \
--entrypoint entrypoint.testnet.solana.com:8001 \
--entrypoint entrypoint2.testnet.solana.com:8001 \
--entrypoint entrypoint3.testnet.solana.com:8001 \
--trusted-validator 5D1fNXzvv5NjV1ysLjirC4WY92RNsVH18vjmcszZd8on \
--trusted-validator 7XSY3MrYnK8vq693Rju17bbPkCN3Z7KvvfvJx4kdrsSY \
--trusted-validator Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN \
--trusted-validator 9QxCLckBiJc783jnMvXZubK4wH86Eqqvashtrwvcsgkv \
--expected-genesis-hash 4uhcVJyU9pJkvQyS88uRDiswHXSCkY3zQawwpjk2NsNY \
--identity ~/validator-keypair.json \
--vote-account ~/vote-account-keypair.json \
--no-untrusted-rpc \
--ledger ledger \
--rpc-port 8899 \
--dynamic-port-range 8000-8010 \
--entrypoint entrypoint.testnet.solana.com:8001 \
--entrypoint entrypoint2.testnet.solana.com:8001 \
--entrypoint entrypoint3.testnet.solana.com:8001 \
--expected-genesis-hash 4uhcVJyU9pJkvQyS88uRDiswHXSCkY3zQawwpjk2NsNY \
--wal-recovery-mode skip_any_corrupted_record \
--limit-ledger-size
```
Expand Down Expand Up @@ -142,25 +142,28 @@ solana config set --url https://api.mainnet-beta.solana.com

##### Example `solana-validator` command-line

[comment]: # (UPDATE ci/live-cluster-sanity.sh TOO!!)

```bash
$ solana-validator \
--identity ~/validator-keypair.json \
--vote-account ~/vote-account-keypair.json \
--entrypoint entrypoint.mainnet-beta.solana.com:8001 \
--entrypoint entrypoint2.mainnet-beta.solana.com:8001 \
--entrypoint entrypoint3.mainnet-beta.solana.com:8001 \
--entrypoint entrypoint4.mainnet-beta.solana.com:8001 \
--entrypoint entrypoint5.mainnet-beta.solana.com:8001 \
--trusted-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 \
--trusted-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ \
--trusted-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ \
--trusted-validator CakcnaRDHka2gXyfbEd2d3xsvkJkqsLw2akB3zsN1D2S \
--expected-genesis-hash 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d \
--expected-shred-version 64864 \
--identity ~/validator-keypair.json \
--vote-account ~/vote-account-keypair.json \
--no-untrusted-rpc \
--ledger ledger \
--rpc-port 8899 \
--private-rpc \
--dynamic-port-range 8000-8010 \
--entrypoint entrypoint.mainnet-beta.solana.com:8001 \
--entrypoint entrypoint2.mainnet-beta.solana.com:8001 \
--entrypoint entrypoint3.mainnet-beta.solana.com:8001 \
--entrypoint entrypoint4.mainnet-beta.solana.com:8001 \
--entrypoint entrypoint5.mainnet-beta.solana.com:8001 \
--expected-genesis-hash 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d \
--wal-recovery-mode skip_any_corrupted_record \
--limit-ledger-size
```
Expand Down
6 changes: 3 additions & 3 deletions runtime/src/bank.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3898,8 +3898,8 @@ impl Bank {
let cycle_params = self.determine_collection_cycle_params(epoch);
let (_, _, in_multi_epoch_cycle, _, _, partition_count) = cycle_params;

// use common codepath for both very likely and very unlikely for the sake of minimized
// risk of any miscalculation instead of negligibly faster computation per slot for the
// use common code-path for both very-likely and very-unlikely for the sake of minimized
// risk of any mis-calculation instead of negligible faster computation per slot for the
// likely case.
let mut start_partition_index =
Self::partition_index_from_slot_index(start_slot_index, cycle_params);
Expand All @@ -3911,7 +3911,7 @@ impl Bank {
let in_middle_of_cycle = start_partition_index > 0;
if in_multi_epoch_cycle && is_special_new_epoch && in_middle_of_cycle {
// Adjust slot indexes so that the final partition ranges are continuous!
// This is need because the caller gives us off-by-one indexes when
// This is needed because the caller gives us off-by-one indexes when
// an epoch boundary is crossed.
// Usually there is no need for this adjustment because cycles are aligned
// with epochs. But for multi-epoch cycles, adjust the indexes if it
Expand Down