Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: don't scp files via local host when doing treedist #99103

Closed
aliher1911 opened this issue Mar 21, 2023 · 3 comments · Fixed by #99283
Closed

roachprod: don't scp files via local host when doing treedist #99103

aliher1911 opened this issue Mar 21, 2023 · 3 comments · Fixed by #99283
Assignees
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team

Comments

@aliher1911
Copy link
Contributor

aliher1911 commented Mar 21, 2023

Roachprod uses syncedCluster.UseTreeDist=true by default. This options allows doing tree fanout when copying files by seeding one node with file and then using scp user@host1:file user@host2:file. Unfortunately it is not doing what one might think and retrieves file from host1 and then sending it to host2. You can see that if you try doing put with UseTreeDist and comparing its execution time with put to single host first and then comparing with scp file user@host2 executed directly on host1. This behaviour is documented in man scp.

The immediate impact of this behaviour for example is inability to use pkg/cmd/roachtest/roachstress.sh -c 10 without explicitly setting -p 1 to disable parallel test runs. If multiple tests are started in parallel, network bandwidth is saturated (for remote workers using ADSL) and only copying cockroach takes 10-15 minutes of allocated test run time which in most cases cause tests to timeout. The other negative effect is when doing roachtest failure investigation, each iteration takes at least 10 minutes to copy files. This effect became worse with introduction of cockroach-default.

We can work that around by calling scp on the host1 directly via ssh. It should be fine to do so as we know that cluster hosts have certificate auth set up in dev clusters. Alternatively, we could disable TreeDist altogether as it is making things worse in my experience because we need to transfer data twice as much.

Jira issue: CRDB-25704

@aliher1911 aliher1911 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels Mar 21, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 21, 2023

cc @cockroachdb/test-eng

@aliher1911
Copy link
Contributor Author

Test start timing example:

10:34:10 test_runner.go:666: [w0] starting test: replicagc-changed-peers/restart=false:1
10:34:10 cluster.go:683: test status: uploading default cockroach binary to nodes
10:34:10 cluster.go:683: test status: uploading file
10:34:12 cluster_synced.go:1690: 1679394722-01-n6cpu4: putting (dist) /Users/.../go/src/github.com/cockroachdb/cockroach/artifacts/02a019a415c/cockroach ./cockroach-default on nodes [1 2 3 4 5 6]
10:38:45 cluster_synced.go:1888:
10:38:45 cluster_synced.go:1891:    1: done
10:38:45 cluster_synced.go:1891:    2: done
10:38:45 cluster_synced.go:1891:    3: done
10:38:45 cluster_synced.go:1891:    4: done
10:38:45 cluster_synced.go:1891:    5: done
10:38:45 cluster_synced.go:1891:    6: done
10:38:45 cluster.go:683: test status:
10:38:45 cluster.go:683: test status: running test
10:38:45 test_runner.go:838: [w0] === RUN   replicagc-changed-peers/restart=false
=== RUN   replicagc-changed-peers/restart=false
10:38:45 cluster.go:683: test status: uploading file
10:38:46 cluster_synced.go:1690: 1679394722-01-n6cpu4: putting (dist) /Users/.../go/src/github.com/cockroachdb/cockroach/artifacts/02a019a415c/cockroach ./cockroach on nodes [1 2 3 4 5 6]
10:43:23 cluster_synced.go:1888:
10:43:23 cluster_synced.go:1891:    1: done
10:43:23 cluster_synced.go:1891:    2: done
10:43:23 cluster_synced.go:1891:    3: done
10:43:23 cluster_synced.go:1891:    4: done
10:43:23 cluster_synced.go:1891:    5: done
10:43:23 cluster_synced.go:1891:    6: done
10:43:23 cluster.go:683: test status:
10:43:23 cluster.go:683: test status: starting nodes :1-3

@aliher1911
Copy link
Contributor Author

I think we are missing -A -R flags to enable direct copy between hosts and agent forwarding to ensure authentication to target host could be performed regardless of hosts being correctly setup to talk to eachother.

@aliher1911 aliher1911 self-assigned this Mar 22, 2023
craig bot pushed a commit that referenced this issue Mar 23, 2023
99283: roachprod: change scp flags to avoid copy via local host r=tbg a=aliher1911

Previously default scp option made it to copy files via local host. This was causing unnecessary delays and network usage because files were first copied to one node, then distributed from that node using scp between remote locations. Unfortunately the default scp behaviour is to fetch from src and put to dst to avoid authenticating from src to dst.
This commit adds scp options to do direct copy.

Fixes: #99103

Release note: None

Co-authored-by: Oleg Afanasyev <[email protected]>
@craig craig bot closed this as completed in 9aa8536 Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team
Projects
None yet
1 participant