Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: increase concurrent unauthenticated SSH connections #37001

Merged
merged 1 commit into from
Apr 25, 2019

Conversation

nvanbenschoten
Copy link
Member

This PR bumps the permitted number of concurrent unauthenticated
SSH connections from 10 to 64. Above this limit, sshd starts
randomly dropping connections. It's possible this is what we have
been running into with the frequent "Connection closed by remote
host" errors in roachtests.

See:

Release note: None

This PR bumps the permitted number of concurrent unauthenticated
SSH connections from 10 to 64. Above this limit, sshd starts
randomly dropping connections. It's possible this is what we have
been running into with the frequent "Connection closed by remote
host" errors in roachtests.

See:
- https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Load_Balancing
- http://edoceo.com/notabene/ssh-exchange-identification

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: though why do you suspect we're performing so many concurrent ssh connections? roachprod and roachtest will open up a number of connections to a host during a test, but most of them are done serially.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @petermattis)

@tbg
Copy link
Member

tbg commented Apr 22, 2019

I think @bdarnell pointed out that the VMs usually get port scans as soon as they come up, perhaps we're just sometimes seeing close to 10 scanbots brute forcing SSH logins.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/cmd/roachprod/vm/aws/support.go, line 91 at r1 (raw file):

# sshguard can prevent frequent ssh connections to the same host. Disable it.
sudo service sshguard stop

If we're getting port scanned, I think we want to keep sshguard enabled. Otherwise the attackers will be able to take up as many of those connection slots as they want. Maybe we should be reenabling sshguard on gce instead of turning it off on aws?

If sshguard is blocking our own tools, we should figure out how to reuse connections (which may be as easy as turning on ControlMaster).

@tbg
Copy link
Member

tbg commented Apr 23, 2019

FWIW, in this test #36412 the number of ..... indicates that it took quite a while until we got the connection closed out from under us. Isn't that incongruent with the theory here? I'd expect the start to either succeed within a reasonable amount of time, or to be clipped early.

BTW, the gossip test is interesting because it looks like it could be misconstrued as a DDOS:

for j := 0; j < 100; j++ {
deadNode = nodes.randNode()[0]
c.Stop(ctx, c.Node(deadNode))
waitForGossip()
c.Start(ctx, t, c.Node(deadNode))
}

@tbg
Copy link
Member

tbg commented Apr 23, 2019

Here's one that I haven't seen so far:

   3: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine [email protected]:./cockroach [email protected]:./cockroach
		Connection reset by 35.192.62.250 port 22

From #36759

@tbg
Copy link
Member

tbg commented Apr 23, 2019

I went and linked some other issues to this PR. Not necessarily all the same, though.

@bdarnell
Copy link
Contributor

FWIW, in this test #36412 the number of ..... indicates that it took quite a while until we got the connection closed out from under us. Isn't that incongruent with the theory here? I'd expect the start to either succeed within a reasonable amount of time, or to be clipped early.

Yeah, that one would not be helped by this change. This change would only affect failures without any other output, I think.

Here's one that I haven't seen so far: Connection reset by 35.192.62.250 port 22 From #36759

That's the same symptom I saw on port 26257 in the syn cookie issue (#36745). However, the dmesg logs here do not show syn cookie activations so that's not the issue here.

@ajwerner
Copy link
Contributor

In create one place where we do create a large number of connections is when we set up the known hosts files for each host:

c.Parallel("scanning hosts", len(c.Nodes), 0, func(i int) ([]byte, error) {

Each node in parallel will create an ssh session with every other host.

That being said some of these referenced failures (#36759) happen during roachprod put for which I have less intuition about how that would create too many connections. The roachmart test is a 9 node cluster so even with treedist it shouldn't create too many connections.

@petermattis
Copy link
Collaborator

Each node in parallel will create an ssh session with every other host.

Hmm, I wonder if we could do the ssh-keyscan on only one machine and that push out the .ssh/known_hosts file to the others.

@bdarnell
Copy link
Contributor

The roachmart test is a 9 node cluster so even with treedist it shouldn't create too many connections.

Treedist wouldn't put us over the 10-connection limit on its own, but it would (I think) use 8, and then a couple of extra connections (port scans, or parts of roachprod/roachtest we're not thinking about at the moment) could put it over the top.

Hmm, I wonder if we could do the ssh-keyscan on only one machine and that push out the .ssh/known_hosts file to the others.

Good idea.

In any case, the default limit of 10 seems really low so I think it's a good idea to increase it no matter what else we do.

@petermattis
Copy link
Collaborator

In any case, the default limit of 10 seems really low so I think it's a good idea to increase it no matter what else we do.

Agreed!

ajwerner added a commit to ajwerner/cockroach that referenced this pull request Apr 24, 2019
Before this PR we would run ssh-keyscan in parallel on all of the hosts.
I suspect the retry loop in the keyscan script is to deal with the connection
failures which cockroachdb#37001 attempts to mitigate but I don't feel eager to rip them
out right now. In addition to performing the keyscan from a single host and
distributing to all, this change also distributes the known_hosts file to the
shared user in anticipation of further changes to enable the shared user to
become the default user for roachprod commands with gce.

Release note: None
@ajwerner
Copy link
Contributor

Each node in parallel will create an ssh session with every other host.

Hmm, I wonder if we could do the ssh-keyscan on only one machine and that push out the .ssh/known_hosts file to the others.

#37077

craig bot pushed a commit that referenced this pull request Apr 24, 2019
37077: roachprod: keyscan from a single host and distribute known_hosts r=ajwerner a=ajwerner

Before this PR we would run ssh-keyscan in parallel on all of the hosts.
I suspect the retry loop in the keyscan script is to deal with the connection
failures which #37001 attempts to mitigate but I don't feel eager to rip them
out right now. In addition to performing the keyscan from a single host and
distributing to all, this change also distributes the known_hosts file to the
shared user in anticipation of further changes to enable the shared user to
become the default user for roachprod commands with gce.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mberhault)


pkg/cmd/roachprod/vm/aws/support.go, line 91 at r1 (raw file):
@mberhault introduced this in cockroachdb/roachprod@1f3f698. He says:

sshguard will block ssh after logging in a few times in succession (maybe 3 times a minute? no idea what the defaults are). This got triggered when using ssh to configure a cluster where we performed each step across all nodes, then moved on to the next step across all nodes. This would turn into multiple ssh session in quick succession and get blocked.

Disabling sshguard might trigger more of these obscure ssh failures, so I'm inclined to keep it disabled until we find that we need it with this increased number of unauthenticated connections. Does that sound reasonable?

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/cmd/roachprod/vm/aws/support.go, line 91 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

@mberhault introduced this in cockroachdb/roachprod@1f3f698. He says:

sshguard will block ssh after logging in a few times in succession (maybe 3 times a minute? no idea what the defaults are). This got triggered when using ssh to configure a cluster where we performed each step across all nodes, then moved on to the next step across all nodes. This would turn into multiple ssh session in quick succession and get blocked.

Disabling sshguard might trigger more of these obscure ssh failures, so I'm inclined to keep it disabled until we find that we need it with this increased number of unauthenticated connections. Does that sound reasonable?

ok

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 24, 2019

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Apr 24, 2019

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Apr 24, 2019

Build failed

@nvanbenschoten
Copy link
Member Author

This failed because of #37085, which I happened to already be looking at (fix is #37105). I'll give it one more go before waiting.

bors r+

craig bot pushed a commit that referenced this pull request Apr 24, 2019
37001: roachprod: increase concurrent unauthenticated SSH connections r=nvanbenschoten a=nvanbenschoten

This PR bumps the permitted number of concurrent unauthenticated
SSH connections from 10 to 64. Above this limit, sshd starts
randomly dropping connections. It's possible this is what we have
been running into with the frequent "Connection closed by remote
host" errors in roachtests.

See:
- https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Load_Balancing
- http://edoceo.com/notabene/ssh-exchange-identification

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig
Copy link
Contributor

craig bot commented Apr 25, 2019

Build succeeded

@craig craig bot merged commit 24cca0f into cockroachdb:master Apr 25, 2019
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/sshLimit branch May 2, 2019 17:40
tbg added a commit to tbg/cockroach that referenced this pull request May 9, 2019
See cockroachdb#36929. Whenever these flakes happen, it'll be good to have verbose
logs. Anecdotally we're seeing fewer of them now, perhaps due to cockroachdb#37001.

Release note: None
craig bot pushed a commit that referenced this pull request May 9, 2019
37424: roachprod: verbose sshd logging r=ajwerner a=tbg

See #36929. Whenever these flakes happen, it'll be good to have verbose
logs. Anecdotally we're seeing fewer of them now, perhaps due to #37001.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 24, 2019
Fixes cockroachdb#38785.
Fixes cockroachdb#35326.

Because everything roachprod does, it does through SSH, we're
particularly susceptible to network delays, packet drops, etc. We've
seen this before, or at least pointed to this being a problem before,
over at cockroachdb#37001. Setting timeouts around our calls to roachprod helps to
better surface these kind of errors.

The underlying issue in cockroachdb#38785 and in cockroachdb#35326 is the fact that we're
running roachprod commands that may (reasonably) fail due to connection
issues, and we're unable to retry them safely (the underlying commands
are non-idempotent). Presently we simply fail the entire test, when
really we should be able to retry the commands. This is left unaddressed.

Release justification: Category 1: Non-production code changes
Release note: None
craig bot pushed a commit that referenced this pull request Sep 24, 2019
40997: roachtest: deflake bank/{node-restart,cluster-recovery} r=irfansharif a=irfansharif

Fixes #38785.
Fixes #35326.

Because everything roachprod does, it does through SSH, we're
particularly susceptible to network delays, packet drops, etc. We've
seen this before, or at least pointed to this being a problem before,
over at #37001. Setting timeouts around our calls to roachprod helps to
better surface these kind of errors.

The underlying issue in #38785 and in #35326 is the fact that we're
running roachprod commands that may (reasonably) fail due to connection
issues, and we're unable to retry them safely (the underlying commands
are non-idempotent). Presently we simply fail the entire test, when
really we should be able to retry the commands. This is left unaddressed.

Release justification: Category 1: Non-production code changes
Release note: None

41029: cli: fix the demo licensing code r=rohany a=knz

Fixes #40734.
Fixes #41024.

Release justification: fixes a flaky test, fixes UX of main new feature

Before this patch, there were multiple problems with the code:

- if the license acquisition was disabled by the env var config,
  the error message would not be clear.
- the licensing code would deadlock silently on OSS-only
  builds (because the license failure channel was not written in that
  control branch).
- the error/warning messages would be interleaved on the same line as
  the input line (missing newline at start of message).
- the test code would fail when the license server is not available.
- the set up of the example database and workload would be performed
  asynchronously, with unclear signalling of when the user
  can expect to use them interactively.

After this patch:
- it's possible to override the license acquisition URL with
  COCKROACH_DEMO_LICENSE_URL, this is used in tests.
- setting up the example database, partitioning and workload is done
  before presenting the interactive prompt.
- partitioning the example database, if requested by
  --geo-partitioned-replicas, waits for license acquisition to
  complete (license acquisition remains asynchronous otherwise).
- impossible configurations are reported early(earlier).

For example:

- OSS-only builds:

```
kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas
*
* ERROR: enterprise features are required for this demo, cannot run from OSS-only binary
*
Failed running "demo"
```

For license acquisition failures:

```
kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas
error while contacting licensing server:
Get https://192.168.2.170/api/license?clusterid=5548b310-14b7-46de-8c92-30605bfe95c4&kind=demo&version=v19.2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
*
* ERROR: license acquisition was unsuccessful.
* Note: enterprise features are needed for --geo-partitioned-replicas.
*
Failed running "demo"
```

Additionally, this change fixes test flakiness that arises from an
unavailable license server.

Release note (cli change): To enable uses of `cockroach demo` with
enterprise features in firewalled network environments, it is now
possible to redirect the license acquisition with the environment
variable COCKROACH_DEMO_LICENSE_URL to a replacement server (for
example a suitably configured HTTP proxy).

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants