-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transports/ssh: support proxy_jump #4951
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #4951 +/- ##
===========================================
- Coverage 80.12% 80.12% -0.00%
===========================================
Files 515 515
Lines 36699 36742 +43
===========================================
+ Hits 29402 29435 +33
- Misses 7297 7307 +10
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
thanks for the addition @dev-zero ! from my side this is very welcome, we should just think about how to best avoid confusing new users with two ways of configuration and guide them to the new proxyjump command, both in the docs and through the verdi cli. I could even imagine deprecating (and eventually removing) the proxycommand option in the cli, i.e. just keeping the support for it in the python API for existing profiles. But maybe keeping both is better... |
The proxycommand may be needed for some more exotic cases where people are tunneling through |
Oh, and how beautifully it works :-) just submitted 100+ calculations, not a single wait, no workers hitting 100% CPU |
6767b22
to
210b199
Compare
@ltalirz ok, updated the docs and help strings |
210b199
to
6afe61f
Compare
Thanks ! With "docs" are you referring to the aiida-core documentation? By the way, do I understand correctly that for you this fix got rid of the symptom #4876 (comment) ? Do you know whether it is due to the use of the It seems to me that the separate process group was introduced by @greschd in 00ed7f6 , perhaps he can comment on the reasoning behind it. |
Ah, right, I thought more about the documentation strings for the CLI.
That, and all other SSH-related issues I've been seeing.
Guessing from paramiko/paramiko#1180 I'd say it is the implementation of |
Wow, so using In any case, in the short term it makes the proxyjump support even more pressing (if this does indeed get rid of the problem). @sphuber @giovannipizzi Do you have any comments on the basics of the implementation here? |
Interesting, glad you found the source of your problems. I have to say, I wouldn't have thought to look at the SSH connections in that infinite recursion error. The weird thing is, I have been using the current setup to submit calculations over SSH with proxy very heavily, as in tens of thousands of jobs, without problem. This was for the high-throughput project on Daint which we connect to through Ela. The configuration of the computer authentication in AiiDA that we have been using looks like
Would you have expected this to also run into the issue you have been observing. Or is that the kind of setup that you referred to as
and so here |
Yes, that's what I was referring to. In principal you wouldn't need the the `netcat` in there, but use the `-W` argument for `ssh` instead (which was my setup and I believe also the one of some others). Maybe the behaviour changes depending on that specific configuration, for example the reaction on reception of a `SIGTERM` or the buffering (in your case ssh forwards the `SIGTERM` to netcat which probably terminates immediately while SSH by itself alone when using `-W` tries to play it a bit safer). The `.~/ssh/config` may also play a role: for example if you have a `MasterSocket` then you may get a corresponding log message from SSH on stderr.
What puzzled me was the fact that my workers were stalling, meaning 100% CPU but not processing any jobs. Stopping the demon generated a timeout and when I was checking the processes the workers were still running with one SSH zombie process each. Zombie processes usually mean that the caller didn't follow "good protocol" (like not doing a `.communicate()`, but only a `.wait()` which is bad if the kernel still waits for the process to drain a buffer). Following how Paramiko does communication a bit further (namely using only `.recv()`) led me to believe that they are driving the `Popen` (in disguise as a socket) in exactly the way the Python docs warn about that can lead to deadlocks, even though they seemed to have been careful.
Doing a `setsid()` on the SSH `Popen` might also not be the best idea (essentially demonizing the process), but if I remember correctly that was needed to solve some other problem.
I don't see how this patch solves the recursion problem per se, but think that the "setup" of deadlocked workers generated by this problem triggers other issues. While the improber termination of the workers is then probably responsible for the `Process unreachable` problems I've been seeing.
Stopping the demon, reconfiguring the host to use `proxy_jump` and then restarting the demon solved all the issues I have reported or seen reported wrt to SSH, CPU usage, Recursion Errors **so far**. I can say more once CSCS is up again, but yesterday the remaining ~200 jobs which hanged in `Created` or were awaiting upload and ~100 more steps were handled within <1min while it took >5min to submit one single job before.
Sorry that there's still a lot of speculation in some parts, but maybe we can fill those gaps together.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dev-zero !
Since this adds support for a new feature, and in addition solves issues, I'm in favour of this!
I didn't check the implementation in detail but the idea seems good.
If we believe that the proxy jump is the 'better' way of doing things, I agree on documenting it as the "default suggestion" in our docs (and just mention proxy_command for advanced users). I wouldn't deprecate the other one for now as there might be use cases for it (already the fact that proxy_command comes later, and the comment on proxy_command to use the proxy jump instead, are enough).
As a comment, I've also been using a lot the proxy command with daint also recently. However, in the past (I think AiiDA 0.x) I remember having (different? or maybe similar) issues with it, where (when using a different nc executable on daint, and probably a different proxy_command string) the processes were remaining as zombie (but this was happening on daint - maybe in the end they were happening also on my computer, but I noticed because I was contacted by the admins that saw hundreds of processes on the jump host ela :-D
Final comment - good to check the various YML we have lying around, to make sure that those 1. continue working or 2. are updated to the new proxy jump. In particular for AiiDAlab (pinging @yakutovicha and @csadorf).
On this PR: I would add changes to the docs as discussed, as well as tests on the CLI to test the new CLI option.
for completeness the
and the
|
Update: one likely has to explicitly close the
There's one caveat to the solution: since this relies on more Paramiko ssh connections (one per proxy and one for the machine) I seem to spawn too many threads. From
which causes:
My guess is a combination of:
|
0a28f31
to
7ee5b58
Compare
7ee5b58
to
01b343c
Compare
Ok, all done. The following warnings are as far as I can tell reported from bogus locations since the CI run from the commit before showed the same messages, but emitted simply from a different location. And ran the reported testcase via a script which monkey-patched file and socket open/close and printed tracebacks based on NEW:
PREVIOUS:
Also, those warnings from the previous commit don't show up in this PRs CI run and there was nothing in between which could have fixed them. |
Since I was leaking threads/filedescriptors/whatnot in an earlier version of this patch I stumbled over this blog article via HN which is about file descriptors (and how a lot of other things like sockets, timers are filedescriptors nowadays): http://0pointer.net/blog/file-descriptor-limits.html $ for pid in $(pgrep verdi) ; do echo "PID = $pid with $(ls /proc/$pid/fd/ | wc -l) file descriptors" ; done Might be something to extend |
Interesting... I read briefly the link. So we should make sure we don't use a limit over 1024, at least as long as we use paramiko? Ok for me to add to verdi test if it's important to check - @sphuber @ramirezfranciscof @chrisjsewell what do you think? |
Correct. There might be other plugins which still use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ltalirz I wasn't happy with the documentation and redid it. I hope it is clearer now. |
please squash commits on merge |
Sorry for not having looked that this earlier, but I'm still having a quick look now, so don't merge yet! 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this, @dev-zero! We've been having issues with using ProxyCommand
when connecting to CSCS clusters, so having support for ProxyJump
will be a big plus.
I haven't reviewed the code in much detail, I trust @ltalirz has that covered. I just wanted to try this out by following the documentation to see if everything is clear. However, I haven't been able to make this work for my connection to daint. Maybe I messed up somewhere (it's a Sunday after all)? Here's the relevant piece in the ~/.ssh/config
:
$ grep -A 8 daint-test ~/.ssh/config
Host daint-test
HostName daint.cscs.ch
User mbercx
IdentityFile ~/.ssh/id_rsa-daint
ProxyJump [email protected]
Host ela.cscs.ch
IdentityFile ~/.ssh/id_rsa-daint
ssh daint-test
works fine. I first set up the computer with:
$ verdi computer setup --config computer/daint-test.yaml
Success: Computer<5> daint-test created
Info: Note: before the computer can be used, it has to be configured with the command:
Info: verdi computer configure ssh daint-test
where
$ more computer/daint-test.yaml
---
label: daint-test
description: Piz Daint with XC50 Haswell GPU (12 CPUs, 1 GPU) - PRACE project
hostname: daint.cscs.ch
transport: ssh
scheduler: slurm
shebang: '#!/bin/bash'
work_dir: /scratch/snx3000/{username}/aiida
mpirun_command: srun -u -n {tot_num_mpiprocs} --hint=nomultithread --unbuffered
mpiprocs_per_machine: 12
prepend_text: |
#SBATCH -A pr110
#SBATCH -C gpu
append_text: ' '
I then configured the ssh
interactively:
$ verdi computer configure ssh daint-test
Info: enter "?" for help
Info: enter "!" to ignore the default and set no value
User name [mbercx]:
Port number [22]:
Look for keys [True]:
SSH key file []:
Connection timeout in s [60]:
Allow ssh agent [True]:
SSH proxy jump []: [email protected]
SSH proxy command []:
Compress file transfers [True]:
GSS auth [False]:
GSS kex [False]:
GSS deleg_creds [False]:
GSS host [daint.cscs.ch]:
Load system host keys [True]:
Key policy (RejectPolicy, WarningPolicy, AutoAddPolicy) [RejectPolicy]:
Use login shell when executing command [True]:
Connection cooldown time (s) [30.0]:
Info: Configuring computer daint-test for user [email protected].
Success: daint-test successfully configured for [email protected]
When trying to run a job, it gets stuck in the upload
step:
$ verdi process list
PK Created Process label Process State Process status
---- --------- --------------- --------------- ----------------------------------
559 5m ago PwBaseWorkChain ⏵ Waiting Waiting for child processes: 564
564 5m ago PwCalculation ⏵ Waiting Waiting for transport task: upload
Total results: 2
Info: last time an entry changed state: 5m ago (at 17:25:28 on 2021-06-06)
And the report has the following error message:
$ verdi process report 564
*** 564: None
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 1 LOG MESSAGES:
+-> ERROR at 2021-06-06 17:25:59.102930+00:00
| Traceback (most recent call last):
| File "/home/mbercx/envs/aiida-dev/code/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
| result = await coro()
| File "/home/mbercx/envs/aiida-dev/code/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 80, in do_upload
| transport = await cancellable.with_interrupt(request)
| File "/home/mbercx/envs/aiida-dev/code/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
| result = await next(wait_iter)
| File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
| return f.result() # May raise f.exception().
| File "/usr/lib/python3.8/asyncio/futures.py", line 175, in result
| raise self._exception
| File "/home/mbercx/envs/aiida-dev/code/aiida-core/aiida/engine/transports.py", line 89, in do_open
| transport.open()
| File "/home/mbercx/envs/aiida-dev/code/aiida-core/aiida/transports/plugins/ssh.py", line 502, in open
| proxy_client.connect(proxy['host'], **proxy_connargs)
| File "/home/mbercx/.virtualenvs/aiida-dev/lib/python3.8/site-packages/paramiko/client.py", line 435, in connect
| self._auth(
| File "/home/mbercx/.virtualenvs/aiida-dev/lib/python3.8/site-packages/paramiko/client.py", line 765, in _auth
| raise SSHException("No authentication methods available")
| paramiko.ssh_exception.SSHException: No authentication methods available
My OpenSSH version is 7.6:
$ ssh -V
OpenSSH_7.6p1 Ubuntu-4ubuntu0.3, OpenSSL 1.0.2n 7 Dec 2017
And I'm using Ubuntu 18.04.5 LTS:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
Let me know if I messed up somewhere! In this case, sorry for delaying the merge. 😅
I've also left some minor comments re the documentation.
Co-authored-by: Marnik Bercx <[email protected]>
@mbercx difficult to say what went wrong since the error from the log is missing which would contain more information (especially the final login information used to login to the proxy). Here's my config and the output of
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ltalirz and @dev-zero! So, it turns out the issue was related to the fact that I was relying on look_for_keys: True
. I.e. when configuring the test computer with the following YAML file:
username: mbercx
port: 22
look_for_keys: True
key_filename: ""
timeout: 60
allow_agent: True
proxy_jump: [email protected]
proxy_command: ""
compress: True
gss_auth: False
gss_kex: False
gss_deleg_creds: False
gss_host: daint.cscs.ch
load_system_host_keys: True
key_policy: RejectPolicy
use_login_shell: True
safe_interval: 30.0
Sidenote: I'm using double quotes to avoid prompts when configuring the computer with the
--config
option. Can you see any issues with that? 🤔
This resulted in the error I also find when trying to run calculations when I do:
$ verdi computer test daint-test
Info: Testing computer<daint-test> for user<[email protected]>...
* Opening connection... 07/14/2021 06:24:16 AM <31330> aiida.transport.SshTransport: [ERROR] Error connecting to proxy 'ela.cscs.ch' through SSH:
[SshTransport] No authentication methods available, connect_args were: {'username': 'mbercx', 'port': 22, 'look_for_keys': True, 'timeout': 60, 'allow_agent': True, 'compress': True, 'gss_auth': False, 'gss_kex': False, 'gss_deleg_creds': False, 'gss_host': 'daint.cscs.ch'}
[FAILED]: Error while trying to connect to the computer
Use the `--print-traceback` option to see the full traceback.
Warning: 1 out of 0 tests failed
(I added another carriage return to make sure the the No authentication methods available
is clearly visible.)
This may be related to the fact that I have a rather large number of keys in my ~/.ssh/
folder. If I specify the correct key in the filename:
key_filename: /home/mbercx/.ssh/id_rsa-daint
and reconfigure the computer, all is well on the western front:
$ verdi computer test daint-test
Info: Testing computer<daint-test> for user<[email protected]>...
* Opening connection... [OK]
* Checking for spurious output... [OK]
* Getting number of jobs from scheduler... 07/14/2021 06:33:20 AM <32573> aiida.scheduler.slurm: [WARNING] Unrecognized job_state 'RD' for job id 31333865
[OK]: 1887 jobs found in the queue
* Determining remote user name... [OK]: mbercx
* Creating and deleting temporary file... [OK]
Success: all 5 tests succeeded
Also tried submitting some calculations, and didn't encounter anymore issues!
Wrt large number of keys: AFAIK ssh considers every login attempt with a key it doesn't know as an unsuccessful login attempt and will likely disconnect after 3 attempts, probably resulting in the error message you see. If you have been using |
Right, makes sense, thanks @dev-zero! This all looks good to me. Just updated the branch, can you (squash and) merge? I typically prefer to write my own commit messages, so I'll leave that to you. 🙃 |
@mbercx I'm not authorized to merge |
Alright, happy to write a short commit message, but since it'll be associated with your account, maybe you'd like to write it yourself so I can add it when I merge? |
@dev-zero just gave you maintainer rights for the aiida-core repo; let me know if you're still unable to merge |
@ltalirz thanks, looks good |
SSH provides multiple ways to forward connections. The legacy way is via SSHProxyCommand which spawns a separate process for each jump host/proxy. Controlling those processes is error prone and lingering/hanging processes have been observed (#4940 and others, depending on the setup). This commit adds support for the SSHProxyJump feature which permits to setup an arbitrary number of proxy jumps without additional processes by creating TCP channels over existing (Paramiko) connections. This gives a good control over the lifetime of the different connections and since a users SSH config is not re-read after the initial setup gives a controlled environment. Hence it has been decided to make this new directive the recommended default in the documentation while still supporting both ways. Co-authored-by: Marnik Bercx <[email protected]> Co-authored-by: Leopold Talirz <[email protected]> Cherry-pick: da179dc
No description provided.