Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First crude smoke test for SSHRemoteIO #69

Merged
merged 5 commits into from
Sep 20, 2023
Merged

Conversation

mih
Copy link
Member

@mih mih commented Sep 13, 2023

See if the whole shebang of identityfile, custom username, with custom port still yields a working IO abstraction.

This changeset includes a patch to datalad-core:

This patch aims to fix a hanging Python sessions after the execution
of an SSH remote command call with no particular stdin input.

Interpretation from #68

Without overriding stdin, the subprocess, i.e. ssh, and python share
the same file pointer. It seems that stdin is configured in a way that
unexpected by the interpreter and messes with python's way to read from
sys.stdin.

This patch passes an explicit b'' as stdin to the SSH client
execution process to effectively achieve a separate fiel descriptor for
that client process.

This patch should not interfere with the implementation of the sshrun
command in datalad-core. It uses a dedicated not-None value for any
execution. However, the compatibility and interference of this patch
should be subject to a thorough investigation and widespread testing
before this changeset is proposed for datalad-core.

Closes #68

@mih mih mentioned this pull request Sep 13, 2023
@mih mih force-pushed the tst-remoteio branch 2 times, most recently from 483ce89 to 88401a6 Compare September 18, 2023 08:31
@mih
Copy link
Member Author

mih commented Sep 18, 2023

Tests cannot run ATM, because of some certificate expiration issue on the windows working images of appveyor. Time will fix it, I guess.

@adswa
Copy link
Member

adswa commented Sep 18, 2023

I have just a preliminary observation: The test in this PR here doesn't run through on my Windows machine.

The patch is applied ([DEBUG] Apply patch to datalad.support.sshconnector._exec_ssh), but it stops doing anything after the DEBUG message "Estimating digests for ...", and I need to kill the Python process in the task manager.

(py39) C:\Users\adina\repos\datalad-ria>pytest -s -v datalad_ria\tests\test_ssh_remote_io.py
[DEBUG] Enable posting DataLad config overrides CLI/ENV as GIT_CONFIG items in process ENV
[DEBUG] Not retro-fitting GitRepo with deprecated symbols, datalad-deprecated package not found
[DEBUG] Apply datalad-next patch to annexrepo.py:AnnexRepo.enable_remote
[DEBUG] Building doc for <class 'datalad.local.configuration.Configuration'>
[DEBUG] Building doc for <class 'datalad_next.patches.configuration.Configuration'>
[DEBUG] Building doc for <class 'datalad.local.configuration.Configuration'>
[DEBUG] Failed to import requests_ftp, thus no ftp support: ModuleNotFoundError(No module named 'requests_ftp')
[DEBUG] Apply datalad-next patch to create_sibling_ghlike.py:_GitHubLike._set_request_headers
[DEBUG] Apply datalad-next patch to interface.(utils|base).py:_execute_command_
[DEBUG] Building doc for <class 'datalad.core.local.status.Status'>
[DEBUG] Building doc for <class 'datalad.core.local.diff.Diff'>
[DEBUG] Building doc for <class 'datalad.core.distributed.push.Push'>
[DEBUG] Apply patch to datalad.core.distributed.push._transfer_data
[DEBUG] Patching datalad.core.distributed.push.Push docstring and parameters
[DEBUG] Building doc for <class 'datalad.core.distributed.push.Push'>
[DEBUG] Patching datalad.support.AnnexRepo.get_export_records (new method)
[DEBUG] Apply patch to datalad.core.distributed.push._push
[DEBUG] Apply patch to datalad.distribution.siblings._enable_remote
[DEBUG] Building doc for <class 'datalad.distribution.update.Update'>
[DEBUG] Building doc for <class 'datalad.distribution.siblings.Siblings'>
[DEBUG] Replace special remote _main() with datalad-next's progress logging enabled variant
[DEBUG] Building doc for <class 'datalad.local.subdatasets.Subdatasets'>
[DEBUG] Building doc for <class 'datalad.distributed.create_sibling_gitlab.CreateSiblingGitlab'>
[DEBUG] Apply patch to datalad.distributed.create_sibling_gitlab._proc_dataset
[DEBUG] Stop advertising discontinued "hierarchy" layout for `create_siblign_gitlab()`
[DEBUG] Building doc for <class 'datalad.distributed.create_sibling_gitlab.CreateSiblingGitlab'>
[DEBUG] Apply patch to datalad.support.sshconnector._exec_ssh
[DEBUG] Run ['git', 'version'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['git', 'version'] with status 0
[DEBUG] Run ['git', 'annex', 'version', '--raw'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['git', 'annex', 'version', '--raw'] with status 0
[DEBUG] Created temporary directory named C:\Users\adina\AppData\Local\Temp\datalad_temp_iq2vzp3q
[DEBUG] Filename ' ' is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_iq2vzp3q
[DEBUG] Filename '/' is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_iq2vzp3q
[DEBUG] Filename '|' is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_iq2vzp3q
[DEBUG] Filename ';&%b5{}\'"' is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_iq2vzp3q
[DEBUG] Filename ";&%b5{}'<" is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_iq2vzp3q
[DEBUG] Filename ";&%b5{}'>" is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_iq2vzp3q
[DEBUG] Filename ";&%b5{}' " is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_iq2vzp3q
[DEBUG] Filename ";&%b5{}'.datc " is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_iq2vzp3q
[DEBUG] Tested 14 obscure filename candidates. The winner: ";&%b5{}'.datc"
[DEBUG] Processing entrypoints
[DEBUG] Done processing entrypoints
[DEBUG] Building doc for <class 'datalad.core.local.create.Create'>
[DEBUG] Building doc for <class 'datalad.core.local.save.Save'>
[DEBUG] Building doc for <class 'datalad.core.distributed.clone.Clone'>
[DEBUG] Building doc for <class 'datalad.distribution.get.Get'>
[DEBUG] Building doc for <class 'datalad.distribution.install.Install'>
[DEBUG] Building doc for <class 'datalad.local.unlock.Unlock'>
[DEBUG] Building doc for <class 'datalad.core.local.run.Run'>
[DEBUG] Building doc for <class 'datalad.distributed.create_sibling_github.CreateSiblingGithub'>
[DEBUG] Building doc for <class 'datalad.distributed.create_sibling_gogs.CreateSiblingGogs'>
[DEBUG] Building doc for <class 'datalad.distributed.create_sibling_gin.CreateSiblingGin'>
[DEBUG] Building doc for <class 'datalad.distributed.create_sibling_gitea.CreateSiblingGitea'>
[DEBUG] Building doc for <class 'datalad.distributed.create_sibling_ria.CreateSiblingRia'>
[DEBUG] Building doc for <class 'datalad.distribution.create_sibling.CreateSibling'>
[DEBUG] Building doc for <class 'datalad.distributed.drop.Drop'>
[DEBUG] Building doc for <class 'datalad.local.remove.Remove'>
[DEBUG] Building doc for <class 'datalad.local.addurls.Addurls'>
[DEBUG] Building doc for <class 'datalad.local.copy_file.CopyFile'>
[DEBUG] Building doc for <class 'datalad.local.download_url.DownloadURL'>
[DEBUG] Building doc for <class 'datalad.local.foreach_dataset.ForEachDataset'>
[DEBUG] Building doc for <class 'datalad.local.rerun.Rerun'>
[DEBUG] Building doc for <class 'datalad.local.run_procedure.RunProcedure'>
[DEBUG] Building doc for <class 'datalad.local.wtf.WTF'>
[DEBUG] Building doc for <class 'datalad.local.clean.Clean'>
[DEBUG] Building doc for <class 'datalad.local.add_archive_content.AddArchiveContent'>
[DEBUG] Building doc for <class 'datalad.local.add_readme.AddReadme'>
[DEBUG] Building doc for <class 'datalad.local.export_archive.ExportArchive'>
[DEBUG] Building doc for <class 'datalad.distributed.export_archive_ora.ExportArchiveORA'>
[DEBUG] Building doc for <class 'datalad.distributed.export_to_figshare.ExportToFigshare'>
[DEBUG] Building doc for <class 'datalad.local.no_annex.NoAnnex'>
[DEBUG] Building doc for <class 'datalad.local.check_dates.CheckDates'>
[DEBUG] Building doc for <class 'datalad.distribution.uninstall.Uninstall'>
[DEBUG] Building doc for <class 'datalad.distribution.create_test_dataset.CreateTestDataset'>
[DEBUG] Building doc for <class 'datalad.support.sshrun.SSHRun'>
[DEBUG] Building doc for <class 'datalad.interface.shell_completion.ShellCompletion'>
[DEBUG] Processing entrypoints
[DEBUG] Loading entrypoint ria from datalad.extensions
[DEBUG] Building doc for <class 'datalad_ria.create_sibling_ria.CreateSiblingRia'>
[DEBUG] Loaded entrypoint ria from datalad.extensions
[DEBUG] Loading entrypoint next from datalad.extensions
[DEBUG] Building doc for <class 'datalad_next.commands.credentials.Credentials'>
[DEBUG] Building doc for <class 'datalad_next.commands.create_sibling_webdav.CreateSiblingWebDAV'>
[DEBUG] Building doc for <class 'datalad_next.commands.tree.TreeCommand'>
[DEBUG] Building doc for <class 'datalad_next.commands.download.Download'>
[DEBUG] Building doc for <class 'datalad_next.commands.ls_file_collection.LsFileCollection'>
[DEBUG] Loaded entrypoint next from datalad.extensions
[DEBUG] Done processing entrypoints
=================================================================== test session starts ====================================================================
platform win32 -- Python 3.9.0, pytest-7.3.2, pluggy-1.0.0 -- C:\Users\adina\env\py39\Scripts\python.exe
cachedir: .pytest_cache
rootdir: C:\Users\adina\repos\datalad-ria
plugins: cov-4.1.0
collected 1 item

datalad_ria/tests/test_ssh_remote_io.py::test_SSHRemoteIO_basics [DEBUG] Created temporary directory named C:\Users\adina\AppData\Local\Temp\datalad_temp_hz9tvufk
[DEBUG] Run ['git', '--git-dir=b:\\nul', 'config', '-z', '-l', '--show-origin'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['git', '--git-dir=b:\\nul', 'config', '-z', '-l', '--show-origin'] with status 0
[DEBUG] UI set to QuietConsoleLog(out=<TextIOWrapper>)
[DEBUG] Run ['git', 'version'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['git', 'version'] with status 0
[DEBUG] Not retro-fitting GitRepo with deprecated symbols, datalad-deprecated package not found
[DEBUG] Run ['git', 'annex', 'version', '--raw'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['git', 'annex', 'version', '--raw'] with status 0
[DEBUG] Created temporary directory named C:\Users\adina\AppData\Local\Temp\datalad_temp_bnx7xftv
[DEBUG] Filename ' ' is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_bnx7xftv
[DEBUG] Filename '/' is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_bnx7xftv
[DEBUG] Filename '|' is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_bnx7xftv
[DEBUG] Filename ';&%b5{}\'"' is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_bnx7xftv
[DEBUG] Filename ";&%b5{}'<" is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_bnx7xftv
[DEBUG] Filename ";&%b5{}'>" is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_bnx7xftv
[DEBUG] Filename ";&%b5{}' " is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_bnx7xftv
[DEBUG] Filename ";&%b5{}'.datc " is not supported on Windows under C:\Users\adina\AppData\Local\Temp\datalad_temp_bnx7xftv
[DEBUG] Tested 14 obscure filename candidates. The winner: ";&%b5{}'.datc"
[DEBUG] chdir 'C:\\Users\\adina\\repos\\datalad-ria' -> 'C:\\Users\\adina\\AppData\\Local\\Temp\\httpserveu00j3d07'
[DEBUG] HTTP: serving C:\Users\adina\AppData\Local\Temp\httpserveu00j3d07 under http://127.0.0.1:64784/
[DEBUG] Estimating digests for C:\Users\adina\AppData\Local\Temp\datalad_temp_hz9tvufk\.gitconfig (cwd=None) h status 0  with status 0 ow-origin']

In a python session, things look as follows:

>>> ssh = SSHRemoteIO('ssh://sshuser@localhost:2222')

doesn't do anything (seemlingly) BUT I can CTRL-C it in the Python process and reuse the python session afterwards. So its not the same hanging as before.

>>> ssh = SSHRemoteIO('ssh://sshuser@localhost:2222')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\adina\env\py39\lib\site-packages\datalad\distributed\ora_remote.py", line 351, in __init__
    line = self.shell.stdout.readline()

@adswa
Copy link
Member

adswa commented Sep 18, 2023

            import pdb; pdb.set_trace()
349             self.shell.stdin.write(b"echo RIA-REMOTE-LOGIN-END\n")
350             self.shell.stdin.flush()
351             while True:
352  ->             line = self.shell.stdout.readline()
353                 if line == b"RIA-REMOTE-LOGIN-END\n":
354                     break
355             # TODO: Same for stderr?

If I get into this while loop in a debugger, stdout only contains empty lines, so it seems trapped in there.

@christian-monch
Copy link

christian-monch commented Sep 18, 2023

            import pdb; pdb.set_trace()
349             self.shell.stdin.write(b"echo RIA-REMOTE-LOGIN-END\n")
350             self.shell.stdin.flush()
351             while True:
352  ->             line = self.shell.stdout.readline()
353                 if line == b"RIA-REMOTE-LOGIN-END\n":
354                     break
355             # TODO: Same for stderr?

If I get into this while loop in a debugger, stdout only contains empty lines, so it seems trapped in there.

Thanks for tracking that down. readline() would return '' to indicate EOF. That happens on my Windows machine, if the ssh-subprocess, in self.shell, fails to connect to the server. Could that be the case here?

[Edit]

We should definitely check for that condition and probably raise an error, that shows the content of self.ssh.sshri.as_str().

@adswa
Copy link
Member

adswa commented Sep 18, 2023

Thanks for tracking that down. readline() would return '' to indicate EOF. That happens on my Windows machine, if the ssh-subprocess, in self.shell, fails to connect to the server. Could that be the case here?

I'm getting a few conflicting findings, maybe you can make sense of them:


In [1]: from datalad.distributed.ora_remote import SSHRemoteIO
[DEBUG] Not retro-fitting GitRepo with deprecated symbols, datalad-deprecated package not found

In [2]: SSHRemoteIO('ssh://sshuser@localhost:2222')
initing initialized
obtaining a shell
> c:\users\adina\env\py39\lib\site-packages\datalad\distributed\ora_remote.py(351)__init__()
-> self.shell.stdin.write(b"echo RIA-REMOTE-LOGIN-END\n")
(Pdb) p self.ssh.sshri.as_str()
'sshuser@localhost'
(Pdb) self.ssh
NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++40 chars++er')>>)
(Pdb) self.ssh.is_open()
False

I interpreted the "is_open() -> False" as weird and an indication that there could be a connection issue.
However, through methods like self.ssh.get_git_version I can send commands to that ssh server (in that specific case, it reports that git is not installed, but as that method doesn't declare the stdin/stdout pipes, the process hangs afterwards).

We should definitely check for that condition and probably raise an error, that shows the content of self.ssh.sshri.as_str().

the cmd with which subprocess.Popen() gets called looks suspicious to me because it has lost the port:

(Pdb) p cmd
['ssh', 'sshuser@localhost']

Edit: If I hardcode the port (sshuser@localhost:2222), I still end up in a seemingly endless loop.

@christian-monch
Copy link

christian-monch commented Sep 18, 2023

the cmd with which subprocess.Popen() gets called looks suspicious to me because it has lost the port:

(Pdb) p cmd
['ssh', 'sshuser@localhost']

That is definitely wrong. There is some old code, i.e. SSHRI.as_str(), which seems to be faulty. It is definitely missing a clause to check for port and some more things (see below). It has to be patched/corrected. If a port is contained in the destination, the destination should read ssh://sshuser@localhost:2222.

Edit: If I hardcode the port (sshuser@localhost:2222), I still end up in a seemingly endless loop.

Can you try with ssh://sshuser@localhost:2222? It works on my machine.

If that works, we just have to fix the (IMHO needlessly complicated) SSHRI code (or, if that would break datalad, bundle an updated version with a patched SSHRemoteIO.__init__()-function). Hopefully that is the last fix in datalad-core ;-)

@christian-monch
Copy link

christian-monch commented Sep 18, 2023

Edit: If I hardcode the port (sshuser@localhost:2222), I still end up in a seemingly endless loop.

We have to check whether the shell process has returned with a non-zero value in the loop. The self.shell.stdin.write() might fail if the ssh-process exits, but that is not guaranteed, due to race-conditions. The loop should look something like this:

        while True:
            status = self.shell.poll()
            if status not in (0, None):
                raise Exception(f'ssh exited with status {status}')     # TODO use proper exception class
            line = self.shell.stdout.readline()
            if line == '':
                raise Exception(f'ssh closed stdout unexpectedly')
            if line == b"RIA-REMOTE-LOGIN-END\n":
                break

That should leave the loop if any errors occur.

@christian-monch
Copy link

christian-monch commented Sep 18, 2023

I am preparing a commit, that includes the necessary patches. Let's see how that works.

@mih
Copy link
Member Author

mih commented Sep 19, 2023

re SSHRI and friends: My preference would be to drop all that code entirely (and eventually). SSH is perfectly capable of working with URLs, our URLOperations framework works with URLs, a user provides URLs. Any back and forth conversion to these RI seems needlessly complicated, and serves little purpose.

general note re "connection open" etc.: Please keep in mind that on Windows SSH does not do connection multiplexing. This means that there is never an open connection, and all this code was likely written with the assumption that such oddity cannot exist. I suspect that these flags have little meaning on windows and they are coded up to make-pretend and satisfy other code's assumptions on how the world should behave (but isn't).

@mih
Copy link
Member Author

mih commented Sep 19, 2023

I have extracted everything from this PR that is not narrowly focused on SSHRemoteIO into separate PRs:

They have been merged into main

@codecov-commenter
Copy link

codecov-commenter commented Sep 19, 2023

Codecov Report

Patch coverage: 94.11% and project coverage change: -0.60% ⚠️

Comparison is base (c93ece8) 97.53% compared to head (a102c53) 96.93%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #69      +/-   ##
==========================================
- Coverage   97.53%   96.93%   -0.60%     
==========================================
  Files          13       15       +2     
  Lines         162      196      +34     
==========================================
+ Hits          158      190      +32     
- Misses          4        6       +2     
Files Changed Coverage Δ
datalad_ria/patches/enabled.py 100.00% <ø> (ø)
datalad_ria/tests/utils.py 100.00% <ø> (ø)
datalad_ria/patches/sshremoteio.py 93.10% <93.10%> (ø)
datalad_ria/tests/test_ssh_remote_io.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mih
Copy link
Member Author

mih commented Sep 20, 2023

This is looking good now! I left a comment re documenting the path forward. But otherwise we should get this merged!

It would be good to rebase this branch, and squash the history to the relevant bits. Likely just two commits:

  1. The added test that documents that datalad-core functionality is broken
  2. The patch that makes the test pass

WDYT?

See if the whole shebang of identityfile, custom username, with custom
port still yields a working IO abstraction.
This commit adds a patch that replaces
SSHRemoteIO.__init__() with an udated version.
The updated version uses proper arguments to
open a the shell-ssh, if the ssh-connection
is an instance of NoMultiplexSSHConnection.
@christian-monch christian-monch merged commit 8b76391 into main Sep 20, 2023
3 checks passed
@christian-monch christian-monch deleted the tst-remoteio branch September 20, 2023 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SSHRemoteIO on windows hangs
4 participants