Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make special remote processes report progress logs to git-annex #329

Merged
merged 4 commits into from
May 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions changelog.d/20230503_133855_michael.hanke_progress.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
### 💫 Enhancements and new features

- Patch the process entrypoint of DataLad's git-annex special remote
implementations to funnel internal progress reporting to git-annex
via standard `PROGRESS` protocol messages. This makes it obsolete
(in many cases) to implement custom progress reporting, and the
use of the standard `log_progress()` helper (either directly or
indirectly) is sufficient to let both a parent DataLad process
or git-annex see progress reports from special remotes.
Fixes https://github.com/datalad/datalad-next/issues/328 via
https://github.com/datalad/datalad-next/pull/329 (by @mih)
1 change: 1 addition & 0 deletions datalad_next/patches/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@
push_optimize,
siblings,
test_keyring,
customremotes_main,
)
136 changes: 136 additions & 0 deletions datalad_next/patches/customremotes_main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
"""Connect ``log_progress``-style progress reporting to git-annex

This patch introduces a dedicated progress log handler as a proxy between
standard datalad progress logging and a git-annex special remote as
an approach to report (data transfer) progress to a git-annex parent process.

This functionality is only (to be) used in dedicated special remote processes.
"""

import logging
from typing import (
Dict,
Type,
)

from datalad_next.utils.patch import apply_patch
from datalad_next.annexremotes import SpecialRemote


def only_progress_logrecords(record: logging.LogRecord) -> bool:
"""Log filter to ignore any non-progress log message"""
return hasattr(record, 'dlm_progress')


class AnnexProgressLogHandler(logging.Handler):
"""Log handler to funnel progress logs to git-annex

For this purpose the handler wraps
:class:`datalad_next.annexremotes.SpecialRemote` instance.
When it receives progress log messages, it converts any
increment reports to absolute values, and then calls
the special remote's ``send_progress()`` method, which will
cause the respective progress update protocol message to
be issued.

.. note::

Git-annex only supports "context-free" progress reporting. When a
progress report is send, it is assumed to be on a currently running
transfer. Only a single integer value can be reported, and it
corresponds to the number of bytes transferred.

This approach implemented here cannot distinguish progress reports
that corresponding to git-annex triggered data transfers and other
(potentially co-occurring) operations. The likelihood of unrelated
operations reporting progress is relatively low, because this
handler is only supposed to be used in dedicated special remote
processes, but remains possible.

This implementation is set up to support tracking multiple
processes, and could report one of them selectively. However, at
present any progress update is relayed to git-annex directly.
This could lead to confusing and non-linear progress reporting.
"""
def __init__(self, annexremote: SpecialRemote):
super().__init__()
self.annexremote = annexremote
self._ptrackers: Dict[str, int] = {}

def emit(self, record: logging.LogRecord):
"""Process a log record

Any incoming log record, compliant with
http://docs.datalad.org/design/progress_reporting.html
is processed. Increment reports are converted to absolute
values, and each update is eventually passed on to special remote,
which issues a progress report to git-annex.
"""
if not hasattr(record, 'dlm_progress'):
# a filter should have been used to prevent this call
return

maint = getattr(record, 'dlm_progress_maint', None)
if maint in ('clear', 'refresh'):
return
pid = getattr(record, 'dlm_progress')
update = getattr(record, 'dlm_progress_update', None)
if pid not in self._ptrackers:
# this is new
prg = getattr(record, 'dlm_progress_initial', 0)
self._ptrackers[pid] = prg
self.annexremote.send_progress(prg)
elif update is None:
# not an update -> done
self._ptrackers.pop(pid)
else:
prg = self._ptrackers[pid]
if getattr(record, 'dlm_progress_increment', False):
prg += update
else:
prg = update
self._ptrackers[pid] = prg
self.annexremote.send_progress(prg)


def patched_underscore_main(args: list, cls: Type[SpecialRemote]):
"""Full replacement for datalad.customremotes.main._main()

Its only purpose is to create a running instance of a SpecialRemote.
The only difference to the original in datalad-core is that once this
instance exists, it is linked to a log handler that converts incoming
progress log messages to the equivalent annex protocol progress reports.

This additional log handler is a strict addition to the log handling
setup established at this point. There should be no interference with
any other log message processing.

.. seealso::

:class:`AnnexProgressLogHandler`
"""
assert cls is not None
from annexremote import Master
master = Master()
remote = cls(master)
master.LinkRemote(remote)

# we add an additional handler to the logger to deal with
# progress reports
dlroot_lgr = logging.getLogger('datalad')
phandler = AnnexProgressLogHandler(remote)
phandler.addFilter(only_progress_logrecords)
dlroot_lgr.addHandler(phandler)

# run the remote
master.Listen()
# cleanup
if hasattr(remote, 'stop'):
remote.stop()


apply_patch(
'datalad.customremotes.main', None, '_main',
patched_underscore_main,
msg='Replace special remote _main() '
"with datalad-next's progress logging enabled variant")
62 changes: 62 additions & 0 deletions datalad_next/patches/tests/test_annex_progress_logging.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@

def test_uncurl_progress_reporting_to_annex(existing_dataset, monkeypatch):
"""Set up a repo that is used to download a key,
check that we see progress reports
"""
repo = existing_dataset.repo
# enable uncurl to get a datalad code piece generate progress info
repo.call_annex([
'initremote',
'uncurl',
'type=external',
'externaltype=uncurl',
'encryption=none',
])
# 1.7MB download, should never change
testfilekey = 'MD5E-s1725572--3f9f0f5c05517686c008115a611586b1.zip'
testfileurl = \
'https://github.com/datalad/datalad/archive/refs/tags/0.18.3.zip'
testfilename = 'datalad.zip'

# register the key in the dataset with the source URL
repo.call_annex(['registerurl', testfilekey, testfileurl])

# record the key to be available from uncurl
uncurl_uuid = repo.call_annex_records(['info', 'uncurl'])[0]['uuid']
repo.call_annex(['setpresentkey', testfilekey, uncurl_uuid, '1'])

# place key in worktree (not strictly required, but a more common setup)
repo.call_annex(['fromkey', '--force', testfilekey, testfilename])

# intercept progress logs in this process. in order for progress reports
# to appear here, uncurl needs to report them to git-annex, and our runner
# setup needs to catch them and call `log_progress`. So this really is an
# end-to-end test.

logs = []

# patch the log_progress() used in annexrepo.py
def catch_progress(*args, **kwargs):
logs.append(kwargs)

import datalad.support.annexrepo
monkeypatch.setattr(
datalad.support.annexrepo,
"log_progress",
catch_progress,
)

# trigger a download. use git-annex directly such that there is
# little chance that the uncurl remote process is talking to a
# datalad parent process directly
repo._call_annex_records(
args=['get'],
files=[testfilename],
progress=True,
total_nbytes=1725572,
)
# check that we got the progress init report
assert any('total' in log for log in logs)
# and at least one progress update -- do not check for more, because
# on fast systems this may take very little time
assert any('update' in log for log in logs)
3 changes: 0 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,9 +115,6 @@
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = True

# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {'https://docs.python.org/': None}

# -- Options for HTML output ----------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
Expand Down
1 change: 1 addition & 0 deletions docs/source/patches.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ DataLad patches
common_cfg
configuration
create_sibling_ghlike
customremotes_main
distribution_dataset
interface_utils
push_optimize
Expand Down