Implement replacement for `datalad status` #586

mih · 2024-01-07T17:52:25Z

With #580 and related iterators I quickly hacked together an iterator for changes in git repos (see #323 for the background). This essentially comprises all pieces necessary to reimplement datalad status.

This implementation would differ significantly, both in code base and also in approach.

it would not be able to "list" the full content of a repository -- this is possible via ls-file-collection (git|annex)worktree, and it is a lot faster with that approach
it would be "change-driven", ie. only operate and report on changes in a repo (the underlying implementation could still deliver a "status" and a "diff", like the one in datalad-core
it would not run any datalad-core repo-class-level code, but exclusively rely on datalad_next.iter_collections

Based on initial performance tests of an unpublished draft, we can expect up to 100x speed increase for sizable repos (status report in <20ms instead of 1.5s (for a test repo with >30k files).

`datalad status` behavior summary

Here is what present datalad status in core does.

{
  "action": "status",
  "bytesize": 63,
  "gitshasum": "30e9ebb10380d9460522d063e53d39adecef4d94",
  "parentds": "/tmp/some",
  "path": "/tmp/some/.datalad/config",
  "prev_gitshasum": "30e9ebb10380d9460522d063e53d39adecef4d94",
  "refds": "/tmp/some",
  "state": "clean",
  "status": "ok",
  "type": "file"
}

bytesize is only present for files in Git (not for annexed files, unless --annex is given).

state can be the expected clean, added, deleted, modified, ...

`datalad status --annex` behavior summary

{
  "action": "status",
  "backend": "MD5E",
  "bytesize": 4,
  "gitshasum": "670d89de1eaf38bd9740fdc274ef214e0f63408f",
  "hashdirlower": "ff4/c57/",
  "hashdirmixed": "pF/Zf/",
  "humansize": "4 B",
  "key": "MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f",
  "keyname": "ba1f2511fc30423bdbb183fe33f3dd0f",
  "mtime": "unknown",
  "parentds": "/tmp/some",
  "path": "/tmp/some/testme",
  "prev_gitshasum": "670d89de1eaf38bd9740fdc274ef214e0f63408f",
  "refds": "/tmp/some",
  "state": "clean",
  "status": "ok",
  "type": "file"
}

`datalad status --annex availability` behavior summary

{
  "action": "status",
  "backend": "MD5E",
  "bytesize": 4,
  "gitshasum": "670d89de1eaf38bd9740fdc274ef214e0f63408f",
  "has_content": true,
  "humansize": "4 B",
  "key": "MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f",
  "keyname": "ba1f2511fc30423bdbb183fe33f3dd0f",
  "mtime": "unknown",
  "objloc": "/tmp/some/.git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f",
  "parentds": "/tmp/some",
  "path": "/tmp/some/testme",
  "prev_gitshasum": "670d89de1eaf38bd9740fdc274ef214e0f63408f",
  "refds": "/tmp/some",
  "state": "clean",
  "status": "ok",
  "type": "file"
}

There is nothing here that could not be reported based on the existing iterators.

Beyond the already existing draft of iter_gitdiff() it makes sense to also implement iter_annexdiff(). It would also implement the pattern of iter_annexworktree() wrapping iter_gitworktree(), and injecting the corresponding annex attributes to the reported items.

API considerations

Right now, I see no need to make the API of the reimplementation be different from the one in datalad-core. Obviously, the report_filetype argument that is already ignored would not be supported.

The text was updated successfully, but these errors were encountered:

mih · 2024-01-08T07:51:00Z

Few more thoughts on recursive operation and listing untracked files within status.

DataLad advertises restricted recursion as a feature (i.e. recurse a limited number of dataset levels). This represents a challenge. Git does not support that (either recurse or not, only). Therefore the handling cannot be left to Git.

Conceptually, it is not clear to me why we support that. If it is about "seamless monorepos", it seems questionable to require an awareness of repository boundaries (which is necessary for specifying numerical limits). I vaguely remember that there was a use case for "immediate subdatasets", but I cannot recall specifics.

Not providing restricted recursion would also encourage usage that does not involve needlessly deep repository hierarchies.

If this is to be discontinued, a main benefit could be added support for Git's pathspecs. Presently datalad largely ignores that possibility, and it would be rather complicated to implement mangling of pathspecs for seemless recursive Git calls on subdataset -- without using Gits own recursion mechanism.

Listing untracked files recursively in subdatasets

Presently git ls-files does not support recursion for --other:

❯ git ls-files --recurse-submodules --other
fatal: ls-files --recurse-submodules unsupported mode

Combining with --stage or --cached does not change that.

A recursive listing of untracked files would therefore require a recursive discovery of submodules and then individual ls-files --other calls.

This means that an implementation of something like a iter_gituntracked(recursive=...) implies running iter_gitworktree. Jointly, this implies that an implementation of a status command capable of listing untracked files recursively, needs to run a recursive worktree discovery, hence gains awareness of all repository content, even of only a change report is desired.

It appears to still make sense to implement a dedicated iter_gituntracked, even if it would call iter_gitworktree inside. This would enable flexible combination with a iter_gitdiff, without always having to go through iter_gitworktree.

mih · 2024-01-08T10:50:32Z

Looking jointly at all Git commands that are involved in a status report (ls-files (for untracked), diff-index for changes): there is no sufficient support for submodule-recursive operations. This needs to be implemented manually.

What calls would need to be done:

non-recursive, no untracked: just diff-index (or iter_gitdiff)
non-recursive and untracked: (1) plus ls-files --other
recursive, no untracked: (1) plus another diff-index for any submodules reported as modified
recursive and untracked: (3) plus discover all submodules, and run ls-files --other in each of them

An alternative look at the required capabilities from the POV of datalad save (the primary users of datalad status).

recursive operation is mandatory: (3) and (4) above
--updated suggests that an optimized implementation for tracked content only is useful: (3) above)
an implementation cannot rely on a (clean) index for incremental staging
constraining a save via any number of path(spec)s is needed. It would make sense to also constrain the change discovery in the same way. This means adding a pathspec argument to the interfaces of all underlying iterators

Conclusions:

(A) we need something that can run git diff-index datalad-style recursive, and supports pathspec-based contraining
(B) we need something that can perform datalad-style recursive discovery of untracked files, possibly/ideally in conjuction with (A)

My gut feeling is that we should first implement (A), and then do a separate implementation of (B), following by an refactoring to get a useful codebase -- while maintaining two distinct API entrypoints.

This implementation is the first to emit `CommandResult` type result items, ie. dataclass instances rather than result dicts. It also uses uniform parameter validation, enabling substantially simplified implementation (e.g., of the result renderer). The user-facing appearance remains (largely?) the same. TODO more detailed analysis The command options `untracked` and `recursive` now both take (optional) qualifiying values, but also work with any value specification at the CLI. Closes datalad#586 (eventually)

This implementation is the first to emit `CommandResult` type result items, ie. dataclass instances rather than result dicts. It also uses uniform parameter validation, enabling substantially simplified implementation (e.g., of the result renderer). The user-facing appearance remains (largely) the same. The command documentation contains a summary of the key differences. Closes datalad#586

This was referenced Jan 8, 2024

Add GitPathSpec support class and constraint #587

Closed

New iter_gitdiff #589

Merged

Initial draft of iter_gitstatus() and datalad next-status #592

Merged

mih closed this as completed in #592 Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement replacement for `datalad status` #586

Implement replacement for `datalad status` #586

mih commented Jan 7, 2024 •

edited

Loading

mih commented Jan 8, 2024 •

edited

Loading

mih commented Jan 8, 2024 •

edited

Loading

Implement replacement for datalad status #586

Implement replacement for datalad status #586

Comments

mih commented Jan 7, 2024 • edited Loading

datalad status behavior summary

datalad status --annex behavior summary

datalad status --annex availability behavior summary

API considerations

mih commented Jan 8, 2024 • edited Loading

mih commented Jan 8, 2024 • edited Loading

Implement replacement for `datalad status` #586

Implement replacement for `datalad status` #586

mih commented Jan 7, 2024 •

edited

Loading

`datalad status` behavior summary

`datalad status --annex` behavior summary

`datalad status --annex availability` behavior summary

mih commented Jan 8, 2024 •

edited

Loading

mih commented Jan 8, 2024 •

edited

Loading