Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement replacement for datalad status #586

Closed
mih opened this issue Jan 7, 2024 · 2 comments · Fixed by #592
Closed

Implement replacement for datalad status #586

mih opened this issue Jan 7, 2024 · 2 comments · Fixed by #592

Comments

@mih
Copy link
Member

mih commented Jan 7, 2024

With #580 and related iterators I quickly hacked together an iterator for changes in git repos (see #323 for the background). This essentially comprises all pieces necessary to reimplement datalad status.

This implementation would differ significantly, both in code base and also in approach.

  • it would not be able to "list" the full content of a repository -- this is possible via ls-file-collection (git|annex)worktree, and it is a lot faster with that approach
  • it would be "change-driven", ie. only operate and report on changes in a repo (the underlying implementation could still deliver a "status" and a "diff", like the one in datalad-core
  • it would not run any datalad-core repo-class-level code, but exclusively rely on datalad_next.iter_collections

Based on initial performance tests of an unpublished draft, we can expect up to 100x speed increase for sizable repos (status report in <20ms instead of 1.5s (for a test repo with >30k files).

datalad status behavior summary

Here is what present datalad status in core does.

{
  "action": "status",
  "bytesize": 63,
  "gitshasum": "30e9ebb10380d9460522d063e53d39adecef4d94",
  "parentds": "/tmp/some",
  "path": "/tmp/some/.datalad/config",
  "prev_gitshasum": "30e9ebb10380d9460522d063e53d39adecef4d94",
  "refds": "/tmp/some",
  "state": "clean",
  "status": "ok",
  "type": "file"
}

bytesize is only present for files in Git (not for annexed files, unless --annex is given).

state can be the expected clean, added, deleted, modified, ...

datalad status --annex behavior summary

{
  "action": "status",
  "backend": "MD5E",
  "bytesize": 4,
  "gitshasum": "670d89de1eaf38bd9740fdc274ef214e0f63408f",
  "hashdirlower": "ff4/c57/",
  "hashdirmixed": "pF/Zf/",
  "humansize": "4 B",
  "key": "MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f",
  "keyname": "ba1f2511fc30423bdbb183fe33f3dd0f",
  "mtime": "unknown",
  "parentds": "/tmp/some",
  "path": "/tmp/some/testme",
  "prev_gitshasum": "670d89de1eaf38bd9740fdc274ef214e0f63408f",
  "refds": "/tmp/some",
  "state": "clean",
  "status": "ok",
  "type": "file"
}

datalad status --annex availability behavior summary

{
  "action": "status",
  "backend": "MD5E",
  "bytesize": 4,
  "gitshasum": "670d89de1eaf38bd9740fdc274ef214e0f63408f",
  "has_content": true,
  "humansize": "4 B",
  "key": "MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f",
  "keyname": "ba1f2511fc30423bdbb183fe33f3dd0f",
  "mtime": "unknown",
  "objloc": "/tmp/some/.git/annex/objects/pF/Zf/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f/MD5E-s4--ba1f2511fc30423bdbb183fe33f3dd0f",
  "parentds": "/tmp/some",
  "path": "/tmp/some/testme",
  "prev_gitshasum": "670d89de1eaf38bd9740fdc274ef214e0f63408f",
  "refds": "/tmp/some",
  "state": "clean",
  "status": "ok",
  "type": "file"
}

There is nothing here that could not be reported based on the existing iterators.

Beyond the already existing draft of iter_gitdiff() it makes sense to also implement iter_annexdiff(). It would also implement the pattern of iter_annexworktree() wrapping iter_gitworktree(), and injecting the corresponding annex attributes to the reported items.

API considerations

Right now, I see no need to make the API of the reimplementation be different from the one in datalad-core. Obviously, the report_filetype argument that is already ignored would not be supported.

@mih
Copy link
Member Author

mih commented Jan 8, 2024

Few more thoughts on recursive operation and listing untracked files within status.

DataLad advertises restricted recursion as a feature (i.e. recurse a limited number of dataset levels). This represents a challenge. Git does not support that (either recurse or not, only). Therefore the handling cannot be left to Git.

Conceptually, it is not clear to me why we support that. If it is about "seamless monorepos", it seems questionable to require an awareness of repository boundaries (which is necessary for specifying numerical limits). I vaguely remember that there was a use case for "immediate subdatasets", but I cannot recall specifics.

Not providing restricted recursion would also encourage usage that does not involve needlessly deep repository hierarchies.

If this is to be discontinued, a main benefit could be added support for Git's pathspecs. Presently datalad largely ignores that possibility, and it would be rather complicated to implement mangling of pathspecs for seemless recursive Git calls on subdataset -- without using Gits own recursion mechanism.

Listing untracked files recursively in subdatasets

Presently git ls-files does not support recursion for --other:

❯ git ls-files --recurse-submodules --other
fatal: ls-files --recurse-submodules unsupported mode

Combining with --stage or --cached does not change that.

A recursive listing of untracked files would therefore require a recursive discovery of submodules and then individual ls-files --other calls.

This means that an implementation of something like a iter_gituntracked(recursive=...) implies running iter_gitworktree. Jointly, this implies that an implementation of a status command capable of listing untracked files recursively, needs to run a recursive worktree discovery, hence gains awareness of all repository content, even of only a change report is desired.

It appears to still make sense to implement a dedicated iter_gituntracked, even if it would call iter_gitworktree inside. This would enable flexible combination with a iter_gitdiff, without always having to go through iter_gitworktree.

@mih
Copy link
Member Author

mih commented Jan 8, 2024

Looking jointly at all Git commands that are involved in a status report (ls-files (for untracked), diff-index for changes): there is no sufficient support for submodule-recursive operations. This needs to be implemented manually.

What calls would need to be done:

  1. non-recursive, no untracked: just diff-index (or iter_gitdiff)
  2. non-recursive and untracked: (1) plus ls-files --other
  3. recursive, no untracked: (1) plus another diff-index for any submodules reported as modified
  4. recursive and untracked: (3) plus discover all submodules, and run ls-files --other in each of them

An alternative look at the required capabilities from the POV of datalad save (the primary users of datalad status).

  • recursive operation is mandatory: (3) and (4) above
  • --updated suggests that an optimized implementation for tracked content only is useful: (3) above)
  • an implementation cannot rely on a (clean) index for incremental staging
  • constraining a save via any number of path(spec)s is needed. It would make sense to also constrain the change discovery in the same way. This means adding a pathspec argument to the interfaces of all underlying iterators

Conclusions:

  • (A) we need something that can run git diff-index datalad-style recursive, and supports pathspec-based contraining
  • (B) we need something that can perform datalad-style recursive discovery of untracked files, possibly/ideally in conjuction with (A)

My gut feeling is that we should first implement (A), and then do a separate implementation of (B), following by an refactoring to get a useful codebase -- while maintaining two distinct API entrypoints.

mih added a commit to mih/datalad-next that referenced this issue Jan 12, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 12, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 13, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 13, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 15, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 15, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 15, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 15, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 15, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 16, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 16, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 16, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 16, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 17, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 17, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 17, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 17, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely?) the same.

TODO more detailed analysis

The command options `untracked` and `recursive` now both take (optional)
qualifiying values, but also work with any value specification at the
CLI.

Closes datalad#586 (eventually)
mih added a commit to mih/datalad-next that referenced this issue Jan 17, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely) the same. The command
documentation contains a summary of the key differences.

Closes datalad#586
mih added a commit to mih/datalad-next that referenced this issue Jan 17, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely) the same. The command
documentation contains a summary of the key differences.

Closes datalad#586
mih added a commit to mih/datalad-next that referenced this issue Jan 17, 2024
This implementation is the first to emit `CommandResult` type result
items, ie. dataclass instances rather than result dicts.

It also uses uniform parameter validation, enabling substantially
simplified implementation (e.g., of the result renderer).

The user-facing appearance remains (largely) the same. The command
documentation contains a summary of the key differences.

Closes datalad#586
@mih mih closed this as completed in #592 Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant