Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF: datalad tree command #92

Merged
merged 135 commits into from
Aug 24, 2022
Merged

NF: datalad tree command #92

merged 135 commits into from
Aug 24, 2022

Conversation

catetrai
Copy link
Contributor

New command datalad tree for displaying directory/dataset hierarchies.

Closes #78

Note:

  • Tests are of the 'end-to-end' kind for now. Can add more proper unit tests once we consolidate the implementation.
  • As discussed in What about a datalad tree command? #78, I have included only long-form parameters for now. I think it would be worth considering short-form variants as well, to make it feel similar to tree. My suggestion would be -L for --depth (consistent with tree syntax) and -R for --dataset-depth (consistent with --recursion-limit in other datalad commands).
  • Performance may be an issue in case of large and deep dataset hierarchies. I played around with running the command on the 'mother of all superdatasets' at datasets.datalad.org, installing some datasets recursively. To me it looks like a classic Not Great, Not Terrible™. There may very well be a more efficient algorithm for the search, so let me know if you think it's worth looking into it.

As a first-time contributor, this was really fun! The development environment (including tests and building docs) was straightforward to set up following the instructions. I was also able to reuse helpers in the utils modules (e.g. for creating test directory trees) which saved me lots of time.

Looking forward to your review! Please be brutal :-)

mih and others added 30 commits June 9, 2022 15:01
@catetrai
Copy link
Contributor Author

I believe the last aspect to talk about is the finalization of the API, including, but not limited to, short options.

I updated the command docstring, parameters and examples to reflect the current implementation, so we can use this as a starting point for discussing changes:

class TreeCommand(Interface):
"""Visualize directory and dataset hierarchies
This command mimics the UNIX/MSDOS 'tree' utility to generate and
display a directory tree, with DataLad-specific enhancements.
It can serve the following purposes:
1. Glorified 'tree' command
2. Dataset discovery
3. Programmatic directory traversal
*Glorified 'tree' command*
The rendered command output uses 'tree'-style visualization::
/tmp/mydir
├── [DS~0] ds_A/
│ └── [DS~1] subds_A/
└── [DS~0] ds_B/
├── dir_B/
│ ├── file.txt
│ ├── subdir_B/
│ └── [DS~1] subds_B0/
└── [DS~1] (not installed) subds_B1/
5 datasets, 2 directories, 1 file
Dataset paths are prefixed by a marker indicating subdataset hierarchy
level, like ``[DS~1]``.
This is the absolute subdataset level, meaning it may also take into
account superdatasets located above the tree root and thus not included
in the output.
If a subdataset is registered but not installed (such as after a
non-recursive ``datalad clone``), it will be prefixed by ``(not
installed)``. Only DataLad datasets are considered, not pure
git/git-annex repositories.
The 'report line' at the bottom of the output shows the count of
displayed datasets, in addition to the count of directories and
files. In this context, datasets and directories are mutually
exclusive categories.
By default, only directories (no files) are included in the tree,
and hidden directories are skipped. Both behaviours can be changed
using command options.
Symbolic links are always followed.
This means that a symlink pointing to a directory is traversed and
counted as a directory (unless it potentially creates a loop in
the tree).
*Dataset discovery*
Using the [CMD: ``--dataset-depth`` CMD][PY: ``dataset_depth`` PY]
option, this command generates the layout of dataset hierarchies based on
subdataset nesting level, regardless of their location in the
filesystem.
In this case, tree depth is determined by subdataset depth. This mode
is therefore suited for discovering available datasets when their
location is not known in advance.
By default, only datasets are listed, without their contents. If
[CMD: ``--depth`` CMD][PY: ``depth`` PY] is specified additionally,
the contents of each dataset will be included up to [CMD:
``--depth`` CMD][PY: ``depth`` PY] directory levels.
Tree filtering options such as [CMD: ``--include-hidden`` CMD][PY:
``include_hidden`` PY] only affect which directories are
reported/displayed, not which directories are traversed to find datasets.
*Programmatic directory traversal*
The command yields a result record for each tree node (dataset,
directory or file). The following properties are reported, where available:
"path"
Absolute path of the tree node
"type"
Type of tree node: "dataset", "directory" or "file"
"depth"
Directory depth of node relative to the tree root
"exhausted_levels"
Depth levels for which no nodes are left to be generated (the
respective subtrees have been 'exhausted')
"count"
Dict with cumulative counts of datasets, directories and files in the
tree up until the current node. File count is only included if the
command is run with the [CMD: ``--include-files`` CMD][PY:
``include_files`` PY]
option.
"dataset_depth"
Subdataset depth level relative to the tree root. Only included for
node type "dataset".
"dataset_abs_depth"
Absolute subdataset depth level. Only included for node type "dataset".
"dataset_is_installed"
Whether the registered subdataset is installed. Only included for node
type "dataset".
"symlink_target"
If the tree node is a symlink, the path to the link target
"is_broken_symlink"
If the tree node is a symlink, whether it is a broken symlink
"""
result_renderer = 'tailored'
_params_ = dict(
path=Parameter(
args=("path",),
nargs='?',
doc="""path to directory from which to generate the tree.
Defaults to the current directory.""",
constraints=EnsureStr() | EnsureNone()),
depth=Parameter(
args=("--depth",),
doc="""maximum level of subdirectories to include in the tree.
If not specified, will generate the full tree with no depth
constraint.
If paired with [CMD: ``--dataset-depth`` CMD][PY:
``dataset_depth`` PY], refers to the maximum directory level to
generate underneath each dataset.""",
constraints=EnsureInt() & EnsureRange(min=0) | EnsureNone()),
dataset_depth=Parameter(
args=("--dataset-depth",),
doc="""maximum level of nested subdatasets to include in the
tree. 0 means only top-level datasets, 1 means top-level
datasets and their immediate subdatasets, etc.""",
constraints=EnsureInt() & EnsureRange(min=0) | EnsureNone()),
include_files=Parameter(
args=("--include-files",),
doc="""include files in the tree""",
action='store_true'),
include_hidden=Parameter(
args=("--include-hidden",),
doc="""include hidden files/directories in the tree. This
option does not affect which directories will be searched for
datasets when specifying [CMD: ``--dataset-depth`` CMD][PY:
``dataset_depth`` PY]. For example, datasets located underneath
the hidden folder `.datalad` will be reported even if [CMD:
``--include-hidden`` CMD][PY: ``include_hidden`` PY] is omitted.""",
action='store_true'),
)
_examples_ = [
dict(text="Show up to 3 levels of subdirectories below the current "
"directory, including files and hidden contents",
code_py="tree(depth=3, include_files=True, include_hidden=True)",
code_cmd="datalad tree --depth 3 --include-files --include-hidden"),
dict(text="Find all top-level datasets located anywhere under ``/tmp``",
code_py="tree('/tmp', dataset_depth=0)",
code_cmd="datalad tree /tmp --dataset-depth 0"),
dict(text="Report first- and second-level subdatasets and their "
"directory contents, up to 1 subdirectory deep within each "
"dataset",
code_py="tree(dataset_depth=2, depth=1)",
code_cmd="datalad tree --dataset-depth 2 --depth 1"),
]

Most other datalad command have --recursive and --recursion-limit to tackle similar use cases. Do you think we can map that concept to tree too? Maybe for tree it does not make sense to have --recursive default to False, but otherwise it should match quite closely.

Yes, for me --recursion-limit has similar semantics as --dataset-depth. With the difference that since tree does not operate on a dataset, the limit is applied to whatever (sub)datasets are found below the tree root. But I think that's close enough to still be intuitive to grasp if someone is familiar with -r/-R usage in other datalad commands. I propose we rename the --dataset-depth option to --recursion-limit (with short option -R). --recursive could be used for unconstrained search (which we actually don't have in the current implementation -- need to specify some arbitrary value like --dataset-depth=100). What do you say?

I want to start with thinking about a conceptual separation of rendering related vs discovery related parameters.

I had previously considered distinguishing between two kinds of exclusion filters: 'display+traversal' exclusion filters (=directories will not be yielded nor traversed when searching for datasets) vs. 'display only' exclusion filters (=directories will not be yielded standalone, but may be traversed in the dataset search).

For example, with the --include-hidden option, hidden directories are not reported nor traversed in the regular tree mode. But in dataset search mode (if --dataset-depth is specified), then they will be traversed (and therefore reported) if they are the parent of valid dataset. So we could say that the option affects reporting but not discovery.

Is this the point you wanted to address? Or am I missing the mark?

@mih

This comment was marked as resolved.

@catetrai

This comment was marked as resolved.

mih
mih previously approved these changes Aug 21, 2022
Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alrighty! I think we had a stellar run and this came out beautiful. Thanks for having this kind of stamina!

From my POV we can merge this now without any further changes. We can add the short options now, or in a subsequent PR.

The performance is cool from my POV too -- I think we have more substantial margins for improvement in the rest of datalad, compared to what is done in this PR.

I need to set up the contributor acknowledgement framework we use in other extension too, to acknowledge you properly for this contribution -- I will get to that soonish!

Thanks much! I love it!

@catetrai
Copy link
Contributor Author

It was my pleasure @mih! Thank you for the feedback and guidance throughout. Glad to have the chance to learn a bit about datalad internals, too (well, more like took a peek underneath the API). Debugging tips also invaluable for daily usage / life.

I will rename the options and add short variants as parts of this PR.

If you or the team have any improvement suggestions (also on documentation, code style / naming conventions, refactoring the tests a bit, etc) I'm happy to follow up on separate PRs.

@mih
Copy link
Member

mih commented Aug 23, 2022

Huh, one of the datalad-core tests started failing. It seems to be an issue with git-annex. I do not see an immediate connection. Will investigate....

I am rerunning the tests on the main branch, if it is also showing up there, we can go ahead with the merge. I somewhat expect that....

@mih
Copy link
Member

mih commented Aug 24, 2022

Yeah, unrelated and should be gone with the 0.17.3 release that just came out.

@mih
Copy link
Member

mih commented Aug 24, 2022

I just saw that a changelog snippet was still missing, so I added one with scriv create.

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go!

@mih mih merged commit 62803bd into datalad:main Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

What about a datalad tree command?
3 participants