NF: datalad tree command #92

catetrai · 2022-07-27T23:14:39Z

New command datalad tree for displaying directory/dataset hierarchies.

Closes #78

Note:

Tests are of the 'end-to-end' kind for now. Can add more proper unit tests once we consolidate the implementation.
As discussed in What about a datalad tree command? #78, I have included only long-form parameters for now. I think it would be worth considering short-form variants as well, to make it feel similar to tree. My suggestion would be -L for --depth (consistent with tree syntax) and -R for --dataset-depth (consistent with --recursion-limit in other datalad commands).
Performance may be an issue in case of large and deep dataset hierarchies. I played around with running the command on the 'mother of all superdatasets' at datasets.datalad.org, installing some datasets recursively. To me it looks like a classic Not Great, Not Terrible™. There may very well be a more efficient algorithm for the search, so let me know if you think it's worth looking into it.

As a first-time contributor, this was really fun! The development environment (including tests and building docs) was straightforward to set up following the instructions. I was also able to reuse helpers in the utils modules (e.g. for creating test directory trees) which saved me lots of time.

Looking forward to your review! Please be brutal :-)

…pecify depth parameter)

…nly)

…depth logic

…d cache

catetrai · 2022-08-16T23:00:32Z

I believe the last aspect to talk about is the finalization of the API, including, but not limited to, short options.

I updated the command docstring, parameters and examples to reflect the current implementation, so we can use this as a starting point for discussing changes:

datalad-next/datalad_next/tree.py

Lines 51 to 219 in e32cf58

    
           class TreeCommand(Interface): 
        
               """Visualize directory and dataset hierarchies 
        
               This command mimics the UNIX/MSDOS 'tree' utility to generate and 
        
               display a directory tree, with DataLad-specific enhancements. 
        
               It can serve the following purposes: 
        
               1. Glorified 'tree' command 
        
               2. Dataset discovery 
        
               3. Programmatic directory traversal 
        
               *Glorified 'tree' command* 
        
               The rendered command output uses 'tree'-style visualization:: 
        
                   /tmp/mydir 
        
                   ├── [DS~0] ds_A/ 
        
                   │   └── [DS~1] subds_A/ 
        
                   └── [DS~0] ds_B/ 
        
                       ├── dir_B/ 
        
                       │   ├── file.txt 
        
                       │   ├── subdir_B/ 
        
                       │   └── [DS~1] subds_B0/ 
        
                       └── [DS~1] (not installed) subds_B1/ 
        
                   5 datasets, 2 directories, 1 file 
        
               Dataset paths are prefixed by a marker indicating subdataset hierarchy 
        
               level, like ``[DS~1]``. 
        
               This is the absolute subdataset level, meaning it may also take into 
        
               account superdatasets located above the tree root and thus not included 
        
               in the output. 
        
               If a subdataset is registered but not installed (such as after a 
        
               non-recursive ``datalad clone``), it will be prefixed by ``(not 
        
               installed)``. Only DataLad datasets are considered, not pure 
        
               git/git-annex repositories. 
        
               The 'report line' at the bottom of the output shows the count of 
        
               displayed datasets, in addition to the count of directories and 
        
               files. In this context, datasets and directories are mutually 
        
               exclusive categories. 
        
               By default, only directories (no files) are included in the tree, 
        
               and hidden directories are skipped. Both behaviours can be changed 
        
               using command options. 
        
               Symbolic links are always followed. 
        
               This means that a symlink pointing to a directory is traversed and 
        
               counted as a directory (unless it potentially creates a loop in 
        
               the tree). 
        
               *Dataset discovery* 
        
               Using the [CMD: ``--dataset-depth`` CMD][PY: ``dataset_depth`` PY] 
        
               option, this command generates the layout of dataset hierarchies based on 
        
               subdataset nesting level, regardless of their location in the 
        
               filesystem. 
        
               In this case, tree depth is determined by subdataset depth. This mode 
        
               is therefore suited for discovering available datasets when their 
        
               location is not known in advance. 
        
               By default, only datasets are listed, without their contents. If 
        
               [CMD: ``--depth`` CMD][PY: ``depth`` PY] is specified additionally, 
        
               the contents of each dataset will be included up to [CMD: 
        
               ``--depth`` CMD][PY: ``depth`` PY] directory levels. 
        
               Tree filtering options such as [CMD: ``--include-hidden`` CMD][PY: 
        
               ``include_hidden`` PY] only affect which directories are 
        
               reported/displayed,  not which directories are traversed to find datasets. 
        
               *Programmatic directory traversal* 
        
               The command yields a result record for each tree node (dataset, 
        
               directory or file). The following properties are reported, where available: 
        
               "path" 
        
                   Absolute path of the tree node 
        
               "type" 
        
                   Type of tree node: "dataset", "directory" or "file" 
        
               "depth" 
        
                   Directory depth of node relative to the tree root 
        
               "exhausted_levels" 
        
                   Depth levels for which no nodes are left to be generated (the 
        
                   respective subtrees have been 'exhausted') 
        
               "count" 
        
                   Dict with cumulative counts of datasets, directories and files in the 
        
                   tree up until the current node. File count is only included if the 
        
                   command is run with the [CMD: ``--include-files`` CMD][PY: 
        
                   ``include_files`` PY] 
        
                   option. 
        
               "dataset_depth" 
        
                   Subdataset depth level relative to the tree root. Only included for 
        
                   node type "dataset". 
        
               "dataset_abs_depth" 
        
                   Absolute subdataset depth level. Only included for node type "dataset". 
        
               "dataset_is_installed" 
        
                   Whether the registered subdataset is installed. Only included for node 
        
                   type "dataset". 
        
               "symlink_target" 
        
                   If the tree node is a symlink, the path to the link target 
        
               "is_broken_symlink" 
        
                   If the tree node is a symlink, whether it is a broken symlink 
        
               """ 
        
               result_renderer = 'tailored' 
        
               _params_ = dict( 
        
                   path=Parameter( 
        
                       args=("path",), 
        
                       nargs='?', 
        
                       doc="""path to directory from which to generate the tree. 
        
                       Defaults to the current directory.""", 
        
                       constraints=EnsureStr() | EnsureNone()), 
        
                   depth=Parameter( 
        
                       args=("--depth",), 
        
                       doc="""maximum level of subdirectories to include in the tree. 
        
                       If not specified, will generate the full tree with no depth  
        
                       constraint. 
        
                       If paired with [CMD: ``--dataset-depth`` CMD][PY:  
        
                       ``dataset_depth`` PY], refers to the maximum directory level to  
        
                       generate underneath each dataset.""", 
        
                       constraints=EnsureInt() & EnsureRange(min=0) | EnsureNone()), 
        
                   dataset_depth=Parameter( 
        
                       args=("--dataset-depth",), 
        
                       doc="""maximum level of nested subdatasets to include in the  
        
                       tree. 0 means only top-level datasets, 1 means top-level  
        
                       datasets and their immediate subdatasets, etc.""", 
        
                       constraints=EnsureInt() & EnsureRange(min=0) | EnsureNone()), 
        
                   include_files=Parameter( 
        
                       args=("--include-files",), 
        
                       doc="""include files in the tree""", 
        
                       action='store_true'), 
        
                   include_hidden=Parameter( 
        
                       args=("--include-hidden",), 
        
                       doc="""include hidden files/directories in the tree. This  
        
                       option does not affect which directories will be searched for  
        
                       datasets when specifying [CMD: ``--dataset-depth`` CMD][PY:  
        
                       ``dataset_depth`` PY]. For example, datasets located underneath  
        
                       the hidden folder `.datalad` will be reported even if [CMD:  
        
                       ``--include-hidden`` CMD][PY: ``include_hidden`` PY] is omitted.""", 
        
                       action='store_true'), 
        
               ) 
        
               _examples_ = [ 
        
                   dict(text="Show up to 3 levels of subdirectories below the current " 
        
                             "directory, including files and hidden contents", 
        
                        code_py="tree(depth=3, include_files=True, include_hidden=True)", 
        
                        code_cmd="datalad tree --depth 3 --include-files --include-hidden"), 
        
                   dict(text="Find all top-level datasets located anywhere under ``/tmp``", 
        
                        code_py="tree('/tmp', dataset_depth=0)", 
        
                        code_cmd="datalad tree /tmp --dataset-depth 0"), 
        
                   dict(text="Report first- and second-level subdatasets and their " 
        
                             "directory contents, up to 1 subdirectory deep within each " 
        
                             "dataset", 
        
                        code_py="tree(dataset_depth=2, depth=1)", 
        
                        code_cmd="datalad tree --dataset-depth 2 --depth 1"), 
        
               ]

Most other datalad command have --recursive and --recursion-limit to tackle similar use cases. Do you think we can map that concept to tree too? Maybe for tree it does not make sense to have --recursive default to False, but otherwise it should match quite closely.

Yes, for me --recursion-limit has similar semantics as --dataset-depth. With the difference that since tree does not operate on a dataset, the limit is applied to whatever (sub)datasets are found below the tree root. But I think that's close enough to still be intuitive to grasp if someone is familiar with -r/-R usage in other datalad commands. I propose we rename the --dataset-depth option to --recursion-limit (with short option -R). --recursive could be used for unconstrained search (which we actually don't have in the current implementation -- need to specify some arbitrary value like --dataset-depth=100). What do you say?

I want to start with thinking about a conceptual separation of rendering related vs discovery related parameters.

I had previously considered distinguishing between two kinds of exclusion filters: 'display+traversal' exclusion filters (=directories will not be yielded nor traversed when searching for datasets) vs. 'display only' exclusion filters (=directories will not be yielded standalone, but may be traversed in the dataset search).

For example, with the --include-hidden option, hidden directories are not reported nor traversed in the regular tree mode. But in dataset search mode (if --dataset-depth is specified), then they will be traversed (and therefore reported) if they are the parent of valid dataset. So we could say that the option affects reporting but not discovery.

Is this the point you wanted to address? Or am I missing the mark?

… the depth of deepest dataset

… main tree

@mih

…itted) (suggested by @mih)

mih

Alrighty! I think we had a stellar run and this came out beautiful. Thanks for having this kind of stamina!

From my POV we can merge this now without any further changes. We can add the short options now, or in a subsequent PR.

The performance is cool from my POV too -- I think we have more substantial margins for improvement in the rest of datalad, compared to what is done in this PR.

I need to set up the contributor acknowledgement framework we use in other extension too, to acknowledge you properly for this contribution -- I will get to that soonish!

Thanks much! I love it!

catetrai · 2022-08-22T20:36:09Z

It was my pleasure @mih! Thank you for the feedback and guidance throughout. Glad to have the chance to learn a bit about datalad internals, too (well, more like took a peek underneath the API). Debugging tips also invaluable for daily usage / life.

I will rename the options and add short variants as parts of this PR.

If you or the team have any improvement suggestions (also on documentation, code style / naming conventions, refactoring the tests a bit, etc) I'm happy to follow up on separate PRs.

mih · 2022-08-23T12:53:48Z

Huh, one of the datalad-core tests started failing. It seems to be an issue with git-annex. I do not see an immediate connection. Will investigate....

I am rerunning the tests on the main branch, if it is also showing up there, we can go ahead with the merge. I somewhat expect that....

mih · 2022-08-24T03:49:31Z

Yeah, unrelated and should be gone with the 0.17.3 release that just came out.

mih · 2022-08-24T07:04:35Z

I just saw that a changelog snippet was still missing, so I added one with scriv create.

mih

Good to go!

mih and others added 30 commits June 9, 2022 15:01

Port test suite to pytest

0c83a30

Temporarily depend on datalad-not-yet-0.17

fcd846d

register datalad tree command, dummy implementation

501b124

bulk of implementation with 2 passing tests

3cdeb2f

clean up docstrings / comments

089226b

set default depth to 1 (prevents annoying wall-of-text if forgot to s…

6f191ec

…pecify depth parameter)

remove parameter --full-paths, does not add much value

c17868b

set up parametrized tests to cover combinations of datalad tree options

cafd3d3

major refactoring, add tests for tree stats

7e7ab1d

fix class name registered as command implementation class

da31023

customize print format for different node types

24ace59

fix command parameter names and examples

599f854

clean up docstrings

689c303

add methods for yielding string output lines

de259a4

add test for normalization of root path

5c9ca96

fix incorrect string creation in to_string()

4a84ee6

add tests for trees with datasets

2518d63

reinstate --full-paths param (useful in combination with --datasets-o…

17d6091

…nly)

support color terminal output for directories and dataset paths

deb06e3

allow --depth=0 (useful in combination with --dataset-depth)

7326603

clean up comment, removed unused imports

888dfe5

add tests for stats with datasets

f2eb743

remove failing tests for --datasets-only (impl needs rework)

4691966

fix false-positive detection of datasets on pure git repos

5fd855d

improve efficiency of dataset detection

6401f55

remove argument --datasets-only, to be replaced with dataset subtree …

364a925

…depth logic

remove redundant logic in directory walk

00cbab9

store last children as set for faster is_last_child check

9cb0baf

extract is_dataset() into function outside class for easier reuse, ad…

5e65110

…d cache

WIP: refactor using pathlib for simpler node generation

c50f7b9

catetrai added 2 commits August 17, 2022 00:01

update tree command docs

744caaa

reword args descriptions in tree command docs

e32cf58

This comment was marked as resolved.

Sign in to view

fix lru_cache decorator syntax for python 3.7 compatibility

8510dd6

This comment was marked as resolved.

Sign in to view

catetrai added 11 commits August 17, 2022 21:51

Merge branch 'main' into nf-tree

1be6a77

add test for dataset tree with resulting directory depth that exceeds…

c64ff43

… the depth of deepest dataset

add test for dataset tree when there are no datasets

cebaf60

fix formatting of multiple imports

26a35cf

add note on performance of --dataset-depth option

b3d1ac3

reword docstrings / log messages

bea50e3

cast Tree root to Path object in constructor

0b7808d

add '__repr__' methods to classes

ca689be

rewording in docstrings

d290d41

compute whole dataset tree upfront instead of yielding in tandem with…

79732a7

… main tree

use %-string formatting for log messages (evaluated only if log is em…

1a02df5

…itted) (suggested by @mih)

mih previously approved these changes Aug 21, 2022

View reviewed changes

mih mentioned this pull request Aug 22, 2022

A DataLad GUI datalad/datalad#6972

Closed

catetrai added 2 commits August 22, 2022 23:11

rename option --dataset-depth to --recursion-limit and add short form

90ff0e9

add option --recursive for unlimited-depth dataset tree

a864693

catetrai dismissed mih’s stale review via a864693 August 22, 2022 21:59

Merge branch 'main' into nf-tree

1d8146e

Add changelog snippet

8634618

mih approved these changes Aug 24, 2022

View reviewed changes

mih merged commit 62803bd into datalad:main Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NF: datalad tree command #92

NF: datalad tree command #92

catetrai commented Jul 27, 2022

catetrai commented Aug 16, 2022

This comment was marked as resolved.

This comment was marked as resolved.

mih left a comment

catetrai commented Aug 22, 2022

mih commented Aug 23, 2022 •

edited

Loading

mih commented Aug 24, 2022

mih commented Aug 24, 2022

mih left a comment

NF: datalad tree command #92

NF: datalad tree command #92

Conversation

catetrai commented Jul 27, 2022

catetrai commented Aug 16, 2022

This comment was marked as resolved.

This comment was marked as resolved.

mih left a comment

Choose a reason for hiding this comment

catetrai commented Aug 22, 2022

mih commented Aug 23, 2022 • edited Loading

mih commented Aug 24, 2022

mih commented Aug 24, 2022

mih left a comment

Choose a reason for hiding this comment

mih commented Aug 23, 2022 •

edited

Loading