Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Consistency between DataTree methods and pathlib.PurePath methods #283

Closed
14 tasks
TomNicholas opened this issue Nov 27, 2023 · 5 comments
Closed
14 tasks
Labels
enhancement New feature or request

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Nov 27, 2023

@eschalkargans suggested in #281 that the API of DataTree objects could more closely follow that of pathlib.PurePath objects. I think this aligning of APIs/nomenclature is a good idea. In general think it's conceptually useful to think of a DataTree object as if it were an instance of pathlib.PurePosixPath (even though the actual implementation should not work like that).

There are various methods we might want to add/change to make them more compatible:

Inspired by pathlib.PurePath:

  • DataTree.match should be renamed to DataTree.glob
  • Add a new method DataTree.match that returns a boolean like PurePath.match does
  • DataTree.lineage should be renamed to .parents
  • Add an .is_relative_to method (this is deprecated in pathlib)
  • A new .joinpath method could be useful
  • DataTree.relative_to should possibly have a walk_up method (see Use new walk_up parameter in pathlib for traversing relative paths #258)
  • A new .with_name method might be useful
  • A new .with_segments method might be useful

Inspired by pathlib.Path (i.e. concrete paths):

  • A new DataTree.walk method might be a better way to expose the logic in iterators.py
  • A new .rename method might be useful
  • A new .replace method might be useful
  • A new .rglob method (though having this and .glob seems overkill)

Several of these might be useful abstractions internally, especially .joinpath, .walk, and .replace.

EDIT: Let's also document this similarity:

@TomNicholas TomNicholas added the enhancement New feature or request label Nov 27, 2023
@etienneschalk
Copy link
Contributor

Hi @TomNicholas , I would like to help with the code on this one. Do you think this might be a good first issue? Thanks!

@TomNicholas
Copy link
Member Author

Sure @etienneschalk! I think each of these bullet points is really it's own little issue, so feel free to open a PR for any one of them. (Maybe leave the tree-walking related ones for now though because I think those will be a little more complicated.)

@TomNicholas
Copy link
Member Author

Once we have completed some of these it would also be nice to add a little section in the documentation that points out this similarity explicitly to users. Also we can then reorganise the grouping of methods in api.rst to have a section for Path-like methods.

@etienneschalk
Copy link
Contributor

Pathlib

The following are some notes I taken while reading the pathlib documentation, thinking about equivalences in DataTree usage

Listing

Curated list

This list only contains methods I did not classified as "Irrelevant".
The "Irrelevant" tag is subjective to my understanding, I may have missed important methods

Pure Paths

  • PurePath.parts
    • "parsed" path
  • PurePath.root
    • Relevant to differentiate between absolute and relative paths. This is already done by PurePath.is_absolute()
    • For DataTree.root, same comment as parents
    • Note: root = parents[-1]? No, currently the parents are rewinded until finding a parent with root is None. Could it be simplified with parents[-1], if the path hierarchy is already known in advence?
  • PurePath.parents
    • The DataTree.parents should use the paths obtained via its NodePath identifier inside of the root's DataTree to produce the list of parents' DataTree.
    • Note: this means all Nodes must be aware of the root. Which is the case via the root attribute. Trees are aware of being a root or a subtree.
  • PurePath.parent
    • Same comment as parents
    • Note: parent == parents[0]?
  • PurePath.name
    • Might be useful if absolute paths are used as internal IDs inside of the tree, for string reprs. PurePaths are hashable and can be used as IDs
  • PurePath.is_absolute()
    • Interesting, as Node IDs should be absolute.
  • PurePath.is_relative_to_other()
    • Can be interesting for quickly knowing if a node is inside of a larger tree, with path-only lookup?
  • PurePath.joinpath
    • Cannot see the immediate utility for a end user, might be useful internally
  • PurePath.match
    • This is a "single-element" version of glob, only checking if a single path conforms to the pattern
    • Might be useful to implement DataTree.glob by mapping it against all paths contained in the tree.
  • PurePath.relative_to(_other_, _walk_up=False_)
    • Might be useful to detach a node from a tree, to generate its new paths identifiers.
  • PurePath.with_name(_name_)
    • Might be useful to rename a node and updating its path representing it inside of its root DataTree.
  • PurePath.with_segments(*pathsegments)
    • Can be useful because the doc says it can be used with classes deriving from PurePaths eg PurePosixPath like NodePath

Concrete Paths

Concrete Paths. Could be implemented by a companion DataTreePath class attached to a DataTree instance.

  • Path.glob()
    • Can be used to map PurePath.match against all paths contained by the bound instance of DataTree
    • Regarding case_sensitivity, since DataTree works with PurePosixPath, keep the default POSIX config: True
  • Path.is_dir()
    • It might be useful to discriminate between DataTree and Dataset (directory-like) and DataArray (file-like))
    • Maybe a better name like is_group could help, or is_aggregation
    • Note: Dataset may actually be closer to a leaf? At first glance, no, as it is non-atomatic. One could argue that a DataArray is non-atomic too (it carries dimension coordinates)
  • Path.is_file()
    • Mirrors path.is_dir()
    • Maybe a better name like is_dataarray could help, or is_leaf
  • Path.is_symlink()
    • To be considered if symbolic nodes are to be implemented
  • Path.iterdir()
    • Like ls
  • Path.walk
    • A good candidate method to implement to explore a DataTree
    • Introduced in Python 3.12 only
    • Currently, from developer point of view, using Path.rglob("*") when needing to iterate through a directory, so maybe walk is dispensable.
  • Path.mkdir
    • Probably irrelevant, but kwargs like parents=True, exist_ok might be useful when working with groups.
  • Path.rename
    • Might be useful to rename a node inside of the root tree
  • Path.replace
    • Similar to Path.rename for DataTree, see https://bugs.python.org/issue27886 for discussion on that topic. replace is more "expeditive" than rename, as if a path already exists it will be surely replaced.
  • Path.absolute()
    • Can be useful for browsing the DataTree
  • Path.resolve()
    • Similar to absolute, but also takes into accounts symlinks. To be considered if symbolic links are to be implemented in DataTree
  • Path.rglob
    • Similar to Path.glob, with the ** prefix. Depends on developer's taste
  • Path.rmdir
    • To remove an entire subtree from the tree? Might be useful in conjunction with relative_to
  • Path.samefile
    • I cannot see an utility rn
  • Path.symlink_to
    • To be considered if symbolic links are to be implemented in DataTree
  • Path.touch
    • Create an empty DataArray at that location?
  • Path.unlink
    • The naming might be confusing to work with DataTree.

Full list

Pure Paths

  • PurePath.parts
    • "parsed" path
  • PurePath.drive Irrelevant
    • Irrelevant for PurePosixPath implementation of PurePath
  • PurePath.root
    • Relevant to differentiate between absolute and relative paths. This is already done by PurePath.is_absolute()
    • For DataTree.root, same comment as parents
    • Note: root = parents[-1]? No, currently the parents are rewinded until finding a parent with root is None. Could it be simplified with parents[-1], if the path hierarchy is already known in advence?
  • PurePath.anchor Irrelevant
    • drive + root = same as root for PurePosixPath = irrelevant
  • PurePath.parents
    • The DataTree.parents should use the paths obtained via its NodePath identifier inside of the root's DataTree to produce the list of parents' DataTree.
    • Note: this means all Nodes must be aware of the root. Which is the case via the root attribute. Trees are aware of being a root or a subtree.
  • PurePath.parent
    • Same comment as parents
    • Note: parent == parents[0]?
  • PurePath.name
    • Might be useful if absolute paths are used as internal IDs inside of the tree, for string reprs. PurePaths are hashable and can be used as IDs
  • PurePath.suffix Irrelevant
  • PurePath.suffixes Irrelevant
  • PurePath.stem Irrelevant
  • PurePath.as_posix() Irrelevant
  • PurePath.as_uri() Irrelevant
  • PurePath.is_absolute()
    • Interesting, as Node IDs should be absolute.
  • PurePath.is_relative_to_other()
    • Can be interesting for quickly knowing if a node is inside of a larger tree, with path-only lookup?
  • PurePath.is_reserved() Irrelevant
  • PurePath.joinpath Irrelevant for end user
    • Cannot see the immediate utility for a end user, might be useful internally
  • PurePath.match
    • This is a "single-element" version of glob, only checking if a single path conforms to the pattern
    • Might be useful to implement DataTree.glob by mapping it against all paths contained in the tree.
  • PurePath.relative_to(_other_, _walk_up=False_)
    • Might be useful to detach a node from a tree, to generate its new paths identifiers.
  • PurePath.with_name(_name_)
    • Might be useful to rename a node and updating its path representing it inside of its root DataTree.
  • PurePath.with_stem(_stem_) Irrelevant
    • Irrelevant (same reason as stem, there is no concept of extension in DataTree paths)
  • PurePath.with_suffix Irrelevant for same reason
  • PurePath.with_segments(*pathsegments)
    • Can be useful because the doc says it can be used with classes deriving from PurePaths eg PurePosixPath like NodePath

Concrete Paths

Concrete Paths. Could be implemented by a companion DataTreePath class attached to a DataTree instance.

  • Path.cwd() irrelevant
  • Path.home() irrelevant
  • Path.stat() irrelevant
  • Path.chmod() irrelevant
  • Path.exists() irrelevant
    • Can be used to determine if the path is contained in the bound instance of DataTree
  • Path.expanduser() irrelevant
  • Path.glob()
    • Can be used to map PurePath.match against all paths contained by the bound instance of DataTree
    • Regarding case_sensitivity, since DataTree works with PurePosixPath, keep the default POSIX config: True
  • Path.group() irrelevant
  • Path.is_dir()
    • It might be useful to discriminate between DataTree and Dataset (directory-like) and DataArray (file-like))
    • Maybe a better name like is_group could help, or is_aggregation
    • Note: Dataset may actually be closer to a leaf? At first glance, no, as it is non-atomatic. One could argue that a DataArray is non-atomic too (it carries dimension coordinates)
  • Path.is_file()
    • Mirrors path.is_dir()
    • Maybe a better name like is_dataarray could help, or is_leaf
  • Path.is_junction() irrelevant
  • Path.is_mount() irrelevant
  • Path.is_symlink()
    • To be considered if symbolic nodes are to be implemented
  • Path.is_socket() irrelevant
  • Path.is_fifo() irrelevant
  • Path.is_block_device() irrelevant
  • Path.is_char_device() irrelevant
  • Path.iterdir()
    • Like ls
  • Path.walk
    • A good candidate method to implement to explore a DataTree
    • Introduced in Python 3.12 only
    • Currently, from developer point of view, using Path.rglob("*") when needing to iterate through a directory, so maybe walk is dispensable.
  • Path.lchmod irrelevant
  • Path.lstat irrelevant
  • Path.mkdir
    • Probably irrelevant, but kwargs like parents=True, exist_ok might be useful when working with groups.
  • Path.open irrelevant
  • Path.owner irrelevant
  • Path.read_bytes irrelevant
  • Path.read_text irrelevant
  • Path.readlink irrelevant
  • Path.rename
    • Might be useful to rename a node inside of the root tree
  • Path.replace
    • Similar to Path.rename for DataTree, see https://bugs.python.org/issue27886 for discussion on that topic. replace is more "expeditive" than rename, as if a path already exists it will be surely replaced.
  • Path.absolute()
    • Can be useful for browsing the DataTree
  • Path.resolve()
    • Similar to absolute, but also takes into accounts symlinks. To be considered if symbolic links are to be implemented in DataTree
  • Path.rglob
    • Similar to Path.glob, with the ** prefix. Depends on developer's taste
  • Path.rmdir
    • To remove an entire subtree from the tree? Might be useful in conjunction with relative_to
  • Path.samefile
    • I cannot see an utility rn
  • Path.symlink_to
    • To be considered if symbolic links are to be implemented in DataTree
  • Path.hardlink_to Irrelevant ?
  • Path.touch
    • Create an empty DataArray at that location?
  • Path.unlink
    • The naming might be confusing to work with DataTree.
  • Path.write_bytes Irrelevant
  • Path.write_text Irrelevant

Ideas

  • Use the NodePath as the DataTree's identifier, and use path.name in the repr
  • Systematically accept PurePosixPath | str for methods expecting a path
  • Do not forbid dots in names, we cannot make assumptions of the variable names in a DataTree

Ideas of question for a FAQ.
A FAQ is a powerful documentation format, it is used for instance in the ruff documentation: https://docs.astral.sh/ruff/faq/
The idea is to answer as quickly as possible as the seamingly mundane questions for someone knowing the tool, but not immediate at all for someone starting to use it

  • Question:Can a Node belong to multiple trees?
  • Answer: I think not, as the parent has cardinality of 0..1 (0 if root, 1 if subtree)

See https://github.com/pydata/xarray/blob/fffb03c8abf5d68667a80cedecf6112ab32472e7/xarray/datatree_/datatree/datatree.py#L425

@property
def parent(self: DataTree) -> DataTree | None:

@TomNicholas
Copy link
Member Author

Closing in favour of pydata/xarray#9448 upstream.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants