-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate treenode module. #8757
Migrate treenode module. #8757
Conversation
PR (#8702) added nbytes representation in DataArrays and Dataset repr, this adds it to the datatree tests.
Moves treenode.py and test_treenode.py. Updates some typing. Updates imports from treenode.
Add test tree structure for easier understanding.
There must be a better way, but I don't know it. particularly the list comprehension casts.
182a5ac
to
32e7453
Compare
This test was broken becuase only the root node was being tested and none of the previous nodes were represented in the __str__.
Thanks @flamingbear . I am just wondering if you have any overall thoughts on the design / implementation here? e.g. the EDIT: If you think it's all fine then that's great, I'm just trying to create an explicit opportunity for you to ask "why didn't you do it this other way?" |
So my overall thoughts as I was moving it / grokking it was that the separation of concerns between the movement, linking of nodes and parenting rules vs the finding by names and paths as pretty clear. The hooks for before and after attaching children seem thoughtful. I haven't dived deep into the datatree code yet but this all makes sense so far. If you were to scrap this for a nested dictionary, wouldn't you end up with pretty complicated code to handle moving/adding nodes? My gut says this is a solid choice to represent tree data. |
@@ -25,7 +21,7 @@ class InvalidTreeError(Exception): | |||
|
|||
|
|||
class NotFoundInTreeError(ValueError): | |||
"""Raised when operation can't be completed because one node is part of the expected tree.""" | |||
"""Raised when operation can't be completed because one node is not part of the expected tree.""" | |||
|
|||
|
|||
class NodePath(PurePosixPath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're exposing it (and keep it as a pure path, which I think we should), should we move this class to a separate module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we exposing NodePath
? I suggested it, but after talking with @etienneschalk decided maybe it wasn't particularly helpful xarray-contrib/datatree#205.
What's the logic for making it a pure path? The posix path is to ensure the slashes are in a consistent direction (remember this isn't a real filepath so shouldn't change automatically on windows systems).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this, but I believe the difference is that PurePosixPath
doesn't have methods that require read/write access (like touch
, remove
, mkdir
, iterdir
, ...), while PosixPath
does.
The main difference between filesystems and datatree
is that there's only a single filesystem instance at a time, while you can have multiple DataTree
objects at the same time. This means that a NodePath
inheriting from PosixPath
would have to be created from the DataTree
object and would require a lot of discipline to not be confusing.
On the other hand, a NodePath
that inherits from PurePosixPath
would help a bit with path manipulations (name
, root
, joinpath()
, parts
, parent
, parents
, ...) that we would otherwise have to use posixpath
for. I didn't check if posixpath
has the same functionality as PurePosixPath
, though, and they might also not be as convenient to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wrote my above comment too quickly - we both agree it should stay as a pure path.
We could also consider adding extra methods to this NodePath
class if it simplifies other parts of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related issue: xarray-contrib/datatree#283 (Consistency between DataTree methods and pathlib.PurePath methods)
To me the graal would be to have datatree more and more compatible with PurePosixPath
(what NodePath
already does by inheriting from it, but we can imagine having an API accepting both strings and PurePosixPath
, without the need to expose NodePath
publicly).
As @keewis mentioned , there is a single instance of the filesystem whereas there can be multiple DataTrees. So a single DataTree could act like a "small filesystem on its own" and implement the PosixPath
methods.
NodePath ~ PurePosixPath
DataTree ~ PosixPath
The idea here is to delegate to pathlib
the diffcult task of choosing the best names for methods that help manipulating trees. For an audience of developers that are already used to writing scripts with pathlib
, this would make working with datatree automatic! And this also gives plenty of ideas of features. Also, with hierarchical-formats such as Zarr that rely on the filesystem, openable by datatree, the proximity is innate.
Concrete example: working with multi-resolution rasters (a usecase of datatree). By using f-strings and glob, you can write resolution independant code once, like you would have done by manipulating a data structure on the file system
The schema on the pathlib doc (arrows are inheritance):
flowchart BT
PurePosixPath --> PurePath
Path --> PurePath
PureWindowsPath --> PurePath
PosixPath --> PurePosixPath
PosixPath --> Path
WindowsPath --> PureWindowsPath
WindowsPath --> Path
How I see it:
flowchart BT
subgraph Pure
PurePosixPath --> PurePath
NodePath --> PurePosixPath
end
subgraph Concrete
DataTree --> NodePath
PosixPath --> PurePosixPath
end
(I removed Path
for simplification)
I am not 100% sure it fully makes sense... The nuance is that when you instanciate a Path, it is always implicit that it is bound to the filesystem, while each DataTree instantiation creates its own data representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etienneschalk for all your thoughts here!
So a single DataTree could act like a "small filesystem on its own" and implement the PosixPath methods.
I am not 100% sure it fully makes sense... The nuance is that when you instanciate a Path, it is always implicit that it is bound to the filesystem, while each DataTree instantiation creates its own data representation.
I strongly feel that DataTree / NodePath objects should not be associated with concrete filesystem paths.
However this
To me the graal would be to have datatree more and more compatible with PurePosixPath (what NodePath already does by inheriting from it, but we can imagine having an API accepting both strings and PurePosixPath, without the need to expose NodePath publicly).
seems like a reasonable idea, and I think we should continue the discussion of that in xarray-contrib/datatree#283.
For now I don't think there is anything we need to do in this PR specifically - we can revisit the API choices / accepting path-like types in a future PR.
personally, would probably rename test_io.py to something related to
datatree io and move all of the tests into xarray/testsat the top level.
That seems fine to me. Integrate the IO tests because the backends are
integrated. Separate the tests of tree functionality so that it's easy to
test different parts of the codebase independently during development
(that's why having a separate test_treenode.py is useful). But whether
test_treenode.py lives in a subdirectory or not just doesn't really matter
I think.
…On Mon, Feb 19, 2024, 11:02 AM Justus Magin ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
On xarray/tests/datatree/test_treenode.py
<#8757 (comment)>:
let's discuss this in the meeting tomorrow, but yeah, I don't have any
strong opinions.
—
Reply to this email directly, view it on GitHub
<#8757 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AISNPI366CTJ6KUSLPD4PILYUOAS7AVCNFSM6AAAAABDK6TBVGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQOBYHA4DONJVGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Question is I did update below the released line to give Tom some credit. I hope that's is allowable.
So that failure looks like a common one as opposed to something from this PR. Is there anything specific to address in the PR before it can be merged? |
We made an executive decision in the datatree meeting today to not bother with subdirectories (for tests or code). We can always reorganise this later if we want to. |
I don't think so - @keewis what say you? |
* main: (31 commits) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) Grouper object design doc (pydata#8510) Bump the actions group with 2 updates (pydata#8804) tokenize() should ignore difference between None and {} attrs (pydata#8797) fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791) Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782) Migrate treenode module. (pydata#8757) ...
* main: (42 commits) correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713) Expand use of `.oindex` and `.vindex` (pydata#8790) Return a dataclass from Grouper.factorize (pydata#8777) [skip-ci] Fix upstream-dev env (pydata#8839) Add dask-expr for windows envs (pydata#8837) [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835) Add `dask-expr` to environment-3.12.yml (pydata#8827) Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736) Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784) try to get the `upstream-dev` CI to complete again (pydata#8823) Bump the actions group with 1 update (pydata#8818) Update documentation for clarity (pydata#8817) DOC: link to zarr.convenience.consolidate_metadata (pydata#8816) Refactor Grouper objects (pydata#8776) Grouper object design doc (pydata#8510) Bump the actions group with 2 updates (pydata#8804) tokenize() should ignore difference between None and {} attrs (pydata#8797) fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791) Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782) Migrate treenode module. (pydata#8757) ...
Migrate datatree's treenode module the base class representing node of a tree, with methods for accessing nodes and traversal.
datatree/treenode.py
Track merging datatree into xarray #8572whats-new.rst
whats-new.rst
api.rst