generated from datalad/datalad-extension-template
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance impact of PathBasedItem
#581
Comments
This was referenced Jan 5, 2024
mih
added a commit
to mih/datalad-next
that referenced
this issue
Jan 5, 2024
Previously, the `name` attribute was some kind of `PurePath` instance. The precise nature varied across item types and corresponding iterators. This led to two problems: - Performance (see datalad#581): the creation of a Path instance is not particularly cheap. If a consumer does not need such an instance, and unconditional creation is a waste of resources - Abstraction (see datalad#430): the Path representation of a path-like name does not always match the exact semantics of the original name (it might have a trailing slash that has a purpose, etc) Both problems are now addressed by relaxing the type of `.name` of a `PathBasedItem`. It can now be anything. Consumers that require an actual Path instance can use the `.path` attribute. An analog access is also implemented for `.link_target` (which now remains literal), that is accompanied by `.link_target_path`, where it was necessary. This change in approach removed the need for the `_ZipFileDirPath` workaround that was used to re-establish compatibility of a Path-based `.name` with the conventions in a `ZipFile` collection/container. With the `.name` attribute retaining its native semantics, the workaround is no longer needed as was removed. In order to document the (lack of) implemented homogeneous conventions for path-based items, the docstrings of all iterators have been amended with information on the nature of the `.name` attribute yielded by them. The corresponding data classes have received docstrings for the newly added properties for Path access. These properties uniformly use the `cached_property` decorator. For the lifetime of an item, the nature of such a path should never change, and caching it automatically addresses the significant creation cost on accessing a path representation repeatedly. Closes datalad#554 Closes datalad#581
mih
added a commit
to mih/datalad-next
that referenced
this issue
Jan 5, 2024
Previously, the `name` attribute was some kind of `PurePath` instance. The precise nature varied across item types and corresponding iterators. This led to two problems: - Performance (see datalad#581): the creation of a Path instance is not particularly cheap. If a consumer does not need such an instance, and unconditional creation is a waste of resources - Abstraction (see datalad#430): the Path representation of a path-like name does not always match the exact semantics of the original name (it might have a trailing slash that has a purpose, etc) Both problems are now addressed by relaxing the type of `.name` of a `PathBasedItem`. It can now be anything. Consumers that require an actual Path instance can use the `.path` attribute. An analog access is also implemented for `.link_target` (which now remains literal), that is accompanied by `.link_target_path`, where it was necessary. This change in approach removed the need for the `_ZipFileDirPath` workaround that was used to re-establish compatibility of a Path-based `.name` with the conventions in a `ZipFile` collection/container. With the `.name` attribute retaining its native semantics, the workaround is no longer needed as was removed. In order to document the (lack of) implemented homogeneous conventions for path-based items, the docstrings of all iterators have been amended with information on the nature of the `.name` attribute yielded by them. The corresponding data classes have received docstrings for the newly added properties for Path access. These properties uniformly use the `cached_property` decorator. For the lifetime of an item, the nature of such a path should never change, and caching it automatically addresses the significant creation cost on accessing a path representation repeatedly. Closes datalad#554 Closes datalad#581
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This dataclass is used in pretty much all iterators of
iter_collections
. In #580 I saw that the conversion of path strings toPurePath
instances can represent as much as half of the runtime of such an iterator.It seems worthwhile to consider optimizations in this regard.
One possibility would be to change
PathBasedItem
to leave itsname
property untouched (Any
), and add an accessor property (@cached_property
)path
that performs the conversion lazily. With this approach, any conversion performance cost would only need to be paid when actually necessary. Moreover, consumer would be given the choice of format of an itemname
.The text was updated successfully, but these errors were encountered: