-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache Manifest files #787
Cache Manifest files #787
Conversation
One test is failing CI (test_duckdb_url_import).
Seems to be a download error and not due to changes in this PR. |
aa27a48
to
173ddb9
Compare
Hmm, it's my first time to see this error. I've merged a PR that bumps BTW, thanks for working on this! |
81a85ab
to
43a2f35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
384526c
to
8c2e79a
Compare
@chinmay-bhat @Fokko @HonahX Sorry for the last-minute feedback, but wouldn't it be better if we explicitly pass
|
@@ -228,6 +229,13 @@ def __eq__(self, other: Any) -> bool: | |||
) | |||
|
|||
|
|||
@lru_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are your thoughts on this, wouldn't be clearer if we could explicitly pass the max size for the lru_cache annotation, to store max entries rather than going default.
@lru_cache(maxsize=128)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the default is 128, I don't think we need to explicitly define maxsize=128
.
Also, _manifests()
is not a public API, so we would probably not set the cache size through user config via pyiceberg.yaml. Is there a use-case where the user would want to set the cache size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree the default size is what you mentioned, It was just an example, but the idea was to make it more readable & configurable.
Please feel free to take your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 128 big enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 128 should be enough. We're caching manifest_list
here and the number of manifest lists should be <100 in most cases.
We can always make this configurable in the future :)
I'm not sure why Is this a duck db problem? Or do I need to open a new PR (from main branch + my changes) to resolve it? |
Being able to configure (and also disable) the cachine would be a very nice
touch
Op ma 10 jun 2024 om 09:52 schreef Chinmay Bhat ***@***.***>
… ***@***.**** commented on this pull request.
------------------------------
In pyiceberg/table/snapshots.py
<#787 (comment)>
:
> @@ -228,6 +229,13 @@ def __eq__(self, other: Any) -> bool:
)
***@***.***_cache
As the default is 128
<https://docs.python.org/3/library/functools.html#functools.lru_cache>, I
don't think we need to explicitly define maxsize=128.
Also, _manifests() is not a public API, so we would *probably* not set
the cache size through user config via pyiceberg.yaml. Is there a use-case
where the user would want to set the cache size?
—
Reply to this email directly, view it on GitHub
<#787 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIU5KBBXVDMDRY62QQEY2LZGVLMNAVCNFSM6AAAAABIUZG46GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCMBWHEZTSOJSGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@chinmay-bhat Looks like the CI is still a bit sad :( |
Hi @Fokko, I'm curious to know why we would want to support custom sized manifest caches. Here's my initial implementation for configuring and disabling cache: <top of file>
MANIFEST_CACHE_SIZE = "manifest-cache-size"
config = Config()
manifest_cache_size = config.get_int(MANIFEST_CACHE_SIZE) if config.get_int(MANIFEST_CACHE_SIZE) else 128
....
@lru_cache(maxsize=manifest_cache_size)
def _manifests(io: FileIO, manifest_list: str) -> List[ManifestFile]:
"""Return the manifests from the manifest list."""
file = io.new_input(manifest_list)
return list(read_manifest_list(file))
class Snapshot:
....
def manifests(self, io: FileIO, use_cache: bool = True) -> List[ManifestFile]:
"""Return the manifests for the given snapshot."""
if self.manifest_list:
if use_cache:
return _manifests(io, self.manifest_list)
else:
_manifests.cache_clear()
return _manifests.__wrapped__(io, self.manifest_list)
return [] Is this in the right direction?
@Fokko @HonahX I need help resolving this error. Error seems unrelated to the changes in this PR. I've added more info in this comment. |
I tried this PR locally and did not observe this issue. My testing platforms are
It seems this issue can only be reproduced in github action😱:
|
I created a new PR against my fork, and once the GitHub actions failed, I manually re-tried them. With
Also, with |
FYI this is caused by transient networking failures when downloading the duckdb iceberg extension. Using the pre-release duckdb library ( |
#1149 bumps duckdb to v1.1.0 which has the PR to retry extension installation |
@chinmay-bhat do you mind rebasing off |
8c2e79a
to
a54c570
Compare
rebased off main! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! CI finally passed!
Thanks for the contribution
This reverts commit 1971fcf.
* cache manifests * update API * small fix * move cache to module level * update signature and check
* cache manifests * update API * small fix * move cache to module level * update signature and check
Closes #595.