Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Manifest files #787

Merged
merged 5 commits into from
Sep 10, 2024
Merged

Conversation

chinmay-bhat
Copy link
Contributor

@chinmay-bhat chinmay-bhat commented Jun 2, 2024

Closes #595.

@chinmay-bhat chinmay-bhat marked this pull request as ready for review June 2, 2024 06:23
@chinmay-bhat
Copy link
Contributor Author

One test is failing CI (test_duckdb_url_import).

FAILED tests/integration/test_writes/test_writes.py::test_duckdb_url_import - duckdb.duckdb.IOException: IO Error: Failed to download extension "iceberg" at URL "http://extensions.duckdb.org/v0.10.3/linux_amd64_gcc4/iceberg.duckdb_extension.gz"
Extension "iceberg" is an existing extension.
 (ERROR Connection)
= 1 failed, 797 passed, 5 skipped, 2735 deselected, 160 warnings in 261.13s (0:04:21) =
make: *** [Makefile:46: test-integration] Error 1
Error: Process completed with exit code 2.

Seems to be a download error and not due to changes in this PR.
Also, the same test passes locally. Can we re-trigger the tests?

@chinmay-bhat chinmay-bhat force-pushed the cache_manifest_files branch from aa27a48 to 173ddb9 Compare June 3, 2024 17:10
@HonahX
Copy link
Contributor

HonahX commented Jun 4, 2024

Hmm, it's my first time to see this error. I've merged a PR that bumps duckdb to 1.0.0: #793. Hope that can fix the issue here.

BTW, thanks for working on this!

pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
@chinmay-bhat chinmay-bhat force-pushed the cache_manifest_files branch 2 times, most recently from 81a85ab to 43a2f35 Compare June 9, 2024 07:21
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
pyiceberg/table/snapshots.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chinmay-bhat chinmay-bhat force-pushed the cache_manifest_files branch from 384526c to 8c2e79a Compare June 10, 2024 05:59
@chinmay-bhat
Copy link
Contributor Author

@Fokko @HonahX Thanks for the review!!

@MehulBatra
Copy link
Contributor

MehulBatra commented Jun 10, 2024

@chinmay-bhat @Fokko @HonahX Sorry for the last-minute feedback, but wouldn't it be better if we explicitly pass
(maxsize=128) for the lru_cache annotation, When a new call comes in, the decorator’s implementation will evict the least recently used of the existing 128 entries to make a place for the new item. that way it would make it clearer for the end user and increase the code visibility, in the near future we can also take that as a user config via pyiceberg.yaml

@lru_cache(maxsize=128)
def _manifests(io: FileIO, manifest_list: str) -> List[ManifestFile]:
    """Return the manifests from the manifest list."""
    file = io.new_input(manifest_list)
    return list(read_manifest_list(file))

@@ -228,6 +229,13 @@ def __eq__(self, other: Any) -> bool:
)


@lru_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are your thoughts on this, wouldn't be clearer if we could explicitly pass the max size for the lru_cache annotation, to store max entries rather than going default.
@lru_cache(maxsize=128)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the default is 128, I don't think we need to explicitly define maxsize=128.

Also, _manifests() is not a public API, so we would probably not set the cache size through user config via pyiceberg.yaml. Is there a use-case where the user would want to set the cache size?

Copy link
Contributor

@MehulBatra MehulBatra Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree the default size is what you mentioned, It was just an example, but the idea was to make it more readable & configurable.
Please feel free to take your call.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 128 big enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 128 should be enough. We're caching manifest_list here and the number of manifest lists should be <100 in most cases.
We can always make this configurable in the future :)

@chinmay-bhat
Copy link
Contributor Author

chinmay-bhat commented Jun 10, 2024

I'm not sure why test_duckdb_url_import continues to fail with the same error. I've rebased onto latest main branch, and the test runs locally. I also don't see this issue in other open PyIceberg PRs which have recently cleared all tests.

Is this a duck db problem? Or do I need to open a new PR (from main branch + my changes) to resolve it?

@Fokko
Copy link
Contributor

Fokko commented Jun 10, 2024 via email

@Fokko
Copy link
Contributor

Fokko commented Jun 10, 2024

@chinmay-bhat Looks like the CI is still a bit sad :(

@chinmay-bhat
Copy link
Contributor Author

chinmay-bhat commented Jun 11, 2024

Being able to configure (and also disable) the cachine would be a very nice touch

Hi @Fokko, I'm curious to know why we would want to support custom sized manifest caches.
I agree user should be able to disable the cache.

Here's my initial implementation for configuring and disabling cache:

<top of file>
MANIFEST_CACHE_SIZE = "manifest-cache-size"

config = Config()
manifest_cache_size = config.get_int(MANIFEST_CACHE_SIZE) if config.get_int(MANIFEST_CACHE_SIZE) else 128
....
@lru_cache(maxsize=manifest_cache_size)
def _manifests(io: FileIO, manifest_list: str) -> List[ManifestFile]:
    """Return the manifests from the manifest list."""
    file = io.new_input(manifest_list)
    return list(read_manifest_list(file))

class Snapshot:
....
    def manifests(self, io: FileIO, use_cache: bool = True) -> List[ManifestFile]:
        """Return the manifests for the given snapshot."""
        if self.manifest_list:
            if use_cache:
                return _manifests(io, self.manifest_list)
            else:
                _manifests.cache_clear()
                return _manifests.__wrapped__(io, self.manifest_list)
        return []

Is this in the right direction?
For reference: how to disable lru_cache and lru_cache doc

Looks like the CI is still a bit sad :(

@Fokko @HonahX I need help resolving this error. Error seems unrelated to the changes in this PR. I've added more info in this comment.

@HonahX
Copy link
Contributor

HonahX commented Jun 12, 2024

Is this a duck db problem? Or do I need to open a new PR (from main branch + my changes) to resolve it?

I tried this PR locally and did not observe this issue. My testing platforms are

  • MacOS 14.5 - Apple Silicone
  • Ubuntu 24.04 - Intel x86/64

It seems this issue can only be reproduced in github action😱:

@chinmay-bhat
Copy link
Contributor Author

chinmay-bhat commented Jun 12, 2024

I created a new PR against my fork, and once the GitHub actions failed, I manually re-tried them.
https://github.com/chinmay-bhat/iceberg-python/pull/1/checks?sha=8c2e79a9c62f98c51eb56ae65369a7bf3f6d49f4

With lru_cache,

  • Python Integration / integration-test (make test-integration) passes.
  • Python CI / lint-and-test (make test-coverage) fails.

Also, with lru_cache, test-integration and test-coverage both pass locally.
My environment is MacOS 14.4.1 - Apple Silicon M2

@kevinjqliu
Copy link
Contributor

FYI this is caused by transient networking failures when downloading the duckdb iceberg extension. Using the pre-release duckdb library ([email protected]) resolves this issue because retry was added in duckdb#13122.
See duckdb#13808 for more info. Might need to wait for duckdb v1.1.0 release

@kevinjqliu
Copy link
Contributor

#1149 bumps duckdb to v1.1.0 which has the PR to retry extension installation

@kevinjqliu
Copy link
Contributor

@chinmay-bhat do you mind rebasing off main to re-run CI?

@chinmay-bhat
Copy link
Contributor Author

rebased off main!

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! CI finally passed!
Thanks for the contribution

@kevinjqliu kevinjqliu merged commit 1971fcf into apache:main Sep 10, 2024
8 checks passed
@chinmay-bhat chinmay-bhat deleted the cache_manifest_files branch September 11, 2024 08:27
kevinjqliu added a commit that referenced this pull request Sep 12, 2024
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* cache manifests

* update API

* small fix

* move cache to module level

* update signature and check
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* cache manifests

* update API

* small fix

* move cache to module level

* update signature and check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement caching of manifest-files
6 participants