Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement caching of manifest-files #595

Closed
Fokko opened this issue Apr 10, 2024 · 8 comments · Fixed by #787 or #1187
Closed

Implement caching of manifest-files #595

Fokko opened this issue Apr 10, 2024 · 8 comments · Fixed by #787 or #1187
Assignees

Comments

@Fokko
Copy link
Contributor

Fokko commented Apr 10, 2024

Feature Request / Improvement

We currently loop over the manifests of a snapshot often just once. But now when we're compounding the operations (DELETE+APPEND), there is a fair chance that read a manifest more than once. The spec states that manifest files are immutable, this means that we can cache is locally using a method annotated with the lru_cache.

@swapdewalkar
Copy link

I am trying to working on this, is it possible to assign it to me?

@Fokko
Copy link
Contributor Author

Fokko commented Apr 10, 2024

@swapdewalkar Thanks for picking this up! I've just assigned it to you

@MehulBatra
Copy link
Contributor

Hi @swapdewalkar I wanted to check in and see if you have any updates on this task. If you need any assistance or if there are any obstacles, please let me know—I will be happy to help!

@chinmay-bhat
Copy link
Contributor

chinmay-bhat commented May 26, 2024

Hi, can we increase the scope of this issue to cache/store all_manifests, data_manifests & delete_manifests? Or do I create a new issue for this? This feature would be useful for tasks like Incremental Scans (Append, Changelog, etc) where we frequently access manifest files. I imagine we would like this feature to be similar to the java implementation.

Also, since @swapdewalkar hasn't responded yet and if they do not have the time/bandwidth for the issue, I'm happy to give this a shot! :)

@Fokko
Copy link
Contributor Author

Fokko commented May 26, 2024

@chinmay-bhat I think we can generalize this quite easily, since from the spec:

Once written, data and metadata files are immutable until they are deleted.

I think we could go as easy to have a lru-cache based on the path to the metadata to cache it :)

@chinmay-bhat
Copy link
Contributor

chinmay-bhat commented May 26, 2024

Thanks @Fokko for the quick response! Really appreciate it!

based on the path to the metadata to cache it

I'm not clear on this. Are you saying we can simply add lru_cache to def manifests(self, io: FileIO) in class Snapshot? And then whenever we need data manifests or delete manifests, we iterate over the cached manifests? Wouldn't it be better to cache those too, since as you said, the files are immutable?

For ex:

@lru_cache
def manifests(self, io: FileIO):
......


@lru_cache
def data_manifests(self):
   return [manifest_file for manifest_file in self.manifests(self.io) if manifest_file.content == ManifestContent.DATA]

@Fokko
Copy link
Contributor Author

Fokko commented May 26, 2024

@chinmay-bhat I don't think it is as easy as that. We should ensure that the manifest_list path is part of the cache. We could share the cache between calls, since if you do subsequent queries, and the snapshot hasn't been updated, this would speed up the call quite a bit.

We could also make the FileIO part of the caching key. I don't think that's stricktly required, but if something changed in the FileIO we might want to invalidate the cache, but I'm open to arguments here.

@chinmay-bhat
Copy link
Contributor

chinmay-bhat commented May 27, 2024

Thank you for clarifying! Here's how I imagine manifests() would look like :)

@lru_cache()
def manifests(self, io: FileIO, manifest_list_location: str) -> List[ManifestFile]:
    if manifest_list_location is not None:
        file = io.new_input(manifest_list_location)
        return list(read_manifest_list(file))
    return []

When we call snapshot.manifests(self.io, snapshot.manifest_list), if manifest_list is the same, we simply query the cached files. But if the snapshot is updated, manifest_list is also updated, and calling manifests() triggers a re-read of manifest files.
Is this similar to what you have in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment