Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]Delta kernel can not get file stats #3771

Open
dongxiao1198 opened this issue Oct 16, 2024 · 4 comments
Open

[Feature Request]Delta kernel can not get file stats #3771

dongxiao1198 opened this issue Oct 16, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@dongxiao1198
Copy link

dongxiao1198 commented Oct 16, 2024

Feature request

Which Delta project/connector is this regarding?

  • [✓] Kernel

Overview

Since the delta-standalone has been deprecated, we are migrating out project using delta-kernel instead of delta-standalone.
But we found that delta-kernel can not get file stats when scanning file lists.

In delta-standalone, we can get file stats in this class : . And we can get the change logs
using "Iterator getChanges" in io.delta.standalone.DeltaLog which can not be list in delta-kernel too.

Motivation

  • out project need min&max stats of each file to do some optimization
  • the change logs will be used to maintain the file list cache incremental(since we can not list all file each time we scanning this table)

Further details

Willingness to contribute

  • [✓] No. I cannot contribute this feature at this time.
@dongxiao1198 dongxiao1198 added the enhancement New feature or request label Oct 16, 2024
@dongxiao1198 dongxiao1198 changed the title [Feature Request]Delta kernel can not get file statistic [Feature Request]Delta kernel can not get file stats Oct 16, 2024
@wgtmac
Copy link

wgtmac commented Nov 4, 2024

@nastra Could you please take a look at this?

@nastra
Copy link
Contributor

nastra commented Nov 4, 2024

FYI @scottsand-db

@scottsand-db
Copy link
Collaborator

Hi @wgtmac -- can you please tell me a bit more about your use case for file stats and for getChanges?

We allow you to include a filter during the ScanBuilder -- what more would you need the file stats for?

Could you also please look at this internal (not public) API for getChanges in Kernel and see if that fits your use case? We can consider making it public.

public CloseableIterator<ColumnarBatch> getChanges(

@wgtmac
Copy link

wgtmac commented Nov 5, 2024

Thanks for the reply from @scottsand-db and help from @nastra!

We use the delta kernel as a metadata client in our proprietary lakehouse to read from delta lake tables. To efficiently make splits at any snapshot and cache the file lists, we need to get following metadata from the API which is available in delta standalone:

  1. Column stats: Carry the column stats (at least the min/max values, if available) of each parquet file, therefore we can prune the list of files to scan at our best effort.
  2. Get latest snapshot version: A cheap way to return the current version without actually replaying the delta logs.
  3. Get change logs between arbitrary snapshots: sometimes we need to cache file list of a specific version and then incrementally sync it to the latest version. It would be great if the delta client supports incremental scan to return file list changes between a specified version range.
  4. Stateful table object: This is similar to the request 3 above. The current table object is pined to a snapshot and cannot call update() to incrementally sync to the latest version, which the standalone library supports.

Hopefully my explanation makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants