Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose the Delta Log in a DataFrame that's easy for analysis #1031

Closed
MrPowers opened this issue Dec 20, 2022 · 4 comments · Fixed by #1033
Closed

Expose the Delta Log in a DataFrame that's easy for analysis #1031

MrPowers opened this issue Dec 20, 2022 · 4 comments · Fixed by #1033
Labels
enhancement New feature or request

Comments

@MrPowers
Copy link
Collaborator

The _delta_log contains all sorts of valuable information for end users. Valuable chunks of Delta Log data are stored in JSON files that aren't easy for users to access.

It'd be great if the file name, file size, modification time, and column statistics were exposed to the user in a DataFrame so they could better manage their Delta table. Here's a possible interface:

import deltalake as dl

dt = dl.DeltaTable("./tmp/delta-table")
dt.delta_log_detail()

That would return a DataFrame with these columns:

  • file_name
  • file_size
  • modification_time
  • data_change
  • col_a_min
  • col_a_max
  • col_b_min
  • col_b_max
  • ...

Here are the types of questions the users could answer with this metadata:

  • How many files in my Delta table have less than 10,000 bytes?
  • What’s the current distribution of the col_a_max values?
  • How many bytes of data did we ingest yesterday?

This would help users a lot before they perform expensive computations.

@wjones127
Copy link
Collaborator

That does sounds useful. We should be careful about the naming, though, since I think this could be confused with two other ideas:

  1. Delta History
  2. We might also want an API that returns a DataFrame containing the full delta log across versions (IIUC what you are proposing just shows the files in the current version, or generally in a particular version).

@MrPowers
Copy link
Collaborator Author

Yep, I'm open for suggestions with names here. FYI, for other interested parties, history() has already been implemented in this lib.

Making it easy to grab the full log across versions / for a given version would be ideal. That's a good point. That'd be especially ideal if there was a log entry for vacuum commands and we could indicate the data that's already been vacuumed.

@chitralverma
Copy link
Contributor

Can we also add an indicator of "number of version available" to this metadata some where?

@wjones127
Copy link
Collaborator

@chitralverma Unfortunately that isn't trivial, since we don't track that anywhere currently, and figuring out it requires looking through the log to see which files are around. I've created #1037 to track that.

wjones127 added a commit that referenced this issue Jan 11, 2023
# Description

Exposes function to get a dataframe of add actions for selected version
of the table.

TODO:

 * [x] add unit tests
 * [x] write user guide
 * [x] handle partition columns
 * [x] handle stats
 * [x] handle tags
 * [x] add a `flatten` option

# Related Issue(s)

- closes #1031

# Documentation

<!---
Share links to useful documentation
--->
chitralverma pushed a commit to chitralverma/delta-rs that referenced this issue Mar 17, 2023
# Description

Exposes function to get a dataframe of add actions for selected version
of the table.

TODO:

 * [x] add unit tests
 * [x] write user guide
 * [x] handle partition columns
 * [x] handle stats
 * [x] handle tags
 * [x] add a `flatten` option

# Related Issue(s)

- closes delta-io#1031

# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants