-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: expose function to get table of add actions #1033
Conversation
@wjones127 Can we have an indicator of "number of version available" to this metadata some where? |
985ab2e
to
2e356ef
Compare
Example: In [1]: from deltalake import DeltaTable, write_deltalake
In [2]: import pyarrow as pa
In [3]: data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
In [4]: write_deltalake("tmp", data, partition_by=["x"])
In [5]: dt = DeltaTable("tmp")
In [6]: dt.get_add_actions_df()
Out[6]:
pyarrow.RecordBatch
path: string
size_bytes: int64
modification_time: timestamp[ms]
data_change: bool
partition_values: struct<x: int64>
child 0, x: int64
num_records: int64
null_count: struct<y: int64 not null>
child 0, y: int64 not null
min: struct<y: int64 not null>
child 0, y: int64 not null
max: struct<y: int64 not null>
child 0, y: int64 not null
In [7]: dt.get_add_actions_df().to_pandas()
Out[7]:
path size_bytes modification_time data_change partition_values num_records null_count min max
0 x=2/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p... 565 1970-01-20 08:40:08.071 True {'x': 2} 1 {'y': 0} {'y': 5} {'y': 5}
1 x=3/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p... 565 1970-01-20 08:40:08.071 True {'x': 3} 1 {'y': 0} {'y': 6} {'y': 6}
2 x=1/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p... 565 1970-01-20 08:40:08.071 True {'x': 1} 1 {'y': 0} {'y': 4} {'y': 4}
In [8]: dt.get_add_actions_df(flatten=True).to_pandas()
Out[8]:
path size_bytes modification_time data_change partition.x num_records null_count.y min.y max.y
0 x=2/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p... 565 1970-01-20 08:40:08.071 True 2 1 0 5 5
1 x=3/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p... 565 1970-01-20 08:40:08.071 True 3 1 0 6 6
2 x=1/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p... 565 1970-01-20 08:40:08.071 True 1 1 0 4 4 |
b3810ac
to
92c63de
Compare
@@ -70,6 +70,7 @@ | |||
|
|||
#![deny(warnings)] | |||
#![deny(missing_docs)] | |||
#![allow(rustdoc::invalid_html_tags)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need this in order to build the docs for some reason. Newer lint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Really looking forward to moving arrow deeper into our log handling :).
Left one minor namin comment that you may want to look at, otherwise LGTM!
python/deltalake/table.py
Outdated
@@ -440,3 +440,37 @@ def __stringify_partition_values( | |||
str_value = str(value) | |||
out.append((field, op, str_value)) | |||
return out | |||
|
|||
def get_add_actions_df(self, flatten: bool = False) -> pyarrow.RecordBatch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super important, but when i see "df" in python, I always think pandas dataframe. Since we are returning a record batch maybe a different name is more fitting for this function? Maybe get_add_action_table
, like the one used internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to that. What do you think @MrPowers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility is to change to return a flattened Pandas DataFrame by default, but allow returning record batch:
def get_add_actions_df(self, flatten: bool, as_pandas: Literal[True]) -> pandas.DataFrame;
def get_add_actions_df(self, flatten: bool, as_pandas: Literal[False]) -> pyarrow.RecordBatch;
def get_add_actions_df(self, flatten=True, as_pandas=True):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally I prefer the chained style (.to_pandas()
) , as it is consistent with loading the table data. Then again, my personal preference is just that 😆. But @MrPowers seems to know the community quite well :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think get_add_actions
and to_pandas()
is fine. I'm not the best authority for the Pythonic way of doing things 😉
I am really excited about this functionality!!! |
2b88ed6
to
8b5bab9
Compare
# Description Exposes function to get a dataframe of add actions for selected version of the table. TODO: * [x] add unit tests * [x] write user guide * [x] handle partition columns * [x] handle stats * [x] handle tags * [x] add a `flatten` option # Related Issue(s) - closes delta-io#1031 # Documentation <!--- Share links to useful documentation --->
Description
Exposes function to get a dataframe of add actions for selected version of the table.
TODO:
flatten
optionRelated Issue(s)
Documentation