-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken documentation of PartitionedDataSet #1710
Conversation
Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Nok Chan <[email protected]>
kedro/io/partitioned_dataset.py
Outdated
@@ -36,6 +36,10 @@ class PartitionedDataSet(AbstractDataSet): | |||
underlying dataset definition. For filesystem level operations it uses `fsspec`: | |||
https://github.com/intake/filesystem_spec. | |||
|
|||
It also support advanced features like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also support advanced features like | |
It also supports advanced features like |
kedro/io/partitioned_dataset.py
Outdated
@@ -36,6 +36,10 @@ class PartitionedDataSet(AbstractDataSet): | |||
underlying dataset definition. For filesystem level operations it uses `fsspec`: | |||
https://github.com/intake/filesystem_spec. | |||
|
|||
It also support advanced features like | |||
`lazy saving <https://kedro.readthedocs.io/en/stable/data/\ | |||
kedro_io.html#partitioned-dataset-lazy-saving>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth doing make build-docs
and checking these links are working ok, since the rst syntax for links is weird and very easy to get wrong. e.g. I think you're missing a final _
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky since the stable
branch doesn't have this doc yet. You are absolutely right that I missed a _
though!
kedro/io/partitioned_dataset.py
Outdated
>>> df = pd.DataFrame([{"DAY_OF_MONTH": str(i), "VALUE": i} for i in range(1, 11)]) | ||
|
||
# Convert it to a dict of pd.DataFrame with DAY_OF_MONTH as the dict key | ||
>>> dict_df = dict(tuple(df.groupby("DAY_OF_MONTH"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does dict_df
look like at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AntonyMilneQB As the comment suggests, it's a Dict[str, pd.Dataframe]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Might be worth just showing what dict_df
looks like in a comment:
# Convert it to a dict of pd.DataFrame with DAY_OF_MONTH as the dict key.
# e.g. dict_df["1"] is pd.DataFrame({"DAY_OF_MONTH": "1", "VALUE": 1}, index=[0])
Or maybe doing this a bit more explicitly as something like this?
dict_df = {row["DAY_OF_MONTH"]: row for index, row in df.iterrows()}
Or even going through df.iloc[...]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AntonyMilneQB Yeah I agree, this is one of those pandas tricks that isn't very explicit but are necessary for performance. For documentation purposes it could be more explicit
dict_df = {day_of_month: df[df["DAY_OF_MONTH"] == day_of_month]
for day_of_month in df['DAY_OF_MONTH']}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these improvements! 👍
Co-authored-by: Merel Theisen <[email protected]>
Co-authored-by: Merel Theisen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice improvement, thank you!
Co-authored-by: Antony Milne <[email protected]>
Co-authored-by: Antony Milne <[email protected]>
Signed-off-by: Nok Chan [email protected]
Description
While looking at kedro-org/kedro-plugins#165, I found that the original documentation is not formatted correctly.
In addition, I think the original document is not easy to follow since the example is not runnable. This 2 years old tutorial is more useful than the doc for me.
PartitionedDataSet
is one of the more complicated dataset, so documenting its usage of it with an example is quite important to help user get started.We also need to document about lazy evaluation of PartitionedDataSet and how it works. AFAIK the lazy evaluation is only on the
load
side but not thesave
part?(Updated: so lazy saving is actually available but it's not a well-documented feature, it's documented here maybe we should have this information included in the Dataset API page since this is usually the entrypoint people looking for information?
Lazy saving was introduced in #744
Development notes
Doc changes, modify the docs with an example that can be run.
Checklist
RELEASE.md
file