Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken documentation of PartitionedDataSet #1710

Merged
merged 13 commits into from
Jul 25, 2022

Conversation

noklam
Copy link
Contributor

@noklam noklam commented Jul 18, 2022

Signed-off-by: Nok Chan [email protected]

Description

While looking at kedro-org/kedro-plugins#165, I found that the original documentation is not formatted correctly.

In addition, I think the original document is not easy to follow since the example is not runnable. This 2 years old tutorial is more useful than the doc for me.

PartitionedDataSet is one of the more complicated dataset, so documenting its usage of it with an example is quite important to help user get started.

We also need to document about lazy evaluation of PartitionedDataSet and how it works. AFAIK the lazy evaluation is only on the load side but not the save part?

(Updated: so lazy saving is actually available but it's not a well-documented feature, it's documented here maybe we should have this information included in the Dataset API page since this is usually the entrypoint people looking for information?

Lazy saving was introduced in #744

Development notes

Doc changes, modify the docs with an example that can be run.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Nok Chan <[email protected]>
@noklam noklam requested a review from idanov as a code owner July 18, 2022 10:27
Signed-off-by: Nok Chan <[email protected]>
@noklam noklam requested review from antonymilne and deepyaman July 18, 2022 10:51
Signed-off-by: Nok Chan <[email protected]>
@@ -36,6 +36,10 @@ class PartitionedDataSet(AbstractDataSet):
underlying dataset definition. For filesystem level operations it uses `fsspec`:
https://github.com/intake/filesystem_spec.

It also support advanced features like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It also support advanced features like
It also supports advanced features like

@@ -36,6 +36,10 @@ class PartitionedDataSet(AbstractDataSet):
underlying dataset definition. For filesystem level operations it uses `fsspec`:
https://github.com/intake/filesystem_spec.

It also support advanced features like
`lazy saving <https://kedro.readthedocs.io/en/stable/data/\
kedro_io.html#partitioned-dataset-lazy-saving>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth doing make build-docs and checking these links are working ok, since the rst syntax for links is weird and very easy to get wrong. e.g. I think you're missing a final _ here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky since the stable branch doesn't have this doc yet. You are absolutely right that I missed a _ though!

>>> df = pd.DataFrame([{"DAY_OF_MONTH": str(i), "VALUE": i} for i in range(1, 11)])

# Convert it to a dict of pd.DataFrame with DAY_OF_MONTH as the dict key
>>> dict_df = dict(tuple(df.groupby("DAY_OF_MONTH")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does dict_df look like at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AntonyMilneQB As the comment suggests, it's a Dict[str, pd.Dataframe]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Might be worth just showing what dict_df looks like in a comment:

# Convert it to a dict of pd.DataFrame with DAY_OF_MONTH as the dict key.
# e.g. dict_df["1"] is pd.DataFrame({"DAY_OF_MONTH": "1", "VALUE": 1}, index=[0])

Or maybe doing this a bit more explicitly as something like this?

dict_df = {row["DAY_OF_MONTH"]: row for index, row in df.iterrows()}

Or even going through df.iloc[...].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AntonyMilneQB Yeah I agree, this is one of those pandas tricks that isn't very explicit but are necessary for performance. For documentation purposes it could be more explicit

dict_df = {day_of_month: df[df["DAY_OF_MONTH"] == day_of_month]
     for day_of_month in df['DAY_OF_MONTH']}

@noklam noklam requested a review from yetudada as a code owner July 19, 2022 12:59
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these improvements! 👍

kedro/io/partitioned_dataset.py Outdated Show resolved Hide resolved
kedro/io/partitioned_dataset.py Outdated Show resolved Hide resolved
noklam and others added 2 commits July 21, 2022 17:37
@noklam noklam changed the title Fix broken documentation of PartitionedDataSet Fix broken documentation of PartitionedDataSet and add better error message when project is misconfigured Jul 21, 2022
@noklam noklam changed the title Fix broken documentation of PartitionedDataSet and add better error message when project is misconfigured Fix broken documentation of PartitionedDataSet Jul 21, 2022
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvement, thank you!

docs/source/data/kedro_io.md Outdated Show resolved Hide resolved
kedro/io/partitioned_dataset.py Outdated Show resolved Hide resolved
noklam and others added 2 commits July 21, 2022 18:00
@noklam noklam removed the request for review from idanov July 21, 2022 17:01
@noklam noklam merged commit 3ece5f0 into main Jul 25, 2022
@noklam noklam deleted the fix/improve_partition_dataset_doc branch July 25, 2022 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants