Fix broken documentation of PartitionedDataSet #1710

noklam · 2022-07-18T10:27:28Z

Signed-off-by: Nok Chan [email protected]

Description

While looking at kedro-org/kedro-plugins#165, I found that the original documentation is not formatted correctly.

In addition, I think the original document is not easy to follow since the example is not runnable. This 2 years old tutorial is more useful than the doc for me.

PartitionedDataSet is one of the more complicated dataset, so documenting its usage of it with an example is quite important to help user get started.

We also need to document about lazy evaluation of PartitionedDataSet and how it works. AFAIK the lazy evaluation is only on the load side but not the save part?

(Updated: so lazy saving is actually available but it's not a well-documented feature, it's documented here maybe we should have this information included in the Dataset API page since this is usually the entrypoint people looking for information?

Lazy saving was introduced in #744

Development notes

Doc changes, modify the docs with an example that can be run.

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Signed-off-by: Nok Chan <[email protected]>

antonymilne · 2022-07-19T07:14:56Z

kedro/io/partitioned_dataset.py

@@ -36,6 +36,10 @@ class PartitionedDataSet(AbstractDataSet):
    underlying dataset definition. For filesystem level operations it uses `fsspec`:
    https://github.com/intake/filesystem_spec.

+    It also support advanced features like


Suggested change

It also support advanced features like

It also supports advanced features like

antonymilne · 2022-07-19T07:15:25Z

kedro/io/partitioned_dataset.py

@@ -36,6 +36,10 @@ class PartitionedDataSet(AbstractDataSet):
    underlying dataset definition. For filesystem level operations it uses `fsspec`:
    https://github.com/intake/filesystem_spec.

+    It also support advanced features like
+    `lazy saving <https://kedro.readthedocs.io/en/stable/data/\
+    kedro_io.html#partitioned-dataset-lazy-saving>`


It's worth doing make build-docs and checking these links are working ok, since the rst syntax for links is weird and very easy to get wrong. e.g. I think you're missing a final _ here.

This is a bit tricky since the stable branch doesn't have this doc yet. You are absolutely right that I missed a _ though!

antonymilne · 2022-07-19T07:17:11Z

kedro/io/partitioned_dataset.py

+        >>> df = pd.DataFrame([{"DAY_OF_MONTH": str(i), "VALUE": i} for i in range(1, 11)])
+
+        # Convert it to a dict of pd.DataFrame with DAY_OF_MONTH as the dict key
+        >>> dict_df = dict(tuple(df.groupby("DAY_OF_MONTH")))


What does dict_df look like at this point?

@AntonyMilneQB As the comment suggests, it's a Dict[str, pd.Dataframe]

Ah, I see. Might be worth just showing what dict_df looks like in a comment:

# Convert it to a dict of pd.DataFrame with DAY_OF_MONTH as the dict key. # e.g. dict_df["1"] is pd.DataFrame({"DAY_OF_MONTH": "1", "VALUE": 1}, index=[0])

Or maybe doing this a bit more explicitly as something like this?

dict_df = {row["DAY_OF_MONTH"]: row for index, row in df.iterrows()}

Or even going through df.iloc[...].

@AntonyMilneQB Yeah I agree, this is one of those pandas tricks that isn't very explicit but are necessary for performance. For documentation purposes it could be more explicit

dict_df = {day_of_month: df[df["DAY_OF_MONTH"] == day_of_month] for day_of_month in df['DAY_OF_MONTH']}

merelcht

Thanks for making these improvements! 👍

kedro/io/partitioned_dataset.py

Co-authored-by: Merel Theisen <[email protected]>

antonymilne

Very nice improvement, thank you!

docs/source/data/kedro_io.md

kedro/io/partitioned_dataset.py

Co-authored-by: Antony Milne <[email protected]>

Fix broken doc

d3b6197

Signed-off-by: Nok Chan <[email protected]>

noklam requested a review from idanov as a code owner July 18, 2022 10:27

add runnable example

51b7325

Signed-off-by: Nok Chan <[email protected]>

noklam requested review from antonymilne and deepyaman July 18, 2022 10:51

doc update

8ed679c

Signed-off-by: Nok Chan <[email protected]>

antonymilne reviewed Jul 19, 2022

View reviewed changes

noklam added 4 commits July 19, 2022 11:12

Fix doc after review

e08c13c

attempt to fix broken link

b620460

link udate

3d11e54

more doc update

0d31da5

noklam requested a review from yetudada as a code owner July 19, 2022 12:59

merelcht approved these changes Jul 21, 2022

View reviewed changes

kedro/io/partitioned_dataset.py Outdated Show resolved Hide resolved

kedro/io/partitioned_dataset.py Outdated Show resolved Hide resolved

noklam and others added 2 commits July 21, 2022 17:37

Update kedro/io/partitioned_dataset.py

ec90536

Co-authored-by: Merel Theisen <[email protected]>

Update kedro/io/partitioned_dataset.py

31a1d52

Co-authored-by: Merel Theisen <[email protected]>

noklam changed the title ~~Fix broken documentation of PartitionedDataSet~~ Fix broken documentation of PartitionedDataSet and add better error message when project is misconfigured Jul 21, 2022

noklam changed the title ~~Fix broken documentation of PartitionedDataSet and add better error message when project is misconfigured~~ Fix broken documentation of PartitionedDataSet Jul 21, 2022

antonymilne approved these changes Jul 21, 2022

View reviewed changes

docs/source/data/kedro_io.md Outdated Show resolved Hide resolved

kedro/io/partitioned_dataset.py Outdated Show resolved Hide resolved

noklam and others added 2 commits July 21, 2022 18:00

Update kedro/io/partitioned_dataset.py

f040e6d

Co-authored-by: Antony Milne <[email protected]>

Update docs/source/data/kedro_io.md

55cbcc9

Co-authored-by: Antony Milne <[email protected]>

noklam removed the request for review from idanov July 21, 2022 17:01

noklam and others added 2 commits July 22, 2022 17:23

Merge branch 'main' into fix/improve_partition_dataset_doc

0da9b08

update linkcheck

dbf621a

noklam merged commit 3ece5f0 into main Jul 25, 2022

noklam deleted the fix/improve_partition_dataset_doc branch July 25, 2022 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken documentation of PartitionedDataSet #1710

Fix broken documentation of PartitionedDataSet #1710

noklam commented Jul 18, 2022 •

edited

Loading

antonymilne Jul 19, 2022

antonymilne Jul 19, 2022

noklam Jul 19, 2022

antonymilne Jul 19, 2022

noklam Jul 19, 2022

antonymilne Jul 19, 2022

noklam Jul 19, 2022

merelcht left a comment

antonymilne left a comment

	It also support advanced features like
	It also supports advanced features like

Fix broken documentation of PartitionedDataSet #1710

Fix broken documentation of PartitionedDataSet #1710

Conversation

noklam commented Jul 18, 2022 • edited Loading

Description

Development notes

Checklist

antonymilne Jul 19, 2022

Choose a reason for hiding this comment

antonymilne Jul 19, 2022

Choose a reason for hiding this comment

noklam Jul 19, 2022

Choose a reason for hiding this comment

antonymilne Jul 19, 2022

Choose a reason for hiding this comment

noklam Jul 19, 2022

Choose a reason for hiding this comment

antonymilne Jul 19, 2022

Choose a reason for hiding this comment

noklam Jul 19, 2022

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

antonymilne left a comment

Choose a reason for hiding this comment

noklam commented Jul 18, 2022 •

edited

Loading