Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preview to datasets as specified in the Kedro catalog under metadata #1374

Merged
merged 16 commits into from
Jun 8, 2023

Conversation

rashidakanchwala
Copy link
Contributor

@rashidakanchwala rashidakanchwala commented May 30, 2023

Description

This ticket is part 2 of the preview dataset feature development. In the first part, we enabled preview functionality for CSV and Excel datasets in the Kedro Framework. In Kedro-viz, we showed previews for all CSV and Excel datasets when the user turned on the preview option under the settings panel.

In this ticket, we are removing the preview option from the settings panel. Now, if the user wants to preview a dataset, they need to specify it via the catalog.yml file.

type: pandas.CSVDataSet
filepath: /filepath/to/dataset
metadata:
  kedro-viz:
    layer: raw
    preview_args:
          nrows: 3

In the above the user wants to see the first 3 rows of the dataset. Previously we always showed the first 40 rows, but now we show the number of rows specified by the user in the catalog.yml file.

Development notes

QA notes

The tests and linter is currently failing but the changes to this are already part of the first PR related to layers so it should be fixed once that's merged. So we can ignore that in this PR.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added new entries to the RELEASE.md file
  • Added tests to cover my changes

@rashidakanchwala rashidakanchwala marked this pull request as ready for review May 31, 2023 09:35
@rashidakanchwala rashidakanchwala requested review from noklam, Huongg, antonymilne and merelcht and removed request for tynandebold May 31, 2023 10:01
@rashidakanchwala rashidakanchwala changed the title Add preview to datasets as specified in the Kedro catalog Add preview to datasets as specified in the Kedro catalog under metadata May 31, 2023
@antonymilne
Copy link
Contributor

My suggested syntax for this is:

type: pandas.CSVDataSet
filepath: /filepath/to/dataset
metadata:
  kedro-viz:
    layer: raw
    preview_args:
      nrows: 5

This would match load_args and save_args and mean that future preview functions can take an arbitrary set of arguments. Otherwise you're hardcoding it here to only accept nrows as a parameter.

You'd call the preview function as dataset._preview(**preview_args).

There's a question about what the default behaviour should be though. If I don't specify preview_args, should the preview show at all?

  • if yes, then should a user be able to turn it off if they like and how would they do this?
  • if no, then you always need to specify nrows explicitly, which is annoying.

Copy link
Contributor

@Huongg Huongg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @rashidakanchwala all looks good to me, just left 2 small comments but nothing seems to be an issue to me.

@@ -46,7 +46,7 @@ const MetadataModal = ({ metadata, onToggle, visible }) => {
</div>
{hasPreview && (
<div className="pipeline-metadata-modal__preview-text">
Previewing first 40 rows only
Previewing first {metadata.preview.data.length} rows only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this copy line might not make sense anymore. We use it originally because we were only going to show the first 40 rows. But now since users can define it themself, do we need it anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so. but should we still show it, it's nice to give them a number of rows of what they are viewing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still useful to have the message but maybe rephrase it to: "Previewing first ... rows"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Merel here. Let's still give them the number.




# companies:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we not going to remove these completely? if not then it might be helpful to have a comment to say why we just commented them out here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for testing purpose, I will remove them before the merge. thanks :)

@merelcht
Copy link
Member

I agree with @antonymilne 's comment above. preview_args would be much more flexible and allow for other things than just number of rows to be shown in the preview.

There's a question about what the default behaviour should be though. If I don't specify preview_args, should the preview show at all?

IMO if the user doesn't specify preview_args on the dataset it shouldn't preview at all. It seems confusing if it would show without adding anything in the catalog. Also, if we add more options and not just number of rows, what would you show by default?

@rashidakanchwala
Copy link
Contributor Author

rashidakanchwala commented May 31, 2023

My suggested syntax for this is:

type: pandas.CSVDataSet
filepath: /filepath/to/dataset
metadata:
  kedro-viz:
    layer: raw
    preview_args:
      nrows: 5

This would match load_args and save_args and mean that future preview functions can take an arbitrary set of arguments. Otherwise you're hardcoding it here to only accept nrows as a parameter.

You'd call the preview function as dataset._preview(**preview_args).

There's a question about what the default behaviour should be though. If I don't specify preview_args, should the preview show at all?

  • if yes, then should a user be able to turn it off if they like and how would they do this?
  • if no, then you always need to specify nrows explicitly, which is annoying.

The preview shouldn't show at all. Basically, if you want to show preview then you need to have the preview_args in your paml and nrows would follow.

Currently, we don't have any other options. But I guess if the user doesn't add nrows, it would default to 40, which is what is specified already in kedro-datasets(https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/pandas/csv_dataset.py#L197O) . I will still test and make sure it works

@@ -595,7 +600,7 @@ def __post_init__(self, data_node: DataNode):
self.tracking_data = dataset.load()
elif data_node.is_preview_node():
try:
self.preview = dataset._preview() # type: ignore
self.preview = dataset._preview(data_node.get_preview_nrows())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it still trigger the _preview method if the row = 0?

Comment on lines 489 to 492
def is_preview_node(self):
"""Checks if the current node has a preview"""
return hasattr(self.kedro_obj, "_preview")
metadata = getattr(self.kedro_obj, "metadata", {}) or {}
return bool(metadata.get("kedro-viz", {}).get("preview"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like it shouldn't be that complicated.
Many questions here:

  • Does preview: 0 count as True or False? It will be False now, then the `get("preview", 0) path will never be executed.

The same pattern accessing the nested dict appears in multiple places, maybe refactor it with a small util function

Signed-off-by: Rashida Kanchwala <[email protected]>
@antonymilne
Copy link
Contributor

antonymilne commented Jun 5, 2023

@merelcht

There's a question about what the default behaviour should be though. If I don't specify preview_args, should the preview show at all?

IMO if the user doesn't specify preview_args on the dataset it shouldn't preview at all. It seems confusing if it would show without adding anything in the catalog. Also, if we add more options and not just number of rows, what would you show by default?

I think the problem with this approach is that the user always has to specify nrows. Currently this has a default value in kedro-datasets:

_preview(self, nrows: int = 40)

... but as it stands there's no way to actually use that default value, since as soon as you write preview_args then you need to specify nrows. In theory it might be nice for a user to just switch preview on but without needing to supply any more information.

For the case of pandas datasets it's not too much effort for the user, but in future maybe we'll have preview methods for lots of different datasets. In that case it would be nice for a user to just be able to say preview: True without needing to specify all the preview_args which might well be different for different dataset tyes.

It's not a big thing so I'm ok with this if it's what we want to do, but I'd probably make nrows compulsory in kedro-datasets in this case.

The alternative solution is maybe we have two keys: preview and preview_args, like this:

preview: true
preview_args:
  nrows: ...

where preview_args is optional and only for "advanced" users who want to fine-tune the preview shown.

Or another option is that preview is enabled by default and they need to specify preview: false to disable it.

Overall if we haven't already done so I think we need @yetudada and/or @NeroOkwa to weigh in on what the expected behaviour would be for this feature because I think there's a few legitimate options.

@merelcht
Copy link
Member

merelcht commented Jun 5, 2023

@antonymilne We discussed this in stand up last week and agreed that the user needs to be specific about what needs to be previewed. So they will have to specify preview_args or nothing will show. There just won't be a default at all. We can always iterate on that solution once we get user feedback on their experience, but at the moment we don't even know yet what potential other preview functions we or other people will add.

Comment on lines 494 to 498
try:
viz_metadata = metadata["kedro-viz"]
except (AttributeError, KeyError):
return None
return viz_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be replaced with the getattr method since it's not a nested key?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional comment: Would it be better to just set this as a property viz_metadata instead of a get method?

Copy link
Contributor Author

@rashidakanchwala rashidakanchwala Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did try this but I got some weird errors, so changed it to get_ . but i agree, it's better as a property. I will ping u again, if u have a couple of min to help me with the errors that would be appreciated. let me try doing it again.

Update - it works now, i will push :)

Copy link
Contributor Author

@rashidakanchwala rashidakanchwala Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Antony suggested doing it this way based on EAFP

try:
       viz_metadata = metadata["kedro-viz"]
 except (AttributeError, KeyError):

Signed-off-by: Rashida Kanchwala <[email protected]>
Signed-off-by: Rashida Kanchwala <[email protected]>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rashidakanchwala I'll message you on Slack as well because I'm a bit confused now about the implementation: are we using preview_args or preview with a numeric value? It seems mixed in the changes you've made..

demo-project/conf/base/catalog_01_raw.yml Show resolved Hide resolved
metadata:
kedro-viz:
layer: feature
preview: 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we would just do the preview_args and not the preview?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just old code. i removed it. thanks for flagging

try:
self.viz_metadata = metadata["kedro-viz"]
except (AttributeError, KeyError):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful here to emit an error message for debugging purposes?

return False
return is_preview

def get_preview_nrows(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets all preview_args right? So I think it makes more sense to name the function get_preview_args

Signed-off-by: Rashida Kanchwala <[email protected]>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

return is_preview

def get_preview_args(self):
"""Gets the number of rows for the preview dataset"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Gets the number of rows for the preview dataset"""
"""Gets the preview arguments for a dataset"""

@@ -368,6 +369,17 @@ def test_data_node_metadata(self):
assert data_node_metadata.filepath == "/tmp/dataset.csv"
assert data_node_metadata.run_command == "kedro run --to-outputs=dataset"

def test_get_preview_nrows(self):
metadata = {"kedro-viz": {"preview": 3}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this test needs to be updated to use preview_args right?

Copy link
Contributor Author

@rashidakanchwala rashidakanchwala Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this. i missed it. the tests are still failing but that's because of the kedro.extras and that should go away once the other PR is merged.

rashidakanchwala and others added 3 commits June 6, 2023 18:00
Signed-off-by: Rashida Kanchwala <[email protected]>
Signed-off-by: Rashida Kanchwala <[email protected]>
Signed-off-by: Rashida Kanchwala <[email protected]>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! ⭐ Don't forget to add the change to the release notes.

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions but generally this looks great ⭐

Comment on lines +499 to +503
try:
is_preview = bool(self.viz_metadata["preview_args"])
except (AttributeError, KeyError):
return False
return is_preview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok how it is, but I don't understand why we need the viz_metadata property. This isn't used for anything other than preview args, so why can't we just do this?

def is_preview_node(self):
    try:
        self.kedro_obj.metadata["kedro-viz"]["preview_args"]
    except (AttributeError, KeyError):  # pragma: no cover
        logger.debug("Kedro-viz preview args not found for %s", self.full_name)
        return False
    else:
        return True

def get_preview_args(self):
        """Gets the preview arguments for a dataset"""
        return self.kedro_obj.metadata["kedro-viz"]["preview_args"]

But it's not a big thing so if you think it's clearer how you have it then don't worry about it 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I see this is probably the opposite of what @noklam was suggesting, so feel free to ignore and just do whatever you think is best 😁

Copy link
Contributor Author

@rashidakanchwala rashidakanchwala Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think viz_metadata might be eventually used for more than preview. for e.g. looking at this (#662) ticket. if we were to also show table statistics we could probably have

metadata :
   kedro-viz: 
       preview_args : 
             nrows=4 
      profiler_args:
              show = true.. 

something like it

@@ -368,6 +369,18 @@ def test_data_node_metadata(self):
assert data_node_metadata.filepath == "/tmp/dataset.csv"
assert data_node_metadata.run_command == "kedro run --to-outputs=dataset"

def test_get_preview_args(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can we also have a test for when preview_args does not exist to check assert not data_node.is_preview_node()?

layer=None,
dataset=dataset,
)
assert data_node.is_preview_node() is True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert data_node.is_preview_node() is True
assert data_node.is_preview_node()

Signed-off-by: Rashida Kanchwala <[email protected]>
@rashidakanchwala rashidakanchwala merged commit 48ad392 into main Jun 8, 2023
@rashidakanchwala rashidakanchwala deleted the preview-datasets-2 branch June 8, 2023 18:19
@tynandebold tynandebold mentioned this pull request Jun 9, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants