Kedro-Viz to show preview of data #907

rashidakanchwala · 2022-06-13T15:23:04Z

Description

Kedro-viz supports Plotly.
Plotly has cool tables -https://plotly.com/python/table/

the idea is simply show the first 5/10 rows of the dataset on Kedro-viz

Implementation

Since we already support Plotly, this would be easy to do, we just read the first 5 rows from the data and display it as a table.

There is an argument around loading so many datasets might make kedro-viz slow. But loading only happens when metadata panel is clicked which is one dataset at a time. Also maybe on Kedro we can allow users to specify which datasets they want to preview on Kedro-viz using catalog.yml preview = true

datajoely · 2022-06-13T15:40:25Z

Would love this!

One note on implementation - we need a workflow to avoid opening enormous files for no reason.

The situation I'm worried about is specifically pandas.CSVDataSet being 1 begillion rows long and us loading that for 5 rows of data.
For spark.SparkDataSet we can append a .limit(5) on there to avoid this.

limdauto · 2022-06-13T15:47:25Z

@datajoely I think we should add an optional head API to Kedro Dataset if we were to do this. This allows viz to preview beyond pandas or spark and avoid performance bottleneck. The thing that knows how to optimise head is the dataset implementation, not viz.

datajoely · 2022-06-13T15:48:29Z

Yeah agreed

antonymilne · 2022-06-14T07:37:13Z

I like this idea and have thought about similar schemes in the past. So since you've brought it up here, let me dump some thoughts I had before here also...

Two basic questions:

is plotly the right thing to use for this? It's a good option since we have it already available, but maybe there's better libraries out there for handling tables (e.g. doesn't look like plotly would handle many hundreds of columns well? Which is not at all uncommon in a kedro pipeline)
how general should we make this? As per @limdauto's comment, maybe we have a general head method that can be used for any dataset. Could we incorporate the current behaviour for matplotlib and plotly datasets into this more generic mechanism? Going beyond a dataset preview, what if I don't want to show the first n rows but would rather just show the size of the dataframe (rows and columns) in the metadata side panel? (which seems equally useful to me and maybe more practical for large dataframes)

Just using plotly for pandas and/or spark dataframes would be totally great for an MVP and to get user feedback, but I just want to brainstorm how we might want to make this more generic in the longer term.

The question of adding custom properties to datasets comes up quite a bit, e.g. #662 (put number of rows in dataset on kedro-viz), https://github.com/quantumblacklabs/private-kedro/issues/1148 (add metadata to catalog entries than can be consumed by plugins), kedro-org/kedro#1076 (very long-standing issue on how to add metadata to catalog entries). This is not just limited to kedro-viz but there's a more general kedro question of how to attach metadata to a catalog entry. Let me just focus on the kedro-viz question here though.

#662 (comment) spells out my rough idea for this: user-customisable dataset widgets. This is quite similar to the idea of kedro-viz extensions, only:

these widgets are shown in the metadata panel rather than a whole new screen (which has both pros and cons but basically means there's much more limited space for them)
widgets are lighter weight and more restricted in how they must be written (unlike an extension, it doesn't start its own server etc.)

As a user, I might want to keep track of lots of different things about a dataset: number of rows/columns, number of unique entries in a particular column, number of N/As, etc. Enabling something that visualises the number of rows in a dataset of type pandas.* is just one particular example of this - in reality I might like to track any sort of thing for any sort of dataset. Let me call this a "trackable".

In the future I think there should be two possible methods for this:

via experiment tracking - this is already work in progress. You can write code to calculate whatever trackable you like in a node and then save it to a tracking dataset. Crucially this will give you a sense of how the trackable changes between one kedro run and the next, since I should be able to go back in time and visualise the pipeline and datasets of historic runs.

some kind of customisable "widget" which allows me to give, in the catalog, as many trackables as I like, e.g. (completely made up example syntax)
shuttles:
    type: pandas.CSVDataSet
    filepath: ...
    viz_widgets:
        number_of_rows
        number_of_na: column1, column2, column3
        my_custom_widget
Where we supply with kedro viz a few common widgets like number_of_rows, but a user can define their own my_custom_widget also so it's very flexible. The natural place for this information to be shown on kedro viz would be the side panel on the right hand side that appears when you click on a dataset. But it would be super cool if somehow we could make the pipeline visualisation customisable with user-pluggable widgets too.

According to this scheme, previewing the first 5 rows of a dataset would be some kind of dataframe_head: {rows: 5} widget that we provide within kedro-viz. This could even be automatically applied to all the datasets of the right type. There could be some kind of marketplace for user-defined widgets (small javascript apps I guess?).

Is the idea of a marketplace of custom widgets for kedro-viz datasets a huge overkill for this? At the moment, absolutely yes. We could achieve what @rashidakanchwala's describes much more simply. And at the moment I think kedro-viz extensions would be better to work on than dataset widgets. But I think it's worth thinking about where this might end up in future though, since it might spark other people's ideas and potentially affects design decisions up front. e.g.

Also maybe on Kedro we can allow users to specify which datasets they want to preview on Kedro-viz using catalog.yml preview = true

This seems too ad-hoc and hacky to me, like the current implementation of layer which is a dataset property but only really used by kedro-viz. So if we end up with lots of such parameters I think we should consider exactly where they should live so that catalog entries don't become too bloated.

yetudada · 2022-06-22T10:29:41Z

The exploration for seeing dataset statistics by @GabrielComymQB:

merelcht · 2022-06-22T11:03:54Z

Notes from Technical Design session:

The team discussed a possible solution to preview data in Viz both on the metadata panel and the experiment tracking panel.

Some questions raised around the goal of showing a preview:

Do we want to show just a preview of the data, or perhaps insights (e.g. # of columns, mean, median..)?
Should users be able to customise what is shown in such a preview?

The consensus is that just a blanket preview of showing the first 5-10 rows wouldn't be useful with all data, and thus the preview should be customisable.

Possible solution:
The solution discussed in the meeting is adding a _preview() method to datasets that specifies how data should be displayed on the Viz side. This _preview() method will be customisable so if a user doesn't like the default implementation they can override it to suit their needs. The result will be displayed in the metadata and experiment tracking panels.

A downside of this solution is that we would essentially be adding visualisation specific code to the framework side, blurring the boundaries between Kedro Viz and Kedro Framework. But the _preview() method could be useful in a jupyter flow as well.

Follow up questions/actions:

What types of data would the _preview() method return? What are the optimal types to display data in Viz?
Specifically, users have expressed the need to log CSV data, what do they want to see from this CSV data?
Are there any other solutions, perhaps with more of the heavy lifting on the Viz side, that would solve this issue?

antonymilne · 2022-06-23T05:04:47Z

A few more thoughts on the preview method approach. Let's say that we solve the question of what types of data preview can return (shouldn't be too hard) and are happy with this living on kedro framework as a new dataset method (I'm more sceptical here). Here's a possibly representative example of what someone might want to do:

for some pandas.CSVDataSets in their pipeline, show number of rows
for some other pandas.CSVDataSets in their pipeline, show first 5 rows

The simplest way to implement this would be for the user to write two new sorts of dataset, something like this:

class CSVDataSetWithNumberOfRows(pandas.CSVDataSet):
    def preview():
        return len(self._load())

class CSVDataSetWithHead(pandas.CSVDataSet):
    def preview():
        return self._load().head()

Then in the catalog file you need to change the relevant dataset type from pandas.CSVDataSet to path.to.CSVDataSetWithNumberOfRows and path.to.CSVDataSetWithHead.

This seems quite unsatisfactory:

it feels heavy-handed to require a new dataset class just to alter how preview renders in kedro-viz. The load/save behaviour of the dataset is what really matters in kedro, and that's the same for all these classes
it doesn't scale well: even if you want every pandas.CSVDataSet to preview the same way, you have to change the type for all your catalog entries (might eventually be solved by improvements to kedro config system)

Fundamentally I think the problem here is that datasets are not easily composed. I cannot easily "mix in" a new behaviour without creating a whole new class. @limdauto mentioned once that Dmitrii had prototyped some new component-based dataset architecture that looks more like my widgets example above. This might be a major change to how kedro datasets work though, which I don't think is on the cards for the foreseeable future.

In reality, is this a problem? Possibly not; maybe we just hard code a sensible default preview into pandas.CSVDataSet and only a few advanced users who are happy writing custom classes would even think of trying to change this. If we value a user being able to customise the preview behaviour then a dataset preview method does feel awkward to me though.

Problem is, I'm not sure I have a better alternative... Maybe hooks + a viz.yml config file somehow? Certainly this would keep the functionality on the kedro-viz side much more. Let me ponder this and write it up as an alternative proposal.

datajoely · 2022-06-27T10:06:35Z

I think [tool.kedro.viz] pyproject.toml section would be helpful you know. In fact, everything in the settings modal could be pre-defined there?

rashidakanchwala · 2022-10-10T11:58:24Z

Hi team,

I was thinking maybe the _preview method can be in Viz as it is a viz implementation. And within the Kedro project catalog.yml we define it like below so the Viz knows how/what to handle for different datasets?

feature_engineering_output:
type: pandas.CSVDataSet
filepath: ${base_location}/04_feature/feature_importance_output.csv
layer: feature
preview :
>>enable: true
>>showRows : 5

@MerelTheisenQB , @datajoely , @tynandebold , @idanov

datajoely · 2022-10-10T12:24:33Z

What about adding preview logic to the AbstractDataSet class? And then also implementing it for the pandas and spark datasets today?

pandas -> .head(5)
spark -> .limit(5).toPandas().head()

tynandebold · 2022-10-12T14:24:02Z

Notes from Technical Design session:

We'll go with the use of transcoding and the @Preview symbol to denote in the catalog that this dataset will be both a normal dataset and have a preview attached to it.
In the Viz UI we'll only load the data on click when the metadata panel is rendered

A question: what icon would we have for a node with a data preview inside it?

We need to come up with a different way to show that this dataset has more information
If a dataset has multiple pieces of information, the icon could have some layers if there are multiple things to show

rashidakanchwala · 2022-10-19T08:32:57Z

Closing this ticket as design and implementation work for the feature is mentioned on ticket #1136

rashidakanchwala · 2023-03-06T13:36:50Z

Update - I had a discussion with @merelcht , the preview function will be written on Kedro side. We are unsure if it's only preview, or also we share the metadata information about (number of rows/columns etc)

I am reponening this ticket as front-end design is done but there's still on going discussions around implementation

tynandebold · 2023-03-06T14:40:42Z

This work will touch Kedro datasets as well as the backend and frontend of Viz.

The first dataset we should add a preview method to is pandas.CSVDataSet.

For the frontend work, the design was done in #1136, so check there for reference.

rashidakanchwala added Idea Type: Discussion labels Jun 13, 2022

rashidakanchwala added this to Kedro-Viz Jun 13, 2022

rashidakanchwala moved this to Inbox in Kedro-Viz Jun 13, 2022

tynandebold moved this from Inbox to Backlog in Kedro-Viz Jun 13, 2022

yetudada added the Type: Technical Design label Jun 20, 2022

yetudada changed the title ~~Kedro-viz to show Data preview~~ Kedro-Viz to show preview of data Jun 20, 2022

merelcht mentioned this issue Jun 22, 2022

Ability to link plots to an experiment kedro-org/kedro#1626

Closed

tynandebold added the Design: Research label Aug 1, 2022

tynandebold mentioned this issue Aug 22, 2022

Remove pandas and plotly dependency #999

Closed

antonymilne mentioned this issue Aug 24, 2022

Provide simple mechanism for adding icons to datasets #480

Closed

1 task

antonymilne mentioned this issue Sep 5, 2022

Handle ImageDataSets on Kedro-viz #1043

Closed

1 task

tynandebold added this to Kedro Framework Oct 10, 2022

tynandebold added the Technical Design label Oct 10, 2022

tynandebold removed this from Kedro Framework Oct 10, 2022

tynandebold moved this from Backlog to Todo in Kedro-Viz Oct 10, 2022

tynandebold removed the Type: Technical Design label Oct 10, 2022

tynandebold added this to Kedro Framework Oct 11, 2022

rashidakanchwala mentioned this issue Oct 18, 2022

Design for Kedro Viz @preview datasets #1136

Closed

1 task

rashidakanchwala moved this from In Progress to Done in Kedro-Viz Oct 18, 2022

rashidakanchwala closed this as completed Oct 19, 2022

rashidakanchwala moved this to Done in Kedro Framework Oct 19, 2022

tynandebold mentioned this issue Nov 14, 2022

Preview dataset metadata in kedro-viz side panel #1149

Closed

rashidakanchwala reopened this Mar 6, 2023

github-project-automation bot moved this from Done to In Progress in Kedro Framework Mar 6, 2023

rashidakanchwala moved this from In Progress to To Do in Kedro Framework Mar 6, 2023

rashidakanchwala moved this from Done to Todo in Kedro-Viz Mar 6, 2023

tynandebold added Issue: Feature Request Python Pull requests that update Python code Javascript Pull requests that update Javascript code and removed Idea Type: Discussion Design: Research Technical Design Python Pull requests that update Python code labels Mar 6, 2023

Huongg moved this from Todo to In Progress in Kedro-Viz Mar 6, 2023

Huongg self-assigned this Mar 6, 2023

Huongg moved this from In Progress to In Review in Kedro-Viz Mar 16, 2023

This was referenced Mar 16, 2023

preview-csv-dataset kedro-org/kedro-plugins#129

Merged

Preview dataset #1288

Merged

merelcht mentioned this issue Mar 20, 2023

Enable adding new attributes to datasets kedro-org/kedro#2440

Closed

Huongg moved this from In Review to Done in Kedro-Viz Mar 24, 2023

tynandebold closed this as completed Mar 24, 2023

github-project-automation bot moved this from To Do to Done in Kedro Framework Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kedro-Viz to show preview of data #907

Kedro-Viz to show preview of data #907

rashidakanchwala commented Jun 13, 2022 •

edited by tynandebold

Loading

datajoely commented Jun 13, 2022

limdauto commented Jun 13, 2022

datajoely commented Jun 13, 2022

antonymilne commented Jun 14, 2022

yetudada commented Jun 22, 2022

merelcht commented Jun 22, 2022

antonymilne commented Jun 23, 2022 •

edited

Loading

datajoely commented Jun 27, 2022

rashidakanchwala commented Oct 10, 2022 •

edited

Loading

datajoely commented Oct 10, 2022

tynandebold commented Oct 12, 2022

rashidakanchwala commented Oct 19, 2022

rashidakanchwala commented Mar 6, 2023

tynandebold commented Mar 6, 2023

Kedro-Viz to show preview of data #907

Kedro-Viz to show preview of data #907

Comments

rashidakanchwala commented Jun 13, 2022 • edited by tynandebold Loading

Description

Implementation

datajoely commented Jun 13, 2022

limdauto commented Jun 13, 2022

datajoely commented Jun 13, 2022

antonymilne commented Jun 14, 2022

yetudada commented Jun 22, 2022

merelcht commented Jun 22, 2022

antonymilne commented Jun 23, 2022 • edited Loading

datajoely commented Jun 27, 2022

rashidakanchwala commented Oct 10, 2022 • edited Loading

datajoely commented Oct 10, 2022

tynandebold commented Oct 12, 2022

rashidakanchwala commented Oct 19, 2022

rashidakanchwala commented Mar 6, 2023

tynandebold commented Mar 6, 2023

rashidakanchwala commented Jun 13, 2022 •

edited by tynandebold

Loading

antonymilne commented Jun 23, 2022 •

edited

Loading

rashidakanchwala commented Oct 10, 2022 •

edited

Loading