Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of HTML repr for HighLevelGraph layers #7763

Merged
merged 22 commits into from
Jun 14, 2021

Conversation

GenevieveBuckley
Copy link
Contributor

@GenevieveBuckley GenevieveBuckley commented Jun 4, 2021

Here's an implementation of a repr_html() for the HighLevelGraph layer. It should give us a nice starting point to discuss what information we do or don't want displayed, and how we want it to look.

@GenevieveBuckley
Copy link
Contributor Author

Examples:

from dask.datasets import timeseries

ddf = timeseries().shuffle("id", shuffle="tasks").head(compute=False)
ddf.dask

html-repr-ddf-Screenshot from 2021-06-04 19-35-24

import dask.array as da

x = da.ones((9,9), chunks=(3,3))
y = x.T
z = y[0:4,0:4]

z.dask

html-repr-dask-array-Screenshot from 2021-06-04 19-36-14

@GenevieveBuckley
Copy link
Contributor Author

I've tried to keep the look similar to @jacobtomlinson's work here dask/distributed#4857 just for consistency (although it seems I've forgotten to add a little blue square next to each layer, oops).

TODO:

  • left align the second column in the tables
  • make the layer text the same grey colour as the highlevelgraph info text
  • check dependencies still look good even if there are a lot of them
  • add the little blue squares next to each layer

For discussion

I would like to shift all the layer specific stuff into a _repr_html_() on Layer (not just HighLevelGraph). But currently we're including the dependencies information in the layer info table... and only the high level graph knows what those dependencies are. Is there a clean way to resolve this? Ideally we don't want a difference between the layer info shown in the HTML repr from either the high level graph or a single layer.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, really exciting to see this.

I'm pleased to see the design resources being used, but I think we should try and make the HLG repr visually different from the cluster/client/scheduler/worker repr. If a user is scrolling through a long notebook looking for one or the other they should be able to identify them at a glance.

We also have some information about the width of each layer right? Perhaps we could show something visually to represent that. I'm thinking along the lines of the minimaps you see in many text editors these days.

@GenevieveBuckley
Copy link
Contributor Author

I'm pleased to see the design resources being used, but I think we should try and make the HLG repr visually different from the cluster/client/scheduler/worker repr. If a user is scrolling through a long notebook looking for one or the other they should be able to identify them at a glance.

Huh, that's an interesting point to consider. Maybe let's turn those squares into circles & maybe use a different color?

We also have some information about the width of each layer right? Perhaps we could show something visually to represent that. I'm thinking along the lines of the minimaps you see in many text editors these days.

I don't know what you mean by width, could you clarify this? The layers are all pretty heterogeneous in terms of what attributes they have (but we can write special case logic or start to enforce more homogeneous descriptive attributes)

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Jun 4, 2021

This is the design reference I've been working from.

I'm keen to try and stay within the colours and shapes shown there in the various diagrams. I would be tempted to say let's specifically keep this diagram in mind when putting together this repr.

image

So as you say maybe something like little grey circles would work well? Maybe a blue circle for the HLG and grey for the layers?

@jacobtomlinson
Copy link
Member

By width I was specifically referring to the n_parts attribute in DataframeIOLayer.

DataFrameIOLayer<name='make-timeseries-4a2ca177f118daf3d8a7db6ab747672c', n_parts=30, columns=None>

But I see what you mean, there isn't much consistency between the layers so this could be tricky.

@jacobtomlinson
Copy link
Member

Also it looks like Layer has a get_dependencies method so that could be used for individual layer reprs.

def get_dependencies(self, key: Hashable, all_hlg_keys: Iterable) -> set:

@jacobtomlinson
Copy link
Member

I threw together a quick mock up in Figma. This is kinda what I had in mind.

image

@jacobtomlinson
Copy link
Member

That little graph icon can be included with

<svg width="76" height="71" viewBox="0 0 76 71" fill="none" xmlns="http://www.w3.org/2000/svg">
<circle cx="61.5" cy="36.5" r="13.5" fill="#F2F2F2" stroke="#1D1D1D" stroke-width="2"/>
<circle cx="14.5" cy="14.5" r="13.5" fill="#F2F2F2" stroke="#1D1D1D" stroke-width="2"/>
<circle cx="14.5" cy="56.5" r="13.5" fill="#F2F2F2" stroke="#1D1D1D" stroke-width="2"/>
<path d="M28 16L30.5 16C33.2614 16 35.5 18.2386 35.5 21L35.5 32.0001C35.5 34.7615 37.7386 37.0001 40.5 37.0001L43 37.0001" stroke="black" stroke-width="1.5"/>
<path d="M40.5 37L40.5 37.75L40.5 37.75L40.5 37ZM35.5 42L36.25 42L35.5 42ZM35.5 52L34.75 52L35.5 52ZM30.5 57L30.5 57.75L30.5 57ZM41.5001 36.25L40.5 36.25L40.5 37.75L41.5001 37.75L41.5001 36.25ZM34.75 42L34.75 52L36.25 52L36.25 42L34.75 42ZM30.5 56.25L28.0001 56.25L28.0001 57.75L30.5 57.75L30.5 56.25ZM34.75 52C34.75 54.3472 32.8472 56.25 30.5 56.25L30.5 57.75C33.6756 57.75 36.25 55.1756 36.25 52L34.75 52ZM40.5 36.25C37.3244 36.25 34.75 38.8243 34.75 42L36.25 42C36.25 39.6528 38.1528 37.75 40.5 37.75L40.5 36.25Z" fill="black"/>
<circle cx="28" cy="16" r="2.25" fill="#E5E5E5" stroke="black" stroke-width="1.5"/>
<circle cx="28" cy="57" r="2.25" fill="#E5E5E5" stroke="black" stroke-width="1.5"/>
<path d="M45.25 36.567C45.5833 36.7594 45.5833 37.2406 45.25 37.433L42.25 39.1651C41.9167 39.3575 41.5 39.117 41.5 38.7321V35.2679C41.5 34.883 41.9167 34.6425 42.25 34.8349L45.25 36.567Z" fill="#1D1D1D"/>
</svg>

Group 2

@mrocklin
Copy link
Member

mrocklin commented Jun 4, 2021 via email

@GenevieveBuckley
Copy link
Contributor Author

  1. What I really want is to have the is_materialized info for all layers available at a glance. Ideally I could make the icon be either a solid or an empty circle, depending on status.
  2. Hm, I had been assuming the get_dependencies method would materialize the graph. In retrospect, I don't have a good reason for assuming that, other than the fact lots of innocuous seeming things seem to cause unwanted materialization. I'll check!
  3. Your icon with the html snippet is pretty cute! I'll try adding it in and we'll see how it looks.

@GenevieveBuckley
Copy link
Contributor Author

Update: it turns out that result.dask.dependencies[layer_key] is not exactly equivalent to dask_layer.get_dependencies(). The get_dependencies method appears to return more low level information.

Here's a short example using the timeseries dataset:

from dask.datasets import timeseries
ddf = timeseries().shuffle("id", shuffle="tasks").head(compute=False)
layer_key = 'simple-shuffle-029255f702ea9cf087d6955375c8e9ad'

This is what we get when we look at the high level graph dependencies

>>> ddf.dask.dependencies[layer_key]
{'make-timeseries-2d58a4d4763bdeb3eca63e8b5ffd6c3b'}

But when we use the layer get_dependencies method, there are two things that are different

  1. We can't use the same string as the key anymore, we need a tuple key instead; and
  2. The results are more low-level information than we saw previously
>>> layer = ddf.dask.layers[layer_key]
>>> tuple_key = ('simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0)
>>> layer.get_dependencies(tuple_key, layer.keys())

{('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 0),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 1),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 2),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 3),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 4),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 5),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 6),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 7),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 8),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 9),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 10),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 11),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 12),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 13),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 14),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 15),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 16),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 17),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 18),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 19),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 20),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 21),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 22),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 23),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 24),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 25),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 26),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 27),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 28),
 ('split-simple-shuffle-029255f702ea9cf087d6955375c8e9ad', 0, 29)}

@GenevieveBuckley
Copy link
Contributor Author

Starting to look nice:
html-hlg-grey-icons-Screenshot from 2021-06-07 18-17-46

The default is to have the "Layers" bit expanded, so all the layer headings are visible but not the detailed layer information.

TODO:

  • make the layer text the same grey colour as the highlevelgraph info text - DONE. They match now, although I've chosen a darker grey that passes a contrast accessibility check (join the discussion here)
  • add the little blue squares next to each layer Add grey icons like Jacob suggests - DONE
  • make the layer icons look different depending on whether they are materialized or not (light grey = unmaterialized; dark grey = materialized) - DONE
  • left align the second column in the tables It turns out that this is being overridden by jupyter's CSS settings (at least, that's why stack overflow says my align tag is being ignored), so I think we should probably leave this alone.
  • check dependencies still look good even if there are a lot of them - working on this now. It doesn't really look great imo, there's a lot of mostly-redundant information

0px;">{highlevelgraph_info}</p>

<div style="">
<details open style="margin-left: 0px;">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know you could default them to open, that's awesome!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I didn't know that either until I really wanted to do it! I'm very happy that turned out to be easy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, this "details open" trick is causing problems with the test checking HTML validity. So I think we'll just get rid of it entirely for now.

@GenevieveBuckley
Copy link
Contributor Author

What I'm thinking about now is how to get the layer names from within the layers themselves. From the high level graph, I just used the dictionary key for that particular layer. I'm not totally sure what the canonical way to get that information from inside the layer is yet.

There's no requirement for layers to have a .name attribute, and so many of them don't have this info stored nicely. For example, anything that is a MaterializedLayer doesn't have a name.

@mrocklin
Copy link
Member

mrocklin commented Jun 7, 2021

For tests I recommend that we ...

  1. Verify that the HTML is clean (I think that there are dashboard plots

@mrocklin
Copy link
Member

mrocklin commented Jun 7, 2021

Whoops, errant send, trying again

For tests I recommend that we ...

  1. Verify that the HTML is clean, see distributed.utils.is_valid_xml
  2. Check that certain terms occur in the output, like the names of the types of a layer or two.

Clearly we can't test the visual aesthetics, but having a couple of basic tests here is likely to keep this code healthy long term

@mrocklin
Copy link
Member

mrocklin commented Jun 7, 2021

I think that it would be good to have a nicer string here

Screen Shot 2021-06-07 at 9 48 49 AM

I wonder if there is a nicer way to show this information.

Screen Shot 2021-06-07 at 9 49 29 AM

For example, saying is_materialized: True seems clinical. It's how we would write it out in code. Instead, we might have an icon somewhere for this. We might also include the number of nodes in each layer.

These things could be in a table, they could also be in the summary. We can stuff more information there for at-a-glance information.

dask/highlevelgraph.py Outdated Show resolved Hide resolved
"is_materialized": layer.is_materialized(),
"dependencies": dependencies,
}
return info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important, but my preference would be to keep logic like this inlined in order to avoid indirection (blog post here: https://matthewrocklin.com/blog/work/2019/06/23/avoid-indirection )

However, really I'd prefer that we move this down to Layer subclasses (even more indirection, I acknowledge).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it should go with the Layer.

I don't agree that it should be inlined with the _repr_html_ function logic, because imo that will make it harder for people making representations for layer subclasses. (Perhaps I misunderstand your point here though)

For someone writing a _repr_html function for a Layer subclass, it seems a lot easier to grab the default dictionary of interesting layer information with this function,, then add more dictionary keys/values specific to that particular Layer subclass, and then turn that into a nice HTML string. If you only have the HTML string, you might end up either duplicating a lot of the logic here for each layer subclass, OR faffing around removing HTML div tags, and trying to insert the new extra information into the right place in the HTML string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that dupliation would ensure (we would have to address this with methods, functions, etc..

However it could open up larger variations than just different keys/values in the table. For example we might include images like the chunk structure, or different coloring based on layer type.

@GenevieveBuckley
Copy link
Contributor Author

We might also include the number of nodes in each layer.

Maybe this is a silly question, but how do I know what the number of nodes in a layer is?

@github-actions github-actions bot added the array label Jun 8, 2021
@jsignell
Copy link
Member

I love this! I feel like it's already helping me gain insight. I pushed it a little using different chunking with Matt's example:

import dask.array as da

x = da.random.random((10000, 10000), chunks=(100, 100))
y = x + x.T - x.mean(axis=0)

image

image

Some ideas:

  1. Having the same summary items as arrray makes a lot of sense to me - this is the same as Matt's 1 above, but maybe including shape in there as well?
  2. The label area of the repr table seems wider than it needs to be.
  3. The size of the headers feels large to me. Especially when compared to the dask.array repr.

dask/highlevelgraph.py Outdated Show resolved Hide resolved
@GenevieveBuckley
Copy link
Contributor Author

GenevieveBuckley commented Jun 11, 2021

Here's what it looks like now

Dataframe example

image

Array example

image

Note: there's something odd on my machine making the little circle icons not line up with the headings. I also don't seem to get the little arrows indicating you can expand the collapsible details. However, Jacob, Matt, and Julia all have really pretty screenshots in this thread, so I think it's probably fine for everyone else.

@GenevieveBuckley
Copy link
Contributor Author

Sorry to keep coming back with suggestions

Don't be sorry, it's fun to tinker.

In some more aggressive future we might consider including the dask array SVG array chart directly in the HTML, that's probably future work though

Sure, sounds good.

Here's a summary of the changes I've made in the last round of edits:

  • fixed bug showing dtype as None where it shouldn't
  • add array shape to collection_annotations
  • add array chunksize to collection annotations
  • Add columns to dataframe collection annotations
  • Add npartitions to the dataframe collection annotations
  • Removed verbose array chunks information from collection_annotations (we can always add this back in if we need it later)
  • Removed verbose dataframe divisions information from collection_annotations (we can always add this back in if we need it later)

Here's what didn't change:

  • Matt suggests: For class types can we try repr(cls) rather than str(cls) so that we get something like the following:
    In [4]: da.Array
    Out[4]: dask.array.core.Array
    Answer: I'm sorry, I can't make this render without the <class '...'> brackets (which is what I think you're asking me to do). I tried these combinations of things in collection_annotations, and all of them render the same way in the output table:
    "type": self.__class__,
    "type1": str(self.__class__),
    "type2": repr(self.__class__),
    "type4": type(self),
    "type5": str(type(self)),
    "type6": repr(type(self)),
  • Julia suggests: making the font size for the headings smaller.
    Sure, I'm happy to implement this is there is agreement on what size it should be. I used some of the choices from Jacob's HTML work here Add HTML repr to scheduler_info and incorporate into client and cluster reprs distributed#4857 as a guide, so we'd want to be consistent everywhere, if possible.

@mrocklin
Copy link
Member

Answer: I'm sorry, I can't make this render without the <class '...'> brackets (which is what I think you're asking me to do). I tried these combinations of things in collection_annotations, and all of them render the same way in the output table:

Yeah, I just tried this myself. I expected it to work because of this:

image

I suspect that the IPython display function must have special cased this, causing the confusion.

@mrocklin
Copy link
Member

I'm happy to withdraw my request.

@mrocklin
Copy link
Member

Tests seem to be sad. Happy to merge otherwise though

@jrbourbeau
Copy link
Member

Just merged main to make tests happy

@jrbourbeau
Copy link
Member

For class types can we try repr(cls) rather than str(cls) so that we get something like the following

FWIW we have a typename utility function which should help with this:

In [1]: import dask.array as da

In [2]: from dask.utils import typename

In [3]: typename(da.Array)
Out[3]: 'dask.array.core.Array'

@jacobtomlinson
Copy link
Member

I see what @jsignell means about the titles. Perhaps for the layers we should drop the size down to <h4>?

@jsignell
Copy link
Member

I think h4 for the layers would be nice, but I am also totally happy with this going in as-is. It is a vast improvement over
image

@jrbourbeau jrbourbeau mentioned this pull request Jun 14, 2021
4 tasks
@mrocklin
Copy link
Member

Let's merge this in. We can iterate.

@mrocklin mrocklin merged commit ca94cb9 into dask:main Jun 14, 2021
@GenevieveBuckley
Copy link
Contributor Author

I didn't check what it looks like when the jupyter notebook is in dark mode, that's probably another useful thing to try.

@jacobtomlinson
Copy link
Member

Most of the text is readable. The darker gray isn't so good and the icons need updating.

image

@jacobtomlinson
Copy link
Member

I've opened #7809 to add dark mode support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants