Google Summer of Code 2021 Project #107

freyam · 2021-08-10T10:43:45Z

I will be writing about my work in the summer working along with @GenevieveBuckley and @martindurant on the different representations of Dask computation.

This has been part of the annual Google Summer of Code program where students get the opportunity to work with mentors on large-scale projects.

The blogpost would contain a list of all the merged work and what the new features mean to the users 🚀

GenevieveBuckley · 2021-08-10T13:06:40Z

From our slack discussion...

Here's the structure I think we should have for the dask-blog post

Section 1: Visualizing high level graphs

Add node size scaling to the Graphviz output for the high level graphs dask#7869
Add tooltips to graphviz dask#7973
Add colors to represent high level layer types dask#7974
Also fixed a big in dask visualize: Fixing calling .visualize() with filename=None dask#7740

Section 2: HTML representation
(maybe link to previous blogposts/twitter threads that talk about HTML reprs in Dask)

Add dask.array SVG to the HTML Repr Add dask.array SVG to the HTML Repr dask#7886
HTML repr for ProcessInterface Add HTML Repr for ProcessInterface Class and all its subclasses distributed#5181
HTML repr for Security class Add HTML Repr for Security Class distributed#5178

All of these (except the bugfix) will need nice before and after screenshots. Putting those together would be a fantastic start (feel free to make a new folder inside the dask-blog/images directory so they're all grouped in one place)

_posts/2021-08-23-gsoc-2021-project.md

Co-authored-by: Genevieve Buckley <[email protected]>

GenevieveBuckley · 2021-08-16T07:08:41Z

We talked earlier about the difference in audience/purpose between your Medium blogpost and this one.

Medium blogpost = here's all the stuff I worked on
Dask blogpost = here's an overview of some new features for Dask users

This draft is very like (1) instead of (2), with a lot of first person sentences ("I worked...", "I changed...", "I tweaked..."). We'll probably want to adjust it to suit the second audience better.

freyam · 2021-08-16T07:09:44Z

I agree. I will edit accordingly.

_posts/2021-08-23-gsoc-2021-project.md

GenevieveBuckley · 2021-08-16T07:16:38Z

General suggestions:

Shorter headings (shift the PR links/titles into the text below)
More descriptive alt-text for images. Ideally they should be a full sentence that makes sense without other supporting information (i.e. you can't expect a reader to know the topic of "PR 5178" means, we need to say that)

BTW, I'm happy to write or re-write text content, and will probably do some of this before we publish the final piece.

_posts/2021-08-23-gsoc-2021-project.md

freyam · 2021-08-19T05:46:54Z

Hiii Genevieve, I read the draft you just pushed to the branch. It's amazing 💯!

I also had made some adjustments and tweaks of my own yesterday. It's not much, but looking at yours, it feels very non-professional.

I think we should go along with yours. I already have the images and some extra text ready. Will commit them in sometime when I reach to my laptop 😀

_posts/2021-08-23-gsoc-2021-project.md

GenevieveBuckley · 2021-08-19T05:54:24Z

@freyam - there are still some important to-do items listed here, mostly involving adding the rest of the demonstration examples.

@jacobtomlinson - you might like to take a brief look over some of this (most relevant to your interests is the second section on HTML representations). No worries if you're busy though.

freyam · 2021-08-19T07:25:23Z

Updated ✔️

martindurant

Just some small thoughts.

_posts/2021-08-23-gsoc-2021-project.md

jacobtomlinson

Looks great to me

GenevieveBuckley · 2021-08-20T07:23:15Z

_posts/2021-08-23-gsoc-2021-project.md

+- Dataframe shuffles are particularly expensive operations. You can [read more about this here](https://docs.dask.org/en/latest/dataframe-best-practices.html#avoid-full-data-shuffling).
+- Reading and writing data to/from storage/network services is often high-latency and therefore a bottleneck.
+- Blockwise layers are generally efficient for computation.
+- All layers are materialized during computation.


I don't know if we should write more about materialized layers here. I can't think of a good way to say:

ideally we won't see many materialized layers before compute() is called

but we might see some and that's ok

but you might also accidentally materialize layers without meaning to, perhaps by counting the number of tasks or looking at the HTML repr (which in turn counts the number of tasks)

and fixing that is a job for dask developers, not dask users

I think on balance this might be more confusing than helpful. If anyone has ideas or thoughts around this I'd be interested to hear them.

GenevieveBuckley · 2021-08-20T07:26:03Z

Thank you @martindurant and @jacobtomlinson
I plan to merge this next week (most likely Australian Tuesday / US Monday)

If either of you have thoughts about this point https://github.com/dask/dask-blog/pull/107/files#r692726228 before then, let me know.

freyam · 2021-08-24T04:30:54Z

💛

base ready

7a4b2d5

update

aa17501

freyam marked this pull request as ready for review August 15, 2021 11:32

GenevieveBuckley reviewed Aug 16, 2021

View reviewed changes