Skip to content

Commit

Permalink
Update figures (dask#8401)
Browse files Browse the repository at this point in the history
A start at updating the docs with new figures, re dask/community#135.
  • Loading branch information
scharlottej13 authored and aeisenbarth committed Jan 6, 2022
1 parent 15870bf commit c5bd255
Show file tree
Hide file tree
Showing 43 changed files with 6,962 additions and 2,182 deletions.
2 changes: 1 addition & 1 deletion docs/source/array-design.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Internal Design
Overview
--------

.. image:: images/array.png
.. image:: images/array.svg
:width: 40 %
:align: right
:alt: 12 rectangular blocks arranged as a 4-row, 3-column layout. Each block includes 'x' and its location in the table starting with ('x',0,0) in the top-left, and a size of 5x8.
Expand Down
6 changes: 3 additions & 3 deletions docs/source/array-overlap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,20 +32,20 @@ Explanation

Consider two neighboring blocks in a Dask array:

.. image:: images/unoverlapping-neighbors.png
.. image:: images/unoverlapping-neighbors.svg
:width: 30%
:alt: Two neighboring blocks which do not overlap.

We extend each block by trading thin nearby slices between arrays:

.. image:: images/overlapping-neighbors.png
.. image:: images/overlapping-neighbors.svg
:width: 30%
:alt: Two neighboring block with thin strips along their shared border representing data shared between them.

We do this in all directions, including also diagonal interactions with the
overlap function:

.. image:: images/overlapping-blocks.png
.. image:: images/overlapping-blocks.svg
:width: 40%
:alt: A two-dimensional grid of blocks where each one has thin strips around their borders representing data shared from their neighbors. They include small corner bits for data shared from diagonal neighbors as well.

Expand Down
3 changes: 2 additions & 1 deletion docs/source/array.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,10 @@ Dask Array.
Design
------

.. image:: images/dask-array-black-text.svg
.. image:: images/dask-array.svg
:alt: Dask arrays coordinate many numpy arrays
:align: right
:scale: 35%

Dask arrays coordinate many NumPy arrays (or "duck arrays" that are
sufficiently NumPy-like in API such as CuPy or Sparse arrays) arranged into a
Expand Down
10 changes: 6 additions & 4 deletions docs/source/dataframe.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,14 +40,16 @@ Dask DataFrame.
Design
------

.. image:: images/dask-dataframe.svg
:alt: Column of four squares collectively labeled as a Dask DataFrame with a single constituent square labeled as a Pandas DataFrame.
:width: 35%
:align: right

Dask DataFrames coordinate many Pandas DataFrames/Series arranged along the
index. A Dask DataFrame is partitioned *row-wise*, grouping rows by index value
for efficiency. These Pandas objects may live on disk or on other machines.

.. image:: images/dask-dataframe.svg
:alt: Dask DataFrames coordinate many Pandas DataFrames
:width: 40%

|
Dask DataFrame copies the Pandas API
------------------------------------
Expand Down
34 changes: 22 additions & 12 deletions docs/source/graphs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Task Graphs

Internally, Dask encodes algorithms in a simple format involving Python dicts,
tuples, and functions. This graph format can be used in isolation from the
dask collections. Working directly with dask graphs is rare, unless you intend
dask collections. Working directly with dask graphs is rare, though, unless you intend
to develop new modules with Dask. Even then, :doc:`dask.delayed <delayed>` is
often a better choice. If you are a *core developer*, then you should start here.

Expand Down Expand Up @@ -39,27 +39,29 @@ computation, often a function call on a non-trivial amount of data. We
represent these tasks as nodes in a graph with edges between nodes if one task
depends on data produced by another. We call upon a *task scheduler* to
execute this graph in a way that respects these data dependencies and leverages
parallelism where possible, multiple independent tasks can be run
parallelism where possible, so multiple independent tasks can be run
simultaneously.

|
.. figure:: images/map-reduce-task-scheduling.svg
:scale: 40%

There are a number of methods for task scheduling, including embarrassingly parallel, MapReduce, and full task scheduling.

|
Many solutions exist. This is a common approach in parallel execution
frameworks. Often task scheduling logic hides within other larger frameworks
(Luigi, Storm, Spark, IPython Parallel, and so on) and so is often reinvented.

Dask is a specification that encodes task schedules with minimal incidental
complexity using terms common to all Python projects, namely dicts, tuples,
(e.g. Luigi, Storm, Spark, IPython Parallel, etc.) and so is often reinvented.
Dask is a specification that encodes full task scheduling with minimal incidental
complexity using terms common to all Python projects, namely, dicts, tuples,
and callables. Ideally this minimum solution is easy to adopt and understand
by a broad community.

Example
-------

.. image:: _static/dask-simple.png
:height: 400px
:alt: A simple dask dictionary
:align: right


Consider the following simple program:

.. code-block:: python
Expand All @@ -82,6 +84,14 @@ We encode this as a dictionary in the following way:
'y': (inc, 'x'),
'z': (add, 'y', 10)}
Which is represented by the following Dask graph:

.. image:: _static/dask-simple.png
:height: 400px
:alt: A simple dask dictionary

|
While less pleasant than our original code, this representation can be analyzed
and executed by other Python code, not just the CPython interpreter. We don't
recommend that users write code in this way, but rather that it is an
Expand Down
14 changes: 12 additions & 2 deletions docs/source/how-to/adaptive.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,18 @@ two situations:
Particularly efficient users may learn to manually add and remove workers
during their session, but this is rare. Instead, we would like the size of a
Dask cluster to match the computational needs at any given time. This is the
goal of the *adaptive deployments* discussed in this document. These are
particularly helpful for interactive workloads, which are characterized by long
goal of the *adaptive deployments* discussed in this document.

|
.. image:: ../images/dask-adaptive.svg
:alt: Dask adaptive scaling
:align: center
:scale: 40%

|
These are particularly helpful for interactive workloads, which are characterized by long
periods of inactivity interrupted with short bursts of heavy activity.
Adaptive deployments can result in both faster analyses that give users much
more power, but with much less pressure on computational resources.
Expand Down
55 changes: 44 additions & 11 deletions docs/source/how-to/deploy-dask-clusters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,18 @@ locally on your own machine or on a distributed cluster. If you are just
getting started, then this page is unnecessary. Dask does not require any setup
if you only want to use it on a single computer.

You can continue reading or watch the screencast below:

.. raw:: html

<iframe width="560"
height="315"
src="https://www.youtube.com/embed/TQM9zIBzNBo"
style="margin: 0 auto 20px auto; display: block;"
frameborder="0"
allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen></iframe>

Dask has two families of task schedulers:

1. **Single machine scheduler**: This scheduler provides basic features on a
Expand All @@ -16,7 +28,16 @@ Dask has two families of task schedulers:
more features, but also requires a bit more effort to set up. It can
run locally or distributed across a cluster.

If you import Dask, set up a computation, and then call ``compute``, then you
|
.. figure:: ../images/dask-overview-distributed-callout.svg
:alt: Dask is composed of three parts. "Collections" create "Task Graphs" which are then sent to the "Scheduler" for execution. There are two types of schedulers that are described in more detail below.

High level collections are used to generate task graphs which can be executed on a single machine or a cluster. Using the Distributed scheduler enables creation of a Dask cluster for multi-machine computation.

|
If you import Dask, set up a computation, and call ``compute``, then you
will use the single-machine scheduler by default. To use the ``dask.distributed``
scheduler you must set up a ``Client``

Expand All @@ -34,18 +55,28 @@ scheduler you must set up a ``Client``
Note that the newer ``dask.distributed`` scheduler is often preferable, even on
single workstations. It contains many diagnostics and features not found in
the older single-machine scheduler. The following pages explain in more detail
how to set up Dask on a variety of local and distributed hardware.
the older single-machine scheduler.

.. raw:: html
There are also a number of different *cluster managers* available, so you can use
Dask distributed with a range of platforms. These *cluster managers* deploy a scheduler
and the necessary workers as determined by communicating with the *resource manager*.
`Dask Jobqueue <https://github.com/dask/dask-jobqueue>`_, for example, is a set of
*cluster managers* for HPC users and works with job queueing systems
(in this case, the *resource manager*) such as `PBS <https://en.wikipedia.org/wiki/Portable_Batch_System>`_,
`Slurm <https://en.wikipedia.org/wiki/Slurm_Workload_Manager>`_,
and `SGE <https://en.wikipedia.org/wiki/Oracle_Grid_Engine>`_.
Those workers are then allocated physical hardware resources.

<iframe width="560"
height="315"
src="https://www.youtube.com/embed/TQM9zIBzNBo"
style="margin: 0 auto 20px auto; display: block;"
frameborder="0"
allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen></iframe>
.. figure:: ../images/dask-cluster-manager.svg
:scale: 50%

An overview of cluster management with Dask distributed.

To summarize, you can use the default, single-machine scheduler to use Dask
on your local machine. If you'd like use a cluster *or* simply take advantage
of the :doc:`extensive diagnostics <../diagnostics-distributed>`,
you can use Dask distributed. The following resources explain
in more detail how to set up Dask on a variety of local and distributed hardware:

- Single Machine:
- :doc:`Default Scheduler <deploy-dask/single-machine>`: The no-setup default.
Expand All @@ -54,6 +85,8 @@ how to set up Dask on a variety of local and distributed hardware.
the newer system on a single machine. This provides more advanced
features while still requiring almost no setup.
- Distributed computing:
- `Beginner's Guide to Configuring a Dask distributed Cluster <https://blog.dask.org/2020/07/30/beginners-config>`_
- `Overview of cluster management options <https://blog.dask.org/2020/07/23/current-state-of-distributed-dask-clusters>`_
- :doc:`Manual Setup <deploy-dask/cli>`: The command line interface to set up
``dask-scheduler`` and ``dask-worker`` processes. Useful for IT or
anyone building a deployment solution.
Expand Down
Binary file removed docs/source/images/array.png
Binary file not shown.
Binary file modified docs/source/images/async-embarrassing.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/source/images/collections-schedulers.png
Binary file not shown.
Loading

0 comments on commit c5bd255

Please sign in to comment.