Update figures (dask#8401)

A start at updating the docs with new figures, re dask/community#135.
aeisenbarth · Jan 6, 2022 · c5bd255 · c5bd255
1 parent 15870bf
commit c5bd255
Show file tree

Hide file tree

Showing 43 changed files with 6,962 additions and 2,182 deletions.
diff --git a/docs/source/array-design.rst b/docs/source/array-design.rst
@@ -4,7 +4,7 @@ Internal Design
 Overview
 --------
 
-.. image:: images/array.png
+.. image:: images/array.svg
    :width: 40 %
    :align: right
    :alt: 12 rectangular blocks arranged as a 4-row, 3-column layout. Each block includes 'x' and its location in the table starting with ('x',0,0) in the top-left, and a size of 5x8.

diff --git a/docs/source/array-overlap.rst b/docs/source/array-overlap.rst
@@ -32,20 +32,20 @@ Explanation
 
 Consider two neighboring blocks in a Dask array:
 
-.. image:: images/unoverlapping-neighbors.png
+.. image:: images/unoverlapping-neighbors.svg
    :width: 30%
    :alt: Two neighboring blocks which do not overlap.
 
 We extend each block by trading thin nearby slices between arrays:
 
-.. image:: images/overlapping-neighbors.png
+.. image:: images/overlapping-neighbors.svg
    :width: 30%
    :alt: Two neighboring block with thin strips along their shared border representing data shared between them.
 
 We do this in all directions, including also diagonal interactions with the
 overlap function:
 
-.. image:: images/overlapping-blocks.png
+.. image:: images/overlapping-blocks.svg
    :width: 40%
    :alt: A two-dimensional grid of blocks where each one has thin strips around their borders representing data shared from their neighbors. They include small corner bits for data shared from diagonal neighbors as well.
 

diff --git a/docs/source/array.rst b/docs/source/array.rst
@@ -42,9 +42,10 @@ Dask Array.
 Design
 ------
 
-.. image:: images/dask-array-black-text.svg
+.. image:: images/dask-array.svg
    :alt: Dask arrays coordinate many numpy arrays
    :align: right
+   :scale: 35%
 
 Dask arrays coordinate many NumPy arrays (or "duck arrays" that are
 sufficiently NumPy-like in API such as CuPy or Sparse arrays) arranged into a

diff --git a/docs/source/dataframe.rst b/docs/source/dataframe.rst
@@ -40,14 +40,16 @@ Dask DataFrame.
 Design
 ------
 
+.. image:: images/dask-dataframe.svg
+   :alt: Column of four squares collectively labeled as a Dask DataFrame with a single constituent square labeled as a Pandas DataFrame.
+   :width: 35%
+   :align: right
+
 Dask DataFrames coordinate many Pandas DataFrames/Series arranged along the
 index.  A Dask DataFrame is partitioned *row-wise*, grouping rows by index value
 for efficiency.  These Pandas objects may live on disk or on other machines.
 
-.. image:: images/dask-dataframe.svg
-   :alt: Dask DataFrames coordinate many Pandas DataFrames
-   :width: 40%
-
+|
 
 Dask DataFrame copies the Pandas API
 ------------------------------------

diff --git a/docs/source/graphs.rst b/docs/source/graphs.rst
@@ -5,7 +5,7 @@ Task Graphs
 
 Internally, Dask encodes algorithms in a simple format involving Python dicts,
 tuples, and functions. This graph format can be used in isolation from the
-dask collections. Working directly with dask graphs is rare, unless you intend
+dask collections. Working directly with dask graphs is rare, though, unless you intend
 to develop new modules with Dask.  Even then, :doc:`dask.delayed <delayed>` is
 often a better choice. If you are a *core developer*, then you should start here.
 
@@ -39,27 +39,29 @@ computation, often a function call on a non-trivial amount of data.  We
 represent these tasks as nodes in a graph with edges between nodes if one task
 depends on data produced by another.  We call upon a *task scheduler* to
 execute this graph in a way that respects these data dependencies and leverages
-parallelism where possible, multiple independent tasks can be run
+parallelism where possible, so multiple independent tasks can be run
 simultaneously.
 
+|
+
+.. figure:: images/map-reduce-task-scheduling.svg
+   :scale: 40%
+
+   There are a number of methods for task scheduling, including embarrassingly parallel, MapReduce, and full task scheduling.
+
+|
+
 Many solutions exist.  This is a common approach in parallel execution
 frameworks.  Often task scheduling logic hides within other larger frameworks
-(Luigi, Storm, Spark, IPython Parallel, and so on) and so is often reinvented.
-
-Dask is a specification that encodes task schedules with minimal incidental
-complexity using terms common to all Python projects, namely dicts, tuples,
+(e.g. Luigi, Storm, Spark, IPython Parallel, etc.) and so is often reinvented.
+Dask is a specification that encodes full task scheduling with minimal incidental
+complexity using terms common to all Python projects, namely, dicts, tuples,
 and callables.  Ideally this minimum solution is easy to adopt and understand
 by a broad community.
 
 Example
 -------
 
-.. image:: _static/dask-simple.png
-   :height: 400px
-   :alt: A simple dask dictionary
-   :align: right
-
-
 Consider the following simple program:
 
 .. code-block:: python
@@ -82,6 +84,14 @@ We encode this as a dictionary in the following way:
         'y': (inc, 'x'),
         'z': (add, 'y', 10)}
 
+Which is represented by the following Dask graph:
+
+.. image:: _static/dask-simple.png
+   :height: 400px
+   :alt: A simple dask dictionary
+
+|
+
 While less pleasant than our original code, this representation can be analyzed
 and executed by other Python code, not just the CPython interpreter.  We don't
 recommend that users write code in this way, but rather that it is an

diff --git a/docs/source/how-to/adaptive.rst b/docs/source/how-to/adaptive.rst
@@ -16,8 +16,18 @@ two situations:
 Particularly efficient users may learn to manually add and remove workers
 during their session, but this is rare.  Instead, we would like the size of a
 Dask cluster to match the computational needs at any given time.  This is the
-goal of the *adaptive deployments* discussed in this document.  These are
-particularly helpful for interactive workloads, which are characterized by long
+goal of the *adaptive deployments* discussed in this document.
+
+|
+
+.. image:: ../images/dask-adaptive.svg
+   :alt: Dask adaptive scaling
+   :align: center
+   :scale: 40%
+
+|
+
+These are particularly helpful for interactive workloads, which are characterized by long
 periods of inactivity interrupted with short bursts of heavy activity.
 Adaptive deployments can result in both faster analyses that give users much
 more power, but with much less pressure on computational resources.

diff --git a/docs/source/how-to/deploy-dask-clusters.rst b/docs/source/how-to/deploy-dask-clusters.rst
@@ -6,6 +6,18 @@ locally on your own machine or on a distributed cluster.  If you are just
 getting started, then this page is unnecessary.  Dask does not require any setup
 if you only want to use it on a single computer.
 
+You can continue reading or watch the screencast below:
+
+.. raw:: html
+
+   <iframe width="560"
+           height="315"
+           src="https://www.youtube.com/embed/TQM9zIBzNBo"
+           style="margin: 0 auto 20px auto; display: block;"
+           frameborder="0"
+           allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture"
+           allowfullscreen></iframe>
+
 Dask has two families of task schedulers:
 
 1.  **Single machine scheduler**: This scheduler provides basic features on a
@@ -16,7 +28,16 @@ Dask has two families of task schedulers:
     more features, but also requires a bit more effort to set up.  It can
     run locally or distributed across a cluster.
 
-If you import Dask, set up a computation, and then call ``compute``, then you
+|
+
+.. figure:: ../images/dask-overview-distributed-callout.svg
+   :alt: Dask is composed of three parts. "Collections" create "Task Graphs" which are then sent to the "Scheduler" for execution. There are two types of schedulers that are described in more detail below.
+
+   High level collections are used to generate task graphs which can be executed on a single machine or a cluster. Using the Distributed scheduler enables creation of a Dask cluster for multi-machine computation.
+
+|
+
+If you import Dask, set up a computation, and call ``compute``, then you
 will use the single-machine scheduler by default.  To use the ``dask.distributed``
 scheduler you must set up a ``Client``
 
@@ -34,18 +55,28 @@ scheduler you must set up a ``Client``
 
 Note that the newer ``dask.distributed`` scheduler is often preferable, even on
 single workstations.  It contains many diagnostics and features not found in
-the older single-machine scheduler.  The following pages explain in more detail
-how to set up Dask on a variety of local and distributed hardware.
+the older single-machine scheduler.
 
-.. raw:: html
+There are also a number of different *cluster managers* available, so you can use
+Dask distributed with a range of platforms. These *cluster managers* deploy a scheduler
+and the necessary workers as determined by communicating with the *resource manager*.
+`Dask Jobqueue <https://github.com/dask/dask-jobqueue>`_, for example, is a set of
+*cluster managers* for HPC users and works with job queueing systems
+(in this case, the *resource manager*) such as `PBS <https://en.wikipedia.org/wiki/Portable_Batch_System>`_,
+`Slurm <https://en.wikipedia.org/wiki/Slurm_Workload_Manager>`_,
+and `SGE <https://en.wikipedia.org/wiki/Oracle_Grid_Engine>`_.
+Those workers are then allocated physical hardware resources.
 
-   <iframe width="560"
-           height="315"
-           src="https://www.youtube.com/embed/TQM9zIBzNBo"
-           style="margin: 0 auto 20px auto; display: block;"
-           frameborder="0"
-           allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture"
-           allowfullscreen></iframe>
+.. figure:: ../images/dask-cluster-manager.svg
+   :scale: 50%
+
+   An overview of cluster management with Dask distributed.
+
+To summarize, you can use the default, single-machine scheduler to use Dask
+on your local machine. If you'd like use a cluster *or* simply take advantage
+of the :doc:`extensive diagnostics <../diagnostics-distributed>`,
+you can use Dask distributed. The following resources explain
+in more detail how to set up Dask on a variety of local and distributed hardware:
 
 - Single Machine:
     - :doc:`Default Scheduler <deploy-dask/single-machine>`: The no-setup default.
@@ -54,6 +85,8 @@ how to set up Dask on a variety of local and distributed hardware.
       the newer system on a single machine.  This provides more advanced
       features while still requiring almost no setup.
 - Distributed computing:
+    - `Beginner's Guide to Configuring a Dask distributed Cluster <https://blog.dask.org/2020/07/30/beginners-config>`_
+    - `Overview of cluster management options <https://blog.dask.org/2020/07/23/current-state-of-distributed-dask-clusters>`_
     - :doc:`Manual Setup <deploy-dask/cli>`: The command line interface to set up
       ``dask-scheduler`` and ``dask-worker`` processes.  Useful for IT or
       anyone building a deployment solution.

diff --git a/docs/source/images/array.png b/docs/source/images/array.png
diff --git a/docs/source/images/async-embarrassing.gif b/docs/source/images/async-embarrassing.gif
diff --git a/docs/source/images/collections-schedulers.png b/docs/source/images/collections-schedulers.png