lincc-frameworks · dougbrn · Sep 11, 2023 · Sep 11, 2023 · Sep 11, 2023 · nevencaplar
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -6,4 +6,5 @@ ipython
 jupytext
 jupyter
 matplotlib
-eztao
+eztao
+ray
diff --git a/docs/tutorials.rst b/docs/tutorials.rst
@@ -13,3 +13,4 @@ functionality.
     Binning Sources in the Ensemble <tutorials/binning_slowly_changing_sources>
     Structure Function Showcase <tutorials/structure_function_showcase>
     Loading Data into the Ensemble <tutorials/tape_datasets>
+    Using Ray with the Ensemble <tutorials/using_ray_with_the_ensemble>
diff --git a/docs/tutorials/using_ray_with_the_ensemble.ipynb b/docs/tutorials/using_ray_with_the_ensemble.ipynb
@@ -0,0 +1,184 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "bcb10f72-948f-475e-a856-4f5c9516fd5e",
+   "metadata": {},
+   "source": [
+    "# Using Dask on Ray with the Ensemble\n",
+    "\n",
+    "[Ray](https://docs.ray.io/en/latest/ray-overview/index.html) is an open-source unified framework for scaling AI and Python applications. Ray provides a scheduler for Dask ([dask_on_ray](https://docs.ray.io/en/latest/ray-more-libs/dask-on-ray.html)) which allows you to build data analyses using Dask’s collections and execute the underlying tasks on a Ray cluster. We have found with TAPE that the Ray scheduler is often more performant than Dasks scheduler. Ray can be used on TAPE using the setup shown in the following example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ace065cd-5c75-4282-bca5-36ebe6868234",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import ray\n",
+    "from ray.util.dask import enable_dask_on_ray, disable_dask_on_ray\n",
+    "from tape import Ensemble\n",
+    "from tape.analysis.structurefunction2 import calc_sf2\n",
+    "\n",
+    "context = ray.init()\n",
+    "\n",
+    "# Use the Dask config helper to set the scheduler to ray_dask_get globally,\n",
+    "# without having to specify it on each compute call.\n",
+    "enable_dask_on_ray()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e6e9fa72-5811-4750-8ba8-bcd762eb80fa",
+   "metadata": {},
+   "source": [
+    "We import ray, and just need to invoke two commands. `context = ray.init()` starts a local ray cluster, and we can use this context object to retrieve the url of the ray dashboard, as shown below. `enable_dask_on_ray()` is a dask configuration function that sets up all Dask work to use the established Ray cluster."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "04453edd-b22b-43cb-abc3-e61e0c958b04",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(context.dashboard_url)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9ad55cc-2203-4145-be1c-0af331805624",
+   "metadata": {},
+   "source": [
+    "For TAPE, the only needed change is to specify `client=False` when initializing an `Ensemble` object. Because the Dask configuration has been set, the Ensemble will automatically use the established Ray cluster."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2cf7c608-fc46-455e-a7f7-04e8b64d52ec",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ens=Ensemble(client=False) # Do not use a client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6a1b904e-7bf6-4dd5-b1e6-0c6229a98739",
+   "metadata": {},
+   "source": [
+    "From here, we are free to work with TAPE as normal."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e0e3bf1a-f9b9-45be-9fea-390d25380794",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ens.from_dataset(\"s82_qso\")\n",
+    "ens._source = ens._source.repartition(npartitions=10)\n",
+    "ens.batch(calc_sf2, use_map=False)  # use_map is false as we repartition naively, splitting per-object sources across partitions"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "c5692d75",
+   "metadata": {},
+   "source": [
+    "## Timing Comparison\n",
+    "\n",
+    "As mentioned above, we generally see that Ray is more performant than Dask. Below is a simple timing comparison."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "f128cdbf",
+   "metadata": {},
+   "source": [
+    "### Ray Timing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dd960e10",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "\n",
+    "ens=Ensemble(client=False) # Do not use a client\n",
+    "ens.from_dataset(\"s82_qso\")\n",
+    "ens._source = ens._source.repartition(npartitions=10)\n",
+    "ens.batch(calc_sf2, use_map=False)"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "228e5114",
+   "metadata": {},
+   "source": [
+    "### Dask Timing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "24a8f466",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "disable_dask_on_ray() # unsets the dask_on_ray configuration settings"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1552c2b8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "\n",
+    "ens = Ensemble()\n",
+    "ens.from_dataset(\"s82_qso\")\n",
+    "ens._source = ens._source.repartition(npartitions=10)\n",
+    "ens.batch(calc_sf2, use_map=False)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "83afbb17b435d9bf8b0d0042367da76f26510da1c5781f0ff6e6c518eab621ec"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/pyproject.toml b/pyproject.toml
@@ -48,7 +48,8 @@ dev = [
     "ipython", # Also used in building notebooks into Sphinx
     "matplotlib", # Used in sample notebook intro_notebook.ipynb
     "eztao==0.4.1", # Used in Structure Function example notebook
-    "bokeh", # Used to render dask client dashboard in Scaling to Large Data notebook 
+    "bokeh", # Used to render dask client dashboard in Scaling to Large Data notebook
+    "ray[default]" # Used in the Ray on Ensemble notebook 
 ]
 
 [project.urls]
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,4 +6,5 @@ ipython @@
     jupytext
     jupyter
     matplotlib
-    eztao
+    eztao
+    ray