Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds a dask_on_ray tutorial #225

Merged
merged 2 commits into from
Sep 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ ipython
jupytext
jupyter
matplotlib
eztao
eztao
ray
1 change: 1 addition & 0 deletions docs/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ functionality.
Binning Sources in the Ensemble <tutorials/binning_slowly_changing_sources>
Structure Function Showcase <tutorials/structure_function_showcase>
Loading Data into the Ensemble <tutorials/tape_datasets>
Using Ray with the Ensemble <tutorials/using_ray_with_the_ensemble>
184 changes: 184 additions & 0 deletions docs/tutorials/using_ray_with_the_ensemble.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
{
Copy link
Member

@nevencaplar nevencaplar Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for me, trying to recreate this workflow, this does not produce anything. Maybe Kostya can say if the same happens for him?


Reply via ReviewNB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 127.0.0.1:8265

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also doesn't seem to produce anything in the readthedocs build

Copy link
Member

@nevencaplar nevencaplar Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would be valuable to add a sentence or two about use of the explicit optionsuse_map=False here


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would adding a comment inline be sufficient? We have a section on this in the docs here: https://tape.readthedocs.io/en/latest/tutorials/scaling_to_large_data.html#Data-Partitioning-and-Parallelization

Copy link
Member

@nevencaplar nevencaplar Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment inline and/or mentioning that it is elaborated more elsewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comment was added

Copy link
Contributor

@hombit hombit Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this tutorial is short, it allows us adding comparison between Dask and Ray. Something like this:

One cell

%%time
ens = Ensemble()
ens.from_dataset("s82_qso")
ens._source = ens._source.repartition(npartitions=10)
ens.batch(calc_sf2, use_map=False)

Next cell

%%time
ens = Ensemble(client=False)
ens.from_dataset("s82_qso")
ens._source = ens._source.repartition(npartitions=10)
ens.batch(calc_sf2, use_map=False)


Reply via ReviewNB

"cells": [
{
"cell_type": "markdown",
"id": "bcb10f72-948f-475e-a856-4f5c9516fd5e",
"metadata": {},
"source": [
"# Using Dask on Ray with the Ensemble\n",
"\n",
"[Ray](https://docs.ray.io/en/latest/ray-overview/index.html) is an open-source unified framework for scaling AI and Python applications. Ray provides a scheduler for Dask ([dask_on_ray](https://docs.ray.io/en/latest/ray-more-libs/dask-on-ray.html)) which allows you to build data analyses using Dask’s collections and execute the underlying tasks on a Ray cluster. We have found with TAPE that the Ray scheduler is often more performant than Dasks scheduler. Ray can be used on TAPE using the setup shown in the following example."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ace065cd-5c75-4282-bca5-36ebe6868234",
"metadata": {},
"outputs": [],
"source": [
"import ray\n",
"from ray.util.dask import enable_dask_on_ray, disable_dask_on_ray\n",
"from tape import Ensemble\n",
"from tape.analysis.structurefunction2 import calc_sf2\n",
"\n",
"context = ray.init()\n",
"\n",
"# Use the Dask config helper to set the scheduler to ray_dask_get globally,\n",
"# without having to specify it on each compute call.\n",
"enable_dask_on_ray()"
]
},
{
"cell_type": "markdown",
"id": "e6e9fa72-5811-4750-8ba8-bcd762eb80fa",
"metadata": {},
"source": [
"We import ray, and just need to invoke two commands. `context = ray.init()` starts a local ray cluster, and we can use this context object to retrieve the url of the ray dashboard, as shown below. `enable_dask_on_ray()` is a dask configuration function that sets up all Dask work to use the established Ray cluster."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "04453edd-b22b-43cb-abc3-e61e0c958b04",
"metadata": {},
"outputs": [],
"source": [
"print(context.dashboard_url)"
]
},
{
"cell_type": "markdown",
"id": "f9ad55cc-2203-4145-be1c-0af331805624",
"metadata": {},
"source": [
"For TAPE, the only needed change is to specify `client=False` when initializing an `Ensemble` object. Because the Dask configuration has been set, the Ensemble will automatically use the established Ray cluster."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2cf7c608-fc46-455e-a7f7-04e8b64d52ec",
"metadata": {},
"outputs": [],
"source": [
"ens=Ensemble(client=False) # Do not use a client"
]
},
{
"cell_type": "markdown",
"id": "6a1b904e-7bf6-4dd5-b1e6-0c6229a98739",
"metadata": {},
"source": [
"From here, we are free to work with TAPE as normal."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e0e3bf1a-f9b9-45be-9fea-390d25380794",
"metadata": {},
"outputs": [],
"source": [
"ens.from_dataset(\"s82_qso\")\n",
"ens._source = ens._source.repartition(npartitions=10)\n",
"ens.batch(calc_sf2, use_map=False) # use_map is false as we repartition naively, splitting per-object sources across partitions"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c5692d75",
"metadata": {},
"source": [
"## Timing Comparison\n",
"\n",
"As mentioned above, we generally see that Ray is more performant than Dask. Below is a simple timing comparison."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f128cdbf",
"metadata": {},
"source": [
"### Ray Timing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dd960e10",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"ens=Ensemble(client=False) # Do not use a client\n",
"ens.from_dataset(\"s82_qso\")\n",
"ens._source = ens._source.repartition(npartitions=10)\n",
"ens.batch(calc_sf2, use_map=False)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "228e5114",
"metadata": {},
"source": [
"### Dask Timing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24a8f466",
"metadata": {},
"outputs": [],
"source": [
"disable_dask_on_ray() # unsets the dask_on_ray configuration settings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1552c2b8",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"ens = Ensemble()\n",
"ens.from_dataset(\"s82_qso\")\n",
"ens._source = ens._source.repartition(npartitions=10)\n",
"ens.batch(calc_sf2, use_map=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
},
"vscode": {
"interpreter": {
"hash": "83afbb17b435d9bf8b0d0042367da76f26510da1c5781f0ff6e6c518eab621ec"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,8 @@ dev = [
"ipython", # Also used in building notebooks into Sphinx
"matplotlib", # Used in sample notebook intro_notebook.ipynb
"eztao==0.4.1", # Used in Structure Function example notebook
"bokeh", # Used to render dask client dashboard in Scaling to Large Data notebook
"bokeh", # Used to render dask client dashboard in Scaling to Large Data notebook
"ray[default]" # Used in the Ray on Ensemble notebook
]

[project.urls]
Expand Down