Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch the Ensemble to default to not using a distributed client #379

Merged
merged 4 commits into from
Feb 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 1 addition & 10 deletions docs/tutorials/binning_slowly_changing_sources.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -263,15 +263,6 @@
"ax.set_xlabel(\"Time (MJD)\")\n",
"ax.set_ylabel(\"Source Count\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ens.client.close() # Tear down the ensemble client"
]
}
],
"metadata": {
Expand All @@ -290,7 +281,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
"version": "3.10.6"
},
"vscode": {
"interpreter": {
Expand Down
55 changes: 21 additions & 34 deletions docs/tutorials/scaling_to_large_data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,7 @@
"source": [
"### The `Dask` Client and Scheduler\n",
"\n",
"An important aspect of `Dask` to understand for optimizing it's performance for large datasets is the Distributed Client. `Dask` has [thorough documentation](https://distributed.dask.org/en/stable/client.html) on this, but the general idea is that the Distributed Client is the entrypoint for setting up a distributed system. The Distributed Client enables asynchronous computation, where Dask's `compute` and `persist` methods are able to run in the background and persist in memory while we continue doing other work.\n",
"\n",
"In the TAPE `Ensemble`, by default a Distributed Client is spun up in the background, which can be accessed using `Ensemble.client_info()`:\n",
"An important aspect of `Dask` to understand for optimizing it's performance for large datasets is the Distributed Client. `Dask` has [thorough documentation](https://distributed.dask.org/en/stable/client.html) on this, but the general idea is that the Distributed Client is the entrypoint for setting up a distributed system. By default, the Tape `Ensemble` operates without a Distributed Client. In the TAPE `Ensemble`, we can have a Distributed Client spun up in the background by indicating `client=True` when initializing the Ensemble, which can be accessed using `Ensemble.client_info()`:\n",
"\n"
]
},
Expand All @@ -42,7 +40,7 @@
"source": [
"from tape import Ensemble\n",
"\n",
"ens = Ensemble()\n",
"ens = Ensemble(client=True)\n",
"\n",
"ens.client_info()"
]
Expand Down Expand Up @@ -72,7 +70,7 @@
"metadata": {},
"outputs": [],
"source": [
"ens = Ensemble(n_workers=3, threads_per_worker=2)\n",
"ens = Ensemble(client=True, n_workers=3, threads_per_worker=2)\n",
"\n",
"ens.client_info()"
]
Expand Down Expand Up @@ -141,23 +139,6 @@
"This may be preferable for those who want full control of the `Dask` client API, which may be beneficial when working on external machines/services or when a more complex setup is desired."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, there may be instances where you would prefer to not use the Distributed Client, particularly when working with smaller amounts of data. In these instances, we allow users to disable the creation of a Distributed Client by passing `client=False`, as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ens=Ensemble(client=False)"
]
},
{
"attachments": {},
"cell_type": "markdown",
Expand All @@ -181,13 +162,15 @@
"ens = Ensemble(client=client)\n",
"\n",
"# Read in data from a parquet file\n",
"ens.from_parquet(\"../../tests/tape_tests/data/source/test_source.parquet\",\n",
" id_col='ps1_objid',\n",
" time_col='midPointTai',\n",
" flux_col='psFlux',\n",
" err_col='psFluxErr',\n",
" band_col='filterName',\n",
" partition_size='5KB')\n",
"ens.from_parquet(\n",
" \"../../tests/tape_tests/data/source/test_source.parquet\",\n",
" id_col=\"ps1_objid\",\n",
" time_col=\"midPointTai\",\n",
" flux_col=\"psFlux\",\n",
" err_col=\"psFluxErr\",\n",
" band_col=\"filterName\",\n",
" partition_size=\"5KB\",\n",
")\n",
"\n",
"ens.info()"
]
Expand All @@ -211,8 +194,12 @@
"from tape.analysis.stetsonj import calc_stetson_J\n",
"import numpy as np\n",
"\n",
"mapres = ens.batch(calc_stetson_J, use_map=True) # will not know to look at multiple partitions to get lightcurve data\n",
"groupres = ens.batch(calc_stetson_J, use_map=False) # will know to look at multiple partitions, with shuffling costs\n",
"mapres = ens.batch(\n",
" calc_stetson_J, use_map=True\n",
") # will not know to look at multiple partitions to get lightcurve data\n",
"groupres = ens.batch(\n",
" calc_stetson_J, use_map=False\n",
") # will know to look at multiple partitions, with shuffling costs\n",
"\n",
"print(\"number of lightcurve results in mapres: \", len(mapres))\n",
"print(\"number of lightcurve results in groupres: \", len(groupres))\n",
Expand Down Expand Up @@ -249,7 +236,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "py310",
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
Expand All @@ -263,11 +250,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
"version": "3.10.11"
},
"vscode": {
"interpreter": {
"hash": "08968836a6367873274ed1d5e98a07391f42fc3a62bd5aba54afbd7b11ba8673"
"hash": "83afbb17b435d9bf8b0d0042367da76f26510da1c5781f0ff6e6c518eab621ec"
}
}
},
Expand Down
Loading
Loading