Skip to content

Commit

Permalink
Merge 462314b into b698d38
Browse files Browse the repository at this point in the history
  • Loading branch information
dougbrn authored Feb 16, 2024
2 parents b698d38 + 462314b commit 3f6450e
Show file tree
Hide file tree
Showing 8 changed files with 207 additions and 291 deletions.
11 changes: 1 addition & 10 deletions docs/tutorials/binning_slowly_changing_sources.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -263,15 +263,6 @@
"ax.set_xlabel(\"Time (MJD)\")\n",
"ax.set_ylabel(\"Source Count\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ens.client.close() # Tear down the ensemble client"
]
}
],
"metadata": {
Expand All @@ -290,7 +281,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
"version": "3.10.6"
},
"vscode": {
"interpreter": {
Expand Down
55 changes: 21 additions & 34 deletions docs/tutorials/scaling_to_large_data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,7 @@
"source": [
"### The `Dask` Client and Scheduler\n",
"\n",
"An important aspect of `Dask` to understand for optimizing it's performance for large datasets is the Distributed Client. `Dask` has [thorough documentation](https://distributed.dask.org/en/stable/client.html) on this, but the general idea is that the Distributed Client is the entrypoint for setting up a distributed system. The Distributed Client enables asynchronous computation, where Dask's `compute` and `persist` methods are able to run in the background and persist in memory while we continue doing other work.\n",
"\n",
"In the TAPE `Ensemble`, by default a Distributed Client is spun up in the background, which can be accessed using `Ensemble.client_info()`:\n",
"An important aspect of `Dask` to understand for optimizing it's performance for large datasets is the Distributed Client. `Dask` has [thorough documentation](https://distributed.dask.org/en/stable/client.html) on this, but the general idea is that the Distributed Client is the entrypoint for setting up a distributed system. By default, the Tape `Ensemble` operates without a Distributed Client. In the TAPE `Ensemble`, we can have a Distributed Client spun up in the background by indicating `client=True` when initializing the Ensemble, which can be accessed using `Ensemble.client_info()`:\n",
"\n"
]
},
Expand All @@ -42,7 +40,7 @@
"source": [
"from tape import Ensemble\n",
"\n",
"ens = Ensemble()\n",
"ens = Ensemble(client=True)\n",
"\n",
"ens.client_info()"
]
Expand Down Expand Up @@ -72,7 +70,7 @@
"metadata": {},
"outputs": [],
"source": [
"ens = Ensemble(n_workers=3, threads_per_worker=2)\n",
"ens = Ensemble(client=True, n_workers=3, threads_per_worker=2)\n",
"\n",
"ens.client_info()"
]
Expand Down Expand Up @@ -141,23 +139,6 @@
"This may be preferable for those who want full control of the `Dask` client API, which may be beneficial when working on external machines/services or when a more complex setup is desired."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, there may be instances where you would prefer to not use the Distributed Client, particularly when working with smaller amounts of data. In these instances, we allow users to disable the creation of a Distributed Client by passing `client=False`, as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ens=Ensemble(client=False)"
]
},
{
"attachments": {},
"cell_type": "markdown",
Expand All @@ -181,13 +162,15 @@
"ens = Ensemble(client=client)\n",
"\n",
"# Read in data from a parquet file\n",
"ens.from_parquet(\"../../tests/tape_tests/data/source/test_source.parquet\",\n",
" id_col='ps1_objid',\n",
" time_col='midPointTai',\n",
" flux_col='psFlux',\n",
" err_col='psFluxErr',\n",
" band_col='filterName',\n",
" partition_size='5KB')\n",
"ens.from_parquet(\n",
" \"../../tests/tape_tests/data/source/test_source.parquet\",\n",
" id_col=\"ps1_objid\",\n",
" time_col=\"midPointTai\",\n",
" flux_col=\"psFlux\",\n",
" err_col=\"psFluxErr\",\n",
" band_col=\"filterName\",\n",
" partition_size=\"5KB\",\n",
")\n",
"\n",
"ens.info()"
]
Expand All @@ -211,8 +194,12 @@
"from tape.analysis.stetsonj import calc_stetson_J\n",
"import numpy as np\n",
"\n",
"mapres = ens.batch(calc_stetson_J, use_map=True) # will not know to look at multiple partitions to get lightcurve data\n",
"groupres = ens.batch(calc_stetson_J, use_map=False) # will know to look at multiple partitions, with shuffling costs\n",
"mapres = ens.batch(\n",
" calc_stetson_J, use_map=True\n",
") # will not know to look at multiple partitions to get lightcurve data\n",
"groupres = ens.batch(\n",
" calc_stetson_J, use_map=False\n",
") # will know to look at multiple partitions, with shuffling costs\n",
"\n",
"print(\"number of lightcurve results in mapres: \", len(mapres))\n",
"print(\"number of lightcurve results in groupres: \", len(groupres))\n",
Expand Down Expand Up @@ -249,7 +236,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "py310",
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
Expand All @@ -263,11 +250,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
"version": "3.10.11"
},
"vscode": {
"interpreter": {
"hash": "08968836a6367873274ed1d5e98a07391f42fc3a62bd5aba54afbd7b11ba8673"
"hash": "83afbb17b435d9bf8b0d0042367da76f26510da1c5781f0ff6e6c518eab621ec"
}
}
},
Expand Down
Loading

0 comments on commit 3f6450e

Please sign in to comment.