From 026ff22b85c45bcf2d7eb3a223b12e1898b51f13 Mon Sep 17 00:00:00 2001 From: Doug Branton Date: Fri, 15 Mar 2024 09:33:33 -0700 Subject: [PATCH] remove out of date scaling nb --- docs/tutorials.rst | 1 - docs/tutorials/scaling_to_large_data.ipynb | 263 --------------------- 2 files changed, 264 deletions(-) delete mode 100644 docs/tutorials/scaling_to_large_data.ipynb diff --git a/docs/tutorials.rst b/docs/tutorials.rst index 60106b2b..4f0ffcac 100644 --- a/docs/tutorials.rst +++ b/docs/tutorials.rst @@ -11,7 +11,6 @@ functionality. Working with HiPSCat and LSDB data Working with the TAPE Timeseries object Common Data Operations with TAPE - Scaling to Large Data Volume Working with Structure Function Using Ray with the Ensemble Showcase - Binning Sources in the Ensemble diff --git a/docs/tutorials/scaling_to_large_data.ipynb b/docs/tutorials/scaling_to_large_data.ipynb deleted file mode 100644 index 09fbd99a..00000000 --- a/docs/tutorials/scaling_to_large_data.ipynb +++ /dev/null @@ -1,263 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Scaling to Large Data Volume\n", - "\n", - "TAPE is built with survey-scale time-domain data in mind. This notebook offers information on how one would use TAPE to navigate large datasets.\n", - "\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Useful `Dask` features of the `Ensemble`\n", - "\n", - "Much of the scalability of TAPE comes from it's `Dask` roots. `Dask` has a whole host of features for organizing, visualizing, and executing large scale jobs. One such example being the \"Lazy\" execution discussed in the [\"Working with the tape Ensemble object\"](https://tape.readthedocs.io/en/latest/tutorials/working_with_the_ensemble.html) notebook, where specific lines of code are not automatically run at execution time, and instead are added to a scheduler and await a signal to run the calculation. This give the user greater control over when they'd like the bulk of the execution time to be run." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### The `Dask` Client and Scheduler\n", - "\n", - "An important aspect of `Dask` to understand for optimizing it's performance for large datasets is the Distributed Client. `Dask` has [thorough documentation](https://distributed.dask.org/en/stable/client.html) on this, but the general idea is that the Distributed Client is the entrypoint for setting up a distributed system. By default, the Tape `Ensemble` operates without a Distributed Client. In the TAPE `Ensemble`, we can have a Distributed Client spun up in the background by indicating `client=True` when initializing the Ensemble, which can be accessed using `Ensemble.client_info()`:\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from tape import Ensemble\n", - "\n", - "ens = Ensemble(client=True)\n", - "\n", - "ens.client_info()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ens.client.close() # tear down the client when we're done with it" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "By calling `Ensemble.client_info()`, we get an interactive output that tells us information about the Client, the Cluster setup, and available workers. In addition, an address is provided to access the `Dask` [Dashboard](https://docs.dask.org/en/stable/dashboard.html). This is an incredibly useful interactive tool for monitoring your in-progress workflows and diagnosing potential issues or slowdowns. It's highly encouraged to check out the `Dask` documentation on this, but also to open this up and play around with it yourself!\n", - "\n", - "In many cases, the client that is built by default may not be ideal for your workflow. In these instances, we have a few ways of building a tailored client. The first is to pass client arguments along to the `Ensemble` itself." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ens = Ensemble(client=True, n_workers=3, threads_per_worker=2)\n", - "\n", - "ens.client_info()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this case, we are only interested in giving the client some clear values in the number of workers, and the number of threads per worker. We end up with a fairly lightweight cluster built underneath." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ens.client.close()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Another option is to simply create a client external to the `Ensemble` and pass it in." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from dask.distributed import Client\n", - "\n", - "client = Client()\n", - "\n", - "ens = Ensemble(client=client)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "client" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ens.client_info() # We see that the two are equivalent" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This may be preferable for those who want full control of the `Dask` client API, which may be beneficial when working on external machines/services or when a more complex setup is desired." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Data Partitioning and Parallelization\n", - "\n", - "Partitioning is the subdivision of a singular (large) dataset into many (small) datasets. It is a crucial concept to understand when working at scale, as almost always the input data will be too large to exist all in memory at once. Partitioning allows data to be read into memory in manageable chunks. \n", - "\n", - "Partitioning is crucial for the performance of TAPE parallelization as well, as in general workers are allocated to partitions, so having more workers than partitions may incur more idle time on workers than you'd like.\n", - "\n", - "Ideally, your input data should be pre-partitioned as any repartitioning of data will be expensive computationally. However, TAPE is able to repartition input data if required." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ens = Ensemble(client=client)\n", - "\n", - "# Read in data from a parquet file\n", - "ens.from_parquet(\n", - " \"../../tests/tape_tests/data/source/test_source.parquet\",\n", - " id_col=\"ps1_objid\",\n", - " time_col=\"midPointTai\",\n", - " flux_col=\"psFlux\",\n", - " err_col=\"psFluxErr\",\n", - " band_col=\"filterName\",\n", - " partition_size=\"5KB\",\n", - ")\n", - "\n", - "ens.info()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here we read in our example dataset, but this time specify the `partition_size` keyword, setting it to '5KB'. This lets the ensemble know that the input dataset, which consists of a single partition, should be repartitioned into 5KB sized chunks. This results in a new dataframe with 32 partitions. Alternatively, we could have supplied `n_partitions` to specify a target number of output partitions. \n", - "\n", - "However, **be careful**, as in this instance we've likely split data from each timeseries into multiple partitions. The consequence of this is that operations on a single lightcurve now need to access multiple partitions, which adds significant data shuffling costs to your workflow. If your timeseries is split across multiple partitions, you'll also want to explicitly set `use_map=False` in `Ensemble.batch` calls, as otherwise it will use `Dask` [map_partitions](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) which will only operate on the lightcurve data it sees in each partition!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from tape.analysis.stetsonj import calc_stetson_J\n", - "import numpy as np\n", - "\n", - "mapres = ens.batch(\n", - " calc_stetson_J, use_map=True\n", - ") # will not know to look at multiple partitions to get lightcurve data\n", - "groupres = ens.batch(\n", - " calc_stetson_J, use_map=False\n", - ") # will know to look at multiple partitions, with shuffling costs\n", - "\n", - "print(\"number of lightcurve results in mapres: \", len(mapres))\n", - "print(\"number of lightcurve results in groupres: \", len(groupres))\n", - "print(\"True number of lightcurves in the dataset:\", len(np.unique(ens.source.index)))" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As you can see, map_partitions finds chunks of the lightcurve in each partition separately and computes them separately, while the alternative (a groupby) does know to look in multiple partitions, with data shuffling costs. The reason why map_partitions is available as an option is that it is a performant choice when the lightcurves are optimally partitioned. This is part of why it's really ideal that this partitioning is already done before you've touched the data, as there's a better chance that the input source has stored their lightcurves with this requirement met." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ens.client.close()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### No one size fits all\n", - "\n", - "With regards to `Dask`, the client setup, and partitioning there is no one solution that will apply to all use cases. We have largely provided the tools to tailor these as needed, but it does require some learning into what options are available and how they apply to your particular use case. We've gone over the general points above, but if you're interested in further learning, `Dask` has a [best practices](https://docs.dask.org/en/stable/best-practices.html) page that discusses some do's and don't's, which we recommend." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.11" - }, - "vscode": { - "interpreter": { - "hash": "83afbb17b435d9bf8b0d0042367da76f26510da1c5781f0ff6e6c518eab621ec" - } - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}