Skip to content

Commit

Permalink
Merge branch 'main' into single_lc
Browse files Browse the repository at this point in the history
  • Loading branch information
wilsonbb committed Apr 5, 2024
2 parents ab10382 + d72d1e7 commit ab1dc5c
Show file tree
Hide file tree
Showing 12 changed files with 138 additions and 93 deletions.
19 changes: 4 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,23 +29,12 @@ TAPE is available to install with pip, using the "lf-tape" package name:
pip install lf-tape
```

## Getting started - for developers
## Contributing

Download code and install dependencies in a conda environment. Run unit tests at the end as a verification that the packages are properly installed.
[![GitHub issue custom search in repo](https://img.shields.io/github/issues-search/lincc-frameworks/tape?color=purple&label=Good%20first%20issues&query=is%3Aopen%20label%3A%22good%20first%20issue%22)](https://github.com/lincc-frameworks/tape/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)

```
$ conda create -n seriesenv python=3.11
$ conda activate seriesenv
$ git clone https://github.com/lincc-frameworks/tape
$ cd tape/
$ pip install .
$ pip install .[dev] # it may be necessary to use `pip install .'[dev]'` (with single quotes) depending on your machine.
$ pip install pytest
$ pytest
```
See the [Contribution Guide](https://tape.readthedocs.io/en/latest/gettingstarted/contributing.html) for complete installation instructions and contribution best practices.

## Acknowledgements

LINCC Frameworks is supported by Schmidt Futures, a philanthropic initiative founded by Eric and Wendy Schmidt, as part of the Virtual Institute of Astrophysics (VIA).
This project is supported by Schmidt Sciences.
1 change: 1 addition & 0 deletions docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,6 @@ Examples
Some examples of how to use the TAPE package are provided in these notebooks.

.. toctree::
:maxdepth: 1

Use Lomb–Scargle Periodograms for SDSS Stripe 82 RR Lyrae <examples/rrlyr-period>
4 changes: 3 additions & 1 deletion docs/gettingstarted.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ we encourage you to open an issue on the
`TAPE github repository <https://github.com/lincc-frameworks/tape/issues>`_.

.. toctree::
:maxdepth: 1

Installing TAPE <gettingstarted/installation>
Quickstart Guide <gettingstarted/quickstart>
Quickstart Guide <gettingstarted/quickstart>
Contribution Guide <gettingstarted/contributing>
20 changes: 20 additions & 0 deletions docs/gettingstarted/contributing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Contribution Guide
==================

Dev Guide - Getting Started
---------------------------

Download code and install dependencies in a conda environment. Run unit tests at the end as a verification that the packages are properly installed.

.. code-block:: bash
conda create -n seriesenv python=3.11
conda activate seriesenv
git clone https://github.com/lincc-frameworks/tape
cd tape/
pip install .
pip install .[dev] # it may be necessary to use `pip install .'[dev]'` (with single quotes) depending on your machine.
pip install pytest
pytest
5 changes: 3 additions & 2 deletions docs/gettingstarted/quickstart.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@
"metadata": {},
"outputs": [],
"source": [
"%pip install lf-tape"
"\n",
"%pip install lf-tape --quiet\n"
]
},
{
Expand Down Expand Up @@ -210,7 +211,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
"version": "3.10.11"
},
"vscode": {
"interpreter": {
Expand Down
6 changes: 3 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,7 @@ TAPE offers a complete ecosystem for loading, filtering, and analyzing
timeseries data. TAPE is built to enable users to run provided and user-defined
analysis functions at scale in a parallelized and/or distributed manner.

Over the survey lifetime of the `LSST <https://www.lsst.org/about>`_, on order
of ~billions of objects will have multiband lightcurves available, and TAPE has
Over the survey lifetime of the `LSST <https://www.lsst.org/about>`_, billions of objects will have multiband lightcurves available, and TAPE has
been built as a framework with the goal of making analysis of LSST-scale
data accessible.

Expand All @@ -23,13 +22,14 @@ How to Use This Guide
==============================================

Begin with the `Getting Started <https://tape.readthedocs.io/en/latest/gettingstarted.html>`_ guide to learn the basics of installation and
walk through a simple example of using TAPE.
walkthrough a simple example of using TAPE.

The `Tutorials <https://tape.readthedocs.io/en/latest/tutorials.html>`_ section showcases the fundamental features of TAPE.

API-level information about TAPE is viewable in the
`API Reference <https://tape.readthedocs.io/en/latest/autoapi/index.html>`_ section.

Learn more about contributing to this repository in our :doc:`Contribution Guide <gettingstarted/contributing>`.


.. toctree::
Expand Down
1 change: 1 addition & 0 deletions docs/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ These pages contain a set of tutorial notebooks for working through core and mor
functionality.

.. toctree::
:maxdepth: 1

Loading Data into TAPE <tutorials/tape_datasets>
Working with the TAPE Ensemble object <tutorials/working_with_the_ensemble>
Expand Down
1 change: 1 addition & 0 deletions docs/tutorials/batch_showcase.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
"for i in range(10, 110):\n",
" obj_ids.append(np.array([i] * 1250))\n",
" mjds.append(np.arange(0.0, 1250.0, 1.0))\n",
"\n",
"obj_ids = np.concatenate(obj_ids)\n",
"mjds = np.concatenate(mjds)\n",
"\n",
Expand Down
4 changes: 2 additions & 2 deletions docs/tutorials/common_data_operations.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@
"Often, you'll want to peek at your data even though the full-size is too large for memory.\n",
"\n",
"> **_Note:_**\n",
"By default this only looks at the first partition of data, so any operations that remove all data from the first partition will produce an empty head result. Specify `npartitions=-1` to grab from all partitions.\n"
"some partitions may be empty and `head` will have to traverse these empty partitions to find enough rows for your result. An empty table with many partitions (O(100)k) might be costly even for an ultimately empty result. "
]
},
{
Expand All @@ -119,7 +119,7 @@
"metadata": {},
"outputs": [],
"source": [
"ens.source.head(5, npartitions=-1) # grabs the first 5 rows\n",
"ens.source.head(5) # grabs the first 5 rows\n",
"\n",
"# can also use tail to grab the last 5 rows"
]
Expand Down
71 changes: 1 addition & 70 deletions docs/tutorials/using_ray_with_the_ensemble.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"source": [
"# Using Dask on Ray with the Ensemble\n",
"\n",
"[Ray](https://docs.ray.io/en/latest/ray-overview/index.html) is an open-source unified framework for scaling AI and Python applications. Ray provides a scheduler for Dask ([dask_on_ray](https://docs.ray.io/en/latest/ray-more-libs/dask-on-ray.html)) which allows you to build data analyses using Dask’s collections and execute the underlying tasks on a Ray cluster. We have found with TAPE that the Ray scheduler is often more performant than Dasks scheduler. Ray can be used on TAPE using the setup shown in the following example."
"[Ray](https://docs.ray.io/en/latest/ray-overview/index.html) is an open-source unified framework for scaling AI and Python applications. Ray provides a scheduler for Dask ([dask_on_ray](https://docs.ray.io/en/latest/ray-more-libs/dask-on-ray.html)) which allows you to build data analyses using Dask’s collections and execute the underlying tasks on a Ray cluster. Ray can be used on TAPE using the setup shown in the following example."
]
},
{
Expand Down Expand Up @@ -86,75 +86,6 @@
" calc_sf2, use_map=False\n",
") # use_map is false as we repartition naively, splitting per-object sources across partitions"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c5692d75",
"metadata": {},
"source": [
"## Timing Comparison\n",
"\n",
"As mentioned above, we generally see that Ray is more performant than Dask. Below is a simple timing comparison."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f128cdbf",
"metadata": {},
"source": [
"### Ray Timing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dd960e10",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"ens = Ensemble(client=False) # Do not use a client\n",
"ens.from_dataset(\"s82_qso\", sorted=True)\n",
"ens.source = ens.source.repartition(npartitions=10)\n",
"ens.batch(calc_sf2, use_map=False)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "228e5114",
"metadata": {},
"source": [
"### Dask Timing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24a8f466",
"metadata": {},
"outputs": [],
"source": [
"disable_dask_on_ray() # unsets the dask_on_ray configuration settings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1552c2b8",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"ens = Ensemble()\n",
"ens.from_dataset(\"s82_qso\", sorted=True)\n",
"ens.source = ens.source.repartition(npartitions=10)\n",
"ens.batch(calc_sf2, use_map=False).compute()"
]
}
],
"metadata": {
Expand Down
40 changes: 40 additions & 0 deletions src/tape/ensemble_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -845,6 +845,46 @@ def repartition(
)
return self._propagate_metadata(result)

def head(self, n=5, compute=True, npartitions=None):
"""Returns `n` rows of data for previewing purposes.
Parameters
----------
n : int, optional
The number of desired rows. Default is 5.
compute : bool, optional
Whether to compute the result immediately. Default is True.
npartitions : int, optional
`npartitions` is not supported and if provided will be ignored. Instead all partitions may be used.
Returns:
A pandas DataFrame with up to `n` rows of data.
"""
if npartitions is not None:
warnings.warn(
"The 'npartitions' parameter is not supported for TAPE dataframes. All partitions may be used."
)

if not compute:
# Just use the Dask head method
return super().head(n, compute=False)

if n <= 0:
return super().head(0)

# Iterate over the partitions until we have enough rows
dfs = []
remaining_rows = n
for partition in self.partitions:
if remaining_rows == 0:
break
# Note that partition is itself a _Frame object, so we need to compute to avoid infinite recursion
partition_head = partition.compute().head(remaining_rows)
dfs.append(partition_head)
remaining_rows -= len(partition_head)

return pd.concat(dfs)


class EnsembleSeries(_Frame, dd.Series):
"""A barebones extension of a Dask Series for Ensemble data."""
Expand Down
59 changes: 59 additions & 0 deletions tests/tape_tests/test_ensemble_frame.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
""" Test EnsembleFrame (inherited from Dask.DataFrame) creation and manipulations. """

from math import floor
import numpy as np
import pandas as pd
from tape import (
Expand Down Expand Up @@ -470,3 +471,61 @@ def test_partition_slicing(parquet_ensemble_with_divisions):

assert ens.source.npartitions == 2 # should return exactly 2 partitions
assert len(ens.object) < prior_src_len # should affect objects


@pytest.mark.parametrize(
"data_fixture",
[
"parquet_ensemble",
"parquet_ensemble_with_divisions",
],
)
def test_head(data_fixture, request):
"""
Tests that head returns the correct number of rows.
"""
ens = request.getfixturevalue(data_fixture)

# Test witht repartitioning the source frame
frame = ens.source
frame = frame.repartition(npartitions=10)

assert frame.npartitions == 10

# Check that a warning is raised when npartitions are requested.
with pytest.warns(UserWarning):
frame.head(5, npartitions=5)

# Test inputs that should return an empty frame
assert len(frame.head(-100)) == 0
assert len(frame.head(0)) == 0
assert len(frame.head(-1)) == 0

assert len(frame.head(100, compute=False).compute()) == 100

one_res = frame.head(1)
assert len(one_res) == 1
assert isinstance(one_res, TapeFrame)
assert set(one_res.columns) == set(frame.columns)

def_result = frame.head()
assert len(def_result) == 5
assert isinstance(one_res, TapeFrame)
assert set(one_res.columns) == set(frame.columns)

def_result = frame.head(24)
assert len(def_result) == 24
assert isinstance(one_res, TapeFrame)
assert set(one_res.columns) == set(frame.columns)

# Test that we have sane behavior even when the number of rows requested is larger than the number of rows in the frame.
assert len(frame.head(2 * len(frame))) == len(frame)

# Choose a value that will be guaranteed to hit every partition for this data.
# Note that with parquet_ensemble_with_divisions some of the partitions are empty
# testing that as well.
rows = floor(len(frame.compute()) * 0.98)
result = frame.head(rows)
assert len(result) == rows
assert isinstance(result, TapeFrame)
assert set(result.columns) == set(frame.columns)

0 comments on commit ab1dc5c

Please sign in to comment.