Merge branch 'main' into single_lc

lincc-frameworks · Apr 5, 2024 · ab1dc5c · ab1dc5c
2 parents ab10382 + d72d1e7
commit ab1dc5c
Show file tree

Hide file tree

Showing 12 changed files with 138 additions and 93 deletions.
diff --git a/README.md b/README.md
@@ -29,23 +29,12 @@ TAPE is available to install with pip, using the "lf-tape" package name:
 pip install lf-tape
 ```
 
-## Getting started - for developers
+## Contributing
 
-Download code and install dependencies in a conda environment. Run unit tests at the end as a verification that the packages are properly installed.
+[![GitHub issue custom search in repo](https://img.shields.io/github/issues-search/lincc-frameworks/tape?color=purple&label=Good%20first%20issues&query=is%3Aopen%20label%3A%22good%20first%20issue%22)](https://github.com/lincc-frameworks/tape/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
 
-```
-$ conda create -n seriesenv python=3.11
-$ conda activate seriesenv
-
-$ git clone https://github.com/lincc-frameworks/tape
-$ cd tape/
-$ pip install .
-$ pip install .[dev]  # it may be necessary to use `pip install .'[dev]'` (with single quotes) depending on your machine.
-
-$ pip install pytest
-$ pytest
-```
+See the [Contribution Guide](https://tape.readthedocs.io/en/latest/gettingstarted/contributing.html) for complete installation instructions and contribution best practices.
 
 ## Acknowledgements
 
-LINCC Frameworks is supported by Schmidt Futures, a philanthropic initiative founded by Eric and Wendy Schmidt, as part of the Virtual Institute of Astrophysics (VIA).
+This project is supported by Schmidt Sciences.
diff --git a/docs/examples.rst b/docs/examples.rst
@@ -4,5 +4,6 @@ Examples
 Some examples of how to use the TAPE package are provided in these notebooks.
 
 .. toctree::
+    :maxdepth: 1
 
     Use Lomb–Scargle Periodograms for SDSS Stripe 82 RR Lyrae <examples/rrlyr-period>
diff --git a/docs/gettingstarted.rst b/docs/gettingstarted.rst
@@ -6,6 +6,8 @@ we encourage you to open an issue on the
 `TAPE github repository <https://github.com/lincc-frameworks/tape/issues>`_.
 
 .. toctree::
+    :maxdepth: 1
 
     Installing TAPE <gettingstarted/installation>
-    Quickstart Guide <gettingstarted/quickstart>
+    Quickstart Guide <gettingstarted/quickstart>
+    Contribution Guide <gettingstarted/contributing>
diff --git a/docs/gettingstarted/contributing.rst b/docs/gettingstarted/contributing.rst
@@ -0,0 +1,20 @@
+Contribution Guide
+==================
+
+Dev Guide - Getting Started
+---------------------------
+
+Download code and install dependencies in a conda environment. Run unit tests at the end as a verification that the packages are properly installed.
+
+.. code-block:: bash
+
+    conda create -n seriesenv python=3.11
+    conda activate seriesenv
+
+    git clone https://github.com/lincc-frameworks/tape
+    cd tape/
+    pip install .
+    pip install .[dev]  # it may be necessary to use `pip install .'[dev]'` (with single quotes) depending on your machine.
+
+    pip install pytest
+    pytest
diff --git a/docs/gettingstarted/quickstart.ipynb b/docs/gettingstarted/quickstart.ipynb
@@ -21,7 +21,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%pip install lf-tape"
+    "\n",
+    "%pip install lf-tape --quiet\n"
    ]
   },
   {
@@ -210,7 +211,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.14"
+   "version": "3.10.11"
   },
   "vscode": {
    "interpreter": {

diff --git a/docs/index.rst b/docs/index.rst
@@ -11,8 +11,7 @@ TAPE offers a complete ecosystem for loading, filtering, and analyzing
 timeseries data. TAPE is built to enable users to run provided and user-defined 
 analysis functions at scale in a parallelized and/or distributed manner.
 
-Over the survey lifetime of the `LSST <https://www.lsst.org/about>`_, on order 
-of ~billions of objects will have multiband lightcurves available, and TAPE has
+Over the survey lifetime of the `LSST <https://www.lsst.org/about>`_, billions of objects will have multiband lightcurves available, and TAPE has
 been built as a framework with the goal of making analysis of LSST-scale
 data accessible.
 
@@ -23,13 +22,14 @@ How to Use This Guide
 ==============================================
 
 Begin with the `Getting Started <https://tape.readthedocs.io/en/latest/gettingstarted.html>`_ guide to learn the basics of installation and
-walk through a simple example of using TAPE.
+walkthrough a simple example of using TAPE.
 
 The `Tutorials <https://tape.readthedocs.io/en/latest/tutorials.html>`_ section showcases the fundamental features of TAPE.
 
 API-level information about TAPE is viewable in the 
 `API Reference <https://tape.readthedocs.io/en/latest/autoapi/index.html>`_ section.
 
+Learn more about contributing to this repository in our :doc:`Contribution Guide <gettingstarted/contributing>`.
 
 
 .. toctree::

diff --git a/docs/tutorials.rst b/docs/tutorials.rst
@@ -5,6 +5,7 @@ These pages contain a set of tutorial notebooks for working through core and mor
 functionality.
 
 .. toctree::
+    :maxdepth: 1
 
     Loading Data into TAPE <tutorials/tape_datasets>
     Working with the TAPE Ensemble object <tutorials/working_with_the_ensemble>

diff --git a/docs/tutorials/batch_showcase.ipynb b/docs/tutorials/batch_showcase.ipynb
@@ -45,6 +45,7 @@
     "for i in range(10, 110):\n",
     "    obj_ids.append(np.array([i] * 1250))\n",
     "    mjds.append(np.arange(0.0, 1250.0, 1.0))\n",
+    "\n",
     "obj_ids = np.concatenate(obj_ids)\n",
     "mjds = np.concatenate(mjds)\n",
     "\n",

diff --git a/docs/tutorials/common_data_operations.ipynb b/docs/tutorials/common_data_operations.ipynb
@@ -110,7 +110,7 @@
     "Often, you'll want to peek at your data even though the full-size is too large for memory.\n",
     "\n",
     "> **_Note:_**\n",
-    "By default this only looks at the first partition of data, so any operations that remove all data from the first partition will produce an empty head result. Specify `npartitions=-1` to grab from all partitions.\n"
+    "some partitions may be empty and `head` will have to traverse these empty partitions to find enough rows for your result. An empty table with many partitions (O(100)k) might be costly even for an ultimately empty result. "
    ]
   },
   {
@@ -119,7 +119,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ens.source.head(5, npartitions=-1)  # grabs the first 5 rows\n",
+    "ens.source.head(5)  # grabs the first 5 rows\n",
     "\n",
     "# can also use tail to grab the last 5 rows"
    ]

diff --git a/docs/tutorials/using_ray_with_the_ensemble.ipynb b/docs/tutorials/using_ray_with_the_ensemble.ipynb
@@ -7,7 +7,7 @@
    "source": [
     "# Using Dask on Ray with the Ensemble\n",
     "\n",
-    "[Ray](https://docs.ray.io/en/latest/ray-overview/index.html) is an open-source unified framework for scaling AI and Python applications. Ray provides a scheduler for Dask ([dask_on_ray](https://docs.ray.io/en/latest/ray-more-libs/dask-on-ray.html)) which allows you to build data analyses using Dask’s collections and execute the underlying tasks on a Ray cluster. We have found with TAPE that the Ray scheduler is often more performant than Dasks scheduler. Ray can be used on TAPE using the setup shown in the following example."
+    "[Ray](https://docs.ray.io/en/latest/ray-overview/index.html) is an open-source unified framework for scaling AI and Python applications. Ray provides a scheduler for Dask ([dask_on_ray](https://docs.ray.io/en/latest/ray-more-libs/dask-on-ray.html)) which allows you to build data analyses using Dask’s collections and execute the underlying tasks on a Ray cluster. Ray can be used on TAPE using the setup shown in the following example."
    ]
   },
   {
@@ -86,75 +86,6 @@
     "    calc_sf2, use_map=False\n",
     ")  # use_map is false as we repartition naively, splitting per-object sources across partitions"
    ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "id": "c5692d75",
-   "metadata": {},
-   "source": [
-    "## Timing Comparison\n",
-    "\n",
-    "As mentioned above, we generally see that Ray is more performant than Dask. Below is a simple timing comparison."
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "id": "f128cdbf",
-   "metadata": {},
-   "source": [
-    "### Ray Timing"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "dd960e10",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%%time\n",
-    "\n",
-    "ens = Ensemble(client=False)  # Do not use a client\n",
-    "ens.from_dataset(\"s82_qso\", sorted=True)\n",
-    "ens.source = ens.source.repartition(npartitions=10)\n",
-    "ens.batch(calc_sf2, use_map=False)"
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "id": "228e5114",
-   "metadata": {},
-   "source": [
-    "### Dask Timing"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "24a8f466",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "disable_dask_on_ray()  # unsets the dask_on_ray configuration settings"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1552c2b8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%%time\n",
-    "\n",
-    "ens = Ensemble()\n",
-    "ens.from_dataset(\"s82_qso\", sorted=True)\n",
-    "ens.source = ens.source.repartition(npartitions=10)\n",
-    "ens.batch(calc_sf2, use_map=False).compute()"
-   ]
   }
  ],
  "metadata": {

diff --git a/src/tape/ensemble_frame.py b/src/tape/ensemble_frame.py
@@ -845,6 +845,46 @@ def repartition(
         )
         return self._propagate_metadata(result)
 
+    def head(self, n=5, compute=True, npartitions=None):
+        """Returns `n` rows of data for previewing purposes.
+
+        Parameters
+        ----------
+        n : int, optional
+            The number of desired rows. Default is 5.
+        compute : bool, optional
+            Whether to compute the result immediately. Default is True.
+        npartitions : int, optional
+            `npartitions` is not supported and if provided will be ignored. Instead all partitions may be used.
+
+        Returns:
+            A pandas DataFrame with up to `n` rows of data.
+        """
+        if npartitions is not None:
+            warnings.warn(
+                "The 'npartitions' parameter is not supported for TAPE dataframes. All partitions may be used."
+            )
+
+        if not compute:
+            # Just use the Dask head method
+            return super().head(n, compute=False)
+
+        if n <= 0:
+            return super().head(0)
+
+        # Iterate over the partitions until we have enough rows
+        dfs = []
+        remaining_rows = n
+        for partition in self.partitions:
+            if remaining_rows == 0:
+                break
+            # Note that partition is itself a _Frame object, so we need to compute to avoid infinite recursion
+            partition_head = partition.compute().head(remaining_rows)
+            dfs.append(partition_head)
+            remaining_rows -= len(partition_head)
+
+        return pd.concat(dfs)
+
 
 class EnsembleSeries(_Frame, dd.Series):
     """A barebones extension of a Dask Series for Ensemble data."""

diff --git a/tests/tape_tests/test_ensemble_frame.py b/tests/tape_tests/test_ensemble_frame.py
@@ -1,5 +1,6 @@
 """ Test EnsembleFrame (inherited from Dask.DataFrame) creation and manipulations. """
 
+from math import floor
 import numpy as np
 import pandas as pd
 from tape import (
@@ -470,3 +471,61 @@ def test_partition_slicing(parquet_ensemble_with_divisions):
 
     assert ens.source.npartitions == 2  # should return exactly 2 partitions
     assert len(ens.object) < prior_src_len  # should affect objects
+
+
+@pytest.mark.parametrize(
+    "data_fixture",
+    [
+        "parquet_ensemble",
+        "parquet_ensemble_with_divisions",
+    ],
+)
+def test_head(data_fixture, request):
+    """
+    Tests that head returns the correct number of rows.
+    """
+    ens = request.getfixturevalue(data_fixture)
+
+    # Test witht repartitioning the source frame
+    frame = ens.source
+    frame = frame.repartition(npartitions=10)
+
+    assert frame.npartitions == 10
+
+    # Check that a warning is raised when npartitions are requested.
+    with pytest.warns(UserWarning):
+        frame.head(5, npartitions=5)
+
+    # Test inputs that should return an empty frame
+    assert len(frame.head(-100)) == 0
+    assert len(frame.head(0)) == 0
+    assert len(frame.head(-1)) == 0
+
+    assert len(frame.head(100, compute=False).compute()) == 100
+
+    one_res = frame.head(1)
+    assert len(one_res) == 1
+    assert isinstance(one_res, TapeFrame)
+    assert set(one_res.columns) == set(frame.columns)
+
+    def_result = frame.head()
+    assert len(def_result) == 5
+    assert isinstance(one_res, TapeFrame)
+    assert set(one_res.columns) == set(frame.columns)
+
+    def_result = frame.head(24)
+    assert len(def_result) == 24
+    assert isinstance(one_res, TapeFrame)
+    assert set(one_res.columns) == set(frame.columns)
+
+    # Test that we have sane behavior even when the number of rows requested is larger than the number of rows in the frame.
+    assert len(frame.head(2 * len(frame))) == len(frame)
+
+    # Choose a value that will be guaranteed to hit every partition for this data.
+    # Note that with parquet_ensemble_with_divisions some of the partitions are empty
+    # testing that as well.
+    rows = floor(len(frame.compute()) * 0.98)
+    result = frame.head(rows)
+    assert len(result) == rows
+    assert isinstance(result, TapeFrame)
+    assert set(result.columns) == set(frame.columns)