Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster fastpath #9001

Merged
merged 1 commit into from
May 5, 2024
Merged

Faster fastpath #9001

merged 1 commit into from
May 5, 2024

Conversation

hmaarrfk
Copy link
Contributor

@hmaarrfk hmaarrfk commented May 5, 2024

I noticed that slicing into datetime64[ns] arrays was really starting to limit our overall performance.

What is strangest is that it is faster to do it on a lazy array than an in-memory array.

I think it comes down to the code on the main branch triggers:

def _possibly_convert_objects(values):

which does a numpy->pandas -> numpy conversion for safety.

But that seems a little over the top.

My benchmark of interest is:

Indexing into a 1D array with a scalar to get a single entry from it out.

This often occurs when we slice into our larger datasets.

on main

python software_timestamp_benchmark.py
software_timestamp.nc
h5netcdf  lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 8664.24it/s]
h5netcdf  computed  : 100%|███████████| 100000/100000 [00:16<00:00, 5976.98it/s]
netcdf4   lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 8453.77it/s]
netcdf4   computed  : 100%|███████████| 100000/100000 [00:16<00:00, 5924.96it/s]
software_timestamp_float64.nc
h5netcdf  lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 9042.25it/s]
h5netcdf  computed  : 100%|██████████| 100000/100000 [00:07<00:00, 14157.30it/s]
netcdf4   lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 9005.22it/s]
netcdf4   computed  : 100%|██████████| 100000/100000 [00:07<00:00, 13935.46it/s]

on this branch:

$ python software_timestamp_benchmark.py
software_timestamp.nc
h5netcdf  lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 8951.20it/s]
h5netcdf  computed  : 100%|██████████| 100000/100000 [00:06<00:00, 14793.26it/s]
netcdf4   lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 8774.30it/s]
netcdf4   computed  : 100%|██████████| 100000/100000 [00:06<00:00, 14671.77it/s]
software_timestamp_float64.nc
h5netcdf  lazy      : 100%|███████████| 100000/100000 [00:10<00:00, 9364.32it/s]
h5netcdf  computed  : 100%|██████████| 100000/100000 [00:06<00:00, 14955.89it/s]
netcdf4   lazy      : 100%|███████████| 100000/100000 [00:10<00:00, 9324.61it/s]
netcdf4   computed  : 100%|██████████| 100000/100000 [00:06<00:00, 14697.09it/s]
The benchmark
import xarray
# from tqdm import tqdm
tqdm = lambda x, *args, **kwargs: x

# dataset = xarray.open_dataset('.nc')
for filename in [
    # 'software_timestamp_float64.nc',
    'software_timestamp.nc',
]:
    print(filename)
    for engine in [
        'h5netcdf', 
        # 'netcdf4',
    ]:
        for compute in [
            # False, 
            True,
        ]:
            dataset = xarray.open_dataset(filename, engine='h5netcdf')
            N_frames = dataset.sizes['frame_number']
            N_iters = 100_000
    
            software_timestamp = dataset['software_timestamp']
            if compute:
                software_timestamp = software_timestamp.compute()
                compute_str = "computed"
            else:
                compute_str = "lazy"
            desc = f"{engine:10s}{compute_str:10s}"
            for i in tqdm(range(N_iters), desc=desc):
                software_timestamp.isel(frame_number=i % N_frames)
            dataset.close()

I can try to expand the benchmark, though it is difficult to "see" these slowdowns with toy examples sometimes.

software_timestamp.zip

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

xref: #2799
xref: #7045

@hmaarrfk hmaarrfk closed this May 5, 2024
@hmaarrfk hmaarrfk reopened this May 5, 2024
@max-sixty max-sixty added the plan to merge Final call for comments label May 5, 2024
if fastpath and getattr(data, "ndim", 0) > 0:
# can't use fastpath (yet) for scalars
return cast("T_DuckArray", _maybe_wrap_data(data))
ndim = getattr(data, "ndim", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small point relative to avoiding _maybe_wrap_data:

Wouldn't we only want to get ndim if fastpath is True? So we could only change the comparison to 0 to a comparison to None on L272?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough and quick review. you are correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like both pd.Index and ExtensionArray define ndim so should be OK. It'd be nice to add this as a comment.

@max-sixty
Copy link
Collaborator

Merging pending tests. I'm not 100% on the implications of skipping _maybe_wrap_data in some cases, so if anyone knows whether this was required, we can reassess. Thanks @hmaarrfk

@max-sixty max-sixty merged commit 50f8726 into pydata:main May 5, 2024
28 checks passed
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request May 6, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request May 6, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request May 6, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
andersy005 pushed a commit that referenced this pull request May 10, 2024
andersy005 added a commit that referenced this pull request May 10, 2024
* main:
  Avoid auto creation of indexes in concat (#8872)
  Fix benchmark CI (#9013)
  Avoid extra read from disk when creating Pandas Index. (#8893)
  Add a benchmark to monitor performance for large dataset indexing (#9012)
  Zarr: Optimize `region="auto"` detection (#8997)
  Trigger CI only if code files are modified. (#9006)
  Fix for ruff 0.4.3 (#9007)
  Port negative frequency fix for `pandas.date_range` to `cftime_range` (#8999)
  Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (#9004)
  Speed up localize (#8536)
  Simplify fast path (#9001)
  Add argument check_dims to assert_allclose to allow transposed inputs (#5733) (#8991)
  Fix syntax error in test related to cupy (#9000)
andersy005 added a commit to hmaarrfk/xarray that referenced this pull request May 10, 2024
* backend-indexing:
  Trigger CI only if code files are modified. (pydata#9006)
  Enable explicit use of key tuples (instead of *Indexer objects) in indexing adapters and explicitly indexed arrays (pydata#8870)
  add `.oindex` and `.vindex` to `BackendArray` (pydata#8885)
  temporary enable CI triggers on feature branch
  Avoid auto creation of indexes in concat (pydata#8872)
  Fix benchmark CI (pydata#9013)
  Avoid extra read from disk when creating Pandas Index. (pydata#8893)
  Add a benchmark to monitor performance for large dataset indexing (pydata#9012)
  Zarr: Optimize `region="auto"` detection (pydata#8997)
  Trigger CI only if code files are modified. (pydata#9006)
  Fix for ruff 0.4.3 (pydata#9007)
  Port negative frequency fix for `pandas.date_range` to `cftime_range` (pydata#8999)
  Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (pydata#9004)
  Speed up localize (pydata#8536)
  Simplify fast path (pydata#9001)
  Add argument check_dims to assert_allclose to allow transposed inputs (pydata#5733) (pydata#8991)
  Fix syntax error in test related to cupy (pydata#9000)
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request Jun 12, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request Jun 12, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request Jun 19, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request Jun 22, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants