Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test chunking (including Hypothesis tests) #57

Merged
merged 25 commits into from
Jun 22, 2021
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
7879749
new fixtures
TomNicholas May 25, 2021
588451e
new tests, including hypothesis tests
TomNicholas May 25, 2021
1716d57
clean up test class structure
TomNicholas May 25, 2021
731f6f1
pass strategy to final monster test
TomNicholas May 25, 2021
ae91eb5
add Hypothesis library to CI environments
TomNicholas May 25, 2021
6f4359f
removed uneccessary space
TomNicholas May 25, 2021
f652bfd
silence PerformanceWarning
TomNicholas May 26, 2021
fd69d10
reverted align_arrays=False
TomNicholas May 26, 2021
1379ac3
eliminated possibility of repeated dims/vars in fixtures
TomNicholas May 26, 2021
2fd7192
removed non-redundant imports
TomNicholas May 26, 2021
ce32da4
fixed linting
TomNicholas May 26, 2021
7d248d1
reinstated hpothesis tests in their own file
TomNicholas May 27, 2021
68df0f2
remove rogue print statement
TomNicholas May 27, 2021
0ccce21
standardized all n_* variable names
TomNicholas May 27, 2021
f04edcf
generalised fixtures to create n-dimensional outputs
TomNicholas May 27, 2021
2d022a2
demoted fixtures to functions
TomNicholas May 27, 2021
9979379
test chunked weights
TomNicholas May 27, 2021
f0baec4
linting
TomNicholas May 27, 2021
932c0ee
tests for unaligned chunks
TomNicholas May 27, 2021
6ab521a
Removed more rogue print statements
TomNicholas May 27, 2021
c957a6d
and more
TomNicholas May 27, 2021
3cc2c11
un-raveled where unnecessary
TomNicholas May 27, 2021
ef11c86
Generalise shape of example dataset to ND
TomNicholas May 27, 2021
5b173a6
parameterized broadcast test to test reducing over both dimensions se…
TomNicholas May 27, 2021
6fc4161
Trigger tests
TomNicholas Jun 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/environment-3.7.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ dependencies:
- dask
- numpy=1.16
- pytest
- hypothesis
- pip
- pip:
- codecov
Expand Down
1 change: 1 addition & 0 deletions ci/environment-3.8.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ dependencies:
- dask
- numpy=1.18
- pytest
- hypothesis
- pip
- pip:
- codecov
Expand Down
1 change: 1 addition & 0 deletions ci/environment-3.9.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ dependencies:
- dask
- numpy
- pytest
- hypothesis
- pip
- pip:
- codecov
Expand Down
25 changes: 25 additions & 0 deletions xhistogram/test/fixtures.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
import uuid
import dask
import dask.array as dsa
import numpy as np
import xarray as xr


def empty_dask_array(shape, dtype=float, chunks=None):
Expand All @@ -12,3 +15,25 @@ def raise_if_computed():
a = a.rechunk(chunks)

return a


def example_dataarray(shape=(5, 20)):
data = np.random.randn(*shape)
dims = [f"dim_{i}" for i in range(len(shape))]
da = xr.DataArray(data, dims=dims, name="T")
return da


def example_dataset(n_dim=2, n_vars=2):
"""Random dataset with every variable having the same shape"""

shape = (8, 9, 10, 11)[:n_dim]
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
dims = [f"dim_{i}" for i in range(len(shape))]
var_names = [uuid.uuid4().hex for _ in range(n_vars)]
ds = xr.Dataset()
for i in range(n_vars):
name = var_names[i]
data = np.random.randn(*shape)
da = xr.DataArray(data, dims=dims, name=name)
ds[name] = da
return ds
143 changes: 143 additions & 0 deletions xhistogram/test/test_chunking.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
import numpy as np
import pytest

from .fixtures import example_dataarray
from ..xarray import histogram


@pytest.mark.parametrize("weights", [False, True])
@pytest.mark.parametrize("chunksize", [1, 2, 3, 10])
@pytest.mark.parametrize("shape", [(10,), (10, 4)])
def test_chunked_weights(chunksize, shape, weights):

data_a = example_dataarray(shape).chunk((chunksize,))

if weights:
weights = example_dataarray(shape).chunk((chunksize,))
weights_arr = weights.values.ravel()
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
else:
weights = weights_arr = None

nbins_a = 6
bins_a = np.linspace(-4, 4, nbins_a + 1)

h = histogram(data_a, bins=[bins_a], weights=weights)

assert h.shape == (nbins_a,)

hist, _ = np.histogram(data_a.values.ravel(), bins=bins_a, weights=weights_arr)

np.testing.assert_allclose(hist, h.values)


@pytest.mark.parametrize("xchunksize", [1, 2, 3, 10])
@pytest.mark.parametrize("ychunksize", [1, 2, 3, 12])
class TestFixedSize2DChunks:
def test_2d_chunks(self, xchunksize, ychunksize):

data_a = example_dataarray(shape=(10, 12)).chunk((xchunksize, ychunksize))

nbins_a = 8
bins_a = np.linspace(-4, 4, nbins_a + 1)

h = histogram(data_a, bins=[bins_a])

assert h.shape == (nbins_a,)

hist, _ = np.histogram(data_a.values.ravel(), bins=bins_a)

np.testing.assert_allclose(hist, h.values)

def test_2d_chunks_broadcast_dim(
self,
xchunksize,
ychunksize,
):
data_a = example_dataarray(shape=(10, 12)).chunk((xchunksize, ychunksize))

nbins_a = 8
bins_a = np.linspace(-4, 4, nbins_a + 1)

reduce_dim, broadcast_dim = data_a.dims
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
h = histogram(data_a, bins=[bins_a], dim=(reduce_dim,)).transpose()

assert h.shape == (nbins_a, data_a.sizes[broadcast_dim])

def _np_hist(*args, **kwargs):
h, _ = np.histogram(*args, **kwargs)
return h

hist = np.apply_along_axis(_np_hist, axis=0, arr=data_a.values, bins=bins_a)

np.testing.assert_allclose(hist, h.values)

def test_2d_chunks_2d_hist(self, xchunksize, ychunksize):

data_a = example_dataarray(shape=(10, 12)).chunk((xchunksize, ychunksize))
data_b = example_dataarray(shape=(10, 12)).chunk((xchunksize, ychunksize))

nbins_a = 8
nbins_b = 9
bins_a = np.linspace(-4, 4, nbins_a + 1)
bins_b = np.linspace(-4, 4, nbins_b + 1)

h = histogram(data_a, data_b, bins=[bins_a, bins_b])

assert h.shape == (nbins_a, nbins_b)

hist, _, _ = np.histogram2d(
data_a.values.ravel(),
data_b.values.ravel(),
bins=[bins_a, bins_b],
)

np.testing.assert_allclose(hist, h.values)


@pytest.mark.parametrize("xchunksize", [1, 2, 3, 10])
@pytest.mark.parametrize("ychunksize", [1, 2, 3, 12])
class TestUnalignedChunks:
def test_unaligned_data_chunks(self, xchunksize, ychunksize):
data_a = example_dataarray(shape=(10, 12)).chunk((xchunksize, ychunksize))
print(data_a.chunks)
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
data_b = example_dataarray(shape=(10, 12)).chunk(
(xchunksize + 1, ychunksize + 1)
)
print(data_b.chunks)
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved

nbins_a = 8
nbins_b = 9
bins_a = np.linspace(-4, 4, nbins_a + 1)
bins_b = np.linspace(-4, 4, nbins_b + 1)

h = histogram(data_a, data_b, bins=[bins_a, bins_b])

assert h.shape == (nbins_a, nbins_b)

hist, _, _ = np.histogram2d(
data_a.values.ravel(),
data_b.values.ravel(),
bins=[bins_a, bins_b],
)

np.testing.assert_allclose(hist, h.values)

def test_unaligned_weights_chunks(self, xchunksize, ychunksize):

data_a = example_dataarray(shape=(10, 12)).chunk((xchunksize, ychunksize))
weights = example_dataarray(shape=(10, 12)).chunk(
(xchunksize + 1, ychunksize + 1)
)

nbins_a = 8
bins_a = np.linspace(-4, 4, nbins_a + 1)

h = histogram(data_a, bins=[bins_a], weights=weights)

assert h.shape == (nbins_a,)

hist, _ = np.histogram(
data_a.values.ravel(), bins=bins_a, weights=weights.values.ravel()
)

np.testing.assert_allclose(hist, h.values)
86 changes: 86 additions & 0 deletions xhistogram/test/test_chunking_hypotheses.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
import numpy as np
import pytest

from .fixtures import example_dataarray, example_dataset
from ..xarray import histogram

pytest.importorskip("hypothesis")

import hypothesis.strategies as st # noqa
from hypothesis import given # noqa


@st.composite
def chunk_shapes(draw, n_dim=3, max_arr_len=10):
"""Generate different chunking patterns for an N-D array of data."""
chunks = []
for n in range(n_dim):
shape = draw(st.integers(min_value=1, max_value=max_arr_len))
chunks.append(shape)
return tuple(chunks)


class TestChunkingHypotheses:
@given(chunk_shapes(n_dim=1, max_arr_len=20))
def test_all_chunking_patterns_1d(self, chunks):

data = example_dataarray(shape=(20,)).chunk(chunks)

nbins_a = 8
bins = np.linspace(-4, 4, nbins_a + 1)

h = histogram(data, bins=[bins])

assert h.shape == (nbins_a,)

hist, _ = np.histogram(
data.values.ravel(),
bins=bins,
)

np.testing.assert_allclose(hist, h)

# TODO mark as slow?
@given(chunk_shapes(n_dim=2, max_arr_len=8))
def test_all_chunking_patterns_2d(self, chunks):

data_a = example_dataarray(shape=(5, 20)).chunk(chunks)
data_b = example_dataarray(shape=(5, 20)).chunk(chunks)

nbins_a = 8
nbins_b = 9
bins_a = np.linspace(-4, 4, nbins_a + 1)
bins_b = np.linspace(-4, 4, nbins_b + 1)

h = histogram(data_a, data_b, bins=[bins_a, bins_b])

assert h.shape == (nbins_a, nbins_b)

hist, _, _ = np.histogram2d(
data_a.values.ravel(),
data_b.values.ravel(),
bins=[bins_a, bins_b],
)

np.testing.assert_allclose(hist, h.values)

# TODO mark as slow?
@pytest.mark.parametrize("n_vars", [1, 2, 3, 4])
@given(chunk_shapes(n_dim=2, max_arr_len=7))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to also test dims= and weights= with Hypothesis. It can be nice to throw all the possible axes of variation into a Hypothesis test as an easy way to check all possible cases, without having to write as many individual tests.

Copy link
Member Author

@TomNicholas TomNicholas May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you mean? If I make the test_all_chunking_patterns_dd_hist accept a dims (or reduce_axes) argument then I also need a np.histogramdd function that can handle that generality. Is there a quick way to achieve that in the test? Possibly with np.apply_over_axes?

For the weights then I guess I could pass weights and allow the data and weights to have different chunking patterns - is that what you meant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suppose it's trickier to test that, since you'd need something to do N-D histograms (xhistogram) to verify the results.

I suppose you could just compare against histogram of the computed (NumPy) arrays, and make it purely a test of the dask functionality. If we're confident the NumPy code paths are well-tested, that seems reasonable to me.

But it was just a thought; I think the tests here are already quite good, so fine to leave it as-is too.

def test_all_chunking_patterns_dd_hist(self, n_vars, chunk_shapes):
ds = example_dataset(n_dim=2, n_vars=n_vars)
ds = ds.chunk({d: c for d, c in zip(ds.dims.keys(), chunk_shapes)})

n_bins = (7, 8, 9, 10)[:n_vars]
bins = [np.linspace(-4, 4, n + 1) for n in n_bins]

h = histogram(*[da for name, da in ds.data_vars.items()], bins=bins)

assert h.shape == n_bins

input_data = np.stack(
[da.values.ravel() for name, da in ds.data_vars.items()], axis=-1
)
hist, _ = np.histogramdd(input_data, bins=bins)

np.testing.assert_allclose(hist, h.values)