Skip to content

Commit

Permalink
first attempt to support awkward arrays (#647)
Browse files Browse the repository at this point in the history
* first attempt to support awkward arrays

* remove comments

* better comment

* add type to gen_adata

* first attempt at concat

* remove comment

* add outer concat

* add awkward to test dep

* add awk arr to data gen

* fix test base

* init test for concat

* fix concatenate tests

* create mock class for awkward array

* remove space

* import ak when needed

* relative import of awk array

* fix optional dep import

* resolve conflicts

* draft IO for akward arrays

* add awkward to docs and save form to attrs

* Update dependencies

* Update dim_len

* ignore vscode directory

* Validate that awkward arrays align to axes

* Fix reindexing during merge

* fix lint

* remove duplicate import

* Test different types of awkward arrays in different slots

* Better function to generate awkward arrays

* Better dim_len for awkward arrays

* Working out how to best check the dim_len

* Only accept awkward arrays that are "regular" in the aligned dimension

The conversion is left to the user. Explicit is better than implicit.

* Switch to v2 API

* WIP rewrite awkward array generation

* Improve awkward array generation and dim_len check

* Switch to new awkward array generation in all tests

* Fix test_transpose

* Fix/workaround more tests

* Add test for setting anndata slots to awkward arrays

* enable tests for 3d ragged array in layers

* Cleanup

* Fix that X could not be set when creating AnnData object from scratch.

Apparently the checks are quite different than when adding a Layer.

* Remove code to make awkward array regular after merge.

This is now done by the awkward array library.

* Do not explicitly copy awkward arrays

* Implement transposing awkward arrays

* Add docs stub and update type hints

* Fix: dtype not available during merge if both X are awkward

* Fix IO

* Request pre-release version of awkward

* Exclude awkward layer in loom tests

* Pull in only changes relevant to obsm/varm

* Update tests

* Fix type hints

* Update error message in algined mapping

* Use compat module to support both awkward v1.9rc and 2.x

* restructure tests

* Add tests for copies and view

* Remove unused imoport

* Fix how actual shape is computed in aligned mapping

* Attempt to support views with ak.behavior

* Use shallow copy

* Add dim_len_awkward function including tests

* Test that assigning an awkward v1 arrays fails

* Add stub for element-wise IO tests

* Restructur dim_len_awkward

* Add more test cases for awkward IO

* WIP add tests for concatenating AwkArrays with missing values

* Fix AwkwardArrayView

* Simplify awkward array view code

* Use None to remove name from awkward array

* Mark test_no_awkward_v1 as xfail for uns

* Add test for categorical arrays

* Update docs/fileformat-prose.rst

Co-authored-by: Isaac Virshup <[email protected]>

* Update anndata/_core/aligned_mapping.py

Co-authored-by: Isaac Virshup <[email protected]>

* Update anndata/tests/helpers.py

Co-authored-by: Isaac Virshup <[email protected]>

* Update awkward tests to use assert_equal with exact=True

* Bump required version

* Update categorical syntax, add new categorical test

* Start concat tests for awkward

* Add release notes

* Add testcases for dim_len with awkward arrays of strings

* Fix dim_len for arrays of strings

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Awkward v2 fixes

Several functions changed until the stable awkward v2 version was released.

* Exclude awkward arrays from fill_value concat test

* fix flake8

* Add IO testcase for AIRR data

* Fix link

* Get inner join working for concatenation

* Bump some concatenation cases to a later PR

* Generate empty arrays for outer join

* Raise NotImplementedError when creating a view of an awkward array with custom behavior

* Add warning when setting awkward array in aligned mapping

* Get much more of concatenation 'working'

* Use warning instead of logging

* extend todo comment about views

* Fix IO, and to_memory for views of awkward arrays

* Removed a number of test cases that we're not targeting

This fixed a number of tests because we had a 1d awkward array being generated, and we currently don't support 1d arrays in obsm well. Tracked in #652.

* Implement outer indexing on axis 0 of an awkward array

* Fix gen_awkward when one of the dimensions has size 0

* Fix equality function for awkward arrays. Was throwing an error when the arrays weren't broadcastable.

* Modify outer concatenation test to accept current behaviour of awkward array

* Add tests for mixed type concatenation with awkward arrays

* Add warning about outer joins

* Call ak._util.arrays_approx_equal instead of rolling our own

* update awkward to 2.0.7 (unfortunately: errors)

* remove unnecessary checks from AwkwardArrayView

* Workaround scikit-hep/awkward#2209

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed extra layer of nesting from on-disk format for awkward arrays

---------

Co-authored-by: Gregor Sturm <[email protected]>
Co-authored-by: Isaac Virshup <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
4 people authored Feb 7, 2023
1 parent 4ccf91c commit a9e634c
Show file tree
Hide file tree
Showing 19 changed files with 1,049 additions and 32 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,5 @@ test.h5ad

# IDEs
/.idea/
/.vscode/

7 changes: 6 additions & 1 deletion anndata/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,12 @@
read_mtx,
read_zarr,
)
from ._warnings import OldFormatWarning, WriteWarning, ImplicitModificationWarning
from ._warnings import (
OldFormatWarning,
WriteWarning,
ImplicitModificationWarning,
ExperimentalFeatureWarning,
)

# backwards compat / shortcut for default format
from ._io import read_h5ad as read
48 changes: 39 additions & 9 deletions anndata/_core/aligned_mapping.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,22 @@
from abc import ABC, abstractmethod
from collections import abc as cabc
from copy import copy
from typing import Union, Optional, Type, ClassVar, TypeVar # Special types
from typing import Iterator, Mapping, Sequence # ABCs
from typing import Tuple, List, Dict # Generic base types
import warnings

import numpy as np
import pandas as pd
from scipy.sparse import spmatrix

from ..utils import deprecated, ensure_df_homogeneous
from ..utils import deprecated, ensure_df_homogeneous, dim_len
from . import raw, anndata
from .views import as_view
from .access import ElementRef
from .index import _subset
from anndata.compat import AwkArray
from anndata._warnings import ExperimentalFeatureWarning


OneDIdx = Union[Sequence[int], Sequence[bool], slice]
Expand Down Expand Up @@ -46,15 +50,37 @@ def _ipython_key_completions_(self) -> List[str]:

def _validate_value(self, val: V, key: str) -> V:
"""Raises an error if value is invalid"""
if isinstance(val, AwkArray):
warnings.warn(
"Support for Awkward Arrays is currently experimental. "
"Behavior may change in the future. Please report any issues you may encounter!",
ExperimentalFeatureWarning,
# stacklevel=3,
)
# Prevent from showing up every time an awkward array is used
# You'd think `once` works, but it doesn't at the repl and in notebooks
warnings.filterwarnings(
"ignore",
category=ExperimentalFeatureWarning,
message="Support for Awkward Arrays is currently experimental.*",
)
for i, axis in enumerate(self.axes):
if self.parent.shape[axis] != val.shape[i]:
if self.parent.shape[axis] != dim_len(val, i):
right_shape = tuple(self.parent.shape[a] for a in self.axes)
raise ValueError(
f"Value passed for key {key!r} is of incorrect shape. "
f"Values of {self.attrname} must match dimensions "
f"{self.axes} of parent. Value had shape {val.shape} while "
f"it should have had {right_shape}."
)
actual_shape = tuple(dim_len(val, a) for a, _ in enumerate(self.axes))
if actual_shape[i] is None and isinstance(val, AwkArray):
raise ValueError(
f"The AwkwardArray is of variable length in dimension {i}.",
f"Try ak.to_regular(array, {i}) before including the array in AnnData",
)
else:
raise ValueError(
f"Value passed for key {key!r} is of incorrect shape. "
f"Values of {self.attrname} must match dimensions "
f"{self.axes} of parent. Value had shape {actual_shape} while "
f"it should have had {right_shape}."
)

if not self._allow_df and isinstance(val, pd.DataFrame):
name = self.attrname.title().rstrip("s")
val = ensure_df_homogeneous(val, f"{name} {key!r}")
Expand Down Expand Up @@ -84,7 +110,11 @@ def parent(self) -> Union["anndata.AnnData", "raw.Raw"]:
def copy(self):
d = self._actual_class(self.parent, self._axis)
for k, v in self.items():
d[k] = v.copy()
if isinstance(v, AwkArray):
# Shallow copy since awkward array buffers are immutable
d[k] = copy(v)
else:
d[k] = v.copy()
return d

def _view(self, parent: "anndata.AnnData", subset_idx: I):
Expand Down
7 changes: 4 additions & 3 deletions anndata/_core/anndata.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
)
from .sparse_dataset import SparseDataset
from .. import utils
from ..utils import convert_to_dict, ensure_df_homogeneous
from ..utils import convert_to_dict, ensure_df_homogeneous, dim_len
from ..logging import anndata_logger as logger
from ..compat import (
ZarrArray,
Expand All @@ -55,6 +55,7 @@
_move_adj_mtx,
_overloaded_uns,
OverloadedDict,
AwkArray,
)


Expand Down Expand Up @@ -1861,7 +1862,7 @@ def _check_dimensions(self, key=None):
if "obsm" in key:
obsm = self._obsm
if (
not all([o.shape[0] == self._n_obs for o in obsm.values()])
not all([dim_len(o, 0) == self._n_obs for o in obsm.values()])
and len(obsm.dim_names) != self._n_obs
):
raise ValueError(
Expand All @@ -1871,7 +1872,7 @@ def _check_dimensions(self, key=None):
if "varm" in key:
varm = self._varm
if (
not all([v.shape[0] == self._n_vars for v in varm.values()])
not all([dim_len(v, 0) == self._n_vars for v in varm.values()])
and len(varm.dim_names) != self._n_vars
):
raise ValueError(
Expand Down
12 changes: 11 additions & 1 deletion anndata/_core/file_backing.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

from . import anndata
from .sparse_dataset import SparseDataset
from ..compat import ZarrArray, DaskArray
from ..compat import ZarrArray, DaskArray, AwkArray


class AnnDataFileManager:
Expand Down Expand Up @@ -123,3 +123,13 @@ def _(x, copy=True):
@to_memory.register(Mapping)
def _(x: Mapping, copy=True):
return {k: to_memory(v, copy=copy) for k, v in x.items()}


@to_memory.register(AwkArray)
def _(x, copy=True):
from copy import copy

if copy:
return copy(x)
else:
return x
9 changes: 8 additions & 1 deletion anndata/_core/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import numpy as np
import pandas as pd
from scipy.sparse import spmatrix, issparse
from ..compat import DaskArray, Index, Index1D
from ..compat import AwkArray, DaskArray, Index, Index1D


def _normalize_indices(
Expand Down Expand Up @@ -145,6 +145,13 @@ def _subset_df(df: pd.DataFrame, subset_idx: Index):
return df.iloc[subset_idx]


@_subset.register(AwkArray)
def _subset_awkarray(a: AwkArray, subset_idx: Index):
if all(isinstance(x, cabc.Iterable) for x in subset_idx):
subset_idx = np.ix_(*subset_idx)
return a[subset_idx]


# Registration for SparseDataset occurs in sparse_dataset.py
@_subset.register(h5py.Dataset)
def _subset_dataset(d, subset_idx):
Expand Down
88 changes: 81 additions & 7 deletions anndata/_core/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
Literal,
)
import typing
from warnings import warn
from warnings import warn, filterwarnings

from natsort import natsorted
import numpy as np
Expand All @@ -27,9 +27,10 @@
from scipy.sparse import spmatrix

from .anndata import AnnData
from ..utils import asarray
from ..compat import DaskArray
from ..compat import AwkArray, DaskArray
from ..utils import asarray, dim_len
from .index import _subset, make_slice
from anndata._warnings import ExperimentalFeatureWarning

T = TypeVar("T")

Expand Down Expand Up @@ -154,6 +155,13 @@ def equal_sparse(a, b) -> bool:
return False


@equal.register(AwkArray)
def equal_awkward(a, b) -> bool:
from ..compat import awkward as ak

return ak.almost_equal(a, b)


def as_sparse(x):
if not isinstance(x, sparse.spmatrix):
return sparse.csr_matrix(x)
Expand Down Expand Up @@ -366,12 +374,14 @@ def apply(self, el, *, axis, fill_value=None):
Missing values are to be replaced with `fill_value`.
"""
if self.no_change and (el.shape[axis] == len(self.old_idx)):
if self.no_change and (dim_len(el, axis) == len(self.old_idx)):
return el
if isinstance(el, pd.DataFrame):
return self._apply_to_df(el, axis=axis, fill_value=fill_value)
elif isinstance(el, sparse.spmatrix):
return self._apply_to_sparse(el, axis=axis, fill_value=fill_value)
elif isinstance(el, AwkArray):
return self._apply_to_awkward(el, axis=axis, fill_value=fill_value)
elif isinstance(el, DaskArray):
return self._apply_to_dask_array(el, axis=axis, fill_value=fill_value)
else:
Expand Down Expand Up @@ -468,6 +478,22 @@ def _apply_to_sparse(self, el: spmatrix, *, axis, fill_value=None) -> spmatrix:

return out

def _apply_to_awkward(self, el: AwkArray, *, axis, fill_value=None):
import awkward as ak

if self.no_change:
return el
elif axis == 1: # Indexing by field
if self.new_idx.isin(self.old_idx).all(): # inner join
return el[self.new_idx]
else: # outer join
# TODO: this code isn't actually hit, we should refactor
raise Exception("This should be unreachable, please open an issue.")
else:
if len(self.new_idx) > len(self.old_idx):
el = ak.pad_none(el, 1, axis=axis) # axis == 0
return el[self.old_idx.get_indexer(self.new_idx)]


def merge_indices(
inds: Iterable[pd.Index], join: Literal["inner", "outer"]
Expand Down Expand Up @@ -534,6 +560,17 @@ def concat_arrays(arrays, reindexers, axis=0, index=None, fill_value=None):
)
df.index = index
return df
elif any(isinstance(a, AwkArray) for a in arrays):
from ..compat import awkward as ak

if not all(
isinstance(a, AwkArray) or a is MissingVal or 0 in a.shape for a in arrays
):
raise NotImplementedError(
"Cannot concatenate an AwkwardArray with other array types."
)

return ak.concatenate([f(a) for f, a in zip(reindexers, arrays)], axis=axis)
elif any(isinstance(a, sparse.spmatrix) for a in arrays):
sparse_stack = (sparse.vstack, sparse.hstack)[axis]
return sparse_stack(
Expand Down Expand Up @@ -579,6 +616,15 @@ def gen_inner_reindexers(els, new_index, axis: Literal[0, 1] = 0):
lambda x, y: x.intersection(y), (df_indices(el) for el in els)
)
reindexers = [Reindexer(df_indices(el), common_ind) for el in els]
elif any(isinstance(el, AwkArray) for el in els if not_missing(el)):
if not all(isinstance(el, AwkArray) for el in els if not_missing(el)):
raise NotImplementedError(
"Cannot concatenate an AwkwardArray with other array types."
)
common_keys = intersect_keys(el.fields for el in els)
reindexers = [
Reindexer(pd.Index(el.fields), pd.Index(list(common_keys))) for el in els
]
else:
min_ind = min(el.shape[alt_axis] for el in els)
reindexers = [
Expand All @@ -596,10 +642,38 @@ def gen_outer_reindexers(els, shapes, new_index: pd.Index, *, axis=0):
else (lambda _, shape=shape: pd.DataFrame(index=range(shape)))
for el, shape in zip(els, shapes)
]
else:
# if fill_value is None:
# fill_value = default_fill_value(els)
elif any(isinstance(el, AwkArray) for el in els if not_missing(el)):
import awkward as ak

if not all(isinstance(el, AwkArray) for el in els if not_missing(el)):
raise NotImplementedError(
"Cannot concatenate an AwkwardArray with other array types."
)
warn(
"Outer joins on awkward.Arrays will have different return values in the future."
"For details, and to offer input, please see:\n\n\t"
"https://github.com/scverse/anndata/issues/898",
ExperimentalFeatureWarning,
)
filterwarnings(
"ignore",
category=ExperimentalFeatureWarning,
message=r"Outer joins on awkward.Arrays will have different return values.*",
)
# all_keys = union_keys(el.fields for el in els if not_missing(el))
reindexers = []
for el in els:
if not_missing(el):
reindexers.append(lambda x: x)
else:
reindexers.append(
lambda x: ak.pad_none(
ak.Array([]),
len(x),
0,
)
)
else:
max_col = max(el.shape[1] for el in els if not_missing(el))
orig_cols = [el.shape[1] if not_missing(el) else 0 for el in els]
reindexers = [
Expand Down
Loading

0 comments on commit a9e634c

Please sign in to comment.