add VLenNDArray #200

sofroniewn · 2019-09-06T23:30:42Z

This PR closes #199 by adding support for ragged nD arrays, where each array can be a different shape and dimensionality. It does this using the scheme described in #199.

It's usage is as follows:

import numcodecs
import numpy as np
x = np.array([[1, 3, 5], [[4, 3], [2, 1]], [[7, 9]]], dtype='object')
codec = numcodecs.VLenNDArray('<i4')
codec.decode(codec.encode(x))
array([array([1, 3, 5], dtype=int32), array([[4, 3], [2, 1]], dtype=int32),
    array([[7, 9]]], dtype=int32)], dtype=object)

I have not added tests yet, but will do so. I will adapt the tests from test_vlen_array.py.

Any comments on the implementation are appreciated. I'm pretty new to this code base, so may have made some wrong choices.

Oh and I also seem to have a bunch of .c files the came when I ran cythonize -a -i ./numcodecs/vlen_nd.pyx that I may or may not have wanted to change, any advice around those would be appreciated too.

TODO:

Unit tests and/or doctests in docstrings
tox -e py37 passes locally
tox -e py27 passes locally
Docstrings and API docs for any new/modified user-facing classes and functions
Changes documented in docs/release.rst
tox -e docs passes locally
AppVeyor and Travis CI passes
Test coverage to 100% (Coveralls passes)

sofroniewn · 2019-09-07T00:33:59Z

I have now added tests in test_vlen_ndarray.py based on the tests for test_vlen_array.py

sofroniewn · 2019-09-07T00:52:31Z

hmm - tests are failing on from numcodecs.vlen_nd import VLenNDArray, but that works fine locally. I'm not quite sure what's going on there.

sofroniewn · 2019-09-07T17:35:20Z

Most tests pass now - I had to add the fixtures folder, fix the cython metadata in the .c file (which had included some paths on my computer) and I had to add a setup.py extension.

There's still one doc string test failing for py3.7 because of where the new-line gets split.

sofroniewn · 2019-09-07T17:46:45Z

I also want to note that when I try and run pytest -v numcodecs locally I get the following error messages which I think pertain to parts of the codebase I'm not trying to interact with and were likely due to problems with my installation - which followed the procedure described in your contributing guide (inculding running pip install -r requirements_dev.txt and python setup.py build_ext --inplace, but without the virtual environment)

____________________________________________________________________________ ERROR collecting numcodecs/tests/test_blosc.py ____________________________________________________________________________
ImportError while importing test module '/Users/nicholassofroniew/Github/numcodecs/numcodecs/tests/test_blosc.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
numcodecs/tests/test_blosc.py:12: in <module>
    from numcodecs import blosc
E   ImportError: dlopen(/Users/nicholassofroniew/Github/numcodecs/numcodecs/blosc.cpython-37m-darwin.so, 2): Symbol not found: _blosc_cbuffer_complib
E     Referenced from: /Users/nicholassofroniew/Github/numcodecs/numcodecs/blosc.cpython-37m-darwin.so
E     Expected in: flat namespace
E    in /Users/nicholassofroniew/Github/numcodecs/numcodecs/blosc.cpython-37m-darwin.so
_____________________________________________________________________________ ERROR collecting numcodecs/tests/test_lz4.py _____________________________________________________________________________
ImportError while importing test module '/Users/nicholassofroniew/Github/numcodecs/numcodecs/tests/test_lz4.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
numcodecs/tests/test_lz4.py:9: in <module>
    from numcodecs.lz4 import LZ4
E   ImportError: dlopen(/Users/nicholassofroniew/Github/numcodecs/numcodecs/lz4.cpython-37m-darwin.so, 2): Symbol not found: _LZ4_compressBound
E     Referenced from: /Users/nicholassofroniew/Github/numcodecs/numcodecs/lz4.cpython-37m-darwin.so
E     Expected in: flat namespace
E    in /Users/nicholassofroniew/Github/numcodecs/numcodecs/lz4.cpython-37m-darwin.so
____________________________________________________________________________ ERROR collecting numcodecs/tests/test_zstd.py _____________________________________________________________________________
ImportError while importing test module '/Users/nicholassofroniew/Github/numcodecs/numcodecs/tests/test_zstd.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
numcodecs/tests/test_zstd.py:9: in <module>
    from numcodecs.zstd import Zstd
E   ImportError: dlopen(/Users/nicholassofroniew/Github/numcodecs/numcodecs/zstd.cpython-37m-darwin.so, 2): Symbol not found: _ZSTD_compress
E     Referenced from: /Users/nicholassofroniew/Github/numcodecs/numcodecs/zstd.cpython-37m-darwin.so
E     Expected in: flat namespace
E    in /Users/nicholassofroniew/Github/numcodecs/numcodecs/zstd.cpython-37m-darwin.so

NumesSanguis · 2020-03-16T01:57:35Z

@sofroniewn Thank you for your effort in trying to support ragged nD arrays. Just want to ask if there is any update on this feature?

sofroniewn · 2020-03-16T02:00:16Z

No update - maybe I can ping @jakirkham and @ryan-williams to take another look / help me get the tests passing / resolve conflicts - i'll note that I think the conflicts / test failures come from the failures of my dev environment not the code I'm trying to add

alimanfoo · 2020-03-25T11:49:20Z

Hi @sofroniewn, just to apologise for not looking at this sooner.

Re the conflicting .c files, those will just be due to changes in non-essential information that gets output by cython and which depends on which computer the C files where generated on. I'd suggest to just remove any changes to those C files from this PR.

alimanfoo

Hi @sofroniewn, apologies again for slow review.

Implementation looks fine to me. Only question is whether to use a single byte for the number of dimensions rather than 4 byte int, could save a bit of space.

A couple of small comments on docstrings.

Also would need some API docs.

alimanfoo · 2020-03-25T11:51:49Z

numcodecs/vlen_nd.pyx

+
+def check_out_param(out, n_items):
+    if not isinstance(out, np.ndarray):
+        raise TypeError('out must be 1-dimensional array')


Suggested change

raise TypeError('out must be 1-dimensional array')

raise TypeError('out must be a numpy array')

alimanfoo · 2020-03-25T11:52:36Z

numcodecs/vlen_nd.pyx

+
+
+class VLenNDArray(Codec):
+    """Encode variable-length n-dimensional arrays via UTF-8.


Suggested change

"""Encode variable-length n-dimensional arrays via UTF-8.

"""Encode an array of variable-length n-dimensional arrays.

alimanfoo · 2020-03-25T11:55:50Z

numcodecs/vlen_nd.pyx

+            data_length += l + 4 * (n + 2)  # 4 bytes to store number of
+                                            # dimensions, 4 bytes per
+                                            # dimension to store dimension
+                                            # and 4 bytes to store the length


Could use a single byte to store the number of dimensions?

alimanfoo · 2020-03-25T12:06:07Z

I also want to note that when I try and run pytest -v numcodecs locally I get the following error messages which I think pertain to parts of the codebase I'm not trying to interact with and were likely due to problems with my installation

These error messages are a bit odd, they suggest some problem with how the other extension modules were compiled. Afraid I haven't got anything very intelligent to suggest, other than cleaning out all the .so and .c files and trying a full build from scratch.

ericpre · 2022-03-16T19:25:58Z

@sofroniewn, by any chance, would you still have interest in finishing this PR? :)

martindurant · 2022-03-16T20:12:01Z

This idea could probably be superseded by the proposal for an awkward-zarr project for GSoC.

sofroniewn · 2022-03-22T04:15:20Z

So sorry both for dropping the ball on this - if there are already plans for this to be superseded please press on with those, or if you have a contributor who wants to take this over PR please take it over. Thanks!!

NumesSanguis · 2022-03-22T06:14:27Z

@martindurant Do you have a link to that GSoC proposal? I could only find this open issue, but not sure if that's related?:
zarr-developers/community#42

MSanKeys963 · 2022-03-22T11:05:39Z

@NumesSanguis here's the ideas-list.md and here's the Awkward Array project details.

sofroniewn added 2 commits September 6, 2019 16:22

add VLenNDArray

d285f4f

add test

b151757

sofroniewn added 3 commits September 7, 2019 09:23

add fixture

7baf6ee

fix cython metadata

ccd3456

add setup.py extension

d2b6bb4

NumesSanguis mentioned this pull request Mar 23, 2020

can VLenArray support 2D arrays #199

Open

alimanfoo reviewed Mar 25, 2020

View reviewed changes

CSSFrancis mentioned this pull request Mar 15, 2022

Ragged Arrays with Ragged Dim>2 don't Save hyperspy/hyperspy#2904

Closed

meggart mentioned this pull request Apr 14, 2023

a bug fix for the general ragged array JuliaIO/Zarr.jl#112

Open

dstansby added the New codec Suggestion for a new codec label Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add VLenNDArray #200

add VLenNDArray #200

sofroniewn commented Sep 6, 2019 •

edited

Loading

sofroniewn commented Sep 7, 2019

sofroniewn commented Sep 7, 2019

sofroniewn commented Sep 7, 2019 •

edited

Loading

sofroniewn commented Sep 7, 2019

NumesSanguis commented Mar 16, 2020

sofroniewn commented Mar 16, 2020

alimanfoo commented Mar 25, 2020

alimanfoo left a comment

alimanfoo Mar 25, 2020

alimanfoo Mar 25, 2020

alimanfoo Mar 25, 2020

alimanfoo commented Mar 25, 2020

ericpre commented Mar 16, 2022

martindurant commented Mar 16, 2022

sofroniewn commented Mar 22, 2022

NumesSanguis commented Mar 22, 2022

MSanKeys963 commented Mar 22, 2022

	raise TypeError('out must be 1-dimensional array')
	raise TypeError('out must be a numpy array')



		class VLenNDArray(Codec):
		"""Encode variable-length n-dimensional arrays via UTF-8.

	"""Encode variable-length n-dimensional arrays via UTF-8.
	"""Encode an array of variable-length n-dimensional arrays.

add VLenNDArray #200

Are you sure you want to change the base?

add VLenNDArray #200

Conversation

sofroniewn commented Sep 6, 2019 • edited Loading

sofroniewn commented Sep 7, 2019

sofroniewn commented Sep 7, 2019

sofroniewn commented Sep 7, 2019 • edited Loading

sofroniewn commented Sep 7, 2019

NumesSanguis commented Mar 16, 2020

sofroniewn commented Mar 16, 2020

alimanfoo commented Mar 25, 2020

alimanfoo left a comment

Choose a reason for hiding this comment

alimanfoo Mar 25, 2020

Choose a reason for hiding this comment

alimanfoo Mar 25, 2020

Choose a reason for hiding this comment

alimanfoo Mar 25, 2020

Choose a reason for hiding this comment

alimanfoo commented Mar 25, 2020

ericpre commented Mar 16, 2022

martindurant commented Mar 16, 2022

sofroniewn commented Mar 22, 2022

NumesSanguis commented Mar 22, 2022

MSanKeys963 commented Mar 22, 2022

sofroniewn commented Sep 6, 2019 •

edited

Loading

sofroniewn commented Sep 7, 2019 •

edited

Loading