json de-/serialisation from/to string or file #1207

ianna · 2022-01-07T13:37:44Z

ak._v2.operations.convert.from_json
ak._v2.operations.convert.from_json_file
ak._v2.operations.convert.to_json
ak._v2.operations.convert.to_json_file

codecov · 2022-01-07T13:47:51Z

Codecov Report

Merging #1207 (1dd4594) into main (8e68200) will decrease coverage by 0.09%.
The diff coverage is 69.34%.

Impacted Files	Coverage Δ
src/awkward/_v2/_util.py	`79.13% <7.14%> (-4.67%)`	⬇️
src/awkward/_v2/contents/bytemaskedarray.py	`80.82% <9.09%> (-2.23%)`	⬇️
src/awkward/_v2/contents/indexedarray.py	`59.42% <10.00%> (-1.23%)`	⬇️
src/awkward/_v2/contents/unmaskedarray.py	`55.25% <20.00%> (-0.83%)`	⬇️
src/awkward/_v2/operations/io/ak_to_json_file.py	`47.05% <43.75%> (-27.95%)`	⬇️
src/awkward/_v2/contents/emptyarray.py	`69.14% <50.00%> (-0.23%)`	⬇️
src/awkward/_v2/contents/listarray.py	`87.76% <50.00%> (-0.19%)`	⬇️
src/awkward/_v2/contents/content.py	`81.59% <58.33%> (-0.47%)`	⬇️
src/awkward/_v2/contents/regulararray.py	`81.97% <58.82%> (-0.88%)`	⬇️
src/awkward/_v2/operations/convert/ak_to_json.py	`68.75% <65.51%> (-6.25%)`	⬇️
... and 11 more

jpivarski · 2022-01-21T16:35:38Z

@ianna Be careful to exclude the pybind11 directory from your PR, so that you don't accidentally downgrade it. (It's happened before: see #1223.)

This will fix it in your local repo, and then check it in:

cd pybind11/
git checkout master
git pull origin master
git checkout 45f792efdd92da094548e2095d6efdbfa7e536ee   # ignore warnings about detached HEAD; we know...
cd ..
git add pybind11
git commit -m "Leave 'pybind11' at version 2.9.0."
git push

ianna · 2022-01-31T10:53:47Z

@jpivarski - Here is a reference for the PR as it is now. Note, this v2 to_json code does not have the optimisations we have discussed, so this is the worst estimate.

v2 from_json is 4x faster than v1:

In [4]: %%timeit
   ...: ak.from_json("[[1.1, 2.2, 3], [], [4, 5.5]]")
   ...: 
   ...: 
670 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [5]: %%timeit
   ...: ak._v2.operations.convert.from_json("[[1.1, 2.2, 3], [], [4, 5.5]]")
   ...: 
   ...: 
152 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

v2 to_json is 2x slower than v1:

In [7]: b = ak.layout.NumpyArray(
   ...:     np.array(
   ...:         [
   ...:             [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]],
   ...:             [[10.1, 20.2, 30.3], [40.4, 50.5, 60.6]],
   ...:         ]
   ...:     )
   ...: )

In [8]: %%timeit
   ...: ak.to_json(b)
   ...: 
   ...: 
32.7 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [9]: b2 = ak._v2.contents.NumpyArray(
   ...:     np.array(
   ...:         [
   ...:             [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]],
   ...:             [[10.1, 20.2, 30.3], [40.4, 50.5, 60.6]],
   ...:         ]
   ...:     )
   ...: )

In [10]: %%timeit
    ...: ak._v2.operations.convert.to_json(b2)
    ...: 
    ...: 
62.3 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

jpivarski · 2022-01-31T20:25:29Z

These are very peculiar test cases—they're serializing/deserializing a very small array many times. In a simple performance model with O(n) asymptotic dependence and O(1) constant-time overhead, these tests are only measuring the O(1) part. Usually, we're only interested in the O(n) part.

for more information, see https://pre-commit.ci

- accept newline as line delimiter '\n' (0x0A) as well as carriage return and newline '\r\n' (0x0D0A) - ignore empty lines '\n\n' - raise an error if JSON is nor parsable

ianna · 2022-02-03T14:27:26Z

@jpivarski - The v2 json de-/serialisation functions in this PR use Python json and do not depend on RapidJSON. The latter can be removed when the v1 code is phased out!

One issue is that Python json decoder understands 'NaN', 'Infinity', and '-Infinity' as their corresponding float values. Allowing the decoder understand configurable strings is doable, but a bit tricky (https://bugs.python.org/issue29992) The question is how important is to keep this functionality?

ianna · 2022-02-03T14:46:42Z

@jpivarski - The v2 json de-/serialisation functions in this PR use Python json and do not depend on RapidJSON. The latter can be removed when the v1 code is phased out!

One issue is that Python json decoder understands 'NaN', 'Infinity', and '-Infinity' as their corresponding float values. Allowing the decoder understand configurable strings is doable, but a bit tricky (https://bugs.python.org/issue29992) The question is how important is to keep this functionality?

I think, I'll implement it as a post processing step in the recursively applied action.

jpivarski · 2022-02-03T14:51:35Z

The Python JSON decoder recognizes non-quoted strings as nan and infinity:

>>> json.loads('[1.1, 2.2, NaN, 3.3]')
[1.1, 2.2, nan, 3.3]
>>> json.loads('[1.1, 2.2, "NaN", 3.3]')
[1.1, 2.2, 'NaN', 3.3]

which is different from what the arguments to to_json/from_json do. Python's JSON encoder/decoder is writing and reading invalid JSON (a feature that can be turned off with some flag, but then NaN/infinity become errors).

The arguments to to_json/from_json write and read valid JSON, but change its interpretation as an Awkward Array. Without the string-converting argument, '[1.1, 2.2, "NaN", 3.3]' would be interpreted as a union array; with it, it's interpreted as a numeric array with a NaN value. The same is true of the complex-conversion argument: [{"real": 1.1, "imag": 2.2}] is valid JSON that could be interpreted as a record array (by default) or as a numeric array with complex type (with the argument).

If we keep in our arguments, there will be two ways to deal with NaN/infinity, by quoting (our arguments) or not quoting (what the Python json library does by default). Our argument for complex numbers is the only way to deal with complex numbers, though.

On RapidJSON, I hadn't been planning to drop it—that was how we could deal with continuous streams of JSON input or output. Also, the JSONSchema-based reader (a v2 thing) depends on it. But we can leave the PR like this and do its reintroduction at a later time. One way or the other, we did have to change the way we use it, so removing it and reintroducing it will make that a cleaner break.

jpivarski · 2022-02-03T14:53:01Z

I think, I'll implement it as a post processing step in the recursively applied action.

We should not postprocess the JSON string—that's too dangerous. Getting it right essentially involves parsing and reserializing JSON, which duplicates that effort.

ianna · 2022-02-03T15:34:39Z

I think, I'll implement it as a post processing step in the recursively applied action.

We should not postprocess the JSON string—that's too dangerous. Getting it right essentially involves parsing and reserializing JSON, which duplicates that effort.

No, I was thinking of the following:

    _CONSTANTS = {}
    if nan_string is not None:
        _CONSTANTS[nan_string] = json.decoder.NaN
    if infinity_string is not None:
        _CONSTANTS[infinity_string] = json.decoder.PosInf
    if minus_infinity_string is not None:
        _CONSTANTS[minus_infinity_string] = json.decoder.NegInf

    def replace_constants_action(node, **kwargs):
        if isinstance(node, ak._v2.contents.ListOffsetArray):
            if (
                node.parameter("__array__") == "bytestring"
                or node.parameter("__array__") == "string"
            ):
                out = node._awkward_strings_to_constants(_CONSTANTS)
                return None if out is None else ak._v2.contents.NumpyArray(out)
        else:
            return None

    if len(_CONSTANTS) > 0:
        layout = layout.recursively_apply(replace_constants_action)

and ListOffsetArray function:

    def _awkward_strings_to_constants(self, constants):
        if (
            self.parameter("__array__") == "bytestring"
            or self.parameter("__array__") == "string"
        ):
            content = ak._v2._util.tobytes(self._content.data)
            starts, stops = self.starts, self.stops
            out = [None] * starts.length
            result = [None] * starts.length
            for i in range(starts.length):
                out[i] = content[starts[i] : stops[i]].decode(errors="surrogateescape")
                if out[i] in constants:
                    result[i] = constants[out[i]]
            return None if result[0] is None else result

jpivarski · 2022-02-03T16:49:38Z

More like this: Awkward → Awkward preprocessing and postprocessing:

def nonfinite_to_union(node, **kwargs):
    if isinstance(node, ak._v2.contents.NumpyArray):
        is_nonfinite = ~node.nplike.isfinite(node.data)  # true for inf, -inf, nan
        is_posinf = is_nonfinite & (node.data > 0)  # true for inf only
        is_neginf = is_nonfinite & (node.data < 0)  # true for -inf only
        is_nan = node.nplike.isnan(node.data)  # true for nan only
        tags = node.nplike.zeros(len(node), np.int8)
        tags[is_nonfinite] = 1
        index = node.nplike.arange(len(node), dtype=np.int64)
        index[is_posinf] = 0
        index[is_neginf] = 1
        index[is_nan] = 2
        return ak._v2.contents.UnionArray(
            ak._v2.index.Index8(tags),
            ak._v2.index.Index64(index),
            [
                node,
                ak._v2.operations.convert.from_iter(
                    ["inf", "-inf", "nan"], highlevel=False
                ),
            ],
        )

def union_to_nonfinite(node, **kwargs):
    if isinstance(node, ak._v2.contents.UnionArray) and len(node.contents) == 2:
        for i in (0, 1):
            not_i = 0 if i == 1 else 1
            if (
                isinstance(node.contents[not_i], ak._v2.contents.NumpyArray)
                and len(node.contents[not_i].shape) == 1
                and node.contents[i].parameter("__array__") == "string"   # not bytestring
                and node.contents[i].length <= 3
                and set(node.contents[i].to_list()).issubset({"inf", "-inf", "nan"})
            ):
                is_string = (node.tags.data == i)
                not_string = ~is_string
                strings = node.contents[i].to_list()   # we know this is small
                finite_numbers = node.contents[not_i].data
                all_numbers = node.nplike.empty(len(node), finite_numbers.dtype)
                all_numbers[not_string] = finite_numbers[node.index.data[not_string]]
                try:
                    posinf_index = (node.index.data == strings.index("inf"))
                    all_numbers[posinf_index & is_string] = np.inf
                except ValueError:   # in case the union doesn't have "inf"
                    pass
                try:
                    neginf_index = (node.index.data == strings.index("-inf"))
                    all_numbers[neginf_index & is_string] = -np.inf
                except ValueError:   # in case the union doesn't have "-inf"
                    pass
                try:
                    nan_index = (node.index.data == strings.index("nan"))
                    all_numbers[nan_index & is_string] = np.nan
                except ValueError:   # in case the union doesn't have "nan"
                    pass
                return ak._v2.contents.NumpyArray(all_numbers)

Given

with_nonfinite = ak._v2.from_iter([1.1, 2.2, float("inf"), 3.3, -float("inf"), 4.4, float("nan"), 5.5, 6.6])
without_nonfinite = ak._v2.from_iter([1.1, 2.2, 3.3, 4.4, 5.5, 6.6])

They can be applied like

ak._v2.Array(with_nonfinite.layout.recursively_apply(nonfinite_to_union, numpy_to_regular=True))
ak._v2.Array(without_nonfinite.layout.recursively_apply(nonfinite_to_union, numpy_to_regular=True))

(The actions assume that the NumpyArray is flat, so numpy_to_regular is needed.)

This also assumes that nplike has isfinite and isnan, which it doesn't yet, but that's easy to add. The above actions also don't accept user-supplied strings, but that's just a matter of replacing these constant strings.

ianna · 2022-02-04T15:10:11Z

@jpivarski - thanks! All strings are now user-configurable, the tests pass. Note, the _awkward_strings_to_nonfinite converts JSON to either a NumpyArray or a UnionArray. The latter is needed if there are other strings a user did not request to convert. Please, see tests/v2/test_0437-stream-of-many-json-files.py:

    # read json file containing multiple definitions of 'nan' and 'inf'
    # user-defined strings
    # replace can only work for one string definition
    array = ak._v2.operations.io.from_json_file(
        os.path.join(path, "samples/test-nan-inf.json"),
        infinity_string="Infinity",
        nan_string="None at all",
    )

Actually, now we could introduce multiple strings such as infinity_strings={"Infinity", "Inf", "inf", "бесконечность"}

Some tests were removed as ndjson noncompliant. The JSON decoder class implements ndjson specs: see https://github.com/ndjson/ndjson-spec

- accept newline as line delimiter '\n' (0x0A) as well as carriage return and newline '\r\n' (0x0D0A)
- ignore empty lines '\n\n'
- raise an error if JSON is nor parsable

Please, have a look at the PR when you have time and comment (or commit directly to the branch if you feel like ;-) Thanks!

jpivarski

There are a lot of improvements in this PR that I want to take as-is: the std::move of GrowableBuffers and general clean-up of the C++, the clean-up of tests to exclude examples we no longer support, and the pre/post-processing of arrays for бесконечный, не число, and комплексное numbers. :)

Sometime after this PR, I plan to revisit to/from_json to pass file-like objects to RapidJSON, integrate the JSONSchema case as an argument to from_json, rather than a separate function, and make newline-delimited JSON something that the user has to ask for (either as a separate function or as an argument). We won't be interpreting strings as filenames (so the path handling in _v2/_util won't be needed). But it will be much easier to do that starting from this PR.

Sorry that I gave conflicting/contradictory suggestions during development—this is a problem that needs a lot of design decisions, since there are so many cases involved in JSON-handling (data that isn't directly encodable, newline vs not newline, files vs strings, Python's builtin module vs RapidJSON).

ianna · 2022-02-04T16:16:54Z

Here is an idea (not mine) how to read from a JSON stream using Python json: consume opening [, call json.raw_decode(s) on a bit starting with the first { to get an object and its ending offset, then consume a comma and whitespace in between, repeat.

ianna marked this pull request as draft January 7, 2022 13:37

ianna changed the title ~~from_json and to_json for v2~~ convert string, stream, or file from/to json for v2 Jan 7, 2022

ianna force-pushed the ianna/from_to_json_v2 branch from f14250f to 4b9a541 Compare January 20, 2022 10:50

jpivarski mentioned this pull request Jan 21, 2022

style: pyupgrade to 3.6 #1244

Merged

agoose77 mentioned this pull request Jan 25, 2022

Fix: support mixed array types in NumpyLike.to_rectilinear #1248

Merged

ianna force-pushed the ianna/from_to_json_v2 branch 2 times, most recently from ba88860 to d507810 Compare January 25, 2022 12:59

ianna and others added 18 commits February 2, 2022 08:48

move unique_ptr and cleanup unused

b4678ca

commit before updating to master

da493fa

remove warning

2a99b15

ak_from_json and tests added

2f97132

test added

d046690

pre-commit formatting

cf44c86

add a test

ff2f77a

add tests

b69f8b5

Leave 'pybind11' at version 2.9.0.

b8315c6

use the same functions for both versions

2ccf1e8

[pre-commit.ci] auto fixes from pre-commit.com hooks

fc02d4b

for more information, see https://pre-commit.ci

rebase and fix an error reported by pre commit

f307dea

add tests to read from json files

3ef9c4b

cleanup

2d6acec

to json and tests

b682584

pass parameters

6ea5abf

to_json without special encoder class

55cd866

rebase and pre-commit

403d41c

ianna force-pushed the ianna/from_to_json_v2 branch from c9636b6 to 403d41c Compare February 2, 2022 07:51

ianna added 2 commits February 2, 2022 09:46

add test for complex with nan and inf

3124876

cleanup

a3742e4

ianna force-pushed the ianna/from_to_json_v2 branch from dcab232 to a3742e4 Compare February 2, 2022 10:01

ianna added 7 commits February 2, 2022 13:07

disable pylint's unspecified-encoding when opening a binary file

0a4d28b

read file as text and add encoding

19ff483

change encoding to utf-8

4e3bf95

add encoding

66d589f

drop RapidJson for from_json, first go at ndjson:

1b114e5

- accept newline as line delimiter '\n' (0x0A) as well as carriage return and newline '\r\n' (0x0D0A) - ignore empty lines '\n\n' - raise an error if JSON is nor parsable

cleanup

4ae2a39

remove dependency on rapidjson, implement ndjson specs

043576c

ianna added 3 commits February 4, 2022 11:51

user-defined strings postprocessing

e3b4bfc

final cleanup

a686b18

all done

1dd4594

ianna changed the title ~~convert string, stream, or file from/to json for v2~~ json de-/serialisation from/to string or file Feb 4, 2022

ianna marked this pull request as ready for review February 4, 2022 15:10

jpivarski approved these changes Feb 4, 2022

View reviewed changes

jpivarski merged commit dde5c91 into main Feb 4, 2022

jpivarski deleted the ianna/from_to_json_v2 branch February 4, 2022 16:07

agoose77 mentioned this pull request Aug 3, 2022

Development: use standardised PR title prefixes? #1577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json de-/serialisation from/to string or file #1207

json de-/serialisation from/to string or file #1207

ianna commented Jan 7, 2022 •

edited

Loading

codecov bot commented Jan 7, 2022 •

edited

Loading

jpivarski commented Jan 21, 2022

ianna commented Jan 31, 2022

jpivarski commented Jan 31, 2022

ianna commented Feb 3, 2022

ianna commented Feb 3, 2022

jpivarski commented Feb 3, 2022

jpivarski commented Feb 3, 2022

ianna commented Feb 3, 2022 •

edited

Loading

jpivarski commented Feb 3, 2022

ianna commented Feb 4, 2022

jpivarski left a comment

ianna commented Feb 4, 2022

json de-/serialisation from/to string or file #1207

json de-/serialisation from/to string or file #1207

Conversation

ianna commented Jan 7, 2022 • edited Loading

codecov bot commented Jan 7, 2022 • edited Loading

Codecov Report

jpivarski commented Jan 21, 2022

ianna commented Jan 31, 2022

jpivarski commented Jan 31, 2022

ianna commented Feb 3, 2022

ianna commented Feb 3, 2022

jpivarski commented Feb 3, 2022

jpivarski commented Feb 3, 2022

ianna commented Feb 3, 2022 • edited Loading

jpivarski commented Feb 3, 2022

ianna commented Feb 4, 2022

jpivarski left a comment

Choose a reason for hiding this comment

ianna commented Feb 4, 2022

ianna commented Jan 7, 2022 •

edited

Loading

codecov bot commented Jan 7, 2022 •

edited

Loading

ianna commented Feb 3, 2022 •

edited

Loading