Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json de-/serialisation from/to string or file #1207

Merged
merged 30 commits into from
Feb 4, 2022
Merged

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Jan 7, 2022

  • ak._v2.operations.convert.from_json
  • ak._v2.operations.convert.from_json_file
  • ak._v2.operations.convert.to_json
  • ak._v2.operations.convert.to_json_file

@ianna ianna marked this pull request as draft January 7, 2022 13:37
@ianna ianna changed the title from_json and to_json for v2 convert string, stream, or file from/to json for v2 Jan 7, 2022
@codecov
Copy link

codecov bot commented Jan 7, 2022

Codecov Report

Merging #1207 (1dd4594) into main (8e68200) will decrease coverage by 0.09%.
The diff coverage is 69.34%.

Impacted Files Coverage Δ
src/awkward/_v2/_util.py 79.13% <7.14%> (-4.67%) ⬇️
src/awkward/_v2/contents/bytemaskedarray.py 80.82% <9.09%> (-2.23%) ⬇️
src/awkward/_v2/contents/indexedarray.py 59.42% <10.00%> (-1.23%) ⬇️
src/awkward/_v2/contents/unmaskedarray.py 55.25% <20.00%> (-0.83%) ⬇️
src/awkward/_v2/operations/io/ak_to_json_file.py 47.05% <43.75%> (-27.95%) ⬇️
src/awkward/_v2/contents/emptyarray.py 69.14% <50.00%> (-0.23%) ⬇️
src/awkward/_v2/contents/listarray.py 87.76% <50.00%> (-0.19%) ⬇️
src/awkward/_v2/contents/content.py 81.59% <58.33%> (-0.47%) ⬇️
src/awkward/_v2/contents/regulararray.py 81.97% <58.82%> (-0.88%) ⬇️
src/awkward/_v2/operations/convert/ak_to_json.py 68.75% <65.51%> (-6.25%) ⬇️
... and 11 more

@jpivarski
Copy link
Member

@ianna Be careful to exclude the pybind11 directory from your PR, so that you don't accidentally downgrade it. (It's happened before: see #1223.)

This will fix it in your local repo, and then check it in:

cd pybind11/
git checkout master
git pull origin master
git checkout 45f792efdd92da094548e2095d6efdbfa7e536ee   # ignore warnings about detached HEAD; we know...
cd ..
git add pybind11
git commit -m "Leave 'pybind11' at version 2.9.0."
git push

@ianna
Copy link
Collaborator Author

ianna commented Jan 31, 2022

@jpivarski - Here is a reference for the PR as it is now. Note, this v2 to_json code does not have the optimisations we have discussed, so this is the worst estimate.

v2 from_json is 4x faster than v1:

In [4]: %%timeit
   ...: ak.from_json("[[1.1, 2.2, 3], [], [4, 5.5]]")
   ...: 
   ...: 
670 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [5]: %%timeit
   ...: ak._v2.operations.convert.from_json("[[1.1, 2.2, 3], [], [4, 5.5]]")
   ...: 
   ...: 
152 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

v2 to_json is 2x slower than v1:

In [7]: b = ak.layout.NumpyArray(
   ...:     np.array(
   ...:         [
   ...:             [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]],
   ...:             [[10.1, 20.2, 30.3], [40.4, 50.5, 60.6]],
   ...:         ]
   ...:     )
   ...: )

In [8]: %%timeit
   ...: ak.to_json(b)
   ...: 
   ...: 
32.7 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [9]: b2 = ak._v2.contents.NumpyArray(
   ...:     np.array(
   ...:         [
   ...:             [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]],
   ...:             [[10.1, 20.2, 30.3], [40.4, 50.5, 60.6]],
   ...:         ]
   ...:     )
   ...: )

In [10]: %%timeit
    ...: ak._v2.operations.convert.to_json(b2)
    ...: 
    ...: 
62.3 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

@jpivarski
Copy link
Member

These are very peculiar test cases—they're serializing/deserializing a very small array many times. In a simple performance model with O(n) asymptotic dependence and O(1) constant-time overhead, these tests are only measuring the O(1) part. Usually, we're only interested in the O(n) part.

@ianna
Copy link
Collaborator Author

ianna commented Feb 3, 2022

@jpivarski - The v2 json de-/serialisation functions in this PR use Python json and do not depend on RapidJSON. The latter can be removed when the v1 code is phased out!

One issue is that Python json decoder understands 'NaN', 'Infinity', and '-Infinity' as their corresponding float values. Allowing the decoder understand configurable strings is doable, but a bit tricky (https://bugs.python.org/issue29992) The question is how important is to keep this functionality?

@ianna
Copy link
Collaborator Author

ianna commented Feb 3, 2022

@jpivarski - The v2 json de-/serialisation functions in this PR use Python json and do not depend on RapidJSON. The latter can be removed when the v1 code is phased out!

One issue is that Python json decoder understands 'NaN', 'Infinity', and '-Infinity' as their corresponding float values. Allowing the decoder understand configurable strings is doable, but a bit tricky (https://bugs.python.org/issue29992) The question is how important is to keep this functionality?

I think, I'll implement it as a post processing step in the recursively applied action.

@jpivarski
Copy link
Member

The Python JSON decoder recognizes non-quoted strings as nan and infinity:

>>> json.loads('[1.1, 2.2, NaN, 3.3]')
[1.1, 2.2, nan, 3.3]
>>> json.loads('[1.1, 2.2, "NaN", 3.3]')
[1.1, 2.2, 'NaN', 3.3]

which is different from what the arguments to to_json/from_json do. Python's JSON encoder/decoder is writing and reading invalid JSON (a feature that can be turned off with some flag, but then NaN/infinity become errors).

The arguments to to_json/from_json write and read valid JSON, but change its interpretation as an Awkward Array. Without the string-converting argument, '[1.1, 2.2, "NaN", 3.3]' would be interpreted as a union array; with it, it's interpreted as a numeric array with a NaN value. The same is true of the complex-conversion argument: [{"real": 1.1, "imag": 2.2}] is valid JSON that could be interpreted as a record array (by default) or as a numeric array with complex type (with the argument).

If we keep in our arguments, there will be two ways to deal with NaN/infinity, by quoting (our arguments) or not quoting (what the Python json library does by default). Our argument for complex numbers is the only way to deal with complex numbers, though.

On RapidJSON, I hadn't been planning to drop it—that was how we could deal with continuous streams of JSON input or output. Also, the JSONSchema-based reader (a v2 thing) depends on it. But we can leave the PR like this and do its reintroduction at a later time. One way or the other, we did have to change the way we use it, so removing it and reintroducing it will make that a cleaner break.

@jpivarski
Copy link
Member

I think, I'll implement it as a post processing step in the recursively applied action.

We should not postprocess the JSON string—that's too dangerous. Getting it right essentially involves parsing and reserializing JSON, which duplicates that effort.

@ianna
Copy link
Collaborator Author

ianna commented Feb 3, 2022

I think, I'll implement it as a post processing step in the recursively applied action.

We should not postprocess the JSON string—that's too dangerous. Getting it right essentially involves parsing and reserializing JSON, which duplicates that effort.

No, I was thinking of the following:

    _CONSTANTS = {}
    if nan_string is not None:
        _CONSTANTS[nan_string] = json.decoder.NaN
    if infinity_string is not None:
        _CONSTANTS[infinity_string] = json.decoder.PosInf
    if minus_infinity_string is not None:
        _CONSTANTS[minus_infinity_string] = json.decoder.NegInf

    def replace_constants_action(node, **kwargs):
        if isinstance(node, ak._v2.contents.ListOffsetArray):
            if (
                node.parameter("__array__") == "bytestring"
                or node.parameter("__array__") == "string"
            ):
                out = node._awkward_strings_to_constants(_CONSTANTS)
                return None if out is None else ak._v2.contents.NumpyArray(out)
        else:
            return None

    if len(_CONSTANTS) > 0:
        layout = layout.recursively_apply(replace_constants_action)

and ListOffsetArray function:

    def _awkward_strings_to_constants(self, constants):
        if (
            self.parameter("__array__") == "bytestring"
            or self.parameter("__array__") == "string"
        ):
            content = ak._v2._util.tobytes(self._content.data)
            starts, stops = self.starts, self.stops
            out = [None] * starts.length
            result = [None] * starts.length
            for i in range(starts.length):
                out[i] = content[starts[i] : stops[i]].decode(errors="surrogateescape")
                if out[i] in constants:
                    result[i] = constants[out[i]]
            return None if result[0] is None else result

@jpivarski
Copy link
Member

More like this: Awkward → Awkward preprocessing and postprocessing:

def nonfinite_to_union(node, **kwargs):
    if isinstance(node, ak._v2.contents.NumpyArray):
        is_nonfinite = ~node.nplike.isfinite(node.data)  # true for inf, -inf, nan
        is_posinf = is_nonfinite & (node.data > 0)  # true for inf only
        is_neginf = is_nonfinite & (node.data < 0)  # true for -inf only
        is_nan = node.nplike.isnan(node.data)  # true for nan only
        tags = node.nplike.zeros(len(node), np.int8)
        tags[is_nonfinite] = 1
        index = node.nplike.arange(len(node), dtype=np.int64)
        index[is_posinf] = 0
        index[is_neginf] = 1
        index[is_nan] = 2
        return ak._v2.contents.UnionArray(
            ak._v2.index.Index8(tags),
            ak._v2.index.Index64(index),
            [
                node,
                ak._v2.operations.convert.from_iter(
                    ["inf", "-inf", "nan"], highlevel=False
                ),
            ],
        )

def union_to_nonfinite(node, **kwargs):
    if isinstance(node, ak._v2.contents.UnionArray) and len(node.contents) == 2:
        for i in (0, 1):
            not_i = 0 if i == 1 else 1
            if (
                isinstance(node.contents[not_i], ak._v2.contents.NumpyArray)
                and len(node.contents[not_i].shape) == 1
                and node.contents[i].parameter("__array__") == "string"   # not bytestring
                and node.contents[i].length <= 3
                and set(node.contents[i].to_list()).issubset({"inf", "-inf", "nan"})
            ):
                is_string = (node.tags.data == i)
                not_string = ~is_string
                strings = node.contents[i].to_list()   # we know this is small
                finite_numbers = node.contents[not_i].data
                all_numbers = node.nplike.empty(len(node), finite_numbers.dtype)
                all_numbers[not_string] = finite_numbers[node.index.data[not_string]]
                try:
                    posinf_index = (node.index.data == strings.index("inf"))
                    all_numbers[posinf_index & is_string] = np.inf
                except ValueError:   # in case the union doesn't have "inf"
                    pass
                try:
                    neginf_index = (node.index.data == strings.index("-inf"))
                    all_numbers[neginf_index & is_string] = -np.inf
                except ValueError:   # in case the union doesn't have "-inf"
                    pass
                try:
                    nan_index = (node.index.data == strings.index("nan"))
                    all_numbers[nan_index & is_string] = np.nan
                except ValueError:   # in case the union doesn't have "nan"
                    pass
                return ak._v2.contents.NumpyArray(all_numbers)

Given

with_nonfinite = ak._v2.from_iter([1.1, 2.2, float("inf"), 3.3, -float("inf"), 4.4, float("nan"), 5.5, 6.6])
without_nonfinite = ak._v2.from_iter([1.1, 2.2, 3.3, 4.4, 5.5, 6.6])

They can be applied like

ak._v2.Array(with_nonfinite.layout.recursively_apply(nonfinite_to_union, numpy_to_regular=True))
ak._v2.Array(without_nonfinite.layout.recursively_apply(nonfinite_to_union, numpy_to_regular=True))

(The actions assume that the NumpyArray is flat, so numpy_to_regular is needed.)

This also assumes that nplike has isfinite and isnan, which it doesn't yet, but that's easy to add. The above actions also don't accept user-supplied strings, but that's just a matter of replacing these constant strings.

@ianna ianna changed the title convert string, stream, or file from/to json for v2 json de-/serialisation from/to string or file Feb 4, 2022
@ianna
Copy link
Collaborator Author

ianna commented Feb 4, 2022

@jpivarski - thanks! All strings are now user-configurable, the tests pass. Note, the _awkward_strings_to_nonfinite converts JSON to either a NumpyArray or a UnionArray. The latter is needed if there are other strings a user did not request to convert. Please, see tests/v2/test_0437-stream-of-many-json-files.py:

    # read json file containing multiple definitions of 'nan' and 'inf'
    # user-defined strings
    # replace can only work for one string definition
    array = ak._v2.operations.io.from_json_file(
        os.path.join(path, "samples/test-nan-inf.json"),
        infinity_string="Infinity",
        nan_string="None at all",
    )

Actually, now we could introduce multiple strings such as infinity_strings={"Infinity", "Inf", "inf", "бесконечность"}

Some tests were removed as ndjson noncompliant. The JSON decoder class implements ndjson specs: see https://github.com/ndjson/ndjson-spec

- accept newline as line delimiter '\n' (0x0A) as well as carriage return and newline '\r\n' (0x0D0A)
- ignore empty lines '\n\n'
- raise an error if JSON is nor parsable

Please, have a look at the PR when you have time and comment (or commit directly to the branch if you feel like ;-) Thanks!

@ianna ianna marked this pull request as ready for review February 4, 2022 15:10
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of improvements in this PR that I want to take as-is: the std::move of GrowableBuffers and general clean-up of the C++, the clean-up of tests to exclude examples we no longer support, and the pre/post-processing of arrays for бесконечный, не число, and комплексное numbers. :)

Sometime after this PR, I plan to revisit to/from_json to pass file-like objects to RapidJSON, integrate the JSONSchema case as an argument to from_json, rather than a separate function, and make newline-delimited JSON something that the user has to ask for (either as a separate function or as an argument). We won't be interpreting strings as filenames (so the path handling in _v2/_util won't be needed). But it will be much easier to do that starting from this PR.

Sorry that I gave conflicting/contradictory suggestions during development—this is a problem that needs a lot of design decisions, since there are so many cases involved in JSON-handling (data that isn't directly encodable, newline vs not newline, files vs strings, Python's builtin module vs RapidJSON).

@jpivarski jpivarski merged commit dde5c91 into main Feb 4, 2022
@jpivarski jpivarski deleted the ianna/from_to_json_v2 branch February 4, 2022 16:07
@ianna
Copy link
Collaborator Author

ianna commented Feb 4, 2022

Here is an idea (not mine) how to read from a JSON stream using Python json: consume opening [, call json.raw_decode(s) on a bit starting with the first { to get an object and its ending offset, then consume a comma and whitespace in between, repeat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants