-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
json de-/serialisation from/to string or file #1207
Conversation
ianna
commented
Jan 7, 2022
•
edited
Loading
edited
- ak._v2.operations.convert.from_json
- ak._v2.operations.convert.from_json_file
- ak._v2.operations.convert.to_json
- ak._v2.operations.convert.to_json_file
Codecov Report
|
f14250f
to
4b9a541
Compare
@ianna Be careful to exclude the pybind11 directory from your PR, so that you don't accidentally downgrade it. (It's happened before: see #1223.) This will fix it in your local repo, and then check it in: cd pybind11/
git checkout master
git pull origin master
git checkout 45f792efdd92da094548e2095d6efdbfa7e536ee # ignore warnings about detached HEAD; we know...
cd ..
git add pybind11
git commit -m "Leave 'pybind11' at version 2.9.0."
git push |
ba88860
to
d507810
Compare
@jpivarski - Here is a reference for the PR as it is now. Note, this v2 v2 In [4]: %%timeit
...: ak.from_json("[[1.1, 2.2, 3], [], [4, 5.5]]")
...:
...:
670 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [5]: %%timeit
...: ak._v2.operations.convert.from_json("[[1.1, 2.2, 3], [], [4, 5.5]]")
...:
...:
152 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) v2 In [7]: b = ak.layout.NumpyArray(
...: np.array(
...: [
...: [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]],
...: [[10.1, 20.2, 30.3], [40.4, 50.5, 60.6]],
...: ]
...: )
...: )
In [8]: %%timeit
...: ak.to_json(b)
...:
...:
32.7 µs ± 237 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [9]: b2 = ak._v2.contents.NumpyArray(
...: np.array(
...: [
...: [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]],
...: [[10.1, 20.2, 30.3], [40.4, 50.5, 60.6]],
...: ]
...: )
...: )
In [10]: %%timeit
...: ak._v2.operations.convert.to_json(b2)
...:
...:
62.3 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) |
These are very peculiar test cases—they're serializing/deserializing a very small array many times. In a simple performance model with O(n) asymptotic dependence and O(1) constant-time overhead, these tests are only measuring the O(1) part. Usually, we're only interested in the O(n) part. |
for more information, see https://pre-commit.ci
c9636b6
to
403d41c
Compare
dcab232
to
a3742e4
Compare
- accept newline as line delimiter '\n' (0x0A) as well as carriage return and newline '\r\n' (0x0D0A) - ignore empty lines '\n\n' - raise an error if JSON is nor parsable
@jpivarski - The v2 json de-/serialisation functions in this PR use Python json and do not depend on RapidJSON. The latter can be removed when the v1 code is phased out! One issue is that Python json decoder understands |
I think, I'll implement it as a post processing step in the recursively applied action. |
The Python JSON decoder recognizes non-quoted strings as nan and infinity: >>> json.loads('[1.1, 2.2, NaN, 3.3]')
[1.1, 2.2, nan, 3.3]
>>> json.loads('[1.1, 2.2, "NaN", 3.3]')
[1.1, 2.2, 'NaN', 3.3] which is different from what the arguments to The arguments to If we keep in our arguments, there will be two ways to deal with NaN/infinity, by quoting (our arguments) or not quoting (what the Python json library does by default). Our argument for complex numbers is the only way to deal with complex numbers, though. On RapidJSON, I hadn't been planning to drop it—that was how we could deal with continuous streams of JSON input or output. Also, the JSONSchema-based reader (a v2 thing) depends on it. But we can leave the PR like this and do its reintroduction at a later time. One way or the other, we did have to change the way we use it, so removing it and reintroducing it will make that a cleaner break. |
We should not postprocess the JSON string—that's too dangerous. Getting it right essentially involves parsing and reserializing JSON, which duplicates that effort. |
No, I was thinking of the following: _CONSTANTS = {}
if nan_string is not None:
_CONSTANTS[nan_string] = json.decoder.NaN
if infinity_string is not None:
_CONSTANTS[infinity_string] = json.decoder.PosInf
if minus_infinity_string is not None:
_CONSTANTS[minus_infinity_string] = json.decoder.NegInf
def replace_constants_action(node, **kwargs):
if isinstance(node, ak._v2.contents.ListOffsetArray):
if (
node.parameter("__array__") == "bytestring"
or node.parameter("__array__") == "string"
):
out = node._awkward_strings_to_constants(_CONSTANTS)
return None if out is None else ak._v2.contents.NumpyArray(out)
else:
return None
if len(_CONSTANTS) > 0:
layout = layout.recursively_apply(replace_constants_action) and ListOffsetArray function: def _awkward_strings_to_constants(self, constants):
if (
self.parameter("__array__") == "bytestring"
or self.parameter("__array__") == "string"
):
content = ak._v2._util.tobytes(self._content.data)
starts, stops = self.starts, self.stops
out = [None] * starts.length
result = [None] * starts.length
for i in range(starts.length):
out[i] = content[starts[i] : stops[i]].decode(errors="surrogateescape")
if out[i] in constants:
result[i] = constants[out[i]]
return None if result[0] is None else result |
More like this: Awkward → Awkward preprocessing and postprocessing: def nonfinite_to_union(node, **kwargs):
if isinstance(node, ak._v2.contents.NumpyArray):
is_nonfinite = ~node.nplike.isfinite(node.data) # true for inf, -inf, nan
is_posinf = is_nonfinite & (node.data > 0) # true for inf only
is_neginf = is_nonfinite & (node.data < 0) # true for -inf only
is_nan = node.nplike.isnan(node.data) # true for nan only
tags = node.nplike.zeros(len(node), np.int8)
tags[is_nonfinite] = 1
index = node.nplike.arange(len(node), dtype=np.int64)
index[is_posinf] = 0
index[is_neginf] = 1
index[is_nan] = 2
return ak._v2.contents.UnionArray(
ak._v2.index.Index8(tags),
ak._v2.index.Index64(index),
[
node,
ak._v2.operations.convert.from_iter(
["inf", "-inf", "nan"], highlevel=False
),
],
)
def union_to_nonfinite(node, **kwargs):
if isinstance(node, ak._v2.contents.UnionArray) and len(node.contents) == 2:
for i in (0, 1):
not_i = 0 if i == 1 else 1
if (
isinstance(node.contents[not_i], ak._v2.contents.NumpyArray)
and len(node.contents[not_i].shape) == 1
and node.contents[i].parameter("__array__") == "string" # not bytestring
and node.contents[i].length <= 3
and set(node.contents[i].to_list()).issubset({"inf", "-inf", "nan"})
):
is_string = (node.tags.data == i)
not_string = ~is_string
strings = node.contents[i].to_list() # we know this is small
finite_numbers = node.contents[not_i].data
all_numbers = node.nplike.empty(len(node), finite_numbers.dtype)
all_numbers[not_string] = finite_numbers[node.index.data[not_string]]
try:
posinf_index = (node.index.data == strings.index("inf"))
all_numbers[posinf_index & is_string] = np.inf
except ValueError: # in case the union doesn't have "inf"
pass
try:
neginf_index = (node.index.data == strings.index("-inf"))
all_numbers[neginf_index & is_string] = -np.inf
except ValueError: # in case the union doesn't have "-inf"
pass
try:
nan_index = (node.index.data == strings.index("nan"))
all_numbers[nan_index & is_string] = np.nan
except ValueError: # in case the union doesn't have "nan"
pass
return ak._v2.contents.NumpyArray(all_numbers) Given with_nonfinite = ak._v2.from_iter([1.1, 2.2, float("inf"), 3.3, -float("inf"), 4.4, float("nan"), 5.5, 6.6])
without_nonfinite = ak._v2.from_iter([1.1, 2.2, 3.3, 4.4, 5.5, 6.6]) They can be applied like ak._v2.Array(with_nonfinite.layout.recursively_apply(nonfinite_to_union, numpy_to_regular=True))
ak._v2.Array(without_nonfinite.layout.recursively_apply(nonfinite_to_union, numpy_to_regular=True)) (The actions assume that the NumpyArray is flat, so This also assumes that |
@jpivarski - thanks! All strings are now user-configurable, the tests pass. Note, the # read json file containing multiple definitions of 'nan' and 'inf'
# user-defined strings
# replace can only work for one string definition
array = ak._v2.operations.io.from_json_file(
os.path.join(path, "samples/test-nan-inf.json"),
infinity_string="Infinity",
nan_string="None at all",
) Actually, now we could introduce multiple strings such as Some tests were removed as
Please, have a look at the PR when you have time and comment (or commit directly to the branch if you feel like ;-) Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of improvements in this PR that I want to take as-is: the std::move
of GrowableBuffers and general clean-up of the C++, the clean-up of tests to exclude examples we no longer support, and the pre/post-processing of arrays for бесконечный, не число, and комплексное numbers. :)
Sometime after this PR, I plan to revisit to/from_json to pass file-like objects to RapidJSON, integrate the JSONSchema case as an argument to from_json, rather than a separate function, and make newline-delimited JSON something that the user has to ask for (either as a separate function or as an argument). We won't be interpreting strings as filenames (so the path handling in _v2/_util
won't be needed). But it will be much easier to do that starting from this PR.
Sorry that I gave conflicting/contradictory suggestions during development—this is a problem that needs a lot of design decisions, since there are so many cases involved in JSON-handling (data that isn't directly encodable, newline vs not newline, files vs strings, Python's builtin module vs RapidJSON).
Here is an idea (not mine) how to read from a JSON stream using Python json: consume opening |