Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for nested JSON structure while reading data in JSONEachRow input format #3144

Merged

Conversation

veloman-yunkan
Copy link
Contributor

This enhancement adds to clickhouse a new option --input_format_import_nested_json which enables reading nested fields from similarly structured JSON (when JSONEachRow input format is used). For example, for a table created as follows

CREATE TABLE test (
  fieldX String
  fieldY String
  myNestedField Nested (
    a Uint8,
    b String
  )
)

the following JSON

{"fieldX": "hello", "fieldY": "World", "myNestedField": {"a":[1], "b":["2"]}}

is processed in --input_format_import_nested_json=1 mode with the same outcome as for the JSON below:

{"fieldX": "hello", "fieldY": "World", "myNestedField.a":[1], "myNestedField.b":["2"]}

Handling of unknown fields is respected.

I have performed some refactoring of the code in dbms/src/Formats/JSONEachRowRowInputStream.cpp, and also introduced a new test dbms/tests/queries/0_stateless/00715_json_each_row_input_nested.sh that covers this new functionality.

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en with the exception of the clause "9. Liability" (however hypothetical, it is still an unmanageable risk and I am not going to undertake it).

The new test demonstrates current handling of flat JSON data corresponding
to nested columns.
Also changed the names of previously extracted functions to camel case.
By default mapping of nested json data to nested tables is disabled. To
enable the import of nested json data (into corresponding nested tables)
clickhouse must be run with the --input_format_import_nested_json=1
option.
@veloman-yunkan
Copy link
Contributor Author

A note on the failed Travis CI build - a couple of changed header files forced almost full recompilation and the Travis CI job was interrupted due to exceeded maximum time limit. At the time of the job termination the build was completed and most of the tests have been run. Strangely enough the only failed test is the new test introduced by me. While playing with clickhouse I noticed that the requirement of different nested columns to have the same length was not enforced (at least it was running without an error in my environment) so I created the test that way. Now, in the Travis CI run it gives the error that I expected to see initially. I have already fixed the test and will push the changes to my repo shortly.

@veloman-yunkan
Copy link
Contributor Author

This time the build itself timed out.

@alexey-milovidov
Copy link
Member

This time the build itself timed out.

Don't worry, Travis build barely works.

@alexey-milovidov
Copy link
Member

Looks (almost) Ok.

alexey-milovidov added a commit that referenced this pull request Sep 17, 2018
@alexey-milovidov alexey-milovidov merged commit f51d791 into ClickHouse:master Sep 17, 2018
@alexey-milovidov alexey-milovidov self-requested a review September 17, 2018 20:36
@alexey-milovidov
Copy link
Member

alexey-milovidov commented Sep 17, 2018

Thank you!

FYI. I also was thinking about another possible way for implementation.

We have named tuples. Nested column may be represented as a single column of type Array(Tuple(a UInt8, b String)). Then we can serialize/deserialize named tuples in natural way in JSON, like {"a":1,"b":"Hello, world"}. And Array(Tuple(a UInt8, b String)) will be serialized/deserialized as [{"a":1,"b":"Hello, world"},{"a":2,"b":"test"}]. Note that this serialization is different that the way you've implemented.

@veloman-yunkan
Copy link
Contributor Author

@alexey-milovidov Thank you for accepting my first contribution!

I like your idea from the use model point of view, but it seems to me that implementing it would require more code (though, I can definitely be wrong).

@champtar
Copy link
Contributor

Hi @veloman-yunkan @alexey-milovidov
This PR break my setup

git revert 46ef387ce338cbf537a19b0bb9662528328ef99e
git revert 8852660b241a963ef6a5b4a795c69d7af984e0ea -m 1

Is enough to fix it

I was not able to make a minimal reproduction case yet, but I can privately share an example for now

I get errors like:

DB::Exception: Unknown field found while parsing JSONEachRow format: t_host":"2, e.what() = DB::Exception

field name is request_host in this exemple

champtar added a commit to champtar/ClickHouse that referenced this pull request Sep 25, 2018
@veloman-yunkan
Copy link
Contributor Author

Hi @champtar,

I reviewed my changes and couldn't spot anything that could break existing functionality (unless the old code contained hidden bugs).

Did you observe this problem with a clean build?

@champtar
Copy link
Contributor

Hi @veloman-yunkan
It's with a clean build
I can share an example curl post that work before but not after with you if you want
(sadly I can't share it in public & I was not able to do a short repro case)

@veloman-yunkan
Copy link
Contributor Author

@champtar Please send the curl post example to my email address.

@champtar
Copy link
Contributor

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants