Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Provide option to read fields with mixed types #11947

Closed
GregoryKimball opened this issue Oct 19, 2022 · 5 comments
Closed

[FEA] Provide option to read fields with mixed types #11947

GregoryKimball opened this issue Oct 19, 2022 · 5 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Oct 19, 2022

Is your feature request related to a problem? Please describe.
The cudf_experimental JSON reader does not faithfully read fields with mixed nesting.

If the field contains scalar and list then the scalar values are set to null:

>>> json_str = '[{"a":1},{"a":[1]}]'
>>> cudf.read_json(json_str, engine='cudf')
      a
0  None
1   [1]

If the field contains scalar and struct then the scalar values are set to null:

json_str = '[{"a":1},{"a":{}}]'
>>> cudf.read_json(json_str, engine='cudf')
      a
0  None
1    {}

If the field contains different levels of nested lists, the shallower lists are set to null:

>>> json_str = '[{"a":[1]},{"a":[[1]]}]'
>>> cudf.read_json(json_str, engine='cudf')
        a
0  [None]
1   [[1]]

If the field contains list and struct types, we throw:

>>> json_str = '[{"a":[1]},{"a":{"b":1}}]'
>>> cudf.read_json(json_str, engine='cudf')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.10/site-packages/cudf/io/json.py", line 111, in read_json
    df = libjson.read_json(
  File "json.pyx", line 50, in cudf._lib.json.read_json
  File "json.pyx", line 138, in cudf._lib.json.read_json
RuntimeError: CUDF failure at: /opt/conda/conda-bld/work/cpp/src/io/json/json_column.cu:576: A mix of lists and structs within the same column is not supported

Describe the solution you'd like
We've talked about a few solutions:

  • model the column as a struct that represents a union column
  • add a union type to libcudf
  • read the field as multiple columns and put a subscript in the name
  • coerce the field's values to strings instead of trying to represent the mixed nesting

Describe alternatives you've considered
Without a workaround, JSON inputs including a field with mixed types will not be readable.

Additional context
This item still needs some design work. Please comment and share a data sample if it is impacting you.

@GregoryKimball GregoryKimball added feature request New feature or request cuIO cuIO issue labels Oct 19, 2022
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Oct 19, 2022
@GregoryKimball GregoryKimball added the 0 - Backlog In queue waiting for assignment label Dec 1, 2022
@isVoid
Copy link
Contributor

isVoid commented Dec 16, 2022

It is common in GeoJson to have different level of nested list data in the same column. Here's a column of a polygon and a linestring. (You can visualize this in https://geojson.io/)
map.zip

@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
@GregoryKimball
Copy link
Contributor Author

After some internal discussion we are favoring the option:

model the column as a struct that represents a union column

Here is an example of what this representation could look like:

[
  { "a": "foo" },
  { "a": { "b": "bar" } }
]
>>> df
                                             a
0         {'0': 'foo', '1': None, 'offset': 0}
1  {'0': None, '1': {'b': 'bar'}, 'offset': 1}
>>> df['a'].dtype
StructDtype({'0': dtype('O'), '1': StructDtype({'b': dtype('O')}), 'offset': dtype('int64')})

@pdufour
Copy link

pdufour commented Jul 7, 2023

This would be great :) Also running into this with this file:

(value gets set as None for first two times, then keeps the empty object)

{"Type":"insert","Key":[1],"SeqNo":1,"Timestamp":1,"Fields":[{"Name":"added_id","Value":1},{"Name":"row_key","Value":"//1=="},{"Name":"column_key","Value":"2"},{"Name":"ref_key","Value":1},{"Name":"updated","Value":"2020-12-08T02:15:07.664543Z"},{"Name":"body","Value":{}}]}

(Formatted JSON):

{
    "Type": "insert",
    "Key": [
        1
    ],
    "SeqNo": 1,
    "Timestamp": 1,
    "Fields": [
        {
            "Name": "added_id",
            "Value": 1
        },
        {
            "Name": "row_key",
            "Value": "//1=="
        },
        {
            "Name": "column_key",
            "Value": "2"
        },
        {
            "Name": "ref_key",
            "Value": 1
        },
        {
            "Name": "updated",
            "Value": "2020-12-08T02:15:07.664543Z"
        },
        {
            "Name": "body",
            "Value": {}
        }
    ]
}

@GregoryKimball
Copy link
Contributor Author

Also see #14239

@GregoryKimball
Copy link
Contributor Author

Closed by #14572

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

3 participants