[BUG] differences in parsed data from cudf.read_json and pandas.read_json #10745

przymusp · 2022-04-27T08:44:03Z

Describe the bug
Reading json using pandas and cudf returns differences in data parsing.

Steps/Code to reproduce bug

>>> import cudf
>>> import pandas as pd
>>> pdf = pd.read_json("bad_cudf.json", lines=True)
>>> cdf = cudf.read_json("bad_cudf.json", lines=True)
>>> pdf["score"].sum()
55127
>>> cdf["score"].sum()
54363.0

Expected behavior

Parsing should be the same for pandas and cudf ie. pdf["score"].sum() == cdf["score"].sum()

Environment overview (please complete the following information)

Environment location: Docker
Method of cuDF install: Docker
- If method of install is [Docker], provide docker pull & docker run commands used
  "DockerVersion": "20.10.14",
  sudo docker run -v "/home/eror/GPU-IDUB":"/mnt" -it 2d32b9cc4ade

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

in attachment

Additional context
Add any other context about the problem here.
ENV and sample data
bad_cudf.json.gz
cudf_env.log

The text was updated successfully, but these errors were encountered:

kkaczmarski · 2022-04-27T09:05:25Z

cudf returns different values than pandas when parsing json? not good.

shwina · 2022-04-27T11:42:56Z

Hmm -- this looks to be more an issue with the reduction ('sum') than the json parsing itself (edit: no it's not - almost definitely an error with the json reader)

In [36]: pdf = pd.read_json("bad_cudf.json.gz", lines=True)

In [37]: gdf = cudf.read_json("bad_cudf.json.gz", lines=True)

In [38]: (gdf['score'] == pdf['score']).all()
/home/ashwin/workspace/rapids/cudf/python/cudf/cudf/core/single_column_frame.py:345: FutureWarning: Binary operations between host objects such as <class 'pandas.core.series.Series'> and <class 'cudf.core.series.Series'> are deprecated and will be removed in a future release. Please convert it to a cudf object before performing the operation.
  warnings.warn(
Out[38]: True

In [39]: gdf['score'].sum()
Out[39]: 54363.0

In [40]: pdf['score'].sum()
Out[40]: 55127

I also notice that cuDF infers the "score" column as floats, while Pandas infers them as ints.

przymusp · 2022-04-27T12:00:29Z

Conversion to floats is one problem for sure.
But for the original error I still think it is parsing error or some memory error:

>>> a = pdf["score"]                                                                                                                                           
>>> b = cdf["score"].to_pandas()

>>> a[a != b]
656       4
744       1
2352      9
4780     87
7294     14
9612    649

>>> b[a != b]
656    NaN
744    NaN
2352   NaN
4780   NaN
7294   NaN
9612   NaN

przymusp · 2022-04-27T12:03:50Z

Reduction works

>>> cdf["score"][ a == b].sum()
54363.0
>>> pdf["score"][ a == b].sum()
54363

shwina · 2022-04-27T14:40:24Z

You're right - this is almost definitely a bug in the json reader. What I didn't realize was that NA in cuDF (and Pandas) has a boolean value of True.

przymusp · 2022-04-27T18:22:51Z

I think the problem is associated with the parsing of nested json objects (all problematic lines had nested objects).
One more problem I did notice is the difference in DataFrame shape. This is probably connected with the nested json objects and the above error.

>>> cdf.shape
(9963, 101)
>>> df.shape
(9963, 50)

bdice · 2022-05-19T23:07:00Z

Hi @przymusp! I looked into this issue with @shwina. You're correct that the nested objects are the issue here. Currently, reading nested JSON data directly is not supported by cuDF (see #8827), but it is a high priority feature for the cuDF library.

A workaround is to read the nested JSON data with pandas, and then transfer the data to cuDF with cudf.from_pandas(df). We caught and fixed a bug #10761 which arose from some of the nested data fields being empty in all rows. Below is a code snippet for performing the analysis you described in the original post, and the results are now correct. To run this code snippet, you will need a nightly version of 22.06a which contains the fix #10761.

>>> import cudf
>>> import pandas as pd
>>> pdf = pd.read_json("bad_cudf.json.gz", lines=True)
>>> cdf = cudf.from_pandas(pdf)
>>> print(pdf["score"].sum())
55127
>>> print(cdf["score"].sum())
55127

I hope this is helpful, and please reach out if you have any other questions! 😄

przymusp · 2022-05-23T09:53:01Z

Thanks for fixing this. Do you have any roadmap for nested json lines ?

elstehle · 2022-05-23T14:10:55Z

Thanks for your interest in the nested JSON parser, @przymusp! We are currently looking at landing an experimental release with the 22.08 release.

przymusp · 2022-06-01T07:41:48Z

Thanks, for the info.

przymusp added Needs Triage Need team to review and classify bug Something isn't working labels Apr 27, 2022

galipremsagar added the cuIO cuIO issue label Apr 27, 2022

przymusp closed this as completed Jun 1, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] differences in parsed data from cudf.read_json and pandas.read_json #10745

[BUG] differences in parsed data from cudf.read_json and pandas.read_json #10745

przymusp commented Apr 27, 2022

kkaczmarski commented Apr 27, 2022

shwina commented Apr 27, 2022 •

edited

Loading

przymusp commented Apr 27, 2022 •

edited

Loading

przymusp commented Apr 27, 2022

shwina commented Apr 27, 2022 •

edited

Loading

przymusp commented Apr 27, 2022

bdice commented May 19, 2022 •

edited

Loading

przymusp commented May 23, 2022

elstehle commented May 23, 2022

przymusp commented Jun 1, 2022

[BUG] differences in parsed data from cudf.read_json and pandas.read_json #10745

[BUG] differences in parsed data from cudf.read_json and pandas.read_json #10745

Comments

przymusp commented Apr 27, 2022

kkaczmarski commented Apr 27, 2022

shwina commented Apr 27, 2022 • edited Loading

przymusp commented Apr 27, 2022 • edited Loading

przymusp commented Apr 27, 2022

shwina commented Apr 27, 2022 • edited Loading

przymusp commented Apr 27, 2022

bdice commented May 19, 2022 • edited Loading

przymusp commented May 23, 2022

elstehle commented May 23, 2022

przymusp commented Jun 1, 2022

shwina commented Apr 27, 2022 •

edited

Loading

przymusp commented Apr 27, 2022 •

edited

Loading

shwina commented Apr 27, 2022 •

edited

Loading

bdice commented May 19, 2022 •

edited

Loading