Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] differences in parsed data from cudf.read_json and pandas.read_json #10745

Closed
przymusp opened this issue Apr 27, 2022 · 10 comments
Closed
Labels
bug Something isn't working cuIO cuIO issue

Comments

@przymusp
Copy link

Describe the bug
Reading json using pandas and cudf returns differences in data parsing.

Steps/Code to reproduce bug

>>> import cudf
>>> import pandas as pd
>>> pdf = pd.read_json("bad_cudf.json", lines=True)
>>> cdf = cudf.read_json("bad_cudf.json", lines=True)
>>> pdf["score"].sum()
55127
>>> cdf["score"].sum()
54363.0

Expected behavior

Parsing should be the same for pandas and cudf ie. pdf["score"].sum() == cdf["score"].sum()

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: Docker
    • If method of install is [Docker], provide docker pull & docker run commands used
      "DockerVersion": "20.10.14",
      sudo docker run -v "/home/eror/GPU-IDUB":"/mnt" -it 2d32b9cc4ade

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

in attachment

Additional context
Add any other context about the problem here.
ENV and sample data
bad_cudf.json.gz
cudf_env.log

@przymusp przymusp added Needs Triage Need team to review and classify bug Something isn't working labels Apr 27, 2022
@kkaczmarski
Copy link

cudf returns different values than pandas when parsing json? not good.

@shwina
Copy link
Contributor

shwina commented Apr 27, 2022

Hmm -- this looks to be more an issue with the reduction ('sum') than the json parsing itself (edit: no it's not - almost definitely an error with the json reader)

In [36]: pdf = pd.read_json("bad_cudf.json.gz", lines=True)

In [37]: gdf = cudf.read_json("bad_cudf.json.gz", lines=True)

In [38]: (gdf['score'] == pdf['score']).all()
/home/ashwin/workspace/rapids/cudf/python/cudf/cudf/core/single_column_frame.py:345: FutureWarning: Binary operations between host objects such as <class 'pandas.core.series.Series'> and <class 'cudf.core.series.Series'> are deprecated and will be removed in a future release. Please convert it to a cudf object before performing the operation.
  warnings.warn(
Out[38]: True

In [39]: gdf['score'].sum()
Out[39]: 54363.0

In [40]: pdf['score'].sum()
Out[40]: 55127

I also notice that cuDF infers the "score" column as floats, while Pandas infers them as ints.

@przymusp
Copy link
Author

przymusp commented Apr 27, 2022

Conversion to floats is one problem for sure.
But for the original error I still think it is parsing error or some memory error:

>>> a = pdf["score"]                                                                                                                                           
>>> b = cdf["score"].to_pandas()

>>> a[a != b]
656       4
744       1
2352      9
4780     87
7294     14
9612    649

>>> b[a != b]
656    NaN
744    NaN
2352   NaN
4780   NaN
7294   NaN
9612   NaN

@przymusp
Copy link
Author

Reduction works

>>> cdf["score"][ a == b].sum()
54363.0
>>> pdf["score"][ a == b].sum()
54363

@shwina
Copy link
Contributor

shwina commented Apr 27, 2022

You're right - this is almost definitely a bug in the json reader. What I didn't realize was that NA in cuDF (and Pandas) has a boolean value of True.

@galipremsagar galipremsagar added the cuIO cuIO issue label Apr 27, 2022
@przymusp
Copy link
Author

I think the problem is associated with the parsing of nested json objects (all problematic lines had nested objects).
One more problem I did notice is the difference in DataFrame shape. This is probably connected with the nested json objects and the above error.

>>> cdf.shape
(9963, 101)
>>> df.shape
(9963, 50)

@bdice
Copy link
Contributor

bdice commented May 19, 2022

Hi @przymusp! I looked into this issue with @shwina. You're correct that the nested objects are the issue here. Currently, reading nested JSON data directly is not supported by cuDF (see #8827), but it is a high priority feature for the cuDF library.

A workaround is to read the nested JSON data with pandas, and then transfer the data to cuDF with cudf.from_pandas(df). We caught and fixed a bug #10761 which arose from some of the nested data fields being empty in all rows. Below is a code snippet for performing the analysis you described in the original post, and the results are now correct. To run this code snippet, you will need a nightly version of 22.06a which contains the fix #10761.

>>> import cudf
>>> import pandas as pd
>>> pdf = pd.read_json("bad_cudf.json.gz", lines=True)
>>> cdf = cudf.from_pandas(pdf)
>>> print(pdf["score"].sum())
55127
>>> print(cdf["score"].sum())
55127

I hope this is helpful, and please reach out if you have any other questions! 😄

@przymusp
Copy link
Author

Thanks for fixing this. Do you have any roadmap for nested json lines ?

@elstehle
Copy link
Contributor

Thanks for your interest in the nested JSON parser, @przymusp! We are currently looking at landing an experimental release with the 22.08 release.

@przymusp
Copy link
Author

przymusp commented Jun 1, 2022

Thanks, for the info.

@przymusp przymusp closed this as completed Jun 1, 2022
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

No branches or pull requests

6 participants