-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] differences in parsed data from cudf.read_json and pandas.read_json #10745
Comments
cudf returns different values than pandas when parsing json? not good. |
Hmm -- this looks to be more an issue with the reduction ( In [36]: pdf = pd.read_json("bad_cudf.json.gz", lines=True)
In [37]: gdf = cudf.read_json("bad_cudf.json.gz", lines=True)
In [38]: (gdf['score'] == pdf['score']).all()
/home/ashwin/workspace/rapids/cudf/python/cudf/cudf/core/single_column_frame.py:345: FutureWarning: Binary operations between host objects such as <class 'pandas.core.series.Series'> and <class 'cudf.core.series.Series'> are deprecated and will be removed in a future release. Please convert it to a cudf object before performing the operation.
warnings.warn(
Out[38]: True
In [39]: gdf['score'].sum()
Out[39]: 54363.0
In [40]: pdf['score'].sum()
Out[40]: 55127 I also notice that cuDF infers the |
Conversion to floats is one problem for sure. >>> a = pdf["score"]
>>> b = cdf["score"].to_pandas()
>>> a[a != b]
656 4
744 1
2352 9
4780 87
7294 14
9612 649
>>> b[a != b]
656 NaN
744 NaN
2352 NaN
4780 NaN
7294 NaN
9612 NaN |
Reduction works >>> cdf["score"][ a == b].sum()
54363.0
>>> pdf["score"][ a == b].sum()
54363 |
You're right - this is almost definitely a bug in the json reader. What I didn't realize was that |
I think the problem is associated with the parsing of nested json objects (all problematic lines had nested objects). >>> cdf.shape
(9963, 101)
>>> df.shape
(9963, 50) |
Hi @przymusp! I looked into this issue with @shwina. You're correct that the nested objects are the issue here. Currently, reading nested JSON data directly is not supported by cuDF (see #8827), but it is a high priority feature for the cuDF library. A workaround is to read the nested JSON data with pandas, and then transfer the data to cuDF with >>> import cudf
>>> import pandas as pd
>>> pdf = pd.read_json("bad_cudf.json.gz", lines=True)
>>> cdf = cudf.from_pandas(pdf)
>>> print(pdf["score"].sum())
55127
>>> print(cdf["score"].sum())
55127 I hope this is helpful, and please reach out if you have any other questions! 😄 |
Thanks for fixing this. Do you have any roadmap for nested json lines ? |
Thanks for your interest in the nested JSON parser, @przymusp! We are currently looking at landing an experimental release with the 22.08 release. |
Thanks, for the info. |
Describe the bug
Reading json using pandas and cudf returns differences in data parsing.
Steps/Code to reproduce bug
Expected behavior
Parsing should be the same for pandas and cudf ie.
pdf["score"].sum() == cdf["score"].sum()
Environment overview (please complete the following information)
docker pull
&docker run
commands used"DockerVersion": "20.10.14",
sudo docker run -v "/home/eror/GPU-IDUB":"/mnt" -it 2d32b9cc4ade
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsin attachment
Additional context
Add any other context about the problem here.
ENV and sample data
bad_cudf.json.gz
cudf_env.log
The text was updated successfully, but these errors were encountered: