-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Inconsistent results using pd.json_normalize() on a generator object versus list (off by one) #35923
Comments
I don't believe the documentation states that a generator is an accepted value for
that said the issue is here: https://github.com/pandas-dev/pandas/blob/v1.1.1/pandas/io/json/_normalize.py#L269-L279 specifically
this consumes the first yieded result from the generator |
Ahh, I see. This was my first time passing a generator to json_normalize and it seemed like it worked since I had many records. Perhaps a warning or error could be raised if a generator is passed to this method. Shall I close this now? |
@WillAyd is this intended to be supported? |
Losing the first record is a nasty surprise - would take a patch here for sure |
@WillAyd did some debugging and found out the issue is caused by this line https://github.com/pandas-dev/pandas/blob/master/pandas/io/json/_normalize.py#L270 The for loop with
I am favouring the additional runtime option, since I think the user will only provide a generator if there are memory constraints. But LMK if you see this differently |
take |
[ x] I have checked that this issue has not already been reported.
[ x] I have confirmed this bug exists on the latest version of pandas.
Code Sample, a copy-pastable example
Only one value is returned with this:
This returns all values though:
And so does this:
Problem description
Using pd.json_normalize() on a generator always seems to reduce the expected results by 1. I first noticed this on a REST API where a column informed me that I should expect 901 results but I kept getting 900 results each time. When I tried to append the results to a list and normalize that, I got the expected 901 results.
Expected Output
Perhaps this is an expected output. It just caused me some headaches earlier and it was not immediately obvious that I was missing one record. I would expect that my example above would result in the same 2 row DataFrame.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : d9fff27
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0
The text was updated successfully, but these errors were encountered: