-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QST: Is this expected behavior when pd.read_csv() with na_values arguments? #59303
Comments
the pandas library to read a CSV file where some values like -99 in the column y should be treated as missing values (NaN). However, the approach you are using na_values={"y":-99} might not work as expected because na_values is typically used to specify a list of strings that should be recognized as NaN. df = pd.read_csv('test.csv', na_values={"y": [-99]}) The above you already used.Hope you got it |
@sshu2017 - thanks for the report. Instead of screenshots, can you edit the OP to have text-based examples? It is extra work for maintainers to try to reproduce behavior using screenshots. |
The docstring for
You are not providing strings, so this is undefined behavior. Perhaps we should raise if strings are not provided though. |
Sorry. I just updated the post. |
I see! I tried using strings and things start to look better.
Results:
I think df7 and df8 are fine but df5 and df6 are still a little strange - the "-99" took care of both "-99" and "-99.0" in the df5 case, while the "-99.0" only took care of the "-99.0" in the df6 case. But all 4 of them are making more sense now. Should we add a check so if someone, like me, happen to provide non-string values to the na_values, an exception (or a warning) will be raised? If yes, I am more than happy to submit a PR for it. Thank you @rhshadrach ! |
An exception, I think. |
Hi @rhshadrach , seem like this issue has been discussed and dealt with 11 years ago. Maybe we could just cherry pick the commit #3841? Not sure why the changes made in the commit #3841 are not in the latest version, for example, the 3.0.0.dev0+1320.gd093fae3cd version I built locally. |
I would guess trying to cherry pick a commit from 11 years ago would be problematic. It seems to me we should be testing for equality here when determining whether to make a replacement. I'm classifying this as a bugfix. Further investigations and PRs to fix are welcome! |
Hi @rhshadrach, I created a branch and now an ValueError would be raised when user send in a non-string value as na_values. But many tests indicate that non-string values are acceptable, for example: this test, this test, and this test. So I am wondering if this change is a bit too much. Please kindly advise. In case you want to see my code changes, here's the comparison of my branch with the main branch. Also I think the issue is in the c parser only and python parser is working as expected, as shown below. (pyarrow parser requires all na_values to be strings so it is all good). Codes:
Output:
Maybe we could fix the c parser? or make c parse behave like pyarrow parser and only accept strings for na_values? |
Ah - thanks. It looks like the documentation for it only taking strings was added here: 20161d9. Agreed we should not restrict |
Hi @rhshadrach , sorry but just to confirm - you are suggesting that the c parser should be fixed and it should be able to take in not only
but also
Is it correct? If yes, I can start working on it but it may take me a while since I am not so familiar with C. |
@sshu2017 - yes, that is correct. |
Research
I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
https://stackoverflow.com/questions/46397526/how-to-use-na-values-option-in-the-pd-read-csv-function
Question about pandas
I have a simple csv file that looks like this:
and when I tried a few different na_values, I got different column y back:
Results:
I'm not sure if this is a bug or it is by design, so just throwing out a general question here. Thank you!
Pandas version is 2.2.1, just in case needed.
The text was updated successfully, but these errors were encountered: