Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QST: Is this expected behavior when pd.read_csv() with na_values arguments? #59303

Closed
2 tasks done
sshu2017 opened this issue Jul 24, 2024 · 12 comments · Fixed by #59755
Closed
2 tasks done

QST: Is this expected behavior when pd.read_csv() with na_values arguments? #59303

sshu2017 opened this issue Jul 24, 2024 · 12 comments · Fixed by #59755
Labels
Bug IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@sshu2017
Copy link
Contributor

sshu2017 commented Jul 24, 2024

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/46397526/how-to-use-na-values-option-in-the-pd-read-csv-function

Question about pandas

I have a simple csv file that looks like this:

x,y,z
a,-99,100
b,-99,200
c,-99.0,300
d,-99.0,400

and when I tried a few different na_values, I got different column y back:

import pandas as pd

df1 = pd.read_csv('test.csv', na_values={"y": -99})
print("df1 = \n", df1)

df2 = pd.read_csv('test.csv', na_values={"y": -99.0})
print("\ndf2 = \n", df2)

df3 = pd.read_csv('test.csv', na_values={"y": [-99.0, -99]})
print("\ndf3 = \n", df3)

df4 = pd.read_csv('test.csv', na_values={"y": [-99, -99.0]})
print("\ndf4 = \n", df4)

Results:

df1 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df2 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df3 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df4 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

I'm not sure if this is a bug or it is by design, so just throwing out a general question here. Thank you!

Pandas version is 2.2.1, just in case needed.

@sshu2017 sshu2017 added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jul 24, 2024
@Lokeshram12
Copy link

the pandas library to read a CSV file where some values like -99 in the column y should be treated as missing values (NaN). However, the approach you are using na_values={"y":-99} might not work as expected because na_values is typically used to specify a list of strings that should be recognized as NaN.
To correctly handle this situation in pandas, you can use the na_values parameter with a dictionary where keys are column names and values are lists of values to be treated as NaN.

df = pd.read_csv('test.csv', na_values={"y": [-99]})

The above you already used.Hope you got it

@rhshadrach
Copy link
Member

@sshu2017 - thanks for the report. Instead of screenshots, can you edit the OP to have text-based examples? It is extra work for maintainers to try to reproduce behavior using screenshots.

@rhshadrach rhshadrach added Needs Info Clarification about behavior needed to assess issue Bug IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Jul 24, 2024
@rhshadrach
Copy link
Member

The docstring for na_values states:

Additional strings to recognize as NA/NaN.

You are not providing strings, so this is undefined behavior. Perhaps we should raise if strings are not provided though.

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action and removed Usage Question labels Jul 24, 2024
@sshu2017
Copy link
Contributor Author

sshu2017 commented Aug 5, 2024

@sshu2017 - thanks for the report. Instead of screenshots, can you edit the OP to have text-based examples? It is extra work for maintainers to try to reproduce behavior using screenshots.

Sorry. I just updated the post.

@sshu2017
Copy link
Contributor Author

sshu2017 commented Aug 5, 2024

The docstring for na_values states:

Additional strings to recognize as NA/NaN.

You are not providing strings, so this is undefined behavior. Perhaps we should raise if strings are not provided though.

I see! I tried using strings and things start to look better.

df5 = pd.read_csv('test.csv', na_values={"y": "-99"})
print("\ndf5 = \n", df5)

df6 = pd.read_csv('test.csv', na_values={"y": "-99.0"})
print("\ndf6 = \n", df6)

df7 = pd.read_csv('test.csv', na_values={"y": ["-99", "-99.0"]})
print("\ndf7 = \n", df7)

df8 = pd.read_csv('test.csv', na_values={"y": ["-99.0", "-99"]})

Results:

df5 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df6 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df7 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df8 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

I think df7 and df8 are fine but df5 and df6 are still a little strange - the "-99" took care of both "-99" and "-99.0" in the df5 case, while the "-99.0" only took care of the "-99.0" in the df6 case. But all 4 of them are making more sense now.

Should we add a check so if someone, like me, happen to provide non-string values to the na_values, an exception (or a warning) will be raised? If yes, I am more than happy to submit a PR for it. Thank you @rhshadrach !

@rhshadrach
Copy link
Member

Should we add a check so if someone, like me, happen to provide non-string values to the na_values, an exception (or a warning) will be raised?

An exception, I think.

@sshu2017
Copy link
Contributor Author

Hi @rhshadrach , seem like this issue has been discussed and dealt with 11 years ago. Maybe we could just cherry pick the commit #3841? Not sure why the changes made in the commit #3841 are not in the latest version, for example, the 3.0.0.dev0+1320.gd093fae3cd version I built locally.

@rhshadrach rhshadrach added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue labels Aug 13, 2024
@rhshadrach
Copy link
Member

Maybe we could just cherry pick the commit #3841?

I would guess trying to cherry pick a commit from 11 years ago would be problematic.

It seems to me we should be testing for equality here when determining whether to make a replacement. I'm classifying this as a bugfix. Further investigations and PRs to fix are welcome!

@sshu2017
Copy link
Contributor Author

Hi @rhshadrach, I created a branch and now an ValueError would be raised when user send in a non-string value as na_values. But many tests indicate that non-string values are acceptable, for example: this test, this test, and this test. So I am wondering if this change is a bit too much. Please kindly advise. In case you want to see my code changes, here's the comparison of my branch with the main branch.
(sorry it's a little messy because my PyCharm accidentally formatted the file a little bit).

Also I think the issue is in the c parser only and python parser is working as expected, as shown below. (pyarrow parser requires all na_values to be strings so it is all good).

Codes:

import pandas as pd
print("pd version = ", pd.__version__)

df1 = pd.read_csv('test.csv', na_values={"y": -99}, engine="python")
print("df1 = \n", df1)
df2 = pd.df1 = pd.read_csv('test.csv', na_values={"y": -99.0}, engine="python")
print("\ndf2 = \n", df2)
df3 = pd.read_csv('test.csv', na_values={"y": [-99, -99.0]}, engine="python")
print("\ndf3 = \n", df3)
df4 = pd.readdf1 = pd.read_csv('test.csv', na_values={"y": [-99.0, -99]}, engine="python")
print("\ndf4 = \n", df4)


df5 = pd.read_csv('test.csv', na_values={"y": -99}, engine="c")
print("\ndf5 = \n", df5)
df6 = pd.df1 = pd.read_csv('test.csv', na_values={"y": -99.0}, engine="c")
print("\ndf6 = \n", df6)
df7 = pd.read_csv('test.csv', na_values={"y": [-99, -99.0]}, engine="c")
print("\ndf7 = \n", df7)
df8 = pd.readdf1 = pd.read_csv('test.csv', na_values={"y": [-99.0, -99]}, engine="c")
print("\ndf8 = \n", df8)

Output:

pd version =  2.2.1
df1 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df2 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df3 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df4 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df5 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df6 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df7 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df8 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

Maybe we could fix the c parser? or make c parse behave like pyarrow parser and only accept strings for na_values?

@rhshadrach
Copy link
Member

Ah - thanks. It looks like the documentation for it only taking strings was added here: 20161d9.

Agreed we should not restrict na_values to only being strings.

@sshu2017
Copy link
Contributor Author

sshu2017 commented Aug 26, 2024

Hi @rhshadrach , sorry but just to confirm - you are suggesting that the c parser should be fixed and it should be able to take in not only

"-99"

but also

-99

Is it correct? If yes, I can start working on it but it may take me a while since I am not so familiar with C.

@rhshadrach
Copy link
Member

@sshu2017 - yes, that is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants