QST: Is this expected behavior when pd.read_csv() with na_values arguments? #59303

sshu2017 · 2024-07-24T04:10:46Z

Research

I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/46397526/how-to-use-na-values-option-in-the-pd-read-csv-function

Question about pandas

I have a simple csv file that looks like this:

x,y,z
a,-99,100
b,-99,200
c,-99.0,300
d,-99.0,400

and when I tried a few different na_values, I got different column y back:

import pandas as pd

df1 = pd.read_csv('test.csv', na_values={"y": -99})
print("df1 = \n", df1)

df2 = pd.read_csv('test.csv', na_values={"y": -99.0})
print("\ndf2 = \n", df2)

df3 = pd.read_csv('test.csv', na_values={"y": [-99.0, -99]})
print("\ndf3 = \n", df3)

df4 = pd.read_csv('test.csv', na_values={"y": [-99, -99.0]})
print("\ndf4 = \n", df4)

Results:

df1 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df2 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df3 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df4 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

I'm not sure if this is a bug or it is by design, so just throwing out a general question here. Thank you!

Pandas version is 2.2.1, just in case needed.

The text was updated successfully, but these errors were encountered:

Lokeshram12 · 2024-07-24T07:09:27Z

the pandas library to read a CSV file where some values like -99 in the column y should be treated as missing values (NaN). However, the approach you are using na_values={"y":-99} might not work as expected because na_values is typically used to specify a list of strings that should be recognized as NaN.
To correctly handle this situation in pandas, you can use the na_values parameter with a dictionary where keys are column names and values are lists of values to be treated as NaN.

df = pd.read_csv('test.csv', na_values={"y": [-99]})

The above you already used.Hope you got it

rhshadrach · 2024-07-24T21:16:56Z

@sshu2017 - thanks for the report. Instead of screenshots, can you edit the OP to have text-based examples? It is extra work for maintainers to try to reproduce behavior using screenshots.

rhshadrach · 2024-07-24T21:30:51Z

The docstring for na_values states:

Additional strings to recognize as NA/NaN.

You are not providing strings, so this is undefined behavior. Perhaps we should raise if strings are not provided though.

sshu2017 · 2024-08-05T06:08:10Z

@sshu2017 - thanks for the report. Instead of screenshots, can you edit the OP to have text-based examples? It is extra work for maintainers to try to reproduce behavior using screenshots.

Sorry. I just updated the post.

sshu2017 · 2024-08-05T06:17:36Z

The docstring for na_values states:

Additional strings to recognize as NA/NaN.

You are not providing strings, so this is undefined behavior. Perhaps we should raise if strings are not provided though.

I see! I tried using strings and things start to look better.

df5 = pd.read_csv('test.csv', na_values={"y": "-99"})
print("\ndf5 = \n", df5)

df6 = pd.read_csv('test.csv', na_values={"y": "-99.0"})
print("\ndf6 = \n", df6)

df7 = pd.read_csv('test.csv', na_values={"y": ["-99", "-99.0"]})
print("\ndf7 = \n", df7)

df8 = pd.read_csv('test.csv', na_values={"y": ["-99.0", "-99"]})

Results:

df5 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df6 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df7 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df8 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

I think df7 and df8 are fine but df5 and df6 are still a little strange - the "-99" took care of both "-99" and "-99.0" in the df5 case, while the "-99.0" only took care of the "-99.0" in the df6 case. But all 4 of them are making more sense now.

Should we add a check so if someone, like me, happen to provide non-string values to the na_values, an exception (or a warning) will be raised? If yes, I am more than happy to submit a PR for it. Thank you @rhshadrach !

rhshadrach · 2024-08-07T21:50:35Z

Should we add a check so if someone, like me, happen to provide non-string values to the na_values, an exception (or a warning) will be raised?

An exception, I think.

sshu2017 · 2024-08-11T04:29:40Z

Hi @rhshadrach , seem like this issue has been discussed and dealt with 11 years ago. Maybe we could just cherry pick the commit #3841? Not sure why the changes made in the commit #3841 are not in the latest version, for example, the 3.0.0.dev0+1320.gd093fae3cd version I built locally.

rhshadrach · 2024-08-13T11:04:45Z

Maybe we could just cherry pick the commit #3841?

I would guess trying to cherry pick a commit from 11 years ago would be problematic.

It seems to me we should be testing for equality here when determining whether to make a replacement. I'm classifying this as a bugfix. Further investigations and PRs to fix are welcome!

sshu2017 · 2024-08-18T02:23:21Z

Hi @rhshadrach, I created a branch and now an ValueError would be raised when user send in a non-string value as na_values. But many tests indicate that non-string values are acceptable, for example: this test, this test, and this test. So I am wondering if this change is a bit too much. Please kindly advise. In case you want to see my code changes, here's the comparison of my branch with the main branch.
(sorry it's a little messy because my PyCharm accidentally formatted the file a little bit).

Also I think the issue is in the c parser only and python parser is working as expected, as shown below. (pyarrow parser requires all na_values to be strings so it is all good).

Codes:

import pandas as pd
print("pd version = ", pd.__version__)

df1 = pd.read_csv('test.csv', na_values={"y": -99}, engine="python")
print("df1 = \n", df1)
df2 = pd.df1 = pd.read_csv('test.csv', na_values={"y": -99.0}, engine="python")
print("\ndf2 = \n", df2)
df3 = pd.read_csv('test.csv', na_values={"y": [-99, -99.0]}, engine="python")
print("\ndf3 = \n", df3)
df4 = pd.readdf1 = pd.read_csv('test.csv', na_values={"y": [-99.0, -99]}, engine="python")
print("\ndf4 = \n", df4)


df5 = pd.read_csv('test.csv', na_values={"y": -99}, engine="c")
print("\ndf5 = \n", df5)
df6 = pd.df1 = pd.read_csv('test.csv', na_values={"y": -99.0}, engine="c")
print("\ndf6 = \n", df6)
df7 = pd.read_csv('test.csv', na_values={"y": [-99, -99.0]}, engine="c")
print("\ndf7 = \n", df7)
df8 = pd.readdf1 = pd.read_csv('test.csv', na_values={"y": [-99.0, -99]}, engine="c")
print("\ndf8 = \n", df8)

Output:

pd version =  2.2.1
df1 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df2 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df3 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df4 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df5 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df6 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

df7 = 
    x   y    z
0  a NaN  100
1  b NaN  200
2  c NaN  300
3  d NaN  400

df8 = 
    x     y    z
0  a -99.0  100
1  b -99.0  200
2  c   NaN  300
3  d   NaN  400

Maybe we could fix the c parser? or make c parse behave like pyarrow parser and only accept strings for na_values?

rhshadrach · 2024-08-20T21:11:49Z

Ah - thanks. It looks like the documentation for it only taking strings was added here: 20161d9.

Agreed we should not restrict na_values to only being strings.

sshu2017 · 2024-08-26T18:57:46Z

Hi @rhshadrach , sorry but just to confirm - you are suggesting that the c parser should be fixed and it should be able to take in not only

"-99"

but also

-99

Is it correct? If yes, I can start working on it but it may take me a while since I am not so familiar with C.

rhshadrach · 2024-09-03T17:35:41Z

@sshu2017 - yes, that is correct.

sshu2017 added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jul 24, 2024

rhshadrach added Needs Info Clarification about behavior needed to assess issue Bug IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Jul 24, 2024

rhshadrach added Needs Discussion Requires discussion from core team before further action and removed Usage Question labels Jul 24, 2024

sshu2017 mentioned this issue Aug 11, 2024

Odd behavior from read_csv with na_values set to non-string values #3611

Closed

rhshadrach added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue labels Aug 13, 2024

sshu2017 mentioned this issue Sep 9, 2024

Fix/na_values_GH59303 #59755

Merged

5 tasks

mroeschke closed this as completed in #59755 Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QST: Is this expected behavior when pd.read_csv() with na_values arguments? #59303

QST: Is this expected behavior when pd.read_csv() with na_values arguments? #59303

sshu2017 commented Jul 24, 2024 •

edited

Loading

Lokeshram12 commented Jul 24, 2024

rhshadrach commented Jul 24, 2024

rhshadrach commented Jul 24, 2024

sshu2017 commented Aug 5, 2024

sshu2017 commented Aug 5, 2024

rhshadrach commented Aug 7, 2024

sshu2017 commented Aug 11, 2024

rhshadrach commented Aug 13, 2024

sshu2017 commented Aug 18, 2024

rhshadrach commented Aug 20, 2024

sshu2017 commented Aug 26, 2024 •

edited

Loading

rhshadrach commented Sep 3, 2024

QST: Is this expected behavior when pd.read_csv() with na_values arguments? #59303

QST: Is this expected behavior when pd.read_csv() with na_values arguments? #59303

Comments

sshu2017 commented Jul 24, 2024 • edited Loading

Research

Link to question on StackOverflow

Question about pandas

Lokeshram12 commented Jul 24, 2024

rhshadrach commented Jul 24, 2024

rhshadrach commented Jul 24, 2024

sshu2017 commented Aug 5, 2024

sshu2017 commented Aug 5, 2024

rhshadrach commented Aug 7, 2024

sshu2017 commented Aug 11, 2024

rhshadrach commented Aug 13, 2024

sshu2017 commented Aug 18, 2024

rhshadrach commented Aug 20, 2024

sshu2017 commented Aug 26, 2024 • edited Loading

rhshadrach commented Sep 3, 2024

sshu2017 commented Jul 24, 2024 •

edited

Loading

sshu2017 commented Aug 26, 2024 •

edited

Loading