Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR/API: non-string types #13877

Closed
michaelaye opened this issue Aug 2, 2016 · 8 comments · Fixed by #23167
Closed

ERR/API: non-string types #13877

michaelaye opened this issue Aug 2, 2016 · 8 comments · Fixed by #23167
Labels
Error Reporting Incorrect or improved errors from pandas Strings String extension data type and string data

Comments

@michaelaye
Copy link
Contributor

michaelaye commented Aug 2, 2016

xref #5602, #9343, #13806 (comments)

Code Sample, a copy-pastable example if possible

In [2]: s = pd.Series(['a', 123, 'b'])

In [3]: s.str.startswith('a')
Out[3]:
0     True
1      NaN
2    False
dtype: object

Expected Output

raise an exception
Reasoning: If I provide NaN, it's okay to return NaN, but silently convert is hard to find. The user cannot easily drop hidden integers in an 'O' series.

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: None
xarray: 0.7.2
IPython: 5.0.0
sphinx: 1.4.1
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@michaelaye
Copy link
Contributor Author

Potential connection: #9343

@shoyer
Copy link
Member

shoyer commented Aug 2, 2016

@michaelaye I fixed your example -- Series.startswith does not exist (you need to use the str prefix)

@michaelaye
Copy link
Contributor Author

ah, sorry. thanks!

@sinhrks
Copy link
Member

sinhrks commented Aug 2, 2016

I think all related string methods should have following kwds to cover most of usecases:

  • errors: whether to raise or coerce non-str input
  • na: specify value to fill NaN (only exists in few methods)
pd.Series(['a', 2, 3]).str.startswith('a')
# 0    True
# 1     NaN
# 2     NaN
# dtype: object

pd.Series(['a', 2, 3]).str.startswith('a', na=False)
# 0     True
# 1    False
# 2    False
# dtype: bool

related to #5672 and #13806 (comments)

@sinhrks sinhrks added Error Reporting Incorrect or improved errors from pandas Strings String extension data type and string data labels Aug 2, 2016
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 2, 2016

Part of this would be solved by a proper string type, since pd.Series(['a', 123, 'b']).str.* would raise an error. But given the type system we have today, a keyword seems reasonable.

na: specify value to fill NaN (only exists in few methods

Would fill_value be a better name? More consistent with other operations.

@jreback
Copy link
Contributor

jreback commented Aug 2, 2016

it is trivial to determine string types

This simply needs impl. also will deal with #9343

In [1]: pd.lib.infer_dtype(pd.Series(['a', 123, 'b']).values)
Out[1]: 'mixed-integer'

In [2]: pd.lib.infer_dtype(pd.Series(['a', np.nan, 'b']).values)
Out[2]: 'mixed'

In [3]: pd.lib.infer_dtype(pd.Series(['a', 'c', 'b']).values)
Out[3]: 'string'

There is a slight perf penalty for this, but prob worth it.

@jreback jreback changed the title string startswith/endswith functions silently coerce non-string types ERR/API: non-string types Aug 2, 2016
@jreback jreback added this to the Next Major Release milestone Aug 2, 2016
@jreback
Copy link
Contributor

jreback commented Aug 2, 2016

I like @sinhrks proposal, which unifies the API (and these same kwds are common in pandas codebase), e.g. errors=raise|ignore|coerce are pretty intuitve to select behavior. Note that we generally have errors='raise', thinking of datetimes, though here for boolean methods I think errors='coerce' is appropriate (e.g. a non-string that is by defintiion False).

@h-vetinari
Copy link
Contributor

h-vetinari commented Jun 20, 2018

@sinhrks @TomAugspurger

Would fill_value be a better name? More consistent with other operations.

There are several methods with 'fill_value' (often also in combination with 'errors') - quick scan of doc/build/html/generated yields 72 hits. On the other hand, the .str.cat-method uses the name 'na_rep', although in a (very) slightly different way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants