Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour of delim_whitespace in pd.read_table #36381

Closed
2 of 3 tasks
mcocdawc opened this issue Sep 15, 2020 · 8 comments
Closed
2 of 3 tasks

Strange behaviour of delim_whitespace in pd.read_table #36381

mcocdawc opened this issue Sep 15, 2020 · 8 comments
Labels
Duplicate Report Duplicate issue or pull request

Comments

@mcocdawc
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
# The error persists with non-empty files
! touch test_file
pd.read_table('./test_file', delim_whitespace=True)

Problem description


~/.local/lib/python3.6/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    623     if delim_whitespace and delimiter != default_sep:
    624         raise ValueError(
--> 625             "Specified a delimiter with both sep and "
    626             "delim_whitespace=True; you can only specify one."
    627         )

ValueError: Specified a delimiter with both sep and delim_whitespace=True; you can only specify one.

Expected Output

Parsed dataframe.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2a7d332
python : 3.6.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.12.14-lp151.28.67-default
Version : #1 SMP Fri Sep 4 15:23:21 UTC 2020 (2c5a14f)
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.2
numpy : 1.17.3
pytz : 2018.5
dateutil : 2.7.3
pip : 19.3.1
setuptools : 40.5.0
Cython : 0.27.3
pytest : None
hypothesis : None
sphinx : 1.7.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.0.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 2.0.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.2.0
sqlalchemy : 1.2.14
tables : None
tabulate : 0.8.5
xarray : None
xlrd : None
xlwt : None
numba : 0.48.0

@mcocdawc mcocdawc added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2020
mcocdawc added a commit to mcocdawc/chemcoord that referenced this issue Sep 15, 2020
@jreback
Copy link
Contributor

jreback commented Sep 15, 2020

sep is defaulted to ,
pls read the doc string

@jreback jreback added IO CSV read_csv, to_csv Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2020
@jreback jreback added this to the No action milestone Sep 15, 2020
@jreback jreback closed this as completed Sep 15, 2020
@mcocdawc
Copy link
Contributor Author

mcocdawc commented Sep 15, 2020

@jreback
The error persists with:
pd.read_table('./test_file', delim_whitespace=True, sep=None, delimiter=None)

@tacaswell
Copy link
Contributor

@jreback this issue should be re-opened.

The issue is that read_table changes the default value of sep='\t' which is consistent with docstring (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html) but inconsistent with the hard-coded default in read_csv of ','

pandas/pandas/io/parsers.py

Lines 694 to 705 in 8f6ec1e

@Appender(
_doc_read_csv_and_table.format(
func_name="read_table",
summary="Read general delimited file into DataFrame.",
_default_sep=r"'\\t' (tab-stop)",
)
)
def read_table(
filepath_or_buffer: FilePathOrBuffer,
sep="\t",
delimiter=None,
# Column and Index Locations and Names

vs

pandas/pandas/io/parsers.py

Lines 603 to 615 in 8f6ec1e

# gh-23761
#
# When a dialect is passed, it overrides any of the overlapping
# parameters passed in directly. We don't want to warn if the
# default parameters were passed in (since it probably means
# that the user didn't pass them in explicitly in the first place).
#
# "delimiter" is the annoying corner case because we alias it to
# "sep" before doing comparison to the dialect values later on.
# Thus, we need a flag to indicate that we need to "override"
# the comparison to dialect values by checking if default values
# for BOTH "delimiter" and "sep" were provided.
default_sep = ","


import pandas as pd
with open('/tmp/test.dat', 'w') as fout:
    for j in range(5):
        fout.write(f'{j} {j*2}    {j**2}\n')

pd.read_table('/tmp/test.dat', delim_whitespace=True)

This probably has to be done with a proper sentinel object?

@tacaswell
Copy link
Contributor

@mcocdawc
Copy link
Contributor Author

So we want to override the default of a keyword argument from another function.
The way it currently works with the local assignment default_sep = "," in parsers.py::615 I think this is not possible.

If I understand it correctly we have to carry two meanings with our sentinel object:

  1. The argument is absent and the default should be taken. This could be done with None or an empty string.
  2. The default value itself. This could be done by directly assigning the separator, as it is done now. The problem is that in the current way we cannot carry the meaning 1 over.

One possibility is:

class Optional:
    def __init__(self, val):
        self.val = val
        
    def __bool__(self):
        return False
    
    def __repr__(self):
        return f'Optional "{self.val}"'

def read_csv(sep=Optional(','), delimiter=None, delim_whitespace=False):
    
    if sum(bool(x) for x in (sep, delimiter, delim_whitespace)) > 1:
        raise ValueError(
            "Specified a delimiter with two of {sep, delimiter, or delim_whitespace}"
            "while only one or zero allowed.")
  
    if any([sep, delimiter]):
        delimiter = delimiter if delimiter else sep
    elif delim_whitespace:
        delimiter = '\S+'
    else:
        delimiter = sep.val

    return delimiter, delim_whitespace

def read_table(sep=Optional('\t'), delimiter=None, delim_whitespace=False):
    return read_csv(**locals())

I would be interested in fixing, but I guess the decision on which sentinel object to choose is an important API decision, which should be done by you.

@mcocdawc
Copy link
Contributor Author

@tacaswell
Does https://github.com/pandas-dev/pandas/pull/36560/files really fix the problem?

read_table(sep='\t', delim_whitespace=True)

will not throw an error, although it should if I read the documentation correctly?

@simonjayhawkins
Copy link
Member

@mcocdawc since this issue is closed, maybe better for visibility to comment directly on #35958 or #36560

@simonjayhawkins simonjayhawkins added Duplicate Report Duplicate issue or pull request and removed IO CSV read_csv, to_csv Usage Question labels Sep 23, 2020
@simonjayhawkins
Copy link
Member

I would be interested in fixing, but I guess the decision on which sentinel object to choose is an important API decision, which should be done by you.

we use lib.no_default as a sentinel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

4 participants