Strange behaviour of `delim_whitespace` in `pd.read_table` #36381

mcocdawc · 2020-09-15T11:37:56Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
# The error persists with non-empty files
! touch test_file
pd.read_table('./test_file', delim_whitespace=True)

Problem description


~/.local/lib/python3.6/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    623     if delim_whitespace and delimiter != default_sep:
    624         raise ValueError(
--> 625             "Specified a delimiter with both sep and "
    626             "delim_whitespace=True; you can only specify one."
    627         )

ValueError: Specified a delimiter with both sep and delim_whitespace=True; you can only specify one.

Expected Output

Parsed dataframe.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2a7d332
python : 3.6.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.12.14-lp151.28.67-default
Version : #1 SMP Fri Sep 4 15:23:21 UTC 2020 (2c5a14f)
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.2
numpy : 1.17.3
pytz : 2018.5
dateutil : 2.7.3
pip : 19.3.1
setuptools : 40.5.0
Cython : 0.27.3
pytest : None
hypothesis : None
sphinx : 1.7.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.0.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 2.0.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.2.0
sqlalchemy : 1.2.14
tables : None
tabulate : 0.8.5
xarray : None
xlrd : None
xlwt : None
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

delim_whitespace=True does not work pandas-dev/pandas#36381

jreback · 2020-09-15T11:52:11Z

sep is defaulted to ,
pls read the doc string

mcocdawc · 2020-09-15T11:54:02Z

@jreback
The error persists with:
pd.read_table('./test_file', delim_whitespace=True, sep=None, delimiter=None)

tacaswell · 2020-09-22T16:34:48Z

@jreback this issue should be re-opened.

The issue is that read_table changes the default value of sep='\t' which is consistent with docstring (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html) but inconsistent with the hard-coded default in read_csv of ','

pandas/pandas/io/parsers.py

Lines 694 to 705 in 8f6ec1e

    
           @Appender( 
        
               _doc_read_csv_and_table.format( 
        
                   func_name="read_table", 
        
                   summary="Read general delimited file into DataFrame.", 
        
                   _default_sep=r"'\\t' (tab-stop)", 
        
               ) 
        
           ) 
        
           def read_table( 
        
               filepath_or_buffer: FilePathOrBuffer, 
        
               sep="\t", 
        
               delimiter=None, 
        
               # Column and Index Locations and Names

vs

pandas/pandas/io/parsers.py

Lines 603 to 615 in 8f6ec1e

    
           # gh-23761 
        
           # 
        
           # When a dialect is passed, it overrides any of the overlapping 
        
           # parameters passed in directly. We don't want to warn if the 
        
           # default parameters were passed in (since it probably means 
        
           # that the user didn't pass them in explicitly in the first place). 
        
           # 
        
           # "delimiter" is the annoying corner case because we alias it to 
        
           # "sep" before doing comparison to the dialect values later on. 
        
           # Thus, we need a flag to indicate that we need to "override" 
        
           # the comparison to dialect values by checking if default values 
        
           # for BOTH "delimiter" and "sep" were provided. 
        
           default_sep = ","

import pandas as pd
with open('/tmp/test.dat', 'w') as fout:
    for j in range(5):
        fout.write(f'{j} {j*2}    {j**2}\n')

pd.read_table('/tmp/test.dat', delim_whitespace=True)

This probably has to be done with a proper sentinel object?

tacaswell · 2020-09-22T21:14:44Z

duplicate of #35958 and fixed by https://github.com/pandas-dev/pandas/pull/36560/files

mcocdawc · 2020-09-22T21:15:38Z

So we want to override the default of a keyword argument from another function.
The way it currently works with the local assignment default_sep = "," in parsers.py::615 I think this is not possible.

If I understand it correctly we have to carry two meanings with our sentinel object:

The argument is absent and the default should be taken. This could be done with None or an empty string.
The default value itself. This could be done by directly assigning the separator, as it is done now. The problem is that in the current way we cannot carry the meaning 1 over.

One possibility is:

class Optional:
    def __init__(self, val):
        self.val = val
        
    def __bool__(self):
        return False
    
    def __repr__(self):
        return f'Optional "{self.val}"'

def read_csv(sep=Optional(','), delimiter=None, delim_whitespace=False):
    
    if sum(bool(x) for x in (sep, delimiter, delim_whitespace)) > 1:
        raise ValueError(
            "Specified a delimiter with two of {sep, delimiter, or delim_whitespace}"
            "while only one or zero allowed.")
  
    if any([sep, delimiter]):
        delimiter = delimiter if delimiter else sep
    elif delim_whitespace:
        delimiter = '\S+'
    else:
        delimiter = sep.val

    return delimiter, delim_whitespace

def read_table(sep=Optional('\t'), delimiter=None, delim_whitespace=False):
    return read_csv(**locals())

I would be interested in fixing, but I guess the decision on which sentinel object to choose is an important API decision, which should be done by you.

mcocdawc · 2020-09-23T07:49:40Z

@tacaswell
Does https://github.com/pandas-dev/pandas/pull/36560/files really fix the problem?

read_table(sep='\t', delim_whitespace=True)

will not throw an error, although it should if I read the documentation correctly?

simonjayhawkins · 2020-09-23T12:04:04Z

@mcocdawc since this issue is closed, maybe better for visibility to comment directly on #35958 or #36560

simonjayhawkins · 2020-09-23T12:07:04Z

I would be interested in fixing, but I guess the decision on which sentinel object to choose is an important API decision, which should be done by you.

we use lib.no_default as a sentinel

mcocdawc added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2020

mcocdawc added a commit to mcocdawc/chemcoord that referenced this issue Sep 15, 2020

BUG: workaround for bug in pd.read_table

4fa3f4f

delim_whitespace=True does not work pandas-dev/pandas#36381

jreback added IO CSV read_csv, to_csv Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2020

jreback added this to the No action milestone Sep 15, 2020

jreback closed this as completed Sep 15, 2020

simonjayhawkins added Duplicate Report Duplicate issue or pull request and removed IO CSV read_csv, to_csv Usage Question labels Sep 23, 2020

tacaswell mentioned this issue Sep 23, 2020

[BUG]: Fix regression in read_table with delim_whitespace=True #36560

Merged

5 tasks

phofl mentioned this issue Sep 27, 2020

BUG: Read_Table and Read_Csv does not raise when delim_whitespace=True and sep=default is given #36583

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange behaviour of `delim_whitespace` in `pd.read_table` #36381

Strange behaviour of `delim_whitespace` in `pd.read_table` #36381

mcocdawc commented Sep 15, 2020

INSTALLED VERSIONS

jreback commented Sep 15, 2020

mcocdawc commented Sep 15, 2020 •

edited

Loading

tacaswell commented Sep 22, 2020

tacaswell commented Sep 22, 2020

mcocdawc commented Sep 22, 2020

mcocdawc commented Sep 23, 2020

simonjayhawkins commented Sep 23, 2020

simonjayhawkins commented Sep 23, 2020

Strange behaviour of delim_whitespace in pd.read_table #36381

Strange behaviour of delim_whitespace in pd.read_table #36381

Comments

mcocdawc commented Sep 15, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Sep 15, 2020

mcocdawc commented Sep 15, 2020 • edited Loading

tacaswell commented Sep 22, 2020

tacaswell commented Sep 22, 2020

mcocdawc commented Sep 22, 2020

mcocdawc commented Sep 23, 2020

simonjayhawkins commented Sep 23, 2020

simonjayhawkins commented Sep 23, 2020

Strange behaviour of `delim_whitespace` in `pd.read_table` #36381

Strange behaviour of `delim_whitespace` in `pd.read_table` #36381

Output of `pd.show_versions()`

mcocdawc commented Sep 15, 2020 •

edited

Loading