Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot implement any function supporting "1 or more column" CSV files when using sep param #738

Closed
adamkennedy opened this issue Jul 19, 2014 · 0 comments
Assignees
Milestone

Comments

@adamkennedy
Copy link

I'm not sure whether this is considered a bug, design flaw, or intentional feature so I present the following as simply an "issue".

The following code demonstrates a simplified and synthetic equivalent to our problem. The use of an unconventional separator in this example is intentional because it can't be fixed by falling back on auto-detection (which fails in this case), but applies equally well to any separator.

The intent here is to demonstrate that explicitly setting the "sep" to any value is unsafe.

> summary.csv <- function (input) {
+   csv <- fread(input, sep = '`', verbose = TRUE)
+   summary(csv)
+ }

> text2 = '"Foo"`"Bar"
+ 1`2
+ '

> text1 = '"Foo"
+ 1
+ '

> summary.csv(text2)
Input contains a \n (or is ""). Taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Looking for supplied sep '`' on line 2 (the last non blank line in the first 'autostart') ... found ok
Found 2 columns
First row with 2 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 2
Subtracted 1 for last eol and any trailing empty lines, leaving 1 data rows
Type codes: 11 (first 5 rows)
Type codes: 11 (after applying colClasses and integer64)
Type codes: 11 (after applying drop or select (if supplied)
Allocating 2 column slots (2 - 0 NULL)
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.000s (  0%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.000s (  0%) Allocation of 1x2 result (xMB) in RAM
   0.000s (  0%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.001s        Total
      Foo         Bar   
 Min.   :1   Min.   :2  
 1st Qu.:1   1st Qu.:2  
 Median :1   Median :2  
 Mean   :1   Mean   :2  
 3rd Qu.:1   3rd Qu.:2  
 Max.   :1   Max.   :2  

> summary.csv(text1)
Input contains a \n (or is ""). Taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Looking for supplied sep '`' on line 2 (the last non blank line in the first 'autostart') ... 
Error in fread(input, sep = "`", verbose = TRUE) (from .active-rstudio-document#2) : 
  The supplied 'sep' was not found on line 2. To read the file as a single character column set sep='\n'.
>

In this notional example, we would like to implement some arbitrary functionality which operates on CSV files containing one or more columns.

As a slightly more concrete example, imagine a notional script which takes an arbitrary-length list of CSV files and inner joins them all together. One of these files might contain just a single column, acting as a row filter on the other files.

There are any number of other cases where we might want to support CSV files that contain a single column.

fread() does clearly support this use case, because the sep auto-detection happily identifies the single column case and continues without emiting a warning.

However, should we attempt to enforce a specific separator with an explicit sep param then the single column case fails with an error, even though the CSV content is this case is completely legally formatted.

The two situations appear to be strangely symmetrically "helpful". When you use auto-detection the fread() function will helpfully continue (silently with no warning) using a guess that it is a single column (even when that guess is wrong, as above). When you supply a specific sep the fread() function will helpfully stop you loading a legal single column CSV because you might have specified the sep wrong.

Moreover, the advice provided in the error is impossible to follow except in the case where the R session is interactive and a human is directly in control of the fread() call or some parent function that supports overriding the sep value by passing it down.

When the R session is not interactive, or when the fread() call is buried under several layers of higher level functionality, changing the sep to "\n" to disable the helpful error is not only something we can't do, but is functionally incorrect because the file does not actually have newline separators, it is still a legal CSV file compliant with the original sep value.

To resolve, I recommend the removal of the Error at minimum, since it provides advice that cannot possible by followed in anything other than the most trivial case.

If a function that is calling fread() is sufficiently confident in the sep value they want as to provide it explicitly, then fread() should obey that choice at face value and allow the loading of single column CSV files without complaint, whether they are correctly single column or not.

As things stand, it is impossible to implement anything with explicit sep, as it will error during parse for the single column case. And the only way to avoid this problem is to never use sep at all.

@mattdowle mattdowle added this to the v1.9.6 milestone Oct 25, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants