Name repair for duplicated columns inconsistent between `read_csv` and `spec_csv` #1387

blaidd4drwg · 2022-03-11T12:59:02Z

When trying to create column specifications of a file that contains duplicated variable names, spec_csv() renames the variables differently (e.g "error_1") than read_csv() ("error...1").

Let test.csv be a simple CSV with duplicated variables "error" and one observation:

echo "a,error,b,error,c,error\n1,string1,2,string2,3,string3" >> test.csv

readr::spec_csv("test.csv")

cols(
  a = col_double(),
  error = col_character(),
  b = col_double(),
  error_1 = col_character(),
  c = col_double(),
  error_2 = col_character()
)

Warning message:
Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error' => 'error_2' [6]

readr::read_csv("test.csv")

New names:                                                                                                                                                                                                  
* error -> error...2
* error -> error...4
* error -> error...6
Rows: 1 Columns: 6
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): error...2, error...4, error...6
dbl (3): a, b, c

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 6
      a error...2     b error...4     c error...6
  <dbl> <chr>     <dbl> <chr>     <dbl> <chr>    
1     1 string1       2 string2       3 string3

The text was updated successfully, but these errors were encountered:

blaidd4drwg · 2022-03-28T11:10:35Z

NB: I dug through some readr code and it seems to me that read_csv and spec_csv are not using the same code-base: spec_csv uses the function col_spec_standardise where duplicated column names are dealt with:

  if (anyDuplicated(col_names)) {
    dups <- duplicated(col_names)

    old_names <- col_names
    col_names <- make.unique(col_names, sep = "_")

    warning(
      "Duplicated column names deduplicated: ",
      paste0(
        encodeString(old_names[dups], quote = "'"),
        " => ",
        encodeString(col_names[dups], quote = "'"),
        " [", which(dups), "]",
        collapse = ", "
      ),
      call. = FALSE
    )
  }

And read_csv is essentially a wrapper for vroom::vroom which then deals with column names (in the C++ code part I think?).

It seems to me that the code to make unique names for duplicated column names is not correct and should follow the same pattern as vroom/vroom_ or vice versa.

Edit:

read_csv uses vctrs::vec_as_names to create syntactically valid names (in col_types_standardise). Thus spec_csv should use the same call.

vieruodu · 2022-08-20T07:18:32Z

Sometimes I know the duplicated columns exisited, but I just wanna ignore it. However, with the current defualt name_repair, looking like "num...1", I cannot ignore it, I have to deal with it.

sbearrows · 2022-08-23T20:45:24Z

Note to self:

Right now spec_csv() is going through edition 1 read_csv() which ends up in col_spec_standardise and that has it's own custom name repair machinery

readr/R/col_types.R

Lines 441 to 458 in 85cf1e8

    
           if (anyDuplicated(col_names)) { 
        
             dups <- duplicated(col_names) 
        
             old_names <- col_names 
        
             col_names <- make.unique(col_names, sep = "_") 
        
             warning( 
        
               "Duplicated column names deduplicated: ", 
        
               paste0( 
        
                 encodeString(old_names[dups], quote = "'"), 
        
                 " => ", 
        
                 encodeString(col_names[dups], quote = "'"), 
        
                 " [", which(dups), "]", 
        
                 collapse = ", " 
        
               ), 
        
               call. = FALSE 
        
             ) 
        
           }

The fact that we force spec_csv() to edition 1 is due to: c88303c
We don't see a way to allow spec_csv() to use edition 2 because vroom doesn't support guessing without parsing the data (as the above commit mentions).

value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")

# note that edition 1 actually guesses column types
readr::with_edition(1, readr::read_csv(value, n_max = 0, guess_max = 1000))
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> # A tibble: 0 × 6
#> # … with 6 variables: a <dbl>, error <chr>, b <dbl>, error_1 <chr>, c <dbl>,
#> #   error_2 <chr>
#> # ℹ Use `colnames()` to see all variable names

# while edition 2 returns all characters
readr::read_csv(value, n_max = 0, guess_max = 1000)
#> New names:
#> Rows: 0 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (6): a, error...2, b, error...4, c, error...6
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> # A tibble: 0 × 6
#> # … with 6 variables: a <chr>, error...2 <chr>, b <chr>, error...4 <chr>,
#> #   c <chr>, error...6 <chr>
#> # ℹ Use `colnames()` to see all variable names

^{Created on 2022-08-23 by the reprex package (v2.0.1.9000)}

There are three options for moving forward:

Bring over the name repair machinery from vroom a la https://github.com/tidyverse/vroom/blob/339073c4fc5cfdcbeb29ff716d41c2aaaf6bad43/R/col_types.R#L389 and add the name_repair argument to all instances where read_delimited() is called. This would require a rather large change to the code base but it would mean the name_repair argument would now be respected for edition 1 read_csv() (and friends).
Bring over the name repair machinery from vroom as above, but force name_repair = unique and continue to not respect name_repair. It would make spec_csv() and read_csv() perform the same for default name repair and we wouldn't need to change a large part of the code base. We currently don't respect name_repair in edition 1, so continuing this behavior might not be a big deal.
Do nothing because the (potentially large) changes would occur in edition 1, which is feature frozen or the changes wouldn't be a perfect solution due to name_repair still not being respected. We also don't feel like name repair for duplicate columns names is a safe or advised maneuver to rely on when copying a spec from spec_csv() for use in read_csv().

jennybc · 2022-08-30T22:44:36Z

It seems like a stopgap readr 2e / vroom version of spec_csv() could just call read_csv() and then return the spec from that result.

(BTW various things seem wonky with this printed output)

library(readr)

value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")

dat <- read_csv(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`

spec(dat)
#> cols(
#>   a = col_double(),
#>   error...2 = col_character(),
#>   b = col_double(),
#>   error...4 = col_character(),
#>   c = col_double(),
#>   error...6 = col_character()
#> )

readr 2e / vroom column names: a, error...2, b, error...4, c, error...6

What happens if we just ask for the spec from the same input?

spec_csv(value)
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> cols(
#>   a = col_double(),
#>   error = col_character(),
#>   b = col_double(),
#>   error_1 = col_character(),
#>   c = col_double(),
#>   error_2 = col_character()
#> )

Problem: we get readr 1e-style column names: a, error, b, error_1, c, error_2

It does seem reasonable to expect spec_csv() to return the same spec as spec(read_csv()), at least with respect to column names.

A simple fix would be to call read_csv() on the input with n_max = guess_max and then return that spec.

spec_csv_vroom <- function(..., guess_max = 1000) {
  tmp <- readr::read_csv(..., n_max = guess_max, guess_max = guess_max)
  spec(tmp)
}

spec_csv_vroom(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> cols(
#>   a = col_double(),
#>   error...2 = col_character(),
#>   b = col_double(),
#>   error...4 = col_character(),
#>   c = col_double(),
#>   error...6 = col_character()
#> )

Is there a reason not to do something along these lines?

DavisVaughan · 2022-08-30T23:18:46Z

Is there a reason not to do something along these lines?

Performance? I guess? i.e. this commit from Jim talks about vroom not supporting guessing without parsing
c88303c

But since spec_csv() and friends are probably relatively rarely used, the benefits of updating them to 2e probably greatly outweigh performance costs - especially since the long term goal is to remove 1e code

sbearrows · 2022-08-30T23:33:52Z

Is there a reason not to do something along these lines?

Same as @DavisVaughan, it would solve the problem especially in the long term for removing readr edition 1 code, but having spec_csv() be just a wrapper for spec(read_csv(n_max = guess_max)) seems odd

jennybc · 2022-08-30T23:37:38Z

Yeah the only thing I could think of was performance (or elegance). But I think returning the wrong column names is a much bigger sin that the downsides of a simple spec_*() implementation.

"readr doesn't have to parse to guess col types but vroom does" seems a bit misleading. readr is definitely consulting (tokenizing and typing) up to guess_max rows of data, in order to guess the column types. This particular point doesn't really resonate with me. I suppose if we were truly only column type guessing, we wouldn't load the data atoms into a data frame, but this seems like an awfully subtle thing to be worried about. Does anyone really "get" this point?

having spec_csv() be just a wrapper for spec(read_csv(n_max = guess_max)) seems odd

How so? It seems like the point of spec_csv() is to return the spec associated with the data frame you'd get if you read that file.

sbearrows · 2022-08-31T16:17:21Z

I know that spec_csv() is basically already a wrapper for ed1 spec(read_csv(n_max = 0, guess_max = guess_max)) so I guess it's not the "wrapper" bit that feels odd. Maybe I'm just blocked by what I think spec_csv() should be doing (ie guessing column types but not reading anything).

jennybc · 2022-08-31T16:52:02Z

guessing column types but not reading anything

But you can't guess column types without reading from the file.

blaidd4drwg changed the title ~~Name repair different for read_csv() and spec_csv()~~ Name repair for duplicated columns inconsistent between read_csv and spec_csv Mar 28, 2022

sbearrows added the feature a feature request or enhancement label Apr 5, 2022

sbearrows linked a pull request Sep 13, 2022 that will close this issue

send spec_*() to vroom #1436

Draft

lwjohnst86 mentioned this issue Apr 16, 2024

Read column specifications using readr::spec_csv() rostools/r-cubed-intermediate#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Name repair for duplicated columns inconsistent between `read_csv` and `spec_csv` #1387

Name repair for duplicated columns inconsistent between `read_csv` and `spec_csv` #1387

blaidd4drwg commented Mar 11, 2022

blaidd4drwg commented Mar 28, 2022 •

edited

Loading

vieruodu commented Aug 20, 2022

sbearrows commented Aug 23, 2022 •

edited

Loading

jennybc commented Aug 30, 2022 •

edited

Loading

DavisVaughan commented Aug 30, 2022

sbearrows commented Aug 30, 2022

jennybc commented Aug 30, 2022

sbearrows commented Aug 31, 2022

jennybc commented Aug 31, 2022

Name repair for duplicated columns inconsistent between read_csv and spec_csv #1387

Name repair for duplicated columns inconsistent between read_csv and spec_csv #1387

Comments

blaidd4drwg commented Mar 11, 2022

blaidd4drwg commented Mar 28, 2022 • edited Loading

vieruodu commented Aug 20, 2022

sbearrows commented Aug 23, 2022 • edited Loading

jennybc commented Aug 30, 2022 • edited Loading

DavisVaughan commented Aug 30, 2022

sbearrows commented Aug 30, 2022

jennybc commented Aug 30, 2022

sbearrows commented Aug 31, 2022

jennybc commented Aug 31, 2022

Name repair for duplicated columns inconsistent between `read_csv` and `spec_csv` #1387

Name repair for duplicated columns inconsistent between `read_csv` and `spec_csv` #1387

blaidd4drwg commented Mar 28, 2022 •

edited

Loading

sbearrows commented Aug 23, 2022 •

edited

Loading

jennybc commented Aug 30, 2022 •

edited

Loading