Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name repair for duplicated columns inconsistent between read_csv and spec_csv #1387

Open
blaidd4drwg opened this issue Mar 11, 2022 · 9 comments · May be fixed by #1436
Open

Name repair for duplicated columns inconsistent between read_csv and spec_csv #1387

blaidd4drwg opened this issue Mar 11, 2022 · 9 comments · May be fixed by #1436
Labels
feature a feature request or enhancement

Comments

@blaidd4drwg
Copy link

When trying to create column specifications of a file that contains duplicated variable names, spec_csv() renames the variables differently (e.g "error_1") than read_csv() ("error...1").

Let test.csv be a simple CSV with duplicated variables "error" and one observation:

echo "a,error,b,error,c,error\n1,string1,2,string2,3,string3" >> test.csv
readr::spec_csv("test.csv")
cols(
  a = col_double(),
  error = col_character(),
  b = col_double(),
  error_1 = col_character(),
  c = col_double(),
  error_2 = col_character()
)

Warning message:
Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error' => 'error_2' [6] 
readr::read_csv("test.csv")
New names:                                                                                                                                                                                                  
* error -> error...2
* error -> error...4
* error -> error...6
Rows: 1 Columns: 6
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): error...2, error...4, error...6
dbl (3): a, b, c

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 6
      a error...2     b error...4     c error...6
  <dbl> <chr>     <dbl> <chr>     <dbl> <chr>    
1     1 string1       2 string2       3 string3
@blaidd4drwg
Copy link
Author

blaidd4drwg commented Mar 28, 2022

NB: I dug through some readr code and it seems to me that read_csv and spec_csv are not using the same code-base: spec_csv uses the function col_spec_standardise where duplicated column names are dealt with:

  if (anyDuplicated(col_names)) {
    dups <- duplicated(col_names)

    old_names <- col_names
    col_names <- make.unique(col_names, sep = "_")

    warning(
      "Duplicated column names deduplicated: ",
      paste0(
        encodeString(old_names[dups], quote = "'"),
        " => ",
        encodeString(col_names[dups], quote = "'"),
        " [", which(dups), "]",
        collapse = ", "
      ),
      call. = FALSE
    )
  }

And read_csv is essentially a wrapper for vroom::vroom which then deals with column names (in the C++ code part I think?).

It seems to me that the code to make unique names for duplicated column names is not correct and should follow the same pattern as vroom/vroom_ or vice versa.

Edit:

read_csv uses vctrs::vec_as_names to create syntactically valid names (in col_types_standardise). Thus spec_csv should use the same call.

@blaidd4drwg blaidd4drwg changed the title Name repair different for read_csv() and spec_csv() Name repair for duplicated columns inconsistent between read_csv and spec_csv Mar 28, 2022
@sbearrows sbearrows added the feature a feature request or enhancement label Apr 5, 2022
@vieruodu
Copy link

Sometimes I know the duplicated columns exisited, but I just wanna ignore it. However, with the current defualt name_repair, looking like "num...1", I cannot ignore it, I have to deal with it.

@sbearrows
Copy link
Contributor

sbearrows commented Aug 23, 2022

Note to self:

Right now spec_csv() is going through edition 1 read_csv() which ends up in col_spec_standardise and that has it's own custom name repair machinery

readr/R/col_types.R

Lines 441 to 458 in 85cf1e8

if (anyDuplicated(col_names)) {
dups <- duplicated(col_names)
old_names <- col_names
col_names <- make.unique(col_names, sep = "_")
warning(
"Duplicated column names deduplicated: ",
paste0(
encodeString(old_names[dups], quote = "'"),
" => ",
encodeString(col_names[dups], quote = "'"),
" [", which(dups), "]",
collapse = ", "
),
call. = FALSE
)
}

The fact that we force spec_csv() to edition 1 is due to: c88303c
We don't see a way to allow spec_csv() to use edition 2 because vroom doesn't support guessing without parsing the data (as the above commit mentions).

value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")

# note that edition 1 actually guesses column types
readr::with_edition(1, readr::read_csv(value, n_max = 0, guess_max = 1000))
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> # A tibble: 0 × 6
#> # … with 6 variables: a <dbl>, error <chr>, b <dbl>, error_1 <chr>, c <dbl>,
#> #   error_2 <chr>
#> # ℹ Use `colnames()` to see all variable names

# while edition 2 returns all characters
readr::read_csv(value, n_max = 0, guess_max = 1000)
#> New names:
#> Rows: 0 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (6): a, error...2, b, error...4, c, error...6
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> # A tibble: 0 × 6
#> # … with 6 variables: a <chr>, error...2 <chr>, b <chr>, error...4 <chr>,
#> #   c <chr>, error...6 <chr>
#> # ℹ Use `colnames()` to see all variable names

Created on 2022-08-23 by the reprex package (v2.0.1.9000)

There are three options for moving forward:

  1. Bring over the name repair machinery from vroom a la https://github.com/tidyverse/vroom/blob/339073c4fc5cfdcbeb29ff716d41c2aaaf6bad43/R/col_types.R#L389 and add the name_repair argument to all instances where read_delimited() is called. This would require a rather large change to the code base but it would mean the name_repair argument would now be respected for edition 1 read_csv() (and friends).
  2. Bring over the name repair machinery from vroom as above, but force name_repair = unique and continue to not respect name_repair. It would make spec_csv() and read_csv() perform the same for default name repair and we wouldn't need to change a large part of the code base. We currently don't respect name_repair in edition 1, so continuing this behavior might not be a big deal.
  3. Do nothing because the (potentially large) changes would occur in edition 1, which is feature frozen or the changes wouldn't be a perfect solution due to name_repair still not being respected. We also don't feel like name repair for duplicate columns names is a safe or advised maneuver to rely on when copying a spec from spec_csv() for use in read_csv().

@jennybc
Copy link
Member

jennybc commented Aug 30, 2022

It seems like a stopgap readr 2e / vroom version of spec_csv() could just call read_csv() and then return the spec from that result.

(BTW various things seem wonky with this printed output)

library(readr)

value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")

dat <- read_csv(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`

spec(dat)
#> cols(
#>   a = col_double(),
#>   error...2 = col_character(),
#>   b = col_double(),
#>   error...4 = col_character(),
#>   c = col_double(),
#>   error...6 = col_character()
#> )

readr 2e / vroom column names: a, error...2, b, error...4, c, error...6

What happens if we just ask for the spec from the same input?

spec_csv(value)
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> cols(
#>   a = col_double(),
#>   error = col_character(),
#>   b = col_double(),
#>   error_1 = col_character(),
#>   c = col_double(),
#>   error_2 = col_character()
#> )

Problem: we get readr 1e-style column names: a, error, b, error_1, c, error_2

It does seem reasonable to expect spec_csv() to return the same spec as spec(read_csv()), at least with respect to column names.

A simple fix would be to call read_csv() on the input with n_max = guess_max and then return that spec.

spec_csv_vroom <- function(..., guess_max = 1000) {
  tmp <- readr::read_csv(..., n_max = guess_max, guess_max = guess_max)
  spec(tmp)
}

spec_csv_vroom(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> cols(
#>   a = col_double(),
#>   error...2 = col_character(),
#>   b = col_double(),
#>   error...4 = col_character(),
#>   c = col_double(),
#>   error...6 = col_character()
#> )

Is there a reason not to do something along these lines?

@DavisVaughan
Copy link
Member

Is there a reason not to do something along these lines?

Performance? I guess? i.e. this commit from Jim talks about vroom not supporting guessing without parsing
c88303c

But since spec_csv() and friends are probably relatively rarely used, the benefits of updating them to 2e probably greatly outweigh performance costs - especially since the long term goal is to remove 1e code

@sbearrows
Copy link
Contributor

Is there a reason not to do something along these lines?

Same as @DavisVaughan, it would solve the problem especially in the long term for removing readr edition 1 code, but having spec_csv() be just a wrapper for spec(read_csv(n_max = guess_max)) seems odd

@jennybc
Copy link
Member

jennybc commented Aug 30, 2022

Yeah the only thing I could think of was performance (or elegance). But I think returning the wrong column names is a much bigger sin that the downsides of a simple spec_*() implementation.

"readr doesn't have to parse to guess col types but vroom does" seems a bit misleading. readr is definitely consulting (tokenizing and typing) up to guess_max rows of data, in order to guess the column types. This particular point doesn't really resonate with me. I suppose if we were truly only column type guessing, we wouldn't load the data atoms into a data frame, but this seems like an awfully subtle thing to be worried about. Does anyone really "get" this point?

having spec_csv() be just a wrapper for spec(read_csv(n_max = guess_max)) seems odd

How so? It seems like the point of spec_csv() is to return the spec associated with the data frame you'd get if you read that file.

@sbearrows
Copy link
Contributor

I know that spec_csv() is basically already a wrapper for ed1 spec(read_csv(n_max = 0, guess_max = guess_max)) so I guess it's not the "wrapper" bit that feels odd. Maybe I'm just blocked by what I think spec_csv() should be doing (ie guessing column types but not reading anything).

@jennybc
Copy link
Member

jennybc commented Aug 31, 2022

guessing column types but not reading anything

But you can't guess column types without reading from the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants