Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic detection of dec=',' in Europe #2431

Closed
Tracked by #3189
mattdowle opened this issue Oct 18, 2017 · 22 comments · Fixed by #4482
Closed
Tracked by #3189

Automatic detection of dec=',' in Europe #2431

mattdowle opened this issue Oct 18, 2017 · 22 comments · Fixed by #4482
Labels
enhancement fread top request One of our most-requested issues

Comments

@mattdowle
Copy link
Member

I'm not sure what base and readr do in this regard, but currently in fread, dec=='.' by default and needs manually setting to ',' in Europe for numerics with comma as the decimal separator. It could instead be detected automatically like sep already is. Please +1 this issue if you'd like this.

Further, dec could be automatically detected per-column for files where some numeric columns use ',' and other columns use '.'. But does anyone need that?

@GegznaV
Copy link

GegznaV commented Oct 20, 2017

It would be enough, if dec would be chosen correctly for the whole file.

@clarkdk
Copy link

clarkdk commented Oct 22, 2017

+1 for auto-dec. dec by column world not be needed. I don’t think a mixed dec csv could be written or read by Excel.

@Boyoron
Copy link

Boyoron commented Nov 23, 2018

pls.

@s-fleck
Copy link

s-fleck commented Nov 11, 2019

fread() also sometimes ignores a manual set dec = ,. This happens randomly and quite infrequently and is therefore hard to reproduce. Restarting the R session fixes this issue. I've been having this problem for years, but only occasioannly so I did not report it before.

@jangorecki
Copy link
Member

It sounds like a different issue than the one here

@MichaelChirico
Copy link
Member

Any sample files for this issue?

@jangorecki
Copy link
Member

fread("a;b\n1,5;2,5", sep=";", dec=",")

@MichaelChirico
Copy link
Member

A real-world example would be better 😄

I saw a 🇫🇷 government website using sep=';' as well, is that common in such files? Or sep='\t' maybe?

@jangorecki
Copy link
Member

AFAIR this is how excel produces csv files in France and Poland.

@GegznaV
Copy link

GegznaV commented May 25, 2020

@MichaelChirico, here are some examples of data: data.zip. Inside the ZIP:

kojos.csv – data with various measurements of leg parts. Three columns with European numbers.
16.0001.trt – spectroscopic data created by the software of a spectrometer. The data of interest start at line 9. Two semicolon-separated columns with European numbers.
ezerai – two tab-separated columns with European numbers and UTF-8 encoding. Data from Wikipedia. (I do not expect fread() to read this dataset ezerai correctly with the default settings).

@GegznaV
Copy link

GegznaV commented May 25, 2020

Wouldn't it be more logical to automatically choose \t tab as sep when it is present instead of some other character? (See the example dataset ezerai)

@jangorecki
Copy link
Member

jangorecki commented May 25, 2020

AFAIR excel uses sep=";" when writing to csv (in those countries where dec=",").

@MichaelChirico
Copy link
Member

@GegznaV fread may choose \t -- there is some logic to determine what fread "thinks" is the correct separator among ,|;\t and , see here

@MichaelChirico
Copy link
Member

Thanks a bunch for the data sets Vilmantas.

  • fread('16.0001.trt') fails because of the extraneous info in the first 6 lines (auto-skip logic not up to the task); fread('16.00001.trt', skip=7L) and fread('16.00001.trt', skip=8L) both work automatically (both get the column names wrong, which are on line 7 but have a subheader column on line 8)
  • fread('kojos.csv') works great
  • fread('ezerai') is not working "automagically". The issue is, both , and \t as sep lead to 3 columns (which matches the header), and the logic for selecting sep is agnostic to column types -- there is a priority order, and , comes first. I filed fread sep='auto' could try using detect_types in the event of a tie #4487 -- I'm hopeful fread's logic could actually detect sep='\t' on its own. fread('ezerai', sep='\t') works.

@jangorecki
Copy link
Member

jangorecki commented May 25, 2020

Remember not to try to handle every possible input. For example

both get the column names wrong, which are on line 7 but have a subheader column on line 8

Sounds that fread would need to skip first 6 lines, then read seventh, skip 8th, and read the rest. Hnadling that is doable but it impose maintainance overhead, can introduce new bugs, etc. It might be better to provide a more general interface where skip can be a vector, so user need to understand what is wrong with their files, and then just skip=c(1:6,8). I recall someone already asked for possibility to read particular lines of file.

@MichaelChirico
Copy link
Member

Yea I don't think there's an automatic way on this one that's not a fragile house of cards to support. skip=vector would be awesome anyway.

@GegznaV
Copy link

GegznaV commented May 25, 2020

However, if we return to the original issue Automatic detection of dec=',' in Europe, it seems that PR #4482 does what it is expected.

When could one expect these changes to be on CRAN?

@jangorecki
Copy link
Member

jangorecki commented May 25, 2020

@GegznaV probably not very soon. Note that we provide windows binaries so Rtools/compilation is not needed. If you are on R 3.6 you can just

install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table", type="win.binary")

If you are on other version you can try

install.packages("https://rdatatable.gitlab.io/data.table/bin/windows/contrib/3.6/data.table_1.12.9.zip", repos=NULL)

Note that soon those 3.6 will move to 4.0.

@GegznaV
Copy link

GegznaV commented May 25, 2020

Ok, can I expect the new version of data.table on CRAN somewhere around mid-August? (Before the new school year in September). Or your dates are even further?

@jangorecki
Copy link
Member

We don't have any fixed release dates. New version on CRAN might eventually be just a patch release not having new features like this.

@MichaelChirico MichaelChirico added top request One of our most-requested issues and removed High labels Jun 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement fread top request One of our most-requested issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants