Automatic detection of dec=',' in Europe #2431

mattdowle · 2017-10-18T21:43:24Z

I'm not sure what base and readr do in this regard, but currently in fread, dec=='.' by default and needs manually setting to ',' in Europe for numerics with comma as the decimal separator. It could instead be detected automatically like sep already is. Please +1 this issue if you'd like this.

Further, dec could be automatically detected per-column for files where some numeric columns use ',' and other columns use '.'. But does anyone need that?

The text was updated successfully, but these errors were encountered:

GegznaV · 2017-10-20T20:35:43Z

It would be enough, if dec would be chosen correctly for the whole file.

clarkdk · 2017-10-22T09:06:25Z

+1 for auto-dec. dec by column world not be needed. I don’t think a mixed dec csv could be written or read by Excel.

Boyoron · 2018-11-23T07:07:23Z

pls.

s-fleck · 2019-11-11T14:37:52Z

fread() also sometimes ignores a manual set dec = ,. This happens randomly and quite infrequently and is therefore hard to reproduce. Restarting the R session fixes this issue. I've been having this problem for years, but only occasioannly so I did not report it before.

jangorecki · 2019-11-11T21:35:39Z

It sounds like a different issue than the one here

MichaelChirico · 2020-05-22T07:46:09Z

Any sample files for this issue?

jangorecki · 2020-05-22T11:15:32Z

fread("a;b\n1,5;2,5", sep=";", dec=",")

MichaelChirico · 2020-05-22T11:27:44Z

A real-world example would be better 😄

I saw a 🇫🇷 government website using sep=';' as well, is that common in such files? Or sep='\t' maybe?

jangorecki · 2020-05-22T12:09:07Z

AFAIR this is how excel produces csv files in France and Poland.

MichaelChirico · 2020-05-22T12:43:09Z

Here are some nightmarishly bad ones 😂

https://datos.gob.es/en/catalogo/ea0010587-porcentaje-de-gastos-en-i-d-respecto-al-pib-a-precios-de-mercado-por-comunidades-autonomas-serie-2000-2017-estadistica-sobre-actividades-en-i-d-en-el-sector-empresas-identificador-api-t14-p057-a2017-l0-02007-px

MichaelChirico · 2020-05-24T18:23:07Z

@cderv @dmpe @thohan88 @GabijaSakalyte @Boyoron @IndreSakalauskaite @Amygdalae @AndriusJasinevicius @dvaitkus @Katazyna-Stankevic @labutytegreta @rasainsodaite @ievajuozapaityte @gertrudam @RPrakapaite @Grazvile @bugampo @raugulis @Ignnn @iurbon @EvitaJ @rutele13 @jstonkus @pociuteagne @Kaamile @LinaAnu @zyginta @evelina11101 @silvimi @Auguste11 @1075353 @Andrealek @esadausk @vaiiva @supermenas @ramintares @viktorija-romovaite @egle-lele @RokasStat @ema-malinauskaite @1611003 @zyle1 @tokotrienoliai @DanasKl @danielius-mockus @Gabriele-gif @domasrupkus @GerdaSkin @emyliuxe @GegznaV @clarkdk @s-fleck

Sorry for the wide ping. I have a PR addressing this issue in #4482 -- it would be great if anyone could provide some "real world" sample CSVs rather than testing on my toy examples. Thanks in advance!

GegznaV · 2020-05-25T08:26:18Z

@MichaelChirico, here are some examples of data: data.zip. Inside the ZIP:

kojos.csv – data with various measurements of leg parts. Three columns with European numbers.
16.0001.trt – spectroscopic data created by the software of a spectrometer. The data of interest start at line 9. Two semicolon-separated columns with European numbers.
ezerai – two tab-separated columns with European numbers and UTF-8 encoding. Data from Wikipedia. (I do not expect fread() to read this dataset ezerai correctly with the default settings).

GegznaV · 2020-05-25T08:54:42Z

Wouldn't it be more logical to automatically choose \t tab as sep when it is present instead of some other character? (See the example dataset ezerai)

jangorecki · 2020-05-25T09:25:50Z

AFAIR excel uses sep=";" when writing to csv (in those countries where dec=",").

MichaelChirico · 2020-05-25T11:43:37Z

@GegznaV fread may choose \t -- there is some logic to determine what fread "thinks" is the correct separator among ,|;\t and , see here

MichaelChirico · 2020-05-25T12:04:02Z

Thanks a bunch for the data sets Vilmantas.

fread('16.0001.trt') fails because of the extraneous info in the first 6 lines (auto-skip logic not up to the task); fread('16.00001.trt', skip=7L) and fread('16.00001.trt', skip=8L) both work automatically (both get the column names wrong, which are on line 7 but have a subheader column on line 8)
fread('kojos.csv') works great
fread('ezerai') is not working "automagically". The issue is, both , and \t as sep lead to 3 columns (which matches the header), and the logic for selecting sep is agnostic to column types -- there is a priority order, and , comes first. I filed fread sep='auto' could try using detect_types in the event of a tie #4487 -- I'm hopeful fread's logic could actually detect sep='\t' on its own. fread('ezerai', sep='\t') works.

jangorecki · 2020-05-25T12:14:08Z

Remember not to try to handle every possible input. For example

both get the column names wrong, which are on line 7 but have a subheader column on line 8

Sounds that fread would need to skip first 6 lines, then read seventh, skip 8th, and read the rest. Hnadling that is doable but it impose maintainance overhead, can introduce new bugs, etc. It might be better to provide a more general interface where skip can be a vector, so user need to understand what is wrong with their files, and then just skip=c(1:6,8). I recall someone already asked for possibility to read particular lines of file.

MichaelChirico · 2020-05-25T12:34:05Z

Yea I don't think there's an automatic way on this one that's not a fragile house of cards to support. skip=vector would be awesome anyway.

GegznaV · 2020-05-25T14:35:51Z

However, if we return to the original issue Automatic detection of dec=',' in Europe, it seems that PR #4482 does what it is expected.

When could one expect these changes to be on CRAN?

jangorecki · 2020-05-25T14:54:48Z

@GegznaV probably not very soon. Note that we provide windows binaries so Rtools/compilation is not needed. If you are on R 3.6 you can just

install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table", type="win.binary")

If you are on other version you can try

install.packages("https://rdatatable.gitlab.io/data.table/bin/windows/contrib/3.6/data.table_1.12.9.zip", repos=NULL)

Note that soon those 3.6 will move to 4.0.

GegznaV · 2020-05-25T15:16:27Z

Ok, can I expect the new version of data.table on CRAN somewhere around mid-August? (Before the new school year in September). Or your dates are even further?

jangorecki · 2020-05-25T15:19:45Z

We don't have any fixed release dates. New version on CRAN might eventually be just a patch release not having new features like this.

mattdowle added enhancement fread labels Oct 18, 2017

mattdowle mentioned this issue Oct 18, 2017

does not recognize numbers written in European style #2430

Closed

statsccpr mentioned this issue Oct 23, 2017

optional arg that returns a list of parse symbols fread() used to intuit raw file #2437

Open

MichaelChirico mentioned this issue Dec 6, 2018

Master list of most-requested issues #3189

Open

75 tasks

MichaelChirico mentioned this issue May 24, 2020

Automatic detection of dec (. or ,) #4482

Merged

2 tasks

MichaelChirico mentioned this issue May 25, 2020

fread sep='auto' could try using detect_types in the event of a tie #4487

Open

GegznaV mentioned this issue May 25, 2020

Make fread() automatically handle UTF-8 encoded files on Windows #4490

Open

MichaelChirico added the High label May 30, 2020

MichaelChirico added top request One of our most-requested issues and removed High labels Jun 7, 2020

MichaelChirico closed this as completed in #4482 Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic detection of dec=',' in Europe #2431

Automatic detection of dec=',' in Europe #2431

mattdowle commented Oct 18, 2017

GegznaV commented Oct 20, 2017

clarkdk commented Oct 22, 2017

Boyoron commented Nov 23, 2018

s-fleck commented Nov 11, 2019

jangorecki commented Nov 11, 2019

MichaelChirico commented May 22, 2020

jangorecki commented May 22, 2020

MichaelChirico commented May 22, 2020

jangorecki commented May 22, 2020

MichaelChirico commented May 22, 2020

MichaelChirico commented May 24, 2020

GegznaV commented May 25, 2020

GegznaV commented May 25, 2020 •

edited

Loading

jangorecki commented May 25, 2020 •

edited

Loading

MichaelChirico commented May 25, 2020

MichaelChirico commented May 25, 2020

jangorecki commented May 25, 2020 •

edited

Loading

MichaelChirico commented May 25, 2020

GegznaV commented May 25, 2020

jangorecki commented May 25, 2020 •

edited

Loading

GegznaV commented May 25, 2020

jangorecki commented May 25, 2020

Automatic detection of dec=',' in Europe #2431

Automatic detection of dec=',' in Europe #2431

Comments

mattdowle commented Oct 18, 2017

GegznaV commented Oct 20, 2017

clarkdk commented Oct 22, 2017

Boyoron commented Nov 23, 2018

s-fleck commented Nov 11, 2019

jangorecki commented Nov 11, 2019

MichaelChirico commented May 22, 2020

jangorecki commented May 22, 2020

MichaelChirico commented May 22, 2020

jangorecki commented May 22, 2020

MichaelChirico commented May 22, 2020

MichaelChirico commented May 24, 2020

GegznaV commented May 25, 2020

GegznaV commented May 25, 2020 • edited Loading

jangorecki commented May 25, 2020 • edited Loading

MichaelChirico commented May 25, 2020

MichaelChirico commented May 25, 2020

jangorecki commented May 25, 2020 • edited Loading

MichaelChirico commented May 25, 2020

GegznaV commented May 25, 2020

jangorecki commented May 25, 2020 • edited Loading

GegznaV commented May 25, 2020

jangorecki commented May 25, 2020

GegznaV commented May 25, 2020 •

edited

Loading

jangorecki commented May 25, 2020 •

edited

Loading

jangorecki commented May 25, 2020 •

edited

Loading

jangorecki commented May 25, 2020 •

edited

Loading