Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #4903] set encoding in fread #563

Closed
arunsrinivasan opened this issue Jun 8, 2014 · 32 comments
Closed

[R-Forge #4903] set encoding in fread #563

arunsrinivasan opened this issue Jun 8, 2014 · 32 comments
Assignees
Milestone

Comments

@arunsrinivasan
Copy link
Member

Submitted by: michele de meo; Assigned to: Nobody; R-Forge link

It's useful to specify the encoding before importing a csv.
For example:

fread(... , encoding='UTF8').

In this way we can avoid the boring use of file function in read.table (not supported in fread ):

read.table(file("mycsv.csm", encoding='UTF8'), ... ).

Michele De Meo

@EDiLD2
Copy link

EDiLD2 commented Sep 24, 2014

+1 for this

@bnbwn
Copy link

bnbwn commented Nov 17, 2014

+1

1 similar comment
@stanasa
Copy link

stanasa commented Dec 28, 2014

+1

@sirvydasdagys
Copy link

+1

3 similar comments
@harmonica2
Copy link

+1

@zeltak
Copy link

zeltak commented Mar 16, 2015

+1

@MarcinKosinski
Copy link

+1

@ghost
Copy link

ghost commented Mar 18, 2015

+1 Would be really usefull. I started to hack locally the code but it's too hard for me because it call a C function (readfile) and I don't know how (for the moment) how to handle encoding in C. If any one some knowledge in C...

@grgurev
Copy link

grgurev commented Mar 18, 2015

+1

10 similar comments
@lucarno
Copy link

lucarno commented Mar 24, 2015

+1

@seyedamo
Copy link

seyedamo commented Apr 9, 2015

+1

@jcizel
Copy link

jcizel commented Apr 9, 2015

+1

@Alectoria
Copy link

+1

@tophcito
Copy link

+1

@clarkdk
Copy link

clarkdk commented Apr 29, 2015

+1

@JohnsonHsieh
Copy link

+1

@whizzalan
Copy link

+1

@ysgit
Copy link

ysgit commented May 27, 2015

+1

@ZeroStack
Copy link

+1

@mattdowle mattdowle added the High label Jun 1, 2015
@rshmyrev
Copy link

+1

@dbuijs
Copy link

dbuijs commented Jun 16, 2015

+1, and in the meantime, a workaround:

You need iconv available from the console. You can check this with the following command:

Sys.which("iconv")

As long as this gives you a path to a binary, the following will work:

# From ISO-8859-1, To UTF-8
new.data.table <- fread("iconv -f ISO-8859-1 -t UTF-8 mytextfile.txt")

@shrektan
Copy link
Member

  • 1

Sent from my iPhone

On Jun 16, 2015, at 15:13, dbuijs [email protected] wrote:

+1, and in the meantime, a workaround:

You need iconv available from the console. You can check this with the following command:

Sys.which("iconv")
As long as this gives you a path to a binary, the following will work:

From ISO-8859-1, To UTF-8

new.data.table <- fread("iconv -f ISO-8859-1 -t UTF-8 mytextfile.txt")

Reply to this email directly or view it on GitHub.

@panda2727
Copy link

+1

1 similar comment
@duf59
Copy link

duf59 commented Aug 7, 2015

+1

@romunov
Copy link
Contributor

romunov commented Aug 7, 2015

+1

note however that in ?Encoding, UTF-8 is used (not UTF8).

@rentrop
Copy link

rentrop commented Aug 22, 2015

+1

1 similar comment
@leoluyi
Copy link

leoluyi commented Aug 24, 2015

+1

@arunsrinivasan
Copy link
Member Author

Tl;dr

Could you please test your files with encoding = "UTF-8" or "Latin-1" and write back as to whether it solves the issue (especially on windows)? Thanks.


About the fix:

Looking at read.table() function, the encoding is set upfront with Encoding(..) <- .... And looking at the source of Encoding<-, it calls an internal function do_setencoding. Looking at the source of that function, https://github.com/wch/r-source/blob/ca5348f0b5e3f3c2b24851d7aff02de5217465eb/src/main/util.c#L1115, it seems quite straightforward to fix this (I think).. through the use of mkCharLenCE function that R's C-API seems to expose.

fread() gains a new argument encoding with valid values unknown, UTF-8 and Latin-1. The default behaviour hasn't changed. Windows user will have to set the encoding argument explicitly. This could change in the future (perhaps also testing for performance hit would be a deciding factor).

@clarkdk
Copy link

clarkdk commented Aug 31, 2015

Arun, There's a typo in README.md

27 . fread() gains eocnding argument. ... eocnding --> encoding

@jangorecki
Copy link
Member

if speaking about the typos there is one missing () next to flush.console in bmerge function:

if (verbose) {cat("done in",round(proc.time()[3]-last.started.at,3),"secs\n");flush.console}

@dbuijs
Copy link

dbuijs commented Sep 1, 2015

Thank you for this!!!!

A question of clarification: does the encoding option in fread just set the marked encoding, or does it convert from what you've declared to the system's native encoding?

@andreasio
Copy link

I have an iso-8850-1 windows file, that I can't fread correctly on linux. This workaround (#563 (comment)) works. And read.csv works (i.e. produces the correct øæå letters, instead of e.g. \xe6), but fread(... encoding = "Latin-1") does not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests