Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf-8 encoding after yaml.load_file #6

Closed
HenricoWitvliet opened this issue May 8, 2013 · 21 comments · Fixed by #32
Closed

utf-8 encoding after yaml.load_file #6

HenricoWitvliet opened this issue May 8, 2013 · 21 comments · Fixed by #32

Comments

@HenricoWitvliet
Copy link

I've got a file with utf-8 characters. yaml.load_file loads the character strings correctly. But the encoding, as given by Encoding(), returns unknown. Now I use Encoding(...) <-'UTF-8' to set the encoding.
It would be nice if the character strings had the utf-8 encoding bit set.

@RinatMenyashev
Copy link

same problem

@viking
Copy link
Contributor

viking commented Feb 13, 2014

This same behavior occurs when using R core functions like readLines, at least in Linux. As far as I know, R does not do any kind of encoding detection. If you run example(Encoding), what is your output?

@HenricoWitvliet
Copy link
Author

Since a yaml file is encoded in unicode, I would expect strings to be given this encoding. The character string that yaml.load_file returns in my example is utf-8 encoded. But I haven't tried an example yaml in utf-16, so I don't know if setting a bit in every string would be enough.

@viking
Copy link
Contributor

viking commented Feb 17, 2014

Ah, I see. I didn't realize that all YAML documents are unicode, but the YAML specification agrees with you. The specification says that by default, the encoding is UTF-8. For UTF-16, the document must provide a byte-order mark:
http://yaml.org/spec/1.1/#id868742

It looks like LibYAML has an encoding property:
http://pyyaml.org/wiki/LibYAML#StylisticEventAttributes

I'll add this into the next update.

@viking
Copy link
Contributor

viking commented Feb 17, 2014

As it turns out, R does not support UTF-16 at all in Encoding() as of version 3.0.2.

@yihui
Copy link
Contributor

yihui commented Apr 28, 2015

We just ran into the same problem. It will be nice if you can explicitly mark the encoding of character strings as UTF-8. Thanks! (We probably do not need to worry about UTF-16)

@viking
Copy link
Contributor

viking commented Apr 29, 2015

I had forgotten about this issue, unfortunately. I will take a fresh look at it.

@yihui
Copy link
Contributor

yihui commented Apr 29, 2015

Thanks! FWIW, this is our current workaround: rstudio/rmarkdown#421 (Recursively mark the character elements of yaml.load() output as UTF-8)

@ofurkusi
Copy link

ofurkusi commented Feb 4, 2016

There seem to be two issues here, one with yaml.load_file and another with yaml.load.

When yaml.load_file calls readLines without explicitly defining the encoding as UTF-8, the contents of a valid UTF-8 encoded yaml file is read into a string with the encoding set to unknown (while in fact being UTF-8). On Windows, R treats the string as latin1 (I guess) so the characters are all garbled when displayed. By adding encoding="UTF-8 as a parameter to readLines the raw text input is read correctly and set as UTF-8 before being passed on to yaml.load.

While I suggest setting encoding="UTF-8 parameter for readLines in yaml.load_file it does not seem to be enough to fix the problem. Once yaml.load starts processing the text read by readLines, it messes the characters up again by reverting the encoding to unknown.

@yihui
Copy link
Contributor

yihui commented Jun 29, 2016

We were bitten by this issue again: rstudio/bookdown#142 Is there a chance that you could fix it? The fix should be fairly simple (mark the input and output strings as UTF-8), and I'm just not familiar with C.

yihui added a commit to rstudio/bookdown that referenced this issue Jun 29, 2016
not sure when this bug can be fixed: vubiostat/r-yaml#6
yihui added a commit to yihui/r-ninja that referenced this issue Jun 29, 2016
…alternative form of chapter_name (due to the bug vubiostat/r-yaml#6, we cannot use R expressions in YAML that contains multibyte characters)
@shrektan
Copy link

We encountered the same issue as well, although it can by solved as @yihui did in https://github.com/rstudio/bookdown/blob/3ed7fc6bd30e2832948d28298dee5cd546339fc8/R/utils.R#L82

We thought it would be nicer if it's fixed in the package yaml.

Thanks.

@yihui
Copy link
Contributor

yihui commented Oct 19, 2016

And bitten by this again rstudio/rmarkdown#841 so yet yet another patch...

@viking
Copy link
Contributor

viking commented Oct 19, 2016

Unfortunately I have precious little time to work on this project at present. A pull request would be appreciated.

@yihui
Copy link
Contributor

yihui commented Oct 19, 2016

@viking Okay, actually that is all I need from you. I'll try to find someone to do the work and submit a pull request. Thanks!

@yihui
Copy link
Contributor

yihui commented Oct 20, 2016

@viking Done in #32. Tested on Windows and *nix.

In the long run, if you feel it is difficult for you to maintain this package, you may consider finding a new maintainer. It seems you are having the similar situation of the tikzDevice package, which is a package that I was highly interested in but the original authors lacked time. The yaml package is critical to the R Markdown world, and I hope you could consider increasing the bus factor so this important project can be carried forward nicely in the future.

BTW, I found this article very inspiring: I gave commit rights to someone I didn't know, I could never have guessed what happened next!.

@viking
Copy link
Contributor

viking commented Oct 27, 2016

Thank you.

@yihui
Copy link
Contributor

yihui commented Nov 3, 2016

@viking Any chance you could make a CRAN release soon? I hate bugging you like this, but without the CRAN release, we just keep hearing users report this issue. Here again: http://rmarkdown.rstudio.com/r_notebooks.html#comment-2982649887

@viking
Copy link
Contributor

viking commented Nov 3, 2016

I'll get to it soon. Not being funded to do this means that I have other priorities. Please recognize that.

@viking
Copy link
Contributor

viking commented Nov 3, 2016

I don't wish to continue this discussion here. I will let you know when the new version is on CRAN.

@yihui
Copy link
Contributor

yihui commented Nov 3, 2016

Yep definitely understood, and much appreciated!

@viking
Copy link
Contributor

viking commented Nov 12, 2016

New version is up on CRAN as of about 10 minutes ago.

jsonn pushed a commit to jsonn/pkgsrc that referenced this issue Mar 11, 2017
Upstream changes:

CHANGES IN knitr VERSION 1.15.1

@yihui yihui released this on 23 Nov 2016 · 49 commits to master since this release
NEW FEATURES

    added a new hook function hook_pngquant() that can call pngquant to optimize PNG images (thanks, @slowkow, #1320)

BUG FIXES

    not really a knitr bug, but knit_params() should be better at dealing with multibyte characters now due to the bug fix in the yaml package vubiostat/r-yaml#6

Downloads

    Source code (zip)
    Source code (tar.gz)

    v1.15
    b08a7bc

CHANGES IN knitr VERSION 1.15

@yihui yihui released this on 10 Nov 2016 · 63 commits to master since this release
NEW FEATURES

    NA values can be displayed using different characters (including empty strings) in kable(); you can set the option knitr.kable.NA, e.g. options(knitr.kable.NA = '') to hide NA values (#1283)
    added a fortran95 engine (thanks, @stefanedwards, #1282)
    added a block2 engine for R Markdown documents as an alternative to the block engine; it should be faster and supports arbitrary Pandoc's Markdown syntax, but it is essentially a hack; note when the output format is LaTeX/PDF, you have to define \let\BeginKnitrBlock\begin \let\EndKnitrBlock\end in the LaTeX preamble
    figure captions specified in the chunk option fig.cap are also applied to HTML widgets (thanks, @byzheng, rstudio/bookdown#118)
    when the chunk option fig.show = 'animate' and ffmpeg.format = 'gif', a GIF animation of the plots in the chunk will be generated for HTML output (https://twitter.com/thomasp85/status/785800003436421120)
    added a width argument to write_bib() so long lines in bib entries can be wrapped
    the inline syntax r#code is also supported besides r code; this can make sure the inline expression is not split when the line is wrapped (thanks, Dave Jarvis)
    provided a global R option knitr.use.cwd so users can choose to evaluate the R code chunks in the current working directory after setting options(knitr.use.cwd = TRUE); the default is to evaluate code in the directory of the input document, unless the knitr option opts_knit$set(root.dir = ...) has been set
    if options(knitr.digits.signif = TRUE), numbers from inline expressions will be formatted using getOption('digits') as the number of significant digits, otherwise (the default behavior) getOption('digits') is treated as the number of decimal places (thanks, @numatt, #1053)
    the chunk option engine.path can also be a list of paths to the engine executables now, e.g., you can set knitr::opts_chunk$set(engine.path = list(python = '/anaconda/bin/python', perl = '/usr/local/bin/perl')), then when a python code chunk is executed, /anaconda/bin/python will be called instead of the system default (rstudio/rmarkdown#812)
    introduced a mechanism to protect text output in the sense that it will not be touched by Pandoc during the conversion from R Markdown to another format; this is primarily for package developers to extend R Markdown; see ?raw_output for details (which also shows new functions extract_raw_output() and restore_raw_output())

MAJOR CHANGES

    the minimal version of R required for knitr is 3.1.0 now (#1269)
    the formatR package is an optional package since the default chunk option tidy = FALSE has been there for a long time; if you use tidy = TRUE, you need to install formatR separately if it is not installed
    :set +m is no longer automatically added to haskell code chunks (#1274)

MINOR CHANGES

    the package option opts_knit$get('stop_on_error') has been removed
    the confusing warning message about knitr::knit2html() when buiding package vignettes using the knitr::rmarkdown engine without pandoc/pandoc-citeproc has been removed (#1286)
    the default value of the quiet argument of plot_crop() was changed from !opts_knit$get('progress') to TRUE, i.e., by default the messages from cropping images are suppressed

BUG FIXES

    the chunk option cache.vars did not really behave like what was documented (thanks, @simonKTH, #1280)
    asis_output() should not be merged with normal character output when results='hold' (thanks, @kevinushey, #1310)

Downloads

    Source code (zip)
    Source code (tar.gz)

    v1.14
    b34be0d

CHANGES IN knitr VERSION 1.14

@yihui yihui released this on 12 Aug 2016 · 845 commits to master since this release
NEW FEATURES

    improved caching for Rcpp code chunks: the shared library built from the C++ code will be preserved on disk and reloaded the next time if caching is enabled (chunk option cache = TRUE), so that the exported R functions are still usable in later R code chunks; note this feature requires Rcpp >= 0.12.5.6 (thanks, @jjallaire, #1239)
    added a helper function all_rcpp_labels(), which is simply all_labels(engine == 'Rcpp') and can be used to extract all chunk lables of Rcpp chunks
    added a new engine named sql that uses the DBI package to execute SQL queries, and optionally assign the result to a variable in the knitr session; see http://rmarkdown.rstudio.com/authoring_knitr_engines.html for details (#1241)
    fig.keep now accepts numeric values to index low-level plots to keep (#1265)

BUG FIXES

    fixed #1211: pandoc('foo.md') generates foo_utf8.html instead of foo.html by default
    fixed #1236: include = FALSE for code chunks inside blockquotes did not work (should return > instead of a blank line) (thanks, @fmichonneau)
    fixed #1217: define the command \hlipl for syntax highlighting for Rnw documents (thanks, @conjugateprior)
    fixed #1215: restoring par() settings might fail when the plot window is partitioned, e.g. par(mfrow = c(1, 2)) (thanks, @jrwishart @jmichaelgilbert)
    fixed #1250: in the quiet mode, knit() should not emit the message "processing file ..." when processing child documents (thanks, @KZARCA)

MAJOR CHANGES

    knitr will no longer generate screenshots automatically for HTML widgets if the webshot package or PhantomJS is not installed

MINOR CHANGES

    if dev = 'cairo_pdf', the cairo_pdf device will be used to record plots (previously the pdf device was used) (#1235)
    LaTeX short captions now go up to the first ., : or ; character followed by a space or newline (thanks, @knokknok, #1249)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants