Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rmarkdown::render() corrupts input file #2534

Closed
J-Moravec opened this issue Jan 16, 2024 · 6 comments · Fixed by #2535
Closed

rmarkdown::render() corrupts input file #2534

J-Moravec opened this issue Jan 16, 2024 · 6 comments · Fixed by #2535
Assignees
Labels
bug an unexpected problem or unintended behavior

Comments

@J-Moravec
Copy link
Contributor

J-Moravec commented Jan 16, 2024

Issue:

Made a typo and instead of rmarkdown::render("report.rmd") wrote rmarkdown::render("report.rds"). To my surprise, rmarkdown finished without error, but the rds file was mangled. This shouldn't happen.

Why this is an issue:

Somewhere in the pipeline, the input file is being modified. This means that potentially, the call of rmarkdown::render(input) can corrupt the input file. This breaks the assumption of the call being safe.

MRE:

saveRDS("test", "test.rds")
rmarkdown::render("test.rds") # no error
readRDS("test.rds") # error

Reproduced on both the latest cran and the development version of rmarkdown.

Output of xfun::session_info("rmarkdown")

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Locale:
  LC_CTYPE=en_NZ.UTF-8       LC_NUMERIC=C              
  LC_TIME=en_NZ.UTF-8        LC_COLLATE=en_NZ.UTF-8    
  LC_MONETARY=en_NZ.UTF-8    LC_MESSAGES=en_NZ.UTF-8   
  LC_PAPER=en_NZ.UTF-8       LC_NAME=C                 
  LC_ADDRESS=C               LC_TELEPHONE=C            
  LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C       

Package version:
  base64enc_0.1.3   bslib_0.6.1       cachem_1.0.8      cli_3.6.2        
  digest_0.6.34     ellipsis_0.3.2    evaluate_0.23     fastmap_1.1.1    
  fontawesome_0.5.2 fs_1.6.3          glue_1.7.0        graphics_4.1.2   
  grDevices_4.1.2   highr_0.10        htmltools_0.5.7   jquerylib_0.1.4  
  jsonlite_1.8.8    knitr_1.45        lifecycle_1.0.4   memoise_2.0.1    
  methods_4.1.2     mime_0.12         R6_2.5.1          rappdirs_0.3.3   
  rlang_1.1.3       rmarkdown_2.25.2  sass_0.4.8        stats_4.1.2      
  tinytex_0.49      tools_4.1.2       utils_4.1.2       xfun_0.41        
  yaml_2.3.8       

Pandoc version: 2.9.2.1
@cderv
Copy link
Collaborator

cderv commented Jan 17, 2024

Thanks for the report.

For context about what happens here:

There is no check against supported extension - so any file can indeed be passed to render. Probably something to change as you suggest.

What is happening when .rds is passed, is rmarkdown::render() running, with default format html_document().
Among processing, pre_processor will do something on the input file

preserved_chunks <<- extract_preserve_chunks(input_file)

This process will read to the input file and sometimes write to the input file. So write_utf8 on the rds input is what is corrupting it.

extract_preserve_chunks <- function(input_file, extract = extractPreserveChunks) {
input_str <- read_utf8(input_file)
preserve <- extract(input_str)
if (!identical(preserve$value, input_str)) write_utf8(preserve$value, input_file)
preserve$chunks
}

@yihui it seems we do not have indeed any check on the extension regarding the supported input file. Should it be a check related to what knitr support or do you think it is safe to consider only known input ?

I see

rmarkdown/R/render.R

Lines 348 to 349 in c306a43

# check whether this document requires a knit
requires_knit <- tolower(xfun::file_ext(input)) %in% c("r", "rmd", "rmarkdown", "qmd")

Meaning it seems render() may support other type of content... If so, then check for text file input only could be a safer check to not break potential edge case usage.

@yihui
Copy link
Member

yihui commented Jan 17, 2024

Yes, we should definitely avoid writing to the original input file. 07ec182 should fix this issue (the bug seems to have existed for 10 years and no one has discovered it before). Actually we can get rid of this extractPreserveChunks() business if Pandoc is greater than v2 and getOption("htmltools.preserve.raw") is TRUE, but this assumption may not be true to everyone.

I searched in the project and it seems most other write_utf8() calls are fine. This one looks suspicious, though:

write_utf8(.preserve_yaml(input_lines, input_lines2), input_file)

I'm not sure if checking extensions is a robust way. Checking if the file is text or binary may be better, but there is no 100% robust way AFAIK.

Anyway, #2535 should fix this issue.

@yihui yihui self-assigned this Jan 17, 2024
@yihui yihui added the bug an unexpected problem or unintended behavior label Jan 17, 2024
@cderv
Copy link
Collaborator

cderv commented Jan 17, 2024

Thanks for the quick fix - I did not think this part was off...

Actually we can get rid of this extractPreserveChunks() business if Pandoc is greater than v2 and getOption("htmltools.preserve.raw") is TRUE, but this assumption may not be true to everyone.

Yeah it may not be true always, we are still a but blocked by Pandoc version.

I'm not sure if checking extensions is a robust way. Checking if the file is text or binary may be better, but there is no 100% robust way AFAIK.

Even if not robust, maybe this is better than allowing to use render() on any type of file. We don't knit based on an extension test, but we do run pandoc on the input without any check I think. It seems odd, but this is like that since years now... crazy ! 😅

J-Moravec added a commit to J-Moravec/rmarkdown that referenced this issue Jan 17, 2024
J-Moravec added a commit to J-Moravec/rmarkdown that referenced this issue Jan 17, 2024
@J-Moravec
Copy link
Contributor Author

Awesome. Thanks for the prompt response and quick fix. The error is not triggered any more as tested using #2536.

Still, I am terrified by the write_utf8(..., input_file). Would it be wiser to write into temp copy instead, if writing is necessary?

yihui added a commit that referenced this issue Jan 17, 2024
@yihui
Copy link
Member

yihui commented Jan 17, 2024

@cderv I agree it will be safer to disallow running render() on arbitrary input files. I just don't know a proper way to disallow it. I'm okay with checking the input file extension. Perhaps .md, .markdown, together with other .r* or .q* extensions will cover most use cases (how about also .txt? I don't know).

@J-Moravec Yes, it looks scary. In the case of extract_preserve_chunks(), we can't write to a temporary copy (well, I think we can, but it will be quite complicated). I just committed dfc4775 and I think it will be much safer now.

Copy link

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants