Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use writeLines for fread(text=.) #4805

Merged
merged 1 commit into from
May 11, 2021
Merged

use writeLines for fread(text=.) #4805

merged 1 commit into from
May 11, 2021

Conversation

MichaelChirico
Copy link
Member

Cleaner & equivalent behavior

Cleaner & equivalent behavior
@codecov
Copy link

codecov bot commented Nov 11, 2020

Codecov Report

Merging #4805 (23e47a3) into master (70b6b13) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4805   +/-   ##
=======================================
  Coverage   99.47%   99.47%           
=======================================
  Files          73       73           
  Lines       14557    14557           
=======================================
  Hits        14481    14481           
  Misses         76       76           
Impacted Files Coverage Δ
R/fread.R 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70b6b13...23e47a3. Read the comment docs.

@@ -35,7 +35,7 @@ yaml=FALSE, autostart=NA, tmpdir=tempdir(), tz="")
if (!is.character(text)) stop("'text=' is type ", typeof(text), " but must be character.")
if (!length(text)) return(data.table())
if (length(text) > 1L) {
cat(text, file=(tmpFile<-tempfile(tmpdir=tmpdir)), sep="\n") # avoid paste0() which could create a new very long single string in R's memory
writeLines(text, tmpFile<-tempfile(tmpdir=tmpdir)) # avoid paste0() which could create a new very long single string in R's memory
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also use useBytes=TRUE here to facilitate writing UTF-8 strings on Windows:

https://stackoverflow.com/q/10675360/3576984

Copy link

@mrdwab mrdwab Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelChirico This isn't something I've tested, but maybe fwrite(as.data.table(text), tmpFile<-tempfile(tmpdir = tmpdir), col.names = FALSE) is also worth exploring. I'm basing this assumption on the speed that I observed here.

Or maybe list(text) instead of as.data.table(text) since fwrite should work with that too..?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think it may warrant some benchmarking. Actually I have run into an issue recently where cat() is very slow for long input. In my case I was cat-ing a whole file by character (e.g. cat(strsplit(readChar(f, file.size(f)), NULL)[[1]], sep = '', file = f) is quite slow)

@MichaelChirico
Copy link
Member Author

MichaelChirico commented May 10, 2021

A benchmark of approaches (h/t @mrdwab for suggesting fwrite):

library(data.table)
library(microbenchmark)
library(withr)

mtcars = as.data.table(mtcars, keep.rownames = "make")
rows = do.call(paste, c(unname(mtcars), list(sep = ",")))
meta = data.table(
  n_rows = c(1e2, 1e5, 1e8),
  n_reps = c(1e3, 1e2, 5)
)
for (ii in seq_len(nrow(meta))) {
  cat(sprintf("timing for emitting %d rows of input\n", meta$n_rows[ii]))
  txt = sample(rows, meta$n_rows[ii], TRUE)
  tmp <- tempfile()
  print(microbenchmark::microbenchmark(
    times = meta$n_reps[ii],
    cat = with_tempfile(tmp, cat(txt, file = tmp, sep = "\n")),
    writeLines = with_tempfile(tmp, writeLines(txt, tmp)),
    fwrite = with_tempfile(tmp, fwrite(list(rows), tmp, quote=FALSE))
  ))
}

output:

timing for emitting 100 rows of input
Unit: microseconds
       expr     min       lq     mean   median       uq      max neval cld
        cat 355.504 363.2115 381.2248 366.9645 374.1155 2365.032  1000   c
 writeLines  92.257  96.9025 104.1850  99.6420 101.9755  268.826  1000 a  
     fwrite 129.107 140.6755 156.1185 144.9870 150.6950 2355.349  1000  b 
timing for emitting 100000 rows of input
Unit: microseconds
       expr        min         lq       mean    median        uq       max neval cld
        cat 290911.277 295095.536 303709.532 305186.89 308638.26 330139.37   100   c
 writeLines  21014.126  25466.695  32991.918  36208.78  37545.37  48450.52   100  b 
     fwrite    158.899    281.687   9117.068  11869.05  12574.75  17852.94   100 a  
timing for emitting 100000000 rows of input
Unit: microseconds
       expr          min          lq        mean      median          uq         max neval cld
        cat 2.841903e+08 284357875.9 289614328.1 290874438.7 294160661.5 294488373.2     5   c
 writeLines 2.261030e+07  24129058.5  27682231.0  26859014.4  28238340.4  36574441.3     5  b 
     fwrite 2.593020e+02    412013.1    336822.1    419335.1    421485.6    431017.2     5 a  

So fwrite() is clearly a winner if people are using fread to ingest huge character vectors, but I don't think that's a primary use case here. I lean to keeping writeLines for simplicity.

Also the need to supply quote=FALSE makes me a bit wary that using fwrite can become a maintenance burden since fwrite is always going to try to do some inspection of the file for how to best emit it that writeLines will never do.

@mattdowle mattdowle added this to the 1.14.1 milestone May 11, 2021
@mattdowle mattdowle merged commit 3764969 into master May 11, 2021
@mattdowle mattdowle deleted the fread-text-writelines branch May 11, 2021 00:46
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants