Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fpaste: fwrite output as a character vector #4572

Open
mrdwab opened this issue Jun 23, 2020 · 8 comments
Open

fpaste: fwrite output as a character vector #4572

mrdwab opened this issue Jun 23, 2020 · 8 comments
Labels
fwrite top request One of our most-requested issues

Comments

@mrdwab
Copy link

mrdwab commented Jun 23, 2020

Given the speed of fwrite, it can be used in conjunction with fread as an alternative to do.call(paste, ...) to flatten multiple columns into a character vector. It would be nice to be able to capture the output of fwrite directly as a character vector.

It is much faster than some of the other idiomatic approaches that are often considered.

Here's the behavior I'm hoping to be able to replicate:

fpaste <- function(dt, sep = ",") {
  x <- tempfile()
  fwrite(dt, file = x, sep = sep, col.names = FALSE)
  fread(x, sep = "\n", header = FALSE)
}

d <- data.frame(a = 1:3, b = c('a','b','c'), c = c('d','e','f'), d = c('g','h','i')) 
cols = c("b", "c", "d")

fpaste(d[cols], "-")
#       V1
# 1: a-d-g
# 2: b-e-h
# 3: c-f-i

Here's a comparison with a straightforward paste in a data.table:

set.seed(1) 
d2 <- d[sample(1:3,1e6,TRUE),]
d3 <- as.data.table(d2)

bench::mark(fpaste(d2[cols], "-")$V1, d3[, paste(b, c, d, sep = "-")])
## # A tibble: 2 x 13
##   expression                          min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
##   <bch:expr>                      <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
## 1 fpaste(d2[cols], "-")$V1         90.2ms  93.8ms     10.8     8.41MB     3.60     3     1      278ms
## 2 d3[, paste(b, c, d, sep = "-")] 220.9ms 223.2ms      4.43   30.55MB     0        3     0      678ms
## # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
@MichaelChirico
Copy link
Member

for the record I tried using capture.output instead of disk I/O and it's way, way slower (I gave up running the benchmark)

@mrdwab
Copy link
Author

mrdwab commented Jun 24, 2020

@MichaelChirico I had forgotten to mention in my original post that I tried with capture.output and also gave up, and then tried with R.utils::captureOutput which performed much better, but still slower than fpaste.

@ColeMiller1
Copy link
Contributor

This sounds impressive. I do not understand how writing to file is faster than manipulating it in RAM. What is happening? Why is capture.output discussed?

@jangorecki
Copy link
Member

jangorecki commented Jun 24, 2020

You can still manipulate it in RAM with fwrite and fread if you use tempfile having tempdir set to ramdisk (search NEWS.md for "ramdisk"). I assume that capture.output is discussed because fwrite can print to console.

@mrdwab
Copy link
Author

mrdwab commented Jun 24, 2020

@ColeMiller1 My initial thought was to just use fwrite with file = "" and use fread on that. But that just prints the output. capture.output could be used to convert that into a string, but it's really slow.

Using the relevant parts of R.utils::captureOutput I tried:

fpaste2 <- function(dt, sep = ",", envir = parent.frame()) {
  eval({
    file <- rawConnection(raw(0L), open = "w")
    on.exit({
      if (!is.null(file)) close(file)
    })
    capture.output(fwrite(dt, sep = sep, col.names = FALSE), file = file)
    fread(rawToChar(rawConnectionValue(file)), sep = "\n", header = FALSE)
  }, envir = envir, enclos = envir)
}

This performs well. It's at least as fast if not faster than do.call(stringi::stri_join, c(d2[cols], sep = "-")) but not as fast as writing to file and re-reading it.

@jangorecki
Copy link
Member

This will not work for a sep="" because fwrite expect non-zero char separator.

@jangorecki jangorecki changed the title fwrite output as a character vector fpaste: fwrite output as a character vector Nov 21, 2020
@jangorecki
Copy link
Member

My use case for sep="" is to mimic paste0("id",1:1e9).
Just this paste0 command alone takes 40 minutes to evaluate. Most probably due to R's string global cache.
If I could do fwrite(data.frame(a="id",b=1:1e9), sep="") then I can potentially save 40 minutes. I actually need to write it to csv rather than console, so populating R's global cache just to dump that to csv is really sub-efficient.

@msummersgill
Copy link

Just now seeing this today, but I think there certainly is an opportunity to improve vectorized string concatenation performance with a fpaste() function.

Back in 2018, I had a use case where this was the bottleneck in a data pipeline. I posted to stack overflow, https://stackoverflow.com/questions/48233309/fast-concatenation-of-data-table-columns-into-one-string-column , and in the course of investigating, I was suprised to find the same thing others described here - it was faster to fwrite the dataset to disk, use sed to perform the concatenation, and fread to pull back in to memory.

One of the answers by Matrin Modrák proposed repurposing some of the code from /src/fwrite.c that ran 8x faster the previous best - an optimized sprintf call. From there, I put that code into a single function package - fastConcat - that we still use in production at my employer. https://github.com/msummersgill/fastConcat

fastConcat::concat() only supports single digit integers (the use case was highly specific), but it is a working proof of concept that the code in /src/fwrite.c could probably be re-purposed to create a data.table::fpaste() with performance at least an order of magnitude better than base::paste().

@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fwrite top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

5 participants