-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: export/import DataFrame as raw vector #1072
Conversation
In general I prefer separate functions as you've implemented. Otherwise a separate 'file' argument if writing a file, then just need a check for This kind of thing makes sense for generics, but when the arguments are different types, I find it cleaner not to overload. |
@shikokuchuo Thank you for your comment. I agree with you, but I think there are cases here where it is better to focus on consistency with other language APIs in Polars and with other packages in R.
In other words, in both Polars and R, it seems that it is acceptable to have different behavior for read functions depending on the argument type (although these functions are not generic functions, of course). So I am thinking here that the function to convert a raw vector of Arrow IPC to a DataFrame should be used by rewriting |
That's fine. My comment is just 'in general' - if there are more important considerations such as API consistency as you point out, don't let me stop you! |
It seems working fine with This example is from https://shikokuchuo.net/mirai/articles/databases.html#database-hosting---using-arrow-database-connectivity library(mirai)
daemons(1)
#> [1] 1
everywhere({
library(DBI)
con <<- dbConnect(adbi::adbi("adbcsqlite"), uri = ":memory:")
})
serialization(
refhook = list(\(x) polars::as_polars_df(x)$to_raw_ipc(future = TRUE),
polars::pl$read_ipc),
class = "nanoarrow_array_stream"
)
m <- mirai(dbWriteTableArrow(con, "iris", iris))
call_mirai(m)$data
#> [1] TRUE
m <- mirai(dbReadTableArrow(con, "iris"))
call_mirai(m)$data
#> shape: (150, 5)
#> ┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
#> │ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
#> ╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
#> │ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
#> │ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
#> │ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ … ┆ … ┆ … ┆ … ┆ … │
#> │ 6.7 ┆ 3.0 ┆ 5.2 ┆ 2.3 ┆ virginica │
#> │ 6.3 ┆ 2.5 ┆ 5.0 ┆ 1.9 ┆ virginica │
#> │ 6.5 ┆ 3.0 ┆ 5.2 ┆ 2.0 ┆ virginica │
#> │ 6.2 ┆ 3.4 ┆ 5.4 ┆ 2.3 ┆ virginica │
#> │ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
#> └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘ Created on 2024-05-03 with reprex v2.1.0 |
Great! I can confirm that the above works. For me this also works: serialization(
refhook = list(\(x) polars::as_polars_df(x)$to_raw_ipc(),
polars::pl$read_ipc),
class = "nanoarrow_array_stream"
) I didn't find the documentation for what that Apart from this, is there scope to make it even more ergonomic to have a direct counterpart to the I think once |
Polars introduced the StringView type earlier than other Arrow implementations and uses it as the default string type, which can cause problems when passing Arrow IPC to other Arrow implementations. Also, the df <- polars::pl$DataFrame(string = "foo")
df$to_raw_ipc(future = FALSE) |>
arrow::read_ipc_file()
#> string
#> 1 foo
df$to_raw_ipc(future = TRUE) |>
arrow::read_ipc_file()
#> Error: cannot handle Array of type <utf8_view> Created on 2024-05-03 with reprex v2.1.0 This is not a problem for exchanging data between Polars and should result in a slight performance increase due to the lack of extra conversions.
I too think something like that would be worth adding, but there is no consensus yet. |
Thanks for exposing these anyway - I think they will be useful for your users if they want to work with parallel / distributed computing. I will add something to the Or once this is merged / released actually :) |
I will merge this for now. If function names need to be changed, I believe they can be changed later. |
Functions originally implemented on the Rust side for use in inter-process communication are refactored and made available on the R side.
This is useful when implementing asynchronous processing in R (use with the
mirai
package for example).I am considering what function should be used on the R side.
Python Polars uses
write_ipc
andread_ipc
to switch between writing to the session and writing to the file depending on the first argument, but I feel that it is better to have the same function for reading and a separate function for writing, as in the Rarrow
package.