-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15168: [R] Add S3 generics to create main Arrow objects #12817
Conversation
|
0f351f8
to
18de4ce
Compare
Ok! This is ready for a review. The main motivation here is to allow other packages to define conversions to Arrow objects. This is most useful for Table objects, since we currently convert to Table before Other methods I added here are important for lower-level packages like the work-in-progress 'substrait' and 'narrow', where there are analogues for Adding the S3 methods was reasonably straightforward; however, actually using them was quite complicated because R Vector -> Array conversion is highly optimized and mostly done in C++. The approach I took was to keep the existing code, changing as little as possible, for R objects that we handle internally. For other objects, the C++ code calls In a nutshell, before, users were stuck with the built-in behaviour for objects with a custom class. This was usually OK, but occasionally sub-optimal, non-functional, or corupting. I've picked a probably rare example because record-style vectors are new, but the current set of conversions assumes that they're list-like (as opposed to data-frame like): # install.packages("arrow")
library(arrow, warn.conflicts = FALSE)
tbl <- tibble::tibble(
x = 1:5,
points = wk::xy(x = 1:5, y = 6:10)
)
# wk::xy() is a record-style vector, like POSIXlt, with a vctrs implementation
str(unclass(tbl$points))
#> List of 2
#> $ x: num [1:5] 1 2 3 4 5
#> $ y: num [1:5] 6 7 8 9 10
# in the release version this this fails
Table$create(tbl)
#> Error: Invalid: All columns must have the same length
# ...or generates bogus output
as.data.frame(Table$create(x = 1:2, points = tbl$points))
#> Warning: Invalid metadata$r
#> # A tibble: 2 × 2
#> x points
#> <int> <list<double>>
#> 1 1 [5]
#> 2 2 [5] After this PR you can define # remotes::install_github("apache/arrow/r#12817")
library(arrow, warn.conflicts = FALSE)
tbl <- tibble::tibble(
x = 1:5,
points = wk::xy(x = 1:5, y = 6:10)
)
# wk::xy() is a record-style vector, like POSIXlt, with a vctrs implementation
str(unclass(tbl$points))
#> List of 2
#> $ x: num [1:5] 1 2 3 4 5
#> $ y: num [1:5] 6 7 8 9 10
# this now fails:
tf <- tempfile()
write_feather(tbl, tf)
#> Error:
#> ! Can't infer Arrow data type from object inheriting from wk_xy / wk_rcrd
# until...
type.wk_xy <- function(x, ...) {
vctrs_extension_type(vctrs::vec_ptype(x))
}
as_arrow_array.wk_xy <- function(x, ...) {
vctrs_extension_array(x)
}
# now works!
write_feather(tbl, tf)
read_feather(tf)
#> # A tibble: 5 × 2
#> x points
#> <int> <wk_xy>
#> 1 1 (1 6)
#> 2 2 (2 7)
#> 3 3 (3 8)
#> 4 4 (4 9)
#> 5 5 (5 10)
# if for some reason the extension type is not loaded, we get the storage type
# with no warning (maybe not ideal?)
arrow::unregister_extension_type("arrow.r.vctrs")
read_feather(tf)
#> # A tibble: 5 × 2
#> x points$x $y
#> <int> <dbl> <dbl>
#> 1 1 1 6
#> 2 2 2 7
#> 3 3 3 8
#> 4 4 4 9
#> 5 5 5 10 Created on 2022-04-14 by the reprex package (v2.0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great Dewey. I had a few small comments and questions. Seems like a great step forward in supporting third-party types.
r/R/util.R
Outdated
" must be an object of class 'data.frame', 'RecordBatch', or 'Table', not '", | ||
class(env[[deparse(call$x)]])[[1]], | ||
"'." | ||
as_writable_table <- function(x, arg_name = "x") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are passing down the arg_name
, do we want to also take the call object to pass in? That would make it so the function calling this helper would be shown instead of as_writable_table
in the error chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to just simplify this a bit...it only gets used in a few places! I'll send you a message about the use of sys.call()
vs caller_env()
...I did some digging and it makes sense to me now why the docs say to use caller_env()
.
// preserving R type information | ||
for (R_xlen_t i = 0; i < schema->num_fields(); i++) { | ||
if (schema->field(i)->type()->id() == Type::EXTENSION) { | ||
metadata_columns[i] = R_NilValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle nested ExtensionType columns? is that a worry at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a really good catch! It led me down a bit of a rabbit hole but it now works!
# remotes::install_github("apache/arrow/r#12817")
library(arrow, warn.conflicts = FALSE)
nested <- tibble::tibble(x = vctrs::new_vctr(1:5, class = "custom_vctr"))
infer_type(nested)
#> StructType
#> struct<x: <custom_vctr[0]>>
as_arrow_array(nested)
#> StructArray
#> <struct<x: <custom_vctr[0]>>>
#> -- is_valid: all not null
#> -- child 0 type: <custom_vctr[0]>
#> [
#> 1,
#> 2,
#> 3,
#> 4,
#> 5
#> ]
Created on 2022-04-19 by the reprex package (v2.0.1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notes, didn't get all the way through the PR, just the changes in r/R
r/R/array.R
Outdated
|
||
#' @export | ||
as_arrow_array.POSIXlt <- function(x, ..., type = NULL) { | ||
as_arrow_array.vctrs_vctr(x, ..., type = type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk of infinite loop if you have an object with class c("vctrs_vctr", "POSIXlt")
? Or do we trust NextMethod() to do sane things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't considered that! It doesn't look like that example causes problems and I haven't historically had problems with NextMethod()
. Is there a related case I should be checking?
# remotes::install_github("apache/arrow/r#12817")
library(arrow, warn.conflicts = FALSE)
x <- as.POSIXlt(Sys.Date())
class(x) <- c("vctrs_vctr", "POSIXlt")
array <- as_arrow_array(x)
x2 <- as.vector(array)
class(x2)
#> [1] "vctrs_vctr" "POSIXlt"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I was just seeing the vctrs_vctr method doing NextMethod() and POSIXlt calling vctrs_vctr so thought they would bounce back to each other, and maybe they do but NextMethod handles it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were totally right...I was using NextMethod()
to re-use the error generation code, so you could trigger an infinite loop with an invalid type
and an oddly formed POSIXlt
. A better approach was to pull the error code into its own function (this also came up in the data.frame
method that was needed to make nested extension types work).
r/R/array.R
Outdated
|
||
# If from_constructor is FALSE, we use the built-in logic exposed by | ||
# Array$create(). If there is no built-in conversion, C++ will call back | ||
# here with from_constructor = TRUE to generate a nice error message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does C++ have to call back here to create an error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's weird. It's because array creation occurs from both R and C++ (and most of the time the call comes from C++). For example, Table$create()
does:
[R] Table$create()
[C++] Table__from_dots()
[C++] arrow::r::RConverter
(using all the old code pathways if we handle that object type internally, or)arrow::r::RExtensionConverter
->[R] as_arrow_array()
.
Other R calls that use this or a similar path are Array$create()
, ChunkedArray$create()
and RecordBatch$create()
. Perhaps some of that usage should use as_arrow_array()
instead, but table creation in particular makes good use of threading and this was the best I came up with that didn't undo all of that optimization!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you don't error if from_constructor = TRUE
?
I think I understand how you get here now, but I don't know that I'll remember next time I look at this. IIUC it's because the extension type handler is calling as_arrow_array() from C++, and if you have implemented as_arrow_array.YourType, you won't hit this default method. But if you do, it means that we don't know what to do with this extension type, so we error. Right?
But aren't all extension types based on regular types, which we can convert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated some names and rewrote the comments to see if this can be clearer. If there's no error when from_vec_to_array = TRUE
, we'd get an infinite loop. We always will need to check all the S3 methods and vec_to_Array()
, regardless of whether somebody calls as_arrow_array()
or something that goes through the Converter API in C++, so we need this kind of logic somewhere (in theory it could live on the C++ side too, but I think it's easier to write here).
I also renamed the RExensionType
converter to AsArrowArrayConverter
...there's no requirement that as_arrow_array()
return an extension array (although most useful methods probably will).
A few places I had used NextMethod()
to fall back and error, and, as you pointed out, that runs the risk of an infinite loop. I moved that error to a function to make it clearer/less error-prone when a conversion is not possible.
r/R/type.R
Outdated
@@ -69,16 +70,43 @@ FLOAT_TYPES <- c("float16", "float32", "float64", "halffloat", "float", "double" | |||
#' type(mtcars) | |||
#' type(Sys.Date()) | |||
#' @export | |||
type <- function(x) UseMethod("type") | |||
type <- function(x, ...) UseMethod("type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this migrate to be called data_type
? Seems inconsistent with as_data_type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
..or as_data_type
could be as_type()
? (I don't remember why I didn't do that in the first place). If renaming type()
is an option, my vote would be infer_type()
or infer_data_type()
since we're not exactly constructing a type here, we're giving a hypothetical future type from something that might be converted to an Array.
r/R/type.R
Outdated
#' @export | ||
type <- function(x) UseMethod("type") | ||
type <- function(x) infer_type(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably safe to do but you should add to deprecated.R something like
type <- function(x) {
.Deprecated("infer_type")
infer_type(x)
}
in case anyone was using this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing you might want to add here: as_arrow_array.pyarrow.lib.Array
and the similar reticulations
e19827d
to
245b0bf
Compare
r/R/array.R
Outdated
@@ -217,6 +217,125 @@ Array$create <- function(x, type = NULL) { | |||
Array$import_from_c <- ImportArray | |||
|
|||
|
|||
#' Convert an object to an Arrow Array | |||
#' | |||
#' Whereas `Array$create()` constructs an [Array] from the built-in data types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leaves me with some questions. Are as_arrow_array()
conversions not optimized? If as_arrow_array()
is going to call the regular Arrow converters for normal types, and it would work with extension types, why would I ever call Array$create()
? When would Array$create()
and as_arrow_array()
do different things?
Maybe you can illustrate through some @examples
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the docs to clarify this...they're almost the same, but Array$create()
tries to convert in C++ first and doesn't have S3 dispatch overhead in all the common cases. I don't know if that tiny difference matters other than that it helps when testing to make sure that the C++-first pathway does the right thing.
r/R/array.R
Outdated
storage_type = type$storage_type() | ||
) | ||
} else { | ||
stop_cant_convert_array(x, type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not NextMethod anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you were right about the possibility of infinite loops! We sometimes try the S3 methods first (i.e., when calling as_arrow_array()
) and sometimes try C++ conversion first (e.g., when calling Table$create()
)...I found it a bit easier to predict what will happen if we always just stop when conversion isn't possible.
r/R/array.R
Outdated
arrays <- Map(as_arrow_array, x, types) | ||
names(arrays) <- names | ||
|
||
# ...because there isn't a StructArray$create() yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't (but there is in C++), but I think the reason we didn't add bindings to create from a vector of Arrays is because Array$create(data.frame)
produces a StructArray. Why can't we use that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wjones127's review highlighted the fact that the C++ conversion didn't handle ExtensionType
fields within a StructArray
, and because as_arrow_array()
doesn't necessarily return an ExtensionType
, it might do the wrong thing for an S3 vctr that maps to a regular Array
type (e.g., narrow::narrow_vctr()
). I went down a bit of a rabbit hole trying to make it work in C++; however, the StructArray
converter is tightly coupled to the StructBuilder
, the AsArrowArrayConverter
completely circumvents the Builder.
All this to say that the C++ level can_convert_native()
now recurses through data.frames and returns false
if any nested column won't work in C++ 🤯. This means we also need an S3 method to handle that case, and hence, have to create a StructArray
from Array
objects. I have a vague recollection that I didn't see the point of StructArray$create()
at the time (I was wrong!). I added a note about that next to the S3 method for data.frame since it's a little confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Since using the C interface here is clearly a hack, and apparently there is a reason to add a StructArray$create method, could you please make (or find, it could already exist) a jira and link the id here as a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! ARROW-16266 (also added as a TODO(ARROW-16266)
comment here)
r/R/array.R
Outdated
# If from_constructor is TRUE, this is a call from C++ for which S3 dispatch | ||
# failed to find a method for the object. If this is the case, we error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably C++ has also tried using the regular type inference as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I added that to the comment (and to the one in type.R).
r/tests/testthat/test-Array.R
Outdated
float64() | ||
} | ||
|
||
# slightly different error if type was specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgive the dead-horse beating, but it's hard for me to tell that this is the case by looking at the test file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flip side of that is that using regex it's impossible to tell that the error message is useful! With expect_error_snapshot()
, the full errors are here: https://github.com/apache/arrow/pull/12817/files#diff-b944d8c5dbb671c5af92fedc0598a90f27d0aed3d3811175fa557c3ccf6cd339 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean you don\'t speak regex[\?!]{2}
# df <- data.frame(x = I(list(structure(1, foo = "bar"), structure(2, baz = "qux")))) | ||
# class(df$x) <- c("sfc_MULTIPOLYGON", "sfc", "list") | ||
# withr::with_options( | ||
# list("arrow.preserve_row_level_metadata" = TRUE), { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this isn't the default in 7.0.0 is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't...maybe a better name for the file would be 6.0.0? I wanted to keep this test to make sure that if somebody had written files this way that they could be read back in again (since it seems like that's our general policy), but if row-level metadata was never enabled by default then maybe we don't need this test?
0132e49
to
d4536ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! Just a few more notes
r/R/array.R
Outdated
array <- vctrs_extension_array( | ||
x, | ||
ptype = type$ptype(), | ||
storage_type = type$storage_type() | ||
) | ||
return(array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other cases don't return()
so maybe this one doesn't need to either?
array <- vctrs_extension_array( | |
x, | |
ptype = type$ptype(), | |
storage_type = type$storage_type() | |
) | |
return(array) | |
vctrs_extension_array( | |
x, | |
ptype = type$ptype(), | |
storage_type = type$storage_type() | |
) |
r/R/array.R
Outdated
arrays <- Map(as_arrow_array, x, types) | ||
names(arrays) <- names | ||
|
||
# ...because there isn't a StructArray$create() yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Since using the C interface here is clearly a hack, and apparently there is a reason to add a StructArray$create method, could you please make (or find, it could already exist) a jira and link the id here as a TODO?
r/R/python.R
Outdated
pa <- reticulate::import("pyarrow", convert = FALSE) | ||
out <- pa$Table$from_arrays(x$columns, schema = x$schema) | ||
# But set the convert attribute on the return object to the requested value | ||
# Going through RecordBatchReader maintains schema metadata (e.g., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is clever, though I have two concerns:
- Why would we lose schema metadata the other way? We pass the schema through the C interface. If it is losing metadata, that sounds like a bug to fix.
- I'm not sure how the Table to RecordBatchReader code works but we should make sure that it's not allocating/concatenating Arrays so that the chunks of the ChunkedArrays line up into RecordBatches. If it did, that would be suboptimal I haven't read the source but it would be worth a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a JIRA (ARROW-16269) for the schema metadata thing...the gist of it is that the schema metadata roundtrips fine but you end up with a situation where roundtripped_table$col1$type
isn't the same as roundtripped_table$schema$col1$type
.
Going through RecordBatchReader
does re-chunk everything to line up into record batches BUT the slicing is zero-copy (according to the comments:
Lines 240 to 244 in 1157e67
/// \brief Compute a stream of record batches from a (possibly chunked) Table | |
/// | |
/// The conversion is zero-copy: each record batch is a view over a slice | |
/// of the table's columns. | |
class ARROW_EXPORT TableBatchReader : public RecordBatchReader { |
r/R/record-batch-reader.R
Outdated
#' @rdname as_record_batch_reader | ||
#' @export | ||
as_record_batch_reader.RecordBatch <- function(x, ...) { | ||
as_record_batch_reader(as_arrow_table(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could go directly to the reader https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L316
though this is probably fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in scope and was easy so I added it!
@github-actions crossbow submit -g r |
Revision: 39c05b1 Submitted crossbow builds: ursacomputing/crossbow @ actions-1906 |
@github-actions crossbow submit test-r-arrow-backwards-compatibility |
Revision: aae6133 Submitted crossbow builds: ursacomputing/crossbow @ actions-1911
|
@github-actions crossbow submit test-r-arrow-backwards-compatibility |
@github-actions crossbow submit test-r-linux-valgrind |
Revision: f3d3f30 Submitted crossbow builds: ursacomputing/crossbow @ actions-1915
|
…nd as_arrow_array()
…create() instead of direct C++ call
f3d3f30
to
6a6ea12
Compare
@github-actions crossbow submit test-r-arrow-backwards-compatibility |
Revision: b21126e Submitted crossbow builds: ursacomputing/crossbow @ actions-1918
|
Benchmark runs are scheduled for baseline = c04c3ff and contender = c13ef50. c13ef50 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
The `type()` function was deprecated in #12817 (ARROW-15168), but the documentation did not indicate that it had been deprecated. Closes #13748 from eitsupi/r-deprecated-type Authored-by: SHIMA Tatsuya <[email protected]> Signed-off-by: Nic Crane <[email protected]>
No description provided.