-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code
Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats: * You can't dictionary_encode a dataset column. `Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}` (ARROW-12632). I will remove the `as.factor` method and leave a TODO to restore it after that JIRA is resolved. * with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing `mutate()`; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine. There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights: * 5b501c5 is the main switch to use InMemoryDataset * b31fb5e deletes `array_expression` * 0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions * 2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz * d12f584 just splits up dplyr.R into many files; 34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface). * a0914f6 + eee491a contain ARROW-12696 Closes #10191 from nealrichardson/dplyr-in-memory Authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
- Loading branch information
1 parent
b34c8f6
commit 9347731
Showing
37 changed files
with
1,360 additions
and
1,708 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Licensed to the Apache Software Foundation (ASF) under one | ||
# or more contributor license agreements. See the NOTICE file | ||
# distributed with this work for additional information | ||
# regarding copyright ownership. The ASF licenses this file | ||
# to you under the Apache License, Version 2.0 (the | ||
# "License"); you may not use this file except in compliance | ||
# with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an | ||
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
|
||
# The following S3 methods are registered on load if dplyr is present | ||
|
||
arrange.arrow_dplyr_query <- function(.data, ..., .by_group = FALSE) { | ||
call <- match.call() | ||
exprs <- quos(...) | ||
if (.by_group) { | ||
# when the data is is grouped and .by_group is TRUE, order the result by | ||
# the grouping columns first | ||
exprs <- c(quos(!!!dplyr::groups(.data)), exprs) | ||
} | ||
if (length(exprs) == 0) { | ||
# Nothing to do | ||
return(.data) | ||
} | ||
.data <- arrow_dplyr_query(.data) | ||
# find and remove any dplyr::desc() and tidy-eval | ||
# the arrange expressions inside an Arrow data_mask | ||
sorts <- vector("list", length(exprs)) | ||
descs <- logical(0) | ||
mask <- arrow_mask(.data) | ||
for (i in seq_along(exprs)) { | ||
x <- find_and_remove_desc(exprs[[i]]) | ||
exprs[[i]] <- x[["quos"]] | ||
sorts[[i]] <- arrow_eval(exprs[[i]], mask) | ||
if (inherits(sorts[[i]], "try-error")) { | ||
msg <- paste('Expression', as_label(exprs[[i]]), 'not supported in Arrow') | ||
return(abandon_ship(call, .data, msg)) | ||
} | ||
names(sorts)[i] <- as_label(exprs[[i]]) | ||
descs[i] <- x[["desc"]] | ||
} | ||
.data$arrange_vars <- c(sorts, .data$arrange_vars) | ||
.data$arrange_desc <- c(descs, .data$arrange_desc) | ||
.data | ||
} | ||
arrange.Dataset <- arrange.ArrowTabular <- arrange.arrow_dplyr_query | ||
|
||
# Helper to handle desc() in arrange() | ||
# * Takes a quosure as input | ||
# * Returns a list with two elements: | ||
# 1. The quosure with any wrapping parentheses and desc() removed | ||
# 2. A logical value indicating whether desc() was found | ||
# * Performs some other validation | ||
find_and_remove_desc <- function(quosure) { | ||
expr <- quo_get_expr(quosure) | ||
descending <- FALSE | ||
if (length(all.vars(expr)) < 1L) { | ||
stop( | ||
"Expression in arrange() does not contain any field names: ", | ||
deparse(expr), | ||
call. = FALSE | ||
) | ||
} | ||
# Use a while loop to remove any number of nested pairs of enclosing | ||
# parentheses and any number of nested desc() calls. In the case of multiple | ||
# nested desc() calls, each one toggles the sort order. | ||
while (identical(typeof(expr), "language") && is.call(expr)) { | ||
if (identical(expr[[1]], quote(`(`))) { | ||
# remove enclosing parentheses | ||
expr <- expr[[2]] | ||
} else if (identical(expr[[1]], quote(desc))) { | ||
# remove desc() and toggle descending | ||
expr <- expr[[2]] | ||
descending <- !descending | ||
} else { | ||
break | ||
} | ||
} | ||
return( | ||
list( | ||
quos = quo_set_expr(quosure, expr), | ||
desc = descending | ||
) | ||
) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# Licensed to the Apache Software Foundation (ASF) under one | ||
# or more contributor license agreements. See the NOTICE file | ||
# distributed with this work for additional information | ||
# regarding copyright ownership. The ASF licenses this file | ||
# to you under the Apache License, Version 2.0 (the | ||
# "License"); you may not use this file except in compliance | ||
# with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an | ||
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
|
||
# The following S3 methods are registered on load if dplyr is present | ||
|
||
collect.arrow_dplyr_query <- function(x, as_data_frame = TRUE, ...) { | ||
x <- ensure_group_vars(x) | ||
x <- ensure_arrange_vars(x) # this sets x$temp_columns | ||
# Pull only the selected rows and cols into R | ||
# See dataset.R for Dataset and Scanner(Builder) classes | ||
tab <- Scanner$create(x)$ToTable() | ||
# Arrange rows | ||
if (length(x$arrange_vars) > 0) { | ||
tab <- tab[ | ||
tab$SortIndices(names(x$arrange_vars), x$arrange_desc), | ||
names(x$selected_columns), # this omits x$temp_columns from the result | ||
drop = FALSE | ||
] | ||
} | ||
if (as_data_frame) { | ||
df <- as.data.frame(tab) | ||
tab$invalidate() | ||
restore_dplyr_features(df, x) | ||
} else { | ||
restore_dplyr_features(tab, x) | ||
} | ||
} | ||
collect.ArrowTabular <- function(x, as_data_frame = TRUE, ...) { | ||
if (as_data_frame) { | ||
as.data.frame(x, ...) | ||
} else { | ||
x | ||
} | ||
} | ||
collect.Dataset <- function(x, ...) dplyr::collect(arrow_dplyr_query(x), ...) | ||
|
||
compute.arrow_dplyr_query <- function(x, ...) dplyr::collect(x, as_data_frame = FALSE) | ||
compute.ArrowTabular <- function(x, ...) x | ||
compute.Dataset <- compute.arrow_dplyr_query | ||
|
||
pull.arrow_dplyr_query <- function(.data, var = -1) { | ||
.data <- arrow_dplyr_query(.data) | ||
var <- vars_pull(names(.data), !!enquo(var)) | ||
.data$selected_columns <- set_names(.data$selected_columns[var], var) | ||
dplyr::collect(.data)[[1]] | ||
} | ||
pull.Dataset <- pull.ArrowTabular <- pull.arrow_dplyr_query |
Oops, something went wrong.