Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16154: [R] Errors which pass through handle_csv_read_error() and handle_parquet_io_error() need better error tracing #12839

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions r/R/csv.R
Original file line number Diff line number Diff line change
Expand Up @@ -200,8 +200,10 @@ read_delim_arrow <- function(file,

tryCatch(
tab <- reader$Read(),
error = function(e) {
handle_csv_read_error(e, schema)
# n = 4 because we want the error to show up as being from read_delim_arrow()
# and not handle_csv_read_error()
error = function(e, call = caller_env(n = 4)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it always n = 4? Is there a more certain way to capture this? (Like, if you define call_env outside of tryCatch, is it just this env?)

Copy link
Member Author

@thisisnic thisisnic Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it always n = 4?

It's always n = 4 here, though I deliberately chose to pass the call parameter into handle_csv_read_error() so the function could be used elsewhere in the code where we may want to pass in a different environment.

Is there a more certain way to capture this? (Like, if you define call_env outside of tryCatch, is it just this env?)

I could call rlang::current_env() above the tryCatch block - I went for calling caller_env() here as it felt "cleaner" to keep that code within this block here.

I suppose that if the tryCatch block was changed to have more functions wrapped round it, then the number would be wrong; however, if we call current_env() outside of the block, we're unnecessarily calling it every time we call the function, even if there's no error.

Not sure what's better - what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels brittle but it's probably fine. I'd just leave in some comments explaining why n = 4, that you could have used caller_env() but this way is lazy/only does it if there's an error (aside: it's just calling parent.frame(), which on my machine takes in the hundreds of nanoseconds to run, so the cost of calling it every time is not something I'm concerned about).

We can revisit later if/when we want to chain together multiple error handlers. Also looks like rlang is growing some experimental tooling around here (https://rlang.r-lib.org/reference/try_fetch.html) so maybe that will mature and be ready whenever we revisit this.

In sum, seems like you've thought this through, so just leave a note explaining why this non-obvious thing is there and 👍 !

handle_csv_read_error(e, schema, call)
}
)

Expand Down
6 changes: 4 additions & 2 deletions r/R/dataset.R
Original file line number Diff line number Diff line change
Expand Up @@ -217,8 +217,10 @@ open_dataset <- function(sources,
tryCatch(
# Default is _not_ to inspect/unify schemas
factory$Finish(schema, isTRUE(unify_schemas)),
error = function(e) {
handle_parquet_io_error(e, format)
# n = 4 because we want the error to show up as being from open_dataset()
# and not handle_parquet_io_error()
error = function(e, call = caller_env(n = 4)) {
handle_parquet_io_error(e, format, call)
}
)
}
Expand Down
6 changes: 4 additions & 2 deletions r/R/dplyr-collect.R
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,10 @@ collect.arrow_dplyr_query <- function(x, as_data_frame = TRUE, ...) {
# See query-engine.R for ExecPlan/Nodes
tryCatch(
tab <- do_exec_plan(x),
error = function(e) {
handle_csv_read_error(e, x$.data$schema)
# n = 4 because we want the error to show up as being from collect()
# and not handle_csv_read_error()
error = function(e, call = caller_env(n = 4)) {
handle_csv_read_error(e, x$.data$schema, call)
}
)

Expand Down
17 changes: 8 additions & 9 deletions r/R/util.R
Original file line number Diff line number Diff line change
Expand Up @@ -125,17 +125,17 @@ read_compressed_error <- function(e) {
stop(e)
}

handle_parquet_io_error <- function(e, format) {
handle_parquet_io_error <- function(e, format, call) {
msg <- conditionMessage(e)
if (grepl("Parquet magic bytes not found in footer", msg) && length(format) > 1 && is_character(format)) {
# If length(format) > 1, that means it is (almost certainly) the default/not specified value
# so let the user know that they should specify the actual (not parquet) format
abort(c(
msg <- c(
msg,
i = "Did you mean to specify a 'format' other than the default (parquet)?"
))
)
}
stop(e)
abort(msg, call = call)
}

is_writable_table <- function(x) {
Expand Down Expand Up @@ -198,19 +198,18 @@ repeat_value_as_array <- function(object, n) {
return(Scalar$create(object)$as_array(n))
}

handle_csv_read_error <- function(e, schema) {
handle_csv_read_error <- function(e, schema, call) {
msg <- conditionMessage(e)

if (grepl("conversion error", msg) && inherits(schema, "Schema")) {
abort(c(
msg <- c(
msg,
i = paste(
"If you have supplied a schema and your data contains a header",
"row, you should supply the argument `skip = 1` to prevent the",
"header being read in as data."
)
))
)
}

abort(msg)
abort(msg, call = call)
}