Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16154: [R] Errors which pass through handle_csv_read_error() and handle_parquet_io_error() need better error tracing #12839

Closed
wants to merge 3 commits into from

Conversation

thisisnic
Copy link
Member

As discussed on #12826

Not sure how (if) to write tests but tried running it locally using the CSV directory set up in test-dataset-csv.R with and without this change, and without it, we get, e.g.

open_dataset(csv_dir)
# Error in `handle_parquet_io_error()` at r/R/dataset.R:221:6:
# ! Invalid: Error creating dataset. Could not read schema from '/tmp/RtmpuTyOD8/file5049dcf581a5/5/file1.csv': Could not open Parquet input source '/tmp/RtmpuTyOD8/file5049dcf581a5/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
# /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:323  GetReader(source, scan_options). Is this a 'parquet' file?
# /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40  InspectSchemas(std::move(options))
# /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262  Inspect(options.inspect_options)
# ℹ Did you mean to specify a 'format' other than the default (parquet)?

and then with it:

open_dataset(csv_dir)
# Error in `open_dataset()`:
# ! Invalid: Error creating dataset. Could not read schema from '/tmp/RtmpLbqZs6/file4e4ca14fb5795/5/file1.csv': Could not open Parquet input source '/tmp/RtmpLbqZs6/file4e4ca14fb5795/5/file1.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
# /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:323  GetReader(source, scan_options). Is this a 'parquet' file?
# /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:40  InspectSchemas(std::move(options))
# /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262  Inspect(options.inspect_options)
# ℹ Did you mean to specify a 'format' other than the default (parquet)?

@github-actions
Copy link

github-actions bot commented Apr 8, 2022

@github-actions
Copy link

github-actions bot commented Apr 8, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for doing this!

r/R/util.R Outdated Show resolved Hide resolved
@@ -200,8 +200,8 @@ read_delim_arrow <- function(file,

tryCatch(
tab <- reader$Read(),
error = function(e) {
handle_csv_read_error(e, schema)
error = function(e, call = caller_env(n = 4)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it always n = 4? Is there a more certain way to capture this? (Like, if you define call_env outside of tryCatch, is it just this env?)

Copy link
Member Author

@thisisnic thisisnic Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it always n = 4?

It's always n = 4 here, though I deliberately chose to pass the call parameter into handle_csv_read_error() so the function could be used elsewhere in the code where we may want to pass in a different environment.

Is there a more certain way to capture this? (Like, if you define call_env outside of tryCatch, is it just this env?)

I could call rlang::current_env() above the tryCatch block - I went for calling caller_env() here as it felt "cleaner" to keep that code within this block here.

I suppose that if the tryCatch block was changed to have more functions wrapped round it, then the number would be wrong; however, if we call current_env() outside of the block, we're unnecessarily calling it every time we call the function, even if there's no error.

Not sure what's better - what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels brittle but it's probably fine. I'd just leave in some comments explaining why n = 4, that you could have used caller_env() but this way is lazy/only does it if there's an error (aside: it's just calling parent.frame(), which on my machine takes in the hundreds of nanoseconds to run, so the cost of calling it every time is not something I'm concerned about).

We can revisit later if/when we want to chain together multiple error handlers. Also looks like rlang is growing some experimental tooling around here (https://rlang.r-lib.org/reference/try_fetch.html) so maybe that will mature and be ready whenever we revisit this.

In sum, seems like you've thought this through, so just leave a note explaining why this non-obvious thing is there and 👍 !

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One request to add a version of what you responded as a code comment, but otherwise LGTM, nice work!

@@ -200,8 +200,8 @@ read_delim_arrow <- function(file,

tryCatch(
tab <- reader$Read(),
error = function(e) {
handle_csv_read_error(e, schema)
error = function(e, call = caller_env(n = 4)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels brittle but it's probably fine. I'd just leave in some comments explaining why n = 4, that you could have used caller_env() but this way is lazy/only does it if there's an error (aside: it's just calling parent.frame(), which on my machine takes in the hundreds of nanoseconds to run, so the cost of calling it every time is not something I'm concerned about).

We can revisit later if/when we want to chain together multiple error handlers. Also looks like rlang is growing some experimental tooling around here (https://rlang.r-lib.org/reference/try_fetch.html) so maybe that will mature and be ready whenever we revisit this.

In sum, seems like you've thought this through, so just leave a note explaining why this non-obvious thing is there and 👍 !

@thisisnic thisisnic closed this in 5d5cceb Apr 13, 2022
@ursabot
Copy link

ursabot commented Apr 14, 2022

Benchmark runs are scheduled for baseline = 681ede6 and contender = 5d5cceb. 5d5cceb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.67% ⬆️0.08%] test-mac-arm
[Failed ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.98% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/504| 5d5ccebe ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/490| 5d5ccebe test-mac-arm>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/488| 5d5ccebe ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/500| 5d5ccebe ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/503| 681ede6f ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/489| 681ede6f test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/490| 681ede6f ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/499| 681ede6f ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants