Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36026: [C++][ORC] Catch all ORC exceptions to avoid crash #40697

Merged
merged 6 commits into from
Mar 26, 2024

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Mar 21, 2024

Rationale for this change

When /usr/share/zoneinfo is unavailable and TZDIR env is unset, creating C++ ORC reader will crash on Windows. We need to eagerly check this and prevent followup crash.

What changes are included in this PR?

Eagerly check TZDB availability before creating ORC reader/writer.

Are these changes tested?

Yes, added a test case to make sure the check work as expected.

Are there any user-facing changes?

Users on Windows (or other cases when TZDB is not availble) will clearly see this error message instead of crash.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 21, 2024

I have filed https://issues.apache.org/jira/browse/ORC-1661 to follow up this issue. For now I'd propose to eagerly check TZDB availability. Please let me know what do you think. @jorisvandenbossche @pitrou @kou

@pitrou
Copy link
Member

pitrou commented Mar 21, 2024

I think it would be more useful to find why the exception isn't properly caught.

@pitrou
Copy link
Member

pitrou commented Mar 21, 2024

The answer might be that ORC_END_CATCH_NOT_OK does not catch all exceptions.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 21, 2024

I think it would be more useful to find why the exception isn't properly caught.

Agreed. To my surprise that Windows behaves differently when opening a non-existent file: apache/orc#1856. I will debug it on my Windows PC after work.

@jorisvandenbossche
Copy link
Member

If my reading of the code in that PR is correct, it is testing that it raises a TimezoneError on the Orc side, but ORC_END_CATCH_NOT_OK only catches those three:

#define ORC_END_CATCH_NOT_OK \
} \
catch (const liborc::ParseError& e) { \
return Status::IOError(e.what()); \
} \
catch (const liborc::InvalidArgument& e) { \
return Status::Invalid(e.what()); \
} \
catch (const liborc::NotImplementedYet& e) { \
return Status::NotImplemented(e.what()); \
}

@wgtmac
Copy link
Member Author

wgtmac commented Mar 21, 2024

TimezoneError is added in ORC 2.0.0, but we are still on 1.9.2. @jorisvandenbossche

@jorisvandenbossche
Copy link
Member

I was looking at ORC 1.9.3 as well, which already has TimezoneError? https://github.com/apache/orc/blob/a765c316adfb80b47fb8c38831f8c0e0d7294cdb/c%2B%2B/src/Timezone.cc#L684-L685

@wgtmac
Copy link
Member Author

wgtmac commented Mar 21, 2024

@jorisvandenbossche Sorry for the confusion. I thought you were talking about this change: apache/orc#1587 which is ORC 2.0.0 only.

I tried to use ORC 1.9.2 (w/o the aforementioned path existence check) to open a reader on Windows w/o TZDIR set. It actually throws a ParseError exception here: https://github.com/apache/orc/blob/v1.9.2/c%2B%2B/src/OrcFile.cc#L62. This also aligns with the bug description here: apache/orc#1577 (comment)

@jorisvandenbossche
Copy link
Member

But that ParseError is still caught and re-raised as a TimezoneError in the snippet I linked? (https://github.com/apache/orc/blob/a765c316adfb80b47fb8c38831f8c0e0d7294cdb/c%2B%2B/src/Timezone.cc#L684-L685)

(and sorry for causing the confusion! because I was initially also talking about your PR for 2.0, but then afterwards also went checking the code of 1.9)

@wgtmac
Copy link
Member Author

wgtmac commented Mar 21, 2024

But that ParseError is still caught and re-raised as a TimezoneError in the snippet I linked? (https://github.com/apache/orc/blob/a765c316adfb80b47fb8c38831f8c0e0d7294cdb/c%2B%2B/src/Timezone.cc#L684-L685)

Yes, this is what I don't understand either. I am trying to reproduce this on the arrow side.

@pitrou
Copy link
Member

pitrou commented Mar 21, 2024

Instead of spending time reproducing it, perhaps ensure that all C++ exceptions are caught? This would be more robust anyway, since Orc C++ uses exceptions.

@jorisvandenbossche
Copy link
Member

Yes, this is what I don't understand either.

Well, it would perfectly explain why it is crashing. And the bug report you linked to (apache/orc#1577 (comment)) actually points to the line (704) that raises the TimezoneError. So that seems to all fit correctly together?

@wgtmac
Copy link
Member Author

wgtmac commented Mar 21, 2024

@jorisvandenbossche You're right. This issue does not relate to Windows and I can also reproduce this on my Mac with a wrong TZDIR env set. The ORC C++ library throws TimezoneError when it catches ParseError as you point out at line 704. However, TimezoneError is not in a publicly installed header: https://github.com/apache/orc/blob/main/c%2B%2B/src/Timezone.hh#L113. As @pitrou says, perhaps we should just catch everything to avoid crash.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 22, 2024
Comment on lines +81 to +82
catch (...) { \
return Status::UnknownError("ORC error"); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? (Is catch (const std::exception& e) enough?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually it is enough. But just in case?

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting committer review Awaiting committer review awaiting review Awaiting review awaiting changes Awaiting changes labels Mar 22, 2024
@@ -183,6 +183,22 @@ liborc::RowReaderOptions default_row_reader_options() {
return options;
}

// Proactively check the availability of timezone database.
// Remove it once https://issues.apache.org/jira/browse/ORC-1661 has been fixed.
Status check_timezone_database_availability() {
Copy link
Member

@pitrou pitrou Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful if we're correctly catching exceptions? I'm afraid this could get out of sync with ORC's own timezone loading code.

I'd rather not do this, and let ORC fix the issue by making the timezone file optional (what is it used for exactly?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac what would be the error message that you get when not doing the above now that you are catching the exceptions?

Because one reason to keep this, is to provide an informative error message (most Windows users will see this the first time, so I think it is important to give some guidance in the error message, and not just a "file not found").
(to avoid getting out of sync with ORC, we could add an ORC version check and only do this for ORC<2?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After catching all exceptions, the error message here without the check_timezone_database_availability() is Invalid: Can't open /usr/share/zoneinfo/XXX.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to avoid getting out of sync with ORC, we could add an ORC version check and only do this for ORC<2?)

That sounds reasonable to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not do this, and let ORC fix the issue by making the timezone file optional (what is it used for exactly?).

ORC has two timestamp types: timestamp (namely timestamp_without_timezone) and timestamp_instant (namely timestamp_with_local_timezone).

  • timestamp: writer keeps the writer timezone in the stripe footer, and reader uses reader timezone to recover the same wall clock time by converting from writer timezone to reader timezone. that's why the reader try to call getLocalTimezone() on startup.
  • timestamp_with_local_timezone: use UTC everywhere so it does not have any conversion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • timestamp: writer keeps the writer timezone in the stripe footer, and reader uses reader timezone to recover the same wall clock time by converting from writer timezone to reader timezone. that's why the reader try to call getLocalTimezone() on startup.

If we convert, we should convert to UTC instead of converting to the local timezone. That's how Arrow timestamps work (but we still need a timezone DB for the conversion anyway):

arrow/format/Schema.fbs

Lines 283 to 288 in 5181791

/// Timestamps with a non-empty timezone
/// ------------------------------------
///
/// If a Timestamp column has a non-empty timezone value, its epoch is
/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in the *UTC* timezone
/// (the Unix epoch), regardless of the Timestamp's own timezone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we convert, we should convert to UTC instead of converting to the local timezone.

I know that's a better idea. However, this was the design choice of ORC timestamp type which originates from Hive. So it would be better to use timestamp_instant type in favor of the old timestamp type in ORC at any time.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 22, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 25, 2024
@wgtmac wgtmac changed the title GH-36026: [C++][ORC] Check TZDB availability for ORC GH-36026: [C++][ORC] Catch all ORC exceptions to avoid crash Mar 25, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 25, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 25, 2024
std::getline(ss, str, '.');
versions.emplace_back(std::stoi(str));
}
if (versions.size() != 3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if a future ORC versions advertises e.g. "2.1.0.1", this will fail? Sounds inflexible while we're only concerned about the major number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense! It may also be 2.1.0-SNAPSHOT or whatever custom patched version. Let me change it to only care about major and minor version instead.

cpp/src/arrow/adapters/orc/adapter.cc Outdated Show resolved Hide resolved
cpp/src/arrow/adapters/orc/adapter.h Outdated Show resolved Hide resolved
const char* tzdir_env_key = "TZDIR";
const char* expect_str = "IANA time zone database is unavailable but required by ORC";
auto tzdir_env_backup = std::getenv(tzdir_env_key);
ARROW_EXPECT_OK(arrow::internal::SetEnvVar(tzdir_env_key, "/a/b/c/d/e"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use EnvVarGuard so that the value doesn't leak if this test fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice that. Thanks!

cpp/src/arrow/adapters/orc/adapter.cc Outdated Show resolved Hide resolved
@@ -173,7 +173,7 @@ class OrcStripeReader : public RecordBatchReader {
int64_t batch_size_;
};

liborc::RowReaderOptions default_row_reader_options() {
liborc::RowReaderOptions DefaultRowReaderOptions() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was added by me in the past and now I have changed its name to follow the same style.

@wgtmac wgtmac requested a review from pitrou March 26, 2024 13:46
@@ -19,6 +19,7 @@

#include <cstdint>
#include <memory>
#include <optional>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't required anymore, is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Removed it.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, just a nit

@pitrou
Copy link
Member

pitrou commented Mar 26, 2024

@github-actions crossbow submit -g python

Copy link

Revision: 182e1d3

Submitted crossbow builds: ursacomputing/crossbow @ actions-4f707bcb76

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-cuda-python GitHub Actions
test-debian-12-python-3-amd64 Azure
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 GitHub Actions

@pitrou pitrou merged commit dbff1f4 into apache:main Mar 26, 2024
36 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Mar 26, 2024
@pitrou
Copy link
Member

pitrou commented Mar 26, 2024

Thanks a lot @wgtmac !

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit dbff1f4.

There were 2 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants