Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] ORC Reader aborts when timezone file is missing #40633

Closed
WillAyd opened this issue Mar 18, 2024 · 12 comments
Closed

[Python] ORC Reader aborts when timezone file is missing #40633

WillAyd opened this issue Mar 18, 2024 · 12 comments

Comments

@WillAyd
Copy link
Contributor

WillAyd commented Mar 18, 2024

Describe the bug, including details regarding any error messages, version, and platform.

This is an upstream report of pandas-dev/pandas#56292

I noticed when running the pandas test suite I was getting this error:

pandas/tests/io/test_orc.py::test_orc_reader_basic terminate called after throwing an instance of 'orc::TimezoneError'
  what():  Can't open /usr/share/zoneinfo/US/Pacific
Fatal Python error: Aborted

Current thread 0x00007eff1a912780 (most recent call first):

The workaround is to create that timezone file:

$ sudo mkdir -p /usr/share/zoneinfo/US
$ sudo ln -s /usr/share/zoneinfo/America/Los_Angeles /usr/share/zoneinfo/US/Pacific

Although I think the error should be handled more gracefully than via abort

Component(s)

Python

@kou
Copy link
Member

kou commented Mar 19, 2024

@wgtmac will improve this.
See also:

@wgtmac wgtmac self-assigned this Mar 19, 2024
@wgtmac
Copy link
Member

wgtmac commented Mar 19, 2024

This seems to be related to the installed version of tz database on the test machine. I checked my laptop and the path /usr/share/zoneinfo/US/Pacific exists. Could you verify the version by checking /usr/share/doc/tzdata/version file? @WillAyd

@WillAyd
Copy link
Contributor Author

WillAyd commented Mar 19, 2024

That file does not exist for me. This is running popOS 22.04

@kou
Copy link
Member

kou commented Mar 20, 2024

Could you try installing the tzdata-legacy package?

@WillAyd
Copy link
Contributor Author

WillAyd commented Mar 20, 2024

I don't see that package for 22.04 - I think first appeared in 23.04?

@kou
Copy link
Member

kou commented Mar 21, 2024

Oh, sorry. Could you install tzdata?

@WillAyd
Copy link
Contributor Author

WillAyd commented Mar 21, 2024

It is already installed - tzdata is already the newest version (2024a-0ubuntu0.22.04).

@kou
Copy link
Member

kou commented Mar 21, 2024

Hmm. tzdata must install /usr/share/zoneinfo/US/Pacific: https://packages.ubuntu.com/jammy/all/tzdata/filelist

@WillAyd
Copy link
Contributor Author

WillAyd commented Mar 21, 2024

Ah OK - interesting indeed. That must have been deleted off of my system somehow, but I do see that in a recovery OS.

Happy to close this issue if we want to chalk it up to an unsupported system configuration

dongjoon-hyun pushed a commit to apache/orc that referenced this issue Mar 22, 2024
### What changes were proposed in this pull request?

Enable TestTimezone.testMissingTZDB unit test to run on Windows.

### Why are the changes needed?

When /usr/share/zoneinfo is unavailable and TZDIR env is unset, creating C++ ORC reader will crash on Windows. We need to better deal with this case. See context from the Apache Arrow community: apache/arrow#36026 and apache/arrow#40633

### How was this patch tested?

Make sure the test passes on Windows.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #1856 from wgtmac/win_tz_test.

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit to apache/orc that referenced this issue Mar 22, 2024
### What changes were proposed in this pull request?

Enable TestTimezone.testMissingTZDB unit test to run on Windows.

### Why are the changes needed?

When /usr/share/zoneinfo is unavailable and TZDIR env is unset, creating C++ ORC reader will crash on Windows. We need to better deal with this case. See context from the Apache Arrow community: apache/arrow#36026 and apache/arrow#40633

### How was this patch tested?

Make sure the test passes on Windows.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #1856 from wgtmac/win_tz_test.

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@aureliobarbosa
Copy link

Could you try installing the tzdata-legacy package?

I also observed pyarrow breaking while processing ORC files, due to inexistent IANA keys. Those were observed on running the pandas test suit locally, but just trying to read some pre-existent ORC files completely broke python and ipython. My setup includes Ubuntu Mantic, Python 3.11 and tzdata version 2024.1.

At least in my case, installing tzdata-legacy system wide was enough to get ride of those errors.

@rhshadrach
Copy link

I've been debugging this issue and independently found the same solution - installing tzdata-legacy. Just stashing my error message here in case it is helpful for others.

pyarrow.lib.ArrowInvalid: Cannot locate timezone 'US/Eastern': US/Eastern not found in timezone database

kou pushed a commit that referenced this issue Dec 18, 2024
…zone (#45051)

### Rationale for this change

If the timezone database is present on the system, but does not contain a timezone referenced in a ORC file, the ORC reader will crash with an uncaught C++ exception.

This can happen for example on Ubuntu 24.04 where some timezone aliases have been removed from the main `tzdata` package to a `tzdata-legacy` package. If `tzdata-legacy` is not installed, trying to read a ORC file that references e.g. the "US/Pacific" timezone would crash.

Here is a backtrace excerpt:
```
#12 0x00007f1a3ce23a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x00007f1a3ce39391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f1a3f4accc4 in orc::loadTZDB(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#15 0x00007f1a3f4ad392 in std::call_once<orc::LazyTimezone::getImpl() const::{lambda()#1}>(std::once_flag&, orc::LazyTimezone::getImpl() const::{lambda()#1}&&)::{lambda()#2}::_FUN() () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#16 0x00007f1a4298bec3 in __pthread_once_slow (once_control=0xa5ca7c8, init_routine=0x7f1a3ce69420 <__once_proxy>) at ./nptl/pthread_once.c:116
#17 0x00007f1a3f4a9ad0 in orc::LazyTimezone::getEpoch() const ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#18 0x00007f1a3f4e76b1 in orc::TimestampColumnReader::TimestampColumnReader(orc::Type const&, orc::StripeStreams&, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#19 0x00007f1a3f4e84ad in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#20 0x00007f1a3f4e8dd7 in orc::StructColumnReader::StructColumnReader(orc::Type const&, orc::StripeStreams&, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#21 0x00007f1a3f4e8532 in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#22 0x00007f1a3f4925e9 in orc::RowReaderImpl::startNextStripe() ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#23 0x00007f1a3f492c9d in orc::RowReaderImpl::next(orc::ColumnVectorBatch&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#24 0x00007f1a3e6b251f in arrow::adapters::orc::ORCFileReader::Impl::ReadBatch(orc::RowReaderOptions const&, std::shared_ptr<arrow::Schema> const&, long) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
```

### What changes are included in this PR?

Catch C++ exceptions when iterating ORC batches instead of letting them slip through.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #40633

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou added this to the 19.0.0 milestone Dec 18, 2024
@kou
Copy link
Member

kou commented Dec 18, 2024

Issue resolved by pull request 45051
#45051

@kou kou closed this as completed Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants