[Parquet] Too many open files (os error 24) #47

alamb · 2021-04-26T11:23:53Z

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-6154

Used [rust]parquet-read binary to read a deeply nested parquet file and see the below stack trace. Unfortunately won't be able to upload file.
{code:java}
stack backtrace:

0: std::panicking::default_hook::{{closure}}

1: std::panicking::default_hook

2: std::panicking::rust_panic_with_hook

3: std::panicking::continue_panic_fmt

4: rust_begin_unwind

5: core::panicking::panic_fmt

6: core::result::unwrap_failed

7: parquet::util::io::FileSource::new

8: <parquet::file::reader::SerializedRowGroupReader as parquet::file::reader::RowGroupReader>::get_column_page_reader

9: <parquet::file::reader::SerializedRowGroupReader as parquet::file::reader::RowGroupReader>::get_column_reader

10: parquet::record::reader::TreeBuilder::reader_tree

11: parquet::record::reader::TreeBuilder::reader_tree

12: parquet::record::reader::TreeBuilder::reader_tree

13: parquet::record::reader::TreeBuilder::reader_tree

14: parquet::record::reader::TreeBuilder::reader_tree

15: parquet::record::reader::TreeBuilder::build

16: <parquet::record::reader::RowIter as core::iter::traits::iterator::Iterator>::next

17: parquet_read::main

18: std::rt::lang_start::{{closure}}

19: std::panicking::try::do_call

20: __rust_maybe_catch_panic

21: std::rt::lang_start_internal

22: main{code}

alamb · 2021-04-26T11:23:55Z

Comment from Chao Sun(csun) @ 2019-08-07T06:02:08.709+0000:

Thanks for reporting. Do you have rough idea how deep the nested data type is? is there any error message? would be great if we can reproduce this.

Comment from Yesh(madras) @ 2019-08-07T11:35:10.840+0000:

Thanks for ack. Below is the error message.  Additional data point is that it is able to dump schema via parquet-schema . 
{code:java}
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("underlying IO error: Too many open files (os error 24)")', src/libcore/result.rs:1084:5{code}

Comment from Ahmed Riza([email protected]) @ 2021-02-12T22:52:01.045+0000:

I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#000000}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors (one for each column in the Parquet file).

This is the initial stack trace when the footer is first read.  `FileSource::new` (in io.rs) gets called for every column subsequently as well when reading the columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`)

 
{code:java}
#0  parquet::util::io::FileSource::new (fd=0x7ffff7c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x00005555558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x7ffff7c3fafc, start=807191, length=65536)

    at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x000055555590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x7ffff7c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x0000555555845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...)

    at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x0000555555845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x0000555555845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7ffff0000d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x0000555555845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-00001-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x000055555577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}

capkurmagati · 2021-08-18T14:54:23Z

I also encountered the issue when I using InfluxDB IOx and MinIo.

Aug 18 23:40:50.688 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aug 18 23:40:50.690 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.690 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.690 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Parquet reader thread terminated due to error: IoError(Os { code: 24, kind: Other, message: "Too many open files" })
Aug 18 23:40:50.691 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.691 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.694 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.694 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.695  INFO influxdb_iox::influxdb_ioxd::rpc::flight: err=Query { database_name: "local_database", source: DataFusionExecution { source: ArrowError(ExternalError(CreatingParquetReader { source: IoError(Os { code: 24, kind: Other, message: "Too many open files" }) })) } } msg="Error handling Flight gRPC request"

Additional info:
The directory structure follows. Each directory has 1-3 parquet files.

s3cmd ls s3://sensors/1/local_database/data/cpu/
                          DIR  s3://sensors/1/local_database/data/cpu/2021-08-17 12:00:00/
                          DIR  s3://sensors/1/local_database/data/cpu/2021-08-17 13:00:00/
                          DIR  s3://sensors/1/local_database/data/cpu/2021-08-17 14:00:00/
                          DIR  s3://sensors/1/local_database/data/cpu/2021-08-17 15:00:00/
20+ more dirs...

alamb · 2021-08-18T15:15:05Z

@capkurmagati I wonder if you might be able to work around the issue by raising the maximum number of open files. Something like

ulimit -n 20000

capkurmagati · 2021-08-18T16:03:47Z

@alamb Yes, it works. Actually I tweaked the value a bit before I posted here trying to reproduce the error. However I observed that the engine does not always return an error for a certain table scan. (I used a limit clause to try to control the files to read) . So I thought it might be a debug. On second thought, I guess the engine may cache the data so that it doesn't scan the same amount of the files. So my problem is unrelated here. Thanks.

alamb · 2021-08-19T10:40:26Z

I think something else that might be related is the fact that (currently) DataFusion execution tries to start all partitions concurrently. This means that depending on how fast IO comes in and the details of the Tokio scheduler, sometimes it will have far too many open files at once (it might end up opening 100 input parquet files, for example, even if there are only 8 cores available for processing) -- @andygrove has mentioned the Ballista scheduler is more sophisticated in this area and hopefully we can move some of those improvements down into the core DataFusion engine

andygrove · 2021-08-19T13:26:09Z

That's right. Ballista avoids this issue by limiting the number of concurrent tasks. However, Ballista has its own related issues where it will generate an excessive number of shuffle files and potentially run into inode limits, so neither solution is as scalable as we would like yet.

jorgecarleitao · 2021-08-22T06:36:59Z

I have not read the code base thoroughly, but I remember something like when I skimmed through it:

this may be related to the fact that AFAIK we currently clone the file reader for every new seek. Thus, even for a single file, we usually open that file multiple times, once per seek (which is roughly one per row group and per column chunk plus one, to read the metadata).

Dandandan · 2021-08-22T14:03:24Z

For this as well as #924: a good start might be to start limiting the number of maximum threads that are used for spawn_blocking tasks, by default there are max 512 concurrent threads for those:

See:

https://docs.rs/tokio/1.10.0/tokio/index.html#cpu-bound-tasks-and-blocking-code

XeCycle · 2021-09-06T09:37:08Z

I'm getting the same (I believe) error on files with many columns (>2k), and FWIW, can be work'd-around by a struct ArcFile(Arc<File>). Just impl parquet::file::reader::ChunkReader for ArcFile, using FileExt::read_at on cfg(unix) and FileExt::seek_read on cfg(windows).

I guess we can modify parquet::util::io::FileSource to use Arc<File>, because we take ownership of provided file objects anyway.

jinyius · 2022-09-08T04:35:58Z

any update here as it's been a year? i can provide some test parquet files that triggers this issue if that helps.

alamb · 2022-09-10T11:22:14Z

Hi @jinyius

There has been some non trivial work by @tustvold to support reading parquet files without having to clone filehandles -- e.g. https://docs.rs/parquet/22.0.0/parquet/file/serialized_reader/struct.SerializedFileReader.html now takes a ChunkReader which is implemented on Bytes.

https://docs.rs/parquet/22.0.0/parquet/file/reader/trait.ChunkReader.html

Thus, in order to read such a file, you can buffer it into Bytes https://docs.rs/parquet/22.0.0/parquet/file/reader/trait.ChunkReader.html#impl-ChunkReader-for-Bytes

Perhaps with something like this (untested):

let mut v = vec![];
let parquet_file: File = open_your_parquet_file();
// read parquet into memory (TODO error checking)
parquet_file.read_to_end(&mut v).unwrap();

// convert to Bytes so we can read the file 
let b: Bytes = v.into();
let reader = SerializedFileReader::new(b).unwrap();

any update here as it's been a year? i can provide some test parquet files that triggers this issue if that helps.

If you could provide an example file and the code you are using that shows the error, I would be happy to help try and apply the method above. If it works for you, I think we should update the documentation to explain this

alamb added the arrow Changes to the arrow crate label Apr 26, 2021

alamb added parquet Changes to the parquet crate and removed arrow Changes to the arrow crate labels Apr 26, 2021

alamb mentioned this issue Aug 22, 2021

Add a separate configuration setting for parallelism of scanning parquet files apache/datafusion#924

Closed

tustvold closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] Too many open files (os error 24) #47

[Parquet] Too many open files (os error 24) #47

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

capkurmagati commented Aug 18, 2021

alamb commented Aug 18, 2021

capkurmagati commented Aug 18, 2021

alamb commented Aug 19, 2021 •

edited

Loading

andygrove commented Aug 19, 2021

jorgecarleitao commented Aug 22, 2021

Dandandan commented Aug 22, 2021

XeCycle commented Sep 6, 2021

jinyius commented Sep 8, 2022

alamb commented Sep 10, 2022

[Parquet] Too many open files (os error 24) #47

[Parquet] Too many open files (os error 24) #47

Comments

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

capkurmagati commented Aug 18, 2021

alamb commented Aug 18, 2021

capkurmagati commented Aug 18, 2021

alamb commented Aug 19, 2021 • edited Loading

andygrove commented Aug 19, 2021

jorgecarleitao commented Aug 22, 2021

Dandandan commented Aug 22, 2021

XeCycle commented Sep 6, 2021

jinyius commented Sep 8, 2022

alamb commented Sep 10, 2022

alamb commented Aug 19, 2021 •

edited

Loading