Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read only enough bytes to infer Arrow IPC file schema via stream #7962

Merged
merged 3 commits into from
Nov 2, 2023

Conversation

Jefffrey
Copy link
Contributor

Which issue does this PR close?

Closes #6368

Rationale for this change

See issue

What changes are included in this PR?

Adjust behaviour of infer Arrow IPC file schema from stream to infer the schema from the first Schema message in the filer, rather than from the footer of the file, thereby allowing us to only need to read as many bytes as we need to infer schema. Old behaviour was to read entire stream before inferring the schema.

Are these changes tested?

Added new unit tests.

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Oct 28, 2023
// So in first read we need at least all known sized sections,
// which is 6 + 2 + 4 + 4 = 16 bytes.
let bytes = collect_at_least_n_bytes(&mut stream, 16, None).await?;
if bytes.len() < 16 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this error check happen within collect_at_least_n_bytes? I would expect that function to Err if it cannot read at least n bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed to have collect_at_least_n_bytes() do the error checking now

block_data.extend_from_slice(&bytes[rest_of_bytes_start_index..]);
let size_to_read = meta_len as usize - block_data.len();
let block_data =
collect_at_least_n_bytes(&mut stream, size_to_read, Some(block_data)).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my previous comment, there is currently no check here that we actually did read at least n bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect_at_least_n_bytes() should now do the length checking

@tustvold
Copy link
Contributor

tustvold commented Nov 1, 2023

FWIW for consistency we might want to do something closer to what we do for parquet where:

  • We have an estimate of the size of the footer which we fetch
  • We read the actual footer size from the fetched data
  • We then fetch any extra data needed
  • Once decoded the footer provides information on the schema and where the data blocks are located

This PR instead appears to read the first RecordBatch, whilst I think this should work (provided the file contains data), the more standard approach might be to read the footer.

Edit: I also filed apache/arrow-rs#5021 which outlines some APIs we could add upstream that might help here

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Jefffrey -- this looks good and well tested to me. Thank you for the initial review @andygrove

cc @jonmmease

fn read_arrow_schema_from_reader<R: Read + Seek>(reader: R) -> Result<SchemaRef> {
let reader = FileReader::try_new(reader, None)?;
Ok(reader.schema())
const ARROW_MAGIC: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving this logic upstream into the arrow-rs reader.

https://github.com/apache/arrow-rs/blob/78735002d99eb0212166924948f95554c4ac2866/arrow-ipc/src/reader.rs#L560

If you agree, I can file an upstream ticket to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would rather the approach described on apache/arrow-rs#5021, reading the footer is more generally useful, providing information beyond just the schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving this logic upstream into the arrow-rs reader.

Yes, this logic was essentially ripped from StreamReader and FileReader of arrow-ipc, but adjusted to be made compatible with async stream of bytes. We could move this logic to arrow-ipc, but we need to keep in mind that though we are getting a stream of bytes, this is a stream of bytes in the IPC file format and not the IPC streaming format. So an AsyncStreamReader might not exactly fit our use, whereas an AsyncFileReader could but might be limited if we don't read its footer when attempting to decode the rest of the data.

I think I would rather the approach described on apache/arrow-rs#5021, reading the footer is more generally useful, providing information beyond just the schema

It seems this ticket could be appropriate for that. Just to note, that we can't exactly read the footer in a stream without reverting to the old method of reading the entire stream just to decode the schema.

Copy link
Contributor

@tustvold tustvold Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverting to the old method of reading the entire stream just to decode the schema.

The idea would be to do something similar to what we do to read the parquet footer, I provided a few more details on the linked ticket. The trick is to perform ranged reads

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah apologies, I took a look at the parquet code and I see what you mean now. That indeed would be a better approach, in line with existing behaviour for FileReader 👍

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Nov 1, 2023

FWIW for consistency we might want to do something closer to what we do for parquet where:

* We have an estimate of the size of the footer which we fetch

* We read the actual footer size from the fetched data

* We then fetch any extra data needed

* Once decoded the footer provides information on the schema and where the data blocks are located

This PR instead appears to read the first RecordBatch, whilst I think this should work (provided the file contains data), the more standard approach might be to read the footer.

Edit: I also filed apache/arrow-rs#5021 which outlines some APIs we could add upstream that might help here

We could read the first chunk of the stream of the file similar to reading the last chunk of parquet and hoping it contains all the necessary data to decode the schema. I chose the current approach as we don't have control over the number of bytes each await brings from the stream, whereas when reading parquet we generally have more control over that I believe.

This method shouldn't read any record batches, it simply reads the first flatbuffer message in the IPC file contents which is expected to be a schema message, per the specification stating that an IPC streaming format should have the schema message come first and the IPC file format is simply an encapsulation of the IPC streaming format with some addiitonal wrapping.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @Jefffrey and @tustvold

@alamb alamb merged commit 436a4fa into apache:main Nov 2, 2023
20 of 22 checks passed
@alamb
Copy link
Contributor

alamb commented Nov 2, 2023

It appears I accidentally merged this PR without a passing CI run. I have filed a PR to fix: #8037

@Jefffrey Jefffrey deleted the arrow_infer_schema_stream branch November 3, 2023 08:01
@andygrove andygrove added the enhancement New feature or request label Nov 5, 2023
Dandandan added a commit to coralogix/arrow-datafusion that referenced this pull request Nov 9, 2023
* Cleanup logical optimizer rules.  (apache#7919)

* Initial commit

* Address todos

* Update comments

* Simplifications

* Minor simplifications

* Address reviews

* Add TableScan constructor

* Minor changes

* make try_new_with_schema method of Aggregate private

* Use projection try_new instead of try_new_schema

* Simplifications, add comment

* Review changes

* Improve comments

* Move get_wider_type to type_coercion module

* Clean up type coercion file

---------

Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>

* Parallelize Serialization of Columns within Parquet RowGroups (apache#7655)

* merge main

* fixes and cmt

* review comments, tuning parameters, updating docs

* cargo fmt

* reduce default buffer size to 2 and update docs

* feat: Use bloom filter when reading parquet to skip row groups  (apache#7821)

* feat: implement read bloom filter support

* test: add unit test for read bloom filter

* Simplify bloom filter application

* test: add unit test for bloom filter with sql `in`

* fix: imrpove bloom filter match express

* fix: add more test for bloom filter

* ci: rollback dependences

* ci: merge main branch

* fix: unit tests for bloom filter

* ci: cargo clippy

* ci: cargo clippy

---------

Co-authored-by: Andrew Lamb <[email protected]>

* fix: don't push down volatile predicates in projection (apache#7909)

* fix: don't push down volatile predicates in projection

* Update datafusion/optimizer/src/push_down_filter.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion/optimizer/src/push_down_filter.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion/optimizer/src/push_down_filter.rs

Co-authored-by: Andrew Lamb <[email protected]>

* add suggestions

* fix

* fix doc

* Update datafusion/optimizer/src/push_down_filter.rs

Co-authored-by: Jonah Gao <[email protected]>

* Update datafusion/optimizer/src/push_down_filter.rs

Co-authored-by: Jonah Gao <[email protected]>

* Update datafusion/optimizer/src/push_down_filter.rs

Co-authored-by: Jonah Gao <[email protected]>

* Update datafusion/optimizer/src/push_down_filter.rs

Co-authored-by: Jonah Gao <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Jonah Gao <[email protected]>

* Add `parquet` feature flag, enabled by default, and make parquet conditional  (apache#7745)

* Make parquet an option by adding multiple cfg attributes without significant code changes.

* Extract parquet logic into submodule from execution::context

* Extract parquet logic into submodule from datafusion_core::dataframe

* Extract more logic into submodule from execution::context

* Move tests from execution::context

* Rename submodules

* [MINOR]: Simplify enforce_distribution, minor changes (apache#7924)

* Initial commit

* Simplifications

* Cleanup imports

* Review

---------

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

* Add simple window query to sqllogictest (apache#7928)

* ci: upgrade node to version 20 (apache#7918)

* Change input for `to_timestamp` function to be seconds rather than nanoseconds, add `to_timestamp_nanos` (apache#7844)

* Change input for `to_timestamp` function

* docs

* fix examples

* output `to_timestamp` signature as ns

* Minor: Document `parquet` crate feature (apache#7927)

* Minor: reduce some #cfg(feature = "parquet") (apache#7929)

* Minor: reduce use of cfg(parquet) in tests (apache#7930)

* Fix CI failures on `to_timestamp()` calls (apache#7941)

* Change input for `to_timestamp` function

* docs

* fix examples

* output `to_timestamp` signature as ns

* Fix CI `to_timestamp()` failed

* Update datafusion/expr/src/built_in_function.rs

Co-authored-by: Andrew Lamb <[email protected]>

* fix typo

* fix

---------

Co-authored-by: Andrew Lamb <[email protected]>

* minor: add a datatype casting for the updated value (apache#7922)

* minor: cast the updated value to the data type of target column

* Update datafusion/sqllogictest/test_files/update.slt

Co-authored-by: Alex Huang <[email protected]>

* Update datafusion/sqllogictest/test_files/update.slt

Co-authored-by: Alex Huang <[email protected]>

* Update datafusion/sqllogictest/test_files/update.slt

Co-authored-by: Alex Huang <[email protected]>

* fix tests

---------

Co-authored-by: Alex Huang <[email protected]>

* fix (apache#7946)

* Add simple exclude all columns test to sqllogictest (apache#7945)

* Add simple exclude all columns test to sqllogictest

* Add more exclude test cases

* Support Partitioning Data by Dictionary Encoded String Array Types (apache#7896)

* support dictionary encoded string columns for partition cols

* remove debug prints

* cargo fmt

* generic dictionary cast and dict encoded test

* updates from review

* force retry checks

* try checks again

* Minor: Remove array() in array_expression (apache#7961)

* remove array

Signed-off-by: jayzhan211 <[email protected]>

* cleanup others

Signed-off-by: jayzhan211 <[email protected]>

* clippy

Signed-off-by: jayzhan211 <[email protected]>

* cleanup cast

Signed-off-by: jayzhan211 <[email protected]>

* fmt

Signed-off-by: jayzhan211 <[email protected]>

* cleanup cast

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>

* Minor: simplify update code (apache#7943)

* Add some initial content about creating logical plans (apache#7952)

* Minor: Change from `&mut SessionContext` to `&SessionContext` in substrait (apache#7965)

* Lower &mut SessionContext in substrait

* rm mut ctx in tests

* Fix crate READMEs (apache#7964)

* Minor: Improve `HashJoinExec` documentation (apache#7953)

* Minor: Improve `HashJoinExec` documentation

* Apply suggestions from code review

Co-authored-by: Liang-Chi Hsieh <[email protected]>

---------

Co-authored-by: Liang-Chi Hsieh <[email protected]>

* chore: clean useless clone baesd on clippy (apache#7973)

* Add README.md to `core`, `execution` and `physical-plan` crates (apache#7970)

* Add README.md to `core`, `execution` and `physical-plan` crates

* prettier

* Update datafusion/physical-plan/README.md

* Update datafusion/wasmtest/README.md

---------

Co-authored-by: Daniël Heres <[email protected]>

* Move source repartitioning into `ExecutionPlan::repartition` (apache#7936)

* Move source repartitioning into ExecutionPlan::repartition

* cleanup

* update test

* update test

* refine docs

* fix merge

* minor: fix broken links in README.md (apache#7986)

* minor: fix broken links in README.md

* fix proto link

* Minor: Upate the `sqllogictest` crate README (apache#7971)

* Minor: Upate the sqllogictest crate README

* prettier

* Apply suggestions from code review

Co-authored-by: Jonah Gao <[email protected]>
Co-authored-by: jakevin <[email protected]>

---------

Co-authored-by: Jonah Gao <[email protected]>
Co-authored-by: jakevin <[email protected]>

* Improve MemoryCatalogProvider default impl block placement (apache#7975)

* Fix `ScalarValue` handling of NULL values for ListArray (apache#7969)

* Fix try_from_array data type for NULL value in ListArray

* Fix

* Explicitly assert the datatype

* For review

* Refactor of Ordering and Prunability Traversals and States (apache#7985)

* simplify ExprOrdering

* Comment improvements

* Move map/transform comment up

---------

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

* Keep output as scalar for scalar function if all inputs are scalar (apache#7967)

* Keep output as scalar for scalar function if all inputs are scalar

* Add end-to-end tests

* Fix crate READMEs for core, execution, physical-plan (apache#7990)

* Update sqlparser requirement from 0.38.0 to 0.39.0 (apache#7983)

* chore: Update sqlparser requirement from 0.38.0 to 0.39.0

* support FILTER Aggregates

* Fix panic in multiple distinct aggregates by fixing `ScalarValue::new_list` (apache#7989)

* Fix panic in multiple distinct aggregates by fixing ScalarValue::new_list

* Update datafusion/common/src/scalar.rs

Co-authored-by: Daniël Heres <[email protected]>

---------

Co-authored-by: Daniël Heres <[email protected]>

* MemoryReservation exposes MemoryConsumer (apache#8000)

... as a getter method.

* fix: generate logical plan for `UPDATE SET FROM` statement (apache#7984)

* Create temporary files for reading or writing (apache#8005)

* Create temporary files for reading or writing

* nit

* addr comment

---------

Co-authored-by: zhongjingxiong <[email protected]>

* doc: minor fix to SortExec::with_fetch comment (apache#8011)

* Fix: dataframe_subquery example Optimizer rule `common_sub_expression_eliminate` failed (apache#8016)

* Fix: Optimizer rule 'common_sub_expression_eliminate' failed

* nit

* nit

* nit

---------

Co-authored-by: zhongjingxiong <[email protected]>

* Percent Decode URL Paths (apache#8009) (apache#8012)

* Treat ListingTableUrl as URL-encoded (apache#8009)

* Update lockfile

* Review feedback

* Minor: Extract common deps into workspace (apache#7982)

* Improve datafusion-*

* More common crates

* Extract async-trait

* Extract more

* Fix cli

---------

Co-authored-by: Andrew Lamb <[email protected]>

* minor: change some plan_err to exec_err (apache#7996)

* minor: change some plan_err to exec_err

Signed-off-by: Ruihang Xia <[email protected]>

* change unreachable code to internal error

Signed-off-by: Ruihang Xia <[email protected]>

---------

Signed-off-by: Ruihang Xia <[email protected]>

* Minor: error on unsupported RESPECT NULLs syntax (apache#7998)

* Minor: error on unsupported RESPECT NULLs syntax

* fix clippy

* Update datafusion/sql/tests/sql_integration.rs

Co-authored-by: Liang-Chi Hsieh <[email protected]>

---------

Co-authored-by: Liang-Chi Hsieh <[email protected]>

* GroupedHashAggregateStream breaks spill batch (apache#8004)

... into smaller chunks to decrease memory required for merging.

* Minor: Add implementation examples to ExecutionPlan::execute (apache#8013)

* Add implementation examples to ExecutionPlan::execute

* Review feedback

* address comment (apache#7993)

Signed-off-by: jayzhan211 <[email protected]>

* GroupedHashAggregateStream should register spillable consumer (apache#8002)

* fix: single_distinct_aggretation_to_group_by fail (apache#7997)

* fix: single_distinct_aggretation_to_group_by faile

* fix

* move test to groupby.slt

* Read only enough bytes to infer Arrow IPC file schema via stream (apache#7962)

* Read only enough bytes to infer Arrow IPC file schema via stream

* Error checking for collect bytes func

* Update datafusion/core/src/datasource/file_format/arrow.rs

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Minor: remove a strange char (apache#8030)

* Minor: Improve documentation for Filter Pushdown (apache#8023)

* Minor: Improve documentation for Fulter Pushdown

* Update datafusion/optimizer/src/push_down_filter.rs

Co-authored-by: jakevin <[email protected]>

* Apply suggestions from code review

* Update datafusion/optimizer/src/push_down_filter.rs

Co-authored-by: Alex Huang <[email protected]>

---------

Co-authored-by: jakevin <[email protected]>
Co-authored-by: Alex Huang <[email protected]>

* Minor: Improve `ExecutionPlan` documentation (apache#8019)

* Minor: Improve `ExecutionPlan` documentation

* Add link to Partitioning

* fix: clippy warnings from nightly rust 1.75 (apache#8025)

Signed-off-by: Ruihang Xia <[email protected]>

* Minor: Avoid recomputing compute_array_ndims in align_array_dimensions (apache#7963)

* Refactor align_array_dimensions

Signed-off-by: jayzhan211 <[email protected]>

* address comment

Signed-off-by: jayzhan211 <[email protected]>

* remove unwrap

Signed-off-by: jayzhan211 <[email protected]>

* address comment

Signed-off-by: jayzhan211 <[email protected]>

* fix rebase

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>

* Minor: fix doc check (apache#8037)

* Minor: remove uncessary #cfg test (apache#8036)

* Minor: remove uncessary #cfg test

* fmt

* Update datafusion/core/src/datasource/file_format/arrow.rs

Co-authored-by: Ruihang Xia <[email protected]>

---------

Co-authored-by: Daniël Heres <[email protected]>
Co-authored-by: Ruihang Xia <[email protected]>

* Minor: Improve documentation for  `PartitionStream` and `StreamingTableExec` (apache#8035)

* Minor: Improve documentation for  `PartitionStream` and `StreamingTableExec`

* fmt

* fmt

* Combine Equivalence and Ordering equivalence to simplify state (apache#8006)

* combine equivalence and ordering equivalence

* Remove EquivalenceProperties struct

* Minor changes

* all tests pass

* Refactor oeq

* Simplifications

* Resolve linter errors

* Minor changes

* Minor changes

* Add new tests

* Simplifications window mode selection

* Simplifications

* Use set_satisfy api

* Use utils for aggregate

* Minor changes

* Minor changes

* Minor changes

* All tests pass

* Simplifications

* Simplifications

* Minor changes

* Simplifications

* All tests pass, fix bug

* Remove unnecessary code

* Simplifications

* Minor changes

* Simplifications

* Move oeq join to methods

* Simplifications

* Remove redundant code

* Minor changes

* Minor changes

* Simplifications

* Simplifications

* Simplifications

* Move window to util from method, simplifications

* Simplifications

* Propagate meet in the union

* Simplifications

* Minor changes, rename

* Address berkay reviews

* Simplifications

* Add new buggy test

* Add data test for sort requirement

* Add experimental check

* Add random test

* Minor changes

* Random test gives error

* Fix missing test case

* Minor changes

* Minor changes

* Simplifications

* Minor changes

* Add new test case

* Minor changes

* Address reviews

* Minor changes

* Increase coverage of random tests

* Remove redundant code

* Simplifications

* Simplifications

* Refactor on tests

* Solving clippy errors

* prune_lex improvements

* Fix failing tests

* Update get_finer and get_meet

* Fix window lex ordering implementation

* Buggy state

* Do not use output ordering in the aggregate

* Add union test

* Update comment

* Fix bug, when batch_size is small

* Review Part 1

* Review Part 2

* Change union meet implementation

* Update comments

* Remove redundant check

* Simplify project out_expr function

* Remove Option<Vec<_>> API.

* Do not use project_out_expr

* Simplifications

* Review Part 3

* Review Part 4

* Review Part 5

* Review Part 6

* Review Part 7

* Review Part 8

* Update comments

* Add new unit tests, simplifications

* Resolve linter errors

* Simplify test codes

* Review Part 9

* Add unit tests for remove_redundant entries

* Simplifications

* Review Part 10

* Fix test

* Add new test case, fix implementation

* Review Part 11

* Review Part 12

* Update comments

* Review Part 13

* Review Part 14

* Review Part 15

* Review Part 16

* Review Part 17

* Review Part 18

* Review Part 19

* Review Part 20

* Review Part 21

* Review Part 22

* Review Part 23

* Review Part 24

* Do not construct idx and sort_expr unnecessarily, Update comments, Union meet single entry

* Review Part 25

* Review Part 26

* Name Changes, comment updates

* Review Part 27

* Add issue links

* Address reviews

* Fix failing test

* Update comments

* SortPreservingMerge, SortPreservingRepartition only preserves given expression ordering among input ordering equivalences

---------

Co-authored-by: metesynnada <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>

* Encapsulate `ProjectionMapping` as a struct (apache#8033)

* Minor: Fix bugs in docs for `to_timestamp`, `to_timestamp_seconds`, ... (apache#8040)

* Minor: Fix bugs in docs for `to_timestamp`, `to_timestamp_seconds`, etc

* prettier

* Update docs/source/user-guide/sql/scalar_functions.md

Co-authored-by: comphead <[email protected]>

* Update docs/source/user-guide/sql/scalar_functions.md

Co-authored-by: comphead <[email protected]>

---------

Co-authored-by: comphead <[email protected]>

* Improve comments for `PartitionSearchMode` struct (apache#8047)

* Improve comments

* Make comments partition/group agnostic

* General approach for Array replace (apache#8050)

* checkpoint

Signed-off-by: jayzhan211 <[email protected]>

* optimize non-list

Signed-off-by: jayzhan211 <[email protected]>

* replace list ver

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* rename

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>

* Minor: Remove the irrelevant note from the Expression API doc (apache#8053)

* Minor: Add more documentation about Partitioning (apache#8022)

* Minor: Add more documentation about Partitioning

* fix typo

* Apply suggestions from code review

Co-authored-by: comphead <[email protected]>

* Add more diagrams, improve text

* undo unintended changes

* undo unintended changes

* fix links

* Try and clarify

---------

Co-authored-by: comphead <[email protected]>

* Minor: improve documentation for IsNotNull, DISTINCT, etc (apache#8052)

* Minor: improve documentation for IsNotNull, DISTINCT, etc

* fix

* Prepare 33.0.0 Release (apache#8057)

* changelog

* update version

* update changelog

* Minor: improve error message by adding types to message (apache#8065)

* Minor: improve error message

* add test

* Minor: Remove redundant BuiltinScalarFunction::supports_zero_argument() (apache#8059)

* deprecate BuiltinScalarFunction::supports_zero_argument()

* unify old supports_zero_argument() impl

* Add example to ci (apache#8060)

* feat: add example to ci

* nit

* addr comments

---------

Co-authored-by: zhongjingxiong <[email protected]>

* Update substrait requirement from 0.18.0 to 0.19.0 (apache#8076)

Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version.
- [Release notes](https://github.com/substrait-io/substrait-rs/releases)
- [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md)
- [Commits](substrait-io/substrait-rs@v0.18.0...v0.19.0)

---
updated-dependencies:
- dependency-name: substrait
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix incorrect results in COUNT(*) queries with LIMIT (apache#8049)

Co-authored-by: Mark Sirek <[email protected]>

* feat: Support determining extensions from names like `foo.parquet.snappy` as well as `foo.parquet` (apache#7972)

* feat: read files based on the file extention

* fix: some the file extension might be started with . and some not

* fix: rename extention to extension

* chore: use exec_err

* chore: rename extention to extension

* chore: rename extention to extension

* chore: simplify the code

* fix: check table is empty

* ci: fix test

* fix: add err info

* refactor: extract the logic to infer_types

* fix: add tests for different extensions

* fix: ci clippy

* fix: add more tests

* fix: simplify the logic

* fix: ci

* Use FairSpillPool for TaskContext with spillable config (apache#8072)

* Minor: Improve HashJoinStream docstrings (apache#8070)

* Minor: Improve HashJoinStream docstrings

* fix comments

* Update datafusion/physical-plan/src/joins/hash_join.rs

Co-authored-by: comphead <[email protected]>

* Update datafusion/physical-plan/src/joins/hash_join.rs

Co-authored-by: comphead <[email protected]>

---------

Co-authored-by: Daniël Heres <[email protected]>
Co-authored-by: comphead <[email protected]>

* Fixing broken link (apache#8085)

* Fixing broken link

* Update docs/source/contributor-guide/index.md

Thanks for spotting this as well

Co-authored-by: Liang-Chi Hsieh <[email protected]>

---------

Co-authored-by: Liang-Chi Hsieh <[email protected]>

* fix: DataFusion suggests invalid functions (apache#8083)

* fix: DataFusion suggests invalid functions

* update test

* Add test for BuiltInWindowFunction

* Replace macro with function for  `array_repeat` (apache#8071)

* General array repeat

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* add test

Signed-off-by: jayzhan211 <[email protected]>

* add test

Signed-off-by: jayzhan211 <[email protected]>

* done

Signed-off-by: jayzhan211 <[email protected]>

* remove test

Signed-off-by: jayzhan211 <[email protected]>

* add comment

Signed-off-by: jayzhan211 <[email protected]>

* fm

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>

* Minor: remove unnecessary projection in `single_distinct_to_group_by` rule (apache#8061)

* Minor: remove unnecessary projection

* fix ci

* minor: Remove duplicate version numbers for arrow, object_store, and parquet dependencies (apache#8095)

* remove duplicate version numbers for arrow, object_store, and parquet dependencies

* cargo update

* use default features in parquet crate

* disable default parquet features in wasmtest

* fix: add match encode/decode  scalar function type (apache#8089)

* feat: Protobuf serde for Json file sink (apache#8062)

* Protobuf serde for Json file sink

* Fix tests

* Fix test

* Minor: use `Expr::alias` in a few places to make the code more concise (apache#8097)

* Minor: Cleanup BuiltinScalarFunction::return_type() (apache#8088)

* Expose metrics from FileSinkExec impl of ExecutionPlan

---------

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Mustafa Akur <[email protected]>
Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Co-authored-by: Devin D'Angelo <[email protected]>
Co-authored-by: Hengfei Yang <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Huaijin <[email protected]>
Co-authored-by: Jonah Gao <[email protected]>
Co-authored-by: Chih Wang <[email protected]>
Co-authored-by: Jeffrey <[email protected]>
Co-authored-by: Marco Neumann <[email protected]>
Co-authored-by: comphead <[email protected]>
Co-authored-by: Alex Huang <[email protected]>
Co-authored-by: Jay Zhan <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: yi wang <[email protected]>
Co-authored-by: Liang-Chi Hsieh <[email protected]>
Co-authored-by: jakevin <[email protected]>
Co-authored-by: 张林伟 <[email protected]>
Co-authored-by: Berkay Şahin <[email protected]>
Co-authored-by: Marko Milenković <[email protected]>
Co-authored-by: jokercurry <[email protected]>
Co-authored-by: zhongjingxiong <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Ruihang Xia <[email protected]>
Co-authored-by: metesynnada <[email protected]>
Co-authored-by: Yongting You <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mark Sirek <[email protected]>
Co-authored-by: Mark Sirek <[email protected]>
Co-authored-by: Edmondo Porcu <[email protected]>
Co-authored-by: Syleechan <[email protected]>
Co-authored-by: Dan Harris <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid reading entire stream to determine schema of arrow file
4 participants