Arrow2 02092022 #1795

Igosuki · 2022-02-09T11:41:39Z

Which issue does this PR close?

None

Closes #.

Rationale for this change

Keep the arrow2 branch up to date

What changes are included in this PR?

Uses the improved parquet io code, switched to blocking FileWriter for writing parquet as File isn't AsyncWrite.

Are there any user-facing changes?

Nope

…debug!` (apache#1689)

…ntegration` tests (apache#1684) * Move tests from context.rs to information_schema.rs * Fix up tests to compile

…on test (apache#1696) * Move some tests out of context.rs and into sql * Move support test out of context.rs and into sql tests * Fixup tests and make them compile

…ry consumers (apache#1691) * Memory manager no longer track consumers, update aggregatedMetricsSet * Easy memory tracking with metrics * use tracking metrics in SPMS * tests * fix * doc * Update datafusion/src/physical_plan/sorts/sort.rs Co-authored-by: Andrew Lamb <[email protected]> * make tracker AtomicUsize Co-authored-by: Andrew Lamb <[email protected]>

* Add TableProvider impl for DataFrameImpl * Add physical plan in * Clean up plan construction and names construction * Remove duplicate comments * Remove unused parameter * Add test * Remove duplicate limit comment * Use cloned instead of individual clone * Reduce the amount of code to get a schema Co-authored-by: Andrew Lamb <[email protected]> * Add comments to test * Fix plan comparison * Compare only the results of execution * Remove println * Refer to df_impl instead of table in test Co-authored-by: Andrew Lamb <[email protected]> * Fix the register_table test to use the correct result set for comparison * Consolidate group/agg exprs * Format * Remove outdated comment Co-authored-by: Andrew Lamb <[email protected]>

* Rebase * impl ToNumeric for ScalarValue * Update macro to be based on * Add floats * Cleanup * Newline

…tPhysicalPlanner` for faster speed (apache#1700) * Change physical_expr creation API * Refactor API usage to avoid creating ExecutionContextState * Fixup ballista * clippy!

…1665) * fix can not load parquet table form spark * add Invalid file in log. * fix fmt

Signed-off-by: remzi <[email protected]>

…e#1709) * Create SchemaAdapter trait to map table schema to file schemas * Linting fix * Remove commented code

* feat: implement TDigest for approx quantile Adds a [TDigest] implementation providing approximate quantile estimations of large inputs using a small amount of (bounded) memory. A TDigest is most accurate near either "end" of the quantile range (that is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that increases resolution at the tails. The paper claims single digit part per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have found accuracy to be more than acceptable for an apprixmate function across the entire quantile range. The implementation is a modified copy of https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++ implementation]. Both Facebook's implementation, and Mn02's Rust port are Apache 2.0 licensed. [TDigest]: https://arxiv.org/abs/1902.04023 [Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h * feat: approx_quantile aggregation Adds the ApproxQuantile physical expression, plumbing & test cases. The function signature is: approx_quantile(column, quantile) Where column can be any numeric type (that can be cast to a float64) and quantile is a float64 literal between 0 and 1. * feat: approx_quantile dataframe function Adds the approx_quantile() dataframe function, and exports it in the prelude. * refactor: bastilla approx_quantile support Adds bastilla wire encoding for approx_quantile. Adding support for this required modifying the AggregateExprNode proto message to support propigating multiple LogicalExprNode aggregate arguments - all the existing aggregations take a single argument, so this wasn't needed before. This commit adds "repeated" to the expr field, which I believe is backwards compatible as described here: https://developers.google.com/protocol-buffers/docs/proto3#updating Specifically, adding "repeated" to an existing message field: "For ... message fields, optional is compatible with repeated" No existing tests needed fixing, and a new roundtrip test is included that covers the change to allow multiple expr. * refactor: use input type as return type Casts the calculated quantile value to the same type as the input data. * fixup! refactor: bastilla approx_quantile support * refactor: rebase onto main * refactor: validate quantile value Ensures the quantile values is between 0 and 1, emitting a plan error if not. * refactor: rename to approx_percentile_cont * refactor: clippy lints

* suppport bitwise and as an example * Use $OP in macro rather than `&` * fix: change signature to &dyn Array * fmt Co-authored-by: Andrew Lamb <[email protected]>

* Convert boolean case expressions to boolean logic * Review feedback

* Substitute parking_lot::Mutex for std::sync::Mutex * enable parking_lot feature in tokio

* Add Expression Simplification API * fmt

* Implement other side of conversion * Add test workflow * Add (failing) tests * Get unit tests passing * Use python -m pip * Debug LD_LIBRARY_PATH * Set LIBRARY_PATH * Update help with better info

Updates the requirements on [parking_lot](https://github.com/Amanieu/parking_lot) to permit the latest version. - [Release notes](https://github.com/Amanieu/parking_lot/releases) - [Changelog](https://github.com/Amanieu/parking_lot/blob/master/CHANGELOG.md) - [Commits](Amanieu/parking_lot@0.11.0...0.12.0) --- updated-dependencies: - dependency-name: parking_lot dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…1731) (apache#1732) * Prevent repartitioning of certain operator's direct children (apache#1731) * Update ballista tests * Don't repartition children of RepartitionExec * Revert partition restriction on Repartition and Projection * Review feedback * Lint

) * API to get Expr type and nullability without a `DFSchema` * Add test * publically export * Improve docs

* remote test * Update .github/workflows/rust.yml Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

* Move optimize test out of context.rs * Update

* use clap 3 style args parsing for datafusion cli * upgrade cli version

* create build-in scalar functions programatically Signed-off-by: remzi <[email protected]> * solve conflict Signed-off-by: remzi <[email protected]> * fix spelling mistake Signed-off-by: remzi <[email protected]> * rename to call_fn Signed-off-by: remzi <[email protected]>

* split datafusion-common module * pyarrow * Update datafusion-common/README.md Co-authored-by: Andy Grove <[email protected]> * Update datafusion/Cargo.toml * include publishing Co-authored-by: Andy Grove <[email protected]>

…e#1763)

…traits (apache#1774) * split up expr for rewriting, visiting, and simplification * add docs

Igosuki · 2022-02-12T08:31:05Z

Added a fix for avro projections

houqp · 2022-03-07T08:28:50Z

@Igosuki @alamb mind if I force update the arrow2 branch with the latest commit from this PR (7910765)? I am thinking it will be easier to manage long running branches using merge commits instead of squash commits so we can keep the full history from master in the arrow2 branch. Right now all commits from master branch got squashed into a single commit into the arrow2 branch every time we merge a PR.

Or do you all prefer to use squash commits instead?

Igosuki · 2022-03-07T10:08:52Z

Yep I actually did a merge with latest here https://github.com/Igosuki/Arrow-Datafusion/tree/arrow2 should I make a new PR ? I now understand why it was so hard to merge all the time 😅 Le lun. 7 mars 2022 à 09:29, QP Hou ***@***.***> a écrit :

…

@Igosuki <https://github.com/Igosuki> @alamb <https://github.com/alamb> mind if I force update the arrow2 branch with the latest commit from this PR (7910765 <7910765>)? I am thinking it will be easier to manage long running branches using merge commits instead of squash commits so we can keep the full history from master in the arrow2 branch. Right now all commits from master branch got squashed into a single commit into the arrow2 branch every time we merge a PR. — Reply to this email directly, view it on GitHub <#1795 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADDFBSJ3427NEZICSHBPN3U6W443ANCNFSM5N5HXOPQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

alamb · 2022-03-07T12:09:43Z

mind if I force update the arrow2 branch with the latest commit from this PR

I do not mind

houqp · 2022-03-08T08:48:07Z

@Igosuki that's great, I tried to manually unsquash the previous commit commits, but it turned out to be too much work and not worth it. I took a quick look at your arrow2 branch and it looks clean without those two squashed commits. I think we can just force push your arrow2 branch into the one in datafusion's repo, then use merge commits going forward for future catch ups. If you are ok with this approach, please feel free to send a PR, once reviewed by the community, I can help perform the force push to replace the current arrow2 branch.

https://git-scm.com/docs/gitfaq#long-running-squash-merge has guidance on how to best maintain long running branches, which should help explain why merge commits will help make future merges easier.

Once we have the arrow2 branch updated, I will get back to performance testing. I remember running into a performance regression around window queries, so that would be the next thing I will dig into.

xudong963 and others added 30 commits January 27, 2022 10:17

feat: add join type for logical plan display (apache#1674)

7b8d72c

(minor) Reduce memory manager and disk manager logs from info! to `…

18ced8d

…debug!` (apache#1689)

Move information_schema tests out of execution/context.rs to `sql_i…

ed1de63

…ntegration` tests (apache#1684) * Move tests from context.rs to information_schema.rs * Fix up tests to compile

Move timestamp related tests out of context.rs and into sql integrati…

ab145c8

…on test (apache#1696) * Move some tests out of context.rs and into sql * Move support test out of context.rs and into sql tests * Fixup tests and make them compile

refine test in repartition.rs & coalesce_batches.rs (apache#1707)

75c7578

Fuzz test for spillable sort (apache#1706)

a7f0156

Lazy TempDir creation in DiskManager (apache#1695)

fecce97

Incorporate dyn scalar kernels (apache#1685)

3494e9c

* Rebase * impl ToNumeric for ScalarValue * Update macro to be based on * Add floats * Cleanup * Newline

add annotation for select_to_plan (apache#1714)

2512608

Support create_physical_expr and ExecutionContextState or `Defaul…

1caf52a

…tPhysicalPlanner` for faster speed (apache#1700) * Change physical_expr creation API * Refactor API usage to avoid creating ExecutionContextState * Fixup ballista * clippy!

Fix can not load parquet table form spark in datafusion-cli. (apache#…

f849968

…1665) * fix can not load parquet table form spark * add Invalid file in log. * fix fmt

add upper bound for pub fn (apache#1713)

d01d8d5

Signed-off-by: remzi <[email protected]>

Create SchemaAdapter trait to map table schema to file schemas (apach…

7bec762

…e#1709) * Create SchemaAdapter trait to map table schema to file schemas * Linting fix * Remove commented code

suppport bitwise and as an example (apache#1653)

940d4eb

* suppport bitwise and as an example * Use $OP in macro rather than `&` * fix: change signature to &dyn Array * fmt Co-authored-by: Andrew Lamb <[email protected]>

fix: substr - correct behaivour with negative start pos (apache#1660)

b6ace16

minor: fix cargo run --release error (apache#1723)

bacf10d

Convert boolean case expressions to boolean logic (apache#1719)

b9a8f15

* Convert boolean case expressions to boolean logic * Review feedback

substitute parking_lot::Mutex for std::sync::Mutex (apache#1720)

46879f1

* Substitute parking_lot::Mutex for std::sync::Mutex * enable parking_lot feature in tokio

Add Expression Simplification API (apache#1717)

e4a056f

* Add Expression Simplification API * fmt

Add tests and CI for optional pyarrow module (apache#1711)

d1ebdbf

* Implement other side of conversion * Add test workflow * Add (failing) tests * Get unit tests passing * Use python -m pip * Debug LD_LIBRARY_PATH * Set LIBRARY_PATH * Update help with better info

API to get Expr's type and nullability without a DFSchema (apache#1726

b2eaee3

) * API to get Expr type and nullability without a `DFSchema` * Add test * publically export * Improve docs

Fix typos in crate documentation (apache#1739)

5124759

add cargo check --release to ci (apache#1737)

97a1b21

* remote test * Update .github/workflows/rust.yml Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

Move optimize test out of context.rs (apache#1742)

15cfcbc

* Move optimize test out of context.rs * Update

use clap 3 style args parsing for datafusion cli (apache#1749)

40df55f

* use clap 3 style args parsing for datafusion cli * upgrade cli version

HaoYang670 and others added 16 commits February 7, 2022 06:59

[split/1] split datafusion-common module (apache#1751)

fe46a1e

* split datafusion-common module * pyarrow * Update datafusion-common/README.md Co-authored-by: Andy Grove <[email protected]> * Update datafusion/Cargo.toml * include publishing Co-authored-by: Andy Grove <[email protected]>

fix: Case insensitive unquoted identifiers (apache#1747)

d014ff2

move dfschema and column (apache#1758)

2e535f9

add datafusion-expr module (apache#1759)

a39a223

move column, dfschema, etc. to common module (apache#1760)

2ec34cf

include window frames and operator into datafusion-expr (apache#1761)

09c67d5

move signature, type signature, and volatility to split module (apach…

3c39c72

…e#1763)

[split/10] split up expr for rewriting, visiting, and simplification …

86dcb09

…traits (apache#1774) * split up expr for rewriting, visiting, and simplification * add docs

move built-in scalar functions (apache#1764)

4b68273

split expr type and null info to be expr-schemable (apache#1784)

f2615af

rewrite predicates before pushing to union inputs (apache#1781)

e8c198b

move accumulator and columnar value (apache#1765)

ed9b049

move accumulator and columnar value (apache#1762)

014e5e9

merge latest datafusion on 02092022

d23c873

fix bad data type in test_try_cast_decimal_to_decimal

b2cfe2b

github-actions bot added ballista datafusion Changes in the datafusion crate sql SQL Planner labels Feb 9, 2022

Igosuki changed the base branch from master to arrow2 February 9, 2022 11:43

added projections for avro columns

7910765

github-actions bot added the documentation Improvements or additions to documentation label Feb 12, 2022

alamb approved these changes Feb 15, 2022

View reviewed changes

alamb merged commit 14bf39d into apache:arrow2 Feb 15, 2022

houqp mentioned this pull request Mar 9, 2022

arrow2 branch: Pin to specific commit #1963

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow2 02092022 #1795

Arrow2 02092022 #1795

Igosuki commented Feb 9, 2022

Igosuki commented Feb 12, 2022

houqp commented Mar 7, 2022 •

edited

Loading

Igosuki commented Mar 7, 2022 via email

alamb commented Mar 7, 2022

houqp commented Mar 8, 2022

Arrow2 02092022 #1795

Arrow2 02092022 #1795

Conversation

Igosuki commented Feb 9, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Igosuki commented Feb 12, 2022

houqp commented Mar 7, 2022 • edited Loading

Igosuki commented Mar 7, 2022 via email

alamb commented Mar 7, 2022

houqp commented Mar 8, 2022

houqp commented Mar 7, 2022 •

edited

Loading