Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow2 test fix #1733

Merged
merged 31 commits into from
Feb 8, 2022
Merged

Arrow2 test fix #1733

merged 31 commits into from
Feb 8, 2022

Conversation

Igosuki
Copy link
Contributor

@Igosuki Igosuki commented Feb 2, 2022

Which issue does this PR close?

None

Rationale for this change

Stop using git repo for arrow2 and use a stabilized version

What changes are included in this PR?

Use arrow2 0.9 instead of master, integrate latest datafusion and a few cosmetic changes to make future merges easier.

Are there any user-facing changes?

No

xudong963 and others added 28 commits January 27, 2022 10:17
…ntegration` tests (apache#1684)

* Move tests from context.rs to information_schema.rs

* Fix up tests to compile
…on test (apache#1696)

* Move some tests out of context.rs and into sql

* Move support test out of context.rs and into sql tests

* Fixup tests and make them compile
…ry consumers (apache#1691)

* Memory manager no longer track consumers, update aggregatedMetricsSet

* Easy memory tracking with metrics

* use tracking metrics in SPMS

* tests

* fix

* doc

* Update datafusion/src/physical_plan/sorts/sort.rs

Co-authored-by: Andrew Lamb <[email protected]>

* make tracker AtomicUsize

Co-authored-by: Andrew Lamb <[email protected]>
* Add TableProvider impl for DataFrameImpl

* Add physical plan in

* Clean up plan construction and names construction

* Remove duplicate comments

* Remove unused parameter

* Add test

* Remove duplicate limit comment

* Use cloned instead of individual clone

* Reduce the amount of code to get a schema

Co-authored-by: Andrew Lamb <[email protected]>

* Add comments to test

* Fix plan comparison

* Compare only the results of execution

* Remove println

* Refer to df_impl instead of table in test

Co-authored-by: Andrew Lamb <[email protected]>

* Fix the register_table test to use the correct result set for comparison

* Consolidate group/agg exprs

* Format

* Remove outdated comment

Co-authored-by: Andrew Lamb <[email protected]>
* Rebase

* impl ToNumeric for ScalarValue

* Update macro to be based on

* Add floats

* Cleanup

* Newline
…tPhysicalPlanner` for faster speed (apache#1700)

* Change physical_expr creation API

* Refactor API usage to avoid creating ExecutionContextState

* Fixup ballista

* clippy!
…1665)

* fix can not load parquet table form spark

* add Invalid file in log.

* fix fmt
…e#1709)

* Create SchemaAdapter trait to map table schema to file schemas

* Linting fix

* Remove commented code
* feat: implement TDigest for approx quantile

Adds a [TDigest] implementation providing approximate quantile
estimations of large inputs using a small amount of (bounded) memory.

A TDigest is most accurate near either "end" of the quantile range (that
is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that
increases resolution at the tails. The paper claims single digit part
per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and
in practice I have found accuracy to be more than acceptable for an
apprixmate function across the entire quantile range.

The implementation is a modified copy of
https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++
implementation]. Both Facebook's implementation, and Mn02's Rust port
are Apache 2.0 licensed.

[TDigest]: https://arxiv.org/abs/1902.04023
[Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h

* feat: approx_quantile aggregation

Adds the ApproxQuantile physical expression, plumbing & test cases.

The function signature is:

	approx_quantile(column, quantile)

Where column can be any numeric type (that can be cast to a float64) and
quantile is a float64 literal between 0 and 1.

* feat: approx_quantile dataframe function

Adds the approx_quantile() dataframe function, and exports it in the
prelude.

* refactor: bastilla approx_quantile support

Adds bastilla wire encoding for approx_quantile.

Adding support for this required modifying the AggregateExprNode proto
message to support propigating multiple LogicalExprNode aggregate
arguments - all the existing aggregations take a single argument, so
this wasn't needed before.

This commit adds "repeated" to the expr field, which I believe is
backwards compatible as described here:

	https://developers.google.com/protocol-buffers/docs/proto3#updating

Specifically, adding "repeated" to an existing message field:

	"For ... message fields, optional is compatible with repeated"

No existing tests needed fixing, and a new roundtrip test is included
that covers the change to allow multiple expr.

* refactor: use input type as return type

Casts the calculated quantile value to the same type as the input data.

* fixup! refactor: bastilla approx_quantile support

* refactor: rebase onto main

* refactor: validate quantile value

Ensures the quantile values is between 0 and 1, emitting a plan error if
not.

* refactor: rename to approx_percentile_cont

* refactor: clippy lints
* suppport bitwise and as an example

* Use $OP in macro rather than `&`

* fix: change signature to &dyn Array

* fmt

Co-authored-by: Andrew Lamb <[email protected]>
* Convert boolean case expressions to boolean logic

* Review feedback
* Substitute parking_lot::Mutex for std::sync::Mutex

* enable parking_lot feature in tokio
* Add Expression Simplification API

* fmt
@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate labels Feb 2, 2022
@github-actions github-actions bot added the sql SQL Planner label Feb 2, 2022
Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one minor comment, the rest looks good to me. Thanks @Igosuki for this work. Sorry that I wasn't able to help with the migration last weekend because something urgent came up :(

datafusion/src/physical_plan/expressions/cast.rs Outdated Show resolved Hide resolved
@Igosuki
Copy link
Contributor Author

Igosuki commented Feb 3, 2022

Fixed the parquet tests as well as some of the decimal tests. The rest of the failures are due to weird behavior when casting decimals...

@houqp
Copy link
Member

houqp commented Feb 6, 2022

Float to decimal failure is caused by upstream bug, I will send a PR to get it fixed shortly.

@alamb
Copy link
Contributor

alamb commented Feb 8, 2022

I don't see any reason not to merge this into the arrow2 branch! ✅

@alamb alamb merged commit 83f937a into apache:arrow2 Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.