Skip to content

Commit

Permalink
Arrow2 02092022 (#1795)
Browse files Browse the repository at this point in the history
* feat: add join type for logical plan display (#1674)

* (minor) Reduce memory manager and disk manager logs from `info!` to `debug!` (#1689)

* Move `information_schema` tests out of execution/context.rs to `sql_integration` tests (#1684)

* Move tests from context.rs to information_schema.rs

* Fix up tests to compile

* Move timestamp related tests out of context.rs and into sql integration test (#1696)

* Move some tests out of context.rs and into sql

* Move support test out of context.rs and into sql tests

* Fixup tests and make them compile

* Add `MemTrackingMetrics` to ease memory tracking for non-limited memory consumers (#1691)

* Memory manager no longer track consumers, update aggregatedMetricsSet

* Easy memory tracking with metrics

* use tracking metrics in SPMS

* tests

* fix

* doc

* Update datafusion/src/physical_plan/sorts/sort.rs

Co-authored-by: Andrew Lamb <[email protected]>

* make tracker AtomicUsize

Co-authored-by: Andrew Lamb <[email protected]>

* Implement TableProvider for DataFrameImpl (#1699)

* Add TableProvider impl for DataFrameImpl

* Add physical plan in

* Clean up plan construction and names construction

* Remove duplicate comments

* Remove unused parameter

* Add test

* Remove duplicate limit comment

* Use cloned instead of individual clone

* Reduce the amount of code to get a schema

Co-authored-by: Andrew Lamb <[email protected]>

* Add comments to test

* Fix plan comparison

* Compare only the results of execution

* Remove println

* Refer to df_impl instead of table in test

Co-authored-by: Andrew Lamb <[email protected]>

* Fix the register_table test to use the correct result set for comparison

* Consolidate group/agg exprs

* Format

* Remove outdated comment

Co-authored-by: Andrew Lamb <[email protected]>

* refine test in repartition.rs & coalesce_batches.rs (#1707)

* Fuzz test for spillable sort (#1706)

* Lazy TempDir creation in DiskManager (#1695)

* Incorporate dyn scalar kernels (#1685)

* Rebase

* impl ToNumeric for ScalarValue

* Update macro to be based on

* Add floats

* Cleanup

* Newline

* add annotation for select_to_plan (#1714)

* Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed (#1700)

* Change physical_expr creation API

* Refactor API usage to avoid creating ExecutionContextState

* Fixup ballista

* clippy!

* Fix can not load parquet table form spark in datafusion-cli. (#1665)

* fix can not load parquet table form spark

* add Invalid file in log.

* fix fmt

* add upper bound for pub fn (#1713)

Signed-off-by: remzi <[email protected]>

* Create SchemaAdapter trait to map table schema to file schemas (#1709)

* Create SchemaAdapter trait to map table schema to file schemas

* Linting fix

* Remove commented code

* approx_quantile() aggregation function (#1539)

* feat: implement TDigest for approx quantile

Adds a [TDigest] implementation providing approximate quantile
estimations of large inputs using a small amount of (bounded) memory.

A TDigest is most accurate near either "end" of the quantile range (that
is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that
increases resolution at the tails. The paper claims single digit part
per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and
in practice I have found accuracy to be more than acceptable for an
apprixmate function across the entire quantile range.

The implementation is a modified copy of
https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++
implementation]. Both Facebook's implementation, and Mn02's Rust port
are Apache 2.0 licensed.

[TDigest]: https://arxiv.org/abs/1902.04023
[Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h

* feat: approx_quantile aggregation

Adds the ApproxQuantile physical expression, plumbing & test cases.

The function signature is:

	approx_quantile(column, quantile)

Where column can be any numeric type (that can be cast to a float64) and
quantile is a float64 literal between 0 and 1.

* feat: approx_quantile dataframe function

Adds the approx_quantile() dataframe function, and exports it in the
prelude.

* refactor: bastilla approx_quantile support

Adds bastilla wire encoding for approx_quantile.

Adding support for this required modifying the AggregateExprNode proto
message to support propigating multiple LogicalExprNode aggregate
arguments - all the existing aggregations take a single argument, so
this wasn't needed before.

This commit adds "repeated" to the expr field, which I believe is
backwards compatible as described here:

	https://developers.google.com/protocol-buffers/docs/proto3#updating

Specifically, adding "repeated" to an existing message field:

	"For ... message fields, optional is compatible with repeated"

No existing tests needed fixing, and a new roundtrip test is included
that covers the change to allow multiple expr.

* refactor: use input type as return type

Casts the calculated quantile value to the same type as the input data.

* fixup! refactor: bastilla approx_quantile support

* refactor: rebase onto main

* refactor: validate quantile value

Ensures the quantile values is between 0 and 1, emitting a plan error if
not.

* refactor: rename to approx_percentile_cont

* refactor: clippy lints

* suppport bitwise and as an example (#1653)

* suppport bitwise and as an example

* Use $OP in macro rather than `&`

* fix: change signature to &dyn Array

* fmt

Co-authored-by: Andrew Lamb <[email protected]>

* fix: substr - correct behaivour with negative start pos (#1660)

* minor: fix cargo run --release error (#1723)

* Convert boolean case expressions to boolean logic (#1719)

* Convert boolean case expressions to boolean logic

* Review feedback

* substitute `parking_lot::Mutex` for `std::sync::Mutex` (#1720)

* Substitute parking_lot::Mutex for std::sync::Mutex

* enable parking_lot feature in tokio

* Add Expression Simplification API (#1717)

* Add Expression Simplification API

* fmt

* Add tests and CI for optional pyarrow module (#1711)

* Implement other side of conversion

* Add test workflow

* Add (failing) tests

* Get unit tests passing

* Use python -m pip

* Debug LD_LIBRARY_PATH

* Set LIBRARY_PATH

* Update help with better info

* Update parking_lot requirement from 0.11 to 0.12 (#1735)

Updates the requirements on [parking_lot](https://github.com/Amanieu/parking_lot) to permit the latest version.
- [Release notes](https://github.com/Amanieu/parking_lot/releases)
- [Changelog](https://github.com/Amanieu/parking_lot/blob/master/CHANGELOG.md)
- [Commits](Amanieu/parking_lot@0.11.0...0.12.0)

---
updated-dependencies:
- dependency-name: parking_lot
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Prevent repartitioning of certain operator's direct children (#1731) (#1732)

* Prevent repartitioning of certain operator's direct children (#1731)

* Update ballista tests

* Don't repartition children of RepartitionExec

* Revert partition restriction on Repartition and Projection

* Review feedback

* Lint

* API to get Expr's type and nullability without a `DFSchema` (#1726)

* API to get Expr type and nullability without a `DFSchema`

* Add test

* publically export

* Improve docs

* Fix typos in crate documentation (#1739)

* add `cargo check --release` to ci (#1737)

* remote test

* Update .github/workflows/rust.yml

Co-authored-by: Andrew Lamb <[email protected]>

Co-authored-by: Andrew Lamb <[email protected]>

* Move optimize test out of context.rs (#1742)

* Move optimize test out of context.rs

* Update

* use clap 3 style args parsing for datafusion cli (#1749)

* use clap 3 style args parsing for datafusion cli

* upgrade cli version

* Add partitioned_csv setup code to sql_integration test (#1743)

* use ordered-float 2.10 (#1756)

Signed-off-by: Andy Grove <[email protected]>

* #1768 Support TimeUnit::Second in hasher (#1769)

* Support TimeUnit::Second in hasher

* fix linter

* format (#1745)

* Create built-in scalar functions programmatically (#1734)

* create build-in scalar functions programatically

Signed-off-by: remzi <[email protected]>

* solve conflict

Signed-off-by: remzi <[email protected]>

* fix spelling mistake

Signed-off-by: remzi <[email protected]>

* rename to call_fn

Signed-off-by: remzi <[email protected]>

* [split/1] split datafusion-common module (#1751)

* split datafusion-common module

* pyarrow

* Update datafusion-common/README.md

Co-authored-by: Andy Grove <[email protected]>

* Update datafusion/Cargo.toml

* include publishing

Co-authored-by: Andy Grove <[email protected]>

* fix: Case insensitive unquoted identifiers (#1747)

* move dfschema and column (#1758)

* add datafusion-expr module (#1759)

* move column, dfschema, etc. to common module (#1760)

* include window frames and operator into datafusion-expr (#1761)

* move signature, type signature, and volatility to split module (#1763)

* [split/10] split up expr for rewriting, visiting, and simplification traits (#1774)

* split up expr for rewriting, visiting, and simplification

* add docs

* move built-in scalar functions (#1764)

* split expr type and null info to be expr-schemable (#1784)

* rewrite predicates before pushing to union inputs (#1781)

* move accumulator and columnar value (#1765)

* move accumulator and columnar value (#1762)

* fix bad data type in test_try_cast_decimal_to_decimal

* added projections for avro columns

Co-authored-by: xudong.w <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Yijie Shen <[email protected]>
Co-authored-by: Phillip Cloud <[email protected]>
Co-authored-by: Matthew Turner <[email protected]>
Co-authored-by: Yang <[email protected]>
Co-authored-by: Remzi Yang <[email protected]>
Co-authored-by: Dan Harris <[email protected]>
Co-authored-by: Dom <[email protected]>
Co-authored-by: Kun Liu <[email protected]>
Co-authored-by: Dmitry Patsura <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: r.4ntix <[email protected]>
Co-authored-by: Jiayu Liu <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Rich <[email protected]>
Co-authored-by: Marko Mikulicic <[email protected]>
Co-authored-by: Eduard Karacharov <[email protected]>
  • Loading branch information
1 parent 83f937a commit 14bf39d
Show file tree
Hide file tree
Showing 102 changed files with 13,702 additions and 8,123 deletions.
59 changes: 57 additions & 2 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,12 +58,18 @@ jobs:
rustup toolchain install ${{ matrix.rust }}
rustup default ${{ matrix.rust }}
rustup component add rustfmt
- name: Build Workspace
- name: Build workspace in debug mode
run: |
cargo build
env:
CARGO_HOME: "/github/home/.cargo"
CARGO_TARGET_DIR: "/github/home/target"
CARGO_TARGET_DIR: "/github/home/target/debug"
- name: Build workspace in release mode
run: |
cargo check --release
env:
CARGO_HOME: "/github/home/.cargo"
CARGO_TARGET_DIR: "/github/home/target/release"
- name: Check DataFusion Build without default features
run: |
cargo check --no-default-features -p datafusion
Expand Down Expand Up @@ -230,6 +236,55 @@ jobs:
# do not produce debug symbols to keep memory usage down
RUSTFLAGS: "-C debuginfo=0"

test-datafusion-pyarrow:
needs: [linux-build-lib]
runs-on: ubuntu-latest
strategy:
matrix:
arch: [amd64]
rust: [stable]
container:
image: ${{ matrix.arch }}/rust
env:
# Disable full debug symbol generation to speed up CI build and keep memory down
# "1" means line tables only, which is useful for panic tracebacks.
RUSTFLAGS: "-C debuginfo=1"
steps:
- uses: actions/checkout@v2
with:
submodules: true
- name: Cache Cargo
uses: actions/cache@v2
with:
path: /github/home/.cargo
# this key equals the ones on `linux-build-lib` for re-use
key: cargo-cache-
- name: Cache Rust dependencies
uses: actions/cache@v2
with:
path: /github/home/target
# this key equals the ones on `linux-build-lib` for re-use
key: ${{ runner.os }}-${{ matrix.arch }}-target-cache-${{ matrix.rust }}
- uses: actions/setup-python@v2
with:
python-version: "3.8"
- name: Install PyArrow
run: |
echo "LIBRARY_PATH=$LD_LIBRARY_PATH" >> $GITHUB_ENV
python -m pip install pyarrow
- name: Setup Rust toolchain
run: |
rustup toolchain install ${{ matrix.rust }}
rustup default ${{ matrix.rust }}
rustup component add rustfmt
- name: Run tests
run: |
cd datafusion
cargo test --features=pyarrow
env:
CARGO_HOME: "/github/home/.cargo"
CARGO_TARGET_DIR: "/github/home/target"

lint:
name: Lint
runs-on: ubuntu-latest
Expand Down
6 changes: 4 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
[workspace]
members = [
"datafusion",
"datafusion-common",
"datafusion-expr",
"datafusion-cli",
"datafusion-examples",
"benchmarks",
Expand All @@ -33,5 +35,5 @@ lto = true
codegen-units = 1

[patch.crates-io]
#arrow2 = { git = "https://github.com/jorgecarleitao/arrow2.git", branch = "main" }
#parquet2 = { git = "https://github.com/jorgecarleitao/parquet2.git", branch = "main" }
arrow2 = { git = "https://github.com/jorgecarleitao/arrow2.git", branch = "main" }
parquet2 = { git = "https://github.com/jorgecarleitao/parquet2.git", branch = "main" }
Loading

0 comments on commit 14bf39d

Please sign in to comment.