Arrow2 02092022 (#1795) · apache/datafusion@14bf39d

Commit

Arrow2 02092022 (#1795)

* feat: add join type for logical plan display (#1674)

* (minor) Reduce memory manager and disk manager logs from `info!` to `debug!` (#1689)

* Move `information_schema` tests out of execution/context.rs to `sql_integration` tests (#1684)

* Move tests from context.rs to information_schema.rs

* Fix up tests to compile

* Move timestamp related tests out of context.rs and into sql integration test (#1696)

* Move some tests out of context.rs and into sql

* Move support test out of context.rs and into sql tests

* Fixup tests and make them compile

* Add `MemTrackingMetrics` to ease memory tracking for non-limited memory consumers (#1691)

* Memory manager no longer track consumers, update aggregatedMetricsSet

* Easy memory tracking with metrics

* use tracking metrics in SPMS

* tests

* fix

* doc

* Update datafusion/src/physical_plan/sorts/sort.rs

Co-authored-by: Andrew Lamb <[email protected]>

* make tracker AtomicUsize

Co-authored-by: Andrew Lamb <[email protected]>

* Implement TableProvider for DataFrameImpl (#1699)

* Add TableProvider impl for DataFrameImpl

* Add physical plan in

* Clean up plan construction and names construction

* Remove duplicate comments

* Remove unused parameter

* Add test

* Remove duplicate limit comment

* Use cloned instead of individual clone

* Reduce the amount of code to get a schema

Co-authored-by: Andrew Lamb <[email protected]>

* Add comments to test

* Fix plan comparison

* Compare only the results of execution

* Remove println

* Refer to df_impl instead of table in test

Co-authored-by: Andrew Lamb <[email protected]>

* Fix the register_table test to use the correct result set for comparison

* Consolidate group/agg exprs

* Format

* Remove outdated comment

Co-authored-by: Andrew Lamb <[email protected]>

* refine test in repartition.rs & coalesce_batches.rs (#1707)

* Fuzz test for spillable sort (#1706)

* Lazy TempDir creation in DiskManager (#1695)

* Incorporate dyn scalar kernels (#1685)

* Rebase

* impl ToNumeric for ScalarValue

* Update macro to be based on

* Add floats

* Cleanup

* Newline

* add annotation for select_to_plan (#1714)

* Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed (#1700)

* Change physical_expr creation API

* Refactor API usage to avoid creating ExecutionContextState

* Fixup ballista

* clippy!

* Fix can not load parquet table form spark in datafusion-cli. (#1665)

* fix can not load parquet table form spark

* add Invalid file in log.

* fix fmt

* add upper bound for pub fn (#1713)

Signed-off-by: remzi <[email protected]>

* Create SchemaAdapter trait to map table schema to file schemas (#1709)

* Create SchemaAdapter trait to map table schema to file schemas

* Linting fix

* Remove commented code

* approx_quantile() aggregation function (#1539)

* feat: implement TDigest for approx quantile

Adds a [TDigest] implementation providing approximate quantile
estimations of large inputs using a small amount of (bounded) memory.

A TDigest is most accurate near either "end" of the quantile range (that
is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that
increases resolution at the tails. The paper claims single digit part
per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and
in practice I have found accuracy to be more than acceptable for an
apprixmate function across the entire quantile range.

The implementation is a modified copy of
https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++
implementation]. Both Facebook's implementation, and Mn02's Rust port
are Apache 2.0 licensed.

[TDigest]: https://arxiv.org/abs/1902.04023
[Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h

* feat: approx_quantile aggregation

Adds the ApproxQuantile physical expression, plumbing & test cases.

The function signature is:

	approx_quantile(column, quantile)

Where column can be any numeric type (that can be cast to a float64) and
quantile is a float64 literal between 0 and 1.

* feat: approx_quantile dataframe function

Adds the approx_quantile() dataframe function, and exports it in the
prelude.

* refactor: bastilla approx_quantile support

Adds bastilla wire encoding for approx_quantile.

Adding support for this required modifying the AggregateExprNode proto
message to support propigating multiple LogicalExprNode aggregate
arguments - all the existing aggregations take a single argument, so
this wasn't needed before.

This commit adds "repeated" to the expr field, which I believe is
backwards compatible as described here:

	https://developers.google.com/protocol-buffers/docs/proto3#updating

Specifically, adding "repeated" to an existing message field:

	"For ... message fields, optional is compatible with repeated"

No existing tests needed fixing, and a new roundtrip test is included
that covers the change to allow multiple expr.

* refactor: use input type as return type

Casts the calculated quantile value to the same type as the input data.

* fixup! refactor: bastilla approx_quantile support

* refactor: rebase onto main

* refactor: validate quantile value

Ensures the quantile values is between 0 and 1, emitting a plan error if
not.

* refactor: rename to approx_percentile_cont

* refactor: clippy lints

* suppport bitwise and as an example (#1653)

* suppport bitwise and as an example

* Use $OP in macro rather than `&`

* fix: change signature to &dyn Array

* fmt

Co-authored-by: Andrew Lamb <[email protected]>

* fix: substr - correct behaivour with negative start pos (#1660)

* minor: fix cargo run --release error (#1723)

* Convert boolean case expressions to boolean logic (#1719)

* Convert boolean case expressions to boolean logic

* Review feedback

* substitute `parking_lot::Mutex` for `std::sync::Mutex` (#1720)

* Substitute parking_lot::Mutex for std::sync::Mutex

* enable parking_lot feature in tokio

* Add Expression Simplification API (#1717)

* Add Expression Simplification API

* fmt

* Add tests and CI for optional pyarrow module (#1711)

* Implement other side of conversion

* Add test workflow

* Add (failing) tests

* Get unit tests passing

* Use python -m pip

* Debug LD_LIBRARY_PATH

* Set LIBRARY_PATH

* Update help with better info

* Update parking_lot requirement from 0.11 to 0.12 (#1735)

Updates the requirements on [parking_lot](https://github.com/Amanieu/parking_lot) to permit the latest version.
- [Release notes](https://github.com/Amanieu/parking_lot/releases)
- [Changelog](https://github.com/Amanieu/parking_lot/blob/master/CHANGELOG.md)
- [Commits](Amanieu/parking_lot@0.11.0...0.12.0)

---
updated-dependencies:
- dependency-name: parking_lot
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Prevent repartitioning of certain operator's direct children (#1731) (#1732)

* Prevent repartitioning of certain operator's direct children (#1731)

* Update ballista tests

* Don't repartition children of RepartitionExec

* Revert partition restriction on Repartition and Projection

* Review feedback

* Lint

* API to get Expr's type and nullability without a `DFSchema` (#1726)

* API to get Expr type and nullability without a `DFSchema`

* Add test

* publically export

* Improve docs

* Fix typos in crate documentation (#1739)

* add `cargo check --release` to ci (#1737)

* remote test

* Update .github/workflows/rust.yml

Co-authored-by: Andrew Lamb <[email protected]>

Co-authored-by: Andrew Lamb <[email protected]>

* Move optimize test out of context.rs (#1742)

* Move optimize test out of context.rs

* Update

* use clap 3 style args parsing for datafusion cli (#1749)

* use clap 3 style args parsing for datafusion cli

* upgrade cli version

* Add partitioned_csv setup code to sql_integration test (#1743)

* use ordered-float 2.10 (#1756)

Signed-off-by: Andy Grove <[email protected]>

* #1768 Support TimeUnit::Second in hasher (#1769)

* Support TimeUnit::Second in hasher

* fix linter

* format (#1745)

* Create built-in scalar functions programmatically (#1734)

* create build-in scalar functions programatically

Signed-off-by: remzi <[email protected]>

* solve conflict

Signed-off-by: remzi <[email protected]>

* fix spelling mistake

Signed-off-by: remzi <[email protected]>

* rename to call_fn

Signed-off-by: remzi <[email protected]>

* [split/1] split datafusion-common module (#1751)

* split datafusion-common module

* pyarrow

* Update datafusion-common/README.md

Co-authored-by: Andy Grove <[email protected]>

* Update datafusion/Cargo.toml

* include publishing

Co-authored-by: Andy Grove <[email protected]>

* fix: Case insensitive unquoted identifiers (#1747)

* move dfschema and column (#1758)

* add datafusion-expr module (#1759)

* move column, dfschema, etc. to common module (#1760)

* include window frames and operator into datafusion-expr (#1761)

* move signature, type signature, and volatility to split module (#1763)

* [split/10] split up expr for rewriting, visiting, and simplification traits (#1774)

* split up expr for rewriting, visiting, and simplification

* add docs

* move built-in scalar functions (#1764)

* split expr type and null info to be expr-schemable (#1784)

* rewrite predicates before pushing to union inputs (#1781)

* move accumulator and columnar value (#1765)

* move accumulator and columnar value (#1762)

* fix bad data type in test_try_cast_decimal_to_decimal

* added projections for avro columns

Co-authored-by: xudong.w <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Yijie Shen <[email protected]>
Co-authored-by: Phillip Cloud <[email protected]>
Co-authored-by: Matthew Turner <[email protected]>
Co-authored-by: Yang <[email protected]>
Co-authored-by: Remzi Yang <[email protected]>
Co-authored-by: Dan Harris <[email protected]>
Co-authored-by: Dom <[email protected]>
Co-authored-by: Kun Liu <[email protected]>
Co-authored-by: Dmitry Patsura <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: r.4ntix <[email protected]>
Co-authored-by: Jiayu Liu <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Rich <[email protected]>
Co-authored-by: Marko Mikulicic <[email protected]>
Co-authored-by: Eduard Karacharov <[email protected]>

Loading branch information

21 people authored Feb 15, 2022

1 parent 83f937a commit 14bf39d

.github/workflows/rust.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -58,12 +58,18 @@ jobs: @@
               rustup toolchain install ${{ matrix.rust }}
               rustup default ${{ matrix.rust }}
               rustup component add rustfmt
-          - name: Build Workspace
+          - name: Build workspace in debug mode
             run: |
               cargo build
             env:
               CARGO_HOME: "/github/home/.cargo"
-              CARGO_TARGET_DIR: "/github/home/target"
+              CARGO_TARGET_DIR: "/github/home/target/debug"
+          - name: Build workspace in release mode
+            run: |
+              cargo check --release
+            env:
+              CARGO_HOME: "/github/home/.cargo"
+              CARGO_TARGET_DIR: "/github/home/target/release"
           - name: Check DataFusion Build without default features
             run: |
               cargo check --no-default-features -p datafusion
@@ Expand Down Expand Up / @@ -230,6 +236,55 @@ jobs: @@
               # do not produce debug symbols to keep memory usage down
               RUSTFLAGS: "-C debuginfo=0"
+      test-datafusion-pyarrow:
+        needs: [linux-build-lib]
+        runs-on: ubuntu-latest
+        strategy:
+          matrix:
+            arch: [amd64]
+            rust: [stable]
+        container:
+          image: ${{ matrix.arch }}/rust
+          env:
+            # Disable full debug symbol generation to speed up CI build and keep memory down
+            # "1" means line tables only, which is useful for panic tracebacks.
+            RUSTFLAGS: "-C debuginfo=1"
+        steps:
+          - uses: actions/checkout@v2
+            with:
+              submodules: true
+          - name: Cache Cargo
+            uses: actions/cache@v2
+            with:
+              path: /github/home/.cargo
+              # this key equals the ones on `linux-build-lib` for re-use
+              key: cargo-cache-
+          - name: Cache Rust dependencies
+            uses: actions/cache@v2
+            with:
+              path: /github/home/target
+              # this key equals the ones on `linux-build-lib` for re-use
+              key: ${{ runner.os }}-${{ matrix.arch }}-target-cache-${{ matrix.rust }}
+          - uses: actions/setup-python@v2
+            with:
+              python-version: "3.8"
+          - name: Install PyArrow
+            run: |
+              echo "LIBRARY_PATH=$LD_LIBRARY_PATH" >> $GITHUB_ENV
+              python -m pip install pyarrow
+          - name: Setup Rust toolchain
+            run: |
+              rustup toolchain install ${{ matrix.rust }}
+              rustup default ${{ matrix.rust }}
+              rustup component add rustfmt
+          - name: Run tests
+            run: |
+              cd datafusion
+              cargo test --features=pyarrow
+            env:
+              CARGO_HOME: "/github/home/.cargo"
+              CARGO_TARGET_DIR: "/github/home/target"
       lint:
         name: Lint
         runs-on: ubuntu-latest
@@ Expand Down @@

Cargo.toml

-Original file line number
+Diff line change
@@ Expand Up / @@ -18,6 +18,8 @@ @@
     [workspace]
     members = [
         "datafusion",
+        "datafusion-common",
+        "datafusion-expr",
         "datafusion-cli",
         "datafusion-examples",
         "benchmarks",
@@ Expand All / @@ -33,5 +35,5 @@ lto = true @@
     codegen-units = 1
     [patch.crates-io]
-    #arrow2 = { git = "https://github.com/jorgecarleitao/arrow2.git", branch = "main" }
-    #parquet2 = { git = "https://github.com/jorgecarleitao/parquet2.git", branch = "main" }
+    arrow2 = { git = "https://github.com/jorgecarleitao/arrow2.git", branch = "main" }
+    parquet2 = { git = "https://github.com/jorgecarleitao/parquet2.git", branch = "main" }

0 comments on commit `14bf39d`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `14bf39d`

Commit

There are no files selected for viewing

0 comments on commit 14bf39d

0 comments on commit `14bf39d`