Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

[NSE-75]Support ColumnarHashAggregate in ColumnarWSCG #76

Merged
merged 6 commits into from
Feb 4, 2021

Conversation

xuechendi
Copy link
Collaborator

@xuechendi xuechendi commented Feb 1, 2021

In this PR, we used original codegen hash aggregate path to make it as a new codegen deriviation class.

  1. only support hashAggregate as the last operator in WSCG stage
  2. only support single hashaggregate
  3. noticed big performance improvement in Q24a/Q24b of sf1536
  4. Verified with all TPCDS sf500
  5. verified with all TPCH sf500

support below cases:
1. no groupby aggr;
2. groupby aggr single key and multiple keys;
3. groupby aggr with result expression;
4. hashjoin groupby aggr;
5. mergejoin groupby aggr

Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
@github-actions
Copy link

github-actions bot commented Feb 1, 2021

…bquery, it can't go WSCG for now

Signed-off-by: Chendi Xue <[email protected]>
@github-actions
Copy link

github-actions bot commented Feb 1, 2021

@github-actions
Copy link

github-actions bot commented Feb 1, 2021

@xuechendi
Copy link
Collaborator Author

@zhouyuan, this PR is verified with TPCH and TPCDS sf500, ready to merge

}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems some file miss an empty line at the end of file.

@zhouyuan
Copy link
Collaborator

zhouyuan commented Feb 2, 2021

ran into some non-unicode in codegen file

if (project_1_output_col_1^@^@<85>^@^@^@^@^@^@^@^@uto project_1_output_col_2 = typed_in_col_2;
aut_validity) {
RETURN_NOT_OK(aggr_action_list_3[2]->Evaluate(memo_index, (void*)&project_1_output_col_1^@^@<85>^@^@^@^@^@^@^@^@uto project_1_output_col_2 = typed_in_col_2;
aut));
} else {

i checked the code and looks like some thing wrong during expression projection:
https://github.com/xuechendi/native-sql-engine/blob/wip_aggr_wscg/cpp/src/codegen/arrow_compute/ext/hash_aggregate_kernel.cc#L287

For TPCDS queries, this may happen when expression expected field name is in upperCase while input list are in lowerCase

Signed-off-by: Chendi Xue <[email protected]>
@xuechendi
Copy link
Collaborator Author

ran into some non-unicode in codegen file

if (project_1_output_col_1^@^@<85>^@^@^@^@^@^@^@^@uto project_1_output_col_2 = typed_in_col_2;
aut_validity) {
RETURN_NOT_OK(aggr_action_list_3[2]->Evaluate(memo_index, (void*)&project_1_output_col_1^@^@<85>^@^@^@^@^@^@^@^@uto project_1_output_col_2 = typed_in_col_2;
aut));
} else {

i checked the code and looks like some thing wrong during expression projection:
https://github.com/xuechendi/native-sql-engine/blob/wip_aggr_wscg/cpp/src/codegen/arrow_compute/ext/hash_aggregate_kernel.cc#L287

@zhouyuan , fix commit submitted

@zhouyuan zhouyuan merged commit 1699336 into oap-project:master Feb 4, 2021
HongW2019 pushed a commit to HongW2019/gazelle_plugin that referenced this pull request Sep 2, 2021
This commit implements the Native SQL Engine for OAP. 

The key components are:
- Using Apache Arrow as column vector format as intermediate data among Spark operator.
- Enable Apache Arrow native readers for Parquet and other formats.
- Leverage Apache Arrow Gandiva/Compute to evaluate columnar expressions with SIMD optimizations

OAP Native SQL Engine is verified by TPC-H workload as of this commit. Please refer to the detailed guide on how to install and test.

Co-authored-by: Chendi Xue <[email protected]>
Co-authored-by: Rong Ma <[email protected]>
Co-authored-by: Jiayi Chen <[email protected]>
Co-authored-by: Hongze Zhang <[email protected]>
Co-authored-by: Rui Mo <[email protected]>
Co-authored-by: Yuan Zhou <[email protected]>
Co-authored-by: Binwei Yang <[email protected]>

======================
* ProjectList prepare check and type change

Signed-off-by: Chendi Xue <[email protected]>

* Add new ReadWriteBench

Signed-off-by: Chendi Xue <[email protected]>

* Add ColumnarHashAggregate support

Framework done, Codes workable, saw fault result

Signed-off-by: Chendi Xue <[email protected]>

* Add an optimization to skip unnecessary project work

Signed-off-by: Chendi Xue <[email protected]>

* [Bug fix] fixed multiple cols aggregation failing issue

Signed-off-by: Chendi Xue <[email protected]>

* Update README.md

* Update ApacheArrowInstallation.md

* use arrow version property to record the right version

Signed-off-by: Yuan Zhou <[email protected]>

* integrate columnar shuffle operator and relative UT

* Update README.md

* Update ApacheArrowInstallation.md

* Update ApacheArrowInstallation.md

* Apply coding format onto current project

Signed-off-by: Chendi Xue <[email protected]>

* Update README.md

* [C++]Fix cpp code format

Signed-off-by: Chendi Xue <[email protected]>

* Add files via upload

* Add files via upload

* [C++]Refactoring current cpp codes and change return using vector<RecordBatch>

Signed-off-by: Chendi Xue <[email protected]>

* [C++]Using google-code-style for c++ codes

Signed-off-by: Chendi Xue <[email protected]>

* [JAVA]change java jniWrapper to return a ArrowRecordBatch array

Signed-off-by: Chendi Xue <[email protected]>

* [SCALA]Bug fixing: ColumnarShuffleExchangeExec didn't recursively pass child to next operator.

Signed-off-by: Chendi Xue <[email protected]>

* [C++]Remove Gandiva Protobuf in this project, added in arrow side

Signed-off-by: Chendi Xue <[email protected]>

* [DOC] Fix installation guide after we remove gandiva_protobuf

Signed-off-by: Chendi Xue <[email protected]>

* [C++]Add jni_common.h

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add splitArray function

Aim to add a function to split one Array into multiple arrays with distinguish key.
Codes are done, runable with correct result, will try bigger input.

reminder: current we can only use one array as splitter.

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Big fix when only splitting one array

Signed-off-by: Chendi Xue <[email protected]>

* [C++] refactoring codes to support a visitor chain

Signed-off-by: Chendi Xue <[email protected]>

* add CMakeLists.txt

* [C++] Change splitArray to only use one loop for all arrays

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Noticed a kernel_ext bug, fix here, also refined a bit codes

Signed-off-by: Chendi Xue <[email protected]>

* [C++] based on arrow commit 868c8c6, to pass hash_table to arrow compute functions.

Original, arrow will initialize a hash_table inside DictEncode function and ValueCounts function
which leads to multiple array can't be processed based on same hash_table, and by changing arrow
code, we now be able to pass a long live hash_table get an unified index for all arrays.

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add new interface "finish"

This interface is designed to evaluate based on multiple recordBatch will generate a output when calling finish

Signed-off-by: Chendi Xue <[email protected]>

* [C++] add a new appendArrayToBatch function

Using this function, we can build a new recordbatch based on multiple recordBatch input, then we are able to make a final aggregate result for all.

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Refactor split.h using array_builder_impl.h to maintain ArrayBuider

Signed-off-by: Chendi Xue <[email protected]>

* [c++] Only use dict when splitting array

Signed-off-by: Chendi Xue <[email protected]>

* Update ApacheArrowInstallation.md

* [C++] support groupby aggregate in cpp level

1. added EncodeArray kernel
2. added a finish function mechanism
3. added appendToCache functions
4. splitArrayList uses indices instead of cache the whole list

Signed-off-by: Chendi Xue <[email protected]>

* [Scala/Java] Added support for groupBy aggregate

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add jni support for finish function

Signed-off-by: Chendi Xue <[email protected]>

* Fix on groupby aggregate feature

Signed-off-by: Chendi Xue <[email protected]>

* [C++/Scala] Move merge multiple groupby batch into one implementation to CPP

Signed-off-by: Chendi Xue <[email protected]>

* [C++/Scala] Continuelly optimize groupby hash aggregation by using action

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Fix DataType issue for HashAggregate

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Optimize GroupBy HashAggregate performance

1. Used hash_table key column as uniqueAction input, so uniqueAction won't need to calculate each time
2. Set max group id at the beginning of row evaluation, so each action evaluate only need to do the real evaluation work.

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Use AppendValues to build Array

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Change Makefile using O3, which significantly improves performance

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add Spark Metrics for HashAggregate

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Using MinMax in SplitArrayListWith Action to get max group id

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add a new way to directly access data instead of using Array API
[C++] Using inline lambda instead of using function call in action

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Refactor arrow compute to simpify the workpath of calling Eval()

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Fix bug, ColumnarBatch was not closed before

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Disable ColumnarShuffle, looks it will cause OOM issue

Signed-off-by: Chendi Xue <[email protected]>

* [C++/Scala] Use group when doing encodeArray and add a null check when closing ColumnarAggregation

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Bug fix, close the last columnarBatch and columnarAggregation instance

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Change to use cmake instead of makefile

Signed-off-by: Chendi Xue <[email protected]>

* [C++]Add GoogleTests

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add a new unittest and macroed add_test

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Change to use groupby_aggregate.h Group func

Rebased intel arrow to lastest arrow commit and revert our changes to hash.h, and move group function to groupby_aggregate.h

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Sort support

To use sort, it requires two kernels, sortArraysToIndices will cook an indices array, then rest arrays can use this sorted_indices to do a shuffle.

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Pass col type when make SplitArrayListWithAction Kernel

Signed-off-by: Chendi Xue <[email protected]>

* add AppendArrayKernel

This patch adds AppendArrayKernel support.

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] Bug fix, noticed we didn't use the java builder inside this project

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Fix bug when there is finish Func in expr

Signed-off-by: Chendi Xue <[email protected]>

* [C++/Scala] Refine the way of extract hash aggregate input expression

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Change unittest to use action_dono, so we can get multiple return_type

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Enable ColumnarSort

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add benchmark

Benchmark will read a parquet file from local and do evaluation upon this file

Signed-off-by: Chendi Xue <[email protected]>

* [C++] ShuffleArrayList performance optimization

Use builder directly instead of using array_builder_impl.h

Signed-off-by: Chendi Xue <[email protected]>

* [C++]Add big scale test, batch size is 5176

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add Iterator<RecordBatch> as Finish Return

Currently, we only supported return as std::vector<RecordBatch>, and I am thinking to add a new way of returning as iterator, to make it more extensible

Signed-off-by: Chendi Xue <[email protected]>

* [Jni + ColumnarSorter] use ResultIterator<RecordBatch> instead of return vector<RecordBatch>

Signed-off-by: Chendi Xue <[email protected]>

* [BUG FIX] Fix uninitialized row_id bug

Signed-off-by: Chendi Xue <[email protected]>

* [JAVA] adding missing BatchIterator file

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] allow to operate on Long and Double type

Both Gandiva and Arrow Compute support these two types now.

Signed-off-by: Yuan Zhou <[email protected]>

* adding vhashjoin support

This patch adds vhashjoin support w/ below major change:
- Allow to set member set for kernels
- Adding Take&NTake kernels
- Spark columnar plugin for ShuffledHashJoinExec(turned off now)

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] Bug fix in ColumnarAggregate when some column will be trimed

Signed-off-by: Chendi Xue <[email protected]>

* Implement this feature with two method:

1. Using utf8 to merge keys -> ConcatArrayKernel
2. use gandiva to do hash + add -> HashAggrArrayKernel

Now we chose to use gandiva

Signed-off-by: Chendi Xue <[email protected]>

* [Scala]Some fixing to support ColumnarAggregationWithTwoKeys

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add ColumnarBatchScan Support

By using which, we can use WSCG off when testing columnarBased process

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add a new ColumnarConditionProjector Operator

Signed-off-by: Chendi Xue <[email protected]>

* [CPP & Scala] Add desend and null first support for ColumnarSort

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Rename ColumnarCondProjExec to ColumnarConditionProjectExec

Signed-off-by: Chendi Xue <[email protected]>

* Add an alternative ColumnarJoin implementation (oap-project#71)

* [CPP]ShuffleArrayList kernel fix when null exists

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add Join Benchmark

We used tpch lineitem and order table to test join, which contains 800+ batches

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] adding a new method for ColumnarJoin

Add a new kernel called probeArrays, which is used to input multiple arrays one by one, then probe primary key by another sets of arrays.
And also refined shuffleArrayListKernel, so by combining this two, we can join batches from two table together.

Signed-off-by: Chendi Xue <[email protected]>

* [JNI] Add jni support for using ResultIterator.Process

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Spark Columnar Support for ShuffledHashJoin

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add New Unittest and BenchmarkTest for InnerJoin

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Refactor current Join codes and support both right Join and InnerJoin

Signed-off-by: Chendi Xue <[email protected]>

* [Scala]ColumnarShuffledHashJoin Refine for InnerJoin

Signed-off-by: Chendi Xue <[email protected]>

* [Scala]ColumnarAggregate fix for Q4

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Remove JoinTime in ColumnarShuffledHashJoinExec and use one in ColumnarShuffledHashJoin

Signed-off-by: Chendi Xue <[email protected]>

* fix cond projector without condition (oap-project#75)

should project with resultSchema

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix string support for columnar projection (oap-project#76)

* [Scala] fix string support for columnar projection

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix StringType convert

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] skip projector evaluate if filter has 0 row result

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix possible memory leak

Signed-off-by: Yuan Zhou <[email protected]>

* [CPP] Add a new Action call CountLiterAction

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Support CountLiteral

Signed-off-by: Chendi Xue <[email protected]>

* Wip avg support (oap-project#79)

* [CPP] Enabled groupby avg, AvgByCount and SumCount kernel

Signed-off-by: Chendi Xue <[email protected]>

* [JNI] Add a new interface called setReturnFields

This interface is used to set result Schema when some of expressions return more than one fields and we can't use current gandiva expression to describe the schema.

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] enable groupby avg

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Rewrite Unique Action and add String Support

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Remove Concat Kernal and Action and some codes refine

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] String fix

Signed-off-by: Chendi Xue <[email protected]>

* Update README.md

* Update ApacheArrowInstallation.md

* [CPP] Multiple Key Groupby fix and optimization

Noticed before groupby with multiple key returns incorrect result, and this commit will fix this
Also if multiple keys are all string, I will concat them with gandiva and do a hash firstly then doing encodeArray.
By doing which, will be a little faster then directly hash and add

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] SplitArray optimization

Move input array from lambda capture to class member, which will improve performance a lot.

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Support Aggregation with projection inside case (oap-project#86)

By this new fix, we are able to run unmodified TPCH Q1

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] adding support for starts_with & ends_with (oap-project#78)

* [Scala] adding support for starts_with & ends_with

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] adding support for like

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix string like support

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] support substring

Signed-off-by: Yuan Zhou <[email protected]>

* [CPP] Support String in ColumnarJoin

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] LeftSemi Join support

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Continue fix aggregate issue for Q3

Now Q3 is runable

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Memory leak issue fixing

Signed-off-by: Chendi Xue <[email protected]>

* [CPP & Scala] Support multiple key join

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add groupby min and max and fix a bug in ShuffleArrayList Evaluate

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add a new interface to get holder current size

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Refine current ConditionProjector codes

1. Use iterator instead of map in ConditionProjector, so we can skip empty columnarBatch as return
2. Fix several bugs and made input schema for condition and project more clear

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add a return column size in Columnar AggregatExpression

Since we may have one scenario like avg, which inputs one col and expected two column as return in partial phase and input two cols and expect one at final phase. Which is also a fix for Q1

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] ColumnarShuffleHashJoin with Knownfloating expr

Signed-off-by: Chendi Xue <[email protected]>

* [scala] support In (oap-project#91)

* [scala] support In

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix get ordinal for ColumnarIn

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix get ordinal in agg (oap-project#92)

a special fix for Q10
Spark will do normalization when float/doubt type as join key

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] A attr fix in ColumnarAggregation

Signed-off-by: Chendi Xue <[email protected]>

* Revert "[Scala] fix get ordinal in agg (oap-project#92)"

This reverts commit 9ed5992b63d7791e59a559c4902d7ca516d3e3b4.

* [Scala] Fix for Q11

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add a new expression who will collect subquery result and as literal in gandiva

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] adding support for extract_year (oap-project#88)

* [Scala] adding support for extract_year

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] cast utf8/int64 to date64 first

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] support DateType for Literal

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] add support for string Contains

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] use string based comparison for datetype

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] clean up

Signed-off-by: Yuan Zhou <[email protected]>

* [CPP] Refine all Aggregation function and add SumCount, AvgByCount, Min and Max support

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] null key will be skipped in Groupby Case

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add native ResultIterator support for Groupby HashAggregate

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] ColumnarHashAggregation and ColumnarProjection Refactor

Extracted current projection codes from ColumnarAggregation and made as a single class,
So we can apply ColumnarProjection to groupingExpression, aggregateExpression and resultExpression.

Also added return by batch support in ColumnarAggregation, so we won't return too much lines
which may result in memory leak.

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] ColumnarConditionProjection fix after Aggregation Refine

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] extractYear fix to use Int32

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add a new interface to pass selectionVector

1. add selection support to evaluator and resultIterator
2. add selectionVector support to ProbeArrays
3. fix wo/ groupby aggregate result type issue

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add a new interface to pass selectionVector

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Using ConditionProjector to handler condition inside Join

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Support condition inside ColumnarJoin

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] A walkaround to skip Condition when input doesn't contain this field

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Support multiple same primary key Join

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] shift groupby key hashed value then add to next one

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] support for Not (oap-project#80)

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala & CPP] Fix ColumnarAggregation ResultIterator bug

Original we used Slice array in native codes, and when we pass this array to Java, Slice configuration will be lost so we are getting incorrect result.
Now we changed to build array inside ResultIterator Next function, and result is correct now.

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] support case when (oap-project#100)

* [Scala] support case when

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix EquealTo

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix agg in case when

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] restore BinaryOperator

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] clean up

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] Cast dataType in BinaryOperator

Signed-off-by: Chendi Xue <[email protected]>

* [Scala & CPP] Support Outer Join

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Fix when aggregationExpression is empty

Signed-off-by: Chendi Xue <[email protected]>

* Wip condition join (oap-project#106)

* [Scala & CPP] Support LeftAnti Join in ColumnarShuffledHashJoin

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Move bindReference inside ColumnarConditionProjection

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add Native conditionedJoin

This PR is aim to do runtime codegen so we can perform a conditioned join operation,
Add a new ConditionedShuffleArrayList implementation
Add a new ConditionedProbeArrays implementation
Generate signature for codegen func, and use signature to check if lib exists
Add NoneCondition Support
Remove ShuffleArrayList implementation and change to use ConditionShuffleArrayList
Remove not in use Kernels and Actions

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Support new conditionedJoin

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Remove original probeArrays kernel

Signed-off-by: Chendi Xue <[email protected]>

* [scala] support function with in operator (oap-project#107)

Signed-off-by: Yuan Zhou <[email protected]>

* [CPP] Use original shuffle codes here to improve performance

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Fix AvgByCount bug

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add In Support when doing codegen and forward unknown function

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Fix a small bug in ColumnarExpressionConverter for Like

Signed-off-by: Chendi Xue <[email protected]>

* Move SparkColumnarPlugin to oap-native-sql folder

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Small fixes (oap-project#1184)

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Fixed a avg with groupby issue, now Q17 is correct (oap-project#1185)

Signed-off-by: Chendi Xue <[email protected]>

* [DO NOT MERGE]WIP Q2 fix (oap-project#1187)

* [CPP & Scala] Fixed some codes for ConditionedShuffle

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Q2_fix done

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Last commit invoked some mis-remove, fix here

Signed-off-by: Chendi Xue <[email protected]>

* Update README.md

* [nativesql] fix compile against new arrow (oap-project#1189)

* [nativesql] fix compile against new arrow

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] fix compile warning

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] remove unused headers

Signed-off-by: Yuan Zhou <[email protected]>

* Update ApacheArrowInstallation.md

* [nativesql]Wip spark rebase (oap-project#1202)

* [nativesql] fix compile against new arrow

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] fix compile warning

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] remove unused headers

Signed-off-by: Yuan Zhou <[email protected]>

* [scala] fix spark reabasing

Signed-off-by: Yuan Zhou <[email protected]>

* [NativeSql] DeCouple Gandiva protobuf and hashing dependency (oap-project#1203)

* Copied Arrow Hashing to our repo so newly modification won't break our builds

Signed-off-by: Chendi Xue <[email protected]>

* [scala] fix spark reabasing

Signed-off-by: Yuan Zhou <[email protected]>

* [CPP] Add protobuf inside native sql

Signed-off-by: Chendi Xue <[email protected]>

Co-authored-by: Yuan Zhou <[email protected]>

* [NativeSql]refactor native parquet reader/writer (oap-project#1205)

* Remove sortArraysToIndices

Signed-off-by: Chendi Xue <[email protected]>

* [NativeSql] Move Parquet Reader and Writer into nativeSql

Signed-off-by: Chendi Xue <[email protected]>

* [NativeSql] Add libhdfs3.so to resource, which will be copied to /hadoop dir when doing make install

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add a parquet reader and writer adapter

Signed-off-by: Chendi Xue <[email protected]>

* [NativeSql] Refactor and move spark side commits to nativeSql

1. move parquet reader logic to nativesql
2. move ArrowWritableColumnVector to nativesql
3. Use postRule to call RowToArrowColumnVector
4. move cpp so to jar
5. remove benchmark folder
6. update readme

Signed-off-by: Chendi Xue <[email protected]>

* [NativeSql][CPP] Use CMake to download and compile protobuf

Signed-off-by: Chendi Xue <[email protected]>

* Update README.md

* ArrowDataSource for Spark (#1226)

* [oap-native-sql]Add Installation Notes (#1231)

* add InstallationNotes to README

* refine

* refine

* refine

* [NativeSql] ClassCastException if non-parquet data source is used (#1238)

* Move ArrowWritableColumnVector from org.apache to com.intel (#1243)

* [DataSource] Compilation error due to multiple source directories (#1244)

* [oap-native-sql]Wip refine protobuf install (#1230)

* [Building] refine protobuf dependency check

 - if not found, download protobuf and statically link to it
 - if found, reuse system level protobuf and dynamically link to it

Signed-off-by: Yuan Zhou <[email protected]>

* [Building] check for dynamic protobuf lib only

Signed-off-by: Yuan Zhou <[email protected]>

* [oap-native-sql][Scala] support date32 (#1225)

* [Scala] support date32

Signed-off-by: Yuan Zhou <[email protected]>

* [C++][Java] Support Date32 in RowToColumn

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] support date32 in unique action

Signed-off-by: Yuan Zhou <[email protected]>

* [Java] fix getUTF8String on Date32

Signed-off-by: Yuan Zhou <[email protected]>

* set C++ 2011 standard (#1236)

* [Scala] fix contain to use is_substr (#1235)

Signed-off-by: Yuan Zhou <[email protected]>

* [Java] fix date32 projection (#1250)

Signed-off-by: Yuan Zhou <[email protected]>

* [NativeSql][Scala] memory leak track and fixes (#1227)

* [NativeSql][Scala] memory leak track and fixes

Signed-off-by: Chendi Xue <[email protected]>

* [NativeSql][CPP] Another derived class should add virtual to its super destruction func

Signed-off-by: Chendi Xue <[email protected]>

* [DataSource][Arrow] Supress exceptions from unexpected types when pushdown filters (#1253)

* Update README googletest installation (#1251)

* [DataSource][Arrow] Output schema mismatch when scanning for zero dat… (#1262)

* [DataSource][Arrow] Output schema mismatch when scanning for zero data columns

* [DataSource][Arrow] Use ArrowWritableColumnVector to fill partition values

* [DataSource][Arrow] Update README.md (#1263)

* [DataSource][Arrow] Add assembly build (#1264)

* [DataSource][Arrow] Download ArrowWritableColumnVector instead of having a copy (#1267)

* [oap-native-sql] Calling ColumnVectorUtils.populate(...) on ArrowWritableColumnVector leads to UnsupportedOperationException (#1268)

* [DataSource][Arrow] Source Downloading: Change to exec-maven-plugin (#1269)

* [DataSource][Arrow] Update README.md (#1276)

* [DataSource][Arrow] Update README.md (#1279)

* [Scala] adding IsNull support (#1256)

Signed-off-by: Yuan Zhou <[email protected]>

* [oap-native-sql] Add open permission parameter (#1266)

* add open O_CREAT permission mode

* [DataSource][Arrow] Prune pushed filters that access partition columns (#1285)

* [oap-native-sql][Scala]Adding abs support (#1273)

* support abs

* [Building] building with spark-sql from our maven repo (#1249)

Signed-off-by: Yuan Zhou <[email protected]>

* [DataSource][Arrow] Close batch every time new batch is read to avoid possible leaks (#1288)

* [DataSource][Arrow] File descriptor leak (#1295)

* inset (#1290)

* upper (#1301)

* [oap-native-sql][CI] update travis for native sql (#1294)

* [CI] update travis for native sql

Signed-off-by: Yuan Zhou <[email protected]>

* [CI] fix grammar, use openjdk8

Signed-off-by: Yuan Zhou <[email protected]>

* [CI] update to use python3 env

Signed-off-by: Yuan Zhou <[email protected]>

* [Doc] update readme (#1308)

Signed-off-by: Yuan Zhou <[email protected]>

* coalesce (#1306)

* [oap-native-sql][Scala]adding if support (#1307)

* add IfOperator

* add boolean type

* [oap-native-sql] Enable ColumnarSort kernel with code generation (#1261)

* [NativeSql] ColumnarSort kernel

ColumnarSort is implemented with CodeGeneration method

Signed-off-by: Chendi Xue <[email protected]>

* [oap-native-sql] Fix compiling issue

Signed-off-by: Chendi Xue <[email protected]>

* [Scala]support date32 in IN epxression (#1303)

Signed-off-by: Yuan Zhou <[email protected]>

* adding ASF license (#1331)

Signed-off-by: Yuan Zhou <[email protected]>
HongW2019 pushed a commit to HongW2019/gazelle_plugin that referenced this pull request Sep 2, 2021
This patch implements below main features for Native SQL engine:

- ColumnarExchange support
- runtime codegen for ColumnarShuffledHashJoin/ColumnarSort
- Configurable batch size for Arrow Data Source
- Support more Functions from TPCDS queries

Please refer to the detailed guide on how to install and test.

Co-authored-by: Chendi Xue <[email protected]>
Co-authored-by: Rong Ma <[email protected]>
Co-authored-by: Jiayi Chen <[email protected]>
Co-authored-by: Hongze Zhang <[email protected]>
Co-authored-by: Rui Mo <[email protected]>
Co-authored-by: Yuan Zhou <[email protected]>
Co-authored-by: Binwei Yang <[email protected]>

=================
* [C++]Add jni_common.h

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add splitArray function

Aim to add a function to split one Array into multiple arrays with distinguish key.
Codes are done, runable with correct result, will try bigger input.

reminder: current we can only use one array as splitter.

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Big fix when only splitting one array

Signed-off-by: Chendi Xue <[email protected]>

* [C++] refactoring codes to support a visitor chain

Signed-off-by: Chendi Xue <[email protected]>

* add CMakeLists.txt

* [C++] Change splitArray to only use one loop for all arrays

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Noticed a kernel_ext bug, fix here, also refined a bit codes

Signed-off-by: Chendi Xue <[email protected]>

* [C++] based on arrow commit 868c8c6, to pass hash_table to arrow compute functions.

Original, arrow will initialize a hash_table inside DictEncode function and ValueCounts function
which leads to multiple array can't be processed based on same hash_table, and by changing arrow
code, we now be able to pass a long live hash_table get an unified index for all arrays.

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add new interface "finish"

This interface is designed to evaluate based on multiple recordBatch will generate a output when calling finish

Signed-off-by: Chendi Xue <[email protected]>

* [C++] add a new appendArrayToBatch function

Using this function, we can build a new recordbatch based on multiple recordBatch input, then we are able to make a final aggregate result for all.

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Refactor split.h using array_builder_impl.h to maintain ArrayBuider

Signed-off-by: Chendi Xue <[email protected]>

* [c++] Only use dict when splitting array

Signed-off-by: Chendi Xue <[email protected]>

* Update ApacheArrowInstallation.md

* [C++] support groupby aggregate in cpp level

1. added EncodeArray kernel
2. added a finish function mechanism
3. added appendToCache functions
4. splitArrayList uses indices instead of cache the whole list

Signed-off-by: Chendi Xue <[email protected]>

* [Scala/Java] Added support for groupBy aggregate

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add jni support for finish function

Signed-off-by: Chendi Xue <[email protected]>

* Fix on groupby aggregate feature

Signed-off-by: Chendi Xue <[email protected]>

* [C++/Scala] Move merge multiple groupby batch into one implementation to CPP

Signed-off-by: Chendi Xue <[email protected]>

* [C++/Scala] Continuelly optimize groupby hash aggregation by using action

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Fix DataType issue for HashAggregate

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Optimize GroupBy HashAggregate performance

1. Used hash_table key column as uniqueAction input, so uniqueAction won't need to calculate each time
2. Set max group id at the beginning of row evaluation, so each action evaluate only need to do the real evaluation work.

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Use AppendValues to build Array

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Change Makefile using O3, which significantly improves performance

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add Spark Metrics for HashAggregate

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Using MinMax in SplitArrayListWith Action to get max group id

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add a new way to directly access data instead of using Array API
[C++] Using inline lambda instead of using function call in action

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Refactor arrow compute to simpify the workpath of calling Eval()

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Fix bug, ColumnarBatch was not closed before

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Disable ColumnarShuffle, looks it will cause OOM issue

Signed-off-by: Chendi Xue <[email protected]>

* [C++/Scala] Use group when doing encodeArray and add a null check when closing ColumnarAggregation

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Bug fix, close the last columnarBatch and columnarAggregation instance

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Change to use cmake instead of makefile

Signed-off-by: Chendi Xue <[email protected]>

* [C++]Add GoogleTests

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add a new unittest and macroed add_test

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Change to use groupby_aggregate.h Group func

Rebased intel arrow to lastest arrow commit and revert our changes to hash.h, and move group function to groupby_aggregate.h

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Sort support

To use sort, it requires two kernels, sortArraysToIndices will cook an indices array, then rest arrays can use this sorted_indices to do a shuffle.

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Pass col type when make SplitArrayListWithAction Kernel

Signed-off-by: Chendi Xue <[email protected]>

* add AppendArrayKernel

This patch adds AppendArrayKernel support.

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] Bug fix, noticed we didn't use the java builder inside this project

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Fix bug when there is finish Func in expr

Signed-off-by: Chendi Xue <[email protected]>

* [C++/Scala] Refine the way of extract hash aggregate input expression

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Change unittest to use action_dono, so we can get multiple return_type

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Enable ColumnarSort

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add benchmark

Benchmark will read a parquet file from local and do evaluation upon this file

Signed-off-by: Chendi Xue <[email protected]>

* [C++] ShuffleArrayList performance optimization

Use builder directly instead of using array_builder_impl.h

Signed-off-by: Chendi Xue <[email protected]>

* [C++]Add big scale test, batch size is 5176

Signed-off-by: Chendi Xue <[email protected]>

* [C++] Add Iterator<RecordBatch> as Finish Return

Currently, we only supported return as std::vector<RecordBatch>, and I am thinking to add a new way of returning as iterator, to make it more extensible

Signed-off-by: Chendi Xue <[email protected]>

* [Jni + ColumnarSorter] use ResultIterator<RecordBatch> instead of return vector<RecordBatch>

Signed-off-by: Chendi Xue <[email protected]>

* [BUG FIX] Fix uninitialized row_id bug

Signed-off-by: Chendi Xue <[email protected]>

* [JAVA] adding missing BatchIterator file

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] allow to operate on Long and Double type

Both Gandiva and Arrow Compute support these two types now.

Signed-off-by: Yuan Zhou <[email protected]>

* adding vhashjoin support

This patch adds vhashjoin support w/ below major change:
- Allow to set member set for kernels
- Adding Take&NTake kernels
- Spark columnar plugin for ShuffledHashJoinExec(turned off now)

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] Bug fix in ColumnarAggregate when some column will be trimed

Signed-off-by: Chendi Xue <[email protected]>

* Implement this feature with two method:

1. Using utf8 to merge keys -> ConcatArrayKernel
2. use gandiva to do hash + add -> HashAggrArrayKernel

Now we chose to use gandiva

Signed-off-by: Chendi Xue <[email protected]>

* [Scala]Some fixing to support ColumnarAggregationWithTwoKeys

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add ColumnarBatchScan Support

By using which, we can use WSCG off when testing columnarBased process

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add a new ColumnarConditionProjector Operator

Signed-off-by: Chendi Xue <[email protected]>

* [CPP & Scala] Add desend and null first support for ColumnarSort

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Rename ColumnarCondProjExec to ColumnarConditionProjectExec

Signed-off-by: Chendi Xue <[email protected]>

* Add an alternative ColumnarJoin implementation (oap-project#71)

* [CPP]ShuffleArrayList kernel fix when null exists

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add Join Benchmark

We used tpch lineitem and order table to test join, which contains 800+ batches

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] adding a new method for ColumnarJoin

Add a new kernel called probeArrays, which is used to input multiple arrays one by one, then probe primary key by another sets of arrays.
And also refined shuffleArrayListKernel, so by combining this two, we can join batches from two table together.

Signed-off-by: Chendi Xue <[email protected]>

* [JNI] Add jni support for using ResultIterator.Process

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Spark Columnar Support for ShuffledHashJoin

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add New Unittest and BenchmarkTest for InnerJoin

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Refactor current Join codes and support both right Join and InnerJoin

Signed-off-by: Chendi Xue <[email protected]>

* [Scala]ColumnarShuffledHashJoin Refine for InnerJoin

Signed-off-by: Chendi Xue <[email protected]>

* [Scala]ColumnarAggregate fix for Q4

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Remove JoinTime in ColumnarShuffledHashJoinExec and use one in ColumnarShuffledHashJoin

Signed-off-by: Chendi Xue <[email protected]>

* fix cond projector without condition (oap-project#75)

should project with resultSchema

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix string support for columnar projection (oap-project#76)

* [Scala] fix string support for columnar projection

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix StringType convert

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] skip projector evaluate if filter has 0 row result

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix possible memory leak

Signed-off-by: Yuan Zhou <[email protected]>

* [CPP] Add a new Action call CountLiterAction

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Support CountLiteral

Signed-off-by: Chendi Xue <[email protected]>

* Wip avg support (oap-project#79)

* [CPP] Enabled groupby avg, AvgByCount and SumCount kernel

Signed-off-by: Chendi Xue <[email protected]>

* [JNI] Add a new interface called setReturnFields

This interface is used to set result Schema when some of expressions return more than one fields and we can't use current gandiva expression to describe the schema.

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] enable groupby avg

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Rewrite Unique Action and add String Support

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Remove Concat Kernal and Action and some codes refine

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] String fix

Signed-off-by: Chendi Xue <[email protected]>

* Update README.md

* Update ApacheArrowInstallation.md

* [CPP] Multiple Key Groupby fix and optimization

Noticed before groupby with multiple key returns incorrect result, and this commit will fix this
Also if multiple keys are all string, I will concat them with gandiva and do a hash firstly then doing encodeArray.
By doing which, will be a little faster then directly hash and add

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] SplitArray optimization

Move input array from lambda capture to class member, which will improve performance a lot.

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Support Aggregation with projection inside case (oap-project#86)

By this new fix, we are able to run unmodified TPCH Q1

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] adding support for starts_with & ends_with (oap-project#78)

* [Scala] adding support for starts_with & ends_with

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] adding support for like

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix string like support

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] support substring

Signed-off-by: Yuan Zhou <[email protected]>

* [CPP] Support String in ColumnarJoin

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] LeftSemi Join support

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Continue fix aggregate issue for Q3

Now Q3 is runable

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Memory leak issue fixing

Signed-off-by: Chendi Xue <[email protected]>

* [CPP & Scala] Support multiple key join

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add groupby min and max and fix a bug in ShuffleArrayList Evaluate

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add a new interface to get holder current size

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Refine current ConditionProjector codes

1. Use iterator instead of map in ConditionProjector, so we can skip empty columnarBatch as return
2. Fix several bugs and made input schema for condition and project more clear

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add a return column size in Columnar AggregatExpression

Since we may have one scenario like avg, which inputs one col and expected two column as return in partial phase and input two cols and expect one at final phase. Which is also a fix for Q1

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] ColumnarShuffleHashJoin with Knownfloating expr

Signed-off-by: Chendi Xue <[email protected]>

* [scala] support In (oap-project#91)

* [scala] support In

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix get ordinal for ColumnarIn

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix get ordinal in agg (oap-project#92)

a special fix for Q10
Spark will do normalization when float/doubt type as join key

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] A attr fix in ColumnarAggregation

Signed-off-by: Chendi Xue <[email protected]>

* Revert "[Scala] fix get ordinal in agg (oap-project#92)"

This reverts commit 9ed5992b63d7791e59a559c4902d7ca516d3e3b4.

* [Scala] Fix for Q11

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add a new expression who will collect subquery result and as literal in gandiva

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] adding support for extract_year (oap-project#88)

* [Scala] adding support for extract_year

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] cast utf8/int64 to date64 first

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] support DateType for Literal

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] add support for string Contains

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] use string based comparison for datetype

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] clean up

Signed-off-by: Yuan Zhou <[email protected]>

* [CPP] Refine all Aggregation function and add SumCount, AvgByCount, Min and Max support

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] null key will be skipped in Groupby Case

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add native ResultIterator support for Groupby HashAggregate

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] ColumnarHashAggregation and ColumnarProjection Refactor

Extracted current projection codes from ColumnarAggregation and made as a single class,
So we can apply ColumnarProjection to groupingExpression, aggregateExpression and resultExpression.

Also added return by batch support in ColumnarAggregation, so we won't return too much lines
which may result in memory leak.

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] ColumnarConditionProjection fix after Aggregation Refine

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] extractYear fix to use Int32

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add a new interface to pass selectionVector

1. add selection support to evaluator and resultIterator
2. add selectionVector support to ProbeArrays
3. fix wo/ groupby aggregate result type issue

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Add a new interface to pass selectionVector

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Using ConditionProjector to handler condition inside Join

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Support condition inside ColumnarJoin

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] A walkaround to skip Condition when input doesn't contain this field

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Support multiple same primary key Join

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] shift groupby key hashed value then add to next one

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] support for Not (oap-project#80)

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala & CPP] Fix ColumnarAggregation ResultIterator bug

Original we used Slice array in native codes, and when we pass this array to Java, Slice configuration will be lost so we are getting incorrect result.
Now we changed to build array inside ResultIterator Next function, and result is correct now.

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] support case when (oap-project#100)

* [Scala] support case when

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix EquealTo

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix agg in case when

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] restore BinaryOperator

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] clean up

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] Cast dataType in BinaryOperator

Signed-off-by: Chendi Xue <[email protected]>

* [Scala & CPP] Support Outer Join

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Fix when aggregationExpression is empty

Signed-off-by: Chendi Xue <[email protected]>

* Wip condition join (oap-project#106)

* [Scala & CPP] Support LeftAnti Join in ColumnarShuffledHashJoin

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Move bindReference inside ColumnarConditionProjection

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add Native conditionedJoin

This PR is aim to do runtime codegen so we can perform a conditioned join operation,
Add a new ConditionedShuffleArrayList implementation
Add a new ConditionedProbeArrays implementation
Generate signature for codegen func, and use signature to check if lib exists
Add NoneCondition Support
Remove ShuffleArrayList implementation and change to use ConditionShuffleArrayList
Remove not in use Kernels and Actions

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Support new conditionedJoin

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Remove original probeArrays kernel

Signed-off-by: Chendi Xue <[email protected]>

* [scala] support function with in operator (oap-project#107)

Signed-off-by: Yuan Zhou <[email protected]>

* [CPP] Use original shuffle codes here to improve performance

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Fix AvgByCount bug

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add In Support when doing codegen and forward unknown function

Signed-off-by: Chendi Xue <[email protected]>

* [Scala] Fix a small bug in ColumnarExpressionConverter for Like

Signed-off-by: Chendi Xue <[email protected]>

* Move SparkColumnarPlugin to oap-native-sql folder

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Small fixes (oap-project#1184)

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Fixed a avg with groupby issue, now Q17 is correct (oap-project#1185)

Signed-off-by: Chendi Xue <[email protected]>

* [DO NOT MERGE]WIP Q2 fix (oap-project#1187)

* [CPP & Scala] Fixed some codes for ConditionedShuffle

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Q2_fix done

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Last commit invoked some mis-remove, fix here

Signed-off-by: Chendi Xue <[email protected]>

* Update README.md

* [nativesql] fix compile against new arrow (oap-project#1189)

* [nativesql] fix compile against new arrow

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] fix compile warning

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] remove unused headers

Signed-off-by: Yuan Zhou <[email protected]>

* Update ApacheArrowInstallation.md

* [nativesql]Wip spark rebase (oap-project#1202)

* [nativesql] fix compile against new arrow

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] fix compile warning

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] remove unused headers

Signed-off-by: Yuan Zhou <[email protected]>

* [scala] fix spark reabasing

Signed-off-by: Yuan Zhou <[email protected]>

* [NativeSql] DeCouple Gandiva protobuf and hashing dependency (oap-project#1203)

* Copied Arrow Hashing to our repo so newly modification won't break our builds

Signed-off-by: Chendi Xue <[email protected]>

* [scala] fix spark reabasing

Signed-off-by: Yuan Zhou <[email protected]>

* [CPP] Add protobuf inside native sql

Signed-off-by: Chendi Xue <[email protected]>

Co-authored-by: Yuan Zhou <[email protected]>

* [NativeSql]refactor native parquet reader/writer (oap-project#1205)

* Remove sortArraysToIndices

Signed-off-by: Chendi Xue <[email protected]>

* [NativeSql] Move Parquet Reader and Writer into nativeSql

Signed-off-by: Chendi Xue <[email protected]>

* [NativeSql] Add libhdfs3.so to resource, which will be copied to /hadoop dir when doing make install

Signed-off-by: Chendi Xue <[email protected]>

* [CPP] Add a parquet reader and writer adapter

Signed-off-by: Chendi Xue <[email protected]>

* [NativeSql] Refactor and move spark side commits to nativeSql

1. move parquet reader logic to nativesql
2. move ArrowWritableColumnVector to nativesql
3. Use postRule to call RowToArrowColumnVector
4. move cpp so to jar
5. remove benchmark folder
6. update readme

Signed-off-by: Chendi Xue <[email protected]>

* [NativeSql][CPP] Use CMake to download and compile protobuf

Signed-off-by: Chendi Xue <[email protected]>

* Update README.md

* ArrowDataSource for Spark (#1226)

* [oap-native-sql]Add Installation Notes (#1231)

* add InstallationNotes to README

* refine

* refine

* refine

* [NativeSql] ClassCastException if non-parquet data source is used (#1238)

* Move ArrowWritableColumnVector from org.apache to com.intel (#1243)

* [DataSource] Compilation error due to multiple source directories (#1244)

* [oap-native-sql]Wip refine protobuf install (#1230)

* [Building] refine protobuf dependency check

 - if not found, download protobuf and statically link to it
 - if found, reuse system level protobuf and dynamically link to it

Signed-off-by: Yuan Zhou <[email protected]>

* [Building] check for dynamic protobuf lib only

Signed-off-by: Yuan Zhou <[email protected]>

* [oap-native-sql][Scala] support date32 (#1225)

* [Scala] support date32

Signed-off-by: Yuan Zhou <[email protected]>

* [C++][Java] Support Date32 in RowToColumn

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] support date32 in unique action

Signed-off-by: Yuan Zhou <[email protected]>

* [Java] fix getUTF8String on Date32

Signed-off-by: Yuan Zhou <[email protected]>

* set C++ 2011 standard (#1236)

* [Scala] fix contain to use is_substr (#1235)

Signed-off-by: Yuan Zhou <[email protected]>

* [Java] fix date32 projection (#1250)

Signed-off-by: Yuan Zhou <[email protected]>

* [NativeSql][Scala] memory leak track and fixes (#1227)

* [NativeSql][Scala] memory leak track and fixes

Signed-off-by: Chendi Xue <[email protected]>

* [NativeSql][CPP] Another derived class should add virtual to its super destruction func

Signed-off-by: Chendi Xue <[email protected]>

* [DataSource][Arrow] Supress exceptions from unexpected types when pushdown filters (#1253)

* Update README googletest installation (#1251)

* [DataSource][Arrow] Output schema mismatch when scanning for zero dat… (#1262)

* [DataSource][Arrow] Output schema mismatch when scanning for zero data columns

* [DataSource][Arrow] Use ArrowWritableColumnVector to fill partition values

* [DataSource][Arrow] Update README.md (#1263)

* [DataSource][Arrow] Add assembly build (#1264)

* [DataSource][Arrow] Download ArrowWritableColumnVector instead of having a copy (#1267)

* [oap-native-sql] Calling ColumnVectorUtils.populate(...) on ArrowWritableColumnVector leads to UnsupportedOperationException (#1268)

* [DataSource][Arrow] Source Downloading: Change to exec-maven-plugin (#1269)

* [DataSource][Arrow] Update README.md (#1276)

* [DataSource][Arrow] Update README.md (#1279)

* [Scala] adding IsNull support (#1256)

Signed-off-by: Yuan Zhou <[email protected]>

* [oap-native-sql] Add open permission parameter (#1266)

* add open O_CREAT permission mode

* [DataSource][Arrow] Prune pushed filters that access partition columns (#1285)

* [oap-native-sql][Scala]Adding abs support (#1273)

* support abs

* [Building] building with spark-sql from our maven repo (#1249)

Signed-off-by: Yuan Zhou <[email protected]>

* [DataSource][Arrow] Close batch every time new batch is read to avoid possible leaks (#1288)

* [DataSource][Arrow] File descriptor leak (#1295)

* inset (#1290)

* upper (#1301)

* [oap-native-sql][CI] update travis for native sql (#1294)

* [CI] update travis for native sql

Signed-off-by: Yuan Zhou <[email protected]>

* [CI] fix grammar, use openjdk8

Signed-off-by: Yuan Zhou <[email protected]>

* [CI] update to use python3 env

Signed-off-by: Yuan Zhou <[email protected]>

* [Doc] update readme (#1308)

Signed-off-by: Yuan Zhou <[email protected]>

* coalesce (#1306)

* [oap-native-sql][Scala]adding if support (#1307)

* add IfOperator

* add boolean type

* [oap-native-sql] Enable ColumnarSort kernel with code generation (#1261)

* [NativeSql] ColumnarSort kernel

ColumnarSort is implemented with CodeGeneration method

Signed-off-by: Chendi Xue <[email protected]>

* [oap-native-sql] Fix compiling issue

Signed-off-by: Chendi Xue <[email protected]>

* [Scala]support date32 in IN epxression (#1303)

Signed-off-by: Yuan Zhou <[email protected]>

* adding ASF license (#1331)

Signed-off-by: Yuan Zhou <[email protected]>

* [CI] update to use new oap-master branch (#1342)

Signed-off-by: Yuan Zhou <[email protected]>

* [oap-native-sql]Rewrite ColumnarShuffledHashJoin using CodeGeneration (#1324)

* [oap-native-sql] Rewrite ColumnarShuffledHashJoin using codegeneration

1. Remove unused files after we change to use codegen
2. Change to use SparseHashMap instead of arrow Hashing

Signed-off-by: Chendi Xue <[email protected]>

* [oap-native-sql] Use java.io.tmpdir or cmake build dir as codegen tmp dir

Signed-off-by: Chendi Xue <[email protected]>

* [oap-native-sql] Add copyright and change datatype in array_item_index to uint16_t

Signed-off-by: Chendi Xue <[email protected]>

* [oap-native-sql] Use add_definitions instead of add_compile_definitions

Signed-off-by: Chendi Xue <[email protected]>

* adding concat support (#1328)

* [oap-native-sql][Scala] refine coalesce (#1340)

* [oap-native-sql][Scala] fix null value exception for StringType and DoubleType (#1333)

* [oap-native-sql] Enable mvn package to build native libs (#1341)

* Enable mvn package to build native libs

* [oap-native-sql][Scala] fix attr errors (#1330)

* [oap-native-sql][Scala] adding round support (#1332)

* [oap-native-sql] columnar shuffle (oap-project#1212)

* [Scala/C++] columnar shuffle

* [Scala] sync with arrow-dataset

* [Scala] rebase to spark 3.1.0

* [Scala] fix & rebase to arrow 0.17

* [Java] serializer & typo

* [Scala] fix serializer & add data size SQLMetric

* [NativeSql][c++] Support date type

[NativeSql][Scala] support fall back row-based shuffle

* [NatvieSql][Scala] columnar shuffle configurable

* [NativeSql][Scala] serializer reference transfer & fix decompress

[NativeSql][c++] update deprecated

* [NativeSql][Scala] fix writer write columnar batch of 0 rows

* [NativeSql][Scala] read batch num rows metrics

* [NativeSql][Scala] configurable native buffer size

* [NativeSql][c++] optimize

[Scala] ColumnarShuffleExchange filter empty batch

* [NativeSql][Scala] fix extra close

* [NativeSql][Scala] coalesce batch

* [NativeSql][c++] find boost

* [NativeSql][Scala] fallback to use parquet data source

* Revert "[NativeSql][Scala] coalesce batch"

This reverts commit 4b6929920f19769051cc899ed244761bdfb43d47.

* [NativeSql] update README.md

* [NativeSql] ci install boost

* [NativeSql] add missing ASF & reformat

* [NativeSql][Scala] remove WSCG=false

* [oap-native-sql] Add customized batch_size and tmp_dir support (#1362)

* [oap-native-sql] Add API to use customized batchSize through spark config to native

* [oap-native-sql] Initialize ColumnarPluginConfig in operators

Signed-off-by: Chendi Xue <[email protected]>

* [oap-native-sql] Add cutomized tmp dir through spark config

Signed-off-by: Chendi Xue <[email protected]>

* [oap-native-sql]Wip optimize sort (#1372)

* [oap-native-sql] Use inplace sort for single key no payload batch

Signed-off-by: Chendi Xue <[email protected]>

* [oap-native-sql] Add ska_sort for single column with payload and use std::sort in desc case

Signed-off-by: Chendi Xue <[email protected]>

* [oap-native-sql] Add third party ska-sort

Signed-off-by: Chendi Xue <[email protected]>

* [oap-native-sql] Columnar shuffle I/O Use Configured Disks (#1378)

* [NativeSql] shuffle I/O using spark configuration

* [NativeSql] some cleanup

* [C++] opt hash join (#1377)

Signed-off-by: Yuan Zhou <[email protected]>

* [NativeSql][Scala] compression workaround (#1381)

* [DataSource][Arrow] Support reading dictionary encoded parquet values (#1376)

* [DataSource][Arrow] Support reading dictionary encoded parquet values

* CI uses Intel-bigdata/arrow/native-sql-engine-clean

* [oap-native-sql]Add ColumnarBatch Combination on Shuffle Read Side (#1370)

* [NativeSql][Scala] coalesce batch

* [NativeSql][Scala] use nano metrics

[NativeSql][Scala] add split metric to collect native split + write time, change write time metric to collect concat shuffle temp file time

* [NativeSql][Java] license & indent

* [NativeSql] rebase

* [NativeSql][c++] compress use single thread (#1394)

* [oap-native-sql][C++] extract codegen headers to nativesql_include folder (#1395)

* [C++] extract codegen headers to nativesql_include folder

So this won't conflict with zstd-jni

Signed-off-by: Yuan Zhou <[email protected]>

* [C++] support additional location of libarrow

Signed-off-by: Yuan Zhou <[email protected]>

* [DataSource][Arrow] Reserve buffer bytes from Spark off-heap executio… (#1393)

* [DataSource][Arrow] Reserve buffer bytes from Spark off-heap execution memory pool

* typo

* wip

* [oap-native-sql][Doc] update docs  (#1392)

* [Doc]wip refine doc

Signed-off-by: Yuan Zhou <[email protected]>

* [Doc] refine wording and picture

Signed-off-by: Yuan Zhou <[email protected]>

* [oap-native-sql][Scala] support PartialMerge mode for aggregate (#1358)

* [oap-native-sql][Scala] support PartialMerge mode for aggregate (#1358)

* [Doc] fix wrong link to core arch picture (#1411)

Signed-off-by: Yuan Zhou <[email protected]>

* [oap-native-sql][Scala] adding cast support (#1312)

* [Scala] adding cast support

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] disable castBIGINT

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] remove cast hack

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] fix getResultAttr in Cast

Signed-off-by: Yuan Zhou <[email protected]>

* [Scala] disable castDECIMAL and cleanup

Signed-off-by: Yuan Zhou <[email protected]>

* [DataSource][Arrow] Error when reading parquet file whose path contains character '%' (#1420)

* [DataSource][Arrow] Follow-up: A test case should be marked ignore (#1422)

* [ArrowDataSource][Scala] allow to specify batch size from Spark (#1416)

Signed-off-by: Yuan Zhou <[email protected]>

Co-authored-by: Chendi.Xue <[email protected]>
Co-authored-by: rongma1997 <[email protected]>
Co-authored-by: Chendi.Xue <[email protected]>
Co-authored-by: JiayiChen785 <[email protected]>
Co-authored-by: Hongze Zhang <[email protected]>
Co-authored-by: Rui Mo <[email protected]>
Co-authored-by: Hongze Zhang <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants