[NSE-136]upgrade to arrow 3.0.0 #107

note: this patch changed the arrow-data-source repo to a private branch to get CI passing. should change back to oap-project/arrow-data-source after 3.0 branch merged
https://github.com/zhztheplayer/arrow-data-source/tree/arrow-3.0.0-rebase

This commit implements the Native SQL Engine for OAP. The key components are: - Using Apache Arrow as column vector format as intermediate data among Spark operator. - Enable Apache Arrow native readers for Parquet and other formats. - Leverage Apache Arrow Gandiva/Compute to evaluate columnar expressions with SIMD optimizations OAP Native SQL Engine is verified by TPC-H workload as of this commit. Please refer to the detailed guide on how to install and test. Co-authored-by: Chendi Xue <[email protected]> Co-authored-by: Rong Ma <[email protected]> Co-authored-by: Jiayi Chen <[email protected]> Co-authored-by: Hongze Zhang <[email protected]> Co-authored-by: Rui Mo <[email protected]> Co-authored-by: Yuan Zhou <[email protected]> Co-authored-by: Binwei Yang <[email protected]> ====================== * ProjectList prepare check and type change Signed-off-by: Chendi Xue <[email protected]> * Add new ReadWriteBench Signed-off-by: Chendi Xue <[email protected]> * Add ColumnarHashAggregate support Framework done, Codes workable, saw fault result Signed-off-by: Chendi Xue <[email protected]> * Add an optimization to skip unnecessary project work Signed-off-by: Chendi Xue <[email protected]> * [Bug fix] fixed multiple cols aggregation failing issue Signed-off-by: Chendi Xue <[email protected]> * Update README.md * Update ApacheArrowInstallation.md * use arrow version property to record the right version Signed-off-by: Yuan Zhou <[email protected]> * integrate columnar shuffle operator and relative UT * Update README.md * Update ApacheArrowInstallation.md * Update ApacheArrowInstallation.md * Apply coding format onto current project Signed-off-by: Chendi Xue <[email protected]> * Update README.md * [C++]Fix cpp code format Signed-off-by: Chendi Xue <[email protected]> * Add files via upload * Add files via upload * [C++]Refactoring current cpp codes and change return using vector<RecordBatch> Signed-off-by: Chendi Xue <[email protected]> * [C++]Using google-code-style for c++ codes Signed-off-by: Chendi Xue <[email protected]> * [JAVA]change java jniWrapper to return a ArrowRecordBatch array Signed-off-by: Chendi Xue <[email protected]> * [SCALA]Bug fixing: ColumnarShuffleExchangeExec didn't recursively pass child to next operator. Signed-off-by: Chendi Xue <[email protected]> * [C++]Remove Gandiva Protobuf in this project, added in arrow side Signed-off-by: Chendi Xue <[email protected]> * [DOC] Fix installation guide after we remove gandiva_protobuf Signed-off-by: Chendi Xue <[email protected]> * [C++]Add jni_common.h Signed-off-by: Chendi Xue <[email protected]> * [C++] Add splitArray function Aim to add a function to split one Array into multiple arrays with distinguish key. Codes are done, runable with correct result, will try bigger input. reminder: current we can only use one array as splitter. Signed-off-by: Chendi Xue <[email protected]> * [C++] Big fix when only splitting one array Signed-off-by: Chendi Xue <[email protected]> * [C++] refactoring codes to support a visitor chain Signed-off-by: Chendi Xue <[email protected]> * add CMakeLists.txt * [C++] Change splitArray to only use one loop for all arrays Signed-off-by: Chendi Xue <[email protected]> * [C++] Noticed a kernel_ext bug, fix here, also refined a bit codes Signed-off-by: Chendi Xue <[email protected]> * [C++] based on arrow commit 868c8c6, to pass hash_table to arrow compute functions. Original, arrow will initialize a hash_table inside DictEncode function and ValueCounts function which leads to multiple array can't be processed based on same hash_table, and by changing arrow code, we now be able to pass a long live hash_table get an unified index for all arrays. Signed-off-by: Chendi Xue <[email protected]> * [C++] Add new interface "finish" This interface is designed to evaluate based on multiple recordBatch will generate a output when calling finish Signed-off-by: Chendi Xue <[email protected]> * [C++] add a new appendArrayToBatch function Using this function, we can build a new recordbatch based on multiple recordBatch input, then we are able to make a final aggregate result for all. Signed-off-by: Chendi Xue <[email protected]> * [C++] Refactor split.h using array_builder_impl.h to maintain ArrayBuider Signed-off-by: Chendi Xue <[email protected]> * [c++] Only use dict when splitting array Signed-off-by: Chendi Xue <[email protected]> * Update ApacheArrowInstallation.md * [C++] support groupby aggregate in cpp level 1. added EncodeArray kernel 2. added a finish function mechanism 3. added appendToCache functions 4. splitArrayList uses indices instead of cache the whole list Signed-off-by: Chendi Xue <[email protected]> * [Scala/Java] Added support for groupBy aggregate Signed-off-by: Chendi Xue <[email protected]> * [C++] Add jni support for finish function Signed-off-by: Chendi Xue <[email protected]> * Fix on groupby aggregate feature Signed-off-by: Chendi Xue <[email protected]> * [C++/Scala] Move merge multiple groupby batch into one implementation to CPP Signed-off-by: Chendi Xue <[email protected]> * [C++/Scala] Continuelly optimize groupby hash aggregation by using action Signed-off-by: Chendi Xue <[email protected]> * [Scala] Fix DataType issue for HashAggregate Signed-off-by: Chendi Xue <[email protected]> * [C++] Optimize GroupBy HashAggregate performance 1. Used hash_table key column as uniqueAction input, so uniqueAction won't need to calculate each time 2. Set max group id at the beginning of row evaluation, so each action evaluate only need to do the real evaluation work. Signed-off-by: Chendi Xue <[email protected]> * [C++] Use AppendValues to build Array Signed-off-by: Chendi Xue <[email protected]> * [C++] Change Makefile using O3, which significantly improves performance Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add Spark Metrics for HashAggregate Signed-off-by: Chendi Xue <[email protected]> * [C++] Using MinMax in SplitArrayListWith Action to get max group id Signed-off-by: Chendi Xue <[email protected]> * [C++] Add a new way to directly access data instead of using Array API [C++] Using inline lambda instead of using function call in action Signed-off-by: Chendi Xue <[email protected]> * [C++] Refactor arrow compute to simpify the workpath of calling Eval() Signed-off-by: Chendi Xue <[email protected]> * [Scala] Fix bug, ColumnarBatch was not closed before Signed-off-by: Chendi Xue <[email protected]> * [Scala] Disable ColumnarShuffle, looks it will cause OOM issue Signed-off-by: Chendi Xue <[email protected]> * [C++/Scala] Use group when doing encodeArray and add a null check when closing ColumnarAggregation Signed-off-by: Chendi Xue <[email protected]> * [Scala] Bug fix, close the last columnarBatch and columnarAggregation instance Signed-off-by: Chendi Xue <[email protected]> * [C++] Change to use cmake instead of makefile Signed-off-by: Chendi Xue <[email protected]> * [C++]Add GoogleTests Signed-off-by: Chendi Xue <[email protected]> * [C++] Add a new unittest and macroed add_test Signed-off-by: Chendi Xue <[email protected]> * [C++] Change to use groupby_aggregate.h Group func Rebased intel arrow to lastest arrow commit and revert our changes to hash.h, and move group function to groupby_aggregate.h Signed-off-by: Chendi Xue <[email protected]> * [C++] Sort support To use sort, it requires two kernels, sortArraysToIndices will cook an indices array, then rest arrays can use this sorted_indices to do a shuffle. Signed-off-by: Chendi Xue <[email protected]> * [C++] Pass col type when make SplitArrayListWithAction Kernel Signed-off-by: Chendi Xue <[email protected]> * add AppendArrayKernel This patch adds AppendArrayKernel support. Signed-off-by: Yuan Zhou <[email protected]> * [C++] Bug fix, noticed we didn't use the java builder inside this project Signed-off-by: Chendi Xue <[email protected]> * [C++] Fix bug when there is finish Func in expr Signed-off-by: Chendi Xue <[email protected]> * [C++/Scala] Refine the way of extract hash aggregate input expression Signed-off-by: Chendi Xue <[email protected]> * [C++] Change unittest to use action_dono, so we can get multiple return_type Signed-off-by: Chendi Xue <[email protected]> * [Scala] Enable ColumnarSort Signed-off-by: Chendi Xue <[email protected]> * [C++] Add benchmark Benchmark will read a parquet file from local and do evaluation upon this file Signed-off-by: Chendi Xue <[email protected]> * [C++] ShuffleArrayList performance optimization Use builder directly instead of using array_builder_impl.h Signed-off-by: Chendi Xue <[email protected]> * [C++]Add big scale test, batch size is 5176 Signed-off-by: Chendi Xue <[email protected]> * [C++] Add Iterator<RecordBatch> as Finish Return Currently, we only supported return as std::vector<RecordBatch>, and I am thinking to add a new way of returning as iterator, to make it more extensible Signed-off-by: Chendi Xue <[email protected]> * [Jni + ColumnarSorter] use ResultIterator<RecordBatch> instead of return vector<RecordBatch> Signed-off-by: Chendi Xue <[email protected]> * [BUG FIX] Fix uninitialized row_id bug Signed-off-by: Chendi Xue <[email protected]> * [JAVA] adding missing BatchIterator file Signed-off-by: Yuan Zhou <[email protected]> * [Scala] allow to operate on Long and Double type Both Gandiva and Arrow Compute support these two types now. Signed-off-by: Yuan Zhou <[email protected]> * adding vhashjoin support This patch adds vhashjoin support w/ below major change: - Allow to set member set for kernels - Adding Take&NTake kernels - Spark columnar plugin for ShuffledHashJoinExec(turned off now) Signed-off-by: Yuan Zhou <[email protected]> * [Scala] Bug fix in ColumnarAggregate when some column will be trimed Signed-off-by: Chendi Xue <[email protected]> * Implement this feature with two method: 1. Using utf8 to merge keys -> ConcatArrayKernel 2. use gandiva to do hash + add -> HashAggrArrayKernel Now we chose to use gandiva Signed-off-by: Chendi Xue <[email protected]> * [Scala]Some fixing to support ColumnarAggregationWithTwoKeys Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add ColumnarBatchScan Support By using which, we can use WSCG off when testing columnarBased process Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add a new ColumnarConditionProjector Operator Signed-off-by: Chendi Xue <[email protected]> * [CPP & Scala] Add desend and null first support for ColumnarSort Signed-off-by: Chendi Xue <[email protected]> * [Scala] Rename ColumnarCondProjExec to ColumnarConditionProjectExec Signed-off-by: Chendi Xue <[email protected]> * Add an alternative ColumnarJoin implementation (oap-project#71) * [CPP]ShuffleArrayList kernel fix when null exists Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add Join Benchmark We used tpch lineitem and order table to test join, which contains 800+ batches Signed-off-by: Chendi Xue <[email protected]> * [CPP] adding a new method for ColumnarJoin Add a new kernel called probeArrays, which is used to input multiple arrays one by one, then probe primary key by another sets of arrays. And also refined shuffleArrayListKernel, so by combining this two, we can join batches from two table together. Signed-off-by: Chendi Xue <[email protected]> * [JNI] Add jni support for using ResultIterator.Process Signed-off-by: Chendi Xue <[email protected]> * [Scala] Spark Columnar Support for ShuffledHashJoin Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add New Unittest and BenchmarkTest for InnerJoin Signed-off-by: Chendi Xue <[email protected]> * [CPP] Refactor current Join codes and support both right Join and InnerJoin Signed-off-by: Chendi Xue <[email protected]> * [Scala]ColumnarShuffledHashJoin Refine for InnerJoin Signed-off-by: Chendi Xue <[email protected]> * [Scala]ColumnarAggregate fix for Q4 Signed-off-by: Chendi Xue <[email protected]> * [Scala] Remove JoinTime in ColumnarShuffledHashJoinExec and use one in ColumnarShuffledHashJoin Signed-off-by: Chendi Xue <[email protected]> * fix cond projector without condition (oap-project#75) should project with resultSchema Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix string support for columnar projection (oap-project#76) * [Scala] fix string support for columnar projection Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix StringType convert Signed-off-by: Yuan Zhou <[email protected]> * [Scala] skip projector evaluate if filter has 0 row result Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix possible memory leak Signed-off-by: Yuan Zhou <[email protected]> * [CPP] Add a new Action call CountLiterAction Signed-off-by: Chendi Xue <[email protected]> * [Scala] Support CountLiteral Signed-off-by: Chendi Xue <[email protected]> * Wip avg support (oap-project#79) * [CPP] Enabled groupby avg, AvgByCount and SumCount kernel Signed-off-by: Chendi Xue <[email protected]> * [JNI] Add a new interface called setReturnFields This interface is used to set result Schema when some of expressions return more than one fields and we can't use current gandiva expression to describe the schema. Signed-off-by: Chendi Xue <[email protected]> * [Scala] enable groupby avg Signed-off-by: Chendi Xue <[email protected]> * [CPP] Rewrite Unique Action and add String Support Signed-off-by: Chendi Xue <[email protected]> * [CPP] Remove Concat Kernal and Action and some codes refine Signed-off-by: Chendi Xue <[email protected]> * [Scala] String fix Signed-off-by: Chendi Xue <[email protected]> * Update README.md * Update ApacheArrowInstallation.md * [CPP] Multiple Key Groupby fix and optimization Noticed before groupby with multiple key returns incorrect result, and this commit will fix this Also if multiple keys are all string, I will concat them with gandiva and do a hash firstly then doing encodeArray. By doing which, will be a little faster then directly hash and add Signed-off-by: Chendi Xue <[email protected]> * [CPP] SplitArray optimization Move input array from lambda capture to class member, which will improve performance a lot. Signed-off-by: Chendi Xue <[email protected]> * [Scala] Support Aggregation with projection inside case (oap-project#86) By this new fix, we are able to run unmodified TPCH Q1 Signed-off-by: Chendi Xue <[email protected]> * [Scala] adding support for starts_with & ends_with (oap-project#78) * [Scala] adding support for starts_with & ends_with Signed-off-by: Yuan Zhou <[email protected]> * [Scala] adding support for like Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix string like support Signed-off-by: Yuan Zhou <[email protected]> * [Scala] support substring Signed-off-by: Yuan Zhou <[email protected]> * [CPP] Support String in ColumnarJoin Signed-off-by: Chendi Xue <[email protected]> * [Scala] LeftSemi Join support Signed-off-by: Chendi Xue <[email protected]> * [Scala] Continue fix aggregate issue for Q3 Now Q3 is runable Signed-off-by: Chendi Xue <[email protected]> * [Scala] Memory leak issue fixing Signed-off-by: Chendi Xue <[email protected]> * [CPP & Scala] Support multiple key join Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add groupby min and max and fix a bug in ShuffleArrayList Evaluate Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add a new interface to get holder current size Signed-off-by: Chendi Xue <[email protected]> * [Scala] Refine current ConditionProjector codes 1. Use iterator instead of map in ConditionProjector, so we can skip empty columnarBatch as return 2. Fix several bugs and made input schema for condition and project more clear Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add a return column size in Columnar AggregatExpression Since we may have one scenario like avg, which inputs one col and expected two column as return in partial phase and input two cols and expect one at final phase. Which is also a fix for Q1 Signed-off-by: Chendi Xue <[email protected]> * [Scala] ColumnarShuffleHashJoin with Knownfloating expr Signed-off-by: Chendi Xue <[email protected]> * [scala] support In (oap-project#91) * [scala] support In Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix get ordinal for ColumnarIn Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix get ordinal in agg (oap-project#92) a special fix for Q10 Spark will do normalization when float/doubt type as join key Signed-off-by: Yuan Zhou <[email protected]> * [Scala] A attr fix in ColumnarAggregation Signed-off-by: Chendi Xue <[email protected]> * Revert "[Scala] fix get ordinal in agg (oap-project#92)" This reverts commit 9ed5992b63d7791e59a559c4902d7ca516d3e3b4. * [Scala] Fix for Q11 Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add a new expression who will collect subquery result and as literal in gandiva Signed-off-by: Chendi Xue <[email protected]> * [Scala] adding support for extract_year (oap-project#88) * [Scala] adding support for extract_year Signed-off-by: Yuan Zhou <[email protected]> * [Scala] cast utf8/int64 to date64 first Signed-off-by: Yuan Zhou <[email protected]> * [Scala] support DateType for Literal Signed-off-by: Yuan Zhou <[email protected]> * [Scala] add support for string Contains Signed-off-by: Yuan Zhou <[email protected]> * [Scala] use string based comparison for datetype Signed-off-by: Yuan Zhou <[email protected]> * [Scala] clean up Signed-off-by: Yuan Zhou <[email protected]> * [CPP] Refine all Aggregation function and add SumCount, AvgByCount, Min and Max support Signed-off-by: Chendi Xue <[email protected]> * [CPP] null key will be skipped in Groupby Case Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add native ResultIterator support for Groupby HashAggregate Signed-off-by: Chendi Xue <[email protected]> * [Scala] ColumnarHashAggregation and ColumnarProjection Refactor Extracted current projection codes from ColumnarAggregation and made as a single class, So we can apply ColumnarProjection to groupingExpression, aggregateExpression and resultExpression. Also added return by batch support in ColumnarAggregation, so we won't return too much lines which may result in memory leak. Signed-off-by: Chendi Xue <[email protected]> * [Scala] ColumnarConditionProjection fix after Aggregation Refine Signed-off-by: Chendi Xue <[email protected]> * [Scala] extractYear fix to use Int32 Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add a new interface to pass selectionVector 1. add selection support to evaluator and resultIterator 2. add selectionVector support to ProbeArrays 3. fix wo/ groupby aggregate result type issue Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add a new interface to pass selectionVector Signed-off-by: Chendi Xue <[email protected]> * [Scala] Using ConditionProjector to handler condition inside Join Signed-off-by: Chendi Xue <[email protected]> * [Scala] Support condition inside ColumnarJoin Signed-off-by: Chendi Xue <[email protected]> * [Scala] A walkaround to skip Condition when input doesn't contain this field Signed-off-by: Chendi Xue <[email protected]> * [CPP] Support multiple same primary key Join Signed-off-by: Chendi Xue <[email protected]> * [CPP] shift groupby key hashed value then add to next one Signed-off-by: Chendi Xue <[email protected]> * [Scala] support for Not (oap-project#80) Signed-off-by: Yuan Zhou <[email protected]> * [Scala & CPP] Fix ColumnarAggregation ResultIterator bug Original we used Slice array in native codes, and when we pass this array to Java, Slice configuration will be lost so we are getting incorrect result. Now we changed to build array inside ResultIterator Next function, and result is correct now. Signed-off-by: Chendi Xue <[email protected]> * [Scala] support case when (oap-project#100) * [Scala] support case when Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix EquealTo Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix agg in case when Signed-off-by: Yuan Zhou <[email protected]> * [Scala] restore BinaryOperator Signed-off-by: Yuan Zhou <[email protected]> * [Scala] clean up Signed-off-by: Yuan Zhou <[email protected]> * [Scala] Cast dataType in BinaryOperator Signed-off-by: Chendi Xue <[email protected]> * [Scala & CPP] Support Outer Join Signed-off-by: Chendi Xue <[email protected]> * [Scala] Fix when aggregationExpression is empty Signed-off-by: Chendi Xue <[email protected]> * Wip condition join (oap-project#106) * [Scala & CPP] Support LeftAnti Join in ColumnarShuffledHashJoin Signed-off-by: Chendi Xue <[email protected]> * [Scala] Move bindReference inside ColumnarConditionProjection Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add Native conditionedJoin This PR is aim to do runtime codegen so we can perform a conditioned join operation, Add a new ConditionedShuffleArrayList implementation Add a new ConditionedProbeArrays implementation Generate signature for codegen func, and use signature to check if lib exists Add NoneCondition Support Remove ShuffleArrayList implementation and change to use ConditionShuffleArrayList Remove not in use Kernels and Actions Signed-off-by: Chendi Xue <[email protected]> * [Scala] Support new conditionedJoin Signed-off-by: Chendi Xue <[email protected]> * [CPP] Remove original probeArrays kernel Signed-off-by: Chendi Xue <[email protected]> * [scala] support function with in operator (oap-project#107) Signed-off-by: Yuan Zhou <[email protected]> * [CPP] Use original shuffle codes here to improve performance Signed-off-by: Chendi Xue <[email protected]> * [CPP] Fix AvgByCount bug Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add In Support when doing codegen and forward unknown function Signed-off-by: Chendi Xue <[email protected]> * [Scala] Fix a small bug in ColumnarExpressionConverter for Like Signed-off-by: Chendi Xue <[email protected]> * Move SparkColumnarPlugin to oap-native-sql folder Signed-off-by: Chendi Xue <[email protected]> * [CPP] Small fixes (oap-project#1184) Signed-off-by: Chendi Xue <[email protected]> * [CPP] Fixed a avg with groupby issue, now Q17 is correct (oap-project#1185) Signed-off-by: Chendi Xue <[email protected]> * [DO NOT MERGE]WIP Q2 fix (oap-project#1187) * [CPP & Scala] Fixed some codes for ConditionedShuffle Signed-off-by: Chendi Xue <[email protected]> * [CPP] Q2_fix done Signed-off-by: Chendi Xue <[email protected]> * [CPP] Last commit invoked some mis-remove, fix here Signed-off-by: Chendi Xue <[email protected]> * Update README.md * [nativesql] fix compile against new arrow (oap-project#1189) * [nativesql] fix compile against new arrow Signed-off-by: Yuan Zhou <[email protected]> * [C++] fix compile warning Signed-off-by: Yuan Zhou <[email protected]> * [C++] remove unused headers Signed-off-by: Yuan Zhou <[email protected]> * Update ApacheArrowInstallation.md * [nativesql]Wip spark rebase (oap-project#1202) * [nativesql] fix compile against new arrow Signed-off-by: Yuan Zhou <[email protected]> * [C++] fix compile warning Signed-off-by: Yuan Zhou <[email protected]> * [C++] remove unused headers Signed-off-by: Yuan Zhou <[email protected]> * [scala] fix spark reabasing Signed-off-by: Yuan Zhou <[email protected]> * [NativeSql] DeCouple Gandiva protobuf and hashing dependency (oap-project#1203) * Copied Arrow Hashing to our repo so newly modification won't break our builds Signed-off-by: Chendi Xue <[email protected]> * [scala] fix spark reabasing Signed-off-by: Yuan Zhou <[email protected]> * [CPP] Add protobuf inside native sql Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: Yuan Zhou <[email protected]> * [NativeSql]refactor native parquet reader/writer (oap-project#1205) * Remove sortArraysToIndices Signed-off-by: Chendi Xue <[email protected]> * [NativeSql] Move Parquet Reader and Writer into nativeSql Signed-off-by: Chendi Xue <[email protected]> * [NativeSql] Add libhdfs3.so to resource, which will be copied to /hadoop dir when doing make install Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add a parquet reader and writer adapter Signed-off-by: Chendi Xue <[email protected]> * [NativeSql] Refactor and move spark side commits to nativeSql 1. move parquet reader logic to nativesql 2. move ArrowWritableColumnVector to nativesql 3. Use postRule to call RowToArrowColumnVector 4. move cpp so to jar 5. remove benchmark folder 6. update readme Signed-off-by: Chendi Xue <[email protected]> * [NativeSql][CPP] Use CMake to download and compile protobuf Signed-off-by: Chendi Xue <[email protected]> * Update README.md * ArrowDataSource for Spark (#1226) * [oap-native-sql]Add Installation Notes (#1231) * add InstallationNotes to README * refine * refine * refine * [NativeSql] ClassCastException if non-parquet data source is used (#1238) * Move ArrowWritableColumnVector from org.apache to com.intel (#1243) * [DataSource] Compilation error due to multiple source directories (#1244) * [oap-native-sql]Wip refine protobuf install (#1230) * [Building] refine protobuf dependency check - if not found, download protobuf and statically link to it - if found, reuse system level protobuf and dynamically link to it Signed-off-by: Yuan Zhou <[email protected]> * [Building] check for dynamic protobuf lib only Signed-off-by: Yuan Zhou <[email protected]> * [oap-native-sql][Scala] support date32 (#1225) * [Scala] support date32 Signed-off-by: Yuan Zhou <[email protected]> * [C++][Java] Support Date32 in RowToColumn Signed-off-by: Yuan Zhou <[email protected]> * [C++] support date32 in unique action Signed-off-by: Yuan Zhou <[email protected]> * [Java] fix getUTF8String on Date32 Signed-off-by: Yuan Zhou <[email protected]> * set C++ 2011 standard (#1236) * [Scala] fix contain to use is_substr (#1235) Signed-off-by: Yuan Zhou <[email protected]> * [Java] fix date32 projection (#1250) Signed-off-by: Yuan Zhou <[email protected]> * [NativeSql][Scala] memory leak track and fixes (#1227) * [NativeSql][Scala] memory leak track and fixes Signed-off-by: Chendi Xue <[email protected]> * [NativeSql][CPP] Another derived class should add virtual to its super destruction func Signed-off-by: Chendi Xue <[email protected]> * [DataSource][Arrow] Supress exceptions from unexpected types when pushdown filters (#1253) * Update README googletest installation (#1251) * [DataSource][Arrow] Output schema mismatch when scanning for zero dat… (#1262) * [DataSource][Arrow] Output schema mismatch when scanning for zero data columns * [DataSource][Arrow] Use ArrowWritableColumnVector to fill partition values * [DataSource][Arrow] Update README.md (#1263) * [DataSource][Arrow] Add assembly build (#1264) * [DataSource][Arrow] Download ArrowWritableColumnVector instead of having a copy (#1267) * [oap-native-sql] Calling ColumnVectorUtils.populate(...) on ArrowWritableColumnVector leads to UnsupportedOperationException (#1268) * [DataSource][Arrow] Source Downloading: Change to exec-maven-plugin (#1269) * [DataSource][Arrow] Update README.md (#1276) * [DataSource][Arrow] Update README.md (#1279) * [Scala] adding IsNull support (#1256) Signed-off-by: Yuan Zhou <[email protected]> * [oap-native-sql] Add open permission parameter (#1266) * add open O_CREAT permission mode * [DataSource][Arrow] Prune pushed filters that access partition columns (#1285) * [oap-native-sql][Scala]Adding abs support (#1273) * support abs * [Building] building with spark-sql from our maven repo (#1249) Signed-off-by: Yuan Zhou <[email protected]> * [DataSource][Arrow] Close batch every time new batch is read to avoid possible leaks (#1288) * [DataSource][Arrow] File descriptor leak (#1295) * inset (#1290) * upper (#1301) * [oap-native-sql][CI] update travis for native sql (#1294) * [CI] update travis for native sql Signed-off-by: Yuan Zhou <[email protected]> * [CI] fix grammar, use openjdk8 Signed-off-by: Yuan Zhou <[email protected]> * [CI] update to use python3 env Signed-off-by: Yuan Zhou <[email protected]> * [Doc] update readme (#1308) Signed-off-by: Yuan Zhou <[email protected]> * coalesce (#1306) * [oap-native-sql][Scala]adding if support (#1307) * add IfOperator * add boolean type * [oap-native-sql] Enable ColumnarSort kernel with code generation (#1261) * [NativeSql] ColumnarSort kernel ColumnarSort is implemented with CodeGeneration method Signed-off-by: Chendi Xue <[email protected]> * [oap-native-sql] Fix compiling issue Signed-off-by: Chendi Xue <[email protected]> * [Scala]support date32 in IN epxression (#1303) Signed-off-by: Yuan Zhou <[email protected]> * adding ASF license (#1331) Signed-off-by: Yuan Zhou <[email protected]>

This patch implements below main features for Native SQL engine: - ColumnarExchange support - runtime codegen for ColumnarShuffledHashJoin/ColumnarSort - Configurable batch size for Arrow Data Source - Support more Functions from TPCDS queries Please refer to the detailed guide on how to install and test. Co-authored-by: Chendi Xue <[email protected]> Co-authored-by: Rong Ma <[email protected]> Co-authored-by: Jiayi Chen <[email protected]> Co-authored-by: Hongze Zhang <[email protected]> Co-authored-by: Rui Mo <[email protected]> Co-authored-by: Yuan Zhou <[email protected]> Co-authored-by: Binwei Yang <[email protected]> ================= * [C++]Add jni_common.h Signed-off-by: Chendi Xue <[email protected]> * [C++] Add splitArray function Aim to add a function to split one Array into multiple arrays with distinguish key. Codes are done, runable with correct result, will try bigger input. reminder: current we can only use one array as splitter. Signed-off-by: Chendi Xue <[email protected]> * [C++] Big fix when only splitting one array Signed-off-by: Chendi Xue <[email protected]> * [C++] refactoring codes to support a visitor chain Signed-off-by: Chendi Xue <[email protected]> * add CMakeLists.txt * [C++] Change splitArray to only use one loop for all arrays Signed-off-by: Chendi Xue <[email protected]> * [C++] Noticed a kernel_ext bug, fix here, also refined a bit codes Signed-off-by: Chendi Xue <[email protected]> * [C++] based on arrow commit 868c8c6, to pass hash_table to arrow compute functions. Original, arrow will initialize a hash_table inside DictEncode function and ValueCounts function which leads to multiple array can't be processed based on same hash_table, and by changing arrow code, we now be able to pass a long live hash_table get an unified index for all arrays. Signed-off-by: Chendi Xue <[email protected]> * [C++] Add new interface "finish" This interface is designed to evaluate based on multiple recordBatch will generate a output when calling finish Signed-off-by: Chendi Xue <[email protected]> * [C++] add a new appendArrayToBatch function Using this function, we can build a new recordbatch based on multiple recordBatch input, then we are able to make a final aggregate result for all. Signed-off-by: Chendi Xue <[email protected]> * [C++] Refactor split.h using array_builder_impl.h to maintain ArrayBuider Signed-off-by: Chendi Xue <[email protected]> * [c++] Only use dict when splitting array Signed-off-by: Chendi Xue <[email protected]> * Update ApacheArrowInstallation.md * [C++] support groupby aggregate in cpp level 1. added EncodeArray kernel 2. added a finish function mechanism 3. added appendToCache functions 4. splitArrayList uses indices instead of cache the whole list Signed-off-by: Chendi Xue <[email protected]> * [Scala/Java] Added support for groupBy aggregate Signed-off-by: Chendi Xue <[email protected]> * [C++] Add jni support for finish function Signed-off-by: Chendi Xue <[email protected]> * Fix on groupby aggregate feature Signed-off-by: Chendi Xue <[email protected]> * [C++/Scala] Move merge multiple groupby batch into one implementation to CPP Signed-off-by: Chendi Xue <[email protected]> * [C++/Scala] Continuelly optimize groupby hash aggregation by using action Signed-off-by: Chendi Xue <[email protected]> * [Scala] Fix DataType issue for HashAggregate Signed-off-by: Chendi Xue <[email protected]> * [C++] Optimize GroupBy HashAggregate performance 1. Used hash_table key column as uniqueAction input, so uniqueAction won't need to calculate each time 2. Set max group id at the beginning of row evaluation, so each action evaluate only need to do the real evaluation work. Signed-off-by: Chendi Xue <[email protected]> * [C++] Use AppendValues to build Array Signed-off-by: Chendi Xue <[email protected]> * [C++] Change Makefile using O3, which significantly improves performance Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add Spark Metrics for HashAggregate Signed-off-by: Chendi Xue <[email protected]> * [C++] Using MinMax in SplitArrayListWith Action to get max group id Signed-off-by: Chendi Xue <[email protected]> * [C++] Add a new way to directly access data instead of using Array API [C++] Using inline lambda instead of using function call in action Signed-off-by: Chendi Xue <[email protected]> * [C++] Refactor arrow compute to simpify the workpath of calling Eval() Signed-off-by: Chendi Xue <[email protected]> * [Scala] Fix bug, ColumnarBatch was not closed before Signed-off-by: Chendi Xue <[email protected]> * [Scala] Disable ColumnarShuffle, looks it will cause OOM issue Signed-off-by: Chendi Xue <[email protected]> * [C++/Scala] Use group when doing encodeArray and add a null check when closing ColumnarAggregation Signed-off-by: Chendi Xue <[email protected]> * [Scala] Bug fix, close the last columnarBatch and columnarAggregation instance Signed-off-by: Chendi Xue <[email protected]> * [C++] Change to use cmake instead of makefile Signed-off-by: Chendi Xue <[email protected]> * [C++]Add GoogleTests Signed-off-by: Chendi Xue <[email protected]> * [C++] Add a new unittest and macroed add_test Signed-off-by: Chendi Xue <[email protected]> * [C++] Change to use groupby_aggregate.h Group func Rebased intel arrow to lastest arrow commit and revert our changes to hash.h, and move group function to groupby_aggregate.h Signed-off-by: Chendi Xue <[email protected]> * [C++] Sort support To use sort, it requires two kernels, sortArraysToIndices will cook an indices array, then rest arrays can use this sorted_indices to do a shuffle. Signed-off-by: Chendi Xue <[email protected]> * [C++] Pass col type when make SplitArrayListWithAction Kernel Signed-off-by: Chendi Xue <[email protected]> * add AppendArrayKernel This patch adds AppendArrayKernel support. Signed-off-by: Yuan Zhou <[email protected]> * [C++] Bug fix, noticed we didn't use the java builder inside this project Signed-off-by: Chendi Xue <[email protected]> * [C++] Fix bug when there is finish Func in expr Signed-off-by: Chendi Xue <[email protected]> * [C++/Scala] Refine the way of extract hash aggregate input expression Signed-off-by: Chendi Xue <[email protected]> * [C++] Change unittest to use action_dono, so we can get multiple return_type Signed-off-by: Chendi Xue <[email protected]> * [Scala] Enable ColumnarSort Signed-off-by: Chendi Xue <[email protected]> * [C++] Add benchmark Benchmark will read a parquet file from local and do evaluation upon this file Signed-off-by: Chendi Xue <[email protected]> * [C++] ShuffleArrayList performance optimization Use builder directly instead of using array_builder_impl.h Signed-off-by: Chendi Xue <[email protected]> * [C++]Add big scale test, batch size is 5176 Signed-off-by: Chendi Xue <[email protected]> * [C++] Add Iterator<RecordBatch> as Finish Return Currently, we only supported return as std::vector<RecordBatch>, and I am thinking to add a new way of returning as iterator, to make it more extensible Signed-off-by: Chendi Xue <[email protected]> * [Jni + ColumnarSorter] use ResultIterator<RecordBatch> instead of return vector<RecordBatch> Signed-off-by: Chendi Xue <[email protected]> * [BUG FIX] Fix uninitialized row_id bug Signed-off-by: Chendi Xue <[email protected]> * [JAVA] adding missing BatchIterator file Signed-off-by: Yuan Zhou <[email protected]> * [Scala] allow to operate on Long and Double type Both Gandiva and Arrow Compute support these two types now. Signed-off-by: Yuan Zhou <[email protected]> * adding vhashjoin support This patch adds vhashjoin support w/ below major change: - Allow to set member set for kernels - Adding Take&NTake kernels - Spark columnar plugin for ShuffledHashJoinExec(turned off now) Signed-off-by: Yuan Zhou <[email protected]> * [Scala] Bug fix in ColumnarAggregate when some column will be trimed Signed-off-by: Chendi Xue <[email protected]> * Implement this feature with two method: 1. Using utf8 to merge keys -> ConcatArrayKernel 2. use gandiva to do hash + add -> HashAggrArrayKernel Now we chose to use gandiva Signed-off-by: Chendi Xue <[email protected]> * [Scala]Some fixing to support ColumnarAggregationWithTwoKeys Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add ColumnarBatchScan Support By using which, we can use WSCG off when testing columnarBased process Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add a new ColumnarConditionProjector Operator Signed-off-by: Chendi Xue <[email protected]> * [CPP & Scala] Add desend and null first support for ColumnarSort Signed-off-by: Chendi Xue <[email protected]> * [Scala] Rename ColumnarCondProjExec to ColumnarConditionProjectExec Signed-off-by: Chendi Xue <[email protected]> * Add an alternative ColumnarJoin implementation (oap-project#71) * [CPP]ShuffleArrayList kernel fix when null exists Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add Join Benchmark We used tpch lineitem and order table to test join, which contains 800+ batches Signed-off-by: Chendi Xue <[email protected]> * [CPP] adding a new method for ColumnarJoin Add a new kernel called probeArrays, which is used to input multiple arrays one by one, then probe primary key by another sets of arrays. And also refined shuffleArrayListKernel, so by combining this two, we can join batches from two table together. Signed-off-by: Chendi Xue <[email protected]> * [JNI] Add jni support for using ResultIterator.Process Signed-off-by: Chendi Xue <[email protected]> * [Scala] Spark Columnar Support for ShuffledHashJoin Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add New Unittest and BenchmarkTest for InnerJoin Signed-off-by: Chendi Xue <[email protected]> * [CPP] Refactor current Join codes and support both right Join and InnerJoin Signed-off-by: Chendi Xue <[email protected]> * [Scala]ColumnarShuffledHashJoin Refine for InnerJoin Signed-off-by: Chendi Xue <[email protected]> * [Scala]ColumnarAggregate fix for Q4 Signed-off-by: Chendi Xue <[email protected]> * [Scala] Remove JoinTime in ColumnarShuffledHashJoinExec and use one in ColumnarShuffledHashJoin Signed-off-by: Chendi Xue <[email protected]> * fix cond projector without condition (oap-project#75) should project with resultSchema Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix string support for columnar projection (oap-project#76) * [Scala] fix string support for columnar projection Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix StringType convert Signed-off-by: Yuan Zhou <[email protected]> * [Scala] skip projector evaluate if filter has 0 row result Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix possible memory leak Signed-off-by: Yuan Zhou <[email protected]> * [CPP] Add a new Action call CountLiterAction Signed-off-by: Chendi Xue <[email protected]> * [Scala] Support CountLiteral Signed-off-by: Chendi Xue <[email protected]> * Wip avg support (oap-project#79) * [CPP] Enabled groupby avg, AvgByCount and SumCount kernel Signed-off-by: Chendi Xue <[email protected]> * [JNI] Add a new interface called setReturnFields This interface is used to set result Schema when some of expressions return more than one fields and we can't use current gandiva expression to describe the schema. Signed-off-by: Chendi Xue <[email protected]> * [Scala] enable groupby avg Signed-off-by: Chendi Xue <[email protected]> * [CPP] Rewrite Unique Action and add String Support Signed-off-by: Chendi Xue <[email protected]> * [CPP] Remove Concat Kernal and Action and some codes refine Signed-off-by: Chendi Xue <[email protected]> * [Scala] String fix Signed-off-by: Chendi Xue <[email protected]> * Update README.md * Update ApacheArrowInstallation.md * [CPP] Multiple Key Groupby fix and optimization Noticed before groupby with multiple key returns incorrect result, and this commit will fix this Also if multiple keys are all string, I will concat them with gandiva and do a hash firstly then doing encodeArray. By doing which, will be a little faster then directly hash and add Signed-off-by: Chendi Xue <[email protected]> * [CPP] SplitArray optimization Move input array from lambda capture to class member, which will improve performance a lot. Signed-off-by: Chendi Xue <[email protected]> * [Scala] Support Aggregation with projection inside case (oap-project#86) By this new fix, we are able to run unmodified TPCH Q1 Signed-off-by: Chendi Xue <[email protected]> * [Scala] adding support for starts_with & ends_with (oap-project#78) * [Scala] adding support for starts_with & ends_with Signed-off-by: Yuan Zhou <[email protected]> * [Scala] adding support for like Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix string like support Signed-off-by: Yuan Zhou <[email protected]> * [Scala] support substring Signed-off-by: Yuan Zhou <[email protected]> * [CPP] Support String in ColumnarJoin Signed-off-by: Chendi Xue <[email protected]> * [Scala] LeftSemi Join support Signed-off-by: Chendi Xue <[email protected]> * [Scala] Continue fix aggregate issue for Q3 Now Q3 is runable Signed-off-by: Chendi Xue <[email protected]> * [Scala] Memory leak issue fixing Signed-off-by: Chendi Xue <[email protected]> * [CPP & Scala] Support multiple key join Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add groupby min and max and fix a bug in ShuffleArrayList Evaluate Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add a new interface to get holder current size Signed-off-by: Chendi Xue <[email protected]> * [Scala] Refine current ConditionProjector codes 1. Use iterator instead of map in ConditionProjector, so we can skip empty columnarBatch as return 2. Fix several bugs and made input schema for condition and project more clear Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add a return column size in Columnar AggregatExpression Since we may have one scenario like avg, which inputs one col and expected two column as return in partial phase and input two cols and expect one at final phase. Which is also a fix for Q1 Signed-off-by: Chendi Xue <[email protected]> * [Scala] ColumnarShuffleHashJoin with Knownfloating expr Signed-off-by: Chendi Xue <[email protected]> * [scala] support In (oap-project#91) * [scala] support In Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix get ordinal for ColumnarIn Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix get ordinal in agg (oap-project#92) a special fix for Q10 Spark will do normalization when float/doubt type as join key Signed-off-by: Yuan Zhou <[email protected]> * [Scala] A attr fix in ColumnarAggregation Signed-off-by: Chendi Xue <[email protected]> * Revert "[Scala] fix get ordinal in agg (oap-project#92)" This reverts commit 9ed5992b63d7791e59a559c4902d7ca516d3e3b4. * [Scala] Fix for Q11 Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add a new expression who will collect subquery result and as literal in gandiva Signed-off-by: Chendi Xue <[email protected]> * [Scala] adding support for extract_year (oap-project#88) * [Scala] adding support for extract_year Signed-off-by: Yuan Zhou <[email protected]> * [Scala] cast utf8/int64 to date64 first Signed-off-by: Yuan Zhou <[email protected]> * [Scala] support DateType for Literal Signed-off-by: Yuan Zhou <[email protected]> * [Scala] add support for string Contains Signed-off-by: Yuan Zhou <[email protected]> * [Scala] use string based comparison for datetype Signed-off-by: Yuan Zhou <[email protected]> * [Scala] clean up Signed-off-by: Yuan Zhou <[email protected]> * [CPP] Refine all Aggregation function and add SumCount, AvgByCount, Min and Max support Signed-off-by: Chendi Xue <[email protected]> * [CPP] null key will be skipped in Groupby Case Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add native ResultIterator support for Groupby HashAggregate Signed-off-by: Chendi Xue <[email protected]> * [Scala] ColumnarHashAggregation and ColumnarProjection Refactor Extracted current projection codes from ColumnarAggregation and made as a single class, So we can apply ColumnarProjection to groupingExpression, aggregateExpression and resultExpression. Also added return by batch support in ColumnarAggregation, so we won't return too much lines which may result in memory leak. Signed-off-by: Chendi Xue <[email protected]> * [Scala] ColumnarConditionProjection fix after Aggregation Refine Signed-off-by: Chendi Xue <[email protected]> * [Scala] extractYear fix to use Int32 Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add a new interface to pass selectionVector 1. add selection support to evaluator and resultIterator 2. add selectionVector support to ProbeArrays 3. fix wo/ groupby aggregate result type issue Signed-off-by: Chendi Xue <[email protected]> * [Scala] Add a new interface to pass selectionVector Signed-off-by: Chendi Xue <[email protected]> * [Scala] Using ConditionProjector to handler condition inside Join Signed-off-by: Chendi Xue <[email protected]> * [Scala] Support condition inside ColumnarJoin Signed-off-by: Chendi Xue <[email protected]> * [Scala] A walkaround to skip Condition when input doesn't contain this field Signed-off-by: Chendi Xue <[email protected]> * [CPP] Support multiple same primary key Join Signed-off-by: Chendi Xue <[email protected]> * [CPP] shift groupby key hashed value then add to next one Signed-off-by: Chendi Xue <[email protected]> * [Scala] support for Not (oap-project#80) Signed-off-by: Yuan Zhou <[email protected]> * [Scala & CPP] Fix ColumnarAggregation ResultIterator bug Original we used Slice array in native codes, and when we pass this array to Java, Slice configuration will be lost so we are getting incorrect result. Now we changed to build array inside ResultIterator Next function, and result is correct now. Signed-off-by: Chendi Xue <[email protected]> * [Scala] support case when (oap-project#100) * [Scala] support case when Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix EquealTo Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix agg in case when Signed-off-by: Yuan Zhou <[email protected]> * [Scala] restore BinaryOperator Signed-off-by: Yuan Zhou <[email protected]> * [Scala] clean up Signed-off-by: Yuan Zhou <[email protected]> * [Scala] Cast dataType in BinaryOperator Signed-off-by: Chendi Xue <[email protected]> * [Scala & CPP] Support Outer Join Signed-off-by: Chendi Xue <[email protected]> * [Scala] Fix when aggregationExpression is empty Signed-off-by: Chendi Xue <[email protected]> * Wip condition join (oap-project#106) * [Scala & CPP] Support LeftAnti Join in ColumnarShuffledHashJoin Signed-off-by: Chendi Xue <[email protected]> * [Scala] Move bindReference inside ColumnarConditionProjection Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add Native conditionedJoin This PR is aim to do runtime codegen so we can perform a conditioned join operation, Add a new ConditionedShuffleArrayList implementation Add a new ConditionedProbeArrays implementation Generate signature for codegen func, and use signature to check if lib exists Add NoneCondition Support Remove ShuffleArrayList implementation and change to use ConditionShuffleArrayList Remove not in use Kernels and Actions Signed-off-by: Chendi Xue <[email protected]> * [Scala] Support new conditionedJoin Signed-off-by: Chendi Xue <[email protected]> * [CPP] Remove original probeArrays kernel Signed-off-by: Chendi Xue <[email protected]> * [scala] support function with in operator (oap-project#107) Signed-off-by: Yuan Zhou <[email protected]> * [CPP] Use original shuffle codes here to improve performance Signed-off-by: Chendi Xue <[email protected]> * [CPP] Fix AvgByCount bug Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add In Support when doing codegen and forward unknown function Signed-off-by: Chendi Xue <[email protected]> * [Scala] Fix a small bug in ColumnarExpressionConverter for Like Signed-off-by: Chendi Xue <[email protected]> * Move SparkColumnarPlugin to oap-native-sql folder Signed-off-by: Chendi Xue <[email protected]> * [CPP] Small fixes (oap-project#1184) Signed-off-by: Chendi Xue <[email protected]> * [CPP] Fixed a avg with groupby issue, now Q17 is correct (oap-project#1185) Signed-off-by: Chendi Xue <[email protected]> * [DO NOT MERGE]WIP Q2 fix (oap-project#1187) * [CPP & Scala] Fixed some codes for ConditionedShuffle Signed-off-by: Chendi Xue <[email protected]> * [CPP] Q2_fix done Signed-off-by: Chendi Xue <[email protected]> * [CPP] Last commit invoked some mis-remove, fix here Signed-off-by: Chendi Xue <[email protected]> * Update README.md * [nativesql] fix compile against new arrow (oap-project#1189) * [nativesql] fix compile against new arrow Signed-off-by: Yuan Zhou <[email protected]> * [C++] fix compile warning Signed-off-by: Yuan Zhou <[email protected]> * [C++] remove unused headers Signed-off-by: Yuan Zhou <[email protected]> * Update ApacheArrowInstallation.md * [nativesql]Wip spark rebase (oap-project#1202) * [nativesql] fix compile against new arrow Signed-off-by: Yuan Zhou <[email protected]> * [C++] fix compile warning Signed-off-by: Yuan Zhou <[email protected]> * [C++] remove unused headers Signed-off-by: Yuan Zhou <[email protected]> * [scala] fix spark reabasing Signed-off-by: Yuan Zhou <[email protected]> * [NativeSql] DeCouple Gandiva protobuf and hashing dependency (oap-project#1203) * Copied Arrow Hashing to our repo so newly modification won't break our builds Signed-off-by: Chendi Xue <[email protected]> * [scala] fix spark reabasing Signed-off-by: Yuan Zhou <[email protected]> * [CPP] Add protobuf inside native sql Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: Yuan Zhou <[email protected]> * [NativeSql]refactor native parquet reader/writer (oap-project#1205) * Remove sortArraysToIndices Signed-off-by: Chendi Xue <[email protected]> * [NativeSql] Move Parquet Reader and Writer into nativeSql Signed-off-by: Chendi Xue <[email protected]> * [NativeSql] Add libhdfs3.so to resource, which will be copied to /hadoop dir when doing make install Signed-off-by: Chendi Xue <[email protected]> * [CPP] Add a parquet reader and writer adapter Signed-off-by: Chendi Xue <[email protected]> * [NativeSql] Refactor and move spark side commits to nativeSql 1. move parquet reader logic to nativesql 2. move ArrowWritableColumnVector to nativesql 3. Use postRule to call RowToArrowColumnVector 4. move cpp so to jar 5. remove benchmark folder 6. update readme Signed-off-by: Chendi Xue <[email protected]> * [NativeSql][CPP] Use CMake to download and compile protobuf Signed-off-by: Chendi Xue <[email protected]> * Update README.md * ArrowDataSource for Spark (#1226) * [oap-native-sql]Add Installation Notes (#1231) * add InstallationNotes to README * refine * refine * refine * [NativeSql] ClassCastException if non-parquet data source is used (#1238) * Move ArrowWritableColumnVector from org.apache to com.intel (#1243) * [DataSource] Compilation error due to multiple source directories (#1244) * [oap-native-sql]Wip refine protobuf install (#1230) * [Building] refine protobuf dependency check - if not found, download protobuf and statically link to it - if found, reuse system level protobuf and dynamically link to it Signed-off-by: Yuan Zhou <[email protected]> * [Building] check for dynamic protobuf lib only Signed-off-by: Yuan Zhou <[email protected]> * [oap-native-sql][Scala] support date32 (#1225) * [Scala] support date32 Signed-off-by: Yuan Zhou <[email protected]> * [C++][Java] Support Date32 in RowToColumn Signed-off-by: Yuan Zhou <[email protected]> * [C++] support date32 in unique action Signed-off-by: Yuan Zhou <[email protected]> * [Java] fix getUTF8String on Date32 Signed-off-by: Yuan Zhou <[email protected]> * set C++ 2011 standard (#1236) * [Scala] fix contain to use is_substr (#1235) Signed-off-by: Yuan Zhou <[email protected]> * [Java] fix date32 projection (#1250) Signed-off-by: Yuan Zhou <[email protected]> * [NativeSql][Scala] memory leak track and fixes (#1227) * [NativeSql][Scala] memory leak track and fixes Signed-off-by: Chendi Xue <[email protected]> * [NativeSql][CPP] Another derived class should add virtual to its super destruction func Signed-off-by: Chendi Xue <[email protected]> * [DataSource][Arrow] Supress exceptions from unexpected types when pushdown filters (#1253) * Update README googletest installation (#1251) * [DataSource][Arrow] Output schema mismatch when scanning for zero dat… (#1262) * [DataSource][Arrow] Output schema mismatch when scanning for zero data columns * [DataSource][Arrow] Use ArrowWritableColumnVector to fill partition values * [DataSource][Arrow] Update README.md (#1263) * [DataSource][Arrow] Add assembly build (#1264) * [DataSource][Arrow] Download ArrowWritableColumnVector instead of having a copy (#1267) * [oap-native-sql] Calling ColumnVectorUtils.populate(...) on ArrowWritableColumnVector leads to UnsupportedOperationException (#1268) * [DataSource][Arrow] Source Downloading: Change to exec-maven-plugin (#1269) * [DataSource][Arrow] Update README.md (#1276) * [DataSource][Arrow] Update README.md (#1279) * [Scala] adding IsNull support (#1256) Signed-off-by: Yuan Zhou <[email protected]> * [oap-native-sql] Add open permission parameter (#1266) * add open O_CREAT permission mode * [DataSource][Arrow] Prune pushed filters that access partition columns (#1285) * [oap-native-sql][Scala]Adding abs support (#1273) * support abs * [Building] building with spark-sql from our maven repo (#1249) Signed-off-by: Yuan Zhou <[email protected]> * [DataSource][Arrow] Close batch every time new batch is read to avoid possible leaks (#1288) * [DataSource][Arrow] File descriptor leak (#1295) * inset (#1290) * upper (#1301) * [oap-native-sql][CI] update travis for native sql (#1294) * [CI] update travis for native sql Signed-off-by: Yuan Zhou <[email protected]> * [CI] fix grammar, use openjdk8 Signed-off-by: Yuan Zhou <[email protected]> * [CI] update to use python3 env Signed-off-by: Yuan Zhou <[email protected]> * [Doc] update readme (#1308) Signed-off-by: Yuan Zhou <[email protected]> * coalesce (#1306) * [oap-native-sql][Scala]adding if support (#1307) * add IfOperator * add boolean type * [oap-native-sql] Enable ColumnarSort kernel with code generation (#1261) * [NativeSql] ColumnarSort kernel ColumnarSort is implemented with CodeGeneration method Signed-off-by: Chendi Xue <[email protected]> * [oap-native-sql] Fix compiling issue Signed-off-by: Chendi Xue <[email protected]> * [Scala]support date32 in IN epxression (#1303) Signed-off-by: Yuan Zhou <[email protected]> * adding ASF license (#1331) Signed-off-by: Yuan Zhou <[email protected]> * [CI] update to use new oap-master branch (#1342) Signed-off-by: Yuan Zhou <[email protected]> * [oap-native-sql]Rewrite ColumnarShuffledHashJoin using CodeGeneration (#1324) * [oap-native-sql] Rewrite ColumnarShuffledHashJoin using codegeneration 1. Remove unused files after we change to use codegen 2. Change to use SparseHashMap instead of arrow Hashing Signed-off-by: Chendi Xue <[email protected]> * [oap-native-sql] Use java.io.tmpdir or cmake build dir as codegen tmp dir Signed-off-by: Chendi Xue <[email protected]> * [oap-native-sql] Add copyright and change datatype in array_item_index to uint16_t Signed-off-by: Chendi Xue <[email protected]> * [oap-native-sql] Use add_definitions instead of add_compile_definitions Signed-off-by: Chendi Xue <[email protected]> * adding concat support (#1328) * [oap-native-sql][Scala] refine coalesce (#1340) * [oap-native-sql][Scala] fix null value exception for StringType and DoubleType (#1333) * [oap-native-sql] Enable mvn package to build native libs (#1341) * Enable mvn package to build native libs * [oap-native-sql][Scala] fix attr errors (#1330) * [oap-native-sql][Scala] adding round support (#1332) * [oap-native-sql] columnar shuffle (oap-project#1212) * [Scala/C++] columnar shuffle * [Scala] sync with arrow-dataset * [Scala] rebase to spark 3.1.0 * [Scala] fix & rebase to arrow 0.17 * [Java] serializer & typo * [Scala] fix serializer & add data size SQLMetric * [NativeSql][c++] Support date type [NativeSql][Scala] support fall back row-based shuffle * [NatvieSql][Scala] columnar shuffle configurable * [NativeSql][Scala] serializer reference transfer & fix decompress [NativeSql][c++] update deprecated * [NativeSql][Scala] fix writer write columnar batch of 0 rows * [NativeSql][Scala] read batch num rows metrics * [NativeSql][Scala] configurable native buffer size * [NativeSql][c++] optimize [Scala] ColumnarShuffleExchange filter empty batch * [NativeSql][Scala] fix extra close * [NativeSql][Scala] coalesce batch * [NativeSql][c++] find boost * [NativeSql][Scala] fallback to use parquet data source * Revert "[NativeSql][Scala] coalesce batch" This reverts commit 4b6929920f19769051cc899ed244761bdfb43d47. * [NativeSql] update README.md * [NativeSql] ci install boost * [NativeSql] add missing ASF & reformat * [NativeSql][Scala] remove WSCG=false * [oap-native-sql] Add customized batch_size and tmp_dir support (#1362) * [oap-native-sql] Add API to use customized batchSize through spark config to native * [oap-native-sql] Initialize ColumnarPluginConfig in operators Signed-off-by: Chendi Xue <[email protected]> * [oap-native-sql] Add cutomized tmp dir through spark config Signed-off-by: Chendi Xue <[email protected]> * [oap-native-sql]Wip optimize sort (#1372) * [oap-native-sql] Use inplace sort for single key no payload batch Signed-off-by: Chendi Xue <[email protected]> * [oap-native-sql] Add ska_sort for single column with payload and use std::sort in desc case Signed-off-by: Chendi Xue <[email protected]> * [oap-native-sql] Add third party ska-sort Signed-off-by: Chendi Xue <[email protected]> * [oap-native-sql] Columnar shuffle I/O Use Configured Disks (#1378) * [NativeSql] shuffle I/O using spark configuration * [NativeSql] some cleanup * [C++] opt hash join (#1377) Signed-off-by: Yuan Zhou <[email protected]> * [NativeSql][Scala] compression workaround (#1381) * [DataSource][Arrow] Support reading dictionary encoded parquet values (#1376) * [DataSource][Arrow] Support reading dictionary encoded parquet values * CI uses Intel-bigdata/arrow/native-sql-engine-clean * [oap-native-sql]Add ColumnarBatch Combination on Shuffle Read Side (#1370) * [NativeSql][Scala] coalesce batch * [NativeSql][Scala] use nano metrics [NativeSql][Scala] add split metric to collect native split + write time, change write time metric to collect concat shuffle temp file time * [NativeSql][Java] license & indent * [NativeSql] rebase * [NativeSql][c++] compress use single thread (#1394) * [oap-native-sql][C++] extract codegen headers to nativesql_include folder (#1395) * [C++] extract codegen headers to nativesql_include folder So this won't conflict with zstd-jni Signed-off-by: Yuan Zhou <[email protected]> * [C++] support additional location of libarrow Signed-off-by: Yuan Zhou <[email protected]> * [DataSource][Arrow] Reserve buffer bytes from Spark off-heap executio… (#1393) * [DataSource][Arrow] Reserve buffer bytes from Spark off-heap execution memory pool * typo * wip * [oap-native-sql][Doc] update docs (#1392) * [Doc]wip refine doc Signed-off-by: Yuan Zhou <[email protected]> * [Doc] refine wording and picture Signed-off-by: Yuan Zhou <[email protected]> * [oap-native-sql][Scala] support PartialMerge mode for aggregate (#1358) * [oap-native-sql][Scala] support PartialMerge mode for aggregate (#1358) * [Doc] fix wrong link to core arch picture (#1411) Signed-off-by: Yuan Zhou <[email protected]> * [oap-native-sql][Scala] adding cast support (#1312) * [Scala] adding cast support Signed-off-by: Yuan Zhou <[email protected]> * [Scala] disable castBIGINT Signed-off-by: Yuan Zhou <[email protected]> * [Scala] remove cast hack Signed-off-by: Yuan Zhou <[email protected]> * [Scala] fix getResultAttr in Cast Signed-off-by: Yuan Zhou <[email protected]> * [Scala] disable castDECIMAL and cleanup Signed-off-by: Yuan Zhou <[email protected]> * [DataSource][Arrow] Error when reading parquet file whose path contains character '%' (#1420) * [DataSource][Arrow] Follow-up: A test case should be marked ignore (#1422) * [ArrowDataSource][Scala] allow to specify batch size from Spark (#1416) Signed-off-by: Yuan Zhou <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Co-authored-by: rongma1997 <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Co-authored-by: JiayiChen785 <[email protected]> Co-authored-by: Hongze Zhang <[email protected]> Co-authored-by: Rui Mo <[email protected]> Co-authored-by: Hongze Zhang <[email protected]>

zhouyuan force-pushed the wip_rebase2 branch from 0148ebd to f33d96d Compare March 1, 2021 09:01

zhouyuan changed the title ~~[DNM]Wip rebase arrow 3.0~~ [NSE-136]upgrade to arrow 3.0 Mar 1, 2021

zhouyuan and others added 14 commits March 1, 2021 17:12

initial try on rebase

bed9361

Signed-off-by: Yuan Zhou <[email protected]>

adding shuffle back

e75a9de

Signed-off-by: Yuan Zhou <[email protected]>

rebase more kernels

724527d

Signed-off-by: Yuan Zhou <[email protected]>

rebase compute related API

cbefcd5

Signed-off-by: Yuan Zhou <[email protected]>

rebase splitter.cc

7ba570e

adding shuffle rebase

e756e27

remove parquet reader/writer

95a75e2

Signed-off-by: Yuan Zhou <[email protected]>

fix unit tests

fce596a

Signed-off-by: Yuan Zhou <[email protected]>

rebase splitter.cc with latest native sql engine code

ad0c05f

enable shuffle tests

8f360e3

Signed-off-by: Yuan Zhou <[email protected]>

rebase scala code

dc1fb59

try to rebase scala

75aa10c

Signed-off-by: Yuan Zhou <[email protected]>

remove loadNextBatch method in ArrowCompressedStreamReader

1f294a8

fix sort kernel

f33d96d

Signed-off-by: Yuan Zhou <[email protected]>

zhouyuan changed the title ~~[NSE-136]upgrade to arrow 3.0~~ [NSE-136]upgrade to arrow 3.0.0 Mar 3, 2021

zhouyuan requested a review from weiting-chen March 3, 2021 07:25

zhouyuan and others added 2 commits March 3, 2021 16:25

Revert "remove loadNextBatch method in ArrowCompressedStreamReader"

2efa6c1

This reverts commit 1f294a8.

rebase the fastpfor codec and fix bug

cc8a0b9

fix CI system

0743d61

Signed-off-by: Yuan Zhou <[email protected]>

weiting-chen approved these changes Mar 3, 2021

View reviewed changes

zhouyuan and others added 4 commits March 3, 2021 21:33

fix

545787a

Signed-off-by: Yuan Zhou <[email protected]>

Add the comment in the changes when calling the arrow API

0ecd12a

fix ut

6f2a225

Signed-off-by: Yuan Zhou <[email protected]>

fix window sort

379bdb4

Signed-off-by: Yuan Zhou <[email protected]>

zhouyuan merged commit 74c35b2 into oap-project:master Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NSE-136]upgrade to arrow 3.0.0 #107

[NSE-136]upgrade to arrow 3.0.0 #107

zhouyuan commented Feb 18, 2021 •

edited

Loading

github-actions bot commented Mar 1, 2021

github-actions bot commented Mar 3, 2021

zhouyuan commented Mar 3, 2021

github-actions bot commented Mar 3, 2021

weiting-chen left a comment

github-actions bot commented Mar 4, 2021

zhouyuan commented Mar 4, 2021

[NSE-136]upgrade to arrow 3.0.0 #107

[NSE-136]upgrade to arrow 3.0.0 #107

Conversation

zhouyuan commented Feb 18, 2021 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Mar 1, 2021

github-actions bot commented Mar 3, 2021

zhouyuan commented Mar 3, 2021

github-actions bot commented Mar 3, 2021

weiting-chen left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 4, 2021

zhouyuan commented Mar 4, 2021

zhouyuan commented Feb 18, 2021 •

edited

Loading