Change log

Generated on 2020-09-18

Release 0.2

Features


#696	[FEA] run integration tests against SPARK-3.0.1
#455	[FEA] Support UCX shuffle with optimized AQE
#510	[FEA] Investigate libcudf features needed to support struct schema pruning during loads
#541	[FEA] Scala UDF:Support for null Value operands
#542	[FEA] Scala UDF: Support for Date and Time
#499	[FEA] disable any kind of warnings about ExecutedCommandExec not being on the GPU
#540	[FEA] Scala UDF: Support for String replaceFirst()
#340	[FEA] widen the rendered Jekyll pages
#602	[FEA] don't release with any -SNAPSHOT dependencies
#579	[FEA] Auto-merge between branches
#515	[FEA] Write tests for AQE skewed join optimization
#452	[FEA] Update HashSortOptimizerSuite to work with AQE
#454	[FEA] Update GpuCoalesceBatchesSuite to work with AQE enabled
#354	[FEA]Spark 3.1 FileSourceScanExec adds parameter optionalNumCoalescedBuckets
#566	[FEA] Add support for StringSplit with an array index.
#524	[FEA] Add GPU specific metrics to GpuFileSourceScanExec
#494	[FEA] Add some AQE-specific tests to the PySpark test suite
#146	[FEA] Python tests should support running with Adaptive Query Execution enabled
#465	[FEA] Audit: Update script to audit multiple versions of Spark
#488	[FEA] Ability to limit total GPU memory used
#70	[FEA] Support StringSplit
#403	[FEA] Add in support for GetArrayItem
#493	[FEA] Implement shuffle optimization when AQE is enabled
#500	[FEA] Add maven profiles for testing with AQE on or off
#471	[FEA] create a formal process for updating the github-pages branch
#233	[FEA] Audit DataWritingCommandExec
#240	[FEA] Audit Api validation script follow on - Optimize StringToTypeTag
#388	[FEA] Audit WindowExec
#425	[FEA] Add tests for configs in BatchScan Readers
#453	[FEA] Update HashAggregatesSuite to work with AQE
#184	[FEA] Enable NoScalaDoc scalastyle rule
#438	[FEA] Enable StringLPad
#232	[FEA] Audit SortExec
#236	[FEA] Audit ShuffleExchangeExec
#355	[FEA] Support Multiple Spark versions in the same jar
#385	[FEA] Support RangeExec on the GPU
#317	[FEA] Write test wrapper to run SQL queries via pyspark
#235	[FEA] Audit BroadcastExchangeExec
#234	[FEA] Audit BatchScanExec
#238	[FEA] Audit ShuffledHashJoinExec
#237	[FEA] Audit BroadcastHashJoinExec
#316	[FEA] Add some basic Dataframe tests for CoalesceExec
#145	[FEA] Scala tests should support running with Adaptive Query Execution enabled
#231	[FEA] Audit ProjectExec
#229	[FEA] Audit FileSourceScanExec

Performance


#326	[DISCUSS] Shuffle read-side error handling
#601	[FEA] Optimize unnecessary sorts when replacing SortAggregate
#333	[FEA] Better handling of reading lots of small Parquet files
#511	[FEA] Connect shuffle table compression to shuffle exec metrics
#15	[FEA] Multiple threads shareing the same GPU
#272	[DOC] Getting started guide for UCX shuffle

Bugs Fixed


#780	[BUG] Inner Join dropping data with bucketed Table input
#569	[BUG] left_semi_join operation is abnormal and serious time-consuming
#744	[BUG] TPC-DS query 6 now produces incorrect results.
#718	[BUG] GpuBroadcastHashJoinExec ArrayIndexOutOfBoundsException
#698	[BUG] batch coalesce can fail to appear between columnar shuffle and subsequent columnar operation
#658	[BUG] GpuCoalesceBatches collectTime metric can be underreported
#59	[BUG] enable tests for string literals in a select
#486	[BUG] GpuWindowExec does not implement requiredChildOrdering
#631	[BUG] Rows are dropped when AQE is enabled in some cases
#671	[BUG] Databricks hash_aggregate_test fails trying to canonicalize a WrappedAggFunction
#218	[BUG] Window function COUNT(x) includes null-values, when it shouldn't
#153	[BUG] Incorrect output from partial-only hash aggregates with multiple distincts and non-distinct functions
#656	[BUG] integration tests produce hive metadata files
#607	[BUG] Fix misleading "cannot run on GPU" warnings when AQE is enabled
#630	[BUG] GpuCustomShuffleReader metrics always show zero rows/batches output
#643	[BUG] race condition while registering a buffer and spilling at the same time
#606	[BUG] Multiple scans for same data source with TPC-DS query59 with delta format
#626	[BUG] parquet_test showing leaked memory buffer
#155	[BUG] Incorrect output from averages with filters in partial only mode
#277	[BUG] HashAggregateSuite failure when AQE is enabled
#276	[BUG] GpuCoalesceBatchSuite failure when AQE is enabled
#598	[BUG] Non-deterministic output from MapOutputTracker.getStatistics() with AQE on GPU
#192	[BUG] test_read_merge_schema fails on Databricks
#341	[BUG] Document compression formats for readers/writers
#587	[BUG] Spark3.1 changed FileScan which means or GpuScans need to be added to shim layer
#362	[BUG] Implement getReaderForRange in the RapidsShuffleManager
#528	[BUG] HashAggregateSuite "Avg Distinct with filter" no longer valid when testing against Spark 3.1.0
#416	[BUG] Fix Spark 3.1.0 integration tests
#556	[BUG] NPE when removing shuffle
#553	[BUG] GpuColumnVector build warnings from raw type access
#492	[BUG] Re-enable AQE integration tests
#275	[BUG] TpchLike query 2 fails when AQE is enabled
#508	[BUG] GpuUnion publishes metrics on the UI that are all 0
#269	Needed to add `--conf spark.driver.extraClassPath=`
#473	[BUG] PartMerge:countDistinct:sum fails sporadically
#531	[BUG] Temporary RMM workaround needs to be removed
#532	[BUG] NPE when enabling shuffle manager
#525	[BUG] GpuFilterExec reports incorrect nullability of output in some cases
#483	[BUG] Multiple scans for the same parquet data source
#382	[BUG] Spark3.1 StringFallbackSuite regexp_replace null cpu fall back test fails.
#489	[FEA] Fix Spark 3.1 GpuHashJoin since it now requires CodegenSupport
#441	[BUG] test_broadcast_nested_loop_join_special_case fails on databricks
#347	[BUG] Failed to read Parquet file generated by GPU-enabled Spark.
#433	`InSet` operator produces an error for Strings
#144	[BUG] spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored
#323	[BUG] GpuBroadcastNestedLoopJoinExec can fail if there are no columns
#356	[BUG] Integration cache test for BroadcastNestedLoopJoin failure
#280	[BUG] Full Outer Join does not work on nullable keys
#149	[BUG] Spark driver fails to load native libs when running on node without CUDA

PRs


#793	Update Jenkins scripts for release
#798	Fix shims provider override config not being seen by executors
#785	Make shuffle run on CPU if we do a join where we read from bucketed table
#765	Add config to override shims provider class
#759	Add CHANGELOG for release 0.2
#758	Skip the udf test fails periodically.
#752	Fix snapshot plugin jar version in docs
#751	Correct the channel for cudf installation
#754	Filter nulls from joins where possible to improve performance
#732	Add a timeout for RapidsShuffleIterator to prevent jobs to hang infin…
#637	Documentation changes for 0.2 release
#747	Disable udf tests that fail periodically
#745	Revert Null Join Filter
#741	Fix issue with parquet partitioned reads
#733	Remove GPU Types from github
#720	Stop removing GpuCoalesceBatches from non-AQE queries when AQE is enabled
#729	Fix collect time metric in CoalesceBatches
#640	Support running Pandas UDFs on GPUs in Python processes.
#721	Add some more checks to databricks build scripts
#714	Move spark 3.0.1-shims out of snapshot-shims
#711	fix blossom checkout repo
#709	[BUG] fix unexpected indentation issue in blossom yml
#642	Init workflow for blossom-ci
#705	Enable configuration check for cast string to timestamp
#702	Update slack channel for Jenkins builds
#701	fix checkout-ref for automerge
#695	Fix spark-3.0.1 shim to be released
#668	refactor automerge to support merge for protected branch
#687	Include the UDF compiler in the dist jar
#689	Change shims dependency to spark-3.0.1
#677	Use multi-threaded parquet read with small files
#638	Add Parquet-based cache serializer
#613	Enable UCX + AQE
#684	Enable test for literal string values in a select
#686	Remove sorts when replacing sort aggregate if possible
#675	Added TimeAdd
#645	[window] Add GpuWindowExec requiredChildOrdering
#676	fixUpJoinConsistency rule now works when AQE is enabled
#683	Fix issues with cannonicalization of WrappedAggFunction
#682	Fix path to start-slave.sh script in docs
#673	Increase build timeouts on nightly and premerge builds
#648	add signoff-check use github actions
#593	Add support for isNaN and datetime related instructions in UDF compiler
#666	[window] Disable GPU for COUNT(exp) queries
#655	Implement AQE unit test for InsertAdaptiveSparkPlan
#614	Fix for aggregation with multiple distinct and non distinct functions
#657	Fix verify build after integration tests are run
#660	Add in neverReplaceExec and several rules for it
#639	BooleanType test shouldn't xfail
#652	Mark UVM config as internal until supported
#653	Move to the cudf-0.15 release
#647	Improve warnings about AQE nodes not supported on GPU
#646	Stop reporting zero metrics for GpuCustomShuffleReader
#644	Small fix for race in catalog where a buffer could get spilled while …
#623	Fix issues with canonicalization
#599	[FEA] changelog generator
#563	cudf and spark version info in artifacts
#633	Fix leak if RebaseHelper throws during Parquet read
#632	Copy function isSearchableType from Spark because signature changed in 3.0.1
#583	Add udf compiler unit tests
#617	Documentation updates for branch 0.2
#616	Add config to reserve GPU memory
#612	[REVIEW] Fix incorrect output from averages with filters in partial only mode
#609	fix minor issues with instructions for building ucx
#611	Added in profile to enable shims for SNAPSHOT releases
#595	Parquet small file reading optimization
#582	fix #579 Auto-merge between branches
#536	Add test for skewed join optimization when AQE is enabled
#603	Fix data size metric always 0 when using RAPIDS shuffle
#600	Fix calculation of string data for compressed batches
#597	Remove the xfail for parquet test_read_merge_schema on Databricks
#591	Add ucx license in NOTICE-binary
#596	Add Spark 3.0.2 to Shim layer
#594	Filter nulls from joins where possible to improve performance.
#590	Move GpuParquetScan/GpuOrcScan into Shim
#588	xfail the tpch spark 3.1.0 tests that fail
#572	Update buffer store to return compressed batches directly, add compression NVTX ranges
#558	Fix unit tests when AQE is enabled
#580	xfail the Spark 3.1.0 integration tests that fail
#565	Minor improvements to TPC-DS benchmarking code
#567	Explicitly disable AQE in one test
#571	Fix Databricks shim layer for GpuFileSourceScanExec and GpuBroadcastExchangeExec
#564	Add GPU decode time metric to scans
#562	getCatalog can be called from the driver, and can return null
#555	Fix build warnings for ColumnViewAccess
#560	Fix databricks build for AQE support
#557	Fix tests failing on Spark 3.1
#547	Add GPU metrics to GpuFileSourceScanExec
#462	Implement optimized AQE support so that exchanges run on GPU where possible
#550	Document Parquet and ORC compression support
#539	Update script to audit multiple Spark versions
#543	Add metrics to GpuUnion operator
#549	Move spark shim properties to top level pom
#497	Add UDF compiler implementations
#487	Add framework for batch compression of shuffle partitions
#544	Add in driverExtraClassPath for standalone mode docs
#546	Fix Spark 3.1.0 shim build error in GpuHashJoin
#537	Use fresh SparkSession when capturing to avoid late capture of previous query
#538	Revert "Temporary workaround for RMM initial pool size bug (#530)"
#517	Add config to limit maximum RMM pool size
#527	Add support for split and getArrayIndex
#534	Fixes bugs around GpuShuffleEnv initialization
#529	[BUG] Degenerate table metas were not getting copied to the heap
#530	Temporary workaround for RMM initial pool size bug
#526	Fix bug with nullability reporting in GpuFilterExec
#521	Fix typo with databricks shim classname SparkShimServiceProvider
#522	Use SQLConf instead of SparkConf when looking up SQL configs
#518	Fix init order issue in GpuShuffleEnv when RAPIDS shuffle configured
#514	Added clarification of RegExpReplace, DateDiff, made descriptive text consistent
#506	Add in basic support for running tpcds like queries
#504	Add ability to ignore tests depending on spark shim version
#503	Remove unused async buffer spill support
#501	disable codegen in 3.1 shim for hash join
#466	Optimize and fix Api validation script
#481	Codeowners
#439	Check a PR has been committed using git signoff
#319	Update partitioning logic in ShuffledBatchRDD
#491	Temporarily ignore AQE integration tests
#490	Fix Spark 3.1.0 build for HashJoin changes
#482	Prevent bad practice in python tests
#485	Show plan in assertion message if test fails
#480	Fix link from README to getting-started.md
#448	Preliminary support for keeping broadcast exchanges on GPU when AQE is enabled
#478	Fall back to CPU for binary as string in parquet
#477	Fix special case joins in broadcast nested loop join
#469	Update HashAggregateSuite to work with AQE
#475	Udf compiler pom followup
#434	Add UDF compiler skeleton
#474	Re-enable noscaladoc check
#461	Fix comments style to pass scala style check
#468	fix broken link
#456	Add closeOnExcept to clean up code that closes resources only on exceptions
#464	Turn off noscaladoc rule until codebase is fixed
#449	Enforce NoScalaDoc rule in scalastyle checks
#450	Enable scalastyle for shuffle plugin
#451	Databricks remove unneeded files and fix build to not fail on rm when file missing
#442	Shim layer support for Spark 3.0.0 Databricks
#447	Add scalastyle plugin to shim module
#426	Update BufferMeta to support multiple codec buffers per table
#440	Run mortgage test both with AQE on and off
#445	Added in StringRPad and StringLPad
#422	Documentation updates
#437	Fix bug with InSet and Strings
#435	Add in checks for Parquet LEGACY date/time rebase
#432	Fix batch use-after-close in partitioning, shuffle env init
#423	Fix duplicates includes in assembly jar
#418	CI Add unit tests running for Spark 3.0.1
#421	Make it easier to run TPCxBB benchmarks from spark shell
#413	Fix download link
#414	Shim Layer to support multiple Spark versions
#406	Update cast handling to deal with new libcudf casting limitations
#405	Change slave->worker
#395	Databricks doc updates
#401	Extended the FAQ
#398	Add tests for GpuPartition
#352	Change spark tgz package name
#397	Fix small bug in ShuffleBufferCatalog.hasActiveShuffle
#286	[REVIEW] Updated join tests for cache
#393	Contributor license agreement
#389	Added in support for RangeExec
#390	Ucx getting started
#391	Hide slack channel in Jenkins scripts
#387	Remove the term whitelist
#365	[REVIEW] Timesub tests
#383	Test utility to compare SQL query results between CPU and GPU
#380	Fix databricks notebook link
#378	Added in FAQ and fixed spelling
#377	Update heading in configs.md
#373	Modifying branch name to conform with rapidsai branch name change
#376	Add our session extension correctly if there are other extensions configured
#374	Fix rat issue for notebooks
#364	Update Databricks patch for changes to GpuSortMergeJoin
#371	fix typo and use regional bucket per GCP's update
#359	Karthik changes
#353	Fix broadcast nested loop join for the no column case
#313	Additional tests for broadcast hash join
#342	Implement build-side rules for shuffle hash join
#349	Updated join code to treat null equality properly
#335	Integration tests on spark 3.0.1-SNAPSHOT & 3.1.0-SNAPSHOT
#346	Update the Title Header for Fine Tuning
#344	Fix small typo in readme
#331	Adds iterator and client unit tests, and prepares for more fetch failure handling
#337	Fix Scala compile phase to allow Java classes referencing Scala classes
#332	Match GPU overwritten functions with SQL functions from FunctionRegistry
#339	Fix databricks build
#338	Move GpuPartitioning to a separate file
#310	Update release Jenkinsfile for Databricks
#330	Hide private info in Jenkins scripts
#324	Add in basic support for GpuCartesianProductExec
#328	Enable slack notification for Databricks build
#321	update databricks patch for GpuBroadcastNestedLoopJoinExec
#322	Add oss.sonatype.org to download the cudf jar
#320	Don't mount passwd/group to the container
#258	Enable running TPCH tests with AQE enabled
#318	Build docker image with Dockerfile
#309	Update databricks patch to latest changes
#312	Trigger branch-0.2 integration test
#307	[Jenkins] Update the release script and Jenkinsfile
#304	[DOC][Minor] Fix typo in spark config name.
#303	Update compatibility doc for -0.0 issues
#301	Add info about branches in README.md
#296	Added in basic support for broadcast nested loop join
#297	Databricks CI improvements and support runtime env parameter to xfail certain tests
#292	Move artifacts version in version-def.sh
#254	Cleanup QA tests
#289	Clean up GpuCollectLimitMeta and add in metrics
#287	Add in support for right join and fix issues build right
#273	Added releases to the README.md
#285	modify run_pyspark_from_build.sh to be bash 3 friendly
#281	Add in support for Full Outer Join on non-null keys
#274	Add RapidsDiskStore tests
#259	Add RapidsHostMemoryStore tests
#282	Update Databricks patch for 0.2 branch
#261	Add conditional xfail test for DISTINCT aggregates with NaN
#263	More time ops
#256	Remove special cases for contains, startsWith, and endWith
#253	Remove GpuAttributeReference and GpuSortOrder
#271	Update the versions for 0.2.0 properly for the databricks build
#162	Integration tests for corner cases in window functions.
#264	Add a local mvn repo for nightly pipeline
#262	Refer to branch-0.2
#255	Revert change to make dependencies of shaded jar optional
#257	Fix link to RAPIDS cudf in index.md
#252	Update to 0.2.0-SNAPSHOT and cudf-0.15-SNAPSHOT

Release 0.1

Features


#74	[FEA] Support ToUnixTimestamp
#21	[FEA] NormalizeNansAndZeros
#105	[FEA] integration tests for equi-joins

Bugs Fixed


#116	[BUG] calling replace with a NULL throws an exception
#168	[BUG] GpuUnitTests Date tests leak column vectors
#209	[BUG] Developers section in pom need to be updated
#204	[BUG] Code coverage docs are out of date
#154	[BUG] Incorrect output from partial-only averages with nulls
#61	[BUG] Cannot disable Parquet, ORC, CSV reading when using FileSourceScanExec

PRs


#249	Compatability -> Compatibility
#247	Add index.md for default doc page, fix table formatting for configs
#241	Let default branch to master per the release rule
#177	Fixed leaks in unit test and use ColumnarBatch for testing
#243	Jenkins file for Databricks release
#225	Make internal project dependencies optional for shaded artifact
#242	Add site pages
#221	Databricks Build Support
#215	Remove CudfColumnVector
#213	Add RapidsDeviceMemoryStore tests
#214	[REVIEW] Test failure to pass Attribute as GpuAttribute
#211	Add project leads to pom developer list
#210	Updated coverage docs
#195	Support public release for plugin jar
#208	Remove unneeded comment from pom.xml
#191	WindowExec handle different spark distributions
#181	Remove INCOMPAT for NormalizeNanAndZero, KnownFloatingPointNormalized
#196	Update Spark dependency to the released 3.0.0 artifacts
#206	Change groupID to 'com.nvidia' in IT scripts
#202	Fixed issue for contains when searching for an empty string
#201	Fix name of scan
#200	Fix issue with GpuAttributeReference not overrideing references
#197	Fix metrics for writes
#186	Fixed issue with nullability on concat
#193	Add RapidsBufferCatalog tests
#188	rebrand to com.nvidia instead of ai.rapids
#189	Handle AggregateExpression having resultIds parameter instead of a single resultId
#190	FileSourceScanExec can have logicalRelation parameter on some distributions
#185	Update type of parameter of GpuExpandExec to make it consistent
#172	Merge qa test to integration test
#180	Add MetaUtils unit tests
#171	Cleanup scaladoc warnings about missing links
#176	Updated join tests to cover more data.
#169	Remove dependency on shaded Spark artifact
#174	Added in fallback tests
#165	Move input metadata tests to pyspark
#173	Fix setting local mode for tests
#160	Integration tests for normalizing NaN/zeroes.
#163	Ignore the order locally for repartition tests
#157	Add partial and final only hash aggregate tests and fix nulls corner case for Average
#159	Add integration tests for joins
#158	Orc merge schema fallback and FileScan format configs
#164	Fix compiler warnings
#152	Moved cudf to 0.14 for CI
#151	Switch CICD pipelines to Github

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Change log

Release 0.2

Features

Performance

Bugs Fixed

PRs

Release 0.1

Features

Bugs Fixed

PRs

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Change log

Release 0.2

Features

Performance

Bugs Fixed

PRs

Release 0.1

Features

Bugs Fixed

PRs