Skip to content

Commit

Permalink
Merge branch 'branch-21.08' into refactor/index_construction
Browse files Browse the repository at this point in the history
  • Loading branch information
vyasr authored Jun 15, 2021
2 parents be2e714 + 7c8d847 commit 0d730ab
Show file tree
Hide file tree
Showing 119 changed files with 3,386 additions and 670 deletions.
99 changes: 97 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,104 @@

Please see https://github.com/rapidsai/cudf/releases/tag/v21.08.00a for the latest changes to this development branch.

# cuDF 21.06.00 (Date TBD)
# cuDF 21.06.00 (9 Jun 2021)

Please see https://github.com/rapidsai/cudf/releases/tag/v21.06.00a for the latest changes to this development branch.
## 🚨 Breaking Changes

- Add support for `make_meta_obj` dispatch in `dask-cudf` ([#8342](https://github.com/rapidsai/cudf/pull/8342)) [@galipremsagar](https://github.com/galipremsagar)
- Add separator-on-null parameter to strings concatenate APIs ([#8282](https://github.com/rapidsai/cudf/pull/8282)) [@davidwendt](https://github.com/davidwendt)
- Introduce a common parent class for NumericalColumn and DecimalColumn ([#8278](https://github.com/rapidsai/cudf/pull/8278)) [@vyasr](https://github.com/vyasr)
- Update ORC statistics API to use C++17 standard library ([#8241](https://github.com/rapidsai/cudf/pull/8241)) [@vuule](https://github.com/vuule)
- Preserve column hierarchy when getting NULL row from `LIST` column ([#8206](https://github.com/rapidsai/cudf/pull/8206)) [@isVoid](https://github.com/isVoid)
- `Groupby.shift` c++ API refactor and python binding ([#8131](https://github.com/rapidsai/cudf/pull/8131)) [@isVoid](https://github.com/isVoid)

## 🐛 Bug Fixes

- Fix struct flattening to add a validity column only when the input column has null element ([#8374](https://github.com/rapidsai/cudf/pull/8374)) [@ttnghia](https://github.com/ttnghia)
- Compilation fix: Remove redefinition for `std::is_same_v()` ([#8369](https://github.com/rapidsai/cudf/pull/8369)) [@mythrocks](https://github.com/mythrocks)
- Add backward compatibility for `dask-cudf` to work with other versions of `dask` ([#8368](https://github.com/rapidsai/cudf/pull/8368)) [@galipremsagar](https://github.com/galipremsagar)
- Handle empty results with nested types in copy_if_else ([#8359](https://github.com/rapidsai/cudf/pull/8359)) [@nvdbaranec](https://github.com/nvdbaranec)
- Handle nested column types properly for empty parquet files. ([#8350](https://github.com/rapidsai/cudf/pull/8350)) [@nvdbaranec](https://github.com/nvdbaranec)
- Raise error when unsupported arguments are passed to `dask_cudf.DataFrame.sort_values` ([#8349](https://github.com/rapidsai/cudf/pull/8349)) [@galipremsagar](https://github.com/galipremsagar)
- Raise `NotImplementedError` for axis=1 in `rank` ([#8347](https://github.com/rapidsai/cudf/pull/8347)) [@galipremsagar](https://github.com/galipremsagar)
- Add support for `make_meta_obj` dispatch in `dask-cudf` ([#8342](https://github.com/rapidsai/cudf/pull/8342)) [@galipremsagar](https://github.com/galipremsagar)
- Update Java string concatenate test for single column ([#8330](https://github.com/rapidsai/cudf/pull/8330)) [@tgravescs](https://github.com/tgravescs)
- Use empty_like in scatter ([#8314](https://github.com/rapidsai/cudf/pull/8314)) [@revans2](https://github.com/revans2)
- Fix concatenate_lists_ignore_null on rows of all_nulls ([#8312](https://github.com/rapidsai/cudf/pull/8312)) [@sperlingxx](https://github.com/sperlingxx)
- Add separator-on-null parameter to strings concatenate APIs ([#8282](https://github.com/rapidsai/cudf/pull/8282)) [@davidwendt](https://github.com/davidwendt)
- COLLECT_LIST support returning empty output columns. ([#8279](https://github.com/rapidsai/cudf/pull/8279)) [@mythrocks](https://github.com/mythrocks)
- Update io util to convert path like object to string ([#8275](https://github.com/rapidsai/cudf/pull/8275)) [@ayushdg](https://github.com/ayushdg)
- Fix result column types for empty inputs to rolling window ([#8274](https://github.com/rapidsai/cudf/pull/8274)) [@mythrocks](https://github.com/mythrocks)
- Actually test equality in assert_groupby_results_equal ([#8272](https://github.com/rapidsai/cudf/pull/8272)) [@shwina](https://github.com/shwina)
- CMake always explicitly specify a source files extension ([#8270](https://github.com/rapidsai/cudf/pull/8270)) [@robertmaynard](https://github.com/robertmaynard)
- Fix struct binary search and struct flattening ([#8268](https://github.com/rapidsai/cudf/pull/8268)) [@ttnghia](https://github.com/ttnghia)
- Revert "patch thrust to fix intmax num elements limitation in scan_by_key" ([#8263](https://github.com/rapidsai/cudf/pull/8263)) [@cwharris](https://github.com/cwharris)
- upgrade dlpack to 0.5 ([#8262](https://github.com/rapidsai/cudf/pull/8262)) [@cwharris](https://github.com/cwharris)
- Fixes CSV-reader type inference for thousands separator and decimal point ([#8261](https://github.com/rapidsai/cudf/pull/8261)) [@elstehle](https://github.com/elstehle)
- Fix incorrect assertion in Java concat ([#8258](https://github.com/rapidsai/cudf/pull/8258)) [@sperlingxx](https://github.com/sperlingxx)
- Copy nested types upon construction ([#8244](https://github.com/rapidsai/cudf/pull/8244)) [@isVoid](https://github.com/isVoid)
- Preserve column hierarchy when getting NULL row from `LIST` column ([#8206](https://github.com/rapidsai/cudf/pull/8206)) [@isVoid](https://github.com/isVoid)
- Clip decimal binary op precision at max precision ([#8194](https://github.com/rapidsai/cudf/pull/8194)) [@ChrisJar](https://github.com/ChrisJar)

## 📖 Documentation

- Add docstring for `dask_cudf.read_csv` ([#8355](https://github.com/rapidsai/cudf/pull/8355)) [@galipremsagar](https://github.com/galipremsagar)
- Fix cudf release version in readme ([#8331](https://github.com/rapidsai/cudf/pull/8331)) [@galipremsagar](https://github.com/galipremsagar)
- Fix structs column description in dev docs ([#8318](https://github.com/rapidsai/cudf/pull/8318)) [@isVoid](https://github.com/isVoid)
- Update readme with correct CUDA versions ([#8315](https://github.com/rapidsai/cudf/pull/8315)) [@raydouglass](https://github.com/raydouglass)
- Add description of the cuIO GDS integration ([#8293](https://github.com/rapidsai/cudf/pull/8293)) [@vuule](https://github.com/vuule)
- Remove unused parameter from copy_partition kernel documentation ([#8283](https://github.com/rapidsai/cudf/pull/8283)) [@robertmaynard](https://github.com/robertmaynard)

## 🚀 New Features

- Add support merging b/w categorical data ([#8332](https://github.com/rapidsai/cudf/pull/8332)) [@galipremsagar](https://github.com/galipremsagar)
- Java: Support struct scalar ([#8327](https://github.com/rapidsai/cudf/pull/8327)) [@sperlingxx](https://github.com/sperlingxx)
- added _is_homogeneous property ([#8299](https://github.com/rapidsai/cudf/pull/8299)) [@shaneding](https://github.com/shaneding)
- Added decimal writing for CSV writer ([#8296](https://github.com/rapidsai/cudf/pull/8296)) [@kaatish](https://github.com/kaatish)
- Java: Support creating a scalar from utf8 string ([#8294](https://github.com/rapidsai/cudf/pull/8294)) [@firestarman](https://github.com/firestarman)
- Add Java API for Concatenate strings with separator ([#8289](https://github.com/rapidsai/cudf/pull/8289)) [@tgravescs](https://github.com/tgravescs)
- `strings::join_list_elements` options for empty list inputs ([#8285](https://github.com/rapidsai/cudf/pull/8285)) [@ttnghia](https://github.com/ttnghia)
- Return python lists for __getitem__ calls to list type series ([#8265](https://github.com/rapidsai/cudf/pull/8265)) [@brandon-b-miller](https://github.com/brandon-b-miller)
- add unit tests for lead/lag on list for row window ([#8259](https://github.com/rapidsai/cudf/pull/8259)) [@wbo4958](https://github.com/wbo4958)
- Create a String column from UTF8 String byte arrays ([#8257](https://github.com/rapidsai/cudf/pull/8257)) [@firestarman](https://github.com/firestarman)
- Support scattering `list_scalar` ([#8256](https://github.com/rapidsai/cudf/pull/8256)) [@isVoid](https://github.com/isVoid)
- Implement `lists::concatenate_list_elements` ([#8231](https://github.com/rapidsai/cudf/pull/8231)) [@ttnghia](https://github.com/ttnghia)
- Support for struct scalars. ([#8220](https://github.com/rapidsai/cudf/pull/8220)) [@nvdbaranec](https://github.com/nvdbaranec)
- Add support for decimal types in ORC writer ([#8198](https://github.com/rapidsai/cudf/pull/8198)) [@vuule](https://github.com/vuule)
- Support create lists column from a `list_scalar` ([#8185](https://github.com/rapidsai/cudf/pull/8185)) [@isVoid](https://github.com/isVoid)
- `Groupby.shift` c++ API refactor and python binding ([#8131](https://github.com/rapidsai/cudf/pull/8131)) [@isVoid](https://github.com/isVoid)
- Add `groupby::replace_nulls(replace_policy)` api ([#7118](https://github.com/rapidsai/cudf/pull/7118)) [@isVoid](https://github.com/isVoid)

## 🛠️ Improvements

- Support Dask + Distributed 2021.05.1 ([#8392](https://github.com/rapidsai/cudf/pull/8392)) [@jakirkham](https://github.com/jakirkham)
- Add aliases for string methods ([#8353](https://github.com/rapidsai/cudf/pull/8353)) [@shwina](https://github.com/shwina)
- Update environment variable used to determine `cuda_version` ([#8321](https://github.com/rapidsai/cudf/pull/8321)) [@ajschmidt8](https://github.com/ajschmidt8)
- JNI: Refactor the code of making column from scalar ([#8310](https://github.com/rapidsai/cudf/pull/8310)) [@firestarman](https://github.com/firestarman)
- Update `CHANGELOG.md` links for calver ([#8303](https://github.com/rapidsai/cudf/pull/8303)) [@ajschmidt8](https://github.com/ajschmidt8)
- Merge `branch-0.19` into `branch-21.06` ([#8302](https://github.com/rapidsai/cudf/pull/8302)) [@ajschmidt8](https://github.com/ajschmidt8)
- use address and length for GDS reads/writes ([#8301](https://github.com/rapidsai/cudf/pull/8301)) [@rongou](https://github.com/rongou)
- Update cudfjni version to 21.06.0 ([#8292](https://github.com/rapidsai/cudf/pull/8292)) [@pxLi](https://github.com/pxLi)
- Update docs build script ([#8284](https://github.com/rapidsai/cudf/pull/8284)) [@ajschmidt8](https://github.com/ajschmidt8)
- Make device_buffer streams explicit and enforce move construction ([#8280](https://github.com/rapidsai/cudf/pull/8280)) [@harrism](https://github.com/harrism)
- Introduce a common parent class for NumericalColumn and DecimalColumn ([#8278](https://github.com/rapidsai/cudf/pull/8278)) [@vyasr](https://github.com/vyasr)
- Do not add nulls to the hash table when null_equality::NOT_EQUAL is passed to left_semi_join and left_anti_join ([#8277](https://github.com/rapidsai/cudf/pull/8277)) [@nvdbaranec](https://github.com/nvdbaranec)
- Enable implicit casting when concatenating mixed types ([#8276](https://github.com/rapidsai/cudf/pull/8276)) [@ChrisJar](https://github.com/ChrisJar)
- Fix CMake FindPackage rmm, pin dev envs' dlpack to v0.3 ([#8271](https://github.com/rapidsai/cudf/pull/8271)) [@trxcllnt](https://github.com/trxcllnt)
- Update cudfjni version to 21.06 ([#8267](https://github.com/rapidsai/cudf/pull/8267)) [@pxLi](https://github.com/pxLi)
- support RMM aligned resource adapter in JNI ([#8266](https://github.com/rapidsai/cudf/pull/8266)) [@rongou](https://github.com/rongou)
- Pass compiler environment variables to conda python build ([#8260](https://github.com/rapidsai/cudf/pull/8260)) [@Ethyling](https://github.com/Ethyling)
- Remove abc inheritance from Serializable ([#8254](https://github.com/rapidsai/cudf/pull/8254)) [@vyasr](https://github.com/vyasr)
- Move more methods into SingleColumnFrame ([#8253](https://github.com/rapidsai/cudf/pull/8253)) [@vyasr](https://github.com/vyasr)
- Update ORC statistics API to use C++17 standard library ([#8241](https://github.com/rapidsai/cudf/pull/8241)) [@vuule](https://github.com/vuule)
- Correct unused parameter warnings in dictonary algorithms ([#8239](https://github.com/rapidsai/cudf/pull/8239)) [@robertmaynard](https://github.com/robertmaynard)
- Correct unused parameters in the copying algorithms ([#8232](https://github.com/rapidsai/cudf/pull/8232)) [@robertmaynard](https://github.com/robertmaynard)
- IO statistics cleanup ([#8191](https://github.com/rapidsai/cudf/pull/8191)) [@kaatish](https://github.com/kaatish)
- Refactor of rolling_window implementation. ([#8158](https://github.com/rapidsai/cudf/pull/8158)) [@nvdbaranec](https://github.com/nvdbaranec)
- Add a flag for allowing single quotes in JSON strings. ([#8144](https://github.com/rapidsai/cudf/pull/8144)) [@nvdbaranec](https://github.com/nvdbaranec)
- Column refactoring 2 ([#8130](https://github.com/rapidsai/cudf/pull/8130)) [@vyasr](https://github.com/vyasr)
- support space in workspace ([#7956](https://github.com/rapidsai/cudf/pull/7956)) [@jolorunyomi](https://github.com/jolorunyomi)
- Support collect_set on rolling window ([#7881](https://github.com/rapidsai/cudf/pull/7881)) [@sperlingxx](https://github.com/sperlingxx)

# cuDF 0.19.0 (21 Apr 2021)

Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ and then check the syntax of your Python and Cython code by running:

```bash
flake8 python
flake8 --config=python/cudf/.flake8.cython
flake8 --config=python/.flake8.cython
```

Additionally, many editors have plugins that will apply `isort` and `Black` as
Expand Down
8 changes: 4 additions & 4 deletions ci/benchmark/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,10 @@ conda install "rmm=$MINOR_VERSION.*" "cudatoolkit=$CUDA_REL" \
# conda install "your-pkg=1.0.0"

# Install the master version of dask, distributed, and streamz
logger "pip install git+https://github.com/dask/distributed.git@main --upgrade --no-deps"
pip install "git+https://github.com/dask/distributed.git@main" --upgrade --no-deps
logger "pip install git+https://github.com/dask/dask.git@main --upgrade --no-deps"
pip install "git+https://github.com/dask/dask.git@main" --upgrade --no-deps
logger "pip install git+https://github.com/dask/distributed.git@2021.06.0 --upgrade --no-deps"
pip install "git+https://github.com/dask/distributed.git@2021.06.0" --upgrade --no-deps
logger "pip install git+https://github.com/dask/dask.git@2021.06.0 --upgrade --no-deps"
pip install "git+https://github.com/dask/dask.git@2021.06.0" --upgrade --no-deps
logger "pip install git+https://github.com/python-streamz/streamz.git --upgrade --no-deps"
pip install "git+https://github.com/python-streamz/streamz.git" --upgrade --no-deps

Expand Down
4 changes: 2 additions & 2 deletions ci/gpu/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,8 @@ function install_dask {
# Install the main version of dask, distributed, and streamz
gpuci_logger "Install the main version of dask, distributed, and streamz"
set -x
pip install "git+https://github.com/dask/distributed.git@main" --upgrade --no-deps
pip install "git+https://github.com/dask/dask.git@main" --upgrade --no-deps
pip install "git+https://github.com/dask/distributed.git@2021.06.0" --upgrade --no-deps
pip install "git+https://github.com/dask/dask.git@2021.06.0" --upgrade --no-deps
pip install "git+https://github.com/python-streamz/streamz.git" --upgrade --no-deps
set +x
}
Expand Down
4 changes: 2 additions & 2 deletions conda/environments/cudf_dev_cuda11.0.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ dependencies:
- cachetools
- transformers
- pip:
- git+https://github.com/dask/dask.git@main
- git+https://github.com/dask/distributed.git@main
- git+https://github.com/dask/dask.git@2021.06.0
- git+https://github.com/dask/distributed.git@2021.06.0
- git+https://github.com/python-streamz/streamz.git
- pyorc
4 changes: 2 additions & 2 deletions conda/environments/cudf_dev_cuda11.2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ dependencies:
- cachetools
- transformers
- pip:
- git+https://github.com/dask/dask.git@main
- git+https://github.com/dask/distributed.git@main
- git+https://github.com/dask/dask.git@2021.06.0
- git+https://github.com/dask/distributed.git@2021.06.0
- git+https://github.com/python-streamz/streamz.git
- pyorc
8 changes: 4 additions & 4 deletions conda/recipes/dask-cudf/run_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ function logger() {
}

# Install the latest version of dask and distributed
logger "pip install git+https://github.com/dask/distributed.git@main --upgrade --no-deps"
pip install "git+https://github.com/dask/distributed.git@main" --upgrade --no-deps
logger "pip install git+https://github.com/dask/distributed.git@2021.06.0 --upgrade --no-deps"
pip install "git+https://github.com/dask/distributed.git@2021.06.0" --upgrade --no-deps

logger "pip install git+https://github.com/dask/dask.git@main --upgrade --no-deps"
pip install "git+https://github.com/dask/dask.git@main" --upgrade --no-deps
logger "pip install git+https://github.com/dask/dask.git@2021.06.0 --upgrade --no-deps"
pip install "git+https://github.com/dask/dask.git@2021.06.0" --upgrade --no-deps

logger "python -c 'import dask_cudf'"
python -c "import dask_cudf"
1 change: 1 addition & 0 deletions cpp/cmake/thirdparty/CUDF_GetArrow.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ function(find_and_configure_arrow VERSION BUILD_STATIC)
"ARROW_WITH_BACKTRACE ON"
"ARROW_CXXFLAGS -w"
"ARROW_JEMALLOC OFF"
"ARROW_S3 ON"
# Arrow modifies CMake's GLOBAL RULE_LAUNCH_COMPILE unless this is off
"ARROW_USE_CCACHE OFF"
"ARROW_ARMV8_ARCH ${ARROW_ARMV8_ARCH}"
Expand Down
7 changes: 7 additions & 0 deletions cpp/docs/DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,7 @@ namespace detail{
} // namespace detail

void external_function(...){
CUDF_FUNC_RANGE(); // Auto generates NVTX range for lifetime of this function
detail::external_function(...);
}
```
Expand All @@ -355,6 +356,12 @@ asynchrony if and when we add an asynchronous API to libcudf.
**Note:** `cudaDeviceSynchronize()` should *never* be used.
This limits the ability to do any multi-stream/multi-threaded work with libcudf APIs.
### NVTX Ranges
In order to aid in performance optimization and debugging, all compute intensive libcudf functions should have a corresponding NVTX range.
In libcudf, we have a convenience macro `CUDF_FUNC_RANGE()` that will automatically annotate the lifetime of the enclosing function and use the functions name as the name of the NVTX range.
For more information about NVTX, see [here](https://github.com/NVIDIA/NVTX/tree/dev/cpp).
### Stream Creation
There may be times in implementing libcudf features where it would be advantageous to use streams
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/detail/scatter.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -305,7 +305,7 @@ struct column_scatterer_impl<struct_view> {
[](auto const& col) { return col.nullable(); });
if (child_nullable) {
auto const gather_map =
scatter_to_gather(scatter_map_begin, scatter_map_end, source.size(), stream);
scatter_to_gather(scatter_map_begin, scatter_map_end, target.size(), stream);
gather_bitmask(cudf::table_view{std::vector<cudf::column_view>{structs_src.child_begin(),
structs_src.child_end()}},
gather_map.begin(),
Expand Down
33 changes: 33 additions & 0 deletions cpp/include/cudf/io/datasource.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,13 @@
#include <rmm/cuda_stream_view.hpp>

#include <arrow/buffer.h>
#include <arrow/filesystem/filesystem.h>
#include <arrow/filesystem/s3fs.h>
#include <arrow/io/file.h>
#include <arrow/io/interfaces.h>
#include <arrow/io/memory.h>
#include <arrow/result.h>
#include <arrow/status.h>

#include <memory>

Expand Down Expand Up @@ -302,6 +306,34 @@ class arrow_io_source : public datasource {
};

public:
/**
* @brief Constructs an object from an Apache Arrow Filesystem URI
*
* @param Apache Arrow Filesystem URI
*/
explicit arrow_io_source(std::string_view arrow_uri)
{
const std::string uri_start_delimiter = "//";
const std::string uri_end_delimiter = "?";

arrow::Result<std::shared_ptr<arrow::fs::FileSystem>> result =
arrow::fs::FileSystemFromUri(static_cast<std::string>(arrow_uri));
CUDF_EXPECTS(result.ok(), "Failed to generate Arrow Filesystem instance from URI.");
filesystem = result.ValueOrDie();

// Parse the path from the URI
size_t start = arrow_uri.find(uri_start_delimiter) == std::string::npos
? 0
: arrow_uri.find(uri_start_delimiter) + uri_start_delimiter.size();
size_t end = arrow_uri.find(uri_end_delimiter) - start;
std::string_view path = arrow_uri.substr(start, end);

arrow::Result<std::shared_ptr<arrow::io::RandomAccessFile>> in_stream =
filesystem->OpenInputFile(static_cast<std::string>(path).c_str());
CUDF_EXPECTS(in_stream.ok(), "Failed to open Arrow RandomAccessFile");
arrow_file = in_stream.ValueOrDie();
}

/**
* @brief Constructs an object from an `arrow` source object.
*
Expand Down Expand Up @@ -340,6 +372,7 @@ class arrow_io_source : public datasource {
}

private:
std::shared_ptr<arrow::fs::FileSystem> filesystem;
std::shared_ptr<arrow::io::RandomAccessFile> arrow_file;
};

Expand Down
Loading

0 comments on commit 0d730ab

Please sign in to comment.