Skip to content

Commit

Permalink
Merge branch 'branch-21.10' into remove-make-strings-children-with-nu…
Browse files Browse the repository at this point in the history
…ll-mask
  • Loading branch information
davidwendt committed Jul 28, 2021
2 parents 50d9124 + 904222b commit 9403ae8
Show file tree
Hide file tree
Showing 15 changed files with 213 additions and 25 deletions.
4 changes: 2 additions & 2 deletions conda/environments/cudf_dev_cuda11.0.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ dependencies:
- mypy=0.782
- typing_extensions
- pre_commit
- dask>=2021.6.0
- distributed>=2021.6.0
- dask>=2021.6.0,<=2021.07.1
- distributed>=2021.6.0,<=2021.07.1
- streamz
- arrow-cpp=4.0.1
- dlpack>=0.5,<0.6.0a0
Expand Down
4 changes: 2 additions & 2 deletions conda/environments/cudf_dev_cuda11.2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ dependencies:
- mypy=0.782
- typing_extensions
- pre_commit
- dask>=2021.6.0
- distributed>=2021.6.0
- dask>=2021.6.0,<=2021.07.1
- distributed>=2021.6.0,<=2021.07.1
- streamz
- arrow-cpp=4.0.1
- dlpack>=0.5,<0.6.0a0
Expand Down
4 changes: 2 additions & 2 deletions conda/recipes/custreamz/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ requirements:
- python
- streamz
- cudf {{ version }}
- dask>=2021.6.0
- distributed>=2021.6.0
- dask>=2021.6.0,<=2021.07.1
- distributed>=2021.6.0,<=2021.07.1
- python-confluent-kafka
- cudf_kafka {{ version }}

Expand Down
8 changes: 4 additions & 4 deletions conda/recipes/dask-cudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,13 @@ requirements:
host:
- python
- cudf {{ version }}
- dask>=2021.6.0
- distributed>=2021.6.0
- dask>=2021.6.0,<=2021.07.1
- distributed>=2021.6.0,<=2021.07.1
run:
- python
- cudf {{ version }}
- dask>=2021.6.0
- distributed>=2021.6.0
- dask>=2021.6.0,<=2021.07.1
- distributed>=2021.6.0,<=2021.07.1

test:
requires:
Expand Down
4 changes: 2 additions & 2 deletions cpp/src/io/orc/writer_impl.cu
Original file line number Diff line number Diff line change
Expand Up @@ -575,8 +575,8 @@ orc_streams writer::impl::create_streams(host_span<orc_column_view> columns,
break;
}
case TypeKind::TIMESTAMP:
add_RLE_stream(gpu::CI_DATA, DATA, TypeKind::INT);
add_RLE_stream(gpu::CI_DATA2, SECONDARY, TypeKind::INT);
add_RLE_stream(gpu::CI_DATA, DATA, TypeKind::LONG);
add_RLE_stream(gpu::CI_DATA2, SECONDARY, TypeKind::LONG);
column.set_orc_encoding(DIRECT_V2);
break;
case TypeKind::DECIMAL:
Expand Down
22 changes: 17 additions & 5 deletions java/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@ most modern cuda drivers.
```

In some cases there may be a classifier to indicate the version of cuda required. See the
Build From Source section below for more information about when this can happen. No official
release of the jar will have a classifier on it.
[Build From Source](#build-from-source) section below for more information about when this
can happen. No official release of the jar will have a classifier on it.

CUDA 11.0:
```xml
Expand All @@ -51,9 +51,9 @@ CUDA 11.0:
## Build From Source

Build [libcudf](../cpp) first, and make sure the JDK is installed and available. Specify
the cmake option `-DCUDF_USE_ARROW_STATIC=ON` when building so that Apache Arrow is linked
statically to libcudf, as this will help create a jar that does not require Arrow and its
dependencies to be available in the runtime environment.
the cmake option `-DCUDF_USE_ARROW_STATIC=ON -DCUDF_ENABLE_ARROW_S3=OFF` when building so
that Apache Arrow is linked statically to libcudf, as this will help create a jar that
does not require Arrow and its dependencies to be available in the runtime environment.

After building libcudf, the Java bindings can be built via Maven, e.g.:
```
Expand All @@ -63,6 +63,18 @@ mvn clean install
If you have a compatible GPU on your build system the tests will use it. If not you will see a
lot of skipped tests.

### Using the Java CI Docker Image

If you are interested in building a Java cudf jar that is similar to the official releases
that can run on all modern Linux systems, see the [Java CI README](ci/README.md) for
instructions on how to build within a Docker environment using devtoolset. Note that
building the jar without the Docker setup and script will likely produce a jar that can
only run in environments similar to that of the build machine.

If you decide to build without Docker and the build script, examining the cmake and Maven
settings in the [Java CI build script](ci/build-in-docker.sh) can be helpful if you are
encountering difficulties during the build.

## Dynamically Linking Arrow

Since libcudf builds by default with a dynamically linked Arrow dependency, it may be
Expand Down
64 changes: 64 additions & 0 deletions java/src/main/java/ai/rapids/cudf/ColumnView.java
Original file line number Diff line number Diff line change
Expand Up @@ -2465,6 +2465,48 @@ public final ColumnVector stringReplace(Scalar target, Scalar replace) {
replace.getScalarHandle()));
}

/**
* For each string, replaces any character sequence matching the given pattern using the
* replacement string scalar.
*
* @param pattern The regular expression pattern to search within each string.
* @param repl The string scalar to replace for each pattern match.
* @return A new column vector containing the string results.
*/
public final ColumnVector replaceRegex(String pattern, Scalar repl) {
return replaceRegex(pattern, repl, -1);
}

/**
* For each string, replaces any character sequence matching the given pattern using the
* replacement string scalar.
*
* @param pattern The regular expression pattern to search within each string.
* @param repl The string scalar to replace for each pattern match.
* @param maxRepl The maximum number of times a replacement should occur within each string.
* @return A new column vector containing the string results.
*/
public final ColumnVector replaceRegex(String pattern, Scalar repl, int maxRepl) {
if (!repl.getType().equals(DType.STRING)) {
throw new IllegalArgumentException("Replacement must be a string scalar");
}
return new ColumnVector(replaceRegex(getNativeView(), pattern, repl.getScalarHandle(),
maxRepl));
}

/**
* For each string, replaces any character sequence matching any of the regular expression
* patterns with the corresponding replacement strings.
*
* @param patterns The regular expression patterns to search within each string.
* @param repls The string scalars to replace for each corresponding pattern match.
* @return A new column vector containing the string results.
*/
public final ColumnVector replaceMultiRegex(String[] patterns, ColumnView repls) {
return new ColumnVector(replaceMultiRegex(getNativeView(), patterns,
repls.getNativeView()));
}

/**
* For each string, replaces any character sequence matching the given pattern
* using the replace template for back-references.
Expand Down Expand Up @@ -3241,6 +3283,28 @@ private static native long substringColumn(long columnView, long startColumn, lo
*/
private static native long stringReplace(long columnView, long target, long repl) throws CudfException;

/**
* Native method for replacing each regular expression pattern match with the specified
* replacement string.
* @param columnView native handle of the cudf::column_view being operated on.
* @param pattern The regular expression pattern to search within each string.
* @param repl native handle of the cudf::scalar containing the replacement string.
* @param maxRepl maximum number of times to replace the pattern within a string
* @return native handle of the resulting cudf column containing the string results.
*/
private static native long replaceRegex(long columnView, String pattern,
long repl, long maxRepl) throws CudfException;

/**
* Native method for multiple instance regular expression replacement.
* @param columnView native handle of the cudf::column_view being operated on.
* @param patterns native handle of the cudf::column_view containing the regex patterns.
* @param repls The replacement template for creating the output string.
* @return native handle of the resulting cudf column containing the string results.
*/
private static native long replaceMultiRegex(long columnView, String[] patterns,
long repls) throws CudfException;

/**
* Native method for replacing any character sequence matching the given pattern
* using the replace template for back-references.
Expand Down
8 changes: 8 additions & 0 deletions java/src/main/native/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,14 @@ target_compile_definitions(cudfjni

if(USE_GDS)
add_library(cufilejni SHARED "src/CuFileJni.cpp")
SET_TARGET_PROPERTIES(cufilejni
PROPERTIES BUILD_RPATH "\$ORIGIN"
# set target compile options
CXX_STANDARD 17
CXX_STANDARD_REQUIRED ON
CUDA_STANDARD 17
CUDA_STANDARD_REQUIRED ON
)
target_include_directories(cufilejni PRIVATE "${cuFile_INCLUDE_DIRS}")
target_link_libraries(cufilejni PRIVATE cudfjni "${cuFile_LIBRARIES}")
endif(USE_GDS)
Expand Down
45 changes: 45 additions & 0 deletions java/src/main/native/src/ColumnViewJni.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1213,6 +1213,51 @@ JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_mapContains(JNIEnv *env,
CATCH_STD(env, 0);
}

JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_replaceRegex(JNIEnv *env, jclass,
jlong j_column_view,
jstring j_pattern, jlong j_repl,
jlong j_maxrepl) {

JNI_NULL_CHECK(env, j_column_view, "column is null", 0);
JNI_NULL_CHECK(env, j_pattern, "pattern string is null", 0);
JNI_NULL_CHECK(env, j_repl, "replace scalar is null", 0);
try {
cudf::jni::auto_set_device(env);
auto cv = reinterpret_cast<cudf::column_view const *>(j_column_view);
cudf::strings_column_view scv(*cv);
cudf::jni::native_jstring pattern(env, j_pattern);
auto repl = reinterpret_cast<cudf::string_scalar const *>(j_repl);

std::unique_ptr<cudf::column> result =
cudf::strings::replace_re(scv, pattern.get(), *repl, j_maxrepl);
return reinterpret_cast<jlong>(result.release());
}
CATCH_STD(env, 0);
}

JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_replaceMultiRegex(JNIEnv *env, jclass,
jlong j_column_view,
jobjectArray j_patterns,
jlong j_repls) {

JNI_NULL_CHECK(env, j_column_view, "column is null", 0);
JNI_NULL_CHECK(env, j_patterns, "patterns is null", 0);
JNI_NULL_CHECK(env, j_repls, "repls is null", 0);
try {
cudf::jni::auto_set_device(env);
auto cv = reinterpret_cast<cudf::column_view const *>(j_column_view);
cudf::strings_column_view scv(*cv);
cudf::jni::native_jstringArray patterns(env, j_patterns);
auto repl_cv = reinterpret_cast<cudf::column_view const *>(j_repls);
cudf::strings_column_view repl_scv(*repl_cv);

std::unique_ptr<cudf::column> result =
cudf::strings::replace_re(scv, patterns.as_cpp_vector(), repl_scv);
return reinterpret_cast<jlong>(result.release());
}
CATCH_STD(env, 0);
}

JNIEXPORT jlong JNICALL Java_ai_rapids_cudf_ColumnView_stringReplaceWithBackrefs(
JNIEnv *env, jclass, jlong column_view, jstring patternObj, jstring replaceObj) {

Expand Down
40 changes: 40 additions & 0 deletions java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -4479,6 +4479,46 @@ void teststringReplaceThrowsException() {
});
}

@Test
void testReplaceRegex() {
try (ColumnVector v =
ColumnVector.fromStrings("title and Title with title", "nothing", null, "Title");
Scalar repl = Scalar.fromString("Repl");
ColumnVector actual = v.replaceRegex("[tT]itle", repl);
ColumnVector expected =
ColumnVector.fromStrings("Repl and Repl with Repl", "nothing", null, "Repl")) {
assertColumnsAreEqual(expected, actual);
}

try (ColumnVector v =
ColumnVector.fromStrings("title and Title with title", "nothing", null, "Title");
Scalar repl = Scalar.fromString("Repl");
ColumnVector actual = v.replaceRegex("[tT]itle", repl, 0)) {
assertColumnsAreEqual(v, actual);
}

try (ColumnVector v =
ColumnVector.fromStrings("title and Title with title", "nothing", null, "Title");
Scalar repl = Scalar.fromString("Repl");
ColumnVector actual = v.replaceRegex("[tT]itle", repl, 1);
ColumnVector expected =
ColumnVector.fromStrings("Repl and Title with title", "nothing", null, "Repl")) {
assertColumnsAreEqual(expected, actual);
}
}

@Test
void testReplaceMultiRegex() {
try (ColumnVector v =
ColumnVector.fromStrings("title and Title with title", "nothing", null, "Title");
ColumnVector repls = ColumnVector.fromStrings("Repl", "**");
ColumnVector actual = v.replaceMultiRegex(new String[] { "[tT]itle", "and|th" }, repls);
ColumnVector expected =
ColumnVector.fromStrings("Repl ** Repl wi** Repl", "no**ing", null, "Repl")) {
assertColumnsAreEqual(expected, actual);
}
}

@Test
void testStringReplaceWithBackrefs() {

Expand Down
Binary file not shown.
19 changes: 19 additions & 0 deletions python/cudf/cudf/tests/test_orc.py
Original file line number Diff line number Diff line change
Expand Up @@ -1152,3 +1152,22 @@ def test_chunked_orc_writer_lists():

got = pa.orc.ORCFile(buffer).read().to_pandas()
assert_eq(expect, got)


def test_writer_timestamp_stream_size(datadir, tmpdir):
pdf_fname = datadir / "TestOrcFile.largeTimestamps.orc"
gdf_fname = tmpdir.join("gdf.orc")

try:
orcfile = pa.orc.ORCFile(pdf_fname)
except Exception as excpr:
if type(excpr).__name__ == "ArrowIOError":
pytest.skip(".orc file is not found")
else:
print(type(excpr).__name__)

expect = orcfile.read().to_pandas()
cudf.from_pandas(expect).to_orc(gdf_fname.strpath)
got = pa.orc.ORCFile(gdf_fname).read().to_pandas()

assert_eq(expect, got)
4 changes: 2 additions & 2 deletions python/custreamz/dev_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
flake8==3.8.3
black==19.10b0
isort==5.6.4
dask>=2021.6.0
distributed>=2021.6.0
dask>=2021.6.0,<=2021.07.1
distributed>=2021.6.0,<=2021.07.1
streamz
python-confluent-kafka
pytest
Expand Down
4 changes: 2 additions & 2 deletions python/dask_cudf/dev_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Copyright (c) 2021, NVIDIA CORPORATION.

dask>=2021.6.0
distributed>=2021.6.0
dask>=2021.6.0,<=2021.07.1
distributed>=2021.6.0,<=2021.07.1
fsspec>=0.6.0
numba>=0.53.1
numpy
Expand Down
8 changes: 4 additions & 4 deletions python/dask_cudf/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@

install_requires = [
"cudf",
"dask>=2021.6.0",
"distributed>=2021.6.0",
"dask>=2021.6.0,<=2021.07.1",
"distributed>=2021.6.0,<=2021.07.1",
"fsspec>=0.6.0",
"numpy",
"pandas>=1.0,<1.3.0dev0",
Expand All @@ -23,8 +23,8 @@
"pandas>=1.0,<1.3.0dev0",
"pytest",
"numba>=0.53.1",
"dask>=2021.6.0",
"distributed>=2021.6.0",
"dask>=2021.6.0,<=2021.07.1",
"distributed>=2021.6.0,<=2021.07.1",
]
}

Expand Down

0 comments on commit 9403ae8

Please sign in to comment.