Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewriting row/column conversions for Spark <-> cudf data conversions #8444

Merged
Merged
Show file tree
Hide file tree
Changes from 82 commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
cf58b89
working on row and column conversions
hyperbolic2346 Jun 7, 2021
6e869b6
fixing kernel launch and updating
hyperbolic2346 Jun 10, 2021
2703baf
Updates and bug fixing
hyperbolic2346 Jun 16, 2021
63a6636
Updating windows to be generated in a square way so we can have more …
hyperbolic2346 Jun 21, 2021
b444279
Adding row to column conversion code. Performance falls off a cliff, …
hyperbolic2346 Jul 8, 2021
fbe5dd5
updating to use make_device_uvector_async and bitmask functions per r…
hyperbolic2346 Jul 8, 2021
f8bc01f
updating conversion code. Found out bit operations are on 32-bit valu…
hyperbolic2346 Jul 13, 2021
636b235
working on row and column conversions
hyperbolic2346 Jun 7, 2021
3bff2aa
fixing kernel launch and updating
hyperbolic2346 Jun 10, 2021
8e52ba1
Updates and bug fixing
hyperbolic2346 Jun 16, 2021
fdfcb39
Updating windows to be generated in a square way so we can have more …
hyperbolic2346 Jun 21, 2021
5cf1cf1
Adding row to column conversion code. Performance falls off a cliff, …
hyperbolic2346 Jul 8, 2021
9af3258
updating to use make_device_uvector_async and bitmask functions per r…
hyperbolic2346 Jul 8, 2021
1d0245b
updating conversion code. Found out bit operations are on 32-bit valu…
hyperbolic2346 Jul 13, 2021
65490e0
updating for memcpy_async and validation in a different kernel
hyperbolic2346 Sep 13, 2021
2a57ce6
fixing validity alignment bugs
hyperbolic2346 Sep 21, 2021
ed5492e
Updates and bug fixes
hyperbolic2346 Sep 22, 2021
158a864
Merge remote-tracking branch 'upstream/branch-21.12' into mwilson/row…
hyperbolic2346 Oct 1, 2021
02cb81b
Fixing merge issue
hyperbolic2346 Oct 1, 2021
bae16f6
working on code to move block creation and batch creation to gpu
hyperbolic2346 Oct 1, 2021
3656848
pulling incomplete code for gpu building block data
hyperbolic2346 Oct 6, 2021
76e4099
Use the new row<->col method
razajafri Oct 6, 2021
966c34c
Fixing issue Raza found with 8-byte data
hyperbolic2346 Oct 7, 2021
aeaf28c
Merge branch 'mwilson/row_conversion' of github.com:hyperbolic2346/cu…
hyperbolic2346 Oct 7, 2021
6452e8e
fixing bug with float columns when 'enough' data was present. Updated…
hyperbolic2346 Oct 13, 2021
c0e9895
code cleanup and removed comments
razajafri Oct 15, 2021
b9df725
Fixing validity buffer alignment issue for row data
hyperbolic2346 Oct 21, 2021
8d00447
Cleaning up code for PR
hyperbolic2346 Oct 21, 2021
c4b0242
fixed typo
razajafri Oct 21, 2021
bc4ed9b
Merge branch 'mwilson/row_conversion' of github.com:hyperbolic2346/cu…
hyperbolic2346 Oct 22, 2021
e92989c
Updating for actual PR. Fixed a few last minute bugs, removed cudf-la…
hyperbolic2346 Oct 23, 2021
131ca58
removing unused header, suppressing shared warning for barrier, updat…
hyperbolic2346 Oct 23, 2021
a74c01c
Merge remote-tracking branch 'upstream/branch-21.12' into mwilson/row…
hyperbolic2346 Oct 27, 2021
d013e8b
updating code to build block infos with thrust on the gpu
hyperbolic2346 Oct 30, 2021
70e39cd
fixing overflow issues with large tables
hyperbolic2346 Nov 4, 2021
a10375c
Merge remote-tracking branch 'upstream/branch-21.12' into mwilson/row…
hyperbolic2346 Nov 8, 2021
64c8374
fixing includes for java
hyperbolic2346 Nov 9, 2021
f8ea2b1
addressed review concerns
razajafri Nov 15, 2021
c88472a
removed TODOs and added note to javadocs
razajafri Nov 15, 2021
00e58d7
working on row and column conversions
hyperbolic2346 Jun 7, 2021
b9f42cd
fixing kernel launch and updating
hyperbolic2346 Jun 10, 2021
6a267ab
Updates and bug fixing
hyperbolic2346 Jun 16, 2021
170a771
Updating windows to be generated in a square way so we can have more …
hyperbolic2346 Jun 21, 2021
a82cee8
Adding row to column conversion code. Performance falls off a cliff, …
hyperbolic2346 Jul 8, 2021
31f3b4a
updating to use make_device_uvector_async and bitmask functions per r…
hyperbolic2346 Jul 8, 2021
b044f8b
updating conversion code. Found out bit operations are on 32-bit valu…
hyperbolic2346 Jul 13, 2021
d2a33ed
working on row and column conversions
hyperbolic2346 Jun 7, 2021
7bcf41c
fixing kernel launch and updating
hyperbolic2346 Jun 10, 2021
17f1e5d
Updates and bug fixing
hyperbolic2346 Jun 16, 2021
dfda0f3
Updating windows to be generated in a square way so we can have more …
hyperbolic2346 Jun 21, 2021
5c0e52c
Adding row to column conversion code. Performance falls off a cliff, …
hyperbolic2346 Jul 8, 2021
cdd02d0
updating to use make_device_uvector_async and bitmask functions per r…
hyperbolic2346 Jul 8, 2021
7bb0496
updating conversion code. Found out bit operations are on 32-bit valu…
hyperbolic2346 Jul 13, 2021
2b069ca
updating for memcpy_async and validation in a different kernel
hyperbolic2346 Sep 13, 2021
92f52cd
fixing validity alignment bugs
hyperbolic2346 Sep 21, 2021
83118d2
Updates and bug fixes
hyperbolic2346 Sep 22, 2021
d563eaa
Fixing merge issue
hyperbolic2346 Oct 1, 2021
5b6688d
working on code to move block creation and batch creation to gpu
hyperbolic2346 Oct 1, 2021
53912ca
pulling incomplete code for gpu building block data
hyperbolic2346 Oct 6, 2021
698817a
Fixing issue Raza found with 8-byte data
hyperbolic2346 Oct 7, 2021
fb6dd51
Use the new row<->col method
razajafri Oct 6, 2021
b0173bf
fixing bug with float columns when 'enough' data was present. Updated…
hyperbolic2346 Oct 13, 2021
81cbaa6
code cleanup and removed comments
razajafri Oct 15, 2021
58eb43f
Fixing validity buffer alignment issue for row data
hyperbolic2346 Oct 21, 2021
06837f0
Cleaning up code for PR
hyperbolic2346 Oct 21, 2021
2c4e12f
fixed typo
razajafri Oct 21, 2021
fa4f0d3
Updating for actual PR. Fixed a few last minute bugs, removed cudf-la…
hyperbolic2346 Oct 23, 2021
e9938b9
removing unused header, suppressing shared warning for barrier, updat…
hyperbolic2346 Oct 23, 2021
3c6b1e5
updating code to build block infos with thrust on the gpu
hyperbolic2346 Oct 30, 2021
630222a
fixing overflow issues with large tables
hyperbolic2346 Nov 4, 2021
5e66d7c
fixing includes for java
hyperbolic2346 Nov 9, 2021
37feaa1
updating from review comments
hyperbolic2346 Nov 16, 2021
e09ab04
Merge branch 'mwilson/row_conversion' of github.com:hyperbolic2346/cu…
hyperbolic2346 Nov 16, 2021
27d44d9
Updating from review comments
hyperbolic2346 Nov 17, 2021
3a48844
removing odd size writing since destination is now padded
hyperbolic2346 Nov 18, 2021
2844240
Merge remote-tracking branch 'upstream/branch-22.02' into mwilson/row…
hyperbolic2346 Nov 22, 2021
9770def
Merge remote-tracking branch 'upstream/branch-22.02' into mwilson/row…
hyperbolic2346 Nov 24, 2021
7595eaf
performance improvements
hyperbolic2346 Dec 7, 2021
74afad7
Update java/src/main/native/src/row_conversion.cu
hyperbolic2346 Dec 15, 2021
652fc33
Merge remote-tracking branch 'upstream/branch-22.02' into mwilson/row…
hyperbolic2346 Dec 21, 2021
57a84e4
changes from review comments
hyperbolic2346 Dec 23, 2021
7fbe10d
removing commented out code
hyperbolic2346 Dec 23, 2021
d47360d
updating from review comments
hyperbolic2346 Jan 4, 2022
9b50271
Updating namespace
hyperbolic2346 Jan 4, 2022
fb7566c
updating namespace
hyperbolic2346 Jan 4, 2022
5e1cf97
Update java/src/main/native/src/row_conversion.cu
hyperbolic2346 Jan 6, 2022
a1e3545
moving to a constant iterator and other review cleanup
hyperbolic2346 Jan 7, 2022
8ef3bbe
Merge branch 'mwilson/row_conversion' of github.com:hyperbolic2346/cu…
hyperbolic2346 Jan 7, 2022
4c750a9
Removing magic numbers per review comments
hyperbolic2346 Jan 10, 2022
0d0015a
removing magic number 2
hyperbolic2346 Jan 10, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions cpp/include/cudf/detail/utilities/integer_utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,15 @@ inline S round_down_safe(S number_to_round, S modulus)
return rounded_down;
}

template <typename S>
hyperbolic2346 marked this conversation as resolved.
Show resolved Hide resolved
constexpr inline S round_up_unsafe(S number_to_round, S modulus) noexcept
hyperbolic2346 marked this conversation as resolved.
Show resolved Hide resolved
{
auto remainder = number_to_round % modulus;
if (remainder == 0) { return number_to_round; }
auto rounded_up = number_to_round - remainder + modulus;
return rounded_up;
}

/**
* Divides the left-hand-side by the right-hand-side, rounding up
* to an integral multiple of the right-hand-side, e.g. (9,5) -> 2 , (10,5) -> 2, (11,5) -> 3.
Expand Down
9 changes: 1 addition & 8 deletions cpp/src/copying/contiguous_split.cu
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,6 @@ namespace {
// align all column size allocations to this boundary so that all output column buffers
// start at that alignment.
static constexpr std::size_t split_align = 64;
inline __device__ std::size_t _round_up_safe(std::size_t number_to_round, std::size_t modulus)
{
auto remainder = number_to_round % modulus;
if (remainder == 0) { return number_to_round; }
auto rounded_up = number_to_round - remainder + modulus;
return rounded_up;
}

/**
* @brief Struct which contains information on a source buffer.
Expand Down Expand Up @@ -960,7 +953,7 @@ std::vector<packed_table> contiguous_split(cudf::table_view const& input,
std::size_t const bytes =
static_cast<std::size_t>(num_elements) * static_cast<std::size_t>(element_size);

return dst_buf_info{_round_up_safe(bytes, 64),
return dst_buf_info{util::round_up_unsafe(bytes, 64ul),
num_elements,
element_size,
num_rows,
Expand Down
4 changes: 2 additions & 2 deletions java/src/main/java/ai/rapids/cudf/HostMemoryBuffer.java
Original file line number Diff line number Diff line change
Expand Up @@ -393,7 +393,7 @@ public final void setInts(long offset, int[] data, long srcOffset, long len) {
*/
public final long getLong(long offset) {
long requestedAddress = this.address + offset;
addressOutOfBoundsCheck(requestedAddress, 8, "setLong");
addressOutOfBoundsCheck(requestedAddress, 8, "getLong");
mythrocks marked this conversation as resolved.
Show resolved Hide resolved
return UnsafeMemoryAccessor.getLong(requestedAddress);
}

Expand All @@ -404,7 +404,7 @@ public final long getLong(long offset) {
*/
public final void setLong(long offset, long value) {
long requestedAddress = this.address + offset;
addressOutOfBoundsCheck(requestedAddress, 8, "getLong");
addressOutOfBoundsCheck(requestedAddress, 8, "setLong");
UnsafeMemoryAccessor.setLong(requestedAddress, value);
}

Expand Down
60 changes: 56 additions & 4 deletions java/src/main/java/ai/rapids/cudf/Table.java
Original file line number Diff line number Diff line change
Expand Up @@ -658,8 +658,12 @@ private static native long[] scatterScalars(long[] srcScalarHandles, long scatte

private static native long[] convertToRows(long nativeHandle);

private static native long[] convertToRowsFixedWidthOptimized(long nativeHandle);

private static native long[] convertFromRows(long nativeColumnView, int[] types, int[] scale);

private static native long[] convertFromRowsFixedWidthOptimized(long nativeColumnView, int[] types, int[] scale);

private static native long[] repeatStaticCount(long tableHandle, int count);

private static native long[] repeatColumnCount(long tableHandle,
Expand Down Expand Up @@ -2685,6 +2689,23 @@ public GatherMap conditionalLeftAntiJoinGatherMap(Table rightTable,
return buildSemiJoinGatherMap(gatherMapData);
}

/**
* For details about how this method functions refer to
* {@link #convertToRowsFixedWidthOptimized()}.
*
* The only thing different between this method and {@link #convertToRowsFixedWidthOptimized()}
* is that this can handle rougly 250M columns while {@link #convertToRowsFixedWidthOptimized()}
* can only handle columns less than 100
*/
public ColumnVector[] convertToRows() {
long[] ptrs = convertToRows(nativeHandle);
ColumnVector[] ret = new ColumnVector[ptrs.length];
for (int i = 0; i < ptrs.length; i++) {
ret[i] = new ColumnVector(ptrs[i]);
}
return ret;
}

/**
* Convert this table of columns into a row major format that is useful for interacting with other
* systems that do row major processing of the data. Currently only fixed-width column types are
Expand Down Expand Up @@ -2759,8 +2780,17 @@ public GatherMap conditionalLeftAntiJoinGatherMap(Table rightTable,
* There are some limits on the size of a single row. If the row is larger than 1KB this will
* throw an exception.
*/
public ColumnVector[] convertToRows() {
long[] ptrs = convertToRows(nativeHandle);
public ColumnVector[] convertToRowsFixedWidthOptimized() {
long[] ptrs = convertToRowsFixedWidthOptimized(nativeHandle);
ColumnVector[] ret = new ColumnVector[ptrs.length];
for (int i = 0; i < ptrs.length; i++) {
ret[i] = new ColumnVector(ptrs[i]);
}
return ret;
}

public ColumnVector[] convertToRowsFixedWidthOptimized() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a javadoc header for this public function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading this incorrectly? Are these the same function?
Line#2752:

public ColumnVector[] convertToRowsFixedWidthOptimized() {

Line#2761:

public ColumnVector[] convertToRowsFixedWidthOptimized() {

Should the first one of these not be convertToRows()?

long[] ptrs = convertToRowsFixedWidthOptimized(nativeHandle);
ColumnVector[] ret = new ColumnVector[ptrs.length];
for (int i = 0; i < ptrs.length; i++) {
ret[i] = new ColumnVector(ptrs[i]);
Expand All @@ -2771,13 +2801,14 @@ public ColumnVector[] convertToRows() {
/**
* Convert a column of list of bytes that is formatted like the output from `convertToRows`
* and convert it back to a table.
*
* NOTE: This method doesn't support nested types
*
* @param vec the row data to process.
* @param schema the types of each column.
* @return the parsed table.
*/
public static Table convertFromRows(ColumnView vec, DType ... schema) {
// TODO at some point we need a schema that support nesting so we can support nested types
// TODO we will need scale at some point very soon too
int[] types = new int[schema.length];
int[] scale = new int[schema.length];
for (int i = 0; i < schema.length; i++) {
Expand All @@ -2788,6 +2819,27 @@ public static Table convertFromRows(ColumnView vec, DType ... schema) {
return new Table(convertFromRows(vec.getNativeView(), types, scale));
}

/**
* Convert a column of list of bytes that is formatted like the output from `convertToRows`
* and convert it back to a table.
*
* NOTE: This method doesn't support nested types
*
* @param vec the row data to process.
* @param schema the types of each column.
* @return the parsed table.
*/
public static Table convertFromRowsFixedWidthOptimized(ColumnView vec, DType ... schema) {
int[] types = new int[schema.length];
int[] scale = new int[schema.length];
for (int i = 0; i < schema.length; i++) {
types[i] = schema[i].typeId.nativeId;
scale[i] = schema[i].getScale();

}
return new Table(convertFromRowsFixedWidthOptimized(vec.getNativeView(), types, scale));
}

/**
* Construct a table from a packed representation.
* @param metadata host-based metadata for the table
Expand Down
41 changes: 41 additions & 0 deletions java/src/main/native/src/TableJni.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2746,6 +2746,25 @@ JNIEXPORT jlongArray JNICALL Java_ai_rapids_cudf_Table_gather(JNIEnv *env, jclas
CATCH_STD(env, 0);
}

JNIEXPORT jlongArray JNICALL
Java_ai_rapids_cudf_Table_convertToRowsFixedWidthOptimized(JNIEnv *env, jclass, jlong input_table) {
JNI_NULL_CHECK(env, input_table, "input table is null", 0);

try {
cudf::jni::auto_set_device(env);
cudf::table_view *n_input_table = reinterpret_cast<cudf::table_view *>(input_table);
std::vector<std::unique_ptr<cudf::column>> cols =
cudf::java::convert_to_rows_fixed_width_optimized(*n_input_table);
int num_columns = cols.size();
cudf::jni::native_jlongArray outcol_handles(env, num_columns);
for (int i = 0; i < num_columns; i++) {
outcol_handles[i] = reinterpret_cast<jlong>(cols[i].release());
}
return outcol_handles.get_jArray();
}
CATCH_STD(env, 0);
}

JNIEXPORT jlongArray JNICALL Java_ai_rapids_cudf_Table_scatterTable(JNIEnv *env, jclass,
jlong j_input, jlong j_map,
jlong j_target,
Expand Down Expand Up @@ -2804,6 +2823,28 @@ JNIEXPORT jlongArray JNICALL Java_ai_rapids_cudf_Table_convertToRows(JNIEnv *env
CATCH_STD(env, 0);
}

JNIEXPORT jlongArray JNICALL Java_ai_rapids_cudf_Table_convertFromRowsFixedWidthOptimized(
JNIEnv *env, jclass, jlong input_column, jintArray types, jintArray scale) {
JNI_NULL_CHECK(env, input_column, "input column is null", 0);
JNI_NULL_CHECK(env, types, "types is null", 0);

try {
cudf::jni::auto_set_device(env);
cudf::column_view *input = reinterpret_cast<cudf::column_view *>(input_column);
cudf::lists_column_view list_input(*input);
cudf::jni::native_jintArray n_types(env, types);
cudf::jni::native_jintArray n_scale(env, scale);
std::vector<cudf::data_type> types_vec;
for (int i = 0; i < n_types.size(); i++) {
types_vec.emplace_back(cudf::jni::make_data_type(n_types[i], n_scale[i]));
}
std::unique_ptr<cudf::table> result =
cudf::java::convert_from_rows_fixed_width_optimized(list_input, types_vec);
return cudf::jni::convert_table_for_return(env, result);
}
CATCH_STD(env, 0);
}

JNIEXPORT jlongArray JNICALL Java_ai_rapids_cudf_Table_convertFromRows(JNIEnv *env, jclass,
jlong input_column,
jintArray types,
Expand Down
Loading