-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Java support for casting of nested child columns [skip ci] #7417
Conversation
@revans2 I have addressed your concerns, PTAL |
*/ | ||
public ColumnView replaceChildrenWithViews(int[] indices, | ||
ColumnView[] views) { | ||
assert(type == DType.STRUCT || type == DType.LIST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a isNestedType
in dtype that I think is more appropriate. We might also want to look at supporting this for strings at some point.
|
||
JNI_ARG_CHECK(env, m.empty(), "One or more invalid child indices passed to be replaced", 0); | ||
|
||
std::unique_ptr<cudf::column_view> n_new_nested(new cudf::column_view(n_col_view->type(), n_col_view->size(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general question, if you'll indulge me:
This idiom is used in several places in this file. Why do we allocate a unique_ptr
and immediately turn around and release()
it? Why not simply return the following?
return reinterpret_cast<jlong>(
new cudf::column_view(n_col_view->type(),
n_col_view->size(),
nullptr,
n_col_view->null_mask(),
n_col_view->null_count(),
n_col_view->offset(),
children));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simply did it to keep up with the conventions of our fathers and their fathers before them :D
I can change that if you like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the change as it makes no sense to do that
m[indices[i]] = children_to_replace[i]; | ||
} | ||
|
||
std::vector<cudf::column_view> children; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: new_children
?
I'm not thrilled with the concept of adding nested-type specific APIs that seem very specific to a casting scenario. I think we could let callers do these "cast within nesting" and other, future view shenanigans by exposing a couple of lower-level APIs that lets users build custom public ColumnView(DType type, long rows, Optional<Long> nullCount,
DeviceMemoryBuffer dataBuffer, DeviceMemoryBuffer validityBuffer,
DeviceMemoryBuffer offsetBuffer) {
...
}
public ColumnView(DType type, long rows, Optional<Long> nullCount,
DeviceMemoryBuffer validityBuffer, DeviceMemoryBuffer offsetBuffer,
ColumnView[] childViews) {
...
} Then for these casting cases, the caller can grab the validity buffer (and offset buffer if its a list) and use those to create a view with newly constructed child column views. This also lets us create views of non-nested types with a swapped-out validity buffer or data buffer if that becomes useful for some future case. Thoughts? |
I see the benefit of having this API. But in this case it won't help unless we make changes to our structure. I don't think it can happen with the way things are currently structured as this would mean changing the Can I also bring up why we have the |
Co-authored-by: MithunR <[email protected]>
Co-authored-by: MithunR <[email protected]>
Co-authored-by: MithunR <[email protected]>
Then use
Every vector can act as a view, and we didn't want to create a delegate method in the vector to every view method. |
I like the idea of exposing a simple constructor for ColumnView instead. It is more flexible too, so we could do things like append column view to a struct or insert them instead of just replacing them. It would also have allowed us to do the equivalent of bit_cast ourselves. My main concern with doing it right now is strings and lists. We want to eventually unify how they appear under the hood and if we expose APIs then it is going to be harder to cheange that without making breaking changes. I would prefer to have us start out with some factory methods for struct and list, but it is minor. |
Whats the verdict on this then? Should I go ahead and make the change that @jlowe recommends now or do we wait until we have unified the lists and strings? Also, can you please explain what you mean by unifying the Strings and Lists? I thought they are already very similar, i.e. a string col is a |
The only ColumnView constructor we have right now takes just an address, but the ColumnVector APIs expose all of the gory details so go ahead and just do what Jason suggested because we are going to have to modify ColumnVector anyways. |
@jlowe have I addressed all your concerns? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions on comments and access scoping, but otherwise OK.
@@ -49,6 +48,57 @@ protected ColumnView(long address) { | |||
this.nullCount = ColumnView.getNativeNullCount(viewHandle); | |||
} | |||
|
|||
/** | |||
* Create a new column view based off of data already on the device. Ref count on the buffers | |||
* is not incremented and none of the underlying buffers are owned by this view. If ownership |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment needs to emphasize that the resulting instance is only valid as long as the underlying buffers remain valid. If someone closes those buffers while this view instance is still active and it is later accessed, the program as undefined behavior.
private static long initViewHandle(DType type, int rows, int nc, DeviceMemoryBuffer dataBuffer, | ||
DeviceMemoryBuffer validityBuffer, | ||
DeviceMemoryBuffer offsetBuffer, long[] childHandles) { | ||
protected static long initViewHandle(DType type, int rows, int nc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be package-private
} | ||
|
||
/** | ||
* Create a new column view based off of data already on the device. Ref count on the buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment on emphasizing lifetime requirements on input parameters relative to this instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're no longer doing this with Java_ai_rapids_cudf_ColumnView_replaceChildrenWithViews
.
LGTM.
@gpucibot merge |
This PR adds a couple of very specialized methods that help us cast columns inside nested types. Authors: - Raza Jafri (@razajafri) Approvers: - Robert (Bobby) Evans (@revans2) - Jason Lowe (@jlowe) - MithunR (@mythrocks) URL: rapidsai#7417
This PR adds a couple of very specialized methods that help us cast columns inside nested types.