Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java support for casting of nested child columns [skip ci] #7417

Merged
merged 19 commits into from
Mar 8, 2021

Conversation

razajafri
Copy link
Contributor

@razajafri razajafri commented Feb 22, 2021

This PR adds a couple of very specialized methods that help us cast columns inside nested types.

@razajafri razajafri requested a review from a team as a code owner February 22, 2021 01:12
@github-actions github-actions bot added the Java Affects Java cuDF API. label Feb 22, 2021
@razajafri razajafri added the improvement Improvement / enhancement to an existing function label Feb 22, 2021
@razajafri
Copy link
Contributor Author

@revans2 I have addressed your concerns, PTAL

*/
public ColumnView replaceChildrenWithViews(int[] indices,
ColumnView[] views) {
assert(type == DType.STRUCT || type == DType.LIST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a isNestedType in dtype that I think is more appropriate. We might also want to look at supporting this for strings at some point.

java/src/main/java/ai/rapids/cudf/ColumnView.java Outdated Show resolved Hide resolved
java/src/main/native/src/ColumnViewJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/ColumnViewJni.cpp Outdated Show resolved Hide resolved
java/src/main/native/src/ColumnViewJni.cpp Outdated Show resolved Hide resolved
@razajafri razajafri added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Java) Reviewer labels Mar 2, 2021
java/src/main/java/ai/rapids/cudf/ColumnView.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnView.java Outdated Show resolved Hide resolved
java/src/main/native/src/ColumnVectorJni.cpp Show resolved Hide resolved
java/src/main/native/src/ColumnViewJni.cpp Outdated Show resolved Hide resolved

JNI_ARG_CHECK(env, m.empty(), "One or more invalid child indices passed to be replaced", 0);

std::unique_ptr<cudf::column_view> n_new_nested(new cudf::column_view(n_col_view->type(), n_col_view->size(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question, if you'll indulge me:
This idiom is used in several places in this file. Why do we allocate a unique_ptr and immediately turn around and release() it? Why not simply return the following?

return reinterpret_cast<jlong>(
       new cudf::column_view(n_col_view->type(), 
                             n_col_view->size(),
                             nullptr, 
                             n_col_view->null_mask(), 
                             n_col_view->null_count(), 
                             n_col_view->offset(), 
                             children));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply did it to keep up with the conventions of our fathers and their fathers before them :D

I can change that if you like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the change as it makes no sense to do that

m[indices[i]] = children_to_replace[i];
}

std::vector<cudf::column_view> children;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: new_children?

java/src/main/native/src/ColumnViewJni.cpp Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java Outdated Show resolved Hide resolved
@jlowe jlowe changed the title Java bindings for Decimal32 support [skip ci] Java support for casting of nested child columns [skip ci] Mar 2, 2021
@jlowe
Copy link
Member

jlowe commented Mar 2, 2021

I'm not thrilled with the concept of adding nested-type specific APIs that seem very specific to a casting scenario. I think we could let callers do these "cast within nesting" and other, future view shenanigans by exposing a couple of lower-level APIs that lets users build custom ColumnView instances, e.g.:

  public ColumnView(DType type, long rows, Optional<Long> nullCount,
                      DeviceMemoryBuffer dataBuffer, DeviceMemoryBuffer validityBuffer,
                      DeviceMemoryBuffer offsetBuffer) {
...
}

  public ColumnView(DType type, long rows, Optional<Long> nullCount,
                      DeviceMemoryBuffer validityBuffer, DeviceMemoryBuffer offsetBuffer,
                      ColumnView[] childViews) {
...
}

Then for these casting cases, the caller can grab the validity buffer (and offset buffer if its a list) and use those to create a view with newly constructed child column views. This also lets us create views of non-nested types with a swapped-out validity buffer or data buffer if that becomes useful for some future case.

Thoughts?

@razajafri
Copy link
Contributor Author

I'm not thrilled with the concept of adding nested-type specific APIs that seem very specific to a casting scenario. I think we could let callers do these "cast within nesting" and other, future view shenanigans by exposing a couple of lower-level APIs that lets users build custom ColumnView instances, e.g.:

  public ColumnView(DType type, long rows, Optional<Long> nullCount,
                      DeviceMemoryBuffer dataBuffer, DeviceMemoryBuffer validityBuffer,
                      DeviceMemoryBuffer offsetBuffer) {
...
}

  public ColumnView(DType type, long rows, Optional<Long> nullCount,
                      DeviceMemoryBuffer validityBuffer, DeviceMemoryBuffer offsetBuffer,
                      ColumnView[] childViews) {
...
}

Then for these casting cases, the caller can grab the validity buffer (and offset buffer if its a list) and use those to create a view with newly constructed child column views. This also lets us create views of non-nested types with a swapped-out validity buffer or data buffer if that becomes useful for some future case.

Thoughts?

I see the benefit of having this API. But in this case it won't help unless we make changes to our structure. I don't think it can happen with the way things are currently structured as this would mean changing the ColumnVector.getValidity to return a DeviceMemoryBuffer instead of DeviceMemoryBufferView.

Can I also bring up why we have the ColumnView as the parent of ColumnVector?

@jlowe
Copy link
Member

jlowe commented Mar 3, 2021

I don't think it can happen with the way things are currently structured as this would mean changing the ColumnVector.getValidity to return a DeviceMemoryBuffer instead of DeviceMemoryBufferView.

Then use BaseDeviceMemoryBuffer in the new API so it can take buffer views or full buffers. The column view owns no memory, so BaseDeviceMemoryBuffer is more appropriate here.

Can I also bring up why we have the ColumnView as the parent of ColumnVector?

Every vector can act as a view, and we didn't want to create a delegate method in the vector to every view method.

@revans2
Copy link
Contributor

revans2 commented Mar 3, 2021

I like the idea of exposing a simple constructor for ColumnView instead. It is more flexible too, so we could do things like append column view to a struct or insert them instead of just replacing them. It would also have allowed us to do the equivalent of bit_cast ourselves.

My main concern with doing it right now is strings and lists. We want to eventually unify how they appear under the hood and if we expose APIs then it is going to be harder to cheange that without making breaking changes. I would prefer to have us start out with some factory methods for struct and list, but it is minor.

@razajafri
Copy link
Contributor Author

razajafri commented Mar 3, 2021

I like the idea of exposing a simple constructor for ColumnView instead. It is more flexible too, so we could do things like append column view to a struct or insert them instead of just replacing them. It would also have allowed us to do the equivalent of bit_cast ourselves.

My main concern with doing it right now is strings and lists. We want to eventually unify how they appear under the hood and if we expose APIs then it is going to be harder to cheange that without making breaking changes. I would prefer to have us start out with some factory methods for struct and list, but it is minor.

Whats the verdict on this then? Should I go ahead and make the change that @jlowe recommends now or do we wait until we have unified the lists and strings?

Also, can you please explain what you mean by unifying the Strings and Lists? I thought they are already very similar, i.e. a string col is a List<char> col.

@revans2
Copy link
Contributor

revans2 commented Mar 3, 2021

Should I go ahead and make the change that @jlowe recommends now or do we wait until we have unified the lists and strings?

The only ColumnView constructor we have right now takes just an address, but the ColumnVector APIs expose all of the gory details so go ahead and just do what Jason suggested because we are going to have to modify ColumnVector anyways.

@razajafri
Copy link
Contributor Author

@jlowe have I addressed all your concerns?

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions on comments and access scoping, but otherwise OK.

@@ -49,6 +48,57 @@ protected ColumnView(long address) {
this.nullCount = ColumnView.getNativeNullCount(viewHandle);
}

/**
* Create a new column view based off of data already on the device. Ref count on the buffers
* is not incremented and none of the underlying buffers are owned by this view. If ownership
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment needs to emphasize that the resulting instance is only valid as long as the underlying buffers remain valid. If someone closes those buffers while this view instance is still active and it is later accessed, the program as undefined behavior.

private static long initViewHandle(DType type, int rows, int nc, DeviceMemoryBuffer dataBuffer,
DeviceMemoryBuffer validityBuffer,
DeviceMemoryBuffer offsetBuffer, long[] childHandles) {
protected static long initViewHandle(DType type, int rows, int nc,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be package-private

}

/**
* Create a new column view based off of data already on the device. Ref count on the buffers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment on emphasizing lifetime requirements on input parameters relative to this instance.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're no longer doing this with Java_ai_rapids_cudf_ColumnView_replaceChildrenWithViews.

LGTM.

@razajafri
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 9017f22 into rapidsai:branch-0.19 Mar 8, 2021
@razajafri razajafri deleted the decimal32 branch March 8, 2021 17:22
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this pull request Mar 25, 2021
This PR adds a couple of very specialized methods that help us cast columns inside nested types.

Authors:
  - Raza Jafri (@razajafri)

Approvers:
  - Robert (Bobby) Evans (@revans2)
  - Jason Lowe (@jlowe)
  - MithunR (@mythrocks)

URL: rapidsai#7417
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants