Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coupling of Dictionary with Data causes issues for usage #3097

Closed
felipeblazing opened this issue Feb 6, 2019 · 8 comments
Closed

Coupling of Dictionary with Data causes issues for usage #3097

felipeblazing opened this issue Feb 6, 2019 · 8 comments
Labels
bug Something isn't working strings strings issues (C++ and Python)

Comments

@felipeblazing
Copy link

In many cases users are going to things like sort, filter group on indices from NVCategory without wanting to modify the dictionary itself. Because NVCategory contains both then it makes it difficult to do something like reordering a column without having to reconstitute a new NVCategory. Is there a way that we can separate the two? This way you could filter a column and you only have to generate new indices and know that the use the same dictionary. We could do this by modifying the current PR for NVCategory support to store an NVCategory::Dictionary or something of that nature instead of teh NVCategory itself. This woudl also probably require a utility function that could create an NVCategory from this data and this dictionary without requiring copies if we wanted to access its apis to do things like merge them.

@felipeblazing
Copy link
Author

To grant a bit more clarity. If the pointer in gdf_column is just to a dictionary then whatever we end up doing in cudf with the nvstrings_category type doesn't require any special code. So order_by woudl work out of the box if this was the case. As it stands I have to add special cose for the case that the column ordered was an GDF_NVSTRINGS_CATEGORY and make a new NVCategory with the new indices and teh values.

@davidwendt
Copy link
Contributor

davidwendt commented Feb 6, 2019

The NVCategory class only really has two member variables. The keys which is an array of strings (ordered and normally sorted). And the values which is an array of integers where each position contains the index of the key (string) it represented when the category was created. You could say NVCategory provides a convenient wrapper for these two pieces of data. If you don't want to use NVCategory, you can get the keys using the get_keys() method and the values using the get_values() method and then do whatever you want with them.
Seems you could filter, sort, order or whatever on the values (integers) and then just use the gather_strings() method at the end to get the resulting strings.

@felipeblazing
Copy link
Author

So can they share a dictionary? What I don't want to do is perform antoher gather. I have a list of indicies and I just want to say this is a new nvcategory a thin wrapper for the SAME dictionary and a NEW list of indices (e.g. say I had
{0 ,1 , 2, 2}
{"I", "love", "pies"}

I filter out values less than equal to love so i end upu with
{2, 2}
How can I make a new nvcategory that uses {2, 2} but doesnt require us to make a NEW dictionary for the new NVCategory but can rather use the old one.

@jrhemstad
Copy link
Contributor

jrhemstad commented Feb 6, 2019

We discussed having multiple instances of an NVCategory share the same dictionary, but that it would be difficult to do because modifications to one NVCategory instance then has the potential to affect any other instance that shares the NVCategory.

What we could do:

  • Store the keys as a std::shared_ptr
    • This way, multiple NVCategory instances can shared the same keys, and the last one to use them cleans it up
  • Implement a copy-on-write policy for the keys
    • It's fine for multiple NVCategory instances to share the same dictionary so long as none of them modify it. For performance, we could have it be shared so long as it's only being read from. But if there's an operation that would modify the dictionary, then the instance doing the modification would create it's own new dictionary independent from the others it was sharing with previously.

@felipeblazing
Copy link
Author

YUM YUM! that woud be amazing

@felipeblazing
Copy link
Author

So For now what I am going to do is make a wrapper function that basically performs these steps in a less efficient fashion for now making a whole new nvcategory, we can still assume its just an nvcategory and in the future we will have functions that can make its access for efficient. If we are going to be passing around shared_ptrs I will template these functions so we can change them easily when it comes to that

@davidwendt
Copy link
Contributor

Implementing the shared-ptr concept would be my preference but is not straight-forward with the current implementation. So for now, I'd like to look at just creating a new NVCategory from an existing set of keys NVStrings and values array of ints. And then work on optimizing this with shared-ptr/ref-count implementation in the future.

@mike-wendt mike-wendt transferred this issue from rapidsai/custrings Oct 15, 2019
@mike-wendt mike-wendt added the strings strings issues (C++ and Python) label Oct 15, 2019
@mike-wendt mike-wendt added the bug Something isn't working label Feb 13, 2020
@jrhemstad
Copy link
Contributor

Superseded by #3535

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

4 participants