-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Requirements for Dictionary Columns #3535
Comments
Do you want the values in the dictionary sorted, or just a flag to indicate that the dictionary is ordered such that ordered comparisons "make sense"? i.e. imagine having an "unsorted" dictionary column:
Would making that column "sorted" mean making the categories |
Python currently has basically all of the functionality in the dictionary requirements but we'd love to push it down into libcudf and do things more efficiently (not use joins + sorts all over). |
So think more generally because categorical colums have that characteristic that you can map to a value but it should be something like
sorted representation
A flag knowing that the dictionary is sorted avoids having to remap when this has already been done. |
Dictionary columns and variable width columns are two totally distinct issues and it's going to be messy to try and discuss them both in the same issue. |
The conversation is related for us because of how we want to work with string data. I can make a seperate issue but its the combination of these two things that makes it feasible for us. |
Representation of a column of variable width elements is already said and done. See https://github.com/rapidsai/cudf/blob/branch-0.11/cpp/docs/TRANSITIONGUIDE.md#strings-support Your requirement is that you want a variable width element column to be able to act as a dictionary for a Dictionary column. This is fine as any column type should be able to act as the dictionary. |
So from my perspective, I'd want a logical flag to indicate that a given dictionary is |
Yep, already done.
Already done.
I agree that Dictionary columns should tell you if the dictionary is sorted. However, not all columns should have this.
Agreed.
Yep. Something like:
Yes, any column can be the dictionary of a Dictionary column. |
Will there be a way to indicate to an algorithm that you would like some of the outputs to be dictionary encoded or sort dictionary encoded? |
In general the output is always a function of the input. So if your input column is a dictionary, the output would be as well (possibly sharing the same dictionary?). |
Dictionary columns are mostly a solved issue in my mind, save for two questions:
|
Memory is pretty darn precious right now. The difference in performance when jobs need to spill out of gpu is pretty huge in terms of performance and simplicity. I do like the idea of indices being any size and signed or unsigned.
I should hope so. Is there any reason to requiring children to be unique as opposed to shared_ptr?
|
+1, Python allows specifying arbitrary integer type for the indices and puts in a best effort to use the minimal type. |
I get that it'd be nice to have arbitrary integer types for the indices, but I don't think y'all appreciate how significant of a complication that adds to the implementation. Especially for any row-level operations.
Yes, exclusive ownership. It's much easier to reason about if you know you have sole ownership over your children so that someone doesn't go and modify your child out from underneath you. |
Revisitng the point about orderedness, I'm actually now leaning towards requiring the dictionary to always be sorted (or ordered, whatever you want to call it) (Just like NVCategory). This will make life a lot easier and preclude a lot of code that looks like this:
Furthermore, synced with @kkraus14 and Python is fine with libcudf only supporting ordered Dictionaries. While Python supports both ordered/unordered Categoricals, Python can just "lie" to libcudf and pass in an unordered Dictionary when the ordering doesn't actually matter. Otherwise, they can guard at the Python layer against operations that are illegal on an unordered dictionaries (like comparisons). |
Now, let's talk about naming. I hear the names Dictionary Column, and Categorical Column being interchanged. Which one are we going to use? |
My main requirement would be for these columns to be true first-class column types, meaning they always work with a libcudf operation where the expanded, non-dictionary form of the column works. For example, I assume concatenating two tables that have a dictionary column would seamlessly handle merging the dictionaries and remapping the indices in the final dictionary column. This avoids sprinkling "do we have any column types that are going to break?" checks throughout our code like we had for NVStrings/NVCategory. Second question is when dictionary columns will appear and disappear without explicitly asking for them. For example, will the cuio loaders start returning dictionary columns sometimes? Parquet/ORC may already have the data dictionary-encoded, and it could be easier/cheaper to return it directly. Could libcudf operations implicitly convert between dictionary-encoded and expanded column forms (e.g.: distinct/groupby on a dictionary column)? I assume not, but if the interop of the two column forms is seamless with respect to libcudf operations then it should not matter. |
They will be called "Dictionary" columns at the C++ layer. Python can call them Categorical.
Yes, that will be the intent. A Dictionary column should work transparently for any libcudf operation (eventually, maybe not at first as we add initial support).
I don't expect they'd ever appear without explicitly asking for them. I can't speak to the IO readers, but for libcudf functions, if the input column is a Dictionary column, then it's corresponding output would be a Dictionary column. |
FWIW, building a sorted dictionary can be much slower than unsorted (hash vs sort) |
I don't think we would return the dictionaries from the io readers: unlike column-level dictionaries, they are often per-stripe or per-rowgroup with various restrictions (though would be faster for enum-type string columns where the dictionary consists of only a few unique strings). In parquet, the dictionary size limit also means that most large datasets have a mix of dictionary and non-dictionary pages. |
That doesn't sound very democratic. Can you at least give reasons? |
The naming discussion/decision happened months ago, but probably not a in a public place, so for the record:
Edit: Actually, the conversation was public: #1072 |
There are actually 3 things to name here: (1). The unique keys that make up the category values. Example string keys All three are to be columns. (1) is column of any type, (2) is an integer column, (3) is the parent column (a new type) with no data and that manages (1) and (2) as children (or whatever). I'd rather call the new column type "Dictionary" or "Category" than "DictionaryArray". But if we use "Dictionary" for the column type (3), then we should not call the keys (1) "dictionary" too. |
So a Dictionary column contains a dictionary and a set of indices. |
How about DICTIONARY_STRING column: like a string column, but instead of having offsets + character data, it would have indices, dictionary_offsets and dictionary_data ? (With ideally any operation that works on string columns would also work on dictionary_string column) |
This is already supported by the dictionary column we're designing where any column can be the dictionary. |
Just referring to the lookup table part of the dictionary as "the dictionary" is problematic when you start to define APIs as @davidwendt has in #3577 . See my review comments there, but the result is you end up with a namespace
which has a name and parameter names that do not reflect its behavior, IMO. The function takes an existing "dictionary column view", and another column ( This is unclear, unintuitive, and confusing to me. Instead, I suggest referring to things as:
Then, the above function would be
See my other review comments for related suggestions. |
Moving to 0.13. |
So in conversation with @davidwendt and @harrism, we realized that it is going to be very difficult (and in some cases impossible) to support sharing dictionary keys between columns. It will be significantly easier to simply just copy these. Without going into the full details right away (it's kind of complicated), I wanted to again poll people to see how strongly people are attached to the requirement to allow keys to be shared. Keep in mind that these dictionary key columns should usually be pretty "small", i.e., usually only columns with low cardinality are worth the effort to dictionary encode. Which means the number of unique keys (and therefore the size of the dictionary keys column) is low and therefore copying isn't too expensive. |
Seems like the copies could also be done asynchronously to overlap other work in many cases, using an internal stream (user does not need to be aware of this asynchronicity). |
By copy here I take it you mean merging the dictionaries right? |
No. Think about performing a sort on a dictionary column. You have an input and an output dictionary column. The dictionary keys are identical between the input and output, only the indices are permuted. In theory, the returned output could share the keys column with the input. Or, the keys column could be simply copied from the input to output. This is the kind of sharing vs. copying we're talking about. |
The dictionary implementation in libcudf is complete barring any changes required to support cython/python. So I think this issue can be closed in favor of requests for specific changes. |
This is just what we see as an ideal state. By no means are we saying we must have all these things to implement the new version of cudf. Stars on things that will cause us big headaches.
Dictionary Requirements:
General Requirements:
The text was updated successfully, but these errors were encountered: