Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Requirements for Dictionary Columns #3535

Closed
felipeblazing opened this issue Dec 5, 2019 · 35 comments
Closed

[FEA] Requirements for Dictionary Columns #3535

felipeblazing opened this issue Dec 5, 2019 · 35 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@felipeblazing
Copy link

felipeblazing commented Dec 5, 2019

This is just what we see as an ideal state. By no means are we saying we must have all these things to implement the new version of cudf. Stars on things that will cause us big headaches.

Dictionary Requirements:

    • Index and dictionary data must be accessible as a column
    • Kernels need to have row level accessors in a header file so it can be compiled against other projects. This may necessitate that the views become header only so that people can write kernels that access this data. If this is not possible we might have to "copy" the functionality over and be very aware of changes to that code.
  1. We should know if the dictionary of a column is sorted with respect to its inputs. This would be useful in general for all columns. I imagine it could be something like three states (sorted unsorted unknown).
  2. Dictionary should NOT contain nulls. If it does it feels like you have TWO places to check for null values which becomes very complicated.
  3. A utlity function for converting a normal column to a dicitionary column with a sorted dictionary

General Requirements:

@felipeblazing felipeblazing added feature request New feature or request Needs Triage Need team to review and classify labels Dec 5, 2019
@datametrician
Copy link
Contributor

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Dec 5, 2019
@kkraus14
Copy link
Collaborator

kkraus14 commented Dec 5, 2019

We should know if the dictionary of a column is sorted with respect to its inputs. This would be useful in general for all columns. I imagine it could be something like three states (sorted unsorted unknown).

Do you want the values in the dictionary sorted, or just a flag to indicate that the dictionary is ordered such that ordered comparisons "make sense"? i.e. imagine having an "unsorted" dictionary column:

indices: [0, 2, 1, 1, 0, 2]
categories: ["small", "normal", "large"]

Would making that column "sorted" mean making the categories ["large", "normal", "small"] and updating the indices accordingly? Or would it mean setting a logical flag such that: ["small" < "normal" < "large"]?

@kkraus14
Copy link
Collaborator

kkraus14 commented Dec 5, 2019

Python currently has basically all of the functionality in the dictionary requirements but we'd love to push it down into libcudf and do things more efficiently (not use joins + sorts all over).

@felipeblazing
Copy link
Author

We should know if the dictionary of a column is sorted with respect to its inputs. This would be useful in general for all columns. I imagine it could be something like three states (sorted unsorted unknown).

Do you want the values in the dictionary sorted, or just a flag to indicate that the dictionary is ordered such that ordered comparisons "make sense"? i.e. imagine having an "unsorted" dictionary column:

indices: [0, 2, 1, 1, 0, 2]
categories: ["small", "normal", "large"]

Would making that column "sorted" mean making the categories ["large", "normal", "small"] and updating the indices accordingly? Or would it mean setting a logical flag such that: ["small" < "normal" < "large"]?

So think more generally because categorical colums have that characteristic that you can map to a value but it should be something like
normal dictionary

indices: [0, 2, 1, 1, 0, 2]
dictionary: [20,10,30]

sorted representation

indices: [1, 2, 0, 0, 1, 2] ==> must be remapped
dictionary: [10,20,30]  ==> must be remapped

A flag knowing that the dictionary is sorted avoids having to remap when this has already been done.

@jrhemstad
Copy link
Contributor

Dictionary columns and variable width columns are two totally distinct issues and it's going to be messy to try and discuss them both in the same issue.

@felipeblazing
Copy link
Author

Dictionary columns and variable width columns are two totally distinct issues and it's going to be messy to try and discuss them both in the same issue.

The conversation is related for us because of how we want to work with string data. I can make a seperate issue but its the combination of these two things that makes it feasible for us.

@felipeblazing felipeblazing changed the title [FEA] Blazing wishlist for Dictionary and Variable Encoded Data [FEA] Blazing wishlist for Dictionary Dec 5, 2019
@jrhemstad
Copy link
Contributor

Representation of a column of variable width elements is already said and done. See https://github.com/rapidsai/cudf/blob/branch-0.11/cpp/docs/TRANSITIONGUIDE.md#strings-support

Your requirement is that you want a variable width element column to be able to act as a dictionary for a Dictionary column. This is fine as any column type should be able to act as the dictionary.

@kkraus14
Copy link
Collaborator

kkraus14 commented Dec 5, 2019

So from my perspective, I'd want a logical flag to indicate that a given dictionary is ordered as opposed to a sorted dictionary. Then if you want a sorted dictionary it's your responsibility to call sort on the categories column and then call some type of set_categories API that updates the indices based on the new categories column. You'd indicate that it's ordered which would enable running ordered comparisons and other binary ops.

@jrhemstad
Copy link
Contributor

    • Index and dictionary data must be accessible as a column

Yep, already done.

    • Kernels need to have row level accessors in a header file so it can be compiled against other projects. This may necessitate that the views become header only so that people can write kernels that access this data. If this is not possible we might have to "copy" the functionality over and be very aware of changes to that code.

Already done.

  1. We should know if the dictionary of a column is sorted with respect to its inputs. This would be useful in general for all columns. I imagine it could be something like three states (sorted unsorted unknown).

I agree that Dictionary columns should tell you if the dictionary is sorted. However, not all columns should have this.

  1. Dictionary should NOT contain nulls. If it does it feels like you have TWO places to check for null values which becomes very complicated.

Agreed.

  1. A utlity function for converting a normal column to a dicitionary column with a sorted dictionary

Yep. Something like:

unique_ptr<column> make_dictionary(column_view c, bool ordered_dictionary);

General Requirements:

Yes, any column can be the dictionary of a Dictionary column.

@felipeblazing
Copy link
Author

Will there be a way to indicate to an algorithm that you would like some of the outputs to be dictionary encoded or sort dictionary encoded?

@jrhemstad
Copy link
Contributor

Will there be a way to indicate to an algorithm that you would like some of the outputs to be dictionary encoded or sort dictionary encoded?

In general the output is always a function of the input. So if your input column is a dictionary, the output would be as well (possibly sharing the same dictionary?).

@jrhemstad
Copy link
Contributor

jrhemstad commented Dec 5, 2019

Dictionary columns are mostly a solved issue in my mind, save for two questions:

  1. Should the type of the indices be fixed? (i.e., always int32_t like NVCategory)?
    • If it's fixed, then you lose out on the potential optimization for low cardinality columns of using a smaller type for the indices. I.e., if your column only has 30 unique values, you could use int8_t for the indices.
    • However, if it is not fixed, then we have to do a double type-dispatch for Dictionary columns (one to resolve the type of the indices, another to resolve the type of the dictionary).
    • I prefer a fixed-type for the indices. Doing that extra dispatch is going to be complicated and harmful to compile time.
  2. Can the Dictionary be shared among columns?
    • This would require adding a new member to cudf::column for the Dictionary since children are current stored as unique_ptr<column>. If shared, we'd want a shared_ptr<column> for the dictionary.

@felipeblazing
Copy link
Author

Dictionary columns are mostly a solved issue in my mind, save for two questions:

  1. Should the type of the indices be fixed? (i.e., always int32_t like NVCategory)?

Memory is pretty darn precious right now. The difference in performance when jobs need to spill out of gpu is pretty huge in terms of performance and simplicity. I do like the idea of indices being any size and signed or unsigned.

  • If it's fixed, then you lose out on the potential optimization for low cardinality columns of using a smaller type for the indices. I.e., if your column only has 30 unique values, you could use int8_t for the indices.
  • However, if it is not fixed, then we have to do a double type-dispatch for Dictionary columns (one to resolve the type of the indices, another to resolve the type of the dictionary).
  • I prefer a fixed-type for the indices. Doing that extra dispatch is going to be complicated and harmful to compile time.
  1. Can the Dictionary be shared among columns?

I should hope so. Is there any reason to requiring children to be unique as opposed to shared_ptr?

  • This would require adding a new member to cudf::column for the Dictionary since children are current stored as unique_ptr<column>. If shared, we'd want a shared_ptr<column> for the dictionary.

@kkraus14
Copy link
Collaborator

kkraus14 commented Dec 5, 2019

I do like the idea of indices being any size and signed or unsigned.

+1, Python allows specifying arbitrary integer type for the indices and puts in a best effort to use the minimal type.

@jrhemstad
Copy link
Contributor

jrhemstad commented Dec 5, 2019

I get that it'd be nice to have arbitrary integer types for the indices, but I don't think y'all appreciate how significant of a complication that adds to the implementation. Especially for any row-level operations.

Is there any reason to requiring children to be unique as opposed to shared_ptr?

Yes, exclusive ownership. It's much easier to reason about if you know you have sole ownership over your children so that someone doesn't go and modify your child out from underneath you.

@jrhemstad
Copy link
Contributor

jrhemstad commented Dec 5, 2019

Revisitng the point about orderedness, I'm actually now leaning towards requiring the dictionary to always be sorted (or ordered, whatever you want to call it) (Just like NVCategory). This will make life a lot easier and preclude a lot of code that looks like this:

if(col.sorted_dictionary)
...
else

Furthermore, synced with @kkraus14 and Python is fine with libcudf only supporting ordered Dictionaries. While Python supports both ordered/unordered Categoricals, Python can just "lie" to libcudf and pass in an unordered Dictionary when the ordering doesn't actually matter. Otherwise, they can guard at the Python layer against operations that are illegal on an unordered dictionaries (like comparisons).

@harrism
Copy link
Member

harrism commented Dec 6, 2019

Now, let's talk about naming. I hear the names Dictionary Column, and Categorical Column being interchanged. Which one are we going to use?

@harrism harrism changed the title [FEA] Blazing wishlist for Dictionary [FEA] Requirements for Dictionary Columns Dec 6, 2019
@harrism
Copy link
Member

harrism commented Dec 6, 2019

@jlowe @revans2 does your team have requirements for Dictionary/Categorical columns?

@jlowe
Copy link
Member

jlowe commented Dec 6, 2019

My main requirement would be for these columns to be true first-class column types, meaning they always work with a libcudf operation where the expanded, non-dictionary form of the column works. For example, I assume concatenating two tables that have a dictionary column would seamlessly handle merging the dictionaries and remapping the indices in the final dictionary column. This avoids sprinkling "do we have any column types that are going to break?" checks throughout our code like we had for NVStrings/NVCategory.

Second question is when dictionary columns will appear and disappear without explicitly asking for them. For example, will the cuio loaders start returning dictionary columns sometimes? Parquet/ORC may already have the data dictionary-encoded, and it could be easier/cheaper to return it directly. Could libcudf operations implicitly convert between dictionary-encoded and expanded column forms (e.g.: distinct/groupby on a dictionary column)? I assume not, but if the interop of the two column forms is seamless with respect to libcudf operations then it should not matter.

@jrhemstad
Copy link
Contributor

Now, let's talk about naming. I hear the names Dictionary Column, and Categorical Column being interchanged. Which one are we going to use?

They will be called "Dictionary" columns at the C++ layer. Python can call them Categorical.

My main requirement would be for these columns to be true first-class column types

Yes, that will be the intent. A Dictionary column should work transparently for any libcudf operation (eventually, maybe not at first as we add initial support).

Second question is when dictionary columns will appear and disappear without explicitly asking for them.

I don't expect they'd ever appear without explicitly asking for them. I can't speak to the IO readers, but for libcudf functions, if the input column is a Dictionary column, then it's corresponding output would be a Dictionary column.

@OlivierNV
Copy link
Contributor

FWIW, building a sorted dictionary can be much slower than unsorted (hash vs sort)

@OlivierNV
Copy link
Contributor

OlivierNV commented Dec 7, 2019

I don't think we would return the dictionaries from the io readers: unlike column-level dictionaries, they are often per-stripe or per-rowgroup with various restrictions (though would be faster for enum-type string columns where the dictionary consists of only a few unique strings). In parquet, the dictionary size limit also means that most large datasets have a mix of dictionary and non-dictionary pages.
[Edit] On the other hand, it certainly would be beneficial to go straight from an orc/parquet dictionary to a dictionary column whenever possible, so we'd probably want to start with an inefficient unsorted dictionary that may contain duplicates and optimize the dictionary as a separate step.

@harrism
Copy link
Member

harrism commented Dec 9, 2019

They will be called "Dictionary" columns at the C++ layer. Python can call them Categorical.

That doesn't sound very democratic. Can you at least give reasons?

@jrhemstad
Copy link
Contributor

jrhemstad commented Dec 9, 2019

They will be called "Dictionary" columns at the C++ layer. Python can call them Categorical.

That doesn't sound very democratic. Can you at least give reasons?

The naming discussion/decision happened months ago, but probably not a in a public place, so for the record:

  1. "Dictionary" is a more intuitive and self-describing name than "Category"
  2. "Dictionary" is the term Arrow uses

Edit: Actually, the conversation was public: #1072

@davidwendt
Copy link
Contributor

There are actually 3 things to name here:

(1). The unique keys that make up the category values. Example string keys ['foo', 'bar', 'baz']
(2). The integer values that are indices into the keys vector. Example [1,0,2,1,1]
(3). The container that includes 1 and 2.

All three are to be columns. (1) is column of any type, (2) is an integer column, (3) is the parent column (a new type) with no data and that manages (1) and (2) as children (or whatever).
From what I can tell, the Arrow page referenced above names (1) dictionary, (2) indices, and (3) DictionaryArray

I'd rather call the new column type "Dictionary" or "Category" than "DictionaryArray". But if we use "Dictionary" for the column type (3), then we should not call the keys (1) "dictionary" too.

@jrhemstad
Copy link
Contributor

There are actually 3 things to name here:

(1). The unique keys that make up the category values. Example string keys ['foo', 'bar', 'baz']
(2). The integer values that are indices into the keys vector. Example [1,0,2,1,1]
(3). The container that includes 1 and 2.

All three are to be columns. (1) is column of any type, (2) is an integer column, (3) is the parent column (a new type) with no data and that manages (1) and (2) as children (or whatever).
From what I can tell, the Arrow page referenced above names (1) dictionary, (2) indices, and (3) DictionaryArray

I'd rather call the new column type "Dictionary" or "Category" than "DictionaryArray". But if we use "Dictionary" for the column type (3), then we should not call the keys (1) "dictionary" too.

  1. The dictionary
  2. indices
  3. Dictionary Column, or a column whose type is DICTIONARY

So a Dictionary column contains a dictionary and a set of indices.

@OlivierNV
Copy link
Contributor

How about DICTIONARY_STRING column: like a string column, but instead of having offsets + character data, it would have indices, dictionary_offsets and dictionary_data ? (With ideally any operation that works on string columns would also work on dictionary_string column)

@jrhemstad
Copy link
Contributor

How about DICTIONARY_STRING column: like a string column, but instead of having offsets + character data, it would have indices, dictionary_offsets and dictionary_data ? (With ideally any operation that works on string columns would also work on dictionary_string column)

This is already supported by the dictionary column we're designing where any column can be the dictionary.

@harrism
Copy link
Member

harrism commented Dec 11, 2019

Just referring to the lookup table part of the dictionary as "the dictionary" is problematic when you start to define APIs as @davidwendt has in #3577 . See my review comments there, but the result is you end up with a namespace dictionary that has functions like

cudf::dictionary::add_dictionary( dictionary_column_view const& dictionary_column,
                                        column_view const& dictionary, ...)

which has a name and parameter names that do not reflect its behavior, IMO. The function takes an existing "dictionary column view", and another column (dictionary) of the same type as "the dictionary" inside dictionary_column. The function adds inserts all the values in dictionary to "the dictionary" inside dictionary_column.

This is unclear, unintuitive, and confusing to me.

Instead, I suggest referring to things as:

  1. keys.
  2. indices.
  3. Dictionary Column, or a column of type DICTIONARY.

Then, the above function would be

cudf::dictionary::add_keys(dictionary_column_view const& dictionary, column_view const& new_keys)

See my other review comments for related suggestions.

@kkraus14
Copy link
Collaborator

Moving to 0.13.

@jrhemstad
Copy link
Contributor

So in conversation with @davidwendt and @harrism, we realized that it is going to be very difficult (and in some cases impossible) to support sharing dictionary keys between columns. It will be significantly easier to simply just copy these.

Without going into the full details right away (it's kind of complicated), I wanted to again poll people to see how strongly people are attached to the requirement to allow keys to be shared.

Keep in mind that these dictionary key columns should usually be pretty "small", i.e., usually only columns with low cardinality are worth the effort to dictionary encode. Which means the number of unique keys (and therefore the size of the dictionary keys column) is low and therefore copying isn't too expensive.

@harrism
Copy link
Member

harrism commented Jan 29, 2020

Seems like the copies could also be done asynchronously to overlap other work in many cases, using an internal stream (user does not need to be aware of this asynchronicity).

@felipeblazing
Copy link
Author

By copy here I take it you mean merging the dictionaries right?

@jrhemstad
Copy link
Contributor

By copy here I take it you mean merging the dictionaries right?

No. Think about performing a sort on a dictionary column. You have an input and an output dictionary column. The dictionary keys are identical between the input and output, only the indices are permuted.

In theory, the returned output could share the keys column with the input. Or, the keys column could be simply copied from the input to output.

This is the kind of sharing vs. copying we're talking about.

@davidwendt
Copy link
Contributor

The dictionary implementation in libcudf is complete barring any changes required to support cython/python. So I think this issue can be closed in favor of requests for specific changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

8 participants