-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle id columns differently #585
Comments
Thanks @LeoGrin, great suggestion. As a matter of fact, I think this is a very common use case. I think you are right that ID columns should be treated differently, and I like the idea of dropping them. Maybe adding a warning alongside would be good, for instance: The only challenge would be how to differentiate between an ID column and some other high cardinality column (for instance, the population of two countries is never exactly the same but is not an ID). |
Great discussion.
To me the real challenge is: how do we come up with an heuristic that is simple enough and somewhat reliable. It needs to be simple so that users understand it.
It probably goes around the number of different ngram compared to the number of rows. Typically, on dirty categories, I expect the number of n_grams to scale roughly as the log of the number of rows (it's documented in Patricio Cerda's papers).
|
Thanks for the analysis! It seems to me we should drop this type of columns optionally during fetching. I don't see how it would be possible to identify ID columns reliably as Jovan suggested. |
I agree this is a real challenge. We should definitely do this only on non-numerical columns. Weirder columns might be an issue (I'm think about the
Agree
Thanks ! I'm going to experiment with this. |
It seems to me we should drop this type of columns optionally during fetching.
That's going to solve the problem for our examples, but our users are likely to still face this problem.
|
Problem Description
Trying to understand better why the GapEncoder can very slow (#342), I found that it is at its slowest when dealing with "id" columns, which contain a lot of different ngrams.
For instance for
traffic_violations
(restricted to 5000 rows), here is the time to fit each column:(geolocation is also a weird column because it's tuples of floats with a lot of decimals, but that's another topic)
and for
drug_directory
It is a shame, as they're probably the columns which are least useful to encode with the
GapEncoder
. Maybe we shouldn't count these columns like the other high-cardinality columnsFeature Description
Detect these id columns
Different non-exclusive possibility:
Deal with these columns
Several possibilities:
MinHashEncoder
instead ofGapEncoder
for these columns, as interpretability is less important for these columns ? Not super convinced.What do you think ? @LilianBoulard @jovan-stojanovic you may be interested.
Alternative Solutions
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: