Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle id columns differently #585

Open
LeoGrin opened this issue Jun 9, 2023 · 5 comments
Open

Handle id columns differently #585

LeoGrin opened this issue Jun 9, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@LeoGrin
Copy link
Contributor

LeoGrin commented Jun 9, 2023

Problem Description

Trying to understand better why the GapEncoder can very slow (#342), I found that it is at its slowest when dealing with "id" columns, which contain a lot of different ngrams.
For instance for traffic_violations (restricted to 5000 rows), here is the time to fit each column:

{'seqid': 17.375709772109985,
 'description': 4.1079487800598145,
 'location': 5.461570978164673,
 'search_reason_for_stop': 1.0070619583129883,
 'state': 0.1596517562866211,
 'make': 1.005993127822876,
 'model': 1.8683619499206543,
 'charge': 1.4155240058898926,
 'driver_city': 1.2626869678497314,
 'driver_state': 0.11691594123840332,
 'dl_state': 0.125899076461792,
 'geolocation': 8.70003890991211}

(geolocation is also a weird column because it's tuples of floats with a lot of decimals, but that's another topic)
and for drug_directory

{'PRODUCTID': 23.933102130889893,
 'PRODUCTNDC': 9.601576089859009,
 'PROPRIETARYNAME': 5.157470941543579,
 'PROPRIETARYNAMESUFFIX': 0.5035130977630615,
 'NONPROPRIETARYNAME': 5.09909725189209,
 'DOSAGEFORMNAME': 0.6999397277832031,
 'ROUTENAME': 0.47968101501464844,
 'APPLICATIONNUMBER': 2.047283887863159,
 'LABELERNAME': 0.847294807434082,
 'SUBSTANCENAME': 4.429254055023193,
 'ACTIVE_NUMERATOR_STRENGTH': 1.4205100536346436,
 'ACTIVE_INGRED_UNIT': 0.8683860301971436,
 'PHARM_CLASSES': 3.5505928993225098}

It is a shame, as they're probably the columns which are least useful to encode with the GapEncoder. Maybe we shouldn't count these columns like the other high-cardinality columns

Feature Description

Detect these id columns

Different non-exclusive possibility:

  • filter for "id" in column name
  • compute the number of different ngrams for each column with sklearn's CountVectorizer (eventually normalize by string length). On the two examples above, the id columns stand out.

Deal with these columns

Several possibilities:

  • Drop these columns. I think people rarely want to use id columns for prediction: if the ids are completely random, they're not useful for prediction. If there is some structure, they can be useful, but also very misleading (for instance if a date is contained in the id, and the same date is in another column).
  • Default to MinHashEncoder instead of GapEncoder for these columns, as interpretability is less important for these columns ? Not super convinced.
  • A better idea ?

What do you think ? @LilianBoulard @jovan-stojanovic you may be interested.

Alternative Solutions

No response

Additional Context

No response

@LeoGrin LeoGrin added the enhancement New feature or request label Jun 9, 2023
@jovan-stojanovic
Copy link
Member

Thanks @LeoGrin, great suggestion.

As a matter of fact, I think this is a very common use case. I think you are right that ID columns should be treated differently, and I like the idea of dropping them.

Maybe adding a warning alongside would be good, for instance:
The 'id_name' column was identified as an ID column. Use column_specific_transformers if you still wish to include it.

The only challenge would be how to differentiate between an ID column and some other high cardinality column (for instance, the population of two countries is never exactly the same but is not an ID).

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 12, 2023 via email

@LilianBoulard
Copy link
Member

Thanks for the analysis!

It seems to me we should drop this type of columns optionally during fetching.
This is something that should be implemented as part of #581.

I don't see how it would be possible to identify ID columns reliably as Jovan suggested.

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 12, 2023

The only challenge would be how to differentiate between an ID column and some other high cardinality column (for instance, the population of two countries is never exactly the same but is not an ID).

I agree this is a real challenge. We should definitely do this only on non-numerical columns. Weirder columns might be an issue (I'm think about the geolocation column in traffic_violations, which is a tuple of floats), but I'm not sure we want these columns treated as high-cardinality columns either.

To me the real challenge is: how do we come up with an heuristic that is simple enough and somewhat reliable. It needs to be simple so that users understand it.

Agree

It probably goes around the number of different ngram compared to the number of rows. Typically, on dirty categories, I expect the number of n_grams to scale roughly as the log of the number of rows (it's documented in Patricio Cerda's papers).

Thanks ! I'm going to experiment with this.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 12, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants