Handle id columns differently #585

LeoGrin · 2023-06-09T17:18:11Z

Problem Description

Trying to understand better why the GapEncoder can very slow (#342), I found that it is at its slowest when dealing with "id" columns, which contain a lot of different ngrams.
For instance for traffic_violations (restricted to 5000 rows), here is the time to fit each column:

{'seqid': 17.375709772109985,
 'description': 4.1079487800598145,
 'location': 5.461570978164673,
 'search_reason_for_stop': 1.0070619583129883,
 'state': 0.1596517562866211,
 'make': 1.005993127822876,
 'model': 1.8683619499206543,
 'charge': 1.4155240058898926,
 'driver_city': 1.2626869678497314,
 'driver_state': 0.11691594123840332,
 'dl_state': 0.125899076461792,
 'geolocation': 8.70003890991211}

(geolocation is also a weird column because it's tuples of floats with a lot of decimals, but that's another topic)
and for drug_directory

{'PRODUCTID': 23.933102130889893,
 'PRODUCTNDC': 9.601576089859009,
 'PROPRIETARYNAME': 5.157470941543579,
 'PROPRIETARYNAMESUFFIX': 0.5035130977630615,
 'NONPROPRIETARYNAME': 5.09909725189209,
 'DOSAGEFORMNAME': 0.6999397277832031,
 'ROUTENAME': 0.47968101501464844,
 'APPLICATIONNUMBER': 2.047283887863159,
 'LABELERNAME': 0.847294807434082,
 'SUBSTANCENAME': 4.429254055023193,
 'ACTIVE_NUMERATOR_STRENGTH': 1.4205100536346436,
 'ACTIVE_INGRED_UNIT': 0.8683860301971436,
 'PHARM_CLASSES': 3.5505928993225098}

It is a shame, as they're probably the columns which are least useful to encode with the GapEncoder. Maybe we shouldn't count these columns like the other high-cardinality columns

Feature Description

Detect these id columns

Different non-exclusive possibility:

filter for "id" in column name
compute the number of different ngrams for each column with sklearn's CountVectorizer (eventually normalize by string length). On the two examples above, the id columns stand out.

Deal with these columns

Several possibilities:

Drop these columns. I think people rarely want to use id columns for prediction: if the ids are completely random, they're not useful for prediction. If there is some structure, they can be useful, but also very misleading (for instance if a date is contained in the id, and the same date is in another column).
Default to MinHashEncoder instead of GapEncoder for these columns, as interpretability is less important for these columns ? Not super convinced.
A better idea ?

What do you think ? @LilianBoulard @jovan-stojanovic you may be interested.

Alternative Solutions

No response

Additional Context

No response

The text was updated successfully, but these errors were encountered:

jovan-stojanovic · 2023-06-12T08:58:24Z

Thanks @LeoGrin, great suggestion.

As a matter of fact, I think this is a very common use case. I think you are right that ID columns should be treated differently, and I like the idea of dropping them.

Maybe adding a warning alongside would be good, for instance:
The 'id_name' column was identified as an ID column. Use column_specific_transformers if you still wish to include it.

The only challenge would be how to differentiate between an ID column and some other high cardinality column (for instance, the population of two countries is never exactly the same but is not an ID).

GaelVaroquaux · 2023-06-12T09:01:38Z

Great discussion. To me the real challenge is: how do we come up with an heuristic that is simple enough and somewhat reliable. It needs to be simple so that users understand it. It probably goes around the number of different ngram compared to the number of rows. Typically, on dirty categories, I expect the number of n_grams to scale roughly as the log of the number of rows (it's documented in Patricio Cerda's papers).

LilianBoulard · 2023-06-12T09:26:31Z

Thanks for the analysis!

It seems to me we should drop this type of columns optionally during fetching.
This is something that should be implemented as part of #581.

I don't see how it would be possible to identify ID columns reliably as Jovan suggested.

LeoGrin · 2023-06-12T09:54:59Z

The only challenge would be how to differentiate between an ID column and some other high cardinality column (for instance, the population of two countries is never exactly the same but is not an ID).

I agree this is a real challenge. We should definitely do this only on non-numerical columns. Weirder columns might be an issue (I'm think about the geolocation column in traffic_violations, which is a tuple of floats), but I'm not sure we want these columns treated as high-cardinality columns either.

To me the real challenge is: how do we come up with an heuristic that is simple enough and somewhat reliable. It needs to be simple so that users understand it.

Agree

It probably goes around the number of different ngram compared to the number of rows. Typically, on dirty categories, I expect the number of n_grams to scale roughly as the log of the number of rows (it's documented in Patricio Cerda's papers).

Thanks ! I'm going to experiment with this.

GaelVaroquaux · 2023-06-12T10:01:34Z

It seems to me we should drop this type of columns optionally during fetching.

That's going to solve the problem for our examples, but our users are likely to still face this problem.

LeoGrin added the enhancement New feature or request label Jun 9, 2023

This was referenced Jul 28, 2023

Benchmark gap encoder early stopping #681

Merged

Gap encoder speedups #680

Merged

jovan-stojanovic mentioned this issue Aug 4, 2023

FEA Joiner add many-to-many joins #674

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle id columns differently #585

Handle id columns differently #585

LeoGrin commented Jun 9, 2023 •

edited

Loading

jovan-stojanovic commented Jun 12, 2023

GaelVaroquaux commented Jun 12, 2023 via email

LilianBoulard commented Jun 12, 2023

LeoGrin commented Jun 12, 2023 •

edited

Loading

GaelVaroquaux commented Jun 12, 2023 via email

Handle id columns differently #585

Handle id columns differently #585

Comments

LeoGrin commented Jun 9, 2023 • edited Loading

Problem Description

Feature Description

Detect these id columns

Deal with these columns

Alternative Solutions

Additional Context

jovan-stojanovic commented Jun 12, 2023

GaelVaroquaux commented Jun 12, 2023 via email

LilianBoulard commented Jun 12, 2023

LeoGrin commented Jun 12, 2023 • edited Loading

GaelVaroquaux commented Jun 12, 2023 via email

LeoGrin commented Jun 9, 2023 •

edited

Loading

LeoGrin commented Jun 12, 2023 •

edited

Loading