Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GapEncoder is slow #342

Closed
LilianBoulard opened this issue Sep 12, 2022 · 8 comments
Closed

GapEncoder is slow #342

LilianBoulard opened this issue Sep 12, 2022 · 8 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@LilianBoulard
Copy link
Member

LilianBoulard commented Sep 12, 2022

Following some experiments I did as part of my work on GAMA, I noticed the GapEncoder is very slow on medium to big datasets.

As discussed with @alexis-cvetkov, it's also something he noticed during his experiments.

A solution suggested by Gaël would be to early-stop the iterative process, which would make it quicker to converge, at the cost of some accuracy.

@LilianBoulard LilianBoulard added the bug Something isn't working label Sep 12, 2022
@LilianBoulard LilianBoulard added the help wanted Extra attention is needed label Sep 12, 2022
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 12, 2022 via email

@LilianBoulard LilianBoulard moved this from Todo to In Progress in dirty_cat sprint 09/14/2022 Sep 14, 2022
@LilianBoulard
Copy link
Member Author

LilianBoulard commented Sep 14, 2022

The issue might become apparent with traffic_violations. Code to reproduce:

from dirty_cat import SuperVectorizer
from dirty_cat.datasets import fetch_traffic_violations

ds = fetch_traffic_violations()
sv = SuperVectorizer()
sv.fit(ds.X)  # This will take a while...

print(sv.transformers)
# This should print the columns associated with the `GapEncoder`.

Later on, to reproduce without the (slight) overhead of the SuperVectorizer, simply instanciate and transform the columns output by sv.transformers.

from dirty_cat import GapEncoder

gap = GapEncoder()
gap.fit(columns)

@AlexandreAbraham
Copy link

Can you point me to the column that is long encode ? Because the script is taking forever on my machine :-/

@LilianBoulard
Copy link
Member Author

Yeah that's the issue 😅
I've launched it on a server, I'll update you as soon as it's finished. Otherwise, you can try using a subset of the datasets' samples.

@LilianBoulard
Copy link
Member Author

Here's the output of the above code! Sorry for the delay

[
    ("datetime", DatetimeEncoder(), ["date_of_stop", "time_of_stop"]),
    ("low_card_cat", OneHotEncoder(drop="if_binary"), ["agency", "subagency", "accident", "belts", "personal_injury", "property_damage", "fatal", "commercial_license", "hazmat", "commercial_vehicle", "alcohol", "work_zone", "search_conducted", "search_disposition", "search_outcome", "search_reason", "search_type", "search_arrest_reason", "vehicletype", "color", "article", "race", "gender", "arrest_type"]),
    ("high_card_cat", GapEncoder(n_components=30), ["seqid", "description", "location", "search_reason_for_stop", "state", "make", "model", "charge", "driver_city", "driver_state", "dl_state", "geolocation"])
]

@AlexandreAbraham
Copy link

Sorry, I also forgot to report the conclusion of my experiments.
I did not find any major bottleneck in the encoder. From my experience, the gap encoder is slow because it diverges therefore it aalways reaches the max iteration. Also, the more we let him run, the worse is the performance of the model after. It is required to check this on other datasets.

@LeoGrin
Copy link
Contributor

LeoGrin commented May 17, 2023

Sorry, I also forgot to report the conclusion of my experiments. I did not find any major bottleneck in the encoder. From my experience, the gap encoder is slow because it diverges therefore it aalways reaches the max iteration. Also, the more we let him run, the worse is the performance of the model after. It is required to check this on other datasets.

Hi, do you remember on which dataset you saw the gap encoder diverge? I can't reproduce this on the traffic_violation dataset.

@jovan-stojanovic
Copy link
Member

Fixed by #680

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants