-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GapEncoder
is slow
#342
Comments
We would need examples of datasets on which it was too slow, to do some empirical work.
|
The issue might become apparent with from dirty_cat import SuperVectorizer
from dirty_cat.datasets import fetch_traffic_violations
ds = fetch_traffic_violations()
sv = SuperVectorizer()
sv.fit(ds.X) # This will take a while...
print(sv.transformers)
# This should print the columns associated with the `GapEncoder`. Later on, to reproduce without the (slight) overhead of the from dirty_cat import GapEncoder
gap = GapEncoder()
gap.fit(columns) |
Can you point me to the column that is long encode ? Because the script is taking forever on my machine :-/ |
Yeah that's the issue 😅 |
Here's the output of the above code! Sorry for the delay [
("datetime", DatetimeEncoder(), ["date_of_stop", "time_of_stop"]),
("low_card_cat", OneHotEncoder(drop="if_binary"), ["agency", "subagency", "accident", "belts", "personal_injury", "property_damage", "fatal", "commercial_license", "hazmat", "commercial_vehicle", "alcohol", "work_zone", "search_conducted", "search_disposition", "search_outcome", "search_reason", "search_type", "search_arrest_reason", "vehicletype", "color", "article", "race", "gender", "arrest_type"]),
("high_card_cat", GapEncoder(n_components=30), ["seqid", "description", "location", "search_reason_for_stop", "state", "make", "model", "charge", "driver_city", "driver_state", "dl_state", "geolocation"])
] |
Sorry, I also forgot to report the conclusion of my experiments. |
Hi, do you remember on which dataset you saw the gap encoder diverge? I can't reproduce this on the traffic_violation dataset. |
Fixed by #680 |
Following some experiments I did as part of my work on GAMA, I noticed the
GapEncoder
is very slow on medium to big datasets.As discussed with @alexis-cvetkov, it's also something he noticed during his experiments.
A solution suggested by Gaël would be to early-stop the iterative process, which would make it quicker to converge, at the cost of some accuracy.
The text was updated successfully, but these errors were encountered: