-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Rework GapEncoder
example
#686
DOC Rework GapEncoder
example
#686
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @LilianBoulard, this looks very good! A few comments.
Co-authored-by: Jovan Stojanovic <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I think that we also need to change the index because there is a link to the example which has been renamed, no?
Co-authored-by: Gael Varoquaux <[email protected]>
You should adapt the example to the fact that #581 was merged |
…/skrub into rework-gap-example
Thanks @LilianBoulard! As agreed, I will take over this PR so we move faster. |
GapEncoder
exampleGapEncoder
example
GapEncoder
exampleGapEncoder
example
GapEncoder
exampleGapEncoder
example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great concision effort again (-100!). My main remark regards using "dirty" for variable names and explainers. This can be easily misinterpreted, and I'd rather have "high cardinality, "high entropy" or "noisy" instead. WDYT?
topic_labels = enc.get_feature_names_out(n_labels=3) | ||
for k, labels in enumerate(topic_labels): | ||
print(f"Topic n°{k}: {labels}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part (and the subsequent visualization) is great! A real wow effect 👍
You have a point, looking back at the example I realize that we introduce "dirty" without explaining what it means. I will correct this. This term "dirty" is something that was used in the literature and in dirty-cat to describe data whose source of variation can be anything, meaningful and meaningful as opposed to "noisy" for instance, which has different connotation (often synonym to meaningless) in some fields (e.g. econometrics, signal processing). As for alternatives, I like "high cardinality" (also used in literature), "high entropy" may be less known among our users. |
WDYT of this version? @Vincent-Maladiere |
Yes, we used it in a paper, but it doesn't mean it conveys the right message 😅. External users might be surprised by this wording, like Cailean Osborne —he was in high spirits when he found out we call those "dirty"!
Well, I still prefer using "high cardinality" if you like. Let's bring this up during the next meeting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This reverts commit c4b263b.
Extracted from #546.