Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC Rework GapEncoder example #686

Merged
merged 19 commits into from
Nov 21, 2023

Conversation

LilianBoulard
Copy link
Member

Extracted from #546.

@LilianBoulard LilianBoulard added documentation Add or improve the documentation no changelog needed labels Jul 31, 2023
@LilianBoulard LilianBoulard self-assigned this Jul 31, 2023
Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @LilianBoulard, this looks very good! A few comments.

examples/02_investigating_dirty_categories.py Outdated Show resolved Hide resolved
examples/02_investigating_dirty_categories.py Outdated Show resolved Hide resolved
examples/02_investigating_dirty_categories.py Outdated Show resolved Hide resolved
examples/02_investigating_dirty_categories.py Outdated Show resolved Hide resolved
examples/02_investigating_dirty_categories.py Outdated Show resolved Hide resolved
examples/02_investigating_dirty_categories.py Outdated Show resolved Hide resolved
Co-authored-by: Jovan Stojanovic <[email protected]>
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think that we also need to change the index because there is a link to the example which has been renamed, no?

examples/02_feature_interpretation_with_gapencoder.py Outdated Show resolved Hide resolved
examples/02_feature_interpretation_with_gapencoder.py Outdated Show resolved Hide resolved
examples/02_feature_interpretation_with_gapencoder.py Outdated Show resolved Hide resolved
examples/02_feature_interpretation_with_gapencoder.py Outdated Show resolved Hide resolved
@jovan-stojanovic
Copy link
Member

You should adapt the example to the fact that #581 was merged

@jovan-stojanovic jovan-stojanovic changed the base branch from main to 0.4.X September 11, 2023 13:28
@jovan-stojanovic jovan-stojanovic changed the base branch from 0.4.X to main September 11, 2023 13:28
@jovan-stojanovic
Copy link
Member

jovan-stojanovic commented Sep 11, 2023

Thanks @LilianBoulard! As agreed, I will take over this PR so we move faster.

@jovan-stojanovic jovan-stojanovic changed the title Rework GapEncoder example [WIP] Rework GapEncoder example Sep 12, 2023
@jovan-stojanovic jovan-stojanovic changed the title [WIP] Rework GapEncoder example Rework GapEncoder example Sep 12, 2023
@jovan-stojanovic jovan-stojanovic changed the title Rework GapEncoder example DOC Rework GapEncoder example Sep 12, 2023
@jovan-stojanovic jovan-stojanovic self-assigned this Sep 14, 2023
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great concision effort again (-100!). My main remark regards using "dirty" for variable names and explainers. This can be easily misinterpreted, and I'd rather have "high cardinality, "high entropy" or "noisy" instead. WDYT?

Comment on lines +76 to +78
topic_labels = enc.get_feature_names_out(n_labels=3)
for k, labels in enumerate(topic_labels):
print(f"Topic n°{k}: {labels}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part (and the subsequent visualization) is great! A real wow effect 👍

@jovan-stojanovic
Copy link
Member

My main remark regards using "dirty" for variable names and explainers. This can be easily misinterpreted, and I'd rather have "high cardinality, "high entropy" or "noisy" instead. WDYT?

You have a point, looking back at the example I realize that we introduce "dirty" without explaining what it means. I will correct this.

This term "dirty" is something that was used in the literature and in dirty-cat to describe data whose source of variation can be anything, meaningful and meaningful as opposed to "noisy" for instance, which has different connotation (often synonym to meaningless) in some fields (e.g. econometrics, signal processing).

As for alternatives, I like "high cardinality" (also used in literature), "high entropy" may be less known among our users.

@jovan-stojanovic
Copy link
Member

WDYT of this version? @Vincent-Maladiere

@Vincent-Maladiere
Copy link
Member

This term "dirty" is something that was used in the literature and in dirty-cat to describe data whose source of variation can be anything

Yes, we used it in a paper, but it doesn't mean it conveys the right message 😅. External users might be surprised by this wording, like Cailean Osborne —he was in high spirits when he found out we call those "dirty"!

WDYT of this version? @Vincent-Maladiere

Well, I still prefer using "high cardinality" if you like. Let's bring this up during the next meeting.

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Vincent-Maladiere Vincent-Maladiere merged commit c4b263b into skrub-data:main Nov 21, 2023
22 checks passed
Vincent-Maladiere added a commit that referenced this pull request Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Add or improve the documentation no changelog needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants