Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge underfilled_job_title with employee_position_title in employee_salaries #581

Merged
merged 4 commits into from
Aug 3, 2023

Conversation

LilianBoulard
Copy link
Member

@LilianBoulard LilianBoulard commented Jun 8, 2023

This PR adds a parameter to fetch_employee_salaries, so the main dirty column is overloaded with another column that adds some new information (from my understanding).

@LilianBoulard LilianBoulard added the enhancement New feature or request label Jun 8, 2023
@LilianBoulard LilianBoulard self-assigned this Jun 8, 2023
@GaelVaroquaux

This comment was marked as resolved.

@GaelVaroquaux
Copy link
Member

I just had a look at the PR. I do not think that I am in favor of adding the new, cleaner, column.

In terms of philosophy, I would like us to try as hard as we can to improve our tools to work on the data as it is, rather than to change the data. This means that we must stare at our examples and wonder what makes them ugly, and then see if we can provide functionality to make them less ugly.

With regards to using more the upstream scikit-learn code, yes, I'm a thousands time in favor of doing that.

@LilianBoulard LilianBoulard changed the title Improve fetching Merge underfilled_job_title with employee_position_title in employee_salaries Jul 24, 2023
@LilianBoulard LilianBoulard marked this pull request as ready for review July 24, 2023 12:04
…ve_fetching

# Conflicts:
#	skrub/datasets/_fetching.py
@LilianBoulard
Copy link
Member Author

I agree that as much as we can, we should use appropriate tools, but in this specific instance, I think merging them in advance is the best option. Of course if we have a tool designed for this type of issue down the road, we can re-introduce it, but currently, this merge is something we do a lot in the new examples (#546), and it would simplify them quite a bit.

@jovan-stojanovic
Copy link
Member

I might have missed something, but why do you need to overwrite the employee_position_title column for simplification? It seems to work well in the first example, and I think what you are doing in #546 might work as well. For instance, works here without preprocessing it for the Gap example.

@LilianBoulard
Copy link
Member Author

To me, underfilled job title is a column that gives more specific information about the job title. Let me demonstrate:

>>> from skrub.datasets import fetch_employee_salaries
>>> dataset = fetch_employee_salaries()
>>> X = dataset.X  # alias
>>> # Filter, keep only the jobs that contain "Fire"
>>> X = X[X["employee_position_title"].str.contains("Fire")]
>>> X[["employee_position_title", "underfilled_job_title", "date_first_hired"]].head(10)
        employee_position_title            underfilled_job_title date_first_hired
8       Firefighter/Rescuer III  Firefighter/Rescuer I (Recruit)       12/12/2016
42      Firefighter/Rescuer III                              NaN       10/09/2006
107     Firefighter/Rescuer III                              NaN       05/08/2011
128         Fire/Rescue Captain                              NaN       02/26/1990
132     Firefighter/Rescuer III           Firefighter/Rescuer II       03/10/2014
142     Firefighter/Rescuer III                              NaN       03/17/2008
152  Master Firefighter/Rescuer                              NaN       01/30/2006
157         Fire/Rescue Captain                              NaN       09/11/2000
158     Firefighter/Rescuer III                              NaN       03/17/2008
167     Firefighter/Rescuer III           Firefighter/Rescuer II       03/10/2014

When there is a value, underfilled_job_title seems to give a more specific description of the job.
So my proposition is to overload employee_position_title with the underfilled_job_title column.

Also, for reference: https://chat.openai.com/share/d4a00de6-d10b-4c5a-af19-43757fb795cf

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see the difference though I don't feel it's crucial. But it's good to have it as an option.
I agree with merging this with False by default (you might easily use it for examples). WDYT?

skrub/datasets/_fetching.py Show resolved Hide resolved
skrub/datasets/_fetching.py Show resolved Hide resolved
@LilianBoulard
Copy link
Member Author

You're right, it's not crucial, but it unloads some boilerplate from the examples, which I think is a big benefit.
On the True/False default, I think that realistically, these fetching methods are mainly used in examples, thus it makes more sense to me being True by default (since we'll set it to True in pretty much all our examples). Maybe we should get a third opinion to settle this, wdyt @Vincent-Maladiere?

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! The point of our examples is to showcase our features, not to explain how to use pandas or this dataset specifically, IMHO.

I agree with @LilianBoulard that we should do this quick preprocessing by default to simplify the examples, even though having it in the example is not dramatic or ugly.

skrub/datasets/_fetching.py Show resolved Hide resolved
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I became convinced :).

Merging, thank you!

@GaelVaroquaux GaelVaroquaux merged commit 1a6f50f into skrub-data:main Aug 3, 2023
@LilianBoulard LilianBoulard deleted the improve_fetching branch August 4, 2023 11:30
LeoGrin pushed a commit to LeoGrin/skrub that referenced this pull request Aug 24, 2023
…oyee_salaries` (skrub-data#581)

* Add `overload_job_titles` parameter to `fetch_employee_salaries`

* Add changelog entry

* Fix path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants