Skip to content

Commit

Permalink
Merge underfilled_job_title with employee_position_title in `empl…
Browse files Browse the repository at this point in the history
…oyee_salaries` (skrub-data#581)

* Add `overload_job_titles` parameter to `fetch_employee_salaries`

* Add changelog entry

* Fix path
  • Loading branch information
LilianBoulard authored and LeoGrin committed Aug 24, 2023
1 parent 32995dd commit 52b39d5
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 1 deletion.
8 changes: 7 additions & 1 deletion CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Ongoing development
=====================

Skrub has not been released yet. It is currently undergoing fast
development and backward compatability is not ensured.
development and backward compatibility is not ensured.

Major changes
-------------
Expand Down Expand Up @@ -120,6 +120,12 @@ Minor changes
* :class:`TableVectorizer` doesn't fail anymore if an infered type doesn't work during transform.
The new entries not matching the type are replaced by missing values. :pr:`666` by :user:`Leo Grinsztajn <LeoGrin>`

- Dataset fetcher :func:`datasets.fetch_employee_salaries` now has a parameter
`overload_job_titles` to allow overloading the job titles
(`employee_position_title`) with the column `underfilled_job_title`,
which provides some more information about the job title.
:pr:`581` by :user:`Lilian Boulard <LilianBoulard>`

Before skrub: dirty_cat
========================

Expand Down
13 changes: 13 additions & 0 deletions skrub/datasets/_fetching.py
Original file line number Diff line number Diff line change
Expand Up @@ -668,6 +668,7 @@ def fetch_employee_salaries(
load_dataframe: bool = True,
drop_linked: bool = True,
drop_irrelevant: bool = True,
overload_job_titles: bool = True,
data_directory: Path | str | None = None,
) -> DatasetAll | DatasetInfoOnly:
"""Fetches the employee salaries dataset (regression), available at https://openml.org/d/42125
Expand All @@ -687,6 +688,11 @@ def fetch_employee_salaries(
Drops column "full_name", which is usually irrelevant to the
statistical analysis.
overload_job_titles : bool, default=True
Uses the column `underfilled_job_title` to enrich the
`employee_position_title` column, as it contains more detailed
information about the job title.
data_directory: pathlib.Path or str, optional
The directory where the dataset is stored.
Expand Down Expand Up @@ -718,6 +724,13 @@ def fetch_employee_salaries(
)
if drop_irrelevant:
dataset.X.drop(["full_name"], axis=1, inplace=True)
if overload_job_titles:
dataset.X["employee_position_title"] = dataset.X[
"underfilled_job_title"
].fillna(dataset.X["employee_position_title"])
dataset.X.drop(
labels=["underfilled_job_title"], axis="columns", inplace=True
)

return dataset

Expand Down

0 comments on commit 52b39d5

Please sign in to comment.