Merge underfilled_job_title with employee_position_title in `empl…

…oyee_salaries` (skrub-data#581) * Add `overload_job_titles` parameter to `fetch_employee_salaries` * Add changelog entry * Fix path
LeoGrin · Aug 24, 2023 · 52b39d5 · 52b39d5
1 parent 32995dd
commit 52b39d5
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 1 deletion.
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -10,7 +10,7 @@ Ongoing development
 =====================
 
 Skrub has not been released yet. It is currently undergoing fast
-development and backward compatability is not ensured.
+development and backward compatibility is not ensured.
 
 Major changes
 -------------
@@ -120,6 +120,12 @@ Minor changes
 * :class:`TableVectorizer` doesn't fail anymore if an infered type doesn't work during transform.
   The new entries not matching the type are replaced by missing values. :pr:`666` by :user:`Leo Grinsztajn <LeoGrin>`
 
+- Dataset fetcher :func:`datasets.fetch_employee_salaries` now has a parameter
+  `overload_job_titles` to allow overloading the job titles
+  (`employee_position_title`) with the column `underfilled_job_title`,
+  which provides some more information about the job title.
+  :pr:`581` by :user:`Lilian Boulard <LilianBoulard>`
+
 Before skrub: dirty_cat
 ========================
 

diff --git a/skrub/datasets/_fetching.py b/skrub/datasets/_fetching.py
@@ -668,6 +668,7 @@ def fetch_employee_salaries(
     load_dataframe: bool = True,
     drop_linked: bool = True,
     drop_irrelevant: bool = True,
+    overload_job_titles: bool = True,
     data_directory: Path | str | None = None,
 ) -> DatasetAll | DatasetInfoOnly:
     """Fetches the employee salaries dataset (regression), available at https://openml.org/d/42125
@@ -687,6 +688,11 @@ def fetch_employee_salaries(
         Drops column "full_name", which is usually irrelevant to the
         statistical analysis.
 
+    overload_job_titles : bool, default=True
+        Uses the column `underfilled_job_title` to enrich the
+        `employee_position_title` column, as it contains more detailed
+        information about the job title.
+
     data_directory: pathlib.Path or str, optional
         The directory where the dataset is stored.
 
@@ -718,6 +724,13 @@ def fetch_employee_salaries(
             )
         if drop_irrelevant:
             dataset.X.drop(["full_name"], axis=1, inplace=True)
+        if overload_job_titles:
+            dataset.X["employee_position_title"] = dataset.X[
+                "underfilled_job_title"
+            ].fillna(dataset.X["employee_position_title"])
+            dataset.X.drop(
+                labels=["underfilled_job_title"], axis="columns", inplace=True
+            )
 
     return dataset