Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: OneHotEncoder.inverse_transform now maintains the column order from the original table #195

Merged
merged 24 commits into from
Apr 18, 2023

Conversation

Marsmaennchen221
Copy link
Contributor

@Marsmaennchen221 Marsmaennchen221 commented Apr 16, 2023

Closes #109.

Summary of Changes

OneHotEncoder.inverse_transform now maintains the column order from the original table (#109)
Fixed bug with OneHotEncoder.inverse_transform to not work if not all columns were fitted
New feature columns in OneHotEncoder will now be inserted where the combined columns were in the original table

…ll columns were fitted

`OneHotEncoder.inverse_transform` now maintains the column order from the original table; added test for this behaviour (#109)
@Marsmaennchen221 Marsmaennchen221 requested a review from a team as a code owner April 16, 2023 19:12
@Marsmaennchen221 Marsmaennchen221 linked an issue Apr 16, 2023 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Apr 16, 2023

Codecov Report

Merging #195 (b1856b1) into main (bea976a) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #195      +/-   ##
==========================================
+ Coverage   93.97%   94.01%   +0.04%     
==========================================
  Files          42       42              
  Lines        1476     1486      +10     
==========================================
+ Hits         1387     1397      +10     
  Misses         89       89              
Impacted Files Coverage Δ
...ds/data/tabular/transformation/_one_hot_encoder.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@lars-reimann
Copy link
Member

lars-reimann commented Apr 16, 2023

🦙 MegaLinter status: ✅ SUCCESS

Descriptor Linter Files Fixed Errors Elapsed time
✅ PYTHON black 2 0 0 0.63s
✅ PYTHON mypy 2 0 1.65s
✅ PYTHON ruff 2 0 0 0.04s
✅ REPOSITORY git_diff yes no 0.02s

See detailed report in MegaLinter reports
Set VALIDATE_ALL_CODEBASE: true in mega-linter.yml to validate all sources, not only the diff

MegaLinter is graciously provided by OX Security

Copy link
Member

@lars-reimann lars-reimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solving this issue is a little more complicated than that: Say you call fit with this table and run the encoder only on column col1:

col1 col2
"a" 0
"b" 1

You store the schema of this table in self._original_schema. Now you call transform with the following table:

col2 col1
0 "a"
1 "b"

If you now call inverse_transform on the result, you will get

col1 col2
"a" 0
"b" 1

because the relative order of columns is now determined by the table the transformer was fitted with. This is not the table before transform, though.

A more thorough solution would be that transform already maintains the order of columns by replacing a column with its one-hot-encoded version instead of appending all new columns at the end. For example, calling transform on

col2 col1
0 "a"
1 "b"

should result in

col2 col1_a col1_b
0 1 0
1 0 1

and calling transform on

col1 col2
"a" 0
"b" 1

should result in

col1_a col1_b col2
1 0 0
0 1 1

Then the inverse_transform can simply maintain the column order.

Copy link
Member

@lars-reimann lars-reimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added another test that should work but doesn't.

Since the column name and the value could contain underscores, it's not possible to figure out which original column a column in the transformed table corresponds to. For example, a_b_c could correspond to the column called a_b (and value c) or a column called a (and value b_c).

Because of this, we'll need to store the mapping from original columns to transformed columns during fit (dict[str, list[str]], with original column names as keys and names of corresponding transformed columns as values). This will also be useful for #190 later. During transform and inverse_transform you can then use this mapping to figure out how the columns need to be sorted instead of relying on the startswith check.

@lars-reimann
Copy link
Member

I've created another issue (#201) about conflicting column names when using a OneHotEncoder. That can be done in a separate PR.

@lars-reimann lars-reimann merged commit 3ec0041 into main Apr 18, 2023
@lars-reimann lars-reimann deleted the 109-onehotencoder-should-maintain-column-order branch April 18, 2023 09:01
lars-reimann pushed a commit that referenced this pull request Apr 21, 2023
## [0.11.0](v0.10.0...v0.11.0) (2023-04-21)

### Features

* `OneHotEncoder.inverse_transform` now maintains the column order from the original table ([#195](#195)) ([3ec0041](3ec0041)), closes [#109](#109) [#109](#109)
* add `plot_` prefix back to plotting methods ([#212](#212)) ([e50c3b0](e50c3b0)), closes [#211](#211)
* adjust `Column`, `Schema` and `Table` to changes in `Row` ([#216](#216)) ([ca3eebb](ca3eebb))
* back `Row` by a `polars.DataFrame` ([#214](#214)) ([62ca34d](62ca34d)), closes [#196](#196) [#149](#149)
* clean up `Row` class ([#215](#215)) ([b12fc68](b12fc68))
* convert between `Row` and `dict` ([#206](#206)) ([e98b653](e98b653)), closes [#204](#204)
* convert between a `dict` and a `Table` ([#198](#198)) ([2a5089e](2a5089e)), closes [#197](#197)
* create column types for `polars` data types ([#208](#208)) ([e18b362](e18b362)), closes [#196](#196)
* dataframe interchange protocol ([#200](#200)) ([bea976a](bea976a)), closes [#199](#199)
* move existing ML solutions into `safeds.ml.classical` package ([#213](#213)) ([655f07f](655f07f)), closes [#210](#210)

### Bug Fixes

* `table.keep_only_columns` now maps column names to correct data ([#194](#194)) ([459ab75](459ab75)), closes [#115](#115)
* typo in type hint ([#184](#184)) ([e79727d](e79727d)), closes [#180](#180)
@lars-reimann
Copy link
Member

🎉 This PR is included in version 0.11.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@lars-reimann lars-reimann added the released Included in a release label Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
released Included in a release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OneHotEncoder should maintain column order
3 participants