Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate whether we can speed up operations on tabular data by using another backend #196

Closed
3 tasks
lars-reimann opened this issue Apr 17, 2023 · 1 comment
Labels
performance 🏃 Speed things up wontfix This will not be worked on

Comments

@lars-reimann
Copy link
Member

lars-reimann commented Apr 17, 2023

  • pandas v2.0.0 introduces pyarrow as a new backend, which is supposedly faster than numpy
  • We could also use pyarrow directly
  • We could also replace pandas by polars

Generally, the interface of polars seems to be nicer to work with for us than pandas. They also do not have an index, which is closer to our design, too.

Tasks

  • Back Table by polars.DataFrame
  • Back Row by polars.DataFrame
  • Back Column by polars.Series
@lars-reimann lars-reimann changed the title Investigate whether we can speed up operations on tabular data Investigate whether we can speed up operations on tabular data by using another backend Apr 17, 2023
@lars-reimann lars-reimann self-assigned this Apr 17, 2023
@lars-reimann lars-reimann added the performance 🏃 Speed things up label Apr 17, 2023
lars-reimann added a commit that referenced this issue Apr 18, 2023
Closes partially #196.

### Summary of Changes

* Add `polars`
* Create `ColumnType` for `polars` data type
* Create `Schema` for `polars` data frame

---------

Co-authored-by: megalinter-bot <[email protected]>
lars-reimann added a commit that referenced this issue Apr 19, 2023
Closes partially #196.
Closes #149.

### Summary of Changes

* `Row` now uses a `polars.DataFrame` instead of a `pandas.Series` to
store its data. The `DataFrame` can directly store the column names.
* Remove the `__hash__` method. A `Row` can no longer be used in a `set`
and as the key of a `dict`. If we find a use-case for this, we'll add it
back.

---------

Co-authored-by: megalinter-bot <[email protected]>
lars-reimann pushed a commit that referenced this issue Apr 21, 2023
## [0.11.0](v0.10.0...v0.11.0) (2023-04-21)

### Features

* `OneHotEncoder.inverse_transform` now maintains the column order from the original table ([#195](#195)) ([3ec0041](3ec0041)), closes [#109](#109) [#109](#109)
* add `plot_` prefix back to plotting methods ([#212](#212)) ([e50c3b0](e50c3b0)), closes [#211](#211)
* adjust `Column`, `Schema` and `Table` to changes in `Row` ([#216](#216)) ([ca3eebb](ca3eebb))
* back `Row` by a `polars.DataFrame` ([#214](#214)) ([62ca34d](62ca34d)), closes [#196](#196) [#149](#149)
* clean up `Row` class ([#215](#215)) ([b12fc68](b12fc68))
* convert between `Row` and `dict` ([#206](#206)) ([e98b653](e98b653)), closes [#204](#204)
* convert between a `dict` and a `Table` ([#198](#198)) ([2a5089e](2a5089e)), closes [#197](#197)
* create column types for `polars` data types ([#208](#208)) ([e18b362](e18b362)), closes [#196](#196)
* dataframe interchange protocol ([#200](#200)) ([bea976a](bea976a)), closes [#199](#199)
* move existing ML solutions into `safeds.ml.classical` package ([#213](#213)) ([655f07f](655f07f)), closes [#210](#210)

### Bug Fixes

* `table.keep_only_columns` now maps column names to correct data ([#194](#194)) ([459ab75](459ab75)), closes [#115](#115)
* typo in type hint ([#184](#184)) ([e79727d](e79727d)), closes [#180](#180)
@lars-reimann lars-reimann removed their assignment Apr 22, 2023
@lars-reimann lars-reimann added the wontfix This will not be worked on label Apr 22, 2023
@lars-reimann
Copy link
Member Author

lars-reimann commented Apr 22, 2023

Overall, polars still needs some time to mature. If a column contains mixed data and no data type is specified explicitly, for example, it just silently replaces values with None that don't match the inferred type:

import polars as pl

series = pl.Series("col", [1, "a", True, None])

for value in series:
    print(value)

# None
# a
# None
# None

This document mentions that an Object type exists for this by it currently has limited support.

Likewise, other libraries also need to support polars or the dataframe interchange protocol or we always need to depend on pandas anyway.

@lars-reimann lars-reimann closed this as not planned Won't fix, can't repro, duplicate, stale Apr 22, 2023
lars-reimann added a commit that referenced this issue Apr 22, 2023
### Summary of Changes

In #214 we changes the implementation of `Row` so its data was stored in
a `polars.DataFrame`. As explained
[here](#196 (comment)),
`pandas` works better for us for now. We might undo this change in the
future if the type inference of `polars` gets improved (or we decide to
implement this ourselves).

---------

Co-authored-by: megalinter-bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance 🏃 Speed things up wontfix This will not be worked on
Projects
Archived in project
Development

No branches or pull requests

1 participant