Investigate whether we can speed up operations on tabular data by using another backend #196

lars-reimann · 2023-04-17T09:26:19Z

pandas v2.0.0 introduces pyarrow as a new backend, which is supposedly faster than numpy
We could also use pyarrow directly
We could also replace pandas by polars

Generally, the interface of polars seems to be nicer to work with for us than pandas. They also do not have an index, which is closer to our design, too.

Tasks

Back Table by polars.DataFrame
Back Row by polars.DataFrame
Back Column by polars.Series

The text was updated successfully, but these errors were encountered:

Closes partially #196. ### Summary of Changes * Add `polars` * Create `ColumnType` for `polars` data type * Create `Schema` for `polars` data frame --------- Co-authored-by: megalinter-bot <[email protected]>

Closes partially #196. Closes #149. ### Summary of Changes * `Row` now uses a `polars.DataFrame` instead of a `pandas.Series` to store its data. The `DataFrame` can directly store the column names. * Remove the `__hash__` method. A `Row` can no longer be used in a `set` and as the key of a `dict`. If we find a use-case for this, we'll add it back. --------- Co-authored-by: megalinter-bot <[email protected]>

## [0.11.0](v0.10.0...v0.11.0) (2023-04-21) ### Features * `OneHotEncoder.inverse_transform` now maintains the column order from the original table ([#195](#195)) ([3ec0041](3ec0041)), closes [#109](#109) [#109](#109) * add `plot_` prefix back to plotting methods ([#212](#212)) ([e50c3b0](e50c3b0)), closes [#211](#211) * adjust `Column`, `Schema` and `Table` to changes in `Row` ([#216](#216)) ([ca3eebb](ca3eebb)) * back `Row` by a `polars.DataFrame` ([#214](#214)) ([62ca34d](62ca34d)), closes [#196](#196) [#149](#149) * clean up `Row` class ([#215](#215)) ([b12fc68](b12fc68)) * convert between `Row` and `dict` ([#206](#206)) ([e98b653](e98b653)), closes [#204](#204) * convert between a `dict` and a `Table` ([#198](#198)) ([2a5089e](2a5089e)), closes [#197](#197) * create column types for `polars` data types ([#208](#208)) ([e18b362](e18b362)), closes [#196](#196) * dataframe interchange protocol ([#200](#200)) ([bea976a](bea976a)), closes [#199](#199) * move existing ML solutions into `safeds.ml.classical` package ([#213](#213)) ([655f07f](655f07f)), closes [#210](#210) ### Bug Fixes * `table.keep_only_columns` now maps column names to correct data ([#194](#194)) ([459ab75](459ab75)), closes [#115](#115) * typo in type hint ([#184](#184)) ([e79727d](e79727d)), closes [#180](#180)

lars-reimann · 2023-04-22T09:42:42Z

Overall, polars still needs some time to mature. If a column contains mixed data and no data type is specified explicitly, for example, it just silently replaces values with None that don't match the inferred type:

import polars as pl

series = pl.Series("col", [1, "a", True, None])

for value in series:
    print(value)

# None
# a
# None
# None

This document mentions that an Object type exists for this by it currently has limited support.

Likewise, other libraries also need to support polars or the dataframe interchange protocol or we always need to depend on pandas anyway.

### Summary of Changes In #214 we changes the implementation of `Row` so its data was stored in a `polars.DataFrame`. As explained [here](#196 (comment)), `pandas` works better for us for now. We might undo this change in the future if the type inference of `polars` gets improved (or we decide to implement this ourselves). --------- Co-authored-by: megalinter-bot <[email protected]>

lars-reimann changed the title ~~Investigate whether we can speed up operations on tabular data~~ Investigate whether we can speed up operations on tabular data by using another backend Apr 17, 2023

lars-reimann self-assigned this Apr 17, 2023

lars-reimann mentioned this issue Apr 17, 2023

Support the dataframe interchange protocol #199

Closed

lars-reimann added the performance 🏃 Speed things up label Apr 17, 2023

lars-reimann mentioned this issue Apr 18, 2023

feat: create column types for polars data types #208

Merged

lars-reimann mentioned this issue Apr 19, 2023

feat: back Row by a polars.DataFrame #214

Merged

lars-reimann removed their assignment Apr 22, 2023

lars-reimann added the wontfix This will not be worked on label Apr 22, 2023

lars-reimann closed this as not planned Won't fix, can't repro, duplicate, stale Apr 22, 2023

lars-reimann mentioned this issue Apr 22, 2023

refactor: use pandas to store row data #240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate whether we can speed up operations on tabular data by using another backend #196

Investigate whether we can speed up operations on tabular data by using another backend #196

lars-reimann commented Apr 17, 2023 •

edited

Loading

lars-reimann commented Apr 22, 2023 •

edited

Loading

Investigate whether we can speed up operations on tabular data by using another backend #196

Investigate whether we can speed up operations on tabular data by using another backend #196

Comments

lars-reimann commented Apr 17, 2023 • edited Loading

Tasks

lars-reimann commented Apr 22, 2023 • edited Loading

lars-reimann commented Apr 17, 2023 •

edited

Loading

lars-reimann commented Apr 22, 2023 •

edited

Loading