Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance problem: various table operations #575

Closed
5 tasks
WinPlay02 opened this issue Mar 17, 2024 · 2 comments
Closed
5 tasks

performance problem: various table operations #575

WinPlay02 opened this issue Mar 17, 2024 · 2 comments
Assignees
Labels
performance 🏃 Speed things up

Comments

@WinPlay02
Copy link
Contributor

WinPlay02 commented Mar 17, 2024

Describe the bug

Some Table operations take multiple hours to complete on large tables, although they should be much faster. When using pandas loc functionality, these operations complete in a few seconds.
These operations include:

  • filter_rows
  • group_rows_by
  • add_rows(Table)
  • sort_rows (probably, untested)
  • transform_column (probably, untested)

The common problem of these operations are, that the first step is boxing the rows to a pandas Series and wrapping them in a Row object (which both leads to a huge overhead).
Most of the time, the single row objects are only needed to be passed to the lambda function that is provided to these operations.

Possible resolutions:

  • don't guarantee that a Row is passed to the lambda functions (for filter_rows, group_rows_by, sort_rows, transform_column). Instead only pass a general type that is indexable with Strings (column names), like Mapping[str, Any] to access the values. Maybe also pass the schema, as the other needed operations on a Row object can be handled with a schema.
  • add_rows could handle adding a whole table as a special case, so they are not split into Rows first, avoiding the problem without changing the interface

To Reproduce

  1. Load a huge dataset with Table.from_csv_file
  2. Run any of the previously mentioned operations
  3. Observe that nothing happens and after a few hours the pipeline still runs

Expected behavior

Faster execution of the listed operations

Screenshots (optional)

No response

Additional Context (optional)

No response

@WinPlay02
Copy link
Contributor Author

Concrete solution for Table.add_rows (insert at the top of the method):

        if isinstance(rows, Table):
            if rows.number_of_rows == 0:
                return self
            if self.number_of_columns == 0:
                return rows
            different_column_names = set()
            different_column_names.update(set(self.column_names) - set(rows.column_names))
            if len(different_column_names) > 0:
                raise UnknownColumnNameError(
                    sorted(
                        different_column_names,
                        key={val: ix for ix, val in enumerate(self.column_names)}.__getitem__,
                    ),
                )

            new_df = pd.concat([self._data, rows._data]).infer_objects()
            new_df.columns = self.column_names
            schema = Schema.merge_multiple_schemas([self.schema, rows.schema])
            result = Table._from_pandas_dataframe(new_df, schema)

            return result

@lars-reimann lars-reimann added the performance 🏃 Speed things up label Mar 18, 2024
lars-reimann added a commit that referenced this issue Apr 3, 2024
Fixes partially #575

### Summary of Changes

Add special case to `add_rows` when adding a `Table`. 
The table is no longer split into a list of rows, instead it is directly
added to the `DataFrame`.
In special cases, this may improve the performance up to 1000x.

---------

Co-authored-by: Lars Reimann <[email protected]>
lars-reimann pushed a commit that referenced this issue Apr 17, 2024
## [0.21.0](v0.20.0...v0.21.0) (2024-04-17)

### Features

* add ARIMA model ([#577](#577)) ([8b9c7a9](8b9c7a9)), closes [#570](#570)
* Add ImageList class ([#534](#534)) ([3cb74a2](3cb74a2)), closes [#528](#528) [#599](#599) [#600](#600)
* more hash, sizeof and eq implementations ([#609](#609)) ([2bc0b0a](2bc0b0a))

### Performance Improvements

* Add special case to `Table.add_rows` to increase performance ([#608](#608)) ([ffb8304](ffb8304)), closes [#606](#606)
* improve performance of model & forward layer ([#616](#616)) ([e856cd5](e856cd5)), closes [#610](#610)
* lazily import our modules and external libraries ([#624](#624)) ([20fc313](20fc313))
* treat Tables specially when calling add_rows ([#606](#606)) ([e555b85](e555b85)), closes [#575](#575)
@lars-reimann lars-reimann self-assigned this May 7, 2024
@lars-reimann lars-reimann moved this from Backlog to In Progress in Library May 7, 2024
@lars-reimann
Copy link
Member

Fixed in #734 and #738.

@github-project-automation github-project-automation bot moved this from In Progress to ✔️ Done in Library May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance 🏃 Speed things up
Projects
Archived in project
Development

No branches or pull requests

2 participants