performance problem: various table operations #575

WinPlay02 · 2024-03-17T18:22:48Z

Describe the bug

Some Table operations take multiple hours to complete on large tables, although they should be much faster. When using pandas loc functionality, these operations complete in a few seconds.
These operations include:

The common problem of these operations are, that the first step is boxing the rows to a pandas Series and wrapping them in a Row object (which both leads to a huge overhead).
Most of the time, the single row objects are only needed to be passed to the lambda function that is provided to these operations.

Possible resolutions:

don't guarantee that a Row is passed to the lambda functions (for filter_rows, group_rows_by, sort_rows, transform_column). Instead only pass a general type that is indexable with Strings (column names), like Mapping[str, Any] to access the values. Maybe also pass the schema, as the other needed operations on a Row object can be handled with a schema.
add_rows could handle adding a whole table as a special case, so they are not split into Rows first, avoiding the problem without changing the interface

To Reproduce

Load a huge dataset with Table.from_csv_file
Run any of the previously mentioned operations
Observe that nothing happens and after a few hours the pipeline still runs

Expected behavior

Faster execution of the listed operations

Screenshots (optional)

No response

Additional Context (optional)

No response

The text was updated successfully, but these errors were encountered:

WinPlay02 · 2024-03-17T18:25:53Z

Concrete solution for Table.add_rows (insert at the top of the method):

        if isinstance(rows, Table):
            if rows.number_of_rows == 0:
                return self
            if self.number_of_columns == 0:
                return rows
            different_column_names = set()
            different_column_names.update(set(self.column_names) - set(rows.column_names))
            if len(different_column_names) > 0:
                raise UnknownColumnNameError(
                    sorted(
                        different_column_names,
                        key={val: ix for ix, val in enumerate(self.column_names)}.__getitem__,
                    ),
                )

            new_df = pd.concat([self._data, rows._data]).infer_objects()
            new_df.columns = self.column_names
            schema = Schema.merge_multiple_schemas([self.schema, rows.schema])
            result = Table._from_pandas_dataframe(new_df, schema)

            return result

Fixes partially #575 ### Summary of Changes Add special case to `add_rows` when adding a `Table`. The table is no longer split into a list of rows, instead it is directly added to the `DataFrame`. In special cases, this may improve the performance up to 1000x. --------- Co-authored-by: Lars Reimann <[email protected]>

## [0.21.0](v0.20.0...v0.21.0) (2024-04-17) ### Features * add ARIMA model ([#577](#577)) ([8b9c7a9](8b9c7a9)), closes [#570](#570) * Add ImageList class ([#534](#534)) ([3cb74a2](3cb74a2)), closes [#528](#528) [#599](#599) [#600](#600) * more hash, sizeof and eq implementations ([#609](#609)) ([2bc0b0a](2bc0b0a)) ### Performance Improvements * Add special case to `Table.add_rows` to increase performance ([#608](#608)) ([ffb8304](ffb8304)), closes [#606](#606) * improve performance of model & forward layer ([#616](#616)) ([e856cd5](e856cd5)), closes [#610](#610) * lazily import our modules and external libraries ([#624](#624)) ([20fc313](20fc313)) * treat Tables specially when calling add_rows ([#606](#606)) ([e555b85](e555b85)), closes [#575](#575)

lars-reimann · 2024-05-07T19:49:47Z

Fixed in #734 and #738.

WinPlay02 added the bug 🪲 label Mar 17, 2024

lars-reimann added this to Library Mar 17, 2024

github-project-automation bot moved this to Backlog in Library Mar 17, 2024

lars-reimann added the performance 🏃 Speed things up label Mar 18, 2024

WinPlay02 mentioned this issue Apr 3, 2024

perf: treat Tables specially when calling add_rows #606

Merged

lars-reimann self-assigned this May 7, 2024

lars-reimann moved this from Backlog to In Progress in Library May 7, 2024

lars-reimann closed this as completed May 7, 2024

github-project-automation bot moved this from In Progress to ✔️ Done in Library May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance problem: various table operations #575

performance problem: various table operations #575

WinPlay02 commented Mar 17, 2024 •

edited by lars-reimann

Loading

WinPlay02 commented Mar 17, 2024

lars-reimann commented May 7, 2024

performance problem: various table operations #575

performance problem: various table operations #575

Comments

WinPlay02 commented Mar 17, 2024 • edited by lars-reimann Loading

Describe the bug

Possible resolutions:

To Reproduce

Expected behavior

Screenshots (optional)

Additional Context (optional)

WinPlay02 commented Mar 17, 2024

lars-reimann commented May 7, 2024

WinPlay02 commented Mar 17, 2024 •

edited by lars-reimann

Loading