Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unit context for 0-sided views #1235

Merged
merged 5 commits into from
Oct 28, 2020
Merged

Add unit context for 0-sided views #1235

merged 5 commits into from
Oct 28, 2020

Conversation

sc1f
Copy link
Contributor

@sc1f sc1f commented Oct 26, 2020

This PR enhances the performance of certain 0-sided Views in Perspective by 2x-10x, depending on data size.

Each View is backed by a context object, which maintains its own traversal of the underlying master table. This traversal allows the user to read data out based on the order of primary keys, allows for pivots to traverse the underlying datasets, allows for sorts to be applied to the subset of data in a context, etc. In the case where the context maintains a basically trivial traversal, where the order that it reads data out is equivalent to the order data is stored in the underlying table, and when it does not have to apply any sorts, filters, or computed columns, we can skip the creation of a traversal entirely, and avoid the overhead of storing primary keys, sorting them, and converting row indices to primary keys.

The unit context is a context object that has no traversal and reads directly from the underlying master table of the gnode. Internally, it offers the same API as all other context types, and all construction around unit contexts occurs in internal code and has no bearing to the public API.

Externally, the unit context offers a massive performance improvement in a large use case—when the View has no pivots, sorts, filters, or computed columns, and the Table does not have a user-specified index. On a Table with a user-specified index, data must be read out in the same order as primary keys, which is different from the underlying stored order in the master table. However, this PR will allow for future improvements to this behavior.

Changelog

  • Add and implement unit context for JS and Python
  • Multitude of tests for unit context in different scenarios to ensure correct behavior

Benchmarks

Javascript benchmarks show a massive improvement in View creation time, and slight improvements in serialization time and time to create a delta.

Screen Shot 2020-10-16 at 1 44 55 PM

In Python, where I've benchmarked this PR against much larger datasets (5m rows), the performance of view() is almost equivalent to the performance of open_view(), which simply provides a handle to an already-created view on the server. Over large datasets and multiple, parallel clients, the unit context massively reduces the overhead of view(), resulting in a 5x-10x improvement in performance over a regular ctx0.

Add unit context - ctx0 with no configuration at all that reads straight from gstate

WIP: use unit ctx in JS, indexed updates/removes still broken

WIP: fix JS tests

WIP: get_pkeys no longer push_back
column order no longer matters for unit context as long as num_columns == table.num_columns

more tests, print inside traversal::step_end
unit context = no pivot/sort/filter/computed, any column order/num of columns

read m_delta_pkeys instead of get_delta_pkeys()

cleanup
Implement unit context in python

add more python tests, make get_row_expanded return bool

fix windows build
@sc1f sc1f added enhancement Feature requests or improvements C++ JS Python breaking labels Oct 26, 2020
@sc1f sc1f removed the breaking label Oct 26, 2020
Copy link
Member

@texodus texodus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for the PR!

I can independently confirm the benchmark results, too. Some improvements I'd like to look into in the future from review:

  • In Emscripten I believe there is quite a bit of code generation associated with these repeated Context APIs, which leads to larger client assets in JS and WASM. Is there? If so, does embind support virtual dispatch? If not, can we perform a switch within a single dispatch C++ function so we do not need embind to generate the entire context API for each of 4 (eventually 5) context types?
  • Contexts could use a cleanup, e.g. FMODE_SIMPLE_CLAUSES, combiner, etc ..
  • I concur with e.g. size() -> num_rows(), and IMO this is worth just applying consistently across the board.

auto columns = view_config->get_columns();
auto filter_op = view_config->get_filter_op();
auto fterm = view_config->get_fterm();
auto computed_columns = view_config->get_computed_columns();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a t_config which initializes these to the empty values we alreayd know these to be?

// TODO: int/float/date/datetime pkeys are already sorted here, so if
// there was a way to assert that `psp_pkey` is a string typed column,
// we can conditional the sort on whether m_sortby.size() > 0 or if
// psp_pkey is a string column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs to be re-sorted - this std::sort just guarantees overlapping indices will be contiguous.

*
* @return t_uindex
*/
t_uindex size() const;
t_uindex num_columns() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may as well go all the way and apply this change to Table.size() API in JS and Python!

@texodus texodus merged commit e4bc81f into master Oct 28, 2020
@texodus texodus deleted the unit-context branch October 28, 2020 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ enhancement Feature requests or improvements JS Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants