Add unit context for 0-sided views #1235

sc1f · 2020-10-26T16:46:49Z

This PR enhances the performance of certain 0-sided Views in Perspective by 2x-10x, depending on data size.

Each View is backed by a context object, which maintains its own traversal of the underlying master table. This traversal allows the user to read data out based on the order of primary keys, allows for pivots to traverse the underlying datasets, allows for sorts to be applied to the subset of data in a context, etc. In the case where the context maintains a basically trivial traversal, where the order that it reads data out is equivalent to the order data is stored in the underlying table, and when it does not have to apply any sorts, filters, or computed columns, we can skip the creation of a traversal entirely, and avoid the overhead of storing primary keys, sorting them, and converting row indices to primary keys.

The unit context is a context object that has no traversal and reads directly from the underlying master table of the gnode. Internally, it offers the same API as all other context types, and all construction around unit contexts occurs in internal code and has no bearing to the public API.

Externally, the unit context offers a massive performance improvement in a large use case—when the View has no pivots, sorts, filters, or computed columns, and the Table does not have a user-specified index. On a Table with a user-specified index, data must be read out in the same order as primary keys, which is different from the underlying stored order in the master table. However, this PR will allow for future improvements to this behavior.

Changelog

Add and implement unit context for JS and Python
Multitude of tests for unit context in different scenarios to ensure correct behavior

Benchmarks

Javascript benchmarks show a massive improvement in View creation time, and slight improvements in serialization time and time to create a delta.

In Python, where I've benchmarked this PR against much larger datasets (5m rows), the performance of view() is almost equivalent to the performance of open_view(), which simply provides a handle to an already-created view on the server. Over large datasets and multiple, parallel clients, the unit context massively reduces the overhead of view(), resulting in a 5x-10x improvement in performance over a regular ctx0.

Add unit context - ctx0 with no configuration at all that reads straight from gstate WIP: use unit ctx in JS, indexed updates/removes still broken WIP: fix JS tests WIP: get_pkeys no longer push_back

column order no longer matters for unit context as long as num_columns == table.num_columns more tests, print inside traversal::step_end

unit context = no pivot/sort/filter/computed, any column order/num of columns read m_delta_pkeys instead of get_delta_pkeys() cleanup

Implement unit context in python add more python tests, make get_row_expanded return bool fix windows build

texodus

Looks good! Thanks for the PR!

I can independently confirm the benchmark results, too. Some improvements I'd like to look into in the future from review:

In Emscripten I believe there is quite a bit of code generation associated with these repeated Context APIs, which leads to larger client assets in JS and WASM. Is there? If so, does embind support virtual dispatch? If not, can we perform a switch within a single dispatch C++ function so we do not need embind to generate the entire context API for each of 4 (eventually 5) context types?
Contexts could use a cleanup, e.g. FMODE_SIMPLE_CLAUSES, combiner, etc ..
I concur with e.g. size() -> num_rows(), and IMO this is worth just applying consistently across the board.

texodus · 2020-10-28T05:41:05Z

cpp/perspective/src/cpp/emscripten.cpp

+        auto columns = view_config->get_columns();
+        auto filter_op = view_config->get_filter_op();
+        auto fterm = view_config->get_fterm();
+        auto computed_columns = view_config->get_computed_columns();


Can we add a t_config which initializes these to the empty values we alreayd know these to be?

texodus · 2020-10-28T05:46:03Z

cpp/perspective/src/cpp/flat_traversal.cpp

+    // TODO: int/float/date/datetime pkeys are already sorted here, so if
+    // there was a way to assert that `psp_pkey` is a string typed column,
+    // we can conditional the sort on whether m_sortby.size() > 0 or if
+    // psp_pkey is a string column.


I think this still needs to be re-sorted - this std::sort just guarantees overlapping indices will be contiguous.

texodus · 2020-10-28T05:49:42Z

cpp/perspective/src/include/perspective/gnode_state.h

     * 
     * @return t_uindex 
     */
-    t_uindex size() const;
+    t_uindex num_columns() const;


We may as well go all the way and apply this change to Table.size() API in JS and Python!

sc1f added 5 commits October 26, 2020 12:46

Add and implement unit context in JS

f7d65bf

Add unit context - ctx0 with no configuration at all that reads straight from gstate WIP: use unit ctx in JS, indexed updates/removes still broken WIP: fix JS tests WIP: get_pkeys no longer push_back

Create unit context regardless of column order

957a970

column order no longer matters for unit context as long as num_columns == table.num_columns more tests, print inside traversal::step_end

Unit context can be created with an arbitary subset of columns

6ca1db1

unit context = no pivot/sort/filter/computed, any column order/num of columns read m_delta_pkeys instead of get_delta_pkeys() cleanup

Implement unit context in Python, test and fix Windows build

d7854f8

Implement unit context in python add more python tests, make get_row_expanded return bool fix windows build

fix windows build again

63ea58b

sc1f added enhancement Feature requests or improvements C++ JS Python breaking labels Oct 26, 2020

finos-cla-bot bot added the cla-present label Oct 26, 2020

sc1f removed the breaking label Oct 26, 2020

texodus approved these changes Oct 28, 2020

View reviewed changes

texodus merged commit e4bc81f into master Oct 28, 2020

texodus deleted the unit-context branch October 28, 2020 05:59

sc1f mentioned this pull request Oct 29, 2020

Remove host_view and open_view from public API #1240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unit context for 0-sided views #1235

Add unit context for 0-sided views #1235

sc1f commented Oct 26, 2020

texodus left a comment

texodus Oct 28, 2020

texodus Oct 28, 2020

texodus Oct 28, 2020

Add unit context for 0-sided views #1235

Add unit context for 0-sided views #1235

Conversation

sc1f commented Oct 26, 2020

Changelog

Benchmarks

texodus left a comment

Choose a reason for hiding this comment

texodus Oct 28, 2020

Choose a reason for hiding this comment

texodus Oct 28, 2020

Choose a reason for hiding this comment

texodus Oct 28, 2020

Choose a reason for hiding this comment