Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-43] Unified Chart Controls #9887

Closed
ktmud opened this issue May 22, 2020 · 5 comments
Closed

[SIP-43] Unified Chart Controls #9887

ktmud opened this issue May 22, 2020 · 5 comments
Assignees
Labels
sip Superset Improvement Proposal

Comments

@ktmud
Copy link
Member

ktmud commented May 22, 2020

Disclaimer: regardless of the details embedded, this document is a conceptual proposal intended to start conversations. We are not committed to implement all or any of the proposed changes yet.

[SIP-43] Unified Chart Controls

Motivation

After obtaining tabular data from datasources, visualizations need to know how to map data columns to marks and channels in visualizations. Sometimes the visualization may also require or the users would like to perform additional transformations before plotting (compute moving average, pivot, aggregate, transpose, sample, window, etc.), which are not possible/convenient to do at the datasource/SQL query level.

The concept of post-processing is not new to Superset. Historically, we had some flavors of these known as advanced analytics in a few charts such as the line chart, but they lack clarity and consistency.

Consistency: Most of these transformations are implemented in viz.py, with each visualization having its own non-standardized post-processor. This Python module has grown out of control and become difficult to maintain. The coupling between visualization and query response makes it difficult to improve some important features such as chart slice caching, embeddable charts and visualization plugins.

We have been slowly migrating to a new visualization-independent query API (SIP-5, SIP-6, #6220). The idea is to decouple data querying with visualizations, move most visualization-specific transformations to the frontend. The new API will always return tabular data. It is up to each visualization plugin to take the output, optionally run it through a generalized post-processing API on the server side before (#9427), and pass the tabular data to the visualization code, which handle the rest of the visualization-specific transformation on the client-side.

Clarity: Some post-processing can be implicit but straightforward, e.g., when users want to add moving averages to a line chart, we simply introduce a form control that computes a derived column. It is easy to infer how that column should be presented in the visualization. However, while there is room for abstraction, this approach requires developers to create custom controls for every visualization, each with their own transformProps logics. Like many basic chart controls in Superset, these controls affect both data manipulation and presentation, so they still unavoidably bind data querying to visualizations. It may be straightforward to use them in one simple chart, but it quickly becomes confusing when there are a lot of different controls across many visualizations---both developers and users sometimes have to guess what’s happening under the hood.

Flexibility: In addition to lack of clarity and consistency, the custom control approach also fall short of supporting more powerful visualizations. For example, following table chart with mock data is common in top-line business reports. It compactly displays multiple metrics (bookings and revenue) across multiple dimensions (state, user type) and multiple time periods (point-in-time measurement and 7 day moving average):

Snip20200521_5

Suppose each metric and dimension is a column in the database, and each row is their values at a given date. Currently, in order to create this output table, users have to write very complex SQL queries and use a virtual datasource. But it’s actually possible to very quickly build the same chart using a combination of Pandas post-processing operators, without writing complex database queries.

Proposed Change

To solve the challenges above, we propose to (1) add a Transform section for server-side post-processing and (2) rearrange the Customize controls in the control panel.

The Transform Section

In the Transform section, users can specify stackable atomic transform operators mapped directly to the pandas_postprocessing API already implemented in the backend. For example, the screenshot below shows the controls popup for Rolling Window transformation:

Snip20200520_9

This one-on-one mapping between transform controls and post processing operators makes it easy for documentation. We can just point users to the Python API spec for pandas_postprocessing.

After applying a transformation, users should be able to view the intermediate and final tabular data (the final results returned by the server will always be tabular). We can add a button to switch between data view and chart view:

Snip20200520_12

Or implement the split-view proposed in SIP-34.

Clicking on the accordion list items in Transform will switch between the intermediate transformed results.

The Customize Section

Superset introduced the Data vs Customize tab to reduce clutter in the control panel. This is helpful for charts with many options. But it also makes the Customize options difficult to discover. A lot of users don’t even know they can add pagination to the table chart.

To simplify the user experience, we intend to move the Customize tab to a new Customize section under the main tab and add a Columns section before other chart rendering options.

The Columns section configures per-column meta data such as d3 format, tooltip template, suffix/prefix, and conditional formatting, corresponding to the final data output. Not all visualizations have customizable rendering options, but those with rendering options will always have the Columns section before other rendering options. We believe with refined UI hierarchy, it’s possible to resurface the customize controls without creating clutters.

The full mockup can be found here.

Long-term Plan

In the future, all visualization controls will follow three simple steps: Query (datasource queries) → Transform (server side post processing) → Customize (chart rendering). This is akin to the visualization grammar used by Vega and Vega-lite. This separation of concerns and unification of control semantics make it super clear what each control is responsible for.

By moving column mapping and chart rendering logics to Customize, we can also remove control overrides in the Query section---currently there are too many variants of metrics, columns, and groupby fields querying the same thing but are stored differently (apache-superset/superset-ui#485 provided a mask to help developers; end users may still be confused). This not only greatly simplifies the code, but also makes it easier to switch between visualization types---all control values for Query, Transform, and even Columns can be easily retained.

In the future (beyond the scope of this SIP), if it’s too tiresome to edit all the transformations one by one, there are many ways to simply the Transform section for users:

  1. We can hide complex operations behind custom operators that are either a preset of other operators, or an arbitrary Python function running in sandbox.
  2. We can add a switch of Simple v.s. Advanced mode. The Advanced mode is what described above. In the Simple mode, users specify transformations using less controls, similar to Advanced Analytics. Each control could potentially represent one or multiple transformations with reasonable defaults.

New or Changed Public Interfaces

  • New React components for the Transform controls and popup modals, which will interact with the post_processing field in the new API (/api/v1/chart/data).
  • New shared controls for the Columns section.
  • New design pattern for control panel to optimize the hierarchy of existing controls (data source, time, etc.).
  • Refactor control panel config registrations and chart control overrides.

New dependencies

No new NPM/Pip dependencies needed.

Migration Plan and Compatibility

This change has profound implications on the way we think of charting in Superset. In addition to exposing post-processing API, it also proposes changes to the Query controls.

We will start with implementing the controls for one example visualization type (the table chart for example), then migrate others one by one. For each visualization type, we have to do the following:

  • Step 1: migrate to the new /chart/data API
  • Step 2: add Transform controls, implement the data vs chart split view
  • Step 3: move Customize controls to the main tab, add Columns controls
  • Step 4: rename and simplify Query controls, db migration may be needed

Rejected Alternatives

  • Add yet another custom control or visualization type for the advanced table chart: the use case is too specific. There are too many operations in the underlying transformation, it becomes too opaque to the users.
  • Only add the transform section to charts which need it: this does not solve the long-standing problem of inconsistent and speculative chart controls, but still introduces a new pattern for charting.

References

@ktmud ktmud added the sip Superset Improvement Proposal label May 22, 2020
@ktmud ktmud changed the title [SIP] Proposal for Unified Chart Controls [SIP-43] Proposal for Unified Chart Controls May 22, 2020
@ktmud
Copy link
Member Author

ktmud commented May 22, 2020

I feel the SIP template could use a "Technical Challenges and Risks" section.

@ktmud ktmud changed the title [SIP-43] Proposal for Unified Chart Controls [SIP-43] Expose Post-processing API and Unify Chart Controls May 22, 2020
@ktmud ktmud changed the title [SIP-43] Expose Post-processing API and Unify Chart Controls [SIP-43] Unify Chart Controls May 22, 2020
@ktmud ktmud changed the title [SIP-43] Unify Chart Controls [SIP-43] Unified Chart Controls May 22, 2020
@mistercrunch
Copy link
Member

The data-centric approach (query + transform + map-data-to-viz-properties) as opposed to the visualization-property-centric approach is certainly interesting, but that last map phase is fairly challenging to design well so that it's intuitive to the user.

This was discussed at length in the sessions leading to SIP-34. Some of the challenges with the query-centric approach are:

  • doing good guesswork around mapping data to viz properties attribute is challenging and we need to allow the user to override
  • forcing the right granularity: some visualization require different things (a particular number of dimensions, a particular number of metrics). While the "Query" panel seem like a conceptually universal thing, the mechanics / restrictions applied to it are tricky (say if we need to force you to pick exactly 2 dimensions and a single metric, or if you're not allowed to pick a dimensions [big number], or if you have to pick a metric and a secondary one is optional, ....).
  • for many users, if you're just trying to make a bar chart, or a big number, the whole query -> transform -> map mind gymnastic is a bit complicated.

We should dig out the best design we had for this during the design biltz. The design would show a query panel with the viz-property-mapping within the dropped "pill" in the droppable area of the query (the wireframe would be worth 10k words here...). That seemed somewhat intuitive but I feel like that design/model couldn't support another layer of transformation that we disregarded at that time.

One more thought, it's the fact that transformations are unevenly tricky, and we may want to categorize them:

  • in-place mutations - transforms that don't change the shape of the result set - are easy
  • metric derivative - a new computed metric work generally well with visualizations that support multiple metrics
  • pivots - where the shape of the data following the transformation is dynamic are generally tricky, the visualization has to expect a certain shape here. Clearly not any pivot table can be visualized by any type of visualization...
  • grain changing: many viz care a lot about having the perfect "grain" and some transforms could mess with that

@ktmud
Copy link
Member Author

ktmud commented May 27, 2020

Was this the query panel design you were looking for? (Thanks @graceguo-supercat for sharing!)

I like its simpleness and directness---the grouping of visualization types and unification of control UI are really slick. It does seem difficult to fit the query + transform + map model in this design, though.

There is definitely a tradeoff between simplicity and flexibility. To better understand the problem, I looked at what other BI tools are doing in this regard:

  1. Redash
    • Users always write SQL queries before visualization.
    • No post-processing transformation possible.
    • Property mapping mostly manual.
  2. Metabase
    • Similar to Redash, separate steps for query building and visualization
    • No post-processing transformations, except simple cumulative measurements (mixed with query-level aggregation functions).
    • Property mapping is manual or based on metric/dimension definitions at the query stage.
  3. Looker
    • A combined Explore view for query + visualization
    • Support "Table Calculation" (post-processing on query results) and "Custom Fields" (query level calculated columns), both have similar interactions
    • Data retrieval not tied to visualization
    • Post-processing and ad-hoc columns achieved via DLS
  4. Mode
    • Query with SQL
    • Apply advanced data transformations with R or Python notebook
    • Charting controls only operate on query results
    • Property mapping is totally manual
  5. Tableau
    • Visualization centric, no separate querying step for most cases
    • Powerful custom functions and on-chart controls to apply transformations
    • Custom function could trigger both optimized queries or post-processing, but the whole process is fully opaque to users.
  6. Domo
  7. Power BI
    • Very similar to Tableau
    • Most transformations (including data source joins) probably happen locally (in desktop app or on Power BI server, as oppose to the datasource)

TL;DR: Tableau, Domo, Power BI use the visualization-centric approach; others data-centric.

In terms of functionality and philosophy, Looker is the closest to what's been proposed in this SIP; Domo is the closest to SIP-34. Either way seems to work from an end-user's point of view. For Superset, it's a matter of which mode is the best for our users and whether we can smoothly get there.

To answer some of your concerns:

  • data <-> viz property mapping: it's possible to keep track of derived metrics and use the same property type for source and derived metrics. Pivot tables also know which output columns are dimensions which are metrics.
  • granularity: I'm not sure I understand this correctly. "Grain change" is the same as adding/removing a column in regular controls, isn't it? I don't think we have to have restrictions on query output. Like Redash/Mode and others, excessive columns can simply be ignored---during chart configuration, you select only the columns you need for visualization. Visualizations should not run any implicit aggregations on the client side, and should throw errors if the final output has excessive/missing dimensions or is not in desired shape.
  • complexity: simple bar chart will not need transform. The transform controls could be part of an "Advanced" section hidden (but discoverable) from regular users. There are ways to make the UI intuitive and non-intrusive. Both Looker and Tableau, the most powerful of above all, use DSL to hide the complexity of the transformation layer. We don't have to go there yet, but it could eventually be a choice.

In summary, the addition of a transform layer does involve a lot of work, but I'd argue it adds valuable flexibility for the users which would really differentiate Superset from its competitors.

@ktmud
Copy link
Member Author

ktmud commented Apr 26, 2021

Per-column formatting has been added to table chart in #13758

A transformation layer adds significantly more complexity and would be very difficult to design well. Over the past year, Superset has seen many gradual improvements and a lot of work for steps 1 - 3 in the Migration Plan of this SIP has been done and still ongoing. It's better to keep the gradual approach and push for bigger changes when the codebase is more stable, thus closing this SIP without a vote.

@rusackas
Copy link
Member

rusackas commented Dec 8, 2022

Since this issue is closed, we'll also consider the SIP/proposal process closed as well. If you want to rekindle this proposal, please re-open this Issue, and send a new [DISCUSS] thread to the dev@ mailing list. Thank you!

@rusackas rusackas moved this from INACTIVE DISCUSSION to DENIED / CLOSED in SIPs (Superset Improvement Proposals) Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sip Superset Improvement Proposal
Projects
Status: Denied / Closed / Discarded
Development

No branches or pull requests

3 participants