-
Notifications
You must be signed in to change notification settings - Fork 0
New HEPData Explore
Weaknesses of the current HEPData Explore:
- Flash and reload each time primary filters are modified. The interaction is slow and uncomfortable.
- The filtering interface needs to be compacted.
- Flattening of data.
- Data is counted by points instead of tables, which may be more relevant.
- crossfilter is not being taken advantage of at all but has caused design decisions like the flattening of data.
- No custom charts.
- It's impossible to plot two independent variables against each other.
In the new HEPData Explore system, the remote database will index only publications and tables (instead of arbitrary table groups made of variable pairs as done before). This simplifies querying since most, if not all, filterable fields belong to the table metadata.
The publications will be indexed with this fields:
- Comment (actually the title of the publication)
- Inspire record
- Tables. For each one:
- table_num (within the publication)
- cmenergies_min
- cmenergies_max
- reactions (list of strings)
- observables (list of strings)
- phrases (list of strings)
- dep vars (list of strings)
- indep vars (list of strings)
- data points
Data points will not necessarily be indexed by the server. They will be stored in such a way that each row of the original table is one row of data, maintaining all the relationships between all the variables.
Since the document level at the server side continues to be the publication, server queries will still retrieve entire publications.
On the client side, for plotting purposes data will be indexed by tables, e.g. having a iterable list of all tables (currently it's being indexed by data points).
When a plot is requested, a couple of variables will be specified for the X and Y axis respectively. They may be any combination of dep and indep vars. Then, tables will be filtered in order to get those that have both specified variables. For each row of each table, the chosen variables will be read and plotted.
Improvement: Any number of dep/indep variables may be chosen for the Y axis. This can be used for instance to compare two distributions (e.g. Observed CL
vs Expected CL
).
Further (but complex) improvement: A third variable or expression may be added, linked to the size of the dot. A language capable of writing mathematical expressions and reading tables variables would be needed for this to be feature complete. A humbler solution would be to provide a simple function with the following parameters: variable, min dot size, max dot size, (optional) value for min dot size, (optional) value for max dot size, linear or logarithmic size scale.
The current system destroys the interface whenever server side filters are modified. This is uncomfortable and inconvenient for exploration.
In this regard, we will take advantage of the fact that server querying is coarse-grained, returning entire publications' data. The application will have an in memory set of publications returned from the last query. After new queries, the set will be updated with the difference from the past query.
Meanwhile data is being downloaded a small notice will appear in screen in order to make explicit that the shown data is outdated.
Tables that went missing will have all their data removed from plots. Empty plots will disappear. New variable pairs will get a new plot, should there are free plots. For this purpose, pairs with the greatest number of tables will be chosen with the greatest priority, but they will never kick out existing plots as long as they have data points.
A regenerate plots button may be added that would drop all the existing plots and replace them with the 8 variable pairs with the greatest number of tables under the currently active filters.
A custom plot button will show a pop-up dialog asking for a specific variable pair. As explained before, both dep and indep vars will be allowed.
A preview of the plot will be shown in the dialog.
To make custom plot creation easier, once either X or Y is set, suggestions for the other will be sorted in such a way that variables that actually appear in pairs and therefore are capable of generating plots appear first. They may also appear in bold or other color to hint they are good matches.
Custom plots will be pinned by default so they will not be deleted automatically in the case the filters are modified in such a way that they no longer contain data points.
Each plot will live in a visible box. The boxes could be reorder by dragging and dropping.
Each plot will have the following widgets:
-
A scatterplot showing the data, the X and Y axes, their values and the name of the variables.
-
An indication of the scale for each axis, log or lin. Clicking it will switch between them. Logarithmic scale will automatically be chosen if there is more than a 10-base order of magnitude between the minimum and the maximum value of that axis scale.
-
A close button to get rid of the plot.
-
An indication of how many tables are being plotted, from how many publications.
-
Data points of different tables within the same plot will show up as a different color.
-
A download button that will allow retrieving the data from the tables plotted in a custom format.
-
A view publications button that will show a pop-up with the same plot on the left and a list of publications and tables on the right. Each table will have a box showing what color it's being represented with in the plot. Hovering a table or a publication will highlight the data points belonging to it.
-
A pin toggle button. Pinned graphs will never be deleted even if they become empty.
-
A edit button will show the custom plot dialog, even if the plot was automatically generated. From there, the user may change the variables. Editing an automatic plot turns it into a custom plot and pins it.
As part of the plots overhaul, their current internal structure will be modified. At the moment they're a single WebGL canvas.
In the new design they will consist of several 2D and WebGL canvas stacked on top of each other. This will allow for greater separation of concerns (each layer can be managed in a single class) and will allow to provide 2D fallbacks more easily, should that be deemed necessary or desirable. Furthermore, should we go the canvas 2D way, this is actually the most performant way to do compositing.
In order to avoid context proliferation, the new plots will be pooled. Ten (or so) plots will be created at the start of the application, which will get populated with data and shown when needed and hidden (e.g. with display: none
) when no longer needed.
For now, the current filter user interface will be kept, since it's current design is already very powerful. It supports full boolean composition, automatic suggestions, reordering and reparenting, and new filter types are easily added. So most effort will be put into the other features instead.
That set, there are some usability issues that could be sorted out. For instance, it should be easier to undo after picking a value in a choice filter. Also, completion should be shorter or be hidden in a contextual pop-up in order not to occupy too much screen real estate.
-
Update the Python aggregator to conform the new ElasticSearch mapping.
- Phrase indexing will be added as part of the update.
-
Remove all instances of crossfilter.js. It's no longer needed for anything in the new design.
-
Update the search functions in the client code to conform the new mapping and handle tables instead of data records.
-
Create a new
Plot
component with all the stated requirements. This will be a hard one. -
Add a hook on the search function to add and remove unpinned plots automatically from search results.
-
Implement the Customize plot window.
-
Implement the View publications window.
-
Implement the Download data button, updated for the new table schema.
-
Implement global versions of View publications and Download data.
-
Add the aggregation panel from the mockup.