A user visually explores a spatial dataset using a visualization dashboard, such as Tableau and ArcGIS. Each user interaction with the dashboard often involves an iteration between the dashboard and underlying data system. In each iteration, the dashboard application first issues a query to extract the data of interest from the underlying data system (e.g., PostGIS and SparkSQL), and then runs the visual analysis task (e.g., heat map and statistical analysis) on the selected data. Based on the visualization result, the user may iteratively go through such steps several times to visually explore various subsets of the database.
Tabula is a middleware that sits between the data system and the spatial data visualization dashboard to reduce the data-to-visualization time.
- Pre-materialized samples: Tabula adopts a sampling cube approach that stores pre-materialized sample for a set of potentially unforeseen queries (represented by an OLAP cube cell).
- User-defined analysis tasks:: Tabula allows data scientists to define their own accuracy loss function such that the produced samples can be used for various user-defined visual analysis tasks.
- Deterministic accuracy loss: Tabula ensures that the difference between the sample fed into the visualization dashboard and the raw query answer never exceeds the user-specified loss threshold with 100% confidence level.
The user feeds such parameters to Tabula as follows:
CREATE TABLE [sampling cube name] AS
SELECT [cubed attributes], SAMPLING(*,[θ]) AS sample
FROM [table name]
GROUPBY CUBE([cubed attributes])
HAVING [loss function name]([attribute], Sam global ) > [θ]
Example:
CREATE TABLE SamplingCube AS
SELECT Trip_distance, Passenger_count, Payment_method, SAMPLING(*,10%) AS sample
FROM nyctaxi
GROUPBY CUBE(Trip_distance, Passenger_count, Payment_method)
HAVING loss(Fare_amount, Sam_global ) > 10%
Once the sampling cube in Tabula is initialized, the data scientist, via the analytics application, can issue SQL queries to Tabula, as follows:
SELECT sample
FROM [sampling cube name]
WHERE [conditions]
Example:
SELECT sample
FROM SamplingCube
WHERE Trip_distance = 1 AND Payment_method = 'cash'
We provide a driver program in this repo to illustrate the usage of Tabula.
- Clone this repo
- Open
pom.xml
and make sure the value ofenv.package
iscompile
. It should look like<env.package>compile</env.package>
- Open
Driver.scala
and make sure the value of SparkSession's attributemaster
islocal[*]
. It should look like.master("local[*]")
- Run this
Driver.scala
with one argumentbuild
orsearch
build
will create a new Tabula partially materialized sampling cubesearch
will start a query that search Tabula sampling cube
- Clone this repo
- Open
pom.xml
and make sure the value ofenv.package
isprovided
. It should look like<env.package> provided </env.package>
- Open
Driver.scala
and make sure the value of SparkSession's attributemaster
islocal[*]
is disabled. This line.master("local[*]")
should be commentted out. - Run
mvn clean install -DskipTests
in the terminal - Submit the compiled fat jar to Spark cluster using
./bin/spark-submit
command, with an argumentbuild
orsearch
Jia Yu and Mohamed Sarwat. Turbocharging Geospatial Visualization Dashboards via a Materialized Sampling Cube Approach. In Proceedings of the International Conference on Data Engineering, ICDE, page to appear, 2020 (PDF)
- Currently, Tabula is implemented on top of Apache SparkSQL 2.3. In the future, we will show how to extend Tabula to more data systems such as PostgreSQL.
- The current code contains a geospatial-visualization aware accuracy loss function as an example at this location: Tabula cube
- We will soon release more examples about how to write user-defined accuracy loss functions in Tabula. Stay tuned.
- Some compared approaches are provided here: Related work
- We also compare with SnappyData in this repository: SnappyData VS Tabula
Mohamed Sarwat ([email protected])
Tabula middleware system is one of the projects under Data Systems Lab at Arizona State University. The mission of DataSys Lab is designing and developing experimental data management systems (e.g., database systems).