[NSE-580] update doc on data source(DS V1/V2 usage) (#587)

* update doc on data source Signed-off-by: Yuan Zhou <[email protected]> * fix wording Signed-off-by: Yuan Zhou <[email protected]>
oap-project · Nov 28, 2021 · a5eb496 · a5eb496
1 parent dcdc459
commit a5eb496
Show file tree

Hide file tree

Showing 2 changed files with 33 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -33,6 +33,7 @@ With [Spark 27396](https://issues.apache.org/jira/browse/SPARK-27396) its possib
 ![Overview](./docs/image/dataset.png)
 
 A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/native-sql-engine/tree/master/arrow-data-source)
+Note both data source V1 and V2 are supported. Please check the [example section](arrow-data-source/#run-a-query-with-arrowdatasource-scala) for arrow data source
 
 ### Apache Arrow Compute/Gandiva based operators
 

diff --git a/arrow-data-source/README.md b/arrow-data-source/README.md
@@ -205,6 +205,38 @@ val df = spark.read
 df.createOrReplaceTempView("my_temp_view")
 spark.sql("SELECT * FROM my_temp_view LIMIT 10").show(10)
 ```
+
+### Example on integrating with hive metastore
+Hive metastore is commonly used in thrift-server based setup, arrow data source also supports this case. The APIs are little different than vanilla Spark. Here's one example on how to create metadata table for parqute based TPCH tables:
+```
+// create a database first, otherwise those tables will be stored in default database
+spark.sql("create database testtpch;").show
+spark.sql("use testtpch;").show
+
+spark.catalog.createTable("lineitem", "hdfs:////user/root/date_tpchnp_1000/lineitem", "arrow")
+spark.catalog.createTable("orders", "hdfs:////user/root/date_tpchnp_1000/orders", "arrow")
+spark.catalog.createTable("customer", "hdfs:////user/root/date_tpchnp_1000/customer", "arrow")
+spark.catalog.createTable("nation", "hdfs:////user/root/date_tpchnp_1000/nation", "arrow")
+spark.catalog.createTable("region", "hdfs:////user/root/date_tpchnp_1000/region", "arrow")
+spark.catalog.createTable("part", "hdfs:////user/root/date_tpchnp_1000/part", "arrow")
+spark.catalog.createTable("partsupp", "hdfs:////user/root/date_tpchnp_1000/partsupp", "arrow")
+spark.catalog.createTable("supplier", "hdfs:////user/root/date_tpchnp_1000/supplier", "arrow")
+
+// need to recover the partitions if it's partiton table
+spark.sql("ALTER TABLE lineitem RECOVER PARTITIONS").show;
+spark.sql("ALTER TABLE orders RECOVER PARTITIONS").show;
+spark.sql("ALTER TABLE customer RECOVER PARTITIONS").show;
+spark.sql("ALTER TABLE nation RECOVER PARTITIONS").show;
+spark.sql("ALTER TABLE region RECOVER PARTITIONS").show;
+spark.sql("ALTER TABLE partsupp RECOVER PARTITIONS").show;
+spark.sql("ALTER TABLE part RECOVER PARTITIONS").show;
+spark.sql("ALTER TABLE supplier RECOVER PARTITIONS").show;
+```
+By default, the "arrow" format indicates for reading parquet files with arrow data source. Note this step only creates the metadata, those original files are not changed. We also support other file formats(ORC, CSV). Here's one example on how to create metadata table for ORC files:
+```
+spark.catalog.createTable("web_site", "arrow", Map("path" -> "hdfs:///root/tmp/TPCDS_ORC/web_site", "originalFormat" -> "orc"))
+```
+
 ### To validate if ArrowDataSource works properly
 
 To validate if ArrowDataSource works, you can go to the DAG to check if ArrowScan has been used from the above example query.