Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Commit

Permalink
[NSE-580] update doc on data source(DS V1/V2 usage) (#587)
Browse files Browse the repository at this point in the history
* update doc on data source

Signed-off-by: Yuan Zhou <[email protected]>

* fix wording

Signed-off-by: Yuan Zhou <[email protected]>
  • Loading branch information
zhouyuan authored Nov 28, 2021
1 parent dcdc459 commit a5eb496
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ With [Spark 27396](https://issues.apache.org/jira/browse/SPARK-27396) its possib
![Overview](./docs/image/dataset.png)

A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/native-sql-engine/tree/master/arrow-data-source)
Note both data source V1 and V2 are supported. Please check the [example section](arrow-data-source/#run-a-query-with-arrowdatasource-scala) for arrow data source

### Apache Arrow Compute/Gandiva based operators

Expand Down
32 changes: 32 additions & 0 deletions arrow-data-source/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,38 @@ val df = spark.read
df.createOrReplaceTempView("my_temp_view")
spark.sql("SELECT * FROM my_temp_view LIMIT 10").show(10)
```

### Example on integrating with hive metastore
Hive metastore is commonly used in thrift-server based setup, arrow data source also supports this case. The APIs are little different than vanilla Spark. Here's one example on how to create metadata table for parqute based TPCH tables:
```
// create a database first, otherwise those tables will be stored in default database
spark.sql("create database testtpch;").show
spark.sql("use testtpch;").show
spark.catalog.createTable("lineitem", "hdfs:////user/root/date_tpchnp_1000/lineitem", "arrow")
spark.catalog.createTable("orders", "hdfs:////user/root/date_tpchnp_1000/orders", "arrow")
spark.catalog.createTable("customer", "hdfs:////user/root/date_tpchnp_1000/customer", "arrow")
spark.catalog.createTable("nation", "hdfs:////user/root/date_tpchnp_1000/nation", "arrow")
spark.catalog.createTable("region", "hdfs:////user/root/date_tpchnp_1000/region", "arrow")
spark.catalog.createTable("part", "hdfs:////user/root/date_tpchnp_1000/part", "arrow")
spark.catalog.createTable("partsupp", "hdfs:////user/root/date_tpchnp_1000/partsupp", "arrow")
spark.catalog.createTable("supplier", "hdfs:////user/root/date_tpchnp_1000/supplier", "arrow")
// need to recover the partitions if it's partiton table
spark.sql("ALTER TABLE lineitem RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE orders RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE customer RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE nation RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE region RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE partsupp RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE part RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE supplier RECOVER PARTITIONS").show;
```
By default, the "arrow" format indicates for reading parquet files with arrow data source. Note this step only creates the metadata, those original files are not changed. We also support other file formats(ORC, CSV). Here's one example on how to create metadata table for ORC files:
```
spark.catalog.createTable("web_site", "arrow", Map("path" -> "hdfs:///root/tmp/TPCDS_ORC/web_site", "originalFormat" -> "orc"))
```

### To validate if ArrowDataSource works properly

To validate if ArrowDataSource works, you can go to the DAG to check if ArrowScan has been used from the above example query.
Expand Down

0 comments on commit a5eb496

Please sign in to comment.