forked from vesoft-inc/nebula
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Spark sstfile generator (vesoft-inc#420)
* [WIP] add spark sst file generator job to generate sst files per partiton per woker * add edge's from and to column reference in mapping file, do check those columns exist * make sure vertex and its outbound edges are in the same partition * add native client unit test * manual boxing AnyVal to AnyRef in order to call NativeCLient.encoded ,for that scala has no autoboxing feature like java * support hive table with date and other partitin columns * fix double free exception * remove all rockdbjni related dependency * use repartitionAndSortWithinPartitions to avoid overlapping sst files key range, update dependency to hadoop 2..7.4 * add mapping file and command line reference, handle mapping load problem * address comments * remove duplicate cmake instruction to find JNI header * fix doc inconsistance * keep all edges to a single edgeType * fix flaky UT * add mapping json schema file and example mapping file * use hdfs -copyFromLocal to put local sst files to HDFS * create destination hdfs dir to put sst files before run hdfs -copyFromLocal * refactor and fix bug when vertex table has only one primary key column but no other column * edge_type encoded as a property and, clean up local sst file dir and refactor key-value type name * create parent dir first before creating local sst files * set java.library.path env variable before run UT in maven surefire pulgin * files generated suffix with .sst * COMPILE phase precede PACKAGE phase in default maven lifecycle,so remove redundant COMPILE and enable test in the meantime * fix build failure caused by imcompatability between maven 3.0.5 and surefire plugin 3.0.0-M2 * add some clearfication about sst file name uniqueness in doc
- Loading branch information
1 parent
10b1012
commit f4dfb05
Showing
25 changed files
with
1,718 additions
and
121 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
sbt.version=1.2.8 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,3 @@ | ||
#!/bin/bash | ||
|
||
mvn clean compile package -DskipTests | ||
mvn test | ||
mvn clean package -X |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
Generate sst files from hive tables datasource, guided by a mapping file, which maps hive tables to vertexes and edges. | ||
Multiple vertexes or edges may map to a single hive table, where a partition column will be used to distinguish different | ||
vertex or edge. | ||
The hive tables may be periodically be regenerated by upstream system to reflect the latest data in so far, and may be | ||
partitioned by a time column to indicate the time when data are generated. | ||
*$HADOOP_HOME* env need to be set for running this job. | ||
|
||
# Environment | ||
component|version | ||
---|--- | ||
os|centos6.5 final(kernel 2.6.32-431.el6.x86_64) | ||
spark|1.6.2 | ||
hadoop|2.7.4 | ||
jdk|1.8+ | ||
scala|2.10.5 | ||
sbt|1.2.8 | ||
|
||
|
||
# Spark-submit command line reference | ||
This is what we used in production environment: | ||
```bash | ||
${SPARK_HOME}/bin/spark-submit --master yarn --queue fmprod --conf spark.executor.instances=24 --conf spark.executor.memory=90g --conf spark.executor.cores=2 --conf spark.executorEnv.LD_LIBRARY_PATH='/soft/server/nebula_native_client:/usr/local/lib:/usr/local/lib64' --conf spark.driver.extraJavaOptions='-Djava.library.path=/soft/server/nebula_native_client/:/usr/local/lib64:/usr/local/lib' --class com.vesoft.tools.SparkSstFileGenerator --files mapping.json nebula-spark-sstfile-generator.jar -di "2019-05-13" -mi mapping.json -pi dt -so file://home/hdp/nebula_output | ||
``` | ||
The application options are described as following. | ||
|
||
# Spark application command line reference | ||
We keep a convention when naming the option,those suffix with _i_ will be an INPUT type option, while those suffix with _o_ will be an OUTPUT type option | ||
|
||
```bash | ||
usage: nebula spark sst file generator | ||
-ci,--default_column_mapping_policy <arg> If omitted, what policy to use when mapping column to property,all columns except primary_key's column will be mapped to tag's property with the same name by default | ||
-di,--latest_date_input <arg> Latest date to query,date format YYYY-MM-dd | ||
-hi,--string_value_charset_input <arg> When the value is of type String,what charset is used when encoded,default to UTF-8 | ||
-ho,--hdfs_sst_file_output <arg> Which hdfs directory will those sstfiles be put, should not starts with file:/// | ||
-li,--limit_input <arg> Return at most this number of edges/vertex, usually used in POC stage, when omitted, fetch all data. | ||
-mi,--mapping_file_input <arg> Hive tables to nebula graph schema mapping file | ||
-pi,--date_partition_input <arg> A partition field of type String of hive table, which represent a Date, and has format of YYY-MM-dd | ||
-ri,--repartition_number_input <arg> Repartition number. Some optimization trick to improve generation speed and data skewness. Need tuning to suit your data. | ||
-so,--local_sst_file_output <arg> Which local directory those generated sst files will be put, should starts with file:/// | ||
-ti,--datasource_type_input <arg> Data source types supported, must be among [hive|hbase|csv] for now, default=hive | ||
``` | ||
|
||
# Mapping file schema | ||
|
||
Mapping file are json format.File Schema is provided as [mapping-schema.json](mapping-schema.json) according to [Json Schema Standard](http://json-schema.org). We provide an example mapping file: [mapping.json](mapping.json) | ||
|
||
# FAQ | ||
## How to use libnebula-native-client.so under CentOS6.5(2.6.32-431 x86-64) | ||
|
||
1. Don't use officially distributed librocksdbjni-linux64.so, build it natively on CentOS6.5. | ||
|
||
```bash | ||
DEBUG_LEVEL=0 make shared_lib | ||
DEBUG_LEVEL=0 make rocksdbjava | ||
``` | ||
_make sure to keep consistent with DEBUG_LEVEL when building, or there will be some link error like `symbol not found` | ||
2. run `sbt assembly` to package this project to a spark job jar, which is default named: `nebula-spark-sstfile-generator.jar` | ||
3. run `jar uvf nebula-spark-sstfile-generator.jar librocksdbjni-linux64.so libnebula_native_client.so` to replace the `*.so` files packaged inside the dependency org.rocksdb:rocksdbjni:5.17.2,or some error like following will occur when spark-submit: | ||
|
||
``` | ||
*** glibc detected *** /soft/java/bin/java: free(): invalid pointer: 0x00007f7985b9f0a0 *** | ||
======= Backtrace: ========= | ||
/lib64/libc.so.6(+0x75f4e)[0x7f7c7d5e6f4e] | ||
/lib64/libc.so.6(+0x78c5d)[0x7f7c7d5e9c5d] | ||
/tmp/librocksdbjni3419235685305324910.so(_ZN7rocksdb10EnvOptionsC1Ev+0x578)[0x7f79431ff908] | ||
/tmp/librocksdbjni3419235685305324910.so(Java_org_rocksdb_EnvOptions_newEnvOptions+0x1c)[0x7f7943044dbc] | ||
[0x7f7c689c1747] | ||
``` | ||
|
||
# TODO | ||
1. Add database_name property to graphspace level and tag/edge level, which the latter will override the former when provided in both levels | ||
2. Schema column definitions' order is important, keep it when parsing mapping file and when encoding | ||
3. Integrated build with maven or cmake, where this spark assembly should be build after nebula native client | ||
4. To handle following situation: different tables share a common Tag, like a tag with properties of (start_time, end_time) | ||
|
Oops, something went wrong.