Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark sstfile generator #420

Merged
merged 26 commits into from
Jun 27, 2019

Conversation

spacewalkman
Copy link
Contributor

Reopen a new PR after this repo changes from private to public, replacing PR#208

A spark job which does the following things:

parsing an input mapping file to map a hive table to a tag/edge, in which the table's PK(logically) should be identified
use nebula native client to encode a tag's key and values
define a custom hadoop OutputFormat and RecordWriter, which should generate a sub dir for one partition per worker in specified sst file output dir

@nebula-community-bot
Copy link
Member

Can one of the admins verify this patch?

@sherman-the-tank
Copy link
Member

jenkins go

@sherman-the-tank sherman-the-tank added the ready-for-testing PR: ready for the CI test label May 22, 2019
@nebula-community-bot
Copy link
Member

Unit testing failed.

@spacewalkman
Copy link
Contributor Author

CI failure seems to related to JNI header, repushed please let jenkins go

@dangleptr
Copy link
Contributor

Jenkins go

@nebula-community-bot
Copy link
Member

Unit testing failed.

@spacewalkman spacewalkman force-pushed the spark-sstfile-generator branch 3 times, most recently from bf1f3d8 to 8f663e9 Compare May 24, 2019 01:51
@dangleptr
Copy link
Contributor

Jenkins go

@nebula-community-bot
Copy link
Member

Unit testing failed.

@spacewalkman spacewalkman force-pushed the spark-sstfile-generator branch 2 times, most recently from 1b2041d to 3c6955a Compare May 24, 2019 07:25
@dangleptr
Copy link
Contributor

Jenkins go

@nebula-community-bot
Copy link
Member

Unit testing failed.

@dangleptr
Copy link
Contributor

Is the pr ready now? @spacewalkman

@spacewalkman
Copy link
Contributor Author

@dangleptr There are some specific data skewness problem causing OOM, need to analysis input data.

@spacewalkman spacewalkman force-pushed the spark-sstfile-generator branch from 3c6955a to 6af86d4 Compare May 31, 2019 03:42
@dangleptr
Copy link
Contributor

The pr is ready now? @spacewalkman

@spacewalkman
Copy link
Contributor Author

Yes.It's ready now.

@dangleptr
Copy link
Contributor

Jenkins go

@nebula-community-bot
Copy link
Member

Unit testing failed.

@dangleptr
Copy link
Contributor

Jenkins go

@nebula-community-bot
Copy link
Member

Unit testing failed.

@spacewalkman spacewalkman force-pushed the spark-sstfile-generator branch from 6af86d4 to 78cd2b3 Compare June 19, 2019 02:14
@spacewalkman spacewalkman dismissed stale reviews from dutor and dangleptr via a63abbf June 27, 2019 07:53
@spacewalkman spacewalkman force-pushed the spark-sstfile-generator branch from bc677f7 to a63abbf Compare June 27, 2019 07:53
@spacewalkman
Copy link
Contributor Author

Jenkins, go

@nebula-community-bot
Copy link
Member

Unit testing passed.

@nebula-community-bot
Copy link
Member

Unit testing passed.

@nebula-community-bot
Copy link
Member

Unit testing passed.

@dangleptr dangleptr merged commit 34eb36d into vesoft-inc:master Jun 27, 2019
tong-hao pushed a commit to tong-hao/nebula that referenced this pull request Jun 1, 2021
* [WIP] add spark sst file generator job to generate sst files per partiton per woker

* add edge's from and to column reference in mapping file, do check those columns exist

* make sure vertex and its outbound edges are in the same partition

* add native client unit test

* manual boxing AnyVal to AnyRef in order to call NativeCLient.encoded ,for that scala has no autoboxing feature like java

* support hive table with date and other partitin columns

* fix double free exception

* remove all rockdbjni related dependency

* use repartitionAndSortWithinPartitions to avoid overlapping sst files key range, update dependency to hadoop 2..7.4

* add mapping file and command line reference, handle mapping load problem

* address comments

* remove duplicate cmake instruction to find JNI header

* fix doc inconsistance

* keep all edges to a single edgeType

* fix flaky UT

* add mapping json schema file and example mapping file

* use hdfs -copyFromLocal to put local sst files to HDFS

* create destination hdfs dir to put sst files before run hdfs -copyFromLocal

* refactor and fix bug when vertex table has only one primary key column but no other column

* edge_type encoded as a property and, clean up local sst file dir and refactor key-value type name

* create parent dir first before creating local sst files

* set java.library.path env variable before run UT in maven surefire pulgin

* files generated suffix with .sst

* COMPILE phase precede PACKAGE phase in default maven lifecycle,so remove redundant COMPILE and enable test in the meantime

* fix build failure caused by imcompatability between maven 3.0.5 and surefire plugin 3.0.0-M2

* add some clearfication about sst file name uniqueness in doc
yixinglu pushed a commit to yixinglu/nebula that referenced this pull request Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants