Skip to content

Commit

Permalink
add exchange doc (#665)
Browse files Browse the repository at this point in the history
* add exchange doc

* update

* Update ex-ug-para-import-command.md

* Update ex-ug-parameter.md
  • Loading branch information
cooper-lzy authored Sep 1, 2021
1 parent 1c33dda commit 62943aa
Show file tree
Hide file tree
Showing 26 changed files with 4,870 additions and 69 deletions.
41 changes: 41 additions & 0 deletions docs-2.0/nebula-exchange/about-exchange/ex-ug-limitations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Limitations

This topic describes some of the limitations of using Exchange 2.x.

## Nebula Graph releases

The correspondence between the Nebula Exchange release (the JAR version) and the Nebula Graph release is as follows.

|Exchange client|Nebula Graph|
|:---|:---|
|{{exchange.release}}|{{nebula.release}}|
|2.1.0|2.0.0、2.0.1|
|2.0-SNAPSHOT|v2-nightly|
|2.0.1|2.0.0、2.0.1|
|2.0.0|2.0.0、2.0.1|

JAR packages are available in two ways: [compile them yourself](../ex-ug-compile.md) or download them from the Maven repository.

If you are using Nebula Graph 1.x, use [Nebula Exchange 1.x](https://github.com/vesoft-inc/nebula-java/tree/v1.0/tools "Click to go to GitHub").

## Environment

Exchange 2.x supports the following operating systems:

- CentOS 7
- macOS

## Software depend on

To ensure the normal operation of Exchange, ensure that the following software has been installed on the machine:

- Apache Spark: 2.4.x

- Java: 1.8

- Scala: 2.10.7、2.11.12 or 2.12.10

Hadoop Distributed File System (HDFS) needs to be deployed in the following scenarios:

- Migrate HDFS data
- Generate SST file
11 changes: 11 additions & 0 deletions docs-2.0/nebula-exchange/about-exchange/ex-ug-terms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Noun explanation

<!--
本文描述使用Exchange时可能需要了解的名词和解释。
- Nebula Exchange:在本手册中简称为Exchange或Exchange 2.x,是一款基于Apache Spark&trade;的Spark应用,用于批量数据迁移。它支持将多种不同来源和格式的数据文件转换为Nebula Graph能识别的点和边数据,再并发导入Nebula Graph。
- Aparch Spark&trade;:是专为大规模数据处理而设计的快速通用的计算引擎,是Apache软件基金会的一个开源项目。
- Driver Program:在本手册中简称为Driver,是运行应用的main函数并且新建SparkContext实例的程序。
-->
68 changes: 68 additions & 0 deletions docs-2.0/nebula-exchange/about-exchange/ex-ug-what-is-exchange.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# What is Nebula Exchange

[Nebula Exchange](https://github.com/vesoft-inc/nebula-spark-utils/tree/{{exchange.branch}}/nebula-exchange) (Exchange) is an Apache Spark&trade; application for bulk migration of cluster data to Nebula Graph in a distributed environment, supporting batch and streaming data migration in a variety of formats.

Exchange consists of Reader, Processor, and Writer. After Reader reads data from different sources and returns a DataFrame, the Processor iterates through each row of the DataFrame and obtains the corresponding value based on the mapping between `fields` in the configuration file. After iterating through the number of rows in the specified batch, Writer writes the captured data to the Nebula Graph at once. The following figure illustrates the process by which Exchange completes the data conversion and migration.

![Nebula Graph&reg; Exchange consists of Reader, Processor, and Writer that can migrate data from a variety of formats and sources to Nebula Graph](../figs/ex-ug-003.png)

## Scenario

Exchange is applicable to the following scenarios:

- Need to streaming data from Kafka, Pulsar platform, such as log files, online data, game players activities, social networking information and financial transactions within the hall or geospatial services, as well as from within the data center of the connected device or instrument into properties such as telemetry data diagram of data vertex or edge, and import the Nebula Graph database.

- Batch data, such as data from a time period, needs to be read from a relational database (such as MySQL) or a distributed file system (such as HDFS), converted into vertex or edge data for a property Graph, and imported into the Nebula Graph database.

- A large volume of data needs to be generated into SST files that Nebula Graph can recognize and then imported into the Nebula Graph database.

## Advantage

Exchange has the following advantages:

- Adaptable: Support for importing data into the Nebula Graph database in a variety of formats or from a variety of sources, making it easy to migrate data.

- SST import: Converts data from different sources into SST files for data import.

- Resumable data import: Resumable data import saves time and improves data import efficiency.

!!! note

Breakpoint continuation is currently supported only when Neo4j data is migrated.

- Asynchronous operation: An insert statement is generated in the source data, sent to the Graph service, and then the insert operation is performed.

- Flexibility: support to import multiple Tags and Edge types at the same time. Different tag and Edge type can be different data sources or formats.

- Statistics: Use the accumulator in Apache Spark&trade; to count the number of successful and failed insert operations.

- Easy to use: It adopts the Human-Optimized Config Object Notation (HOCON) configuration file format and has an object-oriented style, which is easy to understand and operate.

## Data source

Exchange {{exchange.release}} supports converting data from the following formats or sources into vertexes and edges that Nebula Graph can recognize, and then importing them into Nebula Graph in the form of **nGQL** statements:

- Data stored in HDFS or locally:
- [Apache Parquet](../use-exchange/ex-ug-import-from-parquet.md)
- [Apache ORC](../use-exchange/ex-ug-import-from-orc.md)
- [JSON](../use-exchange/ex-ug-import-from-json.md)
- [CSV](../use-exchange/ex-ug-import-from-csv.md)

- [Apache HBase&trade;](../use-exchange/ex-ug-import-from-hbase.md)

- Data repository:

- [Hive](../use-exchange/ex-ug-import-from-hive.md)
- [MaxCompute](../use-exchange/ex-ug-import-from-maxcompute.md)

- Graph database: [Neo4j](../use-exchange/ex-ug-import-from-neo4j.md)(Client version 2.4.5-M1)

- Relational database: [MySQL](../use-exchange/ex-ug-import-from-mysql.md)

- Column database: [ClickHouse](../use-exchange/ex-ug-import-from-clickhouse.md)

- Stream processing software platform: [Apache Kafka&reg;](../use-exchange/ex-ug-import-from-kafka.md)

- Publish/Subscribe messaging platform: [Apache Pulsar 2.4.5](../use-exchange/ex-ug-import-from-pulsar.md)

In addition to importing data as nGQL statements, Exchange supports generating **SST** files for data sources and then [importing SST](../use-exchange/ex-ug-import-from-sst.md) files via Console.
109 changes: 109 additions & 0 deletions docs-2.0/nebula-exchange/ex-ug-FAQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Exchange FAQ

## Compilation

### Some packages not in central repository failed to download, error: `Could not resolve dependencies for project xxx`

Please check the `mirror` part of Maven installation directory `libexec/conf/settings.xml`:

```text
<mirror>
<id>alimaven</id>
<mirrorOf>central</mirrorOf>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/repositories/central/</url>
</mirror>
```

Check whether the value of `mirrorOf` is configured to `*`. If it is, change it to `central` or `*,!SparkPackagesRepo,!bintray-streamnative-maven`.

**Reason**: There are two dependency packages in Exchange's `pom.xml` that are not in Maven's central repository. `pom.xml` configures the repository address for these two dependencies. If the `mirrorOf` value for the mirror address configured in Maven is `*`, all dependencies will be downloaded from the Central repository, causing the download to fail.

## Execution

### How to submit in Yarn-Cluster mode?

To submit a task in Yarn-Cluster mode, run the following command:

```bash
$SPARK_HOME/bin/spark-submit --class com.vesoft.nebula.exchange.Exchange \
--master yarn-cluster \
--files application.conf \
--conf spark.driver.extraClassPath=./ \
--conf spark.executor.extraClassPath=./ \
nebula-exchange-2.0.0.jar \
-c application.conf
```

### Error: `method name xxx not found`

Generally, the port configuration is incorrect. Check the port configuration of the Meta service, Graph service, and Storage service.

### Error: NoSuchMethod、MethodNotFound(`Exception in thread "main" java.lang.NoSuchMethodError`, etc)

Most errors are caused by JAR package conflicts or version conflicts. Check whether the version of the error reporting service is the same as that used in Exchange, especially Spark, Scala, and Hive.

### When Exchange imports Hive data, error: `Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found`

Check whether the `-h` parameter is omitted in the command for submitting the Exchange task, check whether the table and database are correct, and run the user-configured exec statement in spark-SQL to verify the correctness of the exec statement.

### Run error: `com.facebook.thrift.protocol.TProtocolException: Expected protocol id xxx`

Check that the Nebula Graph service port is configured correctly.

- For source, RPM, or DEB installations, configure the port number corresponding to `--port` in the configuration file for each service.

- For docker installation, configure the docker mapped port number as follows:

Execute `docker-compose ps` in the `nebula-docker-compose` directory, for example:

```bash
$ docker-compose ps
Name Command State Ports
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
nebula-docker-compose_graphd_1 /usr/local/nebula/bin/nebu ... Up (healthy) 0.0.0.0:33205->19669/tcp, 0.0.0.0:33204->19670/tcp, 0.0.0.0:9669->9669/tcp
nebula-docker-compose_metad0_1 ./bin/nebula-metad --flagf ... Up (healthy) 0.0.0.0:33165->19559/tcp, 0.0.0.0:33162->19560/tcp, 0.0.0.0:33167->9559/tcp, 9560/tcp
nebula-docker-compose_metad1_1 ./bin/nebula-metad --flagf ... Up (healthy) 0.0.0.0:33166->19559/tcp, 0.0.0.0:33163->19560/tcp, 0.0.0.0:33168->9559/tcp, 9560/tcp
nebula-docker-compose_metad2_1 ./bin/nebula-metad --flagf ... Up (healthy) 0.0.0.0:33161->19559/tcp, 0.0.0.0:33160->19560/tcp, 0.0.0.0:33164->9559/tcp, 9560/tcp
nebula-docker-compose_storaged0_1 ./bin/nebula-storaged --fl ... Up (healthy) 0.0.0.0:33180->19779/tcp, 0.0.0.0:33178->19780/tcp, 9777/tcp, 9778/tcp, 0.0.0.0:33183->9779/tcp, 9780/tcp
nebula-docker-compose_storaged1_1 ./bin/nebula-storaged --fl ... Up (healthy) 0.0.0.0:33175->19779/tcp, 0.0.0.0:33172->19780/tcp, 9777/tcp, 9778/tcp, 0.0.0.0:33177->9779/tcp, 9780/tcp
nebula-docker-compose_storaged2_1 ./bin/nebula-storaged --fl ... Up (healthy) 0.0.0.0:33184->19779/tcp, 0.0.0.0:33181->19780/tcp, 9777/tcp, 9778/tcp, 0.0.0.0:33185->9779/tcp, 9780/tcp
```

Check the `Ports` column to find the docker mapped port number, for example:

- The port number available for Graph service is 9669.

- The port number for Meta service are 33167, 33168, 33164.

- The port number for Storage service are 33183, 33177, 33185.

## Configuration

### Which configuration items affect import performance?

- batch: The number of pieces of data contained in each nGQL statement sent to the Nebula Graph service.

- partition: Number of Spark data partitions, indicating the number of concurrent data import.

- nebula.rate: Go to the token bucket to get a token before sending a request to Nebula Graph.

- limit: Represents the size of the token bucket.

- timeout: Represents the timeout period for obtaining the token.

The values of these four parameters can be adjusted appropriately according to the machine performance. If the leader of the Storage service changes during the import process, you can adjust the values of these four parameters to reduce the import speed.

## Others

### Which versions of Nebula Graph are supported by Exchange?

See [Limitations](about-exchange/ex-ug-limitations.md).

### What is the relationship between Exchange and Spark Writer?

Exchange is the Spark application developed on the basis of Spark Writer. Both are suitable for bulk migration of cluster data to Nebula Graph in a distributed environment, but later maintenance work will be focused on Exchange. Compared with Spark Writer, Exchange has the following improvements:

- Supports more abundant data sources, such as MySQL, Neo4j, Hive, HBase, Kafka, Pulsar, etc.

- Fixed some problems of Spark Writer. For example, when Spark reads data from HDFS, the default source data is String, which may be different from the Nebula Graph's Schema, so Exchange adds automatic data type matching and type conversion. When the data type in the Nebula Graph's Schema is non-String, Exchange converts the source data of String type to the corresponding type.
76 changes: 76 additions & 0 deletions docs-2.0/nebula-exchange/ex-ug-compile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Compile Exchange

This topic describes how to compile Nebula Exchange. Users can also [download](https://repo1.maven.org/maven2/com/vesoft/nebula-exchange/) the compiled `.jar` file directly.

## Prerequisites

- Install [Maven](https://maven.apache.org/download.cgi).

<!-- The Maven library where Pulsar is located was officially closed on May 31st, and the migration location has not been found yet. You can delete it once you find it-->
- Download [pulsar-spark-connector_2.11](https://oss-cdn.nebula-graph.com.cn/jar-packages/pulsar-spark-connector_2.11.zip), and unzip it to `io/streamnative/connectors` directory of the local Maven library.

## Steps

1. Clone the repository `nebula-spark-utils` in the `/` directory.

```bash
git clone -b {{exchange.branch}} https://github.com/vesoft-inc/nebula-spark-utils.git
```

2. Switch to the directory `nebula-exchange`.

```bash
cd nebula-spark-utils/nebula-exchange
```

3. Package Nebula Exchange.

```bash
mvn clean package -Dmaven.test.skip=true -Dgpg.skip -Dmaven.javadoc.skip=true
```

After the compilation is successful, you can view a directory structure similar to the following in the current directory.

```text
.
├── README-CN.md
├── README.md
├── pom.xml
├── src
│   ├── main
│   └── test
└── target
├── classes
├── classes.timestamp
├── maven-archiver
├── nebula-exchange-2.x.y-javadoc.jar
├── nebula-exchange-2.x.y-sources.jar
├── nebula-exchange-2.x.y.jar
├── original-nebula-exchange-2.x.y.jar
└── site
```

In the `target` directory, users can find the `exchange-2.x.y.jar` file.

!!! note

The JAR file version changes with the release of the Nebula Java Client. Users can view the latest version on the [Releases page](https://github.com/vesoft-inc/nebula-java/releases).

When migrating data, you can refer to configuration file [`target/classes/application.conf`](https://github.com/vesoft-inc/nebula-spark-utils/blob/master/nebula-exchange/src/main/resources/application.conf).

## Failed to download the dependency package

If downloading dependencies fails at compile time:

- Check the network Settings and ensure that the network is normal.

- Modify the `mirror` part of Maven installation directory `libexec/conf/settings.xml`:

```text
<mirror>
<id>alimaven</id>
<mirrorOf>central</mirrorOf>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/repositories/central/</url>
</mirror>
```
Binary file added docs-2.0/nebula-exchange/figs/ex-ug-002.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs-2.0/nebula-exchange/figs/ex-ug-003.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
61 changes: 0 additions & 61 deletions docs-2.0/nebula-exchange/nebula-exchange.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Import Command Parameters
# Options for import

After editing the configuration file, run the following commands to import specified source data into the Nebula Graph database.

Expand Down Expand Up @@ -45,4 +45,5 @@ The following table lists command parameters.
| `-D`  / `--dry`  | No | `false` | Check whether the format of the configuration file meets the requirements, but it does not check whether the configuration items of `tags` and `edges` are correct. This parameter cannot be added when users import data. |
|-r / --reload | No | - | Specify the path of the reload file that needs to be reloaded. |

For more Spark parameter configurations, see [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html#runtime-environment).
For more Spark parameter configurations, see [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html#runtime-environment).

Loading

0 comments on commit 62943aa

Please sign in to comment.