Skip to content

Commit

Permalink
exchange import mode (#2349)
Browse files Browse the repository at this point in the history
Co-authored-by: Chris Chen <[email protected]>
  • Loading branch information
cooper-lzy and ChrisChen2023 authored Nov 14, 2023
1 parent f4a7bc3 commit a5d7543
Show file tree
Hide file tree
Showing 2 changed files with 48 additions and 44 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -2,28 +2,33 @@

After editing the configuration file, run the following commands to import specified source data into the NebulaGraph database.

- First import
## Import data

```bash
<spark_install_path>/bin/spark-submit --master "local" --class com.vesoft.nebula.exchange.Exchange <nebula-exchange-2.x.y.jar_path> -c <application.conf_path>
```
```bash
<spark_install_path>/bin/spark-submit --master "spark://HOST:PORT" --class com.vesoft.nebula.exchange.Exchange <nebula-exchange-2.x.y.jar_path> -c <application.conf_path>
```

- Import the reload file
The following table lists command parameters.

If some data fails to be imported during the first import, the failed data will be stored in the reload file. Use the parameter `-r` to import the reload file.
| Parameter | Required | Default value | Description |
| :--- | :--- | :--- | :--- |
| `--class`  | Yes | - | Specify the main class of the driver.|
| `--master`  | Yes | - | Specify the URL of the master process in a Spark cluster. For more information, see [master-urls](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls). Optional values are:</br>`local`: Local Mode. Run Spark applications on a single thread. Suitable for importing small data sets in a test environment.</br>`yarn`: Run Spark applications on a YARN cluster. Suitable for importing large data sets in a production environment.</br>`spark://HOST:PORT`: Connect to the specified Spark standalone cluster.</br>`mesos://HOST:PORT`: Connect to the specified Mesos cluster.</br>`k8s://HOST:PORT`: Connect to the specified Kubernetes cluster.</br>|
| `-c`/`--config`  | Yes | - | Specify the path of the configuration file. |
| `-h`/`--hive`  | No | `false` | Specify whether importing Hive data is supported. |
| `-D`/`--dry`  | No | `false` | Specify whether to check the format of the configuration file. This parameter is used to check the format of the configuration file only, it does not check the validity of `tags` and `edges` configurations and does not import data. Don't add this parameter if you need to import data. |
| `-r`/`--reload` | No | - | Specify the path of the reload file that needs to be reloaded. |

```bash
<spark_install_path>/bin/spark-submit --master "local" --class com.vesoft.nebula.exchange.Exchange <nebula-exchange-2.x.y.jar_path> -c <application.conf_path> -r "<reload_file_path>"
```
For more Spark parameter configurations, see [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html#runtime-environment).

!!! note

- The version number of a JAR file is subject to the name of the JAR file that is actually compiled.

- If users use the [yarn-cluster mode](https://spark-reference-doc-cn.readthedocs.io/zh_CN/latest/deploy-guide/running-on-yarn.html) to submit a job, see the following command, **especially the two '--conf' commands in the example**.
- If users use the [yarn mode](https://spark-reference-doc-cn.readthedocs.io/zh_CN/latest/deploy-guide/running-on-yarn.html) to submit a job, see the following command, **especially the two '--conf' commands in the example**.

```bash
$SPARK_HOME/bin/spark-submit --master yarn-cluster \
$SPARK_HOME/bin/spark-submit --master yarn \
--class com.vesoft.nebula.exchange.Exchange \
--files application.conf \
--conf spark.driver.extraClassPath=./ \
Expand All @@ -32,15 +37,12 @@ After editing the configuration file, run the following commands to import speci
-c application.conf
```

The following table lists command parameters.
## Import the reload file

| Parameter | Required | Default value | Description |
| :--- | :--- | :--- | :--- |
| `--class`  | Yes | - | Specify the main class of the driver.|
| `--master`  | Yes | - | Specify the URL of the master process in a Spark cluster. For more information, see [master-urls](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls "click to open Apache Spark documents"). |
| `-c`  / `--config`  | Yes | - | Specify the path of the configuration file. |
| `-h`  / `--hive`  | No | `false` | Indicate support for importing Hive data. |
| `-D`  / `--dry`  | No | `false` | Check whether the format of the configuration file meets the requirements, but it does not check whether the configuration items of `tags` and `edges` are correct. This parameter cannot be added when users import data. |
| `-r` / `--reload` | No | - | Specify the path of the reload file that needs to be reloaded. |
If some data fails to be imported during the import, the failed data will be stored in the reload file. Use the parameter `-r` to import the data in reload file.

For more Spark parameter configurations, see [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html#runtime-environment).
```bash
<spark_install_path>/bin/spark-submit --master "spark://HOST:PORT" --class com.vesoft.nebula.exchange.Exchange <nebula-exchange-2.x.y.jar_path> -c <application.conf_path> -r "<reload_file_path>"
```

If the import still fails, go to [Official Forum](https://github.com/vesoft-inc/nebula/discussions) for consultation.
Original file line number Diff line number Diff line change
Expand Up @@ -2,28 +2,33 @@

完成配置文件修改后,可以运行以下命令将指定来源的数据导入{{nebula.name}}数据库。

- 首次导入
## 导入数据

```bash
<spark_install_path>/bin/spark-submit --master "local" --class com.vesoft.nebula.exchange.Exchange <nebula-exchange-2.x.y.jar_path> -c <application.conf_path>
```
```bash
<spark_install_path>/bin/spark-submit --master "spark://HOST:PORT" --class com.vesoft.nebula.exchange.Exchange <nebula-exchange-2.x.y.jar_path> -c <application.conf_path>
```

- 导入 reload 文件

如果首次导入时有一些数据导入失败,会将导入失败的数据存入 reload 文件,可以用参数`-r`尝试导入 reload 文件。

```bash
<spark_install_path>/bin/spark-submit --master "local" --class com.vesoft.nebula.exchange.Exchange <nebula-exchange-2.x.y.jar_path> -c <application.conf_path> -r "<reload_file_path>"
```
参数说明如下。

| 参数 | 是否必需 | 默认值 | 说明 |
| :--- | :--- | :--- | :--- |
| `--class`  ||| 指定驱动的主类。 |
| `--master`  ||| 指定 Spark 集群的 master URL。详情请参见 [master-urls](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls)。可选值为:</br>`local`:本地模式,使用单个线程运行 Spark 应用程序。适合在测试环境进行小数据量导入。</br>`yarn`:在 YARN 集群上运行 Spark 应用程序。适合在线上环境进行大数据量导入。</br>`spark://HOST:PORT`:连接到指定的 Spark standalone 集群。</br>`mesos://HOST:PORT`:连接到指定的 Mesos 集群。</br>`k8s://HOST:PORT`:连接到指定的 Kubernetes 集群。</br> |
| `-c`/`--config`  ||| 指定配置文件的路径。 |
| `-h`/`--hive`  || `false` | 添加这个参数表示支持从 Hive 中导入数据。 |
| `-D`/`--dry`  || `false` | 指定是否检查配置文件的格式。该参数仅用于检查配置文件的格式,不检查`tags``edges`配置项的有效性,也不会导入数据。需要导入数据时不要添加这个参数。 |
|`-r`/`--reload` ||| 指定需要重新加载的 reload 文件路径。 |

更多 Spark 的参数配置说明请参见 [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html#runtime-environment)

!!! note

- JAR 文件版本号以实际编译得到的 JAR 文件名称为准。

- 如果使用 [yarn-cluster 模式](https://spark-reference-doc-cn.readthedocs.io/zh_CN/latest/deploy-guide/running-on-yarn.html)提交任务,请参考如下示例,**尤其是示例中的两个**`--conf`。
- 如果使用 [yarn 模式](https://spark-reference-doc-cn.readthedocs.io/zh_CN/latest/deploy-guide/running-on-yarn.html)提交任务,请参考如下示例,**尤其是示例中的两个**`--conf`。

```bash
$SPARK_HOME/bin/spark-submit --master yarn-cluster \
$SPARK_HOME/bin/spark-submit --master yarn \
--class com.vesoft.nebula.exchange.Exchange \
--files application.conf \
--conf spark.driver.extraClassPath=./ \
Expand All @@ -32,15 +37,12 @@
-c application.conf
```

下表列出了命令的相关参数。
## 导入 reload 文件

如果导入数据时有一些数据导入失败,会将导入失败的数据存入 reload 文件,可以用参数`-r`尝试导入 reload 文件中的数据。

| 参数 | 是否必需 | 默认值 | 说明 |
| :--- | :--- | :--- | :--- |
| `--class`  ||| 指定驱动的主类。 |
| `--master`  ||| 指定 Spark 集群中 master 进程的 URL。详情请参见 [master-urls](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls "点击前往 Apache Spark 文档")|
| `-c`  / `--config`  ||| 指定配置文件的路径。 |
| `-h`  / `--hive`  || `false` | 添加这个参数表示支持从 Hive 中导入数据。 |
| `-D`  / `--dry`  || `false` | 添加这个参数表示检查配置文件的格式是否符合要求,但不会校验`tags``edges`的配置项是否正确。正式导入数据时不能添加这个参数。 |
|-r / --reload ||| 指定需要重新加载的 reload 文件路径。 |
```bash
<spark_install_path>/bin/spark-submit --master "spark://HOST:PORT" --class com.vesoft.nebula.exchange.Exchange <nebula-exchange-2.x.y.jar_path> -c <application.conf_path> -r "<reload_file_path>"
```

更多 Spark 的参数配置说明请参见 [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html#runtime-environment)
如果仍然导入失败,请到[论坛](https://discuss.nebula-graph.com.cn/)寻求帮助

0 comments on commit a5d7543

Please sign in to comment.