add kerberos in hdfs (#321) (#2484)

* add kerberos in hdfs https://confluence.nebula-graph.io/pages/viewpage.action?pageId=97847199 * Update ex-ug-import-from-csv.md * Update ex-ug-import-from-csv.md * update * update * update * update
vesoft-inc · Feb 27, 2024 · 472456a · 472456a
1 parent cd882ae
commit 472456a
Show file tree

Hide file tree

Showing 10 changed files with 280 additions and 0 deletions.
diff --git a/docs-2.0-en/import-export/nebula-exchange/use-exchange/ex-ug-import-from-csv.md b/docs-2.0-en/import-export/nebula-exchange/use-exchange/ex-ug-import-from-csv.md
@@ -365,6 +365,34 @@ ${SPARK_HOME}/bin/spark-submit  --master "local" --class com.vesoft.nebula.excha
 
 You can search for `batchSuccess.<tag_name/edge_name>` in the command output to check the number of successes. For example, `batchSuccess.follow: 300`.
 
+#### Access HDFS data with Kerberos certification
+
+When using Kerberos for security certification, you can access the HDFS data in one of the following ways.
+
+- Configure the Kerberos configuration file in a command
+
+  Configure `--conf` and `--files` in the command, for example:
+
+  ```bash
+  ${SPARK_HOME}/bin/spark-submit --master xxx  --num-executors 2 --executor-cores 2 --executor-memory 1g \
+  --conf "spark.driver.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --conf "spark.executor.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --files /local/path/to/xxx.keytab,/local/path/to/krb5.conf \
+  --class  com.vesoft.nebula.exchange.Exchange  \
+  exchange.jar -c xx.conf
+  ```
+
+  The file path in `--conf` can be configured in two ways as follows:
+
+  - Configure the absolute path to the file. All YARN or Spark machines are required to have the corresponding file in the same path.
+  - (Recommended in YARN mode) Configure the relative path to the file (e.g. `./krb5.conf`). The resource files uploaded via `--files` are located in the working directory of the Java virtual machine or JAR.
+
+  The files in `--files` must be stored on the machine where the `spark-submit` command is executed.
+
+- Without commands
+
+  Deploy the Spark and Kerberos-certified Hadoop in a same cluster to make them share HDFS and YARN, and then add the configuration `export HADOOP_HOME=<hadoop_home_path>` to `spark-env.sh` in Spark.
+
 ### Step 5: (optional) Validate data
 
 Users can verify that data has been imported by executing a query in the NebulaGraph client (for example, NebulaGraph Studio). For example:

diff --git a/docs-2.0-en/import-export/nebula-exchange/use-exchange/ex-ug-import-from-hive.md b/docs-2.0-en/import-export/nebula-exchange/use-exchange/ex-ug-import-from-hive.md
@@ -398,6 +398,34 @@ ${SPARK_HOME}/bin/spark-submit  --master "local" --class com.vesoft.nebula.excha
 
 You can search for `batchSuccess.<tag_name/edge_name>` in the command output to check the number of successes. For example, `batchSuccess.follow: 300`.
 
+#### Access HDFS data with Kerberos certification
+
+When using Kerberos for security certification, you can access the HDFS data in one of the following ways.
+
+- Configure the Kerberos configuration file in a command
+
+  Configure `--conf` and `--files` in the command, for example:
+
+  ```bash
+  ${SPARK_HOME}/bin/spark-submit --master xxx  --num-executors 2 --executor-cores 2 --executor-memory 1g \
+  --conf "spark.driver.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --conf "spark.executor.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --files /local/path/to/xxx.keytab,/local/path/to/krb5.conf \
+  --class  com.vesoft.nebula.exchange.Exchange  \
+  exchange.jar -c xx.conf
+  ```
+
+  The file path in `--conf` can be configured in two ways as follows:
+
+  - Configure the absolute path to the file. All YARN or Spark machines are required to have the corresponding file in the same path.
+  - (Recommended in YARN mode) Configure the relative path to the file (e.g. `./krb5.conf`). The resource files uploaded via `--files` are located in the working directory of the Java virtual machine or JAR.
+
+  The files in `--files` must be stored on the machine where the `spark-submit` command is executed.
+
+- Without commands
+
+  Deploy the Spark and Kerberos-certified Hadoop in a same cluster to make them share HDFS and YARN, and then add the configuration `export HADOOP_HOME=<hadoop_home_path>` to `spark-env.sh` in Spark.
+
 ### Step 5: (optional) Validate data
 
 Users can verify that data has been imported by executing a query in the NebulaGraph client (for example, NebulaGraph Studio). For example:

diff --git a/docs-2.0-en/import-export/nebula-exchange/use-exchange/ex-ug-import-from-json.md b/docs-2.0-en/import-export/nebula-exchange/use-exchange/ex-ug-import-from-json.md
@@ -377,6 +377,34 @@ ${SPARK_HOME}/bin/spark-submit  --master "local" --class com.vesoft.nebula.excha
 
 You can search for `batchSuccess.<tag_name/edge_name>` in the command output to check the number of successes. For example, `batchSuccess.follow: 300`.
 
+#### Access HDFS data with Kerberos certification
+
+When using Kerberos for security certification, you can access the HDFS data in one of the following ways.
+
+- Configure the Kerberos configuration file in a command
+
+  Configure `--conf` and `--files` in the command, for example:
+
+  ```bash
+  ${SPARK_HOME}/bin/spark-submit --master xxx  --num-executors 2 --executor-cores 2 --executor-memory 1g \
+  --conf "spark.driver.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --conf "spark.executor.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --files /local/path/to/xxx.keytab,/local/path/to/krb5.conf \
+  --class  com.vesoft.nebula.exchange.Exchange  \
+  exchange.jar -c xx.conf
+  ```
+
+  The file path in `--conf` can be configured in two ways as follows:
+
+  - Configure the absolute path to the file. All YARN or Spark machines are required to have the corresponding file in the same path.
+  - (Recommended in YARN mode) Configure the relative path to the file (e.g. `./krb5.conf`). The resource files uploaded via `--files` are located in the working directory of the Java virtual machine or JAR.
+
+  The files in `--files` must be stored on the machine where the `spark-submit` command is executed.
+
+- Without commands
+
+  Deploy the Spark and Kerberos-certified Hadoop in a same cluster to make them share HDFS and YARN, and then add the configuration `export HADOOP_HOME=<hadoop_home_path>` to `spark-env.sh` in Spark.
+
 ### Step 5: (optional) Validate data
 
 Users can verify that data has been imported by executing a query in the NebulaGraph client (for example, NebulaGraph Studio). For example:

diff --git a/docs-2.0-en/import-export/nebula-exchange/use-exchange/ex-ug-import-from-orc.md b/docs-2.0-en/import-export/nebula-exchange/use-exchange/ex-ug-import-from-orc.md
@@ -341,6 +341,34 @@ ${SPARK_HOME}/bin/spark-submit  --master "local" --class com.vesoft.nebula.excha
 
 You can search for `batchSuccess.<tag_name/edge_name>` in the command output to check the number of successes. For example, `batchSuccess.follow: 300`.
 
+#### Access HDFS data with Kerberos certification
+
+When using Kerberos for security certification, you can access the HDFS data in one of the following ways.
+
+- Configure the Kerberos configuration file in a command
+
+  Configure `--conf` and `--files` in the command, for example:
+
+  ```bash
+  ${SPARK_HOME}/bin/spark-submit --master xxx  --num-executors 2 --executor-cores 2 --executor-memory 1g \
+  --conf "spark.driver.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --conf "spark.executor.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --files /local/path/to/xxx.keytab,/local/path/to/krb5.conf \
+  --class  com.vesoft.nebula.exchange.Exchange  \
+  exchange.jar -c xx.conf
+  ```
+
+  The file path in `--conf` can be configured in two ways as follows:
+
+  - Configure the absolute path to the file. All YARN or Spark machines are required to have the corresponding file in the same path.
+  - (Recommended in YARN mode) Configure the relative path to the file (e.g. `./krb5.conf`). The resource files uploaded via `--files` are located in the working directory of the Java virtual machine or JAR.
+
+  The files in `--files` must be stored on the machine where the `spark-submit` command is executed.
+
+- Without commands
+
+  Deploy the Spark and Kerberos-certified Hadoop in a same cluster to make them share HDFS and YARN, and then add the configuration `export HADOOP_HOME=<hadoop_home_path>` to `spark-env.sh` in Spark.
+
 ### Step 5: (optional) Validate data
 
 Users can verify that data has been imported by executing a query in the NebulaGraph client (for example, NebulaGraph Studio). For example:

diff --git a/...-2.0-en/import-export/nebula-exchange/use-exchange/ex-ug-import-from-parquet.md b/...-2.0-en/import-export/nebula-exchange/use-exchange/ex-ug-import-from-parquet.md
@@ -342,6 +342,34 @@ ${SPARK_HOME}/bin/spark-submit  --master "local" --class com.vesoft.nebula.excha
 
 You can search for `batchSuccess.<tag_name/edge_name>` in the command output to check the number of successes. For example, `batchSuccess.follow: 300`.
 
+#### Access HDFS data with Kerberos certification
+
+When using Kerberos for security certification, you can access the HDFS data in one of the following ways.
+
+- Configure the Kerberos configuration file in a command
+
+  Configure `--conf` and `--files` in the command, for example:
+
+  ```bash
+  ${SPARK_HOME}/bin/spark-submit --master xxx  --num-executors 2 --executor-cores 2 --executor-memory 1g \
+  --conf "spark.driver.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --conf "spark.executor.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --files /local/path/to/xxx.keytab,/local/path/to/krb5.conf \
+  --class  com.vesoft.nebula.exchange.Exchange  \
+  exchange.jar -c xx.conf
+  ```
+
+  The file path in `--conf` can be configured in two ways as follows:
+
+  - Configure the absolute path to the file. All YARN or Spark machines are required to have the corresponding file in the same path.
+  - (Recommended in YARN mode) Configure the relative path to the file (e.g. `./krb5.conf`). The resource files uploaded via `--files` are located in the working directory of the Java virtual machine or JAR.
+
+  The files in `--files` must be stored on the machine where the `spark-submit` command is executed.
+
+- Without commands
+
+  Deploy the Spark and Kerberos-certified Hadoop in a same cluster to make them share HDFS and YARN, and then add the configuration `export HADOOP_HOME=<hadoop_home_path>` to `spark-env.sh` in Spark.
+
 ### Step 5: (optional) Validate data
 
 Users can verify that data has been imported by executing a query in the NebulaGraph client (for example, NebulaGraph Studio). For example:

diff --git a/docs-2.0-zh/import-export/nebula-exchange/use-exchange/ex-ug-import-from-csv.md b/docs-2.0-zh/import-export/nebula-exchange/use-exchange/ex-ug-import-from-csv.md
@@ -362,6 +362,34 @@ ${SPARK_HOME}/bin/spark-submit  --master "local" --class com.vesoft.nebula.excha
 
 用户可以在返回信息中搜索`batchSuccess.<tag_name/edge_name>`，确认成功的数量。例如`batchSuccess.follow: 300`。
 
+#### 访问 Kerberos 认证的 HDFS
+
+使用 Kerberos 进行安全认证时，需使用以下两种方式之一访问 Kerberos 认证的 HDFS。
+
+- 在命令中设置 Kerberos 配置文件
+
+  在命令中配置`--conf`和`--files`，例如：
+
+  ```bash
+  ${SPARK_HOME}/bin/spark-submit --master xxx  --num-executors 2 --executor-cores 2 --executor-memory 1g \
+  --conf "spark.driver.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --conf "spark.executor.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --files /local/path/to/xxx.keytab,/local/path/to/krb5.conf \
+  --class  com.vesoft.nebula.exchange.Exchange  \
+  exchange.jar -c xx.conf
+  ```
+
+  `--conf`中的文件路径有如下两种配置方式：
+
+  - 配置文件的绝对路径。要求所有 YARN 或者 Spark 机器相同路径下都有对应文件。
+  - （YARN 模式下推荐）配置文件的相对路径（例如`./krb5.conf`）。通过`--files`上传的资源文件就在 Java 虚拟机或者 JAR 的工作目录下。
+
+  `--files`中的文件必须存储在执行`spark-submit`命令的机器上。
+
+- 不使用命令
+
+  将 Spark 和 Kerberos 认证的 Hadoop 部署在相同集群内，共用 HDFS 和 YARN，然后在 Spark 的`spark-env.sh`中增加配置`export HADOOP_HOME=<hadoop_home_path>`。
+
 ### 步骤 5：（可选）验证数据
 
 用户可以在 {{nebula.name}} 客户端（例如 NebulaGraph Studio）中执行查询语句，确认数据是否已导入。例如：

diff --git a/docs-2.0-zh/import-export/nebula-exchange/use-exchange/ex-ug-import-from-hive.md b/docs-2.0-zh/import-export/nebula-exchange/use-exchange/ex-ug-import-from-hive.md
@@ -393,6 +393,34 @@ ${SPARK_HOME}/bin/spark-submit  --master "local" --class com.vesoft.nebula.excha
 
 用户可以在返回信息中搜索`batchSuccess.<tag_name/edge_name>`，确认成功的数量。例如`batchSuccess.follow: 300`。
 
+#### 访问 Kerberos 认证的 HDFS
+
+使用 Kerberos 进行安全认证时，需使用以下两种方式之一访问 Kerberos 认证的 HDFS。
+
+- 在命令中设置 Kerberos 配置文件
+
+  在命令中配置`--conf`和`--files`，例如：
+
+  ```bash
+  ${SPARK_HOME}/bin/spark-submit --master xxx  --num-executors 2 --executor-cores 2 --executor-memory 1g \
+  --conf "spark.driver.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --conf "spark.executor.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --files /local/path/to/xxx.keytab,/local/path/to/krb5.conf \
+  --class  com.vesoft.nebula.exchange.Exchange  \
+  exchange.jar -c xx.conf
+  ```
+
+  `--conf`中的文件路径有如下两种配置方式：
+
+  - 配置文件的绝对路径。要求所有 YARN 或者 Spark 机器相同路径下都有对应文件。
+  - （YARN 模式下推荐）配置文件的相对路径（例如`./krb5.conf`）。通过`--files`上传的资源文件就在 Java 虚拟机或者 JAR 的工作目录下。
+
+  `--files`中的文件必须存储在执行`spark-submit`命令的机器上。
+
+- 不使用命令
+
+  将 Spark 和 Kerberos 认证的 Hadoop 部署在相同集群内，共用 HDFS 和 YARN，然后在 Spark 的`spark-env.sh`中增加配置`export HADOOP_HOME=<hadoop_home_path>`。
+
 ### 步骤 5：（可选）验证数据
 
 用户可以在 {{nebula.name}} 客户端（例如 NebulaGraph Studio）中执行查询语句，确认数据是否已导入。例如：

diff --git a/docs-2.0-zh/import-export/nebula-exchange/use-exchange/ex-ug-import-from-json.md b/docs-2.0-zh/import-export/nebula-exchange/use-exchange/ex-ug-import-from-json.md
@@ -368,6 +368,34 @@ ${SPARK_HOME}/bin/spark-submit  --master "local" --class com.vesoft.nebula.excha
 
 用户可以在返回信息中搜索`batchSuccess.<tag_name/edge_name>`，确认成功的数量。例如`batchSuccess.follow: 300`。
 
+#### 访问 Kerberos 认证的 HDFS
+
+使用 Kerberos 进行安全认证时，需使用以下两种方式之一访问 Kerberos 认证的 HDFS。
+
+- 在命令中设置 Kerberos 配置文件
+
+  在命令中配置`--conf`和`--files`，例如：
+
+  ```bash
+  ${SPARK_HOME}/bin/spark-submit --master xxx  --num-executors 2 --executor-cores 2 --executor-memory 1g \
+  --conf "spark.driver.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --conf "spark.executor.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --files /local/path/to/xxx.keytab,/local/path/to/krb5.conf \
+  --class  com.vesoft.nebula.exchange.Exchange  \
+  exchange.jar -c xx.conf
+  ```
+
+  `--conf`中的文件路径有如下两种配置方式：
+
+  - 配置文件的绝对路径。要求所有 YARN 或者 Spark 机器相同路径下都有对应文件。
+  - （YARN 模式下推荐）配置文件的相对路径（例如`./krb5.conf`）。通过`--files`上传的资源文件就在 Java 虚拟机或者 JAR 的工作目录下。
+
+  `--files`中的文件必须存储在执行`spark-submit`命令的机器上。
+
+- 不使用命令
+
+  将 Spark 和 Kerberos 认证的 Hadoop 部署在相同集群内，共用 HDFS 和 YARN，然后在 Spark 的`spark-env.sh`中增加配置`export HADOOP_HOME=<hadoop_home_path>`。
+
 ### 步骤 5：（可选）验证数据
 
 用户可以在 {{nebula.name}} 客户端（例如 NebulaGraph Studio）中执行查询语句，确认数据是否已导入。例如：

diff --git a/docs-2.0-zh/import-export/nebula-exchange/use-exchange/ex-ug-import-from-orc.md b/docs-2.0-zh/import-export/nebula-exchange/use-exchange/ex-ug-import-from-orc.md
@@ -341,6 +341,34 @@ ${SPARK_HOME}/bin/spark-submit  --master "local" --class com.vesoft.nebula.excha
 
 用户可以在返回信息中搜索`batchSuccess.<tag_name/edge_name>`，确认成功的数量。例如`batchSuccess.follow: 300`。
 
+#### 访问 Kerberos 认证的 HDFS
+
+使用 Kerberos 进行安全认证时，需使用以下两种方式之一访问 Kerberos 认证的 HDFS。
+
+- 在命令中设置 Kerberos 配置文件
+
+  在命令中配置`--conf`和`--files`，例如：
+
+  ```bash
+  ${SPARK_HOME}/bin/spark-submit --master xxx  --num-executors 2 --executor-cores 2 --executor-memory 1g \
+  --conf "spark.driver.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --conf "spark.executor.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --files /local/path/to/xxx.keytab,/local/path/to/krb5.conf \
+  --class  com.vesoft.nebula.exchange.Exchange  \
+  exchange.jar -c xx.conf
+  ```
+
+  `--conf`中的文件路径有如下两种配置方式：
+
+  - 配置文件的绝对路径。要求所有 YARN 或者 Spark 机器相同路径下都有对应文件。
+  - （YARN 模式下推荐）配置文件的相对路径（例如`./krb5.conf`）。通过`--files`上传的资源文件就在 Java 虚拟机或者 JAR 的工作目录下。
+
+  `--files`中的文件必须存储在执行`spark-submit`命令的机器上。
+
+- 不使用命令
+
+  将 Spark 和 Kerberos 认证的 Hadoop 部署在相同集群内，共用 HDFS 和 YARN，然后在 Spark 的`spark-env.sh`中增加配置`export HADOOP_HOME=<hadoop_home_path>`。
+
 ### 步骤 5：（可选）验证数据
 
 用户可以在 {{nebula.name}} 客户端（例如 NebulaGraph Studio）中执行查询语句，确认数据是否已导入。例如：

diff --git a/...-2.0-zh/import-export/nebula-exchange/use-exchange/ex-ug-import-from-parquet.md b/...-2.0-zh/import-export/nebula-exchange/use-exchange/ex-ug-import-from-parquet.md
@@ -341,6 +341,34 @@ ${SPARK_HOME}/bin/spark-submit  --master "local" --class com.vesoft.nebula.excha
 
 用户可以在返回信息中搜索`batchSuccess.<tag_name/edge_name>`，确认成功的数量。例如`batchSuccess.follow: 300`。
 
+#### 访问 Kerberos 认证的 HDFS
+
+使用 Kerberos 进行安全认证时，需使用以下两种方式之一访问 Kerberos 认证的 HDFS。
+
+- 在命令中设置 Kerberos 配置文件
+
+  在命令中配置`--conf`和`--files`，例如：
+
+  ```bash
+  ${SPARK_HOME}/bin/spark-submit --master xxx  --num-executors 2 --executor-cores 2 --executor-memory 1g \
+  --conf "spark.driver.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --conf "spark.executor.extraJavaOptions=-Djava.security.krb5.conf=./krb5.conf" \
+  --files /local/path/to/xxx.keytab,/local/path/to/krb5.conf \
+  --class  com.vesoft.nebula.exchange.Exchange  \
+  exchange.jar -c xx.conf
+  ```
+
+  `--conf`中的文件路径有如下两种配置方式：
+
+  - 配置文件的绝对路径。要求所有 YARN 或者 Spark 机器相同路径下都有对应文件。
+  - （YARN 模式下推荐）配置文件的相对路径（例如`./krb5.conf`）。通过`--files`上传的资源文件就在 Java 虚拟机或者 JAR 的工作目录下。
+
+  `--files`中的文件必须存储在执行`spark-submit`命令的机器上。
+
+- 不使用命令
+
+  将 Spark 和 Kerberos 认证的 Hadoop 部署在相同集群内，共用 HDFS 和 YARN，然后在 Spark 的`spark-env.sh`中增加配置`export HADOOP_HOME=<hadoop_home_path>`。
+
 ### 步骤 5：（可选）验证数据
 
 用户可以在 {{nebula.name}} 客户端（例如 NebulaGraph Studio）中执行查询语句，确认数据是否已导入。例如：