From 84b0b0a5900dd9b101228cfa89a0742af1c48f36 Mon Sep 17 00:00:00 2001 From: Simon Cheung Date: Tue, 26 Apr 2022 10:47:37 +0800 Subject: [PATCH 1/3] translate hugegraph-loader.md into english --- .../en/docs/quickstart/hugegraph-loader.md | 517 +++++++++--------- 1 file changed, 258 insertions(+), 259 deletions(-) diff --git a/content/en/docs/quickstart/hugegraph-loader.md b/content/en/docs/quickstart/hugegraph-loader.md index c095f28be..f58a02b8e 100644 --- a/content/en/docs/quickstart/hugegraph-loader.md +++ b/content/en/docs/quickstart/hugegraph-loader.md @@ -4,143 +4,142 @@ linkTitle: "Load data with HugeGraph-Loader" weight: 2 --- -### 1 HugeGraph-Loader概述 +### 1 HugeGraph-Loader overview -HugeGraph-Loader 是 HugeGragh 的数据导入组件,能够将多种数据源的数据转化为图的顶点和边并批量导入到图数据库中。 +HugeGraph-Loader is the data import component of HugeGragh, which can convert data from various data sources into graph vertices and edges and import them into the graph database in batches. -目前支持的数据源包括: +Currently supported data sources include: +- Local disk file or directory, supports TEXT, CSV and JSON format files, supports compressed files +- HDFS file or directory, supports compressed files +- Mainstream relational databases, such as MySQL, PostgreSQL, Oracle, SQL Server -- 本地磁盘文件或目录,支持 TEXT、CSV 和 JSON 格式的文件,支持压缩文件 -- HDFS 文件或目录,支持压缩文件 -- 主流关系型数据库,如 MySQL、PostgreSQL、Oracle、SQL Server +Local disk files and HDFS files support resumable uploads. -本地磁盘文件和 HDFS 文件支持断点续传。 +It will be explained in detail later. -后面会具体说明。 +> Note: HugeGraph-Loader requires HugeGraph Server service, please refer to [HugeGraph-Server Quick Start](/docs/quickstart/hugegraph-server) to download and start Server -> 注意:使用 HugeGraph-Loader 需要依赖 HugeGraph Server 服务,下载和启动 Server 请参考 [HugeGraph-Server Quick Start](/docs/quickstart/hugegraph-server) +### 2 Get HugeGraph-Loader -### 2 获取 HugeGraph-Loader +There are two ways to get HugeGraph-Loader: -有两种方式可以获取 HugeGraph-Loader: +- Download the compiled tarball +- Clone source code to compile and install -- 下载已编译的压缩包 -- 克隆源码编译安装 +#### 2.1 Download the compiled archive -#### 2.1 下载已编译的压缩包 - -下载最新版本的 HugeGraph-Loader release 包: +Download the latest version of the HugeGraph-Loader release package: ```bash wget https://github.com/hugegraph/hugegraph-loader/releases/download/v${version}/hugegraph-loader-${version}.tar.gz tar zxvf hugegraph-loader-${version}.tar.gz ``` -#### 2.2 克隆源码编译安装 +#### 2.2 Clone source code to compile and install -克隆最新版本的 HugeGraph-Loader 源码包: +Clone the latest version of HugeGraph-Loader source package: ```bash $ git clone https://github.com/hugegraph/hugegraph-loader.git ``` -由于Oracle ojdbc license的限制,需要手动安装ojdbc到本地maven仓库。 -访问[Oracle jdbc 下载](https://www.oracle.com/database/technologies/appdev/jdbc-downloads.html) 页面。选择Oracle Database 12c Release 2 (12.2.0.1) drivers,如下图所示。 +Due to the limitation of the Oracle ojdbc license, you need to manually install ojdbc to the local maven repository. +Visit the [Oracle jdbc downloads](https://www.oracle.com/database/technologies/appdev/jdbc-downloads.html) page. Select Oracle Database 12c Release 2 (12.2.0.1) drivers, as shown in the following figure.
image
-打开链接后,选择“ojdbc8.jar”, 如下图所示。 +After opening the link, select "ojdbc8.jar" as shown below.
image
- 把ojdbc8安装到本地maven仓库,进入``ojdbc8.jar``所在目录,执行以下命令。 + Install ojdbc8 to the local maven repository, enter the directory where ``ojdbc8.jar`` is located, and execute the following command. ``` mvn install:install-file -Dfile=./ojdbc8.jar -DgroupId=com.oracle -DartifactId=ojdbc8 -Dversion=12.2.0.1 -Dpackaging=jar ``` -编译生成 tar 包: +Compile and generate tar package: ```bash cd hugegraph-loader mvn clean package -DskipTests ``` -### 3 使用流程 +### 3 Use the process -使用 HugeGraph-Loader 的基本流程分为以下几步: +The basic process of using HugeGraph-Loader is divided into the following steps: -- 编写图模型 -- 准备数据文件 -- 编写输入源映射文件 -- 执行命令导入 +- Write graph models +- Prepare data files +- Write input source map files +- Execute command import -#### 3.1 编写图模型 +#### 3.1 Writing a graph model -这一步是建模的过程,用户需要对自己已有的数据和想要创建的图模型有一个清晰的构想,然后编写 schema 建立图模型。 +This step is the modeling process. Users need to have a clear idea of ​​their existing data and the graph model they want to create, and then write the schema to build the graph model. -比如想创建一个拥有两类顶点及两类边的图,顶点是"人"和"软件",边是"人认识人"和"人创造软件",并且这些顶点和边都带有一些属性,比如顶点"人"有:"姓名"、"年龄"等属性, -"软件"有:"名字"、"售卖价格"等属性;边"认识"有: "日期"属性等。 +For example, if you want to create a graph with two types of vertices and two types of edges, the vertices are "people" and "software", the edges are "people know people" and "people create software", and these vertices and edges have some attributes, For example, the vertex "person" has: "name", "age" and other attributes, +"Software" includes: "name", "sale price" and other attributes; side "knowledge" includes: "date" attribute and so on.
image -

示例图模型

+

graph model example

-在设计好了图模型之后,我们可以用`groovy`编写出`schema`的定义,并保存至文件中,这里命名为`schema.groovy`。 +After designing the graph model, we can use `groovy` to write the definition of `schema` and save it to a file, here named `schema.groovy`. ```groovy -// 创建一些属性 +// create some properties schema.propertyKey("name").asText().ifNotExist().create(); schema.propertyKey("age").asInt().ifNotExist().create(); schema.propertyKey("city").asText().ifNotExist().create(); schema.propertyKey("date").asText().ifNotExist().create(); schema.propertyKey("price").asDouble().ifNotExist().create(); -// 创建 person 顶点类型,其拥有三个属性:name, age, city,主键是 name +// Create the person vertex type, which has three attributes: name, age, city, and the primary key is name schema.vertexLabel("person").properties("name", "age", "city").primaryKeys("name").ifNotExist().create(); -// 创建 software 顶点类型,其拥有两个属性:name, price,主键是 name +// Create a software vertex type, which has two properties: name, price, the primary key is name schema.vertexLabel("software").properties("name", "price").primaryKeys("name").ifNotExist().create(); -// 创建 knows 边类型,这类边是从 person 指向 person 的 +// Create the knows edge type, which goes from person to person schema.edgeLabel("knows").sourceLabel("person").targetLabel("person").ifNotExist().create(); -// 创建 created 边类型,这类边是从 person 指向 software 的 +// Create the created edge type, which points from person to software schema.edgeLabel("created").sourceLabel("person").targetLabel("software").ifNotExist().create(); ``` -> 关于 schema 的详细说明请参考 [hugegraph-client](/docs/clients/hugegraph-client) 中对应部分。 +> Please refer to the corresponding section in [hugegraph-client](/docs/clients/hugegraph-client) for the detailed description of the schema. -#### 3.2 准备数据 +#### 3.2 Prepare data -目前 HugeGraph-Loader 支持的数据源包括: +The data sources currently supported by HugeGraph-Loader include: -- 本地磁盘文件或目录 -- HDFS 文件或目录 -- 部分关系型数据库 +- local disk file or directory +- HDFS file or directory +- Partial relational database -##### 3.2.1 数据源结构 +##### 3.2.1 Data source structure -###### 3.2.1.1 本地磁盘文件或目录 +###### 3.2.1.1 Local disk file or directory -用户可以指定本地磁盘文件作为数据源,如果数据分散在多个文件中,也支持以某个目录作为数据源,但暂时不支持以多个目录作为数据源。 +The user can specify a local disk file as the data source. If the data is scattered in multiple files, a certain directory is also supported as the data source, but multiple directories are not supported as the data source for the time being. -比如:我的数据分散在多个文件中,part-0、part-1 ... part-n,要想执行导入,必须保证它们是放在一个目录下的。然后在 loader 的映射文件中,将`path`指定为该目录即可。 +For example: my data is scattered in multiple files, part-0, part-1 ... part-n. To perform the import, it must be ensured that they are placed in one directory. Then in the loader's mapping file, specify `path` as the directory. -支持的文件格式包括: +Supported file formats include: - TEXT - CSV - JSON -TEXT 是自定义分隔符的文本文件,第一行通常是标题,记录了每一列的名称,也允许没有标题行(在映射文件中指定)。其余的每行代表一条记录,会被转化为一个顶点/边;行的每一列对应一个字段,会被转化为顶点/边的 id、label 或属性; +TEXT is a text file with custom delimiters, the first line is usually the header, and the name of each column is recorded, and no header line is allowed (specified in the mapping file). Each remaining row represents a record, which will be converted into a vertex/edge; each column of the row corresponds to a field, which will be converted into the id, label or attribute of the vertex/edge; -示例如下: +An example is as follows: ``` id|name|lang|price|ISBN @@ -148,70 +147,70 @@ id|name|lang|price|ISBN 2|ripple|java|199|ISBN978-7-100-13678-5 ``` -CSV 是分隔符为逗号`,`的 TEXT 文件,当列值本身包含逗号时,该列值需要用双引号包起来,如: +CSV is a TEXT file with commas `,` as delimiters. When a column value itself contains a comma, the column value needs to be enclosed in double quotes, for example: ``` marko,29,Beijing "li,nary",26,"Wu,han" ``` -JSON 文件要求每一行都是一个 JSON 串,且每行的格式需保持一致。 +The JSON file requires that each line is a JSON string, and the format of each line needs to be consistent. ```json {"source_name": "marko", "target_name": "vadas", "date": "20160110", "weight": 0.5} {"source_name": "marko", "target_name": "josh", "date": "20130220", "weight": 1.0} ``` -###### 3.2.1.2 HDFS 文件或目录 +###### 3.2.1.2 HDFS file or directory -用户也可以指定 HDFS 文件或目录作为数据源,上面关于`本地磁盘文件或目录`的要求全部适用于这里。除此之外,鉴于 HDFS 上通常存储的都是压缩文件,loader 也提供了对压缩文件的支持,并且`本地磁盘文件或目录`同样支持压缩文件。 +Users can also specify HDFS files or directories as data sources, all of the above requirements for `local disk files or directories` apply here. In addition, since HDFS usually stores compressed files, loader also provides support for compressed files, and `local disk file or directory` also supports compressed files. -目前支持的压缩文件类型包括:GZIP、BZ2、XZ、LZMA、SNAPPY_RAW、SNAPPY_FRAMED、Z、DEFLATE、LZ4_BLOCK、LZ4_FRAMED、ORC 和 PARQUET。 +Currently supported compressed file types include: GZIP, BZ2, XZ, LZMA, SNAPPY_RAW, SNAPPY_FRAMED, Z, DEFLATE, LZ4_BLOCK, LZ4_FRAMED, ORC, and PARQUET. -###### 3.2.1.3 主流关系型数据库 +###### 3.2.1.3 Mainstream relational database -loader 还支持以部分关系型数据库作为数据源,目前支持 MySQL、PostgreSQL、Oracle 和 SQL Server。 +The loader also supports some relational databases as data sources, and currently supports MySQL, PostgreSQL, Oracle and SQL Server. -但目前对表结构要求较为严格,如果导入过程中需要做**关联查询**,这样的表结构是不允许的。关联查询的意思是:在读到表的某行后,发现某列的值不能直接使用(比如外键),需要再去做一次查询才能确定该列的真实值。 +However, the requirements for the table structure are relatively strict at present. If **association query** needs to be done during the import process, such a table structure is not allowed. The associated query means: after reading a row of the table, it is found that the value of a certain column cannot be used directly (such as a foreign key), and you need to do another query to determine the true value of the column. -举个例子:假设有三张表,person、software 和 created +For example: Suppose there are three tables, person, software and created ``` -// person 表结构 -id | name | age | city +// person schema +id | name | age | city ``` ``` -// software 表结构 +// software schema id | name | lang | price ``` ``` -// created 表结构 +// created schema id | p_id | s_id | date ``` -如果在建模(schema)时指定 person 或 software 的 id 策略是 PRIMARY_KEY,选择以 name 作为 primary keys(注意:这是 hugegraph 中 vertexlabel 的概念),在导入边数据时,由于需要拼接出源顶点和目标顶点的 id,必须拿着 p_id/s_id 去 person/software 表中查到对应的 name,这种需要做额外查询的表结构的情况,loader 暂时是不支持的。这时可以采用以下两种方式替代: +If the id strategy of person or software is specified as PRIMARY_KEY when modeling (schema), choose name as the primary key (note: this is the concept of vertexlabel in hugegraph), when importing edge data, the source vertex and target need to be spliced ​​out. For the id of the vertex, you must go to the person/software table with p_id/s_id to find the corresponding name. In the case of the schema that requires additional query, the loader does not support it temporarily. In this case, the following two methods can be used instead: -1. 仍然指定 person 和 software 的 id 策略为 PRIMARY_KEY,但是以 person 表和 software 表的 id 列作为顶点的主键属性,这样导入边时直接使用 p_id 和 s_id 和顶点的 label 拼接就能生成 id 了; -2. 指定 person 和 software 的 id 策略为 CUSTOMIZE,然后直接以 person 表和 software 表的 id 列作为顶点 id,这样导入边时直接使用 p_id 和 s_id 即可; +1. The id strategy of person and software is still specified as PRIMARY_KEY, but the id column of the person table and software table is used as the primary key attribute of the vertex, so that the id can be generated by directly splicing p_id and s_id with the label of the vertex when importing an edge; +2. Specify the id policy of person and software as CUSTOMIZE, and then directly use the id column of the person table and the software table as the vertex id, so that p_id and s_id can be used directly when importing edges; -关键点就是要让边能直接使用 p_id 和 s_id,不要再去查一次。 +The key point is to make the edge use p_id and s_id directly, don't check it again. -##### 3.2.2 准备顶点和边数据 +##### 3.2.2 Prepare vertex and edge data -###### 3.2.2.1 顶点数据 +###### 3.2.2.1 Vertex Data -顶点数据文件由一行一行的数据组成,一般每一行作为一个顶点,每一列会作为顶点属性。下面以 CSV 格式作为示例进行说明。 +The vertex data file consists of data line by line. Generally, each line is used as a vertex, and each column is used as a vertex attribute. The following description uses CSV format as an example. -- person 顶点数据(数据本身不包含 header) +- person vertex data (the data itself does not contain a header) ```csv Tom,48,Beijing Jerry,36,Shanghai ``` -- software 顶点数据(数据本身包含 header) +- software vertex data (the data itself contains the header) ```csv name,price @@ -219,17 +218,17 @@ Photoshop,999 Office,388 ``` -###### 3.2.2.2 边数据 +###### 3.2.2.2 Edge data -边数据文件由一行一行的数据组成,一般每一行作为一条边,其中有部分列会作为源顶点和目标顶点的 id,其他列作为边属性。下面以 JSON 格式作为示例进行说明。 +The edge data file consists of data line by line. Generally, each line is used as an edge. Some of the columns are used as the IDs of the source and target vertices, and other columns are used as edge attributes. The following uses JSON format as an example. -- knows 边数据 +- knows edge data ```json {"source_name": "Tom", "target_name": "Jerry", "date": "2008-12-12"} ``` -- created 边数据 +- created edge data ```json {"source_name": "Tom", "target_name": "Photoshop"} @@ -237,19 +236,19 @@ Office,388 {"source_name": "Jerry", "target_name": "Office"} ``` -#### 3.3 编写数据源映射文件 +#### 3.3 Write data source mapping file -##### 3.3.1 映射文件概述 +##### 3.3.1 Mapping file overview -输入源的映射文件用于描述如何将输入源数据与图的顶点类型/边类型建立映射关系,以`JSON`格式组织,由多个映射块组成,其中每一个映射块都负责将一个输入源映射为顶点和边。 +The mapping file of the input source is used to describe how to establish the mapping relationship between the input source data and the vertex type/edge type of the graph. It is organized in `JSON` format and consists of multiple mapping blocks, each of which is responsible for mapping an input source. Mapped to vertices and edges. -具体而言,每个映射块包含**一个输入源**和多个**顶点映射**与**边映射**块,输入源块对应上面介绍的`本地磁盘文件或目录`、`HDFS 文件或目录`和`关系型数据库`,负责描述数据源的基本信息,比如数据在哪,是什么格式的,分隔符是什么等。顶点映射/边映射与该输入源绑定,可以选择输入源的哪些列,哪些列作为id、哪些列作为属性,以及每一列映射成什么属性,列的值映射成属性的什么值等等。 +Specifically, each mapping block contains **an input source** and multiple **vertex mapping** and **edge mapping** blocks, and the input source block corresponds to the `local disk file or directory`, ` HDFS file or directory` and `relational database` are responsible for describing the basic information of the data source, such as where the data is, what format, what is the delimiter, etc. The vertex map/edge map is bound to the input source, which columns of the input source can be selected, which columns are used as ids, which columns are used as attributes, and what attributes are mapped to each column, the values ​​of the columns are mapped to what values ​​of attributes, and so on. -以最通俗的话讲,每一个映射块描述了:要导入的文件在哪,文件的每一行要作为哪一类顶点/边,文件的哪些列是需要导入的,以及这些列对应顶点/边的什么属性等。 +In the simplest terms, each mapping block describes: where is the file to be imported, which type of vertices/edges each line of the file is to be used as, which columns of the file need to be imported, and the corresponding vertices/edges of these columns. what properties etc. -> 注意:0.11.0 版本以前的映射文件与 0.11.0 以后的格式变化较大,为表述方便,下面称 0.11.0 以前的映射文件(格式)为 1.0 版本,0.11.0 以后的为 2.0 版本。并且若无特殊说明,“映射文件”表示的是 2.0 版本的。 +> Note: The format of the mapping file before version 0.11.0 and the format after 0.11.0 has changed greatly. For the convenience of expression, the mapping file (format) before 0.11.0 is called version 1.0, and the version after 0.11.0 is version 2.0 . And unless otherwise specified, the "map file" refers to version 2.0. -2.0 版本的映射文件的框架为: +The skeleton of the map file for version 2.0 is: ```json { @@ -272,9 +271,9 @@ Office,388 } ``` -这里直接给出两个版本的映射文件(描述了上面图模型和数据文件) +Two versions of the mapping file are given directly here (the above graph model and data file are described) -2.0 版本的映射文件: +Mapping file for version 2.0: ```json { @@ -478,7 +477,7 @@ Office,388 } ``` -1.0 版本的映射文件: +Mapping file for version 1.0: ```json { @@ -535,134 +534,134 @@ Office,388 } ``` -映射文件 1.0 版本是以顶点和边为中心,设置输入源;而 2.0 版本是以输入源为中心,设置顶点和边映射。有些输入源(比如一个文件)既能生成顶点,也能生成边,如果用 1.0 版的格式写,就需要在 vertex 和 egde 映射块中各写一次 input 块,这两次的 input 块是完全一样的;而 2.0 版本只需要写一次 input。所以 2.0 版相比于 1.0 版,能省掉一些 input 的重复书写。 +The 1.0 version of the mapping file is centered on the vertex and edge, and sets the input source; while the 2.0 version is centered on the input source, and sets the vertex and edge mapping. Some input sources (such as a file) can generate both vertices and edges. If you write in the 1.0 format, you need to write an input block in each of the vertex and egde mapping blocks. The two input blocks are exactly the same ; and the 2.0 version only needs to write input once. Therefore, compared with version 1.0, version 2.0 can save some repetitive writing of input. -在 hugegraph-loader-{version} 的 bin 目录下,有一个脚本工具 `mapping-convert.sh` 能直接将 1.0 版本的映射文件转换为 2.0 版本的,使用方式如下: +In the bin directory of hugegraph-loader-{version}, there is a script tool `mapping-convert.sh` that can directly convert the mapping file of version 1.0 to version 2.0. The usage is as follows: ```bash bin/mapping-convert.sh struct.json ``` -会在 struct.json 的同级目录下生成一个 struct-v2.json。 +A struct-v2.json will be generated in the same directory as struct.json. -##### 3.3.2 输入源 +##### 3.3.2 Input Source -输入源目前分为三类:FILE、HDFS、JDBC,由`type`节点区分,我们称为本地文件输入源、HDFS 输入源和 JDBC 输入源,下面分别介绍。 +Input sources are currently divided into three categories: FILE, HDFS, and JDBC, which are distinguished by the `type` node. We call them local file input sources, HDFS input sources, and JDBC input sources, which are described below. -###### 3.3.2.1 本地文件输入源 +###### 3.3.2.1 Local file input source -- id: 输入源的 id,该字段用于支持一些内部功能,非必填(未填时会自动生成),强烈建议写上,对于调试大有裨益; -- skip: 是否跳过该输入源,由于 JSON 文件无法添加注释,如果某次导入时不想导入某个输入源,但又不想删除该输入源的配置,则可以设置为 true 将其跳过,默认为 false,非必填; -- input: 输入源映射块,复合结构 - - type: 输入源类型,必须填 file 或 FILE; - - path: 本地文件或目录的路径,绝对路径或相对于映射文件的相对路径,建议使用绝对路径,必填; - - file_filter: 从`path`中筛选复合条件的文件,复合结构,目前只支持配置扩展名,用子节点`extensions`表示,默认为"*",表示保留所有文件; - - format: 本地文件的格式,可选值为 CSV、TEXT 及 JSON,必须大写,必填; - - header: 文件各列的列名,如不指定则会以数据文件第一行作为 header;当文件本身有标题且又指定了 header,文件的第一行会被当作普通的数据行;JSON 文件不需要指定 header,选填; - - delimiter: 文件行的列分隔符,默认以逗号`","`作为分隔符,`JSON`文件不需要指定,选填; - - charset: 文件的编码字符集,默认`UTF-8`,选填; - - date_format: 自定义的日期格式,默认值为 yyyy-MM-dd HH:mm:ss,选填;如果日期是以时间戳的形式呈现的,此项须写为`timestamp`(固定写法); - - time_zone: 设置日期数据是处于哪个时区的,默认值为`GMT+8`,选填; - - skipped_line: 想跳过的行,复合结构,目前只能配置要跳过的行的正则表达式,用子节点`regex`描述,默认不跳过任何行,选填; - - compression: 文件的压缩格式,可选值为 NONE、GZIP、BZ2、XZ、LZMA、SNAPPY_RAW、SNAPPY_FRAMED、Z、DEFLATE、LZ4_BLOCK、LZ4_FRAMED、ORC 和 PARQUET,默认为 NONE,表示非压缩文件,选填; - - list_format: 当文件(非 JSON )的某列是集合结构时(对应图中的 PropertyKey 的 Cardinality 为 Set 或 List),可以用此项设置该列的起始符、分隔符、结束符,复合结构: - - start_symbol: 集合结构列的起始符 (默认值是 `[`, JSON 格式目前不支持指定) - - elem_delimiter: 集合结构列的分隔符 (默认值是 `|`, JSON 格式目前只支持原生`,`分隔) - - end_symbol: 集合结构列的结束符 (默认值是 `]`, JSON 格式目前不支持指定) +- id: The id of the input source. This field is used to support some internal functions. It is not required (it will be automatically generated if it is not filled in). It is strongly recommended to write it, which is very helpful for debugging; +- skip: whether to skip the input source, because the JSON file cannot add comments, if you do not want to import an input source during a certain import, but do not want to delete the configuration of the input source, you can set it to true to skip it, the default is false, not required; +- input: input source map block, composite structure + - type: input source type, file or FILE must be filled; + - path: the path of the local file or directory, the absolute path or the relative path relative to the mapping file, it is recommended to use the absolute path, required; + - file_filter: filter files with compound conditions from `path`, compound structure, currently only supports configuration extensions, represented by child node `extensions`, the default is "*", which means to keep all files; + - format: the format of the local file, the optional values ​​are CSV, TEXT and JSON, which must be uppercase and required; + - header: the column name of each column of the file, if not specified, the first line of the data file will be used as the header; when the file itself has a header and the header is specified, the first line of the file will be treated as a normal data line; JSON The file does not need to specify a header, optional; + - delimiter: The column delimiter of the file line, the default is comma `","` as the delimiter, the `JSON` file does not need to be specified, optional; + - charset: the encoded character set of the file, the default is `UTF-8`, optional; + - date_format: custom date format, the default value is yyyy-MM-dd HH:mm:ss, optional; if the date is presented in the form of a timestamp, this item must be written as `timestamp` (fixed writing); + - time_zone: Set which time zone the date data is in, the default value is `GMT+8`, optional; + - skipped_line: The line to be skipped, compound structure, currently only the regular expression of the line to be skipped can be configured, described by the child node `regex`, no line is skipped by default, optional; + - compression: The compression format of the file, the optional values ​​are NONE, GZIP, BZ2, XZ, LZMA, SNAPPY_RAW, SNAPPY_FRAMED, Z, DEFLATE, LZ4_BLOCK, LZ4_FRAMED, ORC and PARQUET, the default is NONE, which means a non-compressed file, optional; + - list_format: When a column of the file (non-JSON) is a collection structure (the Cardinality of the PropertyKey in the corresponding figure is Set or List), you can use this item to set the start character, separator, and end character of the column, compound structure : + - start_symbol: The start character of the collection structure column (the default value is `[`, JSON format currently does not support specification) + - elem_delimiter: the delimiter of the collection structure column (the default value is `|`, JSON format currently only supports native `,` delimiter) + - end_symbol: the end character of the collection structure column (the default value is `]`, the JSON format does not currently support specification) -###### 3.3.2.2 HDFS 输入源 +###### 3.3.2.2 HDFS input source -上述`本地文件输入源`的节点及含义这里基本都适用,下面仅列出 HDFS 输入源不一样的和特有的节点。 +The nodes and meanings of the above `local file input source` are basically applicable here. Only the different and unique nodes of the HDFS input source are listed below. -- type: 输入源类型,必须填 hdfs 或 HDFS,必填; -- path: HDFS 文件或目录的路径,必须是 HDFS 的绝对路径,必填; -- core_site_path: HDFS 集群的 core-site.xml 文件路径,重点要指明 namenode 的地址(fs.default.name),以及文件系统的实现(fs.hdfs.impl); +- type: input source type, must fill in hdfs or HDFS, required; +- path: the path of the HDFS file or directory, it must be the absolute path of HDFS, required; +- core_site_path: the path of the core-site.xml file of the HDFS cluster, the key point is to specify the address of the namenode (fs.default.name) and the implementation of the file system (fs.hdfs.impl); -###### 3.3.2.3 JDBC 输入源 +###### 3.3.2.3 JDBC input source -前面说到过支持多种关系型数据库,但由于它们的映射结构非常相似,故统称为 JDBC 输入源,然后用`vendor`节点区分不同的数据库。 +As mentioned above, it supports multiple relational databases, but because their mapping structures are very similar, they are collectively referred to as JDBC input sources, and then use the `vendor` node to distinguish different databases. -- type: 输入源类型,必须填 jdbc 或 JDBC,必填; -- vendor: 数据库类型,可选项为 [MySQL、PostgreSQL、Oracle、SQLServer],不区分大小写,必填; -- driver: jdbc 使用的 driver 类型,必填; -- url: jdbc 要连接的数据库的 url,必填; -- database: 要连接的数据库名,必填; -- schema: 要连接的 schema 名,不同的数据库要求不一样,下面详细说明; -- table: 要连接的表名,必填; -- username: 连接数据库的用户名,必填; -- password: 连接数据库的密码,必填; -- batch_size: 按页获取表数据时的一页的大小,默认为 500,选填; +- type: input source type, must fill in jdbc or JDBC, required; +- vendor: database type, optional options are [MySQL, PostgreSQL, Oracle, SQLServer], case-insensitive, required; +- driver: the type of driver used by jdbc, required; +- url: the url of the database that jdbc wants to connect to, required; +- database: the name of the database to be connected, required; +- schema: The name of the schema to be connected, different databases have different requirements, and the details are explained below; +- table: the name of the table to be connected, required; +- username: username to connect to the database, required; +- password: password for connecting to the database, required; +- batch_size: The size of one page when obtaining table data by page, the default is 500, optional; **MYSQL** -| 节点 | 固定值或常见值 | -| --- | --- | +| Node | Fixed value or common value | +| --- | --- | | vendor | MYSQL | | driver | com.mysql.cj.jdbc.Driver | | url | jdbc:mysql://127.0.0.1:3306 | -schema: 可空,若填写必须与database的值一样 +schema: nullable, if filled in, it must be the same as the value of database **POSTGRESQL** -| 节点 | 固定值或常见值 | -| --- | --- | +| Node | Fixed value or common value | +| --- | --- | | vendor | POSTGRESQL | | driver | org.postgresql.Driver | | url | jdbc:postgresql://127.0.0.1:5432 | -schema: 可空,默认值为“public” +schema: nullable, default is "public" **ORACLE** -| 节点 | 固定值或常见值 | -| --- | --- | +| Node | Fixed value or common value | +| --- | --- | | vendor | ORACLE | | driver | oracle.jdbc.driver.OracleDriver | | url | jdbc:oracle:thin:@127.0.0.1:1521 | -schema: 可空,默认值与用户名相同 +schema: nullable, the default value is the same as the username **SQLSERVER** -| 节点 | 固定值或常见值 | -| --- | --- | +| Node | Fixed value or common value | +| --- | --- | | vendor | SQLSERVER | | driver | com.microsoft.sqlserver.jdbc.SQLServerDriver | | url | jdbc:sqlserver://127.0.0.1:1433 | -schema: 必填 +schema: required -##### 3.3.1 顶点和边映射 +##### 3.3.1 Vertex and Edge Mapping -顶点和边映射的节点(JSON 文件中的一个 key)有很多相同的部分,下面先介绍相同部分,再分别介绍`顶点映射`和`边映射`的特有节点。 +The nodes of vertex and edge mapping (a key in the JSON file) have a lot of the same parts. The same parts are introduced first, and then the unique nodes of `vertex map` and `edge map` are introduced respectively. -**相同部分的节点** +**Nodes of the same section** -- label: 待导入的顶点/边数据所属的`label`,必填; -- field_mapping: 将输入源列的列名映射为顶点/边的属性名,选填; -- value_mapping: 将输入源的数据值映射为顶点/边的属性值,选填; -- selected: 选择某些列插入,其他未选中的不插入,不能与`ignored`同时存在,选填; -- ignored: 忽略某些列,使其不参与插入,不能与`selected`同时存在,选填; -- null_values: 可以指定一些字符串代表空值,比如"NULL",如果该列对应的顶点/边属性又是一个可空属性,那在构造顶点/边时不会设置该属性的值,选填; -- update_strategies: 如果数据需要按特定方式批量**更新**时可以对每个属性指定具体的更新策略 (具体见下),选填; -- unfold: 是否将列展开,展开的每一列都会与其他列一起组成一行,相当于是展开成了多行;比如文件的某一列(id 列)的值是`[1,2,3]`,其他列的值是`18,Beijing`,当设置了 unfold 之后,这一行就会变成 3 行,分别是:`1,18,Beijing`,`2,18,Beijing`和`3,18,Beijing`。需要注意的是此项只会展开被选作为 id 的列。默认 false,选填; +- label: `label` to which the vertex/edge data to be imported belongs, required; +- field_mapping: Map the column name of the input source column to the attribute name of the vertex/edge, optional; +- value_mapping: map the data value of the input source to the attribute value of the vertex/edge, optional; +- selected: select some columns to insert, other unselected ones are not inserted, cannot exist at the same time as `ignored`, optional; +- ignored: ignore some columns so that they do not participate in insertion, cannot exist at the same time as `selected`, optional; +- null_values: You can specify some strings to represent null values, such as "NULL". If the vertex/edge attribute corresponding to this column is also a nullable attribute, the value of this attribute will not be set when constructing the vertex/edge, optional ; +- update_strategies: If the data needs to be **updated** in batches in a specific way, you can specify a specific update strategy for each attribute (see below for details), optional; +- unfold: Whether to unfold the column, each unfolded column will form a row with other columns, which is equivalent to unfolding into multiple rows; for example, the value of a certain column (id column) of the file is `[1,2,3]`, The values ​​of other columns are `18,Beijing`. When unfold is set, this row will become 3 rows, namely: `1,18,Beijing`, `2,18,Beijing` and `3,18, Beijing`. Note that this will only expand the column selected as id. Default false, optional; -**更新策略**支持8种 : (需要全大写) +**Update strategy** supports 8 types: (requires all uppercase) -1. 数值累加 : `SUM` -2. 两个数字/日期取更大的: `BIGGER` -3. 两个数字/日期取更小: `SMALLER` -4. **Set**属性取并集: `UNION` -5. **Set**属性取交集: `INTERSECTION` -6. **List**属性追加元素: `APPEND` -7. **List/Set**属性删除元素: `ELIMINATE` -8. 覆盖已有属性: `OVERRIDE` +1. Value accumulation: `SUM` +2. Take the greater of the two numbers/dates: `BIGGER` +3. Take the smaller of two numbers/dates: `SMALLER` +4. **Set** property takes union: `UNION` +5. **Set** attribute intersection: `INTERSECTION` +6. **List** attribute append element: `APPEND` +7. **List/Set** attribute delete element: `ELIMINATE` +8. Override an existing property: `OVERRIDE` -**注意:** 如果新导入的属性值为空, 会采用已有的旧数据而不会采用空值, 效果可以参考如下示例 +**Note:** If the newly imported attribute value is empty, the existing old data will be used instead of the empty value. For the effect, please refer to the following example ```json -// JSON文件中以如下方式指定更新策略 +// The update strategy is specified in the JSON file as follows { "vertices": [ { @@ -681,120 +680,120 @@ schema: 必填 ] } -// 1.写入一行带OVERRIDE更新策略的数据 (这里null代表空) +// 1. Write a line of data with the OVERRIDE update strategy (null means empty here) 'a b null null' -// 2.再写一行 +// 2. Write another line 'null null c d' -// 3.最后可以得到 +// 3. Finally we can get 'a b c d' -// 如果没有更新策略, 则会得到 +// If there is no update strategy, you will get 'null null c d' ``` -> **注意** : 采用了批量更新的策略后, 磁盘读请求数会大幅上升, 导入速度相比纯写覆盖会慢数倍 (此时HDD磁盘[IOPS](https://en.wikipedia.org/wiki/IOPS)会成为瓶颈, 建议采用SSD以保证速度) +> **Note** : After adopting the batch update strategy, the number of disk read requests will increase significantly, and the import speed will be several times slower than that of pure write coverage (at this time HDD disk [IOPS](https://en.wikipedia .org/wiki/IOPS) will be the bottleneck, SSD is recommended for speed) -**顶点映射的特有节点** +**Unique Nodes for Vertex Maps** -- id: 指定某一列作为顶点的 id 列,当顶点 id 策略为`CUSTOMIZE`时,必填;当 id 策略为`PRIMARY_KEY`时,必须为空; +- id: Specify a column as the id column of the vertex. When the vertex id policy is `CUSTOMIZE`, it is required; when the id policy is `PRIMARY_KEY`, it must be empty; -**边映射的特有节点** +**Unique Nodes for Edge Maps** -- source: 选择输入源某几列作为**源顶点**的 id 列,当源顶点的 id 策略为 `CUSTOMIZE`时,必须指定某一列作为顶点的 id 列;当源顶点的 id 策略为 `PRIMARY_KEY`时,必须指定一列或多列用于拼接生成顶点的 id,也就是说,不管是哪种 id 策略,此项必填; -- target: 指定某几列作为**目标顶点**的 id 列,与 source 类似,不再赘述; -- unfold_source: 是否展开文件的 source 列,效果与顶点映射中的类似,不再赘述; -- unfold_target: 是否展开文件的 target 列,效果与顶点映射中的类似,不再赘述; +- source: Select certain columns of the input source as the id column of **source vertex**. When the id policy of the source vertex is `CUSTOMIZE`, a certain column must be specified as the id column of the vertex; when the id policy of the source vertex is ` When PRIMARY_KEY`, one or more columns must be specified for splicing the id of the generated vertex, that is, no matter which id strategy is used, this item is required; +- target: Specify certain columns as the id columns of **target vertex**, similar to source, so I won't repeat them; +- unfold_source: Whether to unfold the source column of the file, the effect is similar to that in the vertex map, and will not be repeated; +- unfold_target: Whether to unfold the target column of the file, the effect is similar to that in the vertex mapping, and will not be repeated; -#### 3.4 执行命令导入 +#### 3.4 Execute command import -准备好图模型、数据文件以及输入源映射关系文件后,接下来就可以将数据文件导入到图数据库中。 +After preparing the graph model, data file, and input source mapping relationship file, the data file can be imported into the graph database. -导入过程由用户提交的命令控制,用户可以通过不同的参数控制执行的具体流程。 +The import process is controlled by commands submitted by the user, and the user can control the specific process of execution through different parameters. -##### 3.4.1 参数说明 +##### 3.4.1 Parameter description -参数 | 默认值 | 是否必传 | 描述信息 +Parameter | Default value | Required or not | Description ------------------- | ------------ | ------- | ----------------------- --f 或 --file | | Y | 配置脚本的路径 --g 或 --graph | | Y | 图数据库空间 --s 或 --schema | | Y | schema文件路径 --h 或 --host | localhost | | HugeGraphServer 的地址 --p 或 --port | 8080 | | HugeGraphServer 的端口号 ---username | null | | 当 HugeGraphServer 开启了权限认证时,当前图的 username ---token | null | | 当 HugeGraphServer 开启了权限认证时,当前图的 token ---protocol | http | | 向服务端发请求的协议,可选 http 或 https ---trust-store-file | | | 请求协议为 https 时,客户端的证书文件路径 ---trust-store-password | | | 请求协议为 https 时,客户端证书密码 ---clear-all-data | false | | 导入数据前是否清除服务端的原有数据 ---clear-timeout | 240 | | 导入数据前清除服务端的原有数据的超时时间 ---incremental-mode | false | | 是否使用断点续导模式,仅输入源为 FILE 和 HDFS 支持该模式,启用该模式能从上一次导入停止的地方开始导 ---failure-mode | false | | 失败模式为 true 时,会导入之前失败了的数据,一般来说失败数据文件需要在人工更正编辑好后,再次进行导入 ---batch-insert-threads | CPUs | | 批量插入线程池大小 (CPUs是当前OS可用**逻辑核**个数) ---single-insert-threads | 8 | | 单条插入线程池的大小 ---max-conn | 4 * CPUs | | HugeClient 与 HugeGraphServer 的最大 HTTP 连接数,**调整线程**的时候建议同时调整此项 ---max-conn-per-route| 2 * CPUs | | HugeClient 与 HugeGraphServer 每个路由的最大 HTTP 连接数,**调整线程**的时候建议同时调整此项 ---batch-size | 500 | | 导入数据时每个批次包含的数据条数 ---max-parse-errors | 1 | | 最多允许多少行数据解析错误,达到该值则程序退出 ---max-insert-errors | 500 | | 最多允许多少行数据插入错误,达到该值则程序退出 ---timeout | 60 | | 插入结果返回的超时时间(秒) ---shutdown-timeout | 10 | | 多线程停止的等待时间(秒) ---retry-times | 0 | | 发生特定异常时的重试次数 ---retry-interval | 10 | | 重试之前的间隔时间(秒) ---check-vertex | false | | 插入边时是否检查边所连接的顶点是否存在 ---print-progress | true | | 是否在控制台实时打印导入条数 ---dry-run | false | | 打开该模式,只解析不导入,通常用于测试 ---help | false | | 打印帮助信息 - -##### 3.4.2 断点续导模式 - -通常情况下,Loader 任务都需要较长时间执行,如果因为某些原因导致导入中断进程退出,而下次希望能从中断的点继续导,这就是使用断点续导的场景。 - -用户设置命令行参数 --incremental-mode 为 true 即打开了断点续导模式。断点续导的关键在于进度文件,导入进程退出的时候,会把退出时刻的导入进度 -记录到进度文件中,进度文件位于 `${struct}` 目录下,文件名形如 `load-progress ${date}` ,${struct} 为映射文件的前缀,${date} 为导入开始 -的时刻。比如:在 `2019-10-10 12:30:30` 开始的一次导入任务,使用的映射文件为 `struct-example.json`,则进度文件的路径为与 struct-example.json -同级的 `struct-example/load-progress 2019-10-10 12:30:30`。 - -> 注意:进度文件的生成与 --incremental-mode 是否打开无关,每次导入结束都会生成一个进度文件。 - -如果数据文件格式都是合法的,是用户自己停止(CTRL + C 或 kill,kill -9 不支持)的导入任务,也就是说没有错误记录的情况下,下一次导入只需要设置 -为断点续导即可。 - -但如果是因为太多数据不合法或者网络异常,达到了 --max-parse-errors 或 --max-insert-errors 的限制,Loader 会把这些插入失败的原始行记录到 -失败文件中,用户对失败文件中的数据行修改后,设置 --reload-failure 为 true 即可把这些"失败文件"也当作输入源进行导入(不影响正常的文件的导入), -当然如果修改后的数据行仍然有问题,则会被再次记录到失败文件中(不用担心会有重复行)。 - -每个顶点映射或边映射有数据插入失败时都会产生自己的失败文件,失败文件又分为解析失败文件(后缀 .parse-error)和插入失败文件(后缀 .insert-error), -它们被保存在 `${struct}/current` 目录下。比如映射文件中有一个顶点映射 person 和边映射 knows,它们各有一些错误行,当 Loader 退出后,在 -`${struct}/current` 目录下会看到如下文件: - -- person-b4cd32ab.parse-error: 顶点映射 person 解析错误的数据 -- person-b4cd32ab.insert-error: 顶点映射 person 插入错误的数据 -- knows-eb6b2bac.parse-error: 边映射 knows 解析错误的数据 -- knows-eb6b2bac.insert-error: 边映射 knows 插入错误的数据 - -> .parse-error 和 .insert-error 并不总是一起存在的,只有存在解析出错的行才会有 .parse-error 文件,只有存在插入出错的行才会有 .insert-error 文件。 - -##### 3.4.3 logs 目录文件说明 - -程序执行过程中各日志及错误数据会写入 hugegraph-loader.log 文件中。 - -##### 3.4.4 执行命令 - -运行 bin/hugeloader 并传入参数 +-f or --file | | Y | path to configure script +-g or --graph | | Y | graph dbspace +-s or --schema | | Y | schema file path +-h or --host | localhost | | address of HugeGraphServer +-p or --port | 8080 | | port number of HugeGraphServer +--username | null | | When HugeGraphServer enables permission authentication, the username of the current graph +--token | null | | When HugeGraphServer has enabled authorization authentication, the token of the current graph +--protocol | http | | Protocol for sending requests to the server, optional http or https +--trust-store-file | | | When the request protocol is https, the client's certificate file path +--trust-store-password | | | When the request protocol is https, the client certificate password +--clear-all-data | false | | Whether to clear the original data on the server before importing data +--clear-timeout | 240 | | Timeout for clearing the original data on the server before importing data +--incremental-mode | false | | Whether to use the breakpoint resume mode, only the input source is FILE and HDFS support this mode, enabling this mode can start the import from the place where the last import stopped +--failure-mode | false | | When the failure mode is true, the data that failed before will be imported. Generally speaking, the failed data file needs to be manually corrected and edited, and then imported again +--batch-insert-threads | CPUs | | Batch insert thread pool size (CPUs is the number of **logical cores** available to the current OS) +--single-insert-threads | 8 | | Size of single insert thread pool +--max-conn | 4 * CPUs | | The maximum number of HTTP connections between HugeClient and HugeGraphServer, it is recommended to adjust this when **adjusting threads** +--max-conn-per-route| 2 * CPUs | | The maximum number of HTTP connections for each route between HugeClient and HugeGraphServer, it is recommended to adjust this item at the same time when **adjusting the thread** +--batch-size | 500 | | The number of data items in each batch when importing data +--max-parse-errors | 1 | | The maximum number of lines of data parsing errors allowed, and the program exits when this value is reached +--max-insert-errors | 500 | | The maximum number of rows of data insertion errors allowed, and the program exits when this value is reached +--timeout | 60 | | Timeout (seconds) for inserting results to return +--shutdown-timeout | 10 | | Waiting time for multithreading to stop (seconds) +--retry-times | 0 | | Number of retries when a specific exception occurs +--retry-interval | 10 | | interval before retry (seconds) +--check-vertex | false | | Whether to check whether the vertex connected by the edge exists when inserting the edge +--print-progress | true | | Whether to print the number of imported items in the console in real time +--dry-run | false | | Turn on this mode, only parsing but not importing, usually used for testing +--help | false | | print help information + +##### 3.4.2 Breakpoint Continuation Mode + +Usually, the Loader task takes a long time to execute. If the import interrupt process exits for some reason, and next time you want to continue the import from the interrupted point, this is the scenario of using breakpoint continuation. + +The user sets the command line parameter --incremental-mode to true to open the breakpoint resume mode. The key to breakpoint continuation lies in the progress file. When the import process exits, the import progress at the time of exit will be recorded. +Recorded in the progress file, the progress file is located in the `${struct}` directory, the file name is like `load-progress ${date}`, ${struct} is the prefix of the mapping file, and ${date} is the start of the import +moment. For example: for an import task started at `2019-10-10 12:30:30`, the mapping file used is `struct-example.json`, then the path of the progress file is the same as struct-example.json +Sibling `struct-example/load-progress 2019-10-10 12:30:30`. + +> Note: The generation of progress files is independent of whether --incremental-mode is turned on or not, and a progress file is generated at the end of each import. + +If the data file formats are all legal and the import task is stopped by the user (CTRL + C or kill, kill -9 is not supported), that is to say, if there is no error record, the next import only needs to be set +Continue for the breakpoint. + +But if the limit of --max-parse-errors or --max-insert-errors is reached because too much data is invalid or network abnormality is reached, Loader will record these original rows that failed to insert into +In the failed file, after the user modifies the data lines in the failed file, set --reload-failure to true to import these "failed files" as input sources (does not affect the normal file import), +Of course, if there is still a problem with the modified data line, it will be logged again to the failure file (don't worry about duplicate lines). + +Each vertex map or edge map will generate its own failure file when data insertion fails. The failure file is divided into a parsing failure file (suffix .parse-error) and an insertion failure file (suffix .insert-error). +They are stored in the `${struct}/current` directory. For example, there is a vertex mapping person and an edge mapping knows in the mapping file, each of which has some error lines. When the Loader exits, in the +You will see the following files in the `${struct}/current` directory: + +- person-b4cd32ab.parse-error: Vertex map person parses wrong data +- person-b4cd32ab.insert-error: Vertex map person inserts wrong data +- knows-eb6b2bac.parse-error: edgemap knows parses wrong data +- knows-eb6b2bac.insert-error: edgemap knows inserts wrong data + +> .parse-error and .insert-error do not always exist together. Only lines with parsing errors will have .parse-error files, and only lines with insertion errors will have .insert-error files. + +##### 3.4.3 logs directory file description + +The log and error data during program execution will be written into hugegraph-loader.log file. + +##### 3.4.4 Execute command + +Run bin/hugeloader and pass in parameters ```bash bin/hugegraph-loader -g {GRAPH_NAME} -f ${INPUT_DESC_FILE} -s ${SCHEMA_FILE} -h {HOST} -p {PORT} ``` -### 4 完整示例 +### 4 Complete example -下面给出的是 hugegraph-loader 包中 example 目录下的例子。 +Given below is an example in the example directory of the hugegraph-loader package. -#### 4.1 准备数据 +#### 4.1 Prepare data -顶点文件:`example/file/vertex_person.csv` +Vertex file: `example/file/vertex_person.csv` ```csv marko,29,Beijing @@ -804,7 +803,7 @@ peter,35,Shanghai "li,nary",26,"Wu,han" ``` -顶点文件:`example/file/vertex_software.txt` +Vertex file: `example/file/vertex_software.txt` ```text name|lang|price @@ -812,14 +811,14 @@ lop|java|328 ripple|java|199 ``` -边文件:`example/file/edge_knows.json` +Edge file: `example/file/edge_knows.json` ``` {"source_name": "marko", "target_name": "vadas", "date": "20160110", "weight": 0.5} {"source_name": "marko", "target_name": "josh", "date": "20130220", "weight": 1.0} ``` -边文件:`example/file/edge_created.json` +Edge file: `example/file/edge_created.json` ``` {"aname": "marko", "bname": "lop", "date": "20171210", "weight": 0.4} @@ -828,9 +827,9 @@ ripple|java|199 {"aname": "peter", "bname": "lop", "date": "20170324", "weight": 0.2} ``` -#### 4.2 编写schema +#### 4.2 Write schema -schema文件:`example/file/schema.groovy` +schema file: `example/file/schema.groovy` ```groovy schema.propertyKey("name").asText().ifNotExist().create(); @@ -858,7 +857,7 @@ schema.indexLabel("createdByWeight").onE("created").by("weight").range().ifNotEx schema.indexLabel("knowsByWeight").onE("knows").by("weight").range().ifNotExist().create(); ``` -#### 4.3 编写输入源映射文件`example/file/struct.json` +#### 4.3 Write the input source mapping file `example/file/struct.json` ```json { @@ -922,13 +921,13 @@ schema.indexLabel("knowsByWeight").onE("knows").by("weight").range().ifNotExist( } ``` -#### 4.4 执行命令导入 +#### 4.4 Execute command import ```bash sh bin/hugegraph-loader.sh -g hugegraph -f example/file/struct.json -s example/file/schema.groovy ``` -导入结束后,会出现类似如下统计信息: +After the import is complete, statistics similar to the following will appear: ``` vertices/edges has been loaded this time : 8/6 From 2bc8de4fc7dabb80abe4770f7f6d7004fd495501 Mon Sep 17 00:00:00 2001 From: Simon Cheung Date: Tue, 26 Apr 2022 12:21:26 +0800 Subject: [PATCH 2/3] fix --- content/en/docs/quickstart/hugegraph-loader.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/content/en/docs/quickstart/hugegraph-loader.md b/content/en/docs/quickstart/hugegraph-loader.md index f58a02b8e..015d49c92 100644 --- a/content/en/docs/quickstart/hugegraph-loader.md +++ b/content/en/docs/quickstart/hugegraph-loader.md @@ -4,7 +4,7 @@ linkTitle: "Load data with HugeGraph-Loader" weight: 2 --- -### 1 HugeGraph-Loader overview +### 1 HugeGraph-Loader Overview HugeGraph-Loader is the data import component of HugeGragh, which can convert data from various data sources into graph vertices and edges and import them into the graph database in batches. @@ -43,7 +43,7 @@ Clone the latest version of HugeGraph-Loader source package: $ git clone https://github.com/hugegraph/hugegraph-loader.git ``` -Due to the limitation of the Oracle ojdbc license, you need to manually install ojdbc to the local maven repository. +Due to the license limitation of the `Oracle OJDBC`, you need to manually install ojdbc to the local maven repository. Visit the [Oracle jdbc downloads](https://www.oracle.com/database/technologies/appdev/jdbc-downloads.html) page. Select Oracle Database 12c Release 2 (12.2.0.1) drivers, as shown in the following figure.
@@ -70,16 +70,14 @@ cd hugegraph-loader mvn clean package -DskipTests ``` -### 3 Use the process - +### 3 How to use The basic process of using HugeGraph-Loader is divided into the following steps: - - Write graph models - Prepare data files - Write input source map files - Execute command import -#### 3.1 Writing a graph model +#### 3.1 Construct graph schema This step is the modeling process. Users need to have a clear idea of ​​their existing data and the graph model they want to create, and then write the schema to build the graph model. @@ -95,7 +93,7 @@ For example, if you want to create a graph with two types of vertices and two ty After designing the graph model, we can use `groovy` to write the definition of `schema` and save it to a file, here named `schema.groovy`. ```groovy -// create some properties +// Create some properties schema.propertyKey("name").asText().ifNotExist().create(); schema.propertyKey("age").asInt().ifNotExist().create(); schema.propertyKey("city").asText().ifNotExist().create(); @@ -660,7 +658,7 @@ The nodes of vertex and edge mapping (a key in the JSON file) have a lot of the **Note:** If the newly imported attribute value is empty, the existing old data will be used instead of the empty value. For the effect, please refer to the following example -```json +```javascript // The update strategy is specified in the JSON file as follows { "vertices": [ @@ -921,7 +919,7 @@ schema.indexLabel("knowsByWeight").onE("knows").by("weight").range().ifNotExist( } ``` -#### 4.4 Execute command import +#### 4.4 Command to import ```bash sh bin/hugegraph-loader.sh -g hugegraph -f example/file/struct.json -s example/file/schema.groovy From a512c84e15c6401c989433a3c80bef704f43d647 Mon Sep 17 00:00:00 2001 From: Simon Cheung Date: Tue, 26 Apr 2022 14:37:19 +0800 Subject: [PATCH 3/3] fix --- content/en/docs/quickstart/hugegraph-loader.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/en/docs/quickstart/hugegraph-loader.md b/content/en/docs/quickstart/hugegraph-loader.md index 015d49c92..a0b32ec37 100644 --- a/content/en/docs/quickstart/hugegraph-loader.md +++ b/content/en/docs/quickstart/hugegraph-loader.md @@ -15,7 +15,7 @@ Currently supported data sources include: Local disk files and HDFS files support resumable uploads. -It will be explained in detail later. +It will be explained in detail below. > Note: HugeGraph-Loader requires HugeGraph Server service, please refer to [HugeGraph-Server Quick Start](/docs/quickstart/hugegraph-server) to download and start Server @@ -24,7 +24,7 @@ It will be explained in detail later. There are two ways to get HugeGraph-Loader: - Download the compiled tarball -- Clone source code to compile and install +- Clone source code then compile and install #### 2.1 Download the compiled archive @@ -72,7 +72,7 @@ mvn clean package -DskipTests ### 3 How to use The basic process of using HugeGraph-Loader is divided into the following steps: -- Write graph models +- Write graph schema - Prepare data files - Write input source map files - Execute command import