diff --git a/README b/README new file mode 100755 index 0000000000..e69de29bb2 diff --git a/README.md b/README.md new file mode 100644 index 0000000000..0b4cbbe7ff --- /dev/null +++ b/README.md @@ -0,0 +1,112 @@ +![Datax-logo](https://github.com/alibaba/DataX/blob/master/images/DataX-logo.jpg) + + + +# DataX + +DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。 + + + +# Features + +DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。 + + + +# DataX详细介绍 + +##### 请参考:[DataX-Introduction](https://github.com/alibaba/DataX/wiki/DataX-Introduction) + + + +# Quick Start + +##### Download [DataX下载地址](http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz) + +##### 请点击:[Quick Start](https://github.com/alibaba/DataX/wiki/Quick-Start) +* [配置示例:从MySQL读取数据 写入ODPS](https://github.com/alibaba/DataX/wiki/Quick-Start) +* [配置定时任务](https://github.com/alibaba/DataX/wiki/%E9%85%8D%E7%BD%AE%E5%AE%9A%E6%97%B6%E4%BB%BB%E5%8A%A1%EF%BC%88Linux%E7%8E%AF%E5%A2%83%EF%BC%89) +* [动态传入参数](https://github.com/alibaba/DataX/wiki/%E5%8A%A8%E6%80%81%E4%BC%A0%E5%85%A5%E5%8F%82%E6%95%B0) + + + +# Support Data Channels + +DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入,目前支持数据如下图,详情请点击:[DataX数据源参考指南](https://github.com/alibaba/DataX/wiki/DataX-all-data-channels) + +| 类型 | 数据源 | Reader(读) | Writer(写) | +| ------------ | ---------- | :-------: | :-------: | +| RDBMS 关系型数据库 | Mysql | √ | √ | +| | Oracle | √ | √ | +| | SqlServer | √ | √ | +| | Postgresql | √ | √ | +| | DRDS | √ | √ | +| | 达梦 | √ | √ | +| 阿里云数仓数据存储 | ODPS | √ | √ | +| | ADS | | √ | +| | OSS | √ | √ | +| | OCS | √ | √ | +| NoSQL数据存储 | OTS | √ | √ | +| | Hbase0.94 | √ | √ | +| | Hbase1.1 | √ | √ | +| | MongoDB | √ | √ | +| 无结构化数据存储 | TxtFile | √ | √ | +| | FTP | √ | √ | +| | HDFS | √ | √ | + + +# 我要开发新的插件 +请点击:[DataX插件开发宝典](https://github.com/alibaba/DataX/wiki/DataX%E6%8F%92%E4%BB%B6%E5%BC%80%E5%8F%91%E5%AE%9D%E5%85%B8) + +# 项目成员 + +核心Contributions: 光戈、一斅、祁然、云时 + +感谢天烬、巴真、静行对DataX做出的贡献。 + +# License + +This software is free to use under the Apache License [Apache license](https://github.com/alibaba/DataX/blob/master/license.txt). + +# +请及时提出issue给我们。请前往:[DataxIssue](https://github.com/alibaba/DataX/issues) + +# 开源版DataX企业用户 + +![Datax-logo](https://github.com/alibaba/DataX/blob/master/images/datax-enterprise-users.jpg) + +``` +长期招聘 联系邮箱:hanfa.shf@alibaba-inc.com +【JAVA开发职位】 +职位名称:JAVA资深开发工程师/专家/高级专家 +工作年限 : 2年以上 +学历要求 : 本科(如果能力靠谱,这些都不是条件) +期望层级 : P6/P7/P8 + +岗位描述: + 1. 负责阿里云大数据平台(数加)的开发设计。 + 2. 负责面向政企客户的大数据相关产品开发; + 3. 利用大规模机器学习算法挖掘数据之间的联系,探索数据挖掘技术在实际场景中的产品应用 ; + 4. 一站式大数据开发平台 + 5. 大数据任务调度引擎 + 6. 任务执行引擎 + 7. 任务监控告警 + 8. 海量异构数据同步 + +岗位要求: + 1. 拥有3年以上JAVA Web开发经验; + 2. 熟悉Java的基础技术体系。包括JVM、类装载、线程、并发、IO资源管理、网络; + 3. 熟练使用常用Java技术框架、对新技术框架有敏锐感知能力;深刻理解面向对象、设计原则、封装抽象; + 4. 熟悉HTML/HTML5和JavaScript;熟悉SQL语言; + 5. 执行力强,具有优秀的团队合作精神、敬业精神; + 6. 深刻理解设计模式及应用场景者加分; + 7. 具有较强的问题分析和处理能力、比较强的动手能力,对技术有强烈追求者优先考虑; + 8. 对高并发、高稳定可用性、高性能、大数据处理有过实际项目及产品经验者优先考虑; + 9. 有大数据产品、云产品、中间件技术解决方案者优先考虑。 +```` +钉钉用户请扫描以下二维码进行讨论: + +![DataX-OpenSource-Dingding](https://raw.githubusercontent.com/alibaba/DataX/master/images/datax-opensource-dingding.png) + + diff --git a/adswriter/doc/adswriter.md b/adswriter/doc/adswriter.md new file mode 100644 index 0000000000..f80229bbc7 --- /dev/null +++ b/adswriter/doc/adswriter.md @@ -0,0 +1,314 @@ +# DataX ADS写入 + + +--- + + +## 1 快速介绍 + +
+ +欢迎ADS加入DataX生态圈!ADSWriter插件实现了其他数据源向ADS写入功能,现有DataX所有的数据源均可以无缝接入ADS,实现数据快速导入ADS。 + +ADS写入预计支持两种实现方式: + +* ADSWriter 支持向ODPS中转落地导入ADS方式,优点在于当数据量较大时(>1KW),可以以较快速度进行导入,缺点引入了ODPS作为落地中转,因此牵涉三方系统(DataX、ADS、ODPS)鉴权认证。 + +* ADSWriter 同时支持向ADS直接写入的方式,优点在于小批量数据写入能够较快完成(<1KW),缺点在于大数据导入较慢。 + + +注意: + +> 如果从ODPS导入数据到ADS,请用户提前在源ODPS的Project中授权ADS Build账号具有读取你源表ODPS的权限,同时,ODPS源表创建人和ADS写入属于同一个阿里云账号。 + +- + +> 如果从非ODPS导入数据到ADS,请用户提前在目的端ADS空间授权ADS Build账号具备Load data权限。 + +以上涉及ADS Build账号请联系ADS管理员提供。 + + +## 2 实现原理 + +ADS写入预计支持两种实现方式: + +### 2.1 Load模式 + +DataX 将数据导入ADS为当前导入任务分配的ADS项目表,随后DataX通知ADS完成数据加载。该类数据导入方式实际上是写ADS完成数据同步,由于ADS是分布式存储集群,因此该通道吞吐量较大,可以支持TB级别数据导入。 + +![中转导入](http://aligitlab.oss-cn-hangzhou-zmf.aliyuncs.com/uploads/cdp/cdp/f805dea46b/_____2015-04-10___12.06.21.png) + +1. CDP底层得到明文的 jdbc://host:port/dbname + username + password + table, 以此连接ADS, 执行show grants; 前置检查该用户是否有ADS中目标表的Load Data或者更高的权限。注意,此时ADSWriter使用用户填写的ADS用户名+密码信息完成登录鉴权工作。 + +2. 检查通过后,通过ADS中目标表的元数据反向生成ODPS DDL,在ODPS中间project中,以ADSWriter的账户建立ODPS表(非分区表,生命周期设为1-2Day), 并调用ODPSWriter把数据源的数据写入该ODPS表中。 + + 注意,这里需要使用中转ODPS的账号AK向中转ODPS写入数据。 + +3. 写入完成后,以中转ODPS账号连接ADS,发起Load Data From ‘odps://中转project/中转table/' [overwrite] into adsdb.adstable [partition (xx,xx=xx)]; 这个命令返回一个Job ID需要记录。 + + 注意,此时ADS使用自己的Build账号访问中转ODPS,因此需要中转ODPS对这个Build账号提前开放读取权限。 + +4. 连接ADS一分钟一次轮询执行 select state from information_schema.job_instances where job_id like ‘$Job ID’,查询状态,注意这个第一个一分钟可能查不到状态记录。 + +5. Success或者Fail后返回给用户,然后删除中转ODPS表,任务结束。 + +上述流程是从其他非ODPS数据源导入ADS流程,对于ODPS导入ADS流程使用如下流程: + +![直接导入](http://aligitlab.oss-cn-hangzhou-zmf.aliyuncs.com/uploads/cdp/cdp/b3a76459d1/_____2015-04-10___12.06.25.png) + +### 2.2 Insert模式 + +DataX 将数据直连ADS接口,利用ADS暴露的INSERT接口直写到ADS。该类数据导入方式写入吞吐量较小,不适合大批量数据写入。有如下注意点: + +* ADSWriter使用JDBC连接直连ADS,并只使用了JDBC Statement进行数据插入。ADS不支持PreparedStatement,故ADSWriter只能单行多线程进行写入。 + +* ADSWriter支持筛选部分列,列换序等功能,即用户可以填写列。 + +* 考虑到ADS负载问题,建议ADSWriter Insert模式建议用户使用TPS限流,最高在1W TPS。 + +* ADSWriter在所有Task完成写入任务后,Job Post单例执行flush工作,保证数据在ADS整体更新。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到ADS,使用Load模式进行导入的数据。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column": [ + { + "value": "DataX", + "type": "string" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 100000 + } + }, + "writer": { + "name": "adswriter", + "parameter": { + "odps": { + "accessId": "xxx", + "accessKey": "xxx", + "account": "xxx@aliyun.com", + "odpsServer": "xxx", + "tunnelServer": "xxx", + "accountType": "aliyun", + "project": "transfer_project" + }, + "writeMode": "load", + "url": "127.0.0.1:3306", + "schema": "schema", + "table": "table", + "username": "username", + "password": "password", + "partition": "", + "lifeCycle": 2, + "overWrite": true, + } + } + } + ] + } +} +``` + +* 这里使用一份从内存产生到ADS,使用Insert模式进行导入的数据。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column": [ + { + "value": "DataX", + "type": "string" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 100000 + } + }, + "writer": { + "name": "adswriter", + "parameter": { + "writeMode": "insert", + "url": "127.0.0.1:3306", + "schema": "schema", + "table": "table", + "column": ["*"], + "username": "username", + "password": "password", + "partition": "id,ds=2015" + } + } + } + ] + } +} +``` + + + +### 3.2 参数说明 (用户配置规格) + +* **url** + + * 描述:ADS连接信息,格式为"ip:port"。 + + * 必选:是
+ + * 默认值:无
+ +* **schema** + + * 描述:ADS的schema名称。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:ADS对应的username,目前就是accessId
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:ADS对应的password,目前就是accessKey
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。 + + * 必选:是
+ + * 默认值:无
+ + +* **partition** + + * 描述:目标表的分区名称,当目标表为分区表,需要指定该字段。 + + * 必选:否
+ + * 默认值:无
+ +* **writeMode** + + * 描述:支持Load和Insert两种写入模式 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表字段列表,可以为["*"],或者具体的字段列表,例如["a", "b", "c"] + + * 必选:是
+ + * 默认值:无
+ +* **overWrite** + + * 描述:ADS写入是否覆盖当前写入的表,true为覆盖写入,false为不覆盖(追加)写入。当writeMode为Load,该值才会生效。 + + * 必选:是
+ + * 默认值:无
+ + +* **lifeCycle** + + * 描述:ADS 临时表生命周期。当writeMode为Load时,该值才会生效。 + + * 必选:是
+ + * 默认值:无
+ + * **batchSize** + + * 描述:ADS 提交数据写的批量条数,当writeMode为insert时,该值才会生效。 + + * 必选:writeMode为insert时才有用
+ + * 默认值:32
+ +* **bufferSize** + + * 描述:DataX数据收集缓冲区大小,缓冲区的目的是攒一个较大的buffer,源头的数据首先进入到此buffer中进行排序,排序完成后再提交ads写。排序是根据ads的分区列模式进行的,排序的目的是数据顺序对ADS服务端更友好,出于性能考虑。bufferSize缓冲区中的数据会经过batchSize批量提交到ADS中,一般如果要设置bufferSize,设置bufferSize为batchSize数量的多倍。当writeMode为insert时,该值才会生效。 + + * 必选:writeMode为insert时才有用
+ + * 默认值:默认不配置不开启此功能
+ + +### 3.3 类型转换 + +| DataX 内部类型| ADS 数据类型 | +| -------- | ----- | +| Long |int, tinyint, smallint, int, bigint| +| Double |float, double, decimal| +| String |varchar | +| Date |date | +| Boolean |bool | +| Bytes |无 | + + 注意: + +* multivalue ADS支持multivalue类型,DataX对于该类型支持待定? + + +## 4 插件约束 + +如果Reader为ODPS,且ADSWriter写入模式为Load模式时,ODPS的partition只支持如下三种配置方式(以两级分区为例): +``` +"partition":["pt=*,ds=*"] (读取test表所有分区的数据) +"partition":["pt=1,ds=*"] (读取test表下面,一级分区pt=1下面的所有二级分区) +"partition":["pt=1,ds=hangzhou"] (读取test表下面,一级分区pt=1下面,二级分区ds=hz的数据) +``` + +## 5 性能报告(线上环境实测) + +### 5.1 环境准备 + +### 5.2 测试报告 + +## 6 FAQ diff --git a/adswriter/pom.xml b/adswriter/pom.xml new file mode 100644 index 0000000000..de407dfeee --- /dev/null +++ b/adswriter/pom.xml @@ -0,0 +1,107 @@ + + + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + 4.0.0 + + adswriter + adswriter + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + mysql + mysql-connector-java + + + + + com.alibaba.datax + datax-core + ${datax-project-version} + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + org.slf4j + slf4j-api + + + org.apache.commons + commons-exec + 1.3 + + + com.alibaba.datax + odpswriter + ${datax-project-version} + + + ch.qos.logback + logback-classic + + + mysql + mysql-connector-java + 5.1.31 + + + commons-configuration + commons-configuration + 1.10 + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/adswriter/src/main/assembly/package.xml b/adswriter/src/main/assembly/package.xml new file mode 100644 index 0000000000..c1fb64bb84 --- /dev/null +++ b/adswriter/src/main/assembly/package.xml @@ -0,0 +1,36 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + config.properties + plugin_job_template.json + + plugin/writer/adswriter + + + target/ + + adswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/adswriter + + + + + + false + plugin/writer/adswriter/libs + runtime + + + diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsException.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsException.java new file mode 100644 index 0000000000..f0d6f92894 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsException.java @@ -0,0 +1,40 @@ +package com.alibaba.datax.plugin.writer.adswriter; + +public class AdsException extends Exception { + + private static final long serialVersionUID = 1080618043484079794L; + + public final static int ADS_CONN_URL_NOT_SET = -100; + public final static int ADS_CONN_USERNAME_NOT_SET = -101; + public final static int ADS_CONN_PASSWORD_NOT_SET = -102; + public final static int ADS_CONN_SCHEMA_NOT_SET = -103; + + public final static int JOB_NOT_EXIST = -200; + public final static int JOB_FAILED = -201; + + public final static int ADS_LOADDATA_SCHEMA_NULL = -300; + public final static int ADS_LOADDATA_TABLE_NULL = -301; + public final static int ADS_LOADDATA_SOURCEPATH_NULL = -302; + public final static int ADS_LOADDATA_JOBID_NOT_AVAIL = -303; + public final static int ADS_LOADDATA_FAILED = -304; + + public final static int ADS_TABLEMETA_SCHEMA_NULL = -404; + public final static int ADS_TABLEMETA_TABLE_NULL = -405; + + public final static int OTHER = -999; + + private int code = OTHER; + private String message; + + public AdsException(int code, String message, Throwable e) { + super(message, e); + this.code = code; + this.message = message; + } + + @Override + public String getMessage() { + return "Code=" + this.code + " Message=" + this.message; + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriter.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriter.java new file mode 100644 index 0000000000..7e04c844a5 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriter.java @@ -0,0 +1,388 @@ +package com.alibaba.datax.plugin.writer.adswriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.util.WriterUtil; +import com.alibaba.datax.plugin.writer.adswriter.ads.ColumnInfo; +import com.alibaba.datax.plugin.writer.adswriter.ads.TableInfo; +import com.alibaba.datax.plugin.writer.adswriter.insert.AdsInsertProxy; +import com.alibaba.datax.plugin.writer.adswriter.insert.AdsInsertUtil; +import com.alibaba.datax.plugin.writer.adswriter.load.AdsHelper; +import com.alibaba.datax.plugin.writer.adswriter.load.TableMetaHelper; +import com.alibaba.datax.plugin.writer.adswriter.load.TransferProjectConf; +import com.alibaba.datax.plugin.writer.adswriter.odps.TableMeta; +import com.alibaba.datax.plugin.writer.adswriter.util.AdsUtil; +import com.alibaba.datax.plugin.writer.adswriter.util.Constant; +import com.alibaba.datax.plugin.writer.adswriter.util.Key; +import com.alibaba.datax.plugin.writer.odpswriter.OdpsWriter; +import com.aliyun.odps.Instance; +import com.aliyun.odps.Odps; +import com.aliyun.odps.OdpsException; +import com.aliyun.odps.account.Account; +import com.aliyun.odps.account.AliyunAccount; +import com.aliyun.odps.task.SQLTask; + +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +public class AdsWriter extends Writer { + + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Writer.Job.class); + public final static String ODPS_READER = "odpsreader"; + + private OdpsWriter.Job odpsWriterJobProxy = new OdpsWriter.Job(); + private Configuration originalConfig; + private Configuration readerConfig; + + /** + * 持有ads账号的ads helper + */ + private AdsHelper adsHelper; + /** + * 持有odps账号的ads helper + */ + private AdsHelper odpsAdsHelper; + /** + * 中转odps的配置,对应到writer配置的parameter.odps部分 + */ + private TransferProjectConf transProjConf; + private final int ODPSOVERTIME = 120000; + private String odpsTransTableName; + + private String writeMode; + private long startTime; + + @Override + public void init() { + startTime = System.currentTimeMillis(); + this.originalConfig = super.getPluginJobConf(); + this.writeMode = this.originalConfig.getString(Key.WRITE_MODE); + if(null == this.writeMode) { + LOG.warn("您未指定[writeMode]参数, 默认采用load模式, load模式只能用于离线表"); + this.writeMode = Constant.LOADMODE; + this.originalConfig.set(Key.WRITE_MODE, "load"); + } + + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + AdsUtil.checkNecessaryConfig(this.originalConfig, this.writeMode); + loadModeInit(); + } else if(Constant.INSERTMODE.equalsIgnoreCase(this.writeMode) || Constant.STREAMMODE.equalsIgnoreCase(this.writeMode)) { + AdsUtil.checkNecessaryConfig(this.originalConfig, this.writeMode); + List allColumns = AdsInsertUtil.getAdsTableColumnNames(originalConfig); + AdsInsertUtil.dealColumnConf(originalConfig, allColumns); + + LOG.debug("After job init(), originalConfig now is:[\n{}\n]", + originalConfig.toJSON()); + } else { + throw DataXException.asDataXException(AdsWriterErrorCode.INVALID_CONFIG_VALUE, "writeMode 必须为 'load' 或者 'insert' 或者 'stream'"); + } + } + + private void loadModeInit() { + this.adsHelper = AdsUtil.createAdsHelper(this.originalConfig); + this.odpsAdsHelper = AdsUtil.createAdsHelperWithOdpsAccount(this.originalConfig); + this.transProjConf = TransferProjectConf.create(this.originalConfig); + // 打印权限申请流程到日志中 + LOG.info(String + .format("%s%n%s%n%s", + "如果您直接是odps->ads数据同步, 需要做2方面授权:", + "[1] ads官方账号至少需要有待同步表的describe和select权限, 因为ads系统需要获取odps待同步表的结构和数据信息", + "[2] 您配置的ads数据源访问账号ak, 需要有向指定的ads数据库发起load data的权限, 您可以在ads系统中添加授权")); + LOG.info(String + .format("%s%s%n%s%n%s", + "如果您直接是rds(或其它非odps数据源)->ads数据同步, 流程是先将数据装载如odps临时表,再从odps临时表->ads, ", + String.format("中转odps项目为%s,中转项目账号为%s, 权限方面:", + this.transProjConf.getProject(), + this.transProjConf.getAccount()), + "[1] ads官方账号至少需要有待同步表(这里是odps临时表)的describe和select权限, 因为ads系统需要获取odps待同步表的结构和数据信息,此部分部署时已经完成授权", + String.format("[2] 中转odps对应的账号%s, 需要有向指定的ads数据库发起load data的权限, 您可以在ads系统中添加授权", this.transProjConf.getAccount()))); + + /** + * 如果是从odps导入到ads,直接load data然后System.exit() + */ + if (super.getPeerPluginName().equals(ODPS_READER)) { + transferFromOdpsAndExit(); + } + Account odpsAccount; + odpsAccount = new AliyunAccount(transProjConf.getAccessId(), transProjConf.getAccessKey()); + + Odps odps = new Odps(odpsAccount); + odps.setEndpoint(transProjConf.getOdpsServer()); + odps.setDefaultProject(transProjConf.getProject()); + + TableMeta tableMeta; + try { + String adsTable = this.originalConfig.getString(Key.ADS_TABLE); + TableInfo tableInfo = adsHelper.getTableInfo(adsTable); + int lifeCycle = this.originalConfig.getInt(Key.Life_CYCLE); + tableMeta = TableMetaHelper.createTempODPSTable(tableInfo, lifeCycle); + this.odpsTransTableName = tableMeta.getTableName(); + String sql = tableMeta.toDDL(); + LOG.info("正在创建ODPS临时表: "+sql); + Instance instance = SQLTask.run(odps, transProjConf.getProject(), sql, null, null); + boolean terminated = false; + int time = 0; + while (!terminated && time < ODPSOVERTIME) { + Thread.sleep(1000); + terminated = instance.isTerminated(); + time += 1000; + } + LOG.info("正在创建ODPS临时表成功"); + } catch (AdsException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_CREATETABLE_FAILED, e); + }catch (OdpsException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_CREATETABLE_FAILED,e); + } catch (InterruptedException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_CREATETABLE_FAILED,e); + } + + Configuration newConf = AdsUtil.generateConf(this.originalConfig, this.odpsTransTableName, + tableMeta, this.transProjConf); + odpsWriterJobProxy.setPluginJobConf(newConf); + odpsWriterJobProxy.init(); + } + + /** + * 当reader是odps的时候,直接call ads的load接口,完成后退出。 + * 这种情况下,用户在odps reader里头填写的参数只有部分有效。 + * 其中accessId、accessKey是忽略掉iao的。 + */ + private void transferFromOdpsAndExit() { + this.readerConfig = super.getPeerPluginJobConf(); + String odpsTableName = this.readerConfig.getString(Key.ODPSTABLENAME); + List userConfiguredPartitions = this.readerConfig.getList(Key.PARTITION, String.class); + + if (userConfiguredPartitions == null) { + userConfiguredPartitions = Collections.emptyList(); + } + + if(userConfiguredPartitions.size() > 1) + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_PARTITION_FAILED, ""); + + if(userConfiguredPartitions.size() == 0) { + loadAdsData(adsHelper, odpsTableName,null); + }else { + loadAdsData(adsHelper, odpsTableName,userConfiguredPartitions.get(0)); + } + System.exit(0); + } + + // 一般来说,是需要推迟到 task 中进行pre 的执行(单表情况例外) + @Override + public void prepare() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + //导数据到odps表中 + this.odpsWriterJobProxy.prepare(); + } else { + // 实时表模式非分库分表 + String adsTable = this.originalConfig.getString(Key.ADS_TABLE); + List preSqls = this.originalConfig.getList(Key.PRE_SQL, + String.class); + List renderedPreSqls = WriterUtil.renderPreOrPostSqls( + preSqls, adsTable); + if (null != renderedPreSqls && !renderedPreSqls.isEmpty()) { + // 说明有 preSql 配置,则此处删除掉 + this.originalConfig.remove(Key.PRE_SQL); + Connection preConn = AdsUtil.getAdsConnect(this.originalConfig); + LOG.info("Begin to execute preSqls:[{}]. context info:{}.", + StringUtils.join(renderedPreSqls, ";"), + this.originalConfig.getString(Key.ADS_URL)); + WriterUtil.executeSqls(preConn, renderedPreSqls, + this.originalConfig.getString(Key.ADS_URL), + DataBaseType.ADS); + DBUtil.closeDBResources(null, null, preConn); + } + } + } + + @Override + public List split(int mandatoryNumber) { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + return this.odpsWriterJobProxy.split(mandatoryNumber); + } else { + List splitResult = new ArrayList(); + for(int i = 0; i < mandatoryNumber; i++) { + splitResult.add(this.originalConfig.clone()); + } + return splitResult; + } + } + + // 一般来说,是需要推迟到 task 中进行post 的执行(单表情况例外) + @Override + public void post() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + loadAdsData(odpsAdsHelper, this.odpsTransTableName, null); + this.odpsWriterJobProxy.post(); + } else { + // 实时表模式非分库分表 + String adsTable = this.originalConfig.getString(Key.ADS_TABLE); + List postSqls = this.originalConfig.getList( + Key.POST_SQL, String.class); + List renderedPostSqls = WriterUtil.renderPreOrPostSqls( + postSqls, adsTable); + if (null != renderedPostSqls && !renderedPostSqls.isEmpty()) { + // 说明有 preSql 配置,则此处删除掉 + this.originalConfig.remove(Key.POST_SQL); + Connection postConn = AdsUtil.getAdsConnect(this.originalConfig); + LOG.info( + "Begin to execute postSqls:[{}]. context info:{}.", + StringUtils.join(renderedPostSqls, ";"), + this.originalConfig.getString(Key.ADS_URL)); + WriterUtil.executeSqls(postConn, renderedPostSqls, + this.originalConfig.getString(Key.ADS_URL), + DataBaseType.ADS); + DBUtil.closeDBResources(null, null, postConn); + } + } + } + + @Override + public void destroy() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + this.odpsWriterJobProxy.destroy(); + } else { + //insert mode do noting + } + } + + private void loadAdsData(AdsHelper helper, String odpsTableName, String odpsPartition) { + + String table = this.originalConfig.getString(Key.ADS_TABLE); + String project; + if (super.getPeerPluginName().equals(ODPS_READER)) { + project = this.readerConfig.getString(Key.PROJECT); + } else { + project = this.transProjConf.getProject(); + } + String partition = this.originalConfig.getString(Key.PARTITION); + String sourcePath = AdsUtil.generateSourcePath(project,odpsTableName,odpsPartition); + /** + * 因为之前检查过,所以不用担心unbox的时候NPE + */ + boolean overwrite = this.originalConfig.getBool(Key.OVER_WRITE); + try { + String id = helper.loadData(table,partition,sourcePath,overwrite); + LOG.info("ADS Load Data任务已经提交,job id: " + id); + boolean terminated = false; + int time = 0; + while(!terminated) { + Thread.sleep(120000); + terminated = helper.checkLoadDataJobStatus(id); + time += 2; + LOG.info("ADS 正在导数据中,整个过程需要20分钟以上,请耐心等待,目前已执行 "+ time+" 分钟"); + } + LOG.info("ADS 导数据已成功"); + } catch (AdsException e) { + if (super.getPeerPluginName().equals(ODPS_READER)) { + // TODO 使用云账号 + AdsWriterErrorCode.ADS_LOAD_ODPS_FAILED.setAdsAccount(helper.getUserName()); + throw DataXException.asDataXException(AdsWriterErrorCode.ADS_LOAD_ODPS_FAILED,e); + } else { + throw DataXException.asDataXException(AdsWriterErrorCode.ADS_LOAD_TEMP_ODPS_FAILED,e); + } + } catch (InterruptedException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_CREATETABLE_FAILED,e); + } + } + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory.getLogger(Writer.Task.class); + private Configuration writerSliceConfig; + private OdpsWriter.Task odpsWriterTaskProxy = new OdpsWriter.Task(); + + + private String writeMode; + private String schema; + private String table; + private int columnNumber; + // warn: 只有在insert, stream模式才有, 对于load模式表明为odps临时表了 + private TableInfo tableInfo; + + @Override + public void init() { + writerSliceConfig = super.getPluginJobConf(); + this.writeMode = this.writerSliceConfig.getString(Key.WRITE_MODE); + this.schema = writerSliceConfig.getString(Key.SCHEMA); + this.table = writerSliceConfig.getString(Key.ADS_TABLE); + + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + odpsWriterTaskProxy.setPluginJobConf(writerSliceConfig); + odpsWriterTaskProxy.init(); + } else if(Constant.INSERTMODE.equalsIgnoreCase(this.writeMode) || Constant.STREAMMODE.equalsIgnoreCase(this.writeMode)) { + try { + this.tableInfo = AdsUtil.createAdsHelper(this.writerSliceConfig).getTableInfo(this.table); + } catch (AdsException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.CREATE_ADS_HELPER_FAILED, e); + } + List allColumns = new ArrayList(); + List columnInfo = this.tableInfo.getColumns(); + for (ColumnInfo eachColumn : columnInfo) { + allColumns.add(eachColumn.getName()); + } + LOG.info("table:[{}] all columns:[\n{}\n].", this.writerSliceConfig.get(Key.ADS_TABLE), StringUtils.join(allColumns, ",")); + AdsInsertUtil.dealColumnConf(writerSliceConfig, allColumns); + List userColumns = writerSliceConfig.getList(Key.COLUMN, String.class); + this.columnNumber = userColumns.size(); + } else { + throw DataXException.asDataXException(AdsWriterErrorCode.INVALID_CONFIG_VALUE, "writeMode 必须为 'load' 或者 'insert' 或者 'stream'"); + } + } + + @Override + public void prepare() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + odpsWriterTaskProxy.prepare(); + } else { + //do nothing + } + } + + public void startWrite(RecordReceiver recordReceiver) { + // 这里的是非odps数据源->odps中转临时表数据同步, load操作在job post阶段完成 + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + odpsWriterTaskProxy.setTaskPluginCollector(super.getTaskPluginCollector()); + odpsWriterTaskProxy.startWrite(recordReceiver); + } else { + // insert 模式 + List columns = writerSliceConfig.getList(Key.COLUMN, String.class); + Connection connection = AdsUtil.getAdsConnect(this.writerSliceConfig); + TaskPluginCollector taskPluginCollector = super.getTaskPluginCollector(); + AdsInsertProxy proxy = new AdsInsertProxy(schema + "." + table, columns, writerSliceConfig, taskPluginCollector, this.tableInfo); + proxy.startWriteWithConnection(recordReceiver, connection, columnNumber); + } + } + + @Override + public void post() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + odpsWriterTaskProxy.post(); + } else { + //do noting until now + } + } + + @Override + public void destroy() { + if(Constant.LOADMODE.equalsIgnoreCase(this.writeMode)) { + odpsWriterTaskProxy.destroy(); + } else { + //do noting until now + } + } + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriterErrorCode.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriterErrorCode.java new file mode 100644 index 0000000000..a1ac3c107a --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/AdsWriterErrorCode.java @@ -0,0 +1,54 @@ +package com.alibaba.datax.plugin.writer.adswriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum AdsWriterErrorCode implements ErrorCode { + REQUIRED_VALUE("AdsWriter-00", "您缺失了必须填写的参数值."), + NO_ADS_TABLE("AdsWriter-01", "ADS表不存在."), + ODPS_CREATETABLE_FAILED("AdsWriter-02", "创建ODPS临时表失败,请联系ADS 技术支持"), + ADS_LOAD_TEMP_ODPS_FAILED("AdsWriter-03", "ADS从ODPS临时表导数据失败,请联系ADS 技术支持"), + TABLE_TRUNCATE_ERROR("AdsWriter-04", "清空 ODPS 目的表时出错."), + CREATE_ADS_HELPER_FAILED("AdsWriter-05", "创建ADSHelper对象出错,请联系ADS 技术支持"), + ODPS_PARTITION_FAILED("AdsWriter-06", "ODPS Reader不允许配置多个partition,目前只支持三种配置方式,\"partition\":[\"pt=*,ds=*\"](读取test表所有分区的数据); \n" + + "\"partition\":[\"pt=1,ds=*\"](读取test表下面,一级分区pt=1下面的所有二级分区); \n" + + "\"partition\":[\"pt=1,ds=hangzhou\"](读取test表下面,一级分区pt=1下面,二级分区ds=hz的数据)"), + ADS_LOAD_ODPS_FAILED("AdsWriter-07", "ADS从ODPS导数据失败,请联系ADS 技术支持,先检查ADS账号是否已加到该ODPS Project中。ADS账号为:"), + INVALID_CONFIG_VALUE("AdsWriter-08", "不合法的配置值."), + + GET_ADS_TABLE_MEATA_FAILED("AdsWriter-11", "获取ADS table原信息失败"); + + private final String code; + private final String description; + private String adsAccount; + + + private AdsWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + public void setAdsAccount(String adsAccount) { + this.adsAccount = adsAccount; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + if (this.code.equals("AdsWriter-07")){ + return String.format("Code:[%s], Description:[%s][%s]. ", this.code, + this.description,adsAccount); + }else{ + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnDataType.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnDataType.java new file mode 100644 index 0000000000..d719c318bf --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnDataType.java @@ -0,0 +1,406 @@ +package com.alibaba.datax.plugin.writer.adswriter.ads; + +import java.math.BigDecimal; +import java.sql.Date; +import java.sql.Time; +import java.sql.Timestamp; +import java.sql.Types; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; + +/** + * ADS column data type. + * + * @since 0.0.1 + */ +public class ColumnDataType { + + // public static final int NULL = 0; + public static final int BOOLEAN = 1; + public static final int BYTE = 2; + public static final int SHORT = 3; + public static final int INT = 4; + public static final int LONG = 5; + public static final int DECIMAL = 6; + public static final int DOUBLE = 7; + public static final int FLOAT = 8; + public static final int TIME = 9; + public static final int DATE = 10; + public static final int TIMESTAMP = 11; + public static final int STRING = 13; + // public static final int STRING_IGNORECASE = 14; + // public static final int STRING_FIXED = 21; + + public static final int MULTI_VALUE = 22; + + public static final int TYPE_COUNT = MULTI_VALUE + 1; + + /** + * The list of types. An ArrayList so that Tomcat doesn't set it to null when clearing references. + */ + private static final ArrayList TYPES = new ArrayList(); + private static final HashMap TYPES_BY_NAME = new HashMap(); + private static final ArrayList TYPES_BY_VALUE_TYPE = new ArrayList(); + + /** + * @param dataTypes + * @return + */ + public static String getNames(int[] dataTypes) { + List names = new ArrayList(dataTypes.length); + for (final int dataType : dataTypes) { + names.add(ColumnDataType.getDataType(dataType).name); + } + return names.toString(); + } + + public int type; + public String name; + public int sqlType; + public String jdbc; + + /** + * How closely the data type maps to the corresponding JDBC SQL type (low is best). + */ + public int sqlTypePos; + + static { + for (int i = 0; i < TYPE_COUNT; i++) { + TYPES_BY_VALUE_TYPE.add(null); + } + // add(NULL, Types.NULL, "Null", new String[] { "NULL" }); + add(STRING, Types.VARCHAR, "String", new String[] { "VARCHAR", "VARCHAR2", "NVARCHAR", "NVARCHAR2", + "VARCHAR_CASESENSITIVE", "CHARACTER VARYING", "TID" }); + add(STRING, Types.LONGVARCHAR, "String", new String[] { "LONGVARCHAR", "LONGNVARCHAR" }); + // add(STRING_FIXED, Types.CHAR, "String", new String[] { "CHAR", "CHARACTER", "NCHAR" }); + // add(STRING_IGNORECASE, Types.VARCHAR, "String", new String[] { "VARCHAR_IGNORECASE" }); + add(BOOLEAN, Types.BOOLEAN, "Boolean", new String[] { "BOOLEAN", "BIT", "BOOL" }); + add(BYTE, Types.TINYINT, "Byte", new String[] { "TINYINT" }); + add(SHORT, Types.SMALLINT, "Short", new String[] { "SMALLINT", "YEAR", "INT2" }); + add(INT, Types.INTEGER, "Int", new String[] { "INTEGER", "INT", "MEDIUMINT", "INT4", "SIGNED" }); + add(INT, Types.INTEGER, "Int", new String[] { "SERIAL" }); + add(LONG, Types.BIGINT, "Long", new String[] { "BIGINT", "INT8", "LONG" }); + add(LONG, Types.BIGINT, "Long", new String[] { "IDENTITY", "BIGSERIAL" }); + add(DECIMAL, Types.DECIMAL, "BigDecimal", new String[] { "DECIMAL", "DEC" }); + add(DECIMAL, Types.NUMERIC, "BigDecimal", new String[] { "NUMERIC", "NUMBER" }); + add(FLOAT, Types.REAL, "Float", new String[] { "REAL", "FLOAT4" }); + add(DOUBLE, Types.DOUBLE, "Double", new String[] { "DOUBLE", "DOUBLE PRECISION" }); + add(DOUBLE, Types.FLOAT, "Double", new String[] { "FLOAT", "FLOAT8" }); + add(TIME, Types.TIME, "Time", new String[] { "TIME" }); + add(DATE, Types.DATE, "Date", new String[] { "DATE" }); + add(TIMESTAMP, Types.TIMESTAMP, "Timestamp", new String[] { "TIMESTAMP", "DATETIME", "SMALLDATETIME" }); + add(MULTI_VALUE, Types.VARCHAR, "String", new String[] { "MULTIVALUE" }); + } + + private static void add(int type, int sqlType, String jdbc, String[] names) { + for (int i = 0; i < names.length; i++) { + ColumnDataType dt = new ColumnDataType(); + dt.type = type; + dt.sqlType = sqlType; + dt.jdbc = jdbc; + dt.name = names[i]; + for (ColumnDataType t2 : TYPES) { + if (t2.sqlType == dt.sqlType) { + dt.sqlTypePos++; + } + } + TYPES_BY_NAME.put(dt.name, dt); + if (TYPES_BY_VALUE_TYPE.get(type) == null) { + TYPES_BY_VALUE_TYPE.set(type, dt); + } + TYPES.add(dt); + } + } + + /** + * Get the list of data types. + * + * @return the list + */ + public static ArrayList getTypes() { + return TYPES; + } + + /** + * Get the name of the Java class for the given value type. + * + * @param type the value type + * @return the class name + */ + public static String getTypeClassName(int type) { + switch (type) { + case BOOLEAN: + // "java.lang.Boolean"; + return Boolean.class.getName(); + case BYTE: + // "java.lang.Byte"; + return Byte.class.getName(); + case SHORT: + // "java.lang.Short"; + return Short.class.getName(); + case INT: + // "java.lang.Integer"; + return Integer.class.getName(); + case LONG: + // "java.lang.Long"; + return Long.class.getName(); + case DECIMAL: + // "java.math.BigDecimal"; + return BigDecimal.class.getName(); + case TIME: + // "java.sql.Time"; + return Time.class.getName(); + case DATE: + // "java.sql.Date"; + return Date.class.getName(); + case TIMESTAMP: + // "java.sql.Timestamp"; + return Timestamp.class.getName(); + case STRING: + // case STRING_IGNORECASE: + // case STRING_FIXED: + case MULTI_VALUE: + // "java.lang.String"; + return String.class.getName(); + case DOUBLE: + // "java.lang.Double"; + return Double.class.getName(); + case FLOAT: + // "java.lang.Float"; + return Float.class.getName(); + // case NULL: + // return null; + default: + throw new IllegalArgumentException("type=" + type); + } + } + + /** + * Get the data type object for the given value type. + * + * @param type the value type + * @return the data type object + */ + public static ColumnDataType getDataType(int type) { + if (type < 0 || type >= TYPE_COUNT) { + throw new IllegalArgumentException("type=" + type); + } + ColumnDataType dt = TYPES_BY_VALUE_TYPE.get(type); + // if (dt == null) { + // dt = TYPES_BY_VALUE_TYPE.get(NULL); + // } + return dt; + } + + /** + * Convert a value type to a SQL type. + * + * @param type the value type + * @return the SQL type + */ + public static int convertTypeToSQLType(int type) { + return getDataType(type).sqlType; + } + + /** + * Convert a SQL type to a value type. + * + * @param sqlType the SQL type + * @return the value type + */ + public static int convertSQLTypeToValueType(int sqlType) { + switch (sqlType) { + // case Types.CHAR: + // case Types.NCHAR: + // return STRING_FIXED; + case Types.VARCHAR: + case Types.LONGVARCHAR: + case Types.NVARCHAR: + case Types.LONGNVARCHAR: + return STRING; + case Types.NUMERIC: + case Types.DECIMAL: + return DECIMAL; + case Types.BIT: + case Types.BOOLEAN: + return BOOLEAN; + case Types.INTEGER: + return INT; + case Types.SMALLINT: + return SHORT; + case Types.TINYINT: + return BYTE; + case Types.BIGINT: + return LONG; + case Types.REAL: + return FLOAT; + case Types.DOUBLE: + case Types.FLOAT: + return DOUBLE; + case Types.DATE: + return DATE; + case Types.TIME: + return TIME; + case Types.TIMESTAMP: + return TIMESTAMP; + // case Types.NULL: + // return NULL; + default: + throw new IllegalArgumentException("JDBC Type: " + sqlType); + } + } + + /** + * Get the value type for the given Java class. + * + * @param x the Java class + * @return the value type + */ + public static int getTypeFromClass(Class x) { + // if (x == null || Void.TYPE == x) { + // return NULL; + // } + if (x.isPrimitive()) { + x = getNonPrimitiveClass(x); + } + if (String.class == x) { + return STRING; + } else if (Integer.class == x) { + return INT; + } else if (Long.class == x) { + return LONG; + } else if (Boolean.class == x) { + return BOOLEAN; + } else if (Double.class == x) { + return DOUBLE; + } else if (Byte.class == x) { + return BYTE; + } else if (Short.class == x) { + return SHORT; + } else if (Float.class == x) { + return FLOAT; + // } else if (Void.class == x) { + // return NULL; + } else if (BigDecimal.class.isAssignableFrom(x)) { + return DECIMAL; + } else if (Date.class.isAssignableFrom(x)) { + return DATE; + } else if (Time.class.isAssignableFrom(x)) { + return TIME; + } else if (Timestamp.class.isAssignableFrom(x)) { + return TIMESTAMP; + } else if (java.util.Date.class.isAssignableFrom(x)) { + return TIMESTAMP; + } else { + throw new IllegalArgumentException("class=" + x); + } + } + + /** + * Convert primitive class names to java.lang.* class names. + * + * @param clazz the class (for example: int) + * @return the non-primitive class (for example: java.lang.Integer) + */ + public static Class getNonPrimitiveClass(Class clazz) { + if (!clazz.isPrimitive()) { + return clazz; + } else if (clazz == boolean.class) { + return Boolean.class; + } else if (clazz == byte.class) { + return Byte.class; + } else if (clazz == char.class) { + return Character.class; + } else if (clazz == double.class) { + return Double.class; + } else if (clazz == float.class) { + return Float.class; + } else if (clazz == int.class) { + return Integer.class; + } else if (clazz == long.class) { + return Long.class; + } else if (clazz == short.class) { + return Short.class; + } else if (clazz == void.class) { + return Void.class; + } + return clazz; + } + + /** + * Get a data type object from a type name. + * + * @param s the type name + * @return the data type object + */ + public static ColumnDataType getTypeByName(String s) { + return TYPES_BY_NAME.get(s); + } + + /** + * Check if the given value type is a String (VARCHAR,...). + * + * @param type the value type + * @return true if the value type is a String type + */ + public static boolean isStringType(int type) { + if (type == STRING /* || type == STRING_FIXED || type == STRING_IGNORECASE */ + || type == MULTI_VALUE) { + return true; + } + return false; + } + + /** + * @return + */ + public boolean supportsAdd() { + return supportsAdd(type); + } + + /** + * Check if the given value type supports the add operation. + * + * @param type the value type + * @return true if add is supported + */ + public static boolean supportsAdd(int type) { + switch (type) { + case BYTE: + case DECIMAL: + case DOUBLE: + case FLOAT: + case INT: + case LONG: + case SHORT: + return true; + default: + return false; + } + } + + /** + * Get the data type that will not overflow when calling 'add' 2 billion times. + * + * @param type the value type + * @return the data type that supports adding + */ + public static int getAddProofType(int type) { + switch (type) { + case BYTE: + return LONG; + case FLOAT: + return DOUBLE; + case INT: + return LONG; + case LONG: + return DECIMAL; + case SHORT: + return LONG; + default: + return type; + } + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnInfo.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnInfo.java new file mode 100644 index 0000000000..030ce35d10 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/ColumnInfo.java @@ -0,0 +1,72 @@ +package com.alibaba.datax.plugin.writer.adswriter.ads; + +/** + * ADS column meta.
+ *

+ * select ordinal_position,column_name,data_type,type_name,column_comment
+ * from information_schema.columns
+ * where table_schema='db_name' and table_name='table_name'
+ * and is_deleted=0
+ * order by ordinal_position limit 1000
+ *

+ * + * @since 0.0.1 + */ +public class ColumnInfo { + + private int ordinal; + private String name; + private ColumnDataType dataType; + private boolean isDeleted; + private String comment; + + public int getOrdinal() { + return ordinal; + } + + public void setOrdinal(int ordinal) { + this.ordinal = ordinal; + } + + public String getName() { + return name; + } + + public void setName(String name) { + this.name = name; + } + + public ColumnDataType getDataType() { + return dataType; + } + + public void setDataType(ColumnDataType dataType) { + this.dataType = dataType; + } + + public boolean isDeleted() { + return isDeleted; + } + + public void setDeleted(boolean isDeleted) { + this.isDeleted = isDeleted; + } + + public String getComment() { + return comment; + } + + public void setComment(String comment) { + this.comment = comment; + } + + @Override + public String toString() { + StringBuilder builder = new StringBuilder(); + builder.append("ColumnInfo [ordinal=").append(ordinal).append(", name=").append(name).append(", dataType=") + .append(dataType).append(", isDeleted=").append(isDeleted).append(", comment=").append(comment) + .append("]"); + return builder.toString(); + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/TableInfo.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/TableInfo.java new file mode 100644 index 0000000000..eac324d1fd --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/TableInfo.java @@ -0,0 +1,135 @@ +package com.alibaba.datax.plugin.writer.adswriter.ads; + +import java.util.ArrayList; +import java.util.List; + +/** + * ADS table meta.
+ *

+ * select table_schema, table_name,comments
+ * from information_schema.tables
+ * where table_schema='alimama' and table_name='click_af' limit 1
+ *

+ *

+ * select ordinal_position,column_name,data_type,type_name,column_comment
+ * from information_schema.columns
+ * where table_schema='db_name' and table_name='table_name'
+ * and is_deleted=0
+ * order by ordinal_position limit 1000
+ *

+ * + * @since 0.0.1 + */ +public class TableInfo { + + private String tableSchema; + private String tableName; + private List columns; + private String comments; + private String tableType; + + private String updateType; + private String partitionType; + private String partitionColumn; + private int partitionCount; + private List primaryKeyColumns; + + @Override + public String toString() { + StringBuilder builder = new StringBuilder(); + builder.append("TableInfo [tableSchema=").append(tableSchema).append(", tableName=").append(tableName) + .append(", columns=").append(columns).append(", comments=").append(comments).append(",updateType=").append(updateType) + .append(",partitionType=").append(partitionType).append(",partitionColumn=").append(partitionColumn).append(",partitionCount=").append(partitionCount) + .append(",primaryKeyColumns=").append(primaryKeyColumns).append("]"); + return builder.toString(); + } + + public String getTableSchema() { + return tableSchema; + } + + public void setTableSchema(String tableSchema) { + this.tableSchema = tableSchema; + } + + public String getTableName() { + return tableName; + } + + public void setTableName(String tableName) { + this.tableName = tableName; + } + + public List getColumns() { + return columns; + } + + public List getColumnsNames() { + List columnNames = new ArrayList(); + for (ColumnInfo column : this.getColumns()) { + columnNames.add(column.getName()); + } + return columnNames; + } + + public void setColumns(List columns) { + this.columns = columns; + } + + public String getComments() { + return comments; + } + + public void setComments(String comments) { + this.comments = comments; + } + + public String getTableType() { + return tableType; + } + + public void setTableType(String tableType) { + this.tableType = tableType; + } + + public String getUpdateType() { + return updateType; + } + + public void setUpdateType(String updateType) { + this.updateType = updateType; + } + + public String getPartitionType() { + return partitionType; + } + + public void setPartitionType(String partitionType) { + this.partitionType = partitionType; + } + + public String getPartitionColumn() { + return partitionColumn; + } + + public void setPartitionColumn(String partitionColumn) { + this.partitionColumn = partitionColumn; + } + + public int getPartitionCount() { + return partitionCount; + } + + public void setPartitionCount(int partitionCount) { + this.partitionCount = partitionCount; + } + + public List getPrimaryKeyColumns() { + return primaryKeyColumns; + } + + public void setPrimaryKeyColumns(List primaryKeyColumns) { + this.primaryKeyColumns = primaryKeyColumns; + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/package-info.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/package-info.java new file mode 100644 index 0000000000..b396c49ffa --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/ads/package-info.java @@ -0,0 +1,6 @@ +/** + * ADS meta and service. + * + * @since 0.0.1 + */ +package com.alibaba.datax.plugin.writer.adswriter.ads; \ No newline at end of file diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertProxy.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertProxy.java new file mode 100644 index 0000000000..7211fb9755 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertProxy.java @@ -0,0 +1,631 @@ +package com.alibaba.datax.plugin.writer.adswriter.insert; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.writer.adswriter.ads.TableInfo; +import com.alibaba.datax.plugin.writer.adswriter.util.AdsUtil; +import com.alibaba.datax.plugin.writer.adswriter.util.Constant; +import com.alibaba.datax.plugin.writer.adswriter.util.Key; +import com.mysql.jdbc.JDBC4PreparedStatement; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.*; +import java.util.ArrayList; +import java.util.Collections; +import java.util.Comparator; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Set; +import java.util.concurrent.Callable; +import java.util.zip.CRC32; +import java.util.zip.Checksum; + + +public class AdsInsertProxy { + + private static final Logger LOG = LoggerFactory + .getLogger(AdsInsertProxy.class); + private static final boolean IS_DEBUG_ENABLE = LOG.isDebugEnabled(); + private static final int MAX_EXCEPTION_CAUSE_ITER = 100; + + private String table; + private List columns; + private TaskPluginCollector taskPluginCollector; + private Configuration configuration; + private Boolean emptyAsNull; + + private String writeMode; + + private String insertSqlPrefix; + private String deleteSqlPrefix; + private int opColumnIndex; + private String lastDmlMode; + // columnName: + private Map> adsTableColumnsMetaData; + private Map> userConfigColumnsMetaData; + // columnName: index @ ads column + private Map primaryKeyNameIndexMap; + + private int retryTimeUpperLimit; + private Connection currentConnection; + + private String partitionColumn; + private int partitionColumnIndex = -1; + private int partitionCount; + + public AdsInsertProxy(String table, List columns, Configuration configuration, TaskPluginCollector taskPluginCollector, TableInfo tableInfo) { + this.table = table; + this.columns = columns; + this.configuration = configuration; + this.taskPluginCollector = taskPluginCollector; + this.emptyAsNull = configuration.getBool(Key.EMPTY_AS_NULL, false); + this.writeMode = configuration.getString(Key.WRITE_MODE); + this.insertSqlPrefix = String.format(Constant.INSERT_TEMPLATE, this.table, StringUtils.join(columns, ",")); + this.deleteSqlPrefix = String.format(Constant.DELETE_TEMPLATE, this.table); + this.opColumnIndex = configuration.getInt(Key.OPIndex, 0); + this.retryTimeUpperLimit = configuration.getInt( + Key.RETRY_CONNECTION_TIME, Constant.DEFAULT_RETRY_TIMES); + this.partitionCount = tableInfo.getPartitionCount(); + this.partitionColumn = tableInfo.getPartitionColumn(); + + //目前ads新建的表如果未插入数据不能通过select colums from table where 1=2,获取列信息,需要读取ads数据字典 + //not this: this.resultSetMetaData = DBUtil.getColumnMetaData(connection, this.table, StringUtils.join(this.columns, ",")); + //no retry here(fetch meta data) 注意实时表列换序的可能 + this.adsTableColumnsMetaData = AdsInsertUtil.getColumnMetaData(tableInfo, this.columns); + this.userConfigColumnsMetaData = new HashMap>(); + + List primaryKeyColumnName = tableInfo.getPrimaryKeyColumns(); + List adsColumnsNames = tableInfo.getColumnsNames(); + this.primaryKeyNameIndexMap = new HashMap(); + //warn: 要使用用户配置的column顺序, 不要使用从ads元数据获取的column顺序, 原来复用load列顺序其实有问题的 + for (int i = 0; i < this.columns.size(); i++) { + String oriEachColumn = this.columns.get(i); + String eachColumn = oriEachColumn; + // 防御性保留字 + if (eachColumn.startsWith(Constant.ADS_QUOTE_CHARACTER) && eachColumn.endsWith(Constant.ADS_QUOTE_CHARACTER)) { + eachColumn = eachColumn.substring(1, eachColumn.length() - 1); + } + for (String eachPrimary : primaryKeyColumnName) { + if (eachColumn.equalsIgnoreCase(eachPrimary)) { + this.primaryKeyNameIndexMap.put(oriEachColumn, i); + } + } + for (String eachAdsColumn : adsColumnsNames) { + if (eachColumn.equalsIgnoreCase(eachAdsColumn)) { + this.userConfigColumnsMetaData.put(oriEachColumn, this.adsTableColumnsMetaData.get(eachAdsColumn)); + } + } + + // 根据第几个column分区列排序,ads实时表只有一级分区、最多256个分区 + if (eachColumn.equalsIgnoreCase(this.partitionColumn)) { + this.partitionColumnIndex = i; + } + } + } + + public void startWriteWithConnection(RecordReceiver recordReceiver, + Connection connection, + int columnNumber) { + this.currentConnection = connection; + int batchSize = this.configuration.getInt(Key.BATCH_SIZE, Constant.DEFAULT_BATCH_SIZE); + // 默认情况下bufferSize需要和batchSize一致 + int bufferSize = this.configuration.getInt(Key.BUFFER_SIZE, batchSize); + // insert缓冲,多个分区排序后insert合并发送到ads + List writeBuffer = new ArrayList(bufferSize); + List deleteBuffer = null; + if (this.writeMode.equalsIgnoreCase(Constant.STREAMMODE)) { + // delete缓冲,多个分区排序后delete合并发送到ads + deleteBuffer = new ArrayList(bufferSize); + } + try { + Record record; + while ((record = recordReceiver.getFromReader()) != null) { + if (this.writeMode.equalsIgnoreCase(Constant.INSERTMODE)) { + if (record.getColumnNumber() != columnNumber) { + // 源头读取字段列数与目的表字段写入列数不相等,直接报错 + throw DataXException + .asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format( + "列配置信息有错误. 因为您配置的任务中,源头读取字段数:%s 与 目的表要写入的字段数:%s 不相等. 请检查您的配置并作出修改.", + record.getColumnNumber(), + columnNumber)); + } + writeBuffer.add(record); + if (writeBuffer.size() >= bufferSize) { + this.doBatchRecordWithPartitionSort(writeBuffer, Constant.INSERTMODE, bufferSize, batchSize); + writeBuffer.clear(); + } + } else { + if (record.getColumnNumber() != columnNumber + 1) { + // 源头读取字段列数需要为目的表字段写入列数+1, 直接报错, 源头多了一列OP + throw DataXException + .asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format( + "列配置信息有错误. 因为您配置的任务中,源头读取字段数:%s 与 目的表要写入的字段数:%s 不满足源头多1列操作类型列. 请检查您的配置并作出修改.", + record.getColumnNumber(), + columnNumber)); + } + String optionColumnValue = record.getColumn(this.opColumnIndex).asString(); + OperationType operationType = OperationType.asOperationType(optionColumnValue); + if (operationType.isInsertTemplate()) { + writeBuffer.add(record); + if (this.lastDmlMode == null || this.lastDmlMode == Constant.INSERTMODE ) { + this.lastDmlMode = Constant.INSERTMODE; + if (writeBuffer.size() >= bufferSize) { + this.doBatchRecordWithPartitionSort(writeBuffer, Constant.INSERTMODE, bufferSize, batchSize); + writeBuffer.clear(); + } + } else { + this.lastDmlMode = Constant.INSERTMODE; + // 模式变换触发一次提交ads delete, 并进入insert模式 + this.doBatchRecordWithPartitionSort(deleteBuffer, Constant.DELETEMODE, bufferSize, batchSize); + deleteBuffer.clear(); + } + } else if (operationType.isDeleteTemplate()) { + deleteBuffer.add(record); + if (this.lastDmlMode == null || this.lastDmlMode == Constant.DELETEMODE ) { + this.lastDmlMode = Constant.DELETEMODE; + if (deleteBuffer.size() >= bufferSize) { + this.doBatchRecordWithPartitionSort(deleteBuffer, Constant.DELETEMODE, bufferSize, batchSize); + deleteBuffer.clear(); + } + } else { + this.lastDmlMode = Constant.DELETEMODE; + // 模式变换触发一次提交ads insert, 并进入delete模式 + this.doBatchRecordWithPartitionSort(writeBuffer, Constant.INSERTMODE, bufferSize, batchSize); + writeBuffer.clear(); + } + } else { + // 注意OP操作类型的脏数据, 这里不需要重试 + this.taskPluginCollector.collectDirtyRecord(record, String.format("不支持您的更新类型:%s", optionColumnValue)); + } + } + } + + if (!writeBuffer.isEmpty()) { + //doOneRecord(writeBuffer, Constant.INSERTMODE); + this.doBatchRecordWithPartitionSort(writeBuffer, Constant.INSERTMODE, bufferSize, batchSize); + writeBuffer.clear(); + } + // 2个缓冲最多一个不为空同时 + if (null!= deleteBuffer && !deleteBuffer.isEmpty()) { + //doOneRecord(deleteBuffer, Constant.DELETEMODE); + this.doBatchRecordWithPartitionSort(deleteBuffer, Constant.DELETEMODE, bufferSize, batchSize); + deleteBuffer.clear(); + } + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + writeBuffer.clear(); + DBUtil.closeDBResources(null, null, connection); + } + } + + /** + * @param bufferSize datax缓冲记录条数 + * @param batchSize datax向ads系统一次发送数据条数 + * @param buffer datax缓冲区 + * @param mode 实时表模式insert 或者 stream + * */ + private void doBatchRecordWithPartitionSort(List buffer, String mode, int bufferSize, int batchSize) throws SQLException{ + //warn: 排序会影响数据插入顺序, 如果源头没有数据约束, 排序可能造成数据不一致, 快速排序是一种不稳定的排序算法 + //warn: 不明确配置bufferSize或者小于batchSize的情况下,不要进行排序;如果缓冲区实际内容条数少于batchSize也不排序了,最后一次的余量 + int recordBufferedNumber = buffer.size(); + if (bufferSize > batchSize && recordBufferedNumber > batchSize && this.partitionColumnIndex >= 0) { + final int partitionColumnIndex = this.partitionColumnIndex; + final int partitionCount = this.partitionCount; + Collections.sort(buffer, new Comparator() { + @Override + public int compare(Record record1, Record record2) { + int hashPartition1 = AdsInsertProxy.getHashPartition(record1.getColumn(partitionColumnIndex).asString(), partitionCount); + int hashPartition2 = AdsInsertProxy.getHashPartition(record2.getColumn(partitionColumnIndex).asString(), partitionCount); + return hashPartition1 - hashPartition2; + } + }); + } + // 将缓冲区的Record输出到ads, 使用recordBufferedNumber哦 + for (int i = 0; i < recordBufferedNumber; i += batchSize) { + int toIndex = i + batchSize; + if (toIndex > recordBufferedNumber) { + toIndex = recordBufferedNumber; + } + this.doBatchRecord(buffer.subList(i, toIndex), mode); + } + } + + private void doBatchRecord(final List buffer, final String mode) throws SQLException { + List> retryExceptionClasss = new ArrayList>(); + retryExceptionClasss.add(com.mysql.jdbc.exceptions.jdbc4.CommunicationsException.class); + retryExceptionClasss.add(java.net.SocketException.class); + try { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Boolean call() throws Exception { + doBatchRecordDml(buffer, mode); + return true; + } + }, this.retryTimeUpperLimit, 2000L, true, retryExceptionClasss); + }catch (SQLException e) { + LOG.warn(String.format("after retry %s times, doBatchRecord meet a exception: ", this.retryTimeUpperLimit), e); + LOG.info("try to re execute for each record..."); + doOneRecord(buffer, mode); + // below is the old way + // for (Record eachRecord : buffer) { + // this.taskPluginCollector.collectDirtyRecord(eachRecord, e); + // } + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } + } + + //warn: ADS 无法支持事物roll back都是不管用 + @SuppressWarnings("resource") + private void doBatchRecordDml(List buffer, String mode) throws Exception { + Statement statement = null; + String sql = null; + try { + int bufferSize = buffer.size(); + if (buffer.isEmpty()) { + return; + } + StringBuilder sqlSb = new StringBuilder(); + // connection.setAutoCommit(true); + //mysql impl warn: if a database access error occurs or this method is called on a closed connection throw SQLException + statement = this.currentConnection.createStatement(); + sqlSb.append(this.generateDmlSql(this.currentConnection, buffer.get(0), mode)); + for (int i = 1; i < bufferSize; i++) { + Record record = buffer.get(i); + this.appendDmlSqlValues(this.currentConnection, record, sqlSb, mode); + } + sql = sqlSb.toString(); + if (IS_DEBUG_ENABLE) { + LOG.debug(sql); + } + @SuppressWarnings("unused") + int status = statement.executeUpdate(sql); + sql = null; + } catch (SQLException e) { + LOG.warn("doBatchRecordDml meet a exception: " + sql, e); + Exception eachException = e; + int maxIter = 0;// 避免死循环 + while (null != eachException && maxIter < AdsInsertProxy.MAX_EXCEPTION_CAUSE_ITER) { + if (this.isRetryable(eachException)) { + LOG.warn("doBatchRecordDml meet a retry exception: " + e.getMessage()); + this.currentConnection = AdsUtil.getAdsConnect(this.configuration); + throw eachException; + } else { + try { + Throwable causeThrowable = eachException.getCause(); + eachException = causeThrowable == null ? null : (Exception)causeThrowable; + } catch (Exception castException) { + LOG.warn("doBatchRecordDml meet a no! retry exception: " + e.getMessage()); + throw e; + } + } + maxIter++; + } + throw e; + } catch (Exception e) { + LOG.error("插入异常, sql: " + sql); + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + DBUtil.closeDBResources(statement, null); + } + } + + private void doOneRecord(List buffer, final String mode) { + List> retryExceptionClasss = new ArrayList>(); + retryExceptionClasss.add(com.mysql.jdbc.exceptions.jdbc4.CommunicationsException.class); + retryExceptionClasss.add(java.net.SocketException.class); + for (final Record record : buffer) { + try { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Boolean call() throws Exception { + doOneRecordDml(record, mode); + return true; + } + }, this.retryTimeUpperLimit, 2000L, true, retryExceptionClasss); + } catch (Exception e) { + // 不能重试的一行,记录脏数据 + this.taskPluginCollector.collectDirtyRecord(record, e); + } + } + } + + @SuppressWarnings("resource") + private void doOneRecordDml(Record record, String mode) throws Exception { + Statement statement = null; + String sql = null; + try { + // connection.setAutoCommit(true); + statement = this.currentConnection.createStatement(); + sql = generateDmlSql(this.currentConnection, record, mode); + if (IS_DEBUG_ENABLE) { + LOG.debug(sql); + } + @SuppressWarnings("unused") + int status = statement.executeUpdate(sql); + sql = null; + } catch (SQLException e) { + LOG.error("doOneDml meet a exception: " + sql, e); + //need retry before record dirty data + //this.taskPluginCollector.collectDirtyRecord(record, e); + // 更新当前可用连接 + Exception eachException = e; + int maxIter = 0;// 避免死循环 + while (null != eachException && maxIter < AdsInsertProxy.MAX_EXCEPTION_CAUSE_ITER) { + if (this.isRetryable(eachException)) { + LOG.warn("doOneDml meet a retry exception: " + e.getMessage()); + this.currentConnection = AdsUtil.getAdsConnect(this.configuration); + throw eachException; + } else { + try { + Throwable causeThrowable = eachException.getCause(); + eachException = causeThrowable == null ? null : (Exception)causeThrowable; + } catch (Exception castException) { + LOG.warn("doOneDml meet a no! retry exception: " + e.getMessage()); + throw e; + } + } + maxIter++; + } + throw e; + } catch (Exception e) { + LOG.error("插入异常, sql: " + sql); + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + DBUtil.closeDBResources(statement, null); + } + } + + private boolean isRetryable(Throwable e) { + Class meetExceptionClass = e.getClass(); + if (meetExceptionClass == com.mysql.jdbc.exceptions.jdbc4.CommunicationsException.class) { + return true; + } + if (meetExceptionClass == java.net.SocketException.class) { + return true; + } + return false; + } + + private String generateDmlSql(Connection connection, Record record, String mode) throws SQLException { + String sql = null; + StringBuilder sqlSb = new StringBuilder(); + if (mode.equalsIgnoreCase(Constant.INSERTMODE)) { + sqlSb.append(this.insertSqlPrefix); + sqlSb.append("("); + int columnsSize = this.columns.size(); + for (int i = 0; i < columnsSize; i++) { + if((i + 1) != columnsSize) { + sqlSb.append("?,"); + } else { + sqlSb.append("?"); + } + } + sqlSb.append(")"); + //mysql impl warn: if a database access error occurs or this method is called on a closed connection + PreparedStatement statement = connection.prepareStatement(sqlSb.toString()); + for (int i = 0; i < this.columns.size(); i++) { + int preparedParamsIndex = i; + if (Constant.STREAMMODE.equalsIgnoreCase(this.writeMode)) { + if (preparedParamsIndex >= this.opColumnIndex) { + preparedParamsIndex = i + 1; + } + } + String columnName = this.columns.get(i); + int columnSqltype = this.userConfigColumnsMetaData.get(columnName).getLeft(); + prepareColumnTypeValue(statement, columnSqltype, record.getColumn(preparedParamsIndex), i, columnName); + } + sql = ((JDBC4PreparedStatement) statement).asSql(); + DBUtil.closeDBResources(statement, null); + } else { + sqlSb.append(this.deleteSqlPrefix); + sqlSb.append("("); + Set> primaryEntrySet = this.primaryKeyNameIndexMap.entrySet(); + int entrySetSize = primaryEntrySet.size(); + int i = 0; + for (Entry eachEntry : primaryEntrySet) { + if((i + 1) != entrySetSize) { + sqlSb.append(String.format(" (%s = ?) and ", eachEntry.getKey())); + } else { + sqlSb.append(String.format(" (%s = ?) ", eachEntry.getKey())); + } + i++; + } + sqlSb.append(")"); + //mysql impl warn: if a database access error occurs or this method is called on a closed connection + PreparedStatement statement = connection.prepareStatement(sqlSb.toString()); + i = 0; + //ads的real time表只能是1级分区、且分区列类型是long, 但是这里是需要主键删除的 + for (Entry each : primaryEntrySet) { + String columnName = each.getKey(); + int columnSqlType = this.userConfigColumnsMetaData.get(columnName).getLeft(); + int primaryKeyInUserConfigIndex = this.primaryKeyNameIndexMap.get(columnName); + if (primaryKeyInUserConfigIndex >= this.opColumnIndex) { + primaryKeyInUserConfigIndex ++; + } + prepareColumnTypeValue(statement, columnSqlType, record.getColumn(primaryKeyInUserConfigIndex), i, columnName); + i++; + } + sql = ((JDBC4PreparedStatement) statement).asSql(); + DBUtil.closeDBResources(statement, null); + } + return sql; + } + + private void appendDmlSqlValues(Connection connection, Record record, StringBuilder sqlSb, String mode) throws SQLException { + String sqlResult = this.generateDmlSql(connection, record, mode); + if (mode.equalsIgnoreCase(Constant.INSERTMODE)) { + sqlSb.append(","); + sqlSb.append(sqlResult.substring(this.insertSqlPrefix.length())); + } else { + // 之前已经充分增加过括号了 + sqlSb.append(" or "); + sqlSb.append(sqlResult.substring(this.deleteSqlPrefix.length())); + } + } + + private void prepareColumnTypeValue(PreparedStatement statement, int columnSqltype, Column column, int preparedPatamIndex, String columnName) throws SQLException { + java.util.Date utilDate; + switch (columnSqltype) { + case Types.CHAR: + case Types.NCHAR: + case Types.CLOB: + case Types.NCLOB: + case Types.VARCHAR: + case Types.LONGVARCHAR: + case Types.NVARCHAR: + case Types.LONGNVARCHAR: + String strValue = column.asString(); + statement.setString(preparedPatamIndex + 1, strValue); + break; + + case Types.SMALLINT: + case Types.INTEGER: + case Types.BIGINT: + case Types.NUMERIC: + case Types.DECIMAL: + case Types.REAL: + String numValue = column.asString(); + if(emptyAsNull && "".equals(numValue) || numValue == null){ + //statement.setObject(preparedPatamIndex + 1, null); + statement.setNull(preparedPatamIndex + 1, Types.BIGINT); + } else{ + statement.setLong(preparedPatamIndex + 1, column.asLong()); + } + break; + + case Types.FLOAT: + case Types.DOUBLE: + String floatValue = column.asString(); + if(emptyAsNull && "".equals(floatValue) || floatValue == null){ + //statement.setObject(preparedPatamIndex + 1, null); + statement.setNull(preparedPatamIndex + 1, Types.DOUBLE); + } else{ + statement.setDouble(preparedPatamIndex + 1, column.asDouble()); + } + break; + + //tinyint is a little special in some database like mysql {boolean->tinyint(1)} + case Types.TINYINT: + Long longValue = column.asLong(); + if (null == longValue) { + statement.setNull(preparedPatamIndex + 1, Types.BIGINT); + } else { + statement.setLong(preparedPatamIndex + 1, longValue); + } + + break; + + case Types.DATE: + java.sql.Date sqlDate = null; + try { + if("".equals(column.getRawData())) { + utilDate = null; + } else { + utilDate = column.asDate(); + } + } catch (DataXException e) { + throw new SQLException(String.format( + "Date 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlDate = new java.sql.Date(utilDate.getTime()); + } + statement.setDate(preparedPatamIndex + 1, sqlDate); + break; + + case Types.TIME: + java.sql.Time sqlTime = null; + try { + if("".equals(column.getRawData())) { + utilDate = null; + } else { + utilDate = column.asDate(); + } + } catch (DataXException e) { + throw new SQLException(String.format( + "TIME 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlTime = new java.sql.Time(utilDate.getTime()); + } + statement.setTime(preparedPatamIndex + 1, sqlTime); + break; + + case Types.TIMESTAMP: + java.sql.Timestamp sqlTimestamp = null; + try { + if("".equals(column.getRawData())) { + utilDate = null; + } else { + utilDate = column.asDate(); + } + } catch (DataXException e) { + throw new SQLException(String.format( + "TIMESTAMP 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlTimestamp = new java.sql.Timestamp( + utilDate.getTime()); + } + statement.setTimestamp(preparedPatamIndex + 1, sqlTimestamp); + break; + + case Types.BOOLEAN: + //case Types.BIT: ads 没有bit + Boolean booleanValue = column.asBoolean(); + if (null == booleanValue) { + statement.setNull(preparedPatamIndex + 1, Types.BOOLEAN); + } else { + statement.setBoolean(preparedPatamIndex + 1, booleanValue); + } + + break; + default: + Pair columnMetaPair = this.userConfigColumnsMetaData.get(columnName); + throw DataXException + .asDataXException( + DBUtilErrorCode.UNSUPPORTED_TYPE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库写入这种字段类型. 字段名:[%s], 字段类型:[%s], 字段Java类型:[%s]. 请修改表中该字段的类型或者不同步该字段.", + columnName, columnMetaPair.getRight(), columnMetaPair.getLeft())); + } + } + + private static int getHashPartition(String value, int totalHashPartitionNum) { + long crc32 = (value == null ? getCRC32("-1") : getCRC32(value)); + return (int) (crc32 % totalHashPartitionNum); + } + + private static long getCRC32(String value) { + Checksum checksum = new CRC32(); + byte[] bytes = value.getBytes(); + checksum.update(bytes, 0, bytes.length); + return checksum.getValue(); + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertUtil.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertUtil.java new file mode 100644 index 0000000000..8e44e8c794 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/AdsInsertUtil.java @@ -0,0 +1,153 @@ +package com.alibaba.datax.plugin.writer.adswriter.insert; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.ListUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.writer.adswriter.AdsException; +import com.alibaba.datax.plugin.writer.adswriter.AdsWriterErrorCode; +import com.alibaba.datax.plugin.writer.adswriter.ads.ColumnInfo; +import com.alibaba.datax.plugin.writer.adswriter.ads.TableInfo; +import com.alibaba.datax.plugin.writer.adswriter.load.AdsHelper; +import com.alibaba.datax.plugin.writer.adswriter.util.AdsUtil; +import com.alibaba.datax.plugin.writer.adswriter.util.Constant; +import com.alibaba.datax.plugin.writer.adswriter.util.Key; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.ImmutablePair; +import org.apache.commons.lang3.tuple.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + + +public class AdsInsertUtil { + + private static final Logger LOG = LoggerFactory + .getLogger(AdsInsertUtil.class); + + public static TableInfo getAdsTableInfo(Configuration conf) { + AdsHelper adsHelper = AdsUtil.createAdsHelper(conf); + TableInfo tableInfo= null; + try { + tableInfo = adsHelper.getTableInfo(conf.getString(Key.ADS_TABLE)); + } catch (AdsException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.GET_ADS_TABLE_MEATA_FAILED, e); + } + return tableInfo; + } + + /* + * 返回列顺序为ads建表列顺序 + * */ + public static List getAdsTableColumnNames(Configuration conf) { + List tableColumns = new ArrayList(); + AdsHelper adsHelper = AdsUtil.createAdsHelper(conf); + TableInfo tableInfo= null; + String adsTable = conf.getString(Key.ADS_TABLE); + try { + tableInfo = adsHelper.getTableInfo(adsTable); + } catch (AdsException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.GET_ADS_TABLE_MEATA_FAILED, e); + } + + List columnInfos = tableInfo.getColumns(); + for(ColumnInfo columnInfo: columnInfos) { + tableColumns.add(columnInfo.getName()); + } + + LOG.info("table:[{}] all columns:[\n{}\n].", adsTable, StringUtils.join(tableColumns, ",")); + return tableColumns; + } + + public static Map> getColumnMetaData + (Configuration configuration, List userColumns) { + Map> columnMetaData = new HashMap>(); + List columnInfoList = getAdsTableColumns(configuration); + for(String column : userColumns) { + if (column.startsWith(Constant.ADS_QUOTE_CHARACTER) && column.endsWith(Constant.ADS_QUOTE_CHARACTER)) { + column = column.substring(1, column.length() - 1); + } + for (ColumnInfo columnInfo : columnInfoList) { + if(column.equalsIgnoreCase(columnInfo.getName())) { + Pair eachPair = new ImmutablePair(columnInfo.getDataType().sqlType, columnInfo.getDataType().name); + columnMetaData.put(columnInfo.getName(), eachPair); + } + } + } + return columnMetaData; + } + + public static Map> getColumnMetaData(TableInfo tableInfo, List userColumns){ + Map> columnMetaData = new HashMap>(); + List columnInfoList = tableInfo.getColumns(); + for(String column : userColumns) { + if (column.startsWith(Constant.ADS_QUOTE_CHARACTER) && column.endsWith(Constant.ADS_QUOTE_CHARACTER)) { + column = column.substring(1, column.length() - 1); + } + for (ColumnInfo columnInfo : columnInfoList) { + if(column.equalsIgnoreCase(columnInfo.getName())) { + Pair eachPair = new ImmutablePair(columnInfo.getDataType().sqlType, columnInfo.getDataType().name); + columnMetaData.put(columnInfo.getName(), eachPair); + } + } + } + return columnMetaData; + } + + /* + * 返回列顺序为ads建表列顺序 + * */ + public static List getAdsTableColumns(Configuration conf) { + AdsHelper adsHelper = AdsUtil.createAdsHelper(conf); + TableInfo tableInfo= null; + String adsTable = conf.getString(Key.ADS_TABLE); + try { + tableInfo = adsHelper.getTableInfo(adsTable); + } catch (AdsException e) { + throw DataXException.asDataXException(AdsWriterErrorCode.GET_ADS_TABLE_MEATA_FAILED, e); + } + + List columnInfos = tableInfo.getColumns(); + + return columnInfos; + } + + public static void dealColumnConf(Configuration originalConfig, List tableColumns) { + List userConfiguredColumns = originalConfig.getList(Key.COLUMN, String.class); + if (null == userConfiguredColumns || userConfiguredColumns.isEmpty()) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + "您的配置文件中的列配置信息有误. 因为您未配置写入数据库表的列名称,DataX获取不到列信息. 请检查您的配置并作出修改."); + } else { + if (1 == userConfiguredColumns.size() && "*".equals(userConfiguredColumns.get(0))) { + LOG.warn("您的配置文件中的列配置信息存在风险. 因为您配置的写入数据库表的列为*,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改."); + + // 回填其值,需要以 String 的方式转交后续处理 + originalConfig.set(Key.COLUMN, tableColumns); + } else if (userConfiguredColumns.size() > tableColumns.size()) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + String.format("您的配置文件中的列配置信息有误. 因为您所配置的写入数据库表的字段个数:%s 大于目的表的总字段总个数:%s. 请检查您的配置并作出修改.", + userConfiguredColumns.size(), tableColumns.size())); + } else { + // 确保用户配置的 column 不重复 + ListUtil.makeSureNoValueDuplicate(userConfiguredColumns, false); + // 检查列是否都为数据库表中正确的列(通过执行一次 select column from table 进行判断) + // ListUtil.makeSureBInA(tableColumns, userConfiguredColumns, true); + // 支持关键字和保留字, ads列是不区分大小写的 + List removeQuotedColumns = new ArrayList(); + for (String each : userConfiguredColumns) { + if (each.startsWith(Constant.ADS_QUOTE_CHARACTER) && each.endsWith(Constant.ADS_QUOTE_CHARACTER)) { + removeQuotedColumns.add(each.substring(1, each.length() - 1)); + } else { + removeQuotedColumns.add(each); + } + } + ListUtil.makeSureBInA(tableColumns, removeQuotedColumns, false); + } + } + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/OperationType.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/OperationType.java new file mode 100644 index 0000000000..a689e70327 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/insert/OperationType.java @@ -0,0 +1,75 @@ +package com.alibaba.datax.plugin.writer.adswriter.insert; + +public enum OperationType { + // i: insert uo:before image uu:before image un: after image d: delete + // u:update + I("i"), UO("uo"), UU("uu"), UN("un"), D("d"), U("u"), UNKNOWN("unknown"), ; + private OperationType(String type) { + this.type = type; + } + + private String type; + + public String getType() { + return this.type; + } + + public static OperationType asOperationType(String type) { + if ("i".equalsIgnoreCase(type)) { + return I; + } else if ("uo".equalsIgnoreCase(type)) { + return UO; + } else if ("uu".equalsIgnoreCase(type)) { + return UU; + } else if ("un".equalsIgnoreCase(type)) { + return UN; + } else if ("d".equalsIgnoreCase(type)) { + return D; + } else if ("u".equalsIgnoreCase(type)) { + return U; + } else { + return UNKNOWN; + } + } + + public boolean isInsertTemplate() { + switch (this) { + // 建议merge 过后应该只有I和U两类 + case I: + case UO: + case UU: + case UN: + case U: + return true; + case D: + return false; + default: + return false; + } + } + + public boolean isDeleteTemplate() { + switch (this) { + // 建议merge 过后应该只有I和U两类 + case I: + case UO: + case UU: + case UN: + case U: + return false; + case D: + return true; + default: + return false; + } + } + + public boolean isLegal() { + return this.type != UNKNOWN.getType(); + } + + @Override + public String toString() { + return this.name(); + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/AdsHelper.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/AdsHelper.java new file mode 100644 index 0000000000..924f6fcb61 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/AdsHelper.java @@ -0,0 +1,429 @@ +/** + * + */ +package com.alibaba.datax.plugin.writer.adswriter.load; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.writer.adswriter.AdsException; +import com.alibaba.datax.plugin.writer.adswriter.AdsWriterErrorCode; +import com.alibaba.datax.plugin.writer.adswriter.ads.ColumnDataType; +import com.alibaba.datax.plugin.writer.adswriter.ads.ColumnInfo; +import com.alibaba.datax.plugin.writer.adswriter.ads.TableInfo; +import com.alibaba.datax.plugin.writer.adswriter.util.AdsUtil; + +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.*; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Properties; +import java.util.concurrent.Callable; + +public class AdsHelper { + private static final Logger LOG = LoggerFactory + .getLogger(AdsHelper.class); + + private String adsURL; + private String userName; + private String password; + private String schema; + private Long socketTimeout; + private String suffix; + + public AdsHelper(String adsUrl, String userName, String password, String schema, Long socketTimeout, String suffix) { + this.adsURL = adsUrl; + this.userName = userName; + this.password = password; + this.schema = schema; + this.socketTimeout = socketTimeout; + this.suffix = suffix; + } + + public String getAdsURL() { + return adsURL; + } + + public void setAdsURL(String adsURL) { + this.adsURL = adsURL; + } + + public String getUserName() { + return userName; + } + + public void setUserName(String userName) { + this.userName = userName; + } + + public String getPassword() { + return password; + } + + public void setPassword(String password) { + this.password = password; + } + + public String getSchema() { + return schema; + } + + public void setSchema(String schema) { + this.schema = schema; + } + + /** + * Obtain the table meta information. + * + * @param table The table + * @return The table meta information + * @throws com.alibaba.datax.plugin.writer.adswriter.AdsException + */ + public TableInfo getTableInfo(String table) throws AdsException { + + if (table == null) { + throw new AdsException(AdsException.ADS_TABLEMETA_TABLE_NULL, "Table is null.", null); + } + + if (adsURL == null) { + throw new AdsException(AdsException.ADS_CONN_URL_NOT_SET, "ADS JDBC connection URL was not set.", null); + } + + if (userName == null) { + throw new AdsException(AdsException.ADS_CONN_USERNAME_NOT_SET, + "ADS JDBC connection user name was not set.", null); + } + + if (password == null) { + throw new AdsException(AdsException.ADS_CONN_PASSWORD_NOT_SET, "ADS JDBC connection password was not set.", + null); + } + + if (schema == null) { + throw new AdsException(AdsException.ADS_CONN_SCHEMA_NOT_SET, "ADS JDBC connection schema was not set.", + null); + } + + Connection connection = null; + Statement statement = null; + ResultSet rs = null; + try { + Class.forName("com.mysql.jdbc.Driver"); + String url = AdsUtil.prepareJdbcUrl(this.adsURL, this.schema, this.socketTimeout, this.suffix); + + Properties connectionProps = new Properties(); + connectionProps.put("user", userName); + connectionProps.put("password", password); + connection = DriverManager.getConnection(url, connectionProps); + statement = connection.createStatement(); + // ads 表名、schema名不区分大小写, 提高用户易用性, 注意列顺序性 + String columnMetaSql = String.format("select ordinal_position,column_name,data_type,type_name,column_comment from information_schema.columns where lower(table_schema) = `'%s'` and lower(table_name) = `'%s'` order by ordinal_position", schema.toLowerCase(), table.toLowerCase()); + LOG.info(String.format("检查列信息sql语句:%s", columnMetaSql)); + rs = statement.executeQuery(columnMetaSql); + + TableInfo tableInfo = new TableInfo(); + List columnInfoList = new ArrayList(); + while (DBUtil.asyncResultSetNext(rs)) { + ColumnInfo columnInfo = new ColumnInfo(); + columnInfo.setOrdinal(rs.getInt(1)); + columnInfo.setName(rs.getString(2)); + //columnInfo.setDataType(ColumnDataType.getDataType(rs.getInt(3))); //for ads version < 0.7 + //columnInfo.setDataType(ColumnDataType.getTypeByName(rs.getString(3).toUpperCase())); //for ads version 0.8 + columnInfo.setDataType(ColumnDataType.getTypeByName(rs.getString(4).toUpperCase())); //for ads version 0.8 & 0.7 + columnInfo.setComment(rs.getString(5)); + columnInfoList.add(columnInfo); + } + if (columnInfoList.isEmpty()) { + throw DataXException.asDataXException(AdsWriterErrorCode.NO_ADS_TABLE, table + "不存在或者查询不到列信息. "); + } + tableInfo.setColumns(columnInfoList); + tableInfo.setTableSchema(schema); + tableInfo.setTableName(table); + DBUtil.closeDBResources(rs, statement, null); + + String tableMetaSql = String.format("select update_type, partition_type, partition_column, partition_count, primary_key_columns from information_schema.tables where lower(table_schema) = `'%s'` and lower(table_name) = `'%s'`", schema.toLowerCase(), table.toLowerCase()); + LOG.info(String.format("检查表信息sql语句:%s", tableMetaSql)); + statement = connection.createStatement(); + rs = statement.executeQuery(tableMetaSql); + while (DBUtil.asyncResultSetNext(rs)) { + tableInfo.setUpdateType(rs.getString(1)); + tableInfo.setPartitionType(rs.getString(2)); + tableInfo.setPartitionColumn(rs.getString(3)); + tableInfo.setPartitionCount(rs.getInt(4)); + //primary_key_columns ads主键是逗号分隔的,可以有多个 + String primaryKeyColumns = rs.getString(5); + if (StringUtils.isNotBlank(primaryKeyColumns)) { + tableInfo.setPrimaryKeyColumns(Arrays.asList(StringUtils.split(primaryKeyColumns, ","))); + } else { + tableInfo.setPrimaryKeyColumns(null); + } + break; + } + DBUtil.closeDBResources(rs, statement, null); + return tableInfo; + + } catch (ClassNotFoundException e) { + throw new AdsException(AdsException.OTHER, e.getMessage(), e); + } catch (SQLException e) { + throw new AdsException(AdsException.OTHER, e.getMessage(), e); + } catch ( DataXException e) { + throw e; + } catch (Exception e) { + throw new AdsException(AdsException.OTHER, e.getMessage(), e); + } finally { + if (rs != null) { + try { + rs.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (statement != null) { + try { + statement.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (connection != null) { + try { + connection.close(); + } catch (SQLException e) { + // Ignore exception + } + } + } + + } + + /** + * Submit LOAD DATA command. + * + * @param table The target ADS table + * @param partition The partition option in the form of "(partition_name,...)" + * @param sourcePath The source path + * @param overwrite + * @return + * @throws AdsException + */ + public String loadData(String table, String partition, String sourcePath, boolean overwrite) throws AdsException { + + if (table == null) { + throw new AdsException(AdsException.ADS_LOADDATA_TABLE_NULL, "ADS LOAD DATA table is null.", null); + } + + if (sourcePath == null) { + throw new AdsException(AdsException.ADS_LOADDATA_SOURCEPATH_NULL, "ADS LOAD DATA source path is null.", + null); + } + + if (adsURL == null) { + throw new AdsException(AdsException.ADS_CONN_URL_NOT_SET, "ADS JDBC connection URL was not set.", null); + } + + if (userName == null) { + throw new AdsException(AdsException.ADS_CONN_USERNAME_NOT_SET, + "ADS JDBC connection user name was not set.", null); + } + + if (password == null) { + throw new AdsException(AdsException.ADS_CONN_PASSWORD_NOT_SET, "ADS JDBC connection password was not set.", + null); + } + + if (schema == null) { + throw new AdsException(AdsException.ADS_CONN_SCHEMA_NOT_SET, "ADS JDBC connection schema was not set.", + null); + } + + StringBuilder sb = new StringBuilder(); + sb.append("LOAD DATA FROM "); + if (sourcePath.startsWith("'") && sourcePath.endsWith("'")) { + sb.append(sourcePath); + } else { + sb.append("'" + sourcePath + "'"); + } + if (overwrite) { + sb.append(" OVERWRITE"); + } + sb.append(" INTO TABLE "); + sb.append(schema + "." + table); + if (partition != null && !partition.trim().equals("")) { + String partitionTrim = partition.trim(); + if(partitionTrim.startsWith("(") && partitionTrim.endsWith(")")) { + sb.append(" PARTITION " + partition); + } else { + sb.append(" PARTITION " + "(" + partition + ")"); + } + } + + Connection connection = null; + Statement statement = null; + ResultSet rs = null; + try { + Class.forName("com.mysql.jdbc.Driver"); + String url = AdsUtil.prepareJdbcUrl(this.adsURL, this.schema, this.socketTimeout, this.suffix); + Properties connectionProps = new Properties(); + connectionProps.put("user", userName); + connectionProps.put("password", password); + connection = DriverManager.getConnection(url, connectionProps); + statement = connection.createStatement(); + LOG.info("正在从ODPS数据库导数据到ADS中: "+sb.toString()); + LOG.info("由于ADS的限制,ADS导数据最少需要20分钟,请耐心等待"); + rs = statement.executeQuery(sb.toString()); + + String jobId = null; + while (DBUtil.asyncResultSetNext(rs)) { + jobId = rs.getString(1); + } + + if (jobId == null) { + throw new AdsException(AdsException.ADS_LOADDATA_JOBID_NOT_AVAIL, + "Job id is not available for the submitted LOAD DATA." + jobId, null); + } + + return jobId; + + } catch (ClassNotFoundException e) { + throw new AdsException(AdsException.ADS_LOADDATA_FAILED, e.getMessage(), e); + } catch (SQLException e) { + throw new AdsException(AdsException.ADS_LOADDATA_FAILED, e.getMessage(), e); + } catch (Exception e) { + throw new AdsException(AdsException.ADS_LOADDATA_FAILED, e.getMessage(), e); + } finally { + if (rs != null) { + try { + rs.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (statement != null) { + try { + statement.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (connection != null) { + try { + connection.close(); + } catch (SQLException e) { + // Ignore exception + } + } + } + + } + + /** + * Check the load data job status. + * + * @param jobId The job id to + * @return true if load data job succeeded, false if load data job failed. + * @throws AdsException + */ + public boolean checkLoadDataJobStatus(String jobId) throws AdsException { + + if (adsURL == null) { + throw new AdsException(AdsException.ADS_CONN_URL_NOT_SET, "ADS JDBC connection URL was not set.", null); + } + + if (userName == null) { + throw new AdsException(AdsException.ADS_CONN_USERNAME_NOT_SET, + "ADS JDBC connection user name was not set.", null); + } + + if (password == null) { + throw new AdsException(AdsException.ADS_CONN_PASSWORD_NOT_SET, "ADS JDBC connection password was not set.", + null); + } + + if (schema == null) { + throw new AdsException(AdsException.ADS_CONN_SCHEMA_NOT_SET, "ADS JDBC connection schema was not set.", + null); + } + + try { + String state = this.checkLoadDataJobStatusWithRetry(jobId); + if (state == null) { + throw new AdsException(AdsException.JOB_NOT_EXIST, "Target job does not exist for id: " + jobId, null); + } + if (state.equals("SUCCEEDED")) { + return true; + } else if (state.equals("FAILED")) { + throw new AdsException(AdsException.JOB_FAILED, "Target job failed for id: " + jobId, null); + } else { + return false; + } + } catch (Exception e) { + throw new AdsException(AdsException.OTHER, e.getMessage(), e); + } + } + + private String checkLoadDataJobStatusWithRetry(final String jobId) + throws AdsException { + try { + Class.forName("com.mysql.jdbc.Driver"); + final String finalAdsUrl = this.adsURL; + final String finalSchema = this.schema; + final Long finalSocketTimeout = this.socketTimeout; + final String suffix = this.suffix; + return RetryUtil.executeWithRetry(new Callable() { + @Override + public String call() throws Exception { + Connection connection = null; + Statement statement = null; + ResultSet rs = null; + try { + + String url = AdsUtil.prepareJdbcUrl(finalAdsUrl, finalSchema, finalSocketTimeout, suffix); + Properties connectionProps = new Properties(); + connectionProps.put("user", userName); + connectionProps.put("password", password); + connection = DriverManager.getConnection(url, + connectionProps); + statement = connection.createStatement(); + + String sql = "select state from information_schema.job_instances where job_id like '" + + jobId + "'"; + rs = statement.executeQuery(sql); + String state = null; + while (DBUtil.asyncResultSetNext(rs)) { + state = rs.getString(1); + } + return state; + } finally { + if (rs != null) { + try { + rs.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (statement != null) { + try { + statement.close(); + } catch (SQLException e) { + // Ignore exception + } + } + if (connection != null) { + try { + connection.close(); + } catch (SQLException e) { + // Ignore exception + } + } + } + } + }, 3, 1000L, true); + } catch (Exception e) { + throw new AdsException(AdsException.OTHER, e.getMessage(), e); + } + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TableMetaHelper.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TableMetaHelper.java new file mode 100644 index 0000000000..1ecad7561d --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TableMetaHelper.java @@ -0,0 +1,87 @@ +package com.alibaba.datax.plugin.writer.adswriter.load; + +import com.alibaba.datax.plugin.writer.adswriter.ads.ColumnDataType; +import com.alibaba.datax.plugin.writer.adswriter.ads.ColumnInfo; +import com.alibaba.datax.plugin.writer.adswriter.ads.TableInfo; +import com.alibaba.datax.plugin.writer.adswriter.odps.DataType; +import com.alibaba.datax.plugin.writer.adswriter.odps.FieldSchema; +import com.alibaba.datax.plugin.writer.adswriter.odps.TableMeta; + +import java.util.ArrayList; +import java.util.List; +import java.util.Random; + +/** + * Table meta helper for ADS writer. + * + * @since 0.0.1 + */ +public class TableMetaHelper { + + private TableMetaHelper() { + } + + /** + * Create temporary ODPS table. + * + * @param tableMeta table meta + * @param lifeCycle for temporary table + * @return ODPS temporary table meta + */ + public static TableMeta createTempODPSTable(TableInfo tableMeta, int lifeCycle) { + TableMeta tempTable = new TableMeta(); + tempTable.setComment(tableMeta.getComments()); + tempTable.setLifeCycle(lifeCycle); + String tableSchema = tableMeta.getTableSchema(); + String tableName = tableMeta.getTableName(); + tempTable.setTableName(generateTempTableName(tableSchema, tableName)); + List tempColumns = new ArrayList(); + List columns = tableMeta.getColumns(); + for (ColumnInfo column : columns) { + FieldSchema tempColumn = new FieldSchema(); + tempColumn.setName(column.getName()); + tempColumn.setType(toODPSDataType(column.getDataType())); + tempColumn.setComment(column.getComment()); + tempColumns.add(tempColumn); + } + tempTable.setCols(tempColumns); + tempTable.setPartitionKeys(null); + return tempTable; + } + + private static String toODPSDataType(ColumnDataType columnDataType) { + int type; + switch (columnDataType.type) { + case ColumnDataType.BOOLEAN: + type = DataType.STRING; + break; + case ColumnDataType.BYTE: + case ColumnDataType.SHORT: + case ColumnDataType.INT: + case ColumnDataType.LONG: + type = DataType.INTEGER; + break; + case ColumnDataType.DECIMAL: + case ColumnDataType.DOUBLE: + case ColumnDataType.FLOAT: + type = DataType.DOUBLE; + break; + case ColumnDataType.DATE: + case ColumnDataType.TIME: + case ColumnDataType.TIMESTAMP: + case ColumnDataType.STRING: + case ColumnDataType.MULTI_VALUE: + type = DataType.STRING; + break; + default: + throw new IllegalArgumentException("columnDataType=" + columnDataType); + } + return DataType.toString(type); + } + + private static String generateTempTableName(String tableSchema, String tableName) { + int randNum = 1000 + new Random(System.currentTimeMillis()).nextInt(1000); + return tableSchema + "__" + tableName + "_" + System.currentTimeMillis() + randNum; + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TransferProjectConf.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TransferProjectConf.java new file mode 100644 index 0000000000..bff4b7b900 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/load/TransferProjectConf.java @@ -0,0 +1,65 @@ +package com.alibaba.datax.plugin.writer.adswriter.load; + +import com.alibaba.datax.common.util.Configuration; + +/** + * Created by xiafei.qiuxf on 15/4/13. + */ +public class TransferProjectConf { + + public final static String KEY_ACCESS_ID = "odps.accessId"; + public final static String KEY_ACCESS_KEY = "odps.accessKey"; + public final static String KEY_ACCOUNT = "odps.account"; + public final static String KEY_ODPS_SERVER = "odps.odpsServer"; + public final static String KEY_ODPS_TUNNEL = "odps.tunnelServer"; + public final static String KEY_ACCOUNT_TYPE = "odps.accountType"; + public final static String KEY_PROJECT = "odps.project"; + + private String accessId; + private String accessKey; + private String account; + private String odpsServer; + private String odpsTunnel; + private String accountType; + private String project; + + public static TransferProjectConf create(Configuration adsWriterConf) { + TransferProjectConf res = new TransferProjectConf(); + res.accessId = adsWriterConf.getString(KEY_ACCESS_ID); + res.accessKey = adsWriterConf.getString(KEY_ACCESS_KEY); + res.account = adsWriterConf.getString(KEY_ACCOUNT); + res.odpsServer = adsWriterConf.getString(KEY_ODPS_SERVER); + res.odpsTunnel = adsWriterConf.getString(KEY_ODPS_TUNNEL); + res.accountType = adsWriterConf.getString(KEY_ACCOUNT_TYPE, "aliyun"); + res.project = adsWriterConf.getString(KEY_PROJECT); + return res; + } + + public String getAccessId() { + return accessId; + } + + public String getAccessKey() { + return accessKey; + } + + public String getAccount() { + return account; + } + + public String getOdpsServer() { + return odpsServer; + } + + public String getOdpsTunnel() { + return odpsTunnel; + } + + public String getAccountType() { + return accountType; + } + + public String getProject() { + return project; + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/DataType.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/DataType.java new file mode 100644 index 0000000000..595b1dfd26 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/DataType.java @@ -0,0 +1,77 @@ +package com.alibaba.datax.plugin.writer.adswriter.odps; + +/** + * ODPS 数据类型. + *

+ * 当前定义了如下类型: + *

    + *
  • INTEGER + *
  • DOUBLE + *
  • BOOLEAN + *
  • STRING + *
  • DATETIME + *
+ *

+ * + * @since 0.0.1 + */ +public class DataType { + + public final static byte INTEGER = 0; + public final static byte DOUBLE = 1; + public final static byte BOOLEAN = 2; + public final static byte STRING = 3; + public final static byte DATETIME = 4; + + public static String toString(int type) { + switch (type) { + case INTEGER: + return "bigint"; + case DOUBLE: + return "double"; + case BOOLEAN: + return "boolean"; + case STRING: + return "string"; + case DATETIME: + return "datetime"; + default: + throw new IllegalArgumentException("type=" + type); + } + } + + /** + * 字符串的数据类型转换为byte常量定义的数据类型. + *

+ * 转换规则: + *

    + *
  • tinyint, int, bigint, long - {@link #INTEGER} + *
  • double, float - {@link #DOUBLE} + *
  • string - {@link #STRING} + *
  • boolean, bool - {@link #BOOLEAN} + *
  • datetime - {@link #DATETIME} + *
+ *

+ * + * @param type 字符串的数据类型 + * @return byte常量定义的数据类型 + * @throws IllegalArgumentException + */ + public static byte convertToDataType(String type) throws IllegalArgumentException { + type = type.toLowerCase().trim(); + if ("string".equals(type)) { + return STRING; + } else if ("bigint".equals(type) || "int".equals(type) || "tinyint".equals(type) || "long".equals(type)) { + return INTEGER; + } else if ("boolean".equals(type) || "bool".equals(type)) { + return BOOLEAN; + } else if ("double".equals(type) || "float".equals(type)) { + return DOUBLE; + } else if ("datetime".equals(type)) { + return DATETIME; + } else { + throw new IllegalArgumentException("unkown type: " + type); + } + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/FieldSchema.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/FieldSchema.java new file mode 100644 index 0000000000..701ee261cf --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/FieldSchema.java @@ -0,0 +1,63 @@ +package com.alibaba.datax.plugin.writer.adswriter.odps; + +/** + * ODPS列属性,包含列名和类型 列名和类型与SQL的DESC表或分区显示的列名和类型一致 + * + * @since 0.0.1 + */ +public class FieldSchema { + + /** 列名 */ + private String name; + + /** 列类型,如:string, bigint, boolean, datetime等等 */ + private String type; + + private String comment; + + public String getName() { + return name; + } + + public void setName(String name) { + this.name = name; + } + + public String getType() { + return type; + } + + public void setType(String type) { + this.type = type; + } + + public String getComment() { + return comment; + } + + public void setComment(String comment) { + this.comment = comment; + } + + @Override + public String toString() { + StringBuilder builder = new StringBuilder(); + builder.append("FieldSchema [name=").append(name).append(", type=").append(type).append(", comment=") + .append(comment).append("]"); + return builder.toString(); + } + + /** + * @return "col_name data_type [COMMENT col_comment]" + */ + public String toDDL() { + StringBuilder builder = new StringBuilder(); + builder.append(name).append(" ").append(type); + String comment = this.comment; + if (comment != null && comment.length() > 0) { + builder.append(" ").append("COMMENT \"" + comment + "\""); + } + return builder.toString(); + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/TableMeta.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/TableMeta.java new file mode 100644 index 0000000000..d0adc4eae6 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/TableMeta.java @@ -0,0 +1,114 @@ +package com.alibaba.datax.plugin.writer.adswriter.odps; + +import java.util.Iterator; +import java.util.List; + +/** + * ODPS table meta. + * + * @since 0.0.1 + */ +public class TableMeta { + + private String tableName; + + private List cols; + + private List partitionKeys; + + private int lifeCycle; + + private String comment; + + public String getTableName() { + return tableName; + } + + public void setTableName(String tableName) { + this.tableName = tableName; + } + + public List getCols() { + return cols; + } + + public void setCols(List cols) { + this.cols = cols; + } + + public List getPartitionKeys() { + return partitionKeys; + } + + public void setPartitionKeys(List partitionKeys) { + this.partitionKeys = partitionKeys; + } + + public int getLifeCycle() { + return lifeCycle; + } + + public void setLifeCycle(int lifeCycle) { + this.lifeCycle = lifeCycle; + } + + public String getComment() { + return comment; + } + + public void setComment(String comment) { + this.comment = comment; + } + + @Override + public String toString() { + StringBuilder builder = new StringBuilder(); + builder.append("TableMeta [tableName=").append(tableName).append(", cols=").append(cols) + .append(", partitionKeys=").append(partitionKeys).append(", lifeCycle=").append(lifeCycle) + .append(", comment=").append(comment).append("]"); + return builder.toString(); + } + + /** + * @return
+ * "CREATE TABLE [IF NOT EXISTS] table_name
+ * [(col_name data_type [COMMENT col_comment], ...)]
+ * [COMMENT table_comment]
+ * [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
+ * [LIFECYCLE days]
+ * [AS select_statement] "
+ */ + public String toDDL() { + StringBuilder builder = new StringBuilder(); + builder.append("CREATE TABLE " + tableName).append(" "); + List cols = this.cols; + if (cols != null && cols.size() > 0) { + builder.append("(").append(toDDL(cols)).append(")").append(" "); + } + String comment = this.comment; + if (comment != null && comment.length() > 0) { + builder.append("COMMENT \"" + comment + "\" "); + } + List partitionKeys = this.partitionKeys; + if (partitionKeys != null && partitionKeys.size() > 0) { + builder.append("PARTITIONED BY "); + builder.append("(").append(toDDL(partitionKeys)).append(")").append(" "); + } + if (lifeCycle > 0) { + builder.append("LIFECYCLE " + lifeCycle).append(" "); + } + builder.append(";"); + return builder.toString(); + } + + private String toDDL(List cols) { + StringBuilder builder = new StringBuilder(); + Iterator iter = cols.iterator(); + builder.append(iter.next().toDDL()); + while (iter.hasNext()) { + builder.append(", ").append(iter.next().toDDL()); + } + return builder.toString(); + } + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/package-info.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/package-info.java new file mode 100644 index 0000000000..92dfd09da4 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/odps/package-info.java @@ -0,0 +1,6 @@ +/** + * ODPS meta. + * + * @since 0.0.1 + */ +package com.alibaba.datax.plugin.writer.adswriter.odps; \ No newline at end of file diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/package-info.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/package-info.java new file mode 100644 index 0000000000..139a39106a --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/package-info.java @@ -0,0 +1,6 @@ +/** + * ADS Writer. + * + * @since 0.0.1 + */ +package com.alibaba.datax.plugin.writer.adswriter; \ No newline at end of file diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/AdsUtil.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/AdsUtil.java new file mode 100644 index 0000000000..4336d4773f --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/AdsUtil.java @@ -0,0 +1,175 @@ +package com.alibaba.datax.plugin.writer.adswriter.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.writer.adswriter.load.AdsHelper; +import com.alibaba.datax.plugin.writer.adswriter.AdsWriterErrorCode; +import com.alibaba.datax.plugin.writer.adswriter.load.TransferProjectConf; +import com.alibaba.datax.plugin.writer.adswriter.odps.FieldSchema; +import com.alibaba.datax.plugin.writer.adswriter.odps.TableMeta; + +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.util.ArrayList; +import java.util.List; + +public class AdsUtil { + private static final Logger LOG = LoggerFactory.getLogger(AdsUtil.class); + + /*检查配置文件中必填的配置项是否都已填 + * */ + public static void checkNecessaryConfig(Configuration originalConfig, String writeMode) { + //检查ADS必要参数 + originalConfig.getNecessaryValue(Key.ADS_URL, + AdsWriterErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.USERNAME, + AdsWriterErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.PASSWORD, + AdsWriterErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.SCHEMA, + AdsWriterErrorCode.REQUIRED_VALUE); + if(Constant.LOADMODE.equals(writeMode)) { + originalConfig.getNecessaryValue(Key.Life_CYCLE, + AdsWriterErrorCode.REQUIRED_VALUE); + Integer lifeCycle = originalConfig.getInt(Key.Life_CYCLE); + if (lifeCycle <= 0) { + throw DataXException.asDataXException(AdsWriterErrorCode.INVALID_CONFIG_VALUE, "配置项[lifeCycle]的值必须大于零."); + } + originalConfig.getNecessaryValue(Key.ADS_TABLE, + AdsWriterErrorCode.REQUIRED_VALUE); + Boolean overwrite = originalConfig.getBool(Key.OVER_WRITE); + if (overwrite == null) { + throw DataXException.asDataXException(AdsWriterErrorCode.REQUIRED_VALUE, "配置项[overWrite]是必填项."); + } + } + if (Constant.STREAMMODE.equalsIgnoreCase(writeMode)) { + originalConfig.getNecessaryValue(Key.OPIndex, AdsWriterErrorCode.REQUIRED_VALUE); + } + } + + /*生成AdsHelp实例 + * */ + public static AdsHelper createAdsHelper(Configuration originalConfig){ + //Get adsUrl,userName,password,schema等参数,创建AdsHelp实例 + String adsUrl = originalConfig.getString(Key.ADS_URL); + String userName = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + String schema = originalConfig.getString(Key.SCHEMA); + Long socketTimeout = originalConfig.getLong(Key.SOCKET_TIMEOUT, Constant.DEFAULT_SOCKET_TIMEOUT); + String suffix = originalConfig.getString(Key.JDBC_URL_SUFFIX, ""); + return new AdsHelper(adsUrl,userName,password,schema,socketTimeout,suffix); + } + + public static AdsHelper createAdsHelperWithOdpsAccount(Configuration originalConfig) { + String adsUrl = originalConfig.getString(Key.ADS_URL); + String userName = originalConfig.getString(TransferProjectConf.KEY_ACCESS_ID); + String password = originalConfig.getString(TransferProjectConf.KEY_ACCESS_KEY); + String schema = originalConfig.getString(Key.SCHEMA); + Long socketTimeout = originalConfig.getLong(Key.SOCKET_TIMEOUT, Constant.DEFAULT_SOCKET_TIMEOUT); + String suffix = originalConfig.getString(Key.JDBC_URL_SUFFIX, ""); + return new AdsHelper(adsUrl, userName, password, schema,socketTimeout,suffix); + } + + /*生成ODPSWriter Plugin所需要的配置文件 + * */ + public static Configuration generateConf(Configuration originalConfig, String odpsTableName, TableMeta tableMeta, TransferProjectConf transConf){ + Configuration newConfig = originalConfig.clone(); + newConfig.set(Key.ODPSTABLENAME, odpsTableName); + newConfig.set(Key.ODPS_SERVER, transConf.getOdpsServer()); + newConfig.set(Key.TUNNEL_SERVER,transConf.getOdpsTunnel()); + newConfig.set(Key.ACCESS_ID,transConf.getAccessId()); + newConfig.set(Key.ACCESS_KEY,transConf.getAccessKey()); + newConfig.set(Key.PROJECT,transConf.getProject()); + newConfig.set(Key.TRUNCATE, true); + newConfig.set(Key.PARTITION,null); +// newConfig.remove(Key.PARTITION); + List cols = tableMeta.getCols(); + List allColumns = new ArrayList(); + if(cols != null && !cols.isEmpty()){ + for(FieldSchema col:cols){ + allColumns.add(col.getName()); + } + } + newConfig.set(Key.COLUMN,allColumns); + return newConfig; + } + + /*生成ADS数据导入时的source_path + * */ + public static String generateSourcePath(String project, String tmpOdpsTableName, String odpsPartition){ + StringBuilder builder = new StringBuilder(); + String partition = transferOdpsPartitionToAds(odpsPartition); + builder.append("odps://").append(project).append("/").append(tmpOdpsTableName); + if(odpsPartition != null && !odpsPartition.isEmpty()){ + builder.append("/").append(partition); + } + return builder.toString(); + } + + public static String transferOdpsPartitionToAds(String odpsPartition){ + if(odpsPartition == null || odpsPartition.isEmpty()) + return null; + String adsPartition = formatPartition(odpsPartition);; + String[] partitions = adsPartition.split("/"); + for(int last = partitions.length; last > 0; last--){ + + String partitionPart = partitions[last-1]; + String newPart = partitionPart.replace(".*", "*").replace("*", ".*"); + if(newPart.split("=")[1].equals(".*")){ + adsPartition = adsPartition.substring(0,adsPartition.length()-partitionPart.length()); + }else{ + break; + } + if(adsPartition.endsWith("/")){ + adsPartition = adsPartition.substring(0,adsPartition.length()-1); + } + } + if (adsPartition.contains("*")) + throw DataXException.asDataXException(AdsWriterErrorCode.ODPS_PARTITION_FAILED, ""); + return adsPartition; + } + + public static String formatPartition(String partition) { + return partition.trim().replaceAll(" *= *", "=") + .replaceAll(" */ *", ",").replaceAll(" *, *", ",") + .replaceAll("'", "").replaceAll(",", "/"); + } + + public static String prepareJdbcUrl(Configuration conf) { + String adsURL = conf.getString(Key.ADS_URL); + String schema = conf.getString(Key.SCHEMA); + Long socketTimeout = conf.getLong(Key.SOCKET_TIMEOUT, + Constant.DEFAULT_SOCKET_TIMEOUT); + String suffix = conf.getString(Key.JDBC_URL_SUFFIX, ""); + return AdsUtil.prepareJdbcUrl(adsURL, schema, socketTimeout, suffix); + } + + public static String prepareJdbcUrl(String adsURL, String schema, + Long socketTimeout, String suffix) { + String jdbcUrl = null; + // like autoReconnect=true&failOverReadOnly=false&maxReconnects=10 + if (StringUtils.isNotBlank(suffix)) { + jdbcUrl = String + .format("jdbc:mysql://%s/%s?useUnicode=true&characterEncoding=UTF-8&socketTimeout=%s&%s", + adsURL, schema, socketTimeout, suffix); + } else { + jdbcUrl = String + .format("jdbc:mysql://%s/%s?useUnicode=true&characterEncoding=UTF-8&socketTimeout=%s", + adsURL, schema, socketTimeout); + } + return jdbcUrl; + } + + public static Connection getAdsConnect(Configuration conf) { + String userName = conf.getString(Key.USERNAME); + String passWord = conf.getString(Key.PASSWORD); + String jdbcUrl = AdsUtil.prepareJdbcUrl(conf); + Connection connection = DBUtil.getConnection(DataBaseType.ADS, jdbcUrl, userName, passWord); + return connection; + } +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Constant.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Constant.java new file mode 100644 index 0000000000..f0ab71ec18 --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Constant.java @@ -0,0 +1,29 @@ +package com.alibaba.datax.plugin.writer.adswriter.util; + +public class Constant { + + public static final String LOADMODE = "load"; + + public static final String INSERTMODE = "insert"; + + public static final String DELETEMODE = "delete"; + + public static final String REPLACEMODE = "replace"; + + public static final String STREAMMODE = "stream"; + + public static final int DEFAULT_BATCH_SIZE = 32; + + public static final long DEFAULT_SOCKET_TIMEOUT = 3600000L; + + public static final int DEFAULT_RETRY_TIMES = 2; + + public static final String INSERT_TEMPLATE = "insert into %s ( %s ) values "; + + public static final String DELETE_TEMPLATE = "delete from %s where "; + + public static final String ADS_TABLE_INFO = "adsTableInfo"; + + public static final String ADS_QUOTE_CHARACTER = "`"; + +} diff --git a/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Key.java b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Key.java new file mode 100644 index 0000000000..3d31c8186f --- /dev/null +++ b/adswriter/src/main/java/com/alibaba/datax/plugin/writer/adswriter/util/Key.java @@ -0,0 +1,62 @@ +package com.alibaba.datax.plugin.writer.adswriter.util; + + +public final class Key { + + public final static String ADS_URL = "url"; + + public final static String USERNAME = "username"; + + public final static String PASSWORD = "password"; + + public final static String SCHEMA = "schema"; + + public final static String ADS_TABLE = "table"; + + public final static String Life_CYCLE = "lifeCycle"; + + public final static String OVER_WRITE = "overWrite"; + + public final static String WRITE_MODE = "writeMode"; + + + public final static String COLUMN = "column"; + + public final static String OPIndex = "opIndex"; + + public final static String EMPTY_AS_NULL = "emptyAsNull"; + + public final static String BATCH_SIZE = "batchSize"; + + public final static String BUFFER_SIZE = "bufferSize"; + + public final static String PRE_SQL = "preSql"; + + public final static String POST_SQL = "postSql"; + + public final static String SOCKET_TIMEOUT = "socketTimeout"; + + public final static String RETRY_CONNECTION_TIME = "retryTimes"; + + public final static String JDBC_URL_SUFFIX = "urlSuffix"; + + /** + * 以下是odps writer的key + */ + public final static String PARTITION = "partition"; + + public final static String ODPSTABLENAME = "table"; + + public final static String ODPS_SERVER = "odpsServer"; + + public final static String TUNNEL_SERVER = "tunnelServer"; + + public final static String ACCESS_ID = "accessId"; + + public final static String ACCESS_KEY = "accessKey"; + + public final static String PROJECT = "project"; + + public final static String TRUNCATE = "truncate"; + +} \ No newline at end of file diff --git a/adswriter/src/main/resources/plugin.json b/adswriter/src/main/resources/plugin.json new file mode 100644 index 0000000000..a70fb36462 --- /dev/null +++ b/adswriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "adswriter", + "class": "com.alibaba.datax.plugin.writer.adswriter.AdsWriter", + "description": "", + "developer": "alibaba" +} \ No newline at end of file diff --git a/adswriter/src/main/resources/plugin_job_template.json b/adswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..0753a226e8 --- /dev/null +++ b/adswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "adswriter", + "parameter": { + "url": "", + "username": "", + "password": "", + "schema": "", + "table": "", + "partition": "", + "overWrite": "", + "lifeCycle": 2 + } +} \ No newline at end of file diff --git a/common/pom.xml b/common/pom.xml new file mode 100755 index 0000000000..6cce789f23 --- /dev/null +++ b/common/pom.xml @@ -0,0 +1,75 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + datax-common + datax-common + jar + + + + org.apache.commons + commons-lang3 + + + com.alibaba + fastjson + + + commons-io + commons-io + + + + junit + junit + test + + + + org.slf4j + slf4j-api + + + + ch.qos.logback + logback-classic + + + + org.apache.httpcomponents + httpclient + 4.4 + test + + + org.apache.httpcomponents + fluent-hc + 4.4 + test + + + org.apache.commons + commons-math3 + 3.1.1 + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + diff --git a/common/src/main/java/com/alibaba/datax/common/base/BaseObject.java b/common/src/main/java/com/alibaba/datax/common/base/BaseObject.java new file mode 100755 index 0000000000..e7d06a9503 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/base/BaseObject.java @@ -0,0 +1,25 @@ +package com.alibaba.datax.common.base; + +import org.apache.commons.lang3.builder.EqualsBuilder; +import org.apache.commons.lang3.builder.HashCodeBuilder; +import org.apache.commons.lang3.builder.ToStringBuilder; +import org.apache.commons.lang3.builder.ToStringStyle; + +public class BaseObject { + + @Override + public int hashCode() { + return HashCodeBuilder.reflectionHashCode(this, false); + } + + @Override + public boolean equals(Object object) { + return EqualsBuilder.reflectionEquals(this, object, false); + } + + @Override + public String toString() { + return ToStringBuilder.reflectionToString(this, + ToStringStyle.MULTI_LINE_STYLE); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/constant/CommonConstant.java b/common/src/main/java/com/alibaba/datax/common/constant/CommonConstant.java new file mode 100755 index 0000000000..423e16f926 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/constant/CommonConstant.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.common.constant; + +public final class CommonConstant { + /** + * 用于插件对自身 split 的每个 task 标识其使用的资源,以告知core 对 reader/writer split 之后的 task 进行拼接时需要根据资源标签进行更有意义的 shuffle 操作 + */ + public static String LOAD_BALANCE_RESOURCE_MARK = "loadBalanceResourceMark"; + +} diff --git a/common/src/main/java/com/alibaba/datax/common/constant/PluginType.java b/common/src/main/java/com/alibaba/datax/common/constant/PluginType.java new file mode 100755 index 0000000000..ceee089e9e --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/constant/PluginType.java @@ -0,0 +1,20 @@ +package com.alibaba.datax.common.constant; + +/** + * Created by jingxing on 14-8-31. + */ +public enum PluginType { + //pluginType还代表了资源目录,很难扩展,或者说需要足够必要才扩展。先mark Handler(其实和transformer一样),再讨论 + READER("reader"), TRANSFORMER("transformer"), WRITER("writer"), HANDLER("handler"); + + private String pluginType; + + private PluginType(String pluginType) { + this.pluginType = pluginType; + } + + @Override + public String toString() { + return this.pluginType; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/BoolColumn.java b/common/src/main/java/com/alibaba/datax/common/element/BoolColumn.java new file mode 100755 index 0000000000..7699e152ae --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/BoolColumn.java @@ -0,0 +1,115 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +/** + * Created by jingxing on 14-8-24. + */ +public class BoolColumn extends Column { + + public BoolColumn(Boolean bool) { + super(bool, Column.Type.BOOL, 1); + } + + public BoolColumn(final String data) { + this(true); + this.validate(data); + if (null == data) { + this.setRawData(null); + this.setByteSize(0); + } else { + this.setRawData(Boolean.valueOf(data)); + this.setByteSize(1); + } + return; + } + + public BoolColumn() { + super(null, Column.Type.BOOL, 1); + } + + @Override + public Boolean asBoolean() { + if (null == super.getRawData()) { + return null; + } + + return (Boolean) super.getRawData(); + } + + @Override + public Long asLong() { + if (null == this.getRawData()) { + return null; + } + + return this.asBoolean() ? 1L : 0L; + } + + @Override + public Double asDouble() { + if (null == this.getRawData()) { + return null; + } + + return this.asBoolean() ? 1.0d : 0.0d; + } + + @Override + public String asString() { + if (null == super.getRawData()) { + return null; + } + + return this.asBoolean() ? "true" : "false"; + } + + @Override + public BigInteger asBigInteger() { + if (null == this.getRawData()) { + return null; + } + + return BigInteger.valueOf(this.asLong()); + } + + @Override + public BigDecimal asBigDecimal() { + if (null == this.getRawData()) { + return null; + } + + return BigDecimal.valueOf(this.asLong()); + } + + @Override + public Date asDate() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bool类型不能转为Date ."); + } + + @Override + public byte[] asBytes() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Boolean类型不能转为Bytes ."); + } + + private void validate(final String data) { + if (null == data) { + return; + } + + if ("true".equalsIgnoreCase(data) || "false".equalsIgnoreCase(data)) { + return; + } + + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[%s]不能转为Bool .", data)); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/BytesColumn.java b/common/src/main/java/com/alibaba/datax/common/element/BytesColumn.java new file mode 100755 index 0000000000..d3cc599361 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/BytesColumn.java @@ -0,0 +1,84 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang3.ArrayUtils; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +/** + * Created by jingxing on 14-8-24. + */ +public class BytesColumn extends Column { + + public BytesColumn() { + this(null); + } + + public BytesColumn(byte[] bytes) { + super(ArrayUtils.clone(bytes), Column.Type.BYTES, null == bytes ? 0 + : bytes.length); + } + + @Override + public byte[] asBytes() { + if (null == this.getRawData()) { + return null; + } + + return (byte[]) this.getRawData(); + } + + @Override + public String asString() { + if (null == this.getRawData()) { + return null; + } + + try { + return ColumnCast.bytes2String(this); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("Bytes[%s]不能转为String .", this.toString())); + } + } + + @Override + public Long asLong() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为Long ."); + } + + @Override + public BigDecimal asBigDecimal() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为BigDecimal ."); + } + + @Override + public BigInteger asBigInteger() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为BigInteger ."); + } + + @Override + public Double asDouble() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为Long ."); + } + + @Override + public Date asDate() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为Date ."); + } + + @Override + public Boolean asBoolean() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Bytes类型不能转为Boolean ."); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/Column.java b/common/src/main/java/com/alibaba/datax/common/element/Column.java new file mode 100755 index 0000000000..ed68e88d6b --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/Column.java @@ -0,0 +1,75 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.fastjson.JSON; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +/** + * Created by jingxing on 14-8-24. + *

+ */ +public abstract class Column { + + private Type type; + + private Object rawData; + + private int byteSize; + + public Column(final Object object, final Type type, int byteSize) { + this.rawData = object; + this.type = type; + this.byteSize = byteSize; + } + + public Object getRawData() { + return this.rawData; + } + + public Type getType() { + return this.type; + } + + public int getByteSize() { + return this.byteSize; + } + + protected void setType(Type type) { + this.type = type; + } + + protected void setRawData(Object rawData) { + this.rawData = rawData; + } + + protected void setByteSize(int byteSize) { + this.byteSize = byteSize; + } + + public abstract Long asLong(); + + public abstract Double asDouble(); + + public abstract String asString(); + + public abstract Date asDate(); + + public abstract byte[] asBytes(); + + public abstract Boolean asBoolean(); + + public abstract BigDecimal asBigDecimal(); + + public abstract BigInteger asBigInteger(); + + @Override + public String toString() { + return JSON.toJSONString(this); + } + + public enum Type { + BAD, NULL, INT, LONG, DOUBLE, STRING, BOOL, DATE, BYTES + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/ColumnCast.java b/common/src/main/java/com/alibaba/datax/common/element/ColumnCast.java new file mode 100755 index 0000000000..89d0a7c627 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/ColumnCast.java @@ -0,0 +1,199 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang3.time.DateFormatUtils; +import org.apache.commons.lang3.time.FastDateFormat; + +import java.io.UnsupportedEncodingException; +import java.text.ParseException; +import java.util.*; + +public final class ColumnCast { + + public static void bind(final Configuration configuration) { + StringCast.init(configuration); + DateCast.init(configuration); + BytesCast.init(configuration); + } + + public static Date string2Date(final StringColumn column) + throws ParseException { + return StringCast.asDate(column); + } + + public static byte[] string2Bytes(final StringColumn column) + throws UnsupportedEncodingException { + return StringCast.asBytes(column); + } + + public static String date2String(final DateColumn column) { + return DateCast.asString(column); + } + + public static String bytes2String(final BytesColumn column) + throws UnsupportedEncodingException { + return BytesCast.asString(column); + } +} + +class StringCast { + static String datetimeFormat = "yyyy-MM-dd HH:mm:ss"; + + static String dateFormat = "yyyy-MM-dd"; + + static String timeFormat = "HH:mm:ss"; + + static List extraFormats = Collections.emptyList(); + + static String timeZone = "GMT+8"; + + static FastDateFormat dateFormatter; + + static FastDateFormat timeFormatter; + + static FastDateFormat datetimeFormatter; + + static TimeZone timeZoner; + + static String encoding = "UTF-8"; + + static void init(final Configuration configuration) { + StringCast.datetimeFormat = configuration.getString( + "common.column.datetimeFormat", StringCast.datetimeFormat); + StringCast.dateFormat = configuration.getString( + "common.column.dateFormat", StringCast.dateFormat); + StringCast.timeFormat = configuration.getString( + "common.column.timeFormat", StringCast.timeFormat); + StringCast.extraFormats = configuration.getList( + "common.column.extraFormats", Collections.emptyList(), String.class); + + StringCast.timeZone = configuration.getString("common.column.timeZone", + StringCast.timeZone); + StringCast.timeZoner = TimeZone.getTimeZone(StringCast.timeZone); + + StringCast.datetimeFormatter = FastDateFormat.getInstance( + StringCast.datetimeFormat, StringCast.timeZoner); + StringCast.dateFormatter = FastDateFormat.getInstance( + StringCast.dateFormat, StringCast.timeZoner); + StringCast.timeFormatter = FastDateFormat.getInstance( + StringCast.timeFormat, StringCast.timeZoner); + + StringCast.encoding = configuration.getString("common.column.encoding", + StringCast.encoding); + } + + static Date asDate(final StringColumn column) throws ParseException { + if (null == column.asString()) { + return null; + } + + try { + return StringCast.datetimeFormatter.parse(column.asString()); + } catch (ParseException ignored) { + } + + try { + return StringCast.dateFormatter.parse(column.asString()); + } catch (ParseException ignored) { + } + + ParseException e; + try { + return StringCast.timeFormatter.parse(column.asString()); + } catch (ParseException ignored) { + e = ignored; + } + + for (String format : StringCast.extraFormats) { + try{ + return FastDateFormat.getInstance(format, StringCast.timeZoner).parse(column.asString()); + } catch (ParseException ignored){ + e = ignored; + } + } + throw e; + } + + static byte[] asBytes(final StringColumn column) + throws UnsupportedEncodingException { + if (null == column.asString()) { + return null; + } + + return column.asString().getBytes(StringCast.encoding); + } +} + +/** + * 后续为了可维护性,可以考虑直接使用 apache 的DateFormatUtils. + * + * 迟南已经修复了该问题,但是为了维护性,还是直接使用apache的内置函数 + */ +class DateCast { + + static String datetimeFormat = "yyyy-MM-dd HH:mm:ss"; + + static String dateFormat = "yyyy-MM-dd"; + + static String timeFormat = "HH:mm:ss"; + + static String timeZone = "GMT+8"; + + static TimeZone timeZoner = TimeZone.getTimeZone(DateCast.timeZone); + + static void init(final Configuration configuration) { + DateCast.datetimeFormat = configuration.getString( + "common.column.datetimeFormat", datetimeFormat); + DateCast.timeFormat = configuration.getString( + "common.column.timeFormat", timeFormat); + DateCast.dateFormat = configuration.getString( + "common.column.dateFormat", dateFormat); + DateCast.timeZone = configuration.getString("common.column.timeZone", + DateCast.timeZone); + DateCast.timeZoner = TimeZone.getTimeZone(DateCast.timeZone); + return; + } + + static String asString(final DateColumn column) { + if (null == column.asDate()) { + return null; + } + + switch (column.getSubType()) { + case DATE: + return DateFormatUtils.format(column.asDate(), DateCast.dateFormat, + DateCast.timeZoner); + case TIME: + return DateFormatUtils.format(column.asDate(), DateCast.timeFormat, + DateCast.timeZoner); + case DATETIME: + return DateFormatUtils.format(column.asDate(), + DateCast.datetimeFormat, DateCast.timeZoner); + default: + throw DataXException + .asDataXException(CommonErrorCode.CONVERT_NOT_SUPPORT, + "时间类型出现不支持类型,目前仅支持DATE/TIME/DATETIME。该类型属于编程错误,请反馈给DataX开发团队 ."); + } + } +} + +class BytesCast { + static String encoding = "utf-8"; + + static void init(final Configuration configuration) { + BytesCast.encoding = configuration.getString("common.column.encoding", + BytesCast.encoding); + return; + } + + static String asString(final BytesColumn column) + throws UnsupportedEncodingException { + if (null == column.asBytes()) { + return null; + } + + return new String(column.asBytes(), encoding); + } +} \ No newline at end of file diff --git a/common/src/main/java/com/alibaba/datax/common/element/DateColumn.java b/common/src/main/java/com/alibaba/datax/common/element/DateColumn.java new file mode 100755 index 0000000000..6626a6fbdd --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/DateColumn.java @@ -0,0 +1,130 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +/** + * Created by jingxing on 14-8-24. + */ +public class DateColumn extends Column { + + private DateType subType = DateType.DATETIME; + + public static enum DateType { + DATE, TIME, DATETIME + } + + /** + * 构建值为null的DateColumn,使用Date子类型为DATETIME + * */ + public DateColumn() { + this((Long)null); + } + + /** + * 构建值为stamp(Unix时间戳)的DateColumn,使用Date子类型为DATETIME + * 实际存储有date改为long的ms,节省存储 + * */ + public DateColumn(final Long stamp) { + super(stamp, Column.Type.DATE, (null == stamp ? 0 : 8)); + } + + /** + * 构建值为date(java.util.Date)的DateColumn,使用Date子类型为DATETIME + * */ + public DateColumn(final Date date) { + this(date == null ? null : date.getTime()); + } + + /** + * 构建值为date(java.sql.Date)的DateColumn,使用Date子类型为DATE,只有日期,没有时间 + * */ + public DateColumn(final java.sql.Date date) { + this(date == null ? null : date.getTime()); + this.setSubType(DateType.DATE); + } + + /** + * 构建值为time(java.sql.Time)的DateColumn,使用Date子类型为TIME,只有时间,没有日期 + * */ + public DateColumn(final java.sql.Time time) { + this(time == null ? null : time.getTime()); + this.setSubType(DateType.TIME); + } + + /** + * 构建值为ts(java.sql.Timestamp)的DateColumn,使用Date子类型为DATETIME + * */ + public DateColumn(final java.sql.Timestamp ts) { + this(ts == null ? null : ts.getTime()); + this.setSubType(DateType.DATETIME); + } + + @Override + public Long asLong() { + + return (Long)this.getRawData(); + } + + @Override + public String asString() { + try { + return ColumnCast.date2String(this); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("Date[%s]类型不能转为String .", this.toString())); + } + } + + @Override + public Date asDate() { + if (null == this.getRawData()) { + return null; + } + + return new Date((Long)this.getRawData()); + } + + @Override + public byte[] asBytes() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Date类型不能转为Bytes ."); + } + + @Override + public Boolean asBoolean() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Date类型不能转为Boolean ."); + } + + @Override + public Double asDouble() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Date类型不能转为Double ."); + } + + @Override + public BigInteger asBigInteger() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Date类型不能转为BigInteger ."); + } + + @Override + public BigDecimal asBigDecimal() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Date类型不能转为BigDecimal ."); + } + + public DateType getSubType() { + return subType; + } + + public void setSubType(DateType subType) { + this.subType = subType; + } +} \ No newline at end of file diff --git a/common/src/main/java/com/alibaba/datax/common/element/DoubleColumn.java b/common/src/main/java/com/alibaba/datax/common/element/DoubleColumn.java new file mode 100755 index 0000000000..17170ea6c4 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/DoubleColumn.java @@ -0,0 +1,161 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +public class DoubleColumn extends Column { + + public DoubleColumn(final String data) { + this(data, null == data ? 0 : data.length()); + this.validate(data); + } + + public DoubleColumn(Long data) { + this(data == null ? (String) null : String.valueOf(data)); + } + + public DoubleColumn(Integer data) { + this(data == null ? (String) null : String.valueOf(data)); + } + + /** + * Double无法表示准确的小数数据,我们不推荐使用该方法保存Double数据,建议使用String作为构造入参 + * + * */ + public DoubleColumn(final Double data) { + this(data == null ? (String) null + : new BigDecimal(String.valueOf(data)).toPlainString()); + } + + /** + * Float无法表示准确的小数数据,我们不推荐使用该方法保存Float数据,建议使用String作为构造入参 + * + * */ + public DoubleColumn(final Float data) { + this(data == null ? (String) null + : new BigDecimal(String.valueOf(data)).toPlainString()); + } + + public DoubleColumn(final BigDecimal data) { + this(null == data ? (String) null : data.toPlainString()); + } + + public DoubleColumn(final BigInteger data) { + this(null == data ? (String) null : data.toString()); + } + + public DoubleColumn() { + this((String) null); + } + + private DoubleColumn(final String data, int byteSize) { + super(data, Column.Type.DOUBLE, byteSize); + } + + @Override + public BigDecimal asBigDecimal() { + if (null == this.getRawData()) { + return null; + } + + try { + return new BigDecimal((String) this.getRawData()); + } catch (NumberFormatException e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[%s] 无法转换为Double类型 .", + (String) this.getRawData())); + } + } + + @Override + public Double asDouble() { + if (null == this.getRawData()) { + return null; + } + + String string = (String) this.getRawData(); + + boolean isDoubleSpecific = string.equals("NaN") + || string.equals("-Infinity") || string.equals("+Infinity"); + if (isDoubleSpecific) { + return Double.valueOf(string); + } + + BigDecimal result = this.asBigDecimal(); + OverFlowUtil.validateDoubleNotOverFlow(result); + + return result.doubleValue(); + } + + @Override + public Long asLong() { + if (null == this.getRawData()) { + return null; + } + + BigDecimal result = this.asBigDecimal(); + OverFlowUtil.validateLongNotOverFlow(result.toBigInteger()); + + return result.longValue(); + } + + @Override + public BigInteger asBigInteger() { + if (null == this.getRawData()) { + return null; + } + + return this.asBigDecimal().toBigInteger(); + } + + @Override + public String asString() { + if (null == this.getRawData()) { + return null; + } + return (String) this.getRawData(); + } + + @Override + public Boolean asBoolean() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Double类型无法转为Bool ."); + } + + @Override + public Date asDate() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Double类型无法转为Date类型 ."); + } + + @Override + public byte[] asBytes() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Double类型无法转为Bytes类型 ."); + } + + private void validate(final String data) { + if (null == data) { + return; + } + + if (data.equalsIgnoreCase("NaN") || data.equalsIgnoreCase("-Infinity") + || data.equalsIgnoreCase("Infinity")) { + return; + } + + try { + new BigDecimal(data); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[%s]无法转为Double类型 .", data)); + } + } + +} \ No newline at end of file diff --git a/common/src/main/java/com/alibaba/datax/common/element/LongColumn.java b/common/src/main/java/com/alibaba/datax/common/element/LongColumn.java new file mode 100755 index 0000000000..d8113f7c05 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/LongColumn.java @@ -0,0 +1,135 @@ +package com.alibaba.datax.common.element; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang3.math.NumberUtils; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +public class LongColumn extends Column { + + /** + * 从整形字符串表示转为LongColumn,支持Java科学计数法 + * + * NOTE:
+ * 如果data为浮点类型的字符串表示,数据将会失真,请使用DoubleColumn对接浮点字符串 + * + * */ + public LongColumn(final String data) { + super(null, Column.Type.LONG, 0); + if (null == data) { + return; + } + + try { + BigInteger rawData = NumberUtils.createBigDecimal(data) + .toBigInteger(); + super.setRawData(rawData); + + // 当 rawData 为[0-127]时,rawData.bitLength() < 8,导致其 byteSize = 0,简单起见,直接认为其长度为 data.length() + // super.setByteSize(rawData.bitLength() / 8); + super.setByteSize(data.length()); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[%s]不能转为Long .", data)); + } + } + + public LongColumn(Long data) { + this(null == data ? (BigInteger) null : BigInteger.valueOf(data)); + } + + public LongColumn(Integer data) { + this(null == data ? (BigInteger) null : BigInteger.valueOf(data)); + } + + public LongColumn(BigInteger data) { + this(data, null == data ? 0 : 8); + } + + private LongColumn(BigInteger data, int byteSize) { + super(data, Column.Type.LONG, byteSize); + } + + public LongColumn() { + this((BigInteger) null); + } + + @Override + public BigInteger asBigInteger() { + if (null == this.getRawData()) { + return null; + } + + return (BigInteger) this.getRawData(); + } + + @Override + public Long asLong() { + BigInteger rawData = (BigInteger) this.getRawData(); + if (null == rawData) { + return null; + } + + OverFlowUtil.validateLongNotOverFlow(rawData); + + return rawData.longValue(); + } + + @Override + public Double asDouble() { + if (null == this.getRawData()) { + return null; + } + + BigDecimal decimal = this.asBigDecimal(); + OverFlowUtil.validateDoubleNotOverFlow(decimal); + + return decimal.doubleValue(); + } + + @Override + public Boolean asBoolean() { + if (null == this.getRawData()) { + return null; + } + + return this.asBigInteger().compareTo(BigInteger.ZERO) != 0 ? true + : false; + } + + @Override + public BigDecimal asBigDecimal() { + if (null == this.getRawData()) { + return null; + } + + return new BigDecimal(this.asBigInteger()); + } + + @Override + public String asString() { + if (null == this.getRawData()) { + return null; + } + return ((BigInteger) this.getRawData()).toString(); + } + + @Override + public Date asDate() { + if (null == this.getRawData()) { + return null; + } + return new Date(this.asLong()); + } + + @Override + public byte[] asBytes() { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, "Long类型不能转为Bytes ."); + } + +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/OverFlowUtil.java b/common/src/main/java/com/alibaba/datax/common/element/OverFlowUtil.java new file mode 100755 index 0000000000..39460c7ebc --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/OverFlowUtil.java @@ -0,0 +1,62 @@ +package com.alibaba.datax.common.element; + +import java.math.BigDecimal; +import java.math.BigInteger; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; + +public final class OverFlowUtil { + public static final BigInteger MAX_LONG = BigInteger + .valueOf(Long.MAX_VALUE); + + public static final BigInteger MIN_LONG = BigInteger + .valueOf(Long.MIN_VALUE); + + public static final BigDecimal MIN_DOUBLE_POSITIVE = new BigDecimal( + String.valueOf(Double.MIN_VALUE)); + + public static final BigDecimal MAX_DOUBLE_POSITIVE = new BigDecimal( + String.valueOf(Double.MAX_VALUE)); + + public static boolean isLongOverflow(final BigInteger integer) { + return (integer.compareTo(OverFlowUtil.MAX_LONG) > 0 || integer + .compareTo(OverFlowUtil.MIN_LONG) < 0); + + } + + public static void validateLongNotOverFlow(final BigInteger integer) { + boolean isOverFlow = OverFlowUtil.isLongOverflow(integer); + + if (isOverFlow) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_OVER_FLOW, + String.format("[%s] 转为Long类型出现溢出 .", integer.toString())); + } + } + + public static boolean isDoubleOverFlow(final BigDecimal decimal) { + if (decimal.signum() == 0) { + return false; + } + + BigDecimal newDecimal = decimal; + boolean isPositive = decimal.signum() == 1; + if (!isPositive) { + newDecimal = decimal.negate(); + } + + return (newDecimal.compareTo(MIN_DOUBLE_POSITIVE) < 0 || newDecimal + .compareTo(MAX_DOUBLE_POSITIVE) > 0); + } + + public static void validateDoubleNotOverFlow(final BigDecimal decimal) { + boolean isOverFlow = OverFlowUtil.isDoubleOverFlow(decimal); + if (isOverFlow) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_OVER_FLOW, + String.format("[%s]转为Double类型出现溢出 .", + decimal.toPlainString())); + } + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/Record.java b/common/src/main/java/com/alibaba/datax/common/element/Record.java new file mode 100755 index 0000000000..d06d80aafb --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/Record.java @@ -0,0 +1,23 @@ +package com.alibaba.datax.common.element; + +/** + * Created by jingxing on 14-8-24. + */ + +public interface Record { + + public void addColumn(Column column); + + public void setColumn(int i, final Column column); + + public Column getColumn(int i); + + public String toString(); + + public int getColumnNumber(); + + public int getByteSize(); + + public int getMemorySize(); + +} diff --git a/common/src/main/java/com/alibaba/datax/common/element/StringColumn.java b/common/src/main/java/com/alibaba/datax/common/element/StringColumn.java new file mode 100755 index 0000000000..11209f4688 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/element/StringColumn.java @@ -0,0 +1,163 @@ +package com.alibaba.datax.common.element; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.Date; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; + +/** + * Created by jingxing on 14-8-24. + */ + +public class StringColumn extends Column { + + public StringColumn() { + this((String) null); + } + + public StringColumn(final String rawData) { + super(rawData, Column.Type.STRING, (null == rawData ? 0 : rawData + .length())); + } + + @Override + public String asString() { + if (null == this.getRawData()) { + return null; + } + + return (String) this.getRawData(); + } + + private void validateDoubleSpecific(final String data) { + if ("NaN".equals(data) || "Infinity".equals(data) + || "-Infinity".equals(data)) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[\"%s\"]属于Double特殊类型,不能转为其他类型 .", data)); + } + + return; + } + + @Override + public BigInteger asBigInteger() { + if (null == this.getRawData()) { + return null; + } + + this.validateDoubleSpecific((String) this.getRawData()); + + try { + return this.asBigDecimal().toBigInteger(); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, String.format( + "String[\"%s\"]不能转为BigInteger .", this.asString())); + } + } + + @Override + public Long asLong() { + if (null == this.getRawData()) { + return null; + } + + this.validateDoubleSpecific((String) this.getRawData()); + + try { + BigInteger integer = this.asBigInteger(); + OverFlowUtil.validateLongNotOverFlow(integer); + return integer.longValue(); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[\"%s\"]不能转为Long .", this.asString())); + } + } + + @Override + public BigDecimal asBigDecimal() { + if (null == this.getRawData()) { + return null; + } + + this.validateDoubleSpecific((String) this.getRawData()); + + try { + return new BigDecimal(this.asString()); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, String.format( + "String [\"%s\"] 不能转为BigDecimal .", this.asString())); + } + } + + @Override + public Double asDouble() { + if (null == this.getRawData()) { + return null; + } + + String data = (String) this.getRawData(); + if ("NaN".equals(data)) { + return Double.NaN; + } + + if ("Infinity".equals(data)) { + return Double.POSITIVE_INFINITY; + } + + if ("-Infinity".equals(data)) { + return Double.NEGATIVE_INFINITY; + } + + BigDecimal decimal = this.asBigDecimal(); + OverFlowUtil.validateDoubleNotOverFlow(decimal); + + return decimal.doubleValue(); + } + + @Override + public Boolean asBoolean() { + if (null == this.getRawData()) { + return null; + } + + if ("true".equalsIgnoreCase(this.asString())) { + return true; + } + + if ("false".equalsIgnoreCase(this.asString())) { + return false; + } + + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[\"%s\"]不能转为Bool .", this.asString())); + } + + @Override + public Date asDate() { + try { + return ColumnCast.string2Date(this); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[\"%s\"]不能转为Date .", this.asString())); + } + } + + @Override + public byte[] asBytes() { + try { + return ColumnCast.string2Bytes(this); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONVERT_NOT_SUPPORT, + String.format("String[\"%s\"]不能转为Bytes .", this.asString())); + } + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/exception/CommonErrorCode.java b/common/src/main/java/com/alibaba/datax/common/exception/CommonErrorCode.java new file mode 100755 index 0000000000..8679ffb475 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/exception/CommonErrorCode.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.common.exception; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * + */ +public enum CommonErrorCode implements ErrorCode { + + CONFIG_ERROR("Common-00", "您提供的配置文件存在错误信息,请检查您的作业配置 ."), + CONVERT_NOT_SUPPORT("Common-01", "同步数据出现业务脏数据情况,数据类型转换错误 ."), + CONVERT_OVER_FLOW("Common-02", "同步数据出现业务脏数据情况,数据类型转换溢出 ."), + RETRY_FAIL("Common-10", "方法调用多次仍旧失败 ."), + RUNTIME_ERROR("Common-11", "运行时内部调用错误 ."), + HOOK_INTERNAL_ERROR("Common-12", "Hook运行错误 ."), + SHUT_DOWN_TASK("Common-20", "Task收到了shutdown指令,为failover做准备"), + WAIT_TIME_EXCEED("Common-21", "等待时间超出范围"), + TASK_HUNG_EXPIRED("Common-22", "任务hung住,Expired"); + + private final String code; + + private final String describe; + + private CommonErrorCode(String code, String describe) { + this.code = code; + this.describe = describe; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.describe; + } + + @Override + public String toString() { + return String.format("Code:[%s], Describe:[%s]", this.code, + this.describe); + } + +} diff --git a/common/src/main/java/com/alibaba/datax/common/exception/DataXException.java b/common/src/main/java/com/alibaba/datax/common/exception/DataXException.java new file mode 100755 index 0000000000..f360e69900 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/exception/DataXException.java @@ -0,0 +1,62 @@ +package com.alibaba.datax.common.exception; + +import com.alibaba.datax.common.spi.ErrorCode; + +import java.io.PrintWriter; +import java.io.StringWriter; + +public class DataXException extends RuntimeException { + + private static final long serialVersionUID = 1L; + + private ErrorCode errorCode; + + public DataXException(ErrorCode errorCode, String errorMessage) { + super(errorCode.toString() + " - " + errorMessage); + this.errorCode = errorCode; + } + + private DataXException(ErrorCode errorCode, String errorMessage, Throwable cause) { + super(errorCode.toString() + " - " + getMessage(errorMessage) + " - " + getMessage(cause), cause); + + this.errorCode = errorCode; + } + + public static DataXException asDataXException(ErrorCode errorCode, String message) { + return new DataXException(errorCode, message); + } + + public static DataXException asDataXException(ErrorCode errorCode, String message, Throwable cause) { + if (cause instanceof DataXException) { + return (DataXException) cause; + } + return new DataXException(errorCode, message, cause); + } + + public static DataXException asDataXException(ErrorCode errorCode, Throwable cause) { + if (cause instanceof DataXException) { + return (DataXException) cause; + } + return new DataXException(errorCode, getMessage(cause), cause); + } + + public ErrorCode getErrorCode() { + return this.errorCode; + } + + private static String getMessage(Object obj) { + if (obj == null) { + return ""; + } + + if (obj instanceof Throwable) { + StringWriter str = new StringWriter(); + PrintWriter pw = new PrintWriter(str); + ((Throwable) obj).printStackTrace(pw); + return str.toString(); + // return ((Throwable) obj).getMessage(); + } else { + return obj.toString(); + } + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/exception/ExceptionTracker.java b/common/src/main/java/com/alibaba/datax/common/exception/ExceptionTracker.java new file mode 100644 index 0000000000..f6d3732e2a --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/exception/ExceptionTracker.java @@ -0,0 +1,15 @@ +package com.alibaba.datax.common.exception; + +import java.io.PrintWriter; +import java.io.StringWriter; + +public final class ExceptionTracker { + public static final int STRING_BUFFER = 1024; + + public static String trace(Throwable ex) { + StringWriter sw = new StringWriter(STRING_BUFFER); + PrintWriter pw = new PrintWriter(sw); + ex.printStackTrace(pw); + return sw.toString(); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/AbstractJobPlugin.java b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractJobPlugin.java new file mode 100755 index 0000000000..946adfd0e4 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractJobPlugin.java @@ -0,0 +1,25 @@ +package com.alibaba.datax.common.plugin; + +/** + * Created by jingxing on 14-8-24. + */ +public abstract class AbstractJobPlugin extends AbstractPlugin { + /** + * @return the jobPluginCollector + */ + public JobPluginCollector getJobPluginCollector() { + return jobPluginCollector; + } + + /** + * @param jobPluginCollector + * the jobPluginCollector to set + */ + public void setJobPluginCollector( + JobPluginCollector jobPluginCollector) { + this.jobPluginCollector = jobPluginCollector; + } + + private JobPluginCollector jobPluginCollector; + +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/AbstractPlugin.java b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractPlugin.java new file mode 100755 index 0000000000..184ee89ece --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractPlugin.java @@ -0,0 +1,87 @@ +package com.alibaba.datax.common.plugin; + +import com.alibaba.datax.common.base.BaseObject; +import com.alibaba.datax.common.util.Configuration; + +public abstract class AbstractPlugin extends BaseObject implements Pluginable { + //作业的config + private Configuration pluginJobConf; + + //插件本身的plugin + private Configuration pluginConf; + + // by qiangsi.lq。 修改为对端的作业configuration + private Configuration peerPluginJobConf; + + private String peerPluginName; + + @Override + public String getPluginName() { + assert null != this.pluginConf; + return this.pluginConf.getString("name"); + } + + @Override + public String getDeveloper() { + assert null != this.pluginConf; + return this.pluginConf.getString("developer"); + } + + @Override + public String getDescription() { + assert null != this.pluginConf; + return this.pluginConf.getString("description"); + } + + @Override + public Configuration getPluginJobConf() { + return pluginJobConf; + } + + @Override + public void setPluginJobConf(Configuration pluginJobConf) { + this.pluginJobConf = pluginJobConf; + } + + @Override + public void setPluginConf(Configuration pluginConf) { + this.pluginConf = pluginConf; + } + + @Override + public Configuration getPeerPluginJobConf() { + return peerPluginJobConf; + } + + @Override + public void setPeerPluginJobConf(Configuration peerPluginJobConf) { + this.peerPluginJobConf = peerPluginJobConf; + } + + @Override + public String getPeerPluginName() { + return peerPluginName; + } + + @Override + public void setPeerPluginName(String peerPluginName) { + this.peerPluginName = peerPluginName; + } + + public void preCheck() { + } + + public void prepare() { + } + + public void post() { + } + + public void preHandler(Configuration jobConfiguration){ + + } + + public void postHandler(Configuration jobConfiguration){ + + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/AbstractTaskPlugin.java b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractTaskPlugin.java new file mode 100755 index 0000000000..39fbbe9b52 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/AbstractTaskPlugin.java @@ -0,0 +1,37 @@ +package com.alibaba.datax.common.plugin; + +/** + * Created by jingxing on 14-8-24. + */ +public abstract class AbstractTaskPlugin extends AbstractPlugin { + + //TaskPlugin 应该具备taskId + private int taskGroupId; + private int taskId; + private TaskPluginCollector taskPluginCollector; + + public TaskPluginCollector getTaskPluginCollector() { + return taskPluginCollector; + } + + public void setTaskPluginCollector( + TaskPluginCollector taskPluginCollector) { + this.taskPluginCollector = taskPluginCollector; + } + + public int getTaskId() { + return taskId; + } + + public void setTaskId(int taskId) { + this.taskId = taskId; + } + + public int getTaskGroupId() { + return taskGroupId; + } + + public void setTaskGroupId(int taskGroupId) { + this.taskGroupId = taskGroupId; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/JobPluginCollector.java b/common/src/main/java/com/alibaba/datax/common/plugin/JobPluginCollector.java new file mode 100755 index 0000000000..6eb02ab4e7 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/JobPluginCollector.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.common.plugin; + +import java.util.List; +import java.util.Map; + +/** + * Created by jingxing on 14-9-9. + */ +public interface JobPluginCollector extends PluginCollector { + + /** + * 从Task获取自定义收集信息 + * + * */ + Map> getMessage(); + + /** + * 从Task获取自定义收集信息 + * + * */ + List getMessage(String key); +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/PluginCollector.java b/common/src/main/java/com/alibaba/datax/common/plugin/PluginCollector.java new file mode 100755 index 0000000000..f2af398dd3 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/PluginCollector.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.common.plugin; + + +/** + * 这里只是一个标示类 + * */ +public interface PluginCollector { + +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/Pluginable.java b/common/src/main/java/com/alibaba/datax/common/plugin/Pluginable.java new file mode 100755 index 0000000000..ac28f6a294 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/Pluginable.java @@ -0,0 +1,30 @@ +package com.alibaba.datax.common.plugin; + +import com.alibaba.datax.common.util.Configuration; + +public interface Pluginable { + String getDeveloper(); + + String getDescription(); + + void setPluginConf(Configuration pluginConf); + + void init(); + + void destroy(); + + String getPluginName(); + + Configuration getPluginJobConf(); + + Configuration getPeerPluginJobConf(); + + public String getPeerPluginName(); + + void setPluginJobConf(Configuration jobConf); + + void setPeerPluginJobConf(Configuration peerPluginJobConf); + + public void setPeerPluginName(String peerPluginName); + +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/RecordReceiver.java b/common/src/main/java/com/alibaba/datax/common/plugin/RecordReceiver.java new file mode 100755 index 0000000000..74f236f371 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/RecordReceiver.java @@ -0,0 +1,26 @@ +/** + * (C) 2010-2013 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.common.plugin; + +import com.alibaba.datax.common.element.Record; + +public interface RecordReceiver { + + public Record getFromReader(); + + public void shutdown(); +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/RecordSender.java b/common/src/main/java/com/alibaba/datax/common/plugin/RecordSender.java new file mode 100755 index 0000000000..0d6926098f --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/RecordSender.java @@ -0,0 +1,32 @@ +/** + * (C) 2010-2013 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.common.plugin; + +import com.alibaba.datax.common.element.Record; + +public interface RecordSender { + + public Record createRecord(); + + public void sendToWriter(Record record); + + public void flush(); + + public void terminate(); + + public void shutdown(); +} diff --git a/common/src/main/java/com/alibaba/datax/common/plugin/TaskPluginCollector.java b/common/src/main/java/com/alibaba/datax/common/plugin/TaskPluginCollector.java new file mode 100755 index 0000000000..f0c85fe6ce --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/plugin/TaskPluginCollector.java @@ -0,0 +1,57 @@ +package com.alibaba.datax.common.plugin; + +import com.alibaba.datax.common.element.Record; + +/** + * + * 该接口提供给Task Plugin用来记录脏数据和自定义信息。
+ * + * 1. 脏数据记录,TaskPluginCollector提供多种脏数据记录的适配,包括本地输出、集中式汇报等等
+ * 2. 自定义信息,所有的task插件运行过程中可以通过TaskPluginCollector收集信息,
+ * Job的插件在POST过程中通过getMessage()接口获取信息 + */ +public abstract class TaskPluginCollector implements PluginCollector { + /** + * 收集脏数据 + * + * @param dirtyRecord + * 脏数据信息 + * @param t + * 异常信息 + * @param errorMessage + * 错误的提示信息 + */ + public abstract void collectDirtyRecord(final Record dirtyRecord, + final Throwable t, final String errorMessage); + + /** + * 收集脏数据 + * + * @param dirtyRecord + * 脏数据信息 + * @param errorMessage + * 错误的提示信息 + */ + public void collectDirtyRecord(final Record dirtyRecord, + final String errorMessage) { + this.collectDirtyRecord(dirtyRecord, null, errorMessage); + } + + /** + * 收集脏数据 + * + * @param dirtyRecord + * 脏数据信息 + * @param t + * 异常信息 + */ + public void collectDirtyRecord(final Record dirtyRecord, final Throwable t) { + this.collectDirtyRecord(dirtyRecord, t, ""); + } + + /** + * 收集自定义信息,Job插件可以通过getMessage获取该信息
+ * 如果多个key冲突,内部使用List记录同一个key,多个value情况。
+ * */ + public abstract void collectMessage(final String key, final String value); +} diff --git a/common/src/main/java/com/alibaba/datax/common/spi/ErrorCode.java b/common/src/main/java/com/alibaba/datax/common/spi/ErrorCode.java new file mode 100755 index 0000000000..053f99a479 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/spi/ErrorCode.java @@ -0,0 +1,33 @@ +package com.alibaba.datax.common.spi; + +/** + * 尤其注意:最好提供toString()实现。例如: + * + *

+ * 
+ * @Override
+ * public String toString() {
+ * 	return String.format("Code:[%s], Description:[%s]. ", this.code, this.describe);
+ * }
+ * 
+ * + */ +public interface ErrorCode { + // 错误码编号 + String getCode(); + + // 错误码描述 + String getDescription(); + + /** 必须提供toString的实现 + * + *
+	 * @Override
+	 * public String toString() {
+	 * 	return String.format("Code:[%s], Description:[%s]. ", this.code, this.describe);
+	 * }
+	 * 
+ * + */ + String toString(); +} diff --git a/common/src/main/java/com/alibaba/datax/common/spi/Hook.java b/common/src/main/java/com/alibaba/datax/common/spi/Hook.java new file mode 100755 index 0000000000..d510f57c18 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/spi/Hook.java @@ -0,0 +1,27 @@ +package com.alibaba.datax.common.spi; + +import com.alibaba.datax.common.util.Configuration; + +import java.util.Map; + +/** + * Created by xiafei.qiuxf on 14/12/17. + */ +public interface Hook { + + /** + * 返回名字 + * + * @return + */ + public String getName(); + + /** + * TODO 文档 + * + * @param jobConf + * @param msg + */ + public void invoke(Configuration jobConf, Map msg); + +} diff --git a/common/src/main/java/com/alibaba/datax/common/spi/Reader.java b/common/src/main/java/com/alibaba/datax/common/spi/Reader.java new file mode 100755 index 0000000000..fec41a9f03 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/spi/Reader.java @@ -0,0 +1,52 @@ +package com.alibaba.datax.common.spi; + +import java.util.List; + +import com.alibaba.datax.common.base.BaseObject; +import com.alibaba.datax.common.plugin.AbstractJobPlugin; +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.plugin.RecordSender; + +/** + * 每个Reader插件在其内部内部实现Job、Task两个内部类。 + * + * + * */ +public abstract class Reader extends BaseObject { + + /** + * 每个Reader插件必须实现Job内部类。 + * + * */ + public static abstract class Job extends AbstractJobPlugin { + + /** + * 切分任务 + * + * @param adviceNumber + * + * 着重说明下,adviceNumber是框架建议插件切分的任务数,插件开发人员最好切分出来的任务数>= + * adviceNumber。
+ *
+ * 之所以采取这个建议是为了给用户最好的实现,例如框架根据计算认为用户数据存储可以支持100个并发连接, + * 并且用户认为需要100个并发。 此时,插件开发人员如果能够根据上述切分规则进行切分并做到>=100连接信息, + * DataX就可以同时启动100个Channel,这样给用户最好的吞吐量
+ * 例如用户同步一张Mysql单表,但是认为可以到10并发吞吐量,插件开发人员最好对该表进行切分,比如使用主键范围切分, + * 并且如果最终切分任务数到>=10,我们就可以提供给用户最大的吞吐量。
+ *
+ * 当然,我们这里只是提供一个建议值,Reader插件可以按照自己规则切分。但是我们更建议按照框架提供的建议值来切分。
+ *
+ * 对于ODPS写入OTS而言,如果存在预排序预切分问题,这样就可能只能按照分区信息切分,无法更细粒度切分, + * 这类情况只能按照源头物理信息切分规则切分。
+ *
+ * + * + * */ + public abstract List split(int adviceNumber); + } + + public static abstract class Task extends AbstractTaskPlugin { + public abstract void startRead(RecordSender recordSender); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/spi/Writer.java b/common/src/main/java/com/alibaba/datax/common/spi/Writer.java new file mode 100755 index 0000000000..457eb6860c --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/spi/Writer.java @@ -0,0 +1,40 @@ +package com.alibaba.datax.common.spi; + +import com.alibaba.datax.common.base.BaseObject; +import com.alibaba.datax.common.plugin.AbstractJobPlugin; +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.plugin.RecordReceiver; + +import java.util.List; + +/** + * 每个Writer插件需要实现Writer类,并在其内部实现Job、Task两个内部类。 + * + * + * */ +public abstract class Writer extends BaseObject { + /** + * 每个Writer插件必须实现Job内部类 + */ + public abstract static class Job extends AbstractJobPlugin { + /** + * 切分任务。
+ * + * @param mandatoryNumber + * 为了做到Reader、Writer任务数对等,这里要求Writer插件必须按照源端的切分数进行切分。否则框架报错! + * + * */ + public abstract List split(int mandatoryNumber); + } + + /** + * 每个Writer插件必须实现Task内部类 + */ + public abstract static class Task extends AbstractTaskPlugin { + + public abstract void startWrite(RecordReceiver lineReceiver); + + public boolean supportFailOver(){return false;} + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/statistics/PerfRecord.java b/common/src/main/java/com/alibaba/datax/common/statistics/PerfRecord.java new file mode 100644 index 0000000000..74b26eeb60 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/statistics/PerfRecord.java @@ -0,0 +1,258 @@ +package com.alibaba.datax.common.statistics; + +import com.alibaba.datax.common.util.HostUtils; +import org.apache.commons.lang3.time.DateFormatUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Date; + +/** + * Created by liqiang on 15/8/23. + */ +@SuppressWarnings("NullableProblems") +public class PerfRecord implements Comparable { + private static Logger perf = LoggerFactory.getLogger(PerfRecord.class); + private static String datetimeFormat = "yyyy-MM-dd HH:mm:ss"; + + + public enum PHASE { + /** + * task total运行的时间,前10为框架统计,后面为部分插件的个性统计 + */ + TASK_TOTAL(0), + + READ_TASK_INIT(1), + READ_TASK_PREPARE(2), + READ_TASK_DATA(3), + READ_TASK_POST(4), + READ_TASK_DESTROY(5), + + WRITE_TASK_INIT(6), + WRITE_TASK_PREPARE(7), + WRITE_TASK_DATA(8), + WRITE_TASK_POST(9), + WRITE_TASK_DESTROY(10), + + /** + * SQL_QUERY: sql query阶段, 部分reader的个性统计 + */ + SQL_QUERY(100), + /** + * 数据从sql全部读出来 + */ + RESULT_NEXT_ALL(101), + + /** + * only odps block close + */ + ODPS_BLOCK_CLOSE(102), + + WAIT_READ_TIME(103), + + WAIT_WRITE_TIME(104), + + TRANSFORMER_TIME(201); + + private int val; + + PHASE(int val) { + this.val = val; + } + + public int toInt(){ + return val; + } + } + + public enum ACTION{ + start, + end + } + + private final int taskGroupId; + private final int taskId; + private final PHASE phase; + private volatile ACTION action; + private volatile Date startTime; + private volatile long elapsedTimeInNs = -1; + private volatile long count = 0; + private volatile long size = 0; + + private volatile long startTimeInNs; + private volatile boolean isReport = false; + + public PerfRecord(int taskGroupId, int taskId, PHASE phase) { + this.taskGroupId = taskGroupId; + this.taskId = taskId; + this.phase = phase; + } + + public static void addPerfRecord(int taskGroupId, int taskId, PHASE phase, long startTime,long elapsedTimeInNs) { + if(PerfTrace.getInstance().isEnable()) { + PerfRecord perfRecord = new PerfRecord(taskGroupId, taskId, phase); + perfRecord.elapsedTimeInNs = elapsedTimeInNs; + perfRecord.action = ACTION.end; + perfRecord.startTime = new Date(startTime); + //在PerfTrace里注册 + PerfTrace.getInstance().tracePerfRecord(perfRecord); + perf.info(perfRecord.toString()); + } + } + + public void start() { + if(PerfTrace.getInstance().isEnable()) { + this.startTime = new Date(); + this.startTimeInNs = System.nanoTime(); + this.action = ACTION.start; + //在PerfTrace里注册 + PerfTrace.getInstance().tracePerfRecord(this); + perf.info(toString()); + } + } + + public void addCount(long count) { + this.count += count; + } + + public void addSize(long size) { + this.size += size; + } + + public void end() { + if(PerfTrace.getInstance().isEnable()) { + this.elapsedTimeInNs = System.nanoTime() - startTimeInNs; + this.action = ACTION.end; + PerfTrace.getInstance().tracePerfRecord(this); + perf.info(toString()); + } + } + + public void end(long elapsedTimeInNs) { + if(PerfTrace.getInstance().isEnable()) { + this.elapsedTimeInNs = elapsedTimeInNs; + this.action = ACTION.end; + PerfTrace.getInstance().tracePerfRecord(this); + perf.info(toString()); + } + } + + public String toString() { + return String.format("%s,%s,%s,%s,%s,%s,%s,%s,%s,%s" + , getInstId(), taskGroupId, taskId, phase, action, + DateFormatUtils.format(startTime, datetimeFormat), elapsedTimeInNs, count, size,getHostIP()); + } + + + @Override + public int compareTo(PerfRecord o) { + if (o == null) { + return 1; + } + return this.elapsedTimeInNs > o.elapsedTimeInNs ? 1 : this.elapsedTimeInNs == o.elapsedTimeInNs ? 0 : -1; + } + + @Override + public int hashCode() { + long jobId = getInstId(); + int result = (int) (jobId ^ (jobId >>> 32)); + result = 31 * result + taskGroupId; + result = 31 * result + taskId; + result = 31 * result + phase.toInt(); + result = 31 * result + (startTime != null ? startTime.hashCode() : 0); + return result; + } + + @Override + public boolean equals(Object o) { + if (this == o) return true; + if(!(o instanceof PerfRecord)){ + return false; + } + + PerfRecord dst = (PerfRecord)o; + + if (this.getInstId() != dst.getInstId()) return false; + if (this.taskGroupId != dst.taskGroupId) return false; + if (this.taskId != dst.taskId) return false; + if (phase != null ? !phase.equals(dst.phase) : dst.phase != null) return false; + if (startTime != null ? !startTime.equals(dst.startTime) : dst.startTime != null) return false; + return true; + } + + public PerfRecord copy() { + PerfRecord copy = new PerfRecord(this.taskGroupId, this.getTaskId(), this.phase); + copy.action = this.action; + copy.startTime = this.startTime; + copy.elapsedTimeInNs = this.elapsedTimeInNs; + copy.count = this.count; + copy.size = this.size; + return copy; + } + public int getTaskGroupId() { + return taskGroupId; + } + + public int getTaskId() { + return taskId; + } + + public PHASE getPhase() { + return phase; + } + + public ACTION getAction() { + return action; + } + + public long getElapsedTimeInNs() { + return elapsedTimeInNs; + } + + public long getCount() { + return count; + } + + public long getSize() { + return size; + } + + public long getInstId(){ + return PerfTrace.getInstance().getInstId(); + } + + public String getHostIP(){ + return HostUtils.IP; + } + + public String getHostName(){ + return HostUtils.HOSTNAME; + } + + public Date getStartTime() { + return startTime; + } + + public long getStartTimeInMs() { + return startTime.getTime(); + } + + public long getStartTimeInNs() { + return startTimeInNs; + } + + public String getDatetime(){ + if(startTime == null){ + return "null time"; + } + return DateFormatUtils.format(startTime, datetimeFormat); + } + + public boolean isReport() { + return isReport; + } + + public void setIsReport(boolean isReport) { + this.isReport = isReport; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/statistics/PerfTrace.java b/common/src/main/java/com/alibaba/datax/common/statistics/PerfTrace.java new file mode 100644 index 0000000000..ea9aa42110 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/statistics/PerfTrace.java @@ -0,0 +1,907 @@ +package com.alibaba.datax.common.statistics; + +import com.alibaba.datax.common.statistics.PerfRecord.PHASE; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.HostUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.text.ParseException; +import java.text.SimpleDateFormat; +import java.util.*; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.TimeUnit; + +/** + * PerfTrace 记录 job(local模式),taskGroup(distribute模式),因为这2种都是jvm,即一个jvm里只需要有1个PerfTrace。 + */ + +public class PerfTrace { + + private static Logger LOG = LoggerFactory.getLogger(PerfTrace.class); + private static PerfTrace instance; + private static final Object lock = new Object(); + private String perfTraceId; + private volatile boolean enable; + private volatile boolean isJob; + private long instId; + private long jobId; + private long jobVersion; + private int taskGroupId; + private int channelNumber; + + private int priority; + private int batchSize = 500; + private volatile boolean perfReportEnable = true; + + //jobid_jobversion,instanceid,taskid, src_mark, dst_mark, + private Map taskDetails = new ConcurrentHashMap(); + //PHASE => PerfRecord + private ConcurrentHashMap perfRecordMaps4print = new ConcurrentHashMap(); + // job_phase => SumPerf4Report + private SumPerf4Report sumPerf4Report = new SumPerf4Report(); + private SumPerf4Report sumPerf4Report4NotEnd; + private Configuration jobInfo; + private final Set needReportPool4NotEnd = new HashSet(); + private final List totalEndReport = new ArrayList(); + + /** + * 单实例 + * + * @param isJob + * @param jobId + * @param taskGroupId + * @return + */ + public static PerfTrace getInstance(boolean isJob, long jobId, int taskGroupId, int priority, boolean enable) { + + if (instance == null) { + synchronized (lock) { + if (instance == null) { + instance = new PerfTrace(isJob, jobId, taskGroupId, priority, enable); + } + } + } + return instance; + } + + /** + * 因为一个JVM只有一个,因此在getInstance(isJob,jobId,taskGroupId)调用完成实例化后,方便后续调用,直接返回该实例 + * + * @return + */ + public static PerfTrace getInstance() { + if (instance == null) { + LOG.error("PerfTrace instance not be init! must have some error! "); + synchronized (lock) { + if (instance == null) { + instance = new PerfTrace(false, -1111, -1111, 0, false); + } + } + } + return instance; + } + + private PerfTrace(boolean isJob, long jobId, int taskGroupId, int priority, boolean enable) { + try { + this.perfTraceId = isJob ? "job_" + jobId : String.format("taskGroup_%s_%s", jobId, taskGroupId); + this.enable = enable; + this.isJob = isJob; + this.taskGroupId = taskGroupId; + this.instId = jobId; + this.priority = priority; + LOG.info(String.format("PerfTrace traceId=%s, isEnable=%s, priority=%s", this.perfTraceId, this.enable, this.priority)); + + } catch (Exception e) { + // do nothing + this.enable = false; + } + } + + public void addTaskDetails(int taskId, String detail) { + if (enable) { + String before = ""; + int index = detail.indexOf("?"); + String current = detail.substring(0, index == -1 ? detail.length() : index); + if (current.indexOf("[") >= 0) { + current += "]"; + } + if (taskDetails.containsKey(taskId)) { + before = taskDetails.get(taskId).trim(); + } + if (StringUtils.isEmpty(before)) { + before = ""; + } else { + before += ","; + } + this.taskDetails.put(taskId, before + current); + } + } + + public void tracePerfRecord(PerfRecord perfRecord) { + try { + if (enable) { + long curNanoTime = System.nanoTime(); + //ArrayList非线程安全 + switch (perfRecord.getAction()) { + case end: + synchronized (totalEndReport) { + totalEndReport.add(perfRecord); + + if (totalEndReport.size() > batchSize * 10) { + sumPerf4EndPrint(totalEndReport); + } + } + + if (perfReportEnable && needReport(perfRecord)) { + synchronized (needReportPool4NotEnd) { + sumPerf4Report.add(curNanoTime,perfRecord); + needReportPool4NotEnd.remove(perfRecord); + } + } + + break; + case start: + if (perfReportEnable && needReport(perfRecord)) { + synchronized (needReportPool4NotEnd) { + needReportPool4NotEnd.add(perfRecord); + } + } + break; + } + } + } catch (Exception e) { + // do nothing + } + } + + private boolean needReport(PerfRecord perfRecord) { + switch (perfRecord.getPhase()) { + case TASK_TOTAL: + case SQL_QUERY: + case RESULT_NEXT_ALL: + case ODPS_BLOCK_CLOSE: + return true; + } + return false; + } + + public String summarizeNoException() { + String res; + try { + res = summarize(); + } catch (Exception e) { + res = "PerfTrace summarize has Exception " + e.getMessage(); + } + return res; + } + + //任务结束时,对当前的perf总汇总统计 + private synchronized String summarize() { + if (!enable) { + return "PerfTrace not enable!"; + } + + if (totalEndReport.size() > 0) { + sumPerf4EndPrint(totalEndReport); + } + + StringBuilder info = new StringBuilder(); + info.append("\n === total summarize info === \n"); + info.append("\n 1. all phase average time info and max time task info: \n\n"); + info.append(String.format("%-20s | %18s | %18s | %18s | %18s | %-100s\n", "PHASE", "AVERAGE USED TIME", "ALL TASK NUM", "MAX USED TIME", "MAX TASK ID", "MAX TASK INFO")); + + List keys = new ArrayList(perfRecordMaps4print.keySet()); + Collections.sort(keys, new Comparator() { + @Override + public int compare(PHASE o1, PHASE o2) { + return o1.toInt() - o2.toInt(); + } + }); + for (PHASE phase : keys) { + SumPerfRecord4Print sumPerfRecord = perfRecordMaps4print.get(phase); + if (sumPerfRecord == null) { + continue; + } + long averageTime = sumPerfRecord.getAverageTime(); + long maxTime = sumPerfRecord.getMaxTime(); + int maxTaskId = sumPerfRecord.maxTaskId; + int maxTaskGroupId = sumPerfRecord.getMaxTaskGroupId(); + info.append(String.format("%-20s | %18s | %18s | %18s | %18s | %-100s\n", + phase, unitTime(averageTime), sumPerfRecord.totalCount, unitTime(maxTime), jobId + "-" + maxTaskGroupId + "-" + maxTaskId, taskDetails.get(maxTaskId))); + } + + //SumPerfRecord4Print countSumPerf = Optional.fromNullable(perfRecordMaps4print.get(PHASE.READ_TASK_DATA)).or(new SumPerfRecord4Print()); + + SumPerfRecord4Print countSumPerf = perfRecordMaps4print.get(PHASE.READ_TASK_DATA); + if(countSumPerf == null){ + countSumPerf = new SumPerfRecord4Print(); + } + + long averageRecords = countSumPerf.getAverageRecords(); + long averageBytes = countSumPerf.getAverageBytes(); + long maxRecord = countSumPerf.getMaxRecord(); + long maxByte = countSumPerf.getMaxByte(); + int maxTaskId4Records = countSumPerf.getMaxTaskId4Records(); + int maxTGID4Records = countSumPerf.getMaxTGID4Records(); + + info.append("\n\n 2. record average count and max count task info :\n\n"); + info.append(String.format("%-20s | %18s | %18s | %18s | %18s | %18s | %-100s\n", "PHASE", "AVERAGE RECORDS", "AVERAGE BYTES", "MAX RECORDS", "MAX RECORD`S BYTES", "MAX TASK ID", "MAX TASK INFO")); + if (maxTaskId4Records > -1) { + info.append(String.format("%-20s | %18s | %18s | %18s | %18s | %18s | %-100s\n" + , PHASE.READ_TASK_DATA, averageRecords, unitSize(averageBytes), maxRecord, unitSize(maxByte), jobId + "-" + maxTGID4Records + "-" + maxTaskId4Records, taskDetails.get(maxTaskId4Records))); + + } + return info.toString(); + } + + //缺省传入的时间是nano + public static String unitTime(long time) { + return unitTime(time, TimeUnit.NANOSECONDS); + } + + public static String unitTime(long time, TimeUnit timeUnit) { + return String.format("%,.3fs", ((float) timeUnit.toNanos(time)) / 1000000000); + } + + public static String unitSize(long size) { + if (size > 1000000000) { + return String.format("%,.2fG", (float) size / 1000000000); + } else if (size > 1000000) { + return String.format("%,.2fM", (float) size / 1000000); + } else if (size > 1000) { + return String.format("%,.2fK", (float) size / 1000); + } else { + return size + "B"; + } + } + + + public synchronized ConcurrentHashMap getPerfRecordMaps4print() { + if (totalEndReport.size() > 0) { + sumPerf4EndPrint(totalEndReport); + } + return perfRecordMaps4print; + } + + public SumPerf4Report getSumPerf4Report() { + return sumPerf4Report; + } + + public Set getNeedReportPool4NotEnd() { + return needReportPool4NotEnd; + } + + public List getTotalEndReport() { + return totalEndReport; + } + + public Map getTaskDetails() { + return taskDetails; + } + + public boolean isEnable() { + return enable; + } + + public boolean isJob() { + return isJob; + } + + private String cluster; + private String jobDomain; + private String srcType; + private String dstType; + private String srcGuid; + private String dstGuid; + private Date windowStart; + private Date windowEnd; + private Date jobStartTime; + + public void setJobInfo(Configuration jobInfo, boolean perfReportEnable, int channelNumber) { + try { + this.jobInfo = jobInfo; + if (jobInfo != null && perfReportEnable) { + + cluster = jobInfo.getString("cluster"); + + String srcDomain = jobInfo.getString("srcDomain", "null"); + String dstDomain = jobInfo.getString("dstDomain", "null"); + jobDomain = srcDomain + "|" + dstDomain; + srcType = jobInfo.getString("srcType"); + dstType = jobInfo.getString("dstType"); + srcGuid = jobInfo.getString("srcGuid"); + dstGuid = jobInfo.getString("dstGuid"); + windowStart = getWindow(jobInfo.getString("windowStart"), true); + windowEnd = getWindow(jobInfo.getString("windowEnd"), false); + String jobIdStr = jobInfo.getString("jobId"); + jobId = StringUtils.isEmpty(jobIdStr) ? (long) -5 : Long.parseLong(jobIdStr); + String jobVersionStr = jobInfo.getString("jobVersion"); + jobVersion = StringUtils.isEmpty(jobVersionStr) ? (long) -4 : Long.parseLong(jobVersionStr); + jobStartTime = new Date(); + } + this.perfReportEnable = perfReportEnable; + this.channelNumber = channelNumber; + } catch (Exception e) { + this.perfReportEnable = false; + } + } + + private Date getWindow(String windowStr, boolean startWindow) { + SimpleDateFormat sdf1 = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); + SimpleDateFormat sdf2 = new SimpleDateFormat("yyyy-MM-dd 00:00:00"); + if (StringUtils.isNotEmpty(windowStr)) { + try { + return sdf1.parse(windowStr); + } catch (ParseException e) { + // do nothing + } + } + + if (startWindow) { + try { + return sdf2.parse(sdf2.format(new Date())); + } catch (ParseException e1) { + //do nothing + } + } + + return null; + } + + public long getInstId() { + return instId; + } + + public Configuration getJobInfo() { + return jobInfo; + } + + public void setBatchSize(int batchSize) { + this.batchSize = batchSize; + } + + public synchronized JobStatisticsDto2 getReports(String mode) { + + try { + if (!enable || !perfReportEnable) { + return null; + } + + if (("job".equalsIgnoreCase(mode) && !isJob) || "tg".equalsIgnoreCase(mode) && isJob) { + return null; + } + + //每次将未完成的task的统计清空 + sumPerf4Report4NotEnd = new SumPerf4Report(); + Set needReportPool4NotEndTmp = null; + synchronized (needReportPool4NotEnd) { + needReportPool4NotEndTmp = new HashSet(needReportPool4NotEnd); + } + + long curNanoTime = System.nanoTime(); + for (PerfRecord perfRecord : needReportPool4NotEndTmp) { + sumPerf4Report4NotEnd.add(curNanoTime, perfRecord); + } + + JobStatisticsDto2 jdo = new JobStatisticsDto2(); + jdo.setInstId(this.instId); + if (isJob) { + jdo.setTaskGroupId(-6); + } else { + jdo.setTaskGroupId(this.taskGroupId); + } + jdo.setJobId(this.jobId); + jdo.setJobVersion(this.jobVersion); + jdo.setWindowStart(this.windowStart); + jdo.setWindowEnd(this.windowEnd); + jdo.setJobStartTime(jobStartTime); + jdo.setJobRunTimeMs(System.currentTimeMillis() - jobStartTime.getTime()); + jdo.setJobPriority(this.priority); + jdo.setChannelNum(this.channelNumber); + jdo.setCluster(this.cluster); + jdo.setJobDomain(this.jobDomain); + jdo.setSrcType(this.srcType); + jdo.setDstType(this.dstType); + jdo.setSrcGuid(this.srcGuid); + jdo.setDstGuid(this.dstGuid); + jdo.setHostAddress(HostUtils.IP); + + //sum + jdo.setTaskTotalTimeMs(sumPerf4Report4NotEnd.totalTaskRunTimeInMs + sumPerf4Report.totalTaskRunTimeInMs); + jdo.setOdpsBlockCloseTimeMs(sumPerf4Report4NotEnd.odpsCloseTimeInMs + sumPerf4Report.odpsCloseTimeInMs); + jdo.setSqlQueryTimeMs(sumPerf4Report4NotEnd.sqlQueryTimeInMs + sumPerf4Report.sqlQueryTimeInMs); + jdo.setResultNextTimeMs(sumPerf4Report4NotEnd.resultNextTimeInMs + sumPerf4Report.resultNextTimeInMs); + + return jdo; + } catch (Exception e) { + // do nothing + } + + return null; + } + + private void sumPerf4EndPrint(List totalEndReport) { + if (!enable || totalEndReport == null) { + return; + } + + for (PerfRecord perfRecord : totalEndReport) { + perfRecordMaps4print.putIfAbsent(perfRecord.getPhase(), new SumPerfRecord4Print()); + perfRecordMaps4print.get(perfRecord.getPhase()).add(perfRecord); + } + + totalEndReport.clear(); + } + + public void setChannelNumber(int needChannelNumber) { + this.channelNumber = needChannelNumber; + } + + + public static class SumPerf4Report { + long totalTaskRunTimeInMs = 0L; + long odpsCloseTimeInMs = 0L; + long sqlQueryTimeInMs = 0L; + long resultNextTimeInMs = 0L; + + public void add(long curNanoTime,PerfRecord perfRecord) { + try { + long runTimeEndInMs; + if (perfRecord.getElapsedTimeInNs() == -1) { + runTimeEndInMs = (curNanoTime - perfRecord.getStartTimeInNs()) / 1000000; + } else { + runTimeEndInMs = perfRecord.getElapsedTimeInNs() / 1000000; + } + switch (perfRecord.getPhase()) { + case TASK_TOTAL: + totalTaskRunTimeInMs += runTimeEndInMs; + break; + case SQL_QUERY: + sqlQueryTimeInMs += runTimeEndInMs; + break; + case RESULT_NEXT_ALL: + resultNextTimeInMs += runTimeEndInMs; + break; + case ODPS_BLOCK_CLOSE: + odpsCloseTimeInMs += runTimeEndInMs; + break; + } + }catch (Exception e){ + //do nothing + } + } + + public long getTotalTaskRunTimeInMs() { + return totalTaskRunTimeInMs; + } + + public long getOdpsCloseTimeInMs() { + return odpsCloseTimeInMs; + } + + public long getSqlQueryTimeInMs() { + return sqlQueryTimeInMs; + } + + public long getResultNextTimeInMs() { + return resultNextTimeInMs; + } + } + + public static class SumPerfRecord4Print { + private long perfTimeTotal = 0; + private long averageTime = 0; + private long maxTime = 0; + private int maxTaskId = -1; + private int maxTaskGroupId = -1; + private int totalCount = 0; + + private long recordsTotal = 0; + private long sizesTotal = 0; + private long averageRecords = 0; + private long averageBytes = 0; + private long maxRecord = 0; + private long maxByte = 0; + private int maxTaskId4Records = -1; + private int maxTGID4Records = -1; + + public void add(PerfRecord perfRecord) { + if (perfRecord == null) { + return; + } + perfTimeTotal += perfRecord.getElapsedTimeInNs(); + if (perfRecord.getElapsedTimeInNs() >= maxTime) { + maxTime = perfRecord.getElapsedTimeInNs(); + maxTaskId = perfRecord.getTaskId(); + maxTaskGroupId = perfRecord.getTaskGroupId(); + } + + recordsTotal += perfRecord.getCount(); + sizesTotal += perfRecord.getSize(); + if (perfRecord.getCount() >= maxRecord) { + maxRecord = perfRecord.getCount(); + maxByte = perfRecord.getSize(); + maxTaskId4Records = perfRecord.getTaskId(); + maxTGID4Records = perfRecord.getTaskGroupId(); + } + + totalCount++; + } + + public long getPerfTimeTotal() { + return perfTimeTotal; + } + + public long getAverageTime() { + if (totalCount > 0) { + averageTime = perfTimeTotal / totalCount; + } + return averageTime; + } + + public long getMaxTime() { + return maxTime; + } + + public int getMaxTaskId() { + return maxTaskId; + } + + public int getMaxTaskGroupId() { + return maxTaskGroupId; + } + + public long getRecordsTotal() { + return recordsTotal; + } + + public long getSizesTotal() { + return sizesTotal; + } + + public long getAverageRecords() { + if (totalCount > 0) { + averageRecords = recordsTotal / totalCount; + } + return averageRecords; + } + + public long getAverageBytes() { + if (totalCount > 0) { + averageBytes = sizesTotal / totalCount; + } + return averageBytes; + } + + public long getMaxRecord() { + return maxRecord; + } + + public long getMaxByte() { + return maxByte; + } + + public int getMaxTaskId4Records() { + return maxTaskId4Records; + } + + public int getMaxTGID4Records() { + return maxTGID4Records; + } + + public int getTotalCount() { + return totalCount; + } + } + class JobStatisticsDto2 { + + private Long id; + private Date gmtCreate; + private Date gmtModified; + private Long instId; + private Long jobId; + private Long jobVersion; + private Integer taskGroupId; + private Date windowStart; + private Date windowEnd; + private Date jobStartTime; + private Date jobEndTime; + private Long jobRunTimeMs; + private Integer jobPriority; + private Integer channelNum; + private String cluster; + private String jobDomain; + private String srcType; + private String dstType; + private String srcGuid; + private String dstGuid; + private Long records; + private Long bytes; + private Long speedRecord; + private Long speedByte; + private String stagePercent; + private Long errorRecord; + private Long errorBytes; + private Long waitReadTimeMs; + private Long waitWriteTimeMs; + private Long odpsBlockCloseTimeMs; + private Long sqlQueryTimeMs; + private Long resultNextTimeMs; + private Long taskTotalTimeMs; + private String hostAddress; + + public Long getId() { + return id; + } + + public Date getGmtCreate() { + return gmtCreate; + } + + public Date getGmtModified() { + return gmtModified; + } + + public Long getInstId() { + return instId; + } + + public Long getJobId() { + return jobId; + } + + public Long getJobVersion() { + return jobVersion; + } + + public Integer getTaskGroupId() { + return taskGroupId; + } + + public Date getWindowStart() { + return windowStart; + } + + public Date getWindowEnd() { + return windowEnd; + } + + public Date getJobStartTime() { + return jobStartTime; + } + + public Date getJobEndTime() { + return jobEndTime; + } + + public Long getJobRunTimeMs() { + return jobRunTimeMs; + } + + public Integer getJobPriority() { + return jobPriority; + } + + public Integer getChannelNum() { + return channelNum; + } + + public String getCluster() { + return cluster; + } + + public String getJobDomain() { + return jobDomain; + } + + public String getSrcType() { + return srcType; + } + + public String getDstType() { + return dstType; + } + + public String getSrcGuid() { + return srcGuid; + } + + public String getDstGuid() { + return dstGuid; + } + + public Long getRecords() { + return records; + } + + public Long getBytes() { + return bytes; + } + + public Long getSpeedRecord() { + return speedRecord; + } + + public Long getSpeedByte() { + return speedByte; + } + + public String getStagePercent() { + return stagePercent; + } + + public Long getErrorRecord() { + return errorRecord; + } + + public Long getErrorBytes() { + return errorBytes; + } + + public Long getWaitReadTimeMs() { + return waitReadTimeMs; + } + + public Long getWaitWriteTimeMs() { + return waitWriteTimeMs; + } + + public Long getOdpsBlockCloseTimeMs() { + return odpsBlockCloseTimeMs; + } + + public Long getSqlQueryTimeMs() { + return sqlQueryTimeMs; + } + + public Long getResultNextTimeMs() { + return resultNextTimeMs; + } + + public Long getTaskTotalTimeMs() { + return taskTotalTimeMs; + } + + public String getHostAddress() { + return hostAddress; + } + + public void setId(Long id) { + this.id = id; + } + + public void setGmtCreate(Date gmtCreate) { + this.gmtCreate = gmtCreate; + } + + public void setGmtModified(Date gmtModified) { + this.gmtModified = gmtModified; + } + + public void setInstId(Long instId) { + this.instId = instId; + } + + public void setJobId(Long jobId) { + this.jobId = jobId; + } + + public void setJobVersion(Long jobVersion) { + this.jobVersion = jobVersion; + } + + public void setTaskGroupId(Integer taskGroupId) { + this.taskGroupId = taskGroupId; + } + + public void setWindowStart(Date windowStart) { + this.windowStart = windowStart; + } + + public void setWindowEnd(Date windowEnd) { + this.windowEnd = windowEnd; + } + + public void setJobStartTime(Date jobStartTime) { + this.jobStartTime = jobStartTime; + } + + public void setJobEndTime(Date jobEndTime) { + this.jobEndTime = jobEndTime; + } + + public void setJobRunTimeMs(Long jobRunTimeMs) { + this.jobRunTimeMs = jobRunTimeMs; + } + + public void setJobPriority(Integer jobPriority) { + this.jobPriority = jobPriority; + } + + public void setChannelNum(Integer channelNum) { + this.channelNum = channelNum; + } + + public void setCluster(String cluster) { + this.cluster = cluster; + } + + public void setJobDomain(String jobDomain) { + this.jobDomain = jobDomain; + } + + public void setSrcType(String srcType) { + this.srcType = srcType; + } + + public void setDstType(String dstType) { + this.dstType = dstType; + } + + public void setSrcGuid(String srcGuid) { + this.srcGuid = srcGuid; + } + + public void setDstGuid(String dstGuid) { + this.dstGuid = dstGuid; + } + + public void setRecords(Long records) { + this.records = records; + } + + public void setBytes(Long bytes) { + this.bytes = bytes; + } + + public void setSpeedRecord(Long speedRecord) { + this.speedRecord = speedRecord; + } + + public void setSpeedByte(Long speedByte) { + this.speedByte = speedByte; + } + + public void setStagePercent(String stagePercent) { + this.stagePercent = stagePercent; + } + + public void setErrorRecord(Long errorRecord) { + this.errorRecord = errorRecord; + } + + public void setErrorBytes(Long errorBytes) { + this.errorBytes = errorBytes; + } + + public void setWaitReadTimeMs(Long waitReadTimeMs) { + this.waitReadTimeMs = waitReadTimeMs; + } + + public void setWaitWriteTimeMs(Long waitWriteTimeMs) { + this.waitWriteTimeMs = waitWriteTimeMs; + } + + public void setOdpsBlockCloseTimeMs(Long odpsBlockCloseTimeMs) { + this.odpsBlockCloseTimeMs = odpsBlockCloseTimeMs; + } + + public void setSqlQueryTimeMs(Long sqlQueryTimeMs) { + this.sqlQueryTimeMs = sqlQueryTimeMs; + } + + public void setResultNextTimeMs(Long resultNextTimeMs) { + this.resultNextTimeMs = resultNextTimeMs; + } + + public void setTaskTotalTimeMs(Long taskTotalTimeMs) { + this.taskTotalTimeMs = taskTotalTimeMs; + } + + public void setHostAddress(String hostAddress) { + this.hostAddress = hostAddress; + } + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/statistics/VMInfo.java b/common/src/main/java/com/alibaba/datax/common/statistics/VMInfo.java new file mode 100644 index 0000000000..cab42a4b94 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/statistics/VMInfo.java @@ -0,0 +1,412 @@ +package com.alibaba.datax.common.statistics; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.lang.management.GarbageCollectorMXBean; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.OperatingSystemMXBean; +import java.lang.management.RuntimeMXBean; +import java.lang.reflect.Method; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * Created by liqiang on 15/11/12. + */ +public class VMInfo { + private static final Logger LOG = LoggerFactory.getLogger(VMInfo.class); + static final long MB = 1024 * 1024; + static final long GB = 1024 * 1024 * 1024; + public static Object lock = new Object(); + private static VMInfo vmInfo; + + /** + * @return null or vmInfo. null is something error, job no care it. + */ + public static VMInfo getVmInfo() { + if (vmInfo == null) { + synchronized (lock) { + if (vmInfo == null) { + try { + vmInfo = new VMInfo(); + } catch (Exception e) { + LOG.warn("no need care, the fail is ignored : vmInfo init failed " + e.getMessage(), e); + } + } + } + + } + return vmInfo; + } + + // 数据的MxBean + private final OperatingSystemMXBean osMXBean; + private final RuntimeMXBean runtimeMXBean; + private final List garbageCollectorMXBeanList; + private final List memoryPoolMXBeanList; + /** + * 静态信息 + */ + private final String osInfo; + private final String jvmInfo; + + /** + * cpu个数 + */ + private final int totalProcessorCount; + + /** + * 机器的各个状态,用于中间打印和统计上报 + */ + private final PhyOSStatus startPhyOSStatus; + private final ProcessCpuStatus processCpuStatus = new ProcessCpuStatus(); + private final ProcessGCStatus processGCStatus = new ProcessGCStatus(); + private final ProcessMemoryStatus processMomoryStatus = new ProcessMemoryStatus(); + //ms + private long lastUpTime = 0; + //nano + private long lastProcessCpuTime = 0; + + + private VMInfo() { + //初始化静态信息 + osMXBean = java.lang.management.ManagementFactory.getOperatingSystemMXBean(); + runtimeMXBean = java.lang.management.ManagementFactory.getRuntimeMXBean(); + garbageCollectorMXBeanList = java.lang.management.ManagementFactory.getGarbageCollectorMXBeans(); + memoryPoolMXBeanList = java.lang.management.ManagementFactory.getMemoryPoolMXBeans(); + + osInfo = runtimeMXBean.getVmVendor() + " " + runtimeMXBean.getSpecVersion() + " " + runtimeMXBean.getVmVersion(); + jvmInfo = osMXBean.getName() + " " + osMXBean.getArch() + " " + osMXBean.getVersion(); + totalProcessorCount = osMXBean.getAvailableProcessors(); + + //构建startPhyOSStatus + startPhyOSStatus = new PhyOSStatus(); + LOG.info("VMInfo# operatingSystem class => " + osMXBean.getClass().getName()); + if (VMInfo.isSunOsMBean(osMXBean)) { + { + startPhyOSStatus.totalPhysicalMemory = VMInfo.getLongFromOperatingSystem(osMXBean, "getTotalPhysicalMemorySize"); + startPhyOSStatus.freePhysicalMemory = VMInfo.getLongFromOperatingSystem(osMXBean, "getFreePhysicalMemorySize"); + startPhyOSStatus.maxFileDescriptorCount = VMInfo.getLongFromOperatingSystem(osMXBean, "getMaxFileDescriptorCount"); + startPhyOSStatus.currentOpenFileDescriptorCount = VMInfo.getLongFromOperatingSystem(osMXBean, "getOpenFileDescriptorCount"); + } + } + + //初始化processGCStatus; + for (GarbageCollectorMXBean garbage : garbageCollectorMXBeanList) { + GCStatus gcStatus = new GCStatus(); + gcStatus.name = garbage.getName(); + processGCStatus.gcStatusMap.put(garbage.getName(), gcStatus); + } + + //初始化processMemoryStatus + if (memoryPoolMXBeanList != null && !memoryPoolMXBeanList.isEmpty()) { + for (MemoryPoolMXBean pool : memoryPoolMXBeanList) { + MemoryStatus memoryStatus = new MemoryStatus(); + memoryStatus.name = pool.getName(); + memoryStatus.initSize = pool.getUsage().getInit(); + memoryStatus.maxSize = pool.getUsage().getMax(); + processMomoryStatus.memoryStatusMap.put(pool.getName(), memoryStatus); + } + } + } + + public String toString() { + return "the machine info => \n\n" + + "\tosInfo:\t" + osInfo + "\n" + + "\tjvmInfo:\t" + jvmInfo + "\n" + + "\tcpu num:\t" + totalProcessorCount + "\n\n" + + startPhyOSStatus.toString() + "\n" + + processGCStatus.toString() + "\n" + + processMomoryStatus.toString() + "\n"; + } + + public String totalString() { + return (processCpuStatus.getTotalString() + processGCStatus.getTotalString()); + } + + public void getDelta() { + getDelta(true); + } + + public synchronized void getDelta(boolean print) { + + try { + if (VMInfo.isSunOsMBean(osMXBean)) { + long curUptime = runtimeMXBean.getUptime(); + long curProcessTime = getLongFromOperatingSystem(osMXBean, "getProcessCpuTime"); + //百分比, uptime是ms,processTime是nano + if ((curUptime > lastUpTime) && (curProcessTime >= lastProcessCpuTime)) { + float curDeltaCpu = (float) (curProcessTime - lastProcessCpuTime) / ((curUptime - lastUpTime) * totalProcessorCount * 10000); + processCpuStatus.setMaxMinCpu(curDeltaCpu); + processCpuStatus.averageCpu = (float) curProcessTime / (curUptime * totalProcessorCount * 10000); + + lastUpTime = curUptime; + lastProcessCpuTime = curProcessTime; + } + } + + for (GarbageCollectorMXBean garbage : garbageCollectorMXBeanList) { + + GCStatus gcStatus = processGCStatus.gcStatusMap.get(garbage.getName()); + if (gcStatus == null) { + gcStatus = new GCStatus(); + gcStatus.name = garbage.getName(); + processGCStatus.gcStatusMap.put(garbage.getName(), gcStatus); + } + + long curTotalGcCount = garbage.getCollectionCount(); + gcStatus.setCurTotalGcCount(curTotalGcCount); + + long curtotalGcTime = garbage.getCollectionTime(); + gcStatus.setCurTotalGcTime(curtotalGcTime); + } + + if (memoryPoolMXBeanList != null && !memoryPoolMXBeanList.isEmpty()) { + for (MemoryPoolMXBean pool : memoryPoolMXBeanList) { + + MemoryStatus memoryStatus = processMomoryStatus.memoryStatusMap.get(pool.getName()); + if (memoryStatus == null) { + memoryStatus = new MemoryStatus(); + memoryStatus.name = pool.getName(); + processMomoryStatus.memoryStatusMap.put(pool.getName(), memoryStatus); + } + memoryStatus.commitedSize = pool.getUsage().getCommitted(); + memoryStatus.setMaxMinUsedSize(pool.getUsage().getUsed()); + long maxMemory = memoryStatus.commitedSize > 0 ? memoryStatus.commitedSize : memoryStatus.maxSize; + memoryStatus.setMaxMinPercent(maxMemory > 0 ? (float) 100 * memoryStatus.usedSize / maxMemory : -1); + } + } + + if (print) { + LOG.info(processCpuStatus.getDeltaString() + processMomoryStatus.getDeltaString() + processGCStatus.getDeltaString()); + } + + } catch (Exception e) { + LOG.warn("no need care, the fail is ignored : vmInfo getDelta failed " + e.getMessage(), e); + } + } + + public static boolean isSunOsMBean(OperatingSystemMXBean operatingSystem) { + final String className = operatingSystem.getClass().getName(); + + return "com.sun.management.UnixOperatingSystem".equals(className); + } + + public static long getLongFromOperatingSystem(OperatingSystemMXBean operatingSystem, String methodName) { + try { + final Method method = operatingSystem.getClass().getMethod(methodName, (Class[]) null); + method.setAccessible(true); + return (Long) method.invoke(operatingSystem, (Object[]) null); + } catch (final Exception e) { + LOG.info(String.format("OperatingSystemMXBean %s failed, Exception = %s ", methodName, e.getMessage())); + } + + return -1; + } + + private class PhyOSStatus { + long totalPhysicalMemory = -1; + long freePhysicalMemory = -1; + long maxFileDescriptorCount = -1; + long currentOpenFileDescriptorCount = -1; + + public String toString() { + return String.format("\ttotalPhysicalMemory:\t%,.2fG\n" + + "\tfreePhysicalMemory:\t%,.2fG\n" + + "\tmaxFileDescriptorCount:\t%s\n" + + "\tcurrentOpenFileDescriptorCount:\t%s\n", + (float) totalPhysicalMemory / GB, (float) freePhysicalMemory / GB, maxFileDescriptorCount, currentOpenFileDescriptorCount); + } + } + + private class ProcessCpuStatus { + // 百分比的值 比如30.0 表示30.0% + float maxDeltaCpu = -1; + float minDeltaCpu = -1; + float curDeltaCpu = -1; + float averageCpu = -1; + + public void setMaxMinCpu(float curCpu) { + this.curDeltaCpu = curCpu; + if (maxDeltaCpu < curCpu) { + maxDeltaCpu = curCpu; + } + + if (minDeltaCpu == -1 || minDeltaCpu > curCpu) { + minDeltaCpu = curCpu; + } + } + + public String getDeltaString() { + StringBuilder sb = new StringBuilder(); + sb.append("\n\t [delta cpu info] => \n"); + sb.append("\t\t"); + sb.append(String.format("%-30s | %-30s | %-30s | %-30s \n", "curDeltaCpu", "averageCpu", "maxDeltaCpu", "minDeltaCpu")); + sb.append("\t\t"); + sb.append(String.format("%-30s | %-30s | %-30s | %-30s \n", + String.format("%,.2f%%", processCpuStatus.curDeltaCpu), + String.format("%,.2f%%", processCpuStatus.averageCpu), + String.format("%,.2f%%", processCpuStatus.maxDeltaCpu), + String.format("%,.2f%%\n", processCpuStatus.minDeltaCpu))); + + return sb.toString(); + } + + public String getTotalString() { + StringBuilder sb = new StringBuilder(); + sb.append("\n\t [total cpu info] => \n"); + sb.append("\t\t"); + sb.append(String.format("%-30s | %-30s | %-30s \n", "averageCpu", "maxDeltaCpu", "minDeltaCpu")); + sb.append("\t\t"); + sb.append(String.format("%-30s | %-30s | %-30s \n", + String.format("%,.2f%%", processCpuStatus.averageCpu), + String.format("%,.2f%%", processCpuStatus.maxDeltaCpu), + String.format("%,.2f%%\n", processCpuStatus.minDeltaCpu))); + + return sb.toString(); + } + + } + + private class ProcessGCStatus { + final Map gcStatusMap = new HashMap(); + + public String toString() { + return "\tGC Names\t" + gcStatusMap.keySet() + "\n"; + } + + public String getDeltaString() { + StringBuilder sb = new StringBuilder(); + sb.append("\n\t [delta gc info] => \n"); + sb.append("\t\t "); + sb.append(String.format("%-20s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s \n", "NAME", "curDeltaGCCount", "totalGCCount", "maxDeltaGCCount", "minDeltaGCCount", "curDeltaGCTime", "totalGCTime", "maxDeltaGCTime", "minDeltaGCTime")); + for (GCStatus gc : gcStatusMap.values()) { + sb.append("\t\t "); + sb.append(String.format("%-20s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s \n", + gc.name, gc.curDeltaGCCount, gc.totalGCCount, gc.maxDeltaGCCount, gc.minDeltaGCCount, + String.format("%,.3fs",(float)gc.curDeltaGCTime/1000), + String.format("%,.3fs",(float)gc.totalGCTime/1000), + String.format("%,.3fs",(float)gc.maxDeltaGCTime/1000), + String.format("%,.3fs",(float)gc.minDeltaGCTime/1000))); + + } + return sb.toString(); + } + + public String getTotalString() { + StringBuilder sb = new StringBuilder(); + sb.append("\n\t [total gc info] => \n"); + sb.append("\t\t "); + sb.append(String.format("%-20s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s \n", "NAME", "totalGCCount", "maxDeltaGCCount", "minDeltaGCCount", "totalGCTime", "maxDeltaGCTime", "minDeltaGCTime")); + for (GCStatus gc : gcStatusMap.values()) { + sb.append("\t\t "); + sb.append(String.format("%-20s | %-18s | %-18s | %-18s | %-18s | %-18s | %-18s \n", + gc.name, gc.totalGCCount, gc.maxDeltaGCCount, gc.minDeltaGCCount, + String.format("%,.3fs",(float)gc.totalGCTime/1000), + String.format("%,.3fs",(float)gc.maxDeltaGCTime/1000), + String.format("%,.3fs",(float)gc.minDeltaGCTime/1000))); + + } + return sb.toString(); + } + } + + private class ProcessMemoryStatus { + final Map memoryStatusMap = new HashMap(); + + public String toString() { + StringBuilder sb = new StringBuilder(); + sb.append("\t"); + sb.append(String.format("%-30s | %-30s | %-30s \n", "MEMORY_NAME", "allocation_size", "init_size")); + for (MemoryStatus ms : memoryStatusMap.values()) { + sb.append("\t"); + sb.append(String.format("%-30s | %-30s | %-30s \n", + ms.name, String.format("%,.2fMB", (float) ms.maxSize / MB), String.format("%,.2fMB", (float) ms.initSize / MB))); + } + return sb.toString(); + } + + public String getDeltaString() { + StringBuilder sb = new StringBuilder(); + sb.append("\n\t [delta memory info] => \n"); + sb.append("\t\t "); + sb.append(String.format("%-30s | %-30s | %-30s | %-30s | %-30s \n", "NAME", "used_size", "used_percent", "max_used_size", "max_percent")); + for (MemoryStatus ms : memoryStatusMap.values()) { + sb.append("\t\t "); + sb.append(String.format("%-30s | %-30s | %-30s | %-30s | %-30s \n", + ms.name, String.format("%,.2f", (float) ms.usedSize / MB) + "MB", + String.format("%,.2f", (float) ms.percent) + "%", + String.format("%,.2f", (float) ms.maxUsedSize / MB) + "MB", + String.format("%,.2f", (float) ms.maxpercent) + "%")); + + } + return sb.toString(); + } + } + + private class GCStatus { + String name; + long maxDeltaGCCount = -1; + long minDeltaGCCount = -1; + long curDeltaGCCount; + long totalGCCount = 0; + long maxDeltaGCTime = -1; + long minDeltaGCTime = -1; + long curDeltaGCTime; + long totalGCTime = 0; + + public void setCurTotalGcCount(long curTotalGcCount) { + this.curDeltaGCCount = curTotalGcCount - totalGCCount; + this.totalGCCount = curTotalGcCount; + + if (maxDeltaGCCount < curDeltaGCCount) { + maxDeltaGCCount = curDeltaGCCount; + } + + if (minDeltaGCCount == -1 || minDeltaGCCount > curDeltaGCCount) { + minDeltaGCCount = curDeltaGCCount; + } + } + + public void setCurTotalGcTime(long curTotalGcTime) { + this.curDeltaGCTime = curTotalGcTime - totalGCTime; + this.totalGCTime = curTotalGcTime; + + if (maxDeltaGCTime < curDeltaGCTime) { + maxDeltaGCTime = curDeltaGCTime; + } + + if (minDeltaGCTime == -1 || minDeltaGCTime > curDeltaGCTime) { + minDeltaGCTime = curDeltaGCTime; + } + } + } + + private class MemoryStatus { + String name; + long initSize; + long maxSize; + long commitedSize; + long usedSize; + float percent; + long maxUsedSize = -1; + float maxpercent = 0; + + void setMaxMinUsedSize(long curUsedSize) { + if (maxUsedSize < curUsedSize) { + maxUsedSize = curUsedSize; + } + this.usedSize = curUsedSize; + } + + void setMaxMinPercent(float curPercent) { + if (maxpercent < curPercent) { + maxpercent = curPercent; + } + this.percent = curPercent; + } + } + +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/Configuration.java b/common/src/main/java/com/alibaba/datax/common/util/Configuration.java new file mode 100755 index 0000000000..f570dd00c2 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/Configuration.java @@ -0,0 +1,1078 @@ +package com.alibaba.datax.common.util; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.spi.ErrorCode; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.serializer.SerializerFeature; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.CharUtils; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.builder.ToStringBuilder; + +import java.io.*; +import java.util.*; + +/** + * Configuration 提供多级JSON配置信息无损存储
+ *
+ *

+ * 实例代码:
+ *

+ * 获取job的配置信息
+ * Configuration configuration = Configuration.from(new File("Config.json"));
+ * String jobContainerClass = + * configuration.getString("core.container.job.class");
+ *

+ *
+ * 设置多级List
+ * configuration.set("job.reader.parameter.jdbcUrl", Arrays.asList(new String[] + * {"jdbc", "jdbc"})); + *

+ *

+ *
+ *
+ * 合并Configuration:
+ * configuration.merge(another); + *

+ *

+ *
+ *
+ *
+ *

+ * Configuration 存在两种较好地实现方式
+ * 第一种是将JSON配置信息中所有的Key全部打平,用a.b.c的级联方式作为Map的Key,内部使用一个Map保存信息
+ * 第二种是将JSON的对象直接使用结构化树形结构保存
+ *

+ * 目前使用的第二种实现方式,使用第一种的问题在于:
+ * 1. 插入新对象,比较难处理,例如a.b.c="bazhen",此时如果需要插入a="bazhen",也即是根目录下第一层所有类型全部要废弃 + * ,使用"bazhen"作为value,第一种方式使用字符串表示key,难以处理这类问题。
+ * 2. 返回树形结构,例如 a.b.c.d = "bazhen",如果返回"a"下的所有元素,实际上是一个Map,需要合并处理
+ * 3. 输出JSON,将上述对象转为JSON,要把上述Map的多级key转为树形结构,并输出为JSON
+ */ +public class Configuration { + + /** + * 对于加密的keyPath,需要记录下来 + * 为的是后面分布式情况下将该值加密后抛到DataXServer中 + */ + private Set secretKeyPathSet = + new HashSet(); + + private Object root = null; + + /** + * 初始化空白的Configuration + */ + public static Configuration newDefault() { + return Configuration.from("{}"); + } + + /** + * 从JSON字符串加载Configuration + */ + public static Configuration from(String json) { + json = StrUtil.replaceVariable(json); + checkJSON(json); + + try { + return new Configuration(json); + } catch (Exception e) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + e); + } + + } + + /** + * 从包括json的File对象加载Configuration + */ + public static Configuration from(File file) { + try { + return Configuration.from(IOUtils + .toString(new FileInputStream(file))); + } catch (FileNotFoundException e) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + String.format("配置信息错误,您提供的配置文件[%s]不存在. 请检查您的配置文件.", file.getAbsolutePath())); + } catch (IOException e) { + throw DataXException.asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format("配置信息错误. 您提供配置文件[%s]读取失败,错误原因: %s. 请检查您的配置文件的权限设置.", + file.getAbsolutePath(), e)); + } + } + + /** + * 从包括json的InputStream对象加载Configuration + */ + public static Configuration from(InputStream is) { + try { + return Configuration.from(IOUtils.toString(is)); + } catch (IOException e) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + String.format("请检查您的配置文件. 您提供的配置文件读取失败,错误原因: %s. 请检查您的配置文件的权限设置.", e)); + } + } + + /** + * 从Map对象加载Configuration + */ + public static Configuration from(final Map object) { + return Configuration.from(Configuration.toJSONString(object)); + } + + /** + * 从List对象加载Configuration + */ + public static Configuration from(final List object) { + return Configuration.from(Configuration.toJSONString(object)); + } + + public String getNecessaryValue(String key, ErrorCode errorCode) { + String value = this.getString(key, null); + if (StringUtils.isBlank(value)) { + throw DataXException.asDataXException(errorCode, + String.format("您提供配置文件有误,[%s]是必填参数,不允许为空或者留白 .", key)); + } + + return value; + } + + public String getUnnecessaryValue(String key,String defaultValue,ErrorCode errorCode) { + String value = this.getString(key, defaultValue); + if (StringUtils.isBlank(value)) { + value = defaultValue; + } + return value; + } + + public Boolean getNecessaryBool(String key, ErrorCode errorCode) { + Boolean value = this.getBool(key); + if (value == null) { + throw DataXException.asDataXException(errorCode, + String.format("您提供配置文件有误,[%s]是必填参数,不允许为空或者留白 .", key)); + } + + return value; + } + + /** + * 根据用户提供的json path,寻址具体的对象。 + *

+ *
+ *

+ * NOTE: 目前仅支持Map以及List下标寻址, 例如: + *

+ *
+ *

+ * 对于如下JSON + *

+ * {"a": {"b": {"c": [0,1,2,3]}}} + *

+ * config.get("") 返回整个Map
+ * config.get("a") 返回a下属整个Map
+ * config.get("a.b.c") 返回c对应的数组List
+ * config.get("a.b.c[0]") 返回数字0 + * + * @return Java表示的JSON对象,如果path不存在或者对象不存在,均返回null。 + */ + public Object get(final String path) { + this.checkPath(path); + try { + return this.findObject(path); + } catch (Exception e) { + return null; + } + } + + /** + * 用户指定部分path,获取Configuration的子集 + *

+ *
+ * 如果path获取的路径或者对象不存在,返回null + */ + public Configuration getConfiguration(final String path) { + Object object = this.get(path); + if (null == object) { + return null; + } + + return Configuration.from(Configuration.toJSONString(object)); + } + + /** + * 根据用户提供的json path,寻址String对象 + * + * @return String对象,如果path不存在或者String不存在,返回null + */ + public String getString(final String path) { + Object string = this.get(path); + if (null == string) { + return null; + } + return String.valueOf(string); + } + + /** + * 根据用户提供的json path,寻址String对象,如果对象不存在,返回默认字符串 + * + * @return String对象,如果path不存在或者String不存在,返回默认字符串 + */ + public String getString(final String path, final String defaultValue) { + String result = this.getString(path); + + if (null == result) { + return defaultValue; + } + + return result; + } + + /** + * 根据用户提供的json path,寻址Character对象 + * + * @return Character对象,如果path不存在或者Character不存在,返回null + */ + public Character getChar(final String path) { + String result = this.getString(path); + if (null == result) { + return null; + } + + try { + return CharUtils.toChar(result); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format("任务读取配置文件出错. 因为配置文件路径[%s] 值非法,期望是字符类型: %s. 请检查您的配置并作出修改.", path, + e.getMessage())); + } + } + + /** + * 根据用户提供的json path,寻址Boolean对象,如果对象不存在,返回默认Character对象 + * + * @return Character对象,如果path不存在或者Character不存在,返回默认Character对象 + */ + public Character getChar(final String path, char defaultValue) { + Character result = this.getChar(path); + if (null == result) { + return defaultValue; + } + return result; + } + + /** + * 根据用户提供的json path,寻址Boolean对象 + * + * @return Boolean对象,如果path值非true,false ,将报错.特别注意:当 path 不存在时,会返回:null. + */ + public Boolean getBool(final String path) { + String result = this.getString(path); + + if (null == result) { + return null; + } else if ("true".equalsIgnoreCase(result)) { + return Boolean.TRUE; + } else if ("false".equalsIgnoreCase(result)) { + return Boolean.FALSE; + } else { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + String.format("您提供的配置信息有误,因为从[%s]获取的值[%s]无法转换为bool类型. 请检查源表的配置并且做出相应的修改.", + path, result)); + } + + } + + /** + * 根据用户提供的json path,寻址Boolean对象,如果对象不存在,返回默认Boolean对象 + * + * @return Boolean对象,如果path不存在或者Boolean不存在,返回默认Boolean对象 + */ + public Boolean getBool(final String path, boolean defaultValue) { + Boolean result = this.getBool(path); + if (null == result) { + return defaultValue; + } + return result; + } + + /** + * 根据用户提供的json path,寻址Integer对象 + * + * @return Integer对象,如果path不存在或者Integer不存在,返回null + */ + public Integer getInt(final String path) { + String result = this.getString(path); + if (null == result) { + return null; + } + + try { + return Integer.valueOf(result); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format("任务读取配置文件出错. 配置文件路径[%s] 值非法, 期望是整数类型: %s. 请检查您的配置并作出修改.", path, + e.getMessage())); + } + } + + /** + * 根据用户提供的json path,寻址Integer对象,如果对象不存在,返回默认Integer对象 + * + * @return Integer对象,如果path不存在或者Integer不存在,返回默认Integer对象 + */ + public Integer getInt(final String path, int defaultValue) { + Integer object = this.getInt(path); + if (null == object) { + return defaultValue; + } + return object; + } + + /** + * 根据用户提供的json path,寻址Long对象 + * + * @return Long对象,如果path不存在或者Long不存在,返回null + */ + public Long getLong(final String path) { + String result = this.getString(path); + if (StringUtils.isBlank(result)) { + return null; + } + + try { + return Long.valueOf(result); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format("任务读取配置文件出错. 配置文件路径[%s] 值非法, 期望是整数类型: %s. 请检查您的配置并作出修改.", path, + e.getMessage())); + } + } + + /** + * 根据用户提供的json path,寻址Long对象,如果对象不存在,返回默认Long对象 + * + * @return Long对象,如果path不存在或者Integer不存在,返回默认Long对象 + */ + public Long getLong(final String path, long defaultValue) { + Long result = this.getLong(path); + if (null == result) { + return defaultValue; + } + return result; + } + + /** + * 根据用户提供的json path,寻址Double对象 + * + * @return Double对象,如果path不存在或者Double不存在,返回null + */ + public Double getDouble(final String path) { + String result = this.getString(path); + if (StringUtils.isBlank(result)) { + return null; + } + + try { + return Double.valueOf(result); + } catch (Exception e) { + throw DataXException.asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format("任务读取配置文件出错. 配置文件路径[%s] 值非法, 期望是浮点类型: %s. 请检查您的配置并作出修改.", path, + e.getMessage())); + } + } + + /** + * 根据用户提供的json path,寻址Double对象,如果对象不存在,返回默认Double对象 + * + * @return Double对象,如果path不存在或者Double不存在,返回默认Double对象 + */ + public Double getDouble(final String path, double defaultValue) { + Double result = this.getDouble(path); + if (null == result) { + return defaultValue; + } + return result; + } + + /** + * 根据用户提供的json path,寻址List对象,如果对象不存在,返回null + */ + @SuppressWarnings("unchecked") + public List getList(final String path) { + List list = this.get(path, List.class); + if (null == list) { + return null; + } + return list; + } + + /** + * 根据用户提供的json path,寻址List对象,如果对象不存在,返回null + */ + @SuppressWarnings("unchecked") + public List getList(final String path, Class t) { + Object object = this.get(path, List.class); + if (null == object) { + return null; + } + + List result = new ArrayList(); + + List origin = (List) object; + for (final Object each : origin) { + result.add((T) each); + } + + return result; + } + + /** + * 根据用户提供的json path,寻址List对象,如果对象不存在,返回默认List + */ + @SuppressWarnings("unchecked") + public List getList(final String path, + final List defaultList) { + Object object = this.getList(path); + if (null == object) { + return defaultList; + } + return (List) object; + } + + /** + * 根据用户提供的json path,寻址List对象,如果对象不存在,返回默认List + */ + public List getList(final String path, final List defaultList, + Class t) { + List list = this.getList(path, t); + if (null == list) { + return defaultList; + } + return list; + } + + /** + * 根据用户提供的json path,寻址包含Configuration的List,如果对象不存在,返回默认null + */ + public List getListConfiguration(final String path) { + List lists = getList(path); + if (lists == null) { + return null; + } + + List result = new ArrayList(); + for (final Object object : lists) { + result.add(Configuration.from(Configuration.toJSONString(object))); + } + return result; + } + + /** + * 根据用户提供的json path,寻址Map对象,如果对象不存在,返回null + */ + @SuppressWarnings("unchecked") + public Map getMap(final String path) { + Map result = this.get(path, Map.class); + if (null == result) { + return null; + } + return result; + } + + /** + * 根据用户提供的json path,寻址Map对象,如果对象不存在,返回null; + */ + @SuppressWarnings("unchecked") + public Map getMap(final String path, Class t) { + Map map = this.get(path, Map.class); + if (null == map) { + return null; + } + + Map result = new HashMap(); + for (final String key : map.keySet()) { + result.put(key, (T) map.get(key)); + } + + return result; + } + + /** + * 根据用户提供的json path,寻址Map对象,如果对象不存在,返回默认map + */ + @SuppressWarnings("unchecked") + public Map getMap(final String path, + final Map defaultMap) { + Object object = this.getMap(path); + if (null == object) { + return defaultMap; + } + return (Map) object; + } + + /** + * 根据用户提供的json path,寻址Map对象,如果对象不存在,返回默认map + */ + public Map getMap(final String path, + final Map defaultMap, Class t) { + Map result = getMap(path, t); + if (null == result) { + return defaultMap; + } + return result; + } + + /** + * 根据用户提供的json path,寻址包含Configuration的Map,如果对象不存在,返回默认null + */ + @SuppressWarnings("unchecked") + public Map getMapConfiguration(final String path) { + Map map = this.get(path, Map.class); + if (null == map) { + return null; + } + + Map result = new HashMap(); + for (final String key : map.keySet()) { + result.put(key, Configuration.from(Configuration.toJSONString(map + .get(key)))); + } + + return result; + } + + /** + * 根据用户提供的json path,寻址具体的对象,并转为用户提供的类型 + *

+ *
+ *

+ * NOTE: 目前仅支持Map以及List下标寻址, 例如: + *

+ *
+ *

+ * 对于如下JSON + *

+ * {"a": {"b": {"c": [0,1,2,3]}}} + *

+ * config.get("") 返回整个Map
+ * config.get("a") 返回a下属整个Map
+ * config.get("a.b.c") 返回c对应的数组List
+ * config.get("a.b.c[0]") 返回数字0 + * + * @return Java表示的JSON对象,如果转型失败,将抛出异常 + */ + @SuppressWarnings("unchecked") + public T get(final String path, Class clazz) { + this.checkPath(path); + return (T) this.get(path); + } + + /** + * 格式化Configuration输出 + */ + public String beautify() { + return JSON.toJSONString(this.getInternal(), + SerializerFeature.PrettyFormat); + } + + /** + * 根据用户提供的json path,插入指定对象,并返回之前存在的对象(如果存在) + *

+ *
+ *

+ * 目前仅支持.以及数组下标寻址, 例如: + *

+ *
+ *

+ * config.set("a.b.c[3]", object); + *

+ *
+ * 对于插入对象,Configuration不做任何限制,但是请务必保证该对象是简单对象(包括Map、List),不要使用自定义对象,否则后续对于JSON序列化等情况会出现未定义行为。 + * + * @param path + * JSON path对象 + * @param object + * 需要插入的对象 + * @return Java表示的JSON对象 + */ + public Object set(final String path, final Object object) { + checkPath(path); + + Object result = this.get(path); + + setObject(path, extractConfiguration(object)); + + return result; + } + + /** + * 获取Configuration下所有叶子节点的key + *

+ *
+ *

+ * 对于
+ *

+ * {"a": {"b": {"c": [0,1,2,3]}}, "x": "y"} + *

+ * 下属的key包括: a.b.c[0],a.b.c[1],a.b.c[2],a.b.c[3],x + */ + public Set getKeys() { + Set collect = new HashSet(); + this.getKeysRecursive(this.getInternal(), "", collect); + return collect; + } + + /** + * 删除path对应的值,如果path不存在,将抛出异常。 + */ + public Object remove(final String path) { + final Object result = this.get(path); + if (null == result) { + throw DataXException.asDataXException( + CommonErrorCode.RUNTIME_ERROR, + String.format("配置文件对应Key[%s]并不存在,该情况是代码编程错误. 请联系DataX团队的同学.", path)); + } + + this.set(path, null); + return result; + } + + /** + * 合并其他Configuration,并修改两者冲突的KV配置 + * + * @param another + * 合并加入的第三方Configuration + * @param updateWhenConflict + * 当合并双方出现KV冲突时候,选择更新当前KV,或者忽略该KV + * @return 返回合并后对象 + */ + public Configuration merge(final Configuration another, + boolean updateWhenConflict) { + Set keys = another.getKeys(); + + for (final String key : keys) { + // 如果使用更新策略,凡是another存在的key,均需要更新 + if (updateWhenConflict) { + this.set(key, another.get(key)); + continue; + } + + // 使用忽略策略,只有another Configuration存在但是当前Configuration不存在的key,才需要更新 + boolean isCurrentExists = this.get(key) != null; + if (isCurrentExists) { + continue; + } + + this.set(key, another.get(key)); + } + return this; + } + + @Override + public String toString() { + return this.toJSON(); + } + + /** + * 将Configuration作为JSON输出 + */ + public String toJSON() { + return Configuration.toJSONString(this.getInternal()); + } + + /** + * 拷贝当前Configuration,注意,这里使用了深拷贝,避免冲突 + */ + public Configuration clone() { + Configuration config = Configuration + .from(Configuration.toJSONString(this.getInternal())); + config.addSecretKeyPath(this.secretKeyPathSet); + return config; + } + + /** + * 按照configuration要求格式的path + * 比如: + * a.b.c + * a.b[2].c + * @param path + */ + public void addSecretKeyPath(String path) { + if(StringUtils.isNotBlank(path)) { + this.secretKeyPathSet.add(path); + } + } + + public void addSecretKeyPath(Set pathSet) { + if(pathSet != null) { + this.secretKeyPathSet.addAll(pathSet); + } + } + + public void setSecretKeyPathSet(Set keyPathSet) { + if(keyPathSet != null) { + this.secretKeyPathSet = keyPathSet; + } + } + + public boolean isSecretPath(String path) { + return this.secretKeyPathSet.contains(path); + } + + @SuppressWarnings("unchecked") + void getKeysRecursive(final Object current, String path, Set collect) { + boolean isRegularElement = !(current instanceof Map || current instanceof List); + if (isRegularElement) { + collect.add(path); + return; + } + + boolean isMap = current instanceof Map; + if (isMap) { + Map mapping = ((Map) current); + for (final String key : mapping.keySet()) { + if (StringUtils.isBlank(path)) { + getKeysRecursive(mapping.get(key), key.trim(), collect); + } else { + getKeysRecursive(mapping.get(key), path + "." + key.trim(), + collect); + } + } + return; + } + + boolean isList = current instanceof List; + if (isList) { + List lists = (List) current; + for (int i = 0; i < lists.size(); i++) { + getKeysRecursive(lists.get(i), path + String.format("[%d]", i), + collect); + } + return; + } + + return; + } + + public Object getInternal() { + return this.root; + } + + private void setObject(final String path, final Object object) { + Object newRoot = setObjectRecursive(this.root, split2List(path), 0, + object); + + if (isSuitForRoot(newRoot)) { + this.root = newRoot; + return; + } + + throw DataXException.asDataXException(CommonErrorCode.RUNTIME_ERROR, + String.format("值[%s]无法适配您提供[%s], 该异常代表系统编程错误, 请联系DataX开发团队!", + ToStringBuilder.reflectionToString(object), path)); + } + + @SuppressWarnings("unchecked") + private Object extractConfiguration(final Object object) { + if (object instanceof Configuration) { + return extractFromConfiguration(object); + } + + if (object instanceof List) { + List result = new ArrayList(); + for (final Object each : (List) object) { + result.add(extractFromConfiguration(each)); + } + return result; + } + + if (object instanceof Map) { + Map result = new HashMap(); + for (final String key : ((Map) object).keySet()) { + result.put(key, + extractFromConfiguration(((Map) object) + .get(key))); + } + return result; + } + + return object; + } + + private Object extractFromConfiguration(final Object object) { + if (object instanceof Configuration) { + return ((Configuration) object).getInternal(); + } + + return object; + } + + Object buildObject(final List paths, final Object object) { + if (null == paths) { + throw DataXException.asDataXException( + CommonErrorCode.RUNTIME_ERROR, + "Path不能为null,该异常代表系统编程错误, 请联系DataX开发团队 !"); + } + + if (1 == paths.size() && StringUtils.isBlank(paths.get(0))) { + return object; + } + + Object child = object; + for (int i = paths.size() - 1; i >= 0; i--) { + String path = paths.get(i); + + if (isPathMap(path)) { + Map mapping = new HashMap(); + mapping.put(path, child); + child = mapping; + continue; + } + + if (isPathList(path)) { + List lists = new ArrayList( + this.getIndex(path) + 1); + expand(lists, this.getIndex(path) + 1); + lists.set(this.getIndex(path), child); + child = lists; + continue; + } + + throw DataXException.asDataXException( + CommonErrorCode.RUNTIME_ERROR, String.format( + "路径[%s]出现非法值类型[%s],该异常代表系统编程错误, 请联系DataX开发团队! .", + StringUtils.join(paths, "."), path)); + } + + return child; + } + + @SuppressWarnings("unchecked") + Object setObjectRecursive(Object current, final List paths, + int index, final Object value) { + + // 如果是已经超出path,我们就返回value即可,作为最底层叶子节点 + boolean isLastIndex = index == paths.size(); + if (isLastIndex) { + return value; + } + + String path = paths.get(index).trim(); + boolean isNeedMap = isPathMap(path); + if (isNeedMap) { + Map mapping; + + // 当前不是map,因此全部替换为map,并返回新建的map对象 + boolean isCurrentMap = current instanceof Map; + if (!isCurrentMap) { + mapping = new HashMap(); + mapping.put( + path, + buildObject(paths.subList(index + 1, paths.size()), + value)); + return mapping; + } + + // 当前是map,但是没有对应的key,也就是我们需要新建对象插入该map,并返回该map + mapping = ((Map) current); + boolean hasSameKey = mapping.containsKey(path); + if (!hasSameKey) { + mapping.put( + path, + buildObject(paths.subList(index + 1, paths.size()), + value)); + return mapping; + } + + // 当前是map,而且还竟然存在这个值,好吧,继续递归遍历 + current = mapping.get(path); + mapping.put(path, + setObjectRecursive(current, paths, index + 1, value)); + return mapping; + } + + boolean isNeedList = isPathList(path); + if (isNeedList) { + List lists; + int listIndexer = getIndex(path); + + // 当前是list,直接新建并返回即可 + boolean isCurrentList = current instanceof List; + if (!isCurrentList) { + lists = expand(new ArrayList(), listIndexer + 1); + lists.set( + listIndexer, + buildObject(paths.subList(index + 1, paths.size()), + value)); + return lists; + } + + // 当前是list,但是对应的indexer是没有具体的值,也就是我们新建对象然后插入到该list,并返回该List + lists = (List) current; + lists = expand(lists, listIndexer + 1); + + boolean hasSameIndex = lists.get(listIndexer) != null; + if (!hasSameIndex) { + lists.set( + listIndexer, + buildObject(paths.subList(index + 1, paths.size()), + value)); + return lists; + } + + // 当前是list,并且存在对应的index,没有办法继续递归寻找 + current = lists.get(listIndexer); + lists.set(listIndexer, + setObjectRecursive(current, paths, index + 1, value)); + return lists; + } + + throw DataXException.asDataXException(CommonErrorCode.RUNTIME_ERROR, + "该异常代表系统编程错误, 请联系DataX开发团队 !"); + } + + private Object findObject(final String path) { + boolean isRootQuery = StringUtils.isBlank(path); + if (isRootQuery) { + return this.root; + } + + Object target = this.root; + + for (final String each : split2List(path)) { + if (isPathMap(each)) { + target = findObjectInMap(target, each); + continue; + } else { + target = findObjectInList(target, each); + continue; + } + } + + return target; + } + + @SuppressWarnings("unchecked") + private Object findObjectInMap(final Object target, final String index) { + boolean isMap = (target instanceof Map); + if (!isMap) { + throw new IllegalArgumentException(String.format( + "您提供的配置文件有误. 路径[%s]需要配置Json格式的Map对象,但该节点发现实际类型是[%s]. 请检查您的配置并作出修改.", + index, target.getClass().toString())); + } + + Object result = ((Map) target).get(index); + if (null == result) { + throw new IllegalArgumentException(String.format( + "您提供的配置文件有误. 路径[%s]值为null,datax无法识别该配置. 请检查您的配置并作出修改.", index)); + } + + return result; + } + + @SuppressWarnings({ "unchecked" }) + private Object findObjectInList(final Object target, final String each) { + boolean isList = (target instanceof List); + if (!isList) { + throw new IllegalArgumentException(String.format( + "您提供的配置文件有误. 路径[%s]需要配置Json格式的Map对象,但该节点发现实际类型是[%s]. 请检查您的配置并作出修改.", + each, target.getClass().toString())); + } + + String index = each.replace("[", "").replace("]", ""); + if (!StringUtils.isNumeric(index)) { + throw new IllegalArgumentException( + String.format( + "系统编程错误,列表下标必须为数字类型,但该节点发现实际类型是[%s] ,该异常代表系统编程错误, 请联系DataX开发团队 !", + index)); + } + + return ((List) target).get(Integer.valueOf(index)); + } + + private List expand(List list, int size) { + int expand = size - list.size(); + while (expand-- > 0) { + list.add(null); + } + return list; + } + + private boolean isPathList(final String path) { + return path.contains("[") && path.contains("]"); + } + + private boolean isPathMap(final String path) { + return StringUtils.isNotBlank(path) && !isPathList(path); + } + + private int getIndex(final String index) { + return Integer.valueOf(index.replace("[", "").replace("]", "")); + } + + private boolean isSuitForRoot(final Object object) { + if (null != object && (object instanceof List || object instanceof Map)) { + return true; + } + + return false; + } + + private String split(final String path) { + return StringUtils.replace(path, "[", ".["); + } + + private List split2List(final String path) { + return Arrays.asList(StringUtils.split(split(path), ".")); + } + + private void checkPath(final String path) { + if (null == path) { + throw new IllegalArgumentException( + "系统编程错误, 该异常代表系统编程错误, 请联系DataX开发团队!."); + } + + for (final String each : StringUtils.split(".")) { + if (StringUtils.isBlank(each)) { + throw new IllegalArgumentException(String.format( + "系统编程错误, 路径[%s]不合法, 路径层次之间不能出现空白字符 .", path)); + } + } + } + + @SuppressWarnings("unused") + private String toJSONPath(final String path) { + return (StringUtils.isBlank(path) ? "$" : "$." + path).replace("$.[", + "$["); + } + + private static void checkJSON(final String json) { + if (StringUtils.isBlank(json)) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + "配置信息错误. 因为您提供的配置信息不是合法的JSON格式, JSON不能为空白. 请按照标准json格式提供配置信息. "); + } + } + + private Configuration(final String json) { + try { + this.root = JSON.parse(json); + } catch (Exception e) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + String.format("配置信息错误. 您提供的配置信息不是合法的JSON格式: %s . 请按照标准json格式提供配置信息. ", e.getMessage())); + } + } + + private static String toJSONString(final Object object) { + return JSON.toJSONString(object); + } + + public Set getSecretKeyPathSet() { + return secretKeyPathSet; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/FilterUtil.java b/common/src/main/java/com/alibaba/datax/common/util/FilterUtil.java new file mode 100755 index 0000000000..37b319a194 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/FilterUtil.java @@ -0,0 +1,52 @@ +package com.alibaba.datax.common.util; + +import java.util.*; +import java.util.regex.Pattern; + +/** + * 提供从 List 中根据 regular 过滤的通用工具(返回值已经去重). 使用场景,比如:odpsreader + * 的分区筛选,hdfsreader/txtfilereader的路径筛选等 + */ +public final class FilterUtil { + + //已经去重 + public static List filterByRegular(List allStrs, + String regular) { + List matchedValues = new ArrayList(); + + // 语法习惯上的兼容处理(pt=* 实际正则应该是:pt=.*) + String newReqular = regular.replace(".*", "*").replace("*", ".*"); + + Pattern p = Pattern.compile(newReqular); + + for (String partition : allStrs) { + if (p.matcher(partition).matches()) { + if (!matchedValues.contains(partition)) { + matchedValues.add(partition); + } + } + } + + return matchedValues; + } + + //已经去重 + public static List filterByRegulars(List allStrs, + List regulars) { + List matchedValues = new ArrayList(); + + List tempMatched = null; + for (String regular : regulars) { + tempMatched = filterByRegular(allStrs, regular); + if (null != tempMatched && !tempMatched.isEmpty()) { + for (String temp : tempMatched) { + if (!matchedValues.contains(temp)) { + matchedValues.add(temp); + } + } + } + } + + return matchedValues; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/HostUtils.java b/common/src/main/java/com/alibaba/datax/common/util/HostUtils.java new file mode 100644 index 0000000000..2ed8f1019c --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/HostUtils.java @@ -0,0 +1,49 @@ +package com.alibaba.datax.common.util; + +import org.apache.commons.io.IOUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.net.InetAddress; +import java.net.UnknownHostException; + +/** + * Created by liqiang on 15/8/25. + */ +public class HostUtils { + + public static final String IP; + public static final String HOSTNAME; + private static final Logger log = LoggerFactory.getLogger(HostUtils.class); + + static { + String ip; + String hostname; + try { + InetAddress addr = InetAddress.getLocalHost(); + ip = addr.getHostAddress(); + hostname = addr.getHostName(); + } catch (UnknownHostException e) { + log.error("Can't find out address: " + e.getMessage()); + ip = "UNKNOWN"; + hostname = "UNKNOWN"; + } + if (ip.equals("127.0.0.1") || ip.equals("::1") || ip.equals("UNKNOWN")) { + try { + Process process = Runtime.getRuntime().exec("hostname -i"); + if (process.waitFor() == 0) { + ip = new String(IOUtils.toByteArray(process.getInputStream()), "UTF8"); + } + process = Runtime.getRuntime().exec("hostname"); + if (process.waitFor() == 0) { + hostname = (new String(IOUtils.toByteArray(process.getInputStream()), "UTF8")).trim(); + } + } catch (Exception e) { + log.warn("get hostname failed {}", e.getMessage()); + } + } + IP = ip; + HOSTNAME = hostname; + log.info("IP {} HOSTNAME {}", IP, HOSTNAME); + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/ListUtil.java b/common/src/main/java/com/alibaba/datax/common/util/ListUtil.java new file mode 100755 index 0000000000..d7a5b76462 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/ListUtil.java @@ -0,0 +1,139 @@ +package com.alibaba.datax.common.util; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang3.StringUtils; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +/** + * 提供针对 DataX中使用的 List 较为常见的一些封装。 比如:checkIfValueDuplicate 可以用于检查用户配置的 writer + * 的列不能重复。makeSureNoValueDuplicate亦然,只是会严格报错。 + */ +public final class ListUtil { + + public static boolean checkIfValueDuplicate(List aList, + boolean caseSensitive) { + if (null == aList || aList.isEmpty()) { + throw DataXException.asDataXException(CommonErrorCode.CONFIG_ERROR, + "您提供的作业配置有误,List不能为空."); + } + + try { + makeSureNoValueDuplicate(aList, caseSensitive); + } catch (Exception e) { + return true; + } + return false; + } + + public static void makeSureNoValueDuplicate(List aList, + boolean caseSensitive) { + if (null == aList || aList.isEmpty()) { + throw new IllegalArgumentException("您提供的作业配置有误, List不能为空."); + } + + if (1 == aList.size()) { + return; + } else { + List list = null; + if (!caseSensitive) { + list = valueToLowerCase(aList); + } else { + list = new ArrayList(aList); + } + + Collections.sort(list); + + for (int i = 0, len = list.size() - 1; i < len; i++) { + if (list.get(i).equals(list.get(i + 1))) { + throw DataXException + .asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format( + "您提供的作业配置信息有误, String:[%s] 不允许重复出现在列表中: [%s].", + list.get(i), + StringUtils.join(aList, ","))); + } + } + } + } + + public static boolean checkIfBInA(List aList, List bList, + boolean caseSensitive) { + if (null == aList || aList.isEmpty() || null == bList + || bList.isEmpty()) { + throw new IllegalArgumentException("您提供的作业配置有误, List不能为空."); + } + + try { + makeSureBInA(aList, bList, caseSensitive); + } catch (Exception e) { + return false; + } + return true; + } + + public static void makeSureBInA(List aList, List bList, + boolean caseSensitive) { + if (null == aList || aList.isEmpty() || null == bList + || bList.isEmpty()) { + throw new IllegalArgumentException("您提供的作业配置有误, List不能为空."); + } + + List all = null; + List part = null; + + if (!caseSensitive) { + all = valueToLowerCase(aList); + part = valueToLowerCase(bList); + } else { + all = new ArrayList(aList); + part = new ArrayList(bList); + } + + for (String oneValue : part) { + if (!all.contains(oneValue)) { + throw DataXException + .asDataXException( + CommonErrorCode.CONFIG_ERROR, + String.format( + "您提供的作业配置信息有误, String:[%s] 不存在于列表中:[%s].", + oneValue, StringUtils.join(aList, ","))); + } + } + + } + + public static boolean checkIfValueSame(List aList) { + if (null == aList || aList.isEmpty()) { + throw new IllegalArgumentException("您提供的作业配置有误, List不能为空."); + } + + if (1 == aList.size()) { + return true; + } else { + Boolean firstValue = aList.get(0); + for (int i = 1, len = aList.size(); i < len; i++) { + if (firstValue.booleanValue() != aList.get(i).booleanValue()) { + return false; + } + } + return true; + } + } + + public static List valueToLowerCase(List aList) { + if (null == aList || aList.isEmpty()) { + throw new IllegalArgumentException("您提供的作业配置有误, List不能为空."); + } + List result = new ArrayList(aList.size()); + for (String oneValue : aList) { + result.add(null != oneValue ? oneValue.toLowerCase() : null); + } + + return result; + } +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/RangeSplitUtil.java b/common/src/main/java/com/alibaba/datax/common/util/RangeSplitUtil.java new file mode 100755 index 0000000000..791f9ea12c --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/RangeSplitUtil.java @@ -0,0 +1,209 @@ +package com.alibaba.datax.common.util; + +import org.apache.commons.lang3.tuple.ImmutablePair; +import org.apache.commons.lang3.tuple.Pair; + +import java.math.BigInteger; +import java.util.*; + +/** + * 提供通用的根据数字范围、字符串范围等进行切分的通用功能. + */ +public final class RangeSplitUtil { + + public static String[] doAsciiStringSplit(String left, String right, int expectSliceNumber) { + int radix = 128; + + BigInteger[] tempResult = doBigIntegerSplit(stringToBigInteger(left, radix), + stringToBigInteger(right, radix), expectSliceNumber); + String[] result = new String[tempResult.length]; + + //处理第一个字符串(因为:在转换为数字,再还原的时候,如果首字符刚好是 basic,则不知道应该添加多少个 basic) + result[0] = left; + result[tempResult.length - 1] = right; + + for (int i = 1, len = tempResult.length - 1; i < len; i++) { + result[i] = bigIntegerToString(tempResult[i], radix); + } + + return result; + } + + + public static long[] doLongSplit(long left, long right, int expectSliceNumber) { + BigInteger[] result = doBigIntegerSplit(BigInteger.valueOf(left), + BigInteger.valueOf(right), expectSliceNumber); + long[] returnResult = new long[result.length]; + for (int i = 0, len = result.length; i < len; i++) { + returnResult[i] = result[i].longValue(); + } + return returnResult; + } + + public static BigInteger[] doBigIntegerSplit(BigInteger left, BigInteger right, int expectSliceNumber) { + if (expectSliceNumber < 1) { + throw new IllegalArgumentException(String.format( + "切分份数不能小于1. 此处:expectSliceNumber=[%s].", expectSliceNumber)); + } + + if (null == left || null == right) { + throw new IllegalArgumentException(String.format( + "对 BigInteger 进行切分时,其左右区间不能为 null. 此处:left=[%s],right=[%s].", left, right)); + } + + if (left.compareTo(right) == 0) { + return new BigInteger[]{left, right}; + } else { + // 调整大小顺序,确保 left < right + if (left.compareTo(right) > 0) { + BigInteger temp = left; + left = right; + right = temp; + } + + //left < right + BigInteger endAndStartGap = right.subtract(left); + + BigInteger step = endAndStartGap.divide(BigInteger.valueOf(expectSliceNumber)); + BigInteger remainder = endAndStartGap.remainder(BigInteger.valueOf(expectSliceNumber)); + + //remainder 不可能超过expectSliceNumber,所以不需要检查remainder的 Integer 的范围 + + // 这里不能 step.intValue()==0,因为可能溢出 + if (step.compareTo(BigInteger.ZERO) == 0) { + expectSliceNumber = remainder.intValue(); + } + + BigInteger[] result = new BigInteger[expectSliceNumber + 1]; + result[0] = left; + result[expectSliceNumber] = right; + + BigInteger lowerBound; + BigInteger upperBound = left; + for (int i = 1; i < expectSliceNumber; i++) { + lowerBound = upperBound; + upperBound = lowerBound.add(step); + upperBound = upperBound.add((remainder.compareTo(BigInteger.valueOf(i)) >= 0) + ? BigInteger.ONE : BigInteger.ZERO); + result[i] = upperBound; + } + + return result; + } + } + + private static void checkIfBetweenRange(int value, int left, int right) { + if (value < left || value > right) { + throw new IllegalArgumentException(String.format("parameter can not <[%s] or >[%s].", + left, right)); + } + } + + /** + * 由于只支持 ascii 码对应字符,所以radix 范围为[1,128] + */ + public static BigInteger stringToBigInteger(String aString, int radix) { + if (null == aString) { + throw new IllegalArgumentException("参数 bigInteger 不能为空."); + } + + checkIfBetweenRange(radix, 1, 128); + + BigInteger result = BigInteger.ZERO; + BigInteger radixBigInteger = BigInteger.valueOf(radix); + + int tempChar; + int k = 0; + + for (int i = aString.length() - 1; i >= 0; i--) { + tempChar = aString.charAt(i); + if (tempChar >= 128) { + throw new IllegalArgumentException(String.format("根据字符串进行切分时仅支持 ASCII 字符串,而字符串:[%s]非 ASCII 字符串.", aString)); + } + result = result.add(BigInteger.valueOf(tempChar).multiply(radixBigInteger.pow(k))); + k++; + } + + return result; + } + + /** + * 把BigInteger 转换为 String.注意:radix 和 basic 范围都为[1,128], radix + basic 的范围也必须在[1,128]. + */ + private static String bigIntegerToString(BigInteger bigInteger, int radix) { + if (null == bigInteger) { + throw new IllegalArgumentException("参数 bigInteger 不能为空."); + } + + checkIfBetweenRange(radix, 1, 128); + + StringBuilder resultStringBuilder = new StringBuilder(); + + List list = new ArrayList(); + BigInteger radixBigInteger = BigInteger.valueOf(radix); + BigInteger currentValue = bigInteger; + + BigInteger quotient = currentValue.divide(radixBigInteger); + while (quotient.compareTo(BigInteger.ZERO) > 0) { + list.add(currentValue.remainder(radixBigInteger).intValue()); + currentValue = currentValue.divide(radixBigInteger); + quotient = currentValue; + } + Collections.reverse(list); + + if (list.isEmpty()) { + list.add(0, bigInteger.remainder(radixBigInteger).intValue()); + } + + Map map = new HashMap(); + for (int i = 0; i < radix; i++) { + map.put(i, (char) (i)); + } + +// String msg = String.format("%s 转为 %s 进制,结果为:%s", bigInteger.longValue(), radix, list); +// System.out.println(msg); + + for (Integer aList : list) { + resultStringBuilder.append(map.get(aList)); + } + + return resultStringBuilder.toString(); + } + + /** + * 获取字符串中的最小字符和最大字符(依据 ascii 进行判断).要求字符串必须非空,并且为 ascii 字符串. + * 返回的Pair,left=最小字符,right=最大字符. + */ + public static Pair getMinAndMaxCharacter(String aString) { + if (!isPureAscii(aString)) { + throw new IllegalArgumentException(String.format("根据字符串进行切分时仅支持 ASCII 字符串,而字符串:[%s]非 ASCII 字符串.", aString)); + } + + char min = aString.charAt(0); + char max = min; + + char temp; + for (int i = 1, len = aString.length(); i < len; i++) { + temp = aString.charAt(i); + min = min < temp ? min : temp; + max = max > temp ? max : temp; + } + + return new ImmutablePair(min, max); + } + + private static boolean isPureAscii(String aString) { + if (null == aString) { + return false; + } + + for (int i = 0, len = aString.length(); i < len; i++) { + char ch = aString.charAt(i); + if (ch >= 127 || ch < 0) { + return false; + } + } + return true; + } + +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/RetryUtil.java b/common/src/main/java/com/alibaba/datax/common/util/RetryUtil.java new file mode 100755 index 0000000000..33c712874b --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/RetryUtil.java @@ -0,0 +1,208 @@ +package com.alibaba.datax.common.util; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; +import java.util.concurrent.*; + +public final class RetryUtil { + + private static final Logger LOG = LoggerFactory.getLogger(RetryUtil.class); + + private static final long MAX_SLEEP_MILLISECOND = 256 * 1000; + + /** + * 重试次数工具方法. + * + * @param callable 实际逻辑 + * @param retryTimes 最大重试次数(>1) + * @param sleepTimeInMilliSecond 运行失败后休眠对应时间再重试 + * @param exponential 休眠时间是否指数递增 + * @param 返回值类型 + * @return 经过重试的callable的执行结果 + */ + public static T executeWithRetry(Callable callable, + int retryTimes, + long sleepTimeInMilliSecond, + boolean exponential) throws Exception { + Retry retry = new Retry(); + return retry.doRetry(callable, retryTimes, sleepTimeInMilliSecond, exponential, null); + } + + /** + * 重试次数工具方法. + * + * @param callable 实际逻辑 + * @param retryTimes 最大重试次数(>1) + * @param sleepTimeInMilliSecond 运行失败后休眠对应时间再重试 + * @param exponential 休眠时间是否指数递增 + * @param 返回值类型 + * @param retryExceptionClasss 出现指定的异常类型时才进行重试 + * @return 经过重试的callable的执行结果 + */ + public static T executeWithRetry(Callable callable, + int retryTimes, + long sleepTimeInMilliSecond, + boolean exponential, + List> retryExceptionClasss) throws Exception { + Retry retry = new Retry(); + return retry.doRetry(callable, retryTimes, sleepTimeInMilliSecond, exponential, retryExceptionClasss); + } + + /** + * 在外部线程执行并且重试。每次执行需要在timeoutMs内执行完,不然视为失败。 + * 执行异步操作的线程池从外部传入,线程池的共享粒度由外部控制。比如,HttpClientUtil共享一个线程池。 + *

+ * 限制条件:仅仅能够在阻塞的时候interrupt线程 + * + * @param callable 实际逻辑 + * @param retryTimes 最大重试次数(>1) + * @param sleepTimeInMilliSecond 运行失败后休眠对应时间再重试 + * @param exponential 休眠时间是否指数递增 + * @param timeoutMs callable执行超时时间,毫秒 + * @param executor 执行异步操作的线程池 + * @param 返回值类型 + * @return 经过重试的callable的执行结果 + */ + public static T asyncExecuteWithRetry(Callable callable, + int retryTimes, + long sleepTimeInMilliSecond, + boolean exponential, + long timeoutMs, + ThreadPoolExecutor executor) throws Exception { + Retry retry = new AsyncRetry(timeoutMs, executor); + return retry.doRetry(callable, retryTimes, sleepTimeInMilliSecond, exponential, null); + } + + /** + * 创建异步执行的线程池。特性如下: + * core大小为0,初始状态下无线程,无初始消耗。 + * max大小为5,最多五个线程。 + * 60秒超时时间,闲置超过60秒线程会被回收。 + * 使用SynchronousQueue,任务不会排队,必须要有可用线程才能提交成功,否则会RejectedExecutionException。 + * + * @return 线程池 + */ + public static ThreadPoolExecutor createThreadPoolExecutor() { + return new ThreadPoolExecutor(0, 5, + 60L, TimeUnit.SECONDS, + new SynchronousQueue()); + } + + + private static class Retry { + + public T doRetry(Callable callable, int retryTimes, long sleepTimeInMilliSecond, boolean exponential, List> retryExceptionClasss) + throws Exception { + + if (null == callable) { + throw new IllegalArgumentException("系统编程错误, 入参callable不能为空 ! "); + } + + if (retryTimes < 1) { + throw new IllegalArgumentException(String.format( + "系统编程错误, 入参retrytime[%d]不能小于1 !", retryTimes)); + } + + Exception saveException = null; + for (int i = 0; i < retryTimes; i++) { + try { + return call(callable); + } catch (Exception e) { + saveException = e; + if (i == 0) { + LOG.error(String.format("Exception when calling callable, 异常Msg:%s", saveException.getMessage()), saveException); + } + + if (null != retryExceptionClasss && !retryExceptionClasss.isEmpty()) { + boolean needRetry = false; + for (Class eachExceptionClass : retryExceptionClasss) { + if (eachExceptionClass == e.getClass()) { + needRetry = true; + break; + } + } + if (!needRetry) { + throw saveException; + } + } + + if (i + 1 < retryTimes && sleepTimeInMilliSecond > 0) { + long startTime = System.currentTimeMillis(); + + long timeToSleep; + if (exponential) { + timeToSleep = sleepTimeInMilliSecond * (long) Math.pow(2, i); + if(timeToSleep >= MAX_SLEEP_MILLISECOND) { + timeToSleep = MAX_SLEEP_MILLISECOND; + } + } else { + timeToSleep = sleepTimeInMilliSecond; + if(timeToSleep >= MAX_SLEEP_MILLISECOND) { + timeToSleep = MAX_SLEEP_MILLISECOND; + } + } + + try { + Thread.sleep(timeToSleep); + } catch (InterruptedException ignored) { + } + + long realTimeSleep = System.currentTimeMillis()-startTime; + + LOG.error(String.format("Exception when calling callable, 即将尝试执行第%s次重试.本次重试计划等待[%s]ms,实际等待[%s]ms, 异常Msg:[%s]", + i+1, timeToSleep,realTimeSleep, e.getMessage())); + + } + } + } + throw saveException; + } + + protected T call(Callable callable) throws Exception { + return callable.call(); + } + } + + private static class AsyncRetry extends Retry { + + private long timeoutMs; + private ThreadPoolExecutor executor; + + public AsyncRetry(long timeoutMs, ThreadPoolExecutor executor) { + this.timeoutMs = timeoutMs; + this.executor = executor; + } + + /** + * 使用传入的线程池异步执行任务,并且等待。 + *

+ * future.get()方法,等待指定的毫秒数。如果任务在超时时间内结束,则正常返回。 + * 如果抛异常(可能是执行超时、执行异常、被其他线程cancel或interrupt),都记录日志并且网上抛异常。 + * 正常和非正常的情况都会判断任务是否结束,如果没有结束,则cancel任务。cancel参数为true,表示即使 + * 任务正在执行,也会interrupt线程。 + * + * @param callable + * @param + * @return + * @throws Exception + */ + @Override + protected T call(Callable callable) throws Exception { + Future future = executor.submit(callable); + try { + return future.get(timeoutMs, TimeUnit.MILLISECONDS); + } catch (Exception e) { + LOG.warn("Try once failed", e); + throw e; + } finally { + if (!future.isDone()) { + future.cancel(true); + LOG.warn("Try once task not done, cancel it, active count: " + executor.getActiveCount()); + } + } + } + } + +} diff --git a/common/src/main/java/com/alibaba/datax/common/util/StrUtil.java b/common/src/main/java/com/alibaba/datax/common/util/StrUtil.java new file mode 100755 index 0000000000..82222b0d48 --- /dev/null +++ b/common/src/main/java/com/alibaba/datax/common/util/StrUtil.java @@ -0,0 +1,85 @@ +package com.alibaba.datax.common.util; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; + +import java.text.DecimalFormat; +import java.util.HashMap; +import java.util.Map; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +public class StrUtil { + + private final static long KB_IN_BYTES = 1024; + + private final static long MB_IN_BYTES = 1024 * KB_IN_BYTES; + + private final static long GB_IN_BYTES = 1024 * MB_IN_BYTES; + + private final static long TB_IN_BYTES = 1024 * GB_IN_BYTES; + + private final static DecimalFormat df = new DecimalFormat("0.00"); + + private static final Pattern VARIABLE_PATTERN = Pattern + .compile("(\\$)\\{?(\\w+)\\}?"); + + private static String SYSTEM_ENCODING = System.getProperty("file.encoding"); + + static { + if (SYSTEM_ENCODING == null) { + SYSTEM_ENCODING = "UTF-8"; + } + } + + private StrUtil() { + } + + public static String stringify(long byteNumber) { + if (byteNumber / TB_IN_BYTES > 0) { + return df.format((double) byteNumber / (double) TB_IN_BYTES) + "TB"; + } else if (byteNumber / GB_IN_BYTES > 0) { + return df.format((double) byteNumber / (double) GB_IN_BYTES) + "GB"; + } else if (byteNumber / MB_IN_BYTES > 0) { + return df.format((double) byteNumber / (double) MB_IN_BYTES) + "MB"; + } else if (byteNumber / KB_IN_BYTES > 0) { + return df.format((double) byteNumber / (double) KB_IN_BYTES) + "KB"; + } else { + return String.valueOf(byteNumber) + "B"; + } + } + + + public static String replaceVariable(final String param) { + Map mapping = new HashMap(); + + Matcher matcher = VARIABLE_PATTERN.matcher(param); + while (matcher.find()) { + String variable = matcher.group(2); + String value = System.getProperty(variable); + if (StringUtils.isBlank(value)) { + value = matcher.group(); + } + mapping.put(matcher.group(), value); + } + + String retString = param; + for (final String key : mapping.keySet()) { + retString = retString.replace(key, mapping.get(key)); + } + + return retString; + } + + public static String compressMiddle(String s, int headLength, int tailLength) { + Validate.notNull(s, "Input string must not be null"); + Validate.isTrue(headLength > 0, "Head length must be larger than 0"); + Validate.isTrue(tailLength > 0, "Tail length must be larger than 0"); + + if(headLength + tailLength >= s.length()) { + return s; + } + return s.substring(0, headLength) + "..." + s.substring(s.length() - tailLength); + } + +} diff --git a/core/pom.xml b/core/pom.xml new file mode 100755 index 0000000000..5582d943d1 --- /dev/null +++ b/core/pom.xml @@ -0,0 +1,150 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + datax-core + datax-core + jar + + + + com.alibaba.datax + datax-transformer + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + commons-configuration + commons-configuration + ${commons-configuration-version} + + + commons-cli + commons-cli + ${commons-cli-version} + + + commons-beanutils + commons-beanutils + 1.9.2 + + + org.apache.httpcomponents + httpclient + 4.4 + + + org.apache.httpcomponents + fluent-hc + 4.4 + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + org.codehaus.janino + janino + 2.5.16 + + + + junit + junit + test + + + + org.mockito + mockito-core + 1.8.5 + test + + + org.powermock + powermock-api-mockito + 1.4.10 + test + + + + org.powermock + powermock-module-junit4 + 1.4.10 + test + + + org.apache.commons + commons-lang3 + 3.3.2 + + + org.codehaus.groovy + groovy-all + 2.1.9 + + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + + com.alibaba.datax.core.Engine + + + + + + + maven-assembly-plugin + + + + com.alibaba.datax.core.Engine + + + datax + + src/main/assembly/package.xml + + + + + + package + + single + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + diff --git a/core/src/main/assembly/package.xml b/core/src/main/assembly/package.xml new file mode 100755 index 0000000000..7369f56351 --- /dev/null +++ b/core/src/main/assembly/package.xml @@ -0,0 +1,98 @@ + + + + dir + + false + + + + src/main/bin + + *.* + + + *.pyc + + 775 + /bin + + + + src/main/script + + *.* + + 775 + /script + + + + src/main/conf + + *.* + + /conf + + + + target/ + + datax-core-0.0.1-SNAPSHOT.jar + + /lib + + + + + + + + + + + + + + + + + + + + src/main/job/ + + *.json + + /job + + + + src/main/tools/ + + *.* + + /tools + + + + 777 + src/main/tmp + + *.* + + /tmp + + + + + + false + /lib + runtime + + + diff --git a/core/src/main/bin/datax.py b/core/src/main/bin/datax.py new file mode 100755 index 0000000000..1099ed3a08 --- /dev/null +++ b/core/src/main/bin/datax.py @@ -0,0 +1,227 @@ +#!/usr/bin/env python +# -*- coding:utf-8 -*- + +import sys +import os +import signal +import subprocess +import time +import re +import socket +import json +from optparse import OptionParser +from optparse import OptionGroup +from string import Template +import codecs +import platform + +def isWindows(): + return platform.system() == 'Windows' + +DATAX_HOME = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) + +DATAX_VERSION = 'DATAX-OPENSOURCE-3.0' +if isWindows(): + codecs.register(lambda name: name == 'cp65001' and codecs.lookup('utf-8') or None) + CLASS_PATH = ("%s/lib/*") % (DATAX_HOME) +else: + CLASS_PATH = ("%s/lib/*:.") % (DATAX_HOME) +LOGBACK_FILE = ("%s/conf/logback.xml") % (DATAX_HOME) +DEFAULT_JVM = "-Xms1g -Xmx1g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=%s/log" % (DATAX_HOME) +DEFAULT_PROPERTY_CONF = "-Dfile.encoding=UTF-8 -Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener -Djava.security.egd=file:///dev/urandom -Ddatax.home=%s -Dlogback.configurationFile=%s" % ( + DATAX_HOME, LOGBACK_FILE) +ENGINE_COMMAND = "java -server ${jvm} %s -classpath %s ${params} com.alibaba.datax.core.Engine -mode ${mode} -jobid ${jobid} -job ${job}" % ( + DEFAULT_PROPERTY_CONF, CLASS_PATH) +REMOTE_DEBUG_CONFIG = "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=9999" + +RET_STATE = { + "KILL": 143, + "FAIL": -1, + "OK": 0, + "RUN": 1, + "RETRY": 2 +} + + +def getLocalIp(): + try: + return socket.gethostbyname(socket.getfqdn(socket.gethostname())) + except: + return "Unknown" + + +def suicide(signum, e): + global child_process + print >> sys.stderr, "[Error] DataX receive unexpected signal %d, starts to suicide." % (signum) + + if child_process: + child_process.send_signal(signal.SIGQUIT) + time.sleep(1) + child_process.kill() + print >> sys.stderr, "DataX Process was killed ! you did ?" + sys.exit(RET_STATE["KILL"]) + + +def register_signal(): + if not isWindows(): + global child_process + signal.signal(2, suicide) + signal.signal(3, suicide) + signal.signal(15, suicide) + + +def getOptionParser(): + usage = "usage: %prog [options] job-url-or-path" + parser = OptionParser(usage=usage) + + prodEnvOptionGroup = OptionGroup(parser, "Product Env Options", + "Normal user use these options to set jvm parameters, job runtime mode etc. " + "Make sure these options can be used in Product Env.") + prodEnvOptionGroup.add_option("-j", "--jvm", metavar="", dest="jvmParameters", action="store", + default=DEFAULT_JVM, help="Set jvm parameters if necessary.") + prodEnvOptionGroup.add_option("--jobid", metavar="", dest="jobid", action="store", default="-1", + help="Set job unique id when running by Distribute/Local Mode.") + prodEnvOptionGroup.add_option("-m", "--mode", metavar="", + action="store", default="standalone", + help="Set job runtime mode such as: standalone, local, distribute. " + "Default mode is standalone.") + prodEnvOptionGroup.add_option("-p", "--params", metavar="", + action="store", dest="params", + help='Set job parameter, eg: the source tableName you want to set it by command, ' + 'then you can use like this: -p"-DtableName=your-table-name", ' + 'if you have mutiple parameters: -p"-DtableName=your-table-name -DcolumnName=your-column-name".' + 'Note: you should config in you job tableName with ${tableName}.') + prodEnvOptionGroup.add_option("-r", "--reader", metavar="", + action="store", dest="reader",type="string", + help='View job config[reader] template, eg: mysqlreader,streamreader') + prodEnvOptionGroup.add_option("-w", "--writer", metavar="", + action="store", dest="writer",type="string", + help='View job config[writer] template, eg: mysqlwriter,streamwriter') + parser.add_option_group(prodEnvOptionGroup) + + devEnvOptionGroup = OptionGroup(parser, "Develop/Debug Options", + "Developer use these options to trace more details of DataX.") + devEnvOptionGroup.add_option("-d", "--debug", dest="remoteDebug", action="store_true", + help="Set to remote debug mode.") + devEnvOptionGroup.add_option("--loglevel", metavar="", dest="loglevel", action="store", + default="info", help="Set log level such as: debug, info, all etc.") + parser.add_option_group(devEnvOptionGroup) + return parser + +def generateJobConfigTemplate(reader, writer): + readerRef = "Please refer to the %s document:\n https://github.com/alibaba/DataX/blob/master/%s/doc/%s.md \n" % (reader,reader,reader) + writerRef = "Please refer to the %s document:\n https://github.com/alibaba/DataX/blob/master/%s/doc/%s.md \n " % (writer,writer,writer) + print readerRef + print writerRef + jobGuid = 'Please save the following configuration as a json file and use\n python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json \nto run the job.\n' + print jobGuid + jobTemplate={ + "job": { + "setting": { + "speed": { + "channel": "" + } + }, + "content": [ + { + "reader": {}, + "writer": {} + } + ] + } + } + readerTemplatePath = "%s/plugin/reader/%s/plugin_job_template.json" % (DATAX_HOME,reader) + writerTemplatePath = "%s/plugin/writer/%s/plugin_job_template.json" % (DATAX_HOME,writer) + try: + readerPar = readPluginTemplate(readerTemplatePath); + except Exception, e: + print "Read reader[%s] template error: can\'t find file %s" % (reader,readerTemplatePath) + try: + writerPar = readPluginTemplate(writerTemplatePath); + except Exception, e: + print "Read writer[%s] template error: : can\'t find file %s" % (writer,writerTemplatePath) + jobTemplate['job']['content'][0]['reader'] = readerPar; + jobTemplate['job']['content'][0]['writer'] = writerPar; + print json.dumps(jobTemplate, indent=4, sort_keys=True) + +def readPluginTemplate(plugin): + with open(plugin, 'r') as f: + return json.load(f) + +def isUrl(path): + if not path: + return False + + assert (isinstance(path, str)) + m = re.match(r"^http[s]?://\S+\w*", path.lower()) + if m: + return True + else: + return False + + +def buildStartCommand(options, args): + commandMap = {} + tempJVMCommand = DEFAULT_JVM + if options.jvmParameters: + tempJVMCommand = tempJVMCommand + " " + options.jvmParameters + + if options.remoteDebug: + tempJVMCommand = tempJVMCommand + " " + REMOTE_DEBUG_CONFIG + print 'local ip: ', getLocalIp() + + if options.loglevel: + tempJVMCommand = tempJVMCommand + " " + ("-Dloglevel=%s" % (options.loglevel)) + + if options.mode: + commandMap["mode"] = options.mode + + # jobResource 可能是 URL,也可能是本地文件路径(相对,绝对) + jobResource = args[0] + if not isUrl(jobResource): + jobResource = os.path.abspath(jobResource) + if jobResource.lower().startswith("file://"): + jobResource = jobResource[len("file://"):] + + jobParams = ("-Dlog.file.name=%s") % (jobResource[-20:].replace('/', '_').replace('.', '_')) + if options.params: + jobParams = jobParams + " " + options.params + + if options.jobid: + commandMap["jobid"] = options.jobid + + commandMap["jvm"] = tempJVMCommand + commandMap["params"] = jobParams + commandMap["job"] = jobResource + + return Template(ENGINE_COMMAND).substitute(**commandMap) + + +def printCopyright(): + print ''' +DataX (%s), From Alibaba ! +Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. + +''' % DATAX_VERSION + sys.stdout.flush() + + +if __name__ == "__main__": + printCopyright() + parser = getOptionParser() + options, args = parser.parse_args(sys.argv[1:]) + if options.reader is not None and options.writer is not None: + generateJobConfigTemplate(options.reader,options.writer) + sys.exit(RET_STATE['OK']) + if len(args) != 1: + parser.print_help() + sys.exit(RET_STATE['FAIL']) + + startCommand = buildStartCommand(options, args) + # print startCommand + + child_process = subprocess.Popen(startCommand, shell=True) + register_signal() + (stdout, stderr) = child_process.communicate() + + sys.exit(child_process.returncode) diff --git a/core/src/main/bin/dxprof.py b/core/src/main/bin/dxprof.py new file mode 100644 index 0000000000..181bf90085 --- /dev/null +++ b/core/src/main/bin/dxprof.py @@ -0,0 +1,191 @@ +#! /usr/bin/env python +# vim: set expandtab tabstop=4 shiftwidth=4 foldmethod=marker nu: + +import re +import sys +import time + +REG_SQL_WAKE = re.compile(r'Begin\s+to\s+read\s+record\s+by\s+Sql', re.IGNORECASE) +REG_SQL_DONE = re.compile(r'Finished\s+read\s+record\s+by\s+Sql', re.IGNORECASE) +REG_SQL_PATH = re.compile(r'from\s+(\w+)(\s+where|\s*$)', re.IGNORECASE) +REG_SQL_JDBC = re.compile(r'jdbcUrl:\s*\[(.+?)\]', re.IGNORECASE) +REG_SQL_UUID = re.compile(r'(\d+\-)+reader') +REG_COMMIT_UUID = re.compile(r'(\d+\-)+writer') +REG_COMMIT_WAKE = re.compile(r'begin\s+to\s+commit\s+blocks', re.IGNORECASE) +REG_COMMIT_DONE = re.compile(r'commit\s+blocks\s+ok', re.IGNORECASE) + +# {{{ function parse_timestamp() # +def parse_timestamp(line): + try: + ts = int(time.mktime(time.strptime(line[0:19], '%Y-%m-%d %H:%M:%S'))) + except: + ts = 0 + + return ts + +# }}} # + +# {{{ function parse_query_host() # +def parse_query_host(line): + ori = REG_SQL_JDBC.search(line) + if (not ori): + return '' + + ori = ori.group(1).split('?')[0] + off = ori.find('@') + if (off > -1): + ori = ori[off+1:len(ori)] + else: + off = ori.find('//') + if (off > -1): + ori = ori[off+2:len(ori)] + + return ori.lower() +# }}} # + +# {{{ function parse_query_table() # +def parse_query_table(line): + ori = REG_SQL_PATH.search(line) + return (ori and ori.group(1).lower()) or '' +# }}} # + +# {{{ function parse_reader_task() # +def parse_task(fname): + global LAST_SQL_UUID + global LAST_COMMIT_UUID + global DATAX_JOBDICT + global DATAX_JOBDICT_COMMIT + global UNIXTIME + LAST_SQL_UUID = '' + DATAX_JOBDICT = {} + LAST_COMMIT_UUID = '' + DATAX_JOBDICT_COMMIT = {} + + UNIXTIME = int(time.time()) + with open(fname, 'r') as f: + for line in f.readlines(): + line = line.strip() + + if (LAST_SQL_UUID and (LAST_SQL_UUID in DATAX_JOBDICT)): + DATAX_JOBDICT[LAST_SQL_UUID]['host'] = parse_query_host(line) + LAST_SQL_UUID = '' + + if line.find('CommonRdbmsReader$Task') > 0: + parse_read_task(line) + elif line.find('commit blocks') > 0: + parse_write_task(line) + else: + continue +# }}} # + +# {{{ function parse_read_task() # +def parse_read_task(line): + ser = REG_SQL_UUID.search(line) + if not ser: + return + + LAST_SQL_UUID = ser.group() + if REG_SQL_WAKE.search(line): + DATAX_JOBDICT[LAST_SQL_UUID] = { + 'stat' : 'R', + 'wake' : parse_timestamp(line), + 'done' : UNIXTIME, + 'host' : parse_query_host(line), + 'path' : parse_query_table(line) + } + elif ((LAST_SQL_UUID in DATAX_JOBDICT) and REG_SQL_DONE.search(line)): + DATAX_JOBDICT[LAST_SQL_UUID]['stat'] = 'D' + DATAX_JOBDICT[LAST_SQL_UUID]['done'] = parse_timestamp(line) +# }}} # + +# {{{ function parse_write_task() # +def parse_write_task(line): + ser = REG_COMMIT_UUID.search(line) + if not ser: + return + + LAST_COMMIT_UUID = ser.group() + if REG_COMMIT_WAKE.search(line): + DATAX_JOBDICT_COMMIT[LAST_COMMIT_UUID] = { + 'stat' : 'R', + 'wake' : parse_timestamp(line), + 'done' : UNIXTIME, + } + elif ((LAST_COMMIT_UUID in DATAX_JOBDICT_COMMIT) and REG_COMMIT_DONE.search(line)): + DATAX_JOBDICT_COMMIT[LAST_COMMIT_UUID]['stat'] = 'D' + DATAX_JOBDICT_COMMIT[LAST_COMMIT_UUID]['done'] = parse_timestamp(line) +# }}} # + +# {{{ function result_analyse() # +def result_analyse(): + def compare(a, b): + return b['cost'] - a['cost'] + + tasklist = [] + hostsmap = {} + statvars = {'sum' : 0, 'cnt' : 0, 'svr' : 0, 'max' : 0, 'min' : int(time.time())} + tasklist_commit = [] + statvars_commit = {'sum' : 0, 'cnt' : 0} + + for idx in DATAX_JOBDICT: + item = DATAX_JOBDICT[idx] + item['uuid'] = idx; + item['cost'] = item['done'] - item['wake'] + tasklist.append(item); + + if (not (item['host'] in hostsmap)): + hostsmap[item['host']] = 1 + statvars['svr'] += 1 + + if (item['cost'] > -1 and item['cost'] < 864000): + statvars['sum'] += item['cost'] + statvars['cnt'] += 1 + statvars['max'] = max(statvars['max'], item['done']) + statvars['min'] = min(statvars['min'], item['wake']) + + for idx in DATAX_JOBDICT_COMMIT: + itemc = DATAX_JOBDICT_COMMIT[idx] + itemc['uuid'] = idx + itemc['cost'] = itemc['done'] - itemc['wake'] + tasklist_commit.append(itemc) + + if (itemc['cost'] > -1 and itemc['cost'] < 864000): + statvars_commit['sum'] += itemc['cost'] + statvars_commit['cnt'] += 1 + + ttl = (statvars['max'] - statvars['min']) or 1 + idx = float(statvars['cnt']) / (statvars['sum'] or ttl) + + tasklist.sort(compare) + for item in tasklist: + print '%s\t%s.%s\t%s\t%s\t% 4d\t% 2.1f%%\t% .2f' %(item['stat'], item['host'], item['path'], + time.strftime('%H:%M:%S', time.localtime(item['wake'])), + (('D' == item['stat']) and time.strftime('%H:%M:%S', time.localtime(item['done']))) or '--', + item['cost'], 100 * item['cost'] / ttl, idx * item['cost']) + + if (not len(tasklist) or not statvars['cnt']): + return + + print '\n--- DataX Profiling Statistics ---' + print '%d task(s) on %d server(s), Total elapsed %d second(s), %.2f second(s) per task in average' %(statvars['cnt'], + statvars['svr'], statvars['sum'], float(statvars['sum']) / statvars['cnt']) + print 'Actually cost %d second(s) (%s - %s), task concurrency: %.2f, tilt index: %.2f' %(ttl, + time.strftime('%H:%M:%S', time.localtime(statvars['min'])), + time.strftime('%H:%M:%S', time.localtime(statvars['max'])), + float(statvars['sum']) / ttl, idx * tasklist[0]['cost']) + + idx_commit = float(statvars_commit['cnt']) / (statvars_commit['sum'] or ttl) + tasklist_commit.sort(compare) + print '%d task(s) done odps comit, Total elapsed %d second(s), %.2f second(s) per task in average, tilt index: %.2f' % ( + statvars_commit['cnt'], + statvars_commit['sum'], float(statvars_commit['sum']) / statvars_commit['cnt'], + idx_commit * tasklist_commit[0]['cost']) + +# }}} # + +if (len(sys.argv) < 2): + print "Usage: %s filename" %(sys.argv[0]) + quit(1) +else: + parse_task(sys.argv[1]) + result_analyse() \ No newline at end of file diff --git a/core/src/main/bin/perftrace.py b/core/src/main/bin/perftrace.py new file mode 100755 index 0000000000..41a1ecb305 --- /dev/null +++ b/core/src/main/bin/perftrace.py @@ -0,0 +1,400 @@ +#!/usr/bin/env python +# -*- coding:utf-8 -*- + + +""" + Life's short, Python more. +""" + +import re +import os +import sys +import json +import uuid +import signal +import time +import subprocess +from optparse import OptionParser +reload(sys) +sys.setdefaultencoding('utf8') + +##begin cli & help logic +def getOptionParser(): + usage = getUsage() + parser = OptionParser(usage = usage) + #rdbms reader and writer + parser.add_option('-r', '--reader', action='store', dest='reader', help='trace datasource read performance with specified !json! string') + parser.add_option('-w', '--writer', action='store', dest='writer', help='trace datasource write performance with specified !json! string') + + parser.add_option('-c', '--channel', action='store', dest='channel', default='1', help='the number of concurrent sync thread, the default is 1') + parser.add_option('-f', '--file', action='store', help='existing datax configuration file, include reader and writer params') + parser.add_option('-t', '--type', action='store', default='reader', help='trace which side\'s performance, cooperate with -f --file params, need to be reader or writer') + parser.add_option('-d', '--delete', action='store', default='true', help='delete temporary files, the default value is true') + #parser.add_option('-h', '--help', action='store', default='true', help='print usage information') + return parser + +def getUsage(): + return ''' +The following params are available for -r --reader: + [these params is for rdbms reader, used to trace rdbms read performance, it's like datax's key] + *datasourceType: datasource type, may be mysql|drds|oracle|ads|sqlserver|postgresql|db2 etc... + *jdbcUrl: datasource jdbc connection string, mysql as a example: jdbc:mysql://ip:port/database + *username: username for datasource + *password: password for datasource + *table: table name for read data + column: column to be read, the default value is ['*'] + splitPk: the splitPk column of rdbms table + where: limit the scope of the performance data set + fetchSize: how many rows to be fetched at each communicate + + [these params is for stream reader, used to trace rdbms write performance] + reader-sliceRecordCount: how man test data to mock(each channel), the default value is 10000 + reader-column : stream reader while generate test data(type supports: string|long|date|double|bool|bytes; support constant value and random function),demo: [{"type":"string","value":"abc"},{"type":"string","random":"10,20"}] + +The following params are available for -w --writer: + [these params is for rdbms writer, used to trace rdbms write performance, it's like datax's key] + datasourceType: datasource type, may be mysql|drds|oracle|ads|sqlserver|postgresql|db2|ads etc... + *jdbcUrl: datasource jdbc connection string, mysql as a example: jdbc:mysql://ip:port/database + *username: username for datasource + *password: password for datasource + *table: table name for write data + column: column to be writed, the default value is ['*'] + batchSize: how many rows to be storeed at each communicate, the default value is 512 + preSql: prepare sql to be executed before write data, the default value is '' + postSql: post sql to be executed end of write data, the default value is '' + url: required for ads, pattern is ip:port + schme: required for ads, ads database name + + [these params is for stream writer, used to trace rdbms read performance] + writer-print: true means print data read from source datasource, the default value is false + +The following params are available global control: + -c --channel: the number of concurrent tasks, the default value is 1 + -f --file: existing completely dataX configuration file path + -t --type: test read or write performance for a datasource, couble be reader or writer, in collaboration with -f --file + -h --help: print help message + +some demo: +perftrace.py --channel=10 --reader='{"jdbcUrl":"jdbc:mysql://127.0.0.1:3306/database", "username":"", "password":"", "table": "", "where":"", "splitPk":"", "writer-print":"false"}' +perftrace.py --channel=10 --writer='{"jdbcUrl":"jdbc:mysql://127.0.0.1:3306/database", "username":"", "password":"", "table": "", "reader-sliceRecordCount": "10000", "reader-column": [{"type":"string","value":"abc"},{"type":"string","random":"10,20"}]}' +perftrace.py --file=/tmp/datax.job.json --type=reader --reader='{"writer-print": "false"}' +perftrace.py --file=/tmp/datax.job.json --type=writer --writer='{"reader-sliceRecordCount": "10000", "reader-column": [{"type":"string","value":"abc"},{"type":"string","random":"10,20"}]}' + +some example jdbc url pattern, may help: +jdbc:oracle:thin:@ip:port:database +jdbc:mysql://ip:port/database +jdbc:sqlserver://ip:port;DatabaseName=database +jdbc:postgresql://ip:port/database +warn: ads url pattern is ip:port +warn: test write performance will write data into your table, you can use a temporary table just for test. +''' + +def printCopyright(): + DATAX_VERSION = 'UNKNOWN_DATAX_VERSION' + print ''' +DataX Util Tools (%s), From Alibaba ! +Copyright (C) 2010-2016, Alibaba Group. All Rights Reserved.''' % DATAX_VERSION + sys.stdout.flush() + + +def yesNoChoice(): + yes = set(['yes','y', 'ye', '']) + no = set(['no','n']) + choice = raw_input().lower() + if choice in yes: + return True + elif choice in no: + return False + else: + sys.stdout.write("Please respond with 'yes' or 'no'") +##end cli & help logic + + +##begin process logic +def suicide(signum, e): + global childProcess + print >> sys.stderr, "[Error] Receive unexpected signal %d, starts to suicide." % (signum) + if childProcess: + childProcess.send_signal(signal.SIGQUIT) + time.sleep(1) + childProcess.kill() + print >> sys.stderr, "DataX Process was killed ! you did ?" + sys.exit(-1) + + +def registerSignal(): + global childProcess + signal.signal(2, suicide) + signal.signal(3, suicide) + signal.signal(15, suicide) + + +def fork(command, isShell=False): + global childProcess + childProcess = subprocess.Popen(command, shell = isShell) + registerSignal() + (stdout, stderr) = childProcess.communicate() + #阻塞直到子进程结束 + childProcess.wait() + return childProcess.returncode +##end process logic + + +##begin datax json generate logic +#warn: if not '': -> true; if not None: -> true +def notNone(obj, context): + if not obj: + raise Exception("Configuration property [%s] could not be blank!" % (context)) + +def attributeNotNone(obj, attributes): + for key in attributes: + notNone(obj.get(key), key) + +def isBlank(value): + if value is None or len(value.strip()) == 0: + return True + return False + +def parsePluginName(jdbcUrl, pluginType): + import re + #warn: drds + name = 'pluginName' + mysqlRegex = re.compile('jdbc:(mysql)://.*') + if (mysqlRegex.match(jdbcUrl)): + name = 'mysql' + postgresqlRegex = re.compile('jdbc:(postgresql)://.*') + if (postgresqlRegex.match(jdbcUrl)): + name = 'postgresql' + oracleRegex = re.compile('jdbc:(oracle):.*') + if (oracleRegex.match(jdbcUrl)): + name = 'oracle' + sqlserverRegex = re.compile('jdbc:(sqlserver)://.*') + if (sqlserverRegex.match(jdbcUrl)): + name = 'sqlserver' + db2Regex = re.compile('jdbc:(db2)://.*') + if (db2Regex.match(jdbcUrl)): + name = 'db2' + return "%s%s" % (name, pluginType) + +def renderDataXJson(paramsDict, readerOrWriter = 'reader', channel = 1): + dataxTemplate = { + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "", + "parameter": { + "username": "", + "password": "", + "sliceRecordCount": "10000", + "column": [ + "*" + ], + "connection": [ + { + "table": [], + "jdbcUrl": [] + } + ] + } + }, + "writer": { + "name": "", + "parameter": { + "print": "false", + "connection": [ + { + "table": [], + "jdbcUrl": '' + } + ] + } + } + } + ] + } + } + dataxTemplate['job']['setting']['speed']['channel'] = channel + dataxTemplateContent = dataxTemplate['job']['content'][0] + + pluginName = '' + if paramsDict.get('datasourceType'): + pluginName = '%s%s' % (paramsDict['datasourceType'], readerOrWriter) + elif paramsDict.get('jdbcUrl'): + pluginName = parsePluginName(paramsDict['jdbcUrl'], readerOrWriter) + elif paramsDict.get('url'): + pluginName = 'adswriter' + + theOtherSide = 'writer' if readerOrWriter == 'reader' else 'reader' + dataxPluginParamsContent = dataxTemplateContent.get(readerOrWriter).get('parameter') + dataxPluginParamsContent.update(paramsDict) + + dataxPluginParamsContentOtherSide = dataxTemplateContent.get(theOtherSide).get('parameter') + + if readerOrWriter == 'reader': + dataxTemplateContent.get('reader')['name'] = pluginName + dataxTemplateContent.get('writer')['name'] = 'streamwriter' + if paramsDict.get('writer-print'): + dataxPluginParamsContentOtherSide['print'] = paramsDict['writer-print'] + del dataxPluginParamsContent['writer-print'] + del dataxPluginParamsContentOtherSide['connection'] + if readerOrWriter == 'writer': + dataxTemplateContent.get('reader')['name'] = 'streamreader' + dataxTemplateContent.get('writer')['name'] = pluginName + if paramsDict.get('reader-column'): + dataxPluginParamsContentOtherSide['column'] = paramsDict['reader-column'] + del dataxPluginParamsContent['reader-column'] + if paramsDict.get('reader-sliceRecordCount'): + dataxPluginParamsContentOtherSide['sliceRecordCount'] = paramsDict['reader-sliceRecordCount'] + del dataxPluginParamsContent['reader-sliceRecordCount'] + del dataxPluginParamsContentOtherSide['connection'] + + if paramsDict.get('jdbcUrl'): + if readerOrWriter == 'reader': + dataxPluginParamsContent['connection'][0]['jdbcUrl'].append(paramsDict['jdbcUrl']) + else: + dataxPluginParamsContent['connection'][0]['jdbcUrl'] = paramsDict['jdbcUrl'] + if paramsDict.get('table'): + dataxPluginParamsContent['connection'][0]['table'].append(paramsDict['table']) + + + traceJobJson = json.dumps(dataxTemplate, indent = 4) + return traceJobJson + +def isUrl(path): + if not path: + return False + if not isinstance(path, str): + raise Exception('Configuration file path required for the string, you configure is:%s' % path) + m = re.match(r"^http[s]?://\S+\w*", path.lower()) + if m: + return True + else: + return False + + +def readJobJsonFromLocal(jobConfigPath): + jobConfigContent = None + jobConfigPath = os.path.abspath(jobConfigPath) + file = open(jobConfigPath) + try: + jobConfigContent = file.read() + finally: + file.close() + if not jobConfigContent: + raise Exception("Your job configuration file read the result is empty, please check the configuration is legal, path: [%s]\nconfiguration:\n%s" % (jobConfigPath, str(jobConfigContent))) + return jobConfigContent + + +def readJobJsonFromRemote(jobConfigPath): + import urllib + conn = urllib.urlopen(jobConfigPath) + jobJson = conn.read() + return jobJson + +def parseJson(strConfig, context): + try: + return json.loads(strConfig) + except Exception, e: + import traceback + traceback.print_exc() + sys.stdout.flush() + print >> sys.stderr, '%s %s need in line with json syntax' % (context, strConfig) + sys.exit(-1) + +def convert(options, args): + traceJobJson = '' + if options.file: + if isUrl(options.file): + traceJobJson = readJobJsonFromRemote(options.file) + else: + traceJobJson = readJobJsonFromLocal(options.file) + traceJobDict = parseJson(traceJobJson, '%s content' % options.file) + attributeNotNone(traceJobDict, ['job']) + attributeNotNone(traceJobDict['job'], ['content']) + attributeNotNone(traceJobDict['job']['content'][0], ['reader', 'writer']) + attributeNotNone(traceJobDict['job']['content'][0]['reader'], ['name', 'parameter']) + attributeNotNone(traceJobDict['job']['content'][0]['writer'], ['name', 'parameter']) + if options.type == 'reader': + traceJobDict['job']['content'][0]['writer']['name'] = 'streamwriter' + if options.reader: + traceReaderDict = parseJson(options.reader, 'reader config') + if traceReaderDict.get('writer-print') is not None: + traceJobDict['job']['content'][0]['writer']['parameter']['print'] = traceReaderDict.get('writer-print') + else: + traceJobDict['job']['content'][0]['writer']['parameter']['print'] = 'false' + else: + traceJobDict['job']['content'][0]['writer']['parameter']['print'] = 'false' + elif options.type == 'writer': + traceJobDict['job']['content'][0]['reader']['name'] = 'streamreader' + if options.writer: + traceWriterDict = parseJson(options.writer, 'writer config') + if traceWriterDict.get('reader-column'): + traceJobDict['job']['content'][0]['reader']['parameter']['column'] = traceWriterDict['reader-column'] + if traceWriterDict.get('reader-sliceRecordCount'): + traceJobDict['job']['content'][0]['reader']['parameter']['sliceRecordCount'] = traceWriterDict['reader-sliceRecordCount'] + else: + columnSize = len(traceJobDict['job']['content'][0]['writer']['parameter']['column']) + streamReaderColumn = [] + for i in range(columnSize): + streamReaderColumn.append({"type": "long", "random": "2,10"}) + traceJobDict['job']['content'][0]['reader']['parameter']['column'] = streamReaderColumn + traceJobDict['job']['content'][0]['reader']['parameter']['sliceRecordCount'] = 10000 + else: + pass#do nothing + return json.dumps(traceJobDict, indent = 4) + elif options.reader: + traceReaderDict = parseJson(options.reader, 'reader config') + return renderDataXJson(traceReaderDict, 'reader', options.channel) + elif options.writer: + traceWriterDict = parseJson(options.writer, 'writer config') + return renderDataXJson(traceWriterDict, 'writer', options.channel) + else: + print getUsage() + sys.exit(-1) + #dataxParams = {} + #for opt, value in options.__dict__.items(): + # dataxParams[opt] = value +##end datax json generate logic + + +if __name__ == "__main__": + printCopyright() + parser = getOptionParser() + + options, args = parser.parse_args(sys.argv[1:]) + #print options, args + dataxTraceJobJson = convert(options, args) + + #由MAC地址、当前时间戳、随机数生成,可以保证全球范围内的唯一性 + dataxJobPath = os.path.join(os.getcwd(), "perftrace-" + str(uuid.uuid1())) + jobConfigOk = True + if os.path.exists(dataxJobPath): + print "file already exists, truncate and rewrite it? %s" % dataxJobPath + if yesNoChoice(): + jobConfigOk = True + else: + print "exit failed, because of file conflict" + sys.exit(-1) + fileWriter = open(dataxJobPath, 'w') + fileWriter.write(dataxTraceJobJson) + fileWriter.close() + + + print "trace environments:" + print "dataxJobPath: %s" % dataxJobPath + dataxHomePath = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) + print "dataxHomePath: %s" % dataxHomePath + + dataxCommand = "%s %s" % (os.path.join(dataxHomePath, "bin", "datax.py"), dataxJobPath) + print "dataxCommand: %s" % dataxCommand + + returncode = fork(dataxCommand, True) + if options.delete == 'true': + os.remove(dataxJobPath) + sys.exit(returncode) diff --git a/core/src/main/conf/.secret.properties b/core/src/main/conf/.secret.properties new file mode 100755 index 0000000000..b807f8ad63 --- /dev/null +++ b/core/src/main/conf/.secret.properties @@ -0,0 +1,9 @@ +#ds basicAuth config +auth.user= +auth.pass= +current.keyVersion= +current.publicKey= +current.privateKey= +current.service.username= +current.service.password= + diff --git a/core/src/main/conf/core.json b/core/src/main/conf/core.json new file mode 100755 index 0000000000..5aa855bc81 --- /dev/null +++ b/core/src/main/conf/core.json @@ -0,0 +1,61 @@ + +{ + "entry": { + "jvm": "-Xms1G -Xmx1G", + "environment": {} + }, + "common": { + "column": { + "datetimeFormat": "yyyy-MM-dd HH:mm:ss", + "timeFormat": "HH:mm:ss", + "dateFormat": "yyyy-MM-dd", + "extraFormats":["yyyyMMdd"], + "timeZone": "GMT+8", + "encoding": "utf-8" + } + }, + "core": { + "dataXServer": { + "address": "http://localhost:7001/api", + "timeout": 10000, + "reportDataxLog": false, + "reportPerfLog": false + }, + "transport": { + "channel": { + "class": "com.alibaba.datax.core.transport.channel.memory.MemoryChannel", + "speed": { + "byte": -1, + "record": -1 + }, + "flowControlInterval": 20, + "capacity": 512, + "byteCapacity": 67108864 + }, + "exchanger": { + "class": "com.alibaba.datax.core.plugin.BufferedRecordExchanger", + "bufferSize": 32 + } + }, + "container": { + "job": { + "reportInterval": 10000 + }, + "taskGroup": { + "channel": 5 + }, + "trace": { + "enable": "false" + } + + }, + "statistics": { + "collector": { + "plugin": { + "taskClass": "com.alibaba.datax.core.statistics.plugin.task.StdoutPluginCollector", + "maxDirtyNumber": 10 + } + } + } + } +} diff --git a/core/src/main/conf/logback.xml b/core/src/main/conf/logback.xml new file mode 100755 index 0000000000..15e4880336 --- /dev/null +++ b/core/src/main/conf/logback.xml @@ -0,0 +1,150 @@ + + + + + + + + + + UTF-8 + + %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{0} - %msg%n + + + + + + UTF-8 + ${log.dir}/${ymd}/${log.file.name}-${byMillionSecond}.log + false + + %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{0} - %msg%n + + + + + + UTF-8 + ${perf.dir}/${ymd}/${log.file.name}-${byMillionSecond}.log + false + + %msg%n + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/core/src/main/java/com/alibaba/datax/core/AbstractContainer.java b/core/src/main/java/com/alibaba/datax/core/AbstractContainer.java new file mode 100755 index 0000000000..c4e09b757e --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/AbstractContainer.java @@ -0,0 +1,35 @@ +package com.alibaba.datax.core; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import org.apache.commons.lang.Validate; + +/** + * 执行容器的抽象类,持有该容器全局的配置 configuration + */ +public abstract class AbstractContainer { + protected Configuration configuration; + + protected AbstractContainerCommunicator containerCommunicator; + + public AbstractContainer(Configuration configuration) { + Validate.notNull(configuration, "Configuration can not be null."); + + this.configuration = configuration; + } + + public Configuration getConfiguration() { + return configuration; + } + + public AbstractContainerCommunicator getContainerCommunicator() { + return containerCommunicator; + } + + public void setContainerCommunicator(AbstractContainerCommunicator containerCommunicator) { + this.containerCommunicator = containerCommunicator; + } + + public abstract void start(); + +} diff --git a/core/src/main/java/com/alibaba/datax/core/Engine.java b/core/src/main/java/com/alibaba/datax/core/Engine.java new file mode 100755 index 0000000000..f80d792f3c --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/Engine.java @@ -0,0 +1,223 @@ +package com.alibaba.datax.core; + +import com.alibaba.datax.common.element.ColumnCast; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.spi.ErrorCode; +import com.alibaba.datax.common.statistics.PerfTrace; +import com.alibaba.datax.common.statistics.VMInfo; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.JobContainer; +import com.alibaba.datax.core.taskgroup.TaskGroupContainer; +import com.alibaba.datax.core.util.ConfigParser; +import com.alibaba.datax.core.util.ConfigurationValidate; +import com.alibaba.datax.core.util.ExceptionTracker; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.util.container.LoadUtil; +import org.apache.commons.cli.BasicParser; +import org.apache.commons.cli.CommandLine; +import org.apache.commons.cli.Options; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Arrays; +import java.util.List; +import java.util.Set; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Engine是DataX入口类,该类负责初始化Job或者Task的运行容器,并运行插件的Job或者Task逻辑 + */ +public class Engine { + private static final Logger LOG = LoggerFactory.getLogger(Engine.class); + + private static String RUNTIME_MODE; + + /* check job model (job/task) first */ + public void start(Configuration allConf) { + + // 绑定column转换信息 + ColumnCast.bind(allConf); + + /** + * 初始化PluginLoader,可以获取各种插件配置 + */ + LoadUtil.bind(allConf); + + boolean isJob = !("taskGroup".equalsIgnoreCase(allConf + .getString(CoreConstant.DATAX_CORE_CONTAINER_MODEL))); + //JobContainer会在schedule后再行进行设置和调整值 + int channelNumber =0; + AbstractContainer container; + long instanceId; + int taskGroupId = -1; + if (isJob) { + allConf.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_MODE, RUNTIME_MODE); + container = new JobContainer(allConf); + instanceId = allConf.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, 0); + + } else { + container = new TaskGroupContainer(allConf); + instanceId = allConf.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID); + taskGroupId = allConf.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID); + channelNumber = allConf.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL); + } + + //缺省打开perfTrace + boolean traceEnable = allConf.getBool(CoreConstant.DATAX_CORE_CONTAINER_TRACE_ENABLE, true); + boolean perfReportEnable = allConf.getBool(CoreConstant.DATAX_CORE_REPORT_DATAX_PERFLOG, true); + + //standlone模式的datax shell任务不进行汇报 + if(instanceId == -1){ + perfReportEnable = false; + } + + int priority = 0; + try { + priority = Integer.parseInt(System.getenv("SKYNET_PRIORITY")); + }catch (NumberFormatException e){ + LOG.warn("prioriy set to 0, because NumberFormatException, the value is: "+System.getProperty("PROIORY")); + } + + Configuration jobInfoConfig = allConf.getConfiguration(CoreConstant.DATAX_JOB_JOBINFO); + //初始化PerfTrace + PerfTrace perfTrace = PerfTrace.getInstance(isJob, instanceId, taskGroupId, priority, traceEnable); + perfTrace.setJobInfo(jobInfoConfig,perfReportEnable,channelNumber); + container.start(); + + } + + + // 注意屏蔽敏感信息 + public static String filterJobConfiguration(final Configuration configuration) { + Configuration jobConfWithSetting = configuration.getConfiguration("job").clone(); + + Configuration jobContent = jobConfWithSetting.getConfiguration("content"); + + filterSensitiveConfiguration(jobContent); + + jobConfWithSetting.set("content",jobContent); + + return jobConfWithSetting.beautify(); + } + + public static Configuration filterSensitiveConfiguration(Configuration configuration){ + Set keys = configuration.getKeys(); + for (final String key : keys) { + boolean isSensitive = StringUtils.endsWithIgnoreCase(key, "password") + || StringUtils.endsWithIgnoreCase(key, "accessKey"); + if (isSensitive && configuration.get(key) instanceof String) { + configuration.set(key, configuration.getString(key).replaceAll(".", "*")); + } + } + return configuration; + } + + public static void entry(final String[] args) throws Throwable { + Options options = new Options(); + options.addOption("job", true, "Job config."); + options.addOption("jobid", true, "Job unique id."); + options.addOption("mode", true, "Job runtime mode."); + + BasicParser parser = new BasicParser(); + CommandLine cl = parser.parse(options, args); + + String jobPath = cl.getOptionValue("job"); + + // 如果用户没有明确指定jobid, 则 datax.py 会指定 jobid 默认值为-1 + String jobIdString = cl.getOptionValue("jobid"); + RUNTIME_MODE = cl.getOptionValue("mode"); + + Configuration configuration = ConfigParser.parse(jobPath); + + long jobId; + if (!"-1".equalsIgnoreCase(jobIdString)) { + jobId = Long.parseLong(jobIdString); + } else { + // only for dsc & ds & datax 3 update + String dscJobUrlPatternString = "/instance/(\\d{1,})/config.xml"; + String dsJobUrlPatternString = "/inner/job/(\\d{1,})/config"; + String dsTaskGroupUrlPatternString = "/inner/job/(\\d{1,})/taskGroup/"; + List patternStringList = Arrays.asList(dscJobUrlPatternString, + dsJobUrlPatternString, dsTaskGroupUrlPatternString); + jobId = parseJobIdFromUrl(patternStringList, jobPath); + } + + boolean isStandAloneMode = "standalone".equalsIgnoreCase(RUNTIME_MODE); + if (!isStandAloneMode && jobId == -1) { + // 如果不是 standalone 模式,那么 jobId 一定不能为-1 + throw DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, "非 standalone 模式必须在 URL 中提供有效的 jobId."); + } + configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, jobId); + + //打印vmInfo + VMInfo vmInfo = VMInfo.getVmInfo(); + if (vmInfo != null) { + LOG.info(vmInfo.toString()); + } + + LOG.info("\n" + Engine.filterJobConfiguration(configuration) + "\n"); + + LOG.debug(configuration.toJSON()); + + ConfigurationValidate.doValidate(configuration); + Engine engine = new Engine(); + engine.start(configuration); + } + + + /** + * -1 表示未能解析到 jobId + * + * only for dsc & ds & datax 3 update + */ + private static long parseJobIdFromUrl(List patternStringList, String url) { + long result = -1; + for (String patternString : patternStringList) { + result = doParseJobIdFromUrl(patternString, url); + if (result != -1) { + return result; + } + } + return result; + } + + private static long doParseJobIdFromUrl(String patternString, String url) { + Pattern pattern = Pattern.compile(patternString); + Matcher matcher = pattern.matcher(url); + if (matcher.find()) { + return Long.parseLong(matcher.group(1)); + } + + return -1; + } + + public static void main(String[] args) throws Exception { + int exitCode = 0; + try { + Engine.entry(args); + } catch (Throwable e) { + exitCode = 1; + LOG.error("\n\n经DataX智能分析,该任务最可能的错误原因是:\n" + ExceptionTracker.trace(e)); + + if (e instanceof DataXException) { + DataXException tempException = (DataXException) e; + ErrorCode errorCode = tempException.getErrorCode(); + if (errorCode instanceof FrameworkErrorCode) { + FrameworkErrorCode tempErrorCode = (FrameworkErrorCode) errorCode; + exitCode = tempErrorCode.toExitValue(); + } + } + + System.exit(exitCode); + } + System.exit(exitCode); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/container/util/HookInvoker.java b/core/src/main/java/com/alibaba/datax/core/container/util/HookInvoker.java new file mode 100755 index 0000000000..6e0ef17825 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/container/util/HookInvoker.java @@ -0,0 +1,91 @@ +package com.alibaba.datax.core.container.util; + +/** + * Created by xiafei.qiuxf on 14/12/17. + */ + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.spi.Hook; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.JarLoader; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.io.FilenameFilter; +import java.util.HashMap; +import java.util.Iterator; +import java.util.Map; +import java.util.ServiceLoader; + +/** + * 扫描给定目录的所有一级子目录,每个子目录当作一个Hook的目录。 + * 对于每个子目录,必须符合ServiceLoader的标准目录格式,见http://docs.oracle.com/javase/6/docs/api/java/util/ServiceLoader.html。 + * 加载里头的jar,使用ServiceLoader机制调用。 + */ +public class HookInvoker { + + private static final Logger LOG = LoggerFactory.getLogger(HookInvoker.class); + private final Map msg; + private final Configuration conf; + + private File baseDir; + + public HookInvoker(String baseDirName, Configuration conf, Map msg) { + this.baseDir = new File(baseDirName); + this.conf = conf; + this.msg = msg; + } + + public void invokeAll() { + if (!baseDir.exists() || baseDir.isFile()) { + LOG.info("No hook invoked, because base dir not exists or is a file: " + baseDir.getAbsolutePath()); + return; + } + + String[] subDirs = baseDir.list(new FilenameFilter() { + @Override + public boolean accept(File dir, String name) { + return new File(dir, name).isDirectory(); + } + }); + + if (subDirs == null) { + throw DataXException.asDataXException(FrameworkErrorCode.HOOK_LOAD_ERROR, "获取HOOK子目录返回null"); + } + + for (String subDir : subDirs) { + doInvoke(new File(baseDir, subDir).getAbsolutePath()); + } + + } + + private void doInvoke(String path) { + ClassLoader oldClassLoader = Thread.currentThread().getContextClassLoader(); + try { + JarLoader jarLoader = new JarLoader(new String[]{path}); + Thread.currentThread().setContextClassLoader(jarLoader); + Iterator hookIt = ServiceLoader.load(Hook.class).iterator(); + if (!hookIt.hasNext()) { + LOG.warn("No hook defined under path: " + path); + } else { + Hook hook = hookIt.next(); + LOG.info("Invoke hook [{}], path: {}", hook.getName(), path); + hook.invoke(conf, msg); + } + } catch (Exception e) { + LOG.error("Exception when invoke hook", e); + throw DataXException.asDataXException( + CommonErrorCode.HOOK_INTERNAL_ERROR, "Exception when invoke hook", e); + } finally { + Thread.currentThread().setContextClassLoader(oldClassLoader); + } + } + + public static void main(String[] args) { + new HookInvoker("/Users/xiafei/workspace/datax3/target/datax/datax/hook", + null, new HashMap()).invokeAll(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/container/util/JobAssignUtil.java b/core/src/main/java/com/alibaba/datax/core/container/util/JobAssignUtil.java new file mode 100755 index 0000000000..31ba60a4dd --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/container/util/JobAssignUtil.java @@ -0,0 +1,177 @@ +package com.alibaba.datax.core.container.util; + +import com.alibaba.datax.common.constant.CommonConstant; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang.Validate; +import org.apache.commons.lang3.StringUtils; + +import java.util.*; + +public final class JobAssignUtil { + private JobAssignUtil() { + } + + /** + * 公平的分配 task 到对应的 taskGroup 中。 + * 公平体现在:会考虑 task 中对资源负载作的 load 标识进行更均衡的作业分配操作。 + * TODO 具体文档举例说明 + */ + public static List assignFairly(Configuration configuration, int channelNumber, int channelsPerTaskGroup) { + Validate.isTrue(configuration != null, "框架获得的 Job 不能为 null."); + + List contentConfig = configuration.getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + Validate.isTrue(contentConfig.size() > 0, "框架获得的切分后的 Job 无内容."); + + Validate.isTrue(channelNumber > 0 && channelsPerTaskGroup > 0, + "每个channel的平均task数[averTaskPerChannel],channel数目[channelNumber],每个taskGroup的平均channel数[channelsPerTaskGroup]都应该为正数"); + + int taskGroupNumber = (int) Math.ceil(1.0 * channelNumber / channelsPerTaskGroup); + + Configuration aTaskConfig = contentConfig.get(0); + + String readerResourceMark = aTaskConfig.getString(CoreConstant.JOB_READER_PARAMETER + "." + + CommonConstant.LOAD_BALANCE_RESOURCE_MARK); + String writerResourceMark = aTaskConfig.getString(CoreConstant.JOB_WRITER_PARAMETER + "." + + CommonConstant.LOAD_BALANCE_RESOURCE_MARK); + + boolean hasLoadBalanceResourceMark = StringUtils.isNotBlank(readerResourceMark) || + StringUtils.isNotBlank(writerResourceMark); + + if (!hasLoadBalanceResourceMark) { + // fake 一个固定的 key 作为资源标识(在 reader 或者 writer 上均可,此处选择在 reader 上进行 fake) + for (Configuration conf : contentConfig) { + conf.set(CoreConstant.JOB_READER_PARAMETER + "." + + CommonConstant.LOAD_BALANCE_RESOURCE_MARK, "aFakeResourceMarkForLoadBalance"); + } + // 是为了避免某些插件没有设置 资源标识 而进行了一次随机打乱操作 + Collections.shuffle(contentConfig, new Random(System.currentTimeMillis())); + } + + LinkedHashMap> resourceMarkAndTaskIdMap = parseAndGetResourceMarkAndTaskIdMap(contentConfig); + List taskGroupConfig = doAssign(resourceMarkAndTaskIdMap, configuration, taskGroupNumber); + + // 调整 每个 taskGroup 对应的 Channel 个数(属于优化范畴) + adjustChannelNumPerTaskGroup(taskGroupConfig, channelNumber); + return taskGroupConfig; + } + + private static void adjustChannelNumPerTaskGroup(List taskGroupConfig, int channelNumber) { + int taskGroupNumber = taskGroupConfig.size(); + int avgChannelsPerTaskGroup = channelNumber / taskGroupNumber; + int remainderChannelCount = channelNumber % taskGroupNumber; + // 表示有 remainderChannelCount 个 taskGroup,其对应 Channel 个数应该为:avgChannelsPerTaskGroup + 1; + // (taskGroupNumber - remainderChannelCount)个 taskGroup,其对应 Channel 个数应该为:avgChannelsPerTaskGroup + + int i = 0; + for (; i < remainderChannelCount; i++) { + taskGroupConfig.get(i).set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, avgChannelsPerTaskGroup + 1); + } + + for (int j = 0; j < taskGroupNumber - remainderChannelCount; j++) { + taskGroupConfig.get(i + j).set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, avgChannelsPerTaskGroup); + } + } + + /** + * 根据task 配置,获取到: + * 资源名称 --> taskId(List) 的 map 映射关系 + */ + private static LinkedHashMap> parseAndGetResourceMarkAndTaskIdMap(List contentConfig) { + // key: resourceMark, value: taskId + LinkedHashMap> readerResourceMarkAndTaskIdMap = new LinkedHashMap>(); + LinkedHashMap> writerResourceMarkAndTaskIdMap = new LinkedHashMap>(); + + for (Configuration aTaskConfig : contentConfig) { + int taskId = aTaskConfig.getInt(CoreConstant.TASK_ID); + // 把 readerResourceMark 加到 readerResourceMarkAndTaskIdMap 中 + String readerResourceMark = aTaskConfig.getString(CoreConstant.JOB_READER_PARAMETER + "." + CommonConstant.LOAD_BALANCE_RESOURCE_MARK); + if (readerResourceMarkAndTaskIdMap.get(readerResourceMark) == null) { + readerResourceMarkAndTaskIdMap.put(readerResourceMark, new LinkedList()); + } + readerResourceMarkAndTaskIdMap.get(readerResourceMark).add(taskId); + + // 把 writerResourceMark 加到 writerResourceMarkAndTaskIdMap 中 + String writerResourceMark = aTaskConfig.getString(CoreConstant.JOB_WRITER_PARAMETER + "." + CommonConstant.LOAD_BALANCE_RESOURCE_MARK); + if (writerResourceMarkAndTaskIdMap.get(writerResourceMark) == null) { + writerResourceMarkAndTaskIdMap.put(writerResourceMark, new LinkedList()); + } + writerResourceMarkAndTaskIdMap.get(writerResourceMark).add(taskId); + } + + if (readerResourceMarkAndTaskIdMap.size() >= writerResourceMarkAndTaskIdMap.size()) { + // 采用 reader 对资源做的标记进行 shuffle + return readerResourceMarkAndTaskIdMap; + } else { + // 采用 writer 对资源做的标记进行 shuffle + return writerResourceMarkAndTaskIdMap; + } + } + + + /** + * /** + * 需要实现的效果通过例子来说是: + *

+     * a 库上有表:0, 1, 2
+     * a 库上有表:3, 4
+     * c 库上有表:5, 6, 7
+     *
+     * 如果有 4个 taskGroup
+     * 则 assign 后的结果为:
+     * taskGroup-0: 0,  4,
+     * taskGroup-1: 3,  6,
+     * taskGroup-2: 5,  2,
+     * taskGroup-3: 1,  7
+     *
+     * 
+ */ + private static List doAssign(LinkedHashMap> resourceMarkAndTaskIdMap, Configuration jobConfiguration, int taskGroupNumber) { + List contentConfig = jobConfiguration.getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + + Configuration taskGroupTemplate = jobConfiguration.clone(); + taskGroupTemplate.remove(CoreConstant.DATAX_JOB_CONTENT); + + List result = new LinkedList(); + + List> taskGroupConfigList = new ArrayList>(taskGroupNumber); + for (int i = 0; i < taskGroupNumber; i++) { + taskGroupConfigList.add(new LinkedList()); + } + + int mapValueMaxLength = -1; + + List resourceMarks = new ArrayList(); + for (Map.Entry> entry : resourceMarkAndTaskIdMap.entrySet()) { + resourceMarks.add(entry.getKey()); + if (entry.getValue().size() > mapValueMaxLength) { + mapValueMaxLength = entry.getValue().size(); + } + } + + int taskGroupIndex = 0; + for (int i = 0; i < mapValueMaxLength; i++) { + for (String resourceMark : resourceMarks) { + if (resourceMarkAndTaskIdMap.get(resourceMark).size() > 0) { + int taskId = resourceMarkAndTaskIdMap.get(resourceMark).get(0); + taskGroupConfigList.get(taskGroupIndex % taskGroupNumber).add(contentConfig.get(taskId)); + taskGroupIndex++; + + resourceMarkAndTaskIdMap.get(resourceMark).remove(0); + } + } + } + + Configuration tempTaskGroupConfig; + for (int i = 0; i < taskGroupNumber; i++) { + tempTaskGroupConfig = taskGroupTemplate.clone(); + tempTaskGroupConfig.set(CoreConstant.DATAX_JOB_CONTENT, taskGroupConfigList.get(i)); + tempTaskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, i); + + result.add(tempTaskGroupConfig); + } + + return result; + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/JobContainer.java b/core/src/main/java/com/alibaba/datax/core/job/JobContainer.java new file mode 100755 index 0000000000..50f1cf7b8d --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/JobContainer.java @@ -0,0 +1,976 @@ +package com.alibaba.datax.core.job; + +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.AbstractJobPlugin; +import com.alibaba.datax.common.plugin.JobPluginCollector; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.statistics.PerfTrace; +import com.alibaba.datax.common.statistics.VMInfo; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.StrUtil; +import com.alibaba.datax.core.AbstractContainer; +import com.alibaba.datax.core.Engine; +import com.alibaba.datax.core.container.util.HookInvoker; +import com.alibaba.datax.core.container.util.JobAssignUtil; +import com.alibaba.datax.core.job.scheduler.AbstractScheduler; +import com.alibaba.datax.core.job.scheduler.processinner.StandAloneScheduler; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.statistics.container.communicator.job.StandAloneJobContainerCommunicator; +import com.alibaba.datax.core.statistics.plugin.DefaultJobPluginCollector; +import com.alibaba.datax.core.util.ErrorRecordChecker; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.ClassLoaderSwapper; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.util.container.LoadUtil; +import com.alibaba.datax.dataxservice.face.domain.enums.ExecuteMode; +import com.alibaba.fastjson.JSON; +import org.apache.commons.lang.StringUtils; +import org.apache.commons.lang.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.text.SimpleDateFormat; +import java.util.ArrayList; +import java.util.List; + +/** + * Created by jingxing on 14-8-24. + *

+ * job实例运行在jobContainer容器中,它是所有任务的master,负责初始化、拆分、调度、运行、回收、监控和汇报 + * 但它并不做实际的数据同步操作 + */ +public class JobContainer extends AbstractContainer { + private static final Logger LOG = LoggerFactory + .getLogger(JobContainer.class); + + private static final SimpleDateFormat dateFormat = new SimpleDateFormat( + "yyyy-MM-dd HH:mm:ss"); + + private ClassLoaderSwapper classLoaderSwapper = ClassLoaderSwapper + .newCurrentThreadClassLoaderSwapper(); + + private long jobId; + + private String readerPluginName; + + private String writerPluginName; + + /** + * reader和writer jobContainer的实例 + */ + private Reader.Job jobReader; + + private Writer.Job jobWriter; + + private Configuration userConf; + + private long startTimeStamp; + + private long endTimeStamp; + + private long startTransferTimeStamp; + + private long endTransferTimeStamp; + + private int needChannelNumber; + + private int totalStage = 1; + + private ErrorRecordChecker errorLimit; + + public JobContainer(Configuration configuration) { + super(configuration); + + errorLimit = new ErrorRecordChecker(configuration); + } + + /** + * jobContainer主要负责的工作全部在start()里面,包括init、prepare、split、scheduler、 + * post以及destroy和statistics + */ + @Override + public void start() { + LOG.info("DataX jobContainer starts job."); + + boolean hasException = false; + boolean isDryRun = false; + try { + this.startTimeStamp = System.currentTimeMillis(); + isDryRun = configuration.getBool(CoreConstant.DATAX_JOB_SETTING_DRYRUN, false); + if(isDryRun) { + LOG.info("jobContainer starts to do preCheck ..."); + this.preCheck(); + } else { + userConf = configuration.clone(); + LOG.debug("jobContainer starts to do preHandle ..."); + this.preHandle(); + + LOG.debug("jobContainer starts to do init ..."); + this.init(); + LOG.info("jobContainer starts to do prepare ..."); + this.prepare(); + LOG.info("jobContainer starts to do split ..."); + this.totalStage = this.split(); + LOG.info("jobContainer starts to do schedule ..."); + this.schedule(); + LOG.debug("jobContainer starts to do post ..."); + this.post(); + + LOG.debug("jobContainer starts to do postHandle ..."); + this.postHandle(); + LOG.info("DataX jobId [{}] completed successfully.", this.jobId); + + this.invokeHooks(); + } + } catch (Throwable e) { + LOG.error("Exception when job run", e); + + hasException = true; + + if (e instanceof OutOfMemoryError) { + this.destroy(); + System.gc(); + } + + + if (super.getContainerCommunicator() == null) { + // 由于 containerCollector 是在 scheduler() 中初始化的,所以当在 scheduler() 之前出现异常时,需要在此处对 containerCollector 进行初始化 + + AbstractContainerCommunicator tempContainerCollector; + // standalone + tempContainerCollector = new StandAloneJobContainerCommunicator(configuration); + + super.setContainerCommunicator(tempContainerCollector); + } + + Communication communication = super.getContainerCommunicator().collect(); + // 汇报前的状态,不需要手动进行设置 + // communication.setState(State.FAILED); + communication.setThrowable(e); + communication.setTimestamp(this.endTimeStamp); + + Communication tempComm = new Communication(); + tempComm.setTimestamp(this.startTransferTimeStamp); + + Communication reportCommunication = CommunicationTool.getReportCommunication(communication, tempComm, this.totalStage); + super.getContainerCommunicator().report(reportCommunication); + + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } finally { + if(!isDryRun) { + + this.destroy(); + this.endTimeStamp = System.currentTimeMillis(); + if (!hasException) { + //最后打印cpu的平均消耗,GC的统计 + VMInfo vmInfo = VMInfo.getVmInfo(); + if (vmInfo != null) { + vmInfo.getDelta(false); + LOG.info(vmInfo.totalString()); + } + + LOG.info(PerfTrace.getInstance().summarizeNoException()); + this.logStatistics(); + } + } + } + } + + private void preCheck() { + this.preCheckInit(); + this.adjustChannelNumber(); + + if (this.needChannelNumber <= 0) { + this.needChannelNumber = 1; + } + this.preCheckReader(); + this.preCheckWriter(); + LOG.info("PreCheck通过"); + } + + private void preCheckInit() { + this.jobId = this.configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, -1); + + if (this.jobId < 0) { + LOG.info("Set jobId = 0"); + this.jobId = 0; + this.configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, + this.jobId); + } + + Thread.currentThread().setName("job-" + this.jobId); + + JobPluginCollector jobPluginCollector = new DefaultJobPluginCollector( + this.getContainerCommunicator()); + this.jobReader = this.preCheckReaderInit(jobPluginCollector); + this.jobWriter = this.preCheckWriterInit(jobPluginCollector); + } + + private Reader.Job preCheckReaderInit(JobPluginCollector jobPluginCollector) { + this.readerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_READER_NAME); + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + + Reader.Job jobReader = (Reader.Job) LoadUtil.loadJobPlugin( + PluginType.READER, this.readerPluginName); + + this.configuration.set(CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER + ".dryRun", true); + + // 设置reader的jobConfig + jobReader.setPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER)); + // 设置reader的readerConfig + jobReader.setPeerPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER)); + + jobReader.setJobPluginCollector(jobPluginCollector); + + classLoaderSwapper.restoreCurrentThreadClassLoader(); + return jobReader; + } + + + private Writer.Job preCheckWriterInit(JobPluginCollector jobPluginCollector) { + this.writerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_WRITER_NAME); + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + + Writer.Job jobWriter = (Writer.Job) LoadUtil.loadJobPlugin( + PluginType.WRITER, this.writerPluginName); + + this.configuration.set(CoreConstant.DATAX_JOB_CONTENT_WRITER_PARAMETER + ".dryRun", true); + + // 设置writer的jobConfig + jobWriter.setPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_WRITER_PARAMETER)); + // 设置reader的readerConfig + jobWriter.setPeerPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER)); + + jobWriter.setPeerPluginName(this.readerPluginName); + jobWriter.setJobPluginCollector(jobPluginCollector); + + classLoaderSwapper.restoreCurrentThreadClassLoader(); + + return jobWriter; + } + + private void preCheckReader() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + LOG.info(String.format("DataX Reader.Job [%s] do preCheck work .", + this.readerPluginName)); + this.jobReader.preCheck(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + private void preCheckWriter() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + LOG.info(String.format("DataX Writer.Job [%s] do preCheck work .", + this.writerPluginName)); + this.jobWriter.preCheck(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + /** + * reader和writer的初始化 + */ + private void init() { + this.jobId = this.configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, -1); + + if (this.jobId < 0) { + LOG.info("Set jobId = 0"); + this.jobId = 0; + this.configuration.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID, + this.jobId); + } + + Thread.currentThread().setName("job-" + this.jobId); + + JobPluginCollector jobPluginCollector = new DefaultJobPluginCollector( + this.getContainerCommunicator()); + //必须先Reader ,后Writer + this.jobReader = this.initJobReader(jobPluginCollector); + this.jobWriter = this.initJobWriter(jobPluginCollector); + } + + private void prepare() { + this.prepareJobReader(); + this.prepareJobWriter(); + } + + private void preHandle() { + String handlerPluginTypeStr = this.configuration.getString( + CoreConstant.DATAX_JOB_PREHANDLER_PLUGINTYPE); + if(!StringUtils.isNotEmpty(handlerPluginTypeStr)){ + return; + } + PluginType handlerPluginType; + try { + handlerPluginType = PluginType.valueOf(handlerPluginTypeStr.toUpperCase()); + } catch (IllegalArgumentException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, + String.format("Job preHandler's pluginType(%s) set error, reason(%s)", handlerPluginTypeStr.toUpperCase(), e.getMessage())); + } + + String handlerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_PREHANDLER_PLUGINNAME); + + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + handlerPluginType, handlerPluginName)); + + AbstractJobPlugin handler = LoadUtil.loadJobPlugin( + handlerPluginType, handlerPluginName); + + JobPluginCollector jobPluginCollector = new DefaultJobPluginCollector( + this.getContainerCommunicator()); + handler.setJobPluginCollector(jobPluginCollector); + + //todo configuration的安全性,将来必须保证 + handler.preHandler(configuration); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + + LOG.info("After PreHandler: \n" + Engine.filterJobConfiguration(configuration) + "\n"); + } + + private void postHandle() { + String handlerPluginTypeStr = this.configuration.getString( + CoreConstant.DATAX_JOB_POSTHANDLER_PLUGINTYPE); + + if(!StringUtils.isNotEmpty(handlerPluginTypeStr)){ + return; + } + PluginType handlerPluginType; + try { + handlerPluginType = PluginType.valueOf(handlerPluginTypeStr.toUpperCase()); + } catch (IllegalArgumentException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, + String.format("Job postHandler's pluginType(%s) set error, reason(%s)", handlerPluginTypeStr.toUpperCase(), e.getMessage())); + } + + String handlerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_POSTHANDLER_PLUGINNAME); + + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + handlerPluginType, handlerPluginName)); + + AbstractJobPlugin handler = LoadUtil.loadJobPlugin( + handlerPluginType, handlerPluginName); + + JobPluginCollector jobPluginCollector = new DefaultJobPluginCollector( + this.getContainerCommunicator()); + handler.setJobPluginCollector(jobPluginCollector); + + handler.postHandler(configuration); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + + /** + * 执行reader和writer最细粒度的切分,需要注意的是,writer的切分结果要参照reader的切分结果, + * 达到切分后数目相等,才能满足1:1的通道模型,所以这里可以将reader和writer的配置整合到一起, + * 然后,为避免顺序给读写端带来长尾影响,将整合的结果shuffler掉 + */ + private int split() { + this.adjustChannelNumber(); + + if (this.needChannelNumber <= 0) { + this.needChannelNumber = 1; + } + + List readerTaskConfigs = this + .doReaderSplit(this.needChannelNumber); + int taskNumber = readerTaskConfigs.size(); + List writerTaskConfigs = this + .doWriterSplit(taskNumber); + + List transformerList = this.configuration.getListConfiguration(CoreConstant.DATAX_JOB_CONTENT_TRANSFORMER); + + LOG.debug("transformer configuration: "+ JSON.toJSONString(transformerList)); + /** + * 输入是reader和writer的parameter list,输出是content下面元素的list + */ + List contentConfig = mergeReaderAndWriterTaskConfigs( + readerTaskConfigs, writerTaskConfigs, transformerList); + + + LOG.debug("contentConfig configuration: "+ JSON.toJSONString(contentConfig)); + + this.configuration.set(CoreConstant.DATAX_JOB_CONTENT, contentConfig); + + return contentConfig.size(); + } + + private void adjustChannelNumber() { + int needChannelNumberByByte = Integer.MAX_VALUE; + int needChannelNumberByRecord = Integer.MAX_VALUE; + + boolean isByteLimit = (this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_BYTE, 0) > 0); + if (isByteLimit) { + long globalLimitedByteSpeed = this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_BYTE, 10 * 1024 * 1024); + + // 在byte流控情况下,单个Channel流量最大值必须设置,否则报错! + Long channelLimitedByteSpeed = this.configuration + .getLong(CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_SPEED_BYTE); + if (channelLimitedByteSpeed == null || channelLimitedByteSpeed <= 0) { + DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, + "在有总bps限速条件下,单个channel的bps值不能为空,也不能为非正数"); + } + + needChannelNumberByByte = + (int) (globalLimitedByteSpeed / channelLimitedByteSpeed); + needChannelNumberByByte = + needChannelNumberByByte > 0 ? needChannelNumberByByte : 1; + LOG.info("Job set Max-Byte-Speed to " + globalLimitedByteSpeed + " bytes."); + } + + boolean isRecordLimit = (this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_RECORD, 0)) > 0; + if (isRecordLimit) { + long globalLimitedRecordSpeed = this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_RECORD, 100000); + + Long channelLimitedRecordSpeed = this.configuration.getLong( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_SPEED_RECORD); + if (channelLimitedRecordSpeed == null || channelLimitedRecordSpeed <= 0) { + DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, + "在有总tps限速条件下,单个channel的tps值不能为空,也不能为非正数"); + } + + needChannelNumberByRecord = + (int) (globalLimitedRecordSpeed / channelLimitedRecordSpeed); + needChannelNumberByRecord = + needChannelNumberByRecord > 0 ? needChannelNumberByRecord : 1; + LOG.info("Job set Max-Record-Speed to " + globalLimitedRecordSpeed + " records."); + } + + // 取较小值 + this.needChannelNumber = needChannelNumberByByte < needChannelNumberByRecord ? + needChannelNumberByByte : needChannelNumberByRecord; + + // 如果从byte或record上设置了needChannelNumber则退出 + if (this.needChannelNumber < Integer.MAX_VALUE) { + return; + } + + boolean isChannelLimit = (this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_CHANNEL, 0) > 0); + if (isChannelLimit) { + this.needChannelNumber = this.configuration.getInt( + CoreConstant.DATAX_JOB_SETTING_SPEED_CHANNEL); + + LOG.info("Job set Channel-Number to " + this.needChannelNumber + + " channels."); + + return; + } + + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, + "Job运行速度必须设置"); + } + + /** + * schedule首先完成的工作是把上一步reader和writer split的结果整合到具体taskGroupContainer中, + * 同时不同的执行模式调用不同的调度策略,将所有任务调度起来 + */ + private void schedule() { + /** + * 这里的全局speed和每个channel的速度设置为B/s + */ + int channelsPerTaskGroup = this.configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, 5); + int taskNumber = this.configuration.getList( + CoreConstant.DATAX_JOB_CONTENT).size(); + + this.needChannelNumber = Math.min(this.needChannelNumber, taskNumber); + PerfTrace.getInstance().setChannelNumber(needChannelNumber); + + /** + * 通过获取配置信息得到每个taskGroup需要运行哪些tasks任务 + */ + + List taskGroupConfigs = JobAssignUtil.assignFairly(this.configuration, + this.needChannelNumber, channelsPerTaskGroup); + + LOG.info("Scheduler starts [{}] taskGroups.", taskGroupConfigs.size()); + + ExecuteMode executeMode = null; + AbstractScheduler scheduler; + try { + executeMode = ExecuteMode.STANDALONE; + scheduler = initStandaloneScheduler(this.configuration); + + //设置 executeMode + for (Configuration taskGroupConfig : taskGroupConfigs) { + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_JOB_MODE, executeMode.getValue()); + } + + if (executeMode == ExecuteMode.LOCAL || executeMode == ExecuteMode.DISTRIBUTE) { + if (this.jobId <= 0) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "在[ local | distribute ]模式下必须设置jobId,并且其值 > 0 ."); + } + } + + LOG.info("Running by {} Mode.", executeMode); + + this.startTransferTimeStamp = System.currentTimeMillis(); + + scheduler.schedule(taskGroupConfigs); + + this.endTransferTimeStamp = System.currentTimeMillis(); + } catch (Exception e) { + LOG.error("运行scheduler 模式[{}]出错.", executeMode); + this.endTransferTimeStamp = System.currentTimeMillis(); + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } + + /** + * 检查任务执行情况 + */ + this.checkLimit(); + } + + + private AbstractScheduler initStandaloneScheduler(Configuration configuration) { + AbstractContainerCommunicator containerCommunicator = new StandAloneJobContainerCommunicator(configuration); + super.setContainerCommunicator(containerCommunicator); + + return new StandAloneScheduler(containerCommunicator); + } + + private void post() { + this.postJobWriter(); + this.postJobReader(); + } + + private void destroy() { + if (this.jobWriter != null) { + this.jobWriter.destroy(); + this.jobWriter = null; + } + if (this.jobReader != null) { + this.jobReader.destroy(); + this.jobReader = null; + } + } + + private void logStatistics() { + long totalCosts = (this.endTimeStamp - this.startTimeStamp) / 1000; + long transferCosts = (this.endTransferTimeStamp - this.startTransferTimeStamp) / 1000; + if (0L == transferCosts) { + transferCosts = 1L; + } + + if (super.getContainerCommunicator() == null) { + return; + } + + Communication communication = super.getContainerCommunicator().collect(); + communication.setTimestamp(this.endTimeStamp); + + Communication tempComm = new Communication(); + tempComm.setTimestamp(this.startTransferTimeStamp); + + Communication reportCommunication = CommunicationTool.getReportCommunication(communication, tempComm, this.totalStage); + + // 字节速率 + long byteSpeedPerSecond = communication.getLongCounter(CommunicationTool.READ_SUCCEED_BYTES) + / transferCosts; + + long recordSpeedPerSecond = communication.getLongCounter(CommunicationTool.READ_SUCCEED_RECORDS) + / transferCosts; + + reportCommunication.setLongCounter(CommunicationTool.BYTE_SPEED, byteSpeedPerSecond); + reportCommunication.setLongCounter(CommunicationTool.RECORD_SPEED, recordSpeedPerSecond); + + super.getContainerCommunicator().report(reportCommunication); + + + LOG.info(String.format( + "\n" + "%-26s: %-18s\n" + "%-26s: %-18s\n" + "%-26s: %19s\n" + + "%-26s: %19s\n" + "%-26s: %19s\n" + "%-26s: %19s\n" + + "%-26s: %19s\n", + "任务启动时刻", + dateFormat.format(startTimeStamp), + + "任务结束时刻", + dateFormat.format(endTimeStamp), + + "任务总计耗时", + String.valueOf(totalCosts) + "s", + "任务平均流量", + StrUtil.stringify(byteSpeedPerSecond) + + "/s", + "记录写入速度", + String.valueOf(recordSpeedPerSecond) + + "rec/s", "读出记录总数", + String.valueOf(CommunicationTool.getTotalReadRecords(communication)), + "读写失败总数", + String.valueOf(CommunicationTool.getTotalErrorRecords(communication)) + )); + + if (communication.getLongCounter(CommunicationTool.TRANSFORMER_SUCCEED_RECORDS) > 0 + || communication.getLongCounter(CommunicationTool.TRANSFORMER_FAILED_RECORDS) > 0 + || communication.getLongCounter(CommunicationTool.TRANSFORMER_FILTER_RECORDS) > 0) { + LOG.info(String.format( + "\n" + "%-26s: %19s\n" + "%-26s: %19s\n" + "%-26s: %19s\n", + "Transformer成功记录总数", + communication.getLongCounter(CommunicationTool.TRANSFORMER_SUCCEED_RECORDS), + + "Transformer失败记录总数", + communication.getLongCounter(CommunicationTool.TRANSFORMER_FAILED_RECORDS), + + "Transformer过滤记录总数", + communication.getLongCounter(CommunicationTool.TRANSFORMER_FILTER_RECORDS) + )); + } + + + } + + /** + * reader job的初始化,返回Reader.Job + * + * @return + */ + private Reader.Job initJobReader( + JobPluginCollector jobPluginCollector) { + this.readerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_READER_NAME); + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + + Reader.Job jobReader = (Reader.Job) LoadUtil.loadJobPlugin( + PluginType.READER, this.readerPluginName); + + // 设置reader的jobConfig + jobReader.setPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER)); + + // 设置reader的readerConfig + jobReader.setPeerPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_WRITER_PARAMETER)); + + jobReader.setJobPluginCollector(jobPluginCollector); + jobReader.init(); + + classLoaderSwapper.restoreCurrentThreadClassLoader(); + return jobReader; + } + + /** + * writer job的初始化,返回Writer.Job + * + * @return + */ + private Writer.Job initJobWriter( + JobPluginCollector jobPluginCollector) { + this.writerPluginName = this.configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_WRITER_NAME); + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + + Writer.Job jobWriter = (Writer.Job) LoadUtil.loadJobPlugin( + PluginType.WRITER, this.writerPluginName); + + // 设置writer的jobConfig + jobWriter.setPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_WRITER_PARAMETER)); + + // 设置reader的readerConfig + jobWriter.setPeerPluginJobConf(this.configuration.getConfiguration( + CoreConstant.DATAX_JOB_CONTENT_READER_PARAMETER)); + + jobWriter.setPeerPluginName(this.readerPluginName); + jobWriter.setJobPluginCollector(jobPluginCollector); + jobWriter.init(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + + return jobWriter; + } + + private void prepareJobReader() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + LOG.info(String.format("DataX Reader.Job [%s] do prepare work .", + this.readerPluginName)); + this.jobReader.prepare(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + private void prepareJobWriter() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + LOG.info(String.format("DataX Writer.Job [%s] do prepare work .", + this.writerPluginName)); + this.jobWriter.prepare(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + // TODO: 如果源头就是空数据 + private List doReaderSplit(int adviceNumber) { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + List readerSlicesConfigs = + this.jobReader.split(adviceNumber); + if (readerSlicesConfigs == null || readerSlicesConfigs.size() <= 0) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_SPLIT_ERROR, + "reader切分的task数目不能小于等于0"); + } + LOG.info("DataX Reader.Job [{}] splits to [{}] tasks.", + this.readerPluginName, readerSlicesConfigs.size()); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + return readerSlicesConfigs; + } + + private List doWriterSplit(int readerTaskNumber) { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + + List writerSlicesConfigs = this.jobWriter + .split(readerTaskNumber); + if (writerSlicesConfigs == null || writerSlicesConfigs.size() <= 0) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_SPLIT_ERROR, + "writer切分的task不能小于等于0"); + } + LOG.info("DataX Writer.Job [{}] splits to [{}] tasks.", + this.writerPluginName, writerSlicesConfigs.size()); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + + return writerSlicesConfigs; + } + + /** + * 按顺序整合reader和writer的配置,这里的顺序不能乱! 输入是reader、writer级别的配置,输出是一个完整task的配置 + * + * @param readerTasksConfigs + * @param writerTasksConfigs + * @return + */ + private List mergeReaderAndWriterTaskConfigs( + List readerTasksConfigs, + List writerTasksConfigs) { + return mergeReaderAndWriterTaskConfigs(readerTasksConfigs, writerTasksConfigs, null); + } + + private List mergeReaderAndWriterTaskConfigs( + List readerTasksConfigs, + List writerTasksConfigs, + List transformerConfigs) { + if (readerTasksConfigs.size() != writerTasksConfigs.size()) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_SPLIT_ERROR, + String.format("reader切分的task数目[%d]不等于writer切分的task数目[%d].", + readerTasksConfigs.size(), writerTasksConfigs.size()) + ); + } + + List contentConfigs = new ArrayList(); + for (int i = 0; i < readerTasksConfigs.size(); i++) { + Configuration taskConfig = Configuration.newDefault(); + taskConfig.set(CoreConstant.JOB_READER_NAME, + this.readerPluginName); + taskConfig.set(CoreConstant.JOB_READER_PARAMETER, + readerTasksConfigs.get(i)); + taskConfig.set(CoreConstant.JOB_WRITER_NAME, + this.writerPluginName); + taskConfig.set(CoreConstant.JOB_WRITER_PARAMETER, + writerTasksConfigs.get(i)); + + if(transformerConfigs!=null && transformerConfigs.size()>0){ + taskConfig.set(CoreConstant.JOB_TRANSFORMER, transformerConfigs); + } + + taskConfig.set(CoreConstant.TASK_ID, i); + contentConfigs.add(taskConfig); + } + + return contentConfigs; + } + + /** + * 这里比较复杂,分两步整合 1. tasks到channel 2. channel到taskGroup + * 合起来考虑,其实就是把tasks整合到taskGroup中,需要满足计算出的channel数,同时不能多起channel + *

+ * example: + *

+ * 前提条件: 切分后是1024个分表,假设用户要求总速率是1000M/s,每个channel的速率的3M/s, + * 每个taskGroup负责运行7个channel + *

+ * 计算: 总channel数为:1000M/s / 3M/s = + * 333个,为平均分配,计算可知有308个每个channel有3个tasks,而有25个每个channel有4个tasks, + * 需要的taskGroup数为:333 / 7 = + * 47...4,也就是需要48个taskGroup,47个是每个负责7个channel,有4个负责1个channel + *

+ * 处理:我们先将这负责4个channel的taskGroup处理掉,逻辑是: + * 先按平均为3个tasks找4个channel,设置taskGroupId为0, + * 接下来就像发牌一样轮询分配task到剩下的包含平均channel数的taskGroup中 + *

+ * TODO delete it + * + * @param averTaskPerChannel + * @param channelNumber + * @param channelsPerTaskGroup + * @return 每个taskGroup独立的全部配置 + */ + @SuppressWarnings("serial") + private List distributeTasksToTaskGroup( + int averTaskPerChannel, int channelNumber, + int channelsPerTaskGroup) { + Validate.isTrue(averTaskPerChannel > 0 && channelNumber > 0 + && channelsPerTaskGroup > 0, + "每个channel的平均task数[averTaskPerChannel],channel数目[channelNumber],每个taskGroup的平均channel数[channelsPerTaskGroup]都应该为正数"); + List taskConfigs = this.configuration + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + int taskGroupNumber = channelNumber / channelsPerTaskGroup; + int leftChannelNumber = channelNumber % channelsPerTaskGroup; + if (leftChannelNumber > 0) { + taskGroupNumber += 1; + } + + /** + * 如果只有一个taskGroup,直接打标返回 + */ + if (taskGroupNumber == 1) { + final Configuration taskGroupConfig = this.configuration.clone(); + /** + * configure的clone不能clone出 + */ + taskGroupConfig.set(CoreConstant.DATAX_JOB_CONTENT, this.configuration + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT)); + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, + channelNumber); + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, 0); + return new ArrayList() { + { + add(taskGroupConfig); + } + }; + } + + List taskGroupConfigs = new ArrayList(); + /** + * 将每个taskGroup中content的配置清空 + */ + for (int i = 0; i < taskGroupNumber; i++) { + Configuration taskGroupConfig = this.configuration.clone(); + List taskGroupJobContent = taskGroupConfig + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + taskGroupJobContent.clear(); + taskGroupConfig.set(CoreConstant.DATAX_JOB_CONTENT, taskGroupJobContent); + + taskGroupConfigs.add(taskGroupConfig); + } + + int taskConfigIndex = 0; + int channelIndex = 0; + int taskGroupConfigIndex = 0; + + /** + * 先处理掉taskGroup包含channel数不是平均值的taskGroup + */ + if (leftChannelNumber > 0) { + Configuration taskGroupConfig = taskGroupConfigs.get(taskGroupConfigIndex); + for (; channelIndex < leftChannelNumber; channelIndex++) { + for (int i = 0; i < averTaskPerChannel; i++) { + List taskGroupJobContent = taskGroupConfig + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + taskGroupJobContent.add(taskConfigs.get(taskConfigIndex++)); + taskGroupConfig.set(CoreConstant.DATAX_JOB_CONTENT, + taskGroupJobContent); + } + } + + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, + leftChannelNumber); + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, + taskGroupConfigIndex++); + } + + /** + * 下面需要轮询分配,并打上channel数和taskGroupId标记 + */ + int equalDivisionStartIndex = taskGroupConfigIndex; + for (; taskConfigIndex < taskConfigs.size() + && equalDivisionStartIndex < taskGroupConfigs.size(); ) { + for (taskGroupConfigIndex = equalDivisionStartIndex; taskGroupConfigIndex < taskGroupConfigs + .size() && taskConfigIndex < taskConfigs.size(); taskGroupConfigIndex++) { + Configuration taskGroupConfig = taskGroupConfigs.get(taskGroupConfigIndex); + List taskGroupJobContent = taskGroupConfig + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + taskGroupJobContent.add(taskConfigs.get(taskConfigIndex++)); + taskGroupConfig.set( + CoreConstant.DATAX_JOB_CONTENT, taskGroupJobContent); + } + } + + for (taskGroupConfigIndex = equalDivisionStartIndex; + taskGroupConfigIndex < taskGroupConfigs.size(); ) { + Configuration taskGroupConfig = taskGroupConfigs.get(taskGroupConfigIndex); + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL, + channelsPerTaskGroup); + taskGroupConfig.set(CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID, + taskGroupConfigIndex++); + } + + return taskGroupConfigs; + } + + private void postJobReader() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.readerPluginName)); + LOG.info("DataX Reader.Job [{}] do post work.", + this.readerPluginName); + this.jobReader.post(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + private void postJobWriter() { + classLoaderSwapper.setCurrentThreadClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.writerPluginName)); + LOG.info("DataX Writer.Job [{}] do post work.", + this.writerPluginName); + this.jobWriter.post(); + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + + /** + * 检查最终结果是否超出阈值,如果阈值设定小于1,则表示百分数阈值,大于1表示条数阈值。 + * + * @param + */ + private void checkLimit() { + Communication communication = super.getContainerCommunicator().collect(); + errorLimit.checkRecordLimit(communication); + errorLimit.checkPercentageLimit(communication); + } + + /** + * 调用外部hook + */ + private void invokeHooks() { + Communication comm = super.getContainerCommunicator().collect(); + HookInvoker invoker = new HookInvoker(CoreConstant.DATAX_HOME + "/hook", configuration, comm.getCounter()); + invoker.invokeAll(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/meta/ExecuteMode.java b/core/src/main/java/com/alibaba/datax/core/job/meta/ExecuteMode.java new file mode 100644 index 0000000000..956f9c4b2d --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/meta/ExecuteMode.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.core.job.meta; + +/** + * Created by liupeng on 15/12/21. + */ +public enum ExecuteMode { + STANDALONE("standalone"), ; + + String value; + + private ExecuteMode(String value) { + this.value = value; + } + + public String value() { + return this.value; + } + + public String getValue() { + return this.value; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/meta/State.java b/core/src/main/java/com/alibaba/datax/core/job/meta/State.java new file mode 100644 index 0000000000..2a1dd227e6 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/meta/State.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.core.job.meta; + +/** + * Created by liupeng on 15/12/21. + */ +public enum State { + SUBMITTING(10), + WAITING(20), + RUNNING(30), + KILLING(40), + KILLED(50), + FAILED(60), + SUCCEEDED(70), ; + + int value; + + private State(int value) { + this.value = value; + } + + public int value() { + return this.value; + } + + public boolean isFinished() { + return this == KILLED || this == FAILED || this == SUCCEEDED; + } + + public boolean isRunning() { + return !this.isFinished(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/scheduler/AbstractScheduler.java b/core/src/main/java/com/alibaba/datax/core/job/scheduler/AbstractScheduler.java new file mode 100755 index 0000000000..ab2b5aa327 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/scheduler/AbstractScheduler.java @@ -0,0 +1,135 @@ +package com.alibaba.datax.core.job.scheduler; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.util.ErrorRecordChecker; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.dataxservice.face.domain.enums.State; +import org.apache.commons.lang.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +public abstract class AbstractScheduler { + private static final Logger LOG = LoggerFactory + .getLogger(AbstractScheduler.class); + + private ErrorRecordChecker errorLimit; + + private AbstractContainerCommunicator containerCommunicator; + + private Long jobId; + + public Long getJobId() { + return jobId; + } + + public AbstractScheduler(AbstractContainerCommunicator containerCommunicator) { + this.containerCommunicator = containerCommunicator; + } + + public void schedule(List configurations) { + Validate.notNull(configurations, + "scheduler配置不能为空"); + int jobReportIntervalInMillSec = configurations.get(0).getInt( + CoreConstant.DATAX_CORE_CONTAINER_JOB_REPORTINTERVAL, 30000); + int jobSleepIntervalInMillSec = configurations.get(0).getInt( + CoreConstant.DATAX_CORE_CONTAINER_JOB_SLEEPINTERVAL, 10000); + + this.jobId = configurations.get(0).getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID); + + errorLimit = new ErrorRecordChecker(configurations.get(0)); + + /** + * 给 taskGroupContainer 的 Communication 注册 + */ + this.containerCommunicator.registerCommunication(configurations); + + int totalTasks = calculateTaskCount(configurations); + startAllTaskGroup(configurations); + + Communication lastJobContainerCommunication = new Communication(); + + long lastReportTimeStamp = System.currentTimeMillis(); + try { + while (true) { + /** + * step 1: collect job stat + * step 2: getReport info, then report it + * step 3: errorLimit do check + * step 4: dealSucceedStat(); + * step 5: dealKillingStat(); + * step 6: dealFailedStat(); + * step 7: refresh last job stat, and then sleep for next while + * + * above steps, some ones should report info to DS + * + */ + Communication nowJobContainerCommunication = this.containerCommunicator.collect(); + nowJobContainerCommunication.setTimestamp(System.currentTimeMillis()); + LOG.debug(nowJobContainerCommunication.toString()); + + //汇报周期 + long now = System.currentTimeMillis(); + if (now - lastReportTimeStamp > jobReportIntervalInMillSec) { + Communication reportCommunication = CommunicationTool + .getReportCommunication(nowJobContainerCommunication, lastJobContainerCommunication, totalTasks); + + this.containerCommunicator.report(reportCommunication); + lastReportTimeStamp = now; + lastJobContainerCommunication = nowJobContainerCommunication; + } + + errorLimit.checkRecordLimit(nowJobContainerCommunication); + + if (nowJobContainerCommunication.getState() == State.SUCCEEDED) { + LOG.info("Scheduler accomplished all tasks."); + break; + } + + if (isJobKilling(this.getJobId())) { + dealKillingStat(this.containerCommunicator, totalTasks); + } else if (nowJobContainerCommunication.getState() == State.FAILED) { + dealFailedStat(this.containerCommunicator, nowJobContainerCommunication.getThrowable()); + } + + Thread.sleep(jobSleepIntervalInMillSec); + } + } catch (InterruptedException e) { + // 以 failed 状态退出 + LOG.error("捕获到InterruptedException异常!", e); + + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } + + } + + protected abstract void startAllTaskGroup(List configurations); + + protected abstract void dealFailedStat(AbstractContainerCommunicator frameworkCollector, Throwable throwable); + + protected abstract void dealKillingStat(AbstractContainerCommunicator frameworkCollector, int totalTasks); + + private int calculateTaskCount(List configurations) { + int totalTasks = 0; + for (Configuration taskGroupConfiguration : configurations) { + totalTasks += taskGroupConfiguration.getListConfiguration( + CoreConstant.DATAX_JOB_CONTENT).size(); + } + return totalTasks; + } + +// private boolean isJobKilling(Long jobId) { +// Result jobInfo = DataxServiceUtil.getJobInfo(jobId); +// return jobInfo.getData() == State.KILLING.value(); +// } + + protected abstract boolean isJobKilling(Long jobId); +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/ProcessInnerScheduler.java b/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/ProcessInnerScheduler.java new file mode 100755 index 0000000000..2bc6e64c97 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/ProcessInnerScheduler.java @@ -0,0 +1,60 @@ +package com.alibaba.datax.core.job.scheduler.processinner; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.job.scheduler.AbstractScheduler; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.taskgroup.TaskGroupContainer; +import com.alibaba.datax.core.taskgroup.runner.TaskGroupContainerRunner; +import com.alibaba.datax.core.util.FrameworkErrorCode; + +import java.util.List; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; + +public abstract class ProcessInnerScheduler extends AbstractScheduler { + + private ExecutorService taskGroupContainerExecutorService; + + public ProcessInnerScheduler(AbstractContainerCommunicator containerCommunicator) { + super(containerCommunicator); + } + + @Override + public void startAllTaskGroup(List configurations) { + this.taskGroupContainerExecutorService = Executors + .newFixedThreadPool(configurations.size()); + + for (Configuration taskGroupConfiguration : configurations) { + TaskGroupContainerRunner taskGroupContainerRunner = newTaskGroupContainerRunner(taskGroupConfiguration); + this.taskGroupContainerExecutorService.execute(taskGroupContainerRunner); + } + + this.taskGroupContainerExecutorService.shutdown(); + } + + @Override + public void dealFailedStat(AbstractContainerCommunicator frameworkCollector, Throwable throwable) { + this.taskGroupContainerExecutorService.shutdownNow(); + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_RUNTIME_ERROR, throwable); + } + + + @Override + public void dealKillingStat(AbstractContainerCommunicator frameworkCollector, int totalTasks) { + //通过进程退出返回码标示状态 + this.taskGroupContainerExecutorService.shutdownNow(); + throw DataXException.asDataXException(FrameworkErrorCode.KILLED_EXIT_VALUE, + "job killed status"); + } + + + private TaskGroupContainerRunner newTaskGroupContainerRunner( + Configuration configuration) { + TaskGroupContainer taskGroupContainer = new TaskGroupContainer(configuration); + + return new TaskGroupContainerRunner(taskGroupContainer); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/StandAloneScheduler.java b/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/StandAloneScheduler.java new file mode 100755 index 0000000000..d87421b7a2 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/job/scheduler/processinner/StandAloneScheduler.java @@ -0,0 +1,19 @@ +package com.alibaba.datax.core.job.scheduler.processinner; + +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; + +/** + * Created by hongjiao.hj on 2014/12/22. + */ +public class StandAloneScheduler extends ProcessInnerScheduler{ + + public StandAloneScheduler(AbstractContainerCommunicator containerCommunicator) { + super(containerCommunicator); + } + + @Override + protected boolean isJobKilling(Long jobId) { + return false; + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/communication/Communication.java b/core/src/main/java/com/alibaba/datax/core/statistics/communication/Communication.java new file mode 100755 index 0000000000..97867c951b --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/communication/Communication.java @@ -0,0 +1,281 @@ +package com.alibaba.datax.core.statistics.communication; + +import com.alibaba.datax.common.base.BaseObject; +import com.alibaba.datax.dataxservice.face.domain.enums.State; +import org.apache.commons.lang.StringUtils; +import org.apache.commons.lang.Validate; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.concurrent.ConcurrentHashMap; + +/** + * DataX所有的状态及统计信息交互类,job、taskGroup、task等的消息汇报都走该类 + */ +public class Communication extends BaseObject implements Cloneable { + /** + * 所有的数值key-value对 * + */ + private Map counter; + + /** + * 运行状态 * + */ + private State state; + + /** + * 异常记录 * + */ + private Throwable throwable; + + /** + * 记录的timestamp * + */ + private long timestamp; + + /** + * task给job的信息 * + */ + Map> message; + + public Communication() { + this.init(); + } + + public synchronized void reset() { + this.init(); + } + + private void init() { + this.counter = new ConcurrentHashMap(); + this.state = State.RUNNING; + this.throwable = null; + this.message = new ConcurrentHashMap>(); + this.timestamp = System.currentTimeMillis(); + } + + public Map getCounter() { + return this.counter; + } + + public State getState() { + return this.state; + } + + public synchronized void setState(State state, boolean isForce) { + if (!isForce && this.state.equals(State.FAILED)) { + return; + } + + this.state = state; + } + + public synchronized void setState(State state) { + setState(state, false); + } + + public Throwable getThrowable() { + return this.throwable; + } + + public synchronized String getThrowableMessage() { + return this.throwable == null ? "" : this.throwable.getMessage(); + } + + public void setThrowable(Throwable throwable) { + setThrowable(throwable, false); + } + + public synchronized void setThrowable(Throwable throwable, boolean isForce) { + if (isForce) { + this.throwable = throwable; + } else { + this.throwable = this.throwable == null ? throwable : this.throwable; + } + } + + public long getTimestamp() { + return this.timestamp; + } + + public void setTimestamp(long timestamp) { + this.timestamp = timestamp; + } + + public Map> getMessage() { + return this.message; + } + + public List getMessage(final String key) { + return message.get(key); + } + + public synchronized void addMessage(final String key, final String value) { + Validate.isTrue(StringUtils.isNotBlank(key), "增加message的key不能为空"); + List valueList = this.message.get(key); + if (null == valueList) { + valueList = new ArrayList(); + this.message.put(key, valueList); + } + + valueList.add(value); + } + + public synchronized Long getLongCounter(final String key) { + Number value = this.counter.get(key); + + return value == null ? 0 : value.longValue(); + } + + public synchronized void setLongCounter(final String key, final long value) { + Validate.isTrue(StringUtils.isNotBlank(key), "设置counter的key不能为空"); + this.counter.put(key, value); + } + + public synchronized Double getDoubleCounter(final String key) { + Number value = this.counter.get(key); + + return value == null ? 0.0d : value.doubleValue(); + } + + public synchronized void setDoubleCounter(final String key, final double value) { + Validate.isTrue(StringUtils.isNotBlank(key), "设置counter的key不能为空"); + this.counter.put(key, value); + } + + public synchronized void increaseCounter(final String key, final long deltaValue) { + Validate.isTrue(StringUtils.isNotBlank(key), "增加counter的key不能为空"); + + long value = this.getLongCounter(key); + + this.counter.put(key, value + deltaValue); + } + + @Override + public Communication clone() { + Communication communication = new Communication(); + + /** + * clone counter + */ + if (this.counter != null) { + for (Map.Entry entry : this.counter.entrySet()) { + String key = entry.getKey(); + Number value = entry.getValue(); + if (value instanceof Long) { + communication.setLongCounter(key, (Long) value); + } else if (value instanceof Double) { + communication.setDoubleCounter(key, (Double) value); + } + } + } + + communication.setState(this.state, true); + communication.setThrowable(this.throwable, true); + communication.setTimestamp(this.timestamp); + + /** + * clone message + */ + if (this.message != null) { + for (final Map.Entry> entry : this.message.entrySet()) { + String key = entry.getKey(); + List value = new ArrayList() {{ + addAll(entry.getValue()); + }}; + communication.getMessage().put(key, value); + } + } + + return communication; + } + + public synchronized Communication mergeFrom(final Communication otherComm) { + if (otherComm == null) { + return this; + } + + /** + * counter的合并,将otherComm的值累加到this中,不存在的则创建 + * 同为long + */ + for (Entry entry : otherComm.getCounter().entrySet()) { + String key = entry.getKey(); + Number otherValue = entry.getValue(); + if (otherValue == null) { + continue; + } + + Number value = this.counter.get(key); + if (value == null) { + value = otherValue; + } else { + if (value instanceof Long && otherValue instanceof Long) { + value = value.longValue() + otherValue.longValue(); + } else { + value = value.doubleValue() + value.doubleValue(); + } + } + + this.counter.put(key, value); + } + + // 合并state + mergeStateFrom(otherComm); + + /** + * 合并throwable,当this的throwable为空时, + * 才将otherComm的throwable合并进来 + */ + this.throwable = this.throwable == null ? otherComm.getThrowable() : this.throwable; + + /** + * timestamp是整个一次合并的时间戳,单独两两communication不作合并 + */ + + /** + * message的合并采取求并的方式,即全部累计在一起 + */ + for (Entry> entry : otherComm.getMessage().entrySet()) { + String key = entry.getKey(); + List valueList = this.message.get(key); + if (valueList == null) { + valueList = new ArrayList(); + this.message.put(key, valueList); + } + + valueList.addAll(entry.getValue()); + } + + return this; + } + + /** + * 合并state,优先级: (Failed | Killed) > Running > Success + * 这里不会出现 Killing 状态,killing 状态只在 Job 自身状态上才有. + */ + public synchronized State mergeStateFrom(final Communication otherComm) { + State retState = this.getState(); + if (otherComm == null) { + return retState; + } + + if (this.state == State.FAILED || otherComm.getState() == State.FAILED + || this.state == State.KILLED || otherComm.getState() == State.KILLED) { + retState = State.FAILED; + } else if (this.state.isRunning() || otherComm.state.isRunning()) { + retState = State.RUNNING; + } + + this.setState(retState); + return retState; + } + + public synchronized boolean isFinished(){ + return this.state == State.SUCCEEDED || this.state == State.FAILED + || this.state == State.KILLED; + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/communication/CommunicationTool.java b/core/src/main/java/com/alibaba/datax/core/statistics/communication/CommunicationTool.java new file mode 100755 index 0000000000..51a601aeb6 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/communication/CommunicationTool.java @@ -0,0 +1,284 @@ +package com.alibaba.datax.core.statistics.communication; + +import com.alibaba.datax.common.statistics.PerfTrace; +import com.alibaba.datax.common.util.StrUtil; +import com.alibaba.fastjson.JSON; +import org.apache.commons.lang.Validate; + +import java.text.DecimalFormat; +import java.util.HashMap; +import java.util.Map; + +/** + * 这里主要是业务层面的处理 + */ +public final class CommunicationTool { + public static final String STAGE = "stage"; + public static final String BYTE_SPEED = "byteSpeed"; + public static final String RECORD_SPEED = "recordSpeed"; + public static final String PERCENTAGE = "percentage"; + + public static final String READ_SUCCEED_RECORDS = "readSucceedRecords"; + public static final String READ_SUCCEED_BYTES = "readSucceedBytes"; + + public static final String READ_FAILED_RECORDS = "readFailedRecords"; + public static final String READ_FAILED_BYTES = "readFailedBytes"; + + public static final String WRITE_RECEIVED_RECORDS = "writeReceivedRecords"; + public static final String WRITE_RECEIVED_BYTES = "writeReceivedBytes"; + + public static final String WRITE_FAILED_RECORDS = "writeFailedRecords"; + public static final String WRITE_FAILED_BYTES = "writeFailedBytes"; + + public static final String TOTAL_READ_RECORDS = "totalReadRecords"; + private static final String TOTAL_READ_BYTES = "totalReadBytes"; + + private static final String TOTAL_ERROR_RECORDS = "totalErrorRecords"; + private static final String TOTAL_ERROR_BYTES = "totalErrorBytes"; + + private static final String WRITE_SUCCEED_RECORDS = "writeSucceedRecords"; + private static final String WRITE_SUCCEED_BYTES = "writeSucceedBytes"; + + public static final String WAIT_WRITER_TIME = "waitWriterTime"; + + public static final String WAIT_READER_TIME = "waitReaderTime"; + + public static final String TRANSFORMER_USED_TIME = "totalTransformerUsedTime"; + public static final String TRANSFORMER_SUCCEED_RECORDS = "totalTransformerSuccessRecords"; + public static final String TRANSFORMER_FAILED_RECORDS = "totalTransformerFailedRecords"; + public static final String TRANSFORMER_FILTER_RECORDS = "totalTransformerFilterRecords"; + public static final String TRANSFORMER_NAME_PREFIX = "usedTimeByTransformer_"; + + public static Communication getReportCommunication(Communication now, Communication old, int totalStage) { + Validate.isTrue(now != null && old != null, + "为汇报准备的新旧metric不能为null"); + + long totalReadRecords = getTotalReadRecords(now); + long totalReadBytes = getTotalReadBytes(now); + now.setLongCounter(TOTAL_READ_RECORDS, totalReadRecords); + now.setLongCounter(TOTAL_READ_BYTES, totalReadBytes); + now.setLongCounter(TOTAL_ERROR_RECORDS, getTotalErrorRecords(now)); + now.setLongCounter(TOTAL_ERROR_BYTES, getTotalErrorBytes(now)); + now.setLongCounter(WRITE_SUCCEED_RECORDS, getWriteSucceedRecords(now)); + now.setLongCounter(WRITE_SUCCEED_BYTES, getWriteSucceedBytes(now)); + + long timeInterval = now.getTimestamp() - old.getTimestamp(); + long sec = timeInterval <= 1000 ? 1 : timeInterval / 1000; + long bytesSpeed = (totalReadBytes + - getTotalReadBytes(old)) / sec; + long recordsSpeed = (totalReadRecords + - getTotalReadRecords(old)) / sec; + + now.setLongCounter(BYTE_SPEED, bytesSpeed < 0 ? 0 : bytesSpeed); + now.setLongCounter(RECORD_SPEED, recordsSpeed < 0 ? 0 : recordsSpeed); + now.setDoubleCounter(PERCENTAGE, now.getLongCounter(STAGE) / (double) totalStage); + + if (old.getThrowable() != null) { + now.setThrowable(old.getThrowable()); + } + + return now; + } + + public static long getTotalReadRecords(final Communication communication) { + return communication.getLongCounter(READ_SUCCEED_RECORDS) + + communication.getLongCounter(READ_FAILED_RECORDS); + } + + public static long getTotalReadBytes(final Communication communication) { + return communication.getLongCounter(READ_SUCCEED_BYTES) + + communication.getLongCounter(READ_FAILED_BYTES); + } + + public static long getTotalErrorRecords(final Communication communication) { + return communication.getLongCounter(READ_FAILED_RECORDS) + + communication.getLongCounter(WRITE_FAILED_RECORDS); + } + + public static long getTotalErrorBytes(final Communication communication) { + return communication.getLongCounter(READ_FAILED_BYTES) + + communication.getLongCounter(WRITE_FAILED_BYTES); + } + + public static long getWriteSucceedRecords(final Communication communication) { + return communication.getLongCounter(WRITE_RECEIVED_RECORDS) - + communication.getLongCounter(WRITE_FAILED_RECORDS); + } + + public static long getWriteSucceedBytes(final Communication communication) { + return communication.getLongCounter(WRITE_RECEIVED_BYTES) - + communication.getLongCounter(WRITE_FAILED_BYTES); + } + + public static class Stringify { + private final static DecimalFormat df = new DecimalFormat("0.00"); + + public static String getSnapshot(final Communication communication) { + StringBuilder sb = new StringBuilder(); + sb.append("Total "); + sb.append(getTotal(communication)); + sb.append(" | "); + sb.append("Speed "); + sb.append(getSpeed(communication)); + sb.append(" | "); + sb.append("Error "); + sb.append(getError(communication)); + sb.append(" | "); + sb.append(" All Task WaitWriterTime "); + sb.append(PerfTrace.unitTime(communication.getLongCounter(WAIT_WRITER_TIME))); + sb.append(" | "); + sb.append(" All Task WaitReaderTime "); + sb.append(PerfTrace.unitTime(communication.getLongCounter(WAIT_READER_TIME))); + sb.append(" | "); + if (communication.getLongCounter(CommunicationTool.TRANSFORMER_USED_TIME) > 0 + || communication.getLongCounter(CommunicationTool.TRANSFORMER_SUCCEED_RECORDS) > 0 + ||communication.getLongCounter(CommunicationTool.TRANSFORMER_FAILED_RECORDS) > 0 + || communication.getLongCounter(CommunicationTool.TRANSFORMER_FILTER_RECORDS) > 0) { + sb.append("Transfermor Success "); + sb.append(String.format("%d records", communication.getLongCounter(CommunicationTool.TRANSFORMER_SUCCEED_RECORDS))); + sb.append(" | "); + sb.append("Transformer Error "); + sb.append(String.format("%d records", communication.getLongCounter(CommunicationTool.TRANSFORMER_FAILED_RECORDS))); + sb.append(" | "); + sb.append("Transformer Filter "); + sb.append(String.format("%d records", communication.getLongCounter(CommunicationTool.TRANSFORMER_FILTER_RECORDS))); + sb.append(" | "); + sb.append("Transformer usedTime "); + sb.append(PerfTrace.unitTime(communication.getLongCounter(CommunicationTool.TRANSFORMER_USED_TIME))); + sb.append(" | "); + } + sb.append("Percentage "); + sb.append(getPercentage(communication)); + return sb.toString(); + } + + private static String getTotal(final Communication communication) { + return String.format("%d records, %d bytes", + communication.getLongCounter(TOTAL_READ_RECORDS), + communication.getLongCounter(TOTAL_READ_BYTES)); + } + + private static String getSpeed(final Communication communication) { + return String.format("%s/s, %d records/s", + StrUtil.stringify(communication.getLongCounter(BYTE_SPEED)), + communication.getLongCounter(RECORD_SPEED)); + } + + private static String getError(final Communication communication) { + return String.format("%d records, %d bytes", + communication.getLongCounter(TOTAL_ERROR_RECORDS), + communication.getLongCounter(TOTAL_ERROR_BYTES)); + } + + private static String getPercentage(final Communication communication) { + return df.format(communication.getDoubleCounter(PERCENTAGE) * 100) + "%"; + } + } + + public static class Jsonify { + @SuppressWarnings("rawtypes") + public static String getSnapshot(Communication communication) { + Validate.notNull(communication); + + Map state = new HashMap(); + + Pair pair = getTotalBytes(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getTotalRecords(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getSpeedRecord(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getSpeedByte(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getStage(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getErrorRecords(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getErrorBytes(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getErrorMessage(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getPercentage(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getWaitReaderTime(communication); + state.put((String) pair.getKey(), pair.getValue()); + + pair = getWaitWriterTime(communication); + state.put((String) pair.getKey(), pair.getValue()); + + return JSON.toJSONString(state); + } + + private static Pair getTotalBytes(final Communication communication) { + return new Pair("totalBytes", communication.getLongCounter(TOTAL_READ_BYTES)); + } + + private static Pair getTotalRecords(final Communication communication) { + return new Pair("totalRecords", communication.getLongCounter(TOTAL_READ_RECORDS)); + } + + private static Pair getSpeedByte(final Communication communication) { + return new Pair("speedBytes", communication.getLongCounter(BYTE_SPEED)); + } + + private static Pair getSpeedRecord(final Communication communication) { + return new Pair("speedRecords", communication.getLongCounter(RECORD_SPEED)); + } + + private static Pair getErrorRecords(final Communication communication) { + return new Pair("errorRecords", communication.getLongCounter(TOTAL_ERROR_RECORDS)); + } + + private static Pair getErrorBytes(final Communication communication) { + return new Pair("errorBytes", communication.getLongCounter(TOTAL_ERROR_BYTES)); + } + + private static Pair getStage(final Communication communication) { + return new Pair("stage", communication.getLongCounter(STAGE)); + } + + private static Pair getPercentage(final Communication communication) { + return new Pair("percentage", communication.getDoubleCounter(PERCENTAGE)); + } + + private static Pair getErrorMessage(final Communication communication) { + return new Pair("errorMessage", communication.getThrowableMessage()); + } + + private static Pair getWaitReaderTime(final Communication communication) { + return new Pair("waitReaderTime", communication.getLongCounter(CommunicationTool.WAIT_READER_TIME)); + } + + private static Pair getWaitWriterTime(final Communication communication) { + return new Pair("waitWriterTime", communication.getLongCounter(CommunicationTool.WAIT_WRITER_TIME)); + } + + static class Pair { + public Pair(final K key, final V value) { + this.key = key; + this.value = value; + } + + public K getKey() { + return key; + } + + public V getValue() { + return value; + } + + private K key; + + private V value; + } + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/communication/LocalTGCommunicationManager.java b/core/src/main/java/com/alibaba/datax/core/statistics/communication/LocalTGCommunicationManager.java new file mode 100755 index 0000000000..0b0529f827 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/communication/LocalTGCommunicationManager.java @@ -0,0 +1,62 @@ +package com.alibaba.datax.core.statistics.communication; + +import com.alibaba.datax.dataxservice.face.domain.enums.State; +import org.apache.commons.lang3.Validate; + +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; + +public final class LocalTGCommunicationManager { + private static Map taskGroupCommunicationMap = + new ConcurrentHashMap(); + + public static void registerTaskGroupCommunication( + int taskGroupId, Communication communication) { + taskGroupCommunicationMap.put(taskGroupId, communication); + } + + public static Communication getJobCommunication() { + Communication communication = new Communication(); + communication.setState(State.SUCCEEDED); + + for (Communication taskGroupCommunication : + taskGroupCommunicationMap.values()) { + communication.mergeFrom(taskGroupCommunication); + } + + return communication; + } + + /** + * 采用获取taskGroupId后再获取对应communication的方式, + * 防止map遍历时修改,同时也防止对map key-value对的修改 + * + * @return + */ + public static Set getTaskGroupIdSet() { + return taskGroupCommunicationMap.keySet(); + } + + public static Communication getTaskGroupCommunication(int taskGroupId) { + Validate.isTrue(taskGroupId >= 0, "taskGroupId不能小于0"); + + return taskGroupCommunicationMap.get(taskGroupId); + } + + public static void updateTaskGroupCommunication(final int taskGroupId, + final Communication communication) { + Validate.isTrue(taskGroupCommunicationMap.containsKey( + taskGroupId), String.format("taskGroupCommunicationMap中没有注册taskGroupId[%d]的Communication," + + "无法更新该taskGroup的信息", taskGroupId)); + taskGroupCommunicationMap.put(taskGroupId, communication); + } + + public static void clear() { + taskGroupCommunicationMap.clear(); + } + + public static Map getTaskGroupCommunicationMap() { + return taskGroupCommunicationMap; + } +} \ No newline at end of file diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/AbstractCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/AbstractCollector.java new file mode 100755 index 0000000000..45f631b721 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/AbstractCollector.java @@ -0,0 +1,68 @@ +package com.alibaba.datax.core.statistics.container.collector; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.LocalTGCommunicationManager; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.dataxservice.face.domain.enums.State; +import java.util.List; +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; + +public abstract class AbstractCollector { + private Map taskCommunicationMap = new ConcurrentHashMap(); + private Long jobId; + + public Map getTaskCommunicationMap() { + return taskCommunicationMap; + } + + public Long getJobId() { + return jobId; + } + + public void setJobId(Long jobId) { + this.jobId = jobId; + } + + public void registerTGCommunication(List taskGroupConfigurationList) { + for (Configuration config : taskGroupConfigurationList) { + int taskGroupId = config.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID); + LocalTGCommunicationManager.registerTaskGroupCommunication(taskGroupId, new Communication()); + } + } + + public void registerTaskCommunication(List taskConfigurationList) { + for (Configuration taskConfig : taskConfigurationList) { + int taskId = taskConfig.getInt(CoreConstant.TASK_ID); + this.taskCommunicationMap.put(taskId, new Communication()); + } + } + + public Communication collectFromTask() { + Communication communication = new Communication(); + communication.setState(State.SUCCEEDED); + + for (Communication taskCommunication : + this.taskCommunicationMap.values()) { + communication.mergeFrom(taskCommunication); + } + + return communication; + } + + public abstract Communication collectFromTaskGroup(); + + public Map getTGCommunicationMap() { + return LocalTGCommunicationManager.getTaskGroupCommunicationMap(); + } + + public Communication getTGCommunication(Integer taskGroupId) { + return LocalTGCommunicationManager.getTaskGroupCommunication(taskGroupId); + } + + public Communication getTaskCommunication(Integer taskId) { + return this.taskCommunicationMap.get(taskId); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/ProcessInnerCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/ProcessInnerCollector.java new file mode 100755 index 0000000000..530794b56a --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/collector/ProcessInnerCollector.java @@ -0,0 +1,17 @@ +package com.alibaba.datax.core.statistics.container.collector; + +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.LocalTGCommunicationManager; + +public class ProcessInnerCollector extends AbstractCollector { + + public ProcessInnerCollector(Long jobId) { + super.setJobId(jobId); + } + + @Override + public Communication collectFromTaskGroup() { + return LocalTGCommunicationManager.getJobCommunication(); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/AbstractContainerCommunicator.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/AbstractContainerCommunicator.java new file mode 100755 index 0000000000..d9e2e63d25 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/AbstractContainerCommunicator.java @@ -0,0 +1,88 @@ +package com.alibaba.datax.core.statistics.container.communicator; + + +import com.alibaba.datax.common.statistics.VMInfo; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.container.collector.AbstractCollector; +import com.alibaba.datax.core.statistics.container.report.AbstractReporter; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.dataxservice.face.domain.enums.State; + +import java.util.List; +import java.util.Map; + +public abstract class AbstractContainerCommunicator { + private Configuration configuration; + private AbstractCollector collector; + private AbstractReporter reporter; + + private Long jobId; + + private VMInfo vmInfo = VMInfo.getVmInfo(); + private long lastReportTime = System.currentTimeMillis(); + + + public Configuration getConfiguration() { + return this.configuration; + } + + public AbstractCollector getCollector() { + return collector; + } + + public AbstractReporter getReporter() { + return reporter; + } + + public void setCollector(AbstractCollector collector) { + this.collector = collector; + } + + public void setReporter(AbstractReporter reporter) { + this.reporter = reporter; + } + + public Long getJobId() { + return jobId; + } + + public AbstractContainerCommunicator(Configuration configuration) { + this.configuration = configuration; + this.jobId = configuration.getLong(CoreConstant.DATAX_CORE_CONTAINER_JOB_ID); + } + + + public abstract void registerCommunication(List configurationList); + + public abstract Communication collect(); + + public abstract void report(Communication communication); + + public abstract State collectState(); + + public abstract Communication getCommunication(Integer id); + + /** + * 当 实现是 TGContainerCommunicator 时,返回的 Map: key=taskId, value=Communication + * 当 实现是 JobContainerCommunicator 时,返回的 Map: key=taskGroupId, value=Communication + */ + public abstract Map getCommunicationMap(); + + public void resetCommunication(Integer id){ + Map map = getCommunicationMap(); + map.put(id, new Communication()); + } + + public void reportVmInfo(){ + long now = System.currentTimeMillis(); + //每5分钟打印一次 + if(now - lastReportTime >= 300000) { + //当前仅打印 + if (vmInfo != null) { + vmInfo.getDelta(true); + } + lastReportTime = now; + } + } +} \ No newline at end of file diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/job/StandAloneJobContainerCommunicator.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/job/StandAloneJobContainerCommunicator.java new file mode 100755 index 0000000000..7ace81180c --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/job/StandAloneJobContainerCommunicator.java @@ -0,0 +1,63 @@ +package com.alibaba.datax.core.statistics.container.communicator.job; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.statistics.container.collector.ProcessInnerCollector; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.statistics.container.report.ProcessInnerReporter; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.dataxservice.face.domain.enums.State; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; +import java.util.Map; + +public class StandAloneJobContainerCommunicator extends AbstractContainerCommunicator { + private static final Logger LOG = LoggerFactory + .getLogger(StandAloneJobContainerCommunicator.class); + + public StandAloneJobContainerCommunicator(Configuration configuration) { + super(configuration); + super.setCollector(new ProcessInnerCollector(configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID))); + super.setReporter(new ProcessInnerReporter()); + } + + @Override + public void registerCommunication(List configurationList) { + super.getCollector().registerTGCommunication(configurationList); + } + + @Override + public Communication collect() { + return super.getCollector().collectFromTaskGroup(); + } + + @Override + public State collectState() { + return this.collect().getState(); + } + + /** + * 和 DistributeJobContainerCollector 的 report 实现一样 + */ + @Override + public void report(Communication communication) { + super.getReporter().reportJobCommunication(super.getJobId(), communication); + + LOG.info(CommunicationTool.Stringify.getSnapshot(communication)); + reportVmInfo(); + } + + @Override + public Communication getCommunication(Integer taskGroupId) { + return super.getCollector().getTGCommunication(taskGroupId); + } + + @Override + public Map getCommunicationMap() { + return super.getCollector().getTGCommunicationMap(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/AbstractTGContainerCommunicator.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/AbstractTGContainerCommunicator.java new file mode 100755 index 0000000000..30ff2b045c --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/AbstractTGContainerCommunicator.java @@ -0,0 +1,74 @@ +package com.alibaba.datax.core.statistics.container.communicator.taskgroup; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.container.collector.ProcessInnerCollector; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.dataxservice.face.domain.enums.State; +import org.apache.commons.lang.Validate; + +import java.util.List; +import java.util.Map; + +/** + * 该类是用于处理 taskGroupContainer 的 communication 的收集汇报的父类 + * 主要是 taskCommunicationMap 记录了 taskExecutor 的 communication 属性 + */ +public abstract class AbstractTGContainerCommunicator extends AbstractContainerCommunicator { + + protected long jobId; + + /** + * 由于taskGroupContainer是进程内部调度 + * 其registerCommunication(),getCommunication(), + * getCommunications(),collect()等方法是一致的 + * 所有TG的Collector都是ProcessInnerCollector + */ + protected int taskGroupId; + + public AbstractTGContainerCommunicator(Configuration configuration) { + super(configuration); + this.jobId = configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID); + super.setCollector(new ProcessInnerCollector(this.jobId)); + this.taskGroupId = configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID); + } + + @Override + public void registerCommunication(List configurationList) { + super.getCollector().registerTaskCommunication(configurationList); + } + + @Override + public final Communication collect() { + return this.getCollector().collectFromTask(); + } + + @Override + public final State collectState() { + Communication communication = new Communication(); + communication.setState(State.SUCCEEDED); + + for (Communication taskCommunication : + super.getCollector().getTaskCommunicationMap().values()) { + communication.mergeStateFrom(taskCommunication); + } + + return communication.getState(); + } + + @Override + public final Communication getCommunication(Integer taskId) { + Validate.isTrue(taskId >= 0, "注册的taskId不能小于0"); + + return super.getCollector().getTaskCommunication(taskId); + } + + @Override + public final Map getCommunicationMap() { + return super.getCollector().getTaskCommunicationMap(); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/StandaloneTGContainerCommunicator.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/StandaloneTGContainerCommunicator.java new file mode 100755 index 0000000000..7852154df1 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/communicator/taskgroup/StandaloneTGContainerCommunicator.java @@ -0,0 +1,19 @@ +package com.alibaba.datax.core.statistics.container.communicator.taskgroup; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.container.report.ProcessInnerReporter; +import com.alibaba.datax.core.statistics.communication.Communication; + +public class StandaloneTGContainerCommunicator extends AbstractTGContainerCommunicator { + + public StandaloneTGContainerCommunicator(Configuration configuration) { + super(configuration); + super.setReporter(new ProcessInnerReporter()); + } + + @Override + public void report(Communication communication) { + super.getReporter().reportTGCommunication(super.taskGroupId, communication); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/report/AbstractReporter.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/report/AbstractReporter.java new file mode 100755 index 0000000000..57f98587aa --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/report/AbstractReporter.java @@ -0,0 +1,11 @@ +package com.alibaba.datax.core.statistics.container.report; + +import com.alibaba.datax.core.statistics.communication.Communication; + +public abstract class AbstractReporter { + + public abstract void reportJobCommunication(Long jobId, Communication communication); + + public abstract void reportTGCommunication(Integer taskGroupId, Communication communication); + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/container/report/ProcessInnerReporter.java b/core/src/main/java/com/alibaba/datax/core/statistics/container/report/ProcessInnerReporter.java new file mode 100755 index 0000000000..15cdccc984 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/container/report/ProcessInnerReporter.java @@ -0,0 +1,17 @@ +package com.alibaba.datax.core.statistics.container.report; + +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.LocalTGCommunicationManager; + +public class ProcessInnerReporter extends AbstractReporter { + + @Override + public void reportJobCommunication(Long jobId, Communication communication) { + // do nothing + } + + @Override + public void reportTGCommunication(Integer taskGroupId, Communication communication) { + LocalTGCommunicationManager.updateTaskGroupCommunication(taskGroupId, communication); + } +} \ No newline at end of file diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/plugin/DefaultJobPluginCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/DefaultJobPluginCollector.java new file mode 100755 index 0000000000..a9571bd44c --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/DefaultJobPluginCollector.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.core.statistics.plugin; + +import com.alibaba.datax.common.plugin.JobPluginCollector; +import com.alibaba.datax.core.statistics.container.communicator.AbstractContainerCommunicator; +import com.alibaba.datax.core.statistics.communication.Communication; + +import java.util.List; +import java.util.Map; + +/** + * Created by jingxing on 14-9-9. + */ +public final class DefaultJobPluginCollector implements JobPluginCollector { + private AbstractContainerCommunicator jobCollector; + + public DefaultJobPluginCollector(AbstractContainerCommunicator containerCollector) { + this.jobCollector = containerCollector; + } + + @Override + public Map> getMessage() { + Communication totalCommunication = this.jobCollector.collect(); + return totalCommunication.getMessage(); + } + + @Override + public List getMessage(String key) { + Communication totalCommunication = this.jobCollector.collect(); + return totalCommunication.getMessage(key); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/AbstractTaskPluginCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/AbstractTaskPluginCollector.java new file mode 100755 index 0000000000..ada9687f24 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/AbstractTaskPluginCollector.java @@ -0,0 +1,77 @@ +package com.alibaba.datax.core.statistics.plugin.task; + +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.FrameworkErrorCode; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Created by jingxing on 14-9-11. + */ +public abstract class AbstractTaskPluginCollector extends TaskPluginCollector { + private static final Logger LOG = LoggerFactory + .getLogger(AbstractTaskPluginCollector.class); + + private Communication communication; + + private Configuration configuration; + + private PluginType pluginType; + + public AbstractTaskPluginCollector(Configuration conf, Communication communication, + PluginType type) { + this.configuration = conf; + this.communication = communication; + this.pluginType = type; + } + + public Communication getCommunication() { + return communication; + } + + public Configuration getConfiguration() { + return configuration; + } + + public PluginType getPluginType() { + return pluginType; + } + + @Override + final public void collectMessage(String key, String value) { + this.communication.addMessage(key, value); + } + + @Override + public void collectDirtyRecord(Record dirtyRecord, Throwable t, + String errorMessage) { + + if (null == dirtyRecord) { + LOG.warn("脏数据record=null."); + return; + } + + if (this.pluginType.equals(PluginType.READER)) { + this.communication.increaseCounter( + CommunicationTool.READ_FAILED_RECORDS, 1); + this.communication.increaseCounter( + CommunicationTool.READ_FAILED_BYTES, dirtyRecord.getByteSize()); + } else if (this.pluginType.equals(PluginType.WRITER)) { + this.communication.increaseCounter( + CommunicationTool.WRITE_FAILED_RECORDS, 1); + this.communication.increaseCounter( + CommunicationTool.WRITE_FAILED_BYTES, dirtyRecord.getByteSize()); + } else { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + String.format("不知道的插件类型[%s].", this.pluginType)); + } + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/HttpPluginCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/HttpPluginCollector.java new file mode 100755 index 0000000000..e479fe2c1e --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/HttpPluginCollector.java @@ -0,0 +1,23 @@ +package com.alibaba.datax.core.statistics.plugin.task; + +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; + +/** + * Created by jingxing on 14-9-9. + */ +public class HttpPluginCollector extends AbstractTaskPluginCollector { + public HttpPluginCollector(Configuration configuration, Communication Communication, + PluginType type) { + super(configuration, Communication, type); + } + + @Override + public void collectDirtyRecord(Record dirtyRecord, Throwable t, + String errorMessage) { + super.collectDirtyRecord(dirtyRecord, t, errorMessage); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/StdoutPluginCollector.java b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/StdoutPluginCollector.java new file mode 100755 index 0000000000..8b2a837811 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/StdoutPluginCollector.java @@ -0,0 +1,74 @@ +package com.alibaba.datax.core.statistics.plugin.task; + +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.statistics.plugin.task.util.DirtyRecord; +import com.alibaba.fastjson.JSON; + +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.HashMap; +import java.util.Map; +import java.util.concurrent.atomic.AtomicInteger; + +/** + * Created by jingxing on 14-9-9. + */ +public class StdoutPluginCollector extends AbstractTaskPluginCollector { + private static final Logger LOG = LoggerFactory + .getLogger(StdoutPluginCollector.class); + + private static final int DEFAULT_MAX_DIRTYNUM = 128; + + private AtomicInteger maxLogNum = new AtomicInteger(0); + + private AtomicInteger currentLogNum = new AtomicInteger(0); + + public StdoutPluginCollector(Configuration configuration, Communication communication, + PluginType type) { + super(configuration, communication, type); + maxLogNum = new AtomicInteger( + configuration.getInt( + CoreConstant.DATAX_CORE_STATISTICS_COLLECTOR_PLUGIN_MAXDIRTYNUM, + DEFAULT_MAX_DIRTYNUM)); + } + + private String formatDirty(final Record dirty, final Throwable t, + final String msg) { + Map msgGroup = new HashMap(); + + msgGroup.put("type", super.getPluginType().toString()); + if (StringUtils.isNotBlank(msg)) { + msgGroup.put("message", msg); + } + if (null != t && StringUtils.isNotBlank(t.getMessage())) { + msgGroup.put("exception", t.getMessage()); + } + if (null != dirty) { + msgGroup.put("record", DirtyRecord.asDirtyRecord(dirty) + .getColumns()); + } + + return JSON.toJSONString(msgGroup); + } + + @Override + public void collectDirtyRecord(Record dirtyRecord, Throwable t, + String errorMessage) { + int logNum = currentLogNum.getAndIncrement(); + if(logNum==0 && t!=null){ + LOG.error("", t); + } + if (maxLogNum.intValue() < 0 || currentLogNum.intValue() < maxLogNum.intValue()) { + LOG.error("脏数据: \n" + + this.formatDirty(dirtyRecord, t, errorMessage)); + } + + super.collectDirtyRecord(dirtyRecord, t, errorMessage); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/util/DirtyRecord.java b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/util/DirtyRecord.java new file mode 100755 index 0000000000..fdc5d8215d --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/statistics/plugin/task/util/DirtyRecord.java @@ -0,0 +1,151 @@ +package com.alibaba.datax.core.statistics.plugin.task.util; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.fastjson.JSON; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.util.ArrayList; +import java.util.Date; +import java.util.List; + +public class DirtyRecord implements Record { + private List columns = new ArrayList(); + + public static DirtyRecord asDirtyRecord(final Record record) { + DirtyRecord result = new DirtyRecord(); + for (int i = 0; i < record.getColumnNumber(); i++) { + result.addColumn(record.getColumn(i)); + } + + return result; + } + + @Override + public void addColumn(Column column) { + this.columns.add( + DirtyColumn.asDirtyColumn(column, this.columns.size())); + } + + @Override + public String toString() { + return JSON.toJSONString(this.columns); + } + + @Override + public void setColumn(int i, Column column) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public Column getColumn(int i) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public int getColumnNumber() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public int getByteSize() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public int getMemorySize() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + public List getColumns() { + return columns; + } + + public void setColumns(List columns) { + this.columns = columns; + } + +} + +class DirtyColumn extends Column { + private int index; + + public static Column asDirtyColumn(final Column column, int index) { + return new DirtyColumn(column, index); + } + + private DirtyColumn(Column column, int index) { + this(null == column ? null : column.getRawData(), + null == column ? Column.Type.NULL : column.getType(), + null == column ? 0 : column.getByteSize(), index); + } + + public int getIndex() { + return index; + } + + public void setIndex(int index) { + this.index = index; + } + + @Override + public Long asLong() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public Double asDouble() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public String asString() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public Date asDate() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public byte[] asBytes() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public Boolean asBoolean() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public BigDecimal asBigDecimal() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + @Override + public BigInteger asBigInteger() { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + "该方法不支持!"); + } + + private DirtyColumn(Object object, Type type, int byteSize, int index) { + super(object, type, byteSize); + this.setIndex(index); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskGroupContainer.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskGroupContainer.java new file mode 100755 index 0000000000..c30c94d9b3 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskGroupContainer.java @@ -0,0 +1,567 @@ +package com.alibaba.datax.core.taskgroup; + +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.statistics.PerfRecord; +import com.alibaba.datax.common.statistics.PerfTrace; +import com.alibaba.datax.common.statistics.VMInfo; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.AbstractContainer; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.statistics.container.communicator.taskgroup.StandaloneTGContainerCommunicator; +import com.alibaba.datax.core.statistics.plugin.task.AbstractTaskPluginCollector; +import com.alibaba.datax.core.taskgroup.runner.AbstractRunner; +import com.alibaba.datax.core.taskgroup.runner.ReaderRunner; +import com.alibaba.datax.core.taskgroup.runner.WriterRunner; +import com.alibaba.datax.core.transport.channel.Channel; +import com.alibaba.datax.core.transport.exchanger.BufferedRecordExchanger; +import com.alibaba.datax.core.transport.exchanger.BufferedRecordTransformerExchanger; +import com.alibaba.datax.core.transport.transformer.TransformerExecution; +import com.alibaba.datax.core.util.ClassUtil; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.TransformerUtil; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.util.container.LoadUtil; +import com.alibaba.datax.dataxservice.face.domain.enums.State; +import com.alibaba.fastjson.JSON; +import org.apache.commons.lang3.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; + +public class TaskGroupContainer extends AbstractContainer { + private static final Logger LOG = LoggerFactory + .getLogger(TaskGroupContainer.class); + + /** + * 当前taskGroup所属jobId + */ + private long jobId; + + /** + * 当前taskGroupId + */ + private int taskGroupId; + + /** + * 使用的channel类 + */ + private String channelClazz; + + /** + * task收集器使用的类 + */ + private String taskCollectorClass; + + private TaskMonitor taskMonitor = TaskMonitor.getInstance(); + + public TaskGroupContainer(Configuration configuration) { + super(configuration); + + initCommunicator(configuration); + + this.jobId = this.configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_JOB_ID); + this.taskGroupId = this.configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID); + + this.channelClazz = this.configuration.getString( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CLASS); + this.taskCollectorClass = this.configuration.getString( + CoreConstant.DATAX_CORE_STATISTICS_COLLECTOR_PLUGIN_TASKCLASS); + } + + private void initCommunicator(Configuration configuration) { + super.setContainerCommunicator(new StandaloneTGContainerCommunicator(configuration)); + + } + + public long getJobId() { + return jobId; + } + + public int getTaskGroupId() { + return taskGroupId; + } + + @Override + public void start() { + try { + /** + * 状态check时间间隔,较短,可以把任务及时分发到对应channel中 + */ + int sleepIntervalInMillSec = this.configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_SLEEPINTERVAL, 100); + /** + * 状态汇报时间间隔,稍长,避免大量汇报 + */ + long reportIntervalInMillSec = this.configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_REPORTINTERVAL, + 10000); + /** + * 2分钟汇报一次性能统计 + */ + + // 获取channel数目 + int channelNumber = this.configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL); + + int taskMaxRetryTimes = this.configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASK_FAILOVER_MAXRETRYTIMES, 1); + + long taskRetryIntervalInMsec = this.configuration.getLong( + CoreConstant.DATAX_CORE_CONTAINER_TASK_FAILOVER_RETRYINTERVALINMSEC, 10000); + + long taskMaxWaitInMsec = this.configuration.getLong(CoreConstant.DATAX_CORE_CONTAINER_TASK_FAILOVER_MAXWAITINMSEC, 60000); + + List taskConfigs = this.configuration + .getListConfiguration(CoreConstant.DATAX_JOB_CONTENT); + + if(LOG.isDebugEnabled()) { + LOG.debug("taskGroup[{}]'s task configs[{}]", this.taskGroupId, + JSON.toJSONString(taskConfigs)); + } + + int taskCountInThisTaskGroup = taskConfigs.size(); + LOG.info(String.format( + "taskGroupId=[%d] start [%d] channels for [%d] tasks.", + this.taskGroupId, channelNumber, taskCountInThisTaskGroup)); + + this.containerCommunicator.registerCommunication(taskConfigs); + + Map taskConfigMap = buildTaskConfigMap(taskConfigs); //taskId与task配置 + List taskQueue = buildRemainTasks(taskConfigs); //待运行task列表 + Map taskFailedExecutorMap = new HashMap(); //taskId与上次失败实例 + List runTasks = new ArrayList(channelNumber); //正在运行task + Map taskStartTimeMap = new HashMap(); //任务开始时间 + + long lastReportTimeStamp = 0; + Communication lastTaskGroupContainerCommunication = new Communication(); + + while (true) { + //1.判断task状态 + boolean failedOrKilled = false; + Map communicationMap = containerCommunicator.getCommunicationMap(); + for(Map.Entry entry : communicationMap.entrySet()){ + Integer taskId = entry.getKey(); + Communication taskCommunication = entry.getValue(); + if(!taskCommunication.isFinished()){ + continue; + } + TaskExecutor taskExecutor = removeTask(runTasks, taskId); + + //上面从runTasks里移除了,因此对应在monitor里移除 + taskMonitor.removeTask(taskId); + + //失败,看task是否支持failover,重试次数未超过最大限制 + if(taskCommunication.getState() == State.FAILED){ + taskFailedExecutorMap.put(taskId, taskExecutor); + if(taskExecutor.supportFailOver() && taskExecutor.getAttemptCount() < taskMaxRetryTimes){ + taskExecutor.shutdown(); //关闭老的executor + containerCommunicator.resetCommunication(taskId); //将task的状态重置 + Configuration taskConfig = taskConfigMap.get(taskId); + taskQueue.add(taskConfig); //重新加入任务列表 + }else{ + failedOrKilled = true; + break; + } + }else if(taskCommunication.getState() == State.KILLED){ + failedOrKilled = true; + break; + }else if(taskCommunication.getState() == State.SUCCEEDED){ + Long taskStartTime = taskStartTimeMap.get(taskId); + if(taskStartTime != null){ + Long usedTime = System.currentTimeMillis() - taskStartTime; + LOG.info("taskGroup[{}] taskId[{}] is successed, used[{}]ms", + this.taskGroupId, taskId, usedTime); + //usedTime*1000*1000 转换成PerfRecord记录的ns,这里主要是简单登记,进行最长任务的打印。因此增加特定静态方法 + PerfRecord.addPerfRecord(taskGroupId, taskId, PerfRecord.PHASE.TASK_TOTAL,taskStartTime, usedTime * 1000L * 1000L); + taskStartTimeMap.remove(taskId); + taskConfigMap.remove(taskId); + } + } + } + + // 2.发现该taskGroup下taskExecutor的总状态失败则汇报错误 + if (failedOrKilled) { + lastTaskGroupContainerCommunication = reportTaskGroupCommunication( + lastTaskGroupContainerCommunication, taskCountInThisTaskGroup); + + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_RUNTIME_ERROR, lastTaskGroupContainerCommunication.getThrowable()); + } + + //3.有任务未执行,且正在运行的任务数小于最大通道限制 + Iterator iterator = taskQueue.iterator(); + while(iterator.hasNext() && runTasks.size() < channelNumber){ + Configuration taskConfig = iterator.next(); + Integer taskId = taskConfig.getInt(CoreConstant.TASK_ID); + int attemptCount = 1; + TaskExecutor lastExecutor = taskFailedExecutorMap.get(taskId); + if(lastExecutor!=null){ + attemptCount = lastExecutor.getAttemptCount() + 1; + long now = System.currentTimeMillis(); + long failedTime = lastExecutor.getTimeStamp(); + if(now - failedTime < taskRetryIntervalInMsec){ //未到等待时间,继续留在队列 + continue; + } + if(!lastExecutor.isShutdown()){ //上次失败的task仍未结束 + if(now - failedTime > taskMaxWaitInMsec){ + markCommunicationFailed(taskId); + reportTaskGroupCommunication(lastTaskGroupContainerCommunication, taskCountInThisTaskGroup); + throw DataXException.asDataXException(CommonErrorCode.WAIT_TIME_EXCEED, "task failover等待超时"); + }else{ + lastExecutor.shutdown(); //再次尝试关闭 + continue; + } + }else{ + LOG.info("taskGroup[{}] taskId[{}] attemptCount[{}] has already shutdown", + this.taskGroupId, taskId, lastExecutor.getAttemptCount()); + } + } + Configuration taskConfigForRun = taskMaxRetryTimes > 1 ? taskConfig.clone() : taskConfig; + TaskExecutor taskExecutor = new TaskExecutor(taskConfigForRun, attemptCount); + taskStartTimeMap.put(taskId, System.currentTimeMillis()); + taskExecutor.doStart(); + + iterator.remove(); + runTasks.add(taskExecutor); + + //上面,增加task到runTasks列表,因此在monitor里注册。 + taskMonitor.registerTask(taskId, this.containerCommunicator.getCommunication(taskId)); + + taskFailedExecutorMap.remove(taskId); + LOG.info("taskGroup[{}] taskId[{}] attemptCount[{}] is started", + this.taskGroupId, taskId, attemptCount); + } + + //4.任务列表为空,executor已结束, 搜集状态为success--->成功 + if (taskQueue.isEmpty() && isAllTaskDone(runTasks) && containerCommunicator.collectState() == State.SUCCEEDED) { + // 成功的情况下,也需要汇报一次。否则在任务结束非常快的情况下,采集的信息将会不准确 + lastTaskGroupContainerCommunication = reportTaskGroupCommunication( + lastTaskGroupContainerCommunication, taskCountInThisTaskGroup); + + LOG.info("taskGroup[{}] completed it's tasks.", this.taskGroupId); + break; + } + + // 5.如果当前时间已经超出汇报时间的interval,那么我们需要马上汇报 + long now = System.currentTimeMillis(); + if (now - lastReportTimeStamp > reportIntervalInMillSec) { + lastTaskGroupContainerCommunication = reportTaskGroupCommunication( + lastTaskGroupContainerCommunication, taskCountInThisTaskGroup); + + lastReportTimeStamp = now; + + //taskMonitor对于正在运行的task,每reportIntervalInMillSec进行检查 + for(TaskExecutor taskExecutor:runTasks){ + taskMonitor.report(taskExecutor.getTaskId(),this.containerCommunicator.getCommunication(taskExecutor.getTaskId())); + } + + } + + Thread.sleep(sleepIntervalInMillSec); + } + + //6.最后还要汇报一次 + reportTaskGroupCommunication(lastTaskGroupContainerCommunication, taskCountInThisTaskGroup); + + + } catch (Throwable e) { + Communication nowTaskGroupContainerCommunication = this.containerCommunicator.collect(); + + if (nowTaskGroupContainerCommunication.getThrowable() == null) { + nowTaskGroupContainerCommunication.setThrowable(e); + } + nowTaskGroupContainerCommunication.setState(State.FAILED); + this.containerCommunicator.report(nowTaskGroupContainerCommunication); + + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + }finally { + if(!PerfTrace.getInstance().isJob()){ + //最后打印cpu的平均消耗,GC的统计 + VMInfo vmInfo = VMInfo.getVmInfo(); + if (vmInfo != null) { + vmInfo.getDelta(false); + LOG.info(vmInfo.totalString()); + } + + LOG.info(PerfTrace.getInstance().summarizeNoException()); + } + } + } + + private Map buildTaskConfigMap(List configurations){ + Map map = new HashMap(); + for(Configuration taskConfig : configurations){ + int taskId = taskConfig.getInt(CoreConstant.TASK_ID); + map.put(taskId, taskConfig); + } + return map; + } + + private List buildRemainTasks(List configurations){ + List remainTasks = new LinkedList(); + for(Configuration taskConfig : configurations){ + remainTasks.add(taskConfig); + } + return remainTasks; + } + + private TaskExecutor removeTask(List taskList, int taskId){ + Iterator iterator = taskList.iterator(); + while(iterator.hasNext()){ + TaskExecutor taskExecutor = iterator.next(); + if(taskExecutor.getTaskId() == taskId){ + iterator.remove(); + return taskExecutor; + } + } + return null; + } + + private boolean isAllTaskDone(List taskList){ + for(TaskExecutor taskExecutor : taskList){ + if(!taskExecutor.isTaskFinished()){ + return false; + } + } + return true; + } + + private Communication reportTaskGroupCommunication(Communication lastTaskGroupContainerCommunication, int taskCount){ + Communication nowTaskGroupContainerCommunication = this.containerCommunicator.collect(); + nowTaskGroupContainerCommunication.setTimestamp(System.currentTimeMillis()); + Communication reportCommunication = CommunicationTool.getReportCommunication(nowTaskGroupContainerCommunication, + lastTaskGroupContainerCommunication, taskCount); + this.containerCommunicator.report(reportCommunication); + return reportCommunication; + } + + private void markCommunicationFailed(Integer taskId){ + Communication communication = containerCommunicator.getCommunication(taskId); + communication.setState(State.FAILED); + } + + /** + * TaskExecutor是一个完整task的执行器 + * 其中包括1:1的reader和writer + */ + class TaskExecutor { + private Configuration taskConfig; + + private int taskId; + + private int attemptCount; + + private Channel channel; + + private Thread readerThread; + + private Thread writerThread; + + private ReaderRunner readerRunner; + + private WriterRunner writerRunner; + + /** + * 该处的taskCommunication在多处用到: + * 1. channel + * 2. readerRunner和writerRunner + * 3. reader和writer的taskPluginCollector + */ + private Communication taskCommunication; + + public TaskExecutor(Configuration taskConf, int attemptCount) { + // 获取该taskExecutor的配置 + this.taskConfig = taskConf; + Validate.isTrue(null != this.taskConfig.getConfiguration(CoreConstant.JOB_READER) + && null != this.taskConfig.getConfiguration(CoreConstant.JOB_WRITER), + "[reader|writer]的插件参数不能为空!"); + + // 得到taskId + this.taskId = this.taskConfig.getInt(CoreConstant.TASK_ID); + this.attemptCount = attemptCount; + + /** + * 由taskId得到该taskExecutor的Communication + * 要传给readerRunner和writerRunner,同时要传给channel作统计用 + */ + this.taskCommunication = containerCommunicator + .getCommunication(taskId); + Validate.notNull(this.taskCommunication, + String.format("taskId[%d]的Communication没有注册过", taskId)); + this.channel = ClassUtil.instantiate(channelClazz, + Channel.class, configuration); + this.channel.setCommunication(this.taskCommunication); + + /** + * 获取transformer的参数 + */ + + List transformerInfoExecs = TransformerUtil.buildTransformerInfo(taskConfig); + + /** + * 生成writerThread + */ + writerRunner = (WriterRunner) generateRunner(PluginType.WRITER); + this.writerThread = new Thread(writerRunner, + String.format("%d-%d-%d-writer", + jobId, taskGroupId, this.taskId)); + //通过设置thread的contextClassLoader,即可实现同步和主程序不通的加载器 + this.writerThread.setContextClassLoader(LoadUtil.getJarLoader( + PluginType.WRITER, this.taskConfig.getString( + CoreConstant.JOB_WRITER_NAME))); + + /** + * 生成readerThread + */ + readerRunner = (ReaderRunner) generateRunner(PluginType.READER,transformerInfoExecs); + this.readerThread = new Thread(readerRunner, + String.format("%d-%d-%d-reader", + jobId, taskGroupId, this.taskId)); + /** + * 通过设置thread的contextClassLoader,即可实现同步和主程序不通的加载器 + */ + this.readerThread.setContextClassLoader(LoadUtil.getJarLoader( + PluginType.READER, this.taskConfig.getString( + CoreConstant.JOB_READER_NAME))); + } + + public void doStart() { + this.writerThread.start(); + + // reader没有起来,writer不可能结束 + if (!this.writerThread.isAlive() || this.taskCommunication.getState() == State.FAILED) { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + this.taskCommunication.getThrowable()); + } + + this.readerThread.start(); + + // 这里reader可能很快结束 + if (!this.readerThread.isAlive() && this.taskCommunication.getState() == State.FAILED) { + // 这里有可能出现Reader线上启动即挂情况 对于这类情况 需要立刻抛出异常 + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + this.taskCommunication.getThrowable()); + } + + } + + + private AbstractRunner generateRunner(PluginType pluginType) { + return generateRunner(pluginType, null); + } + + private AbstractRunner generateRunner(PluginType pluginType, List transformerInfoExecs) { + AbstractRunner newRunner = null; + TaskPluginCollector pluginCollector; + + switch (pluginType) { + case READER: + newRunner = LoadUtil.loadPluginRunner(pluginType, + this.taskConfig.getString(CoreConstant.JOB_READER_NAME)); + newRunner.setJobConf(this.taskConfig.getConfiguration( + CoreConstant.JOB_READER_PARAMETER)); + + pluginCollector = ClassUtil.instantiate( + taskCollectorClass, AbstractTaskPluginCollector.class, + configuration, this.taskCommunication, + PluginType.READER); + + RecordSender recordSender; + if (transformerInfoExecs != null && transformerInfoExecs.size() > 0) { + recordSender = new BufferedRecordTransformerExchanger(taskGroupId, this.taskId, this.channel,this.taskCommunication ,pluginCollector, transformerInfoExecs); + } else { + recordSender = new BufferedRecordExchanger(this.channel, pluginCollector); + } + + ((ReaderRunner) newRunner).setRecordSender(recordSender); + + /** + * 设置taskPlugin的collector,用来处理脏数据和job/task通信 + */ + newRunner.setTaskPluginCollector(pluginCollector); + break; + case WRITER: + newRunner = LoadUtil.loadPluginRunner(pluginType, + this.taskConfig.getString(CoreConstant.JOB_WRITER_NAME)); + newRunner.setJobConf(this.taskConfig + .getConfiguration(CoreConstant.JOB_WRITER_PARAMETER)); + + pluginCollector = ClassUtil.instantiate( + taskCollectorClass, AbstractTaskPluginCollector.class, + configuration, this.taskCommunication, + PluginType.WRITER); + ((WriterRunner) newRunner).setRecordReceiver(new BufferedRecordExchanger( + this.channel, pluginCollector)); + /** + * 设置taskPlugin的collector,用来处理脏数据和job/task通信 + */ + newRunner.setTaskPluginCollector(pluginCollector); + break; + default: + throw DataXException.asDataXException(FrameworkErrorCode.ARGUMENT_ERROR, "Cant generateRunner for:" + pluginType); + } + + newRunner.setTaskGroupId(taskGroupId); + newRunner.setTaskId(this.taskId); + newRunner.setRunnerCommunication(this.taskCommunication); + + return newRunner; + } + + // 检查任务是否结束 + private boolean isTaskFinished() { + // 如果reader 或 writer没有完成工作,那么直接返回工作没有完成 + if (readerThread.isAlive() || writerThread.isAlive()) { + return false; + } + + if(taskCommunication==null || !taskCommunication.isFinished()){ + return false; + } + + return true; + } + + private int getTaskId(){ + return taskId; + } + + private long getTimeStamp(){ + return taskCommunication.getTimestamp(); + } + + private int getAttemptCount(){ + return attemptCount; + } + + private boolean supportFailOver(){ + return writerRunner.supportFailOver(); + } + + private void shutdown(){ + writerRunner.shutdown(); + readerRunner.shutdown(); + if(writerThread.isAlive()){ + writerThread.interrupt(); + } + if(readerThread.isAlive()){ + readerThread.interrupt(); + } + } + + private boolean isShutdown(){ + return !readerThread.isAlive() && !writerThread.isAlive(); + } + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskMonitor.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskMonitor.java new file mode 100644 index 0000000000..6dd7e674e1 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/TaskMonitor.java @@ -0,0 +1,113 @@ +package com.alibaba.datax.core.taskgroup; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.dataxservice.face.domain.enums.State; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.concurrent.ConcurrentHashMap; + +/** + * Created by liqiang on 15/7/23. + */ +public class TaskMonitor { + + private static final Logger LOG = LoggerFactory.getLogger(TaskMonitor.class); + private static final TaskMonitor instance = new TaskMonitor(); + private static long EXPIRED_TIME = 172800 * 1000; + + private ConcurrentHashMap tasks = new ConcurrentHashMap(); + + private TaskMonitor() { + } + + public static TaskMonitor getInstance() { + return instance; + } + + public void registerTask(Integer taskid, Communication communication) { + //如果task已经finish,直接返回 + if (communication.isFinished()) { + return; + } + tasks.putIfAbsent(taskid, new TaskCommunication(taskid, communication)); + } + + public void removeTask(Integer taskid) { + tasks.remove(taskid); + } + + public void report(Integer taskid, Communication communication) { + //如果task已经finish,直接返回 + if (communication.isFinished()) { + return; + } + if (!tasks.containsKey(taskid)) { + LOG.warn("unexpected: taskid({}) missed.", taskid); + tasks.putIfAbsent(taskid, new TaskCommunication(taskid, communication)); + } else { + tasks.get(taskid).report(communication); + } + } + + public TaskCommunication getTaskCommunication(Integer taskid) { + return tasks.get(taskid); + } + + + public static class TaskCommunication { + private Integer taskid; + //记录最后更新的communication + private long lastAllReadRecords = -1; + //只有第一次,或者统计变更时才会更新TS + private long lastUpdateComunicationTS; + private long ttl; + + private TaskCommunication(Integer taskid, Communication communication) { + this.taskid = taskid; + lastAllReadRecords = CommunicationTool.getTotalReadRecords(communication); + ttl = System.currentTimeMillis(); + lastUpdateComunicationTS = ttl; + } + + public void report(Communication communication) { + + ttl = System.currentTimeMillis(); + //采集的数量增长,则变更当前记录, 优先判断这个条件,因为目的是不卡住,而不是expired + if (CommunicationTool.getTotalReadRecords(communication) > lastAllReadRecords) { + lastAllReadRecords = CommunicationTool.getTotalReadRecords(communication); + lastUpdateComunicationTS = ttl; + } else if (isExpired(lastUpdateComunicationTS)) { + communication.setState(State.FAILED); + communication.setTimestamp(ttl); + communication.setThrowable(DataXException.asDataXException(CommonErrorCode.TASK_HUNG_EXPIRED, + String.format("task(%s) hung expired [allReadRecord(%s), elased(%s)]", taskid, lastAllReadRecords, (ttl - lastUpdateComunicationTS)))); + } + + + } + + private boolean isExpired(long lastUpdateComunicationTS) { + return System.currentTimeMillis() - lastUpdateComunicationTS > EXPIRED_TIME; + } + + public Integer getTaskid() { + return taskid; + } + + public long getLastAllReadRecords() { + return lastAllReadRecords; + } + + public long getLastUpdateComunicationTS() { + return lastUpdateComunicationTS; + } + + public long getTtl() { + return ttl; + } + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/AbstractRunner.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/AbstractRunner.java new file mode 100755 index 0000000000..fd8f605ca0 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/AbstractRunner.java @@ -0,0 +1,115 @@ +package com.alibaba.datax.core.taskgroup.runner; + +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.dataxservice.face.domain.enums.State; +import org.apache.commons.lang.Validate; + +public abstract class AbstractRunner { + private AbstractTaskPlugin plugin; + + private Configuration jobConf; + + private Communication runnerCommunication; + + private int taskGroupId; + + private int taskId; + + public AbstractRunner(AbstractTaskPlugin taskPlugin) { + this.plugin = taskPlugin; + } + + public void destroy() { + if (this.plugin != null) { + this.plugin.destroy(); + } + } + + public State getRunnerState() { + return this.runnerCommunication.getState(); + } + + public AbstractTaskPlugin getPlugin() { + return plugin; + } + + public void setPlugin(AbstractTaskPlugin plugin) { + this.plugin = plugin; + } + + public Configuration getJobConf() { + return jobConf; + } + + public void setJobConf(Configuration jobConf) { + this.jobConf = jobConf; + this.plugin.setPluginJobConf(jobConf); + } + + public void setTaskPluginCollector(TaskPluginCollector pluginCollector) { + this.plugin.setTaskPluginCollector(pluginCollector); + } + + private void mark(State state) { + this.runnerCommunication.setState(state); + if (state == State.SUCCEEDED) { + // 对 stage + 1 + this.runnerCommunication.setLongCounter(CommunicationTool.STAGE, + this.runnerCommunication.getLongCounter(CommunicationTool.STAGE) + 1); + } + } + + public void markRun() { + mark(State.RUNNING); + } + + public void markSuccess() { + mark(State.SUCCEEDED); + } + + public void markFail(final Throwable throwable) { + mark(State.FAILED); + this.runnerCommunication.setTimestamp(System.currentTimeMillis()); + this.runnerCommunication.setThrowable(throwable); + } + + /** + * @param taskGroupId the taskGroupId to set + */ + public void setTaskGroupId(int taskGroupId) { + this.taskGroupId = taskGroupId; + this.plugin.setTaskGroupId(taskGroupId); + } + + /** + * @return the taskGroupId + */ + public int getTaskGroupId() { + return taskGroupId; + } + + public int getTaskId() { + return taskId; + } + + public void setTaskId(int taskId) { + this.taskId = taskId; + this.plugin.setTaskId(taskId); + } + + public void setRunnerCommunication(final Communication runnerCommunication) { + Validate.notNull(runnerCommunication, + "插件的Communication不能为空"); + this.runnerCommunication = runnerCommunication; + } + + public Communication getRunnerCommunication() { + return runnerCommunication; + } + + public abstract void shutdown(); +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/ReaderRunner.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/ReaderRunner.java new file mode 100755 index 0000000000..91961d8dc6 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/ReaderRunner.java @@ -0,0 +1,95 @@ +package com.alibaba.datax.core.taskgroup.runner; + +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.statistics.PerfRecord; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Created by jingxing on 14-9-1. + *

+ * 单个slice的reader执行调用 + */ +public class ReaderRunner extends AbstractRunner implements Runnable { + + private static final Logger LOG = LoggerFactory + .getLogger(ReaderRunner.class); + + private RecordSender recordSender; + + public void setRecordSender(RecordSender recordSender) { + this.recordSender = recordSender; + } + + public ReaderRunner(AbstractTaskPlugin abstractTaskPlugin) { + super(abstractTaskPlugin); + } + + @Override + public void run() { + assert null != this.recordSender; + + Reader.Task taskReader = (Reader.Task) this.getPlugin(); + + //统计waitWriterTime,并且在finally才end。 + PerfRecord channelWaitWrite = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WAIT_WRITE_TIME); + try { + channelWaitWrite.start(); + + LOG.debug("task reader starts to do init ..."); + PerfRecord initPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_INIT); + initPerfRecord.start(); + taskReader.init(); + initPerfRecord.end(); + + LOG.debug("task reader starts to do prepare ..."); + PerfRecord preparePerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_PREPARE); + preparePerfRecord.start(); + taskReader.prepare(); + preparePerfRecord.end(); + + LOG.debug("task reader starts to read ..."); + PerfRecord dataPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_DATA); + dataPerfRecord.start(); + taskReader.startRead(recordSender); + recordSender.terminate(); + + dataPerfRecord.addCount(CommunicationTool.getTotalReadRecords(super.getRunnerCommunication())); + dataPerfRecord.addSize(CommunicationTool.getTotalReadBytes(super.getRunnerCommunication())); + dataPerfRecord.end(); + + LOG.debug("task reader starts to do post ..."); + PerfRecord postPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_POST); + postPerfRecord.start(); + taskReader.post(); + postPerfRecord.end(); + // automatic flush + // super.markSuccess(); 这里不能标记为成功,成功的标志由 writerRunner 来标志(否则可能导致 reader 先结束,而 writer 还没有结束的严重 bug) + } catch (Throwable e) { + LOG.error("Reader runner Received Exceptions:", e); + super.markFail(e); + } finally { + LOG.debug("task reader starts to do destroy ..."); + PerfRecord desPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.READ_TASK_DESTROY); + desPerfRecord.start(); + super.destroy(); + desPerfRecord.end(); + + channelWaitWrite.end(super.getRunnerCommunication().getLongCounter(CommunicationTool.WAIT_WRITER_TIME)); + + long transformerUsedTime = super.getRunnerCommunication().getLongCounter(CommunicationTool.TRANSFORMER_USED_TIME); + if (transformerUsedTime > 0) { + PerfRecord transformerRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.TRANSFORMER_TIME); + transformerRecord.start(); + transformerRecord.end(transformerUsedTime); + } + } + } + + public void shutdown(){ + recordSender.shutdown(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/TaskGroupContainerRunner.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/TaskGroupContainerRunner.java new file mode 100755 index 0000000000..8d8663b815 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/TaskGroupContainerRunner.java @@ -0,0 +1,44 @@ +package com.alibaba.datax.core.taskgroup.runner; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.core.taskgroup.TaskGroupContainer; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.dataxservice.face.domain.enums.State; + +public class TaskGroupContainerRunner implements Runnable { + + private TaskGroupContainer taskGroupContainer; + + private State state; + + public TaskGroupContainerRunner(TaskGroupContainer taskGroup) { + this.taskGroupContainer = taskGroup; + this.state = State.SUCCEEDED; + } + + @Override + public void run() { + try { + Thread.currentThread().setName( + String.format("taskGroup-%d", this.taskGroupContainer.getTaskGroupId())); + this.taskGroupContainer.start(); + this.state = State.SUCCEEDED; + } catch (Throwable e) { + this.state = State.FAILED; + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } + } + + public TaskGroupContainer getTaskGroupContainer() { + return taskGroupContainer; + } + + public State getState() { + return state; + } + + public void setState(State state) { + this.state = state; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/WriterRunner.java b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/WriterRunner.java new file mode 100755 index 0000000000..8fa5d68be7 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/taskgroup/runner/WriterRunner.java @@ -0,0 +1,90 @@ +package com.alibaba.datax.core.taskgroup.runner; + +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.statistics.PerfRecord; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import org.apache.commons.lang3.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Created by jingxing on 14-9-1. + *

+ * 单个slice的writer执行调用 + */ +public class WriterRunner extends AbstractRunner implements Runnable { + + private static final Logger LOG = LoggerFactory + .getLogger(WriterRunner.class); + + private RecordReceiver recordReceiver; + + public void setRecordReceiver(RecordReceiver receiver) { + this.recordReceiver = receiver; + } + + public WriterRunner(AbstractTaskPlugin abstractTaskPlugin) { + super(abstractTaskPlugin); + } + + @Override + public void run() { + Validate.isTrue(this.recordReceiver != null); + + Writer.Task taskWriter = (Writer.Task) this.getPlugin(); + //统计waitReadTime,并且在finally end + PerfRecord channelWaitRead = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WAIT_READ_TIME); + try { + channelWaitRead.start(); + LOG.debug("task writer starts to do init ..."); + PerfRecord initPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_INIT); + initPerfRecord.start(); + taskWriter.init(); + initPerfRecord.end(); + + LOG.debug("task writer starts to do prepare ..."); + PerfRecord preparePerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_PREPARE); + preparePerfRecord.start(); + taskWriter.prepare(); + preparePerfRecord.end(); + LOG.debug("task writer starts to write ..."); + + PerfRecord dataPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_DATA); + dataPerfRecord.start(); + taskWriter.startWrite(recordReceiver); + + dataPerfRecord.addCount(CommunicationTool.getTotalReadRecords(super.getRunnerCommunication())); + dataPerfRecord.addSize(CommunicationTool.getTotalReadBytes(super.getRunnerCommunication())); + dataPerfRecord.end(); + + LOG.debug("task writer starts to do post ..."); + PerfRecord postPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_POST); + postPerfRecord.start(); + taskWriter.post(); + postPerfRecord.end(); + + super.markSuccess(); + } catch (Throwable e) { + LOG.error("Writer Runner Received Exceptions:", e); + super.markFail(e); + } finally { + LOG.debug("task writer starts to do destroy ..."); + PerfRecord desPerfRecord = new PerfRecord(getTaskGroupId(), getTaskId(), PerfRecord.PHASE.WRITE_TASK_DESTROY); + desPerfRecord.start(); + super.destroy(); + desPerfRecord.end(); + channelWaitRead.end(super.getRunnerCommunication().getLongCounter(CommunicationTool.WAIT_READER_TIME)); + } + } + + public boolean supportFailOver(){ + Writer.Task taskWriter = (Writer.Task) this.getPlugin(); + return taskWriter.supportFailOver(); + } + + public void shutdown(){ + recordReceiver.shutdown(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/channel/Channel.java b/core/src/main/java/com/alibaba/datax/core/transport/channel/Channel.java new file mode 100755 index 0000000000..8d4f1f67de --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/channel/Channel.java @@ -0,0 +1,248 @@ +package com.alibaba.datax.core.transport.channel; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.transport.record.TerminateRecord; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Collection; + +/** + * Created by jingxing on 14-8-25. + *

+ * 统计和限速都在这里 + */ +public abstract class Channel { + + private static final Logger LOG = LoggerFactory.getLogger(Channel.class); + + protected int taskGroupId; + + protected int capacity; + + protected int byteCapacity; + + protected long byteSpeed; // bps: bytes/s + + protected long recordSpeed; // tps: records/s + + protected long flowControlInterval; + + protected volatile boolean isClosed = false; + + protected Configuration configuration = null; + + protected volatile long waitReaderTime = 0; + + protected volatile long waitWriterTime = 0; + + private static Boolean isFirstPrint = true; + + private Communication currentCommunication; + + private Communication lastCommunication = new Communication(); + + public Channel(final Configuration configuration) { + //channel的queue里默认record为1万条。原来为512条 + int capacity = configuration.getInt( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY, 2048); + long byteSpeed = configuration.getLong( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_SPEED_BYTE, 1024 * 1024); + long recordSpeed = configuration.getLong( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_SPEED_RECORD, 10000); + + if (capacity <= 0) { + throw new IllegalArgumentException(String.format( + "通道容量[%d]必须大于0.", capacity)); + } + + synchronized (isFirstPrint) { + if (isFirstPrint) { + Channel.LOG.info("Channel set byte_speed_limit to " + byteSpeed + + (byteSpeed <= 0 ? ", No bps activated." : ".")); + Channel.LOG.info("Channel set record_speed_limit to " + recordSpeed + + (recordSpeed <= 0 ? ", No tps activated." : ".")); + isFirstPrint = false; + } + } + + this.taskGroupId = configuration.getInt( + CoreConstant.DATAX_CORE_CONTAINER_TASKGROUP_ID); + this.capacity = capacity; + this.byteSpeed = byteSpeed; + this.recordSpeed = recordSpeed; + this.flowControlInterval = configuration.getLong( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_FLOWCONTROLINTERVAL, 1000); + //channel的queue默认大小为8M,原来为64M + this.byteCapacity = configuration.getInt( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY_BYTE, 8 * 1024 * 1024); + this.configuration = configuration; + } + + public void close() { + this.isClosed = true; + } + + public void open() { + this.isClosed = false; + } + + public boolean isClosed() { + return isClosed; + } + + public int getTaskGroupId() { + return this.taskGroupId; + } + + public int getCapacity() { + return capacity; + } + + public long getByteSpeed() { + return byteSpeed; + } + + public Configuration getConfiguration() { + return this.configuration; + } + + public void setCommunication(final Communication communication) { + this.currentCommunication = communication; + this.lastCommunication.reset(); + } + + public void push(final Record r) { + Validate.notNull(r, "record不能为空."); + this.doPush(r); + this.statPush(1L, r.getByteSize()); + } + + public void pushTerminate(final TerminateRecord r) { + Validate.notNull(r, "record不能为空."); + this.doPush(r); + +// // 对 stage + 1 +// currentCommunication.setLongCounter(CommunicationTool.STAGE, +// currentCommunication.getLongCounter(CommunicationTool.STAGE) + 1); + } + + public void pushAll(final Collection rs) { + Validate.notNull(rs); + Validate.noNullElements(rs); + this.doPushAll(rs); + this.statPush(rs.size(), this.getByteSize(rs)); + } + + public Record pull() { + Record record = this.doPull(); + this.statPull(1L, record.getByteSize()); + return record; + } + + public void pullAll(final Collection rs) { + Validate.notNull(rs); + this.doPullAll(rs); + this.statPull(rs.size(), this.getByteSize(rs)); + } + + protected abstract void doPush(Record r); + + protected abstract void doPushAll(Collection rs); + + protected abstract Record doPull(); + + protected abstract void doPullAll(Collection rs); + + public abstract int size(); + + public abstract boolean isEmpty(); + + public abstract void clear(); + + private long getByteSize(final Collection rs) { + long size = 0; + for (final Record each : rs) { + size += each.getByteSize(); + } + return size; + } + + private void statPush(long recordSize, long byteSize) { + currentCommunication.increaseCounter(CommunicationTool.READ_SUCCEED_RECORDS, + recordSize); + currentCommunication.increaseCounter(CommunicationTool.READ_SUCCEED_BYTES, + byteSize); + //在读的时候进行统计waitCounter即可,因为写(pull)的时候可能正在阻塞,但读的时候已经能读到这个阻塞的counter数 + + currentCommunication.setLongCounter(CommunicationTool.WAIT_READER_TIME, waitReaderTime); + currentCommunication.setLongCounter(CommunicationTool.WAIT_WRITER_TIME, waitWriterTime); + + boolean isChannelByteSpeedLimit = (this.byteSpeed > 0); + boolean isChannelRecordSpeedLimit = (this.recordSpeed > 0); + if (!isChannelByteSpeedLimit && !isChannelRecordSpeedLimit) { + return; + } + + long lastTimestamp = lastCommunication.getTimestamp(); + long nowTimestamp = System.currentTimeMillis(); + long interval = nowTimestamp - lastTimestamp; + if (interval - this.flowControlInterval >= 0) { + long byteLimitSleepTime = 0; + long recordLimitSleepTime = 0; + if (isChannelByteSpeedLimit) { + long currentByteSpeed = (CommunicationTool.getTotalReadBytes(currentCommunication) - + CommunicationTool.getTotalReadBytes(lastCommunication)) * 1000 / interval; + if (currentByteSpeed > this.byteSpeed) { + // 计算根据byteLimit得到的休眠时间 + byteLimitSleepTime = currentByteSpeed * interval / this.byteSpeed + - interval; + } + } + + if (isChannelRecordSpeedLimit) { + long currentRecordSpeed = (CommunicationTool.getTotalReadRecords(currentCommunication) - + CommunicationTool.getTotalReadRecords(lastCommunication)) * 1000 / interval; + if (currentRecordSpeed > this.recordSpeed) { + // 计算根据recordLimit得到的休眠时间 + recordLimitSleepTime = currentRecordSpeed * interval / this.recordSpeed + - interval; + } + } + + // 休眠时间取较大值 + long sleepTime = byteLimitSleepTime < recordLimitSleepTime ? + recordLimitSleepTime : byteLimitSleepTime; + if (sleepTime > 0) { + try { + Thread.sleep(sleepTime); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } + } + + lastCommunication.setLongCounter(CommunicationTool.READ_SUCCEED_BYTES, + currentCommunication.getLongCounter(CommunicationTool.READ_SUCCEED_BYTES)); + lastCommunication.setLongCounter(CommunicationTool.READ_FAILED_BYTES, + currentCommunication.getLongCounter(CommunicationTool.READ_FAILED_BYTES)); + lastCommunication.setLongCounter(CommunicationTool.READ_SUCCEED_RECORDS, + currentCommunication.getLongCounter(CommunicationTool.READ_SUCCEED_RECORDS)); + lastCommunication.setLongCounter(CommunicationTool.READ_FAILED_RECORDS, + currentCommunication.getLongCounter(CommunicationTool.READ_FAILED_RECORDS)); + lastCommunication.setTimestamp(nowTimestamp); + } + } + + private void statPull(long recordSize, long byteSize) { + currentCommunication.increaseCounter( + CommunicationTool.WRITE_RECEIVED_RECORDS, recordSize); + currentCommunication.increaseCounter( + CommunicationTool.WRITE_RECEIVED_BYTES, byteSize); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/channel/memory/MemoryChannel.java b/core/src/main/java/com/alibaba/datax/core/transport/channel/memory/MemoryChannel.java new file mode 100755 index 0000000000..e49c7878c7 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/channel/memory/MemoryChannel.java @@ -0,0 +1,146 @@ +package com.alibaba.datax.core.transport.channel.memory; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.transport.channel.Channel; +import com.alibaba.datax.core.transport.record.TerminateRecord; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.CoreConstant; + +import java.util.Collection; +import java.util.concurrent.ArrayBlockingQueue; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.concurrent.locks.Condition; +import java.util.concurrent.locks.ReentrantLock; + +/** + * 内存Channel的具体实现,底层其实是一个ArrayBlockingQueue + * + */ +public class MemoryChannel extends Channel { + + private int bufferSize = 0; + + private AtomicInteger memoryBytes = new AtomicInteger(0); + + private ArrayBlockingQueue queue = null; + + private ReentrantLock lock; + + private Condition notInsufficient, notEmpty; + + public MemoryChannel(final Configuration configuration) { + super(configuration); + this.queue = new ArrayBlockingQueue(this.getCapacity()); + this.bufferSize = configuration.getInt(CoreConstant.DATAX_CORE_TRANSPORT_EXCHANGER_BUFFERSIZE); + + lock = new ReentrantLock(); + notInsufficient = lock.newCondition(); + notEmpty = lock.newCondition(); + } + + @Override + public void close() { + super.close(); + try { + this.queue.put(TerminateRecord.get()); + } catch (InterruptedException ex) { + Thread.currentThread().interrupt(); + } + } + + @Override + public void clear(){ + this.queue.clear(); + } + + @Override + protected void doPush(Record r) { + try { + long startTime = System.nanoTime(); + this.queue.put(r); + waitWriterTime += System.nanoTime() - startTime; + memoryBytes.addAndGet(r.getMemorySize()); + } catch (InterruptedException ex) { + Thread.currentThread().interrupt(); + } + } + + @Override + protected void doPushAll(Collection rs) { + try { + long startTime = System.nanoTime(); + lock.lockInterruptibly(); + int bytes = getRecordBytes(rs); + while (memoryBytes.get() + bytes > this.byteCapacity || rs.size() > this.queue.remainingCapacity()) { + notInsufficient.await(200L, TimeUnit.MILLISECONDS); + } + this.queue.addAll(rs); + waitWriterTime += System.nanoTime() - startTime; + memoryBytes.addAndGet(bytes); + notEmpty.signalAll(); + } catch (InterruptedException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } finally { + lock.unlock(); + } + } + + @Override + protected Record doPull() { + try { + long startTime = System.nanoTime(); + Record r = this.queue.take(); + waitReaderTime += System.nanoTime() - startTime; + memoryBytes.addAndGet(-r.getMemorySize()); + return r; + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new IllegalStateException(e); + } + } + + @Override + protected void doPullAll(Collection rs) { + assert rs != null; + rs.clear(); + try { + long startTime = System.nanoTime(); + lock.lockInterruptibly(); + while (this.queue.drainTo(rs, bufferSize) <= 0) { + notEmpty.await(200L, TimeUnit.MILLISECONDS); + } + waitReaderTime += System.nanoTime() - startTime; + int bytes = getRecordBytes(rs); + memoryBytes.addAndGet(-bytes); + notInsufficient.signalAll(); + } catch (InterruptedException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, e); + } finally { + lock.unlock(); + } + } + + private int getRecordBytes(Collection rs){ + int bytes = 0; + for(Record r : rs){ + bytes += r.getMemorySize(); + } + return bytes; + } + + @Override + public int size() { + return this.queue.size(); + } + + @Override + public boolean isEmpty() { + return this.queue.isEmpty(); + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/exchanger/BufferedRecordExchanger.java b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/BufferedRecordExchanger.java new file mode 100755 index 0000000000..4ea4902dde --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/BufferedRecordExchanger.java @@ -0,0 +1,156 @@ +package com.alibaba.datax.core.transport.exchanger; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.transport.channel.Channel; +import com.alibaba.datax.core.transport.record.TerminateRecord; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang.Validate; + +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.atomic.AtomicInteger; + +public class BufferedRecordExchanger implements RecordSender, RecordReceiver { + + private final Channel channel; + + private final Configuration configuration; + + private final List buffer; + + private int bufferSize ; + + protected final int byteCapacity; + + private final AtomicInteger memoryBytes = new AtomicInteger(0); + + private int bufferIndex = 0; + + private static Class RECORD_CLASS; + + private volatile boolean shutdown = false; + + private final TaskPluginCollector pluginCollector; + + @SuppressWarnings("unchecked") + public BufferedRecordExchanger(final Channel channel, final TaskPluginCollector pluginCollector) { + assert null != channel; + assert null != channel.getConfiguration(); + + this.channel = channel; + this.pluginCollector = pluginCollector; + this.configuration = channel.getConfiguration(); + + this.bufferSize = configuration + .getInt(CoreConstant.DATAX_CORE_TRANSPORT_EXCHANGER_BUFFERSIZE); + this.buffer = new ArrayList(bufferSize); + + //channel的queue默认大小为8M,原来为64M + this.byteCapacity = configuration.getInt( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY_BYTE, 8 * 1024 * 1024); + + try { + BufferedRecordExchanger.RECORD_CLASS = ((Class) Class + .forName(configuration.getString( + CoreConstant.DATAX_CORE_TRANSPORT_RECORD_CLASS, + "com.alibaba.datax.core.transport.record.DefaultRecord"))); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, e); + } + } + + @Override + public Record createRecord() { + try { + return BufferedRecordExchanger.RECORD_CLASS.newInstance(); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, e); + } + } + + @Override + public void sendToWriter(Record record) { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + + Validate.notNull(record, "record不能为空."); + + if (record.getMemorySize() > this.byteCapacity) { + this.pluginCollector.collectDirtyRecord(record, new Exception(String.format("单条记录超过大小限制,当前限制为:%s", this.byteCapacity))); + return; + } + + boolean isFull = (this.bufferIndex >= this.bufferSize || this.memoryBytes.get() + record.getMemorySize() > this.byteCapacity); + if (isFull) { + flush(); + } + + this.buffer.add(record); + this.bufferIndex++; + memoryBytes.addAndGet(record.getMemorySize()); + } + + @Override + public void flush() { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + this.channel.pushAll(this.buffer); + this.buffer.clear(); + this.bufferIndex = 0; + this.memoryBytes.set(0); + } + + @Override + public void terminate() { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + flush(); + this.channel.pushTerminate(TerminateRecord.get()); + } + + @Override + public Record getFromReader() { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + boolean isEmpty = (this.bufferIndex >= this.buffer.size()); + if (isEmpty) { + receive(); + } + + Record record = this.buffer.get(this.bufferIndex++); + if (record instanceof TerminateRecord) { + record = null; + } + return record; + } + + @Override + public void shutdown(){ + shutdown = true; + try{ + buffer.clear(); + channel.clear(); + }catch(Throwable t){ + t.printStackTrace(); + } + } + + private void receive() { + this.channel.pullAll(this.buffer); + this.bufferIndex = 0; + this.bufferSize = this.buffer.size(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/exchanger/BufferedRecordTransformerExchanger.java b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/BufferedRecordTransformerExchanger.java new file mode 100755 index 0000000000..e9677395b1 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/BufferedRecordTransformerExchanger.java @@ -0,0 +1,168 @@ +package com.alibaba.datax.core.transport.exchanger; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.transport.channel.Channel; +import com.alibaba.datax.core.transport.record.TerminateRecord; +import com.alibaba.datax.core.transport.transformer.TransformerExecution; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang.Validate; + +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.atomic.AtomicInteger; + +public class BufferedRecordTransformerExchanger extends TransformerExchanger implements RecordSender, RecordReceiver { + + private final Channel channel; + + private final Configuration configuration; + + private final List buffer; + + private int bufferSize; + + protected final int byteCapacity; + + private final AtomicInteger memoryBytes = new AtomicInteger(0); + + private int bufferIndex = 0; + + private static Class RECORD_CLASS; + + private volatile boolean shutdown = false; + + + @SuppressWarnings("unchecked") + public BufferedRecordTransformerExchanger(final int taskGroupId, final int taskId, + final Channel channel, final Communication communication, + final TaskPluginCollector pluginCollector, + final List tInfoExecs) { + super(taskGroupId, taskId, communication, tInfoExecs, pluginCollector); + assert null != channel; + assert null != channel.getConfiguration(); + + this.channel = channel; + this.configuration = channel.getConfiguration(); + + this.bufferSize = configuration + .getInt(CoreConstant.DATAX_CORE_TRANSPORT_EXCHANGER_BUFFERSIZE); + this.buffer = new ArrayList(bufferSize); + + //channel的queue默认大小为8M,原来为64M + this.byteCapacity = configuration.getInt( + CoreConstant.DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY_BYTE, 8 * 1024 * 1024); + + try { + BufferedRecordTransformerExchanger.RECORD_CLASS = ((Class) Class + .forName(configuration.getString( + CoreConstant.DATAX_CORE_TRANSPORT_RECORD_CLASS, + "com.alibaba.datax.core.transport.record.DefaultRecord"))); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, e); + } + } + + @Override + public Record createRecord() { + try { + return BufferedRecordTransformerExchanger.RECORD_CLASS.newInstance(); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, e); + } + } + + @Override + public void sendToWriter(Record record) { + if (shutdown) { + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + + Validate.notNull(record, "record不能为空."); + + record = doTransformer(record); + + if(record == null){ + return; + } + + if (record.getMemorySize() > this.byteCapacity) { + this.pluginCollector.collectDirtyRecord(record, new Exception(String.format("单条记录超过大小限制,当前限制为:%s", this.byteCapacity))); + return; + } + + boolean isFull = (this.bufferIndex >= this.bufferSize || this.memoryBytes.get() + record.getMemorySize() > this.byteCapacity); + if (isFull) { + flush(); + } + + this.buffer.add(record); + this.bufferIndex++; + memoryBytes.addAndGet(record.getMemorySize()); + } + + @Override + public void flush() { + if (shutdown) { + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + this.channel.pushAll(this.buffer); + //和channel的统计保持同步 + doStat(); + this.buffer.clear(); + this.bufferIndex = 0; + this.memoryBytes.set(0); + } + + @Override + public void terminate() { + if (shutdown) { + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + flush(); + this.channel.pushTerminate(TerminateRecord.get()); + } + + @Override + public Record getFromReader() { + if (shutdown) { + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + boolean isEmpty = (this.bufferIndex >= this.buffer.size()); + if (isEmpty) { + receive(); + } + + Record record = this.buffer.get(this.bufferIndex++); + if (record instanceof TerminateRecord) { + record = null; + } + return record; + } + + @Override + public void shutdown() { + shutdown = true; + try { + buffer.clear(); + channel.clear(); + } catch (Throwable t) { + t.printStackTrace(); + } + } + + private void receive() { + this.channel.pullAll(this.buffer); + this.bufferIndex = 0; + this.bufferSize = this.buffer.size(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/exchanger/RecordExchanger.java b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/RecordExchanger.java new file mode 100755 index 0000000000..fd91ffe705 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/RecordExchanger.java @@ -0,0 +1,113 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.core.transport.exchanger; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.transport.channel.Channel; +import com.alibaba.datax.core.transport.record.TerminateRecord; +import com.alibaba.datax.core.transport.transformer.TransformerExecution; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.datax.core.util.container.CoreConstant; + +import java.util.List; + +public class RecordExchanger extends TransformerExchanger implements RecordSender, RecordReceiver { + + private Channel channel; + + private Configuration configuration; + + private static Class RECORD_CLASS; + + private volatile boolean shutdown = false; + + @SuppressWarnings("unchecked") + public RecordExchanger(final int taskGroupId, final int taskId,final Channel channel, final Communication communication,List transformerExecs, final TaskPluginCollector pluginCollector) { + super(taskGroupId,taskId,communication,transformerExecs, pluginCollector); + assert channel != null; + this.channel = channel; + this.configuration = channel.getConfiguration(); + try { + RecordExchanger.RECORD_CLASS = (Class) Class + .forName(configuration.getString( + CoreConstant.DATAX_CORE_TRANSPORT_RECORD_CLASS, + "com.alibaba.datax.core.transport.record.DefaultRecord")); + } catch (ClassNotFoundException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, e); + } + } + + @Override + public Record getFromReader() { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + Record record = this.channel.pull(); + return (record instanceof TerminateRecord ? null : record); + } + + @Override + public Record createRecord() { + try { + return RECORD_CLASS.newInstance(); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.CONFIG_ERROR, e); + } + } + + @Override + public void sendToWriter(Record record) { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + record = doTransformer(record); + if (record == null) { + return; + } + this.channel.push(record); + //和channel的统计保持同步 + doStat(); + } + + @Override + public void flush() { + } + + @Override + public void terminate() { + if(shutdown){ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + this.channel.pushTerminate(TerminateRecord.get()); + //和channel的统计保持同步 + doStat(); + } + + @Override + public void shutdown(){ + shutdown = true; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/exchanger/TransformerExchanger.java b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/TransformerExchanger.java new file mode 100644 index 0000000000..92a5b73203 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/exchanger/TransformerExchanger.java @@ -0,0 +1,147 @@ +package com.alibaba.datax.core.transport.exchanger; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.transport.transformer.TransformerErrorCode; +import com.alibaba.datax.core.transport.transformer.TransformerExecution; +import com.alibaba.datax.core.util.container.ClassLoaderSwapper; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +/** + * no comments. + * Created by liqiang on 16/3/9. + */ +public abstract class TransformerExchanger { + + private static final Logger LOG = LoggerFactory.getLogger(TransformerExchanger.class); + protected final TaskPluginCollector pluginCollector; + + protected final int taskGroupId; + protected final int taskId; + protected final Communication currentCommunication; + + private long totalExaustedTime = 0; + private long totalFilterRecords = 0; + private long totalSuccessRecords = 0; + private long totalFailedRecords = 0; + + + private List transformerExecs; + + private ClassLoaderSwapper classLoaderSwapper = ClassLoaderSwapper + .newCurrentThreadClassLoaderSwapper(); + + + public TransformerExchanger(int taskGroupId, int taskId, Communication communication, + List transformerExecs, + final TaskPluginCollector pluginCollector) { + + this.transformerExecs = transformerExecs; + this.pluginCollector = pluginCollector; + this.taskGroupId = taskGroupId; + this.taskId = taskId; + this.currentCommunication = communication; + } + + + public Record doTransformer(Record record) { + if (transformerExecs == null || transformerExecs.size() == 0) { + return record; + } + + Record result = record; + + long diffExaustedTime = 0; + String errorMsg = null; + boolean failed = false; + for (TransformerExecution transformerInfoExec : transformerExecs) { + long startTs = System.nanoTime(); + + if (transformerInfoExec.getClassLoader() != null) { + classLoaderSwapper.setCurrentThreadClassLoader(transformerInfoExec.getClassLoader()); + } + + /** + * 延迟检查transformer参数的有效性,直接抛出异常,不作为脏数据 + * 不需要在插件中检查参数的有效性。但参数的个数等和插件相关的参数,在插件内部检查 + */ + if (!transformerInfoExec.isChecked()) { + + if (transformerInfoExec.getColumnIndex() != null && transformerInfoExec.getColumnIndex() >= record.getColumnNumber()) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_ILLEGAL_PARAMETER, + String.format("columnIndex[%s] out of bound[%s]. name=%s", + transformerInfoExec.getColumnIndex(), record.getColumnNumber(), + transformerInfoExec.getTransformerName())); + } + transformerInfoExec.setIsChecked(true); + } + + try { + result = transformerInfoExec.getTransformer().evaluate(result, transformerInfoExec.gettContext(), transformerInfoExec.getFinalParas()); + } catch (Exception e) { + errorMsg = String.format("transformer(%s) has Exception(%s)", transformerInfoExec.getTransformerName(), + e.getMessage()); + failed = true; + //LOG.error(errorMsg, e); + // transformerInfoExec.addFailedRecords(1); + //脏数据不再进行后续transformer处理,按脏数据处理,并过滤该record。 + break; + + } finally { + if (transformerInfoExec.getClassLoader() != null) { + classLoaderSwapper.restoreCurrentThreadClassLoader(); + } + } + + if (result == null) { + /** + * 这个null不能传到writer,必须消化掉 + */ + totalFilterRecords++; + //transformerInfoExec.addFilterRecords(1); + break; + } + + long diff = System.nanoTime() - startTs; + //transformerInfoExec.addExaustedTime(diff); + diffExaustedTime += diff; + //transformerInfoExec.addSuccessRecords(1); + } + + totalExaustedTime += diffExaustedTime; + + if (failed) { + totalFailedRecords++; + this.pluginCollector.collectDirtyRecord(record, errorMsg); + return null; + } else { + totalSuccessRecords++; + return result; + } + } + + public void doStat() { + + /** + * todo 对于多个transformer时,各个transformer的单独统计进行显示。最后再汇总整个transformer的时间消耗. + * 暂时不统计。 + */ +// if (transformers.size() > 1) { +// for (ransformerInfoExec transformerInfoExec : transformers) { +// currentCommunication.setLongCounter(CommunicationTool.TRANSFORMER_NAME_PREFIX + transformerInfoExec.getTransformerName(), transformerInfoExec.getExaustedTime()); +// } +// } + currentCommunication.setLongCounter(CommunicationTool.TRANSFORMER_SUCCEED_RECORDS, totalSuccessRecords); + currentCommunication.setLongCounter(CommunicationTool.TRANSFORMER_FAILED_RECORDS, totalFailedRecords); + currentCommunication.setLongCounter(CommunicationTool.TRANSFORMER_FILTER_RECORDS, totalFilterRecords); + currentCommunication.setLongCounter(CommunicationTool.TRANSFORMER_USED_TIME, totalExaustedTime); + } + + +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/record/DefaultRecord.java b/core/src/main/java/com/alibaba/datax/core/transport/record/DefaultRecord.java new file mode 100755 index 0000000000..2598bc8c80 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/record/DefaultRecord.java @@ -0,0 +1,119 @@ +package com.alibaba.datax.core.transport.record; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.core.util.ClassSize; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import com.alibaba.fastjson.JSON; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * Created by jingxing on 14-8-24. + */ + +public class DefaultRecord implements Record { + + private static final int RECORD_AVERGAE_COLUMN_NUMBER = 16; + + private List columns; + + private int byteSize; + + // 首先是Record本身需要的内存 + private int memorySize = ClassSize.DefaultRecordHead; + + public DefaultRecord() { + this.columns = new ArrayList(RECORD_AVERGAE_COLUMN_NUMBER); + } + + @Override + public void addColumn(Column column) { + columns.add(column); + incrByteSize(column); + } + + @Override + public Column getColumn(int i) { + if (i < 0 || i >= columns.size()) { + return null; + } + return columns.get(i); + } + + @Override + public void setColumn(int i, final Column column) { + if (i < 0) { + throw DataXException.asDataXException(FrameworkErrorCode.ARGUMENT_ERROR, + "不能给index小于0的column设置值"); + } + + if (i >= columns.size()) { + expandCapacity(i + 1); + } + + decrByteSize(getColumn(i)); + this.columns.set(i, column); + incrByteSize(getColumn(i)); + } + + @Override + public String toString() { + Map json = new HashMap(); + json.put("size", this.getColumnNumber()); + json.put("data", this.columns); + return JSON.toJSONString(json); + } + + @Override + public int getColumnNumber() { + return this.columns.size(); + } + + @Override + public int getByteSize() { + return byteSize; + } + + public int getMemorySize(){ + return memorySize; + } + + private void decrByteSize(final Column column) { + if (null == column) { + return; + } + + byteSize -= column.getByteSize(); + + //内存的占用是column对象的头 再加实际大小 + memorySize = memorySize - ClassSize.ColumnHead - column.getByteSize(); + } + + private void incrByteSize(final Column column) { + if (null == column) { + return; + } + + byteSize += column.getByteSize(); + + //内存的占用是column对象的头 再加实际大小 + memorySize = memorySize + ClassSize.ColumnHead + column.getByteSize(); + } + + private void expandCapacity(int totalSize) { + if (totalSize <= 0) { + return; + } + + int needToExpand = totalSize - columns.size(); + while (needToExpand-- > 0) { + this.columns.add(null); + } + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/record/TerminateRecord.java b/core/src/main/java/com/alibaba/datax/core/transport/record/TerminateRecord.java new file mode 100755 index 0000000000..928609abda --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/record/TerminateRecord.java @@ -0,0 +1,48 @@ +package com.alibaba.datax.core.transport.record; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; + +/** + * 作为标示 生产者已经完成生产的标志 + * + */ +public class TerminateRecord implements Record { + private final static TerminateRecord SINGLE = new TerminateRecord(); + + private TerminateRecord() { + } + + public static TerminateRecord get() { + return SINGLE; + } + + @Override + public void addColumn(Column column) { + } + + @Override + public Column getColumn(int i) { + return null; + } + + @Override + public int getColumnNumber() { + return 0; + } + + @Override + public int getByteSize() { + return 0; + } + + @Override + public int getMemorySize() { + return 0; + } + + @Override + public void setColumn(int i, Column column) { + return; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/ComplexTransformerProxy.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/ComplexTransformerProxy.java new file mode 100644 index 0000000000..a160e61df9 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/ComplexTransformerProxy.java @@ -0,0 +1,29 @@ +package com.alibaba.datax.core.transport.transformer; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.transformer.ComplexTransformer; +import com.alibaba.datax.transformer.Transformer; + +import java.util.Map; + +/** + * no comments. + * Created by liqiang on 16/3/8. + */ +public class ComplexTransformerProxy extends ComplexTransformer { + private Transformer realTransformer; + + public ComplexTransformerProxy(Transformer transformer) { + setTransformerName(transformer.getTransformerName()); + this.realTransformer = transformer; + } + + @Override + public Record evaluate(Record record, Map tContext, Object... paras) { + return this.realTransformer.evaluate(record, paras); + } + + public Transformer getRealTransformer() { + return realTransformer; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/FilterTransformer.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/FilterTransformer.java new file mode 100644 index 0000000000..8f6492fa11 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/FilterTransformer.java @@ -0,0 +1,311 @@ +package com.alibaba.datax.core.transport.transformer; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.transformer.Transformer; +import org.apache.commons.lang3.StringUtils; + +import java.util.Arrays; + +/** + * no comments. + * Created by liqiang on 16/3/4. + */ +public class FilterTransformer extends Transformer { + public FilterTransformer() { + setTransformerName("dx_filter"); + } + + @Override + public Record evaluate(Record record, Object... paras) { + + int columnIndex; + String code; + String value; + + try { + if (paras.length != 3) { + throw new RuntimeException("dx_filter paras must be 3"); + } + + columnIndex = (Integer) paras[0]; + code = (String) paras[1]; + value = (String) paras[2]; + + if (StringUtils.isEmpty(value)) { + throw new RuntimeException("dx_filter para 2 can't be null"); + } + } catch (Exception e) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_ILLEGAL_PARAMETER, "paras:" + Arrays.asList(paras).toString() + " => " + e.getMessage()); + } + + + Column column = record.getColumn(columnIndex); + + try { + + if (code.equalsIgnoreCase("like")) { + return doLike(record, value, column); + } else if (code.equalsIgnoreCase("not like")) { + return doNotLike(record, value, column); + } else if (code.equalsIgnoreCase(">")) { + return doGreat(record, value, column, false); + } else if (code.equalsIgnoreCase("<")) { + return doLess(record, value, column, false); + } else if (code.equalsIgnoreCase("=") || code.equalsIgnoreCase("==")) { + return doEqual(record, value, column); + } else if (code.equalsIgnoreCase("!=")) { + return doNotEqual(record, value, column); + } else if (code.equalsIgnoreCase(">=")) { + return doGreat(record, value, column, true); + } else if (code.equalsIgnoreCase("<=")) { + return doLess(record, value, column, true); + } else { + throw new RuntimeException("dx_filter can't suport code:" + code); + } + } catch (Exception e) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_RUN_EXCEPTION, e.getMessage(), e); + } + } + + + private Record doGreat(Record record, String value, Column column, boolean hasEqual) { + + //如果字段为空,直接不参与比较。即空也属于无穷小 + if(column.getRawData() == null){ + return record; + } + if (column instanceof DoubleColumn) { + Double ori = column.asDouble(); + double val = Double.parseDouble(value); + + if (hasEqual) { + if (ori >= val) { + return null; + } else { + return record; + } + } else { + if (ori > val) { + return null; + } else { + return record; + } + } + } else if (column instanceof LongColumn || column instanceof DateColumn) { + Long ori = column.asLong(); + long val = Long.parseLong(value); + + if (hasEqual) { + if (ori >= val) { + return null; + } else { + return record; + } + } else { + if (ori > val) { + return null; + } else { + return record; + } + } + } else if (column instanceof StringColumn || column instanceof BytesColumn || column instanceof BoolColumn) { + String ori = column.asString(); + if (hasEqual) { + if (ori.compareTo(value) >= 0) { + return null; + } else { + return record; + } + } else { + if (ori.compareTo(value) > 0) { + return null; + } else { + return record; + } + } + } else { + throw new RuntimeException(">=,> can't support this columnType:" + column.getClass().getSimpleName()); + } + } + + private Record doLess(Record record, String value, Column column, boolean hasEqual) { + + //如果字段为空,直接不参与比较。即空也属于无穷大 + if(column.getRawData() == null){ + return record; + } + + if (column instanceof DoubleColumn) { + Double ori = column.asDouble(); + double val = Double.parseDouble(value); + + if (hasEqual) { + if (ori <= val) { + return null; + } else { + return record; + } + } else { + if (ori < val) { + return null; + } else { + return record; + } + } + } else if (column instanceof LongColumn || column instanceof DateColumn) { + Long ori = column.asLong(); + long val = Long.parseLong(value); + + if (hasEqual) { + if (ori <= val) { + return null; + } else { + return record; + } + } else { + if (ori < val) { + return null; + } else { + return record; + } + } + } else if (column instanceof StringColumn || column instanceof BytesColumn || column instanceof BoolColumn) { + String ori = column.asString(); + if (hasEqual) { + if (ori.compareTo(value) <= 0) { + return null; + } else { + return record; + } + } else { + if (ori.compareTo(value) < 0) { + return null; + } else { + return record; + } + } + } else { + throw new RuntimeException("<=,< can't support this columnType:" + column.getClass().getSimpleName()); + } + + } + + /** + * DateColumn将比较long值,StringColumn,ByteColumn以及BooleanColumn比较其String值 + * + * @param record + * @param value + * @param column + * @return 如果相等,则过滤。 + */ + + private Record doEqual(Record record, String value, Column column) { + + //如果字段为空,只比较目标字段为"null",否则null字段均不过滤 + if(column.getRawData() == null){ + if(value.equalsIgnoreCase("null")){ + return null; + }else { + return record; + } + } + + if (column instanceof DoubleColumn) { + Double ori = column.asDouble(); + double val = Double.parseDouble(value); + + if (ori == val) { + return null; + } else { + return record; + } + } else if (column instanceof LongColumn || column instanceof DateColumn) { + Long ori = column.asLong(); + long val = Long.parseLong(value); + + if (ori == val) { + return null; + } else { + return record; + } + } else if (column instanceof StringColumn || column instanceof BytesColumn || column instanceof BoolColumn) { + String ori = column.asString(); + if (ori.compareTo(value) == 0) { + return null; + } else { + return record; + } + } else { + throw new RuntimeException("== can't support this columnType:" + column.getClass().getSimpleName()); + } + + } + + /** + * DateColumn将比较long值,StringColumn,ByteColumn以及BooleanColumn比较其String值 + * + * @param record + * @param value + * @param column + * @return 如果不相等,则过滤。 + */ + private Record doNotEqual(Record record, String value, Column column) { + + //如果字段为空,只比较目标字段为"null", 否则null字段均过滤。 + if(column.getRawData() == null){ + if(value.equalsIgnoreCase("null")){ + return record; + }else { + return null; + } + } + + if (column instanceof DoubleColumn) { + Double ori = column.asDouble(); + double val = Double.parseDouble(value); + + if (ori != val) { + return null; + } else { + return record; + } + } else if (column instanceof LongColumn || column instanceof DateColumn) { + Long ori = column.asLong(); + long val = Long.parseLong(value); + + if (ori != val) { + return null; + } else { + return record; + } + } else if (column instanceof StringColumn || column instanceof BytesColumn || column instanceof BoolColumn) { + String ori = column.asString(); + if (ori.compareTo(value) != 0) { + return null; + } else { + return record; + } + } else { + throw new RuntimeException("== can't support this columnType:" + column.getClass().getSimpleName()); + } + } + + private Record doLike(Record record, String value, Column column) { + String orivalue = column.asString(); + if (orivalue !=null && orivalue.matches(value)) { + return null; + } else { + return record; + } + } + + private Record doNotLike(Record record, String value, Column column) { + String orivalue = column.asString(); + if (orivalue !=null && orivalue.matches(value)) { + return record; + } else { + return null; + } + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/GroovyTransformer.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/GroovyTransformer.java new file mode 100644 index 0000000000..83d6691e26 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/GroovyTransformer.java @@ -0,0 +1,91 @@ +package com.alibaba.datax.core.transport.transformer; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.transformer.Transformer; +import groovy.lang.GroovyClassLoader; +import org.apache.commons.lang3.StringUtils; +import org.codehaus.groovy.control.CompilationFailedException; + +import java.util.Arrays; +import java.util.List; + +/** + * no comments. + * Created by liqiang on 16/3/4. + */ +public class GroovyTransformer extends Transformer { + public GroovyTransformer() { + setTransformerName("dx_groovy"); + } + + private Transformer groovyTransformer; + + @Override + public Record evaluate(Record record, Object... paras) { + + if (groovyTransformer == null) { + //全局唯一 + if (paras.length < 1 || paras.length > 2) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_ILLEGAL_PARAMETER, "dx_groovy paras must be 1 or 2 . now paras is: " + Arrays.asList(paras).toString()); + } + synchronized (this) { + + if (groovyTransformer == null) { + String code = (String) paras[0]; + @SuppressWarnings("unchecked") List extraPackage = paras.length == 2 ? (List) paras[1] : null; + initGroovyTransformer(code, extraPackage); + } + } + } + + return this.groovyTransformer.evaluate(record); + } + + private void initGroovyTransformer(String code, List extraPackage) { + GroovyClassLoader loader = new GroovyClassLoader(GroovyTransformer.class.getClassLoader()); + String groovyRule = getGroovyRule(code, extraPackage); + + Class groovyClass; + try { + groovyClass = loader.parseClass(groovyRule); + } catch (CompilationFailedException cfe) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_GROOVY_INIT_EXCEPTION, cfe); + } + + try { + Object t = groovyClass.newInstance(); + if (!(t instanceof Transformer)) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_GROOVY_INIT_EXCEPTION, "datax bug! contact askdatax"); + } + this.groovyTransformer = (Transformer) t; + } catch (Throwable ex) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_GROOVY_INIT_EXCEPTION, ex); + } + } + + + private String getGroovyRule(String expression, List extraPackagesStrList) { + StringBuffer sb = new StringBuffer(); + if(extraPackagesStrList!=null) { + for (String extraPackagesStr : extraPackagesStrList) { + if (StringUtils.isNotEmpty(extraPackagesStr)) { + sb.append(extraPackagesStr); + } + } + } + sb.append("import static com.alibaba.datax.core.transport.transformer.GroovyTransformerStaticUtil.*;"); + sb.append("import com.alibaba.datax.common.element.*;"); + sb.append("import com.alibaba.datax.common.exception.DataXException;"); + sb.append("import com.alibaba.datax.transformer.Transformer;"); + sb.append("import java.util.*;"); + sb.append("public class RULE extends Transformer").append("{"); + sb.append("public Record evaluate(Record record, Object... paras) {"); + sb.append(expression); + sb.append("}}"); + + return sb.toString(); + } + + +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/GroovyTransformerStaticUtil.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/GroovyTransformerStaticUtil.java new file mode 100644 index 0000000000..4c872993ab --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/GroovyTransformerStaticUtil.java @@ -0,0 +1,10 @@ +package com.alibaba.datax.core.transport.transformer; + +/** + * GroovyTransformer的帮助类,供groovy代码使用,必须全是static的方法 + * Created by liqiang on 16/3/4. + */ +public class GroovyTransformerStaticUtil { + + +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/PadTransformer.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/PadTransformer.java new file mode 100644 index 0000000000..359c51a8ab --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/PadTransformer.java @@ -0,0 +1,91 @@ +package com.alibaba.datax.core.transport.transformer; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.transformer.Transformer; + +import java.util.Arrays; + +/** + * no comments. + * Created by liqiang on 16/3/4. + */ +public class PadTransformer extends Transformer { + public PadTransformer() { + setTransformerName("dx_pad"); + } + + @Override + public Record evaluate(Record record, Object... paras) { + + int columnIndex; + String padType; + int length; + String padString; + + try { + if (paras.length != 4) { + throw new RuntimeException("dx_pad paras must be 4"); + } + + columnIndex = (Integer) paras[0]; + padType = (String) paras[1]; + length = Integer.valueOf((String) paras[2]); + padString = (String) paras[3]; + } catch (Exception e) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_ILLEGAL_PARAMETER, "paras:" + Arrays.asList(paras).toString() + " => " + e.getMessage()); + } + + Column column = record.getColumn(columnIndex); + + try { + String oriValue = column.asString(); + + //如果字段为空,作为空字符串处理 + if(oriValue==null){ + oriValue = ""; + } + String newValue; + if (!padType.equalsIgnoreCase("r") && !padType.equalsIgnoreCase("l")) { + throw new RuntimeException(String.format("dx_pad first para(%s) support l or r", padType)); + } + if (length <= oriValue.length()) { + newValue = oriValue.substring(0, length); + } else { + + newValue = doPad(padType, oriValue, length, padString); + } + + record.setColumn(columnIndex, new StringColumn(newValue)); + + } catch (Exception e) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_RUN_EXCEPTION, e.getMessage(),e); + } + return record; + } + + private String doPad(String padType, String oriValue, int length, String padString) { + + String finalPad = ""; + int NeedLength = length - oriValue.length(); + while (NeedLength > 0) { + + if (NeedLength >= padString.length()) { + finalPad += padString; + NeedLength -= padString.length(); + } else { + finalPad += padString.substring(0, NeedLength); + NeedLength = 0; + } + } + + if (padType.equalsIgnoreCase("l")) { + return finalPad + oriValue; + } else { + return oriValue + finalPad; + } + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/ReplaceTransformer.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/ReplaceTransformer.java new file mode 100644 index 0000000000..bf5e36af63 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/ReplaceTransformer.java @@ -0,0 +1,66 @@ +package com.alibaba.datax.core.transport.transformer; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.transformer.Transformer; + +import java.util.Arrays; + +/** + * no comments. + * Created by liqiang on 16/3/4. + */ +public class ReplaceTransformer extends Transformer { + public ReplaceTransformer() { + setTransformerName("dx_replace"); + } + + @Override + public Record evaluate(Record record, Object... paras) { + + int columnIndex; + int startIndex; + int length; + String replaceString; + try { + if (paras.length != 4) { + throw new RuntimeException("dx_replace paras must be 4"); + } + + columnIndex = (Integer) paras[0]; + startIndex = Integer.valueOf((String) paras[1]); + length = Integer.valueOf((String) paras[2]); + replaceString = (String) paras[3]; + } catch (Exception e) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_ILLEGAL_PARAMETER, "paras:" + Arrays.asList(paras).toString() + " => " + e.getMessage()); + } + + Column column = record.getColumn(columnIndex); + + try { + String oriValue = column.asString(); + + //如果字段为空,跳过replace处理 + if(oriValue == null){ + return record; + } + String newValue; + if (startIndex > oriValue.length()) { + throw new RuntimeException(String.format("dx_replace startIndex(%s) out of range(%s)", startIndex, oriValue.length())); + } + if (startIndex + length >= oriValue.length()) { + newValue = oriValue.substring(0, startIndex) + replaceString; + } else { + newValue = oriValue.substring(0, startIndex) + replaceString + oriValue.substring(startIndex + length, oriValue.length()); + } + + record.setColumn(columnIndex, new StringColumn(newValue)); + + } catch (Exception e) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_RUN_EXCEPTION, e.getMessage(),e); + } + return record; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/SubstrTransformer.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/SubstrTransformer.java new file mode 100644 index 0000000000..4671df41a2 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/SubstrTransformer.java @@ -0,0 +1,65 @@ +package com.alibaba.datax.core.transport.transformer; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.transformer.Transformer; + +import java.util.Arrays; + +/** + * no comments. + * Created by liqiang on 16/3/4. + */ +public class SubstrTransformer extends Transformer { + public SubstrTransformer() { + setTransformerName("dx_substr"); + } + + @Override + public Record evaluate(Record record, Object... paras) { + + int columnIndex; + int startIndex; + int length; + + try { + if (paras.length != 3) { + throw new RuntimeException("dx_substr paras must be 3"); + } + + columnIndex = (Integer) paras[0]; + startIndex = Integer.valueOf((String) paras[1]); + length = Integer.valueOf((String) paras[2]); + + } catch (Exception e) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_ILLEGAL_PARAMETER, "paras:" + Arrays.asList(paras).toString() + " => " + e.getMessage()); + } + + Column column = record.getColumn(columnIndex); + + try { + String oriValue = column.asString(); + //如果字段为空,跳过subStr处理 + if(oriValue == null){ + return record; + } + String newValue; + if (startIndex > oriValue.length()) { + throw new RuntimeException(String.format("dx_substr startIndex(%s) out of range(%s)", startIndex, oriValue.length())); + } + if (startIndex + length >= oriValue.length()) { + newValue = oriValue.substring(startIndex, oriValue.length()); + } else { + newValue = oriValue.substring(startIndex, startIndex + length); + } + + record.setColumn(columnIndex, new StringColumn(newValue)); + + } catch (Exception e) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_RUN_EXCEPTION, e.getMessage(),e); + } + return record; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerErrorCode.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerErrorCode.java new file mode 100755 index 0000000000..6088d204e9 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerErrorCode.java @@ -0,0 +1,40 @@ +package com.alibaba.datax.core.transport.transformer; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum TransformerErrorCode implements ErrorCode { + //重复命名 + TRANSFORMER_NAME_ERROR("TransformerErrorCode-01","Transformer name illegal"), + TRANSFORMER_DUPLICATE_ERROR("TransformerErrorCode-02","Transformer name has existed"), + TRANSFORMER_NOTFOUND_ERROR("TransformerErrorCode-03","Transformer name not found"), + TRANSFORMER_CONFIGURATION_ERROR("TransformerErrorCode-04","Transformer configuration error"), + TRANSFORMER_ILLEGAL_PARAMETER("TransformerErrorCode-05","Transformer parameter illegal"), + TRANSFORMER_RUN_EXCEPTION("TransformerErrorCode-06","Transformer run exception"), + TRANSFORMER_GROOVY_INIT_EXCEPTION("TransformerErrorCode-07","Transformer Groovy init exception"), + ; + + private final String code; + + private final String description; + + private TransformerErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerExecution.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerExecution.java new file mode 100644 index 0000000000..1f307a97c4 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerExecution.java @@ -0,0 +1,122 @@ +package com.alibaba.datax.core.transport.transformer; + +import com.alibaba.datax.transformer.ComplexTransformer; + +import java.util.Map; + +/** + * 每个func对应一个实例. + * Created by liqiang on 16/3/16. + */ +public class TransformerExecution { + + private Object[] finalParas; + + private final TransformerExecutionParas transformerExecutionParas; + private final TransformerInfo transformerInfo; + + + public TransformerExecution(TransformerInfo transformerInfo ,TransformerExecutionParas transformerExecutionParas) { + this.transformerExecutionParas = transformerExecutionParas; + this.transformerInfo = transformerInfo; + } + + /** + * 以下是动态统计信息,暂时未用 + */ + private long exaustedTime = 0; + private long successRecords = 0; + private long failedRecords = 0; + private long filterRecords = 0; + + /** + * 参数采取延迟检查 + */ + + private boolean isChecked = false; + + public void genFinalParas() { + + /** + * groovy不支持传参 + */ + if (transformerInfo.getTransformer().getTransformerName().equals("dx_groovy")) { + finalParas = new Object[2]; + finalParas[0] = transformerExecutionParas.getCode(); + finalParas[1] = transformerExecutionParas.getExtraPackage(); + return; + } + /** + * 其他function,按照columnIndex和para的顺序,如果columnIndex为空,跳过conlumnIndex + */ + if (transformerExecutionParas.getColumnIndex() != null) { + if (transformerExecutionParas.getParas() != null) { + finalParas = new Object[transformerExecutionParas.getParas().length + 1]; + System.arraycopy(transformerExecutionParas.getParas(), 0, finalParas, 1, transformerExecutionParas.getParas().length); + } else { + finalParas = new Object[1]; + } + finalParas[0] = transformerExecutionParas.getColumnIndex(); + + } else { + if (transformerExecutionParas.getParas() != null) { + finalParas = transformerExecutionParas.getParas(); + } else { + finalParas = null; + } + + } + } + + + public Object[] getFinalParas() { + return finalParas; + } + + public long getExaustedTime() { + return exaustedTime; + } + + public long getSuccessRecords() { + return successRecords; + } + + public long getFailedRecords() { + return failedRecords; + } + + public long getFilterRecords() { + return filterRecords; + } + + public void setIsChecked(boolean isChecked) { + this.isChecked = isChecked; + } + + public boolean isChecked() { + return isChecked; + } + + /** + * 一些代理方法 + */ + public ClassLoader getClassLoader() { + return transformerInfo.getClassLoader(); + } + + public Integer getColumnIndex() { + return transformerExecutionParas.getColumnIndex(); + } + + public String getTransformerName() { + return transformerInfo.getTransformer().getTransformerName(); + } + + public ComplexTransformer getTransformer() { + return transformerInfo.getTransformer(); + } + + public Map gettContext() { + return transformerExecutionParas.gettContext(); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerExecutionParas.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerExecutionParas.java new file mode 100644 index 0000000000..7645c25445 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerExecutionParas.java @@ -0,0 +1,62 @@ +package com.alibaba.datax.core.transport.transformer; + +import java.util.List; +import java.util.Map; + +/** + * no comments. + * Created by liqiang on 16/3/16. + */ +public class TransformerExecutionParas { + + /** + * 以下是function参数 + */ + + private Integer columnIndex; + private String[] paras; + private Map tContext; + private String code; + private List extraPackage; + + + public Integer getColumnIndex() { + return columnIndex; + } + + public String[] getParas() { + return paras; + } + + public Map gettContext() { + return tContext; + } + + public String getCode() { + return code; + } + + public List getExtraPackage() { + return extraPackage; + } + + public void setColumnIndex(Integer columnIndex) { + this.columnIndex = columnIndex; + } + + public void setParas(String[] paras) { + this.paras = paras; + } + + public void settContext(Map tContext) { + this.tContext = tContext; + } + + public void setCode(String code) { + this.code = code; + } + + public void setExtraPackage(List extraPackage) { + this.extraPackage = extraPackage; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerInfo.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerInfo.java new file mode 100644 index 0000000000..7b2b3d74b8 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerInfo.java @@ -0,0 +1,42 @@ +package com.alibaba.datax.core.transport.transformer; + +import com.alibaba.datax.transformer.ComplexTransformer; + +/** + * 单实例. + * Created by liqiang on 16/3/9. + */ +public class TransformerInfo { + + /** + * function基本信息 + */ + private ComplexTransformer transformer; + private ClassLoader classLoader; + private boolean isNative; + + + public ComplexTransformer getTransformer() { + return transformer; + } + + public ClassLoader getClassLoader() { + return classLoader; + } + + public boolean isNative() { + return isNative; + } + + public void setTransformer(ComplexTransformer transformer) { + this.transformer = transformer; + } + + public void setClassLoader(ClassLoader classLoader) { + this.classLoader = classLoader; + } + + public void setIsNative(boolean isNative) { + this.isNative = isNative; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerRegistry.java b/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerRegistry.java new file mode 100644 index 0000000000..96a0d98845 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/transport/transformer/TransformerRegistry.java @@ -0,0 +1,177 @@ +package com.alibaba.datax.core.transport.transformer; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.container.CoreConstant; +import com.alibaba.datax.core.util.container.JarLoader; +import com.alibaba.datax.transformer.ComplexTransformer; +import com.alibaba.datax.transformer.Transformer; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * no comments. + * Created by liqiang on 16/3/3. + */ +public class TransformerRegistry { + + private static final Logger LOG = LoggerFactory.getLogger(TransformerRegistry.class); + private static Map registedTransformer = new HashMap(); + + static { + /** + * add native transformer + * local storage and from server will be delay load. + */ + + registTransformer(new SubstrTransformer()); + registTransformer(new PadTransformer()); + registTransformer(new ReplaceTransformer()); + registTransformer(new FilterTransformer()); + registTransformer(new GroovyTransformer()); + } + + public static void loadTransformerFromLocalStorage() { + //add local_storage transformer + loadTransformerFromLocalStorage(null); + } + + + public static void loadTransformerFromLocalStorage(List transformers) { + + String[] paths = new File(CoreConstant.DATAX_STORAGE_TRANSFORMER_HOME).list(); + if (null == paths) { + return; + } + + for (final String each : paths) { + try { + if (transformers == null || transformers.contains(each)) { + loadTransformer(each); + } + } catch (Exception e) { + LOG.error(String.format("skip transformer(%s) loadTransformer has Exception(%s)", each, e.getMessage()), e); + } + + } + } + + public static void loadTransformer(String each) { + String transformerPath = CoreConstant.DATAX_STORAGE_TRANSFORMER_HOME + File.separator + each; + Configuration transformerConfiguration; + try { + transformerConfiguration = loadTransFormerConfig(transformerPath); + } catch (Exception e) { + LOG.error(String.format("skip transformer(%s),load transformer.json error, path = %s, ", each, transformerPath), e); + return; + } + + String className = transformerConfiguration.getString("class"); + if (StringUtils.isEmpty(className)) { + LOG.error(String.format("skip transformer(%s),class not config, path = %s, config = %s", each, transformerPath, transformerConfiguration.beautify())); + return; + } + + String funName = transformerConfiguration.getString("name"); + if (!each.equals(funName)) { + LOG.warn(String.format("transformer(%s) name not match transformer.json config name[%s], will ignore json's name, path = %s, config = %s", each, funName, transformerPath, transformerConfiguration.beautify())); + } + JarLoader jarLoader = new JarLoader(new String[]{transformerPath}); + + try { + Class transformerClass = jarLoader.loadClass(className); + Object transformer = transformerClass.newInstance(); + if (ComplexTransformer.class.isAssignableFrom(transformer.getClass())) { + ((ComplexTransformer) transformer).setTransformerName(each); + registComplexTransformer((ComplexTransformer) transformer, jarLoader, false); + } else if (Transformer.class.isAssignableFrom(transformer.getClass())) { + ((Transformer) transformer).setTransformerName(each); + registTransformer((Transformer) transformer, jarLoader, false); + } else { + LOG.error(String.format("load Transformer class(%s) error, path = %s", className, transformerPath)); + } + } catch (Exception e) { + //错误funciton跳过 + LOG.error(String.format("skip transformer(%s),load Transformer class error, path = %s ", each, transformerPath), e); + } + } + + private static Configuration loadTransFormerConfig(String transformerPath) { + return Configuration.from(new File(transformerPath + File.separator + "transformer.json")); + } + + public static TransformerInfo getTransformer(String transformerName) { + + TransformerInfo result = registedTransformer.get(transformerName); + + //if (result == null) { + //todo 再尝试从disk读取 + //} + + return result; + } + + public static synchronized void registTransformer(Transformer transformer) { + registTransformer(transformer, null, true); + } + + public static synchronized void registTransformer(Transformer transformer, ClassLoader classLoader, boolean isNative) { + + checkName(transformer.getTransformerName(), isNative); + + if (registedTransformer.containsKey(transformer.getTransformerName())) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_DUPLICATE_ERROR, " name=" + transformer.getTransformerName()); + } + + registedTransformer.put(transformer.getTransformerName(), buildTransformerInfo(new ComplexTransformerProxy(transformer), isNative, classLoader)); + + } + + public static synchronized void registComplexTransformer(ComplexTransformer complexTransformer, ClassLoader classLoader, boolean isNative) { + + checkName(complexTransformer.getTransformerName(), isNative); + + if (registedTransformer.containsKey(complexTransformer.getTransformerName())) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_DUPLICATE_ERROR, " name=" + complexTransformer.getTransformerName()); + } + + registedTransformer.put(complexTransformer.getTransformerName(), buildTransformerInfo(complexTransformer, isNative, classLoader)); + } + + private static void checkName(String functionName, boolean isNative) { + boolean checkResult = true; + if (isNative) { + if (!functionName.startsWith("dx_")) { + checkResult = false; + } + } else { + if (functionName.startsWith("dx_")) { + checkResult = false; + } + } + + if (!checkResult) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_NAME_ERROR, " name=" + functionName + ": isNative=" + isNative); + } + + } + + private static TransformerInfo buildTransformerInfo(ComplexTransformer complexTransformer, boolean isNative, ClassLoader classLoader) { + TransformerInfo transformerInfo = new TransformerInfo(); + transformerInfo.setClassLoader(classLoader); + transformerInfo.setIsNative(isNative); + transformerInfo.setTransformer(complexTransformer); + return transformerInfo; + } + + public static List getAllSuportTransformer() { + return new ArrayList(registedTransformer.keySet()); + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ClassSize.java b/core/src/main/java/com/alibaba/datax/core/util/ClassSize.java new file mode 100644 index 0000000000..1be49addf3 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ClassSize.java @@ -0,0 +1,42 @@ +package com.alibaba.datax.core.util; + +/** + * Created by liqiang on 15/12/12. + */ +public class ClassSize { + + public static final int DefaultRecordHead; + public static final int ColumnHead; + + //objectHead的大小 + public static final int REFERENCE; + public static final int OBJECT; + public static final int ARRAY; + public static final int ARRAYLIST; + static { + //only 64位 + REFERENCE = 8; + + OBJECT = 2 * REFERENCE; + + ARRAY = align(3 * REFERENCE); + + // 16+8+24+16 + ARRAYLIST = align(OBJECT + align(REFERENCE) + align(ARRAY) + + (2 * Long.SIZE / Byte.SIZE)); + // 8+64+8 + DefaultRecordHead = align(align(REFERENCE) + ClassSize.ARRAYLIST + 2 * Integer.SIZE / Byte.SIZE); + //16+4 + ColumnHead = align(2 * REFERENCE + Integer.SIZE / Byte.SIZE); + } + + public static int align(int num) { + return (int)(align((long)num)); + } + + public static long align(long num) { + //The 7 comes from that the alignSize is 8 which is the number of bytes + //stored and sent together + return ((num + 7) >> 3) << 3; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ClassUtil.java b/core/src/main/java/com/alibaba/datax/core/util/ClassUtil.java new file mode 100755 index 0000000000..0cf0d5617c --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ClassUtil.java @@ -0,0 +1,40 @@ +package com.alibaba.datax.core.util; + +import java.lang.reflect.Constructor; + +public final class ClassUtil { + + /** + * 通过反射构造类对象 + * + * @param className + * 反射的类名称 + * @param t + * 反射类的类型Class对象 + * @param args + * 构造参数 + * + * */ + @SuppressWarnings({ "rawtypes", "unchecked" }) + public static T instantiate(String className, Class t, + Object... args) { + try { + Constructor constructor = (Constructor) Class.forName(className) + .getConstructor(ClassUtil.toClassType(args)); + return (T) constructor.newInstance(args); + } catch (Exception e) { + throw new IllegalArgumentException(e); + } + } + + private static Class[] toClassType(Object[] args) { + Class[] clazzs = new Class[args.length]; + + for (int i = 0, length = args.length; i < length; i++) { + clazzs[i] = args[i].getClass(); + } + + return clazzs; + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ConfigParser.java b/core/src/main/java/com/alibaba/datax/core/util/ConfigParser.java new file mode 100755 index 0000000000..20039864b8 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ConfigParser.java @@ -0,0 +1,197 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.io.FileUtils; +import org.apache.commons.lang.StringUtils; +import org.apache.http.client.methods.HttpGet; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.io.IOException; +import java.net.URL; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; + +public final class ConfigParser { + private static final Logger LOG = LoggerFactory.getLogger(ConfigParser.class); + /** + * 指定Job配置路径,ConfigParser会解析Job、Plugin、Core全部信息,并以Configuration返回 + */ + public static Configuration parse(final String jobPath) { + Configuration configuration = ConfigParser.parseJobConfig(jobPath); + + configuration.merge( + ConfigParser.parseCoreConfig(CoreConstant.DATAX_CONF_PATH), + false); + // todo config优化,只捕获需要的plugin + String readerPluginName = configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_READER_NAME); + String writerPluginName = configuration.getString( + CoreConstant.DATAX_JOB_CONTENT_WRITER_NAME); + + String preHandlerName = configuration.getString( + CoreConstant.DATAX_JOB_PREHANDLER_PLUGINNAME); + + String postHandlerName = configuration.getString( + CoreConstant.DATAX_JOB_POSTHANDLER_PLUGINNAME); + + Set pluginList = new HashSet(); + pluginList.add(readerPluginName); + pluginList.add(writerPluginName); + + if(StringUtils.isNotEmpty(preHandlerName)) { + pluginList.add(preHandlerName); + } + if(StringUtils.isNotEmpty(postHandlerName)) { + pluginList.add(postHandlerName); + } + try { + configuration.merge(parsePluginConfig(new ArrayList(pluginList)), false); + }catch (Exception e){ + //吞掉异常,保持log干净。这里message足够。 + LOG.warn(String.format("插件[%s,%s]加载失败,1s后重试... Exception:%s ", readerPluginName, writerPluginName, e.getMessage())); + try { + Thread.sleep(1000); + } catch (InterruptedException e1) { + // + } + configuration.merge(parsePluginConfig(new ArrayList(pluginList)), false); + } + + return configuration; + } + + private static Configuration parseCoreConfig(final String path) { + return Configuration.from(new File(path)); + } + + public static Configuration parseJobConfig(final String path) { + String jobContent = getJobContent(path); + Configuration config = Configuration.from(jobContent); + + return SecretUtil.decryptSecretKey(config); + } + + private static String getJobContent(String jobResource) { + String jobContent; + + boolean isJobResourceFromHttp = jobResource.trim().toLowerCase().startsWith("http"); + + + if (isJobResourceFromHttp) { + //设置httpclient的 HTTP_TIMEOUT_INMILLIONSECONDS + Configuration coreConfig = ConfigParser.parseCoreConfig(CoreConstant.DATAX_CONF_PATH); + int httpTimeOutInMillionSeconds = coreConfig.getInt( + CoreConstant.DATAX_CORE_DATAXSERVER_TIMEOUT, 5000); + HttpClientUtil.setHttpTimeoutInMillionSeconds(httpTimeOutInMillionSeconds); + + HttpClientUtil httpClientUtil = new HttpClientUtil(); + try { + URL url = new URL(jobResource); + HttpGet httpGet = HttpClientUtil.getGetRequest(); + httpGet.setURI(url.toURI()); + + jobContent = httpClientUtil.executeAndGetWithFailedRetry(httpGet, 1, 1000l); + } catch (Exception e) { + throw DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, "获取作业配置信息失败:" + jobResource, e); + } + } else { + // jobResource 是本地文件绝对路径 + try { + jobContent = FileUtils.readFileToString(new File(jobResource)); + } catch (IOException e) { + throw DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, "获取作业配置信息失败:" + jobResource, e); + } + } + + if (jobContent == null) { + throw DataXException.asDataXException(FrameworkErrorCode.CONFIG_ERROR, "获取作业配置信息失败:" + jobResource); + } + return jobContent; + } + + public static Configuration parsePluginConfig(List wantPluginNames) { + Configuration configuration = Configuration.newDefault(); + + Set replicaCheckPluginSet = new HashSet(); + int complete = 0; + for (final String each : ConfigParser + .getDirAsList(CoreConstant.DATAX_PLUGIN_READER_HOME)) { + Configuration eachReaderConfig = ConfigParser.parseOnePluginConfig(each, "reader", replicaCheckPluginSet, wantPluginNames); + if(eachReaderConfig!=null) { + configuration.merge(eachReaderConfig, true); + complete += 1; + } + } + + for (final String each : ConfigParser + .getDirAsList(CoreConstant.DATAX_PLUGIN_WRITER_HOME)) { + Configuration eachWriterConfig = ConfigParser.parseOnePluginConfig(each, "writer", replicaCheckPluginSet, wantPluginNames); + if(eachWriterConfig!=null) { + configuration.merge(eachWriterConfig, true); + complete += 1; + } + } + + if (wantPluginNames != null && wantPluginNames.size() > 0 && wantPluginNames.size() != complete) { + throw DataXException.asDataXException(FrameworkErrorCode.PLUGIN_INIT_ERROR, "插件加载失败,未完成指定插件加载:" + wantPluginNames); + } + + return configuration; + } + + + public static Configuration parseOnePluginConfig(final String path, + final String type, + Set pluginSet, List wantPluginNames) { + String filePath = path + File.separator + "plugin.json"; + Configuration configuration = Configuration.from(new File(filePath)); + + String pluginPath = configuration.getString("path"); + String pluginName = configuration.getString("name"); + if(!pluginSet.contains(pluginName)) { + pluginSet.add(pluginName); + } else { + throw DataXException.asDataXException(FrameworkErrorCode.PLUGIN_INIT_ERROR, "插件加载失败,存在重复插件:" + filePath); + } + + //不是想要的插件,返回null + if (wantPluginNames != null && wantPluginNames.size() > 0 && !wantPluginNames.contains(pluginName)) { + return null; + } + + boolean isDefaultPath = StringUtils.isBlank(pluginPath); + if (isDefaultPath) { + configuration.set("path", path); + } + + Configuration result = Configuration.newDefault(); + + result.set( + String.format("plugin.%s.%s", type, pluginName), + configuration.getInternal()); + + return result; + } + + private static List getDirAsList(String path) { + List result = new ArrayList(); + + String[] paths = new File(path).list(); + if (null == paths) { + return result; + } + + for (final String each : paths) { + result.add(path + File.separator + each); + } + + return result; + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ConfigurationValidate.java b/core/src/main/java/com/alibaba/datax/core/util/ConfigurationValidate.java new file mode 100755 index 0000000000..bc15bcf144 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ConfigurationValidate.java @@ -0,0 +1,33 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang.Validate; + +/** + * Created by jingxing on 14-9-16. + * + * 对配置文件做整体检查 + */ +public class ConfigurationValidate { + public static void doValidate(Configuration allConfig) { + Validate.isTrue(allConfig!=null, ""); + + coreValidate(allConfig); + + pluginValidate(allConfig); + + jobValidate(allConfig); + } + + private static void coreValidate(Configuration allconfig) { + return; + } + + private static void pluginValidate(Configuration allConfig) { + return; + } + + private static void jobValidate(Configuration allConfig) { + return; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ErrorRecordChecker.java b/core/src/main/java/com/alibaba/datax/core/util/ErrorRecordChecker.java new file mode 100755 index 0000000000..ad7f80f614 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ErrorRecordChecker.java @@ -0,0 +1,82 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.statistics.communication.Communication; +import com.alibaba.datax.core.statistics.communication.CommunicationTool; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang3.Validate; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * 检查任务是否到达错误记录限制。有检查条数(recordLimit)和百分比(percentageLimit)两种方式。 + * 1. errorRecord表示出错条数不能大于限制数,当超过时任务失败。比如errorRecord为0表示不容许任何脏数据。 + * 2. errorPercentage表示出错比例,在任务结束时校验。 + * 3. errorRecord优先级高于errorPercentage。 + */ +public final class ErrorRecordChecker { + private static final Logger LOG = LoggerFactory + .getLogger(ErrorRecordChecker.class); + + private Long recordLimit; + private Double percentageLimit; + + public ErrorRecordChecker(Configuration configuration) { + this(configuration.getLong(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT_RECORD), + configuration.getDouble(CoreConstant.DATAX_JOB_SETTING_ERRORLIMIT_PERCENT)); + } + + public ErrorRecordChecker(Long rec, Double percentage) { + recordLimit = rec; + percentageLimit = percentage; + + if (percentageLimit != null) { + Validate.isTrue(0.0 <= percentageLimit && percentageLimit <= 1.0, + "脏数据百分比限制应该在[0.0, 1.0]之间"); + } + + if (recordLimit != null) { + Validate.isTrue(recordLimit >= 0, + "脏数据条数现在应该为非负整数"); + + // errorRecord优先级高于errorPercentage. + percentageLimit = null; + } + } + + public void checkRecordLimit(Communication communication) { + if (recordLimit == null) { + return; + } + + long errorNumber = CommunicationTool.getTotalErrorRecords(communication); + if (recordLimit < errorNumber) { + LOG.debug( + String.format("Error-limit set to %d, error count check.", + recordLimit)); + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_DIRTY_DATA_LIMIT_EXCEED, + String.format("脏数据条数检查不通过,限制是[%d]条,但实际上捕获了[%d]条.", + recordLimit, errorNumber)); + } + } + + public void checkPercentageLimit(Communication communication) { + if (percentageLimit == null) { + return; + } + LOG.debug(String.format( + "Error-limit set to %f, error percent check.", percentageLimit)); + + long total = CommunicationTool.getTotalReadRecords(communication); + long error = CommunicationTool.getTotalErrorRecords(communication); + + if (total > 0 && ((double) error / (double) total) > percentageLimit) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_DIRTY_DATA_LIMIT_EXCEED, + String.format("脏数据百分比检查不通过,限制是[%f],但实际上捕获到[%f].", + percentageLimit, ((double) error / (double) total))); + } + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/ExceptionTracker.java b/core/src/main/java/com/alibaba/datax/core/util/ExceptionTracker.java new file mode 100755 index 0000000000..d06f6798c0 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/ExceptionTracker.java @@ -0,0 +1,15 @@ +package com.alibaba.datax.core.util; + +import java.io.PrintWriter; +import java.io.StringWriter; + +public class ExceptionTracker { + public static final int STRING_BUFFER = 4096; + + public static String trace(Throwable ex) { + StringWriter sw = new StringWriter(STRING_BUFFER); + PrintWriter pw = new PrintWriter(sw); + ex.printStackTrace(pw); + return sw.toString(); + } +} \ No newline at end of file diff --git a/core/src/main/java/com/alibaba/datax/core/util/FrameworkErrorCode.java b/core/src/main/java/com/alibaba/datax/core/util/FrameworkErrorCode.java new file mode 100755 index 0000000000..f50f793504 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/FrameworkErrorCode.java @@ -0,0 +1,68 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * TODO: 根据现有日志数据分析各类错误,进行细化。 + * + *

请不要格式化本类代码

+ */ +public enum FrameworkErrorCode implements ErrorCode { + + INSTALL_ERROR("Framework-00", "DataX引擎安装错误, 请联系您的运维解决 ."), + ARGUMENT_ERROR("Framework-01", "DataX引擎运行错误,该问题通常是由于内部编程错误引起,请联系DataX开发团队解决 ."), + RUNTIME_ERROR("Framework-02", "DataX引擎运行过程出错,具体原因请参看DataX运行结束时的错误诊断信息 ."), + CONFIG_ERROR("Framework-03", "DataX引擎配置错误,该问题通常是由于DataX安装错误引起,请联系您的运维解决 ."), + SECRET_ERROR("Framework-04", "DataX引擎加解密出错,该问题通常是由于DataX密钥配置错误引起,请联系您的运维解决 ."), + HOOK_LOAD_ERROR("Framework-05", "加载外部Hook出现错误,通常是由于DataX安装引起的"), + HOOK_FAIL_ERROR("Framework-06", "执行外部Hook出现错误"), + + PLUGIN_INSTALL_ERROR("Framework-10", "DataX插件安装错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 ."), + PLUGIN_NOT_FOUND("Framework-11", "DataX插件配置错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 ."), + PLUGIN_INIT_ERROR("Framework-12", "DataX插件初始化错误, 该问题通常是由于DataX安装错误引起,请联系您的运维解决 ."), + PLUGIN_RUNTIME_ERROR("Framework-13", "DataX插件运行时出错, 具体原因请参看DataX运行结束时的错误诊断信息 ."), + PLUGIN_DIRTY_DATA_LIMIT_EXCEED("Framework-14", "DataX传输脏数据超过用户预期,该错误通常是由于源端数据存在较多业务脏数据导致,请仔细检查DataX汇报的脏数据日志信息, 或者您可以适当调大脏数据阈值 ."), + PLUGIN_SPLIT_ERROR("Framework-15", "DataX插件切分出错, 该问题通常是由于DataX各个插件编程错误引起,请联系DataX开发团队解决"), + KILL_JOB_TIMEOUT_ERROR("Framework-16", "kill 任务超时,请联系PE解决"), + START_TASKGROUP_ERROR("Framework-17", "taskGroup启动失败,请联系DataX开发团队解决"), + CALL_DATAX_SERVICE_FAILED("Framework-18", "请求 DataX Service 出错."), + CALL_REMOTE_FAILED("Framework-19", "远程调用失败"), + KILLED_EXIT_VALUE("Framework-143", "Job 收到了 Kill 命令."); + + private final String code; + + private final String description; + + private FrameworkErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } + + /** + * 通过 "Framework-143" 来标示 任务是 Killed 状态 + */ + public int toExitValue() { + if (this == FrameworkErrorCode.KILLED_EXIT_VALUE) { + return 143; + } else { + return 1; + } + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/HttpClientUtil.java b/core/src/main/java/com/alibaba/datax/core/util/HttpClientUtil.java new file mode 100755 index 0000000000..ea66f36725 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/HttpClientUtil.java @@ -0,0 +1,171 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.RetryUtil; +import org.apache.http.Consts; +import org.apache.http.HttpEntity; +import org.apache.http.HttpResponse; +import org.apache.http.HttpStatus; +import org.apache.http.auth.AuthScope; +import org.apache.http.auth.UsernamePasswordCredentials; +import org.apache.http.client.CredentialsProvider; +import org.apache.http.client.config.RequestConfig; +import org.apache.http.client.methods.*; +import org.apache.http.impl.client.BasicCredentialsProvider; +import org.apache.http.impl.client.CloseableHttpClient; +import org.apache.http.impl.client.HttpClientBuilder; +import org.apache.http.util.EntityUtils; + +import java.io.IOException; +import java.util.Properties; +import java.util.concurrent.Callable; +import java.util.concurrent.ThreadPoolExecutor; + + +public class HttpClientUtil { + + private static CredentialsProvider provider; + + private CloseableHttpClient httpClient; + + private volatile static HttpClientUtil clientUtil; + + //构建httpclient的时候一定要设置这两个参数。淘宝很多生产故障都由此引起 + private static int HTTP_TIMEOUT_INMILLIONSECONDS = 5000; + + private static final int POOL_SIZE = 20; + + private static ThreadPoolExecutor asyncExecutor = RetryUtil.createThreadPoolExecutor(); + + public static void setHttpTimeoutInMillionSeconds(int httpTimeoutInMillionSeconds) { + HTTP_TIMEOUT_INMILLIONSECONDS = httpTimeoutInMillionSeconds; + } + + public static synchronized HttpClientUtil getHttpClientUtil() { + if (null == clientUtil) { + synchronized (HttpClientUtil.class) { + if (null == clientUtil) { + clientUtil = new HttpClientUtil(); + } + } + } + return clientUtil; + } + + public HttpClientUtil() { + Properties prob = SecretUtil.getSecurityProperties(); + HttpClientUtil.setBasicAuth(prob.getProperty("auth.user"),prob.getProperty("auth.pass")); + initApacheHttpClient(); + } + + public void destroy() { + destroyApacheHttpClient(); + } + + public static void setBasicAuth(String username, String password) { + provider = new BasicCredentialsProvider(); + provider.setCredentials(AuthScope.ANY, + new UsernamePasswordCredentials(username,password)); + } + + // 创建包含connection pool与超时设置的client + private void initApacheHttpClient() { + RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(HTTP_TIMEOUT_INMILLIONSECONDS) + .setConnectTimeout(HTTP_TIMEOUT_INMILLIONSECONDS).setConnectionRequestTimeout(HTTP_TIMEOUT_INMILLIONSECONDS) + .setStaleConnectionCheckEnabled(true).build(); + + if(null == provider) { + httpClient = HttpClientBuilder.create().setMaxConnTotal(POOL_SIZE).setMaxConnPerRoute(POOL_SIZE) + .setDefaultRequestConfig(requestConfig).build(); + } else { + httpClient = HttpClientBuilder.create().setMaxConnTotal(POOL_SIZE).setMaxConnPerRoute(POOL_SIZE) + .setDefaultRequestConfig(requestConfig).setDefaultCredentialsProvider(provider).build(); + } + } + + private void destroyApacheHttpClient() { + try { + if (httpClient != null) { + httpClient.close(); + httpClient = null; + } + } catch (IOException e) { + e.printStackTrace(); + } + } + + public static HttpGet getGetRequest() { + return new HttpGet(); + } + + public static HttpPost getPostRequest() { + return new HttpPost(); + } + + public static HttpPut getPutRequest() { + return new HttpPut(); + } + + public static HttpDelete getDeleteRequest() { + return new HttpDelete(); + } + + public String executeAndGet(HttpRequestBase httpRequestBase) throws Exception { + HttpResponse response; + String entiStr = ""; + try { + response = httpClient.execute(httpRequestBase); + + if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) { + System.err.println("请求地址:" + httpRequestBase.getURI() + ", 请求方法:" + httpRequestBase.getMethod() + + ",STATUS CODE = " + response.getStatusLine().getStatusCode()); + if (httpRequestBase != null) { + httpRequestBase.abort(); + } + throw new Exception("Response Status Code : " + response.getStatusLine().getStatusCode()); + } else { + HttpEntity entity = response.getEntity(); + if (entity != null) { + entiStr = EntityUtils.toString(entity, Consts.UTF_8); + } else { + throw new Exception("Response Entity Is Null"); + } + } + } catch (Exception e) { + throw e; + } + + return entiStr; + } + + public String executeAndGetWithRetry(final HttpRequestBase httpRequestBase, final int retryTimes, final long retryInterval) { + try { + return RetryUtil.asyncExecuteWithRetry(new Callable() { + @Override + public String call() throws Exception { + return executeAndGet(httpRequestBase); + } + }, retryTimes, retryInterval, true, HTTP_TIMEOUT_INMILLIONSECONDS + 1000, asyncExecutor); + } catch (Exception e) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, e); + } + } + + public String executeAndGetWithFailedRetry(final HttpRequestBase httpRequestBase, final int retryTimes, final long retryInterval){ + try { + return RetryUtil.asyncExecuteWithRetry(new Callable() { + @Override + public String call() throws Exception { + String result = executeAndGet(httpRequestBase); + if(result!=null && result.startsWith("{\"result\":-1")){ + throw DataXException.asDataXException(FrameworkErrorCode.CALL_REMOTE_FAILED, "远程接口返回-1,将重试"); + } + return result; + } + }, retryTimes, retryInterval, true, HTTP_TIMEOUT_INMILLIONSECONDS + 1000, asyncExecutor); + } catch (Exception e) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, e); + } + } + +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/SecretUtil.java b/core/src/main/java/com/alibaba/datax/core/util/SecretUtil.java new file mode 100755 index 0000000000..1a576aaa08 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/SecretUtil.java @@ -0,0 +1,440 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.util.container.CoreConstant; + +import org.apache.commons.codec.binary.Base64; +import org.apache.commons.lang.StringUtils; +import org.apache.commons.lang3.tuple.ImmutableTriple; +import org.apache.commons.lang3.tuple.Triple; + +import javax.crypto.Cipher; +import javax.crypto.SecretKey; +import javax.crypto.spec.SecretKeySpec; + +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.IOException; +import java.io.InputStream; +import java.security.Key; +import java.security.KeyFactory; +import java.security.KeyPair; +import java.security.KeyPairGenerator; +import java.security.interfaces.RSAPrivateKey; +import java.security.interfaces.RSAPublicKey; +import java.security.spec.PKCS8EncodedKeySpec; +import java.security.spec.X509EncodedKeySpec; +import java.util.HashMap; +import java.util.Map; +import java.util.Properties; + +/** + * Created by jingxing on 14/12/15. + */ +public class SecretUtil { + private static Properties properties; + + //RSA Key:keyVersion value:left:privateKey, right:publicKey, middle: type + //DESede Key: keyVersion value:left:keyContent, right:keyContent, middle: type + private static Map> versionKeyMap; + + private static final String ENCODING = "UTF-8"; + + public static final String KEY_ALGORITHM_RSA = "RSA"; + + public static final String KEY_ALGORITHM_3DES = "DESede"; + + private static final String CIPHER_ALGORITHM_3DES = "DESede/ECB/PKCS5Padding"; + + private static final Base64 base64 = new Base64(); + + /** + * BASE64加密 + * + * @param plaintextBytes + * @return + * @throws Exception + */ + public static String encryptBASE64(byte[] plaintextBytes) throws Exception { + return new String(base64.encode(plaintextBytes), ENCODING); + } + + /** + * BASE64解密 + * + * @param cipherText + * @return + * @throws Exception + */ + public static byte[] decryptBASE64(String cipherText) { + return base64.decode(cipherText); + } + + /** + * 加密
+ * @param data 裸的原始数据 + * @param key 经过base64加密的公钥(RSA)或者裸密钥(3DES) + * */ + public static String encrypt(String data, String key, String method) { + if (SecretUtil.KEY_ALGORITHM_RSA.equals(method)) { + return SecretUtil.encryptRSA(data, key); + } else if (SecretUtil.KEY_ALGORITHM_3DES.equals(method)) { + return SecretUtil.encrypt3DES(data, key); + } else { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("系统编程错误,不支持的加密类型", method)); + } + } + + /** + * 解密
+ * @param data 已经经过base64加密的密文 + * @param key 已经经过base64加密私钥(RSA)或者裸密钥(3DES) + * */ + public static String decrypt(String data, String key, String method) { + if (SecretUtil.KEY_ALGORITHM_RSA.equals(method)) { + return SecretUtil.decryptRSA(data, key); + } else if (SecretUtil.KEY_ALGORITHM_3DES.equals(method)) { + return SecretUtil.decrypt3DES(data, key); + } else { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("系统编程错误,不支持的加密类型", method)); + } + } + + /** + * 加密
+ * 用公钥加密 encryptByPublicKey + * + * @param data 裸的原始数据 + * @param key 经过base64加密的公钥 + * @return 结果也采用base64加密 + * @throws Exception + */ + public static String encryptRSA(String data, String key) { + try { + // 对公钥解密,公钥被base64加密过 + byte[] keyBytes = decryptBASE64(key); + + // 取得公钥 + X509EncodedKeySpec x509KeySpec = new X509EncodedKeySpec(keyBytes); + KeyFactory keyFactory = KeyFactory.getInstance(KEY_ALGORITHM_RSA); + Key publicKey = keyFactory.generatePublic(x509KeySpec); + + // 对数据加密 + Cipher cipher = Cipher.getInstance(keyFactory.getAlgorithm()); + cipher.init(Cipher.ENCRYPT_MODE, publicKey); + + return encryptBASE64(cipher.doFinal(data.getBytes(ENCODING))); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "rsa加密出错", e); + } + } + + /** + * 解密
+ * 用私钥解密 + * + * @param data 已经经过base64加密的密文 + * @param key 已经经过base64加密私钥 + * @return + * @throws Exception + */ + public static String decryptRSA(String data, String key) { + try { + // 对密钥解密 + byte[] keyBytes = decryptBASE64(key); + + // 取得私钥 + PKCS8EncodedKeySpec pkcs8KeySpec = new PKCS8EncodedKeySpec(keyBytes); + KeyFactory keyFactory = KeyFactory.getInstance(KEY_ALGORITHM_RSA); + Key privateKey = keyFactory.generatePrivate(pkcs8KeySpec); + + // 对数据解密 + Cipher cipher = Cipher.getInstance(keyFactory.getAlgorithm()); + cipher.init(Cipher.DECRYPT_MODE, privateKey); + + return new String(cipher.doFinal(decryptBASE64(data)), ENCODING); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "rsa解密出错", e); + } + } + + /** + * 初始化密钥 for RSA ALGORITHM + * + * @return + * @throws Exception + */ + public static String[] initKey() throws Exception { + KeyPairGenerator keyPairGen = KeyPairGenerator + .getInstance(KEY_ALGORITHM_RSA); + keyPairGen.initialize(1024); + + KeyPair keyPair = keyPairGen.generateKeyPair(); + + // 公钥 + RSAPublicKey publicKey = (RSAPublicKey) keyPair.getPublic(); + + // 私钥 + RSAPrivateKey privateKey = (RSAPrivateKey) keyPair.getPrivate(); + + String[] publicAndPrivateKey = { + encryptBASE64(publicKey.getEncoded()), + encryptBASE64(privateKey.getEncoded())}; + + return publicAndPrivateKey; + } + + /** + * 加密 DESede
+ * 用密钥加密 + * + * @param data 裸的原始数据 + * @param key 加密的密钥 + * @return 结果也采用base64加密 + * @throws Exception + */ + public static String encrypt3DES(String data, String key) { + try { + // 生成密钥 + SecretKey desKey = new SecretKeySpec(build3DesKey(key), + KEY_ALGORITHM_3DES); + // 对数据加密 + Cipher cipher = Cipher.getInstance(CIPHER_ALGORITHM_3DES); + cipher.init(Cipher.ENCRYPT_MODE, desKey); + return encryptBASE64(cipher.doFinal(data.getBytes(ENCODING))); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "3重DES加密出错", e); + } + } + + /** + * 解密
+ * 用密钥解密 + * + * @param data 已经经过base64加密的密文 + * @param key 解密的密钥 + * @return + * @throws Exception + */ + public static String decrypt3DES(String data, String key) { + try { + // 生成密钥 + SecretKey desKey = new SecretKeySpec(build3DesKey(key), + KEY_ALGORITHM_3DES); + // 对数据解密 + Cipher cipher = Cipher.getInstance(CIPHER_ALGORITHM_3DES); + cipher.init(Cipher.DECRYPT_MODE, desKey); + return new String(cipher.doFinal(decryptBASE64(data)), ENCODING); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "rsa解密出错", e); + } + } + + /** + * 根据字符串生成密钥字节数组 + * + * @param keyStr + * 密钥字符串 + * @return key 符合DESede标准的24byte数组 + */ + private static byte[] build3DesKey(String keyStr) { + try { + // 声明一个24位的字节数组,默认里面都是0,warn: 字符串0(48)和数组默认值0不一样,统一字符串0(48) + byte[] key = "000000000000000000000000".getBytes(ENCODING); + byte[] temp = keyStr.getBytes(ENCODING); + if (key.length > temp.length) { + // 如果temp不够24位,则拷贝temp数组整个长度的内容到key数组中 + System.arraycopy(temp, 0, key, 0, temp.length); + } else { + // 如果temp大于24位,则拷贝temp数组24个长度的内容到key数组中 + System.arraycopy(temp, 0, key, 0, key.length); + } + return key; + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "构建三重DES密匙出错", e); + } + } + + public static synchronized Properties getSecurityProperties() { + if (properties == null) { + InputStream secretStream = null; + try { + secretStream = new FileInputStream( + CoreConstant.DATAX_SECRET_PATH); + } catch (FileNotFoundException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + "DataX配置要求加解密,但无法找到密钥的配置文件"); + } + + properties = new Properties(); + try { + properties.load(secretStream); + secretStream.close(); + } catch (IOException e) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "读取加解密配置文件出错", e); + } + } + + return properties; + } + + + public static Configuration encryptSecretKey(Configuration configuration) { + String keyVersion = configuration + .getString(CoreConstant.DATAX_JOB_SETTING_KEYVERSION); + // 没有设置keyVersion,表示不用解密 + if (StringUtils.isBlank(keyVersion)) { + return configuration; + } + + Map> versionKeyMap = getPrivateKeyMap(); + + if (null == versionKeyMap.get(keyVersion)) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("DataX配置的密钥版本为[%s],但在系统中没有配置,任务密钥配置错误,不存在您配置的密钥版本", keyVersion)); + } + + String key = versionKeyMap.get(keyVersion).getRight(); + String method = versionKeyMap.get(keyVersion).getMiddle(); + // keyVersion要求的私钥没有配置 + if (StringUtils.isBlank(key)) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("DataX配置的密钥版本为[%s],但在系统中没有配置,可能是任务密钥配置错误,也可能是系统维护问题", keyVersion)); + } + + String tempEncrptedData = null; + for (String path : configuration.getSecretKeyPathSet()) { + tempEncrptedData = SecretUtil.encrypt(configuration.getString(path), key, method); + int lastPathIndex = path.lastIndexOf(".") + 1; + String lastPathKey = path.substring(lastPathIndex); + + String newPath = path.substring(0, lastPathIndex) + "*" + + lastPathKey; + configuration.set(newPath, tempEncrptedData); + configuration.remove(path); + } + + return configuration; + } + + public static Configuration decryptSecretKey(Configuration config) { + String keyVersion = config + .getString(CoreConstant.DATAX_JOB_SETTING_KEYVERSION); + // 没有设置keyVersion,表示不用解密 + if (StringUtils.isBlank(keyVersion)) { + return config; + } + + Map> versionKeyMap = getPrivateKeyMap(); + if (null == versionKeyMap.get(keyVersion)) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("DataX配置的密钥版本为[%s],但在系统中没有配置,任务密钥配置错误,不存在您配置的密钥版本", keyVersion)); + } + String decryptKey = versionKeyMap.get(keyVersion).getLeft(); + String method = versionKeyMap.get(keyVersion).getMiddle(); + // keyVersion要求的私钥没有配置 + if (StringUtils.isBlank(decryptKey)) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, + String.format("DataX配置的密钥版本为[%s],但在系统中没有配置,可能是任务密钥配置错误,也可能是系统维护问题", keyVersion)); + } + + // 对包含*号key解密处理 + for (String key : config.getKeys()) { + int lastPathIndex = key.lastIndexOf(".") + 1; + String lastPathKey = key.substring(lastPathIndex); + if (lastPathKey.length() > 1 && lastPathKey.charAt(0) == '*' + && lastPathKey.charAt(1) != '*') { + Object value = config.get(key); + if (value instanceof String) { + String newKey = key.substring(0, lastPathIndex) + + lastPathKey.substring(1); + config.set(newKey, + SecretUtil.decrypt((String) value, decryptKey, method)); + config.addSecretKeyPath(newKey); + config.remove(key); + } + } + } + + return config; + } + + private static synchronized Map> getPrivateKeyMap() { + if (versionKeyMap == null) { + versionKeyMap = new HashMap>(); + Properties properties = SecretUtil.getSecurityProperties(); + + String[] serviceUsernames = new String[] { + CoreConstant.LAST_SERVICE_USERNAME, + CoreConstant.CURRENT_SERVICE_USERNAME }; + String[] servicePasswords = new String[] { + CoreConstant.LAST_SERVICE_PASSWORD, + CoreConstant.CURRENT_SERVICE_PASSWORD }; + + for (int i = 0; i < serviceUsernames.length; i++) { + String serviceUsername = properties + .getProperty(serviceUsernames[i]); + if (StringUtils.isNotBlank(serviceUsername)) { + String servicePassword = properties + .getProperty(servicePasswords[i]); + if (StringUtils.isNotBlank(servicePassword)) { + versionKeyMap.put(serviceUsername, ImmutableTriple.of( + servicePassword, SecretUtil.KEY_ALGORITHM_3DES, + servicePassword)); + } else { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, String.format( + "DataX配置要求加解密,但配置的密钥版本[%s]存在密钥为空的情况", + serviceUsername)); + } + } + } + + String[] keyVersions = new String[] { CoreConstant.LAST_KEYVERSION, + CoreConstant.CURRENT_KEYVERSION }; + String[] privateKeys = new String[] { CoreConstant.LAST_PRIVATEKEY, + CoreConstant.CURRENT_PRIVATEKEY }; + String[] publicKeys = new String[] { CoreConstant.LAST_PUBLICKEY, + CoreConstant.CURRENT_PUBLICKEY }; + for (int i = 0; i < keyVersions.length; i++) { + String keyVersion = properties.getProperty(keyVersions[i]); + if (StringUtils.isNotBlank(keyVersion)) { + String privateKey = properties.getProperty(privateKeys[i]); + String publicKey = properties.getProperty(publicKeys[i]); + if (StringUtils.isNotBlank(privateKey) + && StringUtils.isNotBlank(publicKey)) { + versionKeyMap.put(keyVersion, ImmutableTriple.of( + privateKey, SecretUtil.KEY_ALGORITHM_RSA, + publicKey)); + } else { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, String.format( + "DataX配置要求加解密,但配置的公私钥对存在为空的情况,版本[%s]", + keyVersion)); + } + } + } + } + if (versionKeyMap.size() <= 0) { + throw DataXException.asDataXException( + FrameworkErrorCode.SECRET_ERROR, "DataX配置要求加解密,但无法找到加解密配置"); + } + return versionKeyMap; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/TransformerUtil.java b/core/src/main/java/com/alibaba/datax/core/util/TransformerUtil.java new file mode 100644 index 0000000000..1b46962341 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/TransformerUtil.java @@ -0,0 +1,107 @@ +package com.alibaba.datax.core.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.transport.transformer.*; +import com.alibaba.datax.core.util.container.CoreConstant; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * no comments. + * Created by liqiang on 16/3/9. + */ +public class TransformerUtil { + + private static final Logger LOG = LoggerFactory.getLogger(TransformerUtil.class); + + public static List buildTransformerInfo(Configuration taskConfig) { + List tfConfigs = taskConfig.getListConfiguration(CoreConstant.JOB_TRANSFORMER); + if (tfConfigs == null || tfConfigs.size() == 0) { + return null; + } + + List result = new ArrayList(); + + + List functionNames = new ArrayList(); + + + for (Configuration configuration : tfConfigs) { + String functionName = configuration.getString("name"); + if (StringUtils.isEmpty(functionName)) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_CONFIGURATION_ERROR, "config=" + configuration.toJSON()); + } + + if (functionName.equals("dx_groovy") && functionNames.contains("dx_groovy")) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_CONFIGURATION_ERROR, "dx_groovy can be invoke once only."); + } + functionNames.add(functionName); + } + + /** + * 延迟load 第三方插件的function,并按需load + */ + LOG.info(String.format(" user config tranformers [%s], loading...", functionNames)); + TransformerRegistry.loadTransformerFromLocalStorage(functionNames); + + int i = 0; + + for (Configuration configuration : tfConfigs) { + String functionName = configuration.getString("name"); + TransformerInfo transformerInfo = TransformerRegistry.getTransformer(functionName); + if (transformerInfo == null) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_NOTFOUND_ERROR, "name=" + functionName); + } + + /** + * 具体的UDF对应一个paras + */ + TransformerExecutionParas transformerExecutionParas = new TransformerExecutionParas(); + /** + * groovy function仅仅只有code + */ + if (!functionName.equals("dx_groovy") && !functionName.equals("dx_fackGroovy")) { + Integer columnIndex = configuration.getInt(CoreConstant.TRANSFORMER_PARAMETER_COLUMNINDEX); + + if (columnIndex == null) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_ILLEGAL_PARAMETER, "columnIndex must be set by UDF:name=" + functionName); + } + + transformerExecutionParas.setColumnIndex(columnIndex); + List paras = configuration.getList(CoreConstant.TRANSFORMER_PARAMETER_PARAS, String.class); + if (paras != null && paras.size() > 0) { + transformerExecutionParas.setParas(paras.toArray(new String[0])); + } + } else { + String code = configuration.getString(CoreConstant.TRANSFORMER_PARAMETER_CODE); + if (StringUtils.isEmpty(code)) { + throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_ILLEGAL_PARAMETER, "groovy code must be set by UDF:name=" + functionName); + } + transformerExecutionParas.setCode(code); + + List extraPackage = configuration.getList(CoreConstant.TRANSFORMER_PARAMETER_EXTRAPACKAGE, String.class); + if (extraPackage != null && extraPackage.size() > 0) { + transformerExecutionParas.setExtraPackage(extraPackage); + } + } + transformerExecutionParas.settContext(configuration.getMap(CoreConstant.TRANSFORMER_PARAMETER_CONTEXT)); + + TransformerExecution transformerExecution = new TransformerExecution(transformerInfo, transformerExecutionParas); + + transformerExecution.genFinalParas(); + result.add(transformerExecution); + i++; + LOG.info(String.format(" %s of transformer init success. name=%s, isNative=%s parameter = %s" + , i, transformerInfo.getTransformer().getTransformerName() + , transformerInfo.isNative(), configuration.getConfiguration("parameter"))); + } + + return result; + + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/container/ClassLoaderSwapper.java b/core/src/main/java/com/alibaba/datax/core/util/container/ClassLoaderSwapper.java new file mode 100755 index 0000000000..b878cf0905 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/container/ClassLoaderSwapper.java @@ -0,0 +1,41 @@ +package com.alibaba.datax.core.util.container; + +/** + * Created by jingxing on 14-8-29. + * + * 为避免jar冲突,比如hbase可能有多个版本的读写依赖jar包,JobContainer和TaskGroupContainer + * 就需要脱离当前classLoader去加载这些jar包,执行完成后,又退回到原来classLoader上继续执行接下来的代码 + */ +public final class ClassLoaderSwapper { + private ClassLoader storeClassLoader = null; + + private ClassLoaderSwapper() { + } + + public static ClassLoaderSwapper newCurrentThreadClassLoaderSwapper() { + return new ClassLoaderSwapper(); + } + + /** + * 保存当前classLoader,并将当前线程的classLoader设置为所给classLoader + * + * @param + * @return + */ + public ClassLoader setCurrentThreadClassLoader(ClassLoader classLoader) { + this.storeClassLoader = Thread.currentThread().getContextClassLoader(); + Thread.currentThread().setContextClassLoader(classLoader); + return this.storeClassLoader; + } + + /** + * 将当前线程的类加载器设置为保存的类加载 + * @return + */ + public ClassLoader restoreCurrentThreadClassLoader() { + ClassLoader classLoader = Thread.currentThread() + .getContextClassLoader(); + Thread.currentThread().setContextClassLoader(this.storeClassLoader); + return classLoader; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/container/CoreConstant.java b/core/src/main/java/com/alibaba/datax/core/util/container/CoreConstant.java new file mode 100755 index 0000000000..6a0b6205e2 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/container/CoreConstant.java @@ -0,0 +1,189 @@ +package com.alibaba.datax.core.util.container; + +import org.apache.commons.lang.StringUtils; + +import java.io.File; + +/** + * Created by jingxing on 14-8-25. + */ +public class CoreConstant { + // --------------------------- 全局使用的变量(最好按照逻辑顺序,调整下成员变量顺序) + // -------------------------------- + + public static final String DATAX_CORE_CONTAINER_TASKGROUP_CHANNEL = "core.container.taskGroup.channel"; + + public static final String DATAX_CORE_CONTAINER_MODEL = "core.container.model"; + + public static final String DATAX_CORE_CONTAINER_JOB_ID = "core.container.job.id"; + + public static final String DATAX_CORE_CONTAINER_TRACE_ENABLE = "core.container.trace.enable"; + + public static final String DATAX_CORE_CONTAINER_JOB_MODE = "core.container.job.mode"; + + public static final String DATAX_CORE_CONTAINER_JOB_REPORTINTERVAL = "core.container.job.reportInterval"; + + public static final String DATAX_CORE_CONTAINER_JOB_SLEEPINTERVAL = "core.container.job.sleepInterval"; + + public static final String DATAX_CORE_CONTAINER_TASKGROUP_ID = "core.container.taskGroup.id"; + + public static final String DATAX_CORE_CONTAINER_TASKGROUP_SLEEPINTERVAL = "core.container.taskGroup.sleepInterval"; + + public static final String DATAX_CORE_CONTAINER_TASKGROUP_REPORTINTERVAL = "core.container.taskGroup.reportInterval"; + + public static final String DATAX_CORE_CONTAINER_TASK_FAILOVER_MAXRETRYTIMES = "core.container.task.failOver.maxRetryTimes"; + + public static final String DATAX_CORE_CONTAINER_TASK_FAILOVER_RETRYINTERVALINMSEC = "core.container.task.failOver.retryIntervalInMsec"; + + public static final String DATAX_CORE_CONTAINER_TASK_FAILOVER_MAXWAITINMSEC = "core.container.task.failOver.maxWaitInMsec"; + + public static final String DATAX_CORE_DATAXSERVER_ADDRESS = "core.dataXServer.address"; + + public static final String DATAX_CORE_DSC_ADDRESS = "core.dsc.address"; + + public static final String DATAX_CORE_DATAXSERVER_TIMEOUT = "core.dataXServer.timeout"; + + public static final String DATAX_CORE_REPORT_DATAX_LOG = "core.dataXServer.reportDataxLog"; + + public static final String DATAX_CORE_REPORT_DATAX_PERFLOG = "core.dataXServer.reportPerfLog"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_CLASS = "core.transport.channel.class"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY = "core.transport.channel.capacity"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_CAPACITY_BYTE = "core.transport.channel.byteCapacity"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_SPEED_BYTE = "core.transport.channel.speed.byte"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_SPEED_RECORD = "core.transport.channel.speed.record"; + + public static final String DATAX_CORE_TRANSPORT_CHANNEL_FLOWCONTROLINTERVAL = "core.transport.channel.flowControlInterval"; + + public static final String DATAX_CORE_TRANSPORT_EXCHANGER_BUFFERSIZE = "core.transport.exchanger.bufferSize"; + + public static final String DATAX_CORE_TRANSPORT_RECORD_CLASS = "core.transport.record.class"; + + public static final String DATAX_CORE_STATISTICS_COLLECTOR_PLUGIN_TASKCLASS = "core.statistics.collector.plugin.taskClass"; + + public static final String DATAX_CORE_STATISTICS_COLLECTOR_PLUGIN_MAXDIRTYNUM = "core.statistics.collector.plugin.maxDirtyNumber"; + + public static final String DATAX_JOB_CONTENT_READER_NAME = "job.content[0].reader.name"; + + public static final String DATAX_JOB_CONTENT_READER_PARAMETER = "job.content[0].reader.parameter"; + + public static final String DATAX_JOB_CONTENT_WRITER_NAME = "job.content[0].writer.name"; + + public static final String DATAX_JOB_CONTENT_WRITER_PARAMETER = "job.content[0].writer.parameter"; + + public static final String DATAX_JOB_JOBINFO = "job.jobInfo"; + + public static final String DATAX_JOB_CONTENT = "job.content"; + + public static final String DATAX_JOB_CONTENT_TRANSFORMER = "job.content[0].transformer"; + + public static final String DATAX_JOB_SETTING_KEYVERSION = "job.setting.keyVersion"; + + public static final String DATAX_JOB_SETTING_SPEED_BYTE = "job.setting.speed.byte"; + + public static final String DATAX_JOB_SETTING_SPEED_RECORD = "job.setting.speed.record"; + + public static final String DATAX_JOB_SETTING_SPEED_CHANNEL = "job.setting.speed.channel"; + + public static final String DATAX_JOB_SETTING_ERRORLIMIT = "job.setting.errorLimit"; + + public static final String DATAX_JOB_SETTING_ERRORLIMIT_RECORD = "job.setting.errorLimit.record"; + + public static final String DATAX_JOB_SETTING_ERRORLIMIT_PERCENT = "job.setting.errorLimit.percentage"; + + public static final String DATAX_JOB_SETTING_DRYRUN = "job.setting.dryRun"; + + public static final String DATAX_JOB_PREHANDLER_PLUGINTYPE = "job.preHandler.pluginType"; + + public static final String DATAX_JOB_PREHANDLER_PLUGINNAME = "job.preHandler.pluginName"; + + public static final String DATAX_JOB_POSTHANDLER_PLUGINTYPE = "job.postHandler.pluginType"; + + public static final String DATAX_JOB_POSTHANDLER_PLUGINNAME = "job.postHandler.pluginName"; + // ----------------------------- 局部使用的变量 + public static final String JOB_WRITER = "reader"; + + public static final String JOB_READER = "reader"; + + public static final String JOB_TRANSFORMER = "transformer"; + + public static final String JOB_READER_NAME = "reader.name"; + + public static final String JOB_READER_PARAMETER = "reader.parameter"; + + public static final String JOB_WRITER_NAME = "writer.name"; + + public static final String JOB_WRITER_PARAMETER = "writer.parameter"; + + public static final String TRANSFORMER_PARAMETER_COLUMNINDEX = "parameter.columnIndex"; + public static final String TRANSFORMER_PARAMETER_PARAS = "parameter.paras"; + public static final String TRANSFORMER_PARAMETER_CONTEXT = "parameter.context"; + public static final String TRANSFORMER_PARAMETER_CODE = "parameter.code"; + public static final String TRANSFORMER_PARAMETER_EXTRAPACKAGE = "parameter.extraPackage"; + + public static final String TASK_ID = "taskId"; + + // ----------------------------- 安全模块变量 ------------------ + + public static final String LAST_KEYVERSION = "last.keyVersion"; + + public static final String LAST_PUBLICKEY = "last.publicKey"; + + public static final String LAST_PRIVATEKEY = "last.privateKey"; + + public static final String LAST_SERVICE_USERNAME = "last.service.username"; + + public static final String LAST_SERVICE_PASSWORD = "last.service.password"; + + public static final String CURRENT_KEYVERSION = "current.keyVersion"; + + public static final String CURRENT_PUBLICKEY = "current.publicKey"; + + public static final String CURRENT_PRIVATEKEY = "current.privateKey"; + + public static final String CURRENT_SERVICE_USERNAME = "current.service.username"; + + public static final String CURRENT_SERVICE_PASSWORD = "current.service.password"; + + // ----------------------------- 环境变量 --------------------------------- + + public static String DATAX_HOME = System.getProperty("datax.home"); + + public static String DATAX_CONF_PATH = StringUtils.join(new String[] { + DATAX_HOME, "conf", "core.json" }, File.separator); + + public static String DATAX_CONF_LOG_PATH = StringUtils.join(new String[] { + DATAX_HOME, "conf", "logback.xml" }, File.separator); + + public static String DATAX_SECRET_PATH = StringUtils.join(new String[] { + DATAX_HOME, "conf", ".secret.properties" }, File.separator); + + public static String DATAX_PLUGIN_HOME = StringUtils.join(new String[] { + DATAX_HOME, "plugin" }, File.separator); + + public static String DATAX_PLUGIN_READER_HOME = StringUtils.join( + new String[] { DATAX_HOME, "plugin", "reader" }, File.separator); + + public static String DATAX_PLUGIN_WRITER_HOME = StringUtils.join( + new String[] { DATAX_HOME, "plugin", "writer" }, File.separator); + + public static String DATAX_BIN_HOME = StringUtils.join(new String[] { + DATAX_HOME, "bin" }, File.separator); + + public static String DATAX_JOB_HOME = StringUtils.join(new String[] { + DATAX_HOME, "job" }, File.separator); + + public static String DATAX_STORAGE_TRANSFORMER_HOME = StringUtils.join( + new String[] { DATAX_HOME, "local_storage", "transformer" }, File.separator); + + public static String DATAX_STORAGE_PLUGIN_READ_HOME = StringUtils.join( + new String[] { DATAX_HOME, "local_storage", "plugin","reader" }, File.separator); + + public static String DATAX_STORAGE_PLUGIN_WRITER_HOME = StringUtils.join( + new String[] { DATAX_HOME, "local_storage", "plugin","writer" }, File.separator); + +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/container/JarLoader.java b/core/src/main/java/com/alibaba/datax/core/util/container/JarLoader.java new file mode 100755 index 0000000000..9fc113dc6a --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/container/JarLoader.java @@ -0,0 +1,97 @@ +package com.alibaba.datax.core.util.container; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import org.apache.commons.lang.StringUtils; +import org.apache.commons.lang.Validate; + +import java.io.File; +import java.io.FileFilter; +import java.net.URL; +import java.net.URLClassLoader; +import java.util.ArrayList; +import java.util.List; + +/** + * 提供Jar隔离的加载机制,会把传入的路径、及其子路径、以及路径中的jar文件加入到class path。 + */ +public class JarLoader extends URLClassLoader { + public JarLoader(String[] paths) { + this(paths, JarLoader.class.getClassLoader()); + } + + public JarLoader(String[] paths, ClassLoader parent) { + super(getURLs(paths), parent); + } + + private static URL[] getURLs(String[] paths) { + Validate.isTrue(null != paths && 0 != paths.length, + "jar包路径不能为空."); + + List dirs = new ArrayList(); + for (String path : paths) { + dirs.add(path); + JarLoader.collectDirs(path, dirs); + } + + List urls = new ArrayList(); + for (String path : dirs) { + urls.addAll(doGetURLs(path)); + } + + return urls.toArray(new URL[0]); + } + + private static void collectDirs(String path, List collector) { + if (null == path || StringUtils.isBlank(path)) { + return; + } + + File current = new File(path); + if (!current.exists() || !current.isDirectory()) { + return; + } + + for (File child : current.listFiles()) { + if (!child.isDirectory()) { + continue; + } + + collector.add(child.getAbsolutePath()); + collectDirs(child.getAbsolutePath(), collector); + } + } + + private static List doGetURLs(final String path) { + Validate.isTrue(!StringUtils.isBlank(path), "jar包路径不能为空."); + + File jarPath = new File(path); + + Validate.isTrue(jarPath.exists() && jarPath.isDirectory(), + "jar包路径必须存在且为目录."); + + /* set filter */ + FileFilter jarFilter = new FileFilter() { + @Override + public boolean accept(File pathname) { + return pathname.getName().endsWith(".jar"); + } + }; + + /* iterate all jar */ + File[] allJars = new File(path).listFiles(jarFilter); + List jarURLs = new ArrayList(allJars.length); + + for (int i = 0; i < allJars.length; i++) { + try { + jarURLs.add(allJars[i].toURI().toURL()); + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_INIT_ERROR, + "系统加载jar包出错", e); + } + } + + return jarURLs; + } +} diff --git a/core/src/main/java/com/alibaba/datax/core/util/container/LoadUtil.java b/core/src/main/java/com/alibaba/datax/core/util/container/LoadUtil.java new file mode 100755 index 0000000000..30e926c385 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/core/util/container/LoadUtil.java @@ -0,0 +1,202 @@ +package com.alibaba.datax.core.util.container; + +import com.alibaba.datax.common.constant.PluginType; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.AbstractJobPlugin; +import com.alibaba.datax.common.plugin.AbstractPlugin; +import com.alibaba.datax.common.plugin.AbstractTaskPlugin; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.core.taskgroup.runner.AbstractRunner; +import com.alibaba.datax.core.taskgroup.runner.ReaderRunner; +import com.alibaba.datax.core.taskgroup.runner.WriterRunner; +import com.alibaba.datax.core.util.FrameworkErrorCode; +import org.apache.commons.lang3.StringUtils; + +import java.util.HashMap; +import java.util.Map; + +/** + * Created by jingxing on 14-8-24. + *

+ * 插件加载器,大体上分reader、transformer(还未实现)和writer三中插件类型, + * reader和writer在执行时又可能出现Job和Task两种运行时(加载的类不同) + */ +public class LoadUtil { + private static final String pluginTypeNameFormat = "plugin.%s.%s"; + + private LoadUtil() { + } + + private enum ContainerType { + Job("Job"), Task("Task"); + private String type; + + private ContainerType(String type) { + this.type = type; + } + + public String value() { + return type; + } + } + + /** + * 所有插件配置放置在pluginRegisterCenter中,为区别reader、transformer和writer,还能区别 + * 具体pluginName,故使用pluginType.pluginName作为key放置在该map中 + */ + private static Configuration pluginRegisterCenter; + + /** + * jarLoader的缓冲 + */ + private static Map jarLoaderCenter = new HashMap(); + + /** + * 设置pluginConfigs,方便后面插件来获取 + * + * @param pluginConfigs + */ + public static void bind(Configuration pluginConfigs) { + pluginRegisterCenter = pluginConfigs; + } + + private static String generatePluginKey(PluginType pluginType, + String pluginName) { + return String.format(pluginTypeNameFormat, pluginType.toString(), + pluginName); + } + + private static Configuration getPluginConf(PluginType pluginType, + String pluginName) { + Configuration pluginConf = pluginRegisterCenter + .getConfiguration(generatePluginKey(pluginType, pluginName)); + + if (null == pluginConf) { + throw DataXException.asDataXException( + FrameworkErrorCode.PLUGIN_INSTALL_ERROR, + String.format("DataX不能找到插件[%s]的配置.", + pluginName)); + } + + return pluginConf; + } + + /** + * 加载JobPlugin,reader、writer都可能要加载 + * + * @param pluginType + * @param pluginName + * @return + */ + public static AbstractJobPlugin loadJobPlugin(PluginType pluginType, + String pluginName) { + Class clazz = LoadUtil.loadPluginClass( + pluginType, pluginName, ContainerType.Job); + + try { + AbstractJobPlugin jobPlugin = (AbstractJobPlugin) clazz + .newInstance(); + jobPlugin.setPluginConf(getPluginConf(pluginType, pluginName)); + return jobPlugin; + } catch (Exception e) { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + String.format("DataX找到plugin[%s]的Job配置.", + pluginName), e); + } + } + + /** + * 加载taskPlugin,reader、writer都可能加载 + * + * @param pluginType + * @param pluginName + * @return + */ + public static AbstractTaskPlugin loadTaskPlugin(PluginType pluginType, + String pluginName) { + Class clazz = LoadUtil.loadPluginClass( + pluginType, pluginName, ContainerType.Task); + + try { + AbstractTaskPlugin taskPlugin = (AbstractTaskPlugin) clazz + .newInstance(); + taskPlugin.setPluginConf(getPluginConf(pluginType, pluginName)); + return taskPlugin; + } catch (Exception e) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, + String.format("DataX不能找plugin[%s]的Task配置.", + pluginName), e); + } + } + + /** + * 根据插件类型、名字和执行时taskGroupId加载对应运行器 + * + * @param pluginType + * @param pluginName + * @return + */ + public static AbstractRunner loadPluginRunner(PluginType pluginType, String pluginName) { + AbstractTaskPlugin taskPlugin = LoadUtil.loadTaskPlugin(pluginType, + pluginName); + + switch (pluginType) { + case READER: + return new ReaderRunner(taskPlugin); + case WRITER: + return new WriterRunner(taskPlugin); + default: + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + String.format("插件[%s]的类型必须是[reader]或[writer]!", + pluginName)); + } + } + + /** + * 反射出具体plugin实例 + * + * @param pluginType + * @param pluginName + * @param pluginRunType + * @return + */ + @SuppressWarnings("unchecked") + private static synchronized Class loadPluginClass( + PluginType pluginType, String pluginName, + ContainerType pluginRunType) { + Configuration pluginConf = getPluginConf(pluginType, pluginName); + JarLoader jarLoader = LoadUtil.getJarLoader(pluginType, pluginName); + try { + return (Class) jarLoader + .loadClass(pluginConf.getString("class") + "$" + + pluginRunType.value()); + } catch (Exception e) { + throw DataXException.asDataXException(FrameworkErrorCode.RUNTIME_ERROR, e); + } + } + + public static synchronized JarLoader getJarLoader(PluginType pluginType, + String pluginName) { + Configuration pluginConf = getPluginConf(pluginType, pluginName); + + JarLoader jarLoader = jarLoaderCenter.get(generatePluginKey(pluginType, + pluginName)); + if (null == jarLoader) { + String pluginPath = pluginConf.getString("path"); + if (StringUtils.isBlank(pluginPath)) { + throw DataXException.asDataXException( + FrameworkErrorCode.RUNTIME_ERROR, + String.format( + "%s插件[%s]路径非法!", + pluginType, pluginName)); + } + jarLoader = new JarLoader(new String[]{pluginPath}); + jarLoaderCenter.put(generatePluginKey(pluginType, pluginName), + jarLoader); + } + + return jarLoader; + } +} diff --git a/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/EnumStrVal.java b/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/EnumStrVal.java new file mode 100644 index 0000000000..d23b7ec262 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/EnumStrVal.java @@ -0,0 +1,5 @@ +package com.alibaba.datax.dataxservice.face.domain.enums; + +public interface EnumStrVal { + public String value(); +} diff --git a/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/EnumVal.java b/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/EnumVal.java new file mode 100644 index 0000000000..ad4af0bc00 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/EnumVal.java @@ -0,0 +1,5 @@ +package com.alibaba.datax.dataxservice.face.domain.enums; + +public interface EnumVal { + public int value(); +} \ No newline at end of file diff --git a/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/ExecuteMode.java b/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/ExecuteMode.java new file mode 100644 index 0000000000..9243796469 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/ExecuteMode.java @@ -0,0 +1,49 @@ +package com.alibaba.datax.dataxservice.face.domain.enums; + +public enum ExecuteMode implements EnumStrVal { + + STANDALONE("standalone"), + LOCAL("local"), + DISTRIBUTE("distribute"); + + String value; + + ExecuteMode(String value) { + this.value = value; + } + + @Override + public String value() { + return value; + } + + public String getValue() { + return this.value; + } + + public static boolean isLocal(String mode) { + return equalsIgnoreCase(LOCAL.getValue(), mode); + } + + public static boolean isDistribute(String mode) { + return equalsIgnoreCase(DISTRIBUTE.getValue(), mode); + } + + public static ExecuteMode toExecuteMode(String modeName) { + for (ExecuteMode mode : ExecuteMode.values()) { + if (mode.value().equals(modeName)) { + return mode; + } + } + throw new RuntimeException("no such mode :" + modeName); + } + + private static boolean equalsIgnoreCase(String str1, String str2) { + return str1 == null ? str2 == null : str1.equalsIgnoreCase(str2); + } + + @Override + public String toString() { + return this.value; + } +} diff --git a/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/State.java b/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/State.java new file mode 100644 index 0000000000..657fe5fc33 --- /dev/null +++ b/core/src/main/java/com/alibaba/datax/dataxservice/face/domain/enums/State.java @@ -0,0 +1,35 @@ +package com.alibaba.datax.dataxservice.face.domain.enums; + +public enum State implements EnumVal { + + SUBMITTING(10), + WAITING(20), + RUNNING(30), + KILLING(40), + KILLED(50), + FAILED(60), + SUCCEEDED(70); + + + /* 一定会被初始化的 */ + int value; + + State(int value) { + this.value = value; + } + + @Override + public int value() { + return value; + } + + + public boolean isFinished() { + return this == KILLED || this == FAILED || this == SUCCEEDED; + } + + public boolean isRunning() { + return !isFinished(); + } + +} \ No newline at end of file diff --git a/core/src/main/job/job.json b/core/src/main/job/job.json new file mode 100755 index 0000000000..582065929a --- /dev/null +++ b/core/src/main/job/job.json @@ -0,0 +1,52 @@ +{ + "job": { + "setting": { + "speed": { + "byte":10485760 + }, + "errorLimit": { + "record": 0, + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19890604, + "type": "long" + }, + { + "value": "1989-06-04 00:00:00", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 100000 + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "encoding": "UTF-8" + } + } + } + ] + } +} diff --git a/core/src/main/log/datax.log b/core/src/main/log/datax.log new file mode 100755 index 0000000000..e69de29bb2 diff --git a/core/src/main/script/Readme.md b/core/src/main/script/Readme.md new file mode 100755 index 0000000000..341f3f88d8 --- /dev/null +++ b/core/src/main/script/Readme.md @@ -0,0 +1 @@ +some script here. \ No newline at end of file diff --git a/core/src/main/tmp/readme.txt b/core/src/main/tmp/readme.txt new file mode 100755 index 0000000000..74b233ce58 --- /dev/null +++ b/core/src/main/tmp/readme.txt @@ -0,0 +1,4 @@ +If you are developing DataX Plugin, In your Plugin you can use this directory to store temporary resources . + +NOTE: +Each time install DataX, this directory will be cleaned up ! \ No newline at end of file diff --git a/datax-opensource-dingding.png b/datax-opensource-dingding.png new file mode 100644 index 0000000000..fe8b8544ca Binary files /dev/null and b/datax-opensource-dingding.png differ diff --git a/drdsreader/doc/drdsreader.md b/drdsreader/doc/drdsreader.md new file mode 100644 index 0000000000..25df920029 --- /dev/null +++ b/drdsreader/doc/drdsreader.md @@ -0,0 +1,342 @@ + +# DrdsReader 插件文档 + + +___ + + +## 1 快速介绍 + +DrdsReader插件实现了从DRDS(分布式RDS)读取数据。在底层实现上,DrdsReader通过JDBC连接远程DRDS数据库,并执行相应的sql语句将数据从DRDS库中SELECT出来。 + +DRDS的插件目前DataX只适配了Mysql引擎的场景,DRDS对于DataX而言,就是一套分布式Mysql数据库,并且大部分通信协议遵守Mysql使用场景。 + +## 2 实现原理 + +简而言之,DrdsReader通过JDBC连接器连接到远程的DRDS数据库,并根据用户配置的信息生成查询SELECT SQL语句并发送到远程DRDS数据库,并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集,并传递给下游Writer处理。 + +对于用户配置Table、Column、Where的信息,DrdsReader将其拼接为SQL语句发送到DRDS数据库。不同于普通的Mysql数据库,DRDS作为分布式数据库系统,无法适配所有Mysql的协议,包括复杂的Join等语句,DRDS暂时无法支持。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从DRDS数据库同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + "speed": { + //设置传输速度,单位为byte/s,DataX运行会尽可能达到该速度但是不超过它. + "byte": 1048576 + } + //出错限制 + "errorLimit": { + //出错的record条数上限,当大于该值即报错。 + "record": 0, + //出错的record百分比上限 1.0表示100%,0.02表示2% + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "drdsReader", + "parameter": { + // 数据库连接用户名 + "username": "root", + // 数据库连接密码 + "password": "root", + "column": [ + "id","name" + ], + "connection": [ + { + "table": [ + "table" + ], + "jdbcUrl": [ + "jdbc:mysql://127.0.0.1:3306/database" + ] + } + ] + } + }, + "writer": { + //writer类型 + "name": "streamwriter", + //是否打印内容 + "parameter": { + "print":true, + } + } + } + ] + } +} + +``` + +* 配置一个自定义SQL的数据库同步任务到本地内容的作业: + +``` +{ + "job": { + "setting": { + }, + "content": [ + { + "reader": { + "name": "drdsreader", + "parameter": { + "username": "root", + "password": "root", + "where": "", + "connection": [ + { + "querySql": [ + "select db_id,on_line_flag from db_info where db_id < 10;" + ], + "jdbcUrl": [ + "jdbc:drds://localhost:3306/database"] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,使用JSON的数组描述.注意,jdbcUrl必须包含在connection配置单元中。DRDSReader中关于jdbcUrl中JSON数组填写一个JDBC连接即可。 + + jdbcUrl按照Mysql官方规范,并可以填写连接附件控制信息。具体请参看[mysql官方文档](http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html)。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:所选取需要抽取的表。注意,由于DRDS本身就是分布式数据源,因此填写多张表无意义。系统对多表不做校验。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。用户使用\*代表默认使用所有列配置,例如['\*']。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照Mysql SQL语法格式: + ["id", "\`table\`", "1", "'bazhen.csy'", "null", "to_char(a + 1)", "2.3" , "true"] + id为普通列名,\`table\`为包含保留在的列名,1为整形数字常量,'bazhen.csy'为字符串常量,null为空指针,to_char(a + 1)为表达式,2.3为浮点数,true为布尔值。 + + column必须用户显示指定同步的列集合,不允许为空! + + * 必选:是
+ + * 默认值:无
+ +* **where** + + * 描述:筛选条件,DrdsReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。
。 + + where条件可以有效地进行业务增量同步。where条件不配置或者为空,视作全表同步数据。 + + * 必选:否
+ + * 默认值:无
+ +* **querySql** + + * 描述:暂时不支持配置querySql模式
+ + +### 3.3 类型转换 + +目前DrdsReader支持大部分DRDS类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出DrdsReader针对DRDS类型转换列表: + + +| DataX 内部类型| DRDS 数据类型 | +| -------- | ----- | +| Long |int, tinyint, smallint, mediumint, int, bigint| +| Double |float, double, decimal| +| String |varchar, char, tinytext, text, mediumtext, longtext | +| Date |date, datetime, timestamp, time, year | +| Boolean |bit, bool | +| Bytes |tinyblob, mediumblob, blob, longblob, varbinary | + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 +* `类似Mysql,tinyint(1)视作整形`。 +* `类似Mysql,bit类型读取目前是未定义状态。` + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + + CREATE TABLE `tc_biz_vertical_test_0000` ( + `biz_order_id` bigint(20) NOT NULL COMMENT 'id', + `key_value` varchar(4000) NOT NULL COMMENT 'Key-value的内容', + `gmt_create` datetime NOT NULL COMMENT '创建时间', + `gmt_modified` datetime NOT NULL COMMENT '修改时间', + `attribute_cc` int(11) DEFAULT NULL COMMENT '防止并发修改的标志', + `value_type` int(11) NOT NULL DEFAULT '0' COMMENT '类型', + `buyer_id` bigint(20) DEFAULT NULL COMMENT 'buyerid', + `seller_id` bigint(20) DEFAULT NULL COMMENT 'seller_id', + PRIMARY KEY (`biz_order_id`,`value_type`), + KEY `idx_biz_vertical_gmtmodified` (`gmt_modified`) + ) ENGINE=InnoDB DEFAULT CHARSET=gbk COMMENT='tc_biz_vertical' + + +单行记录类似于: + + biz_order_id: 888888888 + key_value: ;orderIds:20148888888,2014888888813800; + gmt_create: 2011-09-24 11:07:20 + gmt_modified: 2011-10-24 17:56:34 + attribute_cc: 1 + value_type: 3 + buyer_id: 8888888 + seller_id: 1 + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 24核 Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz + 2. mem: 48GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* DRDS数据库机器参数为: + 1. cpu: 32核 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz + 2. mem: 256GB + 3. net: 千兆双网卡 + 4. disc: BTWL419303E2800RGN INTEL SSDSC2BB800G4 D2010370 + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + + +### 4.2 测试报告 + +#### 4.2.1 单表测试报告 + + +| 通道数| 是否按照主键切分| DataX速度(Rec/s)| DataX机器运行负载|DB网卡流出流量(MB/s)|DB运行负载| +|--------|--------| --------|--------|--------|--------|--------|--------| + + +说明: + +1. 这里的单表,主键类型为 bigint(20),范围为:190247559466810-570722244711460,从主键范围划分看,数据分布均匀。 +2. 对单表如果没有安装主键切分,那么配置通道个数不会提升速度,效果与1个通道一样。 + + +#### 4.2.2 分表测试报告(2个分库,每个分库16张分表,共计32张分表) + + +| 通道数| DataX速度(Rec/s)|DataX机器运行负载|DB网卡流出流量(MB/s)|DB运行负载| +|--------| --------|--------|--------|--------|--------|--------| + + + +## 5 约束限制 + + +### 5.1 一致性视图问题 + +DRDS本身属于分布式数据库,对外无法提供一致性的多库多表视图,不同于Mysql等单库单表同步,DRDSReader无法抽取同一个时间切片的分库分表快照信息,也就是说DataX DrdsReader抽取底层不同的分表将获取不同的分表快照,无法保证强一致性。 + + +### 5.2 数据库编码问题 + +DRDS本身的编码设置非常灵活,包括指定编码到库、表、字段级别,甚至可以均不同编码。优先级从高到低为字段、表、库、实例。我们不推荐数据库用户设置如此混乱的编码,最好在库级别就统一到UTF-8。 + +DrdsReader底层使用JDBC进行数据抽取,JDBC天然适配各类编码,并在底层进行了编码转换。因此DrdsReader不需用户指定编码,可以自动获取编码并转码。 + +对于DRDS底层写入编码和其设定的编码不一致的混乱情况,DrdsReader对此无法识别,对此也无法提供解决方案,对于这类情况,`导出有可能为乱码`。 + +### 5.3 增量数据同步 + +DrdsReader使用JDBC SELECT语句完成数据抽取工作,因此可以使用SELECT...WHERE...进行增量数据抽取,方式有多种: + +* 数据库在线应用写入数据库时,填充modify字段为更改时间戳,包括新增、更新、删除(逻辑删)。对于这类应用,DrdsReader只需要WHERE条件跟上一同步阶段时间戳即可。 +* 对于新增流水型数据,DrdsReader可以WHERE条件后跟上一阶段最大自增ID即可。 + +对于业务上无字段区分新增、修改数据情况,DrdsReader也无法进行增量数据同步,只能同步全量数据。 + +### 5.4 Sql安全性 + +DrdsReader提供querySql语句交给用户自己实现SELECT抽取语句,DrdsReader本身对querySql不做任何安全性校验。这块交由DataX用户方自己保证。 + +## 6 FAQ + +*** + +**Q: DrdsReader同步报错,报错信息为XXX** + + A: 网络或者权限问题,请使用DRDS命令行测试: + + mysql -u -p -h -D -e "select * from <表名>" + +如果上述命令也报错,那可以证实是环境问题,请联系你的DBA。 + +*** + +**Q: 我想同步DRDS增量数据,怎么配置?** + + A: DrdsReader必须业务支持增量字段DataX才能同步增量,例如在淘宝大部分业务表中,通过gmt_modified字段表征这条记录的最新修改时间,那么DataX DrdsReader只需要配置where条件为 + +``` + "where": "Date(add_time) = '2014-06-01'" +``` + +*** + + + diff --git a/drdsreader/pom.xml b/drdsreader/pom.xml new file mode 100755 index 0000000000..2f890ecacd --- /dev/null +++ b/drdsreader/pom.xml @@ -0,0 +1,84 @@ + + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + drdsreader + drdsreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + mysql + mysql-connector-java + 5.1.34 + + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/drdsreader/src/main/assembly/package.xml b/drdsreader/src/main/assembly/package.xml new file mode 100755 index 0000000000..2d170236ac --- /dev/null +++ b/drdsreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/drdsreader + + + target/ + + drdsreader-0.0.1-SNAPSHOT.jar + + plugin/reader/drdsreader + + + + + + false + plugin/reader/drdsreader/libs + runtime + + + diff --git a/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReader.java b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReader.java new file mode 100755 index 0000000000..0e6d330138 --- /dev/null +++ b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReader.java @@ -0,0 +1,150 @@ +package com.alibaba.datax.plugin.reader.drdsreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.util.TableExpandUtil; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +public class DrdsReader extends Reader { + + private static final DataBaseType DATABASE_TYPE = DataBaseType.MySql; + private static final Logger LOG = LoggerFactory.getLogger(DrdsReader.class); + + public static class Job extends Reader.Job { + + private Configuration originalConfig = null; + private CommonRdbmsReader.Job commonRdbmsReaderJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + int fetchSize = this.originalConfig.getInt(Constant.FETCH_SIZE, + Integer.MIN_VALUE); + this.originalConfig.set(Constant.FETCH_SIZE, fetchSize); + this.validateConfiguration(); + + this.commonRdbmsReaderJob = new CommonRdbmsReader.Job( + DATABASE_TYPE); + this.commonRdbmsReaderJob.init(this.originalConfig); + } + + @Override + public List split(int adviceNumber) { + return DrdsReaderSplitUtil.doSplit(this.originalConfig, + adviceNumber); + } + + @Override + public void post() { + this.commonRdbmsReaderJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderJob.destroy(this.originalConfig); + } + + private void validateConfiguration() { + // do not splitPk + String splitPk = originalConfig.getString(Key.SPLIT_PK, null); + if (null != splitPk) { + LOG.warn("由于您读取数据库是drds, 所以您不需要配置 splitPk. 如果您不想看到这条提醒,请移除您源头表中配置的 splitPk."); + this.originalConfig.remove(Key.SPLIT_PK); + } + + List conns = this.originalConfig.getList( + Constant.CONN_MARK, Object.class); + if (null == conns || conns.size() != 1) { + throw DataXException.asDataXException( + DBUtilErrorCode.REQUIRED_VALUE, + "您未配置读取数据库jdbcUrl的信息. 正确的配置方式是给 jdbcUrl 配置上您需要读取的连接. 请检查您的配置并作出修改."); + } + Configuration connConf = Configuration + .from(conns.get(0).toString()); + connConf.getNecessaryValue(Key.JDBC_URL, + DBUtilErrorCode.REQUIRED_VALUE); + + // only one jdbcUrl + List jdbcUrls = connConf + .getList(Key.JDBC_URL, String.class); + if (null == jdbcUrls || jdbcUrls.size() != 1) { + throw DataXException.asDataXException( + DBUtilErrorCode.ILLEGAL_VALUE, + "您的jdbcUrl配置信息有误, 因为您配置读取数据库jdbcUrl的数量不正确. 正确的配置方式是配置且只配置 1 个目的 jdbcUrl. 请检查您的配置并作出修改."); + } + // if have table,only one + List tables = connConf.getList(Key.TABLE, String.class); + if (null != tables && tables.size() != 1) { + throw DataXException + .asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + "您的jdbcUrl配置信息有误. 由于您读取数据库是drds,配置读取源表数目错误. 正确的配置方式是配置且只配置 1 个目的 table. 请检查您的配置并作出修改."); + + } + if (null != tables && tables.size() == 1) { + List expandedTables = TableExpandUtil.expandTableConf( + DATABASE_TYPE, tables); + if (null == expandedTables || expandedTables.size() != 1) { + throw DataXException + .asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + "您的jdbcUrl配置信息有误. 由于您读取数据库是drds,配置读取源表数目错误. 正确的配置方式是配置且只配置 1 个目的 table. 请检查您的配置并作出修改."); + } + } + + // if have querySql,only one + List querySqls = connConf.getList(Key.QUERY_SQL, + String.class); + if (null != querySqls && querySqls.size() != 1) { + throw DataXException + .asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + "您的querySql配置信息有误. 由于您读取数据库是drds, 配置读取querySql数目错误. 正确的配置方式是配置且只配置 1 个 querySql. 请检查您的配置并作出修改."); + } + + // warn:other checking about table,querySql in common + } + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + private CommonRdbmsReader.Task commonRdbmsReaderTask; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsReaderTask = new CommonRdbmsReader.Task( + DATABASE_TYPE,super.getTaskGroupId(), super.getTaskId()); + this.commonRdbmsReaderTask.init(this.readerSliceConfig); + + } + + @Override + public void startRead(RecordSender recordSender) { + int fetchSize = this.readerSliceConfig.getInt(Constant.FETCH_SIZE); + + this.commonRdbmsReaderTask.startRead(this.readerSliceConfig, + recordSender, super.getTaskPluginCollector(), fetchSize); + } + + @Override + public void post() { + this.commonRdbmsReaderTask.post(this.readerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderTask.destroy(this.readerSliceConfig); + } + + } + +} diff --git a/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderErrorCode.java b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderErrorCode.java new file mode 100755 index 0000000000..91b3afd49e --- /dev/null +++ b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderErrorCode.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.reader.drdsreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum DrdsReaderErrorCode implements ErrorCode { + GET_TOPOLOGY_FAILED("DrdsReader-01", "获取 drds 表的拓扑结构失败."),; + + private final String code; + private final String description; + + private DrdsReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderSplitUtil.java b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderSplitUtil.java new file mode 100755 index 0000000000..fefd698f76 --- /dev/null +++ b/drdsreader/src/main/java/com/alibaba/datax/plugin/reader/drdsreader/DrdsReaderSplitUtil.java @@ -0,0 +1,121 @@ +package com.alibaba.datax.plugin.reader.drdsreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.reader.util.SingleTableSplitUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.sql.ResultSet; +import java.util.*; + +public class DrdsReaderSplitUtil { + + private static final Logger LOG = LoggerFactory + .getLogger(DrdsReaderSplitUtil.class); + + public static List doSplit(Configuration originalSliceConfig, + int adviceNumber) { + boolean isTableMode = originalSliceConfig.getBool(Constant.IS_TABLE_MODE).booleanValue(); + int tableNumber = originalSliceConfig.getInt(Constant.TABLE_NUMBER_MARK); + + if (isTableMode && tableNumber == 1) { + //需要先把内层的 table,connection 先放到外层 + String table = originalSliceConfig.getString(String.format("%s[0].%s[0]", Constant.CONN_MARK, Key.TABLE)).trim(); + originalSliceConfig.set(Key.TABLE, table); + + //注意:这里的 jdbcUrl 不是从数组中获取的,因为之前的 master init 方法已经进行过预处理 + String jdbcUrl = originalSliceConfig.getString(String.format("%s[0].%s", Constant.CONN_MARK, Key.JDBC_URL)).trim(); + + originalSliceConfig.set(Key.JDBC_URL, DataBaseType.DRDS.appendJDBCSuffixForReader(jdbcUrl)); + + originalSliceConfig.remove(Constant.CONN_MARK); + return doDrdsReaderSplit(originalSliceConfig); + } else { + throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, "您的配置信息中的表(table)的配置有误. 因为Drdsreader 只需要读取一张逻辑表,后台会通过DRDS Proxy自动获取实际对应物理表的数据. 请检查您的配置并作出修改."); + } + } + + private static List doDrdsReaderSplit(Configuration originalSliceConfig) { + List splittedConfigurations = new ArrayList(); + + Map> topology = getTopology(originalSliceConfig); + if (null == topology || topology.isEmpty()) { + throw DataXException.asDataXException(DrdsReaderErrorCode.GET_TOPOLOGY_FAILED, + "获取 drds 表拓扑结构失败, 拓扑结构不能为空."); + } else { + String table = originalSliceConfig.getString(Key.TABLE).trim(); + String column = originalSliceConfig.getString(Key.COLUMN).trim(); + String where = originalSliceConfig.getString(Key.WHERE, null); + // 不能带英语分号结尾 + String sql = SingleTableSplitUtil + .buildQuerySql(column, table, where); + // 根据拓扑拆分任务 + for (Map.Entry> entry : topology.entrySet()) { + String group = entry.getKey(); + StringBuilder sqlbuilder = new StringBuilder(); + sqlbuilder.append("/*+TDDL({'extra':{'MERGE_UNION':'false'},'type':'direct',"); + sqlbuilder.append("'vtab':'").append(table).append("',"); + sqlbuilder.append("'dbid':'").append(group).append("',"); + sqlbuilder.append("'realtabs':["); + Iterator it = entry.getValue().iterator(); + while (it.hasNext()) { + String realTable = it.next(); + sqlbuilder.append('\'').append(realTable).append('\''); + if (it.hasNext()) { + sqlbuilder.append(','); + } + } + sqlbuilder.append("]})*/"); + sqlbuilder.append(sql); + Configuration param = originalSliceConfig.clone(); + param.set(Key.QUERY_SQL, sqlbuilder.toString()); + splittedConfigurations.add(param); + } + + return splittedConfigurations; + } + } + + + private static Map> getTopology(Configuration configuration) { + Map> topology = new HashMap>(); + + String jdbcURL = configuration.getString(Key.JDBC_URL); + String username = configuration.getString(Key.USERNAME); + String password = configuration.getString(Key.PASSWORD); + String logicTable = configuration.getString(Key.TABLE).trim(); + + Connection conn = null; + ResultSet rs = null; + try { + conn = DBUtil.getConnection(DataBaseType.DRDS, jdbcURL, username, password); + rs = DBUtil.query(conn, "SHOW TOPOLOGY " + logicTable); + while (DBUtil.asyncResultSetNext(rs)) { + String groupName = rs.getString("GROUP_NAME"); + String tableName = rs.getString("TABLE_NAME"); + List tables = topology.get(groupName); + if (tables == null) { + tables = new ArrayList(); + topology.put(groupName, tables); + } + tables.add(tableName); + } + + return topology; + } catch (Exception e) { + throw DataXException.asDataXException(DrdsReaderErrorCode.GET_TOPOLOGY_FAILED, + String.format("获取 drds 表拓扑结构失败.根据您的配置, datax获取不到拓扑信息。相关上下文信息:表:%s, jdbcUrl:%s . 请联系 drds 管理员处理.", logicTable, jdbcURL), e); + } finally { + DBUtil.closeDBResources(rs, null, conn); + } + } + +} + diff --git a/drdsreader/src/main/resources/plugin.json b/drdsreader/src/main/resources/plugin.json new file mode 100755 index 0000000000..eaa86d5a48 --- /dev/null +++ b/drdsreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "drdsreader", + "class": "com.alibaba.datax.plugin.reader.drdsreader.DrdsReader", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/drdsreader/src/main/resources/plugin_job_template.json b/drdsreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..cf008227df --- /dev/null +++ b/drdsreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,11 @@ +{ + "name": "drdsreader", + "parameter": { + "jdbcUrl": "", + "username": "", + "password": "", + "table": "", + "column": [], + "where": "" + } +} \ No newline at end of file diff --git a/drdswriter/doc/drdswriter.md b/drdswriter/doc/drdswriter.md new file mode 100644 index 0000000000..ab4cd94c2a --- /dev/null +++ b/drdswriter/doc/drdswriter.md @@ -0,0 +1,226 @@ +# DataX DRDSWriter + + +--- + + +## 1 快速介绍 + +DRDSWriter 插件实现了写入数据到 DRDS 的目的表的功能。在底层实现上, DRDSWriter 通过 JDBC 连接远程 DRDS 数据库的 Proxy,并执行相应的 replace into ... 的 sql 语句将数据写入 DRDS,特别注意执行的 Sql 语句是 replace into,为了避免数据重复写入,需要你的表具备主键或者唯一性索引(Unique Key)。 + +DRDSWriter 面向ETL开发工程师,他们使用 DRDSWriter 从数仓导入数据到 DRDS。同时 DRDSWriter 亦可以作为数据迁移工具为DBA等用户提供服务。 + + +## 2 实现原理 + +DRDSWriter 通过 DataX 框架获取 Reader 生成的协议数据,通过 `replace into...`(没有遇到主键/唯一性索引冲突时,与 insert into 行为一致,冲突时会用新行替换原有行所有字段) 的语句写入数据到 DRDS。DRDSWriter 累积一定数据,提交给 DRDS 的 Proxy,该 Proxy 内部决定数据是写入一张还是多张表以及多张表写入时如何路由数据。 +
+ + 注意:整个任务至少需要具备 replace into...的权限,是否需要其他权限,取决于你任务配置中在 preSql 和 postSql 中指定的语句。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到 DRDS 导入的数据。 + +```json +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19880808, + "type": "long" + }, + { + "value": "1988-08-08 08:08:08", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 1000 + } + }, + "writer": { +                    "name": "drdswriter", + "parameter": { + "writeMode": "insert", + "username": "root", + "password": "root", + "column": [ + "id", + "name" + ], + "preSql": [ + "delete from test" + ], + "connection": [ + { + "jdbcUrl": "jdbc:mysql://127.0.0.1:3306/datax?useUnicode=true&characterEncoding=gbk", + "table": [ + "test" + ] + } + ] + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:目的数据库的 JDBC 连接信息。作业运行时,DataX 会在你提供的 jdbcUrl 后面追加如下属性:yearIsDateType=false&zeroDateTimeBehavior=convertToNull&rewriteBatchedStatements=true + + 注意:1、在一个数据库上只能配置一个 jdbcUrl 值 + 2、一个DRDS 写入任务仅能配置一个 jdbcUrl + 3、jdbcUrl按照Mysql/DRDS官方规范,并可以填写连接附加控制信息,比如想指定连接编码为 gbk ,则在 jdbcUrl 后面追加属性 useUnicode=true&characterEncoding=gbk。具体请参看 Mysql/DRDS官方文档或者咨询对应 DBA。 + + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:目的数据库的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:目的数据库的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。 只能配置一个DRDS 的表名称。 + + 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。如果要依次写入全部列,使用*表示, 例如: "column": ["*"] + + **column配置项必须指定,不能留空!** + + + 注意:1、我们强烈不推荐你这样配置,因为当你目的表字段个数、类型等有改动时,你的任务可能运行不正确或者失败 + 2、此处 column 不能配置任何常量值 + + * 必选:是
+ + * 默认值:否
+ +* **preSql** + + * 描述:写入数据到目的表前,会先执行这里的标准语句。比如你想在导入数据前清空数据表中的数据,那么可以配置为:`"preSql":["delete from yourTableName"]`
+ + * 必选:否
+ + * 默认值:无
+ +* **postSql** + + * 描述:写入数据到目的表后,会执行这里的标准语句。(原理同 preSql )
+ + * 必选:否
+ + * 默认值:无
+ +* **writeMode** + + * 描述:默认为 replace,目前仅支持 replace,可以不配置。
+ + * 必选:否
+ + * 默认值:replace
+ +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与DRDS的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。
+ + * 必选:否
+ + * 默认值:
+ +### 3.3 类型转换 + +类似 MysqlWriter ,目前 DRDSWriter 支持大部分 Mysql 类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出 DRDSWriter 针对 Mysql 类型转换列表: + + +| DataX 内部类型| Mysql 数据类型 | +| -------- | ----- | +| Long |int, tinyint, smallint, mediumint, int, bigint, year| +| Double |float, double, decimal| +| String |varchar, char, tinytext, text, mediumtext, longtext | +| Date |date, datetime, timestamp, time | +| Boolean |bit, bool | +| Bytes |tinyblob, mediumblob, blob, longblob, varbinary | + + + +## 4 性能报告 + + +## 5 约束限制 + + +## FAQ + +*** + +**Q: DRDSWriter 执行 postSql 语句报错,那么数据导入到目标数据库了吗?** + +A: DataX 导入过程存在三块逻辑,pre 操作、导入操作、post 操作,其中任意一环报错,DataX 作业报错。由于 DataX 不能保证在同一个事务完成上述几个操作,因此有可能数据已经落入到目标端。 + +*** + +**Q: 按照上述说法,那么有部分脏数据导入数据库,如果影响到线上数据库怎么办?** + +A: 目前有两种解法,第一种配置 pre 语句,该 sql 可以清理当天导入数据, DataX 每次导入时候可以把上次清理干净并导入完整数据。第二种,向临时表导入数据,完成后再 rename 到线上表。 + +*** + +**Q: 上面第二种方法可以避免对线上数据造成影响,那我具体怎样操作?** + +A: 可以配置临时表导入 diff --git a/drdswriter/pom.xml b/drdswriter/pom.xml new file mode 100755 index 0000000000..fc852ff8ca --- /dev/null +++ b/drdswriter/pom.xml @@ -0,0 +1,84 @@ + + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + drdswriter + drdswriter + jar + + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + mysql + mysql-connector-java + 5.1.34 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/drdswriter/src/main/assembly/package.xml b/drdswriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..f3c893ac30 --- /dev/null +++ b/drdswriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/drdswriter + + + target/ + + drdswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/drdswriter + + + + + + false + plugin/writer/drdswriter/libs + runtime + + + diff --git a/drdswriter/src/main/java/com/alibaba/datax/plugin/writer/drdswriter/DrdsWriter.java b/drdswriter/src/main/java/com/alibaba/datax/plugin/writer/drdswriter/DrdsWriter.java new file mode 100755 index 0000000000..b2bf0ac4e4 --- /dev/null +++ b/drdswriter/src/main/java/com/alibaba/datax/plugin/writer/drdswriter/DrdsWriter.java @@ -0,0 +1,97 @@ +package com.alibaba.datax.plugin.writer.drdswriter; + + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; +import com.alibaba.datax.plugin.rdbms.writer.Key; + +import java.util.List; + +public class DrdsWriter extends Writer { + private static final DataBaseType DATABASE_TYPE = DataBaseType.DRDS; + + public static class Job extends Writer.Job { + private Configuration originalConfig = null; + private String DEFAULT_WRITEMODE = "replace"; + private String INSERT_IGNORE_WRITEMODE = "insert ignore"; + private CommonRdbmsWriter.Job commonRdbmsWriterJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + String writeMode = this.originalConfig.getString(Key.WRITE_MODE, DEFAULT_WRITEMODE); + if (!DEFAULT_WRITEMODE.equalsIgnoreCase(writeMode) && + !INSERT_IGNORE_WRITEMODE.equalsIgnoreCase(writeMode)) { + throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, + String.format("写入模式(writeMode)配置错误. DRDSWriter只支持两种写入模式为:[%s, %s], 但是您配置的写入模式为:%s. 请检查您的配置并作出修改.", + DEFAULT_WRITEMODE, INSERT_IGNORE_WRITEMODE, writeMode)); + } + + this.originalConfig.set(Key.WRITE_MODE, writeMode); + this.commonRdbmsWriterJob = new CommonRdbmsWriter.Job(DATABASE_TYPE); + this.commonRdbmsWriterJob.init(this.originalConfig); + } + + // 对于 Drds 而言,只会暴露一张逻辑表,所以直接在 Master 做 pre,post 操作 + @Override + public void prepare() { + this.commonRdbmsWriterJob.prepare(this.originalConfig); + } + + @Override + public List split(int mandatoryNumber) { + return this.commonRdbmsWriterJob.split(this.originalConfig, mandatoryNumber); + } + + // 一般来说,是需要推迟到 task 中进行post 的执行(单表情况例外) + @Override + public void post() { + this.commonRdbmsWriterJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private CommonRdbmsWriter.Task commonRdbmsWriterTask; + + @Override + public void init() { + this.writerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsWriterTask = new CommonRdbmsWriter.Task(DATABASE_TYPE); + this.commonRdbmsWriterTask.init(this.writerSliceConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterTask.prepare(this.writerSliceConfig); + } + + //TODO 改用连接池,确保每次获取的连接都是可用的(注意:连接可能需要每次都初始化其 session) + public void startWrite(RecordReceiver recordReceiver) { + this.commonRdbmsWriterTask.startWrite(recordReceiver, this.writerSliceConfig, + super.getTaskPluginCollector()); + } + + @Override + public void post() { + this.commonRdbmsWriterTask.post(this.writerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterTask.destroy(this.writerSliceConfig); + } + + } +} diff --git a/drdswriter/src/main/resources/plugin.json b/drdswriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..ad0db036db --- /dev/null +++ b/drdswriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "drdswriter", + "class": "com.alibaba.datax.plugin.writer.drdswriter.DrdsWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/drdswriter/src/main/resources/plugin_job_template.json b/drdswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..93fcd4efe3 --- /dev/null +++ b/drdswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "drdswriter", + "parameter": { + "jdbcUrl": "", + "username": "", + "password": "", + "table": "", + "column": [], + "writeMode": "", + "preSql": [], + "postSql": [] + } +} \ No newline at end of file diff --git a/elasticsearchwriter/README.md b/elasticsearchwriter/README.md new file mode 100644 index 0000000000..cfcd5efeb9 --- /dev/null +++ b/elasticsearchwriter/README.md @@ -0,0 +1,5 @@ +本插件仅在Elasticsearch 5.x上测试 + + + + diff --git a/elasticsearchwriter/build.sh b/elasticsearchwriter/build.sh new file mode 100644 index 0000000000..1c6e4acd86 --- /dev/null +++ b/elasticsearchwriter/build.sh @@ -0,0 +1,18 @@ +#!/bin/sh + +SCRIPT_HOME=$(cd $(dirname $0); pwd) +cd $SCRIPT_HOME/.. +mvn clean package -DskipTests assembly:assembly + +cd $SCRIPT_HOME/target/datax/plugin/writer/ + +if [ -d "eswriter" ]; then + tar -zcvf eswriter.tgz eswriter + cp eswriter.tgz $SCRIPT_HOME + cd $SCRIPT_HOME +ansible-playbook -i hosts main.yml -u vagrant -k +fi + + + + diff --git a/elasticsearchwriter/doc/elasticsearchwriter.md b/elasticsearchwriter/doc/elasticsearchwriter.md new file mode 100644 index 0000000000..9a22f13c22 --- /dev/null +++ b/elasticsearchwriter/doc/elasticsearchwriter.md @@ -0,0 +1,245 @@ +# DataX ElasticSearchWriter + + +--- + +## 1 快速介绍 + +数据导入elasticsearch的插件 + +## 2 实现原理 + +使用elasticsearch的rest api接口, 批量把从reader读入的数据写入elasticsearch + +## 3 功能说明 + +### 3.1 配置样例 + +#### job.json + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + ... + }, + "writer": { + "name": "elasticsearchwriter", + "parameter": { + "endpoint": "http://xxx:9999", + "accessId": "xxxx", + "accessKey": "xxxx", + "index": "test-1", + "type": "default", + "cleanup": true, + "settings": {"index" :{"number_of_shards": 1, "number_of_replicas": 0}}, + "discovery": false, + "batchSize": 1000, + "splitter": ",", + "column": [ + {"name": "pk", "type": "id"}, + { "name": "col_ip","type": "ip" }, + { "name": "col_double","type": "double" }, + { "name": "col_long","type": "long" }, + { "name": "col_integer","type": "integer" }, + { "name": "col_keyword", "type": "keyword" }, + { "name": "col_text", "type": "text", "analyzer": "ik_max_word"}, + { "name": "col_geo_point", "type": "geo_point" }, + { "name": "col_date", "type": "date", "format": "yyyy-MM-dd HH:mm:ss"}, + { "name": "col_nested1", "type": "nested" }, + { "name": "col_nested2", "type": "nested" }, + { "name": "col_object1", "type": "object" }, + { "name": "col_object2", "type": "object" }, + { "name": "col_integer_array", "type":"integer", "array":true}, + { "name": "col_geo_shape", "type":"geo_shape", "tree": "quadtree", "precision": "10m"} + ] + } + } + } + ] + } +} +``` + +#### 3.2 参数说明 + +* endpoint + * 描述:ElasticSearch的连接地址 + * 必选:是 + * 默认值:无 + +* accessId + * 描述:http auth中的user + * 必选:否 + * 默认值:空 + +* accessKey + * 描述:http auth中的password + * 必选:否 + * 默认值:空 + +* index + * 描述:elasticsearch中的index名 + * 必选:是 + * 默认值:无 + +* type + * 描述:elasticsearch中index的type名 + * 必选:否 + * 默认值:index名 + +* cleanup + * 描述:是否删除原表 + * 必选:否 + * 默认值:false + +* batchSize + * 描述:每次批量数据的条数 + * 必选:否 + * 默认值:1000 + +* trySize + * 描述:失败后重试的次数 + * 必选:否 + * 默认值:30 + +* timeout + * 描述:客户端超时时间 + * 必选:否 + * 默认值:600000 + +* discovery + * 描述:启用节点发现将(轮询)并定期更新客户机中的服务器列表。 + * 必选:否 + * 默认值:false + +* compression + * 描述:http请求,开启压缩 + * 必选:否 + * 默认值:true + +* multiThread + * 描述:http请求,是否有多线程 + * 必选:否 + * 默认值:true + +* ignoreWriteError + * 描述:忽略写入错误,不重试,继续写入 + * 必选:否 + * 默认值:false + +* ignoreParseError + * 描述:忽略解析数据格式错误,继续写入 + * 必选:否 + * 默认值:true + +* alias + * 描述:数据导入完成后写入别名 + * 必选:否 + * 默认值:无 + +* aliasMode + * 描述:数据导入完成后增加别名的模式,append(增加模式), exclusive(只留这一个) + * 必选:否 + * 默认值:append + +* settings + * 描述:创建index时候的settings, 与elasticsearch官方相同 + * 必选:否 + * 默认值:无 + +* splitter + * 描述:如果插入数据是array,就使用指定分隔符 + * 必选:否 + * 默认值:-,- + +* column + * 描述:elasticsearch所支持的字段类型,样例中包含了全部 + * 必选:是 + +* dynamic + * 描述: 不使用datax的mappings,使用es自己的自动mappings + * 必选: 否 + * 默认值: false + + + +## 4 性能报告 + +### 4.1 环境准备 + +* 总数据量 1kw条数据, 每条0.1kb +* 1个shard, 0个replica +* 不加id,这样默认是append_only模式,不检查版本,插入速度会有20%左右的提升 + +#### 4.1.1 输入数据类型(streamreader) + +``` +{"value": "1.1.1.1", "type": "string"}, +{"value": 19890604.0, "type": "double"}, +{"value": 19890604, "type": "long"}, +{"value": 19890604, "type": "long"}, +{"value": "hello world", "type": "string"}, +{"value": "hello world", "type": "string"}, +{"value": "41.12,-71.34", "type": "string"}, +{"value": "2017-05-25", "type": "string"}, +``` + +#### 4.1.2 输出数据类型(eswriter) + +``` +{ "name": "col_ip","type": "ip" }, +{ "name": "col_double","type": "double" }, +{ "name": "col_long","type": "long" }, +{ "name": "col_integer","type": "integer" }, +{ "name": "col_keyword", "type": "keyword" }, +{ "name": "col_text", "type": "text"}, +{ "name": "col_geo_point", "type": "geo_point" }, +{ "name": "col_date", "type": "date"} +``` + +#### 4.1.2 机器参数 + +1. cpu: 32 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz +2. mem: 128G +3. net: 千兆双网卡 + +#### 4.1.3 DataX jvm 参数 + +-Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + +### 4.2 测试报告 + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| +|--------|--------| --------|--------| +| 4| 256| 11013| 0.828| +| 4| 1024| 19417| 1.43| +| 4| 4096| 23923| 1.76| +| 4| 8172| 24449| 1.80| +| 8| 256| 21459| 1.58| +| 8| 1024| 37037| 2.72| +| 8| 4096| 45454| 3.34| +| 8| 8172| 45871| 3.37| +| 16| 1024| 67567| 4.96| +| 16| 4096| 78125| 5.74| +| 16| 8172| 77519| 5.69| +| 32| 1024| 94339| 6.93| +| 32| 4096| 96153| 7.06| +| 64| 1024| 91743| 6.74| + +### 4.3 测试总结 + +* 最好的结果是32通道,每次传4096,如果单条数据很大, 请适当减少批量数,防止oom +* 当然这个很容易水平扩展,而且es也是分布式的,多设置几个shard也可以水平扩展 + +## 5 约束限制 + +* 如果导入id,这样数据导入失败也会重试,重新导入也仅仅是覆盖,保证数据一致性 +* 如果不导入id,就是append_only模式,elasticsearch自动生成id,速度会提升20%左右,但数据无法修复,适合日志型数据(对数据精度要求不高的) \ No newline at end of file diff --git a/elasticsearchwriter/pom.xml b/elasticsearchwriter/pom.xml new file mode 100644 index 0000000000..39ee97e0c5 --- /dev/null +++ b/elasticsearchwriter/pom.xml @@ -0,0 +1,89 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + elasticsearchwriter + + com.alibaba.datax + 0.0.1-SNAPSHOT + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + io.searchbox + jest-common + 2.4.0 + + + io.searchbox + jest + 2.4.0 + + + joda-time + joda-time + 2.9.7 + + + junit + junit + 4.11 + test + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/elasticsearchwriter/src/main/assembly/package.xml b/elasticsearchwriter/src/main/assembly/package.xml new file mode 100644 index 0000000000..92b9162549 --- /dev/null +++ b/elasticsearchwriter/src/main/assembly/package.xml @@ -0,0 +1,34 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + + plugin/writer/elasticsearchwriter + + + target/ + + elasticsearchwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/elasticsearchwriter + + + + + + false + plugin/writer/elasticsearchwriter/libs + runtime + + + diff --git a/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESClient.java b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESClient.java new file mode 100644 index 0000000000..34bb7e5420 --- /dev/null +++ b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESClient.java @@ -0,0 +1,236 @@ +package com.alibaba.datax.plugin.writer.elasticsearchwriter; + +import com.google.gson.Gson; +import com.google.gson.JsonElement; +import com.google.gson.JsonObject; +import com.google.gson.JsonParser; +import io.searchbox.action.Action; +import io.searchbox.client.JestClient; +import io.searchbox.client.JestClientFactory; +import io.searchbox.client.JestResult; +import io.searchbox.client.config.HttpClientConfig; +import io.searchbox.client.config.HttpClientConfig.Builder; +import io.searchbox.core.Bulk; +import io.searchbox.indices.CreateIndex; +import io.searchbox.indices.DeleteIndex; +import io.searchbox.indices.IndicesExists; +import io.searchbox.indices.aliases.*; +import io.searchbox.indices.mapping.PutMapping; +import org.apache.http.HttpHost; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.concurrent.TimeUnit; + +/** + * Created by xiongfeng.bxf on 17/2/8. + */ +public class ESClient { + private static final Logger log = LoggerFactory.getLogger(ESClient.class); + + private JestClient jestClient; + + public JestClient getClient() { + return jestClient; + } + + public void createClient(String endpoint, + String user, + String passwd, + boolean multiThread, + int readTimeout, + boolean compression, + boolean discovery) { + + JestClientFactory factory = new JestClientFactory(); + Builder httpClientConfig = new HttpClientConfig + .Builder(endpoint) + .setPreemptiveAuth(new HttpHost(endpoint)) + .multiThreaded(multiThread) + .connTimeout(30000) + .readTimeout(readTimeout) + .maxTotalConnection(200) + .requestCompressionEnabled(compression) + .discoveryEnabled(discovery) + .discoveryFrequency(5l, TimeUnit.MINUTES); + + if (!("".equals(user) || "".equals(passwd))) { + httpClientConfig.defaultCredentials(user, passwd); + } + + factory.setHttpClientConfig(httpClientConfig.build()); + + jestClient = factory.getObject(); + } + + public boolean indicesExists(String indexName) throws Exception { + boolean isIndicesExists = false; + JestResult rst = jestClient.execute(new IndicesExists.Builder(indexName).build()); + if (rst.isSucceeded()) { + isIndicesExists = true; + } else { + switch (rst.getResponseCode()) { + case 404: + isIndicesExists = false; + break; + case 401: + // 无权访问 + default: + log.warn(rst.getErrorMessage()); + break; + } + } + return isIndicesExists; + } + + public boolean deleteIndex(String indexName) throws Exception { + log.info("delete index " + indexName); + if (indicesExists(indexName)) { + JestResult rst = execute(new DeleteIndex.Builder(indexName).build()); + if (!rst.isSucceeded()) { + return false; + } + } else { + log.info("index cannot found, skip delete " + indexName); + } + return true; + } + + public boolean createIndex(String indexName, String typeName, + Object mappings, String settings, boolean dynamic) throws Exception { + JestResult rst = null; + if (!indicesExists(indexName)) { + log.info("create index " + indexName); + rst = jestClient.execute( + new CreateIndex.Builder(indexName) + .settings(settings) + .setParameter("master_timeout", "5m") + .build() + ); + //index_already_exists_exception + if (!rst.isSucceeded()) { + if (getStatus(rst) == 400) { + log.info(String.format("index [%s] already exists", indexName)); + return true; + } else { + log.error(rst.getErrorMessage()); + return false; + } + } else { + log.info(String.format("create [%s] index success", indexName)); + } + } + + int idx = 0; + while (idx < 5) { + if (indicesExists(indexName)) { + break; + } + Thread.sleep(2000); + idx ++; + } + if (idx >= 5) { + return false; + } + + if (dynamic) { + log.info("ignore mappings"); + return true; + } + log.info("create mappings for " + indexName + " " + mappings); + rst = jestClient.execute(new PutMapping.Builder(indexName, typeName, mappings) + .setParameter("master_timeout", "5m").build()); + if (!rst.isSucceeded()) { + if (getStatus(rst) == 400) { + log.info(String.format("index [%s] mappings already exists", indexName)); + } else { + log.error(rst.getErrorMessage()); + return false; + } + } else { + log.info(String.format("index [%s] put mappings success", indexName)); + } + return true; + } + + public JestResult execute(Action clientRequest) throws Exception { + JestResult rst = null; + rst = jestClient.execute(clientRequest); + if (!rst.isSucceeded()) { + //log.warn(rst.getErrorMessage()); + } + return rst; + } + + public Integer getStatus(JestResult rst) { + JsonObject jsonObject = rst.getJsonObject(); + if (jsonObject.has("status")) { + return jsonObject.get("status").getAsInt(); + } + return 600; + } + + public boolean isBulkResult(JestResult rst) { + JsonObject jsonObject = rst.getJsonObject(); + return jsonObject.has("items"); + } + + + public boolean alias(String indexname, String aliasname, boolean needClean) throws IOException { + GetAliases getAliases = new GetAliases.Builder().addIndex(aliasname).build(); + AliasMapping addAliasMapping = new AddAliasMapping.Builder(indexname, aliasname).build(); + JestResult rst = jestClient.execute(getAliases); + log.info(rst.getJsonString()); + List list = new ArrayList(); + if (rst.isSucceeded()) { + JsonParser jp = new JsonParser(); + JsonObject jo = (JsonObject)jp.parse(rst.getJsonString()); + for(Map.Entry entry : jo.entrySet()){ + String tindex = entry.getKey(); + if (indexname.equals(tindex)) { + continue; + } + AliasMapping m = new RemoveAliasMapping.Builder(tindex, aliasname).build(); + String s = new Gson().toJson(m.getData()); + log.info(s); + if (needClean) { + list.add(m); + } + } + } + + ModifyAliases modifyAliases = new ModifyAliases.Builder(addAliasMapping).addAlias(list).setParameter("master_timeout", "5m").build(); + rst = jestClient.execute(modifyAliases); + if (!rst.isSucceeded()) { + log.error(rst.getErrorMessage()); + return false; + } + return true; + } + + public JestResult bulkInsert(Bulk.Builder bulk, int trySize) throws Exception { + // es_rejected_execution_exception + // illegal_argument_exception + // cluster_block_exception + JestResult rst = null; + rst = jestClient.execute(bulk.build()); + if (!rst.isSucceeded()) { + log.warn(rst.getErrorMessage()); + } + return rst; + } + + /** + * 关闭JestClient客户端 + * + */ + public void closeJestClient() { + if (jestClient != null) { + jestClient.shutdownClient(); + } + } +} diff --git a/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESColumn.java b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESColumn.java new file mode 100644 index 0000000000..8990d77c21 --- /dev/null +++ b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESColumn.java @@ -0,0 +1,65 @@ +package com.alibaba.datax.plugin.writer.elasticsearchwriter; + +/** + * Created by xiongfeng.bxf on 17/3/2. + */ +public class ESColumn { + + private String name;//: "appkey", + + private String type;//": "TEXT", + + private String timezone; + + private String format; + + private Boolean array; + + public void setName(String name) { + this.name = name; + } + + public void setType(String type) { + this.type = type; + } + + public void setTimeZone(String timezone) { + this.timezone = timezone; + } + + public void setFormat(String format) { + this.format = format; + } + + public String getName() { + return name; + } + + public String getType() { + return type; + } + + public String getTimezone() { + return timezone; + } + + public String getFormat() { + return format; + } + + public void setTimezone(String timezone) { + this.timezone = timezone; + } + + public Boolean isArray() { + return array; + } + + public void setArray(Boolean array) { + this.array = array; + } + + public Boolean getArray() { + return array; + } +} diff --git a/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESFieldType.java b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESFieldType.java new file mode 100644 index 0000000000..14b096891a --- /dev/null +++ b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESFieldType.java @@ -0,0 +1,47 @@ +package com.alibaba.datax.plugin.writer.elasticsearchwriter; + +/** + * Created by xiongfeng.bxf on 17/3/1. + */ +public enum ESFieldType { + ID, + STRING, + TEXT, + KEYWORD, + LONG, + INTEGER, + SHORT, + BYTE, + DOUBLE, + FLOAT, + DATE, + BOOLEAN, + BINARY, + INTEGER_RANGE, + FLOAT_RANGE, + LONG_RANGE, + DOUBLE_RANGE, + DATE_RANGE, + GEO_POINT, + GEO_SHAPE, + + IP, + COMPLETION, + TOKEN_COUNT, + + ARRAY, + OBJECT, + NESTED; + + public static ESFieldType getESFieldType(String type) { + if (type == null) { + return null; + } + for (ESFieldType f : ESFieldType.values()) { + if (f.name().compareTo(type.toUpperCase()) == 0) { + return f; + } + } + return null; + } +} diff --git a/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESWriter.java b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESWriter.java new file mode 100644 index 0000000000..eb0e9a8137 --- /dev/null +++ b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESWriter.java @@ -0,0 +1,460 @@ +package com.alibaba.datax.plugin.writer.elasticsearchwriter; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.JSONObject; +import com.alibaba.fastjson.TypeReference; +import io.searchbox.client.JestResult; +import io.searchbox.core.Bulk; +import io.searchbox.core.BulkResult; +import io.searchbox.core.Index; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; +import org.joda.time.format.DateTimeFormat; +import org.joda.time.format.DateTimeFormatter; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.net.URLEncoder; +import java.util.*; +import java.util.concurrent.Callable; + +public class ESWriter extends Writer { + private final static String WRITE_COLUMNS = "write_columns"; + + public static class Job extends Writer.Job { + private static final Logger log = LoggerFactory.getLogger(Job.class); + + private Configuration conf = null; + + @Override + public void init() { + this.conf = super.getPluginJobConf(); + } + + @Override + public void prepare() { + /** + * 注意:此方法仅执行一次。 + * 最佳实践:如果 Job 中有需要进行数据同步之前的处理,可以在此处完成,如果没有必要则可以直接去掉。 + */ + ESClient esClient = new ESClient(); + esClient.createClient(Key.getEndpoint(conf), + Key.getAccessID(conf), + Key.getAccessKey(conf), + false, + 300000, + false, + false); + + String indexName = Key.getIndexName(conf); + String typeName = Key.getTypeName(conf); + boolean dynamic = Key.getDynamic(conf); + String mappings = genMappings(typeName); + String settings = JSONObject.toJSONString( + Key.getSettings(conf) + ); + log.info(String.format("index:[%s], type:[%s], mappings:[%s]", indexName, typeName, mappings)); + + try { + boolean isIndicesExists = esClient.indicesExists(indexName); + if (Key.isCleanup(this.conf) && isIndicesExists) { + esClient.deleteIndex(indexName); + } + // 强制创建,内部自动忽略已存在的情况 + if (!esClient.createIndex(indexName, typeName, mappings, settings, dynamic)) { + throw new IOException("create index or mapping failed"); + } + } catch (Exception ex) { + throw DataXException.asDataXException(ESWriterErrorCode.ES_MAPPINGS, ex.toString()); + } + esClient.closeJestClient(); + } + + private String genMappings(String typeName) { + String mappings = null; + Map propMap = new HashMap(); + List columnList = new ArrayList(); + + List column = conf.getList("column"); + if (column != null) { + for (Object col : column) { + JSONObject jo = JSONObject.parseObject(col.toString()); + String colName = jo.getString("name"); + String colTypeStr = jo.getString("type"); + if (colTypeStr == null) { + throw DataXException.asDataXException(ESWriterErrorCode.BAD_CONFIG_VALUE, col.toString() + " column must have type"); + } + ESFieldType colType = ESFieldType.getESFieldType(colTypeStr); + if (colType == null) { + throw DataXException.asDataXException(ESWriterErrorCode.BAD_CONFIG_VALUE, col.toString() + " unsupported type"); + } + + ESColumn columnItem = new ESColumn(); + + if (colName.equals(Key.PRIMARY_KEY_COLUMN_NAME)) { + // 兼容已有版本 + colType = ESFieldType.ID; + colTypeStr = "id"; + } + + columnItem.setName(colName); + columnItem.setType(colTypeStr); + + if (colType == ESFieldType.ID) { + columnList.add(columnItem); + // 如果是id,则properties为空 + continue; + } + + Boolean array = jo.getBoolean("array"); + if (array != null) { + columnItem.setArray(array); + } + Map field = new HashMap(); + field.put("type", colTypeStr); + //https://www.elastic.co/guide/en/elasticsearch/reference/5.2/breaking_50_mapping_changes.html#_literal_index_literal_property + // https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_deep_dive_on_doc_values.html#_disabling_doc_values + field.put("doc_values", jo.getBoolean("doc_values")); + field.put("ignore_above", jo.getInteger("ignore_above")); + field.put("index", jo.getBoolean("index")); + + switch (colType) { + case STRING: + // 兼容string类型,ES5之前版本 + break; + case KEYWORD: + // https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html#_warm_up_global_ordinals + field.put("eager_global_ordinals", jo.getBoolean("eager_global_ordinals")); + case TEXT: + field.put("analyzer", jo.getString("analyzer")); + // 优化disk使用,也同步会提高index性能 + // https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html + field.put("norms", jo.getBoolean("norms")); + field.put("index_options", jo.getBoolean("index_options")); + break; + case DATE: + columnItem.setTimeZone(jo.getString("timezone")); + columnItem.setFormat(jo.getString("format")); + // 后面时间会处理为带时区的标准时间,所以不需要给ES指定格式 + /* + if (jo.getString("format") != null) { + field.put("format", jo.getString("format")); + } else { + //field.put("format", "strict_date_optional_time||epoch_millis||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"); + } + */ + break; + case GEO_SHAPE: + field.put("tree", jo.getString("tree")); + field.put("precision", jo.getString("precision")); + default: + break; + } + propMap.put(colName, field); + columnList.add(columnItem); + } + } + + conf.set(WRITE_COLUMNS, JSON.toJSONString(columnList)); + + log.info(JSON.toJSONString(columnList)); + + Map rootMappings = new HashMap(); + Map typeMappings = new HashMap(); + typeMappings.put("properties", propMap); + rootMappings.put(typeName, typeMappings); + + mappings = JSON.toJSONString(rootMappings); + + if (mappings == null || "".equals(mappings)) { + throw DataXException.asDataXException(ESWriterErrorCode.BAD_CONFIG_VALUE, "must have mappings"); + } + + return mappings; + } + + @Override + public List split(int mandatoryNumber) { + List configurations = new ArrayList(mandatoryNumber); + for (int i = 0; i < mandatoryNumber; i++) { + configurations.add(conf); + } + return configurations; + } + + @Override + public void post() { + ESClient esClient = new ESClient(); + esClient.createClient(Key.getEndpoint(conf), + Key.getAccessID(conf), + Key.getAccessKey(conf), + false, + 300000, + false, + false); + String alias = Key.getAlias(conf); + if (!"".equals(alias)) { + log.info(String.format("alias [%s] to [%s]", alias, Key.getIndexName(conf))); + try { + esClient.alias(Key.getIndexName(conf), alias, Key.isNeedCleanAlias(conf)); + } catch (IOException e) { + throw DataXException.asDataXException(ESWriterErrorCode.ES_ALIAS_MODIFY, e); + } + } + } + + @Override + public void destroy() { + + } + } + + public static class Task extends Writer.Task { + + private static final Logger log = LoggerFactory.getLogger(Job.class); + + private Configuration conf; + + + ESClient esClient = null; + private List typeList; + private List columnList; + + private int trySize; + private int batchSize; + private String index; + private String type; + private String splitter; + + @Override + public void init() { + this.conf = super.getPluginJobConf(); + index = Key.getIndexName(conf); + type = Key.getTypeName(conf); + + trySize = Key.getTrySize(conf); + batchSize = Key.getBatchSize(conf); + splitter = Key.getSplitter(conf); + columnList = JSON.parseObject(this.conf.getString(WRITE_COLUMNS), new TypeReference>() { + }); + + typeList = new ArrayList(); + + for (ESColumn col : columnList) { + typeList.add(ESFieldType.getESFieldType(col.getType())); + } + + esClient = new ESClient(); + } + + @Override + public void prepare() { + esClient.createClient(Key.getEndpoint(conf), + Key.getAccessID(conf), + Key.getAccessKey(conf), + Key.isMultiThread(conf), + Key.getTimeout(conf), + Key.isCompression(conf), + Key.isDiscovery(conf)); + } + + @Override + public void startWrite(RecordReceiver recordReceiver) { + List writerBuffer = new ArrayList(this.batchSize); + Record record = null; + long total = 0; + while ((record = recordReceiver.getFromReader()) != null) { + writerBuffer.add(record); + if (writerBuffer.size() >= this.batchSize) { + total += doBatchInsert(writerBuffer); + writerBuffer.clear(); + } + } + + if (!writerBuffer.isEmpty()) { + total += doBatchInsert(writerBuffer); + writerBuffer.clear(); + } + + String msg = String.format("task end, write size :%d", total); + getTaskPluginCollector().collectMessage("writesize", String.valueOf(total)); + log.info(msg); + esClient.closeJestClient(); + } + + private String getDateStr(ESColumn esColumn, Column column) { + DateTime date = null; + DateTimeZone dtz = DateTimeZone.getDefault(); + if (esColumn.getTimezone() != null) { + // 所有时区参考 http://www.joda.org/joda-time/timezones.html + dtz = DateTimeZone.forID(esColumn.getTimezone()); + } + if (column.getType() != Column.Type.DATE && esColumn.getFormat() != null) { + DateTimeFormatter formatter = DateTimeFormat.forPattern(esColumn.getFormat()); + date = formatter.withZone(dtz).parseDateTime(column.asString()); + return date.toString(); + } else if (column.getType() == Column.Type.DATE) { + date = new DateTime(column.asLong(), dtz); + return date.toString(); + } else { + return column.asString(); + } + } + + private long doBatchInsert(final List writerBuffer) { + Map data = null; + final Bulk.Builder bulkaction = new Bulk.Builder().defaultIndex(this.index).defaultType(this.type); + for (Record record : writerBuffer) { + data = new HashMap(); + String id = null; + for (int i = 0; i < record.getColumnNumber(); i++) { + Column column = record.getColumn(i); + String columnName = columnList.get(i).getName(); + ESFieldType columnType = typeList.get(i); + //如果是数组类型,那它传入的必是字符串类型 + if (columnList.get(i).isArray() != null && columnList.get(i).isArray()) { + String[] dataList = column.asString().split(splitter); + if (!columnType.equals(ESFieldType.DATE)) { + data.put(columnName, dataList); + } else { + for (int pos = 0; pos < dataList.length; pos++) { + dataList[pos] = getDateStr(columnList.get(i), column); + } + data.put(columnName, dataList); + } + } else { + switch (columnType) { + case ID: + if (id != null) { + id += record.getColumn(i).asString(); + } else { + id = record.getColumn(i).asString(); + } + break; + case DATE: + try { + String dateStr = getDateStr(columnList.get(i), column); + data.put(columnName, dateStr); + } catch (Exception e) { + getTaskPluginCollector().collectDirtyRecord(record, String.format("时间类型解析失败 [%s:%s] exception: %s", columnName, column.toString(), e.toString())); + } + break; + case KEYWORD: + case STRING: + case TEXT: + case IP: + case GEO_POINT: + data.put(columnName, column.asString()); + break; + case BOOLEAN: + data.put(columnName, column.asBoolean()); + break; + case BYTE: + case BINARY: + data.put(columnName, column.asBytes()); + break; + case LONG: + data.put(columnName, column.asLong()); + break; + case INTEGER: + data.put(columnName, column.asBigInteger()); + break; + case SHORT: + data.put(columnName, column.asBigInteger()); + break; + case FLOAT: + case DOUBLE: + data.put(columnName, column.asDouble()); + break; + case NESTED: + case OBJECT: + case GEO_SHAPE: + data.put(columnName, JSON.parse(column.asString())); + break; + default: + getTaskPluginCollector().collectDirtyRecord(record, "类型错误:不支持的类型:" + columnType + " " + columnName); + } + } + } + + if (id == null) { + //id = UUID.randomUUID().toString(); + bulkaction.addAction(new Index.Builder(data).build()); + } else { + bulkaction.addAction(new Index.Builder(data).id(id).build()); + } + } + + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public Integer call() throws Exception { + JestResult jestResult = esClient.bulkInsert(bulkaction, 1); + if (jestResult.isSucceeded()) { + return writerBuffer.size(); + } + + String msg = String.format("response code: [%d] error :[%s]", jestResult.getResponseCode(), jestResult.getErrorMessage()); + log.warn(msg); + if (esClient.isBulkResult(jestResult)) { + BulkResult brst = (BulkResult) jestResult; + List failedItems = brst.getFailedItems(); + for (BulkResult.BulkResultItem item : failedItems) { + if (item.status != 400) { + // 400 BAD_REQUEST 如果非数据异常,请求异常,则不允许忽略 + throw DataXException.asDataXException(ESWriterErrorCode.ES_INDEX_INSERT, String.format("status:[%d], error: %s", item.status, item.error)); + } else { + // 如果用户选择不忽略解析错误,则抛异常,默认为忽略 + if (!Key.isIgnoreParseError(conf)) { + throw DataXException.asDataXException(ESWriterErrorCode.ES_INDEX_INSERT, String.format("status:[%d], error: %s, config not ignoreParseError so throw this error", item.status, item.error)); + } + } + } + + List items = brst.getItems(); + for (int idx = 0; idx < items.size(); ++idx) { + BulkResult.BulkResultItem item = items.get(idx); + if (item.error != null && !"".equals(item.error)) { + getTaskPluginCollector().collectDirtyRecord(writerBuffer.get(idx), String.format("status:[%d], error: %s", item.status, item.error)); + } + } + return writerBuffer.size() - brst.getFailedItems().size(); + } else { + Integer status = esClient.getStatus(jestResult); + switch (status) { + case 429: //TOO_MANY_REQUESTS + log.warn("server response too many requests, so auto reduce speed"); + break; + } + throw DataXException.asDataXException(ESWriterErrorCode.ES_INDEX_INSERT, jestResult.getErrorMessage()); + } + } + }, trySize, 60000L, true); + } catch (Exception e) { + if (Key.isIgnoreWriteError(this.conf)) { + log.warn(String.format("重试[%d]次写入失败,忽略该错误,继续写入!", trySize)); + } else { + throw DataXException.asDataXException(ESWriterErrorCode.ES_INDEX_INSERT, e); + } + } + return 0; + } + + @Override + public void post() { + } + + @Override + public void destroy() { + esClient.closeJestClient(); + } + } +} diff --git a/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESWriterErrorCode.java b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESWriterErrorCode.java new file mode 100644 index 0000000000..59dcbd0ae1 --- /dev/null +++ b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/ESWriterErrorCode.java @@ -0,0 +1,37 @@ +package com.alibaba.datax.plugin.writer.elasticsearchwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum ESWriterErrorCode implements ErrorCode { + BAD_CONFIG_VALUE("ESWriter-00", "您配置的值不合法."), + ES_INDEX_DELETE("ESWriter-01", "删除index错误."), + ES_INDEX_CREATE("ESWriter-02", "创建index错误."), + ES_MAPPINGS("ESWriter-03", "mappings错误."), + ES_INDEX_INSERT("ESWriter-04", "插入数据错误."), + ES_ALIAS_MODIFY("ESWriter-05", "别名修改错误."), + ; + + private final String code; + private final String description; + + ESWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} \ No newline at end of file diff --git a/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/Key.java b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/Key.java new file mode 100644 index 0000000000..0f2d3f5c20 --- /dev/null +++ b/elasticsearchwriter/src/main/java/com/alibaba/datax/plugin/writer/elasticsearchwriter/Key.java @@ -0,0 +1,131 @@ +package com.alibaba.datax.plugin.writer.elasticsearchwriter; + +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang3.StringUtils; + +import java.util.HashMap; +import java.util.Map; + +public final class Key { + // ---------------------------------------- + // 类型定义 主键字段定义 + // ---------------------------------------- + public static final String PRIMARY_KEY_COLUMN_NAME = "pk"; + + public static enum ActionType { + UNKONW, + INDEX, + CREATE, + DELETE, + UPDATE + } + + public static ActionType getActionType(Configuration conf) { + String actionType = conf.getString("actionType", "index"); + if ("index".equals(actionType)) { + return ActionType.INDEX; + } else if ("create".equals(actionType)) { + return ActionType.CREATE; + } else if ("delete".equals(actionType)) { + return ActionType.DELETE; + } else if ("update".equals(actionType)) { + return ActionType.UPDATE; + } else { + return ActionType.UNKONW; + } + } + + + public static String getEndpoint(Configuration conf) { + return conf.getNecessaryValue("endpoint", ESWriterErrorCode.BAD_CONFIG_VALUE); + } + + public static String getAccessID(Configuration conf) { + return conf.getString("accessId", ""); + } + + public static String getAccessKey(Configuration conf) { + return conf.getString("accessKey", ""); + } + + public static int getBatchSize(Configuration conf) { + return conf.getInt("batchSize", 1000); + } + + public static int getTrySize(Configuration conf) { + return conf.getInt("trySize", 30); + } + + public static int getTimeout(Configuration conf) { + return conf.getInt("timeout", 600000); + } + + public static boolean isCleanup(Configuration conf) { + return conf.getBool("cleanup", false); + } + + public static boolean isDiscovery(Configuration conf) { + return conf.getBool("discovery", false); + } + + public static boolean isCompression(Configuration conf) { + return conf.getBool("compression", true); + } + + public static boolean isMultiThread(Configuration conf) { + return conf.getBool("multiThread", true); + } + + public static String getIndexName(Configuration conf) { + return conf.getNecessaryValue("index", ESWriterErrorCode.BAD_CONFIG_VALUE); + } + + public static String getTypeName(Configuration conf) { + String indexType = conf.getString("indexType"); + if(StringUtils.isBlank(indexType)){ + indexType = conf.getString("type", getIndexName(conf)); + } + return indexType; + } + + + public static boolean isIgnoreWriteError(Configuration conf) { + return conf.getBool("ignoreWriteError", false); + } + + public static boolean isIgnoreParseError(Configuration conf) { + return conf.getBool("ignoreParseError", true); + } + + + public static boolean isHighSpeedMode(Configuration conf) { + if ("highspeed".equals(conf.getString("mode", ""))) { + return true; + } + return false; + } + + public static String getAlias(Configuration conf) { + return conf.getString("alias", ""); + } + + public static boolean isNeedCleanAlias(Configuration conf) { + String mode = conf.getString("aliasMode", "append"); + if ("exclusive".equals(mode)) { + return true; + } + return false; + } + + public static Map getSettings(Configuration conf) { + return conf.getMap("settings", new HashMap()); + } + + public static String getSplitter(Configuration conf) { + return conf.getString("splitter", "-,-"); + } + + public static boolean getDynamic(Configuration conf) { + return conf.getBool("dynamic", false); + } +} diff --git a/elasticsearchwriter/src/main/resources/plugin.json b/elasticsearchwriter/src/main/resources/plugin.json new file mode 100644 index 0000000000..b6e6384bce --- /dev/null +++ b/elasticsearchwriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "elasticsearchwriter", + "class": "com.alibaba.datax.plugin.writer.elasticsearchwriter.ESWriter", + "description": "适用于: 生产环境. 原理: TODO", + "developer": "alibaba" +} \ No newline at end of file diff --git a/ftpreader/doc/ftpreader.md b/ftpreader/doc/ftpreader.md new file mode 100644 index 0000000000..770c6a9c96 --- /dev/null +++ b/ftpreader/doc/ftpreader.md @@ -0,0 +1,329 @@ +# DataX FtpReader 说明 + + +------------ + +## 1 快速介绍 + +FtpReader提供了读取远程FTP文件系统数据存储的能力。在底层实现上,FtpReader获取远程FTP文件数据,并转换为DataX传输协议传递给Writer。 + +**本地文件内容存放的是一张逻辑意义上的二维表,例如CSV格式的文本信息。** + + +## 2 功能与限制 + +FtpReader实现了从远程FTP文件读取数据并转为DataX协议的功能,远程FTP文件本身是无结构化数据存储,对于DataX而言,FtpReader实现上类比TxtFileReader,有诸多相似之处。目前FtpReader支持功能如下: + +1. 支持且仅支持读取TXT的文件,且要求TXT中shema为一张二维表。 + +2. 支持类CSV格式文件,自定义分隔符。 + +3. 支持多种类型数据读取(使用String表示),支持列裁剪,支持列常量 + +4. 支持递归读取、支持文件名过滤。 + +5. 支持文本压缩,现有压缩格式为zip、gzip、bzip2。 + +6. 多个File可以支持并发读取。 + +我们暂时不能做到: + +1. 单个File支持多线程并发读取,这里涉及到单个File内部切分算法。二期考虑支持。 + +2. 单个File在压缩情况下,从技术上无法支持多线程并发读取。 + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "setting": {}, + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "ftpreader", + "parameter": { + "protocol": "sftp", + "host": "127.0.0.1", + "port": 22, + "username": "xx", + "password": "xxx", + "path": [ + "/home/hanfa.shf/ftpReaderTest/data" + ], + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "boolean" + }, + { + "index": 2, + "type": "double" + }, + { + "index": 3, + "type": "string" + }, + { + "index": 4, + "type": "date", + "format": "yyyy.MM.dd" + } + ], + "encoding": "UTF-8", + "fieldDelimiter": "," + } + }, + "writer": { + "name": "ftpWriter", + "parameter": { + "path": "/home/hanfa.shf/ftpReaderTest/result", + "fileName": "shihf", + "writeMode": "truncate", + "format": "yyyy-MM-dd" + } + } + } + ] + } +} +``` + +### 3.2 参数说明 + +* **protocol** + + * 描述:ftp服务器协议,目前支持传输协议有ftp和sftp。
+ + * 必选:是
+ + * 默认值:无
+ +* **host** + + * 描述:ftp服务器地址。
+ + * 必选:是
+ + * 默认值:无
+ +* **port** + + * 描述:ftp服务器端口。
+ + * 必选:否
+ + * 默认值:若传输协议是sftp协议,默认值是22;若传输协议是标准ftp协议,默认值是21
+ +* **timeout** + + * 描述:连接ftp服务器连接超时时间,单位毫秒。
+ + * 必选:否
+ + * 默认值:60000(1分钟)
+* **connectPattern** + + * 描述:连接模式(主动模式或者被动模式)。该参数只在传输协议是标准ftp协议时使用,值只能为:PORT (主动),PASV(被动)。两种模式主要的不同是数据连接建立的不同。对于Port模式,是客户端在本地打开一个端口等服务器去连接建立数据连接,而Pasv模式就是服务器打开一个端口等待客户端去建立一个数据连接。
+ + * 必选:否
+ + * 默认值:PASV
+ +* **username** + + * 描述:ftp服务器访问用户名。
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:ftp服务器访问密码。
+ + * 必选:是
+ + * 默认值:无
+ +* **path** + + * 描述:远程FTP文件系统的路径信息,注意这里可以支持填写多个路径。
+ + 当指定单个远程FTP文件,FtpReader暂时只能使用单线程进行数据抽取。二期考虑在非压缩文件情况下针对单个File可以进行多线程并发读取。 + + 当指定多个远程FTP文件,FtpReader支持使用多线程进行数据抽取。线程并发数通过通道数指定。 + + 当指定通配符,FtpReader尝试遍历出多个文件信息。例如: 指定/*代表读取/目录下所有的文件,指定/bazhen/\*代表读取bazhen目录下游所有的文件。**FtpReader目前只支持\*作为文件通配符。** + + **特别需要注意的是,DataX会将一个作业下同步的所有Text File视作同一张数据表。用户必须自己保证所有的File能够适配同一套schema信息。读取文件用户必须保证为类CSV格式,并且提供给DataX权限可读。** + + **特别需要注意的是,如果Path指定的路径下没有符合匹配的文件抽取,DataX将报错。** + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:读取字段列表,type指定源数据的类型,index指定当前列来自于文本第几列(以0开始),value指定当前类型为常量,不从源头文件读取数据,而是根据value值自动生成对应的列。
+ + 默认情况下,用户可以全部按照String类型读取数据,配置如下: + + ```json + "column": ["*"] + ``` + + 用户可以指定Column字段信息,配置如下: + + ```json + { + "type": "long", + "index": 0 //从远程FTP文件文本第一列获取int字段 + }, + { + "type": "string", + "value": "alibaba" //从FtpReader内部生成alibaba的字符串字段作为当前字段 + } + ``` + + 对于用户指定Column信息,type必须填写,index/value必须选择其一。 + + * 必选:是
+ + * 默认值:全部按照string类型读取
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + * 必选:是
+ + * 默认值:,
+ +* **compress** + + * 描述:文本压缩类型,默认不填写意味着没有压缩。支持压缩类型为zip、gzip、bzip2。
+ + * 必选:否
+ + * 默认值:没有压缩
+ +* **encoding** + + * 描述:读取文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8
+ +* **skipHeader** + + * 描述:类CSV格式文件可能存在表头为标题情况,需要跳过。默认不跳过。
+ + * 必选:否
+ + * 默认值:false
+ +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat:"\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:\N
+ +* **maxTraversalLevel** + + * 描述:允许遍历文件夹的最大层数。
+ + * 必选:否
+ + * 默认值:100
+ + +* **csvReaderConfig** + + * 描述:读取CSV类型文件参数配置,Map类型。读取CSV类型文件使用的CsvReader进行读取,会有很多配置,不配置则使用默认值。
+ + * 必选:否
+ + * 默认值:无
+ + +常见配置: + +```json +"csvReaderConfig":{ + "safetySwitch": false, + "skipEmptyRecords": false, + "useTextQualifier": false +} +``` + +所有配置项及默认值,配置时 csvReaderConfig 的map中请**严格按照以下字段名字进行配置**: + +``` +boolean caseSensitive = true; +char textQualifier = 34; +boolean trimWhitespace = true; +boolean useTextQualifier = true;//是否使用csv转义字符 +char delimiter = 44;//分隔符 +char recordDelimiter = 0; +char comment = 35; +boolean useComments = false; +int escapeMode = 1; +boolean safetySwitch = true;//单列长度是否限制100000字符 +boolean skipEmptyRecords = true;//是否跳过空行 +boolean captureRawRecord = true; +``` + + +### 3.3 类型转换 + +远程FTP文件本身不提供数据类型,该类型是DataX FtpReader定义: + +| DataX 内部类型| 远程FTP文件 数据类型 | +| -------- | ----- | +| +| Long |Long | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Date |Date | + +其中: + +* 远程FTP文件 Long是指远程FTP文件文本中使用整形的字符串表示形式,例如"19901219"。 +* 远程FTP文件 Double是指远程FTP文件文本中使用Double的字符串表示形式,例如"3.1415"。 +* 远程FTP文件 Boolean是指远程FTP文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* 远程FTP文件 Date是指远程FTP文件文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。 + + +## 4 性能报告 + + + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + diff --git a/ftpreader/pom.xml b/ftpreader/pom.xml new file mode 100755 index 0000000000..c06c654275 --- /dev/null +++ b/ftpreader/pom.xml @@ -0,0 +1,92 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + ftpreader + ftpreader + FtpReader提供了读取指定ftp服务器文件功能,并可以根据用户配置的类型进行类型转换,建议开发、测试环境使用。 + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + + com.jcraft + jsch + 0.1.51 + + + commons-net + commons-net + 3.3 + + + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/ftpreader/src/main/assembly/package.xml b/ftpreader/src/main/assembly/package.xml new file mode 100755 index 0000000000..f94fc5bd5b --- /dev/null +++ b/ftpreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/ftpreader + + + target/ + + ftpreader-0.0.1-SNAPSHOT.jar + + plugin/reader/ftpreader + + + + + + false + plugin/reader/ftpreader/libs + runtime + + + diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Constant.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Constant.java new file mode 100755 index 0000000000..15019fdb50 --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Constant.java @@ -0,0 +1,14 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + + +public class Constant { + public static final String SOURCE_FILES = "sourceFiles"; + + public static final int DEFAULT_FTP_PORT = 21; + public static final int DEFAULT_SFTP_PORT = 22; + public static final int DEFAULT_TIMEOUT = 60000; + public static final int DEFAULT_MAX_TRAVERSAL_LEVEL = 100; + public static final String DEFAULT_FTP_CONNECT_PATTERN = "PASV"; + + +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpHelper.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpHelper.java new file mode 100644 index 0000000000..f8b3f56f21 --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpHelper.java @@ -0,0 +1,107 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +import java.io.InputStream; +import java.util.HashSet; +import java.util.List; + +public abstract class FtpHelper { + /** + * + * @Title: LoginFtpServer + * @Description: 与ftp服务器建立连接 + * @param @param host + * @param @param username + * @param @param password + * @param @param port + * @param @param timeout + * @param @param connectMode + * @return void + * @throws + */ + public abstract void loginFtpServer(String host, String username, String password, int port, int timeout,String connectMode) ; + /** + * + * @Title: LogoutFtpServer + * todo 方法名首字母 + * @Description: 断开与ftp服务器的连接 + * @param + * @return void + * @throws + */ + public abstract void logoutFtpServer(); + /** + * + * @Title: isDirExist + * @Description: 判断指定路径是否是目录 + * @param @param directoryPath + * @param @return + * @return boolean + * @throws + */ + public abstract boolean isDirExist(String directoryPath); + /** + * + * @Title: isFileExist + * @Description: 判断指定路径是否是文件 + * @param @param filePath + * @param @return + * @return boolean + * @throws + */ + public abstract boolean isFileExist(String filePath); + /** + * + * @Title: isSymbolicLink + * @Description: 判断指定路径是否是软链接 + * @param @param filePath + * @param @return + * @return boolean + * @throws + */ + public abstract boolean isSymbolicLink(String filePath); + /** + * + * @Title: getListFiles + * @Description: 递归获取指定路径下符合条件的所有文件绝对路径 + * @param @param directoryPath + * @param @param parentLevel 父目录的递归层数(首次为0) + * @param @param maxTraversalLevel 允许的最大递归层数 + * @param @return + * @return HashSet + * @throws + */ + public abstract HashSet getListFiles(String directoryPath, int parentLevel, int maxTraversalLevel); + + /** + * + * @Title: getInputStream + * @Description: 获取指定路径的输入流 + * @param @param filePath + * @param @return + * @return InputStream + * @throws + */ + public abstract InputStream getInputStream(String filePath); + + /** + * + * @Title: getAllFiles + * @Description: 获取指定路径列表下符合条件的所有文件的绝对路径 + * @param @param srcPaths 路径列表 + * @param @param parentLevel 父目录的递归层数(首次为0) + * @param @param maxTraversalLevel 允许的最大递归层数 + * @param @return + * @return HashSet + * @throws + */ + public HashSet getAllFiles(List srcPaths, int parentLevel, int maxTraversalLevel){ + HashSet sourceAllFiles = new HashSet(); + if (!srcPaths.isEmpty()) { + for (String eachPath : srcPaths) { + sourceAllFiles.addAll(getListFiles(eachPath, parentLevel, maxTraversalLevel)); + } + } + return sourceAllFiles; + } + +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReader.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReader.java new file mode 100644 index 0000000000..c1f20dfd7f --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReader.java @@ -0,0 +1,253 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +import java.io.InputStream; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; + +public class FtpReader extends Reader { + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + private Configuration originConfig = null; + + private List path = null; + + private HashSet sourceFiles; + + // ftp链接参数 + private String protocol; + private String host; + private int port; + private String username; + private String password; + private int timeout; + private String connectPattern; + private int maxTraversalLevel; + + private FtpHelper ftpHelper = null; + + @Override + public void init() { + this.originConfig = this.getPluginJobConf(); + this.sourceFiles = new HashSet(); + + this.validateParameter(); + UnstructuredStorageReaderUtil.validateParameter(this.originConfig); + + if ("sftp".equals(protocol)) { + //sftp协议 + this.port = originConfig.getInt(Key.PORT, Constant.DEFAULT_SFTP_PORT); + this.ftpHelper = new SftpHelper(); + } else if ("ftp".equals(protocol)) { + // ftp 协议 + this.port = originConfig.getInt(Key.PORT, Constant.DEFAULT_FTP_PORT); + this.ftpHelper = new StandardFtpHelper(); + } + ftpHelper.loginFtpServer(host, username, password, port, timeout, connectPattern); + + } + + private void validateParameter() { + //todo 常量 + this.protocol = this.originConfig.getNecessaryValue(Key.PROTOCOL, FtpReaderErrorCode.REQUIRED_VALUE); + boolean ptrotocolTag = "ftp".equals(this.protocol) || "sftp".equals(this.protocol); + if (!ptrotocolTag) { + throw DataXException.asDataXException(FtpReaderErrorCode.ILLEGAL_VALUE, + String.format("仅支持 ftp和sftp 传输协议 , 不支持您配置的传输协议: [%s]", protocol)); + } + this.host = this.originConfig.getNecessaryValue(Key.HOST, FtpReaderErrorCode.REQUIRED_VALUE); + this.username = this.originConfig.getNecessaryValue(Key.USERNAME, FtpReaderErrorCode.REQUIRED_VALUE); + this.password = this.originConfig.getNecessaryValue(Key.PASSWORD, FtpReaderErrorCode.REQUIRED_VALUE); + this.timeout = originConfig.getInt(Key.TIMEOUT, Constant.DEFAULT_TIMEOUT); + this.maxTraversalLevel = originConfig.getInt(Key.MAXTRAVERSALLEVEL, Constant.DEFAULT_MAX_TRAVERSAL_LEVEL); + + // only support connect pattern + this.connectPattern = this.originConfig.getUnnecessaryValue(Key.CONNECTPATTERN, Constant.DEFAULT_FTP_CONNECT_PATTERN, null); + boolean connectPatternTag = "PORT".equals(connectPattern) || "PASV".equals(connectPattern); + if (!connectPatternTag) { + throw DataXException.asDataXException(FtpReaderErrorCode.ILLEGAL_VALUE, + String.format("不支持您配置的ftp传输模式: [%s]", connectPattern)); + }else{ + this.originConfig.set(Key.CONNECTPATTERN, connectPattern); + } + + //path check + String pathInString = this.originConfig.getNecessaryValue(Key.PATH, FtpReaderErrorCode.REQUIRED_VALUE); + if (!pathInString.startsWith("[") && !pathInString.endsWith("]")) { + path = new ArrayList(); + path.add(pathInString); + } else { + path = this.originConfig.getList(Key.PATH, String.class); + if (null == path || path.size() == 0) { + throw DataXException.asDataXException(FtpReaderErrorCode.REQUIRED_VALUE, "您需要指定待读取的源目录或文件"); + } + for (String eachPath : path) { + if(!eachPath.startsWith("/")){ + String message = String.format("请检查参数path:[%s],需要配置为绝对路径", eachPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.ILLEGAL_VALUE, message); + } + } + } + + } + + @Override + public void prepare() { + LOG.debug("prepare() begin..."); + + this.sourceFiles = ftpHelper.getAllFiles(path, 0, maxTraversalLevel); + + LOG.info(String.format("您即将读取的文件数为: [%s]", this.sourceFiles.size())); + } + + @Override + public void post() { + } + + @Override + public void destroy() { + try { + this.ftpHelper.logoutFtpServer(); + } catch (Exception e) { + String message = String.format( + "关闭与ftp服务器连接失败: [%s] host=%s, username=%s, port=%s", + e.getMessage(), host, username, port); + LOG.error(message, e); + } + } + + // warn: 如果源目录为空会报错,拖空目录意图=>空文件显示指定此意图 + @Override + public List split(int adviceNumber) { + LOG.debug("split() begin..."); + List readerSplitConfigs = new ArrayList(); + + // warn:每个slice拖且仅拖一个文件, + // int splitNumber = adviceNumber; + int splitNumber = this.sourceFiles.size(); + if (0 == splitNumber) { + throw DataXException.asDataXException(FtpReaderErrorCode.EMPTY_DIR_EXCEPTION, + String.format("未能找到待读取的文件,请确认您的配置项path: %s", this.originConfig.getString(Key.PATH))); + } + + List> splitedSourceFiles = this.splitSourceFiles(new ArrayList(this.sourceFiles), splitNumber); + for (List files : splitedSourceFiles) { + Configuration splitedConfig = this.originConfig.clone(); + splitedConfig.set(Constant.SOURCE_FILES, files); + readerSplitConfigs.add(splitedConfig); + } + LOG.debug("split() ok and end..."); + return readerSplitConfigs; + } + + private List> splitSourceFiles(final List sourceList, int adviceNumber) { + List> splitedList = new ArrayList>(); + int averageLength = sourceList.size() / adviceNumber; + averageLength = averageLength == 0 ? 1 : averageLength; + + for (int begin = 0, end = 0; begin < sourceList.size(); begin = end) { + end = begin + averageLength; + if (end > sourceList.size()) { + end = sourceList.size(); + } + splitedList.add(sourceList.subList(begin, end)); + } + return splitedList; + } + + } + + public static class Task extends Reader.Task { + private static Logger LOG = LoggerFactory.getLogger(Task.class); + + private String host; + private int port; + private String username; + private String password; + private String protocol; + private int timeout; + private String connectPattern; + + private Configuration readerSliceConfig; + private List sourceFiles; + + private FtpHelper ftpHelper = null; + + @Override + public void init() {//连接重试 + /* for ftp connection */ + this.readerSliceConfig = this.getPluginJobConf(); + this.host = readerSliceConfig.getString(Key.HOST); + this.protocol = readerSliceConfig.getString(Key.PROTOCOL); + this.username = readerSliceConfig.getString(Key.USERNAME); + this.password = readerSliceConfig.getString(Key.PASSWORD); + this.timeout = readerSliceConfig.getInt(Key.TIMEOUT, Constant.DEFAULT_TIMEOUT); + + this.sourceFiles = this.readerSliceConfig.getList(Constant.SOURCE_FILES, String.class); + + if ("sftp".equals(protocol)) { + //sftp协议 + this.port = readerSliceConfig.getInt(Key.PORT, Constant.DEFAULT_SFTP_PORT); + this.ftpHelper = new SftpHelper(); + } else if ("ftp".equals(protocol)) { + // ftp 协议 + this.port = readerSliceConfig.getInt(Key.PORT, Constant.DEFAULT_FTP_PORT); + this.connectPattern = readerSliceConfig.getString(Key.CONNECTPATTERN, Constant.DEFAULT_FTP_CONNECT_PATTERN);// 默认为被动模式 + this.ftpHelper = new StandardFtpHelper(); + } + ftpHelper.loginFtpServer(host, username, password, port, timeout, connectPattern); + + } + + @Override + public void prepare() { + + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + try { + this.ftpHelper.logoutFtpServer(); + } catch (Exception e) { + String message = String.format( + "关闭与ftp服务器连接失败: [%s] host=%s, username=%s, port=%s", + e.getMessage(), host, username, port); + LOG.error(message, e); + } + } + + @Override + public void startRead(RecordSender recordSender) { + LOG.debug("start read source files..."); + for (String fileName : this.sourceFiles) { + LOG.info(String.format("reading file : [%s]", fileName)); + InputStream inputStream = null; + + inputStream = ftpHelper.getInputStream(fileName); + + UnstructuredStorageReaderUtil.readFromStream(inputStream, fileName, this.readerSliceConfig, + recordSender, this.getTaskPluginCollector()); + recordSender.flush(); + } + + LOG.debug("end read source files..."); + } + + } +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReaderErrorCode.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReaderErrorCode.java new file mode 100755 index 0000000000..3883ee6e81 --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/FtpReaderErrorCode.java @@ -0,0 +1,52 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public enum FtpReaderErrorCode implements ErrorCode { + REQUIRED_VALUE("FtpReader-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("FtpReader-01", "您填写的参数值不合法."), + MIXED_INDEX_VALUE("FtpReader-02", "您的列信息配置同时包含了index,value."), + NO_INDEX_VALUE("FtpReader-03","您明确的配置列信息,但未填写相应的index,value."), + + FILE_NOT_EXISTS("FtpReader-04", "您配置的目录文件路径不存在或者没有权限读取."), + OPEN_FILE_WITH_CHARSET_ERROR("FtpReader-05", "您配置的文件编码和实际文件编码不符合."), + OPEN_FILE_ERROR("FtpReader-06", "您配置的文件在打开时异常."), + READ_FILE_IO_ERROR("FtpReader-07", "您配置的文件在读取时出现IO异常."), + SECURITY_NOT_ENOUGH("FtpReader-08", "您缺少权限执行相应的文件操作."), + CONFIG_INVALID_EXCEPTION("FtpReader-09", "您的参数配置错误."), + RUNTIME_EXCEPTION("FtpReader-10", "出现运行时异常, 请联系我们"), + EMPTY_DIR_EXCEPTION("FtpReader-11", "您尝试读取的文件目录为空."), + + FAIL_LOGIN("FtpReader-12", "登录失败,无法与ftp服务器建立连接."), + FAIL_DISCONNECT("FtpReader-13", "关闭ftp连接失败,无法与ftp服务器断开连接."), + COMMAND_FTP_IO_EXCEPTION("FtpReader-14", "与ftp服务器连接异常."), + OUT_MAX_DIRECTORY_LEVEL("FtpReader-15", "超出允许的最大目录层数."), + LINK_FILE("FtpReader-16", "您尝试读取的文件为链接文件."),; + + private final String code; + private final String description; + + private FtpReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Key.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Key.java new file mode 100755 index 0000000000..cdbd043cd6 --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/Key.java @@ -0,0 +1,13 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +public class Key { + public static final String PROTOCOL = "protocol"; + public static final String HOST = "host"; + public static final String USERNAME = "username"; + public static final String PASSWORD = "password"; + public static final String PORT = "port"; + public static final String TIMEOUT = "timeout"; + public static final String CONNECTPATTERN = "connectPattern"; + public static final String PATH = "path"; + public static final String MAXTRAVERSALLEVEL = "maxTraversalLevel"; +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/SftpHelper.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/SftpHelper.java new file mode 100644 index 0000000000..d25b040c41 --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/SftpHelper.java @@ -0,0 +1,246 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +import java.io.InputStream; +import java.util.HashSet; +import java.util.Properties; +import java.util.Vector; + +import org.apache.commons.io.IOUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; +import com.jcraft.jsch.ChannelSftp; +import com.jcraft.jsch.JSch; +import com.jcraft.jsch.JSchException; +import com.jcraft.jsch.Session; +import com.jcraft.jsch.SftpATTRS; +import com.jcraft.jsch.SftpException; +import com.jcraft.jsch.ChannelSftp.LsEntry; + +public class SftpHelper extends FtpHelper { + private static final Logger LOG = LoggerFactory.getLogger(SftpHelper.class); + + Session session = null; + ChannelSftp channelSftp = null; + @Override + public void loginFtpServer(String host, String username, String password, int port, int timeout, + String connectMode) { + JSch jsch = new JSch(); // 创建JSch对象 + try { + session = jsch.getSession(username, host, port); + // 根据用户名,主机ip,端口获取一个Session对象 + // 如果服务器连接不上,则抛出异常 + if (session == null) { + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, + "session is null,无法通过sftp与服务器建立链接,请检查主机名和用户名是否正确."); + } + + session.setPassword(password); // 设置密码 + Properties config = new Properties(); + config.put("StrictHostKeyChecking", "no"); + session.setConfig(config); // 为Session对象设置properties + session.setTimeout(timeout); // 设置timeout时间 + session.connect(); // 通过Session建立链接 + + channelSftp = (ChannelSftp) session.openChannel("sftp"); // 打开SFTP通道 + channelSftp.connect(); // 建立SFTP通道的连接 + + //设置命令传输编码 + //String fileEncoding = System.getProperty("file.encoding"); + //channelSftp.setFilenameEncoding(fileEncoding); + } catch (JSchException e) { + if(null != e.getCause()){ + String cause = e.getCause().toString(); + String unknownHostException = "java.net.UnknownHostException: " + host; + String illegalArgumentException = "java.lang.IllegalArgumentException: port out of range:" + port; + String wrongPort = "java.net.ConnectException: Connection refused"; + if (unknownHostException.equals(cause)) { + String message = String.format("请确认ftp服务器地址是否正确,无法连接到地址为: [%s] 的ftp服务器", host); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } else if (illegalArgumentException.equals(cause) || wrongPort.equals(cause) ) { + String message = String.format("请确认连接ftp服务器端口是否正确,错误的端口: [%s] ", port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } + }else { + if("Auth fail".equals(e.getMessage())){ + String message = String.format("与ftp服务器建立连接失败,请检查用户名和密码是否正确: [%s]", + "message:host =" + host + ",username = " + username + ",port =" + port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message); + }else{ + String message = String.format("与ftp服务器建立连接失败 : [%s]", + "message:host =" + host + ",username = " + username + ",port =" + port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } + } + } + + } + + @Override + public void logoutFtpServer() { + if (channelSftp != null) { + channelSftp.disconnect(); + } + if (session != null) { + session.disconnect(); + } + } + + @Override + public boolean isDirExist(String directoryPath) { + try { + SftpATTRS sftpATTRS = channelSftp.lstat(directoryPath); + return sftpATTRS.isDir(); + } catch (SftpException e) { + if (e.getMessage().toLowerCase().equals("no such file")) { + String message = String.format("请确认您的配置项path:[%s]存在,且配置的用户有权限读取", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + String message = String.format("进入目录:[%s]时发生I/O异常,请确认与ftp服务器的连接正常", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + + @Override + public boolean isFileExist(String filePath) { + boolean isExitFlag = false; + try { + SftpATTRS sftpATTRS = channelSftp.lstat(filePath); + if(sftpATTRS.getSize() >= 0){ + isExitFlag = true; + } + } catch (SftpException e) { + if (e.getMessage().toLowerCase().equals("no such file")) { + String message = String.format("请确认您的配置项path:[%s]存在,且配置的用户有权限读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } else { + String message = String.format("获取文件:[%s] 属性时发生I/O异常,请确认与ftp服务器的连接正常", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + return isExitFlag; + } + + @Override + public boolean isSymbolicLink(String filePath) { + try { + SftpATTRS sftpATTRS = channelSftp.lstat(filePath); + return sftpATTRS.isLink(); + } catch (SftpException e) { + if (e.getMessage().toLowerCase().equals("no such file")) { + String message = String.format("请确认您的配置项path:[%s]存在,且配置的用户有权限读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } else { + String message = String.format("获取文件:[%s] 属性时发生I/O异常,请确认与ftp服务器的连接正常", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + } + + HashSet sourceFiles = new HashSet(); + @Override + public HashSet getListFiles(String directoryPath, int parentLevel, int maxTraversalLevel) { + if(parentLevel < maxTraversalLevel){ + String parentPath = null;// 父级目录,以'/'结尾 + int pathLen = directoryPath.length(); + if (directoryPath.contains("*") || directoryPath.contains("?")) {//*和?的限制 + // path是正则表达式 + String subPath = UnstructuredStorageReaderUtil.getRegexPathParentPath(directoryPath); + if (isDirExist(subPath)) { + parentPath = subPath; + } else { + String message = String.format("不能进入目录:[%s]," + "请确认您的配置项path:[%s]存在,且配置的用户有权限进入", subPath, + directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + + } else if (isDirExist(directoryPath)) { + // path是目录 + if (directoryPath.charAt(pathLen - 1) == IOUtils.DIR_SEPARATOR) { + parentPath = directoryPath; + } else { + parentPath = directoryPath + IOUtils.DIR_SEPARATOR; + } + } else if(isSymbolicLink(directoryPath)){ + //path是链接文件 + String message = String.format("文件:[%s]是链接文件,当前不支持链接文件的读取", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.LINK_FILE, message); + }else if (isFileExist(directoryPath)) { + // path指向具体文件 + sourceFiles.add(directoryPath); + return sourceFiles; + } else { + String message = String.format("请确认您的配置项path:[%s]存在,且配置的用户有权限读取", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + + try { + Vector vector = channelSftp.ls(directoryPath); + for (int i = 0; i < vector.size(); i++) { + LsEntry le = (LsEntry) vector.get(i); + String strName = le.getFilename(); + String filePath = parentPath + strName; + + if (isDirExist(filePath)) { + // 是子目录 + if (!(strName.equals(".") || strName.equals(".."))) { + //递归处理 + getListFiles(filePath, parentLevel+1, maxTraversalLevel); + } + } else if(isSymbolicLink(filePath)){ + //是链接文件 + String message = String.format("文件:[%s]是链接文件,当前不支持链接文件的读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.LINK_FILE, message); + }else if (isFileExist(filePath)) { + // 是文件 + sourceFiles.add(filePath); + } else { + String message = String.format("请确认path:[%s]存在,且配置的用户有权限读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + + } // end for vector + } catch (SftpException e) { + String message = String.format("获取path:[%s] 下文件列表时发生I/O异常,请确认与ftp服务器的连接正常", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + + return sourceFiles; + }else{ + //超出最大递归层数 + String message = String.format("获取path:[%s] 下文件列表时超出最大层数,请确认路径[%s]下不存在软连接文件", directoryPath, directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.OUT_MAX_DIRECTORY_LEVEL, message); + } + } + + @Override + public InputStream getInputStream(String filePath) { + try { + return channelSftp.get(filePath); + } catch (SftpException e) { + String message = String.format("读取文件 : [%s] 时出错,请确认文件:[%s]存在且配置的用户有权限读取", filePath, filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.OPEN_FILE_ERROR, message); + } + } + +} diff --git a/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/StandardFtpHelper.java b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/StandardFtpHelper.java new file mode 100644 index 0000000000..79b23f8bff --- /dev/null +++ b/ftpreader/src/main/java/com/alibaba/datax/plugin/reader/ftpreader/StandardFtpHelper.java @@ -0,0 +1,229 @@ +package com.alibaba.datax.plugin.reader.ftpreader; + +import java.io.IOException; +import java.io.InputStream; +import java.net.UnknownHostException; +import java.util.HashSet; + +import org.apache.commons.io.IOUtils; +import org.apache.commons.net.ftp.FTP; +import org.apache.commons.net.ftp.FTPClient; +import org.apache.commons.net.ftp.FTPClientConfig; +import org.apache.commons.net.ftp.FTPFile; +import org.apache.commons.net.ftp.FTPReply; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; + +public class StandardFtpHelper extends FtpHelper { + private static final Logger LOG = LoggerFactory.getLogger(StandardFtpHelper.class); + FTPClient ftpClient = null; + + @Override + public void loginFtpServer(String host, String username, String password, int port, int timeout, + String connectMode) { + ftpClient = new FTPClient(); + try { + // 连接 + ftpClient.connect(host, port); + // 登录 + ftpClient.login(username, password); + // 不需要写死ftp server的OS TYPE,FTPClient getSystemType()方法会自动识别 + // ftpClient.configure(new FTPClientConfig(FTPClientConfig.SYST_UNIX)); + ftpClient.setConnectTimeout(timeout); + ftpClient.setDataTimeout(timeout); + if ("PASV".equals(connectMode)) { + ftpClient.enterRemotePassiveMode(); + ftpClient.enterLocalPassiveMode(); + } else if ("PORT".equals(connectMode)) { + ftpClient.enterLocalActiveMode(); + // ftpClient.enterRemoteActiveMode(host, port); + } + int reply = ftpClient.getReplyCode(); + if (!FTPReply.isPositiveCompletion(reply)) { + ftpClient.disconnect(); + String message = String.format("与ftp服务器建立连接失败,请检查用户名和密码是否正确: [%s]", + "message:host =" + host + ",username = " + username + ",port =" + port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message); + } + //设置命令传输编码 + String fileEncoding = System.getProperty("file.encoding"); + ftpClient.setControlEncoding(fileEncoding); + } catch (UnknownHostException e) { + String message = String.format("请确认ftp服务器地址是否正确,无法连接到地址为: [%s] 的ftp服务器", host); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } catch (IllegalArgumentException e) { + String message = String.format("请确认连接ftp服务器端口是否正确,错误的端口: [%s] ", port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } catch (Exception e) { + String message = String.format("与ftp服务器建立连接失败 : [%s]", + "message:host =" + host + ",username = " + username + ",port =" + port); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_LOGIN, message, e); + } + + } + + @Override + public void logoutFtpServer() { + if (ftpClient.isConnected()) { + try { + //todo ftpClient.completePendingCommand();//打开流操作之后必须,原因还需要深究 + ftpClient.logout(); + } catch (IOException e) { + String message = "与ftp服务器断开连接失败"; + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_DISCONNECT, message, e); + }finally { + if(ftpClient.isConnected()){ + try { + ftpClient.disconnect(); + } catch (IOException e) { + String message = "与ftp服务器断开连接失败"; + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FAIL_DISCONNECT, message, e); + } + } + + } + } + } + + @Override + public boolean isDirExist(String directoryPath) { + try { + return ftpClient.changeWorkingDirectory(new String(directoryPath.getBytes(),FTP.DEFAULT_CONTROL_ENCODING)); + } catch (IOException e) { + String message = String.format("进入目录:[%s]时发生I/O异常,请确认与ftp服务器的连接正常", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + + @Override + public boolean isFileExist(String filePath) { + boolean isExitFlag = false; + try { + FTPFile[] ftpFiles = ftpClient.listFiles(new String(filePath.getBytes(),FTP.DEFAULT_CONTROL_ENCODING)); + if (ftpFiles.length == 1 && ftpFiles[0].isFile()) { + isExitFlag = true; + } + } catch (IOException e) { + String message = String.format("获取文件:[%s] 属性时发生I/O异常,请确认与ftp服务器的连接正常", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + return isExitFlag; + } + + @Override + public boolean isSymbolicLink(String filePath) { + boolean isExitFlag = false; + try { + FTPFile[] ftpFiles = ftpClient.listFiles(new String(filePath.getBytes(),FTP.DEFAULT_CONTROL_ENCODING)); + if (ftpFiles.length == 1 && ftpFiles[0].isSymbolicLink()) { + isExitFlag = true; + } + } catch (IOException e) { + String message = String.format("获取文件:[%s] 属性时发生I/O异常,请确认与ftp服务器的连接正常", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + return isExitFlag; + } + + HashSet sourceFiles = new HashSet(); + @Override + public HashSet getListFiles(String directoryPath, int parentLevel, int maxTraversalLevel) { + if(parentLevel < maxTraversalLevel){ + String parentPath = null;// 父级目录,以'/'结尾 + int pathLen = directoryPath.length(); + if (directoryPath.contains("*") || directoryPath.contains("?")) { + // path是正则表达式 + String subPath = UnstructuredStorageReaderUtil.getRegexPathParentPath(directoryPath); + if (isDirExist(subPath)) { + parentPath = subPath; + } else { + String message = String.format("不能进入目录:[%s]," + "请确认您的配置项path:[%s]存在,且配置的用户有权限进入", subPath, + directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + } else if (isDirExist(directoryPath)) { + // path是目录 + if (directoryPath.charAt(pathLen - 1) == IOUtils.DIR_SEPARATOR) { + parentPath = directoryPath; + } else { + parentPath = directoryPath + IOUtils.DIR_SEPARATOR; + } + } else if (isFileExist(directoryPath)) { + // path指向具体文件 + sourceFiles.add(directoryPath); + return sourceFiles; + } else if(isSymbolicLink(directoryPath)){ + //path是链接文件 + String message = String.format("文件:[%s]是链接文件,当前不支持链接文件的读取", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.LINK_FILE, message); + }else { + String message = String.format("请确认您的配置项path:[%s]存在,且配置的用户有权限读取", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + + try { + FTPFile[] fs = ftpClient.listFiles(new String(directoryPath.getBytes(),FTP.DEFAULT_CONTROL_ENCODING)); + for (FTPFile ff : fs) { + String strName = ff.getName(); + String filePath = parentPath + strName; + if (ff.isDirectory()) { + if (!(strName.equals(".") || strName.equals(".."))) { + //递归处理 + getListFiles(filePath, parentLevel+1, maxTraversalLevel); + } + } else if (ff.isFile()) { + // 是文件 + sourceFiles.add(filePath); + } else if(ff.isSymbolicLink()){ + //是链接文件 + String message = String.format("文件:[%s]是链接文件,当前不支持链接文件的读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.LINK_FILE, message); + }else { + String message = String.format("请确认path:[%s]存在,且配置的用户有权限读取", filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.FILE_NOT_EXISTS, message); + } + } // end for FTPFile + } catch (IOException e) { + String message = String.format("获取path:[%s] 下文件列表时发生I/O异常,请确认与ftp服务器的连接正常", directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + return sourceFiles; + + } else{ + //超出最大递归层数 + String message = String.format("获取path:[%s] 下文件列表时超出最大层数,请确认路径[%s]下不存在软连接文件", directoryPath, directoryPath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.OUT_MAX_DIRECTORY_LEVEL, message); + } + } + + @Override + public InputStream getInputStream(String filePath) { + try { + return ftpClient.retrieveFileStream(new String(filePath.getBytes(),FTP.DEFAULT_CONTROL_ENCODING)); + } catch (IOException e) { + String message = String.format("读取文件 : [%s] 时出错,请确认文件:[%s]存在且配置的用户有权限读取", filePath, filePath); + LOG.error(message); + throw DataXException.asDataXException(FtpReaderErrorCode.OPEN_FILE_ERROR, message); + } + } + +} diff --git a/ftpreader/src/main/resources/plugin-template.json b/ftpreader/src/main/resources/plugin-template.json new file mode 100755 index 0000000000..9680aec671 --- /dev/null +++ b/ftpreader/src/main/resources/plugin-template.json @@ -0,0 +1,38 @@ +{ + "name": "ftpreader", + "parameter": { + "host": "", + "port": "", + "username": "", + "password": "", + "protocol": "", + "path": [ + "" + ], + "encoding": "UTF-8", + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "boolean" + }, + { + "index": 2, + "type": "double" + }, + { + "index": 3, + "type": "string" + }, + { + "index": 4, + "type": "date", + "format": "yyyy.MM.dd" + } + ], + "fieldDelimiter": "," + } +} diff --git a/ftpreader/src/main/resources/plugin.json b/ftpreader/src/main/resources/plugin.json new file mode 100755 index 0000000000..ce5ce26b9a --- /dev/null +++ b/ftpreader/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "ftpreader", + "class": "com.alibaba.datax.plugin.reader.ftpreader.FtpReader", + "description": "useScene: test. mechanism: use datax framework to transport data from txt file. warn: The more you know about the data, the less problems you encounter.", + "developer": "alibaba" +} + diff --git a/ftpreader/src/main/resources/plugin_job_template.json b/ftpreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..88f429f3e9 --- /dev/null +++ b/ftpreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,19 @@ +{ + "name": "ftpreader", + "parameter": { + "host": "", + "protocol": "sftp", + "port":"", + "username": "", + "password": "", + "path": [], + "column": [ + { + "index": 0, + "type": "" + } + ], + "fieldDelimiter": ",", + "encoding": "UTF-8" + } +} \ No newline at end of file diff --git a/ftpwriter/doc/.gitkeep b/ftpwriter/doc/.gitkeep new file mode 100644 index 0000000000..e69de29bb2 diff --git a/ftpwriter/doc/ftpwriter.md b/ftpwriter/doc/ftpwriter.md new file mode 100644 index 0000000000..bf2e726fa1 --- /dev/null +++ b/ftpwriter/doc/ftpwriter.md @@ -0,0 +1,244 @@ +# DataX FtpWriter 说明 + + +------------ + +## 1 快速介绍 + +FtpWriter提供了向远程FTP文件写入CSV格式的一个或者多个文件,在底层实现上,FtpWriter将DataX传输协议下的数据转换为csv格式,并使用FTP相关的网络协议写出到远程FTP服务器。 + +**写入FTP文件内容存放的是一张逻辑意义上的二维表,例如CSV格式的文本信息。** + + +## 2 功能与限制 + +FtpWriter实现了从DataX协议转为FTP文件功能,FTP文件本身是无结构化数据存储,FtpWriter如下几个方面约定: + +1. 支持且仅支持写入文本类型(不支持BLOB如视频数据)的文件,且要求文本中shema为一张二维表。 + +2. 支持类CSV格式文件,自定义分隔符。 + +3. 写出时不支持文本压缩。 + +6. 支持多线程写入,每个线程写入不同子文件。 + +我们不能做到: + +1. 单个文件不能支持并发写入。 + + +## 3 功能说明 + + +### 3.1 配置样例 + + +```json +{ + "setting": {}, + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": {}, + "writer": { + "name": "ftpwriter", + "parameter": { + "protocol": "sftp", + "host": "***", + "port": 22, + "username": "xxx", + "password": "xxx", + "timeout": "60000", + "connectPattern": "PASV", + "path": "/tmp/data/", + "fileName": "yixiao", + "writeMode": "truncate|append|nonConflict", + "fieldDelimiter": ",", + "encoding": "UTF-8", + "nullFormat": "null", + "dateFormat": "yyyy-MM-dd", + "fileFormat": "csv", + "header": [] + } + } + } + ] + } +} +``` + +### 3.2 参数说明 + +* **protocol** + + * 描述:ftp服务器协议,目前支持传输协议有ftp和sftp。
+ + * 必选:是
+ + * 默认值:无
+ +* **host** + + * 描述:ftp服务器地址。
+ + * 必选:是
+ + * 默认值:无
+ +* **port** + + * 描述:ftp服务器端口。
+ + * 必选:否
+ + * 默认值:若传输协议是sftp协议,默认值是22;若传输协议是标准ftp协议,默认值是21
+ +* **timeout** + + * 描述:连接ftp服务器连接超时时间,单位毫秒。
+ + * 必选:否
+ + * 默认值:60000(1分钟)
+ +* **username** + + * 描述:ftp服务器访问用户名。
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:ftp服务器访问密码。
+ + * 必选:是
+ + * 默认值:无
+ +* **path** + + * 描述:FTP文件系统的路径信息,FtpWriter会写入Path目录下属多个文件。
+ + * 必选:是
+ + * 默认值:无
+ +* **fileName** + + * 描述:FtpWriter写入的文件名,该文件名会添加随机的后缀作为每个线程写入实际文件名。
+ + * 必选:是
+ + * 默认值:无
+ +* **writeMode** + + * 描述:FtpWriter写入前数据清理处理模式:
+ + * truncate,写入前清理目录下一fileName前缀的所有文件。 + * append,写入前不做任何处理,DataX FtpWriter直接使用filename写入,并保证文件名不冲突。 + * nonConflict,如果目录下有fileName前缀的文件,直接报错。 + + * 必选:是
+ + * 默认值:无
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + * 必选:否
+ + * 默认值:,
+ +* **compress** + + * 描述:文本压缩类型,暂时不支持。
+ + * 必选:否
+ + * 默认值:无压缩
+ +* **encoding** + + * 描述:读取文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8
+ + +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat="\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:\N
+ +* **dateFormat** + + * 描述:日期类型的数据序列化到文件中时的格式,例如 "dateFormat": "yyyy-MM-dd"。
+ + * 必选:否
+ + * 默认值:无
+ +* **fileFormat** + + * 描述:文件写出的格式,包括csv (http://zh.wikipedia.org/wiki/%E9%80%97%E5%8F%B7%E5%88%86%E9%9A%94%E5%80%BC) 和text两种,csv是严格的csv格式,如果待写数据包括列分隔符,则会按照csv的转义语法转义,转义符号为双引号";text格式是用列分隔符简单分割待写数据,对于待写数据包括列分隔符情况下不做转义。
+ + * 必选:否
+ + * 默认值:text
+ +* **header** + + * 描述:txt写出时的表头,示例['id', 'name', 'age']。
+ + * 必选:否
+ + * 默认值:无
+ +### 3.3 类型转换 + + +FTP文件本身不提供数据类型,该类型是DataX FtpWriter定义: + +| DataX 内部类型| FTP文件 数据类型 | +| -------- | ----- | +| +| Long |Long -> 字符串序列化表示| +| Double |Double -> 字符串序列化表示| +| String |String -> 字符串序列化表示| +| Boolean |Boolean -> 字符串序列化表示| +| Date |Date -> 字符串序列化表示| + +其中: + +* FTP文件 Long是指FTP文件文本中使用整形的字符串表示形式,例如"19901219"。 +* FTP文件 Double是指FTP文件文本中使用Double的字符串表示形式,例如"3.1415"。 +* FTP文件 Boolean是指FTP文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* FTP文件 Date是指FTP文件文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。 + + +## 4 性能报告 + + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + diff --git a/ftpwriter/pom.xml b/ftpwriter/pom.xml new file mode 100644 index 0000000000..1fd766c9c0 --- /dev/null +++ b/ftpwriter/pom.xml @@ -0,0 +1,93 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + ftpwriter + ftpwriter + FtpWriter提供了写数据到指定ftp服务器文件功能。 + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + + com.jcraft + jsch + 0.1.51 + + + commons-net + commons-net + 3.3 + + + junit + junit + test + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/ftpwriter/src/main/assembly/package.xml b/ftpwriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..84ea21b5cb --- /dev/null +++ b/ftpwriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/ftpwriter + + + target/ + + ftpwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/ftpwriter + + + + + + false + plugin/writer/ftpwriter/libs + runtime + + + diff --git a/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/FtpWriter.java b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/FtpWriter.java new file mode 100755 index 0000000000..eda603fc73 --- /dev/null +++ b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/FtpWriter.java @@ -0,0 +1,301 @@ +package com.alibaba.datax.plugin.writer.ftpwriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.unstructuredstorage.writer.UnstructuredStorageWriterUtil; +import com.alibaba.datax.plugin.writer.ftpwriter.util.Constant; +import com.alibaba.datax.plugin.writer.ftpwriter.util.IFtpHelper; +import com.alibaba.datax.plugin.writer.ftpwriter.util.SftpHelperImpl; +import com.alibaba.datax.plugin.writer.ftpwriter.util.StandardFtpHelperImpl; + +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.OutputStream; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.concurrent.Callable; + +public class FtpWriter extends Writer { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + private Configuration writerSliceConfig = null; + private Set allFileExists = null; + + private String protocol; + private String host; + private int port; + private String username; + private String password; + private int timeout; + + private IFtpHelper ftpHelper = null; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.validateParameter(); + UnstructuredStorageWriterUtil + .validateParameter(this.writerSliceConfig); + try { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Void call() throws Exception { + ftpHelper.loginFtpServer(host, username, password, + port, timeout); + return null; + } + }, 3, 4000, true); + } catch (Exception e) { + String message = String + .format("与ftp服务器建立连接失败, host:%s, username:%s, port:%s, errorMessage:%s", + host, username, port, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.FAIL_LOGIN, message, e); + } + } + + private void validateParameter() { + this.writerSliceConfig + .getNecessaryValue( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME, + FtpWriterErrorCode.REQUIRED_VALUE); + String path = this.writerSliceConfig.getNecessaryValue(Key.PATH, + FtpWriterErrorCode.REQUIRED_VALUE); + if (!path.startsWith("/")) { + String message = String.format("请检查参数path:%s,需要配置为绝对路径", path); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.ILLEGAL_VALUE, message); + } + + this.host = this.writerSliceConfig.getNecessaryValue(Key.HOST, + FtpWriterErrorCode.REQUIRED_VALUE); + this.username = this.writerSliceConfig.getNecessaryValue( + Key.USERNAME, FtpWriterErrorCode.REQUIRED_VALUE); + this.password = this.writerSliceConfig.getNecessaryValue( + Key.PASSWORD, FtpWriterErrorCode.REQUIRED_VALUE); + this.timeout = this.writerSliceConfig.getInt(Key.TIMEOUT, + Constant.DEFAULT_TIMEOUT); + + this.protocol = this.writerSliceConfig.getNecessaryValue( + Key.PROTOCOL, FtpWriterErrorCode.REQUIRED_VALUE); + if ("sftp".equalsIgnoreCase(this.protocol)) { + this.port = this.writerSliceConfig.getInt(Key.PORT, + Constant.DEFAULT_SFTP_PORT); + this.ftpHelper = new SftpHelperImpl(); + } else if ("ftp".equalsIgnoreCase(this.protocol)) { + this.port = this.writerSliceConfig.getInt(Key.PORT, + Constant.DEFAULT_FTP_PORT); + this.ftpHelper = new StandardFtpHelperImpl(); + } else { + throw DataXException.asDataXException( + FtpWriterErrorCode.ILLEGAL_VALUE, String.format( + "仅支持 ftp和sftp 传输协议 , 不支持您配置的传输协议: [%s]", + protocol)); + } + this.writerSliceConfig.set(Key.PORT, this.port); + } + + @Override + public void prepare() { + String path = this.writerSliceConfig.getString(Key.PATH); + // warn: 这里用户需要配一个目录 + this.ftpHelper.mkDirRecursive(path); + + String fileName = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME); + String writeMode = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.WRITE_MODE); + + Set allFileExists = this.ftpHelper.getAllFilesInDir(path, + fileName); + this.allFileExists = allFileExists; + + // truncate option handler + if ("truncate".equals(writeMode)) { + LOG.info(String.format( + "由于您配置了writeMode truncate, 开始清理 [%s] 下面以 [%s] 开头的内容", + path, fileName)); + Set fullFileNameToDelete = new HashSet(); + for (String each : allFileExists) { + fullFileNameToDelete.add(UnstructuredStorageWriterUtil + .buildFilePath(path, each, null)); + } + LOG.info(String.format( + "删除目录path:[%s] 下指定前缀fileName:[%s] 文件列表如下: [%s]", path, + fileName, + StringUtils.join(fullFileNameToDelete.iterator(), ", "))); + + this.ftpHelper.deleteFiles(fullFileNameToDelete); + } else if ("append".equals(writeMode)) { + LOG.info(String + .format("由于您配置了writeMode append, 写入前不做清理工作, [%s] 目录下写入相应文件名前缀 [%s] 的文件", + path, fileName)); + LOG.info(String.format( + "目录path:[%s] 下已经存在的指定前缀fileName:[%s] 文件列表如下: [%s]", + path, fileName, + StringUtils.join(allFileExists.iterator(), ", "))); + } else if ("nonConflict".equals(writeMode)) { + LOG.info(String.format( + "由于您配置了writeMode nonConflict, 开始检查 [%s] 下面的内容", path)); + if (!allFileExists.isEmpty()) { + LOG.info(String.format( + "目录path:[%s] 下指定前缀fileName:[%s] 冲突文件列表如下: [%s]", + path, fileName, + StringUtils.join(allFileExists.iterator(), ", "))); + throw DataXException + .asDataXException( + FtpWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您配置的path: [%s] 目录不为空, 下面存在其他文件或文件夹.", + path)); + } + } else { + throw DataXException + .asDataXException( + FtpWriterErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 truncate, append, nonConflict 三种模式, 不支持您配置的 writeMode 模式 : [%s]", + writeMode)); + } + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + try { + this.ftpHelper.logoutFtpServer(); + } catch (Exception e) { + String message = String + .format("关闭与ftp服务器连接失败, host:%s, username:%s, port:%s, errorMessage:%s", + host, username, port, e.getMessage()); + LOG.error(message, e); + } + } + + @Override + public List split(int mandatoryNumber) { + return UnstructuredStorageWriterUtil.split(this.writerSliceConfig, + this.allFileExists, mandatoryNumber); + } + + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + + private Configuration writerSliceConfig; + + private String path; + private String fileName; + private String suffix; + + private String protocol; + private String host; + private int port; + private String username; + private String password; + private int timeout; + + private IFtpHelper ftpHelper = null; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.path = this.writerSliceConfig.getString(Key.PATH); + this.fileName = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME); + this.suffix = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.SUFFIX); + + this.host = this.writerSliceConfig.getString(Key.HOST); + this.port = this.writerSliceConfig.getInt(Key.PORT); + this.username = this.writerSliceConfig.getString(Key.USERNAME); + this.password = this.writerSliceConfig.getString(Key.PASSWORD); + this.timeout = this.writerSliceConfig.getInt(Key.TIMEOUT, + Constant.DEFAULT_TIMEOUT); + this.protocol = this.writerSliceConfig.getString(Key.PROTOCOL); + + if ("sftp".equalsIgnoreCase(this.protocol)) { + this.ftpHelper = new SftpHelperImpl(); + } else if ("ftp".equalsIgnoreCase(this.protocol)) { + this.ftpHelper = new StandardFtpHelperImpl(); + } + try { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Void call() throws Exception { + ftpHelper.loginFtpServer(host, username, password, + port, timeout); + return null; + } + }, 3, 4000, true); + } catch (Exception e) { + String message = String + .format("与ftp服务器建立连接失败, host:%s, username:%s, port:%s, errorMessage:%s", + host, username, port, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.FAIL_LOGIN, message, e); + } + } + + @Override + public void prepare() { + + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + LOG.info("begin do write..."); + String fileFullPath = UnstructuredStorageWriterUtil.buildFilePath( + this.path, this.fileName, this.suffix); + LOG.info(String.format("write to file : [%s]", fileFullPath)); + + OutputStream outputStream = null; + try { + outputStream = this.ftpHelper.getOutputStream(fileFullPath); + UnstructuredStorageWriterUtil.writeToStream(lineReceiver, + outputStream, this.writerSliceConfig, this.fileName, + this.getTaskPluginCollector()); + } catch (Exception e) { + throw DataXException.asDataXException( + FtpWriterErrorCode.WRITE_FILE_IO_ERROR, + String.format("无法创建待写文件 : [%s]", this.fileName), e); + } finally { + IOUtils.closeQuietly(outputStream); + } + LOG.info("end do write"); + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + try { + this.ftpHelper.logoutFtpServer(); + } catch (Exception e) { + String message = String + .format("关闭与ftp服务器连接失败, host:%s, username:%s, port:%s, errorMessage:%s", + host, username, port, e.getMessage()); + LOG.error(message, e); + } + } + } +} diff --git a/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/FtpWriterErrorCode.java b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/FtpWriterErrorCode.java new file mode 100755 index 0000000000..1ee2d23d7a --- /dev/null +++ b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/FtpWriterErrorCode.java @@ -0,0 +1,54 @@ +package com.alibaba.datax.plugin.writer.ftpwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum FtpWriterErrorCode implements ErrorCode { + + REQUIRED_VALUE("FtpWriter-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("FtpWriter-01", "您填写的参数值不合法."), + MIXED_INDEX_VALUE("FtpWriter-02", "您的列信息配置同时包含了index,value."), + NO_INDEX_VALUE("FtpWriter-03","您明确的配置列信息,但未填写相应的index,value."), + + FILE_NOT_EXISTS("FtpWriter-04", "您配置的目录文件路径不存在或者没有权限读取."), + OPEN_FILE_WITH_CHARSET_ERROR("FtpWriter-05", "您配置的文件编码和实际文件编码不符合."), + OPEN_FILE_ERROR("FtpWriter-06", "您配置的文件在打开时异常."), + WRITE_FILE_IO_ERROR("FtpWriter-07", "您配置的文件在读取时出现IO异常."), + SECURITY_NOT_ENOUGH("FtpWriter-08", "您缺少权限执行相应的文件操作."), + CONFIG_INVALID_EXCEPTION("FtpWriter-09", "您的参数配置错误."), + RUNTIME_EXCEPTION("FtpWriter-10", "出现运行时异常, 请联系我们"), + EMPTY_DIR_EXCEPTION("FtpWriter-11", "您尝试读取的文件目录为空."), + + FAIL_LOGIN("FtpWriter-12", "登录失败,无法与ftp服务器建立连接."), + FAIL_DISCONNECT("FtpWriter-13", "关闭ftp连接失败,无法与ftp服务器断开连接."), + COMMAND_FTP_IO_EXCEPTION("FtpWriter-14", "与ftp服务器连接异常."), + OUT_MAX_DIRECTORY_LEVEL("FtpWriter-15", "超出允许的最大目录层数."), + LINK_FILE("FtpWriter-16", "您尝试读取的文件为链接文件."), + COMMAND_FTP_ENCODING_EXCEPTION("FtpWriter-17", "与ftp服务器连接,使用指定编码异常."), + FAIL_LOGOUT("FtpWriter-18", "登出失败,关闭与ftp服务器建立连接失败,但这不影响任务同步."),; + + + private final String code; + private final String description; + + private FtpWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } + +} diff --git a/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/Key.java b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/Key.java new file mode 100755 index 0000000000..1cf4812ab9 --- /dev/null +++ b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/Key.java @@ -0,0 +1,19 @@ +package com.alibaba.datax.plugin.writer.ftpwriter; + +public class Key { + public static final String PROTOCOL = "protocol"; + + public static final String HOST = "host"; + + public static final String USERNAME = "username"; + + public static final String PASSWORD = "password"; + + public static final String PORT = "port"; + + public static final String TIMEOUT = "timeout"; + + public static final String CONNECTPATTERN = "connectPattern"; + + public static final String PATH = "path"; +} diff --git a/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/Constant.java b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/Constant.java new file mode 100755 index 0000000000..0a632f65d6 --- /dev/null +++ b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/Constant.java @@ -0,0 +1,16 @@ +package com.alibaba.datax.plugin.writer.ftpwriter.util; + + +public class Constant { + public static final int DEFAULT_FTP_PORT = 21; + + public static final int DEFAULT_SFTP_PORT = 22; + + public static final int DEFAULT_TIMEOUT = 60000; + + public static final int DEFAULT_MAX_TRAVERSAL_LEVEL = 100; + + public static final String DEFAULT_FTP_CONNECT_PATTERN = "PASV"; + + public static final String CONTROL_ENCODING = "utf8"; +} diff --git a/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/IFtpHelper.java b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/IFtpHelper.java new file mode 100644 index 0000000000..2e503f7f7b --- /dev/null +++ b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/IFtpHelper.java @@ -0,0 +1,36 @@ +package com.alibaba.datax.plugin.writer.ftpwriter.util; + +import java.io.OutputStream; +import java.util.Set; + +public interface IFtpHelper { + + //使用被动方式 + public void loginFtpServer(String host, String username, String password, int port, int timeout); + + public void logoutFtpServer(); + + /** + * warn: 不支持递归创建, 比如 mkdir -p + * */ + public void mkdir(String directoryPath); + + /** + * 支持目录递归创建 + */ + public void mkDirRecursive(String directoryPath); + + public OutputStream getOutputStream(String filePath); + + public String getRemoteFileContent(String filePath); + + public Set getAllFilesInDir(String dir, String prefixFileName); + + /** + * warn: 不支持文件夹删除, 比如 rm -rf + * */ + public void deleteFiles(Set filesToDelete); + + public void completePendingCommand(); + +} diff --git a/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/SftpHelperImpl.java b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/SftpHelperImpl.java new file mode 100644 index 0000000000..e6d786298e --- /dev/null +++ b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/SftpHelperImpl.java @@ -0,0 +1,307 @@ +package com.alibaba.datax.plugin.writer.ftpwriter.util; + +import java.io.ByteArrayOutputStream; +import java.io.OutputStream; +import java.util.HashSet; +import java.util.Properties; +import java.util.Set; +import java.util.Vector; + +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.writer.ftpwriter.FtpWriterErrorCode; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.serializer.SerializerFeature; +import com.jcraft.jsch.ChannelSftp; +import com.jcraft.jsch.JSch; +import com.jcraft.jsch.JSchException; +import com.jcraft.jsch.Session; +import com.jcraft.jsch.SftpATTRS; +import com.jcraft.jsch.SftpException; +import com.jcraft.jsch.ChannelSftp.LsEntry; + +public class SftpHelperImpl implements IFtpHelper { + private static final Logger LOG = LoggerFactory + .getLogger(SftpHelperImpl.class); + + private Session session = null; + private ChannelSftp channelSftp = null; + + @Override + public void loginFtpServer(String host, String username, String password, + int port, int timeout) { + JSch jsch = new JSch(); + try { + this.session = jsch.getSession(username, host, port); + if (this.session == null) { + throw DataXException + .asDataXException(FtpWriterErrorCode.FAIL_LOGIN, + "创建ftp连接this.session失败,无法通过sftp与服务器建立链接,请检查主机名和用户名是否正确."); + } + + this.session.setPassword(password); + Properties config = new Properties(); + config.put("StrictHostKeyChecking", "no"); + // config.put("PreferredAuthentications", "password"); + this.session.setConfig(config); + this.session.setTimeout(timeout); + this.session.connect(); + + this.channelSftp = (ChannelSftp) this.session.openChannel("sftp"); + this.channelSftp.connect(); + } catch (JSchException e) { + if (null != e.getCause()) { + String cause = e.getCause().toString(); + String unknownHostException = "java.net.UnknownHostException: " + + host; + String illegalArgumentException = "java.lang.IllegalArgumentException: port out of range:" + + port; + String wrongPort = "java.net.ConnectException: Connection refused"; + if (unknownHostException.equals(cause)) { + String message = String + .format("请确认ftp服务器地址是否正确,无法连接到地址为: [%s] 的ftp服务器, errorMessage:%s", + host, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.FAIL_LOGIN, message, e); + } else if (illegalArgumentException.equals(cause) + || wrongPort.equals(cause)) { + String message = String.format( + "请确认连接ftp服务器端口是否正确,错误的端口: [%s], errorMessage:%s", + port, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.FAIL_LOGIN, message, e); + } + } else { + String message = String + .format("与ftp服务器建立连接失败,请检查主机、用户名、密码是否正确, host:%s, port:%s, username:%s, errorMessage:%s", + host, port, username, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.FAIL_LOGIN, message); + } + } + + } + + @Override + public void logoutFtpServer() { + if (this.channelSftp != null) { + this.channelSftp.disconnect(); + this.channelSftp = null; + } + if (this.session != null) { + this.session.disconnect(); + this.session = null; + } + } + + @Override + public void mkdir(String directoryPath) { + boolean isDirExist = false; + try { + this.printWorkingDirectory(); + SftpATTRS sftpATTRS = this.channelSftp.lstat(directoryPath); + isDirExist = sftpATTRS.isDir(); + } catch (SftpException e) { + if (e.getMessage().toLowerCase().equals("no such file")) { + LOG.warn(String.format( + "您的配置项path:[%s]不存在,将尝试进行目录创建, errorMessage:%s", + directoryPath, e.getMessage()), e); + isDirExist = false; + } + } + if (!isDirExist) { + try { + // warn 检查mkdir -p + this.channelSftp.mkdir(directoryPath); + } catch (SftpException e) { + String message = String + .format("创建目录:%s时发生I/O异常,请确认与ftp服务器的连接正常,拥有目录创建权限, errorMessage:%s", + directoryPath, e.getMessage()); + LOG.error(message, e); + throw DataXException + .asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, + message, e); + } + } + } + + @Override + public void mkDirRecursive(String directoryPath){ + boolean isDirExist = false; + try { + this.printWorkingDirectory(); + SftpATTRS sftpATTRS = this.channelSftp.lstat(directoryPath); + isDirExist = sftpATTRS.isDir(); + } catch (SftpException e) { + if (e.getMessage().toLowerCase().equals("no such file")) { + LOG.warn(String.format( + "您的配置项path:[%s]不存在,将尝试进行目录创建, errorMessage:%s", + directoryPath, e.getMessage()), e); + isDirExist = false; + } + } + if (!isDirExist) { + StringBuilder dirPath = new StringBuilder(); + dirPath.append(IOUtils.DIR_SEPARATOR_UNIX); + String[] dirSplit = StringUtils.split(directoryPath,IOUtils.DIR_SEPARATOR_UNIX); + try { + // ftp server不支持递归创建目录,只能一级一级创建 + for(String dirName : dirSplit){ + dirPath.append(dirName); + mkDirSingleHierarchy(dirPath.toString()); + dirPath.append(IOUtils.DIR_SEPARATOR_UNIX); + } + } catch (SftpException e) { + String message = String + .format("创建目录:%s时发生I/O异常,请确认与ftp服务器的连接正常,拥有目录创建权限, errorMessage:%s", + directoryPath, e.getMessage()); + LOG.error(message, e); + throw DataXException + .asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, + message, e); + } + } + } + + public boolean mkDirSingleHierarchy(String directoryPath) throws SftpException { + boolean isDirExist = false; + try { + SftpATTRS sftpATTRS = this.channelSftp.lstat(directoryPath); + isDirExist = sftpATTRS.isDir(); + } catch (SftpException e) { + if(!isDirExist){ + LOG.info(String.format("正在逐级创建目录 [%s]",directoryPath)); + this.channelSftp.mkdir(directoryPath); + return true; + } + } + if(!isDirExist){ + LOG.info(String.format("正在逐级创建目录 [%s]",directoryPath)); + this.channelSftp.mkdir(directoryPath); + } + return true; + } + + @Override + public OutputStream getOutputStream(String filePath) { + try { + this.printWorkingDirectory(); + String parentDir = filePath.substring(0, + StringUtils.lastIndexOf(filePath, IOUtils.DIR_SEPARATOR)); + this.channelSftp.cd(parentDir); + this.printWorkingDirectory(); + OutputStream writeOutputStream = this.channelSftp.put(filePath, + ChannelSftp.APPEND); + String message = String.format( + "打开FTP文件[%s]获取写出流时出错,请确认文件%s有权限创建,有权限写出等", filePath, + filePath); + if (null == writeOutputStream) { + throw DataXException.asDataXException( + FtpWriterErrorCode.OPEN_FILE_ERROR, message); + } + return writeOutputStream; + } catch (SftpException e) { + String message = String.format( + "写出文件[%s] 时出错,请确认文件%s有权限写出, errorMessage:%s", filePath, + filePath, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.OPEN_FILE_ERROR, message); + } + } + + @Override + public String getRemoteFileContent(String filePath) { + try { + this.completePendingCommand(); + this.printWorkingDirectory(); + String parentDir = filePath.substring(0, + StringUtils.lastIndexOf(filePath, IOUtils.DIR_SEPARATOR)); + this.channelSftp.cd(parentDir); + this.printWorkingDirectory(); + ByteArrayOutputStream outputStream = new ByteArrayOutputStream(22); + this.channelSftp.get(filePath, outputStream); + String result = outputStream.toString(); + IOUtils.closeQuietly(outputStream); + return result; + } catch (SftpException e) { + String message = String.format( + "写出文件[%s] 时出错,请确认文件%s有权限写出, errorMessage:%s", filePath, + filePath, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.OPEN_FILE_ERROR, message); + } + } + + @Override + public Set getAllFilesInDir(String dir, String prefixFileName) { + Set allFilesWithPointedPrefix = new HashSet(); + try { + this.printWorkingDirectory(); + @SuppressWarnings("rawtypes") + Vector allFiles = this.channelSftp.ls(dir); + LOG.debug(String.format("ls: %s", JSON.toJSONString(allFiles, + SerializerFeature.UseSingleQuotes))); + for (int i = 0; i < allFiles.size(); i++) { + LsEntry le = (LsEntry) allFiles.get(i); + String strName = le.getFilename(); + if (strName.startsWith(prefixFileName)) { + allFilesWithPointedPrefix.add(strName); + } + } + } catch (SftpException e) { + String message = String + .format("获取path:[%s] 下文件列表时发生I/O异常,请确认与ftp服务器的连接正常,拥有目录ls权限, errorMessage:%s", + dir, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + return allFilesWithPointedPrefix; + } + + @Override + public void deleteFiles(Set filesToDelete) { + String eachFile = null; + try { + this.printWorkingDirectory(); + for (String each : filesToDelete) { + LOG.info(String.format("delete file [%s].", each)); + eachFile = each; + this.channelSftp.rm(each); + } + } catch (SftpException e) { + String message = String.format( + "删除文件:[%s] 时发生异常,请确认指定文件有删除权限,以及网络交互正常, errorMessage:%s", + eachFile, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + + private void printWorkingDirectory() { + try { + LOG.info(String.format("current working directory:%s", + this.channelSftp.pwd())); + } catch (Exception e) { + LOG.warn(String.format("printWorkingDirectory error:%s", + e.getMessage())); + } + } + + @Override + public void completePendingCommand() { + } + +} diff --git a/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/StandardFtpHelperImpl.java b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/StandardFtpHelperImpl.java new file mode 100644 index 0000000000..8999b0a85a --- /dev/null +++ b/ftpwriter/src/main/java/com/alibaba/datax/plugin/writer/ftpwriter/util/StandardFtpHelperImpl.java @@ -0,0 +1,327 @@ +package com.alibaba.datax.plugin.writer.ftpwriter.util; + +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.io.OutputStream; +import java.net.UnknownHostException; +import java.util.HashSet; +import java.util.Set; + +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.net.ftp.FTPClient; +import org.apache.commons.net.ftp.FTPClientConfig; +import org.apache.commons.net.ftp.FTPFile; +import org.apache.commons.net.ftp.FTPReply; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.writer.ftpwriter.FtpWriterErrorCode; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.serializer.SerializerFeature; + +public class StandardFtpHelperImpl implements IFtpHelper { + private static final Logger LOG = LoggerFactory + .getLogger(StandardFtpHelperImpl.class); + FTPClient ftpClient = null; + + @Override + public void loginFtpServer(String host, String username, String password, + int port, int timeout) { + this.ftpClient = new FTPClient(); + try { + this.ftpClient.setControlEncoding("UTF-8"); + // 不需要写死ftp server的OS TYPE,FTPClient getSystemType()方法会自动识别 + // this.ftpClient.configure(new FTPClientConfig(FTPClientConfig.SYST_UNIX)); + this.ftpClient.setDefaultTimeout(timeout); + this.ftpClient.setConnectTimeout(timeout); + this.ftpClient.setDataTimeout(timeout); + + // 连接登录 + this.ftpClient.connect(host, port); + this.ftpClient.login(username, password); + + this.ftpClient.enterRemotePassiveMode(); + this.ftpClient.enterLocalPassiveMode(); + int reply = this.ftpClient.getReplyCode(); + if (!FTPReply.isPositiveCompletion(reply)) { + this.ftpClient.disconnect(); + String message = String + .format("与ftp服务器建立连接失败,host:%s, port:%s, username:%s, replyCode:%s", + host, port, username, reply); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.FAIL_LOGIN, message); + } + } catch (UnknownHostException e) { + String message = String.format( + "请确认ftp服务器地址是否正确,无法连接到地址为: [%s] 的ftp服务器, errorMessage:%s", + host, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.FAIL_LOGIN, message, e); + } catch (IllegalArgumentException e) { + String message = String.format( + "请确认连接ftp服务器端口是否正确,错误的端口: [%s], errorMessage:%s", port, + e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.FAIL_LOGIN, message, e); + } catch (Exception e) { + String message = String + .format("与ftp服务器建立连接失败,host:%s, port:%s, username:%s, errorMessage:%s", + host, port, username, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.FAIL_LOGIN, message, e); + } + + } + + @Override + public void logoutFtpServer() { + if (this.ftpClient.isConnected()) { + try { + this.ftpClient.logout(); + } catch (IOException e) { + String message = String.format( + "与ftp服务器断开连接失败, errorMessage:%s", e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.FAIL_DISCONNECT, message, e); + } finally { + if (this.ftpClient.isConnected()) { + try { + this.ftpClient.disconnect(); + } catch (IOException e) { + String message = String.format( + "与ftp服务器断开连接失败, errorMessage:%s", + e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.FAIL_DISCONNECT, message, e); + } + } + this.ftpClient = null; + } + } + } + + @Override + public void mkdir(String directoryPath) { + String message = String.format("创建目录:%s时发生异常,请确认与ftp服务器的连接正常,拥有目录创建权限", + directoryPath); + try { + this.printWorkingDirectory(); + boolean isDirExist = this.ftpClient + .changeWorkingDirectory(directoryPath); + if (!isDirExist) { + int replayCode = this.ftpClient.mkd(directoryPath); + message = String + .format("%s,replayCode:%s", message, replayCode); + if (replayCode != FTPReply.COMMAND_OK + && replayCode != FTPReply.PATHNAME_CREATED) { + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, + message); + } + } + } catch (IOException e) { + message = String.format("%s, errorMessage:%s", message, + e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + + @Override + public void mkDirRecursive(String directoryPath){ + StringBuilder dirPath = new StringBuilder(); + dirPath.append(IOUtils.DIR_SEPARATOR_UNIX); + String[] dirSplit = StringUtils.split(directoryPath,IOUtils.DIR_SEPARATOR_UNIX); + String message = String.format("创建目录:%s时发生异常,请确认与ftp服务器的连接正常,拥有目录创建权限", directoryPath); + try { + // ftp server不支持递归创建目录,只能一级一级创建 + for(String dirName : dirSplit){ + dirPath.append(dirName); + boolean mkdirSuccess = mkDirSingleHierarchy(dirPath.toString()); + dirPath.append(IOUtils.DIR_SEPARATOR_UNIX); + if(!mkdirSuccess){ + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, + message); + } + } + } catch (IOException e) { + message = String.format("%s, errorMessage:%s", message, + e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + + public boolean mkDirSingleHierarchy(String directoryPath) throws IOException { + boolean isDirExist = this.ftpClient + .changeWorkingDirectory(directoryPath); + // 如果directoryPath目录不存在,则创建 + if (!isDirExist) { + int replayCode = this.ftpClient.mkd(directoryPath); + if (replayCode != FTPReply.COMMAND_OK + && replayCode != FTPReply.PATHNAME_CREATED) { + return false; + } + } + return true; + } + + @Override + public OutputStream getOutputStream(String filePath) { + try { + this.printWorkingDirectory(); + String parentDir = filePath.substring(0, + StringUtils.lastIndexOf(filePath, IOUtils.DIR_SEPARATOR)); + this.ftpClient.changeWorkingDirectory(parentDir); + this.printWorkingDirectory(); + OutputStream writeOutputStream = this.ftpClient + .appendFileStream(filePath); + String message = String.format( + "打开FTP文件[%s]获取写出流时出错,请确认文件%s有权限创建,有权限写出等", filePath, + filePath); + if (null == writeOutputStream) { + throw DataXException.asDataXException( + FtpWriterErrorCode.OPEN_FILE_ERROR, message); + } + + return writeOutputStream; + } catch (IOException e) { + String message = String.format( + "写出文件 : [%s] 时出错,请确认文件:[%s]存在且配置的用户有权限写, errorMessage:%s", + filePath, filePath, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.OPEN_FILE_ERROR, message); + } + } + + @Override + public String getRemoteFileContent(String filePath) { + try { + this.completePendingCommand(); + this.printWorkingDirectory(); + String parentDir = filePath.substring(0, + StringUtils.lastIndexOf(filePath, IOUtils.DIR_SEPARATOR)); + this.ftpClient.changeWorkingDirectory(parentDir); + this.printWorkingDirectory(); + ByteArrayOutputStream outputStream = new ByteArrayOutputStream(22); + this.ftpClient.retrieveFile(filePath, outputStream); + String result = outputStream.toString(); + IOUtils.closeQuietly(outputStream); + return result; + } catch (IOException e) { + String message = String.format( + "读取文件 : [%s] 时出错,请确认文件:[%s]存在且配置的用户有权限读取, errorMessage:%s", + filePath, filePath, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.OPEN_FILE_ERROR, message); + } + } + + @Override + public Set getAllFilesInDir(String dir, String prefixFileName) { + Set allFilesWithPointedPrefix = new HashSet(); + try { + boolean isDirExist = this.ftpClient.changeWorkingDirectory(dir); + if (!isDirExist) { + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, + String.format("进入目录[%s]失败", dir)); + } + this.printWorkingDirectory(); + FTPFile[] fs = this.ftpClient.listFiles(dir); + // LOG.debug(JSON.toJSONString(this.ftpClient.listNames(dir))); + LOG.debug(String.format("ls: %s", + JSON.toJSONString(fs, SerializerFeature.UseSingleQuotes))); + for (FTPFile ff : fs) { + String strName = ff.getName(); + if (strName.startsWith(prefixFileName)) { + allFilesWithPointedPrefix.add(strName); + } + } + } catch (IOException e) { + String message = String + .format("获取path:[%s] 下文件列表时发生I/O异常,请确认与ftp服务器的连接正常,拥有目录ls权限, errorMessage:%s", + dir, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + return allFilesWithPointedPrefix; + } + + @Override + public void deleteFiles(Set filesToDelete) { + String eachFile = null; + boolean deleteOk = false; + try { + this.printWorkingDirectory(); + for (String each : filesToDelete) { + LOG.info(String.format("delete file [%s].", each)); + eachFile = each; + deleteOk = this.ftpClient.deleteFile(each); + if (!deleteOk) { + String message = String.format( + "删除文件:[%s] 时失败,请确认指定文件有删除权限", eachFile); + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, + message); + } + } + } catch (IOException e) { + String message = String.format( + "删除文件:[%s] 时发生异常,请确认指定文件有删除权限,以及网络交互正常, errorMessage:%s", + eachFile, e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } + + private void printWorkingDirectory() { + try { + LOG.info(String.format("current working directory:%s", + this.ftpClient.printWorkingDirectory())); + } catch (Exception e) { + LOG.warn(String.format("printWorkingDirectory error:%s", + e.getMessage())); + } + } + + @Override + public void completePendingCommand() { + /* + * Q:After I perform a file transfer to the server, + * printWorkingDirectory() returns null. A:You need to call + * completePendingCommand() after transferring the file. wiki: + * http://wiki.apache.org/commons/Net/FrequentlyAskedQuestions + */ + try { + boolean isOk = this.ftpClient.completePendingCommand(); + if (!isOk) { + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, + "完成ftp completePendingCommand操作发生异常"); + } + } catch (IOException e) { + String message = String.format( + "完成ftp completePendingCommand操作发生异常, errorMessage:%s", + e.getMessage()); + LOG.error(message); + throw DataXException.asDataXException( + FtpWriterErrorCode.COMMAND_FTP_IO_EXCEPTION, message, e); + } + } +} diff --git a/ftpwriter/src/main/resources/plugin.json b/ftpwriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..55979b0f32 --- /dev/null +++ b/ftpwriter/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "ftpwriter", + "class": "com.alibaba.datax.plugin.writer.ftpwriter.FtpWriter", + "description": "useScene: test. mechanism: use datax framework to transport data from ftp txt file. warn: The more you know about the data, the less problems you encounter.", + "developer": "alibaba" +} + diff --git a/ftpwriter/src/main/resources/plugin_job_template.json b/ftpwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..1a116c4912 --- /dev/null +++ b/ftpwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,24 @@ +{ + "name": "ftpwriter", + "parameter": { + "name": "ftpwriter", + "parameter": { + "protocol": "", + "host": "", + "port": "", + "username": "", + "password": "", + "timeout": "", + "connectPattern": "", + "path": "", + "fileName": "", + "writeMode": "", + "fieldDelimiter": "", + "encoding": "", + "nullFormat": "", + "dateFormat": "", + "fileFormat": "", + "header": [] + } + } +} \ No newline at end of file diff --git a/hbase094xreader/doc/.gitkeep b/hbase094xreader/doc/.gitkeep new file mode 100644 index 0000000000..e69de29bb2 diff --git a/hbase094xreader/doc/hbase094xreader.md b/hbase094xreader/doc/hbase094xreader.md new file mode 100644 index 0000000000..0076b8f114 --- /dev/null +++ b/hbase094xreader/doc/hbase094xreader.md @@ -0,0 +1,454 @@ +# Hbase094XReader & Hbase11XReader 插件文档 + +___ + +## 1 快速介绍 + +HbaseReader 插件实现了从 Hbase中读取数据。在底层实现上,HbaseReader 通过 HBase 的 Java 客户端连接远程 HBase 服务,并通过 Scan 方式读取你指定 rowkey 范围内的数据,并将读取的数据使用 DataX 自定义的数据类型拼装为抽象的数据集,并传递给下游 Writer 处理。 + + +### 1.1支持的功能 + +1、目前HbaseReader支持的Hbase版本有:Hbase0.94.x和Hbase1.1.x。 + +* 若您的hbase版本为Hbase0.94.x,reader端的插件请选择:hbase094xreader,即: + + ``` + "reader": { + "name": "hbase094xreader" + } + ``` + +* 若您的hbase版本为Hbase1.1.x,reader端的插件请选择:hbase11xreader,即: + + ``` + "reader": { + "name": "hbase11xreader" + } + ``` + +2、目前HbaseReader支持两模式读取:normal 模式、multiVersionFixedColumn模式; + +* normal 模式:把HBase中的表,当成普通二维表(横表)进行读取,读取最新版本数据。如: + + ``` +hbase(main):017:0> scan 'users' +ROW COLUMN+CELL + lisi column=address:city, timestamp=1457101972764, value=beijing + lisi column=address:contry, timestamp=1457102773908, value=china + lisi column=address:province, timestamp=1457101972736, value=beijing + lisi column=info:age, timestamp=1457101972548, value=27 + lisi column=info:birthday, timestamp=1457101972604, value=1987-06-17 + lisi column=info:company, timestamp=1457101972653, value=baidu + xiaoming column=address:city, timestamp=1457082196082, value=hangzhou + xiaoming column=address:contry, timestamp=1457082195729, value=china + xiaoming column=address:province, timestamp=1457082195773, value=zhejiang + xiaoming column=info:age, timestamp=1457082218735, value=29 + xiaoming column=info:birthday, timestamp=1457082186830, value=1987-06-17 + xiaoming column=info:company, timestamp=1457082189826, value=alibaba +2 row(s) in 0.0580 seconds +``` +读取后数据 + + | rowKey | addres:city | address:contry | address:province | info:age| info:birthday | info:company | + | --------| ---------------- |----- |----- |--------| ---------------- |----- | +| lisi | beijing| china| beijing |27 | 1987-06-17 | baidu| +| xiaoming | hangzhou| china | zhejiang|29 | 1987-06-17 | alibaba| + + + +* multiVersionFixedColumn模式:把HBase中的表,当成竖表进行读取。读出的每条记录一定是四列形式,依次为:rowKey,family:qualifier,timestamp,value。读取时需要明确指定要读取的列,把每一个 cell 中的值,作为一条记录(record),若有多个版本就有多条记录(record)。如: + + ``` +hbase(main):018:0> scan 'users',{VERSIONS=>5} +ROW COLUMN+CELL + lisi column=address:city, timestamp=1457101972764, value=beijing + lisi column=address:contry, timestamp=1457102773908, value=china + lisi column=address:province, timestamp=1457101972736, value=beijing + lisi column=info:age, timestamp=1457101972548, value=27 + lisi column=info:birthday, timestamp=1457101972604, value=1987-06-17 + lisi column=info:company, timestamp=1457101972653, value=baidu + xiaoming column=address:city, timestamp=1457082196082, value=hangzhou + xiaoming column=address:contry, timestamp=1457082195729, value=china + xiaoming column=address:province, timestamp=1457082195773, value=zhejiang + xiaoming column=info:age, timestamp=1457082218735, value=29 + xiaoming column=info:age, timestamp=1457082178630, value=24 + xiaoming column=info:birthday, timestamp=1457082186830, value=1987-06-17 + xiaoming column=info:company, timestamp=1457082189826, value=alibaba +2 row(s) in 0.0260 seconds +``` +读取后数据(4列) + + | rowKey | column:qualifier| timestamp | value | +| --------| ---------------- |----- |----- | +| lisi | address:city| 1457101972764 | beijing | +| lisi | address:contry| 1457102773908 | china | +| lisi | address:province| 1457101972736 | beijing | +| lisi | info:age| 1457101972548 | 27 | +| lisi | info:birthday| 1457101972604 | 1987-06-17 | +| lisi | info:company| 1457101972653 | beijing | +| xiaoming | address:city| 1457082196082 | hangzhou | +| xiaoming | address:contry| 1457082195729 | china | +| xiaoming | address:province| 1457082195773 | zhejiang | +| xiaoming | info:age| 1457082218735 | 29 | +| xiaoming | info:age| 1457082178630 | 24 | +| xiaoming | info:birthday| 1457082186830 | 1987-06-17 | +| xiaoming | info:company| 1457082189826 | alibaba | + + +3、HbaseReader中有一个必填配置项是:hbaseConfig,需要你联系 HBase PE,将hbase-site.xml 中与连接 HBase 相关的配置项提取出来,以 json 格式填入,同时可以补充更多HBase client的配置,如:设置scan的cache(hbase.client.scanner.caching)、batch来优化与服务器的交互。 + + +如:hbase-site.xml的配置内容如下 + +``` + + + hbase.rootdir + hdfs://ip:9000/hbase + + + hbase.cluster.distributed + true + + + hbase.zookeeper.quorum + *** + + +``` +转换后的json为: + +``` +"hbaseConfig": { + "hbase.rootdir": "hdfs: //ip:9000/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "***" + } +``` + +### 1.2 限制 + +1、目前不支持动态列的读取。考虑网络传输流量(支持动态列,需要先将hbase所有列的数据读取出来,再按规则进行过滤),现支持的两种读取模式中需要用户明确指定要读取的列。 + +2、关于同步作业的切分:目前的切分方式是根据用户hbase表数据的region分布进行切分。即:在用户填写的[startrowkey,endrowkey]范围内,一个region会切分成一个task,单个region不进行切分。 + +3、multiVersionFixedColumn模式下不支持增加常量列 + + +## 2 实现原理 + +简而言之,HbaseReader 通过 HBase 的 Java 客户端,通过 HTable, Scan, ResultScanner 等 API,读取你指定 rowkey 范围内的数据,并将读取的数据使用 DataX 自定义的数据类型拼装为抽象的数据集,并传递给下游 Writer 处理。hbase11xreader与hbase094xreader的主要不同在于API的调用不同,Hbase1.1.x废弃了很多Hbase0.94.x的api。 + + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从 HBase 抽取数据到本地的作业:(normal 模式) + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "hbase11xreader", + "parameter": { + "hbaseConfig": { + "hbase.rootdir": "hdfs: //xxx: 9000/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "xxx" + }, + "table": "users", + "encoding": "utf-8", + "mode": "normal", + "column": [ + { + "name": "rowkey", + "type": "string" + }, + { + "name": "info: age", + "type": "string" + }, + { + "name": "info: birthday", + "type": "date", + "format":"yyyy-MM-dd" + }, + { + "name": "info: company", + "type": "string" + }, + { + "name": "address: contry", + "type": "string" + }, + { + "name": "address: province", + "type": "string" + }, + { + "name": "address: city", + "type": "string" + } + ], + "range": { + "startRowkey": "", + "endRowkey": "", + "isBinaryRowkey": true + } + } + }, + "writer": { + "name": "txtfilewriter", + "parameter": { + "path": "/Users/shf/workplace/datax_test/hbase11xreader/result", + "fileName": "qiran", + "writeMode": "truncate" + } + } + } + ] + } +} +``` + +* 配置一个从 HBase 抽取数据到本地的作业:( multiVersionFixedColumn 模式) + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "hbase11xreader", + "parameter": { + "hbaseConfig": { + "hbase.rootdir": "hdfs: //xxx: 9000/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "xxx" + }, + "table": "users", + "encoding": "utf-8", + "mode": "multiVersionFixedColumn", + "maxVersion": "-1", + "column": [ + { + "name": "rowkey", + "type": "string" + }, + { + "name": "info: age", + "type": "string" + }, + { + "name": "info: birthday", + "type": "date", + "format":"yyyy-MM-dd" + }, + { + "name": "info: company", + "type": "string" + }, + { + "name": "address: contry", + "type": "string" + }, + { + "name": "address: province", + "type": "string" + }, + { + "name": "address: city", + "type": "string" + } + ], + "range": { + "startRowkey": "", + "endRowkey": "" + } + } + }, + "writer": { + "name": "txtfilewriter", + "parameter": { + "path": "/Users/shf/workplace/datax_test/hbase11xreader/result", + "fileName": "qiran", + "writeMode": "truncate" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **hbaseConfig** + + * 描述:每个HBase集群提供给DataX客户端连接的配置信息存放在hbase-site.xml,请联系你的HBase PE提供配置信息,并转换为JSON格式。同时可以补充更多HBase client的配置,如:设置scan的cache、batch来优化与服务器的交互。 + + * 必选:是
+ + * 默认值:无
+ +* **mode** + + * 描述:读取hbase的模式,支持normal 模式、multiVersionFixedColumn模式,即:normal/multiVersionFixedColumn
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:要读取的 hbase 表名(大小写敏感)
+ + * 必选:是
+ + * 默认值:无
+ +* **encoding** + + * 描述:编码方式,UTF-8 或是 GBK,用于对二进制存储的 HBase byte[] 转为 String 时的编码
+ + * 必选:否
+ + * 默认值:UTF-8
+ + +* **column** + + * 描述:要读取的hbase字段,normal 模式与multiVersionFixedColumn 模式下必填项。 + (1)、normal 模式下:name指定读取的hbase列,除了rowkey外,必须为 列族:列名 的格式,type指定源数据的类型,format指定日期类型的格式,value指定当前类型为常量,不从hbase读取数据,而是根据value值自动生成对应的列。配置格式如下: + + ``` + "column": +[ + { + "name": "rowkey", + "type": "string" + }, + { + "value": "test", + "type": "string" + } +] + + ``` + normal 模式下,对于用户指定Column信息,type必须填写,name/value必须选择其一。 + + (2)、multiVersionFixedColumn 模式下:name指定读取的hbase列,除了rowkey外,必须为 列族:列名 的格式,type指定源数据的类型,format指定日期类型的格式 。multiVersionFixedColumn模式下不支持常量列。配置格式如下: + + ``` + "column": +[ + { + "name": "rowkey", + "type": "string" + }, + { + "name": "info: age", + "type": "string" + } +] + ``` + + * 必选:是
+ + * 默认值:无
+ + +* **maxVersion** + + * 描述:指定在多版本模式下的hbasereader读取的版本数,取值只能为-1或者大于1的数字,-1表示读取所有版本
+ + * 必选:multiVersionFixedColumn 模式下必填项
+ + * 默认值:无
+ +* **range** + + * 描述:指定hbasereader读取的rowkey范围。
+ startRowkey:指定开始rowkey;
+ endRowkey指定结束rowkey;
+ isBinaryRowkey:指定配置的startRowkey和endRowkey转换为byte[]时的方式,默认值为false,若为true,则调用Bytes.toBytesBinary(rowkey)方法进行转换;若为false:则调用Bytes.toBytes(rowkey)
+ 配置格式如下: + + ``` + "range": { + "startRowkey": "aaa", + "endRowkey": "ccc", + "isBinaryRowkey":false +} + ``` +
+ + * 必选:否
+ + * 默认值:无
+ +* **scanCacheSize** + + * 描述:Hbase client每次rpc从服务器端读取的行数
+ + * 必选:否
+ + * 默认值:256
+ +* **scanBatchSize** + + * 描述:Hbase client每次rpc从服务器端读取的列数
+ + * 必选:否
+ + * 默认值:100
+ + +### 3.3 类型转换 + + +下面列出支持的读取HBase数据类型,HbaseReader 针对 HBase 类型转换列表: + +| DataX 内部类型| HBase 数据类型 | +| -------- | ----- | +| Long |int, short ,long| +| Double |float, double| +| String |string,binarystring | +| Date |date | +| Boolean |boolean | + + + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 + +## 4 性能报告 + +略 + +## 5 约束限制 + +略 + + +## 6 FAQ + +*** diff --git a/hbase094xreader/pom.xml b/hbase094xreader/pom.xml new file mode 100644 index 0000000000..9170536c76 --- /dev/null +++ b/hbase094xreader/pom.xml @@ -0,0 +1,97 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + hbase094xreader + hbase094xreader + 0.0.1-SNAPSHOT + + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + org.apache.hbase + hbase + 0.94.27 + + + org.apache.hadoop + hadoop-core + 0.20.205.0 + + + org.apache.zookeeper + zookeeper + 3.3.2 + + + + + com.alibaba.datax + datax-core + ${datax-project-version} + test + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + + diff --git a/hbase094xreader/src/main/assembly/package.xml b/hbase094xreader/src/main/assembly/package.xml new file mode 100755 index 0000000000..78460d3685 --- /dev/null +++ b/hbase094xreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/hbase094xreader + + + target/ + + hbase094xreader-0.0.1-SNAPSHOT.jar + + plugin/reader/hbase094xreader + + + + + + false + plugin/reader/hbase094xreader/libs + runtime + + + diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/ColumnType.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/ColumnType.java new file mode 100755 index 0000000000..4044e092bd --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/ColumnType.java @@ -0,0 +1,48 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang.StringUtils; + +import java.util.Arrays; + +/** + * 只对 normal 模式读取时有用,多版本读取时,不存在列类型的 + */ +public enum ColumnType { + BOOLEAN("boolean"), + SHORT("short"), + INT("int"), + LONG("long"), + FLOAT("float"), + DOUBLE("double"), + DATE("date"), + STRING("string"), + BINARY_STRING("binarystring") + ; + + private String typeName; + + ColumnType(String typeName) { + this.typeName = typeName; + } + + public static ColumnType getByTypeName(String typeName) { + if(StringUtils.isBlank(typeName)){ + throw DataXException.asDataXException(Hbase094xReaderErrorCode.ILLEGAL_VALUE, + String.format("Hbasereader 不支持该类型:%s, 目前支持的类型是:%s", typeName, Arrays.asList(values()))); + } + for (ColumnType columnType : values()) { + if (StringUtils.equalsIgnoreCase(columnType.typeName, typeName.trim())) { + return columnType; + } + } + + throw DataXException.asDataXException(Hbase094xReaderErrorCode.ILLEGAL_VALUE, + String.format("Hbasereader 不支持该类型:%s, 目前支持的类型是:%s", typeName, Arrays.asList(values()))); + } + + @Override + public String toString() { + return this.typeName; + } +} diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Constant.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Constant.java new file mode 100755 index 0000000000..65825dc0cd --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Constant.java @@ -0,0 +1,16 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + +public final class Constant { + public static final String RANGE = "range"; + + public static final String ROWKEY_FLAG = "rowkey"; + + public static final String DEFAULT_ENCODING = "UTF-8"; + + public static final String DEFAULT_DATA_FORMAT = "yyyy-MM-dd HH:mm:ss"; + + public static final int DEFAULT_SCAN_CACHE_SIZE = 256; + + public static final int DEFAULT_SCAN_BATCH_SIZE = 100; + +} diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Hbase094xHelper.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Hbase094xHelper.java new file mode 100644 index 0000000000..c3e2a21224 --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Hbase094xHelper.java @@ -0,0 +1,441 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.TypeReference; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hbase.HConstants; +import org.apache.hadoop.hbase.client.HBaseAdmin; +import org.apache.hadoop.hbase.client.HTable; +import org.apache.hadoop.hbase.client.ResultScanner; +import org.apache.hadoop.hbase.util.Bytes; +import org.apache.hadoop.hbase.util.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.Charset; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + + +/** + * 工具类 + * Created by shf on 16/3/7. + */ +public class Hbase094xHelper { + + private static final Logger LOG = LoggerFactory.getLogger(Hbase094xHelper.class); + + public static org.apache.hadoop.conf.Configuration getHbaseConf(String hbaseConf) { + if (StringUtils.isBlank(hbaseConf)) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.REQUIRED_VALUE, "读 Hbase 时需要配置 hbaseConfig,其内容为 Hbase 连接信息,请联系 Hbase PE 获取该信息."); + } + org.apache.hadoop.conf.Configuration conf = new org.apache.hadoop.conf.Configuration(); + + try { + Map map = JSON.parseObject(hbaseConf, new TypeReference>() {}); + // 用户配置的 key-value 对 来表示 hbaseConf + Validate.isTrue(map != null && map.size() !=0, "hbaseConfig 不能为空 Map 结构!"); + for (Map.Entry entry : map.entrySet()) { + conf.set(entry.getKey(), entry.getValue()); + } + } catch (Exception e) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.GET_HBASE_CONFIGURATION_ERROR, e); + } + return conf; + } + + /** + * 每次都获取一个新的HTable 注意:HTable 本身是线程不安全的 + */ + public static HTable getTable(com.alibaba.datax.common.util.Configuration configuration) { + String hbaseConnConf = configuration.getString(Key.HBASE_CONFIG); + String tableName = configuration.getString(Key.TABLE); + HBaseAdmin admin = null; + try { + org.apache.hadoop.conf.Configuration hbaseConf = Hbase094xHelper.getHbaseConf(hbaseConnConf); + HTable htable = new HTable(hbaseConf, tableName); + admin = new HBaseAdmin(hbaseConf); + checkHbaseTable(admin, htable); + + return htable; + } catch (Exception e) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.GET_HBASE_TABLE_ERROR, e); + } finally { + Hbase094xHelper.closeAdmin(admin); + } + } + + private static void checkHbaseTable(HBaseAdmin admin, HTable htable) throws DataXException, IOException { + if (!admin.isMasterRunning()) { + throw new IllegalStateException("HBase master 没有运行, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if (!admin.tableExists(htable.getTableName())) { + throw new IllegalStateException("HBase源头表" + Bytes.toString(htable.getTableName()) + + "不存在, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if (!admin.isTableAvailable(htable.getTableName()) || !admin.isTableEnabled(htable.getTableName())) { + throw new IllegalStateException("HBase源头表" + Bytes.toString(htable.getTableName()) + + " 不可用, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if(admin.isTableDisabled(htable.getTableName())){ + throw DataXException.asDataXException(Hbase094xReaderErrorCode.ILLEGAL_VALUE, "HBase源头表" + Bytes.toString(htable.getTableName()) + + "is disabled, 请检查您的配置 或者 联系 Hbase 管理员."); + } + } + + + public static void closeAdmin(HBaseAdmin admin){ + try { + if(null != admin) + admin.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.CLOSE_HBASE_ADMIN_ERROR, e); + } + } + + public static void closeTable(HTable table){ + try { + if(null != table) + table.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.CLOSE_HBASE_TABLE_ERROR, e); + } + } + + public static void closeResultScanner(ResultScanner resultScanner){ + if(null != resultScanner) { + resultScanner.close(); + } + } + + + public static byte[] convertUserStartRowkey(Configuration configuration) { + String startRowkey = configuration.getString(Key.START_ROWKEY); + if (StringUtils.isBlank(startRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } else { + boolean isBinaryRowkey = configuration.getBool(Key.IS_BINARY_ROWKEY); + return Hbase094xHelper.stringToBytes(startRowkey, isBinaryRowkey); + } + } + + public static byte[] convertUserEndRowkey(Configuration configuration) { + String endRowkey = configuration.getString(Key.END_ROWKEY); + if (StringUtils.isBlank(endRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } else { + boolean isBinaryRowkey = configuration.getBool(Key.IS_BINARY_ROWKEY); + return Hbase094xHelper.stringToBytes(endRowkey, isBinaryRowkey); + } + } + + /** + * 注意:convertUserStartRowkey 和 convertInnerStartRowkey,前者会受到 isBinaryRowkey 的影响,只用于第一次对用户配置的 String 类型的 rowkey 转为二进制时使用。而后者约定:切分时得到的二进制的 rowkey 回填到配置中时采用 + */ + public static byte[] convertInnerStartRowkey(Configuration configuration) { + String startRowkey = configuration.getString(Key.START_ROWKEY); + if (StringUtils.isBlank(startRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } + return Bytes.toBytesBinary(startRowkey); + } + + public static byte[] convertInnerEndRowkey(Configuration configuration) { + String endRowkey = configuration.getString(Key.END_ROWKEY); + if (StringUtils.isBlank(endRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } + return Bytes.toBytesBinary(endRowkey); + } + + + private static byte[] stringToBytes(String rowkey, boolean isBinaryRowkey) { + if (isBinaryRowkey) { + return Bytes.toBytesBinary(rowkey); + } else { + return Bytes.toBytes(rowkey); + } + } + + + public static boolean isRowkeyColumn(String columnName) { + return Constant.ROWKEY_FLAG.equalsIgnoreCase(columnName); + } + + + /** + * 用于解析 Normal 模式下的列配置 + */ + public static List parseColumnOfNormalMode(List column) { + List hbaseColumnCells = new ArrayList(); + + HbaseColumnCell oneColumnCell; + + for (Map aColumn : column) { + ColumnType type = ColumnType.getByTypeName(aColumn.get(Key.TYPE)); + String columnName = aColumn.get(Key.NAME); + String columnValue = aColumn.get(Key.VALUE); + String dateformat = aColumn.get(Key.FORMAT); + + if (type == ColumnType.DATE) { + + if(dateformat == null){ + dateformat = Constant.DEFAULT_DATA_FORMAT; + } + Validate.isTrue(StringUtils.isNotBlank(columnName) || StringUtils.isNotBlank(columnValue), "Hbasereader 在 normal 方式读取时则要么是 type + name + format 的组合,要么是type + value + format 的组合. 而您的配置非这两种组合,请检查并修改."); + + oneColumnCell = new HbaseColumnCell + .Builder(type) + .columnName(columnName) + .columnValue(columnValue) + .dateformat(dateformat) + .build(); + } else { + Validate.isTrue(StringUtils.isNotBlank(columnName) || StringUtils.isNotBlank(columnValue), "Hbasereader 在 normal 方式读取时,其列配置中,如果类型不是时间,则要么是 type + name 的组合,要么是type + value 的组合. 而您的配置非这两种组合,请检查并修改."); + oneColumnCell = new HbaseColumnCell.Builder(type) + .columnName(columnName) + .columnValue(columnValue) + .build(); + } + + hbaseColumnCells.add(oneColumnCell); + } + + return hbaseColumnCells; + } + + //将多竖表column变成>形式 + public static HashMap> parseColumnOfMultiversionMode(List column){ + + HashMap> familyQualifierMap = new HashMap>(); + for (Map aColumn : column) { + String type = aColumn.get(Key.TYPE); + String columnName = aColumn.get(Key.NAME); + String dateformat = aColumn.get(Key.FORMAT); + + ColumnType.getByTypeName(type); + Validate.isTrue(StringUtils.isNotBlank(columnName), "Hbasereader 中,column 需要配置列名称name,格式为 列族:列名,您的配置为空,请检查并修改."); + + String familyQualifier; + if( !Hbase094xHelper.isRowkeyColumn(columnName)){ + String[] cfAndQualifier = columnName.split(":"); + if ( cfAndQualifier.length != 2) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 中,column 的列配置格式应该是:列族:列名. 您配置的列错误:" + columnName); + } + familyQualifier = StringUtils.join(cfAndQualifier[0].trim(),":",cfAndQualifier[1].trim()); + }else{ + familyQualifier = columnName.trim(); + } + + HashMap typeAndFormat = new HashMap(); + typeAndFormat.put(Key.TYPE,type); + typeAndFormat.put(Key.FORMAT,dateformat); + familyQualifierMap.put(familyQualifier,typeAndFormat); + } + return familyQualifierMap; + } + + + public static List split(Configuration configuration) { + byte[] startRowkeyByte = Hbase094xHelper.convertUserStartRowkey(configuration); + byte[] endRowkeyByte = Hbase094xHelper.convertUserEndRowkey(configuration); + + /* 如果用户配置了 startRowkey 和 endRowkey,需要确保:startRowkey <= endRowkey */ + if (startRowkeyByte.length != 0 && endRowkeyByte.length != 0 + && Bytes.compareTo(startRowkeyByte, endRowkeyByte) > 0) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 中 startRowkey 不得大于 endRowkey."); + } + + HTable htable = Hbase094xHelper.getTable(configuration); + + List resultConfigurations; + + try { + Pair regionRanges = htable.getStartEndKeys(); + if (null == regionRanges) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.SPLIT_ERROR, "获取源头 Hbase 表的 rowkey 范围失败."); + } + resultConfigurations = Hbase094xHelper.doSplit(configuration, startRowkeyByte, endRowkeyByte, + regionRanges); + LOG.info("HBaseReader split job into {} tasks.", resultConfigurations.size()); + return resultConfigurations; + } catch (Exception e) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.SPLIT_ERROR, "切分源头 Hbase 表失败.", e); + } finally { + Hbase094xHelper.closeTable(htable); + } + } + + + private static List doSplit(Configuration config, byte[] startRowkeyByte, + byte[] endRowkeyByte, Pair regionRanges) { + + List configurations = new ArrayList(); + + for (int i = 0; i < regionRanges.getFirst().length; i++) { + + byte[] regionStartKey = regionRanges.getFirst()[i]; + byte[] regionEndKey = regionRanges.getSecond()[i]; + + // 当前的region为最后一个region + // 如果最后一个region的start Key大于用户指定的userEndKey,则最后一个region,应该不包含在内 + // 注意如果用户指定userEndKey为"",则此判断应该不成立。userEndKey为""表示取得最大的region + if (Bytes.compareTo(regionEndKey, HConstants.EMPTY_BYTE_ARRAY) == 0 + && (endRowkeyByte.length != 0 && (Bytes.compareTo( + regionStartKey, endRowkeyByte) > 0))) { + continue; + } + + // 如果当前的region不是最后一个region, + // 用户配置的userStartKey大于等于region的endkey,则这个region不应该含在内 + if ((Bytes.compareTo(regionEndKey, HConstants.EMPTY_BYTE_ARRAY) != 0) + && (Bytes.compareTo(startRowkeyByte, regionEndKey) >= 0)) { + continue; + } + + // 如果用户配置的userEndKey小于等于 region的startkey,则这个region不应该含在内 + // 注意如果用户指定的userEndKey为"",则次判断应该不成立。userEndKey为""表示取得最大的region + if (endRowkeyByte.length != 0 + && (Bytes.compareTo(endRowkeyByte, regionStartKey) <= 0)) { + continue; + } + + Configuration p = config.clone(); + + String thisStartKey = getStartKey(startRowkeyByte, regionStartKey); + + String thisEndKey = getEndKey(endRowkeyByte, regionEndKey); + + p.set(Key.START_ROWKEY, thisStartKey); + p.set(Key.END_ROWKEY, thisEndKey); + + LOG.debug("startRowkey:[{}], endRowkey:[{}] .", thisStartKey, thisEndKey); + + configurations.add(p); + } + + return configurations; + } + + private static String getEndKey(byte[] endRowkeyByte, byte[] regionEndKey) { + if (endRowkeyByte == null) {// 由于之前处理过,所以传入的userStartKey不可能为null + throw new IllegalArgumentException("userEndKey should not be null!"); + } + + byte[] tempEndRowkeyByte; + + if (endRowkeyByte.length == 0) { + tempEndRowkeyByte = regionEndKey; + } else if (Bytes.compareTo(regionEndKey, HConstants.EMPTY_BYTE_ARRAY) == 0) { + // 为最后一个region + tempEndRowkeyByte = endRowkeyByte; + } else { + if (Bytes.compareTo(endRowkeyByte, regionEndKey) > 0) { + tempEndRowkeyByte = regionEndKey; + } else { + tempEndRowkeyByte = endRowkeyByte; + } + } + + return Bytes.toStringBinary(tempEndRowkeyByte); + } + + private static String getStartKey(byte[] startRowkeyByte, byte[] regionStarKey) { + if (startRowkeyByte == null) {// 由于之前处理过,所以传入的userStartKey不可能为null + throw new IllegalArgumentException( + "userStartKey should not be null!"); + } + + byte[] tempStartRowkeyByte; + + if (Bytes.compareTo(startRowkeyByte, regionStarKey) < 0) { + tempStartRowkeyByte = regionStarKey; + } else { + tempStartRowkeyByte = startRowkeyByte; + } + + return Bytes.toStringBinary(tempStartRowkeyByte); + } + + + public static void validateParameter(Configuration originalConfig) { + originalConfig.getNecessaryValue(Key.HBASE_CONFIG, Hbase094xReaderErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.TABLE, Hbase094xReaderErrorCode.REQUIRED_VALUE); + + Hbase094xHelper.validateMode(originalConfig); + + //非必选参数处理 + String encoding = originalConfig.getString(Key.ENCODING, Constant.DEFAULT_ENCODING); + if (!Charset.isSupported(encoding)) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.ILLEGAL_VALUE, String.format("Hbasereader 不支持您所配置的编码:[%s]", encoding)); + } + originalConfig.set(Key.ENCODING, encoding); + // 处理 range 的配置 + String startRowkey = originalConfig.getString(Constant.RANGE + "." + Key.START_ROWKEY); + + //此处判断需要谨慎:如果有 key range.startRowkey 但是没有值,得到的 startRowkey 是空字符串,而不是 null + if (startRowkey != null && startRowkey.length() != 0) { + originalConfig.set(Key.START_ROWKEY, startRowkey); + } + + String endRowkey = originalConfig.getString(Constant.RANGE + "." + Key.END_ROWKEY); + //此处判断需要谨慎:如果有 key range.endRowkey 但是没有值,得到的 endRowkey 是空字符串,而不是 null + if (endRowkey != null && endRowkey.length() != 0) { + originalConfig.set(Key.END_ROWKEY, endRowkey); + } + Boolean isBinaryRowkey = originalConfig.getBool(Constant.RANGE + "." + Key.IS_BINARY_ROWKEY,false); + originalConfig.set(Key.IS_BINARY_ROWKEY, isBinaryRowkey); + + //scan cache + int scanCacheSize = originalConfig.getInt(Key.SCAN_CACHE_SIZE,Constant.DEFAULT_SCAN_CACHE_SIZE); + originalConfig.set(Key.SCAN_CACHE_SIZE,scanCacheSize); + + int scanBatchSize = originalConfig.getInt(Key.SCAN_BATCH_SIZE,Constant.DEFAULT_SCAN_BATCH_SIZE); + originalConfig.set(Key.SCAN_BATCH_SIZE,scanBatchSize); + } + + private static String validateMode(Configuration originalConfig) { + String mode = originalConfig.getNecessaryValue(Key.MODE,Hbase094xReaderErrorCode.REQUIRED_VALUE); + List column = originalConfig.getList(Key.COLUMN, Map.class); + if (column == null || column.isEmpty()) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.REQUIRED_VALUE, "您配置的column为空,Hbase必须配置 column,其形式为:column:[{\"name\": \"cf0:column0\",\"type\": \"string\"},{\"name\": \"cf1:column1\",\"type\": \"long\"}]"); + } + ModeType modeType = ModeType.getByTypeName(mode); + switch (modeType) { + case Normal: { + // normal 模式不需要配置 maxVersion,需要配置 column,并且 column 格式为 Map 风格 + String maxVersion = originalConfig.getString(Key.MAX_VERSION); + Validate.isTrue(maxVersion == null, "您配置的是 normal 模式读取 hbase 中的数据,所以不能配置无关项:maxVersion"); + // 通过 parse 进行 column 格式的进一步检查 + Hbase094xHelper.parseColumnOfNormalMode(column); + break; + } + case MultiVersionFixedColumn:{ + // multiVersionFixedColumn 模式需要配置 maxVersion + checkMaxVersion(originalConfig, mode); + + Hbase094xHelper.parseColumnOfMultiversionMode(column); + break; + } + default: + throw DataXException.asDataXException(Hbase094xReaderErrorCode.ILLEGAL_VALUE, + String.format("Hbase11xReader不支持该 mode 类型:%s", mode)); + } + return mode; + } + + // 检查 maxVersion 是否存在,并且值是否合法 + private static void checkMaxVersion(Configuration configuration, String mode) { + Integer maxVersion = configuration.getInt(Key.MAX_VERSION); + Validate.notNull(maxVersion, String.format("您配置的是 %s 模式读取 hbase 中的数据,所以必须配置:maxVersion", mode)); + boolean isMaxVersionValid = maxVersion == -1 || maxVersion > 1; + Validate.isTrue(isMaxVersionValid, String.format("您配置的是 %s 模式读取 hbase 中的数据,但是配置的 maxVersion 值错误. maxVersion规定:-1为读取全部版本,不能配置为0或者1(因为0或者1,我们认为用户是想用 normal 模式读取数据,而非 %s 模式读取,二者差别大),大于1则表示读取最新的对应个数的版本", mode, mode)); + } +} diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Hbase094xReader.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Hbase094xReader.java new file mode 100755 index 0000000000..8d2f7a277d --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Hbase094xReader.java @@ -0,0 +1,107 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +/** + * Hbase094xReader + * Created by shf on 16/3/7. + */ +public class Hbase094xReader extends Reader { + public static class Job extends Reader.Job { + private Configuration originConfig = null; + + @Override + public void init() { + this.originConfig = this.getPluginJobConf(); + Hbase094xHelper.validateParameter(this.originConfig); + } + + @Override + public List split(int adviceNumber) { + return Hbase094xHelper.split(this.originConfig); + } + + + @Override + public void destroy() { + + } + + } + public static class Task extends Reader.Task { + private Configuration taskConfig; + private static Logger LOG = LoggerFactory.getLogger(Task.class); + private HbaseAbstractTask hbaseTaskProxy; + @Override + public void init() { + this.taskConfig = super.getPluginJobConf(); + String mode = this.taskConfig.getString(Key.MODE); + ModeType modeType = ModeType.getByTypeName(mode); + + switch (modeType) { + case Normal: + this.hbaseTaskProxy = new NormalTask(this.taskConfig); + break; + case MultiVersionFixedColumn: + this.hbaseTaskProxy = new MultiVersionFixedColumnTask(this.taskConfig); + break; + default: + throw DataXException.asDataXException(Hbase094xReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 不支持此类模式:" + modeType); + } + } + + @Override + public void prepare() { + try { + this.hbaseTaskProxy.prepare(); + } catch (Exception e) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.PREPAR_READ_ERROR, e); + } + } + + @Override + public void startRead(RecordSender recordSender) { + Record record = recordSender.createRecord(); + boolean fetchOK; + while (true) { + try { + fetchOK = this.hbaseTaskProxy.fetchLine(record); + } catch (Exception e) { + LOG.info("Exception", e); + super.getTaskPluginCollector().collectDirtyRecord(record, e); + record = recordSender.createRecord(); + continue; + } + if (fetchOK) { + recordSender.sendToWriter(record); + record = recordSender.createRecord(); + } else { + break; + } + } + recordSender.flush(); + } + + @Override + public void post() { + super.post(); + } + + @Override + public void destroy() { + if (this.hbaseTaskProxy != null) { + this.hbaseTaskProxy.close(); + } + } + } + +} diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Hbase094xReaderErrorCode.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Hbase094xReaderErrorCode.java new file mode 100755 index 0000000000..7b97fbd4d1 --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Hbase094xReaderErrorCode.java @@ -0,0 +1,40 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum Hbase094xReaderErrorCode implements ErrorCode { + REQUIRED_VALUE("Hbase094xReader-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("Hbase094xReader-01", "您配置的值不合法."), + PREPAR_READ_ERROR("Hbase094xReader-02", "准备读取 Hbase 时出错."), + SPLIT_ERROR("Hbase094xReader-03", "切分 Hbase 表时出错."), + GET_HBASE_CONFIGURATION_ERROR("HbaseReader-04", "解析hbase configuration时出错."), + INIT_TABLE_ERROR("Hbase094xReader-04", "初始化 Hbase 抽取表时出错."), + GET_HBASE_TABLE_ERROR("HbaseReader-05", "初始化 Hbase 抽取表时出错."), + CLOSE_HBASE_TABLE_ERROR("HbaseReader-06", "关闭Hbase 抽取表时出错."), + CLOSE_HBASE_ADMIN_ERROR("HbaseReader-07", "关闭 Hbase admin时出错.") + ; + + private final String code; + private final String description; + + private Hbase094xReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/HbaseAbstractTask.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/HbaseAbstractTask.java new file mode 100755 index 0000000000..8934793837 --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/HbaseAbstractTask.java @@ -0,0 +1,153 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang.ArrayUtils; +import org.apache.commons.lang3.time.DateUtils; +import org.apache.hadoop.hbase.client.HTable; +import org.apache.hadoop.hbase.client.Result; +import org.apache.hadoop.hbase.client.ResultScanner; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; + +public abstract class HbaseAbstractTask { + private final static Logger LOG = LoggerFactory.getLogger(HbaseAbstractTask.class); + + private byte[] startKey = null; + private byte[] endKey = null; + + protected HTable htable; + protected String encoding; + protected int scanCacheSize; + protected int scanBatchSize; + + protected Result lastResult = null; + protected Scan scan; + protected ResultScanner resultScanner; + + public HbaseAbstractTask(com.alibaba.datax.common.util.Configuration configuration) { + + this.htable = Hbase094xHelper.getTable(configuration); + + this.encoding = configuration.getString(Key.ENCODING,Constant.DEFAULT_ENCODING); + this.startKey = Hbase094xHelper.convertInnerStartRowkey(configuration); + this.endKey = Hbase094xHelper.convertInnerEndRowkey(configuration); + this.scanCacheSize = configuration.getInt(Key.SCAN_CACHE_SIZE,Constant.DEFAULT_SCAN_CACHE_SIZE); + this.scanBatchSize = configuration.getInt(Key.SCAN_BATCH_SIZE,Constant.DEFAULT_SCAN_BATCH_SIZE); + } + + public abstract boolean fetchLine(Record record) throws Exception; + + //不同模式设置不同,如多版本模式需要设置版本 + public abstract void initScan(Scan scan); + + + public void prepare() throws Exception { + this.scan = new Scan(); + this.scan.setSmall(false); + this.scan.setStartRow(startKey); + this.scan.setStopRow(endKey); + LOG.info("The task set startRowkey=[{}], endRowkey=[{}].", Bytes.toStringBinary(this.startKey), Bytes.toStringBinary(this.endKey)); + //scan的Caching Batch全部留在hconfig中每次从服务器端读取的行数,设置默认值未256 + this.scan.setCaching(this.scanCacheSize); + //设置获取记录的列个数,hbase默认无限制,也就是返回所有的列,这里默认是100 + this.scan.setBatch(this.scanBatchSize); + //为是否缓存块,hbase默认缓存,同步全部数据时非热点数据,因此不需要缓存 + this.scan.setCacheBlocks(false); + initScan(this.scan); + + this.resultScanner = this.htable.getScanner(this.scan); + } + + public void close() { + Hbase094xHelper.closeResultScanner(this.resultScanner); + Hbase094xHelper.closeTable(this.htable); + } + + protected Result getNextHbaseRow() throws IOException { + Result result; + try { + result = resultScanner.next(); + } catch (IOException e) { + if (lastResult != null) { + this.scan.setStartRow(lastResult.getRow()); + } + resultScanner = this.htable.getScanner(scan); + result = resultScanner.next(); + if (lastResult != null && Bytes.equals(lastResult.getRow(), result.getRow())) { + result = resultScanner.next(); + } + } + lastResult = result; + // may be null + return result; + } + + public Column convertBytesToAssignType(ColumnType columnType, byte[] byteArray,String dateformat) throws Exception { + Column column; + switch (columnType) { + case BOOLEAN: + column = new BoolColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toBoolean(byteArray)); + break; + case SHORT: + column = new LongColumn(ArrayUtils.isEmpty(byteArray) ? null : String.valueOf(Bytes.toShort(byteArray))); + break; + case INT: + column = new LongColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toInt(byteArray)); + break; + case LONG: + column = new LongColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toLong(byteArray)); + break; + case FLOAT: + column = new DoubleColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toFloat(byteArray)); + break; + case DOUBLE: + column = new DoubleColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toDouble(byteArray)); + break; + case STRING: + column = new StringColumn(ArrayUtils.isEmpty(byteArray) ? null : new String(byteArray, encoding)); + break; + case BINARY_STRING: + column = new StringColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toStringBinary(byteArray)); + break; + case DATE: + String dateValue = Bytes.toStringBinary(byteArray); + column = new DateColumn(ArrayUtils.isEmpty(byteArray) ? null : DateUtils.parseDate(dateValue, new String[]{dateformat})); + break; + default: + throw DataXException.asDataXException(Hbase094xReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 不支持您配置的列类型:" + columnType); + } + return column; + } + + public Column convertValueToAssignType(ColumnType columnType, String constantValue,String dateformat) throws Exception { + Column column; + switch (columnType) { + case BOOLEAN: + column = new BoolColumn(constantValue); + break; + case SHORT: + case INT: + case LONG: + column = new LongColumn(constantValue); + break; + case FLOAT: + case DOUBLE: + column = new DoubleColumn(constantValue); + break; + case STRING: + column = new StringColumn(constantValue); + break; + case DATE: + column = new DateColumn(DateUtils.parseDate(constantValue, new String[]{dateformat})); + break; + default: + throw DataXException.asDataXException(Hbase094xReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 常量列不支持您配置的列类型:" + columnType); + } + return column; + } +} diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/HbaseColumnCell.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/HbaseColumnCell.java new file mode 100755 index 0000000000..2d8638f0b0 --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/HbaseColumnCell.java @@ -0,0 +1,122 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + +import com.alibaba.datax.common.base.BaseObject; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; +import org.apache.hadoop.hbase.util.Bytes; + +/** + * 描述 hbasereader 插件中,column 配置中的一个单元项实体 + */ +public class HbaseColumnCell extends BaseObject { + private ColumnType columnType; + + // columnName 格式为:列族:列名 + private String columnName; + + private byte[] columnFamily; + private byte[] qualifier; + + //对于常量类型,其常量值放到 columnValue 里 + private String columnValue; + + //当配置了 columnValue 时,isConstant=true(这个成员变量是用于方便使用本类的地方判断是否是常量类型字段) + private boolean isConstant; + + // 只在类型是时间类型时,才会设置该值,无默认值。形式如:yyyy-MM-dd HH:mm:ss + private String dateformat; + + private HbaseColumnCell(Builder builder) { + this.columnType = builder.columnType; + + //columnName 和 columnValue 必须有一个为 null + Validate.isTrue(builder.columnName == null || builder.columnValue == null, "Hbasereader 中,column 不能同时配置 列名称 和 列值,二者选其一."); + + //columnName 和 columnValue 不能都为 null + Validate.isTrue(builder.columnName != null || builder.columnValue != null, "Hbasereader 中,column 需要配置 列名称 或者 列值, 二者选其一."); + + if (builder.columnName != null) { + this.isConstant = false; + this.columnName = builder.columnName; + // 如果 columnName 不是 rowkey,则必须配置为:列族:列名 格式 + if (!Hbase094xHelper.isRowkeyColumn(this.columnName)) { + + String promptInfo = "Hbasereader 中,column 的列配置格式应该是:列族:列名. 您配置的列错误:" + this.columnName; + String[] cfAndQualifier = this.columnName.split(":"); + Validate.isTrue(cfAndQualifier != null && cfAndQualifier.length == 2 + && StringUtils.isNotBlank(cfAndQualifier[0]) + && StringUtils.isNotBlank(cfAndQualifier[1]), promptInfo); + + this.columnFamily = Bytes.toBytes(cfAndQualifier[0].trim()); + this.qualifier = Bytes.toBytes(cfAndQualifier[1].trim()); + } + } else { + this.isConstant = true; + this.columnValue = builder.columnValue; + } + + if (builder.dateformat != null) { + this.dateformat = builder.dateformat; + } + } + + public ColumnType getColumnType() { + return columnType; + } + + public String getColumnName() { + return columnName; + } + + public byte[] getColumnFamily() { + return columnFamily; + } + + public byte[] getQualifier() { + return qualifier; + } + + public String getDateformat() { + return dateformat; + } + + public String getColumnValue() { + return columnValue; + } + + public boolean isConstant() { + return isConstant; + } + + // 内部 builder 类 + public static class Builder { + private ColumnType columnType; + private String columnName; + private String columnValue; + + private String dateformat; + + public Builder(ColumnType columnType) { + this.columnType = columnType; + } + + public Builder columnName(String columnName) { + this.columnName = columnName; + return this; + } + + public Builder columnValue(String columnValue) { + this.columnValue = columnValue; + return this; + } + + public Builder dateformat(String dateformat) { + this.dateformat = dateformat; + return this; + } + + public HbaseColumnCell build() { + return new HbaseColumnCell(this); + } + } +} diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Key.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Key.java new file mode 100755 index 0000000000..6256f923fa --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/Key.java @@ -0,0 +1,50 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + +public final class Key { + + public final static String HBASE_CONFIG = "hbaseConfig"; + + public final static String TABLE = "table"; + + /** + * mode 可以取 normal 或者 multiVersionFixedColumn 或者 multiVersionDynamicColumn 三个值,无默认值。 + *

+ * normal 配合 column(Map 结构的)使用 + */ + public final static String MODE = "mode"; + + /** + * 配合 mode = multiVersion 时使用,指明需要读取的版本个数。无默认值 + * -1 表示去读全部版本 + * 不能为0,1 + * >1 表示最多读取对应个数的版本数(不能超过 Integer 的最大值) + */ + public final static String MAX_VERSION = "maxVersion"; + + /** + * 默认为 utf8 + */ + public final static String ENCODING = "encoding"; + + public final static String COLUMN = "column"; + + public final static String COLUMN_FAMILY = "columnFamily"; + + public static final String NAME = "name"; + + public static final String TYPE = "type"; + + public static final String FORMAT = "format"; + + public static final String VALUE = "value"; + + public final static String START_ROWKEY = "startRowkey"; + + public final static String END_ROWKEY = "endRowkey"; + + public final static String IS_BINARY_ROWKEY = "isBinaryRowkey"; + + public final static String SCAN_CACHE_SIZE = "scanCacheSize"; + + public final static String SCAN_BATCH_SIZE = "scanBatchSize"; +} diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/ModeType.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/ModeType.java new file mode 100644 index 0000000000..788b17190c --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/ModeType.java @@ -0,0 +1,28 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + +import com.alibaba.datax.common.exception.DataXException; + +import java.util.Arrays; + +public enum ModeType { + Normal("normal"), + MultiVersionFixedColumn("multiVersionFixedColumn") + ; + + private String mode; + + ModeType(String mode) { + this.mode = mode.toLowerCase(); + } + + public static ModeType getByTypeName(String modeName) { + for (ModeType modeType : values()) { + if (modeType.mode.equalsIgnoreCase(modeName)) { + return modeType; + } + } + + throw DataXException.asDataXException(Hbase094xReaderErrorCode.ILLEGAL_VALUE, + String.format("HbaseReader 不支持该 mode 类型:%s, 目前支持的 mode 类型是:%s", modeName, Arrays.asList(values()))); + } +} diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/MultiVersionFixedColumnTask.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/MultiVersionFixedColumnTask.java new file mode 100644 index 0000000000..bfd0eb0a87 --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/MultiVersionFixedColumnTask.java @@ -0,0 +1,26 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + +import com.alibaba.datax.common.util.Configuration; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; + +import java.util.Map; + +public class MultiVersionFixedColumnTask extends MultiVersionTask { + + public MultiVersionFixedColumnTask(Configuration configuration) { + super(configuration); + } + + @Override + public void initScan(Scan scan) { + for (Map aColumn : column) { + String columnName = aColumn.get(Key.NAME); + if(!Hbase094xHelper.isRowkeyColumn(columnName)){ + String[] cfAndQualifier = columnName.split(":"); + scan.addColumn(Bytes.toBytes(cfAndQualifier[0].trim()), Bytes.toBytes(cfAndQualifier[1].trim())); + } + } + super.setMaxVersions(scan); + } +} diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/MultiVersionTask.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/MultiVersionTask.java new file mode 100755 index 0000000000..7075b0ccd4 --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/MultiVersionTask.java @@ -0,0 +1,100 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.hbase.KeyValue; +import org.apache.hadoop.hbase.client.Result; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; + +import java.io.UnsupportedEncodingException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +public abstract class MultiVersionTask extends HbaseAbstractTask { + private static byte[] COLON_BYTE; + + private int maxVersion; + private List kvList = new ArrayList(); + + private int currentReadPosition = 0; + public List column; + private HashMap> familyQualifierMap = null; + + public MultiVersionTask(Configuration configuration) { + super(configuration); + this.maxVersion = configuration.getInt(Key.MAX_VERSION); + this.column = configuration.getList(Key.COLUMN, Map.class); + this.familyQualifierMap = Hbase094xHelper.parseColumnOfMultiversionMode(this.column); + + try { + MultiVersionTask.COLON_BYTE = ":".getBytes("utf8"); + } catch (UnsupportedEncodingException e) { + throw DataXException.asDataXException(Hbase094xReaderErrorCode.PREPAR_READ_ERROR, "系统内部获取 列族与列名冒号分隔符的二进制时失败.", e); + } + } + + @Override + public boolean fetchLine(Record record) throws Exception { + Result result; + if (this.kvList == null || this.kvList.size() == this.currentReadPosition) { + result = super.getNextHbaseRow(); + if (result == null) { + return false; + } + super.lastResult = result; + + this.kvList = result.list(); + if (this.kvList == null) { + return false; + } + this.currentReadPosition = 0; + } + try { + KeyValue keyValue = this.kvList.get(this.currentReadPosition); + + convertCellToLine(keyValue, record); + + } catch (Exception e) { + throw e; + } finally { + this.currentReadPosition++; + } + return true; + } + + private void convertCellToLine(KeyValue keyValue, Record record) throws Exception { + byte[] rawRowkey = keyValue.getRow(); + long timestamp = keyValue.getTimestamp(); + byte[] cfAndQualifierName = Bytes.add(keyValue.getFamily(), MultiVersionTask.COLON_BYTE, keyValue.getQualifier()); + byte[] columnValue = keyValue.getValue(); + + ColumnType rawRowkeyType = ColumnType.getByTypeName(familyQualifierMap.get(Constant.ROWKEY_FLAG).get(Key.TYPE)); + String familyQualifier = new String(cfAndQualifierName, Constant.DEFAULT_ENCODING); + ColumnType columnValueType = ColumnType.getByTypeName(familyQualifierMap.get(familyQualifier).get(Key.TYPE)); + String columnValueFormat = familyQualifierMap.get(familyQualifier).get(Key.FORMAT); + if(StringUtils.isBlank(columnValueFormat)){ + columnValueFormat = Constant.DEFAULT_DATA_FORMAT; + } + + record.addColumn(convertBytesToAssignType(rawRowkeyType, rawRowkey, columnValueFormat)); + record.addColumn(convertBytesToAssignType(ColumnType.STRING, cfAndQualifierName, columnValueFormat)); + // 直接忽略了用户配置的 timestamp 的类型 + record.addColumn(new LongColumn(timestamp)); + record.addColumn(convertBytesToAssignType(columnValueType, columnValue, columnValueFormat)); + } + + public void setMaxVersions(Scan scan) { + if (this.maxVersion == -1 || this.maxVersion == Integer.MAX_VALUE) { + scan.setMaxVersions(); + } else { + scan.setMaxVersions(this.maxVersion); + } + } + +} diff --git a/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/NormalTask.java b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/NormalTask.java new file mode 100755 index 0000000000..5255cb8a5d --- /dev/null +++ b/hbase094xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase094xreader/NormalTask.java @@ -0,0 +1,88 @@ +package com.alibaba.datax.plugin.reader.hbase094xreader; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.util.Configuration; +import org.apache.hadoop.hbase.client.Result; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; + +import java.util.List; +import java.util.Map; + +public class NormalTask extends HbaseAbstractTask { + private List column; + private List hbaseColumnCells; + + public NormalTask(Configuration configuration) { + super(configuration); + this.column = configuration.getList(Key.COLUMN, Map.class); + this.hbaseColumnCells = Hbase094xHelper.parseColumnOfNormalMode(this.column); + } + + /** + * normal模式下将用户配置的column 设置到scan中 + */ + @Override + public void initScan(Scan scan) { + boolean isConstant; + boolean isRowkeyColumn; + for (HbaseColumnCell cell : this.hbaseColumnCells) { + isConstant = cell.isConstant(); + isRowkeyColumn = Hbase094xHelper.isRowkeyColumn(cell.getColumnName()); + if (!isConstant && !isRowkeyColumn) { + this.scan.addColumn(cell.getColumnFamily(), cell.getQualifier()); + } + } + } + + + @Override + public boolean fetchLine(Record record) throws Exception { + Result result = super.getNextHbaseRow(); + + if (null == result) { + return false; + } + super.lastResult = result; + + try { + byte[] hbaseColumnValue; + String columnName; + ColumnType columnType; + + byte[] columnFamily; + byte[] qualifier; + + for (HbaseColumnCell cell : this.hbaseColumnCells) { + columnType = cell.getColumnType(); + if (cell.isConstant()) { + // 对常量字段的处理 + String constantValue = cell.getColumnValue(); + + Column constantColumn = super.convertValueToAssignType(columnType,constantValue,cell.getDateformat()); + record.addColumn(constantColumn); + } else { + // 根据列名称获取值 + columnName = cell.getColumnName(); + if (Hbase094xHelper.isRowkeyColumn(columnName)) { + hbaseColumnValue = result.getRow(); + } else { + columnFamily = cell.getColumnFamily(); + qualifier = cell.getQualifier(); + hbaseColumnValue = result.getValue(columnFamily, qualifier); + } + + Column hbaseColumn = super.convertBytesToAssignType(columnType,hbaseColumnValue,cell.getDateformat()); + record.addColumn(hbaseColumn); + } + } + } catch (Exception e) { + // 注意,这里catch的异常,期望是byte数组转换失败的情况。而实际上,string的byte数组,转成整数类型是不容易报错的。但是转成double类型容易报错。 + record.setColumn(0, new StringColumn(Bytes.toStringBinary(result.getRow()))); + throw e; + } + return true; + } +} diff --git a/hbase094xreader/src/main/resources/plugin.json b/hbase094xreader/src/main/resources/plugin.json new file mode 100755 index 0000000000..a40407ec88 --- /dev/null +++ b/hbase094xreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "hbase094xreader", + "class": "com.alibaba.datax.plugin.reader.hbase094xreader.Hbase094xReader", + "description": "useScene: prod. mechanism: Scan to read data.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/hbase094xreader/src/main/resources/plugin_job_template.json b/hbase094xreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..14acc174cd --- /dev/null +++ b/hbase094xreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "hbase094xreader", + "parameter": { + "hbaseConfig": {}, + "table": "", + "encoding": "", + "mode": "", + "column": [], + "range": { + "startRowkey": "", + "endRowkey": "", + "isBinaryRowkey": true + } + } +} \ No newline at end of file diff --git a/hbase094xwriter/doc/.gitkeep b/hbase094xwriter/doc/.gitkeep new file mode 100644 index 0000000000..e69de29bb2 diff --git a/hbase094xwriter/doc/hbase094xwriter.md b/hbase094xwriter/doc/hbase094xwriter.md new file mode 100644 index 0000000000..cec8144d08 --- /dev/null +++ b/hbase094xwriter/doc/hbase094xwriter.md @@ -0,0 +1,356 @@ +# Hbase094XWriter & Hbase11XWriter 插件文档 + +___ + +## 1 快速介绍 + +HbaseWriter 插件实现了从向Hbase中写取数据。在底层实现上,HbaseWriter 通过 HBase 的 Java 客户端连接远程 HBase 服务,并通过 put 方式写入Hbase。 + + +### 1.1支持功能 + +1、目前HbaseWriter支持的Hbase版本有:Hbase0.94.x和Hbase1.1.x。 + +* 若您的hbase版本为Hbase0.94.x,writer端的插件请选择:hbase094xwriter,即: + + ``` + "writer": { + "name": "hbase094xwriter" + } + ``` + +* 若您的hbase版本为Hbase1.1.x,writer端的插件请选择:hbase11xwriter,即: + + ``` + "writer": { + "name": "hbase11xwriter" + } + ``` + +2、目前HbaseWriter支持源端多个字段拼接作为hbase 表的 rowkey,具体配置参考:rowkeyColumn配置; + +3、写入hbase的时间戳(版本)支持:用当前时间作为版本,指定源端列作为版本,指定一个时间 三种方式作为版本; + +4、HbaseWriter中有一个必填配置项是:hbaseConfig,需要你联系 HBase PE,将hbase-site.xml 中与连接 HBase 相关的配置项提取出来,以 json 格式填入,同时可以补充更多HBase client的配置来优化与服务器的交互。 + + +如:hbase-site.xml的配置内容如下 + +``` + + + hbase.rootdir + hdfs://ip:9000/hbase + + + hbase.cluster.distributed + true + + + hbase.zookeeper.quorum + *** + + +``` +转换后的json为: + +``` +"hbaseConfig": { + "hbase.rootdir": "hdfs: //ip: 9000/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "***" + } +``` + +### 1.2 限制 + +1、目前只支持源端为横表写入,不支持竖表(源端读出的为四元组: rowKey,family:qualifier,timestamp,value)模式的数据写入;本期目标主要是替换DataX2中的habsewriter,下次迭代考虑支持。 + +2、目前不支持写入hbase前清空表数据,若需要清空数据请联系HBase PE + +## 2 实现原理 + +简而言之,HbaseWriter 通过 HBase 的 Java 客户端,通过 HTable, Put等 API,将从上游Reader读取的数据写入HBase你hbase11xwriter与hbase094xwriter的主要不同在于API的调用不同,Hbase1.1.x废弃了很多Hbase0.94.x的api。 + + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从本地写入hbase1.1.x的作业: + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 5 + } + }, + "content": [ + { + "reader": { + "name": "txtfilereader", + "parameter": { + "path": "/Users/shf/workplace/datax_test/hbase11xwriter/txt/normal.txt", + "charset": "UTF-8", + "column": [ + { + "index": 0, + "type": "String" + }, + { + "index": 1, + "type": "string" + }, + { + "index": 2, + "type": "string" + }, + { + "index": 3, + "type": "string" + }, + { + "index": 4, + "type": "string" + }, + { + "index": 5, + "type": "string" + }, + { + "index": 6, + "type": "string" + } + + ], + "fieldDelimiter": "," + } + }, + "writer": { + "name": "hbase11xwriter", + "parameter": { + "hbaseConfig": { + "hbase.rootdir": "hdfs: //ip: 9000/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "***" + }, + "table": "writer", + "mode": "normal", + "rowkeyColumn": [ + { + "index":0, + "type":"string" + }, + { + "index":-1, + "type":"string", + "value":"_" + } + ], + "column": [ + { + "index":1, + "name": "cf1:q1", + "type": "string" + }, + { + "index":2, + "name": "cf1:q2", + "type": "string" + }, + { + "index":3, + "name": "cf1:q3", + "type": "string" + }, + { + "index":4, + "name": "cf2:q1", + "type": "string" + }, + { + "index":5, + "name": "cf2:q2", + "type": "string" + }, + { + "index":6, + "name": "cf2:q3", + "type": "string" + } + ], + "versionColumn":{ + "index": -1, + "value":"123456789" + }, + "encoding": "utf-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **hbaseConfig** + + * 描述:每个HBase集群提供给DataX客户端连接的配置信息存放在hbase-site.xml,请联系你的HBase PE提供配置信息,并转换为JSON格式。同时可以补充更多HBase client的配置,如:设置scan的cache、batch来优化与服务器的交互。 + + * 必选:是
+ + * 默认值:无
+ +* **mode** + + * 描述:写hbase的模式,目前只支持normal 模式,后续考虑动态列模式
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:要写的 hbase 表名(大小写敏感)
+ + * 必选:是
+ + * 默认值:无
+ +* **encoding** + + * 描述:编码方式,UTF-8 或是 GBK,用于 String 转 HBase byte[]时的编码
+ + * 必选:否
+ + * 默认值:UTF-8
+ + +* **column** + + * 描述:要写入的hbase字段。index:指定该列对应reader端column的索引,从0开始;name:指定hbase表中的列,必须为 列族:列名 的格式;type:指定写入数据类型,用于转换HBase byte[]。配置格式如下: + + ``` +"column": [ + { + "index":1, + "name": "cf1:q1", + "type": "string" + }, + { + "index":2, + "name": "cf1:q2", + "type": "string" + } + ] + + ``` + + * 必选:是
+ + * 默认值:无
+ +* **rowkeyColumn** + + * 描述:要写入的hbase的rowkey列。index:指定该列对应reader端column的索引,从0开始,若为常量index为-1;type:指定写入数据类型,用于转换HBase byte[];value:配置常量,常作为多个字段的拼接符。hbasewriter会将rowkeyColumn中所有列按照配置顺序进行拼接作为写入hbase的rowkey,不能全为常量。配置格式如下: + + ``` +"rowkeyColumn": [ + { + "index":0, + "type":"string" + }, + { + "index":-1, + "type":"string", + "value":"_" + } + ] + + ``` + + * 必选:是
+ + * 默认值:无
+ +* **versionColumn** + + * 描述:指定写入hbase的时间戳。支持:当前时间、指定时间列,指定时间,三者选一。若不配置表示用当前时间。index:指定对应reader端column的索引,从0开始,需保证能转换为long,若是Date类型,会尝试用yyyy-MM-dd HH:mm:ss和yyyy-MM-dd HH:mm:ss SSS去解析;若为指定时间index为-1;value:指定时间的值,long值。配置格式如下: + + ``` +"versionColumn":{ + "index":1 +} + + ``` + + 或者 + + ``` +"versionColumn":{ + "index":-1, + "value":123456789 +} + + ``` + + * 必选:否
+ + * 默认值:无
+ + +* **nullMode** + + * 描述:读取的null值时,如何处理。支持两种方式:(1)skip:表示不向hbase写这列;(2)empty:写入HConstants.EMPTY_BYTE_ARRAY,即new byte [0]
+ + * 必选:否
+ + * 默认值:skip
+ +* **walFlag** + + * 描述:在HBae client向集群中的RegionServer提交数据时(Put/Delete操作),首先会先写WAL(Write Ahead Log)日志(即HLog,一个RegionServer上的所有Region共享一个HLog),只有当WAL日志写成功后,再接着写MemStore,然后客户端被通知提交数据成功;如果写WAL日志失败,客户端则被通知提交失败。关闭(false)放弃写WAL日志,从而提高数据写入的性能。
+ + * 必选:否
+ + * 默认值:false
+ +* **writeBufferSize** + + * 描述:设置HBae client的写buffer大小,单位字节。配合autoflush使用。autoflush,开启(true)表示Hbase client在写的时候有一条put就执行一次更新;关闭(false),表示Hbase client在写的时候只有当put填满客户端写缓存时,才实际向HBase服务端发起写请求
+ + * 必选:否
+ + * 默认值:8M
+ +### 3.3 HBase支持的列类型 +* BOOLEAN +* SHORT +* INT +* LONG +* FLOAT +* DOUBLE +* STRING + + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 + +## 4 性能报告 + +略 + +## 5 约束限制 + +略 + +## 6 FAQ + +*** diff --git a/hbase094xwriter/pom.xml b/hbase094xwriter/pom.xml new file mode 100644 index 0000000000..f3f403c10d --- /dev/null +++ b/hbase094xwriter/pom.xml @@ -0,0 +1,110 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + hbase094xwriter + hbase094xwriter + 0.0.1-SNAPSHOT + + + 1.8 + + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + org.apache.hbase + hbase + 0.94.27 + + + org.apache.hadoop + hadoop-core + 0.20.205.0 + + + org.apache.zookeeper + zookeeper + 3.3.2 + + + commons-codec + commons-codec + ${commons-codec.version} + + + + + com.alibaba.datax + datax-core + ${datax-project-version} + test + + + com.alibaba.datax + datax-common + 0.0.1-SNAPSHOT + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + + diff --git a/hbase094xwriter/src/main/assembly/package.xml b/hbase094xwriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..172711cfb7 --- /dev/null +++ b/hbase094xwriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/hbase094xwriter + + + target/ + + hbase094xwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/hbase094xwriter + + + + + + false + plugin/writer/hbase094xwriter/libs + runtime + + + diff --git a/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/ColumnType.java b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/ColumnType.java new file mode 100755 index 0000000000..e54b108c86 --- /dev/null +++ b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/ColumnType.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.plugin.writer.hbase094xwriter; + +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang.StringUtils; + +import java.util.Arrays; + +/** + * 只对 normal 模式读取时有用,多版本读取时,不存在列类型的 + */ +public enum ColumnType { + STRING("string"), + BOOLEAN("boolean"), + SHORT("short"), + INT("int"), + LONG("long"), + FLOAT("float"), + DOUBLE("double"); + + private String typeName; + + ColumnType(String typeName) { + this.typeName = typeName; + } + + public static ColumnType getByTypeName(String typeName) { + if(StringUtils.isBlank(typeName)){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, + String.format("Hbasewriter 不支持该类型:%s, 目前支持的类型是:%s", typeName, Arrays.asList(values()))); + } + for (ColumnType columnType : values()) { + if (StringUtils.equalsIgnoreCase(columnType.typeName, typeName.trim())) { + return columnType; + } + } + + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, + String.format("Hbasewriter 不支持该类型:%s, 目前支持的类型是:%s", typeName, Arrays.asList(values()))); + } + + @Override + public String toString() { + return this.typeName; + } +} diff --git a/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Constant.java b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Constant.java new file mode 100755 index 0000000000..5899424dff --- /dev/null +++ b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Constant.java @@ -0,0 +1,8 @@ +package com.alibaba.datax.plugin.writer.hbase094xwriter; + +public final class Constant { + public static final String DEFAULT_ENCODING = "UTF-8"; + public static final String DEFAULT_DATA_FORMAT = "yyyy-MM-dd HH:mm:ss"; + public static final String DEFAULT_NULL_MODE = "skip"; + public static final long DEFAULT_WRITE_BUFFER_SIZE = 8 * 1024 * 1024; +} diff --git a/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Hbase094xHelper.java b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Hbase094xHelper.java new file mode 100644 index 0000000000..f671d31d52 --- /dev/null +++ b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Hbase094xHelper.java @@ -0,0 +1,263 @@ +package com.alibaba.datax.plugin.writer.hbase094xwriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.TypeReference; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hbase.HBaseConfiguration; +import org.apache.hadoop.hbase.HTableDescriptor; +import org.apache.hadoop.hbase.client.*; +import org.apache.hadoop.hbase.util.Bytes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.Charset; +import java.util.List; +import java.util.Map; + + +/** + * Created by shf on 16/3/7. + */ +public class Hbase094xHelper { + + private static final Logger LOG = LoggerFactory.getLogger(Hbase094xHelper.class); + + /** + * + * @param hbaseConfig + * @return + */ + public static org.apache.hadoop.conf.Configuration getHbaseConfiguration(String hbaseConfig) { + if (StringUtils.isBlank(hbaseConfig)) { + throw DataXException.asDataXException(Hbase094xWriterErrorCode.REQUIRED_VALUE, "读 Hbase 时需要配置hbaseConfig,其内容为 Hbase 连接信息,请联系 Hbase PE 获取该信息."); + } + org.apache.hadoop.conf.Configuration hConfiguration = HBaseConfiguration.create(); + try { + Map hbaseConfigMap = JSON.parseObject(hbaseConfig, new TypeReference>() {}); + // 用户配置的 key-value 对 来表示 hbaseConfig + Validate.isTrue(hbaseConfigMap != null, "hbaseConfig不能为空Map结构!"); + for (Map.Entry entry : hbaseConfigMap.entrySet()) { + hConfiguration.set(entry.getKey(), entry.getValue()); + } + } catch (Exception e) { + throw DataXException.asDataXException(Hbase094xWriterErrorCode.GET_HBASE_CONFIG_ERROR, e); + } + return hConfiguration; + } + + + public static HTable getTable(com.alibaba.datax.common.util.Configuration configuration){ + String hbaseConfig = configuration.getString(Key.HBASE_CONFIG); + String userTable = configuration.getString(Key.TABLE); + org.apache.hadoop.conf.Configuration hConfiguration = Hbase094xHelper.getHbaseConfiguration(hbaseConfig); + Boolean autoFlush = configuration.getBool(Key.AUTO_FLUSH, false); + long writeBufferSize = configuration.getLong(Key.WRITE_BUFFER_SIZE, Constant.DEFAULT_WRITE_BUFFER_SIZE); + + HTable htable = null; + HBaseAdmin admin = null; + try { + htable = new HTable(hConfiguration, userTable); + admin = new HBaseAdmin(hConfiguration); + Hbase094xHelper.checkHbaseTable(admin,htable); + //本期设置autoflush 一定为flase,通过hbase writeBufferSize来控制每次flush大小 + htable.setAutoFlush(false); + htable.setWriteBufferSize(writeBufferSize); + return htable; + } catch (Exception e) { + Hbase094xHelper.closeTable(htable); + throw DataXException.asDataXException(Hbase094xWriterErrorCode.GET_HBASE_TABLE_ERROR, e); + }finally { + Hbase094xHelper.closeAdmin(admin); + } + } + + public static void deleteTable(com.alibaba.datax.common.util.Configuration configuration) { + String userTable = configuration.getString(Key.TABLE); + LOG.info(String.format("由于您配置了deleteType delete,HBasWriter begins to delete table %s .", userTable)); + Scan scan = new Scan(); + HTable hTable =Hbase094xHelper.getTable(configuration); + ResultScanner scanner = null; + try { + scanner = hTable.getScanner(scan); + for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { + hTable.delete(new Delete(rr.getRow())); + } + } catch (Exception e) { + throw DataXException.asDataXException(Hbase094xWriterErrorCode.DELETE_HBASE_ERROR, e); + }finally { + if(scanner != null){ + scanner.close(); + } + Hbase094xHelper.closeTable(hTable); + } + } + + public static void truncateTable(com.alibaba.datax.common.util.Configuration configuration) { + + String hbaseConfig = configuration.getString(Key.HBASE_CONFIG); + String userTable = configuration.getString(Key.TABLE); + org.apache.hadoop.conf.Configuration hConfiguration = Hbase094xHelper.getHbaseConfiguration(hbaseConfig); + + HTable htable = null; + HBaseAdmin admin = null; + LOG.info(String.format("由于您配置了deleteType truncate,HBasWriter begins to truncate table %s .", userTable)); + try{ + htable = new HTable(hConfiguration, userTable); + admin = new HBaseAdmin(hConfiguration); + HTableDescriptor descriptor = htable.getTableDescriptor(); + Hbase094xHelper.checkHbaseTable(admin,htable); + admin.disableTable(htable.getTableName()); + admin.deleteTable(htable.getTableName()); + admin.createTable(descriptor); + }catch (Exception e) { + throw DataXException.asDataXException(Hbase094xWriterErrorCode.TRUNCATE_HBASE_ERROR, e); + }finally { + Hbase094xHelper.closeAdmin(admin); + Hbase094xHelper.closeTable(htable); + } + } + + + + public static void closeAdmin(HBaseAdmin admin){ + try { + if(null != admin) + admin.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase094xWriterErrorCode.CLOSE_HBASE_AMIN_ERROR, e); + } + } + + public static void closeTable(HTable table){ + try { + if(null != table) + table.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase094xWriterErrorCode.CLOSE_HBASE_TABLE_ERROR, e); + } + } + + + public static void checkHbaseTable(HBaseAdmin admin, HTable hTable) throws IOException { + if (!admin.isMasterRunning()) { + throw new IllegalStateException("HBase master 没有运行, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if (!admin.tableExists(hTable.getTableName())) { + throw new IllegalStateException("HBase源头表" + Bytes.toString(hTable.getTableName()) + + "不存在, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if (!admin.isTableAvailable(hTable.getTableName()) || !admin.isTableEnabled(hTable.getTableName())) { + throw new IllegalStateException("HBase源头表" + Bytes.toString(hTable.getTableName()) + + " 不可用, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if(admin.isTableDisabled(hTable.getTableName())){ + throw new IllegalStateException("HBase源头表" + Bytes.toString(hTable.getTableName()) + + " 不可用, 请检查您的配置 或者 联系 Hbase 管理员."); + } + } + + + public static void validateParameter(com.alibaba.datax.common.util.Configuration originalConfig) { + originalConfig.getNecessaryValue(Key.HBASE_CONFIG, Hbase094xWriterErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.TABLE, Hbase094xWriterErrorCode.REQUIRED_VALUE); + + Hbase094xHelper.validateMode(originalConfig); + + String encoding = originalConfig.getString(Key.ENCODING, Constant.DEFAULT_ENCODING); + if (!Charset.isSupported(encoding)) { + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, String.format("Hbasewriter 不支持您所配置的编码:[%s]", encoding)); + } + originalConfig.set(Key.ENCODING, encoding); + Boolean autoFlush = originalConfig.getBool(Key.AUTO_FLUSH, false); + //本期设置autoflush 一定为flase,通过hbase writeBufferSize来控制每次flush大小 + originalConfig.set(Key.AUTO_FLUSH,false); + Boolean walFlag = originalConfig.getBool(Key.WAL_FLAG, false); + originalConfig.set(Key.WAL_FLAG, walFlag); + long writeBufferSize = originalConfig.getLong(Key.WRITE_BUFFER_SIZE,Constant.DEFAULT_WRITE_BUFFER_SIZE); + originalConfig.set(Key.WRITE_BUFFER_SIZE, writeBufferSize); + } + + + + + public static void validateMode(com.alibaba.datax.common.util.Configuration originalConfig){ + String mode = originalConfig.getNecessaryValue(Key.MODE, Hbase094xWriterErrorCode.REQUIRED_VALUE); + ModeType modeType = ModeType.getByTypeName(mode); + switch (modeType) { + case Normal: { + validateRowkeyColumn(originalConfig); + validateColumn(originalConfig); + validateVersionColumn(originalConfig); + break; + } + default: + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, + String.format("Hbase11xWriter不支持该 mode 类型:%s", mode)); + } + } + + public static void validateColumn(com.alibaba.datax.common.util.Configuration originalConfig){ + List columns = originalConfig.getListConfiguration(Key.COLUMN); + if (columns == null || columns.isEmpty()) { + throw DataXException.asDataXException(Hbase094xWriterErrorCode.REQUIRED_VALUE, "column为必填项,其形式为:column:[{\"index\": 0,\"name\": \"cf0:column0\",\"type\": \"string\"},{\"index\": 1,\"name\": \"cf1:column1\",\"type\": \"long\"}]"); + } + for (Configuration aColumn : columns) { + Integer index = aColumn.getInt(Key.INDEX); + String type = aColumn.getNecessaryValue(Key.TYPE, Hbase094xWriterErrorCode.REQUIRED_VALUE); + String name = aColumn.getNecessaryValue(Key.NAME, Hbase094xWriterErrorCode.REQUIRED_VALUE); + ColumnType.getByTypeName(type); + if(name.split(":").length != 2){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, String.format("您column配置项中name配置的列格式[%s]不正确,name应该配置为 列族:列名 的形式, 如 {\"index\": 1,\"name\": \"cf1:q1\",\"type\": \"long\"}", name)); + } + if(index == null || index < 0){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, "您的column配置项不正确,配置项中中index为必填项,且为非负数,请检查并修改."); + } + } + } + + public static void validateRowkeyColumn(com.alibaba.datax.common.util.Configuration originalConfig){ + List rowkeyColumn = originalConfig.getListConfiguration(Key.ROWKEY_COLUMN); + if (rowkeyColumn == null || rowkeyColumn.isEmpty()) { + throw DataXException.asDataXException(Hbase094xWriterErrorCode.REQUIRED_VALUE, "rowkeyColumn为必填项,其形式为:rowkeyColumn:[{\"index\": 0,\"type\": \"string\"},{\"index\": -1,\"type\": \"string\",\"value\": \"_\"}]"); + } + int rowkeyColumnSize = rowkeyColumn.size(); + //包含{"index":0,"type":"string"} 或者 {"index":-1,"type":"string","value":"_"} + for (Configuration aRowkeyColumn : rowkeyColumn) { + Integer index = aRowkeyColumn.getInt(Key.INDEX); + String type = aRowkeyColumn.getNecessaryValue(Key.TYPE, Hbase094xWriterErrorCode.REQUIRED_VALUE); + ColumnType.getByTypeName(type); + if(index == null ){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.REQUIRED_VALUE, "rowkeyColumn配置项中index为必填项"); + } + //不能只有-1列,即rowkey连接串 + if(rowkeyColumnSize ==1 && index == -1){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, "rowkeyColumn配置项不能全为常量列,至少指定一个rowkey列"); + } + if(index == -1){ + aRowkeyColumn.getNecessaryValue(Key.VALUE, Hbase094xWriterErrorCode.REQUIRED_VALUE); + } + } + } + + public static void validateVersionColumn(com.alibaba.datax.common.util.Configuration originalConfig){ + Configuration versionColumn = originalConfig.getConfiguration(Key.VERSION_COLUMN); + //为null,表示用当前时间;指定列,需要index + if(versionColumn != null){ + Integer index = versionColumn.getInt(Key.INDEX); + if(index == null ){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.REQUIRED_VALUE, "versionColumn配置项中index为必填项"); + } + if(index == -1){ + //指定时间,需要index=-1,value + versionColumn.getNecessaryValue(Key.VALUE, Hbase094xWriterErrorCode.REQUIRED_VALUE); + }else if(index < 0){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, "您versionColumn配置项中index配置不正确,只能取-1或者非负数"); + } + } + } +} diff --git a/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Hbase094xWriter.java b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Hbase094xWriter.java new file mode 100644 index 0000000000..0092b132b3 --- /dev/null +++ b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Hbase094xWriter.java @@ -0,0 +1,82 @@ +package com.alibaba.datax.plugin.writer.hbase094xwriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +/** + * Created by shf on 16/3/17. + */ +public class Hbase094xWriter extends Writer { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + private Configuration originConfig = null; + @Override + public void init() { + this.originConfig = this.getPluginJobConf(); + Hbase094xHelper.validateParameter(this.originConfig); + } + + @Override + public void prepare(){ + Boolean truncate = originConfig.getBool(Key.TRUNCATE,false); + if(truncate){ + Hbase094xHelper.truncateTable(this.originConfig); + } + } + + @Override + public List split(int mandatoryNumber) { + List splitResultConfigs = new ArrayList(); + for (int j = 0; j < mandatoryNumber; j++) { + splitResultConfigs.add(originConfig.clone()); + } + return splitResultConfigs; + } + + @Override + public void destroy() { + + } + } + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + private Configuration taskConfig; + private HbaseAbstractTask hbaseTaskProxy; + + @Override + public void init() { + this.taskConfig = super.getPluginJobConf(); + String mode = this.taskConfig.getString(Key.MODE); + ModeType modeType = ModeType.getByTypeName(mode); + + switch (modeType) { + case Normal: + this.hbaseTaskProxy = new NormalTask(this.taskConfig); + break; + default: + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, "Hbasewriter 不支持此类模式:" + modeType); + } + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + this.hbaseTaskProxy.startWriter(lineReceiver,super.getTaskPluginCollector()); + } + + + @Override + public void destroy() { + if (this.hbaseTaskProxy != null) { + this.hbaseTaskProxy.close(); + } + } + } +} diff --git a/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Hbase094xWriterErrorCode.java b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Hbase094xWriterErrorCode.java new file mode 100644 index 0000000000..5752705110 --- /dev/null +++ b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Hbase094xWriterErrorCode.java @@ -0,0 +1,44 @@ +package com.alibaba.datax.plugin.writer.hbase094xwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by shf on 16/3/8. + */ +public enum Hbase094xWriterErrorCode implements ErrorCode { + REQUIRED_VALUE("Hbasewriter-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("Hbasewriter-01", "您填写的参数值不合法."), + GET_HBASE_CONFIG_ERROR("Hbasewriter-02", "获取Hbase config时出错."), + GET_HBASE_TABLE_ERROR("Hbasewriter-03", "初始化 Hbase 抽取表时出错."), + CLOSE_HBASE_AMIN_ERROR("Hbasewriter-05", "关闭Hbase admin时出错."), + CLOSE_HBASE_TABLE_ERROR("Hbasewriter-06", "关闭Hbase table时时出错."), + PUT_HBASE_ERROR("Hbasewriter-07", "写入hbase时发生IO异常."), + DELETE_HBASE_ERROR("Hbasewriter-08", "delete hbase表时发生异常."), + TRUNCATE_HBASE_ERROR("Hbasewriter-09", "truncate hbase表时发生异常"), + CONSTRUCT_ROWKEY_ERROR("Hbasewriter-10", "构建rowkey时发生异常."), + CONSTRUCT_VERSION_ERROR("Hbasewriter-11", "构建version时发生异常.") + ; + private final String code; + private final String description; + + private Hbase094xWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} diff --git a/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/HbaseAbstractTask.java b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/HbaseAbstractTask.java new file mode 100755 index 0000000000..555e85aed3 --- /dev/null +++ b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/HbaseAbstractTask.java @@ -0,0 +1,158 @@ +package com.alibaba.datax.plugin.writer.hbase094xwriter; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import org.apache.hadoop.hbase.HConstants; +import org.apache.hadoop.hbase.client.HTable; +import org.apache.hadoop.hbase.client.Put; +import org.apache.hadoop.hbase.util.Bytes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.Charset; +import java.util.List; + +public abstract class HbaseAbstractTask { + private final static Logger LOG = LoggerFactory.getLogger(HbaseAbstractTask.class); + + public NullModeType nullMode = null; + + public List columns; + public List rowkeyColumn; + public Configuration versionColumn; + + + public HTable htable; + public String encoding; + public Boolean walFlag; + + + public HbaseAbstractTask(com.alibaba.datax.common.util.Configuration configuration) { + this.htable = Hbase094xHelper.getTable(configuration); + this.columns = configuration.getListConfiguration(Key.COLUMN); + this.rowkeyColumn = configuration.getListConfiguration(Key.ROWKEY_COLUMN); + this.versionColumn = configuration.getConfiguration(Key.VERSION_COLUMN); + this.encoding = configuration.getString(Key.ENCODING,Constant.DEFAULT_ENCODING); + this.nullMode = NullModeType.getByTypeName(configuration.getString(Key.NULL_MODE,Constant.DEFAULT_NULL_MODE)); + this.walFlag = configuration.getBool(Key.WAL_FLAG, false); + } + + public void startWriter(RecordReceiver lineReceiver,TaskPluginCollector taskPluginCollector){ + Record record; + try { + while ((record = lineReceiver.getFromReader()) != null) { + Put put; + try { + put = convertRecordToPut(record); + } catch (Exception e) { + taskPluginCollector.collectDirtyRecord(record, e); + continue; + } + try { + this.htable.put(put); + } catch (IllegalArgumentException e) { + if(e.getMessage().equals("No columns to insert") && nullMode.equals(NullModeType.Skip)){ + LOG.info(String.format("record is empty, 您配置nullMode为[skip],将会忽略这条记录,record[%s]", record.toString())); + continue; + }else { + taskPluginCollector.collectDirtyRecord(record, e); + continue; + } + } + } + }catch (IOException e){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.PUT_HBASE_ERROR,e); + }finally { + Hbase094xHelper.closeTable(this.htable); + } + } + + + public abstract Put convertRecordToPut(Record record); + + public void close() { + Hbase094xHelper.closeTable(this.htable); + } + + public byte[] getColumnByte(ColumnType columnType, Column column){ + byte[] bytes; + if(column.getRawData() != null){ + switch (columnType) { + case INT: + bytes = Bytes.toBytes(column.asLong().intValue()); + break; + case LONG: + bytes = Bytes.toBytes(column.asLong()); + break; + case DOUBLE: + bytes = Bytes.toBytes(column.asDouble()); + break; + case FLOAT: + bytes = Bytes.toBytes(column.asDouble().floatValue()); + break; + case SHORT: + bytes = Bytes.toBytes(column.asLong().shortValue()); + break; + case BOOLEAN: + bytes = Bytes.toBytes(column.asBoolean()); + break; + case STRING: + bytes = this.getValueByte(columnType,column.asString()); + break; + default: + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, "HbaseWriter列不支持您配置的列类型:" + columnType); + } + }else{ + switch (nullMode){ + case Skip: + bytes = null; + break; + case Empty: + bytes = HConstants.EMPTY_BYTE_ARRAY; + break; + default: + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, "HbaseWriter nullMode不支持您配置的类型,只支持skip或者empty"); + } + } + return bytes; + } + + public byte[] getValueByte(ColumnType columnType, String value){ + byte[] bytes; + if(value != null){ + switch (columnType) { + case INT: + bytes = Bytes.toBytes(Integer.parseInt(value)); + break; + case LONG: + bytes = Bytes.toBytes(Long.parseLong(value)); + break; + case DOUBLE: + bytes = Bytes.toBytes(Double.parseDouble(value)); + break; + case FLOAT: + bytes = Bytes.toBytes(Float.parseFloat(value)); + break; + case SHORT: + bytes = Bytes.toBytes(Short.parseShort(value)); + break; + case BOOLEAN: + bytes = Bytes.toBytes(Boolean.parseBoolean(value)); + break; + case STRING: + bytes = value.getBytes(Charset.forName(encoding)); + break; + default: + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, "HbaseWriter列不支持您配置的列类型:" + columnType); + } + }else{ + bytes = HConstants.EMPTY_BYTE_ARRAY; + } + return bytes; + } +} diff --git a/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Key.java b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Key.java new file mode 100755 index 0000000000..545052f670 --- /dev/null +++ b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/Key.java @@ -0,0 +1,52 @@ +package com.alibaba.datax.plugin.writer.hbase094xwriter; + +public final class Key { + + public final static String HBASE_CONFIG = "hbaseConfig"; + + public final static String TABLE = "table"; + + /** + * mode 可以取 normal 或者 multiVersionFixedColumn 或者 multiVersionDynamicColumn 三个值,无默认值。 + *

+ * normal 配合 column(Map 结构的)使用 + *

+ * multiVersion + */ + public final static String MODE = "mode"; + + + public final static String ROWKEY_COLUMN = "rowkeyColumn"; + + public final static String VERSION_COLUMN = "versionColumn"; + + /** + * 默认为 utf8 + */ + public final static String ENCODING = "encoding"; + + public final static String COLUMN = "column"; + + public static final String INDEX = "index"; + + public static final String NAME = "name"; + + public static final String TYPE = "type"; + + public static final String VALUE = "value"; + + public static final String FORMAT = "format"; + + /** + * 默认为 EMPTY_BYTES + */ + public static final String NULL_MODE = "nullMode"; + + public static final String TRUNCATE = "truncate"; + + public static final String AUTO_FLUSH = "autoFlush"; + + public static final String WAL_FLAG = "walFlag"; + + public static final String WRITE_BUFFER_SIZE = "writeBufferSize"; +} diff --git a/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/ModeType.java b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/ModeType.java new file mode 100644 index 0000000000..f9ae86e36b --- /dev/null +++ b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/ModeType.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.writer.hbase094xwriter; + +import com.alibaba.datax.common.exception.DataXException; + +import java.util.Arrays; + +public enum ModeType { + Normal("normal"), + MultiVersion("multiVersion") + ; + + private String mode; + + + ModeType(String mode) { + this.mode = mode.toLowerCase(); + } + + public String getMode() { + return mode; + } + + public static ModeType getByTypeName(String modeName) { + for (ModeType modeType : values()) { + if (modeType.mode.equalsIgnoreCase(modeName)) { + return modeType; + } + } + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, + String.format("Hbasewriter 不支持该 mode 类型:%s, 目前支持的 mode 类型是:%s", modeName, Arrays.asList(values()))); + } +} diff --git a/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/NormalTask.java b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/NormalTask.java new file mode 100755 index 0000000000..c86b4fb429 --- /dev/null +++ b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/NormalTask.java @@ -0,0 +1,124 @@ +package com.alibaba.datax.plugin.writer.hbase094xwriter; + +import com.alibaba.datax.common.element.DoubleColumn; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; +import org.apache.commons.lang3.time.DateUtils; +import org.apache.hadoop.hbase.KeyValue; +import org.apache.hadoop.hbase.client.Put; +import org.apache.hadoop.hbase.util.Bytes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Timestamp; +import java.text.ParseException; +import java.text.SimpleDateFormat; +import java.util.Date; + +public class NormalTask extends HbaseAbstractTask { + private static final Logger LOG = LoggerFactory.getLogger(NormalTask.class); + public NormalTask(Configuration configuration) { + super(configuration); + } + + @Override + public Put convertRecordToPut(Record record){ + byte[] rowkey = getRowkey(record); + Put put = null; + if(this.versionColumn == null){ + put = new Put(rowkey); + put.setWriteToWAL(super.walFlag); + }else { + long timestamp = getVersion(record); + put = new Put(rowkey,timestamp); + } + for (Configuration aColumn : columns) { + Integer index = aColumn.getInt(Key.INDEX); + String type = aColumn.getString(Key.TYPE); + ColumnType columnType = ColumnType.getByTypeName(type); + String name = aColumn.getString(Key.NAME); + String promptInfo = "Hbasewriter 中,column 的列配置格式应该是:列族:列名. 您配置的列错误:" + name; + String[] cfAndQualifier = name.split(":"); + Validate.isTrue(cfAndQualifier != null && cfAndQualifier.length == 2 + && StringUtils.isNotBlank(cfAndQualifier[0]) + && StringUtils.isNotBlank(cfAndQualifier[1]), promptInfo); + if(index >= record.getColumnNumber()){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, String.format("您的column配置项中中index值超出范围,根据reader端配置,index的值小于%s,而您配置的值为%s,请检查并修改.",record.getColumnNumber(),index)); + } + byte[] columnBytes = getColumnByte(columnType,record.getColumn(index)); + //columnBytes 为null忽略这列 + if(null != columnBytes){ + put.add(Bytes.toBytes( + cfAndQualifier[0]), + Bytes.toBytes(cfAndQualifier[1]), + columnBytes); + }else{ + continue; + } + } + return put; + } + + public byte[] getRowkey(Record record){ + byte[] rowkeyBuffer = {}; + for (Configuration aRowkeyColumn : rowkeyColumn) { + Integer index = aRowkeyColumn.getInt(Key.INDEX); + String type = aRowkeyColumn.getString(Key.TYPE); + ColumnType columnType = ColumnType.getByTypeName(type); + if(index == -1){ + String value = aRowkeyColumn.getString(Key.VALUE); + rowkeyBuffer = Bytes.add(rowkeyBuffer,getValueByte(columnType,value)); + }else{ + if(index >= record.getColumnNumber()){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.CONSTRUCT_ROWKEY_ERROR, String.format("您的rowkeyColumn配置项中中index值超出范围,根据reader端配置,index的值小于%s,而您配置的值为%s,请检查并修改.",record.getColumnNumber(),index)); + } + byte[] value = getColumnByte(columnType,record.getColumn(index)); + rowkeyBuffer = Bytes.add(rowkeyBuffer, value); + } + } + return rowkeyBuffer; + } + + public long getVersion(Record record){ + int index = versionColumn.getInt(Key.INDEX); + long timestamp; + if(index == -1){ + //指定时间作为版本 + timestamp = versionColumn.getLong(Key.VALUE); + if(timestamp < 0){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.CONSTRUCT_VERSION_ERROR, "您指定的版本非法!"); + } + }else{ + //指定列作为版本,long/doubleColumn直接record.aslong, 其它类型尝试用yyyy-MM-dd HH:mm:ss,yyyy-MM-dd HH:mm:ss SSS去format + if(index >= record.getColumnNumber()){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.CONSTRUCT_VERSION_ERROR, String.format("您的versionColumn配置项中中index值超出范围,根据reader端配置,index的值小于%s,而您配置的值为%s,请检查并修改.",record.getColumnNumber(),index)); + } + if(record.getColumn(index).getRawData() == null){ + throw DataXException.asDataXException(Hbase094xWriterErrorCode.CONSTRUCT_VERSION_ERROR, "您指定的版本为空!"); + } + SimpleDateFormat df_senconds = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); + SimpleDateFormat df_ms = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss SSS"); + if(record.getColumn(index) instanceof LongColumn || record.getColumn(index) instanceof DoubleColumn){ + timestamp = record.getColumn(index).asLong(); + }else { + Date date; + try{ + date = df_ms.parse(record.getColumn(index).asString()); + }catch (ParseException e){ + try { + date = df_senconds.parse(record.getColumn(index).asString()); + } catch (ParseException e1) { + LOG.info(String.format("您指定第[%s]列作为hbase写入版本,但在尝试用yyyy-MM-dd HH:mm:ss 和 yyyy-MM-dd HH:mm:ss SSS 去解析为Date时均出错,请检查并修改",index)); + throw DataXException.asDataXException(Hbase094xWriterErrorCode.CONSTRUCT_VERSION_ERROR, e1); + } + } + timestamp = date.getTime(); + } + } + return timestamp; + } +} \ No newline at end of file diff --git a/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/NullModeType.java b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/NullModeType.java new file mode 100644 index 0000000000..edec0d2346 --- /dev/null +++ b/hbase094xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase094xwriter/NullModeType.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.writer.hbase094xwriter; + +import com.alibaba.datax.common.exception.DataXException; + +import java.util.Arrays; + +public enum NullModeType { + Skip("skip"), + Empty("empty") + ; + + private String mode; + + + NullModeType(String mode) { + this.mode = mode.toLowerCase(); + } + + public String getMode() { + return mode; + } + + public static NullModeType getByTypeName(String modeName) { + for (NullModeType modeType : values()) { + if (modeType.mode.equalsIgnoreCase(modeName)) { + return modeType; + } + } + throw DataXException.asDataXException(Hbase094xWriterErrorCode.ILLEGAL_VALUE, + String.format("Hbasewriter 不支持该 nullMode 类型:%s, 目前支持的 nullMode 类型是:%s", modeName, Arrays.asList(values()))); + } +} diff --git a/hbase094xwriter/src/main/resources/plugin.json b/hbase094xwriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..abc051c2ba --- /dev/null +++ b/hbase094xwriter/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "hbase094xwriter", + "class": "com.alibaba.datax.plugin.writer.hbase094xwriter.Hbase094xWriter", + "description": "use put: prod. mechanism: use hbase java api put data.", + "developer": "alibaba" +} + diff --git a/hbase094xwriter/src/main/resources/plugin_job_template.json b/hbase094xwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..e6b0653138 --- /dev/null +++ b/hbase094xwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,17 @@ +{ + "name": "hbase094xwriter", + "parameter": { + "hbaseConfig": {}, + "table": "", + "mode": "", + "rowkeyColumn": [ + ], + "column": [ + ], + "versionColumn":{ + "index": "", + "value":"" + }, + "encoding": "" + } +} \ No newline at end of file diff --git a/hbase11xreader/doc/.gitkeep b/hbase11xreader/doc/.gitkeep new file mode 100644 index 0000000000..e69de29bb2 diff --git a/hbase11xreader/doc/hbase11xreader.md b/hbase11xreader/doc/hbase11xreader.md new file mode 100644 index 0000000000..21afae8f2e --- /dev/null +++ b/hbase11xreader/doc/hbase11xreader.md @@ -0,0 +1,454 @@ +# Hbase094XReader & Hbase11XReader 插件文档 + +___ + +## 1 快速介绍 + +HbaseReader 插件实现了从 Hbase中读取数据。在底层实现上,HbaseReader 通过 HBase 的 Java 客户端连接远程 HBase 服务,并通过 Scan 方式读取你指定 rowkey 范围内的数据,并将读取的数据使用 DataX 自定义的数据类型拼装为抽象的数据集,并传递给下游 Writer 处理。 + + +### 1.1支持的功能 + +1、目前HbaseReader支持的Hbase版本有:Hbase0.94.x和Hbase1.1.x。 + +* 若您的hbase版本为Hbase0.94.x,reader端的插件请选择:hbase094xreader,即: + + ``` + "reader": { + "name": "hbase094xreader" + } + ``` + +* 若您的hbase版本为Hbase1.1.x,reader端的插件请选择:hbase11xreader,即: + + ``` + "reader": { + "name": "hbase11xreader" + } + ``` + +2、目前HbaseReader支持两模式读取:normal 模式、multiVersionFixedColumn模式; + +* normal 模式:把HBase中的表,当成普通二维表(横表)进行读取,读取最新版本数据。如: + + ``` +hbase(main):017:0> scan 'users' +ROW COLUMN+CELL + lisi column=address:city, timestamp=1457101972764, value=beijing + lisi column=address:contry, timestamp=1457102773908, value=china + lisi column=address:province, timestamp=1457101972736, value=beijing + lisi column=info:age, timestamp=1457101972548, value=27 + lisi column=info:birthday, timestamp=1457101972604, value=1987-06-17 + lisi column=info:company, timestamp=1457101972653, value=baidu + xiaoming column=address:city, timestamp=1457082196082, value=hangzhou + xiaoming column=address:contry, timestamp=1457082195729, value=china + xiaoming column=address:province, timestamp=1457082195773, value=zhejiang + xiaoming column=info:age, timestamp=1457082218735, value=29 + xiaoming column=info:birthday, timestamp=1457082186830, value=1987-06-17 + xiaoming column=info:company, timestamp=1457082189826, value=alibaba +2 row(s) in 0.0580 seconds +``` +读取后数据 + + | rowKey | addres:city | address:contry | address:province | info:age| info:birthday | info:company | + | --------| ---------------- |----- |----- |--------| ---------------- |----- | +| lisi | beijing| china| beijing |27 | 1987-06-17 | baidu| +| xiaoming | hangzhou| china | zhejiang|29 | 1987-06-17 | alibaba| + + + +* multiVersionFixedColumn模式:把HBase中的表,当成竖表进行读取。读出的每条记录一定是四列形式,依次为:rowKey,family:qualifier,timestamp,value。读取时需要明确指定要读取的列,把每一个 cell 中的值,作为一条记录(record),若有多个版本就有多条记录(record)。如: + + ``` +hbase(main):018:0> scan 'users',{VERSIONS=>5} +ROW COLUMN+CELL + lisi column=address:city, timestamp=1457101972764, value=beijing + lisi column=address:contry, timestamp=1457102773908, value=china + lisi column=address:province, timestamp=1457101972736, value=beijing + lisi column=info:age, timestamp=1457101972548, value=27 + lisi column=info:birthday, timestamp=1457101972604, value=1987-06-17 + lisi column=info:company, timestamp=1457101972653, value=baidu + xiaoming column=address:city, timestamp=1457082196082, value=hangzhou + xiaoming column=address:contry, timestamp=1457082195729, value=china + xiaoming column=address:province, timestamp=1457082195773, value=zhejiang + xiaoming column=info:age, timestamp=1457082218735, value=29 + xiaoming column=info:age, timestamp=1457082178630, value=24 + xiaoming column=info:birthday, timestamp=1457082186830, value=1987-06-17 + xiaoming column=info:company, timestamp=1457082189826, value=alibaba +2 row(s) in 0.0260 seconds +``` +读取后数据(4列) + + | rowKey | column:qualifier| timestamp | value | +| --------| ---------------- |----- |----- | +| lisi | address:city| 1457101972764 | beijing | +| lisi | address:contry| 1457102773908 | china | +| lisi | address:province| 1457101972736 | beijing | +| lisi | info:age| 1457101972548 | 27 | +| lisi | info:birthday| 1457101972604 | 1987-06-17 | +| lisi | info:company| 1457101972653 | beijing | +| xiaoming | address:city| 1457082196082 | hangzhou | +| xiaoming | address:contry| 1457082195729 | china | +| xiaoming | address:province| 1457082195773 | zhejiang | +| xiaoming | info:age| 1457082218735 | 29 | +| xiaoming | info:age| 1457082178630 | 24 | +| xiaoming | info:birthday| 1457082186830 | 1987-06-17 | +| xiaoming | info:company| 1457082189826 | alibaba | + + +3、HbaseReader中有一个必填配置项是:hbaseConfig,需要你联系 HBase PE,将hbase-site.xml 中与连接 HBase 相关的配置项提取出来,以 json 格式填入,同时可以补充更多HBase client的配置,如:设置scan的cache(hbase.client.scanner.caching)、batch来优化与服务器的交互。 + + +如:hbase-site.xml的配置内容如下 + +``` + + + hbase.rootdir + hdfs://ip:9000/hbase + + + hbase.cluster.distributed + true + + + hbase.zookeeper.quorum + *** + + +``` +转换后的json为: + +``` +"hbaseConfig": { + "hbase.rootdir": "hdfs: //ip:9000/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "***" + } +``` + +### 1.2 限制 + +1、目前不支持动态列的读取。考虑网络传输流量(支持动态列,需要先将hbase所有列的数据读取出来,再按规则进行过滤),现支持的两种读取模式中需要用户明确指定要读取的列。 + +2、关于同步作业的切分:目前的切分方式是根据用户hbase表数据的region分布进行切分。即:在用户填写的[startrowkey,endrowkey]范围内,一个region会切分成一个task,单个region不进行切分。 + +3、multiVersionFixedColumn模式下不支持增加常量列 + + +## 2 实现原理 + +简而言之,HbaseReader 通过 HBase 的 Java 客户端,通过 HTable, Scan, ResultScanner 等 API,读取你指定 rowkey 范围内的数据,并将读取的数据使用 DataX 自定义的数据类型拼装为抽象的数据集,并传递给下游 Writer 处理。hbase11xreader与hbase094xreader的主要不同在于API的调用不同,Hbase1.1.x废弃了很多Hbase0.94.x的api。 + + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从 HBase 抽取数据到本地的作业:(normal 模式) + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "hbase11xreader", + "parameter": { + "hbaseConfig": { + "hbase.rootdir": "hdfs: //xxxx: 9000/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "xxxf" + }, + "table": "users", + "encoding": "utf-8", + "mode": "normal", + "column": [ + { + "name": "rowkey", + "type": "string" + }, + { + "name": "info: age", + "type": "string" + }, + { + "name": "info: birthday", + "type": "date", + "format":"yyyy-MM-dd" + }, + { + "name": "info: company", + "type": "string" + }, + { + "name": "address: contry", + "type": "string" + }, + { + "name": "address: province", + "type": "string" + }, + { + "name": "address: city", + "type": "string" + } + ], + "range": { + "startRowkey": "", + "endRowkey": "", + "isBinaryRowkey": true + } + } + }, + "writer": { + "name": "txtfilewriter", + "parameter": { + "path": "/Users/shf/workplace/datax_test/hbase11xreader/result", + "fileName": "qiran", + "writeMode": "truncate" + } + } + } + ] + } +} +``` + +* 配置一个从 HBase 抽取数据到本地的作业:( multiVersionFixedColumn 模式) + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "hbase11xreader", + "parameter": { + "hbaseConfig": { + "hbase.rootdir": "hdfs: //xxx 9000/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "xxx" + }, + "table": "users", + "encoding": "utf-8", + "mode": "multiVersionFixedColumn", + "maxVersion": "-1", + "column": [ + { + "name": "rowkey", + "type": "string" + }, + { + "name": "info: age", + "type": "string" + }, + { + "name": "info: birthday", + "type": "date", + "format":"yyyy-MM-dd" + }, + { + "name": "info: company", + "type": "string" + }, + { + "name": "address: contry", + "type": "string" + }, + { + "name": "address: province", + "type": "string" + }, + { + "name": "address: city", + "type": "string" + } + ], + "range": { + "startRowkey": "", + "endRowkey": "" + } + } + }, + "writer": { + "name": "txtfilewriter", + "parameter": { + "path": "/Users/shf/workplace/datax_test/hbase11xreader/result", + "fileName": "qiran", + "writeMode": "truncate" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **hbaseConfig** + + * 描述:每个HBase集群提供给DataX客户端连接的配置信息存放在hbase-site.xml,请联系你的HBase PE提供配置信息,并转换为JSON格式。同时可以补充更多HBase client的配置,如:设置scan的cache、batch来优化与服务器的交互。 + + * 必选:是
+ + * 默认值:无
+ +* **mode** + + * 描述:读取hbase的模式,支持normal 模式、multiVersionFixedColumn模式,即:normal/multiVersionFixedColumn
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:要读取的 hbase 表名(大小写敏感)
+ + * 必选:是
+ + * 默认值:无
+ +* **encoding** + + * 描述:编码方式,UTF-8 或是 GBK,用于对二进制存储的 HBase byte[] 转为 String 时的编码
+ + * 必选:否
+ + * 默认值:UTF-8
+ + +* **column** + + * 描述:要读取的hbase字段,normal 模式与multiVersionFixedColumn 模式下必填项。 + (1)、normal 模式下:name指定读取的hbase列,除了rowkey外,必须为 列族:列名 的格式,type指定源数据的类型,format指定日期类型的格式,value指定当前类型为常量,不从hbase读取数据,而是根据value值自动生成对应的列。配置格式如下: + + ``` + "column": +[ + { + "name": "rowkey", + "type": "string" + }, + { + "value": "test", + "type": "string" + } +] + + ``` + normal 模式下,对于用户指定Column信息,type必须填写,name/value必须选择其一。 + + (2)、multiVersionFixedColumn 模式下:name指定读取的hbase列,除了rowkey外,必须为 列族:列名 的格式,type指定源数据的类型,format指定日期类型的格式 。multiVersionFixedColumn模式下不支持常量列。配置格式如下: + + ``` + "column": +[ + { + "name": "rowkey", + "type": "string" + }, + { + "name": "info: age", + "type": "string" + } +] + ``` + + * 必选:是
+ + * 默认值:无
+ + +* **maxVersion** + + * 描述:指定在多版本模式下的hbasereader读取的版本数,取值只能为-1或者大于1的数字,-1表示读取所有版本
+ + * 必选:multiVersionFixedColumn 模式下必填项
+ + * 默认值:无
+ +* **range** + + * 描述:指定hbasereader读取的rowkey范围。
+ startRowkey:指定开始rowkey;
+ endRowkey指定结束rowkey;
+ isBinaryRowkey:指定配置的startRowkey和endRowkey转换为byte[]时的方式,默认值为false,若为true,则调用Bytes.toBytesBinary(rowkey)方法进行转换;若为false:则调用Bytes.toBytes(rowkey)
+ 配置格式如下: + + ``` + "range": { + "startRowkey": "aaa", + "endRowkey": "ccc", + "isBinaryRowkey":false +} + ``` +
+ + * 必选:否
+ + * 默认值:无
+ +* **scanCacheSize** + + * 描述:Hbase client每次rpc从服务器端读取的行数
+ + * 必选:否
+ + * 默认值:256
+ +* **scanBatchSize** + + * 描述:Hbase client每次rpc从服务器端读取的列数
+ + * 必选:否
+ + * 默认值:100
+ + +### 3.3 类型转换 + + +下面列出支持的读取HBase数据类型,HbaseReader 针对 HBase 类型转换列表: + +| DataX 内部类型| HBase 数据类型 | +| -------- | ----- | +| Long |int, short ,long| +| Double |float, double| +| String |string,binarystring | +| Date |date | +| Boolean |boolean | + + + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 + +## 4 性能报告 + +略 + +## 5 约束限制 + +略 + + +## 6 FAQ + +*** diff --git a/hbase11xreader/pom.xml b/hbase11xreader/pom.xml new file mode 100644 index 0000000000..e6923580fe --- /dev/null +++ b/hbase11xreader/pom.xml @@ -0,0 +1,102 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + hbase11xreader + hbase11xreader + 0.0.1-SNAPSHOT + jar + + + 1.1.3 + 2.5.0 + + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.apache.hadoop + hadoop-hdfs + ${hadoop.version} + + + org.apache.hbase + hbase-client + ${hbase.version} + + + org.apache.hbase + hbase-common + ${hbase.version} + + + com.google.guava + guava + 12.0.1 + + + junit + junit + test + + + org.mockito + mockito-core + 2.0.44-beta + test + + + com.alibaba.datax + datax-core + ${datax-project-version} + test + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + \ No newline at end of file diff --git a/hbase11xreader/src/main/assembly/package.xml b/hbase11xreader/src/main/assembly/package.xml new file mode 100755 index 0000000000..f48acef6d3 --- /dev/null +++ b/hbase11xreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/hbase11xreader + + + target/ + + hbase11xreader-0.0.1-SNAPSHOT.jar + + plugin/reader/hbase11xreader + + + + + + false + plugin/reader/hbase11xreader/libs + runtime + + + diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/ColumnType.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/ColumnType.java new file mode 100755 index 0000000000..0efec09a91 --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/ColumnType.java @@ -0,0 +1,48 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang.StringUtils; + +import java.util.Arrays; + +/** + * 只对 normal 模式读取时有用,多版本读取时,不存在列类型的 + */ +public enum ColumnType { + BOOLEAN("boolean"), + SHORT("short"), + INT("int"), + LONG("long"), + FLOAT("float"), + DOUBLE("double"), + DATE("date"), + STRING("string"), + BINARY_STRING("binarystring") + ; + + private String typeName; + + ColumnType(String typeName) { + this.typeName = typeName; + } + + public static ColumnType getByTypeName(String typeName) { + if(StringUtils.isBlank(typeName)){ + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, + String.format("Hbasereader 不支持该类型:%s, 目前支持的类型是:%s", typeName, Arrays.asList(values()))); + } + for (ColumnType columnType : values()) { + if (StringUtils.equalsIgnoreCase(columnType.typeName, typeName.trim())) { + return columnType; + } + } + + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, + String.format("Hbasereader 不支持该类型:%s, 目前支持的类型是:%s", typeName, Arrays.asList(values()))); + } + + @Override + public String toString() { + return this.typeName; + } +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Constant.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Constant.java new file mode 100755 index 0000000000..af2e7e8a01 --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Constant.java @@ -0,0 +1,16 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +public final class Constant { + public static final String RANGE = "range"; + + public static final String ROWKEY_FLAG = "rowkey"; + + public static final String DEFAULT_DATA_FORMAT = "yyyy-MM-dd HH:mm:ss"; + + public static final String DEFAULT_ENCODING = "UTF-8"; + + public static final int DEFAULT_SCAN_CACHE_SIZE = 256; + + public static final int DEFAULT_SCAN_BATCH_SIZE = 100; + +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Hbase11xHelper.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Hbase11xHelper.java new file mode 100644 index 0000000000..643072a92e --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Hbase11xHelper.java @@ -0,0 +1,482 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.TypeReference; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; +import org.apache.hadoop.hbase.HBaseConfiguration; +import org.apache.hadoop.hbase.HConstants; +import org.apache.hadoop.hbase.TableName; +import org.apache.hadoop.hbase.client.*; +import org.apache.hadoop.hbase.util.Bytes; +import org.apache.hadoop.hbase.util.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.Charset; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + + +/** + * 工具类 + * Created by shf on 16/3/7. + */ +public class Hbase11xHelper { + + private static final Logger LOG = LoggerFactory.getLogger(Hbase11xHelper.class); + + public static org.apache.hadoop.hbase.client.Connection getHbaseConnection(String hbaseConfig) { + if (StringUtils.isBlank(hbaseConfig)) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.REQUIRED_VALUE, "读 Hbase 时需要配置hbaseConfig,其内容为 Hbase 连接信息,请联系 Hbase PE 获取该信息."); + } + org.apache.hadoop.conf.Configuration hConfiguration = HBaseConfiguration.create(); + try { + Map hbaseConfigMap = JSON.parseObject(hbaseConfig, new TypeReference>() {}); + // 用户配置的 key-value 对 来表示 hbaseConfig + Validate.isTrue(hbaseConfigMap != null && hbaseConfigMap.size() !=0, "hbaseConfig不能为空Map结构!"); + for (Map.Entry entry : hbaseConfigMap.entrySet()) { + hConfiguration.set(entry.getKey(), entry.getValue()); + } + } catch (Exception e) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.GET_HBASE_CONNECTION_ERROR, e); + } + org.apache.hadoop.hbase.client.Connection hConnection = null; + try { + hConnection = ConnectionFactory.createConnection(hConfiguration); + + } catch (Exception e) { + Hbase11xHelper.closeConnection(hConnection); + throw DataXException.asDataXException(Hbase11xReaderErrorCode.GET_HBASE_CONNECTION_ERROR, e); + } + return hConnection; + } + + + public static Table getTable(com.alibaba.datax.common.util.Configuration configuration){ + String hbaseConfig = configuration.getString(Key.HBASE_CONFIG); + String userTable = configuration.getString(Key.TABLE); + org.apache.hadoop.hbase.client.Connection hConnection = Hbase11xHelper.getHbaseConnection(hbaseConfig); + TableName hTableName = TableName.valueOf(userTable); + org.apache.hadoop.hbase.client.Admin admin = null; + org.apache.hadoop.hbase.client.Table hTable = null; + try { + admin = hConnection.getAdmin(); + Hbase11xHelper.checkHbaseTable(admin,hTableName); + hTable = hConnection.getTable(hTableName); + + } catch (Exception e) { + Hbase11xHelper.closeTable(hTable); + Hbase11xHelper.closeAdmin(admin); + Hbase11xHelper.closeConnection(hConnection); + throw DataXException.asDataXException(Hbase11xReaderErrorCode.GET_HBASE_TABLE_ERROR, e); + } + return hTable; + } + + public static RegionLocator getRegionLocator(com.alibaba.datax.common.util.Configuration configuration){ + String hbaseConfig = configuration.getString(Key.HBASE_CONFIG); + String userTable = configuration.getString(Key.TABLE); + org.apache.hadoop.hbase.client.Connection hConnection = Hbase11xHelper.getHbaseConnection(hbaseConfig); + TableName hTableName = TableName.valueOf(userTable); + org.apache.hadoop.hbase.client.Admin admin = null; + RegionLocator regionLocator = null; + try { + admin = hConnection.getAdmin(); + Hbase11xHelper.checkHbaseTable(admin,hTableName); + regionLocator = hConnection.getRegionLocator(hTableName); + } catch (Exception e) { + Hbase11xHelper.closeRegionLocator(regionLocator); + Hbase11xHelper.closeAdmin(admin); + Hbase11xHelper.closeConnection(hConnection); + throw DataXException.asDataXException(Hbase11xReaderErrorCode.GET_HBASE_REGINLOCTOR_ERROR, e); + } + return regionLocator; + + } + + public static void closeConnection(Connection hConnection){ + try { + if(null != hConnection) + hConnection.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.CLOSE_HBASE_CONNECTION_ERROR, e); + } + } + + public static void closeAdmin(Admin admin){ + try { + if(null != admin) + admin.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.CLOSE_HBASE_ADMIN_ERROR, e); + } + } + + public static void closeTable(Table table){ + try { + if(null != table) + table.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.CLOSE_HBASE_TABLE_ERROR, e); + } + } + + public static void closeResultScanner(ResultScanner resultScanner){ + if(null != resultScanner) { + resultScanner.close(); + } + } + + public static void closeRegionLocator(RegionLocator regionLocator){ + try { + if(null != regionLocator) + regionLocator.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.CLOSE_HBASE_REGINLOCTOR_ERROR, e); + } + } + + + public static void checkHbaseTable(Admin admin, TableName hTableName) throws IOException { + if(!admin.tableExists(hTableName)){ + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, "HBase源头表" + hTableName.toString() + + "不存在, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if(!admin.isTableAvailable(hTableName)){ + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, "HBase源头表" +hTableName.toString() + + " 不可用, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if(admin.isTableDisabled(hTableName)){ + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, "HBase源头表" +hTableName.toString() + + "is disabled, 请检查您的配置 或者 联系 Hbase 管理员."); + } + } + + + public static byte[] convertUserStartRowkey(com.alibaba.datax.common.util.Configuration configuration) { + String startRowkey = configuration.getString(Key.START_ROWKEY); + if (StringUtils.isBlank(startRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } else { + boolean isBinaryRowkey = configuration.getBool(Key.IS_BINARY_ROWKEY); + return Hbase11xHelper.stringToBytes(startRowkey, isBinaryRowkey); + } + } + + public static byte[] convertUserEndRowkey(com.alibaba.datax.common.util.Configuration configuration) { + String endRowkey = configuration.getString(Key.END_ROWKEY); + if (StringUtils.isBlank(endRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } else { + boolean isBinaryRowkey = configuration.getBool(Key.IS_BINARY_ROWKEY); + return Hbase11xHelper.stringToBytes(endRowkey, isBinaryRowkey); + } + } + + /** + * 注意:convertUserStartRowkey 和 convertInnerStartRowkey,前者会受到 isBinaryRowkey 的影响,只用于第一次对用户配置的 String 类型的 rowkey 转为二进制时使用。而后者约定:切分时得到的二进制的 rowkey 回填到配置中时采用 + */ + public static byte[] convertInnerStartRowkey(Configuration configuration) { + String startRowkey = configuration.getString(Key.START_ROWKEY); + if (StringUtils.isBlank(startRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } + + return Bytes.toBytesBinary(startRowkey); + } + + public static byte[] convertInnerEndRowkey(Configuration configuration) { + String endRowkey = configuration.getString(Key.END_ROWKEY); + if (StringUtils.isBlank(endRowkey)) { + return HConstants.EMPTY_BYTE_ARRAY; + } + + return Bytes.toBytesBinary(endRowkey); + } + + + private static byte[] stringToBytes(String rowkey, boolean isBinaryRowkey) { + if (isBinaryRowkey) { + return Bytes.toBytesBinary(rowkey); + } else { + return Bytes.toBytes(rowkey); + } + } + + + public static boolean isRowkeyColumn(String columnName) { + return Constant.ROWKEY_FLAG.equalsIgnoreCase(columnName); + } + + + /** + * 用于解析 Normal 模式下的列配置 + */ + public static List parseColumnOfNormalMode(List column) { + List hbaseColumnCells = new ArrayList(); + + HbaseColumnCell oneColumnCell; + + for (Map aColumn : column) { + ColumnType type = ColumnType.getByTypeName(aColumn.get(Key.TYPE)); + String columnName = aColumn.get(Key.NAME); + String columnValue = aColumn.get(Key.VALUE); + String dateformat = aColumn.get(Key.FORMAT); + + if (type == ColumnType.DATE) { + + if(dateformat == null){ + dateformat = Constant.DEFAULT_DATA_FORMAT; + } + Validate.isTrue(StringUtils.isNotBlank(columnName) || StringUtils.isNotBlank(columnValue), "Hbasereader 在 normal 方式读取时则要么是 type + name + format 的组合,要么是type + value + format 的组合. 而您的配置非这两种组合,请检查并修改."); + + oneColumnCell = new HbaseColumnCell + .Builder(type) + .columnName(columnName) + .columnValue(columnValue) + .dateformat(dateformat) + .build(); + } else { + Validate.isTrue(StringUtils.isNotBlank(columnName) || StringUtils.isNotBlank(columnValue), "Hbasereader 在 normal 方式读取时,其列配置中,如果类型不是时间,则要么是 type + name 的组合,要么是type + value 的组合. 而您的配置非这两种组合,请检查并修改."); + oneColumnCell = new HbaseColumnCell.Builder(type) + .columnName(columnName) + .columnValue(columnValue) + .build(); + } + + hbaseColumnCells.add(oneColumnCell); + } + + return hbaseColumnCells; + } + + //将多竖表column变成>形式 + public static HashMap> parseColumnOfMultiversionMode(List column){ + + HashMap> familyQualifierMap = new HashMap>(); + for (Map aColumn : column) { + String type = aColumn.get(Key.TYPE); + String columnName = aColumn.get(Key.NAME); + String dateformat = aColumn.get(Key.FORMAT); + + ColumnType.getByTypeName(type); + Validate.isTrue(StringUtils.isNotBlank(columnName), "Hbasereader 中,column 需要配置列名称name,格式为 列族:列名,您的配置为空,请检查并修改."); + + String familyQualifier; + if( !Hbase11xHelper.isRowkeyColumn(columnName)){ + String[] cfAndQualifier = columnName.split(":"); + if ( cfAndQualifier.length != 2) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 中,column 的列配置格式应该是:列族:列名. 您配置的列错误:" + columnName); + } + familyQualifier = StringUtils.join(cfAndQualifier[0].trim(),":",cfAndQualifier[1].trim()); + }else{ + familyQualifier = columnName.trim(); + } + + HashMap typeAndFormat = new HashMap(); + typeAndFormat.put(Key.TYPE,type); + typeAndFormat.put(Key.FORMAT,dateformat); + familyQualifierMap.put(familyQualifier,typeAndFormat); + } + return familyQualifierMap; + } + + public static List split(Configuration configuration) { + byte[] startRowkeyByte = Hbase11xHelper.convertUserStartRowkey(configuration); + byte[] endRowkeyByte = Hbase11xHelper.convertUserEndRowkey(configuration); + + /* 如果用户配置了 startRowkey 和 endRowkey,需要确保:startRowkey <= endRowkey */ + if (startRowkeyByte.length != 0 && endRowkeyByte.length != 0 + && Bytes.compareTo(startRowkeyByte, endRowkeyByte) > 0) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 中 startRowkey 不得大于 endRowkey."); + } + RegionLocator regionLocator = Hbase11xHelper.getRegionLocator(configuration); + List resultConfigurations ; + try { + Pair regionRanges = regionLocator.getStartEndKeys(); + if (null == regionRanges) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.SPLIT_ERROR, "获取源头 Hbase 表的 rowkey 范围失败."); + } + resultConfigurations = Hbase11xHelper.doSplit(configuration, startRowkeyByte, endRowkeyByte, + regionRanges); + + LOG.info("HBaseReader split job into {} tasks.", resultConfigurations.size()); + return resultConfigurations; + } catch (Exception e) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.SPLIT_ERROR, "切分源头 Hbase 表失败.", e); + }finally { + Hbase11xHelper.closeRegionLocator(regionLocator); + } + } + + + private static List doSplit(Configuration config, byte[] startRowkeyByte, + byte[] endRowkeyByte, Pair regionRanges) { + + List configurations = new ArrayList(); + + for (int i = 0; i < regionRanges.getFirst().length; i++) { + + byte[] regionStartKey = regionRanges.getFirst()[i]; + byte[] regionEndKey = regionRanges.getSecond()[i]; + + // 当前的region为最后一个region + // 如果最后一个region的start Key大于用户指定的userEndKey,则最后一个region,应该不包含在内 + // 注意如果用户指定userEndKey为"",则此判断应该不成立。userEndKey为""表示取得最大的region + if (Bytes.compareTo(regionEndKey, HConstants.EMPTY_BYTE_ARRAY) == 0 + && (endRowkeyByte.length != 0 && (Bytes.compareTo( + regionStartKey, endRowkeyByte) > 0))) { + continue; + } + + // 如果当前的region不是最后一个region, + // 用户配置的userStartKey大于等于region的endkey,则这个region不应该含在内 + if ((Bytes.compareTo(regionEndKey, HConstants.EMPTY_BYTE_ARRAY) != 0) + && (Bytes.compareTo(startRowkeyByte, regionEndKey) >= 0)) { + continue; + } + + // 如果用户配置的userEndKey小于等于 region的startkey,则这个region不应该含在内 + // 注意如果用户指定的userEndKey为"",则次判断应该不成立。userEndKey为""表示取得最大的region + if (endRowkeyByte.length != 0 + && (Bytes.compareTo(endRowkeyByte, regionStartKey) <= 0)) { + continue; + } + + Configuration p = config.clone(); + + String thisStartKey = getStartKey(startRowkeyByte, regionStartKey); + + String thisEndKey = getEndKey(endRowkeyByte, regionEndKey); + + p.set(Key.START_ROWKEY, thisStartKey); + p.set(Key.END_ROWKEY, thisEndKey); + + LOG.debug("startRowkey:[{}], endRowkey:[{}] .", thisStartKey, thisEndKey); + + configurations.add(p); + } + + return configurations; + } + + private static String getEndKey(byte[] endRowkeyByte, byte[] regionEndKey) { + if (endRowkeyByte == null) {// 由于之前处理过,所以传入的userStartKey不可能为null + throw new IllegalArgumentException("userEndKey should not be null!"); + } + + byte[] tempEndRowkeyByte; + + if (endRowkeyByte.length == 0) { + tempEndRowkeyByte = regionEndKey; + } else if (Bytes.compareTo(regionEndKey, HConstants.EMPTY_BYTE_ARRAY) == 0) { + // 为最后一个region + tempEndRowkeyByte = endRowkeyByte; + } else { + if (Bytes.compareTo(endRowkeyByte, regionEndKey) > 0) { + tempEndRowkeyByte = regionEndKey; + } else { + tempEndRowkeyByte = endRowkeyByte; + } + } + + return Bytes.toStringBinary(tempEndRowkeyByte); + } + + private static String getStartKey(byte[] startRowkeyByte, byte[] regionStarKey) { + if (startRowkeyByte == null) {// 由于之前处理过,所以传入的userStartKey不可能为null + throw new IllegalArgumentException( + "userStartKey should not be null!"); + } + + byte[] tempStartRowkeyByte; + + if (Bytes.compareTo(startRowkeyByte, regionStarKey) < 0) { + tempStartRowkeyByte = regionStarKey; + } else { + tempStartRowkeyByte = startRowkeyByte; + } + return Bytes.toStringBinary(tempStartRowkeyByte); + } + + + public static void validateParameter(com.alibaba.datax.common.util.Configuration originalConfig) { + originalConfig.getNecessaryValue(Key.HBASE_CONFIG, Hbase11xReaderErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.TABLE, Hbase11xReaderErrorCode.REQUIRED_VALUE); + + Hbase11xHelper.validateMode(originalConfig); + + //非必选参数处理 + String encoding = originalConfig.getString(Key.ENCODING, Constant.DEFAULT_ENCODING); + if (!Charset.isSupported(encoding)) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, String.format("Hbasereader 不支持您所配置的编码:[%s]", encoding)); + } + originalConfig.set(Key.ENCODING, encoding); + // 处理 range 的配置 + String startRowkey = originalConfig.getString(Constant.RANGE + "." + Key.START_ROWKEY); + + //此处判断需要谨慎:如果有 key range.startRowkey 但是没有值,得到的 startRowkey 是空字符串,而不是 null + if (startRowkey != null && startRowkey.length() != 0) { + originalConfig.set(Key.START_ROWKEY, startRowkey); + } + + String endRowkey = originalConfig.getString(Constant.RANGE + "." + Key.END_ROWKEY); + //此处判断需要谨慎:如果有 key range.endRowkey 但是没有值,得到的 endRowkey 是空字符串,而不是 null + if (endRowkey != null && endRowkey.length() != 0) { + originalConfig.set(Key.END_ROWKEY, endRowkey); + } + Boolean isBinaryRowkey = originalConfig.getBool(Constant.RANGE + "." + Key.IS_BINARY_ROWKEY,false); + originalConfig.set(Key.IS_BINARY_ROWKEY, isBinaryRowkey); + + //scan cache + int scanCacheSize = originalConfig.getInt(Key.SCAN_CACHE_SIZE,Constant.DEFAULT_SCAN_CACHE_SIZE); + originalConfig.set(Key.SCAN_CACHE_SIZE,scanCacheSize); + + int scanBatchSize = originalConfig.getInt(Key.SCAN_BATCH_SIZE,Constant.DEFAULT_SCAN_BATCH_SIZE); + originalConfig.set(Key.SCAN_BATCH_SIZE,scanBatchSize); + } + + private static String validateMode(com.alibaba.datax.common.util.Configuration originalConfig) { + String mode = originalConfig.getNecessaryValue(Key.MODE,Hbase11xReaderErrorCode.REQUIRED_VALUE); + List column = originalConfig.getList(Key.COLUMN, Map.class); + if (column == null || column.isEmpty()) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.REQUIRED_VALUE, "您配置的column为空,Hbase必须配置 column,其形式为:column:[{\"name\": \"cf0:column0\",\"type\": \"string\"},{\"name\": \"cf1:column1\",\"type\": \"long\"}]"); + } + ModeType modeType = ModeType.getByTypeName(mode); + switch (modeType) { + case Normal: { + // normal 模式不需要配置 maxVersion,需要配置 column,并且 column 格式为 Map 风格 + String maxVersion = originalConfig.getString(Key.MAX_VERSION); + Validate.isTrue(maxVersion == null, "您配置的是 normal 模式读取 hbase 中的数据,所以不能配置无关项:maxVersion"); + // 通过 parse 进行 column 格式的进一步检查 + Hbase11xHelper.parseColumnOfNormalMode(column); + break; + } + case MultiVersionFixedColumn:{ + // multiVersionFixedColumn 模式需要配置 maxVersion + checkMaxVersion(originalConfig, mode); + + Hbase11xHelper.parseColumnOfMultiversionMode(column); + break; + } + default: + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, + String.format("HbaseReader不支持该 mode 类型:%s", mode)); + } + return mode; + } + + // 检查 maxVersion 是否存在,并且值是否合法 + private static void checkMaxVersion(Configuration configuration, String mode) { + Integer maxVersion = configuration.getInt(Key.MAX_VERSION); + Validate.notNull(maxVersion, String.format("您配置的是 %s 模式读取 hbase 中的数据,所以必须配置:maxVersion", mode)); + boolean isMaxVersionValid = maxVersion == -1 || maxVersion > 1; + Validate.isTrue(isMaxVersionValid, String.format("您配置的是 %s 模式读取 hbase 中的数据,但是配置的 maxVersion 值错误. maxVersion规定:-1为读取全部版本,不能配置为0或者1(因为0或者1,我们认为用户是想用 normal 模式读取数据,而非 %s 模式读取,二者差别大),大于1则表示读取最新的对应个数的版本", mode, mode)); + } +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Hbase11xReader.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Hbase11xReader.java new file mode 100644 index 0000000000..b57478a184 --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Hbase11xReader.java @@ -0,0 +1,107 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +/** + * Hbase11xReader + * Created by shf on 16/3/7. + */ +public class Hbase11xReader extends Reader { + public static class Job extends Reader.Job { + private Configuration originConfig = null; + + @Override + public void init() { + this.originConfig = this.getPluginJobConf(); + Hbase11xHelper.validateParameter(this.originConfig); + } + + @Override + public List split(int adviceNumber) { + return Hbase11xHelper.split(this.originConfig); + } + + + @Override + public void destroy() { + + } + + } + public static class Task extends Reader.Task { + private Configuration taskConfig; + private static Logger LOG = LoggerFactory.getLogger(Task.class); + private HbaseAbstractTask hbaseTaskProxy; + @Override + public void init() { + this.taskConfig = super.getPluginJobConf(); + String mode = this.taskConfig.getString(Key.MODE); + ModeType modeType = ModeType.getByTypeName(mode); + + switch (modeType) { + case Normal: + this.hbaseTaskProxy = new NormalTask(this.taskConfig); + break; + case MultiVersionFixedColumn: + this.hbaseTaskProxy = new MultiVersionFixedColumnTask(this.taskConfig); + break; + default: + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 不支持此类模式:" + modeType); + } + } + + @Override + public void prepare() { + try { + this.hbaseTaskProxy.prepare(); + } catch (Exception e) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.PREPAR_READ_ERROR, e); + } + } + + @Override + public void startRead(RecordSender recordSender) { + Record record = recordSender.createRecord(); + boolean fetchOK; + while (true) { + try { + fetchOK = this.hbaseTaskProxy.fetchLine(record); + } catch (Exception e) { + LOG.info("Exception", e); + super.getTaskPluginCollector().collectDirtyRecord(record, e); + record = recordSender.createRecord(); + continue; + } + if (fetchOK) { + recordSender.sendToWriter(record); + record = recordSender.createRecord(); + } else { + break; + } + } + recordSender.flush(); + } + + @Override + public void post() { + super.post(); + } + + @Override + public void destroy() { + if (this.hbaseTaskProxy != null) { + this.hbaseTaskProxy.close(); + } + } + } + +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Hbase11xReaderErrorCode.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Hbase11xReaderErrorCode.java new file mode 100644 index 0000000000..609b6b8439 --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Hbase11xReaderErrorCode.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by shf on 16/3/8. + */ +public enum Hbase11xReaderErrorCode implements ErrorCode { + REQUIRED_VALUE("Hbase11xReader-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("Hbase11xReader-01", "您填写的参数值不合法."), + PREPAR_READ_ERROR("HbaseReader-02", "准备读取 Hbase 时出错."), + SPLIT_ERROR("HbaseReader-03", "切分 Hbase 表时出错."), + GET_HBASE_CONNECTION_ERROR("HbaseReader-04", "获取Hbase连接时出错."), + GET_HBASE_TABLE_ERROR("HbaseReader-05", "初始化 Hbase 抽取表时出错."), + GET_HBASE_REGINLOCTOR_ERROR("HbaseReader-06", "获取 Hbase RegionLocator时出错."), + CLOSE_HBASE_CONNECTION_ERROR("HbaseReader-07", "关闭Hbase连接时出错."), + CLOSE_HBASE_TABLE_ERROR("HbaseReader-08", "关闭Hbase 抽取表时出错."), + CLOSE_HBASE_REGINLOCTOR_ERROR("HbaseReader-09", "关闭 Hbase RegionLocator时出错."), + CLOSE_HBASE_ADMIN_ERROR("HbaseReader-10", "关闭 Hbase admin时出错.") + ; + + private final String code; + private final String description; + + private Hbase11xReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/HbaseAbstractTask.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/HbaseAbstractTask.java new file mode 100755 index 0000000000..c32343a935 --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/HbaseAbstractTask.java @@ -0,0 +1,154 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang.ArrayUtils; +import org.apache.commons.lang3.time.DateUtils; +import org.apache.hadoop.hbase.client.Result; +import org.apache.hadoop.hbase.client.ResultScanner; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.client.Table; +import org.apache.hadoop.hbase.util.Bytes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; + +public abstract class HbaseAbstractTask { + private final static Logger LOG = LoggerFactory.getLogger(HbaseAbstractTask.class); + + private byte[] startKey = null; + private byte[] endKey = null; + + protected Table htable; + protected String encoding; + protected int scanCacheSize; + protected int scanBatchSize; + + protected Result lastResult = null; + protected Scan scan; + protected ResultScanner resultScanner; + + + public HbaseAbstractTask(com.alibaba.datax.common.util.Configuration configuration) { + + this.htable = Hbase11xHelper.getTable(configuration); + + this.encoding = configuration.getString(Key.ENCODING,Constant.DEFAULT_ENCODING); + this.startKey = Hbase11xHelper.convertInnerStartRowkey(configuration); + this.endKey = Hbase11xHelper.convertInnerEndRowkey(configuration); + this.scanCacheSize = configuration.getInt(Key.SCAN_CACHE_SIZE,Constant.DEFAULT_SCAN_CACHE_SIZE); + this.scanBatchSize = configuration.getInt(Key.SCAN_BATCH_SIZE,Constant.DEFAULT_SCAN_BATCH_SIZE); + } + + public abstract boolean fetchLine(Record record) throws Exception; + + //不同模式设置不同,如多版本模式需要设置版本 + public abstract void initScan(Scan scan); + + + public void prepare() throws Exception { + this.scan = new Scan(); + this.scan.setSmall(false); + this.scan.setStartRow(startKey); + this.scan.setStopRow(endKey); + LOG.info("The task set startRowkey=[{}], endRowkey=[{}].", Bytes.toStringBinary(this.startKey), Bytes.toStringBinary(this.endKey)); + //scan的Caching Batch全部留在hconfig中每次从服务器端读取的行数,设置默认值未256 + this.scan.setCaching(this.scanCacheSize); + //设置获取记录的列个数,hbase默认无限制,也就是返回所有的列,这里默认是100 + this.scan.setBatch(this.scanBatchSize); + //为是否缓存块,hbase默认缓存,同步全部数据时非热点数据,因此不需要缓存 + this.scan.setCacheBlocks(false); + initScan(this.scan); + + this.resultScanner = this.htable.getScanner(this.scan); + } + + public void close() { + Hbase11xHelper.closeResultScanner(this.resultScanner); + Hbase11xHelper.closeTable(this.htable); + } + + protected Result getNextHbaseRow() throws IOException { + Result result; + try { + result = resultScanner.next(); + } catch (IOException e) { + if (lastResult != null) { + this.scan.setStartRow(lastResult.getRow()); + } + resultScanner = this.htable.getScanner(scan); + result = resultScanner.next(); + if (lastResult != null && Bytes.equals(lastResult.getRow(), result.getRow())) { + result = resultScanner.next(); + } + } + lastResult = result; + // may be null + return result; + } + + public Column convertBytesToAssignType(ColumnType columnType, byte[] byteArray,String dateformat) throws Exception { + Column column; + switch (columnType) { + case BOOLEAN: + column = new BoolColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toBoolean(byteArray)); + break; + case SHORT: + column = new LongColumn(ArrayUtils.isEmpty(byteArray) ? null : String.valueOf(Bytes.toShort(byteArray))); + break; + case INT: + column = new LongColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toInt(byteArray)); + break; + case LONG: + column = new LongColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toLong(byteArray)); + break; + case FLOAT: + column = new DoubleColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toFloat(byteArray)); + break; + case DOUBLE: + column = new DoubleColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toDouble(byteArray)); + break; + case STRING: + column = new StringColumn(ArrayUtils.isEmpty(byteArray) ? null : new String(byteArray, encoding)); + break; + case BINARY_STRING: + column = new StringColumn(ArrayUtils.isEmpty(byteArray) ? null : Bytes.toStringBinary(byteArray)); + break; + case DATE: + String dateValue = Bytes.toStringBinary(byteArray); + column = new DateColumn(ArrayUtils.isEmpty(byteArray) ? null : DateUtils.parseDate(dateValue, new String[]{dateformat})); + break; + default: + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 不支持您配置的列类型:" + columnType); + } + return column; + } + + public Column convertValueToAssignType(ColumnType columnType, String constantValue,String dateformat) throws Exception { + Column column; + switch (columnType) { + case BOOLEAN: + column = new BoolColumn(constantValue); + break; + case SHORT: + case INT: + case LONG: + column = new LongColumn(constantValue); + break; + case FLOAT: + case DOUBLE: + column = new DoubleColumn(constantValue); + break; + case STRING: + column = new StringColumn(constantValue); + break; + case DATE: + column = new DateColumn(DateUtils.parseDate(constantValue, new String[]{dateformat})); + break; + default: + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, "Hbasereader 常量列不支持您配置的列类型:" + columnType); + } + return column; + } +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/HbaseColumnCell.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/HbaseColumnCell.java new file mode 100755 index 0000000000..aba1c6fd3b --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/HbaseColumnCell.java @@ -0,0 +1,122 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +import com.alibaba.datax.common.base.BaseObject; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; +import org.apache.hadoop.hbase.util.Bytes; + +/** + * 描述 hbasereader 插件中,column 配置中的一个单元项实体 + */ +public class HbaseColumnCell extends BaseObject { + private ColumnType columnType; + + // columnName 格式为:列族:列名 + private String columnName; + + private byte[] columnFamily; + private byte[] qualifier; + + //对于常量类型,其常量值放到 columnValue 里 + private String columnValue; + + //当配置了 columnValue 时,isConstant=true(这个成员变量是用于方便使用本类的地方判断是否是常量类型字段) + private boolean isConstant; + + // 只在类型是时间类型时,才会设置该值,无默认值。形式如:yyyy-MM-dd HH:mm:ss + private String dateformat; + + private HbaseColumnCell(Builder builder) { + this.columnType = builder.columnType; + + //columnName 和 columnValue 必须有一个为 null + Validate.isTrue(builder.columnName == null || builder.columnValue == null, "Hbasereader 中,column 不能同时配置 列名称 和 列值,二者选其一."); + + //columnName 和 columnValue 不能都为 null + Validate.isTrue(builder.columnName != null || builder.columnValue != null, "Hbasereader 中,column 需要配置 列名称 或者 列值, 二者选其一."); + + if (builder.columnName != null) { + this.isConstant = false; + this.columnName = builder.columnName; + // 如果 columnName 不是 rowkey,则必须配置为:列族:列名 格式 + if (!Hbase11xHelper.isRowkeyColumn(this.columnName)) { + + String promptInfo = "Hbasereader 中,column 的列配置格式应该是:列族:列名. 您配置的列错误:" + this.columnName; + String[] cfAndQualifier = this.columnName.split(":"); + Validate.isTrue(cfAndQualifier != null && cfAndQualifier.length == 2 + && StringUtils.isNotBlank(cfAndQualifier[0]) + && StringUtils.isNotBlank(cfAndQualifier[1]), promptInfo); + + this.columnFamily = Bytes.toBytes(cfAndQualifier[0].trim()); + this.qualifier = Bytes.toBytes(cfAndQualifier[1].trim()); + } + } else { + this.isConstant = true; + this.columnValue = builder.columnValue; + } + + if (builder.dateformat != null) { + this.dateformat = builder.dateformat; + } + } + + public ColumnType getColumnType() { + return columnType; + } + + public String getColumnName() { + return columnName; + } + + public byte[] getColumnFamily() { + return columnFamily; + } + + public byte[] getQualifier() { + return qualifier; + } + + public String getDateformat() { + return dateformat; + } + + public String getColumnValue() { + return columnValue; + } + + public boolean isConstant() { + return isConstant; + } + + // 内部 builder 类 + public static class Builder { + private ColumnType columnType; + private String columnName; + private String columnValue; + + private String dateformat; + + public Builder(ColumnType columnType) { + this.columnType = columnType; + } + + public Builder columnName(String columnName) { + this.columnName = columnName; + return this; + } + + public Builder columnValue(String columnValue) { + this.columnValue = columnValue; + return this; + } + + public Builder dateformat(String dateformat) { + this.dateformat = dateformat; + return this; + } + + public HbaseColumnCell build() { + return new HbaseColumnCell(this); + } + } +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Key.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Key.java new file mode 100755 index 0000000000..800ab6564c --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/Key.java @@ -0,0 +1,51 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +public final class Key { + + public final static String HBASE_CONFIG = "hbaseConfig"; + + public final static String TABLE = "table"; + + /** + * mode 可以取 normal 或者 multiVersionFixedColumn 或者 multiVersionDynamicColumn 三个值,无默认值。 + *

+ * normal 配合 column(Map 结构的)使用 + */ + public final static String MODE = "mode"; + + /** + * 配合 mode = multiVersion 时使用,指明需要读取的版本个数。无默认值 + * -1 表示去读全部版本 + * 不能为0,1 + * >1 表示最多读取对应个数的版本数(不能超过 Integer 的最大值) + */ + public final static String MAX_VERSION = "maxVersion"; + + /** + * 默认为 utf8 + */ + public final static String ENCODING = "encoding"; + + public final static String COLUMN = "column"; + + public final static String COLUMN_FAMILY = "columnFamily"; + + public static final String NAME = "name"; + + public static final String TYPE = "type"; + + public static final String FORMAT = "format"; + + public static final String VALUE = "value"; + + public final static String START_ROWKEY = "startRowkey"; + + public final static String END_ROWKEY = "endRowkey"; + + public final static String IS_BINARY_ROWKEY = "isBinaryRowkey"; + + public final static String SCAN_CACHE_SIZE = "scanCacheSize"; + + public final static String SCAN_BATCH_SIZE = "scanBatchSize"; + +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/ModeType.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/ModeType.java new file mode 100644 index 0000000000..bdfb5c040c --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/ModeType.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +import com.alibaba.datax.common.exception.DataXException; + +import java.util.Arrays; + +public enum ModeType { + Normal("normal"), + MultiVersionFixedColumn("multiVersionFixedColumn") + ; + + private String mode; + + + ModeType(String mode) { + this.mode = mode.toLowerCase(); + } + + public String getMode() { + return mode; + } + + public static ModeType getByTypeName(String modeName) { + for (ModeType modeType : values()) { + if (modeType.mode.equalsIgnoreCase(modeName)) { + return modeType; + } + } + throw DataXException.asDataXException(Hbase11xReaderErrorCode.ILLEGAL_VALUE, + String.format("HbaseReader 不支持该 mode 类型:%s, 目前支持的 mode 类型是:%s", modeName, Arrays.asList(values()))); + } +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/MultiVersionDynamicColumnTask.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/MultiVersionDynamicColumnTask.java new file mode 100644 index 0000000000..1d824a06f7 --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/MultiVersionDynamicColumnTask.java @@ -0,0 +1,26 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +import com.alibaba.datax.common.util.Configuration; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; + +import java.util.List; + +public class MultiVersionDynamicColumnTask extends MultiVersionTask { + private List columnFamilies = null; + + public MultiVersionDynamicColumnTask(Configuration configuration){ + super(configuration); + + this.columnFamilies = configuration.getList(Key.COLUMN_FAMILY, String.class); + } + + @Override + public void initScan(Scan scan) { + for (String columnFamily : columnFamilies) { + scan.addFamily(Bytes.toBytes(columnFamily.trim())); + } + + super.setMaxVersions(scan); + } +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/MultiVersionFixedColumnTask.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/MultiVersionFixedColumnTask.java new file mode 100644 index 0000000000..084aedfa8d --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/MultiVersionFixedColumnTask.java @@ -0,0 +1,27 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +import com.alibaba.datax.common.util.Configuration; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; + +import java.util.List; +import java.util.Map; + +public class MultiVersionFixedColumnTask extends MultiVersionTask { + + public MultiVersionFixedColumnTask(Configuration configuration) { + super(configuration); + } + + @Override + public void initScan(Scan scan) { + for (Map aColumn : column) { + String columnName = aColumn.get(Key.NAME); + if(!Hbase11xHelper.isRowkeyColumn(columnName)){ + String[] cfAndQualifier = columnName.split(":"); + scan.addColumn(Bytes.toBytes(cfAndQualifier[0].trim()), Bytes.toBytes(cfAndQualifier[1].trim())); + } + } + super.setMaxVersions(scan); + } +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/MultiVersionTask.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/MultiVersionTask.java new file mode 100755 index 0000000000..85fe432a83 --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/MultiVersionTask.java @@ -0,0 +1,99 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.hbase.Cell; +import org.apache.hadoop.hbase.CellUtil; +import org.apache.hadoop.hbase.client.Result; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; + +import java.io.UnsupportedEncodingException; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +public abstract class MultiVersionTask extends HbaseAbstractTask { + private static byte[] COLON_BYTE; + + private int maxVersion; + private Cell cellArr[] = null; + + private int currentReadPosition = 0; + public List column; + private HashMap> familyQualifierMap = null; + + public MultiVersionTask(Configuration configuration) { + super(configuration); + this.maxVersion = configuration.getInt(Key.MAX_VERSION); + this.column = configuration.getList(Key.COLUMN, Map.class); + this.familyQualifierMap = Hbase11xHelper.parseColumnOfMultiversionMode(this.column); + + try { + MultiVersionTask.COLON_BYTE = ":".getBytes("utf8"); + } catch (UnsupportedEncodingException e) { + throw DataXException.asDataXException(Hbase11xReaderErrorCode.PREPAR_READ_ERROR, "系统内部获取 列族与列名冒号分隔符的二进制时失败.", e); + } + } + + @Override + public boolean fetchLine(Record record) throws Exception { + Result result; + if (this.cellArr == null || this.cellArr.length == this.currentReadPosition) { + result = super.getNextHbaseRow(); + if (result == null) { + return false; + } + super.lastResult = result; + + this.cellArr = result.rawCells(); + if(this.cellArr == null || this.cellArr.length ==0){ + return false; + } + this.currentReadPosition = 0; + } + try { + Cell cell = this.cellArr[this.currentReadPosition]; + + convertCellToLine(cell, record); + + } catch (Exception e) { + throw e; + } finally { + this.currentReadPosition++; + } + return true; + } + + private void convertCellToLine(Cell cell, Record record) throws Exception { + byte[] rawRowkey = CellUtil.cloneRow(cell); + long timestamp = cell.getTimestamp(); + byte[] cfAndQualifierName = Bytes.add(CellUtil.cloneFamily(cell), MultiVersionTask.COLON_BYTE, CellUtil.cloneQualifier(cell)); + byte[] columnValue = CellUtil.cloneValue(cell); + + ColumnType rawRowkeyType = ColumnType.getByTypeName(familyQualifierMap.get(Constant.ROWKEY_FLAG).get(Key.TYPE)); + String familyQualifier = new String(cfAndQualifierName, Constant.DEFAULT_ENCODING); + ColumnType columnValueType = ColumnType.getByTypeName(familyQualifierMap.get(familyQualifier).get(Key.TYPE)); + String columnValueFormat = familyQualifierMap.get(familyQualifier).get(Key.FORMAT); + if(StringUtils.isBlank(columnValueFormat)){ + columnValueFormat = Constant.DEFAULT_DATA_FORMAT; + } + + record.addColumn(convertBytesToAssignType(rawRowkeyType, rawRowkey, columnValueFormat)); + record.addColumn(convertBytesToAssignType(ColumnType.STRING, cfAndQualifierName, columnValueFormat)); + // 直接忽略了用户配置的 timestamp 的类型 + record.addColumn(new LongColumn(timestamp)); + record.addColumn(convertBytesToAssignType(columnValueType, columnValue, columnValueFormat)); + } + + public void setMaxVersions(Scan scan) { + if (this.maxVersion == -1 || this.maxVersion == Integer.MAX_VALUE) { + scan.setMaxVersions(); + } else { + scan.setMaxVersions(this.maxVersion); + } + } + +} diff --git a/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/NormalTask.java b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/NormalTask.java new file mode 100755 index 0000000000..ccb5c5f293 --- /dev/null +++ b/hbase11xreader/src/main/java/com/alibaba/datax/plugin/reader/hbase11xreader/NormalTask.java @@ -0,0 +1,88 @@ +package com.alibaba.datax.plugin.reader.hbase11xreader; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.util.Configuration; +import org.apache.hadoop.hbase.client.Result; +import org.apache.hadoop.hbase.client.Scan; +import org.apache.hadoop.hbase.util.Bytes; + +import java.util.List; +import java.util.Map; + +public class NormalTask extends HbaseAbstractTask { + private List column; + private List hbaseColumnCells; + + public NormalTask(Configuration configuration) { + super(configuration); + this.column = configuration.getList(Key.COLUMN, Map.class); + this.hbaseColumnCells = Hbase11xHelper.parseColumnOfNormalMode(this.column); + } + + /** + * normal模式下将用户配置的column 设置到scan中 + */ + @Override + public void initScan(Scan scan) { + boolean isConstant; + boolean isRowkeyColumn; + for (HbaseColumnCell cell : this.hbaseColumnCells) { + isConstant = cell.isConstant(); + isRowkeyColumn = Hbase11xHelper.isRowkeyColumn(cell.getColumnName()); + if (!isConstant && !isRowkeyColumn) { + this.scan.addColumn(cell.getColumnFamily(), cell.getQualifier()); + } + } + } + + + @Override + public boolean fetchLine(Record record) throws Exception { + Result result = super.getNextHbaseRow(); + + if (null == result) { + return false; + } + super.lastResult = result; + + try { + byte[] hbaseColumnValue; + String columnName; + ColumnType columnType; + + byte[] columnFamily; + byte[] qualifier; + + for (HbaseColumnCell cell : this.hbaseColumnCells) { + columnType = cell.getColumnType(); + if (cell.isConstant()) { + // 对常量字段的处理 + String constantValue = cell.getColumnValue(); + + Column constantColumn = super.convertValueToAssignType(columnType,constantValue,cell.getDateformat()); + record.addColumn(constantColumn); + } else { + // 根据列名称获取值 + columnName = cell.getColumnName(); + if (Hbase11xHelper.isRowkeyColumn(columnName)) { + hbaseColumnValue = result.getRow(); + } else { + columnFamily = cell.getColumnFamily(); + qualifier = cell.getQualifier(); + hbaseColumnValue = result.getValue(columnFamily, qualifier); + } + + Column hbaseColumn = super.convertBytesToAssignType(columnType,hbaseColumnValue,cell.getDateformat()); + record.addColumn(hbaseColumn); + } + } + } catch (Exception e) { + // 注意,这里catch的异常,期望是byte数组转换失败的情况。而实际上,string的byte数组,转成整数类型是不容易报错的。但是转成double类型容易报错。 + record.setColumn(0, new StringColumn(Bytes.toStringBinary(result.getRow()))); + throw e; + } + return true; + } +} diff --git a/hbase11xreader/src/main/resources/plugin.json b/hbase11xreader/src/main/resources/plugin.json new file mode 100755 index 0000000000..12d2f303ab --- /dev/null +++ b/hbase11xreader/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "hbase11xreader", + "class": "com.alibaba.datax.plugin.reader.hbase11xreader.Hbase11xReader", + "description": "useScene: prod. mechanism: Scan to read data.", + "developer": "alibaba" +} + diff --git a/hbase11xreader/src/main/resources/plugin_job_template.json b/hbase11xreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..1735ccc3ea --- /dev/null +++ b/hbase11xreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "hbase11xreader", + "parameter": { + "hbaseConfig": {}, + "table": "", + "encoding": "", + "mode": "", + "column": [], + "range": { + "startRowkey": "", + "endRowkey": "", + "isBinaryRowkey": true + } + } +} \ No newline at end of file diff --git a/hbase11xsqlwriter/doc/hbase11xsqlwriter.md b/hbase11xsqlwriter/doc/hbase11xsqlwriter.md new file mode 100644 index 0000000000..ab386d2812 --- /dev/null +++ b/hbase11xsqlwriter/doc/hbase11xsqlwriter.md @@ -0,0 +1,159 @@ +# HBase11xsqlwriter插件文档 + +## 1. 快速介绍 + +HBase11xsqlwriter实现了向hbase中的SQL表(phoenix)批量导入数据的功能。Phoenix因为对rowkey做了数据编码,所以,直接使用HBaseAPI进行写入会面临手工数据转换的问题,麻烦且易错。本插件提供了单间的SQL表的数据导入方式。 + +在底层实现上,通过Phoenix的JDBC驱动,执行UPSERT语句向hbase写入数据。 + +### 1.1 支持的功能 + +* 支持带索引的表的数据导入,可以同步更新所有的索引表 + + +### 1.2 限制 + +* 仅支持1.x系列的hbase +* 仅支持通过phoenix创建的表,不支持原生HBase表 +* 不支持带时间戳的数据导入 + +## 2. 实现原理 + +通过Phoenix的JDBC驱动,执行UPSERT语句向表中批量写入数据。因为使用上层接口,所以,可以同步更新索引表。 + +## 3. 配置说明 + +### 3.1 配置样例 + +```json +{ + "job": { + "entry": { + "jvm": "-Xms2048m -Xmx2048m" + }, + "content": [ + { + "reader": { + "name": "txtfilereader", + "parameter": { + "path": "/Users/shf/workplace/datax_test/hbase11xsqlwriter/txt/normal.txt", + "charset": "UTF-8", + "column": [ + { + "index": 0, + "type": "String" + }, + { + "index": 1, + "type": "string" + }, + { + "index": 2, + "type": "string" + }, + { + "index": 3, + "type": "string" + } + ], + "fieldDelimiter": "," + } + }, + "writer": { + "name": "hbase11xsqlwriter", + "parameter": { + "batchSize": "256", + "column": [ + "UID", + "TS", + "EVENTID", + "CONTENT" + ], + "hbaseConfig": { + "hbase.zookeeper.quorum": "目标hbase集群的ZK服务器地址,向PE咨询", + "zookeeper.znode.parent": "目标hbase集群的znode,向PE咨询" + }, + "nullMode": "skip", + "table": "目标hbase表名,大小写有关" + } + } + } + ], + "setting": { + "speed": { + "channel": 5 + } + } + } +} +``` + + +### 3.2 参数说明 + +* **name** + + * 描述:插件名字,必须是`hbase11xsqlwriter` + * 必选:是 + * 默认值:无 + +* **table** + + * 描述:要导入的表名,大小写敏感,通常phoenix表都是**大写**表名 + * 必选:是 + * 默认值:无 + +* **column** + + * 描述:列名,大小写敏感,通常phoenix的列名都是**大写**。 + * 需要注意列的顺序,必须与reader输出的列的顺序一一对应。 + * 不需要填写数据类型,会自动从phoenix获取列的元数据 + * 必选:是 + * 默认值:无 + +* **hbaseConfig** + + * 描述:hbase集群地址,zk为必填项,格式:ip1,ip2,ip3,注意,多个IP之间使用英文的逗号分隔。znode是可选的,默认值是/hbase + * 必选:是 + * 默认值:无 + +* **batchSize** + + * 描述:批量写入的最大行数 + * 必选:否 + * 默认值:256 + +* **nullMode** + + * 描述:读取到的列值为null时,如何处理。目前有两种方式: + * skip:跳过这一列,即不插入这一列(如果该行的这一列之前已经存在,则会被删除) + * empty:插入空值,值类型的空值是0,varchar的空值是空字符串 + * 必选:否 + * 默认值:skip + +## 4. 性能报告 + +无 + +## 5. 约束限制 + +writer中的列的定义顺序必须与reader的列顺序匹配。reader中的列顺序定义了输出的每一行中,列的组织顺序。而writer的列顺序,定义的是在收到的数据中,writer期待的列的顺序。例如: + +reader的列顺序是: c1, c2, c3, c4 + +writer的列顺序是: x1, x2, x3, x4 + +则reader输出的列c1就会赋值给writer的列x1。如果writer的列顺序是x1, x2, x4, x3,则c3会赋值给x4,c4会赋值给x3. + + +## 6. FAQ + +1. 并发开多少合适?速度慢时增加并发有用吗? + 数据导入进程默认JVM的堆大小是2GB,并发(channel数)是通过多线程实现的,开过多的线程有时并不能提高导入速度,反而可能因为过于频繁的GC导致性能下降。一般建议并发数(channel)为5-10. + +2. batchSize设置多少比较合适? +默认是256,但应根据每行的大小来计算最合适的batchSize。通常一次操作的数据量在2MB-4MB左右,用这个值除以行大小,即可得到batchSize。 + + + + diff --git a/hbase11xsqlwriter/pom.xml b/hbase11xsqlwriter/pom.xml new file mode 100644 index 0000000000..0b8a2d51c4 --- /dev/null +++ b/hbase11xsqlwriter/pom.xml @@ -0,0 +1,127 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + hbase11xsqlwriter + hbase11xsqlwriter + 0.0.1-SNAPSHOT + jar + + + 4.11.0-HBase-1.1 + 2.7.1 + 1.8 + + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.apache.hadoop + hadoop-hdfs + ${hadoop.version} + + + org.apache.hadoop + hadoop-common + ${hadoop.version} + + + org.apache.phoenix + phoenix-core + ${phoenix.version} + + + com.google.guava + guava + 12.0.1 + + + commons-codec + commons-codec + ${commons-codec.version} + + + + + junit + junit + test + + + com.alibaba.datax + datax-core + ${datax-project-version} + + + com.alibaba.datax + datax-service-face + + + test + + + org.mockito + mockito-all + 1.9.5 + test + + + + + + + src/main/java + + **/*.properties + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + \ No newline at end of file diff --git a/hbase11xsqlwriter/src/main/assembly/package.xml b/hbase11xsqlwriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..e8780b9b01 --- /dev/null +++ b/hbase11xsqlwriter/src/main/assembly/package.xml @@ -0,0 +1,34 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + + plugin/writer/hbase11xsqlwriter + + + target/ + + hbase11xsqlwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/hbase11xsqlwriter + + + + + + false + plugin/writer/hbase11xsqlwriter/libs + runtime + + + diff --git a/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/Constant.java b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/Constant.java new file mode 100755 index 0000000000..d45d30e1c8 --- /dev/null +++ b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/Constant.java @@ -0,0 +1,21 @@ +package com.alibaba.datax.plugin.writer.hbase11xsqlwriter; + +public final class Constant { + public static final String DEFAULT_ENCODING = "UTF-8"; + public static final String DEFAULT_DATA_FORMAT = "yyyy-MM-dd HH:mm:ss"; + public static final String DEFAULT_NULL_MODE = "skip"; + public static final String DEFAULT_ZNODE = "/hbase"; + public static final boolean DEFAULT_LAST_COLUMN_IS_VERSION = false; // 默认最后一列不是version列 + public static final int DEFAULT_BATCH_ROW_COUNT = 256; // 默认一次写256行 + public static final boolean DEFAULT_TRUNCATE = false; // 默认开始的时候不清空表 + + public static final int TYPE_UNSIGNED_TINYINT = 11; + public static final int TYPE_UNSIGNED_SMALLINT = 13; + public static final int TYPE_UNSIGNED_INTEGER = 9; + public static final int TYPE_UNSIGNED_LONG = 10; + public static final int TYPE_UNSIGNED_FLOAT = 14; + public static final int TYPE_UNSIGNED_DOUBLE = 15; + public static final int TYPE_UNSIGNED_DATE = 19; + public static final int TYPE_UNSIGNED_TIME = 18; + public static final int TYPE_UNSIGNED_TIMESTAMP = 20; +} diff --git a/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLHelper.java b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLHelper.java new file mode 100644 index 0000000000..6146ac8d8d --- /dev/null +++ b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLHelper.java @@ -0,0 +1,198 @@ +package com.alibaba.datax.plugin.writer.hbase11xsqlwriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.TypeReference; +import org.apache.hadoop.hbase.TableName; +import org.apache.hadoop.hbase.client.Admin; +import org.apache.hadoop.hbase.util.Pair; +import org.apache.phoenix.jdbc.PhoenixConnection; +import org.apache.phoenix.schema.ColumnNotFoundException; +import org.apache.phoenix.schema.MetaDataClient; +import org.apache.phoenix.schema.PTable; +import org.apache.phoenix.util.SchemaUtil; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.sql.Connection; +import java.sql.DriverManager; +import java.sql.SQLException; +import java.util.List; +import java.util.Map; + +/** + * @author yanghan.y + */ +public class HbaseSQLHelper { + private static final Logger LOG = LoggerFactory.getLogger(HbaseSQLHelper.class); + + /** + * 将datax的配置解析成sql writer的配置 + */ + public static HbaseSQLWriterConfig parseConfig(Configuration cfg) { + return HbaseSQLWriterConfig.parse(cfg); + } + + /** + * 将hbase config字符串解析成zk quorum和znode。 + * 因为hbase使用的配置名称 xxx.xxxx.xxx会被{@link Configuration#from(String)}识别成json路径, + * 而不是一个完整的配置项,所以,hbase的配置必须通过直接调用json API进行解析。 + * @param hbaseCfgString 配置中{@link Key#HBASE_CONFIG}的值 + * @return 返回2个string,第一个是zk quorum,第二个是znode + */ + public static Pair getHbaseConfig(String hbaseCfgString) { + assert hbaseCfgString != null; + Map hbaseConfigMap = JSON.parseObject(hbaseCfgString, new TypeReference>() {}); + String zkQuorum = hbaseConfigMap.get(Key.HBASE_ZK_QUORUM); + String znode = hbaseConfigMap.get(Key.HBASE_ZNODE_PARENT); + if (znode == null || znode.isEmpty()) { + znode = Constant.DEFAULT_ZNODE; + } + return new Pair(zkQuorum, znode); + } + + /** + * 校验配置 + */ + public static void validateConfig(HbaseSQLWriterConfig cfg) { + // 校验集群地址:尝试连接,连不上就说明有问题,抛错退出 + Connection conn = getJdbcConnection(cfg); + + // 检查表:存在,可用 + checkTable(conn, cfg.getTableName()); + + // 校验元数据:配置中给出的列必须是目的表中已经存在的列 + PTable schema = null; + try { + schema = getTableSchema(conn, cfg.getTableName()); + } catch (SQLException e) { + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.GET_HBASE_CONNECTION_ERROR, + "无法获取目的表" + cfg.getTableName() + "的元数据信息,表可能不是SQL表或表名配置错误,请检查您的配置 或者 联系 HBase 管理员.", e); + } + + try { + List columnNames = cfg.getColumns(); + for (String colName : columnNames) { + schema.getColumnForColumnName(colName); + } + } catch (ColumnNotFoundException e) { + // 用户配置的列名在元数据中不存在 + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "您配置的列" + e.getColumnName() + "在目的表" + cfg.getTableName() + "的元数据中不存在,请检查您的配置 或者 联系 HBase 管理员.", e); + } catch (SQLException e) { + // 列名有二义性或者其他问题 + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "目的表" + cfg.getTableName() + "的列信息校验失败,请检查您的配置 或者 联系 HBase 管理员.", e); + } + } + + /** + * 获取JDBC连接,轻量级连接,使用完后必须显式close + */ + public static Connection getJdbcConnection(HbaseSQLWriterConfig cfg) { + String connStr = cfg.getConnectionString(); + LOG.debug("Connecting to HBase cluster [" + connStr + "] ..."); + Connection conn; + try { + Class.forName("org.apache.phoenix.jdbc.PhoenixDriver"); + conn = DriverManager.getConnection(connStr); + conn.setAutoCommit(false); + } catch (Throwable e) { + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.GET_HBASE_CONNECTION_ERROR, + "无法连接hbase集群,配置不正确或目标集群不可用,请检查配置和集群状态 或者 联系 HBase 管理员.", e); + } + LOG.debug("Connected to HBase cluster successfully."); + return conn; + } + + /** + * 获取一张表的元数据信息 + * @param conn hbsae sql的jdbc连接 + * @param fullTableName 目标表的完整表名 + * @return 表的元数据 + */ + public static PTable getTableSchema(Connection conn, String fullTableName) throws SQLException { + PhoenixConnection hconn = conn.unwrap(PhoenixConnection.class); + MetaDataClient mdc = new MetaDataClient(hconn); + String schemaName = SchemaUtil.getSchemaNameFromFullName(fullTableName); + String tableName = SchemaUtil.getTableNameFromFullName(fullTableName); + return mdc.updateCache(schemaName, tableName).getTable(); + } + + /** + * 清空表 + */ + public static void truncateTable(Connection conn, String tableName) { + PhoenixConnection sqlConn = null; + Admin admin = null; + try { + sqlConn = conn.unwrap(PhoenixConnection.class); + admin = sqlConn.getQueryServices().getAdmin(); + TableName hTableName = TableName.valueOf(tableName); + // 确保表存在、可用 + checkTable(admin, hTableName); + // 清空表 + admin.disableTable(hTableName); + admin.truncateTable(hTableName, true); + LOG.debug("Table " + tableName + " has been truncated."); + } catch (Throwable t) { + // 清空表失败 + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.TRUNCATE_HBASE_ERROR, + "清空目的表" + tableName + "失败,请联系 HBase 管理员.", t); + } finally { + if (admin != null) { + closeAdmin(admin); + } + } + } + + /** + * 检查表:表要存在,enabled + */ + public static void checkTable(Connection conn, String tableName) throws DataXException { + PhoenixConnection sqlConn = null; + Admin admin = null; + try { + sqlConn = conn.unwrap(PhoenixConnection.class); + admin = sqlConn.getQueryServices().getAdmin(); + TableName hTableName = TableName.valueOf(tableName); + checkTable(admin, hTableName); + } catch (SQLException t) { + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.TRUNCATE_HBASE_ERROR, + "表" + tableName + "状态检查未通过,请检查您的集群和表状态 或者 联系 Hbase 管理员.", t); + } catch (IOException t) { + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.TRUNCATE_HBASE_ERROR, + "表" + tableName + "状态检查未通过,请检查您的集群和表状态 或者 联系 Hbase 管理员.", t); + } finally { + if (admin != null) { + closeAdmin(admin); + } + } + } + + private static void checkTable(Admin admin, TableName tableName) throws IOException { + if(!admin.tableExists(tableName)){ + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "HBase目的表" + tableName.toString() + "不存在, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if(!admin.isTableAvailable(tableName)){ + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "HBase目的表" + tableName.toString() + "不可用, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if(admin.isTableDisabled(tableName)){ + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "HBase目的表" + tableName.toString() + "不可用, 请检查您的配置 或者 联系 Hbase 管理员."); + } + } + + private static void closeAdmin(Admin admin){ + try { + if(null != admin) + admin.close(); + } catch (IOException e) { + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.CLOSE_HBASE_AMIN_ERROR, e); + } + } +} diff --git a/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriter.java b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriter.java new file mode 100644 index 0000000000..7091154c26 --- /dev/null +++ b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriter.java @@ -0,0 +1,72 @@ +package com.alibaba.datax.plugin.writer.hbase11xsqlwriter; + +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; + +import java.sql.Connection; +import java.util.ArrayList; +import java.util.List; + +/** + * @author yanghan.y + */ +public class HbaseSQLWriter extends Writer { + public static class Job extends Writer.Job { + private HbaseSQLWriterConfig config; + + @Override + public void init() { + // 解析配置 + config = HbaseSQLHelper.parseConfig(this.getPluginJobConf()); + + // 校验配置,会访问集群来检查表 + HbaseSQLHelper.validateConfig(config); + } + + @Override + public void prepare() { + // 写之前是否要清空目标表,默认不清空 + if(config.truncate()) { + Connection conn = HbaseSQLHelper.getJdbcConnection(config); + HbaseSQLHelper.truncateTable(conn, config.getTableName()); + } + } + + @Override + public List split(int mandatoryNumber) { + List splitResultConfigs = new ArrayList(); + for (int j = 0; j < mandatoryNumber; j++) { + splitResultConfigs.add(config.getOriginalConfig().clone()); + } + return splitResultConfigs; + } + + @Override + public void destroy() { + // NOOP + } + } + + public static class Task extends Writer.Task { + private Configuration taskConfig; + private HbaseSQLWriterTask hbaseSQLWriterTask; + + @Override + public void init() { + this.taskConfig = super.getPluginJobConf(); + this.hbaseSQLWriterTask = new HbaseSQLWriterTask(this.taskConfig); + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + this.hbaseSQLWriterTask.startWriter(lineReceiver, super.getTaskPluginCollector()); + } + + + @Override + public void destroy() { + // hbaseSQLTask不需要close + } + } +} diff --git a/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriterConfig.java b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriterConfig.java new file mode 100644 index 0000000000..ce8561fe50 --- /dev/null +++ b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriterConfig.java @@ -0,0 +1,209 @@ +package com.alibaba.datax.plugin.writer.hbase11xsqlwriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.hbase.TableName; +import org.apache.hadoop.hbase.util.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +/** + * HBase SQL writer config + * + * @author yanghan.y + */ +public class HbaseSQLWriterConfig { + private final static Logger LOG = LoggerFactory.getLogger(HbaseSQLWriterConfig.class); + private Configuration originalConfig; // 原始的配置数据 + + // 集群配置 + private String connectionString; + + // 表配置 + private String tableName; + private List columns; // 目的表的所有列的列名,包括主键和非主键,不包括时间列 + + // 其他配置 + private NullModeType nullMode; + private int batchSize; // 一次批量写入多少行 + private boolean truncate; // 导入开始前是否要清空目的表 + + /** + * @return 获取原始的datax配置 + */ + public Configuration getOriginalConfig() { + return originalConfig; + } + + /** + * @return 获取连接字符串,使用ZK模式 + */ + public String getConnectionString() { + return connectionString; + } + + /** + * @return 获取表名 + */ + public String getTableName() { + return tableName; + } + + /** + * @return 返回所有的列,包括主键列和非主键列,但不包括version列 + */ + public List getColumns() { + return columns; + } + + /** + * + * @return + */ + public NullModeType getNullMode() { + return nullMode; + } + + /** + * @return 批量写入的最大行数 + */ + public int getBatchSize() { + return batchSize; + } + + /** + * @return 在writer初始化的时候是否要清空目标表 + */ + public boolean truncate() { + return truncate; + } + + /** + * @param dataxCfg + * @return + */ + public static HbaseSQLWriterConfig parse(Configuration dataxCfg) { + assert dataxCfg != null; + HbaseSQLWriterConfig cfg = new HbaseSQLWriterConfig(); + cfg.originalConfig = dataxCfg; + + // 1. 解析集群配置 + parseClusterConfig(cfg, dataxCfg); + + // 2. 解析列配置 + parseTableConfig(cfg, dataxCfg); + + // 3. 解析其他配置 + cfg.nullMode = NullModeType.getByTypeName(dataxCfg.getString(Key.NULL_MODE, Constant.DEFAULT_NULL_MODE)); + cfg.batchSize = dataxCfg.getInt(Key.BATCH_SIZE, Constant.DEFAULT_BATCH_ROW_COUNT); + cfg.truncate = dataxCfg.getBool(Key.TRUNCATE, Constant.DEFAULT_TRUNCATE); + + // 4. 打印解析出来的配置 + LOG.info("HBase SQL writer config parsed:" + cfg.toString()); + + return cfg; + } + + private static void parseClusterConfig(HbaseSQLWriterConfig cfg, Configuration dataxCfg) { + // 获取hbase集群的连接信息字符串 + String hbaseCfg = dataxCfg.getString(Key.HBASE_CONFIG); + if (StringUtils.isBlank(hbaseCfg)) { + // 集群配置必须存在且不为空 + throw DataXException.asDataXException( + HbaseSQLWriterErrorCode.REQUIRED_VALUE, + "读 Hbase 时需要配置hbaseConfig,其内容为 Hbase 连接信息,请联系 Hbase PE 获取该信息."); + } + + // 解析zk服务器和znode信息 + Pair zkCfg; + try { + zkCfg = HbaseSQLHelper.getHbaseConfig(hbaseCfg); + } catch (Throwable t) { + // 解析hbase配置错误 + throw DataXException.asDataXException( + HbaseSQLWriterErrorCode.REQUIRED_VALUE, + "解析hbaseConfig出错,请确认您配置的hbaseConfig为合法的json数据格式,内容正确."); + } + String zkQuorum = zkCfg.getFirst(); + String znode = zkCfg.getSecond(); + if (zkQuorum == null || zkQuorum.isEmpty()) { + throw DataXException.asDataXException( + HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "HBase的hbase.zookeeper.quorum配置不能为空,请联系HBase PE获取该信息."); + } + if (znode == null || znode.isEmpty()) { + throw DataXException.asDataXException( + HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "HBase的zookeeper.znode.parent配置不能为空,请联系HBase PE获取该信息."); + } + + // 生成sql使用的连接字符串, 格式: jdbc:phoenix:zk_quorum:2181:/znode_parent + cfg.connectionString = "jdbc:phoenix:" + zkQuorum + ":2181:" + znode; + } + + private static void parseTableConfig(HbaseSQLWriterConfig cfg, Configuration dataxCfg) { + // 解析并检查表名 + cfg.tableName = dataxCfg.getString(Key.TABLE); + if (cfg.tableName == null || cfg.tableName.isEmpty()) { + throw DataXException.asDataXException( + HbaseSQLWriterErrorCode.ILLEGAL_VALUE, "HBase的tableName配置不能为空,请检查并修改配置."); + } + try { + TableName tn = TableName.valueOf(cfg.tableName); + } catch (Exception e) { + throw DataXException.asDataXException( + HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "您配置的tableName(" + cfg.tableName + ")含有非法字符,请检查您的配置 或者 联系 Hbase 管理员."); + } + + // 解析列配置 + cfg.columns = dataxCfg.getList(Key.COLUMN, String.class); + if (cfg.columns == null || cfg.columns.isEmpty()) { + throw DataXException.asDataXException( + HbaseSQLWriterErrorCode.ILLEGAL_VALUE, "HBase的columns配置不能为空,请添加目标表的列名配置."); + } + } + + @Override + public String toString() { + StringBuilder ret = new StringBuilder(); + // 集群配置 + ret.append("\n[jdbc]"); + ret.append(connectionString); + ret.append("\n"); + + // 表配置 + ret.append("[tableName]"); + ret.append(tableName); + ret.append("\n"); + ret.append("[column]"); + for (String col : columns) { + ret.append(col); + ret.append(","); + } + ret.setLength(ret.length() - 1); + ret.append("\n"); + + // 其他配置 + ret.append("[nullMode]"); + ret.append(nullMode); + ret.append("\n"); + ret.append("[batchSize]"); + ret.append(batchSize); + ret.append("\n"); + ret.append("[truncate]"); + ret.append(truncate); + ret.append("\n"); + + return ret.toString(); + } + + /** + * 禁止直接实例化本类,必须调用{@link #parse}接口来初始化 + */ + private HbaseSQLWriterConfig() { + } +} diff --git a/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriterErrorCode.java b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriterErrorCode.java new file mode 100644 index 0000000000..81f9bc0c3d --- /dev/null +++ b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriterErrorCode.java @@ -0,0 +1,40 @@ +package com.alibaba.datax.plugin.writer.hbase11xsqlwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum HbaseSQLWriterErrorCode implements ErrorCode { + REQUIRED_VALUE("Hbasewriter-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("Hbasewriter-01", "您填写的参数值不合法."), + GET_HBASE_CONNECTION_ERROR("Hbasewriter-02", "获取Hbase连接时出错."), + GET_HBASE_TABLE_ERROR("Hbasewriter-03", "获取 Hbase table时出错."), + CLOSE_HBASE_CONNECTION_ERROR("Hbasewriter-04", "关闭Hbase连接时出错."), + CLOSE_HBASE_AMIN_ERROR("Hbasewriter-05", "关闭Hbase admin时出错."), + CLOSE_HBASE_TABLE_ERROR("Hbasewriter-06", "关闭Hbase table时时出错."), + PUT_HBASE_ERROR("Hbasewriter-07", "写入hbase时发生IO异常."), + DELETE_HBASE_ERROR("Hbasewriter-08", "delete hbase表时发生异常."), + TRUNCATE_HBASE_ERROR("Hbasewriter-09", "truncate hbase表时发生异常."), + ; + + private final String code; + private final String description; + + private HbaseSQLWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, this.description); + } +} diff --git a/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriterTask.java b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriterTask.java new file mode 100644 index 0000000000..1b00ea3f45 --- /dev/null +++ b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/HbaseSQLWriterTask.java @@ -0,0 +1,366 @@ +package com.alibaba.datax.plugin.writer.hbase11xsqlwriter; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.google.common.collect.Lists; +import org.apache.phoenix.schema.PTable; +import org.apache.phoenix.schema.types.PDataType; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.math.BigDecimal; +import java.sql.Connection; +import java.sql.PreparedStatement; +import java.sql.SQLException; +import java.sql.Types; +import java.util.Arrays; +import java.util.List; + +/** + * @author yanghan.y + */ +public class HbaseSQLWriterTask { + private final static Logger LOG = LoggerFactory.getLogger(HbaseSQLWriterTask.class); + + private TaskPluginCollector taskPluginCollector; + private HbaseSQLWriterConfig cfg; + private Connection connection = null; + private PreparedStatement ps = null; + // 需要向hbsae写入的列的数量,即用户配置的column参数中列的个数。时间戳不包含在内 + private int numberOfColumnsToWrite; + // 期待从源头表的Record中拿到多少列 + private int numberOfColumnsToRead; + private boolean needExplicitVersion = false; + private int[] columnTypes; + + public HbaseSQLWriterTask(Configuration configuration) { + // 这里仅解析配置,不访问远端集群,配置的合法性检查在writer的init过程中进行 + cfg = HbaseSQLHelper.parseConfig(configuration); + } + + public void startWriter(RecordReceiver lineReceiver, TaskPluginCollector taskPluginCollector) { + this.taskPluginCollector = taskPluginCollector; + Record record; + try { + // 准备阶段 + prepare(); + + List buffer = Lists.newArrayListWithExpectedSize(cfg.getBatchSize()); + while ((record = lineReceiver.getFromReader()) != null) { + // 校验列数量是否符合预期 + if (record.getColumnNumber() != numberOfColumnsToRead) { + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "数据源给出的列数量[" + record.getColumnNumber() + "]与您配置中的列数量[" + numberOfColumnsToRead + + "]不同, 请检查您的配置 或者 联系 Hbase 管理员."); + } + + buffer.add(record); + if (buffer.size() > cfg.getBatchSize()) { + doBatchUpsert(buffer); + buffer.clear(); + } + } // end while loop + + // 处理剩余的record + if (!buffer.isEmpty()) { + doBatchUpsert(buffer); + buffer.clear(); + } + } catch (Throwable t) { + // 确保所有异常都转化为DataXException + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.PUT_HBASE_ERROR, t); + } finally { + close(); + } + } + + private void prepare() throws SQLException { + if (connection == null) { + connection = HbaseSQLHelper.getJdbcConnection(cfg); + connection.setAutoCommit(false); // 批量提交 + } + + if (ps == null) { + // 一个Task的生命周期中只使用一个PreparedStatement对象,所以,在 + ps = createPreparedStatement(); + columnTypes = getColumnSqlType(cfg.getColumns()); + } + } + + private void close() { + if (ps != null) { + try { + ps.close(); + } catch (SQLException e) { + // 不会出错 + LOG.error("Failed closing PreparedStatement", e); + } + } + if (connection != null) { + try { + connection.close(); + } catch (SQLException e) { + // 不会出错 + LOG.error("Failed closing Connection", e); + } + } + } + + /** + * 批量提交一组数据,如果失败,则尝试一行行提交,如果仍然失败,抛错给用户 + */ + private void doBatchUpsert(List records) throws SQLException { + try { + // 将所有record提交到connection缓存 + for (Record r : records) { + setupStatement(r); + ps.executeUpdate(); + } + + // 将缓存的数据提交到hbase + connection.commit(); + } catch (SQLException e) { + LOG.error("Failed batch committing " + records.size() + " records", e); + + // 批量提交失败,则一行行重试,以确定那一行出错 + connection.rollback(); + doSingleUpsert(records); + } catch (Exception e) { + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.PUT_HBASE_ERROR, e); + } + } + + /** + * 单行提交,将出错的行记录到脏数据中。由脏数据收集模块判断任务是否继续 + */ + private void doSingleUpsert(List records) throws SQLException { + for (Record r : records) { + try { + setupStatement(r); + ps.executeUpdate(); + connection.commit(); + } catch (SQLException e) { + //出错了,记录脏数据 + LOG.error("Failed writing hbase", e); + this.taskPluginCollector.collectDirtyRecord(r, e); + } + } + } + + /** + * 生成sql模板,并根据模板创建PreparedStatement + */ + private PreparedStatement createPreparedStatement() throws SQLException { + // 生成列名集合,列之间用逗号分隔: col1,col2,col3,... + StringBuilder columnNamesBuilder = new StringBuilder(); + for (String col : cfg.getColumns()) { + // 列名使用双引号,则不自动转换为全大写,而是保留用户配置的大小写 + columnNamesBuilder.append("\""); + columnNamesBuilder.append(col); + columnNamesBuilder.append("\""); + columnNamesBuilder.append(","); + } + columnNamesBuilder.setLength(columnNamesBuilder.length() - 1); // 移除末尾多余的逗号 + String columnNames = columnNamesBuilder.toString(); + numberOfColumnsToWrite = cfg.getColumns().size(); + numberOfColumnsToRead = numberOfColumnsToWrite; // 开始的时候,要读的列数娱要写的列数相等 + + // 生成UPSERT模板 + String tableName = cfg.getTableName(); + // 表名使用双引号,则不自动转换为全大写,而是保留用户配置的大小写 + StringBuilder upsertBuilder = + new StringBuilder("upsert into \"" + tableName + "\" (" + columnNames + " ) values ("); + for (int i = 0; i < cfg.getColumns().size(); i++) { + upsertBuilder.append("?,"); + } + upsertBuilder.setLength(upsertBuilder.length() - 1); // 移除末尾多余的逗号 + upsertBuilder.append(")"); + + String sql = upsertBuilder.toString(); + PreparedStatement ps = connection.prepareStatement(sql); + LOG.debug("SQL template generated: " + sql); + return ps; + } + + /** + * 根据列名来从数据库元数据中获取这一列对应的SQL类型 + */ + private int[] getColumnSqlType(List columnNames) throws SQLException { + int[] types = new int[numberOfColumnsToWrite]; + PTable ptable = HbaseSQLHelper.getTableSchema(connection, cfg.getTableName()); + + for (int i = 0; i < columnNames.size(); i++) { + String name = columnNames.get(i); + PDataType type = ptable.getColumnForColumnName(name).getDataType(); + types[i] = type.getSqlType(); + LOG.debug("Column name : " + name + ", sql type = " + type.getSqlType() + " " + type.getSqlTypeName()); + } + return types; + } + + private void setupStatement(Record record) throws SQLException { + // 一开始的时候就已经校验过record中的列数量与ps中需要的值数量相等 + for (int i = 0; i < numberOfColumnsToWrite; i++) { + Column col = record.getColumn(i); + int sqlType = columnTypes[i]; + // PreparedStatement中的索引从1开始,所以用i+1 + setupColumn(i + 1, sqlType, col); + } + } + + private void setupColumn(int pos, int sqlType, Column col) throws SQLException { + if (col.getRawData() != null) { + switch (sqlType) { + case Types.CHAR: + case Types.VARCHAR: + ps.setString(pos, col.asString()); + break; + + case Types.BINARY: + case Types.VARBINARY: + ps.setBytes(pos, col.asBytes()); + break; + + case Types.BOOLEAN: + ps.setBoolean(pos, col.asBoolean()); + break; + + case Types.TINYINT: + case Constant.TYPE_UNSIGNED_TINYINT: + ps.setByte(pos, col.asLong().byteValue()); + break; + + case Types.SMALLINT: + case Constant.TYPE_UNSIGNED_SMALLINT: + ps.setShort(pos, col.asLong().shortValue()); + break; + + case Types.INTEGER: + case Constant.TYPE_UNSIGNED_INTEGER: + ps.setInt(pos, col.asLong().intValue()); + break; + + case Types.BIGINT: + case Constant.TYPE_UNSIGNED_LONG: + ps.setLong(pos, col.asLong()); + break; + + case Types.FLOAT: + ps.setFloat(pos, col.asDouble().floatValue()); + break; + + case Types.DOUBLE: + ps.setDouble(pos, col.asDouble()); + break; + + case Types.DECIMAL: + ps.setBigDecimal(pos, col.asBigDecimal()); + break; + + case Types.DATE: + case Constant.TYPE_UNSIGNED_DATE: + ps.setDate(pos, new java.sql.Date(col.asDate().getTime())); + break; + + case Types.TIME: + case Constant.TYPE_UNSIGNED_TIME: + ps.setTime(pos, new java.sql.Time(col.asDate().getTime())); + break; + + case Types.TIMESTAMP: + case Constant.TYPE_UNSIGNED_TIMESTAMP: + ps.setTimestamp(pos, new java.sql.Timestamp(col.asDate().getTime())); + break; + + default: + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "不支持您配置的列类型:" + sqlType + ", 请检查您的配置 或者 联系 Hbase 管理员."); + + } // end switch + } else { + // 没有值,按空值的配置情况处理 + switch (cfg.getNullMode()){ + case Skip: + // 跳过空值,则不插入该列, + ps.setNull(pos, sqlType); + break; + + case Empty: + // 插入"空值",请注意不同类型的空值不同 + // 另外,对SQL来说,空值本身是有值的,这与直接操作HBASE Native API时的空值完全不同 + ps.setObject(pos, getEmptyValue(sqlType)); + break; + + default: + // nullMode的合法性在初始化配置的时候已经校验过,这里一定不会出错 + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "Hbasewriter 不支持该 nullMode 类型: " + cfg.getNullMode() + + ", 目前支持的 nullMode 类型是:" + Arrays.asList(NullModeType.values())); + } + } + } + + /** + * 根据类型获取"空值" + * 值类型的空值都是0,bool是false,String是空字符串 + * @param sqlType sql数据类型,定义于{@link Types} + */ + private Object getEmptyValue(int sqlType) { + switch (sqlType) { + case Types.VARCHAR: + return ""; + + case Types.BOOLEAN: + return false; + + case Types.TINYINT: + case Constant.TYPE_UNSIGNED_TINYINT: + return (byte) 0; + + case Types.SMALLINT: + case Constant.TYPE_UNSIGNED_SMALLINT: + return (short) 0; + + case Types.INTEGER: + case Constant.TYPE_UNSIGNED_INTEGER: + return (int) 0; + + case Types.BIGINT: + case Constant.TYPE_UNSIGNED_LONG: + return (long) 0; + + case Types.FLOAT: + return (float) 0.0; + + case Types.DOUBLE: + return (double) 0.0; + + case Types.DECIMAL: + return new BigDecimal(0); + + case Types.DATE: + case Constant.TYPE_UNSIGNED_DATE: + return new java.sql.Date(0); + + case Types.TIME: + case Constant.TYPE_UNSIGNED_TIME: + return new java.sql.Time(0); + + case Types.TIMESTAMP: + case Constant.TYPE_UNSIGNED_TIMESTAMP: + return new java.sql.Timestamp(0); + + case Types.BINARY: + case Types.VARBINARY: + return new byte[0]; + + default: + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "不支持您配置的列类型:" + sqlType + ", 请检查您的配置 或者 联系 Hbase 管理员."); + } + } +} diff --git a/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/Key.java b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/Key.java new file mode 100755 index 0000000000..1b4f3816b6 --- /dev/null +++ b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/Key.java @@ -0,0 +1,44 @@ +package com.alibaba.datax.plugin.writer.hbase11xsqlwriter; + +import org.apache.hadoop.hbase.HConstants; + +public final class Key { + + /** + * 【必选】hbase集群配置,连接一个hbase集群需要的最小配置只有两个:zk和znode + */ + public final static String HBASE_CONFIG = "hbaseConfig"; + public final static String HBASE_ZK_QUORUM = HConstants.ZOOKEEPER_QUORUM; + public final static String HBASE_ZNODE_PARENT = HConstants.ZOOKEEPER_ZNODE_PARENT; + + /** + * 【必选】writer要写入的表的表名 + */ + public final static String TABLE = "table"; + + /** + * 【必选】列配置 + */ + public final static String COLUMN = "column"; + public static final String NAME = "name"; + + /** + * 【可选】遇到空值默认跳过 + */ + public static final String NULL_MODE = "nullMode"; + + /** + * 【可选】 + * 在writer初始化的时候,是否清空目的表 + * 如果全局启动多个writer,则必须确保所有的writer都prepare之后,再开始导数据。 + */ + public static final String TRUNCATE = "truncate"; + + /** + * 【可选】批量写入的最大行数,默认100行 + */ + public static final String BATCH_SIZE = "batchSize"; + + + +} diff --git a/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/NullModeType.java b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/NullModeType.java new file mode 100644 index 0000000000..2e2d034c48 --- /dev/null +++ b/hbase11xsqlwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xsqlwriter/NullModeType.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.writer.hbase11xsqlwriter; + +import com.alibaba.datax.common.exception.DataXException; + +import java.util.Arrays; + +public enum NullModeType { + Skip("skip"), + Empty("empty") + ; + + private String mode; + + + NullModeType(String mode) { + this.mode = mode.toLowerCase(); + } + + public String getMode() { + return mode; + } + + public static NullModeType getByTypeName(String modeName) { + for (NullModeType modeType : values()) { + if (modeType.mode.equalsIgnoreCase(modeName)) { + return modeType; + } + } + throw DataXException.asDataXException(HbaseSQLWriterErrorCode.ILLEGAL_VALUE, + "Hbasewriter 不支持该 nullMode 类型:" + modeName + ", 目前支持的 nullMode 类型是:" + Arrays.asList(values())); + } +} diff --git a/hbase11xsqlwriter/src/main/resources/plugin.json b/hbase11xsqlwriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..9ebf770008 --- /dev/null +++ b/hbase11xsqlwriter/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "hbase11xsqlwriter", + "class": "com.alibaba.datax.plugin.writer.hbase11xsqlwriter.HbaseSQLWriter", + "description": "useScene: prod. mechanism: use hbase sql UPSERT to put data, index tables will be updated too.", + "developer": "alibaba" +} + diff --git a/hbase11xwriter/doc/.gitkeep b/hbase11xwriter/doc/.gitkeep new file mode 100644 index 0000000000..e69de29bb2 diff --git a/hbase11xwriter/doc/hbase11xwriter.md b/hbase11xwriter/doc/hbase11xwriter.md new file mode 100644 index 0000000000..cec8144d08 --- /dev/null +++ b/hbase11xwriter/doc/hbase11xwriter.md @@ -0,0 +1,356 @@ +# Hbase094XWriter & Hbase11XWriter 插件文档 + +___ + +## 1 快速介绍 + +HbaseWriter 插件实现了从向Hbase中写取数据。在底层实现上,HbaseWriter 通过 HBase 的 Java 客户端连接远程 HBase 服务,并通过 put 方式写入Hbase。 + + +### 1.1支持功能 + +1、目前HbaseWriter支持的Hbase版本有:Hbase0.94.x和Hbase1.1.x。 + +* 若您的hbase版本为Hbase0.94.x,writer端的插件请选择:hbase094xwriter,即: + + ``` + "writer": { + "name": "hbase094xwriter" + } + ``` + +* 若您的hbase版本为Hbase1.1.x,writer端的插件请选择:hbase11xwriter,即: + + ``` + "writer": { + "name": "hbase11xwriter" + } + ``` + +2、目前HbaseWriter支持源端多个字段拼接作为hbase 表的 rowkey,具体配置参考:rowkeyColumn配置; + +3、写入hbase的时间戳(版本)支持:用当前时间作为版本,指定源端列作为版本,指定一个时间 三种方式作为版本; + +4、HbaseWriter中有一个必填配置项是:hbaseConfig,需要你联系 HBase PE,将hbase-site.xml 中与连接 HBase 相关的配置项提取出来,以 json 格式填入,同时可以补充更多HBase client的配置来优化与服务器的交互。 + + +如:hbase-site.xml的配置内容如下 + +``` + + + hbase.rootdir + hdfs://ip:9000/hbase + + + hbase.cluster.distributed + true + + + hbase.zookeeper.quorum + *** + + +``` +转换后的json为: + +``` +"hbaseConfig": { + "hbase.rootdir": "hdfs: //ip: 9000/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "***" + } +``` + +### 1.2 限制 + +1、目前只支持源端为横表写入,不支持竖表(源端读出的为四元组: rowKey,family:qualifier,timestamp,value)模式的数据写入;本期目标主要是替换DataX2中的habsewriter,下次迭代考虑支持。 + +2、目前不支持写入hbase前清空表数据,若需要清空数据请联系HBase PE + +## 2 实现原理 + +简而言之,HbaseWriter 通过 HBase 的 Java 客户端,通过 HTable, Put等 API,将从上游Reader读取的数据写入HBase你hbase11xwriter与hbase094xwriter的主要不同在于API的调用不同,Hbase1.1.x废弃了很多Hbase0.94.x的api。 + + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从本地写入hbase1.1.x的作业: + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 5 + } + }, + "content": [ + { + "reader": { + "name": "txtfilereader", + "parameter": { + "path": "/Users/shf/workplace/datax_test/hbase11xwriter/txt/normal.txt", + "charset": "UTF-8", + "column": [ + { + "index": 0, + "type": "String" + }, + { + "index": 1, + "type": "string" + }, + { + "index": 2, + "type": "string" + }, + { + "index": 3, + "type": "string" + }, + { + "index": 4, + "type": "string" + }, + { + "index": 5, + "type": "string" + }, + { + "index": 6, + "type": "string" + } + + ], + "fieldDelimiter": "," + } + }, + "writer": { + "name": "hbase11xwriter", + "parameter": { + "hbaseConfig": { + "hbase.rootdir": "hdfs: //ip: 9000/hbase", + "hbase.cluster.distributed": "true", + "hbase.zookeeper.quorum": "***" + }, + "table": "writer", + "mode": "normal", + "rowkeyColumn": [ + { + "index":0, + "type":"string" + }, + { + "index":-1, + "type":"string", + "value":"_" + } + ], + "column": [ + { + "index":1, + "name": "cf1:q1", + "type": "string" + }, + { + "index":2, + "name": "cf1:q2", + "type": "string" + }, + { + "index":3, + "name": "cf1:q3", + "type": "string" + }, + { + "index":4, + "name": "cf2:q1", + "type": "string" + }, + { + "index":5, + "name": "cf2:q2", + "type": "string" + }, + { + "index":6, + "name": "cf2:q3", + "type": "string" + } + ], + "versionColumn":{ + "index": -1, + "value":"123456789" + }, + "encoding": "utf-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **hbaseConfig** + + * 描述:每个HBase集群提供给DataX客户端连接的配置信息存放在hbase-site.xml,请联系你的HBase PE提供配置信息,并转换为JSON格式。同时可以补充更多HBase client的配置,如:设置scan的cache、batch来优化与服务器的交互。 + + * 必选:是
+ + * 默认值:无
+ +* **mode** + + * 描述:写hbase的模式,目前只支持normal 模式,后续考虑动态列模式
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:要写的 hbase 表名(大小写敏感)
+ + * 必选:是
+ + * 默认值:无
+ +* **encoding** + + * 描述:编码方式,UTF-8 或是 GBK,用于 String 转 HBase byte[]时的编码
+ + * 必选:否
+ + * 默认值:UTF-8
+ + +* **column** + + * 描述:要写入的hbase字段。index:指定该列对应reader端column的索引,从0开始;name:指定hbase表中的列,必须为 列族:列名 的格式;type:指定写入数据类型,用于转换HBase byte[]。配置格式如下: + + ``` +"column": [ + { + "index":1, + "name": "cf1:q1", + "type": "string" + }, + { + "index":2, + "name": "cf1:q2", + "type": "string" + } + ] + + ``` + + * 必选:是
+ + * 默认值:无
+ +* **rowkeyColumn** + + * 描述:要写入的hbase的rowkey列。index:指定该列对应reader端column的索引,从0开始,若为常量index为-1;type:指定写入数据类型,用于转换HBase byte[];value:配置常量,常作为多个字段的拼接符。hbasewriter会将rowkeyColumn中所有列按照配置顺序进行拼接作为写入hbase的rowkey,不能全为常量。配置格式如下: + + ``` +"rowkeyColumn": [ + { + "index":0, + "type":"string" + }, + { + "index":-1, + "type":"string", + "value":"_" + } + ] + + ``` + + * 必选:是
+ + * 默认值:无
+ +* **versionColumn** + + * 描述:指定写入hbase的时间戳。支持:当前时间、指定时间列,指定时间,三者选一。若不配置表示用当前时间。index:指定对应reader端column的索引,从0开始,需保证能转换为long,若是Date类型,会尝试用yyyy-MM-dd HH:mm:ss和yyyy-MM-dd HH:mm:ss SSS去解析;若为指定时间index为-1;value:指定时间的值,long值。配置格式如下: + + ``` +"versionColumn":{ + "index":1 +} + + ``` + + 或者 + + ``` +"versionColumn":{ + "index":-1, + "value":123456789 +} + + ``` + + * 必选:否
+ + * 默认值:无
+ + +* **nullMode** + + * 描述:读取的null值时,如何处理。支持两种方式:(1)skip:表示不向hbase写这列;(2)empty:写入HConstants.EMPTY_BYTE_ARRAY,即new byte [0]
+ + * 必选:否
+ + * 默认值:skip
+ +* **walFlag** + + * 描述:在HBae client向集群中的RegionServer提交数据时(Put/Delete操作),首先会先写WAL(Write Ahead Log)日志(即HLog,一个RegionServer上的所有Region共享一个HLog),只有当WAL日志写成功后,再接着写MemStore,然后客户端被通知提交数据成功;如果写WAL日志失败,客户端则被通知提交失败。关闭(false)放弃写WAL日志,从而提高数据写入的性能。
+ + * 必选:否
+ + * 默认值:false
+ +* **writeBufferSize** + + * 描述:设置HBae client的写buffer大小,单位字节。配合autoflush使用。autoflush,开启(true)表示Hbase client在写的时候有一条put就执行一次更新;关闭(false),表示Hbase client在写的时候只有当put填满客户端写缓存时,才实际向HBase服务端发起写请求
+ + * 必选:否
+ + * 默认值:8M
+ +### 3.3 HBase支持的列类型 +* BOOLEAN +* SHORT +* INT +* LONG +* FLOAT +* DOUBLE +* STRING + + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 + +## 4 性能报告 + +略 + +## 5 约束限制 + +略 + +## 6 FAQ + +*** diff --git a/hbase11xwriter/pom.xml b/hbase11xwriter/pom.xml new file mode 100644 index 0000000000..21ea104bf0 --- /dev/null +++ b/hbase11xwriter/pom.xml @@ -0,0 +1,111 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + hbase11xwriter + hbase11xwriter + 0.0.1-SNAPSHOT + jar + + + 1.1.3 + 2.5.0 + 1.8 + + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.apache.hadoop + hadoop-hdfs + ${hadoop.version} + + + org.apache.hbase + hbase-client + ${hbase.version} + + + org.apache.hbase + hbase-common + ${hbase.version} + + + com.google.guava + guava + 12.0.1 + + + commons-codec + commons-codec + ${commons-codec.version} + + + junit + junit + test + + + com.alibaba.datax + datax-core + ${datax-project-version} + test + + + org.mockito + mockito-all + 1.9.5 + test + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + \ No newline at end of file diff --git a/hbase11xwriter/src/main/assembly/package.xml b/hbase11xwriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..657d464834 --- /dev/null +++ b/hbase11xwriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/hbase11xwriter + + + target/ + + hbase11xwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/hbase11xwriter + + + + + + false + plugin/writer/hbase11xwriter/libs + runtime + + + diff --git a/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/ColumnType.java b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/ColumnType.java new file mode 100755 index 0000000000..081d10105d --- /dev/null +++ b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/ColumnType.java @@ -0,0 +1,46 @@ +package com.alibaba.datax.plugin.writer.hbase11xwriter; + +import com.alibaba.datax.common.exception.DataXException; +import org.apache.commons.lang.StringUtils; + +import java.util.Arrays; + +/** + * 只对 normal 模式读取时有用,多版本读取时,不存在列类型的 + */ +public enum ColumnType { + STRING("string"), + BOOLEAN("boolean"), + SHORT("short"), + INT("int"), + LONG("long"), + FLOAT("float"), + DOUBLE("double") + ; + + private String typeName; + + ColumnType(String typeName) { + this.typeName = typeName; + } + + public static ColumnType getByTypeName(String typeName) { + if(StringUtils.isBlank(typeName)){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, + String.format("Hbasewriter 不支持该类型:%s, 目前支持的类型是:%s", typeName, Arrays.asList(values()))); + } + for (ColumnType columnType : values()) { + if (StringUtils.equalsIgnoreCase(columnType.typeName, typeName.trim())) { + return columnType; + } + } + + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, + String.format("Hbasewriter 不支持该类型:%s, 目前支持的类型是:%s", typeName, Arrays.asList(values()))); + } + + @Override + public String toString() { + return this.typeName; + } +} diff --git a/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Constant.java b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Constant.java new file mode 100755 index 0000000000..5e69205303 --- /dev/null +++ b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Constant.java @@ -0,0 +1,8 @@ +package com.alibaba.datax.plugin.writer.hbase11xwriter; + +public final class Constant { + public static final String DEFAULT_ENCODING = "UTF-8"; + public static final String DEFAULT_DATA_FORMAT = "yyyy-MM-dd HH:mm:ss"; + public static final String DEFAULT_NULL_MODE = "skip"; + public static final long DEFAULT_WRITE_BUFFER_SIZE = 8 * 1024 * 1024; +} diff --git a/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Hbase11xHelper.java b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Hbase11xHelper.java new file mode 100644 index 0000000000..94b13b60c9 --- /dev/null +++ b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Hbase11xHelper.java @@ -0,0 +1,297 @@ +package com.alibaba.datax.plugin.writer.hbase11xwriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.TypeReference; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; +import org.apache.hadoop.hbase.HBaseConfiguration; +import org.apache.hadoop.hbase.TableName; +import org.apache.hadoop.hbase.client.*; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.Charset; +import java.util.List; +import java.util.Map; + + +public class Hbase11xHelper { + + private static final Logger LOG = LoggerFactory.getLogger(Hbase11xHelper.class); + + public static org.apache.hadoop.conf.Configuration getHbaseConfiguration(String hbaseConfig) { + if (StringUtils.isBlank(hbaseConfig)) { + throw DataXException.asDataXException(Hbase11xWriterErrorCode.REQUIRED_VALUE, "读 Hbase 时需要配置hbaseConfig,其内容为 Hbase 连接信息,请联系 Hbase PE 获取该信息."); + } + org.apache.hadoop.conf.Configuration hConfiguration = HBaseConfiguration.create(); + try { + Map hbaseConfigMap = JSON.parseObject(hbaseConfig, new TypeReference>() {}); + // 用户配置的 key-value 对 来表示 hbaseConfig + Validate.isTrue(hbaseConfigMap != null, "hbaseConfig不能为空Map结构!"); + for (Map.Entry entry : hbaseConfigMap.entrySet()) { + hConfiguration.set(entry.getKey(), entry.getValue()); + } + } catch (Exception e) { + throw DataXException.asDataXException(Hbase11xWriterErrorCode.GET_HBASE_CONNECTION_ERROR, e); + } + return hConfiguration; + } + + public static org.apache.hadoop.hbase.client.Connection getHbaseConnection(String hbaseConfig) { + org.apache.hadoop.conf.Configuration hConfiguration = Hbase11xHelper.getHbaseConfiguration(hbaseConfig); + + org.apache.hadoop.hbase.client.Connection hConnection = null; + try { + hConnection = ConnectionFactory.createConnection(hConfiguration); + + } catch (Exception e) { + Hbase11xHelper.closeConnection(hConnection); + throw DataXException.asDataXException(Hbase11xWriterErrorCode.GET_HBASE_CONNECTION_ERROR, e); + } + return hConnection; + } + + + public static Table getTable(com.alibaba.datax.common.util.Configuration configuration){ + String hbaseConfig = configuration.getString(Key.HBASE_CONFIG); + String userTable = configuration.getString(Key.TABLE); + long writeBufferSize = configuration.getLong(Key.WRITE_BUFFER_SIZE, Constant.DEFAULT_WRITE_BUFFER_SIZE); + org.apache.hadoop.hbase.client.Connection hConnection = Hbase11xHelper.getHbaseConnection(hbaseConfig); + TableName hTableName = TableName.valueOf(userTable); + org.apache.hadoop.hbase.client.Admin admin = null; + org.apache.hadoop.hbase.client.Table hTable = null; + try { + admin = hConnection.getAdmin(); + Hbase11xHelper.checkHbaseTable(admin,hTableName); + hTable = hConnection.getTable(hTableName); + BufferedMutatorParams bufferedMutatorParams = new BufferedMutatorParams(hTableName); + bufferedMutatorParams.writeBufferSize(writeBufferSize); + } catch (Exception e) { + Hbase11xHelper.closeTable(hTable); + Hbase11xHelper.closeAdmin(admin); + Hbase11xHelper.closeConnection(hConnection); + throw DataXException.asDataXException(Hbase11xWriterErrorCode.GET_HBASE_TABLE_ERROR, e); + } + return hTable; + } + + public static BufferedMutator getBufferedMutator(com.alibaba.datax.common.util.Configuration configuration){ + String hbaseConfig = configuration.getString(Key.HBASE_CONFIG); + String userTable = configuration.getString(Key.TABLE); + long writeBufferSize = configuration.getLong(Key.WRITE_BUFFER_SIZE, Constant.DEFAULT_WRITE_BUFFER_SIZE); + org.apache.hadoop.conf.Configuration hConfiguration = Hbase11xHelper.getHbaseConfiguration(hbaseConfig); + org.apache.hadoop.hbase.client.Connection hConnection = Hbase11xHelper.getHbaseConnection(hbaseConfig); + TableName hTableName = TableName.valueOf(userTable); + org.apache.hadoop.hbase.client.Admin admin = null; + BufferedMutator bufferedMutator = null; + try { + admin = hConnection.getAdmin(); + Hbase11xHelper.checkHbaseTable(admin,hTableName); + //参考HTable getBufferedMutator() + bufferedMutator = hConnection.getBufferedMutator( + new BufferedMutatorParams(hTableName) + .pool(HTable.getDefaultExecutor(hConfiguration)) + .writeBufferSize(writeBufferSize)); + } catch (Exception e) { + Hbase11xHelper.closeBufferedMutator(bufferedMutator); + Hbase11xHelper.closeAdmin(admin); + Hbase11xHelper.closeConnection(hConnection); + throw DataXException.asDataXException(Hbase11xWriterErrorCode.GET_HBASE_BUFFEREDMUTATOR_ERROR, e); + } + return bufferedMutator; + } + + public static void deleteTable(com.alibaba.datax.common.util.Configuration configuration) { + String userTable = configuration.getString(Key.TABLE); + LOG.info(String.format("由于您配置了deleteType delete,HBasWriter begins to delete table %s .", userTable)); + Scan scan = new Scan(); + org.apache.hadoop.hbase.client.Table hTable =Hbase11xHelper.getTable(configuration); + ResultScanner scanner = null; + try { + scanner = hTable.getScanner(scan); + for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { + hTable.delete(new Delete(rr.getRow())); + } + } catch (Exception e) { + throw DataXException.asDataXException(Hbase11xWriterErrorCode.DELETE_HBASE_ERROR, e); + }finally { + if(scanner != null){ + scanner.close(); + } + Hbase11xHelper.closeTable(hTable); + } + } + + public static void truncateTable(com.alibaba.datax.common.util.Configuration configuration) { + String hbaseConfig = configuration.getString(Key.HBASE_CONFIG); + String userTable = configuration.getString(Key.TABLE); + LOG.info(String.format("由于您配置了 truncate 为true,HBasWriter begins to truncate table %s .", userTable)); + TableName hTableName = TableName.valueOf(userTable); + org.apache.hadoop.hbase.client.Connection hConnection = Hbase11xHelper.getHbaseConnection(hbaseConfig); + org.apache.hadoop.hbase.client.Admin admin = null; + try{ + admin = hConnection.getAdmin(); + Hbase11xHelper.checkHbaseTable(admin,hTableName); + admin.disableTable(hTableName); + admin.truncateTable(hTableName,true); + }catch (Exception e) { + throw DataXException.asDataXException(Hbase11xWriterErrorCode.TRUNCATE_HBASE_ERROR, e); + }finally { + Hbase11xHelper.closeAdmin(admin); + Hbase11xHelper.closeConnection(hConnection); + } + } + + public static void closeConnection(Connection hConnection){ + try { + if(null != hConnection) + hConnection.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase11xWriterErrorCode.CLOSE_HBASE_CONNECTION_ERROR, e); + } + } + + public static void closeAdmin(Admin admin){ + try { + if(null != admin) + admin.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase11xWriterErrorCode.CLOSE_HBASE_AMIN_ERROR, e); + } + } + + public static void closeBufferedMutator(BufferedMutator bufferedMutator){ + try { + if(null != bufferedMutator) + bufferedMutator.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase11xWriterErrorCode.CLOSE_HBASE_BUFFEREDMUTATOR_ERROR, e); + } + } + + public static void closeTable(Table table){ + try { + if(null != table) + table.close(); + } catch (IOException e) { + throw DataXException.asDataXException(Hbase11xWriterErrorCode.CLOSE_HBASE_TABLE_ERROR, e); + } + } + + + private static void checkHbaseTable(Admin admin, TableName hTableName) throws IOException { + if(!admin.tableExists(hTableName)){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, "HBase源头表" + hTableName.toString() + + "不存在, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if(!admin.isTableAvailable(hTableName)){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, "HBase源头表" +hTableName.toString() + + " 不可用, 请检查您的配置 或者 联系 Hbase 管理员."); + } + if(admin.isTableDisabled(hTableName)){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, "HBase源头表" +hTableName.toString() + + "is disabled, 请检查您的配置 或者 联系 Hbase 管理员."); + } + } + + + public static void validateParameter(com.alibaba.datax.common.util.Configuration originalConfig) { + originalConfig.getNecessaryValue(Key.HBASE_CONFIG, Hbase11xWriterErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.TABLE, Hbase11xWriterErrorCode.REQUIRED_VALUE); + + Hbase11xHelper.validateMode(originalConfig); + + String encoding = originalConfig.getString(Key.ENCODING, Constant.DEFAULT_ENCODING); + if (!Charset.isSupported(encoding)) { + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, String.format("Hbasewriter 不支持您所配置的编码:[%s]", encoding)); + } + originalConfig.set(Key.ENCODING, encoding); + + Boolean walFlag = originalConfig.getBool(Key.WAL_FLAG, false); + originalConfig.set(Key.WAL_FLAG, walFlag); + long writeBufferSize = originalConfig.getLong(Key.WRITE_BUFFER_SIZE,Constant.DEFAULT_WRITE_BUFFER_SIZE); + originalConfig.set(Key.WRITE_BUFFER_SIZE, writeBufferSize); + } + + + + + private static void validateMode(com.alibaba.datax.common.util.Configuration originalConfig){ + String mode = originalConfig.getNecessaryValue(Key.MODE,Hbase11xWriterErrorCode.REQUIRED_VALUE); + ModeType modeType = ModeType.getByTypeName(mode); + switch (modeType) { + case Normal: { + validateRowkeyColumn(originalConfig); + validateColumn(originalConfig); + validateVersionColumn(originalConfig); + break; + } + default: + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, + String.format("Hbase11xWriter不支持该 mode 类型:%s", mode)); + } + } + + private static void validateColumn(com.alibaba.datax.common.util.Configuration originalConfig){ + List columns = originalConfig.getListConfiguration(Key.COLUMN); + if (columns == null || columns.isEmpty()) { + throw DataXException.asDataXException(Hbase11xWriterErrorCode.REQUIRED_VALUE, "column为必填项,其形式为:column:[{\"index\": 0,\"name\": \"cf0:column0\",\"type\": \"string\"},{\"index\": 1,\"name\": \"cf1:column1\",\"type\": \"long\"}]"); + } + for (Configuration aColumn : columns) { + Integer index = aColumn.getInt(Key.INDEX); + String type = aColumn.getNecessaryValue(Key.TYPE,Hbase11xWriterErrorCode.REQUIRED_VALUE); + String name = aColumn.getNecessaryValue(Key.NAME,Hbase11xWriterErrorCode.REQUIRED_VALUE); + ColumnType.getByTypeName(type); + if(name.split(":").length != 2){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, String.format("您column配置项中name配置的列格式[%s]不正确,name应该配置为 列族:列名 的形式, 如 {\"index\": 1,\"name\": \"cf1:q1\",\"type\": \"long\"}", name)); + } + if(index == null || index < 0){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, "您的column配置项不正确,配置项中中index为必填项,且为非负数,请检查并修改."); + } + } + } + + private static void validateRowkeyColumn(com.alibaba.datax.common.util.Configuration originalConfig){ + List rowkeyColumn = originalConfig.getListConfiguration(Key.ROWKEY_COLUMN); + if (rowkeyColumn == null || rowkeyColumn.isEmpty()) { + throw DataXException.asDataXException(Hbase11xWriterErrorCode.REQUIRED_VALUE, "rowkeyColumn为必填项,其形式为:rowkeyColumn:[{\"index\": 0,\"type\": \"string\"},{\"index\": -1,\"type\": \"string\",\"value\": \"_\"}]"); + } + int rowkeyColumnSize = rowkeyColumn.size(); + //包含{"index":0,"type":"string"} 或者 {"index":-1,"type":"string","value":"_"} + for (Configuration aRowkeyColumn : rowkeyColumn) { + Integer index = aRowkeyColumn.getInt(Key.INDEX); + String type = aRowkeyColumn.getNecessaryValue(Key.TYPE,Hbase11xWriterErrorCode.REQUIRED_VALUE); + ColumnType.getByTypeName(type); + if(index == null ){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.REQUIRED_VALUE, "rowkeyColumn配置项中index为必填项"); + } + //不能只有-1列,即rowkey连接串 + if(rowkeyColumnSize ==1 && index == -1){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, "rowkeyColumn配置项不能全为常量列,至少指定一个rowkey列"); + } + if(index == -1){ + aRowkeyColumn.getNecessaryValue(Key.VALUE,Hbase11xWriterErrorCode.REQUIRED_VALUE); + } + } + } + + private static void validateVersionColumn(com.alibaba.datax.common.util.Configuration originalConfig){ + Configuration versionColumn = originalConfig.getConfiguration(Key.VERSION_COLUMN); + //为null,表示用当前时间;指定列,需要index + if(versionColumn != null){ + Integer index = versionColumn.getInt(Key.INDEX); + if(index == null ){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.REQUIRED_VALUE, "versionColumn配置项中index为必填项"); + } + if(index == -1){ + //指定时间,需要index=-1,value + versionColumn.getNecessaryValue(Key.VALUE,Hbase11xWriterErrorCode.REQUIRED_VALUE); + }else if(index < 0){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, "您versionColumn配置项中index配置不正确,只能取-1或者非负数"); + } + } + } +} diff --git a/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Hbase11xWriter.java b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Hbase11xWriter.java new file mode 100644 index 0000000000..babbed0531 --- /dev/null +++ b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Hbase11xWriter.java @@ -0,0 +1,78 @@ +package com.alibaba.datax.plugin.writer.hbase11xwriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang3.StringUtils; + +import java.util.ArrayList; +import java.util.List; + +/** + * Hbase11xWriter + * Created by shf on 16/3/17. + */ +public class Hbase11xWriter extends Writer { + public static class Job extends Writer.Job { + private Configuration originConfig = null; + @Override + public void init() { + this.originConfig = this.getPluginJobConf(); + Hbase11xHelper.validateParameter(this.originConfig); + } + + @Override + public void prepare(){ + Boolean truncate = originConfig.getBool(Key.TRUNCATE,false); + if(truncate){ + Hbase11xHelper.truncateTable(this.originConfig); + } + } + @Override + public List split(int mandatoryNumber) { + List splitResultConfigs = new ArrayList(); + for (int j = 0; j < mandatoryNumber; j++) { + splitResultConfigs.add(originConfig.clone()); + } + return splitResultConfigs; + } + + @Override + public void destroy() { + + } + } + public static class Task extends Writer.Task { + private Configuration taskConfig; + private HbaseAbstractTask hbaseTaskProxy; + + @Override + public void init() { + this.taskConfig = super.getPluginJobConf(); + String mode = this.taskConfig.getString(Key.MODE); + ModeType modeType = ModeType.getByTypeName(mode); + + switch (modeType) { + case Normal: + this.hbaseTaskProxy = new NormalTask(this.taskConfig); + break; + default: + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, "Hbasereader 不支持此类模式:" + modeType); + } + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + this.hbaseTaskProxy.startWriter(lineReceiver,super.getTaskPluginCollector()); + } + + + @Override + public void destroy() { + if (this.hbaseTaskProxy != null) { + this.hbaseTaskProxy.close(); + } + } + } +} diff --git a/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Hbase11xWriterErrorCode.java b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Hbase11xWriterErrorCode.java new file mode 100644 index 0000000000..3434a49e1d --- /dev/null +++ b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Hbase11xWriterErrorCode.java @@ -0,0 +1,48 @@ +package com.alibaba.datax.plugin.writer.hbase11xwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Hbase11xWriterErrorCode + * Created by shf on 16/3/8. + */ +public enum Hbase11xWriterErrorCode implements ErrorCode { + REQUIRED_VALUE("Hbasewriter-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("Hbasewriter-01", "您填写的参数值不合法."), + GET_HBASE_CONNECTION_ERROR("Hbasewriter-02", "获取Hbase连接时出错."), + GET_HBASE_TABLE_ERROR("Hbasewriter-03", "获取 Hbase table时出错."), + CLOSE_HBASE_CONNECTION_ERROR("Hbasewriter-04", "关闭Hbase连接时出错."), + CLOSE_HBASE_AMIN_ERROR("Hbasewriter-05", "关闭Hbase admin时出错."), + CLOSE_HBASE_TABLE_ERROR("Hbasewriter-06", "关闭Hbase table时时出错."), + PUT_HBASE_ERROR("Hbasewriter-07", "写入hbase时发生IO异常."), + DELETE_HBASE_ERROR("Hbasewriter-08", "delete hbase表时发生异常."), + TRUNCATE_HBASE_ERROR("Hbasewriter-09", "truncate hbase表时发生异常."), + CONSTRUCT_ROWKEY_ERROR("Hbasewriter-10", "构建rowkey时发生异常."), + CONSTRUCT_VERSION_ERROR("Hbasewriter-11", "构建version时发生异常."), + GET_HBASE_BUFFEREDMUTATOR_ERROR("Hbasewriter-12", "获取hbase BufferedMutator 时出错."), + CLOSE_HBASE_BUFFEREDMUTATOR_ERROR("Hbasewriter-13", "关闭 Hbase BufferedMutator时出错."), + ; + private final String code; + private final String description; + + private Hbase11xWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} diff --git a/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/HbaseAbstractTask.java b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/HbaseAbstractTask.java new file mode 100755 index 0000000000..22e6144c36 --- /dev/null +++ b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/HbaseAbstractTask.java @@ -0,0 +1,164 @@ +package com.alibaba.datax.plugin.writer.hbase11xwriter; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import org.apache.hadoop.hbase.HConstants; +import org.apache.hadoop.hbase.client.BufferedMutator; +import org.apache.hadoop.hbase.client.Put; +import org.apache.hadoop.hbase.util.Bytes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.charset.Charset; +import java.util.List; + +public abstract class HbaseAbstractTask { + private final static Logger LOG = LoggerFactory.getLogger(HbaseAbstractTask.class); + + public NullModeType nullMode = null; + + public List columns; + public List rowkeyColumn; + public Configuration versionColumn; + + + //public Table htable; + public String encoding; + public Boolean walFlag; + public BufferedMutator bufferedMutator; + + + public HbaseAbstractTask(com.alibaba.datax.common.util.Configuration configuration) { + //this.htable = Hbase11xHelper.getTable(configuration); + this.bufferedMutator = Hbase11xHelper.getBufferedMutator(configuration); + this.columns = configuration.getListConfiguration(Key.COLUMN); + this.rowkeyColumn = configuration.getListConfiguration(Key.ROWKEY_COLUMN); + this.versionColumn = configuration.getConfiguration(Key.VERSION_COLUMN); + this.encoding = configuration.getString(Key.ENCODING,Constant.DEFAULT_ENCODING); + this.nullMode = NullModeType.getByTypeName(configuration.getString(Key.NULL_MODE,Constant.DEFAULT_NULL_MODE)); + this.walFlag = configuration.getBool(Key.WAL_FLAG, false); + } + + public void startWriter(RecordReceiver lineReceiver, TaskPluginCollector taskPluginCollector){ + Record record; + try { + while ((record = lineReceiver.getFromReader()) != null) { + Put put; + try { + put = convertRecordToPut(record); + } catch (Exception e) { + taskPluginCollector.collectDirtyRecord(record, e); + continue; + } + try { + //this.htable.put(put); + this.bufferedMutator.mutate(put); + } catch (IllegalArgumentException e) { + if(e.getMessage().equals("No columns to insert") && nullMode.equals(NullModeType.Skip)){ + LOG.info(String.format("record is empty, 您配置nullMode为[skip],将会忽略这条记录,record[%s]", record.toString())); + continue; + }else { + taskPluginCollector.collectDirtyRecord(record, e); + continue; + } + } + } + }catch (IOException e){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.PUT_HBASE_ERROR,e); + }finally { + //Hbase11xHelper.closeTable(this.htable); + Hbase11xHelper.closeBufferedMutator(this.bufferedMutator); + } + } + + + public abstract Put convertRecordToPut(Record record); + + public void close() { + //Hbase11xHelper.closeTable(this); + Hbase11xHelper.closeBufferedMutator(this.bufferedMutator); + } + + + public byte[] getColumnByte(ColumnType columnType, Column column){ + byte[] bytes; + if(column.getRawData() != null){ + switch (columnType) { + case INT: + bytes = Bytes.toBytes(column.asLong().intValue()); + break; + case LONG: + bytes = Bytes.toBytes(column.asLong()); + break; + case DOUBLE: + bytes = Bytes.toBytes(column.asDouble()); + break; + case FLOAT: + bytes = Bytes.toBytes(column.asDouble().floatValue()); + break; + case SHORT: + bytes = Bytes.toBytes(column.asLong().shortValue()); + break; + case BOOLEAN: + bytes = Bytes.toBytes(column.asBoolean()); + break; + case STRING: + bytes = this.getValueByte(columnType,column.asString()); + break; + default: + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, "HbaseWriter列不支持您配置的列类型:" + columnType); + } + }else{ + switch (nullMode){ + case Skip: + bytes = null; + break; + case Empty: + bytes = HConstants.EMPTY_BYTE_ARRAY; + break; + default: + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, "HbaseWriter nullMode不支持您配置的类型,只支持skip或者empty"); + } + } + return bytes; + } + + public byte[] getValueByte(ColumnType columnType, String value){ + byte[] bytes; + if(value != null){ + switch (columnType) { + case INT: + bytes = Bytes.toBytes(Integer.parseInt(value)); + break; + case LONG: + bytes = Bytes.toBytes(Long.parseLong(value)); + break; + case DOUBLE: + bytes = Bytes.toBytes(Double.parseDouble(value)); + break; + case FLOAT: + bytes = Bytes.toBytes(Float.parseFloat(value)); + break; + case SHORT: + bytes = Bytes.toBytes(Short.parseShort(value)); + break; + case BOOLEAN: + bytes = Bytes.toBytes(Boolean.parseBoolean(value)); + break; + case STRING: + bytes = value.getBytes(Charset.forName(encoding)); + break; + default: + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, "HbaseWriter列不支持您配置的列类型:" + columnType); + } + }else{ + bytes = HConstants.EMPTY_BYTE_ARRAY; + } + return bytes; + } +} diff --git a/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Key.java b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Key.java new file mode 100755 index 0000000000..0b8e8d4b97 --- /dev/null +++ b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/Key.java @@ -0,0 +1,53 @@ +package com.alibaba.datax.plugin.writer.hbase11xwriter; + +public final class Key { + + public final static String HBASE_CONFIG = "hbaseConfig"; + + public final static String TABLE = "table"; + + /** + * mode 可以取 normal 或者 multiVersionFixedColumn 或者 multiVersionDynamicColumn 三个值,无默认值。 + *

+ * normal 配合 column(Map 结构的)使用 + *

+ * multiVersion + */ + public final static String MODE = "mode"; + + + public final static String ROWKEY_COLUMN = "rowkeyColumn"; + + public final static String VERSION_COLUMN = "versionColumn"; + + /** + * 默认为 utf8 + */ + public final static String ENCODING = "encoding"; + + public final static String COLUMN = "column"; + + public static final String INDEX = "index"; + + public static final String NAME = "name"; + + public static final String TYPE = "type"; + + public static final String VALUE = "value"; + + public static final String FORMAT = "format"; + + /** + * 默认为 EMPTY_BYTES + */ + public static final String NULL_MODE = "nullMode"; + + public static final String TRUNCATE = "truncate"; + + public static final String AUTO_FLUSH = "autoFlush"; + + public static final String WAL_FLAG = "walFlag"; + + public static final String WRITE_BUFFER_SIZE = "writeBufferSize"; + +} diff --git a/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/ModeType.java b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/ModeType.java new file mode 100644 index 0000000000..6871aa0c62 --- /dev/null +++ b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/ModeType.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.writer.hbase11xwriter; + +import com.alibaba.datax.common.exception.DataXException; + +import java.util.Arrays; + +public enum ModeType { + Normal("normal"), + MultiVersion("multiVersion") + ; + + private String mode; + + + ModeType(String mode) { + this.mode = mode.toLowerCase(); + } + + public String getMode() { + return mode; + } + + public static ModeType getByTypeName(String modeName) { + for (ModeType modeType : values()) { + if (modeType.mode.equalsIgnoreCase(modeName)) { + return modeType; + } + } + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, + String.format("Hbasewriter 不支持该 mode 类型:%s, 目前支持的 mode 类型是:%s", modeName, Arrays.asList(values()))); + } +} diff --git a/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/MultiVersionTask.java b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/MultiVersionTask.java new file mode 100755 index 0000000000..95352ba1ab --- /dev/null +++ b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/MultiVersionTask.java @@ -0,0 +1,62 @@ +package com.alibaba.datax.plugin.writer.hbase11xwriter; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import org.apache.hadoop.hbase.client.Put; + +public class MultiVersionTask extends HbaseAbstractTask { + + public MultiVersionTask(Configuration configuration) { + super(configuration); + } + + @Override + public Put convertRecordToPut(Record record) { + if (record.getColumnNumber() != 4 ) { + // multversion 模式下源头读取字段列数为4元组(rowkey,column,timestamp,value),目的端需告诉[] + throw DataXException + .asDataXException( + Hbase11xWriterErrorCode.ILLEGAL_VALUE, + String.format( + "HbaseWriter multversion模式下列配置信息有错误.源头应该为四元组,实际源头读取字段数:%s,请检查您的配置并作出修改.", + record.getColumnNumber())); + } + Put put = null; + //rowkey +// ColumnType rowkeyType = ColumnType.getByTypeName(String.valueOf(columnList.get(0).get(Key.TYPE))); +// if(record.getColumn(0).getRawData() == null){ +// throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, "HbaseWriter的rowkey不能为空,请选择合适的rowkey列"); +// } +// //timestamp +// if(record.getColumn(2).getRawData()!= null){ +// put = new Put(getColumnByte(rowkeyType,record.getColumn(0)),record.getColumn(2).asLong()); +// }else{ +// put = new Put(getColumnByte(rowkeyType,record.getColumn(0))); +// } +// //column family,qualifie +// Map userColumn = columnList.get(1); +// ColumnType columnType = ColumnType.getByTypeName(userColumn.get(Key.TYPE)); +// String columnName = userColumn.get(Key.NAME); +// String promptInfo = "Hbasewriter 中,column 的列配置格式应该是:列族:列名. 您配置的列错误:" + columnName; +// String[] cfAndQualifier = columnName.split(":"); +// Validate.isTrue(cfAndQualifier != null && cfAndQualifier.length == 2 +// && StringUtils.isNotBlank(cfAndQualifier[0]) +// && StringUtils.isNotBlank(cfAndQualifier[1]), promptInfo); +// +// if(!columnName.equals(record.getColumn(1).asString())){ +// throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, +// String.format("您的配置中源端和目的端列名不一致,源端为[%s],目的端为[%s],请检查您的配置并作出修改.",record.getColumn(1).asString(),columnName)); +// +// } +// //value +// Column column = record.getColumn(3); +// put.addColumn(Bytes.toBytes( +// cfAndQualifier[0]), +// Bytes.toBytes(cfAndQualifier[1]), +// getColumnByte(columnType,column) +// ); + return put; + } + +} diff --git a/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/NormalTask.java b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/NormalTask.java new file mode 100755 index 0000000000..0ff0aea583 --- /dev/null +++ b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/NormalTask.java @@ -0,0 +1,129 @@ +package com.alibaba.datax.plugin.writer.hbase11xwriter; + +import com.alibaba.datax.common.element.DoubleColumn; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.Validate; +import org.apache.commons.lang3.time.DateUtils; +import org.apache.commons.net.ntp.TimeStamp; +import org.apache.hadoop.hbase.client.Durability; +import org.apache.hadoop.hbase.client.Put; +import org.apache.hadoop.hbase.util.Bytes; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Timestamp; +import java.text.ParseException; +import java.text.SimpleDateFormat; +import java.util.Date; +import java.util.Map; + +public class NormalTask extends HbaseAbstractTask { + private static final Logger LOG = LoggerFactory.getLogger(NormalTask.class); + public NormalTask(Configuration configuration) { + super(configuration); + } + + @Override + public Put convertRecordToPut(Record record){ + byte[] rowkey = getRowkey(record); + Put put = null; + if(this.versionColumn == null){ + put = new Put(rowkey); + if(!super.walFlag){ + //等价与0.94 put.setWriteToWAL(super.walFlag); + put.setDurability(Durability.SKIP_WAL); + } + }else { + long timestamp = getVersion(record); + put = new Put(rowkey,timestamp); + } + for (Configuration aColumn : columns) { + Integer index = aColumn.getInt(Key.INDEX); + String type = aColumn.getString(Key.TYPE); + ColumnType columnType = ColumnType.getByTypeName(type); + String name = aColumn.getString(Key.NAME); + String promptInfo = "Hbasewriter 中,column 的列配置格式应该是:列族:列名. 您配置的列错误:" + name; + String[] cfAndQualifier = name.split(":"); + Validate.isTrue(cfAndQualifier != null && cfAndQualifier.length == 2 + && StringUtils.isNotBlank(cfAndQualifier[0]) + && StringUtils.isNotBlank(cfAndQualifier[1]), promptInfo); + if(index >= record.getColumnNumber()){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, String.format("您的column配置项中中index值超出范围,根据reader端配置,index的值小于%s,而您配置的值为%s,请检查并修改.",record.getColumnNumber(),index)); + } + byte[] columnBytes = getColumnByte(columnType,record.getColumn(index)); + //columnBytes 为null忽略这列 + if(null != columnBytes){ + put.addColumn(Bytes.toBytes( + cfAndQualifier[0]), + Bytes.toBytes(cfAndQualifier[1]), + columnBytes); + }else{ + continue; + } + } + return put; + } + + public byte[] getRowkey(Record record){ + byte[] rowkeyBuffer = {}; + for (Configuration aRowkeyColumn : rowkeyColumn) { + Integer index = aRowkeyColumn.getInt(Key.INDEX); + String type = aRowkeyColumn.getString(Key.TYPE); + ColumnType columnType = ColumnType.getByTypeName(type); + if(index == -1){ + String value = aRowkeyColumn.getString(Key.VALUE); + rowkeyBuffer = Bytes.add(rowkeyBuffer,getValueByte(columnType,value)); + }else{ + if(index >= record.getColumnNumber()){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.CONSTRUCT_ROWKEY_ERROR, String.format("您的rowkeyColumn配置项中中index值超出范围,根据reader端配置,index的值小于%s,而您配置的值为%s,请检查并修改.",record.getColumnNumber(),index)); + } + byte[] value = getColumnByte(columnType,record.getColumn(index)); + rowkeyBuffer = Bytes.add(rowkeyBuffer, value); + } + } + return rowkeyBuffer; + } + + public long getVersion(Record record){ + int index = versionColumn.getInt(Key.INDEX); + long timestamp; + if(index == -1){ + //指定时间作为版本 + timestamp = versionColumn.getLong(Key.VALUE); + if(timestamp < 0){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.CONSTRUCT_VERSION_ERROR, "您指定的版本非法!"); + } + }else{ + //指定列作为版本,long/doubleColumn直接record.aslong, 其它类型尝试用yyyy-MM-dd HH:mm:ss,yyyy-MM-dd HH:mm:ss SSS去format + if(index >= record.getColumnNumber()){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.CONSTRUCT_VERSION_ERROR, String.format("您的versionColumn配置项中中index值超出范围,根据reader端配置,index的值小于%s,而您配置的值为%s,请检查并修改.",record.getColumnNumber(),index)); + } + if(record.getColumn(index).getRawData() == null){ + throw DataXException.asDataXException(Hbase11xWriterErrorCode.CONSTRUCT_VERSION_ERROR, "您指定的版本为空!"); + } + SimpleDateFormat df_senconds = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); + SimpleDateFormat df_ms = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss SSS"); + if(record.getColumn(index) instanceof LongColumn || record.getColumn(index) instanceof DoubleColumn){ + timestamp = record.getColumn(index).asLong(); + }else { + Date date; + try{ + date = df_ms.parse(record.getColumn(index).asString()); + }catch (ParseException e){ + try { + date = df_senconds.parse(record.getColumn(index).asString()); + } catch (ParseException e1) { + LOG.info(String.format("您指定第[%s]列作为hbase写入版本,但在尝试用yyyy-MM-dd HH:mm:ss 和 yyyy-MM-dd HH:mm:ss SSS 去解析为Date时均出错,请检查并修改",index)); + throw DataXException.asDataXException(Hbase11xWriterErrorCode.CONSTRUCT_VERSION_ERROR, e1); + } + } + timestamp = date.getTime(); + } + } + return timestamp; + } +} \ No newline at end of file diff --git a/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/NullModeType.java b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/NullModeType.java new file mode 100644 index 0000000000..d77dbbd79c --- /dev/null +++ b/hbase11xwriter/src/main/java/com/alibaba/datax/plugin/writer/hbase11xwriter/NullModeType.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.writer.hbase11xwriter; + +import com.alibaba.datax.common.exception.DataXException; + +import java.util.Arrays; + +public enum NullModeType { + Skip("skip"), + Empty("empty") + ; + + private String mode; + + + NullModeType(String mode) { + this.mode = mode.toLowerCase(); + } + + public String getMode() { + return mode; + } + + public static NullModeType getByTypeName(String modeName) { + for (NullModeType modeType : values()) { + if (modeType.mode.equalsIgnoreCase(modeName)) { + return modeType; + } + } + throw DataXException.asDataXException(Hbase11xWriterErrorCode.ILLEGAL_VALUE, + String.format("Hbasewriter 不支持该 nullMode 类型:%s, 目前支持的 nullMode 类型是:%s", modeName, Arrays.asList(values()))); + } +} diff --git a/hbase11xwriter/src/main/resources/plugin.json b/hbase11xwriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..8db36f59af --- /dev/null +++ b/hbase11xwriter/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "hbase11xwriter", + "class": "com.alibaba.datax.plugin.writer.hbase11xwriter.Hbase11xWriter", + "description": "use put: prod. mechanism: use hbase java api put data.", + "developer": "alibaba" +} + diff --git a/hbase11xwriter/src/main/resources/plugin_job_template.json b/hbase11xwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..c3d2dd685f --- /dev/null +++ b/hbase11xwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,21 @@ +{ + "name": "hbase11xwriter", + "parameter": { + "hbaseConfig": { + "hbase.rootdir": "", + "hbase.cluster.distributed": "", + "hbase.zookeeper.quorum": "" + }, + "table": "", + "mode": "", + "rowkeyColumn": [ + ], + "column": [ + ], + "versionColumn":{ + "index": "", + "value":"" + }, + "encoding": "" + } +} \ No newline at end of file diff --git a/hdfsreader/doc/hdfsreader.md b/hdfsreader/doc/hdfsreader.md new file mode 100644 index 0000000000..cd83c530e7 --- /dev/null +++ b/hdfsreader/doc/hdfsreader.md @@ -0,0 +1,369 @@ +# DataX HdfsReader 插件文档 + + +------------ + +## 1 快速介绍 + +HdfsReader提供了读取分布式文件系统数据存储的能力。在底层实现上,HdfsReader获取分布式文件系统上文件的数据,并转换为DataX传输协议传递给Writer。 + +**目前HdfsReader支持的文件格式有textfile(text)、orcfile(orc)、rcfile(rc)、sequence file(seq)和普通逻辑二维表(csv)类型格式的文件,且文件内容存放的必须是一张逻辑意义上的二维表。** + +**HdfsReader需要Jdk1.7及以上版本的支持。** + + +## 2 功能与限制 + +HdfsReader实现了从Hadoop分布式文件系统Hdfs中读取文件数据并转为DataX协议的功能。textfile是Hive建表时默认使用的存储格式,数据不做压缩,本质上textfile就是以文本的形式将数据存放在hdfs中,对于DataX而言,HdfsReader实现上类比TxtFileReader,有诸多相似之处。orcfile,它的全名是Optimized Row Columnar file,是对RCFile做了优化。据官方文档介绍,这种文件格式可以提供一种高效的方法来存储Hive数据。HdfsReader利用Hive提供的OrcSerde类,读取解析orcfile文件的数据。目前HdfsReader支持的功能如下: + +1. 支持textfile、orcfile、rcfile、sequence file和csv格式的文件,且要求文件内容存放的是一张逻辑意义上的二维表。 + +2. 支持多种类型数据读取(使用String表示),支持列裁剪,支持列常量 + +3. 支持递归读取、支持正则表达式("*"和"?")。 + +4. 支持orcfile数据压缩,目前支持SNAPPY,ZLIB两种压缩方式。 + +5. 多个File可以支持并发读取。 + +6. 支持sequence file数据压缩,目前支持lzo压缩方式。 + +7. csv类型支持压缩格式有:gzip、bz2、zip、lzo、lzo_deflate、snappy。 + +8. 目前插件中Hive版本为1.1.1,Hadoop版本为2.7.1(Apache[为适配JDK1.7],在Hadoop 2.5.0, Hadoop 2.6.0 和Hive 1.2.0测试环境中写入正常;其它版本需后期进一步测试; + +9. 支持kerberos认证(注意:如果用户需要进行kerberos认证,那么用户使用的Hadoop集群版本需要和hdfsreader的Hadoop版本保持一致,如果高于hdfsreader的Hadoop版本,不保证kerberos认证有效) + +我们暂时不能做到: + +1. 单个File支持多线程并发读取,这里涉及到单个File内部切分算法。二期考虑支持。 +2. 目前还不支持hdfs HA; + + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "job": { + "setting": { + "speed": { + "channel": 3 + } + }, + "content": [ + { + "reader": { + "name": "hdfsreader", + "parameter": { + "path": "/user/hive/warehouse/mytable01/*", + "defaultFS": "hdfs://xxx:port", + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "boolean" + }, + { + "type": "string", + "value": "hello" + }, + { + "index": 2, + "type": "double" + } + ], + "fileType": "orc", + "encoding": "UTF-8", + "fieldDelimiter": "," + } + + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": true + } + } + } + ] + } +} +``` + +### 3.2 参数说明(各个配置项值前后不允许有空格) + +* **path** + + * 描述:要读取的文件路径,如果要读取多个文件,可以使用正则表达式"*",注意这里可以支持填写多个路径。。
+ + 当指定单个Hdfs文件,HdfsReader暂时只能使用单线程进行数据抽取。二期考虑在非压缩文件情况下针对单个File可以进行多线程并发读取。 + + 当指定多个Hdfs文件,HdfsReader支持使用多线程进行数据抽取。线程并发数通过通道数指定。 + + 当指定通配符,HdfsReader尝试遍历出多个文件信息。例如: 指定/*代表读取/目录下所有的文件,指定/bazhen/\*代表读取bazhen目录下游所有的文件。HdfsReader目前只支持"*"和"?"作为文件通配符。 + + **特别需要注意的是,DataX会将一个作业下同步的所有的文件视作同一张数据表。用户必须自己保证所有的File能够适配同一套schema信息。并且提供给DataX权限可读。** + + + * 必选:是
+ + * 默认值:无
+ +* **defaultFS** + + * 描述:Hadoop hdfs文件系统namenode节点地址。
+ + + **目前HdfsReader已经支持Kerberos认证,如果需要权限认证,则需要用户配置kerberos参数,见下面** + + + * 必选:是
+ + * 默认值:无
+ +* **fileType** + + * 描述:文件的类型,目前只支持用户配置为"text"、"orc"、"rc"、"seq"、"csv"。
+ + text表示textfile文件格式 + + orc表示orcfile文件格式 + + rc表示rcfile文件格式 + + seq表示sequence file文件格式 + + csv表示普通hdfs文件格式(逻辑二维表) + + **特别需要注意的是,HdfsReader能够自动识别文件是orcfile、textfile或者还是其它类型的文件,但该项是必填项,HdfsReader则会只读取用户配置的类型的文件,忽略路径下其他格式的文件** + + **另外需要注意的是,由于textfile和orcfile是两种完全不同的文件格式,所以HdfsReader对这两种文件的解析方式也存在差异,这种差异导致hive支持的复杂复合类型(比如map,array,struct,union)在转换为DataX支持的String类型时,转换的结果格式略有差异,比如以map类型为例:** + + orcfile map类型经hdfsreader解析转换成datax支持的string类型后,结果为"{job=80, team=60, person=70}" + + textfile map类型经hdfsreader解析转换成datax支持的string类型后,结果为"job:80,team:60,person:70" + + 从上面的转换结果可以看出,数据本身没有变化,但是表示的格式略有差异,所以如果用户配置的文件路径中要同步的字段在Hive中是复合类型的话,建议配置统一的文件格式。 + + **如果需要统一复合类型解析出来的格式,我们建议用户在hive客户端将textfile格式的表导成orcfile格式的表** + + * 必选:是
+ + * 默认值:无
+ + +* **column** + + * 描述:读取字段列表,type指定源数据的类型,index指定当前列来自于文本第几列(以0开始),value指定当前类型为常量,不从源头文件读取数据,而是根据value值自动生成对应的列。
+ + 默认情况下,用户可以全部按照String类型读取数据,配置如下: + + ```json + "column": ["*"] + ``` + + 用户可以指定Column字段信息,配置如下: + + ```json +{ + "type": "long", + "index": 0 //从本地文件文本第一列获取int字段 +}, +{ + "type": "string", + "value": "alibaba" //HdfsReader内部生成alibaba的字符串字段作为当前字段 +} + ``` + + 对于用户指定Column信息,type必须填写,index/value必须选择其一。 + + * 必选:是
+ + * 默认值:全部按照string类型读取
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + **另外需要注意的是,HdfsReader在读取textfile数据时,需要指定字段分割符,如果不指定默认为',',HdfsReader在读取orcfile时,用户无需指定字段分割符** + + * 必选:否
+ + * 默认值:,
+ + +* **encoding** + + * 描述:读取文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8
+ + +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat:"\\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:无
+ +* **haveKerberos** + + * 描述:是否有Kerberos认证,默认false
+ + 例如如果用户配置true,则配置项kerberosKeytabFilePath,kerberosPrincipal为必填。 + + * 必选:haveKerberos 为true必选
+ + * 默认值:false
+ +* **kerberosKeytabFilePath** + + * 描述:Kerberos认证 keytab文件路径,绝对路径
+ + * 必选:否
+ + * 默认值:无
+ +* **kerberosPrincipal** + + * 描述:Kerberos认证Principal名,如xxxx/hadoopclient@xxx.xxx
+ + * 必选:haveKerberos 为true必选
+ + * 默认值:无
+ +* **compress** + + * 描述:当fileType(文件类型)为csv下的文件压缩方式,目前仅支持 gzip、bz2、zip、lzo、lzo_deflate、hadoop-snappy、framing-snappy压缩;**值得注意的是,lzo存在两种压缩格式:lzo和lzo_deflate,用户在配置的时候需要留心,不要配错了;另外,由于snappy目前没有统一的stream format,datax目前只支持最主流的两种:hadoop-snappy(hadoop上的snappy stream format)和framing-snappy(google建议的snappy stream format)**;orc文件类型下无需填写。
+ + * 必选:否
+ + * 默认值:无
+ +* **hadoopConfig** + + * 描述:hadoopConfig里可以配置与Hadoop相关的一些高级参数,比如HA的配置。
+ + ```json + "hadoopConfig":{ + "dfs.nameservices": "testDfs", + "dfs.ha.namenodes.testDfs": "namenode1,namenode2", +        "dfs.namenode.rpc-address.aliDfs.namenode1": "", + "dfs.namenode.rpc-address.aliDfs.namenode2": "", + "dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" + } + ``` + + * 必选:否
+ + * 默认值:无
+ +* **csvReaderConfig** + + * 描述:读取CSV类型文件参数配置,Map类型。读取CSV类型文件使用的CsvReader进行读取,会有很多配置,不配置则使用默认值。
+ + * 必选:否
+ + * 默认值:无
+ + +常见配置: + +```json +"csvReaderConfig":{ + "safetySwitch": false, + "skipEmptyRecords": false, + "useTextQualifier": false +} +``` + +所有配置项及默认值,配置时 csvReaderConfig 的map中请**严格按照以下字段名字进行配置**: + +``` +boolean caseSensitive = true; +char textQualifier = 34; +boolean trimWhitespace = true; +boolean useTextQualifier = true;//是否使用csv转义字符 +char delimiter = 44;//分隔符 +char recordDelimiter = 0; +char comment = 35; +boolean useComments = false; +int escapeMode = 1; +boolean safetySwitch = true;//单列长度是否限制100000字符 +boolean skipEmptyRecords = true;//是否跳过空行 +boolean captureRawRecord = true; +``` + +### 3.3 类型转换 + +由于textfile和orcfile文件表的元数据信息由Hive维护并存放在Hive自己维护的数据库(如mysql)中,目前HdfsReader不支持对Hive元数 + +据数据库进行访问查询,因此用户在进行类型转换的时候,必须指定数据类型,如果用户配置的column为"*",则所有column默认转换为 + +string类型。HdfsReader提供了类型转换的建议表如下: + +| DataX 内部类型| Hive表 数据类型 | +| -------- | ----- | +| Long |TINYINT,SMALLINT,INT,BIGINT| +| Double |FLOAT,DOUBLE| +| String |String,CHAR,VARCHAR,STRUCT,MAP,ARRAY,UNION,BINARY| +| Boolean |BOOLEAN| +| Date |Date,TIMESTAMP| + +其中: + +* Long是指Hdfs文件文本中使用整形的字符串表示形式,例如"123456789"。 +* Double是指Hdfs文件文本中使用Double的字符串表示形式,例如"3.1415"。 +* Boolean是指Hdfs文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* Date是指Hdfs文件文本中使用Date的字符串表示形式,例如"2014-12-31"。 + +特别提醒: + +* Hive支持的数据类型TIMESTAMP可以精确到纳秒级别,所以textfile、orcfile中TIMESTAMP存放的数据类似于"2015-08-21 22:40:47.397898389",如果转换的类型配置为DataX的Date,转换之后会导致纳秒部分丢失,所以如果需要保留纳秒部分的数据,请配置转换类型为DataX的String类型。 + + +### 3.4 按分区读取 + +Hive在建表的时候,可以指定分区partition,例如创建分区partition(day="20150820",hour="09"),对应的hdfs文件系统中,相应的表的目录下则会多出/20150820和/09两个目录,且/20150820是/09的父目录。了解了分区都会列成相应的目录结构,在按照某个分区读取某个表所有数据时,则只需配置好json中path的值即可。 + +比如需要读取表名叫mytable01下分区day为20150820这一天的所有数据,则配置如下: + +```json +"path": "/user/hive/warehouse/mytable01/20150820/*" +``` + + +## 4 性能报告 + + + +## 5 约束限制 + +略 + +## 6 FAQ + +1. 如果报java.io.IOException: Maximum column length of 100,000 exceeded in column...异常信息,说明数据源column字段长度超过了100000字符。 + + 需要在json的reader里增加如下配置 + ```json + "csvReaderConfig":{ + "safetySwitch": false, + "skipEmptyRecords": false, + "useTextQualifier": false + } + ``` + safetySwitch = false;//单列长度不限制100000字符 + diff --git a/hdfsreader/pom.xml b/hdfsreader/pom.xml new file mode 100644 index 0000000000..b3c91a5ec4 --- /dev/null +++ b/hdfsreader/pom.xml @@ -0,0 +1,126 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + hdfsreader + com.alibaba.datax + 0.0.1-SNAPSHOT + jar + + 1.1.1 + 2.7.1 + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + org.apache.hadoop + hadoop-hdfs + ${hadoop.version} + + + org.apache.hadoop + hadoop-common + ${hadoop.version} + + + org.apache.hadoop + hadoop-yarn-common + ${hadoop.version} + + + org.apache.hadoop + hadoop-mapreduce-client-core + ${hadoop.version} + + + + org.apache.hive + hive-exec + ${hive.version} + + + org.apache.hive + hive-serde + ${hive.version} + + + org.apache.hive + hive-service + ${hive.version} + + + org.apache.hive + hive-common + ${hive.version} + + + org.apache.hive.hcatalog + hive-hcatalog-core + ${hive.version} + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/hdfsreader/src/main/assembly/package.xml b/hdfsreader/src/main/assembly/package.xml new file mode 100644 index 0000000000..3f1393b764 --- /dev/null +++ b/hdfsreader/src/main/assembly/package.xml @@ -0,0 +1,49 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/hdfsreader + + + target/ + + hdfsreader-0.0.1-SNAPSHOT.jar + + plugin/reader/hdfsreader + + + + + + + + + + + + + + + + + + + + false + plugin/reader/hdfsreader/libs + runtime + + + diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Constant.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Constant.java new file mode 100644 index 0000000000..6bfb9bf7e5 --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Constant.java @@ -0,0 +1,13 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +/** + * Created by mingya.wmy on 2015/8/14. + */ +public class Constant { + public static final String SOURCE_FILES = "sourceFiles"; + public static final String TEXT = "TEXT"; + public static final String ORC = "ORC"; + public static final String CSV = "CSV"; + public static final String SEQ = "SEQ"; + public static final String RC = "RC"; +} diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/DFSUtil.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/DFSUtil.java new file mode 100644 index 0000000000..364dfeadf9 --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/DFSUtil.java @@ -0,0 +1,697 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.reader.ColumnEntry; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderErrorCode; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.JSONObject; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hive.ql.io.RCFile; +import org.apache.hadoop.hive.ql.io.RCFileRecordReader; +import org.apache.hadoop.hive.ql.io.orc.OrcFile; +import org.apache.hadoop.hive.ql.io.orc.OrcInputFormat; +import org.apache.hadoop.hive.ql.io.orc.OrcSerde; +import org.apache.hadoop.hive.ql.io.orc.Reader; +import org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable; +import org.apache.hadoop.hive.serde2.columnar.BytesRefWritable; +import org.apache.hadoop.hive.serde2.objectinspector.StructField; +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; +import org.apache.hadoop.io.*; +import org.apache.hadoop.mapred.*; +import org.apache.hadoop.security.UserGroupInformation; +import org.apache.hadoop.util.ReflectionUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.InputStream; +import java.nio.ByteBuffer; +import java.text.SimpleDateFormat; +import java.util.*; + +/** + * Created by mingya.wmy on 2015/8/12. + */ +public class DFSUtil { + private static final Logger LOG = LoggerFactory.getLogger(HdfsReader.Job.class); + + private org.apache.hadoop.conf.Configuration hadoopConf = null; + private String specifiedFileType = null; + private Boolean haveKerberos = false; + private String kerberosKeytabFilePath; + private String kerberosPrincipal; + + + private static final int DIRECTORY_SIZE_GUESS = 16 * 1024; + + public static final String HDFS_DEFAULTFS_KEY = "fs.defaultFS"; + public static final String HADOOP_SECURITY_AUTHENTICATION_KEY = "hadoop.security.authentication"; + + + public DFSUtil(Configuration taskConfig) { + hadoopConf = new org.apache.hadoop.conf.Configuration(); + //io.file.buffer.size 性能参数 + //http://blog.csdn.net/yangjl38/article/details/7583374 + Configuration hadoopSiteParams = taskConfig.getConfiguration(Key.HADOOP_CONFIG); + JSONObject hadoopSiteParamsAsJsonObject = JSON.parseObject(taskConfig.getString(Key.HADOOP_CONFIG)); + if (null != hadoopSiteParams) { + Set paramKeys = hadoopSiteParams.getKeys(); + for (String each : paramKeys) { + hadoopConf.set(each, hadoopSiteParamsAsJsonObject.getString(each)); + } + } + hadoopConf.set(HDFS_DEFAULTFS_KEY, taskConfig.getString(Key.DEFAULT_FS)); + + //是否有Kerberos认证 + this.haveKerberos = taskConfig.getBool(Key.HAVE_KERBEROS, false); + if (haveKerberos) { + this.kerberosKeytabFilePath = taskConfig.getString(Key.KERBEROS_KEYTAB_FILE_PATH); + this.kerberosPrincipal = taskConfig.getString(Key.KERBEROS_PRINCIPAL); + this.hadoopConf.set(HADOOP_SECURITY_AUTHENTICATION_KEY, "kerberos"); + } + this.kerberosAuthentication(this.kerberosPrincipal, this.kerberosKeytabFilePath); + + LOG.info(String.format("hadoopConfig details:%s", JSON.toJSONString(this.hadoopConf))); + } + + private void kerberosAuthentication(String kerberosPrincipal, String kerberosKeytabFilePath) { + if (haveKerberos && StringUtils.isNotBlank(this.kerberosPrincipal) && StringUtils.isNotBlank(this.kerberosKeytabFilePath)) { + UserGroupInformation.setConfiguration(this.hadoopConf); + try { + UserGroupInformation.loginUserFromKeytab(kerberosPrincipal, kerberosKeytabFilePath); + } catch (Exception e) { + String message = String.format("kerberos认证失败,请确定kerberosKeytabFilePath[%s]和kerberosPrincipal[%s]填写正确", + kerberosKeytabFilePath, kerberosPrincipal); + throw DataXException.asDataXException(HdfsReaderErrorCode.KERBEROS_LOGIN_ERROR, message, e); + } + } + } + + /** + * 获取指定路径列表下符合条件的所有文件的绝对路径 + * + * @param srcPaths 路径列表 + * @param specifiedFileType 指定文件类型 + */ + public HashSet getAllFiles(List srcPaths, String specifiedFileType) { + + this.specifiedFileType = specifiedFileType; + + if (!srcPaths.isEmpty()) { + for (String eachPath : srcPaths) { + LOG.info(String.format("get HDFS all files in path = [%s]", eachPath)); + getHDFSAllFiles(eachPath); + } + } + return sourceHDFSAllFilesList; + } + + private HashSet sourceHDFSAllFilesList = new HashSet(); + + public HashSet getHDFSAllFiles(String hdfsPath) { + + try { + FileSystem hdfs = FileSystem.get(hadoopConf); + //判断hdfsPath是否包含正则符号 + if (hdfsPath.contains("*") || hdfsPath.contains("?")) { + Path path = new Path(hdfsPath); + FileStatus stats[] = hdfs.globStatus(path); + for (FileStatus f : stats) { + if (f.isFile()) { + if (f.getLen() == 0) { + String message = String.format("文件[%s]长度为0,将会跳过不作处理!", hdfsPath); + LOG.warn(message); + } else { + addSourceFileByType(f.getPath().toString()); + } + } else if (f.isDirectory()) { + getHDFSAllFilesNORegex(f.getPath().toString(), hdfs); + } + } + } else { + getHDFSAllFilesNORegex(hdfsPath, hdfs); + } + + return sourceHDFSAllFilesList; + + } catch (IOException e) { + String message = String.format("无法读取路径[%s]下的所有文件,请确认您的配置项fs.defaultFS, path的值是否正确," + + "是否有读写权限,网络是否已断开!", hdfsPath); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.PATH_CONFIG_ERROR, e); + } + } + + private HashSet getHDFSAllFilesNORegex(String path, FileSystem hdfs) throws IOException { + + // 获取要读取的文件的根目录 + Path listFiles = new Path(path); + + // If the network disconnected, this method will retry 45 times + // each time the retry interval for 20 seconds + // 获取要读取的文件的根目录的所有二级子文件目录 + FileStatus stats[] = hdfs.listStatus(listFiles); + + for (FileStatus f : stats) { + // 判断是不是目录,如果是目录,递归调用 + if (f.isDirectory()) { + LOG.info(String.format("[%s] 是目录, 递归获取该目录下的文件", f.getPath().toString())); + getHDFSAllFilesNORegex(f.getPath().toString(), hdfs); + } else if (f.isFile()) { + + addSourceFileByType(f.getPath().toString()); + } else { + String message = String.format("该路径[%s]文件类型既不是目录也不是文件,插件自动忽略。", + f.getPath().toString()); + LOG.info(message); + } + } + return sourceHDFSAllFilesList; + } + + // 根据用户指定的文件类型,将指定的文件类型的路径加入sourceHDFSAllFilesList + private void addSourceFileByType(String filePath) { + // 检查file的类型和用户配置的fileType类型是否一致 + boolean isMatchedFileType = checkHdfsFileType(filePath, this.specifiedFileType); + + if (isMatchedFileType) { + LOG.info(String.format("[%s]是[%s]类型的文件, 将该文件加入source files列表", filePath, this.specifiedFileType)); + sourceHDFSAllFilesList.add(filePath); + } else { + String message = String.format("文件[%s]的类型与用户配置的fileType类型不一致," + + "请确认您配置的目录下面所有文件的类型均为[%s]" + , filePath, this.specifiedFileType); + LOG.error(message); + throw DataXException.asDataXException( + HdfsReaderErrorCode.FILE_TYPE_UNSUPPORT, message); + } + } + + public InputStream getInputStream(String filepath) { + InputStream inputStream; + Path path = new Path(filepath); + try { + FileSystem fs = FileSystem.get(hadoopConf); + //If the network disconnected, this method will retry 45 times + //each time the retry interval for 20 seconds + inputStream = fs.open(path); + return inputStream; + } catch (IOException e) { + String message = String.format("读取文件 : [%s] 时出错,请确认文件:[%s]存在且配置的用户有权限读取", filepath, filepath); + throw DataXException.asDataXException(HdfsReaderErrorCode.READ_FILE_ERROR, message, e); + } + } + + public void sequenceFileStartRead(String sourceSequenceFilePath, Configuration readerSliceConfig, + RecordSender recordSender, TaskPluginCollector taskPluginCollector) { + LOG.info(String.format("Start Read sequence file [%s].", sourceSequenceFilePath)); + + Path seqFilePath = new Path(sourceSequenceFilePath); + SequenceFile.Reader reader = null; + try { + //获取SequenceFile.Reader实例 + reader = new SequenceFile.Reader(this.hadoopConf, + SequenceFile.Reader.file(seqFilePath)); + //获取key 与 value + Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), this.hadoopConf); + Text value = new Text(); + while (reader.next(key, value)) { + if (StringUtils.isNotBlank(value.toString())) { + UnstructuredStorageReaderUtil.transportOneRecord(recordSender, + readerSliceConfig, taskPluginCollector, value.toString()); + } + } + } catch (Exception e) { + String message = String.format("SequenceFile.Reader读取文件[%s]时出错", sourceSequenceFilePath); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.READ_SEQUENCEFILE_ERROR, message, e); + } finally { + IOUtils.closeStream(reader); + LOG.info("Finally, Close stream SequenceFile.Reader."); + } + + } + + public void rcFileStartRead(String sourceRcFilePath, Configuration readerSliceConfig, + RecordSender recordSender, TaskPluginCollector taskPluginCollector) { + LOG.info(String.format("Start Read rcfile [%s].", sourceRcFilePath)); + List column = UnstructuredStorageReaderUtil + .getListColumnEntry(readerSliceConfig, com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + // warn: no default value '\N' + String nullFormat = readerSliceConfig.getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.NULL_FORMAT); + + Path rcFilePath = new Path(sourceRcFilePath); + FileSystem fs = null; + RCFileRecordReader recordReader = null; + try { + fs = FileSystem.get(rcFilePath.toUri(), hadoopConf); + long fileLen = fs.getFileStatus(rcFilePath).getLen(); + FileSplit split = new FileSplit(rcFilePath, 0, fileLen, (String[]) null); + recordReader = new RCFileRecordReader(hadoopConf, split); + LongWritable key = new LongWritable(); + BytesRefArrayWritable value = new BytesRefArrayWritable(); + Text txt = new Text(); + while (recordReader.next(key, value)) { + String[] sourceLine = new String[value.size()]; + txt.clear(); + for (int i = 0; i < value.size(); i++) { + BytesRefWritable v = value.get(i); + txt.set(v.getData(), v.getStart(), v.getLength()); + sourceLine[i] = txt.toString(); + } + UnstructuredStorageReaderUtil.transportOneRecord(recordSender, + column, sourceLine, nullFormat, taskPluginCollector); + } + + } catch (IOException e) { + String message = String.format("读取文件[%s]时出错", sourceRcFilePath); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.READ_RCFILE_ERROR, message, e); + } finally { + try { + if (recordReader != null) { + recordReader.close(); + LOG.info("Finally, Close RCFileRecordReader."); + } + } catch (IOException e) { + LOG.warn(String.format("finally: 关闭RCFileRecordReader失败, %s", e.getMessage())); + } + } + + } + + public void orcFileStartRead(String sourceOrcFilePath, Configuration readerSliceConfig, + RecordSender recordSender, TaskPluginCollector taskPluginCollector) { + LOG.info(String.format("Start Read orcfile [%s].", sourceOrcFilePath)); + List column = UnstructuredStorageReaderUtil + .getListColumnEntry(readerSliceConfig, com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + String nullFormat = readerSliceConfig.getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.NULL_FORMAT); + StringBuilder allColumns = new StringBuilder(); + StringBuilder allColumnTypes = new StringBuilder(); + boolean isReadAllColumns = false; + int columnIndexMax = -1; + // 判断是否读取所有列 + if (null == column || column.size() == 0) { + int allColumnsCount = getAllColumnsCount(sourceOrcFilePath); + columnIndexMax = allColumnsCount - 1; + isReadAllColumns = true; + } else { + columnIndexMax = getMaxIndex(column); + } + for (int i = 0; i <= columnIndexMax; i++) { + allColumns.append("col"); + allColumnTypes.append("string"); + if (i != columnIndexMax) { + allColumns.append(","); + allColumnTypes.append(":"); + } + } + if (columnIndexMax >= 0) { + JobConf conf = new JobConf(hadoopConf); + Path orcFilePath = new Path(sourceOrcFilePath); + Properties p = new Properties(); + p.setProperty("columns", allColumns.toString()); + p.setProperty("columns.types", allColumnTypes.toString()); + try { + OrcSerde serde = new OrcSerde(); + serde.initialize(conf, p); + StructObjectInspector inspector = (StructObjectInspector) serde.getObjectInspector(); + InputFormat in = new OrcInputFormat(); + FileInputFormat.setInputPaths(conf, orcFilePath.toString()); + + //If the network disconnected, will retry 45 times, each time the retry interval for 20 seconds + //Each file as a split + //TODO multy threads + InputSplit[] splits = in.getSplits(conf, 1); + + RecordReader reader = in.getRecordReader(splits[0], conf, Reporter.NULL); + Object key = reader.createKey(); + Object value = reader.createValue(); + // 获取列信息 + List fields = inspector.getAllStructFieldRefs(); + + List recordFields; + while (reader.next(key, value)) { + recordFields = new ArrayList(); + + for (int i = 0; i <= columnIndexMax; i++) { + Object field = inspector.getStructFieldData(value, fields.get(i)); + recordFields.add(field); + } + transportOneRecord(column, recordFields, recordSender, + taskPluginCollector, isReadAllColumns, nullFormat); + } + reader.close(); + } catch (Exception e) { + String message = String.format("从orcfile文件路径[%s]中读取数据发生异常,请联系系统管理员。" + , sourceOrcFilePath); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.READ_FILE_ERROR, message); + } + } else { + String message = String.format("请确认您所读取的列配置正确!columnIndexMax 小于0,column:%s", JSON.toJSONString(column)); + throw DataXException.asDataXException(HdfsReaderErrorCode.BAD_CONFIG_VALUE, message); + } + } + + private Record transportOneRecord(List columnConfigs, List recordFields + , RecordSender recordSender, TaskPluginCollector taskPluginCollector, boolean isReadAllColumns, String nullFormat) { + Record record = recordSender.createRecord(); + Column columnGenerated; + try { + if (isReadAllColumns) { + // 读取所有列,创建都为String类型的column + for (Object recordField : recordFields) { + String columnValue = null; + if (recordField != null) { + columnValue = recordField.toString(); + } + columnGenerated = new StringColumn(columnValue); + record.addColumn(columnGenerated); + } + } else { + for (ColumnEntry columnConfig : columnConfigs) { + String columnType = columnConfig.getType(); + Integer columnIndex = columnConfig.getIndex(); + String columnConst = columnConfig.getValue(); + + String columnValue = null; + + if (null != columnIndex) { + if (null != recordFields.get(columnIndex)) + columnValue = recordFields.get(columnIndex).toString(); + } else { + columnValue = columnConst; + } + Type type = Type.valueOf(columnType.toUpperCase()); + // it's all ok if nullFormat is null + if (StringUtils.equals(columnValue, nullFormat)) { + columnValue = null; + } + switch (type) { + case STRING: + columnGenerated = new StringColumn(columnValue); + break; + case LONG: + try { + columnGenerated = new LongColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "LONG")); + } + break; + case DOUBLE: + try { + columnGenerated = new DoubleColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "DOUBLE")); + } + break; + case BOOLEAN: + try { + columnGenerated = new BoolColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "BOOLEAN")); + } + + break; + case DATE: + try { + if (columnValue == null) { + columnGenerated = new DateColumn((Date) null); + } else { + String formatString = columnConfig.getFormat(); + if (StringUtils.isNotBlank(formatString)) { + // 用户自己配置的格式转换 + SimpleDateFormat format = new SimpleDateFormat( + formatString); + columnGenerated = new DateColumn( + format.parse(columnValue)); + } else { + // 框架尝试转换 + columnGenerated = new DateColumn( + new StringColumn(columnValue) + .asDate()); + } + } + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "DATE")); + } + break; + default: + String errorMessage = String.format( + "您配置的列类型暂不支持 : [%s]", columnType); + LOG.error(errorMessage); + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.NOT_SUPPORT_TYPE, + errorMessage); + } + + record.addColumn(columnGenerated); + } + } + recordSender.sendToWriter(record); + } catch (IllegalArgumentException iae) { + taskPluginCollector + .collectDirtyRecord(record, iae.getMessage()); + } catch (IndexOutOfBoundsException ioe) { + taskPluginCollector + .collectDirtyRecord(record, ioe.getMessage()); + } catch (Exception e) { + if (e instanceof DataXException) { + throw (DataXException) e; + } + // 每一种转换失败都是脏数据处理,包括数字格式 & 日期格式 + taskPluginCollector.collectDirtyRecord(record, e.getMessage()); + } + + return record; + } + + private int getAllColumnsCount(String filePath) { + int columnsCount; + final String colFinal = "_col"; + Path path = new Path(filePath); + try { + Reader reader = OrcFile.createReader(path, OrcFile.readerOptions(hadoopConf)); + String type_struct = reader.getObjectInspector().getTypeName(); + columnsCount = (type_struct.length() - type_struct.replace(colFinal, "").length()) + / colFinal.length(); + return columnsCount; + } catch (IOException e) { + String message = "读取orcfile column列数失败,请联系系统管理员"; + throw DataXException.asDataXException(HdfsReaderErrorCode.READ_FILE_ERROR, message); + } + } + + private int getMaxIndex(List columnConfigs) { + int maxIndex = -1; + for (ColumnEntry columnConfig : columnConfigs) { + Integer columnIndex = columnConfig.getIndex(); + if (columnIndex != null && columnIndex < 0) { + String message = String.format("您column中配置的index不能小于0,请修改为正确的index,column配置:%s", + JSON.toJSONString(columnConfigs)); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.CONFIG_INVALID_EXCEPTION, message); + } else if (columnIndex != null && columnIndex > maxIndex) { + maxIndex = columnIndex; + } + } + return maxIndex; + } + + private enum Type { + STRING, LONG, BOOLEAN, DOUBLE, DATE, + } + + public boolean checkHdfsFileType(String filepath, String specifiedFileType) { + + Path file = new Path(filepath); + + try { + FileSystem fs = FileSystem.get(hadoopConf); + FSDataInputStream in = fs.open(file); + + if (StringUtils.equalsIgnoreCase(specifiedFileType, Constant.CSV) + || StringUtils.equalsIgnoreCase(specifiedFileType, Constant.TEXT)) { + + boolean isORC = isORCFile(file, fs, in);// 判断是否是 ORC File + if (isORC) { + return false; + } + boolean isRC = isRCFile(filepath, in);// 判断是否是 RC File + if (isRC) { + return false; + } + boolean isSEQ = isSequenceFile(filepath, in);// 判断是否是 Sequence File + if (isSEQ) { + return false; + } + // 如果不是ORC,RC和SEQ,则默认为是TEXT或CSV类型 + return !isORC && !isRC && !isSEQ; + + } else if (StringUtils.equalsIgnoreCase(specifiedFileType, Constant.ORC)) { + + return isORCFile(file, fs, in); + } else if (StringUtils.equalsIgnoreCase(specifiedFileType, Constant.RC)) { + + return isRCFile(filepath, in); + } else if (StringUtils.equalsIgnoreCase(specifiedFileType, Constant.SEQ)) { + + return isSequenceFile(filepath, in); + } + + } catch (Exception e) { + String message = String.format("检查文件[%s]类型失败,目前支持ORC,SEQUENCE,RCFile,TEXT,CSV五种格式的文件," + + "请检查您文件类型和文件是否正确。", filepath); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.READ_FILE_ERROR, message, e); + } + return false; + } + + // 判断file是否是ORC File + private boolean isORCFile(Path file, FileSystem fs, FSDataInputStream in) { + try { + // figure out the size of the file using the option or filesystem + long size = fs.getFileStatus(file).getLen(); + + //read last bytes into buffer to get PostScript + int readSize = (int) Math.min(size, DIRECTORY_SIZE_GUESS); + in.seek(size - readSize); + ByteBuffer buffer = ByteBuffer.allocate(readSize); + in.readFully(buffer.array(), buffer.arrayOffset() + buffer.position(), + buffer.remaining()); + + //read the PostScript + //get length of PostScript + int psLen = buffer.get(readSize - 1) & 0xff; + int len = OrcFile.MAGIC.length(); + if (psLen < len + 1) { + return false; + } + int offset = buffer.arrayOffset() + buffer.position() + buffer.limit() - 1 + - len; + byte[] array = buffer.array(); + // now look for the magic string at the end of the postscript. + if (Text.decode(array, offset, len).equals(OrcFile.MAGIC)) { + return true; + } else { + // If it isn't there, this may be the 0.11.0 version of ORC. + // Read the first 3 bytes of the file to check for the header + in.seek(0); + byte[] header = new byte[len]; + in.readFully(header, 0, len); + // if it isn't there, this isn't an ORC file + if (Text.decode(header, 0, len).equals(OrcFile.MAGIC)) { + return true; + } + } + } catch (IOException e) { + LOG.info(String.format("检查文件类型: [%s] 不是ORC File.", file.toString())); + } + return false; + } + + // 判断file是否是RC file + private boolean isRCFile(String filepath, FSDataInputStream in) { + + // The first version of RCFile used the sequence file header. + final byte[] ORIGINAL_MAGIC = new byte[]{(byte) 'S', (byte) 'E', (byte) 'Q'}; + // The 'magic' bytes at the beginning of the RCFile + final byte[] RC_MAGIC = new byte[]{(byte) 'R', (byte) 'C', (byte) 'F'}; + // the version that was included with the original magic, which is mapped + // into ORIGINAL_VERSION + final byte ORIGINAL_MAGIC_VERSION_WITH_METADATA = 6; + // All of the versions should be place in this list. + final int ORIGINAL_VERSION = 0; // version with SEQ + final int NEW_MAGIC_VERSION = 1; // version with RCF + final int CURRENT_VERSION = NEW_MAGIC_VERSION; + byte version; + + byte[] magic = new byte[RC_MAGIC.length]; + try { + in.seek(0); + in.readFully(magic); + + if (Arrays.equals(magic, ORIGINAL_MAGIC)) { + byte vers = in.readByte(); + if (vers != ORIGINAL_MAGIC_VERSION_WITH_METADATA) { + return false; + } + version = ORIGINAL_VERSION; + } else { + if (!Arrays.equals(magic, RC_MAGIC)) { + return false; + } + + // Set 'version' + version = in.readByte(); + if (version > CURRENT_VERSION) { + return false; + } + } + + if (version == ORIGINAL_VERSION) { + try { + Class keyCls = hadoopConf.getClassByName(Text.readString(in)); + Class valCls = hadoopConf.getClassByName(Text.readString(in)); + if (!keyCls.equals(RCFile.KeyBuffer.class) + || !valCls.equals(RCFile.ValueBuffer.class)) { + return false; + } + } catch (ClassNotFoundException e) { + return false; + } + } + boolean decompress = in.readBoolean(); // is compressed? + if (version == ORIGINAL_VERSION) { + // is block-compressed? it should be always false. + boolean blkCompressed = in.readBoolean(); + if (blkCompressed) { + return false; + } + } + return true; + } catch (IOException e) { + LOG.info(String.format("检查文件类型: [%s] 不是RC File.", filepath)); + } + return false; + } + + // 判断file是否是Sequence file + private boolean isSequenceFile(String filepath, FSDataInputStream in) { + byte[] SEQ_MAGIC = new byte[]{(byte) 'S', (byte) 'E', (byte) 'Q'}; + byte[] magic = new byte[SEQ_MAGIC.length]; + try { + in.seek(0); + in.readFully(magic); + if (Arrays.equals(magic, SEQ_MAGIC)) { + return true; + } else { + return false; + } + } catch (IOException e) { + LOG.info(String.format("检查文件类型: [%s] 不是Sequence File.", filepath)); + } + return false; + } + +} diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsFileType.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsFileType.java new file mode 100644 index 0000000000..43d13dffaa --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsFileType.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +/** + * Created by mingya.wmy on 2015/8/22. + * + */ +public enum HdfsFileType { + ORC, SEQ, RC, CSV, TEXT, +} diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReader.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReader.java new file mode 100644 index 0000000000..c953ef162e --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReader.java @@ -0,0 +1,303 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; +import org.apache.commons.io.Charsets; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.InputStream; +import java.nio.charset.UnsupportedCharsetException; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; + +public class HdfsReader extends Reader { + + /** + * Job 中的方法仅执行一次,Task 中方法会由框架启动多个 Task 线程并行执行。 + *

+ * 整个 Reader 执行流程是: + *

+     * Job类init-->prepare-->split
+     *
+     * Task类init-->prepare-->startRead-->post-->destroy
+     * Task类init-->prepare-->startRead-->post-->destroy
+     *
+     * Job类post-->destroy
+     * 
+ */ + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + private Configuration readerOriginConfig = null; + private String encoding = null; + private HashSet sourceFiles; + private String specifiedFileType = null; + private DFSUtil dfsUtil = null; + private List path = null; + + @Override + public void init() { + + LOG.info("init() begin..."); + this.readerOriginConfig = super.getPluginJobConf(); + this.validate(); + dfsUtil = new DFSUtil(this.readerOriginConfig); + LOG.info("init() ok and end..."); + + } + + public void validate(){ + this.readerOriginConfig.getNecessaryValue(Key.DEFAULT_FS, + HdfsReaderErrorCode.DEFAULT_FS_NOT_FIND_ERROR); + + // path check + String pathInString = this.readerOriginConfig.getNecessaryValue(Key.PATH, HdfsReaderErrorCode.REQUIRED_VALUE); + if (!pathInString.startsWith("[") && !pathInString.endsWith("]")) { + path = new ArrayList(); + path.add(pathInString); + } else { + path = this.readerOriginConfig.getList(Key.PATH, String.class); + if (null == path || path.size() == 0) { + throw DataXException.asDataXException(HdfsReaderErrorCode.REQUIRED_VALUE, "您需要指定待读取的源目录或文件"); + } + for (String eachPath : path) { + if(!eachPath.startsWith("/")){ + String message = String.format("请检查参数path:[%s],需要配置为绝对路径", eachPath); + LOG.error(message); + throw DataXException.asDataXException(HdfsReaderErrorCode.ILLEGAL_VALUE, message); + } + } + } + + specifiedFileType = this.readerOriginConfig.getNecessaryValue(Key.FILETYPE, HdfsReaderErrorCode.REQUIRED_VALUE); + if( !specifiedFileType.equalsIgnoreCase(Constant.ORC) && + !specifiedFileType.equalsIgnoreCase(Constant.TEXT) && + !specifiedFileType.equalsIgnoreCase(Constant.CSV) && + !specifiedFileType.equalsIgnoreCase(Constant.SEQ) && + !specifiedFileType.equalsIgnoreCase(Constant.RC)){ + String message = "HdfsReader插件目前支持ORC, TEXT, CSV, SEQUENCE, RC五种格式的文件," + + "请将fileType选项的值配置为ORC, TEXT, CSV, SEQUENCE 或者 RC"; + throw DataXException.asDataXException(HdfsReaderErrorCode.FILE_TYPE_ERROR, message); + } + + encoding = this.readerOriginConfig.getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, "UTF-8"); + + try { + Charsets.toCharset(encoding); + } catch (UnsupportedCharsetException uce) { + throw DataXException.asDataXException( + HdfsReaderErrorCode.ILLEGAL_VALUE, + String.format("不支持的编码格式 : [%s]", encoding), uce); + } catch (Exception e) { + throw DataXException.asDataXException( + HdfsReaderErrorCode.ILLEGAL_VALUE, + String.format("运行配置异常 : %s", e.getMessage()), e); + } + //check Kerberos + Boolean haveKerberos = this.readerOriginConfig.getBool(Key.HAVE_KERBEROS, false); + if(haveKerberos) { + this.readerOriginConfig.getNecessaryValue(Key.KERBEROS_KEYTAB_FILE_PATH, HdfsReaderErrorCode.REQUIRED_VALUE); + this.readerOriginConfig.getNecessaryValue(Key.KERBEROS_PRINCIPAL, HdfsReaderErrorCode.REQUIRED_VALUE); + } + + // validate the Columns + validateColumns(); + + if(this.specifiedFileType.equalsIgnoreCase(Constant.CSV)){ + //compress校验 + UnstructuredStorageReaderUtil.validateCompress(this.readerOriginConfig); + UnstructuredStorageReaderUtil.validateCsvReaderConfig(this.readerOriginConfig); + } + + } + + private void validateColumns(){ + + // 检测是column 是否为 ["*"] 若是则填为空 + List column = this.readerOriginConfig + .getListConfiguration(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + if (null != column + && 1 == column.size() + && ("\"*\"".equals(column.get(0).toString()) || "'*'" + .equals(column.get(0).toString()))) { + readerOriginConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN, new ArrayList()); + } else { + // column: 1. index type 2.value type 3.when type is Data, may have format + List columns = this.readerOriginConfig + .getListConfiguration(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + + if (null == columns || columns.size() == 0) { + throw DataXException.asDataXException( + HdfsReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 columns"); + } + + if (null != columns && columns.size() != 0) { + for (Configuration eachColumnConf : columns) { + eachColumnConf.getNecessaryValue(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.TYPE, HdfsReaderErrorCode.REQUIRED_VALUE); + Integer columnIndex = eachColumnConf.getInt(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.INDEX); + String columnValue = eachColumnConf.getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.VALUE); + + if (null == columnIndex && null == columnValue) { + throw DataXException.asDataXException( + HdfsReaderErrorCode.NO_INDEX_VALUE, + "由于您配置了type, 则至少需要配置 index 或 value"); + } + + if (null != columnIndex && null != columnValue) { + throw DataXException.asDataXException( + HdfsReaderErrorCode.MIXED_INDEX_VALUE, + "您混合配置了index, value, 每一列同时仅能选择其中一种"); + } + + } + } + } + } + + @Override + public void prepare() { + LOG.info("prepare(), start to getAllFiles..."); + this.sourceFiles = dfsUtil.getAllFiles(path, specifiedFileType); + LOG.info(String.format("您即将读取的文件数为: [%s], 列表为: [%s]", + this.sourceFiles.size(), + StringUtils.join(this.sourceFiles, ","))); + } + + @Override + public List split(int adviceNumber) { + + LOG.info("split() begin..."); + List readerSplitConfigs = new ArrayList(); + // warn:每个slice拖且仅拖一个文件, + // int splitNumber = adviceNumber; + int splitNumber = this.sourceFiles.size(); + if (0 == splitNumber) { + throw DataXException.asDataXException(HdfsReaderErrorCode.EMPTY_DIR_EXCEPTION, + String.format("未能找到待读取的文件,请确认您的配置项path: %s", this.readerOriginConfig.getString(Key.PATH))); + } + + List> splitedSourceFiles = this.splitSourceFiles(new ArrayList(this.sourceFiles), splitNumber); + for (List files : splitedSourceFiles) { + Configuration splitedConfig = this.readerOriginConfig.clone(); + splitedConfig.set(Constant.SOURCE_FILES, files); + readerSplitConfigs.add(splitedConfig); + } + + return readerSplitConfigs; + } + + + private List> splitSourceFiles(final List sourceList, int adviceNumber) { + List> splitedList = new ArrayList>(); + int averageLength = sourceList.size() / adviceNumber; + averageLength = averageLength == 0 ? 1 : averageLength; + + for (int begin = 0, end = 0; begin < sourceList.size(); begin = end) { + end = begin + averageLength; + if (end > sourceList.size()) { + end = sourceList.size(); + } + splitedList.add(sourceList.subList(begin, end)); + } + return splitedList; + } + + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + + } + + public static class Task extends Reader.Task { + + private static Logger LOG = LoggerFactory.getLogger(Reader.Task.class); + private Configuration taskConfig; + private List sourceFiles; + private String specifiedFileType; + private String encoding; + private DFSUtil dfsUtil = null; + private int bufferSize; + + @Override + public void init() { + + this.taskConfig = super.getPluginJobConf(); + this.sourceFiles = this.taskConfig.getList(Constant.SOURCE_FILES, String.class); + this.specifiedFileType = this.taskConfig.getNecessaryValue(Key.FILETYPE, HdfsReaderErrorCode.REQUIRED_VALUE); + this.encoding = this.taskConfig.getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, "UTF-8"); + this.dfsUtil = new DFSUtil(this.taskConfig); + this.bufferSize = this.taskConfig.getInt(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.BUFFER_SIZE, + com.alibaba.datax.plugin.unstructuredstorage.reader.Constant.DEFAULT_BUFFER_SIZE); + } + + @Override + public void prepare() { + + } + + @Override + public void startRead(RecordSender recordSender) { + + LOG.info("read start"); + for (String sourceFile : this.sourceFiles) { + LOG.info(String.format("reading file : [%s]", sourceFile)); + + if(specifiedFileType.equalsIgnoreCase(Constant.TEXT) + || specifiedFileType.equalsIgnoreCase(Constant.CSV)) { + + InputStream inputStream = dfsUtil.getInputStream(sourceFile); + UnstructuredStorageReaderUtil.readFromStream(inputStream, sourceFile, this.taskConfig, + recordSender, this.getTaskPluginCollector()); + }else if(specifiedFileType.equalsIgnoreCase(Constant.ORC)){ + + dfsUtil.orcFileStartRead(sourceFile, this.taskConfig, recordSender, this.getTaskPluginCollector()); + }else if(specifiedFileType.equalsIgnoreCase(Constant.SEQ)){ + + dfsUtil.sequenceFileStartRead(sourceFile, this.taskConfig, recordSender, this.getTaskPluginCollector()); + }else if(specifiedFileType.equalsIgnoreCase(Constant.RC)){ + + dfsUtil.rcFileStartRead(sourceFile, this.taskConfig, recordSender, this.getTaskPluginCollector()); + }else { + + String message = "HdfsReader插件目前支持ORC, TEXT, CSV, SEQUENCE, RC五种格式的文件," + + "请将fileType选项的值配置为ORC, TEXT, CSV, SEQUENCE 或者 RC"; + throw DataXException.asDataXException(HdfsReaderErrorCode.FILE_TYPE_UNSUPPORT, message); + } + + if(recordSender != null){ + recordSender.flush(); + } + } + + LOG.info("end read source files..."); + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + + } + +} \ No newline at end of file diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReaderErrorCode.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReaderErrorCode.java new file mode 100644 index 0000000000..8dd3f37095 --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/HdfsReaderErrorCode.java @@ -0,0 +1,47 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum HdfsReaderErrorCode implements ErrorCode { + BAD_CONFIG_VALUE("HdfsReader-00", "您配置的值不合法."), + PATH_NOT_FIND_ERROR("HdfsReader-01", "您未配置path值"), + DEFAULT_FS_NOT_FIND_ERROR("HdfsReader-02", "您未配置defaultFS值"), + ILLEGAL_VALUE("HdfsReader-03", "值错误"), + CONFIG_INVALID_EXCEPTION("HdfsReader-04", "参数配置错误"), + REQUIRED_VALUE("HdfsReader-05", "您缺失了必须填写的参数值."), + NO_INDEX_VALUE("HdfsReader-06","没有 Index" ), + MIXED_INDEX_VALUE("HdfsReader-07","index 和 value 混合" ), + EMPTY_DIR_EXCEPTION("HdfsReader-08", "您尝试读取的文件目录为空."), + PATH_CONFIG_ERROR("HdfsReader-09", "您配置的path格式有误"), + READ_FILE_ERROR("HdfsReader-10", "读取文件出错"), + MALFORMED_ORC_ERROR("HdfsReader-10", "ORCFILE格式异常"), + FILE_TYPE_ERROR("HdfsReader-11", "文件类型配置错误"), + FILE_TYPE_UNSUPPORT("HdfsReader-12", "文件类型目前不支持"), + KERBEROS_LOGIN_ERROR("HdfsReader-13", "KERBEROS认证失败"), + READ_SEQUENCEFILE_ERROR("HdfsReader-14", "读取SequenceFile文件出错"), + READ_RCFILE_ERROR("HdfsReader-15", "读取RCFile文件出错"),; + + private final String code; + private final String description; + + private HdfsReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} \ No newline at end of file diff --git a/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Key.java b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Key.java new file mode 100644 index 0000000000..7b985a8832 --- /dev/null +++ b/hdfsreader/src/main/java/com/alibaba/datax/plugin/reader/hdfsreader/Key.java @@ -0,0 +1,15 @@ +package com.alibaba.datax.plugin.reader.hdfsreader; + +public final class Key { + + /** + * 此处声明插件用到的需要插件使用者提供的配置项 + */ + public final static String PATH = "path"; + public final static String DEFAULT_FS = "defaultFS"; + public static final String FILETYPE = "fileType"; + public static final String HADOOP_CONFIG = "hadoopConfig"; + public static final String HAVE_KERBEROS = "haveKerberos"; + public static final String KERBEROS_KEYTAB_FILE_PATH = "kerberosKeytabFilePath"; + public static final String KERBEROS_PRINCIPAL = "kerberosPrincipal"; +} diff --git a/hdfsreader/src/main/resources/plugin.json b/hdfsreader/src/main/resources/plugin.json new file mode 100644 index 0000000000..f3f5c7277c --- /dev/null +++ b/hdfsreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "hdfsreader", + "class": "com.alibaba.datax.plugin.reader.hdfsreader.HdfsReader", + "description": "useScene: test. mechanism: use datax framework to transport data from hdfs. warn: The more you know about the data, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/hdfsreader/src/main/resources/plugin_job_template.json b/hdfsreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..d73427d303 --- /dev/null +++ b/hdfsreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,11 @@ +{ + "name": "hdfsreader", + "parameter": { + "path": "", + "defaultFS": "", + "column": [], + "fileType": "orc", + "encoding": "UTF-8", + "fieldDelimiter": "," + } +} \ No newline at end of file diff --git a/hdfswriter/doc/hdfswriter.md b/hdfswriter/doc/hdfswriter.md new file mode 100644 index 0000000000..028a544e63 --- /dev/null +++ b/hdfswriter/doc/hdfswriter.md @@ -0,0 +1,393 @@ +# DataX HdfsWriter 插件文档 + + +------------ + +## 1 快速介绍 + +HdfsWriter提供向HDFS文件系统指定路径中写入TEXTFile文件和ORCFile文件,文件内容可与hive中表关联。 + + +## 2 功能与限制 + +* (1)、目前HdfsWriter仅支持textfile和orcfile两种格式的文件,且文件内容存放的必须是一张逻辑意义上的二维表; +* (2)、由于HDFS是文件系统,不存在schema的概念,因此不支持对部分列写入; +* (3)、目前仅支持与以下Hive数据类型: +数值型:TINYINT,SMALLINT,INT,BIGINT,FLOAT,DOUBLE +字符串类型:STRING,VARCHAR,CHAR +布尔类型:BOOLEAN +时间类型:DATE,TIMESTAMP +**目前不支持:decimal、binary、arrays、maps、structs、union类型**; +* (4)、对于Hive分区表目前仅支持一次写入单个分区; +* (5)、对于textfile需用户保证写入hdfs文件的分隔符**与在Hive上创建表时的分隔符一致**,从而实现写入hdfs数据与Hive表字段关联; +* (6)、HdfsWriter实现过程是:首先根据用户指定的path,创建一个hdfs文件系统上不存在的临时目录,创建规则:path_随机;然后将读取的文件写入这个临时目录;全部写入后再将这个临时目录下的文件移动到用户指定目录(在创建文件时保证文件名不重复); 最后删除临时目录。如果在中间过程发生网络中断等情况造成无法与hdfs建立连接,需要用户手动删除已经写入的文件和临时目录。 +* (7)、目前插件中Hive版本为1.1.1,Hadoop版本为2.7.1(Apache[为适配JDK1.7],在Hadoop 2.5.0, Hadoop 2.6.0 和Hive 1.2.0测试环境中写入正常;其它版本需后期进一步测试; +* (8)、目前HdfsWriter支持Kerberos认证(注意:如果用户需要进行kerberos认证,那么用户使用的Hadoop集群版本需要和hdfsreader的Hadoop版本保持一致,如果高于hdfsreader的Hadoop版本,不保证kerberos认证有效) + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "setting": {}, + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "txtfilereader", + "parameter": { + "path": ["/Users/shf/workplace/txtWorkplace/job/dataorcfull.txt"], + "encoding": "UTF-8", + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "long" + }, + { + "index": 2, + "type": "long" + }, + { + "index": 3, + "type": "long" + }, + { + "index": 4, + "type": "DOUBLE" + }, + { + "index": 5, + "type": "DOUBLE" + }, + { + "index": 6, + "type": "STRING" + }, + { + "index": 7, + "type": "STRING" + }, + { + "index": 8, + "type": "STRING" + }, + { + "index": 9, + "type": "BOOLEAN" + }, + { + "index": 10, + "type": "date" + }, + { + "index": 11, + "type": "date" + } + ], + "fieldDelimiter": "\t" + } + }, + "writer": { + "name": "hdfswriter", + "parameter": { + "defaultFS": "hdfs://xxx:port", + "fileType": "orc", + "path": "/user/hive/warehouse/writerorc.db/orcfull", + "fileName": "xxxx", + "column": [ + { + "name": "col1", + "type": "TINYINT" + }, + { + "name": "col2", + "type": "SMALLINT" + }, + { + "name": "col3", + "type": "INT" + }, + { + "name": "col4", + "type": "BIGINT" + }, + { + "name": "col5", + "type": "FLOAT" + }, + { + "name": "col6", + "type": "DOUBLE" + }, + { + "name": "col7", + "type": "STRING" + }, + { + "name": "col8", + "type": "VARCHAR" + }, + { + "name": "col9", + "type": "CHAR" + }, + { + "name": "col10", + "type": "BOOLEAN" + }, + { + "name": "col11", + "type": "date" + }, + { + "name": "col12", + "type": "TIMESTAMP" + } + ], + "writeMode": "append", + "fieldDelimiter": "\t", + "compress":"NONE" + } + } + } + ] + } +} +``` + +### 3.2 参数说明 + +* **defaultFS** + + * 描述:Hadoop hdfs文件系统namenode节点地址。格式:hdfs://ip:端口;例如:hdfs://127.0.0.1:9000
+ + * 必选:是
+ + * 默认值:无
+ +* **fileType** + + * 描述:文件的类型,目前只支持用户配置为"text"或"orc"。
+ + text表示textfile文件格式 + + orc表示orcfile文件格式 + + * 必选:是
+ + * 默认值:无
+* **path** + + * 描述:存储到Hadoop hdfs文件系统的路径信息,HdfsWriter会根据并发配置在Path目录下写入多个文件。为与hive表关联,请填写hive表在hdfs上的存储路径。例:Hive上设置的数据仓库的存储路径为:/user/hive/warehouse/ ,已建立数据库:test,表:hello;则对应的存储路径为:/user/hive/warehouse/test.db/hello
+ + * 必选:是
+ + * 默认值:无
+ +* **fileName** + + * 描述:HdfsWriter写入时的文件名,实际执行时会在该文件名后添加随机的后缀作为每个线程写入实际文件名。
+ + * 必选:是
+ + * 默认值:无
+* **column** + + * 描述:写入数据的字段,不支持对部分列写入。为与hive中表关联,需要指定表中所有字段名和字段类型,其中:name指定字段名,type指定字段类型。
+ + 用户可以指定Column字段信息,配置如下: + + ```json + "column": + [ + { + "name": "userName", + "type": "string" + }, + { + "name": "age", + "type": "long" + } + ] + ``` + + * 必选:是
+ + * 默认值:无
+* **writeMode** + + * 描述:hdfswriter写入前数据清理处理模式:
+ + * append,写入前不做任何处理,DataX hdfswriter直接使用filename写入,并保证文件名不冲突。 + * nonConflict,如果目录下有fileName前缀的文件,直接报错。 + + * 必选:是
+ + * 默认值:无
+ +* **fieldDelimiter** + + * 描述:hdfswriter写入时的字段分隔符,**需要用户保证与创建的Hive表的字段分隔符一致,否则无法在Hive表中查到数据**
+ + * 必选:是
+ + * 默认值:无
+ +* **compress** + + * 描述:hdfs文件压缩类型,默认不填写意味着没有压缩。其中:text类型文件支持压缩类型有gzip、bzip2;orc类型文件支持的压缩类型有NONE、SNAPPY(需要用户安装SnappyCodec)。
+ + * 必选:否
+ + * 默认值:无压缩
+ +* **hadoopConfig** + + * 描述:hadoopConfig里可以配置与Hadoop相关的一些高级参数,比如HA的配置。
+ + ```json + "hadoopConfig":{ + "dfs.nameservices": "testDfs", + "dfs.ha.namenodes.testDfs": "namenode1,namenode2", +        "dfs.namenode.rpc-address.aliDfs.namenode1": "", + "dfs.namenode.rpc-address.aliDfs.namenode2": "", + "dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" + } + ``` + + * 必选:否
+ + * 默认值:无
+ +* **encoding** + + * 描述:写文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8,**慎重修改**
+ +* **haveKerberos** + + * 描述:是否有Kerberos认证,默认false
+ + 例如如果用户配置true,则配置项kerberosKeytabFilePath,kerberosPrincipal为必填。 + + * 必选:haveKerberos 为true必选
+ + * 默认值:false
+ +* **kerberosKeytabFilePath** + + * 描述:Kerberos认证 keytab文件路径,绝对路径
+ + * 必选:否
+ + * 默认值:无
+ +* **kerberosPrincipal** + + * 描述:Kerberos认证Principal名,如xxxx/hadoopclient@xxx.xxx
+ + * 必选:haveKerberos 为true必选
+ + * 默认值:无
+ + +### 3.3 类型转换 + +目前 HdfsWriter 支持大部分 Hive 类型,请注意检查你的类型。 + +下面列出 HdfsWriter 针对 Hive 数据类型转换列表: + +| DataX 内部类型| HIVE 数据类型 | +| -------- | ----- | +| Long |TINYINT,SMALLINT,INT,BIGINT | +| Double |FLOAT,DOUBLE | +| String |STRING,VARCHAR,CHAR | +| Boolean |BOOLEAN | +| Date |DATE,TIMESTAMP | + + +## 4 配置步骤 +* 步骤一、在Hive中创建数据库、表 +Hive数据库在HDFS上存储配置,在hive安装目录下 conf/hive-site.xml文件中配置,默认值为:/user/hive/warehouse +如下所示: + +```xml + + hive.metastore.warehouse.dir + /user/hive/warehouse + location of default database for the warehouse + +``` +Hive建库/建表语法 参考 [Hive操作手册]( https://cwiki.apache.org/confluence/display/Hive/LanguageManual) + +例: +(1)建立存储为textfile文件类型的表 +```json +create database IF NOT EXISTS hdfswriter; +use hdfswriter; +create table text_table( +col1 TINYINT, +col2 SMALLINT, +col3 INT, +col4 BIGINT, +col5 FLOAT, +col6 DOUBLE, +col7 STRING, +col8 VARCHAR(10), +col9 CHAR(10), +col10 BOOLEAN, +col11 date, +col12 TIMESTAMP +) +row format delimited +fields terminated by "\t" +STORED AS TEXTFILE; +``` +text_table在hdfs上存储路径为:/user/hive/warehouse/hdfswriter.db/text_table/ + +(2)建立存储为orcfile文件类型的表 +```json +create database IF NOT EXISTS hdfswriter; +use hdfswriter; +create table orc_table( +col1 TINYINT, +col2 SMALLINT, +col3 INT, +col4 BIGINT, +col5 FLOAT, +col6 DOUBLE, +col7 STRING, +col8 VARCHAR(10), +col9 CHAR(10), +col10 BOOLEAN, +col11 date, +col12 TIMESTAMP +) +ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' +STORED AS ORC; +``` +orc_table在hdfs上存储路径为:/user/hive/warehouse/hdfswriter.db/orc_table/ + +* 步骤二、根据步骤一的配置信息配置HdfsWriter作业 + +## 5 约束限制 + +略 + +## 6 FAQ + +略 diff --git a/hdfswriter/pom.xml b/hdfswriter/pom.xml new file mode 100644 index 0000000000..574c23b6f2 --- /dev/null +++ b/hdfswriter/pom.xml @@ -0,0 +1,135 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + hdfswriter + hdfswriter + HdfsWriter提供了写入HDFS功能。 + jar + + 1.1.1 + 2.7.1 + + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + org.apache.hadoop + hadoop-hdfs + ${hadoop.version} + + + org.apache.hadoop + hadoop-common + ${hadoop.version} + + + org.apache.hadoop + hadoop-yarn-common + ${hadoop.version} + + + org.apache.hadoop + hadoop-mapreduce-client-core + ${hadoop.version} + + + + org.apache.hive + hive-exec + ${hive.version} + + + org.apache.hive + hive-serde + ${hive.version} + + + org.apache.hive + hive-service + ${hive.version} + + + org.apache.hive + hive-common + ${hive.version} + + + org.apache.hive.hcatalog + hive-hcatalog-core + ${hive.version} + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + + junit + junit + test + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/hdfswriter/src/main/assembly/package.xml b/hdfswriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..0356c4a251 --- /dev/null +++ b/hdfswriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/hdfswriter + + + target/ + + hdfswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/hdfswriter + + + + + + false + plugin/writer/hdfswriter/libs + runtime + + + diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Constant.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Constant.java new file mode 100755 index 0000000000..3e4fa52f97 --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Constant.java @@ -0,0 +1,7 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +public class Constant { + + public static final String DEFAULT_ENCODING = "UTF-8"; + public static final String DEFAULT_NULL_FORMAT = "\\N"; +} diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsHelper.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsHelper.java new file mode 100644 index 0000000000..c8bfa50b6c --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsHelper.java @@ -0,0 +1,559 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.JSONObject; +import com.google.common.collect.Lists; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.MutablePair; +import org.apache.hadoop.fs.*; +import org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat; +import org.apache.hadoop.hive.ql.io.orc.OrcSerde; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.compress.CompressionCodec; +import org.apache.hadoop.mapred.*; +import org.apache.hadoop.security.UserGroupInformation; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import java.io.IOException; +import java.text.SimpleDateFormat; +import java.util.*; + +public class HdfsHelper { + public static final Logger LOG = LoggerFactory.getLogger(HdfsWriter.Job.class); + public FileSystem fileSystem = null; + public JobConf conf = null; + public org.apache.hadoop.conf.Configuration hadoopConf = null; + public static final String HADOOP_SECURITY_AUTHENTICATION_KEY = "hadoop.security.authentication"; + public static final String HDFS_DEFAULTFS_KEY = "fs.defaultFS"; + + // Kerberos + private Boolean haveKerberos = false; + private String kerberosKeytabFilePath; + private String kerberosPrincipal; + + public void getFileSystem(String defaultFS, Configuration taskConfig){ + hadoopConf = new org.apache.hadoop.conf.Configuration(); + + Configuration hadoopSiteParams = taskConfig.getConfiguration(Key.HADOOP_CONFIG); + JSONObject hadoopSiteParamsAsJsonObject = JSON.parseObject(taskConfig.getString(Key.HADOOP_CONFIG)); + if (null != hadoopSiteParams) { + Set paramKeys = hadoopSiteParams.getKeys(); + for (String each : paramKeys) { + hadoopConf.set(each, hadoopSiteParamsAsJsonObject.getString(each)); + } + } + hadoopConf.set(HDFS_DEFAULTFS_KEY, defaultFS); + + //是否有Kerberos认证 + this.haveKerberos = taskConfig.getBool(Key.HAVE_KERBEROS, false); + if(haveKerberos){ + this.kerberosKeytabFilePath = taskConfig.getString(Key.KERBEROS_KEYTAB_FILE_PATH); + this.kerberosPrincipal = taskConfig.getString(Key.KERBEROS_PRINCIPAL); + hadoopConf.set(HADOOP_SECURITY_AUTHENTICATION_KEY, "kerberos"); + } + this.kerberosAuthentication(this.kerberosPrincipal, this.kerberosKeytabFilePath); + conf = new JobConf(hadoopConf); + try { + fileSystem = FileSystem.get(conf); + } catch (IOException e) { + String message = String.format("获取FileSystem时发生网络IO异常,请检查您的网络是否正常!HDFS地址:[%s]", + "message:defaultFS =" + defaultFS); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e); + }catch (Exception e) { + String message = String.format("获取FileSystem失败,请检查HDFS地址是否正确: [%s]", + "message:defaultFS =" + defaultFS); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e); + } + + if(null == fileSystem || null == conf){ + String message = String.format("获取FileSystem失败,请检查HDFS地址是否正确: [%s]", + "message:defaultFS =" + defaultFS); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, message); + } + } + + private void kerberosAuthentication(String kerberosPrincipal, String kerberosKeytabFilePath){ + if(haveKerberos && StringUtils.isNotBlank(this.kerberosPrincipal) && StringUtils.isNotBlank(this.kerberosKeytabFilePath)){ + UserGroupInformation.setConfiguration(this.hadoopConf); + try { + UserGroupInformation.loginUserFromKeytab(kerberosPrincipal, kerberosKeytabFilePath); + } catch (Exception e) { + String message = String.format("kerberos认证失败,请确定kerberosKeytabFilePath[%s]和kerberosPrincipal[%s]填写正确", + kerberosKeytabFilePath, kerberosPrincipal); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.KERBEROS_LOGIN_ERROR, e); + } + } + } + + /** + *获取指定目录先的文件列表 + * @param dir + * @return + * 拿到的是文件全路径, + * eg:hdfs://10.101.204.12:9000/user/hive/warehouse/writer.db/text/test.textfile + */ + public String[] hdfsDirList(String dir){ + Path path = new Path(dir); + String[] files = null; + try { + FileStatus[] status = fileSystem.listStatus(path); + files = new String[status.length]; + for(int i=0;i tmpFiles, HashSet endFiles){ + Path tmpFilesParent = null; + if(tmpFiles.size() != endFiles.size()){ + String message = String.format("临时目录下文件名个数与目标文件名个数不一致!"); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.HDFS_RENAME_FILE_ERROR, message); + }else{ + try{ + for (Iterator it1=tmpFiles.iterator(),it2=endFiles.iterator();it1.hasNext()&&it2.hasNext();){ + String srcFile = it1.next().toString(); + String dstFile = it2.next().toString(); + Path srcFilePah = new Path(srcFile); + Path dstFilePah = new Path(dstFile); + if(tmpFilesParent == null){ + tmpFilesParent = srcFilePah.getParent(); + } + LOG.info(String.format("start rename file [%s] to file [%s].", srcFile,dstFile)); + boolean renameTag = false; + long fileLen = fileSystem.getFileStatus(srcFilePah).getLen(); + if(fileLen>0){ + renameTag = fileSystem.rename(srcFilePah,dstFilePah); + if(!renameTag){ + String message = String.format("重命名文件[%s]失败,请检查您的网络是否正常!", srcFile); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.HDFS_RENAME_FILE_ERROR, message); + } + LOG.info(String.format("finish rename file [%s] to file [%s].", srcFile,dstFile)); + }else{ + LOG.info(String.format("文件[%s]内容为空,请检查写入是否正常!", srcFile)); + } + } + }catch (Exception e) { + String message = String.format("重命名文件时发生异常,请检查您的网络是否正常!"); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e); + }finally { + deleteDir(tmpFilesParent); + } + } + } + + //关闭FileSystem + public void closeFileSystem(){ + try { + fileSystem.close(); + } catch (IOException e) { + String message = String.format("关闭FileSystem时发生IO异常,请检查您的网络是否正常!"); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e); + } + } + + + //textfile格式文件 + public FSDataOutputStream getOutputStream(String path){ + Path storePath = new Path(path); + FSDataOutputStream fSDataOutputStream = null; + try { + fSDataOutputStream = fileSystem.create(storePath); + } catch (IOException e) { + String message = String.format("Create an FSDataOutputStream at the indicated Path[%s] failed: [%s]", + "message:path =" + path); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.Write_FILE_IO_ERROR, e); + } + return fSDataOutputStream; + } + + /** + * 写textfile类型文件 + * @param lineReceiver + * @param config + * @param fileName + * @param taskPluginCollector + */ + public void textFileStartWrite(RecordReceiver lineReceiver, Configuration config, String fileName, + TaskPluginCollector taskPluginCollector){ + char fieldDelimiter = config.getChar(Key.FIELD_DELIMITER); + List columns = config.getListConfiguration(Key.COLUMN); + String compress = config.getString(Key.COMPRESS,null); + + SimpleDateFormat dateFormat = new SimpleDateFormat("yyyyMMddHHmm"); + String attempt = "attempt_"+dateFormat.format(new Date())+"_0001_m_000000_0"; + Path outputPath = new Path(fileName); + //todo 需要进一步确定TASK_ATTEMPT_ID + conf.set(JobContext.TASK_ATTEMPT_ID, attempt); + FileOutputFormat outFormat = new TextOutputFormat(); + outFormat.setOutputPath(conf, outputPath); + outFormat.setWorkOutputPath(conf, outputPath); + if(null != compress) { + Class codecClass = getCompressCodec(compress); + if (null != codecClass) { + outFormat.setOutputCompressorClass(conf, codecClass); + } + } + try { + RecordWriter writer = outFormat.getRecordWriter(fileSystem, conf, outputPath.toString(), Reporter.NULL); + Record record = null; + while ((record = lineReceiver.getFromReader()) != null) { + MutablePair transportResult = transportOneRecord(record, fieldDelimiter, columns, taskPluginCollector); + if (!transportResult.getRight()) { + writer.write(NullWritable.get(),transportResult.getLeft()); + } + } + writer.close(Reporter.NULL); + } catch (Exception e) { + String message = String.format("写文件文件[%s]时发生IO异常,请检查您的网络是否正常!", fileName); + LOG.error(message); + Path path = new Path(fileName); + deleteDir(path.getParent()); + throw DataXException.asDataXException(HdfsWriterErrorCode.Write_FILE_IO_ERROR, e); + } + } + + public static MutablePair transportOneRecord( + Record record, char fieldDelimiter, List columnsConfiguration, TaskPluginCollector taskPluginCollector) { + MutablePair, Boolean> transportResultList = transportOneRecord(record,columnsConfiguration,taskPluginCollector); + //保存<转换后的数据,是否是脏数据> + MutablePair transportResult = new MutablePair(); + transportResult.setRight(false); + if(null != transportResultList){ + Text recordResult = new Text(StringUtils.join(transportResultList.getLeft(), fieldDelimiter)); + transportResult.setRight(transportResultList.getRight()); + transportResult.setLeft(recordResult); + } + return transportResult; + } + + public Class getCompressCodec(String compress){ + Class codecClass = null; + if(null == compress){ + codecClass = null; + }else if("GZIP".equalsIgnoreCase(compress)){ + codecClass = org.apache.hadoop.io.compress.GzipCodec.class; + }else if ("BZIP2".equalsIgnoreCase(compress)) { + codecClass = org.apache.hadoop.io.compress.BZip2Codec.class; + }else if("SNAPPY".equalsIgnoreCase(compress)){ + //todo 等需求明确后支持 需要用户安装SnappyCodec + codecClass = org.apache.hadoop.io.compress.SnappyCodec.class; + // org.apache.hadoop.hive.ql.io.orc.ZlibCodec.class not public + //codecClass = org.apache.hadoop.hive.ql.io.orc.ZlibCodec.class; + }else { + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("目前不支持您配置的 compress 模式 : [%s]", compress)); + } + return codecClass; + } + + /** + * 写orcfile类型文件 + * @param lineReceiver + * @param config + * @param fileName + * @param taskPluginCollector + */ + public void orcFileStartWrite(RecordReceiver lineReceiver, Configuration config, String fileName, + TaskPluginCollector taskPluginCollector){ + List columns = config.getListConfiguration(Key.COLUMN); + String compress = config.getString(Key.COMPRESS, null); + List columnNames = getColumnNames(columns); + List columnTypeInspectors = getColumnTypeInspectors(columns); + StructObjectInspector inspector = (StructObjectInspector)ObjectInspectorFactory + .getStandardStructObjectInspector(columnNames, columnTypeInspectors); + + OrcSerde orcSerde = new OrcSerde(); + + FileOutputFormat outFormat = new OrcOutputFormat(); + if(!"NONE".equalsIgnoreCase(compress) && null != compress ) { + Class codecClass = getCompressCodec(compress); + if (null != codecClass) { + outFormat.setOutputCompressorClass(conf, codecClass); + } + } + try { + RecordWriter writer = outFormat.getRecordWriter(fileSystem, conf, fileName, Reporter.NULL); + Record record = null; + while ((record = lineReceiver.getFromReader()) != null) { + MutablePair, Boolean> transportResult = transportOneRecord(record,columns,taskPluginCollector); + if (!transportResult.getRight()) { + writer.write(NullWritable.get(), orcSerde.serialize(transportResult.getLeft(), inspector)); + } + } + writer.close(Reporter.NULL); + } catch (Exception e) { + String message = String.format("写文件文件[%s]时发生IO异常,请检查您的网络是否正常!", fileName); + LOG.error(message); + Path path = new Path(fileName); + deleteDir(path.getParent()); + throw DataXException.asDataXException(HdfsWriterErrorCode.Write_FILE_IO_ERROR, e); + } + } + + public List getColumnNames(List columns){ + List columnNames = Lists.newArrayList(); + for (Configuration eachColumnConf : columns) { + columnNames.add(eachColumnConf.getString(Key.NAME)); + } + return columnNames; + } + + /** + * 根据writer配置的字段类型,构建inspector + * @param columns + * @return + */ + public List getColumnTypeInspectors(List columns){ + List columnTypeInspectors = Lists.newArrayList(); + for (Configuration eachColumnConf : columns) { + SupportHiveDataType columnType = SupportHiveDataType.valueOf(eachColumnConf.getString(Key.TYPE).toUpperCase()); + ObjectInspector objectInspector = null; + switch (columnType) { + case TINYINT: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Byte.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case SMALLINT: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Short.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case INT: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Integer.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case BIGINT: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Long.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case FLOAT: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Float.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case DOUBLE: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Double.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case TIMESTAMP: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(java.sql.Timestamp.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case DATE: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(java.sql.Date.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case STRING: + case VARCHAR: + case CHAR: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(String.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + case BOOLEAN: + objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Boolean.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); + break; + default: + throw DataXException + .asDataXException( + HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库写入这种字段类型. 字段名:[%s], 字段类型:[%d]. 请修改表中该字段的类型或者不同步该字段.", + eachColumnConf.getString(Key.NAME), + eachColumnConf.getString(Key.TYPE))); + } + + columnTypeInspectors.add(objectInspector); + } + return columnTypeInspectors; + } + + public OrcSerde getOrcSerde(Configuration config){ + String fieldDelimiter = config.getString(Key.FIELD_DELIMITER); + String compress = config.getString(Key.COMPRESS); + String encoding = config.getString(Key.ENCODING); + + OrcSerde orcSerde = new OrcSerde(); + Properties properties = new Properties(); + properties.setProperty("orc.bloom.filter.columns", fieldDelimiter); + properties.setProperty("orc.compress", compress); + properties.setProperty("orc.encoding.strategy", encoding); + + orcSerde.initialize(conf, properties); + return orcSerde; + } + + public static MutablePair, Boolean> transportOneRecord( + Record record,List columnsConfiguration, + TaskPluginCollector taskPluginCollector){ + + MutablePair, Boolean> transportResult = new MutablePair, Boolean>(); + transportResult.setRight(false); + List recordList = Lists.newArrayList(); + int recordLength = record.getColumnNumber(); + if (0 != recordLength) { + Column column; + for (int i = 0; i < recordLength; i++) { + column = record.getColumn(i); + //todo as method + if (null != column.getRawData()) { + String rowData = column.getRawData().toString(); + SupportHiveDataType columnType = SupportHiveDataType.valueOf( + columnsConfiguration.get(i).getString(Key.TYPE).toUpperCase()); + //根据writer端类型配置做类型转换 + try { + switch (columnType) { + case TINYINT: + recordList.add(Byte.valueOf(rowData)); + break; + case SMALLINT: + recordList.add(Short.valueOf(rowData)); + break; + case INT: + recordList.add(Integer.valueOf(rowData)); + break; + case BIGINT: + recordList.add(column.asLong()); + break; + case FLOAT: + recordList.add(Float.valueOf(rowData)); + break; + case DOUBLE: + recordList.add(column.asDouble()); + break; + case STRING: + case VARCHAR: + case CHAR: + recordList.add(column.asString()); + break; + case BOOLEAN: + recordList.add(column.asBoolean()); + break; + case DATE: + recordList.add(new java.sql.Date(column.asDate().getTime())); + break; + case TIMESTAMP: + recordList.add(new java.sql.Timestamp(column.asDate().getTime())); + break; + default: + throw DataXException + .asDataXException( + HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库写入这种字段类型. 字段名:[%s], 字段类型:[%d]. 请修改表中该字段的类型或者不同步该字段.", + columnsConfiguration.get(i).getString(Key.NAME), + columnsConfiguration.get(i).getString(Key.TYPE))); + } + } catch (Exception e) { + // warn: 此处认为脏数据 + String message = String.format( + "字段类型转换错误:你目标字段为[%s]类型,实际字段值为[%s].", + columnsConfiguration.get(i).getString(Key.TYPE), column.getRawData().toString()); + taskPluginCollector.collectDirtyRecord(record, message); + transportResult.setRight(true); + break; + } + }else { + // warn: it's all ok if nullFormat is null + recordList.add(null); + } + } + } + transportResult.setLeft(recordList); + return transportResult; + } +} diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriter.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriter.java new file mode 100644 index 0000000000..0119be2b56 --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriter.java @@ -0,0 +1,381 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.writer.Constant; +import com.google.common.collect.Sets; +import org.apache.commons.io.Charsets; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.fs.Path; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; + + +public class HdfsWriter extends Writer { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + private Configuration writerSliceConfig = null; + + private String defaultFS; + private String path; + private String fileType; + private String fileName; + private List columns; + private String writeMode; + private String fieldDelimiter; + private String compress; + private String encoding; + private HashSet tmpFiles = new HashSet();//临时文件全路径 + private HashSet endFiles = new HashSet();//最终文件全路径 + + private HdfsHelper hdfsHelper = null; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.validateParameter(); + + //创建textfile存储 + hdfsHelper = new HdfsHelper(); + + hdfsHelper.getFileSystem(defaultFS, this.writerSliceConfig); + } + + private void validateParameter() { + this.defaultFS = this.writerSliceConfig.getNecessaryValue(Key.DEFAULT_FS, HdfsWriterErrorCode.REQUIRED_VALUE); + //fileType check + this.fileType = this.writerSliceConfig.getNecessaryValue(Key.FILE_TYPE, HdfsWriterErrorCode.REQUIRED_VALUE); + if( !fileType.equalsIgnoreCase("ORC") && !fileType.equalsIgnoreCase("TEXT")){ + String message = "HdfsWriter插件目前只支持ORC和TEXT两种格式的文件,请将filetype选项的值配置为ORC或者TEXT"; + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, message); + } + //path + this.path = this.writerSliceConfig.getNecessaryValue(Key.PATH, HdfsWriterErrorCode.REQUIRED_VALUE); + if(!path.startsWith("/")){ + String message = String.format("请检查参数path:[%s],需要配置为绝对路径", path); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, message); + }else if(path.contains("*") || path.contains("?")){ + String message = String.format("请检查参数path:[%s],不能包含*,?等特殊字符", path); + LOG.error(message); + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, message); + } + //fileName + this.fileName = this.writerSliceConfig.getNecessaryValue(Key.FILE_NAME, HdfsWriterErrorCode.REQUIRED_VALUE); + //columns check + this.columns = this.writerSliceConfig.getListConfiguration(Key.COLUMN); + if (null == columns || columns.size() == 0) { + throw DataXException.asDataXException(HdfsWriterErrorCode.REQUIRED_VALUE, "您需要指定 columns"); + }else{ + for (Configuration eachColumnConf : columns) { + eachColumnConf.getNecessaryValue(Key.NAME, HdfsWriterErrorCode.COLUMN_REQUIRED_VALUE); + eachColumnConf.getNecessaryValue(Key.TYPE, HdfsWriterErrorCode.COLUMN_REQUIRED_VALUE); + } + } + //writeMode check + this.writeMode = this.writerSliceConfig.getNecessaryValue(Key.WRITE_MODE, HdfsWriterErrorCode.REQUIRED_VALUE); + writeMode = writeMode.toLowerCase().trim(); + Set supportedWriteModes = Sets.newHashSet("append", "nonconflict"); + if (!supportedWriteModes.contains(writeMode)) { + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("仅支持append, nonConflict两种模式, 不支持您配置的 writeMode 模式 : [%s]", + writeMode)); + } + this.writerSliceConfig.set(Key.WRITE_MODE, writeMode); + //fieldDelimiter check + this.fieldDelimiter = this.writerSliceConfig.getString(Key.FIELD_DELIMITER,null); + if(null == fieldDelimiter){ + throw DataXException.asDataXException(HdfsWriterErrorCode.REQUIRED_VALUE, + String.format("您提供配置文件有误,[%s]是必填参数.", Key.FIELD_DELIMITER)); + }else if(1 != fieldDelimiter.length()){ + // warn: if have, length must be one + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", fieldDelimiter)); + } + //compress check + this.compress = this.writerSliceConfig.getString(Key.COMPRESS,null); + if(fileType.equalsIgnoreCase("TEXT")){ + Set textSupportedCompress = Sets.newHashSet("GZIP", "BZIP2"); + //用户可能配置的是compress:"",空字符串,需要将compress设置为null + if(StringUtils.isBlank(compress) ){ + this.writerSliceConfig.set(Key.COMPRESS, null); + }else { + compress = compress.toUpperCase().trim(); + if(!textSupportedCompress.contains(compress) ){ + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("目前TEXT FILE仅支持GZIP、BZIP2 两种压缩, 不支持您配置的 compress 模式 : [%s]", + compress)); + } + } + }else if(fileType.equalsIgnoreCase("ORC")){ + Set orcSupportedCompress = Sets.newHashSet("NONE", "SNAPPY"); + if(null == compress){ + this.writerSliceConfig.set(Key.COMPRESS, "NONE"); + }else { + compress = compress.toUpperCase().trim(); + if(!orcSupportedCompress.contains(compress)){ + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("目前ORC FILE仅支持SNAPPY压缩, 不支持您配置的 compress 模式 : [%s]", + compress)); + } + } + + } + //Kerberos check + Boolean haveKerberos = this.writerSliceConfig.getBool(Key.HAVE_KERBEROS, false); + if(haveKerberos) { + this.writerSliceConfig.getNecessaryValue(Key.KERBEROS_KEYTAB_FILE_PATH, HdfsWriterErrorCode.REQUIRED_VALUE); + this.writerSliceConfig.getNecessaryValue(Key.KERBEROS_PRINCIPAL, HdfsWriterErrorCode.REQUIRED_VALUE); + } + // encoding check + this.encoding = this.writerSliceConfig.getString(Key.ENCODING,Constant.DEFAULT_ENCODING); + try { + encoding = encoding.trim(); + this.writerSliceConfig.set(Key.ENCODING, encoding); + Charsets.toCharset(encoding); + } catch (Exception e) { + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("不支持您配置的编码格式:[%s]", encoding), e); + } + } + + @Override + public void prepare() { + //若路径已经存在,检查path是否是目录 + if(hdfsHelper.isPathexists(path)){ + if(!hdfsHelper.isPathDir(path)){ + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("您配置的path: [%s] 不是一个合法的目录, 请您注意文件重名, 不合法目录名等情况.", + path)); + } + //根据writeMode对目录下文件进行处理 + Path[] existFilePaths = hdfsHelper.hdfsDirList(path,fileName); + boolean isExistFile = false; + if(existFilePaths.length > 0){ + isExistFile = true; + } + /** + if ("truncate".equals(writeMode) && isExistFile ) { + LOG.info(String.format("由于您配置了writeMode truncate, 开始清理 [%s] 下面以 [%s] 开头的内容", + path, fileName)); + hdfsHelper.deleteFiles(existFilePaths); + } else + */ + if ("append".equalsIgnoreCase(writeMode)) { + LOG.info(String.format("由于您配置了writeMode append, 写入前不做清理工作, [%s] 目录下写入相应文件名前缀 [%s] 的文件", + path, fileName)); + } else if ("nonconflict".equalsIgnoreCase(writeMode) && isExistFile) { + LOG.info(String.format("由于您配置了writeMode nonConflict, 开始检查 [%s] 下面的内容", path)); + List allFiles = new ArrayList(); + for (Path eachFile : existFilePaths) { + allFiles.add(eachFile.toString()); + } + LOG.error(String.format("冲突文件列表为: [%s]", StringUtils.join(allFiles, ","))); + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("由于您配置了writeMode nonConflict,但您配置的path: [%s] 目录不为空, 下面存在其他文件或文件夹.", path)); + } + }else{ + throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, + String.format("您配置的path: [%s] 不存在, 请先在hive端创建对应的数据库和表.", path)); + } + } + + @Override + public void post() { + hdfsHelper.renameFile(tmpFiles, endFiles); + } + + @Override + public void destroy() { + hdfsHelper.closeFileSystem(); + } + + @Override + public List split(int mandatoryNumber) { + LOG.info("begin do split..."); + List writerSplitConfigs = new ArrayList(); + String filePrefix = fileName; + + Set allFiles = new HashSet(); + + //获取该路径下的所有已有文件列表 + if(hdfsHelper.isPathexists(path)){ + allFiles.addAll(Arrays.asList(hdfsHelper.hdfsDirList(path))); + } + + String fileSuffix; + //临时存放路径 + String storePath = buildTmpFilePath(this.path); + //最终存放路径 + String endStorePath = buildFilePath(); + this.path = endStorePath; + for (int i = 0; i < mandatoryNumber; i++) { + // handle same file name + + Configuration splitedTaskConfig = this.writerSliceConfig.clone(); + String fullFileName = null; + String endFullFileName = null; + + fileSuffix = UUID.randomUUID().toString().replace('-', '_'); + + fullFileName = String.format("%s%s%s__%s", defaultFS, storePath, filePrefix, fileSuffix); + endFullFileName = String.format("%s%s%s__%s", defaultFS, endStorePath, filePrefix, fileSuffix); + + while (allFiles.contains(endFullFileName)) { + fileSuffix = UUID.randomUUID().toString().replace('-', '_'); + fullFileName = String.format("%s%s%s__%s", defaultFS, storePath, filePrefix, fileSuffix); + endFullFileName = String.format("%s%s%s__%s", defaultFS, endStorePath, filePrefix, fileSuffix); + } + allFiles.add(endFullFileName); + + //设置临时文件全路径和最终文件全路径 + if("GZIP".equalsIgnoreCase(this.compress)){ + this.tmpFiles.add(fullFileName + ".gz"); + this.endFiles.add(endFullFileName + ".gz"); + }else if("BZIP2".equalsIgnoreCase(compress)){ + this.tmpFiles.add(fullFileName + ".bz2"); + this.endFiles.add(endFullFileName + ".bz2"); + }else{ + this.tmpFiles.add(fullFileName); + this.endFiles.add(endFullFileName); + } + + splitedTaskConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME, + fullFileName); + + LOG.info(String.format("splited write file name:[%s]", + fullFileName)); + + writerSplitConfigs.add(splitedTaskConfig); + } + LOG.info("end do split."); + return writerSplitConfigs; + } + + private String buildFilePath() { + boolean isEndWithSeparator = false; + switch (IOUtils.DIR_SEPARATOR) { + case IOUtils.DIR_SEPARATOR_UNIX: + isEndWithSeparator = this.path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR)); + break; + case IOUtils.DIR_SEPARATOR_WINDOWS: + isEndWithSeparator = this.path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR_WINDOWS)); + break; + default: + break; + } + if (!isEndWithSeparator) { + this.path = this.path + IOUtils.DIR_SEPARATOR; + } + return this.path; + } + + /** + * 创建临时目录 + * @param userPath + * @return + */ + private String buildTmpFilePath(String userPath) { + String tmpFilePath; + boolean isEndWithSeparator = false; + switch (IOUtils.DIR_SEPARATOR) { + case IOUtils.DIR_SEPARATOR_UNIX: + isEndWithSeparator = userPath.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR)); + break; + case IOUtils.DIR_SEPARATOR_WINDOWS: + isEndWithSeparator = userPath.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR_WINDOWS)); + break; + default: + break; + } + String tmpSuffix; + tmpSuffix = UUID.randomUUID().toString().replace('-', '_'); + if (!isEndWithSeparator) { + tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR); + }else if("/".equals(userPath)){ + tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR); + }else{ + tmpFilePath = String.format("%s__%s%s", userPath.substring(0,userPath.length()-1), tmpSuffix, IOUtils.DIR_SEPARATOR); + } + while(hdfsHelper.isPathexists(tmpFilePath)){ + tmpSuffix = UUID.randomUUID().toString().replace('-', '_'); + if (!isEndWithSeparator) { + tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR); + }else if("/".equals(userPath)){ + tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR); + }else{ + tmpFilePath = String.format("%s__%s%s", userPath.substring(0,userPath.length()-1), tmpSuffix, IOUtils.DIR_SEPARATOR); + } + } + return tmpFilePath; + } + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + + private Configuration writerSliceConfig; + + private String defaultFS; + private String fileType; + private String fileName; + + private HdfsHelper hdfsHelper = null; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + + this.defaultFS = this.writerSliceConfig.getString(Key.DEFAULT_FS); + this.fileType = this.writerSliceConfig.getString(Key.FILE_TYPE); + //得当的已经是绝对路径,eg:hdfs://10.101.204.12:9000/user/hive/warehouse/writer.db/text/test.textfile + this.fileName = this.writerSliceConfig.getString(Key.FILE_NAME); + + hdfsHelper = new HdfsHelper(); + hdfsHelper.getFileSystem(defaultFS, writerSliceConfig); + } + + @Override + public void prepare() { + + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + LOG.info("begin do write..."); + LOG.info(String.format("write to file : [%s]", this.fileName)); + if(fileType.equalsIgnoreCase("TEXT")){ + //写TEXT FILE + hdfsHelper.textFileStartWrite(lineReceiver,this.writerSliceConfig, this.fileName, + this.getTaskPluginCollector()); + }else if(fileType.equalsIgnoreCase("ORC")){ + //写ORC FILE + hdfsHelper.orcFileStartWrite(lineReceiver,this.writerSliceConfig, this.fileName, + this.getTaskPluginCollector()); + } + + LOG.info("end do write"); + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + } +} diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriterErrorCode.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriterErrorCode.java new file mode 100644 index 0000000000..a9e1cb30e6 --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/HdfsWriterErrorCode.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by shf on 15/10/8. + */ +public enum HdfsWriterErrorCode implements ErrorCode { + + CONFIG_INVALID_EXCEPTION("HdfsWriter-00", "您的参数配置错误."), + REQUIRED_VALUE("HdfsWriter-01", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("HdfsWriter-02", "您填写的参数值不合法."), + WRITER_FILE_WITH_CHARSET_ERROR("HdfsWriter-03", "您配置的编码未能正常写入."), + Write_FILE_IO_ERROR("HdfsWriter-04", "您配置的文件在写入时出现IO异常."), + WRITER_RUNTIME_EXCEPTION("HdfsWriter-05", "出现运行时异常, 请联系我们."), + CONNECT_HDFS_IO_ERROR("HdfsWriter-06", "与HDFS建立连接时出现IO异常."), + COLUMN_REQUIRED_VALUE("HdfsWriter-07", "您column配置中缺失了必须填写的参数值."), + HDFS_RENAME_FILE_ERROR("HdfsWriter-08", "将文件移动到配置路径失败."), + KERBEROS_LOGIN_ERROR("HdfsWriter-09", "KERBEROS认证失败"); + + private final String code; + private final String description; + + private HdfsWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } + +} diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Key.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Key.java new file mode 100644 index 0000000000..f1f6309689 --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/Key.java @@ -0,0 +1,36 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +/** + * Created by shf on 15/10/8. + */ +public class Key { + // must have + public static final String PATH = "path"; + //must have + public final static String DEFAULT_FS = "defaultFS"; + //must have + public final static String FILE_TYPE = "fileType"; + // must have + public static final String FILE_NAME = "fileName"; + // must have for column + public static final String COLUMN = "column"; + public static final String NAME = "name"; + public static final String TYPE = "type"; + public static final String DATE_FORMAT = "dateFormat"; + // must have + public static final String WRITE_MODE = "writeMode"; + // must have + public static final String FIELD_DELIMITER = "fieldDelimiter"; + // not must, default UTF-8 + public static final String ENCODING = "encoding"; + // not must, default no compress + public static final String COMPRESS = "compress"; + // not must, not default \N + public static final String NULL_FORMAT = "nullFormat"; + // Kerberos + public static final String HAVE_KERBEROS = "haveKerberos"; + public static final String KERBEROS_KEYTAB_FILE_PATH = "kerberosKeytabFilePath"; + public static final String KERBEROS_PRINCIPAL = "kerberosPrincipal"; + // hadoop config + public static final String HADOOP_CONFIG = "hadoopConfig"; +} diff --git a/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/SupportHiveDataType.java b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/SupportHiveDataType.java new file mode 100644 index 0000000000..b7949302c8 --- /dev/null +++ b/hdfswriter/src/main/java/com/alibaba/datax/plugin/writer/hdfswriter/SupportHiveDataType.java @@ -0,0 +1,19 @@ +package com.alibaba.datax.plugin.writer.hdfswriter; + +public enum SupportHiveDataType { + TINYINT, + SMALLINT, + INT, + BIGINT, + FLOAT, + DOUBLE, + + TIMESTAMP, + DATE, + + STRING, + VARCHAR, + CHAR, + + BOOLEAN +} diff --git a/hdfswriter/src/main/resources/plugin.json b/hdfswriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..98c8e94a60 --- /dev/null +++ b/hdfswriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "hdfswriter", + "class": "com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter", + "description": "useScene: prod. mechanism: via FileSystem connect HDFS write data concurrent.", + "developer": "alibaba" +} diff --git a/hdfswriter/src/main/resources/plugin_job_template.json b/hdfswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..08b4ab62ec --- /dev/null +++ b/hdfswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "hdfswriter", + "parameter": { + "defaultFS": "", + "fileType": "", + "path": "", + "fileName": "", + "column": [], + "writeMode": "", + "fieldDelimiter": "", + "compress":"" + } +} \ No newline at end of file diff --git a/images/DataX-logo.jpg b/images/DataX-logo.jpg new file mode 100644 index 0000000000..ce8b9e9940 Binary files /dev/null and b/images/DataX-logo.jpg differ diff --git a/images/datax-enterprise-users.jpg b/images/datax-enterprise-users.jpg new file mode 100644 index 0000000000..5ddd82f564 Binary files /dev/null and b/images/datax-enterprise-users.jpg differ diff --git a/images/datax-opensource-dingding.png b/images/datax-opensource-dingding.png new file mode 100644 index 0000000000..fe8b8544ca Binary files /dev/null and b/images/datax-opensource-dingding.png differ diff --git a/license.txt b/license.txt new file mode 100644 index 0000000000..00b845b43b --- /dev/null +++ b/license.txt @@ -0,0 +1,13 @@ +Copyright 1999-2017 Alibaba Group Holding Ltd. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. diff --git a/mongodbreader/doc/mongodbreader.md b/mongodbreader/doc/mongodbreader.md new file mode 100644 index 0000000000..3535d5b7fd --- /dev/null +++ b/mongodbreader/doc/mongodbreader.md @@ -0,0 +1,149 @@ +### Datax MongoDBReader +#### 1 快速介绍 + +MongoDBReader 插件利用 MongoDB 的java客户端MongoClient进行MongoDB的读操作。最新版本的Mongo已经将DB锁的粒度从DB级别降低到document级别,配合上MongoDB强大的索引功能,基本可以达到高性能的读取MongoDB的需求。 + +#### 2 实现原理 + +MongoDBReader通过Datax框架从MongoDB并行的读取数据,通过主控的JOB程序按照指定的规则对MongoDB中的数据进行分片,并行读取,然后将MongoDB支持的类型通过逐一判断转换成Datax支持的类型。 + +#### 3 功能说明 +* 该示例从ODPS读一份数据到MongoDB。 + + { + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "mongodbreader", + "parameter": { + "address": ["127.0.0.1:27017"], + "userName": "", + "userPassword": "", + "dbName": "tag_per_data", + "collectionName": "tag_data12", + "column": [ + { + "name": "unique_id", + "type": "string" + }, + { + "name": "sid", + "type": "string" + }, + { + "name": "user_id", + "type": "string" + }, + { + "name": "auction_id", + "type": "string" + }, + { + "name": "content_type", + "type": "string" + }, + { + "name": "pool_type", + "type": "string" + }, + { + "name": "frontcat_id", + "type": "Array", + "spliter": "" + }, + { + "name": "categoryid", + "type": "Array", + "spliter": "" + }, + { + "name": "gmt_create", + "type": "string" + }, + { + "name": "taglist", + "type": "Array", + "spliter": " " + }, + { + "name": "property", + "type": "string" + }, + { + "name": "scorea", + "type": "int" + }, + { + "name": "scoreb", + "type": "int" + }, + { + "name": "scorec", + "type": "int" + } + ] + } + }, + "writer": { + "name": "odpswriter", + "parameter": { + "project": "tb_ai_recommendation", + "table": "jianying_tag_datax_read_test01", + "column": [ + "unique_id", + "sid", + "user_id", + "auction_id", + "content_type", + "pool_type", + "frontcat_id", + "categoryid", + "gmt_create", + "taglist", + "property", + "scorea", + "scoreb" + ], + "accessId": "**************", + "accessKey": "********************", + "truncate": true, + "odpsServer": "xxx/api", + "tunnelServer": "xxx", + "accountType": "aliyun" + } + } + } + ] + } + } +#### 4 参数说明 + +* address: MongoDB的数据地址信息,因为MonogDB可能是个集群,则ip端口信息需要以Json数组的形式给出。【必填】 +* userName:MongoDB的用户名。【选填】 +* userPassword: MongoDB的密码。【选填】 +* collectionName: MonogoDB的集合名。【必填】 +* column:MongoDB的文档列名。【必填】 +* name:Column的名字。【必填】 +* type:Column的类型。【选填】 +* splitter:因为MongoDB支持数组类型,但是Datax框架本身不支持数组类型,所以mongoDB读出来的数组类型要通过这个分隔符合并成字符串。【选填】 + +#### 5 类型转换 + +| DataX 内部类型| MongoDB 数据类型 | +| -------- | ----- | +| Long | int, Long | +| Double | double | +| String | string, array | +| Date | date | +| Boolean | boolean | +| Bytes | bytes | + + +#### 6 性能报告 +#### 7 测试报告 \ No newline at end of file diff --git a/mongodbreader/pom.xml b/mongodbreader/pom.xml new file mode 100644 index 0000000000..fec3bfd85f --- /dev/null +++ b/mongodbreader/pom.xml @@ -0,0 +1,84 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + mongodbreader + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + org.mongodb + mongo-java-driver + 3.2.2 + + + com.google.guava + guava + 16.0.1 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/mongodbreader/src/main/assembly/package.xml b/mongodbreader/src/main/assembly/package.xml new file mode 100644 index 0000000000..a7e967f90b --- /dev/null +++ b/mongodbreader/src/main/assembly/package.xml @@ -0,0 +1,36 @@ + + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/mongodbreader + + + target/ + + mongodbreader-0.0.1-SNAPSHOT.jar + + plugin/reader/mongodbreader + + + + + + false + plugin/reader/mongodbreader/libs + runtime + + + diff --git a/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/KeyConstant.java b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/KeyConstant.java new file mode 100644 index 0000000000..fbc83d51ea --- /dev/null +++ b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/KeyConstant.java @@ -0,0 +1,97 @@ +package com.alibaba.datax.plugin.reader.mongodbreader; + +/** + * Created by jianying.wcj on 2015/3/17 0017. + */ +public class KeyConstant { + /** + * 数组类型 + */ + public static final String ARRAY_TYPE = "array"; + /** + * 嵌入文档数组类型 + */ + public static final String DOCUMENT_ARRAY_TYPE = "document.array"; + /** + * 嵌入文档类型 + */ + public static final String DOCUMENT_TYPE = "document"; + /** + * mongodb 的 host 地址 + */ + public static final String MONGO_ADDRESS = "address"; + /** + * mongodb 的用户名 + */ + public static final String MONGO_USER_NAME = "userName"; + public static final String MONGO_USERNAME = "username"; + /** + * mongodb 密码 + */ + public static final String MONGO_USER_PASSWORD = "userPassword"; + public static final String MONGO_PASSWORD = "password"; + /** + * mongodb 数据库名 + */ + public static final String MONGO_DB_NAME = "dbName"; + public static final String MONGO_DATABASE = "database"; + public static final String MONGO_AUTHDB = "authDb"; + /** + * mongodb 集合名 + */ + public static final String MONGO_COLLECTION_NAME = "collectionName"; + /** + * mongodb 查询条件 + */ + public static final String MONGO_QUERY = "query"; + /** + * mongodb 的列 + */ + public static final String MONGO_COLUMN = "column"; + /** + * 每个列的名字 + */ + public static final String COLUMN_NAME = "name"; + /** + * 每个列的类型 + */ + public static final String COLUMN_TYPE = "type"; + /** + * 列分隔符 + */ + public static final String COLUMN_SPLITTER = "splitter"; + /** + * 跳过的列数 + */ + public static final String SKIP_COUNT = "skipCount"; + + + public static final String LOWER_BOUND = "lowerBound"; + public static final String UPPER_BOUND = "upperBound"; + public static final String IS_OBJECTID = "isObjectId"; + /** + * 批量获取的记录数 + */ + public static final String BATCH_SIZE = "batchSize"; + /** + * MongoDB的_id + */ + public static final String MONGO_PRIMARY_ID = "_id"; + /** + * MongoDB的错误码 + */ + public static final int MONGO_UNAUTHORIZED_ERR_CODE = 13; + public static final int MONGO_ILLEGALOP_ERR_CODE = 20; + /** + * 判断是否为数组类型 + * @param type 数据类型 + * @return + */ + public static boolean isArrayType(String type) { + return ARRAY_TYPE.equals(type) || DOCUMENT_ARRAY_TYPE.equals(type); + } + + public static boolean isDocumentType(String type) { + return type.startsWith(DOCUMENT_TYPE); + } +} diff --git a/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReader.java b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReader.java new file mode 100644 index 0000000000..ba7f07f43e --- /dev/null +++ b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReader.java @@ -0,0 +1,212 @@ +package com.alibaba.datax.plugin.reader.mongodbreader; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Date; +import java.util.Iterator; +import java.util.List; + +import com.alibaba.datax.common.element.BoolColumn; +import com.alibaba.datax.common.element.DateColumn; +import com.alibaba.datax.common.element.DoubleColumn; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.mongodbreader.util.CollectionSplitUtil; +import com.alibaba.datax.plugin.reader.mongodbreader.util.MongoUtil; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.JSONArray; +import com.alibaba.fastjson.JSONObject; + +import com.google.common.base.Joiner; +import com.google.common.base.Strings; +import com.mongodb.MongoClient; +import com.mongodb.client.MongoCollection; +import com.mongodb.client.MongoCursor; +import com.mongodb.client.MongoDatabase; +import org.bson.Document; +import org.bson.types.ObjectId; + +/** + * Created by jianying.wcj on 2015/3/19 0019. + * Modified by mingyan.zc on 2016/6/13. + * Modified by mingyan.zc on 2017/7/5. + */ +public class MongoDBReader extends Reader { + + public static class Job extends Reader.Job { + + private Configuration originalConfig = null; + + private MongoClient mongoClient; + + private String userName = null; + private String password = null; + + @Override + public List split(int adviceNumber) { + return CollectionSplitUtil.doSplit(originalConfig,adviceNumber,mongoClient); + } + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + this.userName = originalConfig.getString(KeyConstant.MONGO_USER_NAME, originalConfig.getString(KeyConstant.MONGO_USERNAME)); + this.password = originalConfig.getString(KeyConstant.MONGO_USER_PASSWORD, originalConfig.getString(KeyConstant.MONGO_PASSWORD)); + String database = originalConfig.getString(KeyConstant.MONGO_DB_NAME, originalConfig.getString(KeyConstant.MONGO_DATABASE)); + String authDb = originalConfig.getString(KeyConstant.MONGO_AUTHDB, database); + if(!Strings.isNullOrEmpty(this.userName) && !Strings.isNullOrEmpty(this.password)) { + this.mongoClient = MongoUtil.initCredentialMongoClient(originalConfig,userName,password,authDb); + } else { + this.mongoClient = MongoUtil.initMongoClient(originalConfig); + } + } + + @Override + public void destroy() { + + } + } + + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + + private MongoClient mongoClient; + + private String userName = null; + private String password = null; + + private String authDb = null; + private String database = null; + private String collection = null; + + private String query = null; + + private JSONArray mongodbColumnMeta = null; + private Object lowerBound = null; + private Object upperBound = null; + private boolean isObjectId = true; + + @Override + public void startRead(RecordSender recordSender) { + + if(lowerBound== null || upperBound == null || + mongoClient == null || database == null || + collection == null || mongodbColumnMeta == null) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE, + MongoDBReaderErrorCode.ILLEGAL_VALUE.getDescription()); + } + MongoDatabase db = mongoClient.getDatabase(database); + MongoCollection col = db.getCollection(this.collection); + + MongoCursor dbCursor = null; + Document filter = new Document(); + if (lowerBound.equals("min")) { + if (!upperBound.equals("max")) { + filter.append(KeyConstant.MONGO_PRIMARY_ID, new Document("$lt", isObjectId ? new ObjectId(upperBound.toString()) : upperBound)); + } + } else if (upperBound.equals("max")) { + filter.append(KeyConstant.MONGO_PRIMARY_ID, new Document("$gte", isObjectId ? new ObjectId(lowerBound.toString()) : lowerBound)); + } else { + filter.append(KeyConstant.MONGO_PRIMARY_ID, new Document("$gte", isObjectId ? new ObjectId(lowerBound.toString()) : lowerBound).append("$lt", isObjectId ? new ObjectId(upperBound.toString()) : upperBound)); + } + if(!Strings.isNullOrEmpty(query)) { + Document queryFilter = Document.parse(query); + filter = new Document("$and", Arrays.asList(filter, queryFilter)); + } + dbCursor = col.find(filter).iterator(); + while (dbCursor.hasNext()) { + Document item = dbCursor.next(); + Record record = recordSender.createRecord(); + Iterator columnItera = mongodbColumnMeta.iterator(); + while (columnItera.hasNext()) { + JSONObject column = (JSONObject)columnItera.next(); + Object tempCol = item.get(column.getString(KeyConstant.COLUMN_NAME)); + if (tempCol == null) { + if (KeyConstant.isDocumentType(column.getString(KeyConstant.COLUMN_TYPE))) { + String[] name = column.getString(KeyConstant.COLUMN_NAME).split("\\."); + if (name.length > 1) { + Object obj; + Document nestedDocument = item; + for (String str : name) { + obj = nestedDocument.get(str); + if (obj instanceof Document) { + nestedDocument = (Document) obj; + } + } + + if (null != nestedDocument) { + Document doc = nestedDocument; + tempCol = doc.get(name[name.length - 1]); + } + } + } + } + if (tempCol == null) { + //continue; 这个不能直接continue会导致record到目的端错位 + record.addColumn(new StringColumn(null)); + }else if (tempCol instanceof Double) { + //TODO deal with Double.isNaN() + record.addColumn(new DoubleColumn((Double) tempCol)); + } else if (tempCol instanceof Boolean) { + record.addColumn(new BoolColumn((Boolean) tempCol)); + } else if (tempCol instanceof Date) { + record.addColumn(new DateColumn((Date) tempCol)); + } else if (tempCol instanceof Integer) { + record.addColumn(new LongColumn((Integer) tempCol)); + }else if (tempCol instanceof Long) { + record.addColumn(new LongColumn((Long) tempCol)); + } else { + if(KeyConstant.isArrayType(column.getString(KeyConstant.COLUMN_TYPE))) { + String splitter = column.getString(KeyConstant.COLUMN_SPLITTER); + if(Strings.isNullOrEmpty(splitter)) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE, + MongoDBReaderErrorCode.ILLEGAL_VALUE.getDescription()); + } else { + ArrayList array = (ArrayList)tempCol; + String tempArrayStr = Joiner.on(splitter).join(array); + record.addColumn(new StringColumn(tempArrayStr)); + } + } else { + record.addColumn(new StringColumn(tempCol.toString())); + } + } + } + recordSender.sendToWriter(record); + } + } + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.userName = readerSliceConfig.getString(KeyConstant.MONGO_USER_NAME, readerSliceConfig.getString(KeyConstant.MONGO_USERNAME)); + this.password = readerSliceConfig.getString(KeyConstant.MONGO_USER_PASSWORD, readerSliceConfig.getString(KeyConstant.MONGO_PASSWORD)); + this.database = readerSliceConfig.getString(KeyConstant.MONGO_DB_NAME, readerSliceConfig.getString(KeyConstant.MONGO_DATABASE)); + this.authDb = readerSliceConfig.getString(KeyConstant.MONGO_AUTHDB, this.database); + if(!Strings.isNullOrEmpty(userName) && !Strings.isNullOrEmpty(password)) { + mongoClient = MongoUtil.initCredentialMongoClient(readerSliceConfig,userName,password,authDb); + } else { + mongoClient = MongoUtil.initMongoClient(readerSliceConfig); + } + + this.collection = readerSliceConfig.getString(KeyConstant.MONGO_COLLECTION_NAME); + this.query = readerSliceConfig.getString(KeyConstant.MONGO_QUERY); + this.mongodbColumnMeta = JSON.parseArray(readerSliceConfig.getString(KeyConstant.MONGO_COLUMN)); + this.lowerBound = readerSliceConfig.get(KeyConstant.LOWER_BOUND); + this.upperBound = readerSliceConfig.get(KeyConstant.UPPER_BOUND); + this.isObjectId = readerSliceConfig.getBool(KeyConstant.IS_OBJECTID); + } + + @Override + public void destroy() { + + } + + } +} diff --git a/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReaderErrorCode.java b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReaderErrorCode.java new file mode 100644 index 0000000000..4b3780c26b --- /dev/null +++ b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/MongoDBReaderErrorCode.java @@ -0,0 +1,33 @@ +package com.alibaba.datax.plugin.reader.mongodbreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by jianying.wcj on 2015/3/19 0019. + */ +public enum MongoDBReaderErrorCode implements ErrorCode { + + ILLEGAL_VALUE("ILLEGAL_PARAMETER_VALUE","参数不合法"), + ILLEGAL_ADDRESS("ILLEGAL_ADDRESS","不合法的Mongo地址"), + UNEXCEPT_EXCEPTION("UNEXCEPT_EXCEPTION","未知异常"); + + private final String code; + + private final String description; + + private MongoDBReaderErrorCode(String code,String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return code; + } + + @Override + public String getDescription() { + return description; + } +} + diff --git a/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/CollectionSplitUtil.java b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/CollectionSplitUtil.java new file mode 100644 index 0000000000..a66578f8da --- /dev/null +++ b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/CollectionSplitUtil.java @@ -0,0 +1,173 @@ +package com.alibaba.datax.plugin.reader.mongodbreader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.mongodbreader.KeyConstant; +import com.alibaba.datax.plugin.reader.mongodbreader.MongoDBReaderErrorCode; +import com.google.common.base.Strings; +import com.mongodb.MongoClient; +import com.mongodb.MongoCommandException; +import com.mongodb.client.MongoCollection; +import com.mongodb.client.MongoDatabase; +import org.bson.Document; +import org.bson.types.ObjectId; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +/** + * Created by jianying.wcj on 2015/3/19 0019. + * Modified by mingyan.zc on 2016/6/13. + * Modified by mingyan.zc on 2017/7/5. + */ +public class CollectionSplitUtil { + + public static List doSplit( + Configuration originalSliceConfig, int adviceNumber, MongoClient mongoClient) { + + List confList = new ArrayList(); + + String dbName = originalSliceConfig.getString(KeyConstant.MONGO_DB_NAME, originalSliceConfig.getString(KeyConstant.MONGO_DATABASE)); + + String collName = originalSliceConfig.getString(KeyConstant.MONGO_COLLECTION_NAME); + + if(Strings.isNullOrEmpty(dbName) || Strings.isNullOrEmpty(collName) || mongoClient == null) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE, + MongoDBReaderErrorCode.ILLEGAL_VALUE.getDescription()); + } + + boolean isObjectId = isPrimaryIdObjectId(mongoClient, dbName, collName); + + List rangeList = doSplitCollection(adviceNumber, mongoClient, dbName, collName, isObjectId); + for(Range range : rangeList) { + Configuration conf = originalSliceConfig.clone(); + conf.set(KeyConstant.LOWER_BOUND, range.lowerBound); + conf.set(KeyConstant.UPPER_BOUND, range.upperBound); + conf.set(KeyConstant.IS_OBJECTID, isObjectId); + confList.add(conf); + } + return confList; + } + + + private static boolean isPrimaryIdObjectId(MongoClient mongoClient, String dbName, String collName) { + MongoDatabase database = mongoClient.getDatabase(dbName); + MongoCollection col = database.getCollection(collName); + Document doc = col.find().limit(1).first(); + Object id = doc.get(KeyConstant.MONGO_PRIMARY_ID); + if (id instanceof ObjectId) { + return true; + } + return false; + } + + // split the collection into multiple chunks, each chunk specifies a range + private static List doSplitCollection(int adviceNumber, MongoClient mongoClient, + String dbName, String collName, boolean isObjectId) { + + MongoDatabase database = mongoClient.getDatabase(dbName); + List rangeList = new ArrayList(); + if (adviceNumber == 1) { + Range range = new Range(); + range.lowerBound = "min"; + range.upperBound = "max"; + return Arrays.asList(range); + } + + Document result = database.runCommand(new Document("collStats", collName)); + int docCount = result.getInteger("count"); + if (docCount == 0) { + return rangeList; + } + int avgObjSize = 1; + Object avgObjSizeObj = result.get("avgObjSize"); + if (avgObjSizeObj instanceof Integer) { + avgObjSize = ((Integer) avgObjSizeObj).intValue(); + } else if (avgObjSizeObj instanceof Double) { + avgObjSize = ((Double) avgObjSizeObj).intValue(); + } + int splitPointCount = adviceNumber - 1; + int chunkDocCount = docCount / adviceNumber; + ArrayList splitPoints = new ArrayList(); + + // test if user has splitVector role(clusterManager) + boolean supportSplitVector = true; + try { + database.runCommand(new Document("splitVector", dbName + "." + collName) + .append("keyPattern", new Document(KeyConstant.MONGO_PRIMARY_ID, 1)) + .append("force", true)); + } catch (MongoCommandException e) { + if (e.getErrorCode() == KeyConstant.MONGO_UNAUTHORIZED_ERR_CODE || + e.getErrorCode() == KeyConstant.MONGO_ILLEGALOP_ERR_CODE) { + supportSplitVector = false; + } + } + + if (supportSplitVector) { + boolean forceMedianSplit = false; + int maxChunkSize = (docCount / splitPointCount - 1) * 2 * avgObjSize / (1024 * 1024); + //int maxChunkSize = (chunkDocCount - 1) * 2 * avgObjSize / (1024 * 1024); + if (maxChunkSize < 1) { + forceMedianSplit = true; + } + if (!forceMedianSplit) { + result = database.runCommand(new Document("splitVector", dbName + "." + collName) + .append("keyPattern", new Document(KeyConstant.MONGO_PRIMARY_ID, 1)) + .append("maxChunkSize", maxChunkSize) + .append("maxSplitPoints", adviceNumber - 1)); + } else { + result = database.runCommand(new Document("splitVector", dbName + "." + collName) + .append("keyPattern", new Document(KeyConstant.MONGO_PRIMARY_ID, 1)) + .append("force", true)); + } + ArrayList splitKeys = result.get("splitKeys", ArrayList.class); + + for (int i = 0; i < splitKeys.size(); i++) { + Document splitKey = splitKeys.get(i); + Object id = splitKey.get(KeyConstant.MONGO_PRIMARY_ID); + if (isObjectId) { + ObjectId oid = (ObjectId)id; + splitPoints.add(oid.toHexString()); + } else { + splitPoints.add(id); + } + } + } else { + int skipCount = chunkDocCount; + MongoCollection col = database.getCollection(collName); + + for (int i = 0; i < splitPointCount; i++) { + Document doc = col.find().skip(skipCount).limit(chunkDocCount).first(); + Object id = doc.get(KeyConstant.MONGO_PRIMARY_ID); + if (isObjectId) { + ObjectId oid = (ObjectId)id; + splitPoints.add(oid.toHexString()); + } else { + splitPoints.add(id); + } + skipCount += chunkDocCount; + } + } + + Object lastObjectId = "min"; + for (Object splitPoint : splitPoints) { + Range range = new Range(); + range.lowerBound = lastObjectId; + lastObjectId = splitPoint; + range.upperBound = lastObjectId; + rangeList.add(range); + } + Range range = new Range(); + range.lowerBound = lastObjectId; + range.upperBound = "max"; + rangeList.add(range); + + return rangeList; + } +} + +class Range { + Object lowerBound; + Object upperBound; +} diff --git a/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/MongoUtil.java b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/MongoUtil.java new file mode 100644 index 0000000000..ae7a2dd3c9 --- /dev/null +++ b/mongodbreader/src/main/java/com/alibaba/datax/plugin/reader/mongodbreader/util/MongoUtil.java @@ -0,0 +1,90 @@ +package com.alibaba.datax.plugin.reader.mongodbreader.util; + +import java.net.UnknownHostException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.mongodbreader.KeyConstant; +import com.alibaba.datax.plugin.reader.mongodbreader.MongoDBReaderErrorCode; + +import com.mongodb.MongoClient; +import com.mongodb.MongoCredential; +import com.mongodb.ServerAddress; + +/** + * Created by jianying.wcj on 2015/3/17 0017. + * Modified by mingyan.zc on 2016/6/13. + */ +public class MongoUtil { + + public static MongoClient initMongoClient(Configuration conf) { + + List addressList = conf.getList(KeyConstant.MONGO_ADDRESS); + if(addressList == null || addressList.size() <= 0) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE,"不合法参数"); + } + try { + return new MongoClient(parseServerAddress(addressList)); + } catch (UnknownHostException e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_ADDRESS,"不合法的地址"); + } catch (NumberFormatException e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE,"不合法参数"); + } catch (Exception e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.UNEXCEPT_EXCEPTION,"未知异常"); + } + } + + public static MongoClient initCredentialMongoClient(Configuration conf, String userName, String password, String database) { + + List addressList = conf.getList(KeyConstant.MONGO_ADDRESS); + if(!isHostPortPattern(addressList)) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE,"不合法参数"); + } + try { + MongoCredential credential = MongoCredential.createCredential(userName, database, password.toCharArray()); + return new MongoClient(parseServerAddress(addressList), Arrays.asList(credential)); + + } catch (UnknownHostException e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_ADDRESS,"不合法的地址"); + } catch (NumberFormatException e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.ILLEGAL_VALUE,"不合法参数"); + } catch (Exception e) { + throw DataXException.asDataXException(MongoDBReaderErrorCode.UNEXCEPT_EXCEPTION,"未知异常"); + } + } + /** + * 判断地址类型是否符合要求 + * @param addressList + * @return + */ + private static boolean isHostPortPattern(List addressList) { + for(Object address : addressList) { + String regex = "(\\S+):([0-9]+)"; + if(!((String)address).matches(regex)) { + return false; + } + } + return true; + } + /** + * 转换为mongo地址协议 + * @param rawAddressList + * @return + */ + private static List parseServerAddress(List rawAddressList) throws UnknownHostException{ + List addressList = new ArrayList(); + for(Object address : rawAddressList) { + String[] tempAddress = ((String)address).split(":"); + try { + ServerAddress sa = new ServerAddress(tempAddress[0],Integer.valueOf(tempAddress[1])); + addressList.add(sa); + } catch (Exception e) { + throw new UnknownHostException(); + } + } + return addressList; + } +} diff --git a/mongodbreader/src/main/resources/plugin.json b/mongodbreader/src/main/resources/plugin.json new file mode 100644 index 0000000000..dc270008bf --- /dev/null +++ b/mongodbreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "mongodbreader", + "class": "com.alibaba.datax.plugin.reader.mongodbreader.MongoDBReader", + "description": "useScene: prod. mechanism: via mongoclient connect mongodb reader data concurrent.", + "developer": "alibaba" +} diff --git a/mongodbreader/src/main/resources/plugin_job_template.json b/mongodbreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..3679361391 --- /dev/null +++ b/mongodbreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,11 @@ +{ + "name": "mongodbreader", + "parameter": { + "address": [], + "userName": "", + "userPassword": "", + "dbName": "", + "collectionName": "", + "column": [] + } +} \ No newline at end of file diff --git a/mongodbwriter/doc/mongodbwriter.md b/mongodbwriter/doc/mongodbwriter.md new file mode 100644 index 0000000000..e30008dbd8 --- /dev/null +++ b/mongodbwriter/doc/mongodbwriter.md @@ -0,0 +1,157 @@ +### Datax MongoDBWriter +#### 1 快速介绍 + +MongoDBWriter 插件利用 MongoDB 的java客户端MongoClient进行MongoDB的写操作。最新版本的Mongo已经将DB锁的粒度从DB级别降低到document级别,配合上MongoDB强大的索引功能,基本可以满足数据源向MongoDB写入数据的需求,针对数据更新的需求,通过配置业务主键的方式也可以实现。 + +#### 2 实现原理 + +MongoDBWriter通过Datax框架获取Reader生成的数据,然后将Datax支持的类型通过逐一判断转换成MongoDB支持的类型。其中一个值得指出的点就是Datax本身不支持数组类型,但是MongoDB支持数组类型,并且数组类型的索引还是蛮强大的。为了使用MongoDB的数组类型,则可以通过参数的特殊配置,将字符串可以转换成MongoDB中的数组。类型转换之后,就可以依托于Datax框架并行的写入MongoDB。 + +#### 3 功能说明 +* 该示例从ODPS读一份数据到MongoDB。 + + { + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "odpsreader", + "parameter": { + "accessId": "********", + "accessKey": "*********", + "project": "tb_ai_recommendation", + "table": "jianying_tag_datax_test", + "column": [ + "unique_id", + "sid", + "user_id", + "auction_id", + "content_type", + "pool_type", + "frontcat_id", + "categoryid", + "gmt_create", + "taglist", + "property", + "scorea", + "scoreb" + ], + "splitMode": "record", + "odpsServer": "http://xxx/api" + } + }, + "writer": { + "name": "mongodbwriter", + "parameter": { + "address": [ + "127.0.0.1:27017" + ], + "userName": "", + "userPassword": "", + "dbName": "tag_per_data", + "collectionName": "tag_data", + "column": [ + { + "name": "unique_id", + "type": "string" + }, + { + "name": "sid", + "type": "string" + }, + { + "name": "user_id", + "type": "string" + }, + { + "name": "auction_id", + "type": "string" + }, + { + "name": "content_type", + "type": "string" + }, + { + "name": "pool_type", + "type": "string" + }, + { + "name": "frontcat_id", + "type": "Array", + "splitter": " " + }, + { + "name": "categoryid", + "type": "Array", + "splitter": " " + }, + { + "name": "gmt_create", + "type": "string" + }, + { + "name": "taglist", + "type": "Array", + "splitter": " " + }, + { + "name": "property", + "type": "string" + }, + { + "name": "scorea", + "type": "int" + }, + { + "name": "scoreb", + "type": "int" + }, + { + "name": "scorec", + "type": "int" + } + ], + "upsertInfo": { + "isUpsert": "true", + "upsertKey": "unique_id" + } + } + } + } + ] + } + } + +#### 4 参数说明 + +* address: MongoDB的数据地址信息,因为MonogDB可能是个集群,则ip端口信息需要以Json数组的形式给出。【必填】 +* userName:MongoDB的用户名。【选填】 +* userPassword: MongoDB的密码。【选填】 +* collectionName: MonogoDB的集合名。【必填】 +* column:MongoDB的文档列名。【必填】 +* name:Column的名字。【必填】 +* type:Column的类型。【选填】 +* splitter:特殊分隔符,当且仅当要处理的字符串要用分隔符分隔为字符数组时,才使用这个参数,通过这个参数指定的分隔符,将字符串分隔存储到MongoDB的数组中。【选填】 +* upsertInfo:指定了传输数据时更新的信息。【选填】 +* isUpsert:当设置为true时,表示针对相同的upsertKey做更新操作。【选填】 +* upsertKey:upsertKey指定了没行记录的业务主键。用来做更新时使用。【选填】 + +#### 5 类型转换 + +| DataX 内部类型| MongoDB 数据类型 | +| -------- | ----- | +| Long | int, Long | +| Double | double | +| String | string, array | +| Date | date | +| Boolean | boolean | +| Bytes | bytes | + + +#### 6 性能报告 +#### 7 测试报告 \ No newline at end of file diff --git a/mongodbwriter/pom.xml b/mongodbwriter/pom.xml new file mode 100644 index 0000000000..0360db2a15 --- /dev/null +++ b/mongodbwriter/pom.xml @@ -0,0 +1,88 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + mongodbwriter + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + org.mongodb + mongo-java-driver + 3.2.2 + + + com.google.guava + guava + 16.0.1 + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/mongodbwriter/src/main/assembly/package.xml b/mongodbwriter/src/main/assembly/package.xml new file mode 100644 index 0000000000..9225be35e9 --- /dev/null +++ b/mongodbwriter/src/main/assembly/package.xml @@ -0,0 +1,36 @@ + + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/mongodbwriter + + + target/ + + mongodbwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/mongodbwriter + + + + + + false + plugin/writer/mongodbwriter/libs + runtime + + + diff --git a/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/KeyConstant.java b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/KeyConstant.java new file mode 100644 index 0000000000..40de3124ed --- /dev/null +++ b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/KeyConstant.java @@ -0,0 +1,88 @@ +package com.alibaba.datax.plugin.writer.mongodbwriter; + +public class KeyConstant { + /** + * mongodb 的 host 地址 + */ + public static final String MONGO_ADDRESS = "address"; + /** + * 数组类型 + */ + public static final String ARRAY_TYPE = "array"; + /** + * ObjectId类型 + */ + public static final String OBJECT_ID_TYPE = "objectid"; + /** + * mongodb 的用户名 + */ + public static final String MONGO_USER_NAME = "userName"; + /** + * mongodb 密码 + */ + public static final String MONGO_USER_PASSWORD = "userPassword"; + /** + * mongodb 数据库名 + */ + public static final String MONGO_DB_NAME = "dbName"; + /** + * mongodb 集合名 + */ + public static final String MONGO_COLLECTION_NAME = "collectionName"; + /** + * mongodb 的列 + */ + public static final String MONGO_COLUMN = "column"; + /** + * 每个列的名字 + */ + public static final String COLUMN_NAME = "name"; + /** + * 每个列的类型 + */ + public static final String COLUMN_TYPE = "type"; + /** + * 数组中每个元素的类型 + */ + public static final String ITEM_TYPE = "itemtype"; + /** + * 列分隔符 + */ + public static final String COLUMN_SPLITTER = "splitter"; + /** + * 数据更新列信息 + */ + public static final String WRITE_MODE = "writeMode"; + /** + * 有相同的记录是否覆盖,默认为false + */ + public static final String IS_REPLACE = "isReplace"; + /** + * 指定用来判断是否覆盖的 业务主键 + */ + public static final String UNIQUE_KEY = "replaceKey"; + /** + * 判断是否为数组类型 + * @param type 数据类型 + * @return + */ + public static boolean isArrayType(String type) { + return ARRAY_TYPE.equals(type); + } + /** + * 判断是否为ObjectId类型 + * @param type 数据类型 + * @return + */ + public static boolean isObjectIdType(String type) { + return OBJECT_ID_TYPE.equals(type); + } + /** + * 判断一个值是否为true + * @param value + * @return + */ + public static boolean isValueTrue(String value){ + return "true".equals(value); + } +} diff --git a/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriter.java b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriter.java new file mode 100644 index 0000000000..66c75078d3 --- /dev/null +++ b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriter.java @@ -0,0 +1,340 @@ +package com.alibaba.datax.plugin.writer.mongodbwriter; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.writer.Key; +import com.alibaba.datax.plugin.writer.mongodbwriter.util.MongoUtil; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.JSONArray; +import com.alibaba.fastjson.JSONObject; +import com.google.common.base.Strings; +import com.mongodb.*; +import com.mongodb.client.MongoCollection; +import com.mongodb.client.MongoDatabase; +import com.mongodb.client.model.BulkWriteOptions; +import com.mongodb.client.model.ReplaceOneModel; +import com.mongodb.client.model.UpdateOptions; +import org.bson.types.ObjectId; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +public class MongoDBWriter extends Writer{ + + public static class Job extends Writer.Job { + + private Configuration originalConfig = null; + + @Override + public List split(int mandatoryNumber) { + List configList = new ArrayList(); + for(int i = 0; i < mandatoryNumber; i++) { + configList.add(this.originalConfig.clone()); + } + return configList; + } + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + } + + @Override + public void prepare() { + super.prepare(); + } + + @Override + public void destroy() { + + } + } + + public static class Task extends Writer.Task { + + private static final Logger logger = LoggerFactory.getLogger(Task.class); + private Configuration writerSliceConfig; + + private MongoClient mongoClient; + + private String userName = null; + private String password = null; + + private String database = null; + private String collection = null; + private Integer batchSize = null; + private JSONArray mongodbColumnMeta = null; + private JSONObject writeMode = null; + private static int BATCH_SIZE = 1000; + + @Override + public void prepare() { + super.prepare(); + //获取presql配置,并执行 + String preSql = writerSliceConfig.getString(Key.PRE_SQL); + if(Strings.isNullOrEmpty(preSql)) { + return; + } + Configuration conConf = Configuration.from(preSql); + if(Strings.isNullOrEmpty(database) || Strings.isNullOrEmpty(collection) + || mongoClient == null || mongodbColumnMeta == null || batchSize == null) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE, + MongoDBWriterErrorCode.ILLEGAL_VALUE.getDescription()); + } + MongoDatabase db = mongoClient.getDatabase(database); + MongoCollection col = db.getCollection(this.collection); + String type = conConf.getString("type"); + if (Strings.isNullOrEmpty(type)){ + return; + } + if (type.equals("drop")){ + col.drop(); + } else if (type.equals("remove")){ + String json = conConf.getString("json"); + BasicDBObject query; + if (Strings.isNullOrEmpty(json)) { + query = new BasicDBObject(); + List items = conConf.getList("item", Object.class); + for (Object con : items) { + Configuration _conf = Configuration.from(con.toString()); + if (Strings.isNullOrEmpty(_conf.getString("condition"))) { + query.put(_conf.getString("name"), _conf.get("value")); + } else { + query.put(_conf.getString("name"), + new BasicDBObject(_conf.getString("condition"), _conf.get("value"))); + } + } +// and { "pv" : { "$gt" : 200 , "$lt" : 3000} , "pid" : { "$ne" : "xxx"}} +// or { "$or" : [ { "age" : { "$gt" : 27}} , { "age" : { "$lt" : 15}}]} + } else { + query = (BasicDBObject) com.mongodb.util.JSON.parse(json); + } + col.deleteMany(query); + } + if(logger.isDebugEnabled()) { + logger.debug("After job prepare(), originalConfig now is:[\n{}\n]", writerSliceConfig.toJSON()); + } + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + if(Strings.isNullOrEmpty(database) || Strings.isNullOrEmpty(collection) + || mongoClient == null || mongodbColumnMeta == null || batchSize == null) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE, + MongoDBWriterErrorCode.ILLEGAL_VALUE.getDescription()); + } + MongoDatabase db = mongoClient.getDatabase(database); + MongoCollection col = db.getCollection(this.collection, BasicDBObject.class); + List writerBuffer = new ArrayList(this.batchSize); + Record record = null; + while((record = lineReceiver.getFromReader()) != null) { + writerBuffer.add(record); + if(writerBuffer.size() >= this.batchSize) { + doBatchInsert(col,writerBuffer,mongodbColumnMeta); + writerBuffer.clear(); + } + } + if(!writerBuffer.isEmpty()) { + doBatchInsert(col,writerBuffer,mongodbColumnMeta); + writerBuffer.clear(); + } + } + + private void doBatchInsert(MongoCollection collection, List writerBuffer, JSONArray columnMeta) { + + List dataList = new ArrayList(); + + for(Record record : writerBuffer) { + + BasicDBObject data = new BasicDBObject(); + + for(int i = 0; i < record.getColumnNumber(); i++) { + + String type = columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_TYPE); + //空记录处理 + if (Strings.isNullOrEmpty(record.getColumn(i).asString())) { + if (KeyConstant.isArrayType(type.toLowerCase())) { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), new Object[0]); + } else { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), record.getColumn(i).asString()); + } + continue; + } + if (Column.Type.INT.name().equalsIgnoreCase(type)) { + //int是特殊类型, 其他类型按照保存时Column的类型进行处理 + try { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), + Integer.parseInt( + String.valueOf(record.getColumn(i).getRawData()))); + } catch (Exception e) { + super.getTaskPluginCollector().collectDirtyRecord(record, e); + } + } else if(record.getColumn(i) instanceof StringColumn){ + //处理ObjectId和数组类型 + try { + if (KeyConstant.isObjectIdType(type.toLowerCase())) { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), + new ObjectId(record.getColumn(i).asString())); + } else if (KeyConstant.isArrayType(type.toLowerCase())) { + String splitter = columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_SPLITTER); + if (Strings.isNullOrEmpty(splitter)) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE, + MongoDBWriterErrorCode.ILLEGAL_VALUE.getDescription()); + } + String itemType = columnMeta.getJSONObject(i).getString(KeyConstant.ITEM_TYPE); + if (itemType != null && !itemType.isEmpty()) { + //如果数组指定类型不为空,将其转换为指定类型 + String[] item = record.getColumn(i).asString().split(splitter); + if (itemType.equalsIgnoreCase(Column.Type.DOUBLE.name())) { + ArrayList list = new ArrayList(); + for (String s : item) { + list.add(Double.parseDouble(s)); + } + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), list.toArray(new Double[0])); + } else if (itemType.equalsIgnoreCase(Column.Type.INT.name())) { + ArrayList list = new ArrayList(); + for (String s : item) { + list.add(Integer.parseInt(s)); + } + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), list.toArray(new Integer[0])); + } else if (itemType.equalsIgnoreCase(Column.Type.LONG.name())) { + ArrayList list = new ArrayList(); + for (String s : item) { + list.add(Long.parseLong(s)); + } + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), list.toArray(new Long[0])); + } else if (itemType.equalsIgnoreCase(Column.Type.BOOL.name())) { + ArrayList list = new ArrayList(); + for (String s : item) { + list.add(Boolean.parseBoolean(s)); + } + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), list.toArray(new Boolean[0])); + } else if (itemType.equalsIgnoreCase(Column.Type.BYTES.name())) { + ArrayList list = new ArrayList(); + for (String s : item) { + list.add(Byte.parseByte(s)); + } + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), list.toArray(new Byte[0])); + } else { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), record.getColumn(i).asString().split(splitter)); + } + } else { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), record.getColumn(i).asString().split(splitter)); + } + } else if(type.toLowerCase().equalsIgnoreCase("json")) { + //如果是json类型,将其进行转换 + Object mode = com.mongodb.util.JSON.parse(record.getColumn(i).asString()); + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME),JSON.toJSON(mode)); + } else { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), record.getColumn(i).asString()); + } + } catch (Exception e) { + super.getTaskPluginCollector().collectDirtyRecord(record, e); + } + } else if(record.getColumn(i) instanceof LongColumn) { + + if (Column.Type.LONG.name().equalsIgnoreCase(type)) { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME),record.getColumn(i).asLong()); + } else { + super.getTaskPluginCollector().collectDirtyRecord(record, "record's [" + i + "] column's type should be: " + type); + } + + } else if(record.getColumn(i) instanceof DateColumn) { + + if (Column.Type.DATE.name().equalsIgnoreCase(type)) { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), + record.getColumn(i).asDate()); + } else { + super.getTaskPluginCollector().collectDirtyRecord(record, "record's [" + i + "] column's type should be: " + type); + } + + } else if(record.getColumn(i) instanceof DoubleColumn) { + + if (Column.Type.DOUBLE.name().equalsIgnoreCase(type)) { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), + record.getColumn(i).asDouble()); + } else { + super.getTaskPluginCollector().collectDirtyRecord(record, "record's [" + i + "] column's type should be: " + type); + } + + } else if(record.getColumn(i) instanceof BoolColumn) { + + if (Column.Type.BOOL.name().equalsIgnoreCase(type)) { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), + record.getColumn(i).asBoolean()); + } else { + super.getTaskPluginCollector().collectDirtyRecord(record, "record's [" + i + "] column's type should be: " + type); + } + + } else if(record.getColumn(i) instanceof BytesColumn) { + + if (Column.Type.BYTES.name().equalsIgnoreCase(type)) { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME), + record.getColumn(i).asBytes()); + } else { + super.getTaskPluginCollector().collectDirtyRecord(record, "record's [" + i + "] column's type should be: " + type); + } + + } else { + data.put(columnMeta.getJSONObject(i).getString(KeyConstant.COLUMN_NAME),record.getColumn(i).asString()); + } + } + dataList.add(data); + } + /** + * 如果存在重复的值覆盖 + */ + if(this.writeMode != null && + this.writeMode.getString(KeyConstant.IS_REPLACE) != null && + KeyConstant.isValueTrue(this.writeMode.getString(KeyConstant.IS_REPLACE))) { + String uniqueKey = this.writeMode.getString(KeyConstant.UNIQUE_KEY); + if(!Strings.isNullOrEmpty(uniqueKey)) { + List> replaceOneModelList = new ArrayList>(); + for(BasicDBObject data : dataList) { + BasicDBObject query = new BasicDBObject(); + if(uniqueKey != null) { + query.put(uniqueKey,data.get(uniqueKey)); + } + ReplaceOneModel replaceOneModel = new ReplaceOneModel(query, data, new UpdateOptions().upsert(true)); + replaceOneModelList.add(replaceOneModel); + } + collection.bulkWrite(replaceOneModelList, new BulkWriteOptions().ordered(false)); + } else { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE, + MongoDBWriterErrorCode.ILLEGAL_VALUE.getDescription()); + } + } else { + collection.insertMany(dataList); + } + } + + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.userName = writerSliceConfig.getString(KeyConstant.MONGO_USER_NAME); + this.password = writerSliceConfig.getString(KeyConstant.MONGO_USER_PASSWORD); + this.database = writerSliceConfig.getString(KeyConstant.MONGO_DB_NAME); + if(!Strings.isNullOrEmpty(userName) && !Strings.isNullOrEmpty(password)) { + this.mongoClient = MongoUtil.initCredentialMongoClient(this.writerSliceConfig,userName,password,database); + } else { + this.mongoClient = MongoUtil.initMongoClient(this.writerSliceConfig); + } + this.collection = writerSliceConfig.getString(KeyConstant.MONGO_COLLECTION_NAME); + this.batchSize = BATCH_SIZE; + this.mongodbColumnMeta = JSON.parseArray(writerSliceConfig.getString(KeyConstant.MONGO_COLUMN)); + this.writeMode = JSON.parseObject(writerSliceConfig.getString(KeyConstant.WRITE_MODE)); + } + + @Override + public void destroy() { + mongoClient.close(); + } + } + +} diff --git a/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriterErrorCode.java b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriterErrorCode.java new file mode 100644 index 0000000000..bff743c4e4 --- /dev/null +++ b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/MongoDBWriterErrorCode.java @@ -0,0 +1,30 @@ +package com.alibaba.datax.plugin.writer.mongodbwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum MongoDBWriterErrorCode implements ErrorCode { + + ILLEGAL_VALUE("ILLEGAL_PARAMETER_VALUE","参数不合法"), + ILLEGAL_ADDRESS("ILLEGAL_ADDRESS","不合法的Mongo地址"), + JSONCAST_EXCEPTION("JSONCAST_EXCEPTION","json类型转换异常"), + UNEXCEPT_EXCEPTION("UNEXCEPT_EXCEPTION","未知异常"); + + private final String code; + + private final String description; + + private MongoDBWriterErrorCode(String code,String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return code; + } + + @Override + public String getDescription() { + return description; + } +} diff --git a/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/util/MongoUtil.java b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/util/MongoUtil.java new file mode 100644 index 0000000000..17334be403 --- /dev/null +++ b/mongodbwriter/src/main/java/com/alibaba/datax/plugin/writer/mongodbwriter/util/MongoUtil.java @@ -0,0 +1,95 @@ +package com.alibaba.datax.plugin.writer.mongodbwriter.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.mongodbwriter.KeyConstant; +import com.alibaba.datax.plugin.writer.mongodbwriter.MongoDBWriterErrorCode; +import com.mongodb.MongoClient; +import com.mongodb.MongoCredential; +import com.mongodb.ServerAddress; + +import java.net.UnknownHostException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +public class MongoUtil { + + public static MongoClient initMongoClient(Configuration conf) { + + List addressList = conf.getList(KeyConstant.MONGO_ADDRESS); + if(addressList == null || addressList.size() <= 0) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE,"不合法参数"); + } + try { + return new MongoClient(parseServerAddress(addressList)); + } catch (UnknownHostException e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_ADDRESS,"不合法的地址"); + } catch (NumberFormatException e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE,"不合法参数"); + } catch (Exception e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.UNEXCEPT_EXCEPTION,"未知异常"); + } + } + + public static MongoClient initCredentialMongoClient(Configuration conf,String userName,String password,String database) { + + List addressList = conf.getList(KeyConstant.MONGO_ADDRESS); + if(!isHostPortPattern(addressList)) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE,"不合法参数"); + } + try { + MongoCredential credential = MongoCredential.createCredential(userName, database, password.toCharArray()); + return new MongoClient(parseServerAddress(addressList), Arrays.asList(credential)); + + } catch (UnknownHostException e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_ADDRESS,"不合法的地址"); + } catch (NumberFormatException e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.ILLEGAL_VALUE,"不合法参数"); + } catch (Exception e) { + throw DataXException.asDataXException(MongoDBWriterErrorCode.UNEXCEPT_EXCEPTION,"未知异常"); + } + } + /** + * 判断地址类型是否符合要求 + * @param addressList + * @return + */ + private static boolean isHostPortPattern(List addressList) { + for(Object address : addressList) { + String regex = "(\\S+):([0-9]+)"; + if(!((String)address).matches(regex)) { + return false; + } + } + return true; + } + /** + * 转换为mongo地址协议 + * @param rawAddressList + * @return + */ + private static List parseServerAddress(List rawAddressList) throws UnknownHostException{ + List addressList = new ArrayList(); + for(Object address : rawAddressList) { + String[] tempAddress = ((String)address).split(":"); + try { + ServerAddress sa = new ServerAddress(tempAddress[0],Integer.valueOf(tempAddress[1])); + addressList.add(sa); + } catch (Exception e) { + throw new UnknownHostException(); + } + } + return addressList; + } + + public static void main(String[] args) { + try { + ArrayList hostAddress = new ArrayList(); + hostAddress.add("127.0.0.1:27017"); + System.out.println(MongoUtil.isHostPortPattern(hostAddress)); + } catch (Exception e) { + e.printStackTrace(); + } + } +} diff --git a/mongodbwriter/src/main/resources/plugin.json b/mongodbwriter/src/main/resources/plugin.json new file mode 100644 index 0000000000..9d830b6fab --- /dev/null +++ b/mongodbwriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "mongodbwriter", + "class": "com.alibaba.datax.plugin.writer.mongodbwriter.MongoDBWriter", + "description": "useScene: prod. mechanism: via mongoclient connect mongodb write data concurrent.", + "developer": "alibaba" +} diff --git a/mongodbwriter/src/main/resources/plugin_job_template.json b/mongodbwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..d4ba4bf1fc --- /dev/null +++ b/mongodbwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "mongodbwriter", + "parameter": { + "address": [], + "userName": "", + "userPassword": "", + "dbName": "", + "collectionName": "", + "column": [], + "upsertInfo": { + "isUpsert": "", + "upsertKey": "" + } + } +} \ No newline at end of file diff --git a/mysqlreader/doc/mysqlreader.md b/mysqlreader/doc/mysqlreader.md new file mode 100644 index 0000000000..3ae52afbf2 --- /dev/null +++ b/mysqlreader/doc/mysqlreader.md @@ -0,0 +1,368 @@ + +# MysqlReader 插件文档 + + +___ + + + +## 1 快速介绍 + +MysqlReader插件实现了从Mysql读取数据。在底层实现上,MysqlReader通过JDBC连接远程Mysql数据库,并执行相应的sql语句将数据从mysql库中SELECT出来。 + +**不同于其他关系型数据库,MysqlReader不支持FetchSize.** + +## 2 实现原理 + +简而言之,MysqlReader通过JDBC连接器连接到远程的Mysql数据库,并根据用户配置的信息生成查询SELECT SQL语句,然后发送到远程Mysql数据库,并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集,并传递给下游Writer处理。 + +对于用户配置Table、Column、Where的信息,MysqlReader将其拼接为SQL语句发送到Mysql数据库;对于用户配置querySql信息,MysqlReader直接将其发送到Mysql数据库。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从Mysql数据库同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 3 + }, + "errorLimit": { + "record": 0, + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "mysqlreader", + "parameter": { + "username": "root", + "password": "root", + "column": [ + "id", + "name" + ], + "splitPk": "db_id", + "connection": [ + { + "table": [ + "table" + ], + "jdbcUrl": [ + "jdbc:mysql://127.0.0.1:3306/database" + ] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print":true + } + } + } + ] + } +} + +``` + +* 配置一个自定义SQL的数据库同步任务到本地内容的作业: + +``` +{ + "job": { + "setting": { + "speed": { + "channel":1 + } + }, + "content": [ + { + "reader": { + "name": "mysqlreader", + "parameter": { + "username": "root", + "password": "root", + "connection": [ + { + "querySql": [ + "select db_id,on_line_flag from db_info where db_id < 10;" + ], + "jdbcUrl": [ + "jdbc:mysql://bad_ip:3306/database", + "jdbc:mysql://127.0.0.1:bad_port/database", + "jdbc:mysql://127.0.0.1:3306/database" + ] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,使用JSON的数组描述,并支持一个库填写多个连接地址。之所以使用JSON数组描述连接信息,是因为阿里集团内部支持多个IP探测,如果配置了多个,MysqlReader可以依次探测ip的可连接性,直到选择一个合法的IP。如果全部连接失败,MysqlReader报错。 注意,jdbcUrl必须包含在connection配置单元中。对于阿里集团外部使用情况,JSON数组填写一个JDBC连接即可。 + + jdbcUrl按照Mysql官方规范,并可以填写连接附件控制信息。具体请参看[Mysql官方文档](http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html)。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:所选取的需要同步的表。使用JSON的数组描述,因此支持多张表同时抽取。当配置为多张表时,用户自己需保证多张表是同一schema结构,MysqlReader不予检查表是否同一逻辑表。注意,table必须包含在connection配置单元中。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。用户使用\*代表默认使用所有列配置,例如['\*']。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照Mysql SQL语法格式: + ["id", "\`table\`", "1", "'bazhen.csy'", "null", "to_char(a + 1)", "2.3" , "true"] + id为普通列名,\`table\`为包含保留在的列名,1为整形数字常量,'bazhen.csy'为字符串常量,null为空指针,to_char(a + 1)为表达式,2.3为浮点数,true为布尔值。 + + * 必选:是
+ + * 默认值:无
+ +* **splitPk** + + * 描述:MysqlReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,DataX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 + + 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 + +  目前splitPk仅支持整形数据切分,`不支持浮点、字符串、日期等其他类型`。如果用户指定其他非支持类型,MysqlReader将报错! + + 如果splitPk不填写,包括不提供splitPk或者splitPk值为空,DataX视作使用单通道同步该表数据。 + + * 必选:否
+ + * 默认值:空
+ +* **where** + + * 描述:筛选条件,MysqlReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。
+ + where条件可以有效地进行业务增量同步。如果不填写where语句,包括不提供where的key或者value,DataX均视作同步全量数据。 + + * 必选:否
+ + * 默认值:无
+ +* **querySql** + + * 描述:在有些业务场景下,where这一配置项不足以描述所筛选的条件,用户可以通过该配置型来自定义筛选SQL。当用户配置了这一项之后,DataX系统就会忽略table,column这些配置型,直接使用这个配置项的内容对数据进行筛选,例如需要进行多表join后同步数据,使用select a,b from table_a join table_b on table_a.id = table_b.id
+ + `当用户配置querySql时,MysqlReader直接忽略table、column、where条件的配置`,querySql优先级大于table、column、where选项。 + + * 必选:否
+ + * 默认值:无
+ + +### 3.3 类型转换 + +目前MysqlReader支持大部分Mysql类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出MysqlReader针对Mysql类型转换列表: + + +| DataX 内部类型| Mysql 数据类型 | +| -------- | ----- | +| Long |int, tinyint, smallint, mediumint, int, bigint| +| Double |float, double, decimal| +| String |varchar, char, tinytext, text, mediumtext, longtext, year | +| Date |date, datetime, timestamp, time | +| Boolean |bit, bool | +| Bytes |tinyblob, mediumblob, blob, longblob, varbinary | + + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 +* `tinyint(1) DataX视作为整形`。 +* `year DataX视作为字符串类型` +* `bit DataX属于未定义行为`。 + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + + CREATE TABLE `tc_biz_vertical_test_0000` ( + `biz_order_id` bigint(20) NOT NULL COMMENT 'id', + `key_value` varchar(4000) NOT NULL COMMENT 'Key-value的内容', + `gmt_create` datetime NOT NULL COMMENT '创建时间', + `gmt_modified` datetime NOT NULL COMMENT '修改时间', + `attribute_cc` int(11) DEFAULT NULL COMMENT '防止并发修改的标志', + `value_type` int(11) NOT NULL DEFAULT '0' COMMENT '类型', + `buyer_id` bigint(20) DEFAULT NULL COMMENT 'buyerid', + `seller_id` bigint(20) DEFAULT NULL COMMENT 'seller_id', + PRIMARY KEY (`biz_order_id`,`value_type`), + KEY `idx_biz_vertical_gmtmodified` (`gmt_modified`) + ) ENGINE=InnoDB DEFAULT CHARSET=gbk COMMENT='tc_biz_vertical' + + +单行记录类似于: + + biz_order_id: 888888888 + key_value: ;orderIds:20148888888,2014888888813800; + gmt_create: 2011-09-24 11:07:20 + gmt_modified: 2011-10-24 17:56:34 + attribute_cc: 1 + value_type: 3 + buyer_id: 8888888 + seller_id: 1 + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 24核 Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz + 2. mem: 48GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* Mysql数据库机器参数为: + 1. cpu: 32核 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz + 2. mem: 256GB + 3. net: 千兆双网卡 + 4. disc: BTWL419303E2800RGN INTEL SSDSC2BB800G4 D2010370 + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + + +### 4.2 测试报告 + +#### 4.2.1 单表测试报告 + + +| 通道数| 是否按照主键切分| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡进入流量(MB/s)|DataX机器运行负载|DB网卡流出流量(MB/s)|DB运行负载| +|--------|--------| --------|--------|--------|--------|--------|--------| +|1| 否 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|1| 是 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|4| 否 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|4| 是 | 329733 | 32.60 | 58| 0.8 | 60| 0.76 | +|8| 否 | 183185 | 18.11 | 29| 0.6 | 31| 0.6 | +|8| 是 | 549556 | 54.33 | 115| 1.46 | 120| 0.78 | + +说明: + +1. 这里的单表,主键类型为 bigint(20),范围为:190247559466810-570722244711460,从主键范围划分看,数据分布均匀。 +2. 对单表如果没有安装主键切分,那么配置通道个数不会提升速度,效果与1个通道一样。 + + +#### 4.2.2 分表测试报告(2个分库,每个分库16张分表,共计32张分表) + + +| 通道数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡进入流量(MB/s)|DataX机器运行负载|DB网卡流出流量(MB/s)|DB运行负载| +|--------| --------|--------|--------|--------|--------|--------| +|1| 202241 | 20.06 | 31.5| 1.0 | 32 | 1.1 | +|4| 726358 | 72.04 | 123.9 | 3.1 | 132 | 3.6 | +|8|1074405 | 106.56| 197 | 5.5 | 205| 5.1| +|16| 1227892 | 121.79 | 229.2 | 8.1 | 233 | 7.3 | + +## 5 约束限制 + +### 5.1 主备同步数据恢复问题 + +主备同步问题指Mysql使用主从灾备,备库从主库不间断通过binlog恢复数据。由于主备数据同步存在一定的时间差,特别在于某些特定情况,例如网络延迟等问题,导致备库同步恢复的数据与主库有较大差别,导致从备库同步的数据不是一份当前时间的完整镜像。 + +针对这个问题,我们提供了preSql功能,该功能待补充。 + +### 5.2 一致性约束 + +Mysql在数据存储划分中属于RDBMS系统,对外可以提供强一致性数据查询接口。例如当一次同步任务启动运行过程中,当该库存在其他数据写入方写入数据时,MysqlReader完全不会获取到写入更新数据,这是由于数据库本身的快照特性决定的。关于数据库快照特性,请参看[MVCC Wikipedia](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) + +上述是在MysqlReader单线程模型下数据同步一致性的特性,由于MysqlReader可以根据用户配置信息使用了并发数据抽取,因此不能严格保证数据一致性:当MysqlReader根据splitPk进行数据切分后,会先后启动多个并发任务完成数据同步。由于多个并发任务相互之间不属于同一个读事务,同时多个并发任务存在时间间隔。因此这份数据并不是`完整的`、`一致的`数据快照信息。 + +针对多线程的一致性快照需求,在技术上目前无法实现,只能从工程角度解决,工程化的方式存在取舍,我们提供几个解决思路给用户,用户可以自行选择: + +1. 使用单线程同步,即不再进行数据切片。缺点是速度比较慢,但是能够很好保证一致性。 + +2. 关闭其他数据写入方,保证当前数据为静态数据,例如,锁表、关闭备库同步等等。缺点是可能影响在线业务。 + +### 5.3 数据库编码问题 + +Mysql本身的编码设置非常灵活,包括指定编码到库、表、字段级别,甚至可以均不同编码。优先级从高到低为字段、表、库、实例。我们不推荐数据库用户设置如此混乱的编码,最好在库级别就统一到UTF-8。 + +MysqlReader底层使用JDBC进行数据抽取,JDBC天然适配各类编码,并在底层进行了编码转换。因此MysqlReader不需用户指定编码,可以自动获取编码并转码。 + +对于Mysql底层写入编码和其设定的编码不一致的混乱情况,MysqlReader对此无法识别,对此也无法提供解决方案,对于这类情况,`导出有可能为乱码`。 + +### 5.4 增量数据同步 + +MysqlReader使用JDBC SELECT语句完成数据抽取工作,因此可以使用SELECT...WHERE...进行增量数据抽取,方式有多种: + +* 数据库在线应用写入数据库时,填充modify字段为更改时间戳,包括新增、更新、删除(逻辑删)。对于这类应用,MysqlReader只需要WHERE条件跟上一同步阶段时间戳即可。 +* 对于新增流水型数据,MysqlReader可以WHERE条件后跟上一阶段最大自增ID即可。 + +对于业务上无字段区分新增、修改数据情况,MysqlReader也无法进行增量数据同步,只能同步全量数据。 + +### 5.5 Sql安全性 + +MysqlReader提供querySql语句交给用户自己实现SELECT抽取语句,MysqlReader本身对querySql不做任何安全性校验。这块交由DataX用户方自己保证。 + +## 6 FAQ + +*** + +**Q: MysqlReader同步报错,报错信息为XXX** + + A: 网络或者权限问题,请使用mysql命令行测试: + + mysql -u -p -h -D -e "select * from <表名>" + +如果上述命令也报错,那可以证实是环境问题,请联系你的DBA。 + + diff --git a/mysqlreader/pom.xml b/mysqlreader/pom.xml new file mode 100755 index 0000000000..aa45912240 --- /dev/null +++ b/mysqlreader/pom.xml @@ -0,0 +1,81 @@ + + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + mysqlreader + mysqlreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + mysql + mysql-connector-java + 5.1.34 + + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/mysqlreader/src/main/assembly/package.xml b/mysqlreader/src/main/assembly/package.xml new file mode 100755 index 0000000000..3b35d9381f --- /dev/null +++ b/mysqlreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/mysqlreader + + + target/ + + mysqlreader-0.0.1-SNAPSHOT.jar + + plugin/reader/mysqlreader + + + + + + false + plugin/reader/mysqlreader/libs + runtime + + + diff --git a/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReader.java b/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReader.java new file mode 100755 index 0000000000..9dfff9c181 --- /dev/null +++ b/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReader.java @@ -0,0 +1,97 @@ +package com.alibaba.datax.plugin.reader.mysqlreader; + +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +public class MysqlReader extends Reader { + + private static final DataBaseType DATABASE_TYPE = DataBaseType.MySql; + + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + private Configuration originalConfig = null; + private CommonRdbmsReader.Job commonRdbmsReaderJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + Integer userConfigedFetchSize = this.originalConfig.getInt(Constant.FETCH_SIZE); + if (userConfigedFetchSize != null) { + LOG.warn("对 mysqlreader 不需要配置 fetchSize, mysqlreader 将会忽略这项配置. 如果您不想再看到此警告,请去除fetchSize 配置."); + } + + this.originalConfig.set(Constant.FETCH_SIZE, Integer.MIN_VALUE); + + this.commonRdbmsReaderJob = new CommonRdbmsReader.Job(DATABASE_TYPE); + this.commonRdbmsReaderJob.init(this.originalConfig); + } + + @Override + public void preCheck(){ + init(); + this.commonRdbmsReaderJob.preCheck(this.originalConfig,DATABASE_TYPE); + + } + + @Override + public List split(int adviceNumber) { + return this.commonRdbmsReaderJob.split(this.originalConfig, adviceNumber); + } + + @Override + public void post() { + this.commonRdbmsReaderJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + private CommonRdbmsReader.Task commonRdbmsReaderTask; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsReaderTask = new CommonRdbmsReader.Task(DATABASE_TYPE,super.getTaskGroupId(), super.getTaskId()); + this.commonRdbmsReaderTask.init(this.readerSliceConfig); + + } + + @Override + public void startRead(RecordSender recordSender) { + int fetchSize = this.readerSliceConfig.getInt(Constant.FETCH_SIZE); + + this.commonRdbmsReaderTask.startRead(this.readerSliceConfig, recordSender, + super.getTaskPluginCollector(), fetchSize); + } + + @Override + public void post() { + this.commonRdbmsReaderTask.post(this.readerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderTask.destroy(this.readerSliceConfig); + } + + } + +} diff --git a/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReaderErrorCode.java b/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReaderErrorCode.java new file mode 100755 index 0000000000..de9525e9d6 --- /dev/null +++ b/mysqlreader/src/main/java/com/alibaba/datax/plugin/reader/mysqlreader/MysqlReaderErrorCode.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.reader.mysqlreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum MysqlReaderErrorCode implements ErrorCode { + ; + + private final String code; + private final String description; + + private MysqlReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/mysqlreader/src/main/resources/plugin.json b/mysqlreader/src/main/resources/plugin.json new file mode 100755 index 0000000000..6a8227b8ec --- /dev/null +++ b/mysqlreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "mysqlreader", + "class": "com.alibaba.datax.plugin.reader.mysqlreader.MysqlReader", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/mysqlreader/src/main/resources/plugin_job_template.json b/mysqlreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..153ae5b3a1 --- /dev/null +++ b/mysqlreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "mysqlreader", + "parameter": { + "username": "", + "password": "", + "column": [], + "connection": [ + { + "jdbcUrl": [], + "table": [] + } + ], + "where": "" + } +} \ No newline at end of file diff --git a/mysqlwriter/doc/mysqlwriter.md b/mysqlwriter/doc/mysqlwriter.md new file mode 100644 index 0000000000..f6abd242d8 --- /dev/null +++ b/mysqlwriter/doc/mysqlwriter.md @@ -0,0 +1,361 @@ +# DataX MysqlWriter + + +--- + + +## 1 快速介绍 + +MysqlWriter 插件实现了写入数据到 Mysql 主库的目的表的功能。在底层实现上, MysqlWriter 通过 JDBC 连接远程 Mysql 数据库,并执行相应的 insert into ... 或者 ( replace into ...) 的 sql 语句将数据写入 Mysql,内部会分批次提交入库,需要数据库本身采用 innodb 引擎。 + +MysqlWriter 面向ETL开发工程师,他们使用 MysqlWriter 从数仓导入数据到 Mysql。同时 MysqlWriter 亦可以作为数据迁移工具为DBA等用户提供服务。 + + +## 2 实现原理 + +MysqlWriter 通过 DataX 框架获取 Reader 生成的协议数据,根据你配置的 `writeMode` 生成 + + +* `insert into...`(当主键/唯一性索引冲突时会写不进去冲突的行) + +##### 或者 + +* `replace into...`(没有遇到主键/唯一性索引冲突时,与 insert into 行为一致,冲突时会用新行替换原有行所有字段) 的语句写入数据到 Mysql。出于性能考虑,采用了 `PreparedStatement + Batch`,并且设置了:`rewriteBatchedStatements=true`,将数据缓冲到线程上下文 Buffer 中,当 Buffer 累计到预定阈值时,才发起写入请求。 + +
+ + 注意:目的表所在数据库必须是主库才能写入数据;整个任务至少需要具备 insert/replace into...的权限,是否需要其他权限,取决于你任务配置中在 preSql 和 postSql 中指定的语句。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到 Mysql 导入的数据。 + +```json +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19880808, + "type": "long" + }, + { + "value": "1988-08-08 08:08:08", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 1000 + } + }, + "writer": { + "name": "mysqlwriter", + "parameter": { + "writeMode": "insert", + "username": "root", + "password": "root", + "column": [ + "id", + "name" + ], + "session": [ + "set session sql_mode='ANSI'" + ], + "preSql": [ + "delete from test" + ], + "connection": [ + { + "jdbcUrl": "jdbc:mysql://127.0.0.1:3306/datax?useUnicode=true&characterEncoding=gbk", + "table": [ + "test" + ] + } + ] + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:目的数据库的 JDBC 连接信息。作业运行时,DataX 会在你提供的 jdbcUrl 后面追加如下属性:yearIsDateType=false&zeroDateTimeBehavior=convertToNull&rewriteBatchedStatements=true + + 注意:1、在一个数据库上只能配置一个 jdbcUrl 值。这与 MysqlReader 支持多个备库探测不同,因为此处不支持同一个数据库存在多个主库的情况(双主导入数据情况) + 2、jdbcUrl按照Mysql官方规范,并可以填写连接附加控制信息,比如想指定连接编码为 gbk ,则在 jdbcUrl 后面追加属性 useUnicode=true&characterEncoding=gbk。具体请参看 Mysql官方文档或者咨询对应 DBA。 + + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:目的数据库的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:目的数据库的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。支持写入一个或者多个表。当配置为多张表时,必须确保所有表结构保持一致。 + + 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。如果要依次写入全部列,使用*表示, 例如: "column": ["*"]。 + + **column配置项必须指定,不能留空!** + + 注意:1、我们强烈不推荐你这样配置,因为当你目的表字段个数、类型等有改动时,你的任务可能运行不正确或者失败 + 2、 column 不能配置任何常量值 + + * 必选:是
+ + * 默认值:否
+ +* **session** + + * 描述: DataX在获取Mysql连接时,执行session指定的SQL语句,修改当前connection session属性 + + * 必须: 否 + + * 默认值: 空 + +* **preSql** + + * 描述:写入数据到目的表前,会先执行这里的标准语句。如果 Sql 中有你需要操作到的表名称,请使用 `@table` 表示,这样在实际执行 Sql 语句时,会对变量按照实际表名称进行替换。比如你的任务是要写入到目的端的100个同构分表(表名称为:datax_00,datax01, ... datax_98,datax_99),并且你希望导入数据前,先对表中数据进行删除操作,那么你可以这样配置:`"preSql":["delete from 表名"]`,效果是:在执行到每个表写入数据前,会先执行对应的 delete from 对应表名称
+ + * 必选:否
+ + * 默认值:无
+ +* **postSql** + + * 描述:写入数据到目的表后,会执行这里的标准语句。(原理同 preSql )
+ + * 必选:否
+ + * 默认值:无
+ +* **writeMode** + + * 描述:控制写入数据到目标表采用 `insert into` 或者 `replace into` 或者 `ON DUPLICATE KEY UPDATE` 语句
+ + * 必选:是
+ + * 所有选项:insert/replace/update
+ + * 默认值:insert
+ +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与Mysql的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。
+ + * 必选:否
+ + * 默认值:1024
+ + +### 3.3 类型转换 + +类似 MysqlReader ,目前 MysqlWriter 支持大部分 Mysql 类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出 MysqlWriter 针对 Mysql 类型转换列表: + + +| DataX 内部类型| Mysql 数据类型 | +| -------- | ----- | +| Long |int, tinyint, smallint, mediumint, int, bigint, year| +| Double |float, double, decimal| +| String |varchar, char, tinytext, text, mediumtext, longtext | +| Date |date, datetime, timestamp, time | +| Boolean |bit, bool | +| Bytes |tinyblob, mediumblob, blob, longblob, varbinary | + + * `bit类型目前是未定义类型转换` + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + + CREATE TABLE `datax_mysqlwriter_perf_00` ( + `biz_order_id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT 'id', + `key_value` varchar(4000) NOT NULL COMMENT 'Key-value的内容', + `gmt_create` datetime NOT NULL COMMENT '创建时间', + `gmt_modified` datetime NOT NULL COMMENT '修改时间', + `attribute_cc` int(11) DEFAULT NULL COMMENT '防止并发修改的标志', + `value_type` int(11) NOT NULL DEFAULT '0' COMMENT '类型', + `buyer_id` bigint(20) DEFAULT NULL COMMENT 'buyerid', + `seller_id` bigint(20) DEFAULT NULL COMMENT 'seller_id', + PRIMARY KEY (`biz_order_id`,`value_type`), + KEY `idx_biz_vertical_gmtmodified` (`gmt_modified`) + ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='datax perf test' + + +单行记录类似于: + + key_value: ;orderIds:20148888888,2014888888813800; + gmt_create: 2011-09-24 11:07:20 + gmt_modified: 2011-10-24 17:56:34 + attribute_cc: 1 + value_type: 3 + buyer_id: 8888888 + seller_id: 1 + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 24核 Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz + 2. mem: 48GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* Mysql数据库机器参数为: + 1. cpu: 32核 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz + 2. mem: 256GB + 3. net: 千兆双网卡 + 4. disc: BTWL419303E2800RGN INTEL SSDSC2BB800G4 D2010370 + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + + +### 4.2 测试报告 + +#### 4.2.1 单表测试报告 + + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡流出流量(MB/s)|DataX机器运行负载|DB网卡进入流量(MB/s)|DB运行负载|DB TPS| +|--------|--------| --------|--------|--------|--------|--------|--------|--------| +|1| 128 | 5319 | 0.260 | 0.580 | 0.05 | 0.620| 0.5 | 50 | +|1| 512 | 14285 | 0.697 | 1.6 | 0.12 | 1.6 | 0.6 | 28 | +|1| 1024 | 17241 | 0.842 | 1.9 | 0.20 | 1.9 | 0.6 | 16| +|1| 2048 | 31250 | 1.49 | 2.8 | 0.15 | 3.0| 0.8 | 15 | +|1| 4096 | 31250 | 1.49 | 3.5 | 0.20 | 3.6| 0.8 | 8 | +|4| 128 | 11764 | 0.574 | 1.5 | 0.21 | 1.6| 0.8 | 112 | +|4| 512 | 30769 | 1.47 | 3.5 | 0.3 | 3.6 | 0.9 | 88 | +|4| 1024 | 50000 | 2.38 | 5.4 | 0.3 | 5.5 | 1.0 | 66 | +|4| 2048 | 66666 | 3.18 | 7.0 | 0.3 | 7.1| 1.37 | 46 | +|4| 4096 | 80000 | 3.81 | 7.3| 0.5 | 7.3| 1.40 | 26 | +|8| 128 | 17777 | 0.868 | 2.9 | 0.28 | 2.9| 0.8 | 200 | +|8| 512 | 57142 | 2.72 | 8.5 | 0.5 | 8.5| 0.70 | 159 | +|8| 1024 | 88888 | 4.24 | 12.2 | 0.9 | 12.4 | 1.0 | 108 | +|8| 2048 | 133333 | 6.36 | 14.7 | 0.9 | 14.7 | 1.0 | 81 | +|8| 4096 | 166666 | 7.95 | 19.5 | 0.9 | 19.5 | 3.0 | 45 | +|16| 128 | 32000 | 1.53 | 3.3 | 0.6 | 3.4 | 0.88 | 401 | +|16| 512 | 106666 | 5.09 | 16.1| 0.9 | 16.2 | 2.16 | 260 | +|16| 1024 | 173913 | 8.29 | 22.1| 1.5 | 22.2 | 4.5 | 200 | +|16| 2048 | 228571 | 10.90 | 28.6 | 1.61 | 28.7 | 4.60 | 128 | +|16| 4096 | 246153 | 11.74 | 31.1| 1.65 | 31.2| 4.66 | 57 | +|32| 1024 | 246153 | 11.74 | 30.5| 3.17 | 30.7 | 12.10 | 270 | + + +说明: + +1. 这里的单表,主键类型为 bigint(20),自增。 +2. batchSize 和 通道个数,对性能影响较大。 +3. 16通道,4096批量提交时,出现 full gc 2次。 + + +#### 4.2.2 分表测试报告(2个分库,每个分库4张分表,共计8张分表) + + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡流出流量(MB/s)|DataX机器运行负载|DB网卡进入流量(MB/s)|DB运行负载|DB TPS| +|--------|--------| --------|--------|--------|--------|--------|--------|--------| +|8| 128 | 26764 | 1.28 | 2.9 | 0.5 | 3.0| 0.8 | 209 | +|8| 512 | 95180 | 4.54 | 10.5 | 0.7 | 10.9 | 0.8 | 188 | +|8| 1024 | 94117 | 4.49 | 12.3 | 0.6 | 12.4 | 1.09 | 120 | +|8| 2048 | 133333 | 6.36 | 19.4 | 0.9 | 19.5| 1.35 | 85 | +|8| 4096 | 191692 | 9.14 | 22.1 | 1.0 | 22.2| 1.45 | 45 | + + +#### 4.2.3 分表测试报告(2个分库,每个分库8张分表,共计16张分表) + + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡流出流量(MB/s)|DataX机器运行负载|DB网卡进入流量(MB/s)|DB运行负载|DB TPS| +|--------|--------| --------|--------|--------|--------|--------|--------|--------| +|16| 128 | 50124 | 2.39 | 5.6 | 0.40 | 6.0| 2.42 | 378 | +|16| 512 | 155084 | 7.40 | 18.6 | 1.30 | 18.9| 2.82 | 325 | +|16| 1024 | 177777 | 8.48 | 24.1 | 1.43 | 25.5| 3.5 | 233 | +|16| 2048 | 289382 | 13.8 | 33.1 | 2.5 | 33.5| 4.5 | 150 | +|16| 4096 | 326451 | 15.52 | 33.7 | 1.5 | 33.9| 4.3 | 80 | + +#### 4.2.4 性能测试小结 +1. 批量提交行数(batchSize)对性能影响很大,当 `batchSize>=512` 之后,单线程写入速度能达到每秒写入一万行 +2. 在 `batchSize>=512` 的基础上,随着通道数的增加(通道数<32),速度呈线性比增加。 +3. `通常不建议写入数据库时,通道个数 >32` + + +## 5 约束限制 + + + + +## FAQ + +*** + +**Q: MysqlWriter 执行 postSql 语句报错,那么数据导入到目标数据库了吗?** + +A: DataX 导入过程存在三块逻辑,pre 操作、导入操作、post 操作,其中任意一环报错,DataX 作业报错。由于 DataX 不能保证在同一个事务完成上述几个操作,因此有可能数据已经落入到目标端。 + +*** + +**Q: 按照上述说法,那么有部分脏数据导入数据库,如果影响到线上数据库怎么办?** + +A: 目前有两种解法,第一种配置 pre 语句,该 sql 可以清理当天导入数据, DataX 每次导入时候可以把上次清理干净并导入完整数据。第二种,向临时表导入数据,完成后再 rename 到线上表。 + +*** + +**Q: 上面第二种方法可以避免对线上数据造成影响,那我具体怎样操作?** + +A: 可以配置临时表导入 diff --git a/mysqlwriter/pom.xml b/mysqlwriter/pom.xml new file mode 100755 index 0000000000..58b93179cd --- /dev/null +++ b/mysqlwriter/pom.xml @@ -0,0 +1,79 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + mysqlwriter + mysqlwriter + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + mysql + mysql-connector-java + 5.1.34 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/mysqlwriter/src/main/assembly/package.xml b/mysqlwriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..03883c7be4 --- /dev/null +++ b/mysqlwriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/mysqlwriter + + + target/ + + mysqlwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/mysqlwriter + + + + + + false + plugin/writer/mysqlwriter/libs + runtime + + + diff --git a/mysqlwriter/src/main/java/com/alibaba/datax/plugin/writer/mysqlwriter/MysqlWriter.java b/mysqlwriter/src/main/java/com/alibaba/datax/plugin/writer/mysqlwriter/MysqlWriter.java new file mode 100755 index 0000000000..9d2c82ee7c --- /dev/null +++ b/mysqlwriter/src/main/java/com/alibaba/datax/plugin/writer/mysqlwriter/MysqlWriter.java @@ -0,0 +1,101 @@ +package com.alibaba.datax.plugin.writer.mysqlwriter; + +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; +import com.alibaba.datax.plugin.rdbms.writer.Key; + +import java.util.List; + + +//TODO writeProxy +public class MysqlWriter extends Writer { + private static final DataBaseType DATABASE_TYPE = DataBaseType.MySql; + + public static class Job extends Writer.Job { + private Configuration originalConfig = null; + private CommonRdbmsWriter.Job commonRdbmsWriterJob; + + @Override + public void preCheck(){ + this.init(); + this.commonRdbmsWriterJob.writerPreCheck(this.originalConfig, DATABASE_TYPE); + } + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + this.commonRdbmsWriterJob = new CommonRdbmsWriter.Job(DATABASE_TYPE); + this.commonRdbmsWriterJob.init(this.originalConfig); + } + + // 一般来说,是需要推迟到 task 中进行pre 的执行(单表情况例外) + @Override + public void prepare() { + //实跑先不支持 权限 检验 + //this.commonRdbmsWriterJob.privilegeValid(this.originalConfig, DATABASE_TYPE); + this.commonRdbmsWriterJob.prepare(this.originalConfig); + } + + @Override + public List split(int mandatoryNumber) { + return this.commonRdbmsWriterJob.split(this.originalConfig, mandatoryNumber); + } + + // 一般来说,是需要推迟到 task 中进行post 的执行(单表情况例外) + @Override + public void post() { + this.commonRdbmsWriterJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private CommonRdbmsWriter.Task commonRdbmsWriterTask; + + @Override + public void init() { + this.writerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsWriterTask = new CommonRdbmsWriter.Task(DATABASE_TYPE); + this.commonRdbmsWriterTask.init(this.writerSliceConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterTask.prepare(this.writerSliceConfig); + } + + //TODO 改用连接池,确保每次获取的连接都是可用的(注意:连接可能需要每次都初始化其 session) + public void startWrite(RecordReceiver recordReceiver) { + this.commonRdbmsWriterTask.startWrite(recordReceiver, this.writerSliceConfig, + super.getTaskPluginCollector()); + } + + @Override + public void post() { + this.commonRdbmsWriterTask.post(this.writerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterTask.destroy(this.writerSliceConfig); + } + + @Override + public boolean supportFailOver(){ + String writeMode = writerSliceConfig.getString(Key.WRITE_MODE); + return "replace".equalsIgnoreCase(writeMode); + } + + } + + +} diff --git a/mysqlwriter/src/main/resources/plugin.json b/mysqlwriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..e2b62538a1 --- /dev/null +++ b/mysqlwriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "mysqlwriter", + "class": "com.alibaba.datax.plugin.writer.mysqlwriter.MysqlWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute insert sql. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/mysqlwriter/src/main/resources/plugin_job_template.json b/mysqlwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..1edb500486 --- /dev/null +++ b/mysqlwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,17 @@ +{ + "name": "mysqlwriter", + "parameter": { + "username": "", + "password": "", + "writeMode": "", + "column": [], + "session": [], + "preSql": [], + "connection": [ + { + "jdbcUrl": "", + "table": [] + } + ] + } +} \ No newline at end of file diff --git a/ocswriter/doc/ocswriter.md b/ocswriter/doc/ocswriter.md new file mode 100644 index 0000000000..d6064367cb --- /dev/null +++ b/ocswriter/doc/ocswriter.md @@ -0,0 +1,168 @@ +# DataX OCSWriter 适用memcached客户端写入ocs +--- +## 1 快速介绍 +### 1.1 OCS简介 +开放缓存服务( Open Cache Service,简称OCS)是基于内存的缓存服务,支持海量小数据的高速访问。OCS可以极大缓解对后端存储的压力,提高网站或应用的响应速度。OCS支持Key-Value的数据结构,兼容Memcached协议的客户端都可与OCS通信。
+ +OCS 支持即开即用的方式快速部署;对于动态Web、APP应用,可通过缓存服务减轻对数据库的压力,从而提高网站整体的响应速度。
+ +与本地MemCache相同之处在于OCS兼容Memcached协议,与用户环境兼容,可直接用于OCS服务 不同之处在于硬件和数据部署在云端,有完善的基础设施、网络安全保障、系统维护服务。所有的这些服务,都不需要投资,只需根据使用量进行付费即可。 +### 1.2 OCSWriter简介 +OCSWriter是DataX实现的,基于Memcached协议的数据写入OCS通道。 +## 2 功能说明 +### 2.1 配置样例 +* 这里使用一份从内存产生的数据导入到OCS。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column": [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19880808, + "type": "long" + }, + { + "value": "1988-08-08 08:08:08", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 1000 + } + }, + "writer": { + "name": "ocswriter", + "parameter": { + "proxy": "xxxx", + "port": "11211", + "userName": "user", + "password": "******", + "writeMode": "set|add|replace|append|prepend", + "writeFormat": "text|binary", + "fieldDelimiter": "\u0001", + "expireTime": 1000, + "indexes": "0,2", + "batchSize": 1000 + } + } + } + ] + } +} +``` + +### 2.2 参数说明 + +* **proxy** + + * 描述:OCS机器的ip或host。 + * 必选:是 + +* **port** + + * 描述:OCS的连接域名,默认为11211 + * 必选:否 + * 默认值:11211 + +* **username** + + * 描述:OCS连接的访问账号。 + * 必选:是 + +* **password** + + * 描述:OCS连接的访问密码 + * 必选:是 + +* **writeMode** + + * 描述: OCSWriter写入方式,具体为: + * set: 存储这个数据,如果已经存在则覆盖 + * add: 存储这个数据,当且仅当这个key不存在的时候 + * replace: 存储这个数据,当且仅当这个key存在 + * append: 将数据存放在已存在的key对应的内容的后面,忽略exptime + * prepend: 将数据存放在已存在的key对应的内容的前面,忽略exptime + * 必选:是 + +* **writeFormat** + + * 描述: OCSWriter写出数据格式,目前支持两类数据写入方式: + * text: 将源端数据序列化为文本格式,其中第一个字段作为OCS写入的KEY,后续所有字段序列化为STRING类型,使用用户指定的fieldDelimiter作为间隔符,将文本拼接为完整的字符串再写入OCS。 + * binary: 将源端数据作为二进制直接写入,这类场景为未来做扩展使用,目前不支持。如果填写binary将会报错! + * 必选:否 + * 默认值:text + +* **expireTime** + + * 描述: OCS值缓存失效时间,目前MemCache支持两类过期时间, + + * Unix时间(自1970.1.1开始到现在的秒数),该时间指定了到未来某个时刻数据失效。 + * 相对当前时间的秒数,该时间指定了从现在开始多长时间后数据失效。 + **注意:如果过期时间的秒数大于60*60*24*30(即30天),则服务端认为是Unix时间。** + * 单位:秒 + * 必选:否 + * 默认值:0【0表示永久有效】 + +* **indexes** + + * 描述: 用数据的第几列当做ocs的key + * 必选:否 + * 默认值:0 + +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与OCS的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况[memcached版本暂不支持批量写]。 + * 必选:否 + * 默认值:256 + +* **fieldDelimiter** + * 描述:写入ocs的key和value分隔符。比如:key=tom\u0001boston, value=28\u0001lawer\u0001male\u0001married + * 必选:否 + * 默认值:\u0001 + +## 3 性能报告 +### 3.1 datax机器配置 +``` +CPU:16核、内存:24GB、网卡:单网卡1000mbps +``` +### 3.2 任务资源配置 +``` +-Xms8g -Xmx8g -XX:+HeapDumpOnOutOfMemoryError +``` +### 3.3 测试报告 +| 单条数据大小 | 通道并发数 | TPS | 通道流量 | 出口流量 | 备注 | +| :--------: | :--------:| :--: | :--: | :--: | :--: | +| 1KB | 1 | 579 tps | 583.31KB/s | 648.63KB/s | 无 | +| 1KB | 10 | 6006 tps | 5.87MB/s | 6.73MB/s | 无 | +| 1KB | 100 | 49916 tps | 48.56MB/s | 55.55MB/s | 无 | +| 10KB | 1 | 438 tps | 4.62MB/s | 5.07MB/s | 无 | +| 10KB | 10 | 4313 tps | 45.57MB/s | 49.51MB/s | 无 | +| 10KB | 100 | 10713 tps | 112.80MB/s | 123.01MB/s | 无 | +| 100KB | 1 | 275 tps | 26.09MB/s | 144.90KB/s | 无。数据冗余大,压缩比高。 | +| 100KB | 10 | 2492 tps | 236.33MB/s | 1.30MB/s | 无 | +| 100KB | 100 | 3187 tps | 302.17MB/s | 1.77MB/s | 无 | + +### 3.4 性能测试小结 +1. 单条数据小于10KB时建议开启100并发。 +2. 不建议10KB以上的数据写入ocs。 diff --git a/ocswriter/pom.xml b/ocswriter/pom.xml new file mode 100644 index 0000000000..34e49cab17 --- /dev/null +++ b/ocswriter/pom.xml @@ -0,0 +1,88 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + ocswriter + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + com.alibaba.datax + datax-core + ${datax-project-version} + + + org.slf4j + slf4j-api + + + org.testng + testng + 6.8.8 + test + + + org.easymock + easymock + 3.3.1 + test + + + com.google.code.simple-spring-memcached + spymemcached + 2.8.1 + + + com.google.guava + guava + 16.0.1 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + 3.2 + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + + diff --git a/ocswriter/src/main/assembly/package.xml b/ocswriter/src/main/assembly/package.xml new file mode 100644 index 0000000000..804456e6f5 --- /dev/null +++ b/ocswriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/ocswriter + + + target/ + + ocswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/ocswriter + + + + + + false + plugin/writer/ocswriter/libs + runtime + + + diff --git a/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/Key.java b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/Key.java new file mode 100644 index 0000000000..8942bfacea --- /dev/null +++ b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/Key.java @@ -0,0 +1,23 @@ +package com.alibaba.datax.plugin.writer.ocswriter; + +public final class Key { + public final static String USER = "username"; + + public final static String PASSWORD = "password"; + + public final static String PROXY = "proxy"; + + public final static String PORT = "port"; + + public final static String WRITE_MODE = "writeMode"; + + public final static String WRITE_FORMAT = "writeFormat"; + + public final static String FIELD_DELIMITER = "fieldDelimiter"; + + public final static String EXPIRE_TIME = "expireTime"; + + public final static String BATCH_SIZE = "batchSize"; + + public final static String INDEXES = "indexes"; +} diff --git a/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/OcsWriter.java b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/OcsWriter.java new file mode 100644 index 0000000000..fa7686fd01 --- /dev/null +++ b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/OcsWriter.java @@ -0,0 +1,315 @@ +package com.alibaba.datax.plugin.writer.ocswriter; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.writer.ocswriter.utils.ConfigurationChecker; +import com.alibaba.datax.plugin.writer.ocswriter.utils.OcsWriterErrorCode; +import com.google.common.annotations.VisibleForTesting; +import net.spy.memcached.AddrUtil; +import net.spy.memcached.ConnectionFactoryBuilder; +import net.spy.memcached.MemcachedClient; +import net.spy.memcached.auth.AuthDescriptor; +import net.spy.memcached.auth.PlainCallbackHandler; +import net.spy.memcached.internal.OperationFuture; +import org.apache.commons.lang3.StringUtils; + +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.concurrent.Callable; +import java.util.concurrent.TimeUnit; + +public class OcsWriter extends Writer { + + public static class Job extends Writer.Job { + private Configuration configuration; + + @Override + public void init() { + this.configuration = super.getPluginJobConf(); + //参数有效性检查 + ConfigurationChecker.check(this.configuration); + } + + @Override + public void prepare() { + super.prepare(); + } + + @Override + public List split(int mandatoryNumber) { + ArrayList configList = new ArrayList(); + for (int i = 0; i < mandatoryNumber; i++) { + configList.add(this.configuration.clone()); + } + return configList; + } + + @Override + public void destroy() { + } + } + + public static class Task extends Writer.Task { + + private Configuration configuration; + private MemcachedClient client; + private Set indexesFromUser = new HashSet(); + private String delimiter; + private int expireTime; + //private int batchSize; + private ConfigurationChecker.WRITE_MODE writeMode; + private TaskPluginCollector taskPluginCollector; + + @Override + public void init() { + this.configuration = this.getPluginJobConf(); + this.taskPluginCollector = super.getTaskPluginCollector(); + } + + @Override + public void prepare() { + super.prepare(); + + //如果用户不配置,默认为第0列 + String indexStr = this.configuration.getString(Key.INDEXES, "0"); + for (String index : indexStr.split(",")) { + indexesFromUser.add(Integer.parseInt(index)); + } + + //如果用户不配置,默认为\u0001 + delimiter = this.configuration.getString(Key.FIELD_DELIMITER, "\u0001"); + expireTime = this.configuration.getInt(Key.EXPIRE_TIME, 0); + //todo 此版本不支持批量提交,待ocswriter发布新版本client后支持。batchSize = this.configuration.getInt(Key.BATCH_SIZE, 100); + writeMode = ConfigurationChecker.WRITE_MODE.valueOf(this.configuration.getString(Key.WRITE_MODE)); + + String proxy = this.configuration.getString(Key.PROXY); + //默认端口为11211 + String port = this.configuration.getString(Key.PORT, "11211"); + String username = this.configuration.getString(Key.USER); + String password = this.configuration.getString(Key.PASSWORD); + AuthDescriptor ad = new AuthDescriptor(new String[]{"PLAIN"}, new PlainCallbackHandler(username, password)); + + try { + client = getMemcachedConn(proxy, port, ad); + } catch (Exception e) { + //异常不能吃掉,直接抛出,便于定位 + throw DataXException.asDataXException(OcsWriterErrorCode.OCS_INIT_ERROR, String.format("初始化ocs客户端失败"), e); + } + } + + /** + * 建立ocs客户端连接 + * 重试9次,间隔时间指数增长 + */ + private MemcachedClient getMemcachedConn(final String proxy, final String port, final AuthDescriptor ad) throws Exception { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public MemcachedClient call() throws Exception { + return new MemcachedClient( + new ConnectionFactoryBuilder().setProtocol(ConnectionFactoryBuilder.Protocol.BINARY) + .setAuthDescriptor(ad) + .build(), + AddrUtil.getAddresses(proxy + ":" + port)); + } + }, 9, 1000L, true); + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + Record record; + String key; + String value; + while ((record = lineReceiver.getFromReader()) != null) { + try { + key = buildKey(record); + value = buildValue(record); + switch (writeMode) { + case set: + case replace: + case add: + commitWithRetry(key, value); + break; + case append: + case prepend: + commit(key, value); + break; + default: + //没有default,因为参数检查的时候已经判断,不可能出现5中模式之外的模式 + } + } catch (Exception e) { + this.taskPluginCollector.collectDirtyRecord(record, e); + } + } + } + + /** + * 没有重试的commit + */ + private void commit(final String key, final String value) { + OperationFuture future; + switch (writeMode) { + case set: + future = client.set(key, expireTime, value); + break; + case add: + //幂等原则:相同的输入得到相同的输出,不管调用多少次。 + //所以add和replace是幂等的。 + future = client.add(key, expireTime, value); + break; + case replace: + future = client.replace(key, expireTime, value); + break; + //todo 【注意】append和prepend重跑任务不能支持幂等,使用需谨慎,不需要重试 + case append: + future = client.append(0L, key, value); + break; + case prepend: + future = client.prepend(0L, key, value); + break; + default: + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("不支持的写入模式%s", writeMode.toString())); + //因为前面参数校验的时候已经判断,不可能存在5中操作之外的类型。 + } + //【注意】getStatus()返回为null有可能是因为get()超时导致,此种情况当做脏数据处理。但有可能数据已经成功写入ocs。 + if (future == null || future.getStatus() == null || !future.getStatus().isSuccess()) { + throw DataXException.asDataXException(OcsWriterErrorCode.COMMIT_FAILED, "提交数据到ocs失败"); + } + } + + /** + * 提交数据到ocs,有重试机制 + */ + private void commitWithRetry(final String key, final String value) throws Exception { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Object call() throws Exception { + commit(key, value); + return null; + } + }, 3, 1000L, false); + } + + /** + * 构建value + * 如果有二进制字段当做脏数据处理 + * 如果col为null,当做脏数据处理 + */ + private String buildValue(Record record) { + ArrayList valueList = new ArrayList(); + int colNum = record.getColumnNumber(); + for (int i = 0; i < colNum; i++) { + Column col = record.getColumn(i); + if (col != null) { + String value; + Column.Type type = col.getType(); + switch (type) { + case STRING: + case BOOL: + case DOUBLE: + case LONG: + case DATE: + value = col.asString(); + //【注意】value字段中如果有分隔符,当做脏数据处理 + if (value != null && value.contains(delimiter)) { + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("数据中包含分隔符:%s", value)); + } + break; + default: + //目前不支持二进制,如果遇到二进制,则当做脏数据处理 + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("不支持的数据格式:%s", type.toString())); + } + valueList.add(value); + } else { + //如果取到的列为null,需要当做脏数据处理 + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("record中不存在第%s个字段", i)); + } + } + return StringUtils.join(valueList, delimiter); + } + + /** + * 构建key + * 构建数据为空时当做脏数据处理 + */ + private String buildKey(Record record) { + ArrayList keyList = new ArrayList(); + for (int index : indexesFromUser) { + Column col = record.getColumn(index); + if (col == null) { + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("不存在第%s列", index)); + } + Column.Type type = col.getType(); + String value; + switch (type) { + case STRING: + case BOOL: + case DOUBLE: + case LONG: + case DATE: + value = col.asString(); + if (value != null && value.contains(delimiter)) { + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("主键中包含分隔符:%s", value)); + } + keyList.add(value); + break; + default: + //目前不支持二进制,如果遇到二进制,则当做脏数据处理 + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("不支持的数据格式:%s", type.toString())); + } + } + String rtn = StringUtils.join(keyList, delimiter); + if (StringUtils.isBlank(rtn)) { + throw DataXException.asDataXException(OcsWriterErrorCode.DIRTY_RECORD, String.format("构建主键为空,请检查indexes的配置")); + } + return rtn; + } + + /** + * shutdown中会有数据异步提交,需要重试。 + */ + @Override + public void destroy() { + try { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Object call() throws Exception { + if (client == null || client.shutdown(10000L, TimeUnit.MILLISECONDS)) { + return null; + } else { + throw DataXException.asDataXException(OcsWriterErrorCode.SHUTDOWN_FAILED, "关闭ocsClient失败"); + } + } + }, 8, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(OcsWriterErrorCode.SHUTDOWN_FAILED, "关闭ocsClient失败", e); + } + } + + /** + * 以下为测试使用 + */ + @VisibleForTesting + public String buildValue_test(Record record) { + return this.buildValue(record); + } + + @VisibleForTesting + public String buildKey_test(Record record) { + return this.buildKey(record); + } + + @VisibleForTesting + public void setIndexesFromUser(HashSet indexesFromUser) { + this.indexesFromUser = indexesFromUser; + } + + } +} diff --git a/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/CommonUtils.java b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/CommonUtils.java new file mode 100644 index 0000000000..47335df3f1 --- /dev/null +++ b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/CommonUtils.java @@ -0,0 +1,12 @@ +package com.alibaba.datax.plugin.writer.ocswriter.utils; + +public class CommonUtils { + + public static void sleepInMs(long time) { + try{ + Thread.sleep(time); + } catch (InterruptedException e) { + // + } + } +} diff --git a/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/ConfigurationChecker.java b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/ConfigurationChecker.java new file mode 100644 index 0000000000..3ce0a47a50 --- /dev/null +++ b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/ConfigurationChecker.java @@ -0,0 +1,144 @@ +package com.alibaba.datax.plugin.writer.ocswriter.utils; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.ocswriter.Key; +import com.google.common.annotations.VisibleForTesting; + +import net.spy.memcached.AddrUtil; +import net.spy.memcached.ConnectionFactoryBuilder; +import net.spy.memcached.MemcachedClient; +import net.spy.memcached.auth.AuthDescriptor; +import net.spy.memcached.auth.PlainCallbackHandler; + +import org.apache.commons.lang3.EnumUtils; +import org.apache.commons.lang3.StringUtils; + + +public class ConfigurationChecker { + + public static void check(Configuration config) { + paramCheck(config); + hostReachableCheck(config); + } + + public enum WRITE_MODE { + set, + add, + replace, + append, + prepend + } + + private enum WRITE_FORMAT { + text + } + + /** + * 参数有效性基本检查 + */ + private static void paramCheck(Configuration config) { + String proxy = config.getString(Key.PROXY); + if (StringUtils.isBlank(proxy)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("ocs服务地址%s不能设置为空", Key.PROXY)); + } + String user = config.getString(Key.USER); + if (StringUtils.isBlank(user)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("访问ocs的用户%s不能设置为空", Key.USER)); + } + String password = config.getString(Key.PASSWORD); + if (StringUtils.isBlank(password)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("访问ocs的用户%s不能设置为空", Key.PASSWORD)); + } + + String port = config.getString(Key.PORT, "11211"); + if (StringUtils.isBlank(port)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("ocs端口%s不能设置为空", Key.PORT)); + } + + String indexes = config.getString(Key.INDEXES, "0"); + if (StringUtils.isBlank(indexes)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("当做key的列编号%s不能为空", Key.INDEXES)); + } + for (String index : indexes.split(",")) { + try { + if (Integer.parseInt(index) < 0) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("列编号%s必须为逗号分隔的非负整数", Key.INDEXES)); + } + } catch (NumberFormatException e) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("列编号%s必须为逗号分隔的非负整数", Key.INDEXES)); + } + } + + String writerMode = config.getString(Key.WRITE_MODE); + if (StringUtils.isBlank(writerMode)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("操作方式%s不能为空", Key.WRITE_MODE)); + } + if (!EnumUtils.isValidEnum(WRITE_MODE.class, writerMode.toLowerCase())) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("不支持操作方式%s,仅支持%s", writerMode, StringUtils.join(WRITE_MODE.values(), ","))); + } + + String writerFormat = config.getString(Key.WRITE_FORMAT, "text"); + if (StringUtils.isBlank(writerFormat)) { + throw DataXException.asDataXException(OcsWriterErrorCode.REQUIRED_VALUE, String.format("写入格式%s不能为空", Key.WRITE_FORMAT)); + } + if (!EnumUtils.isValidEnum(WRITE_FORMAT.class, writerFormat.toLowerCase())) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("不支持写入格式%s,仅支持%s", writerFormat, StringUtils.join(WRITE_FORMAT.values(), ","))); + } + + int expireTime = config.getInt(Key.EXPIRE_TIME, 0); + if (expireTime < 0) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("数据过期时间设置%s不能小于0", Key.EXPIRE_TIME)); + } + + int batchSiz = config.getInt(Key.BATCH_SIZE, 100); + if (batchSiz <= 0) { + throw DataXException.asDataXException(OcsWriterErrorCode.ILLEGAL_PARAM_VALUE, String.format("批量写入大小设置%s必须大于0", Key.BATCH_SIZE)); + } + //fieldDelimiter不需要检查,默认为\u0001 + } + + /** + * 检查ocs服务器网络是否可达 + */ + private static void hostReachableCheck(Configuration config) { + String proxy = config.getString(Key.PROXY); + String port = config.getString(Key.PORT); + String username = config.getString(Key.USER); + String password = config.getString(Key.PASSWORD); + AuthDescriptor ad = new AuthDescriptor(new String[] { "PLAIN" }, + new PlainCallbackHandler(username, password)); + try { + MemcachedClient client = new MemcachedClient( + new ConnectionFactoryBuilder() + .setProtocol( + ConnectionFactoryBuilder.Protocol.BINARY) + .setAuthDescriptor(ad).build(), + AddrUtil.getAddresses(proxy + ":" + port)); + client.get("for_check_connectivity"); + client.getVersions(); + if (client.getAvailableServers().isEmpty()) { + throw new RuntimeException( + "没有可用的Servers: getAvailableServers() -> is empty"); + } + client.shutdown(); + } catch (Exception e) { + throw DataXException.asDataXException( + OcsWriterErrorCode.HOST_UNREACHABLE, + String.format("OCS[%s]服务不可用", proxy), e); + } + } + + /** + * 以下为测试使用 + */ + @VisibleForTesting + public static void paramCheck_test(Configuration configuration) { + paramCheck(configuration); + } + + @VisibleForTesting + public static void hostReachableCheck_test(Configuration configuration) { + hostReachableCheck(configuration); + } +} diff --git a/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/OcsWriterErrorCode.java b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/OcsWriterErrorCode.java new file mode 100644 index 0000000000..a92bd2e62a --- /dev/null +++ b/ocswriter/src/main/java/com/alibaba/datax/plugin/writer/ocswriter/utils/OcsWriterErrorCode.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.writer.ocswriter.utils; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum OcsWriterErrorCode implements ErrorCode { + REQUIRED_VALUE("OcsWriterErrorCode-000", "参数不能为空"), + ILLEGAL_PARAM_VALUE("OcsWriterErrorCode-001", "参数不合法"), + HOST_UNREACHABLE("OcsWriterErrorCode-002", "服务不可用"), + OCS_INIT_ERROR("OcsWriterErrorCode-003", "初始化ocs client失败"), + DIRTY_RECORD("OcsWriterErrorCode-004", "脏数据"), + SHUTDOWN_FAILED("OcsWriterErrorCode-005", "关闭ocs client失败"), + COMMIT_FAILED("OcsWriterErrorCode-006", "提交数据到ocs失败"); + + private final String code; + private final String description; + + private OcsWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return null; + } + + @Override + public String getDescription() { + return null; + } +} diff --git a/ocswriter/src/main/resources/plugin.json b/ocswriter/src/main/resources/plugin.json new file mode 100644 index 0000000000..4874911a41 --- /dev/null +++ b/ocswriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "ocswriter", + "class": "com.alibaba.datax.plugin.writer.ocswriter.OcsWriter", + "description": "set|add|replace|append|prepend record into ocs.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/ocswriter/src/main/resources/plugin_job_template.json b/ocswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..d62f3c96c5 --- /dev/null +++ b/ocswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "ocswriter", + "parameter": { + "proxy": "", + "port": "", + "userName": "", + "password": "", + "writeMode": "", + "writeFormat": "", + "fieldDelimiter": "", + "expireTime": "", + "indexes": "", + "batchSize": "" + } +} \ No newline at end of file diff --git a/odpsreader/doc/odpsreader.md b/odpsreader/doc/odpsreader.md new file mode 100644 index 0000000000..0ae528943c --- /dev/null +++ b/odpsreader/doc/odpsreader.md @@ -0,0 +1,349 @@ +# DataX ODPSReader + + +--- + + +## 1 快速介绍 +ODPSReader 实现了从 ODPS读取数据的功能,有关ODPS请参看(https://help.aliyun.com/document_detail/27800.html?spm=5176.doc27803.6.101.NxCIgY)。 在底层实现上,ODPSReader 根据你配置的 源头项目 / 表 / 分区 / 表字段 等信息,通过 `Tunnel` 从 ODPS 系统中读取数据。 + +
+ + 注意 1、如果你需要使用ODPSReader/Writer插件,由于 AccessId/AccessKey 解密的需要,请务必使用 JDK 1.6.32 及以上版本。JDK 安装事项,请联系 PE 处理 + 2、ODPSReader 不是通过 ODPS SQL (select ... from ... where ... )来抽取数据的 + 3、注意区分你要读取的表是线上环境还是线下环境 + 4、目前 DataX3 依赖的 SDK 版本是: + + com.aliyun.odps + odps-sdk-core-internal + 0.13.2 + + + +## 2 实现原理 +ODPSReader 支持读取分区表、非分区表,不支持读取虚拟视图。当要读取分区表时,需要指定出具体的分区配置,比如读取 t0 表,其分区为 pt=1,ds=hangzhou 那么你需要在配置中配置该值。当要读取非分区表时,你不能提供分区配置。表字段可以依序指定全部列,也可以指定部分列,或者调整列顺序,或者指定常量字段,但是表字段中不能指定分区列(分区列不是表字段)。 + + 注意:要特别注意 odpsServer、project、table、accessId、accessKey 的配置,因为直接影响到是否能够加载到你需要读取数据的表。很多权限问题都出现在这里。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份读出 ODPS 数据然后打印到屏幕的配置样板。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "odpsreader", + "parameter": { + "accessId": "accessId", + "accessKey": "accessKey", + "project": "targetProjectName", + "table": "tableName", + "partition": [ + "pt=1,ds=hangzhou" + ], + "column": [ + "customer_id", + "nickname" + ], + "packageAuthorizedProject": "yourCurrentProjectName", + "splitMode": "record", + "odpsServer": "http://xxx/api" + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "fieldDelimiter": "\t", + "print": "true" + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +## 参数 + +* **accessId** + * 描述:ODPS系统登录ID
+ + * 必选:是
+ + * 默认值:无
+ +* **accessKey** + * 描述:ODPS系统登录Key
+ + * 必选:是
+ + * 默认值:无
+ +* **project** + + * 描述:读取数据表所在的 ODPS 项目名称(大小写不敏感)
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:读取数据表的表名称(大小写不敏感)
+ + * 必选:是
+ + * 默认值:无
+ +* **partition** + + * 描述:读取数据所在的分区信息,支持linux shell通配符,包括 * 表示0个或多个字符,?代表任意一个字符。例如现在有分区表 test,其存在 pt=1,ds=hangzhou pt=1,ds=shanghai pt=2,ds=hangzhou pt=2,ds=beijing 四个分区,如果你想读取 pt=1,ds=shanghai 这个分区的数据,那么你应该配置为: `"partition":["pt=1,ds=shanghai"]`; 如果你想读取 pt=1下的所有分区,那么你应该配置为: `"partition":["pt=1,ds=* "]`;如果你想读取整个 test 表的所有分区的数据,那么你应该配置为: `"partition":["pt=*,ds=*"]`
+ + * 必选:如果表为分区表,则必填。如果表为非分区表,则不能填写
+ + * 默认值:无
+ +* **column** + + * 描述:读取 odps 源头表的列信息。例如现在有表 test,其字段为:id,name,age 如果你想依次读取 id,name,age 那么你应该配置为: `"column":["id","name","age"]` 或者配置为:`"column"=["*"]` 这里 * 表示依次读取表的每个字段,但是我们不推荐你配置抽取字段为 * ,因为当你的表字段顺序调整、类型变更或者个数增减,你的任务就会存在源头表列和目的表列不能对齐的风险,会直接导致你的任务运行结果不正确甚至运行失败。如果你想依次读取 name,id 那么你应该配置为: `"coulumn":["name","id"]` 如果你想在源头抽取的字段中添加常量字段(以适配目标表的字段顺序),比如你想抽取的每一行数据值为 age 列对应的值,name列对应的值,常量日期值1988-08-08 08:08:08,id 列对应的值 那么你应该配置为:`"column":["age","name","'1988-08-08 08:08:08'","id"]` 即常量列首尾用符号`'` 包住即可,我们内部实现上识别常量是通过检查你配置的每一个字段,如果发现有字段首尾都有`'`,则认为其是常量字段,其实际值为去除`'` 之后的值。 + + 注意:ODPSReader 抽取数据表不是通过 ODPS 的 Select SQL 语句,所以不能在字段上指定函数,也不能指定分区字段名称(分区字段不属于表字段) + + * 必选:是
+ + * 默认值:无
+ +* **odpsServer** + + * 描述:源头表 所在 ODPS 系统的server 地址
+ + * 必选:是
+ + * 默认值:无
+ +* **tunnelServer** + + * 描述:源头表 所在 ODPS 系统的tunnel 地址
+ + * 必选:是
+ + * 默认值:无
+ +* **splitMode** + + * 描述:读取源头表时切分所需要的模式。默认值为 record,可不填,表示根据切分份数,按照记录数进行切分。如果你的任务目的端为 Mysql,并且是 Mysql 的多个表,那么根据现在 DataX 结构,你的源头表必须是分区表,并且每个分区依次对应目的端 Mysql 的多个分表,则此时应该配置为`"splitMode":"partition"`
+ + * 必选:否
+ + * 默认值:record
+ +* **accountProvider** [待定] + + * 描述:读取时使用的 ODPS 账号类型。目前支持 aliyun/taobao 两种类型。默认为 aliyun,可不填
+ + * 必选:否
+ + * 默认值:aliyun
+ +* **packageAuthorizedProject** + + * 描述:被package授权的project,即用户当前所在project
+ + * 必选:否
+ + * 默认值:无
+ +* **isCompress** + + * 描述:是否压缩读取,bool类型: "true"表示压缩, "false"标示不压缩
+ + * 必选:否
+ + * 默认值:"false" : 不压缩
+ +### 3.3 类型转换 + +下面列出 ODPSReader 读出类型与 DataX 内部类型的转换关系: + + +| ODPS 数据类型| DataX 内部类型 | +| -------- | ----- | +| BIGINT | Long | +| DOUBLE | Double | +| STRING | String | +| DATETIME | Date | +| Boolean | Bool | + + +## 4 性能报告(线上环境实测) + +### 4.1 环境准备 + +#### 4.1.1 数据特征 + +建表语句: + + use cdo_datasync; + create table datax3_odpswriter_perf_10column_1kb_00( + s_0 string, + bool_1 boolean, + bi_2 bigint, + dt_3 datetime, + db_4 double, + s_5 string, + s_6 string, + s_7 string, + s_8 string, + s_9 string + )PARTITIONED by (pt string,year string); + +单行记录类似于: + + s_0 : 485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&* + bool_1 : true + bi_2 : 1696248667889 + dt_3 : 2013-07-0600: 00: 00 + db_4 : 3.141592653578 + s_5 : 100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 + s_6 : 100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209 + s_7 : 100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209 + s_8 : 100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209 + s_9 : 12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu : 24 Core Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz cache 15.36MB + 2. mem : 50GB + 3. net : 千兆双网卡 + 4. jvm : -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + 5. disc: DataX 数据不落磁盘,不统计此项 + +* 任务配置为: +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "odpsreader", + "parameter": { + "accessId": "******************************", + "accessKey": "*****************************", + "column": [ + "*" + ], + "partition": [ + "pt=20141010000000,year=2014" + ], + "odpsServer": "http://xxx/api", + "project": "cdo_datasync", + "table": "datax3_odpswriter_perf_10column_1kb_00", + "tunnelServer": "http://xxx" + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "column": [ + { + "value": "485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&*" + }, + { + "value": "true", + "type": "bool" + }, + { + "value": "1696248667889", + "type": "long" + }, + { + "type": "date", + "value": "2013-07-06 00:00:00", + "dateFormat": "yyyy-mm-dd hh:mm:ss" + }, + { + "value": "3.141592653578", + "type": "double" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209" + }, + { + "value": "100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209" + }, + { + "value": "12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + } + ] + } + } + } + ] + } +} +``` + +### 4.2 测试报告 + + +| 并发任务数| DataX速度(Rec/s)|DataX流量(MB/S)|网卡流量(MB/S)|DataX运行负载| +|--------| --------|--------|--------|--------| +|1|117507|50.20|53.7|0.62| +|2|232976|99.54|108.1|0.99| +|4|387382|165.51|181.3|1.98| +|5|426054|182.03|202.2|2.35| +|6|434793|185.76|204.7|2.77| +|8|495904|211.87|230.2|2.86| +|16|501596|214.31|234.7|2.84| +|32|501577|214.30|234.7|2.99| +|64|501625|214.32|234.7|3.22| + +说明: + +1. OdpsReader 影响速度最主要的是channel数目,这里到达8时已经打满网卡,过多调大反而会影响系统性能。 +2. channel数目的选择,可以考虑odps表文件组织,可尝试合并小文件再进行同步调优。 + + +## 5 约束限制 + + + + +## FAQ(待补充) + +*** + +**Q: 你来问** + +A: 我来答。 + +*** + diff --git a/odpsreader/pom.xml b/odpsreader/pom.xml new file mode 100755 index 0000000000..9204d908bf --- /dev/null +++ b/odpsreader/pom.xml @@ -0,0 +1,144 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + odpsreader + odpsreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + org.bouncycastle + bcprov-jdk15on + 1.52 + system + ${basedir}/src/main/libs/bcprov-jdk15on-1.52.jar + + + com.aliyun.odps + odps-sdk-core + 0.19.3-public + + + + org.mockito + mockito-core + 1.8.5 + test + + + org.powermock + powermock-api-mockito + 1.4.10 + test + + + org.powermock + powermock-module-junit4 + 1.4.10 + test + + + + org.mockito + mockito-core + 1.8.5 + test + + + org.powermock + powermock-api-mockito + 1.4.10 + test + + + + org.powermock + powermock-module-junit4 + 1.4.10 + test + + + + org.mockito + mockito-core + 1.8.5 + test + + + org.powermock + powermock-api-mockito + 1.4.10 + test + + + + org.powermock + powermock-module-junit4 + 1.4.10 + test + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/odpsreader/src/main/assembly/package.xml b/odpsreader/src/main/assembly/package.xml new file mode 100755 index 0000000000..9ec3309e6e --- /dev/null +++ b/odpsreader/src/main/assembly/package.xml @@ -0,0 +1,42 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/odpsreader + + + target/ + + odpsreader-0.0.1-SNAPSHOT.jar + + plugin/reader/odpsreader + + + src/main/libs + + *.* + + plugin/reader/odpsreader/libs + + + + + + false + plugin/reader/odpsreader/libs + runtime + + + diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ColumnType.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ColumnType.java new file mode 100644 index 0000000000..eb674a7f67 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ColumnType.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +public enum ColumnType { + PARTITION, NORMAL, CONSTANT, UNKNOWN, ; + + @Override + public String toString() { + switch (this) { + case PARTITION: + return "partition"; + case NORMAL: + return "normal"; + case CONSTANT: + return "constant"; + default: + return "unknown"; + } + } + + public static ColumnType asColumnType(String columnTypeString) { + if ("partition".equals(columnTypeString)) { + return PARTITION; + } else if ("normal".equals(columnTypeString)) { + return NORMAL; + } else if ("constant".equals(columnTypeString)) { + return CONSTANT; + } else { + return UNKNOWN; + } + } +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Constant.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Constant.java new file mode 100755 index 0000000000..c3c674ddd1 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Constant.java @@ -0,0 +1,35 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +public class Constant { + + public final static String START_INDEX = "startIndex"; + + public final static String STEP_COUNT = "stepCount"; + + public final static String SESSION_ID = "sessionId"; + + public final static String IS_PARTITIONED_TABLE = "isPartitionedTable"; + + public static final String DEFAULT_SPLIT_MODE = "record"; + + public static final String PARTITION_SPLIT_MODE = "partition"; + + public static final String DEFAULT_ACCOUNT_TYPE = "aliyun"; + + public static final String TAOBAO_ACCOUNT_TYPE = "taobao"; + + // 常量字段用COLUMN_CONSTANT_FLAG 首尾包住即可 + public final static String COLUMN_CONSTANT_FLAG = "'"; + + /** + * 以下是获取accesskey id 需要用到的常量值 + */ + public static final String SKYNET_ACCESSID = "SKYNET_ACCESSID"; + + public static final String SKYNET_ACCESSKEY = "SKYNET_ACCESSKEY"; + + public static final String PARTITION_COLUMNS = "partitionColumns"; + + public static final String PARSED_COLUMNS = "parsedColumns"; + +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Key.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Key.java new file mode 100755 index 0000000000..9537cb9397 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/Key.java @@ -0,0 +1,34 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +public class Key { + + public final static String ACCESS_ID = "accessId"; + + public final static String ACCESS_KEY = "accessKey"; + + public static final String PROJECT = "project"; + + public final static String TABLE = "table"; + + public final static String PARTITION = "partition"; + + public final static String ODPS_SERVER = "odpsServer"; + + // 线上环境不需要填写,线下环境必填 + public final static String TUNNEL_SERVER = "tunnelServer"; + + public final static String COLUMN = "column"; + + // 当值为:partition 则只切分到分区;当值为:record,则当按照分区切分后达不到adviceNum时,继续按照record切分 + public final static String SPLIT_MODE = "splitMode"; + + // 账号类型,默认为aliyun,也可能为taobao等其他类型 + public final static String ACCOUNT_TYPE = "accountType"; + + public final static String PACKAGE_AUTHORIZED_PROJECT = "packageAuthorizedProject"; + + public final static String IS_COMPRESS = "isCompress"; + + public final static String MAX_RETRY_TIME = "maxRetryTime"; + +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReader.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReader.java new file mode 100755 index 0000000000..f5cf10ca28 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReader.java @@ -0,0 +1,390 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.FilterUtil; +import com.alibaba.datax.plugin.reader.odpsreader.util.IdAndKeyUtil; +import com.alibaba.datax.plugin.reader.odpsreader.util.OdpsSplitUtil; +import com.alibaba.datax.plugin.reader.odpsreader.util.OdpsUtil; +import com.aliyun.odps.*; +import com.aliyun.odps.tunnel.TableTunnel.DownloadSession; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.MutablePair; +import org.apache.commons.lang3.tuple.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; + +public class OdpsReader extends Reader { + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + private static boolean IS_DEBUG = LOG.isDebugEnabled(); + + private Configuration originalConfig; + private Odps odps; + private Table table; + + public void preCheck() { + this.init(); + } + + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + //如果用户没有配置accessId/accessKey,尝试从环境变量获取 + String accountType = originalConfig.getString(Key.ACCOUNT_TYPE, Constant.DEFAULT_ACCOUNT_TYPE); + if (Constant.DEFAULT_ACCOUNT_TYPE.equalsIgnoreCase(accountType)) { + this.originalConfig = IdAndKeyUtil.parseAccessIdAndKey(this.originalConfig); + } + + //检查必要的参数配置 + OdpsUtil.checkNecessaryConfig(this.originalConfig); + //重试次数的配置检查 + OdpsUtil.dealMaxRetryTime(this.originalConfig); + + //确定切分模式 + dealSplitMode(this.originalConfig); + + this.odps = OdpsUtil.initOdps(this.originalConfig); + String tableName = this.originalConfig.getString(Key.TABLE); + String projectName = this.originalConfig.getString(Key.PROJECT); + + this.table = OdpsUtil.getTable(this.odps, projectName, tableName); + this.originalConfig.set(Constant.IS_PARTITIONED_TABLE, + OdpsUtil.isPartitionedTable(table)); + + boolean isVirtualView = this.table.isVirtualView(); + if (isVirtualView) { + throw DataXException.asDataXException(OdpsReaderErrorCode.VIRTUAL_VIEW_NOT_SUPPORT, + String.format("源头表:%s 是虚拟视图,DataX 不支持读取虚拟视图.", tableName)); + } + + this.dealPartition(this.table); + this.dealColumn(this.table); + } + + private void dealSplitMode(Configuration originalConfig) { + String splitMode = originalConfig.getString(Key.SPLIT_MODE, Constant.DEFAULT_SPLIT_MODE).trim(); + if (splitMode.equalsIgnoreCase(Constant.DEFAULT_SPLIT_MODE) || + splitMode.equalsIgnoreCase(Constant.PARTITION_SPLIT_MODE)) { + originalConfig.set(Key.SPLIT_MODE, splitMode); + } else { + throw DataXException.asDataXException(OdpsReaderErrorCode.SPLIT_MODE_ERROR, + String.format("您所配置的 splitMode:%s 不正确. splitMode 仅允许配置为 record 或者 partition.", splitMode)); + } + } + + /** + * 对分区的配置处理。最终效果是所有正则配置,完全展开成实际对应的分区配置。正则规则如下: + *

+ *

    + *
  1. 如果是分区表,则必须配置分区:可以配置为*,表示整表读取;也可以配置为分别列出要读取的叶子分区.
    TODO + * 未来会支持一些常用的分区正则筛选配置. 分区配置中,不能在分区所表示的数组中配置多个*,因为那样就是多次读取全表,无意义.
  2. + *
  3. 如果是非分区表,则不能配置分区值.
  4. + *
+ */ + private void dealPartition(Table table) { + List userConfiguredPartitions = this.originalConfig.getList( + Key.PARTITION, String.class); + + boolean isPartitionedTable = this.originalConfig.getBool(Constant.IS_PARTITIONED_TABLE); + List partitionColumns = new ArrayList(); + + if (isPartitionedTable) { + // 分区表,需要配置分区 + if (null == userConfiguredPartitions || userConfiguredPartitions.isEmpty()) { + throw DataXException.asDataXException(OdpsReaderErrorCode.PARTITION_ERROR, + String.format("分区信息没有配置.由于源头表:%s 为分区表, 所以您需要配置其抽取的表的分区信息. 格式形如:pt=hello,ds=hangzhou,请您参考此格式修改该配置项.", + table.getName())); + } else { + List allPartitions = OdpsUtil.getTableAllPartitions(table); + + if (null == allPartitions || allPartitions.isEmpty()) { + throw DataXException.asDataXException(OdpsReaderErrorCode.PARTITION_ERROR, + String.format("分区信息配置错误.源头表:%s 虽然为分区表, 但其实际分区值并不存在. 请确认源头表已经生成该分区,再进行数据抽取.", + table.getName())); + } + + List parsedPartitions = expandUserConfiguredPartition( + allPartitions, userConfiguredPartitions); + + if (null == parsedPartitions || parsedPartitions.isEmpty()) { + throw DataXException.asDataXException( + OdpsReaderErrorCode.PARTITION_ERROR, + String.format( + "分区配置错误,根据您所配置的分区没有匹配到源头表中的分区. 源头表所有分区是:[\n%s\n], 您配置的分区是:[\n%s\n]. 请您根据实际情况在作出修改. ", + StringUtils.join(allPartitions, "\n"), + StringUtils.join(userConfiguredPartitions, "\n"))); + } + this.originalConfig.set(Key.PARTITION, parsedPartitions); + + for (Column column : table.getSchema() + .getPartitionColumns()) { + partitionColumns.add(column.getName()); + } + } + } else { + // 非分区表,则不能配置分区 + if (null != userConfiguredPartitions + && !userConfiguredPartitions.isEmpty()) { + throw DataXException.asDataXException(OdpsReaderErrorCode.PARTITION_ERROR, + String.format("分区配置错误,源头表:%s 为非分区表, 您不能配置分区. 请您删除该配置项. ", table.getName())); + } + } + + this.originalConfig.set(Constant.PARTITION_COLUMNS, partitionColumns); + if (isPartitionedTable) { + LOG.info("{源头表:{} 的所有分区列是:[{}]}", table.getName(), + StringUtils.join(partitionColumns, ",")); + } + } + + private List expandUserConfiguredPartition( + List allPartitions, List userConfiguredPartitions) { + // 对odps 本身的所有分区进行特殊字符的处理 + List allStandardPartitions = OdpsUtil + .formatPartitions(allPartitions); + + // 对用户自身配置的所有分区进行特殊字符的处理 + List allStandardUserConfiguredPartitions = OdpsUtil + .formatPartitions(userConfiguredPartitions); + + /** + * 对配置的分区级数(深度)进行检查 + * (1)先检查用户配置的分区级数,自身级数是否相等 + * (2)检查用户配置的分区级数是否与源头表的的分区级数一样 + */ + String firstPartition = allStandardUserConfiguredPartitions.get(0); + int firstPartitionDepth = firstPartition.split(",").length; + + String comparedPartition = null; + int comparedPartitionDepth = -1; + for (int i = 1, len = allStandardUserConfiguredPartitions.size(); i < len; i++) { + comparedPartition = allStandardUserConfiguredPartitions.get(i); + comparedPartitionDepth = comparedPartition.split(",").length; + if (comparedPartitionDepth != firstPartitionDepth) { + throw DataXException.asDataXException(OdpsReaderErrorCode.PARTITION_ERROR, + String.format("分区配置错误, 您所配置的分区级数和该表的实际情况不一致, 比如分区:[%s] 是 %s 级分区, 而分区:[%s] 是 %s 级分区. DataX 是通过英文逗号判断您所配置的分区级数的. 正确的格式形如\"pt=${bizdate}, type=0\" ,请您参考示例修改该配置项. ", + firstPartition, firstPartitionDepth, comparedPartition, comparedPartitionDepth)); + } + } + + int tableOriginalPartitionDepth = allStandardPartitions.get(0).split(",").length; + if (firstPartitionDepth != tableOriginalPartitionDepth) { + throw DataXException.asDataXException(OdpsReaderErrorCode.PARTITION_ERROR, + String.format("分区配置错误, 您所配置的分区:%s 的级数:%s 与您要读取的 ODPS 源头表的分区级数:%s 不相等. DataX 是通过英文逗号判断您所配置的分区级数的.正确的格式形如\"pt=${bizdate}, type=0\" ,请您参考示例修改该配置项.", + firstPartition, firstPartitionDepth, tableOriginalPartitionDepth)); + } + + List retPartitions = FilterUtil.filterByRegulars(allStandardPartitions, + allStandardUserConfiguredPartitions); + + return retPartitions; + } + + private void dealColumn(Table table) { + // 用户配置的 column 之前已经确保其不为空 + List userConfiguredColumns = this.originalConfig.getList( + Key.COLUMN, String.class); + + List allColumns = OdpsUtil.getTableAllColumns(table); + List allNormalColumns = OdpsUtil + .getTableOriginalColumnNameList(allColumns); + + StringBuilder columnMeta = new StringBuilder(); + for (Column column : allColumns) { + columnMeta.append(column.getName()).append(":").append(column.getType()).append(","); + } + columnMeta.setLength(columnMeta.length() - 1); + + LOG.info("源头表:{} 的所有字段是:[{}]", table.getName(), columnMeta.toString()); + + if (1 == userConfiguredColumns.size() + && "*".equals(userConfiguredColumns.get(0))) { + LOG.warn("这是一条警告信息,您配置的 ODPS 读取的列为*,这是不推荐的行为,因为当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错. 建议您把所有需要抽取的列都配置上. "); + this.originalConfig.set(Key.COLUMN, allNormalColumns); + } + + userConfiguredColumns = this.originalConfig.getList( + Key.COLUMN, String.class); + + /** + * warn: 字符串常量需要与表原生字段tableOriginalColumnNameList 分开存放 demo: + * ["id","'id'","name"] + */ + List allPartitionColumns = this.originalConfig.getList( + Constant.PARTITION_COLUMNS, String.class); + List> parsedColumns = OdpsUtil + .parseColumns(allNormalColumns, allPartitionColumns, + userConfiguredColumns); + + this.originalConfig.set(Constant.PARSED_COLUMNS, parsedColumns); + + StringBuilder sb = new StringBuilder(); + sb.append("[ "); + for (int i = 0, len = parsedColumns.size(); i < len; i++) { + Pair pair = parsedColumns.get(i); + sb.append(String.format(" %s : %s", pair.getLeft(), + pair.getRight())); + if (i != len - 1) { + sb.append(","); + } + } + sb.append(" ]"); + LOG.info("parsed column details: {} .", sb.toString()); + } + + + @Override + public void prepare() { + } + + @Override + public List split(int adviceNumber) { + return OdpsSplitUtil.doSplit(this.originalConfig, this.odps, adviceNumber); + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + } + + public static class Task extends Reader.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + private Configuration readerSliceConf; + + private String tunnelServer; + private Odps odps = null; + private Table table = null; + private String projectName = null; + private String tableName = null; + private boolean isPartitionedTable; + private String sessionId; + private boolean isCompress; + + @Override + public void init() { + this.readerSliceConf = super.getPluginJobConf(); + this.tunnelServer = this.readerSliceConf.getString( + Key.TUNNEL_SERVER, null); + + this.odps = OdpsUtil.initOdps(this.readerSliceConf); + this.projectName = this.readerSliceConf.getString(Key.PROJECT); + this.tableName = this.readerSliceConf.getString(Key.TABLE); + this.table = OdpsUtil.getTable(this.odps, projectName, tableName); + this.isPartitionedTable = this.readerSliceConf + .getBool(Constant.IS_PARTITIONED_TABLE); + this.sessionId = this.readerSliceConf.getString(Constant.SESSION_ID, null); + + + + this.isCompress = this.readerSliceConf.getBool(Key.IS_COMPRESS, false); + + // sessionId 为空的情况是:切分级别只到 partition 的情况 + if (StringUtils.isBlank(this.sessionId)) { + DownloadSession session = OdpsUtil.createMasterSessionForPartitionedTable(odps, + tunnelServer, projectName, tableName, this.readerSliceConf.getString(Key.PARTITION)); + this.sessionId = session.getId(); + } + + LOG.info("sessionId:{}", this.sessionId); + } + + @Override + public void prepare() { + } + + @Override + public void startRead(RecordSender recordSender) { + DownloadSession downloadSession = null; + String partition = this.readerSliceConf.getString(Key.PARTITION); + + if (this.isPartitionedTable) { + downloadSession = OdpsUtil.getSlaveSessionForPartitionedTable(this.odps, this.sessionId, + this.tunnelServer, this.projectName, this.tableName, partition); + } else { + downloadSession = OdpsUtil.getSlaveSessionForNonPartitionedTable(this.odps, this.sessionId, + this.tunnelServer, this.projectName, this.tableName); + } + + long start = this.readerSliceConf.getLong(Constant.START_INDEX, 0); + long count = this.readerSliceConf.getLong(Constant.STEP_COUNT, + downloadSession.getRecordCount()); + + if (count > 0) { + LOG.info(String.format( + "Begin to read ODPS table:%s, partition:%s, startIndex:%s, count:%s.", + this.tableName, partition, start, count)); + } else if (count == 0) { + LOG.warn(String.format("源头表:%s 的分区:%s 没有内容可抽取, 请您知晓.", + this.tableName, partition)); + return; + } else { + throw DataXException.asDataXException(OdpsReaderErrorCode.READ_DATA_FAIL, + String.format("源头表:%s 的分区:%s 读取行数为负数, 请联系 ODPS 管理员查看表状态!", + this.tableName, partition)); + } + + TableSchema tableSchema = this.table.getSchema(); + Set allColumns = new HashSet(); + allColumns.addAll(tableSchema.getColumns()); + allColumns.addAll(tableSchema.getPartitionColumns()); + + Map columnTypeMap = new HashMap(); + for (Column column : allColumns) { + columnTypeMap.put(column.getName(), column.getType()); + } + + try { + List parsedColumnsTmp = this.readerSliceConf + .getListConfiguration(Constant.PARSED_COLUMNS); + List> parsedColumns = new ArrayList>(); + for (int i = 0; i < parsedColumnsTmp.size(); i++) { + Configuration eachColumnConfig = parsedColumnsTmp.get(i); + String columnName = eachColumnConfig.getString("left"); + ColumnType columnType = ColumnType + .asColumnType(eachColumnConfig.getString("right")); + parsedColumns.add(new MutablePair( + columnName, columnType)); + + } + ReaderProxy readerProxy = new ReaderProxy(recordSender, downloadSession, + columnTypeMap, parsedColumns, partition, this.isPartitionedTable, + start, count, this.isCompress); + + readerProxy.doRead(); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.READ_DATA_FAIL, + String.format("源头表:%s 的分区:%s 读取失败, 请联系 ODPS 管理员查看错误详情.", this.tableName, partition), e); + } + + } + + + @Override + public void post() { + } + + @Override + public void destroy() { + } + + } +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReaderErrorCode.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReaderErrorCode.java new file mode 100755 index 0000000000..cdda6ac862 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/OdpsReaderErrorCode.java @@ -0,0 +1,60 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum OdpsReaderErrorCode implements ErrorCode { + REQUIRED_VALUE("OdpsReader-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("OdpsReader-01", "您配置的值不合法."), + CREATE_DOWNLOADSESSION_FAIL("OdpsReader-03", "创建 ODPS 的 downloadSession 失败."), + GET_DOWNLOADSESSION_FAIL("OdpsReader-04", "获取 ODPS 的 downloadSession 失败."), + READ_DATA_FAIL("OdpsReader-05", "读取 ODPS 源头表失败."), + GET_ID_KEY_FAIL("OdpsReader-06", "获取 accessId/accessKey 失败."), + + ODPS_READ_EXCEPTION("OdpsReader-07", "读取 odps 异常"), + OPEN_RECORD_READER_FAILED("OdpsReader-08", "打开 recordReader 失败."), + + ODPS_PROJECT_NOT_FOUNT("OdpsReader-10", "您配置的值不合法, odps project 不存在."), //ODPS-0420111: Project not found + + ODPS_TABLE_NOT_FOUNT("OdpsReader-12", "您配置的值不合法, odps table 不存在."), // ODPS-0130131:Table not found + + ODPS_ACCESS_KEY_ID_NOT_FOUND("OdpsReader-13", "您配置的值不合法, odps accessId,accessKey 不存在."), //ODPS-0410051:Invalid credentials - accessKeyId not found + + ODPS_ACCESS_KEY_INVALID("OdpsReader-14", "您配置的值不合法, odps accessKey 错误."), //ODPS-0410042:Invalid signature value - User signature dose not match + + ODPS_ACCESS_DENY("OdpsReader-15", "拒绝访问, 您不在 您配置的 project 中."), //ODPS-0420095: Access Denied - Authorization Failed [4002], You doesn't exist in project + + + + SPLIT_MODE_ERROR("OdpsReader-30", "splitMode配置错误."), + + ACCOUNT_TYPE_ERROR("OdpsReader-31", "odps 账号类型错误."), + + VIRTUAL_VIEW_NOT_SUPPORT("OdpsReader-32", "Datax 不支持 读取虚拟视图."), + + PARTITION_ERROR("OdpsReader-33", "分区配置错误."), + + ; + private final String code; + private final String description; + + private OdpsReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ReaderProxy.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ReaderProxy.java new file mode 100755 index 0000000000..8e069ef568 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/ReaderProxy.java @@ -0,0 +1,281 @@ +package com.alibaba.datax.plugin.reader.odpsreader; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.plugin.reader.odpsreader.util.OdpsUtil; +import com.aliyun.odps.OdpsType; +import com.aliyun.odps.data.Record; +import com.aliyun.odps.data.RecordReader; +import com.aliyun.odps.tunnel.TableTunnel; +import org.apache.commons.lang3.tuple.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.text.ParseException; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +public class ReaderProxy { + private static final Logger LOG = LoggerFactory + .getLogger(ReaderProxy.class); + private static boolean IS_DEBUG = LOG.isDebugEnabled(); + + private RecordSender recordSender; + private TableTunnel.DownloadSession downloadSession; + private Map columnTypeMap; + private List> parsedColumns; + private String partition; + private boolean isPartitionTable; + + private long start; + private long count; + private boolean isCompress; + + public ReaderProxy(RecordSender recordSender, TableTunnel.DownloadSession downloadSession, + Map columnTypeMap, + List> parsedColumns, String partition, + boolean isPartitionTable, long start, long count, boolean isCompress) { + this.recordSender = recordSender; + this.downloadSession = downloadSession; + this.columnTypeMap = columnTypeMap; + this.parsedColumns = parsedColumns; + this.partition = partition; + this.isPartitionTable = isPartitionTable; + this.start = start; + this.count = count; + this.isCompress = isCompress; + } + + // warn: odps 分区列和正常列不能重名, 所有列都不不区分大小写 + public void doRead() { + try { + LOG.info("start={}, count={}",start, count); + //RecordReader recordReader = downloadSession.openRecordReader(start, count, isCompress); + RecordReader recordReader = OdpsUtil.getRecordReader(downloadSession, start, count, isCompress); + + Record odpsRecord; + Map partitionMap = this + .parseCurrentPartitionValue(); + + int retryTimes = 1; + while (true) { + try { + odpsRecord = recordReader.read(); + } catch(Exception e) { + //odps read 异常后重试10次 + LOG.warn("warn : odps read exception: {}", e.getMessage()); + if(retryTimes < 10) { + try { + Thread.sleep(2000); + } catch (InterruptedException ignored) { + } + recordReader = downloadSession.openRecordReader(start, count, isCompress); + LOG.warn("odps-read-exception, 重试第{}次", retryTimes); + retryTimes++; + continue; + } else { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_READ_EXCEPTION, e); + } + } + //记录已经读取的点 + start++; + count--; + + if (odpsRecord != null) { + + com.alibaba.datax.common.element.Record dataXRecord = recordSender + .createRecord(); + // warn: for PARTITION||NORMAL columnTypeMap's key + // sets(columnName) is big than parsedColumns's left + // sets(columnName), always contain + for (Pair pair : this.parsedColumns) { + String columnName = pair.getLeft(); + switch (pair.getRight()) { + case PARTITION: + String partitionColumnValue = this + .getPartitionColumnValue(partitionMap, + columnName); + this.odpsColumnToDataXField(odpsRecord, dataXRecord, + this.columnTypeMap.get(columnName), + partitionColumnValue, true); + break; + case NORMAL: + this.odpsColumnToDataXField(odpsRecord, dataXRecord, + this.columnTypeMap.get(columnName), columnName, + false); + break; + case CONSTANT: + dataXRecord.addColumn(new StringColumn(columnName)); + break; + default: + break; + } + } + recordSender.sendToWriter(dataXRecord); + } else { + break; + } + } + //fixed, 避免recordReader.close失败,跟鸣天确认过,可以不用关闭RecordReader + try { + recordReader.close(); + } catch (Exception e) { + LOG.warn("recordReader close exception", e); + } + } catch (DataXException e) { + throw e; + } catch (Exception e) { + // warn: if dirty + throw DataXException.asDataXException( + OdpsReaderErrorCode.READ_DATA_FAIL, e); + } + } + + private Map parseCurrentPartitionValue() { + Map partitionMap = new HashMap(); + if (this.isPartitionTable) { + String[] splitedPartition = this.partition.split(","); + for (String eachPartition : splitedPartition) { + String[] partitionDetail = eachPartition.split("="); + // warn: check partition like partition=1 + if (2 != partitionDetail.length) { + throw DataXException + .asDataXException( + OdpsReaderErrorCode.ILLEGAL_VALUE, + String.format( + "您的分区 [%s] 解析出现错误,解析后正确的配置方式类似为 [ pt=1,dt=1 ].", + eachPartition)); + } + // warn: translate to lower case, it's more comfortable to + // compare whit user's input columns + String partitionName = partitionDetail[0].toLowerCase(); + String partitionValue = partitionDetail[1]; + partitionMap.put(partitionName, partitionValue); + } + } + if (IS_DEBUG) { + LOG.debug(String.format("partition value details: %s", + com.alibaba.fastjson.JSON.toJSONString(partitionMap))); + } + return partitionMap; + } + + private String getPartitionColumnValue(Map partitionMap, + String partitionColumnName) { + // warn: to lower case + partitionColumnName = partitionColumnName.toLowerCase(); + // it's will never happen, but add this checking + if (!partitionMap.containsKey(partitionColumnName)) { + String errorMessage = String.format( + "表所有分区信息为: %s 其中找不到 [%s] 对应的分区值.", + com.alibaba.fastjson.JSON.toJSONString(partitionMap), + partitionColumnName); + throw DataXException.asDataXException( + OdpsReaderErrorCode.READ_DATA_FAIL, errorMessage); + } + return partitionMap.get(partitionColumnName); + } + + /** + * TODO warn: odpsRecord 的 String 可能获取出来的是 binary + * + * warn: there is no dirty data in reader plugin, so do not handle dirty + * data with TaskPluginCollector + * + * warn: odps only support BIGINT && String partition column actually + * + * @param odpsRecord + * every line record of odps table + * @param dataXRecord + * every datax record, to be send to writer. method getXXX() case sensitive + * @param type + * odps column type + * @param columnNameValue + * for partition column it's column value, for normal column it's + * column name + * @param isPartitionColumn + * true means partition column and false means normal column + * */ + private void odpsColumnToDataXField(Record odpsRecord, + com.alibaba.datax.common.element.Record dataXRecord, OdpsType type, + String columnNameValue, boolean isPartitionColumn) { + switch (type) { + case BIGINT: { + if (isPartitionColumn) { + dataXRecord.addColumn(new LongColumn(columnNameValue)); + } else { + dataXRecord.addColumn(new LongColumn(odpsRecord + .getBigint(columnNameValue))); + } + break; + } + case BOOLEAN: { + if (isPartitionColumn) { + dataXRecord.addColumn(new BoolColumn(columnNameValue)); + } else { + dataXRecord.addColumn(new BoolColumn(odpsRecord + .getBoolean(columnNameValue))); + } + break; + } + case DATETIME: { + if (isPartitionColumn) { + try { + dataXRecord.addColumn(new DateColumn(ColumnCast + .string2Date(new StringColumn(columnNameValue)))); + } catch (ParseException e) { + LOG.error(String.format("", this.partition)); + String errMessage = String.format( + "您读取分区 [%s] 出现日期转换异常, 日期的字符串表示为 [%s].", + this.partition, columnNameValue); + LOG.error(errMessage); + throw DataXException.asDataXException( + OdpsReaderErrorCode.READ_DATA_FAIL, errMessage, e); + } + } else { + dataXRecord.addColumn(new DateColumn(odpsRecord + .getDatetime(columnNameValue))); + } + + break; + } + case DOUBLE: { + if (isPartitionColumn) { + dataXRecord.addColumn(new DoubleColumn(columnNameValue)); + } else { + dataXRecord.addColumn(new DoubleColumn(odpsRecord + .getDouble(columnNameValue))); + } + break; + } + case DECIMAL: { + if(isPartitionColumn) { + dataXRecord.addColumn(new DoubleColumn(columnNameValue)); + } else { + dataXRecord.addColumn(new DoubleColumn(odpsRecord.getDecimal(columnNameValue))); + } + break; + } + case STRING: { + if (isPartitionColumn) { + dataXRecord.addColumn(new StringColumn(columnNameValue)); + } else { + dataXRecord.addColumn(new StringColumn(odpsRecord + .getString(columnNameValue))); + } + break; + } + default: + throw DataXException + .asDataXException( + OdpsReaderErrorCode.ILLEGAL_VALUE, + String.format( + "DataX 抽取 ODPS 数据不支持字段类型为:[%s]. 目前支持抽取的字段类型有:bigint, boolean, datetime, double, decimal, string. " + + "您可以选择不抽取 DataX 不支持的字段或者联系 ODPS 管理员寻求帮助.", + type)); + } + } + +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/DESCipher.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/DESCipher.java new file mode 100644 index 0000000000..dad82d501d --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/DESCipher.java @@ -0,0 +1,355 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.reader.odpsreader.util; + +import javax.crypto.Cipher; +import javax.crypto.SecretKey; +import javax.crypto.SecretKeyFactory; +import javax.crypto.spec.DESKeySpec; +import java.security.SecureRandom; + +/** + *   * DES加解密,支持与delphi交互(字符串编码需统一为UTF-8) + * + *   * + * + *   * @author wym + * + *    + */ + +public class DESCipher { + + /** + *   * 密钥 + * + *    + */ + + public static final String KEY = "u4Gqu4Z8"; + + private final static String DES = "DES"; + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @param key + * + *   * 密钥,长度必须是8的倍数 + * + *   * @return 密文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] encrypt(byte[] src, byte[] key) throws Exception { + + // DES算法要求有一个可信任的随机数源 + + SecureRandom sr = new SecureRandom(); + + // 从原始密匙数据创建DESKeySpec对象 + + DESKeySpec dks = new DESKeySpec(key); + + // 创建一个密匙工厂,然后用它把DESKeySpec转换成 + + // 一个SecretKey对象 + + SecretKeyFactory keyFactory = SecretKeyFactory.getInstance(DES); + + SecretKey securekey = keyFactory.generateSecret(dks); + + // Cipher对象实际完成加密操作 + + Cipher cipher = Cipher.getInstance(DES); + + // 用密匙初始化Cipher对象 + + cipher.init(Cipher.ENCRYPT_MODE, securekey, sr); + + // 现在,获取数据并加密 + + // 正式执行加密操作 + + return cipher.doFinal(src); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @param key + * + *   * 密钥,长度必须是8的倍数 + * + *   * @return 明文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] decrypt(byte[] src, byte[] key) throws Exception { + + // DES算法要求有一个可信任的随机数源 + + SecureRandom sr = new SecureRandom(); + + // 从原始密匙数据创建一个DESKeySpec对象 + + DESKeySpec dks = new DESKeySpec(key); + + // 创建一个密匙工厂,然后用它把DESKeySpec对象转换成 + + // 一个SecretKey对象 + + SecretKeyFactory keyFactory = SecretKeyFactory.getInstance(DES); + + SecretKey securekey = keyFactory.generateSecret(dks); + + // Cipher对象实际完成解密操作 + + Cipher cipher = Cipher.getInstance(DES); + + // 用密匙初始化Cipher对象 + + cipher.init(Cipher.DECRYPT_MODE, securekey, sr); + + // 现在,获取数据并解密 + + // 正式执行解密操作 + + return cipher.doFinal(src); + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @return 密文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] encrypt(byte[] src) throws Exception { + + return encrypt(src, KEY.getBytes()); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @return 明文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] decrypt(byte[] src) throws Exception { + + return decrypt(src, KEY.getBytes()); + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字符串) + * + *   * @return 密文(16进制字符串) + * + *   * @throws Exception + * + *    + */ + + public final static String encrypt(String src) { + + try { + + return byte2hex(encrypt(src.getBytes(), KEY.getBytes())); + + } catch (Exception e) { + + e.printStackTrace(); + + } + + return null; + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字符串) + * + *   * @return 明文(字符串) + * + *   * @throws Exception + * + *    + */ + + public final static String decrypt(String src) { + try { + + return new String(decrypt(hex2byte(src.getBytes()), KEY.getBytes())); + + } catch (Exception e) { + + e.printStackTrace(); + + } + + return null; + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @return 密文(16进制字符串) + * + *   * @throws Exception + * + *    + */ + + public static String encryptToString(byte[] src) throws Exception { + + return encrypt(new String(src)); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @return 明文(字符串) + * + *   * @throws Exception + * + *    + */ + + public static String decryptToString(byte[] src) throws Exception { + + return decrypt(new String(src)); + + } + + public static String byte2hex(byte[] b) { + + String hs = ""; + + String stmp = ""; + + for (int n = 0; n < b.length; n++) { + + stmp = (Integer.toHexString(b[n] & 0XFF)); + + if (stmp.length() == 1) + + hs = hs + "0" + stmp; + + else + + hs = hs + stmp; + + } + + return hs.toUpperCase(); + + } + + public static byte[] hex2byte(byte[] b) { + + if ((b.length % 2) != 0) + + throw new IllegalArgumentException("长度不是偶数"); + + byte[] b2 = new byte[b.length / 2]; + + for (int n = 0; n < b.length; n += 2) { + + String item = new String(b, n, 2); + + b2[n / 2] = (byte) Integer.parseInt(item, 16); + + } + return b2; + + } + + /* + * public static void main(String[] args) { try { String src = "cheetah"; + * String crypto = DESCipher.encrypt(src); System.out.println("密文[" + src + + * "]:" + crypto); System.out.println("解密后:" + DESCipher.decrypt(crypto)); } + * catch (Exception e) { e.printStackTrace(); } } + */ +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/IdAndKeyUtil.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/IdAndKeyUtil.java new file mode 100644 index 0000000000..faa90a987d --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/IdAndKeyUtil.java @@ -0,0 +1,85 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.reader.odpsreader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.odpsreader.Constant; +import com.alibaba.datax.plugin.reader.odpsreader.Key; +import com.alibaba.datax.plugin.reader.odpsreader.OdpsReaderErrorCode; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Map; + +public class IdAndKeyUtil { + private static Logger LOG = LoggerFactory.getLogger(IdAndKeyUtil.class); + + public static Configuration parseAccessIdAndKey(Configuration originalConfig) { + String accessId = originalConfig.getString(Key.ACCESS_ID); + String accessKey = originalConfig.getString(Key.ACCESS_KEY); + + // 只要 accessId,accessKey 二者配置了一个,就理解为是用户本意是要直接手动配置其 accessid/accessKey + if (StringUtils.isNotBlank(accessId) || StringUtils.isNotBlank(accessKey)) { + LOG.info("Try to get accessId/accessKey from your config."); + //通过如下语句,进行检查是否确实配置了 + accessId = originalConfig.getNecessaryValue(Key.ACCESS_ID, OdpsReaderErrorCode.REQUIRED_VALUE); + accessKey = originalConfig.getNecessaryValue(Key.ACCESS_KEY, OdpsReaderErrorCode.REQUIRED_VALUE); + //检查完毕,返回即可 + return originalConfig; + } else { + Map envProp = System.getenv(); + return getAccessIdAndKeyFromEnv(originalConfig, envProp); + } + } + + private static Configuration getAccessIdAndKeyFromEnv(Configuration originalConfig, + Map envProp) { + String accessId = null; + String accessKey = null; + + String skynetAccessID = envProp.get(Constant.SKYNET_ACCESSID); + String skynetAccessKey = envProp.get(Constant.SKYNET_ACCESSKEY); + + if (StringUtils.isNotBlank(skynetAccessID) + || StringUtils.isNotBlank(skynetAccessKey)) { + /** + * 环境变量中,如果存在SKYNET_ACCESSID/SKYNET_ACCESSKEy(只要有其中一个变量,则认为一定是两个都存在的!), + * 则使用其值作为odps的accessId/accessKey(会解密) + */ + + LOG.info("Try to get accessId/accessKey from environment."); + accessId = skynetAccessID; + accessKey = DESCipher.decrypt(skynetAccessKey); + if (StringUtils.isNotBlank(accessKey)) { + originalConfig.set(Key.ACCESS_ID, accessId); + originalConfig.set(Key.ACCESS_KEY, accessKey); + LOG.info("Get accessId/accessKey from environment variables successfully."); + } else { + throw DataXException.asDataXException(OdpsReaderErrorCode.GET_ID_KEY_FAIL, + String.format("从环境变量中获取accessId/accessKey 失败, accessId=[%s]", accessId)); + } + } else { + // 无处获取(既没有配置在作业中,也没用在环境变量中) + throw DataXException.asDataXException(OdpsReaderErrorCode.GET_ID_KEY_FAIL, + "无法获取到accessId/accessKey. 它们既不存在于您的配置中,也不存在于环境变量中."); + } + + return originalConfig; + } +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsExceptionMsg.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsExceptionMsg.java new file mode 100644 index 0000000000..35ac822146 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsExceptionMsg.java @@ -0,0 +1,18 @@ +package com.alibaba.datax.plugin.reader.odpsreader.util; + +/** + * Created by hongjiao.hj on 2015/6/9. + */ +public class OdpsExceptionMsg { + + public static final String ODPS_PROJECT_NOT_FOUNT = "ODPS-0420111: Project not found"; + + public static final String ODPS_TABLE_NOT_FOUNT = "ODPS-0130131:Table not found"; + + public static final String ODPS_ACCESS_KEY_ID_NOT_FOUND = "ODPS-0410051:Invalid credentials - accessKeyId not found"; + + public static final String ODPS_ACCESS_KEY_INVALID = "ODPS-0410042:Invalid signature value - User signature dose not match"; + + public static final String ODPS_ACCESS_DENY = "ODPS-0420095: Access Denied - Authorization Failed [4002], You doesn't exist in project"; + +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsSplitUtil.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsSplitUtil.java new file mode 100755 index 0000000000..b7f4f1aaf3 --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsSplitUtil.java @@ -0,0 +1,168 @@ +package com.alibaba.datax.plugin.reader.odpsreader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RangeSplitUtil; +import com.alibaba.datax.plugin.reader.odpsreader.Constant; +import com.alibaba.datax.plugin.reader.odpsreader.Key; +import com.alibaba.datax.plugin.reader.odpsreader.OdpsReaderErrorCode; +import com.aliyun.odps.Odps; +import com.aliyun.odps.tunnel.TableTunnel.DownloadSession; +import org.apache.commons.lang3.tuple.ImmutablePair; +import org.apache.commons.lang3.tuple.Pair; + +import java.util.ArrayList; +import java.util.List; + +public final class OdpsSplitUtil { + + public static List doSplit(Configuration originalConfig, Odps odps, + int adviceNum) { + boolean isPartitionedTable = originalConfig.getBool(Constant.IS_PARTITIONED_TABLE); + if (isPartitionedTable) { + // 分区表 + return splitPartitionedTable(odps, originalConfig, adviceNum); + } else { + // 非分区表 + return splitForNonPartitionedTable(odps, adviceNum, originalConfig); + } + + } + + private static List splitPartitionedTable(Odps odps, Configuration originalConfig, + int adviceNum) { + List splittedConfigs = new ArrayList(); + + List partitions = originalConfig.getList(Key.PARTITION, + String.class); + + if (null == partitions || partitions.isEmpty()) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ILLEGAL_VALUE, + "您所配置的分区不能为空白."); + } + + //splitMode 默认为 record + String splitMode = originalConfig.getString(Key.SPLIT_MODE); + Configuration tempConfig = null; + if (partitions.size() > adviceNum || Constant.PARTITION_SPLIT_MODE.equals(splitMode)) { + // 此时不管 splitMode 是什么,都不需要再进行切分了 + // 注意:此处没有把 sessionId 设置到 config 中去,所以后续在 task 中获取 sessionId 时,需要针对这种情况重新创建 sessionId + for (String onePartition : partitions) { + tempConfig = originalConfig.clone(); + tempConfig.set(Key.PARTITION, onePartition); + splittedConfigs.add(tempConfig); + } + + return splittedConfigs; + } else { + // 还需要计算对每个分区,切分份数等信息 + int eachPartitionShouldSplittedNumber = calculateEachPartitionShouldSplittedNumber( + adviceNum, partitions.size()); + + for (String onePartition : partitions) { + List configs = splitOnePartition(odps, + onePartition, eachPartitionShouldSplittedNumber, + originalConfig); + splittedConfigs.addAll(configs); + } + + return splittedConfigs; + } + } + + private static int calculateEachPartitionShouldSplittedNumber( + int adviceNumber, int partitionNumber) { + double tempNum = 1.0 * adviceNumber / partitionNumber; + + return (int) Math.ceil(tempNum); + } + + private static List splitForNonPartitionedTable(Odps odps, + int adviceNum, Configuration sliceConfig) { + List params = new ArrayList(); + + String tunnelServer = sliceConfig.getString(Key.TUNNEL_SERVER); + String tableName = sliceConfig.getString(Key.TABLE); + + String projectName = sliceConfig.getString(Key.PROJECT); + + DownloadSession session = OdpsUtil.createMasterSessionForNonPartitionedTable(odps, + tunnelServer, projectName, tableName); + + String id = session.getId(); + long count = session.getRecordCount(); + + List> splitResult = splitRecordCount(count, adviceNum); + + for (Pair pair : splitResult) { + Configuration iParam = sliceConfig.clone(); + iParam.set(Constant.SESSION_ID, id); + iParam.set(Constant.START_INDEX, pair.getLeft().longValue()); + iParam.set(Constant.STEP_COUNT, pair.getRight().longValue()); + + params.add(iParam); + } + + return params; + } + + private static List splitOnePartition(Odps odps, + String onePartition, int adviceNum, Configuration sliceConfig) { + List params = new ArrayList(); + + String tunnelServer = sliceConfig.getString(Key.TUNNEL_SERVER); + String tableName = sliceConfig.getString(Key.TABLE); + + String projectName = sliceConfig.getString(Key.PROJECT); + + DownloadSession session = OdpsUtil.createMasterSessionForPartitionedTable(odps, + tunnelServer, projectName, tableName, onePartition); + + String id = session.getId(); + long count = session.getRecordCount(); + + List> splitResult = splitRecordCount(count, adviceNum); + + for (Pair pair : splitResult) { + Configuration iParam = sliceConfig.clone(); + iParam.set(Key.PARTITION, onePartition); + iParam.set(Constant.SESSION_ID, id); + iParam.set(Constant.START_INDEX, pair.getLeft().longValue()); + iParam.set(Constant.STEP_COUNT, pair.getRight().longValue()); + + params.add(iParam); + } + + return params; + } + + /** + * Pair left: startIndex, right: stepCount + */ + private static List> splitRecordCount(long recordCount, int adviceNum) { + if(recordCount<0){ + throw new IllegalArgumentException("切分的 recordCount 不能为负数.recordCount=" + recordCount); + } + + if(adviceNum<1){ + throw new IllegalArgumentException("切分的 adviceNum 不能为负数.adviceNum=" + adviceNum); + } + + List> result = new ArrayList>(); + // 为了适配 RangeSplitUtil 的处理逻辑,起始值从0开始计算 + if (recordCount == 0) { + result.add(ImmutablePair.of(0L, 0L)); + return result; + } + + long[] tempResult = RangeSplitUtil.doLongSplit(0L, recordCount - 1, adviceNum); + + tempResult[tempResult.length - 1]++; + + for (int i = 0; i < tempResult.length - 1; i++) { + result.add(ImmutablePair.of(tempResult[i], (tempResult[i + 1] - tempResult[i]))); + } + return result; + } + +} diff --git a/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsUtil.java b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsUtil.java new file mode 100755 index 0000000000..2aa3f66e4a --- /dev/null +++ b/odpsreader/src/main/java/com/alibaba/datax/plugin/reader/odpsreader/util/OdpsUtil.java @@ -0,0 +1,378 @@ +package com.alibaba.datax.plugin.reader.odpsreader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.reader.odpsreader.ColumnType; +import com.alibaba.datax.plugin.reader.odpsreader.Constant; +import com.alibaba.datax.plugin.reader.odpsreader.Key; +import com.alibaba.datax.plugin.reader.odpsreader.OdpsReaderErrorCode; +import com.aliyun.odps.*; +import com.aliyun.odps.account.Account; +import com.aliyun.odps.account.AliyunAccount; +import com.aliyun.odps.data.RecordReader; +import com.aliyun.odps.tunnel.TableTunnel; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.MutablePair; +import org.apache.commons.lang3.tuple.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; +import java.util.concurrent.Callable; + +public final class OdpsUtil { + private static final Logger LOG = LoggerFactory.getLogger(OdpsUtil.class); + + public static int MAX_RETRY_TIME = 10; + + public static void checkNecessaryConfig(Configuration originalConfig) { + originalConfig.getNecessaryValue(Key.ODPS_SERVER, OdpsReaderErrorCode.REQUIRED_VALUE); + + originalConfig.getNecessaryValue(Key.PROJECT, OdpsReaderErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.TABLE, OdpsReaderErrorCode.REQUIRED_VALUE); + + if (null == originalConfig.getList(Key.COLUMN) || + originalConfig.getList(Key.COLUMN, String.class).isEmpty()) { + throw DataXException.asDataXException(OdpsReaderErrorCode.REQUIRED_VALUE, "datax获取不到源表的列信息, 由于您未配置读取源头表的列信息. datax无法知道该抽取表的哪些字段的数据 " + + "正确的配置方式是给 column 配置上您需要读取的列名称,用英文逗号分隔."); + } + + } + + public static void dealMaxRetryTime(Configuration originalConfig) { + int maxRetryTime = originalConfig.getInt(Key.MAX_RETRY_TIME, + OdpsUtil.MAX_RETRY_TIME); + if (maxRetryTime < 1 || maxRetryTime > OdpsUtil.MAX_RETRY_TIME) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ILLEGAL_VALUE, "您所配置的maxRetryTime 值错误. 该值不能小于1, 且不能大于 " + OdpsUtil.MAX_RETRY_TIME + + ". 推荐的配置方式是给maxRetryTime 配置1-11之间的某个值. 请您检查配置并做出相应修改."); + } + MAX_RETRY_TIME = maxRetryTime; + } + + public static Odps initOdps(Configuration originalConfig) { + String odpsServer = originalConfig.getString(Key.ODPS_SERVER); + + String accessId = originalConfig.getString(Key.ACCESS_ID); + String accessKey = originalConfig.getString(Key.ACCESS_KEY); + String project = originalConfig.getString(Key.PROJECT); + + String packageAuthorizedProject = originalConfig.getString(Key.PACKAGE_AUTHORIZED_PROJECT); + + String defaultProject; + if(StringUtils.isBlank(packageAuthorizedProject)) { + defaultProject = project; + } else { + defaultProject = packageAuthorizedProject; + } + + String accountType = originalConfig.getString(Key.ACCOUNT_TYPE, + Constant.DEFAULT_ACCOUNT_TYPE); + + Account account = null; + if (accountType.equalsIgnoreCase(Constant.DEFAULT_ACCOUNT_TYPE)) { + account = new AliyunAccount(accessId, accessKey); + } else { + throw DataXException.asDataXException(OdpsReaderErrorCode.ACCOUNT_TYPE_ERROR, + String.format("不支持的账号类型:[%s]. 账号类型目前仅支持aliyun, taobao.", accountType)); + } + + Odps odps = new Odps(account); + boolean isPreCheck = originalConfig.getBool("dryRun", false); + if(isPreCheck) { + odps.getRestClient().setConnectTimeout(3); + odps.getRestClient().setReadTimeout(3); + odps.getRestClient().setRetryTimes(2); + } + odps.setDefaultProject(defaultProject); + odps.setEndpoint(odpsServer); + + return odps; + } + + public static Table getTable(Odps odps, String projectName, String tableName) { + final Table table = odps.tables().get(projectName, tableName); + try { + //通过这种方式检查表是否存在,失败重试。重试策略:每秒钟重试一次,最大重试3次 + return RetryUtil.executeWithRetry(new Callable() { + @Override + public Table call() throws Exception { + table.reload(); + return table; + } + }, 3, 1000, false); + } catch (Exception e) { + throwDataXExceptionWhenReloadTable(e, tableName); + } + return table; + } + + public static boolean isPartitionedTable(Table table) { + return getPartitionDepth(table) > 0; + } + + public static int getPartitionDepth(Table table) { + TableSchema tableSchema = table.getSchema(); + + return tableSchema.getPartitionColumns().size(); + } + + public static List getTableAllPartitions(Table table) { + List tableAllPartitions = table.getPartitions(); + + List retPartitions = new ArrayList(); + + if (null != tableAllPartitions) { + for (Partition partition : tableAllPartitions) { + retPartitions.add(partition.getPartitionSpec().toString()); + } + } + + return retPartitions; + } + + public static List getTableAllColumns(Table table) { + TableSchema tableSchema = table.getSchema(); + return tableSchema.getColumns(); + } + + + public static List getTableOriginalColumnNameList( + List columns) { + List tableOriginalColumnNameList = new ArrayList(); + + for (Column column : columns) { + tableOriginalColumnNameList.add(column.getName()); + } + + return tableOriginalColumnNameList; + } + + public static String formatPartition(String partition) { + if (StringUtils.isBlank(partition)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ILLEGAL_VALUE, + "您所配置的分区不能为空白."); + } else { + return partition.trim().replaceAll(" *= *", "=") + .replaceAll(" */ *", ",").replaceAll(" *, *", ",") + .replaceAll("'", ""); + } + } + + public static List formatPartitions(List partitions) { + if (null == partitions || partitions.isEmpty()) { + return Collections.emptyList(); + } else { + List formattedPartitions = new ArrayList(); + for (String partition : partitions) { + formattedPartitions.add(formatPartition(partition)); + + } + return formattedPartitions; + } + } + + public static List> parseColumns( + List allNormalColumns, List allPartitionColumns, + List userConfiguredColumns) { + List> parsededColumns = new ArrayList>(); + // warn: upper & lower case + for (String column : userConfiguredColumns) { + MutablePair pair = new MutablePair(); + + // if constant column + if (OdpsUtil.checkIfConstantColumn(column)) { + // remove first and last ' + pair.setLeft(column.substring(1, column.length() - 1)); + pair.setRight(ColumnType.CONSTANT); + parsededColumns.add(pair); + continue; + } + + // if normal column, warn: in o d p s normal columns can not + // repeated in partitioning columns + int index = OdpsUtil.indexOfIgnoreCase(allNormalColumns, column); + if (0 <= index) { + pair.setLeft(allNormalColumns.get(index)); + pair.setRight(ColumnType.NORMAL); + parsededColumns.add(pair); + continue; + } + + // if partition column + index = OdpsUtil.indexOfIgnoreCase(allPartitionColumns, column); + if (0 <= index) { + pair.setLeft(allPartitionColumns.get(index)); + pair.setRight(ColumnType.PARTITION); + parsededColumns.add(pair); + continue; + } + // not exist column + throw DataXException.asDataXException( + OdpsReaderErrorCode.ILLEGAL_VALUE, + String.format("源头表的列配置错误. 您所配置的列 [%s] 不存在.", column)); + + } + return parsededColumns; + } + + private static int indexOfIgnoreCase(List columnCollection, + String column) { + int index = -1; + for (int i = 0; i < columnCollection.size(); i++) { + if (columnCollection.get(i).equalsIgnoreCase(column)) { + index = i; + break; + } + } + return index; + } + + public static boolean checkIfConstantColumn(String column) { + if (column.length() >= 2 && column.startsWith(Constant.COLUMN_CONSTANT_FLAG) && + column.endsWith(Constant.COLUMN_CONSTANT_FLAG)) { + return true; + } else { + return false; + } + } + + public static TableTunnel.DownloadSession createMasterSessionForNonPartitionedTable(Odps odps, String tunnelServer, + final String projectName, final String tableName) { + + final TableTunnel tunnel = new TableTunnel(odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tunnel.setEndpoint(tunnelServer); + } + + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.DownloadSession call() throws Exception { + return tunnel.createDownloadSession( + projectName, tableName); + } + }, MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.CREATE_DOWNLOADSESSION_FAIL, e); + } + } + + public static TableTunnel.DownloadSession getSlaveSessionForNonPartitionedTable(Odps odps, final String sessionId, + String tunnelServer, final String projectName, final String tableName) { + + final TableTunnel tunnel = new TableTunnel(odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tunnel.setEndpoint(tunnelServer); + } + + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.DownloadSession call() throws Exception { + return tunnel.getDownloadSession( + projectName, tableName, sessionId); + } + }, MAX_RETRY_TIME ,1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.GET_DOWNLOADSESSION_FAIL, e); + } + } + + public static TableTunnel.DownloadSession createMasterSessionForPartitionedTable(Odps odps, String tunnelServer, + final String projectName, final String tableName, String partition) { + + final TableTunnel tunnel = new TableTunnel(odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tunnel.setEndpoint(tunnelServer); + } + + final PartitionSpec partitionSpec = new PartitionSpec(partition); + + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.DownloadSession call() throws Exception { + return tunnel.createDownloadSession( + projectName, tableName, partitionSpec); + } + }, MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.CREATE_DOWNLOADSESSION_FAIL, e); + } + } + + public static TableTunnel.DownloadSession getSlaveSessionForPartitionedTable(Odps odps, final String sessionId, + String tunnelServer, final String projectName, final String tableName, String partition) { + + final TableTunnel tunnel = new TableTunnel(odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tunnel.setEndpoint(tunnelServer); + } + + final PartitionSpec partitionSpec = new PartitionSpec(partition); + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.DownloadSession call() throws Exception { + return tunnel.getDownloadSession( + projectName, tableName, partitionSpec, sessionId); + } + }, MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.GET_DOWNLOADSESSION_FAIL, e); + } + } + + + + public static RecordReader getRecordReader(final TableTunnel.DownloadSession downloadSession, final long start, final long count, + final boolean isCompress) { + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public RecordReader call() throws Exception { + return downloadSession.openRecordReader(start, count, isCompress); + } + }, MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsReaderErrorCode.OPEN_RECORD_READER_FAILED, + "open RecordReader失败. 请联系 ODPS 管理员处理.", e); + } + } + + /** + * table.reload() 方法抛出的 odps 异常 转化为更清晰的 datax 异常 抛出 + */ + public static void throwDataXExceptionWhenReloadTable(Exception e, String tableName) { + if(e.getMessage() != null) { + if(e.getMessage().contains(OdpsExceptionMsg.ODPS_PROJECT_NOT_FOUNT)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_PROJECT_NOT_FOUNT, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 [project] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_TABLE_NOT_FOUNT)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_TABLE_NOT_FOUNT, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 [table] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_KEY_ID_NOT_FOUND)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_ACCESS_KEY_ID_NOT_FOUND, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 [accessId] [accessKey]是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_KEY_INVALID)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_ACCESS_KEY_INVALID, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 [accessKey] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_DENY)) { + throw DataXException.asDataXException(OdpsReaderErrorCode.ODPS_ACCESS_DENY, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 [accessId] [accessKey] [project]是否匹配.", tableName), e); + } + } + throw DataXException.asDataXException(OdpsReaderErrorCode.ILLEGAL_VALUE, + String.format("加载 ODPS 源头表:%s 失败. " + + "请检查您配置的 ODPS 源头表的 project,table,accessId,accessKey,odpsServer等值.", tableName), e); + } +} diff --git a/odpsreader/src/main/libs/bcprov-jdk15on-1.52.jar b/odpsreader/src/main/libs/bcprov-jdk15on-1.52.jar new file mode 100644 index 0000000000..6c54dd901c Binary files /dev/null and b/odpsreader/src/main/libs/bcprov-jdk15on-1.52.jar differ diff --git a/odpsreader/src/main/resources/plugin.json b/odpsreader/src/main/resources/plugin.json new file mode 100755 index 0000000000..2d441acf6a --- /dev/null +++ b/odpsreader/src/main/resources/plugin.json @@ -0,0 +1,10 @@ +{ + "name": "odpsreader", + "class": "com.alibaba.datax.plugin.reader.odpsreader.OdpsReader", + "description": { + "useScene": "prod.", + "mechanism": "TODO", + "warn": "TODO" + }, + "developer": "alibaba" +} \ No newline at end of file diff --git a/odpsreader/src/main/resources/plugin_job_template.json b/odpsreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..6eddf0cd2f --- /dev/null +++ b/odpsreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,14 @@ +{ + "name": "odpsreader", + "parameter": { + "accessId": "", + "accessKey": "", + "project": "", + "table": "", + "partition": [], + "column": [], + "packageAuthorizedProject": "", + "splitMode": "", + "odpsServer": "" + } +} \ No newline at end of file diff --git a/odpswriter/doc/odpswriter.md b/odpswriter/doc/odpswriter.md new file mode 100644 index 0000000000..053a77c2fa --- /dev/null +++ b/odpswriter/doc/odpswriter.md @@ -0,0 +1,338 @@ +# DataX ODPS写入 + + +--- + + +## 1 快速介绍 + +ODPSWriter插件用于实现往ODPS插入或者更新数据,主要提供给etl开发同学将业务数据导入odps,适合于TB,GB数量级的数据传输,如果需要传输PB量级的数据,请选择dt task工具 ; + + + +## 2 实现原理 + +在底层实现上,ODPSWriter是通过DT Tunnel写入ODPS系统的,有关ODPS的更多技术细节请参看 ODPS主站 https://data.aliyun.com/product/odps 和ODPS产品文档 https://help.aliyun.com/product/27797.html + +目前 DataX3 依赖的 SDK 版本是: + + + com.aliyun.odps + odps-sdk-core-internal + 0.13.2 + + + +注意: **如果你需要使用ODPSReader/Writer插件,请务必使用JDK 1.6-32及以上版本** +使用java -version查看Java版本号 + +## 3 功能说明 + +### 3.1 配置样例 +* 这里使用一份从内存产生到ODPS导入的数据。 + +```json +{ + "job": { + "setting": { + "speed": {"byte": 1048576} + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 100000 + } + }, + "writer": { + "name": "odpswriter", + "parameter": { + "project": "chinan_test", + "table": "odps_write_test00_partitioned", + "partition":"school=SiChuan-School,class=1", + "column": ["id","name"], + "accessId": "xxx", + "accessKey": "xxxx", + "truncate": true, + "odpsServer": "http://sxxx/api", + "tunnelServer": "http://xxx", + "accountType": "aliyun" + } + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + + +* **accessId** + * 描述:ODPS系统登录ID
+ * 必选:是
+ * 默认值:无
+ +* **accessKey** + * 描述:ODPS系统登录Key
+ * 必选:是
+ * 默认值:无
+ +* **project** + + * 描述:ODPS表所属的project,注意:Project只能是字母+数字组合,请填写英文名称。在云端等用户看到的ODPS项目中文名只是显示名,请务必填写底层真实地Project英文标识名。
+ * 必选:是
+ * 默认值:无
+ +* **table** + + * 描述:写入数据的表名,不能填写多张表,因为DataX不支持同时导入多张表。
+ * 必选:是
+ * 默认值:无
+ +* **partition** + + * 描述:需要写入数据表的分区信息,必须指定到最后一级分区。把数据写入一个三级分区表,必须配置到最后一级分区,例如pt=20150101/type=1/biz=2。 +
+ * 必选:**如果是分区表,该选项必填,如果非分区表,该选项不可填写。** + * 默认值:空
+ +* **column** + + * 描述:需要导入的字段列表,当导入全部字段时,可以配置为"column": ["*"], 当需要插入部分odps列填写部分列,例如"column": ["id", "name"]。ODPSWriter支持列筛选、列换序,例如表有a,b,c三个字段,用户只同步c,b两个字段。可以配置成["c","b"], 在导入过程中,字段a自动补空,设置为null。
+ * 必选:否
+ * 默认值:无
+ +* **truncate** + * 描述:ODPSWriter通过配置"truncate": true,保证写入的幂等性,即当出现写入失败再次运行时,ODPSWriter将清理前述数据,并导入新数据,这样可以保证每次重跑之后的数据都保持一致。
+ + **truncate选项不是原子操作!ODPS SQL无法做到原子性。因此当多个任务同时向一个Table/Partition清理分区时候,可能出现并发时序问题,请务必注意!**针对这类问题,我们建议尽量不要多个作业DDL同时操作同一份分区,或者在多个并发作业启动前,提前创建分区。 + + * 必选:是
+ * 默认值:无
+ +* **odpsServer** + + * 描述:ODPS的server地址,线上地址为 http://service.odps.aliyun.com/api
+ * 必选:是
+ * 默认值:无
+ +* **tunnelServer** + + * 描述:ODPS的tunnelserver地址,线上地址为 http://dt.odps.aliyun.com
+ * 必选:是,
+ * 默认值:无
+ + +### 3.3 类型转换 + +类似ODPSReader,目前ODPSWriter支持大部分ODPS类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出ODPSWriter针对ODPS类型转换列表: + + +| DataX 内部类型| ODPS 数据类型 | +| -------- | ----- | +| Long |bigint | +| Double |double | +| String |string | +| Date |datetime | +| Boolean |bool | + + + + +## 4 插件特点 + +### 4.1 关于列筛选的问题 + +* ODPS本身不支持列筛选、重排序、补空等等,但是DataX ODPSWriter完成了上述需求,支持列筛选、重排序、补空。例如需要导入的字段列表,当导入全部字段时,可以配置为"column": ["*"],odps表有a,b,c三个字段,用户只同步c,b两个字段,在列配置中可以写成"column": ["c","b"],表示会把reader的第一列和第二列导入odps的c字段和b字段,而odps表中新插入记录的a字段会被置为null. + +### 4.2 列配置错误的处理 + +* 为了保证写入数据的可靠性,避免多余列数据丢失造成数据质量故障。对于写入多余的列,ODPSWriter将报错。例如ODPS表字段为a,b,c,但是ODPSWriter写入的字段为多于3列的话ODPSWriter将报错。 + +### 4.3 分区配置注意事项 + +* ODPSWriter只提供 **写入到最后一级分区** 功能,不支持写入按照某个字段进行分区路由等功能。假设表一共有3级分区,那么在分区配置中就必须指明写入到某个三级分区,例如把数据写入一个表的第三级分区,可以配置为 pt=20150101/type=1/biz=2,但是不能配置为pt=20150101/type=1或者pt=20150101。 + +### 4.4 任务重跑和failover +* ODPSWriter通过配置"truncate": true,保证写入的幂等性,即当出现写入失败再次运行时,ODPSWriter将清理前述数据,并导入新数据,这样可以保证每次重跑之后的数据都保持一致。如果在运行过程中因为其他的异常导致了任务中断,是不能保证数据的原子性的,数据不会回滚也不会自动重跑,需要用户利用幂等性这一特点重跑去确保保证数据的完整性。**truncate为true的情况下,会将指定分区\表的数据全部清理,请谨慎使用!** + + + +## 5 性能报告(线上环境实测) + +### 5.1 环境准备 + +#### 5.1.1 数据特征 + +建表语句: + + use cdo_datasync; + create table datax3_odpswriter_perf_10column_1kb_00( + s_0 string, + bool_1 boolean, + bi_2 bigint, + dt_3 datetime, + db_4 double, + s_5 string, + s_6 string, + s_7 string, + s_8 string, + s_9 string + )PARTITIONED by (pt string,year string); + +单行记录类似于: + + s_0 : 485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&* + bool_1 : true + bi_2 : 1696248667889 + dt_3 : 2013-07-0600: 00: 00 + db_4 : 3.141592653578 + s_5 : 100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 + s_6 : 100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209 + s_7 : 100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209 + s_8 : 100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209 + s_9 : 12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 + +#### 5.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu : 24 Core Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz cache 15.36MB + 2. mem : 50GB + 3. net : 千兆双网卡 + 4. jvm : -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + 5. disc: DataX 数据不落磁盘,不统计此项 + +* 任务配置为: +``` +{ + "job": { + "setting": { + "speed": { + "channel": "1,2,4,5,6,8,16,32,64" + } + }, + "content": [ + { + "reader": { + "name": "odpsreader", + "parameter": { + "accessId": "******************************", + "accessKey": "*****************************", + "column": [ + "*" + ], + "partition": [ + "pt=20141010000000,year=2014" + ], + "odpsServer": "http://service.odps.aliyun.com/api", + "project": "cdo_datasync", + "table": "datax3_odpswriter_perf_10column_1kb_00", + "tunnelServer": "http://dt.odps.aliyun.com" + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "column": [ + { + "value": "485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&*" + }, + { + "value": "true", + "type": "bool" + }, + { + "value": "1696248667889", + "type": "long" + }, + { + "type": "date", + "value": "2013-07-06 00:00:00", + "dateFormat": "yyyy-mm-dd hh:mm:ss" + }, + { + "value": "3.141592653578", + "type": "double" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209" + }, + { + "value": "100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209" + }, + { + "value": "12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + } + ] + } + } + } + ] + } +} +``` + +### 5.2 测试报告 + + +| 并发任务数|blockSizeInMB| DataX速度(Rec/s)|DataX流量(MB/S)|网卡流量(MB/S)|DataX运行负载| +|--------| --------|--------|--------|--------|--------| +|1|32|30303|13.03|14.5|0.12| +|1|64|38461|16.54|16.5|0.44| +|1|128|46454|20.55|26.7|0.47| +|1|256|52631|22.64|26.7|0.47| +|1|512|58823|25.30|28.7|0.44| +|4|32|114816|49.38|55.3|0.75| +|4|64|147577|63.47|71.3|0.82| +|4|128|177744|76.45|83.2|0.97| +|4|256|173913|74.80|80.1|1.01| +|4|512|200000|86.02|95.1|1.41| +|8|32|204480|87.95|92.7|1.16| +|8|64|294224|126.55|135.3|1.65| +|8|128|365475|157.19|163.7|2.89| +|8|256|394713|169.83|176.7|2.72| +|8|512|241691|103.95|125.7|2.29| +|16|32|420838|181.01|198.0|2.56| +|16|64|458144|197.05|217.4|2.85| +|16|128|443219|190.63|210.5|3.29| +|16|256|315235|135.58|140.0|0.95| +|16|512|OOM||||| + +说明: + +1. OdpsWriter 影响速度的是channel 和 blockSizeInMB。blockSizeInMB 取`32` 和 `64`时,速度比较稳定,过分大的 blockSizeInMB 可能造成速度波动以及内存OOM。 +2. channel 和 blockSizeInMB 对速度的影响都很明显,建议综合考虑配合选择。 +3. channel 数目的选择,可以综合考虑源端数据特征进行选择,对于StreamReader,在16个channel时将网卡打满。 + + +## 6 FAQ +#### 1 导数据到 odps 的日志中有以下报错,该怎么处理呢?"ODPS-0420095: Access Denied - Authorization Failed [4002], You doesn‘t exist in project example_dev“ + +解决办法 :找ODPS Prject 的 owner给用户的云账号授权,授权语句: +grant Describe,Select,Alter,Update on table [tableName] to user XXX + +#### 2 可以导入数据到odps的视图吗? +目前不支持通过视图到数据到odps,视图是ODPS非实体化数据存储对象,技术上无法向视图导入数据。 diff --git a/odpswriter/pom.xml b/odpswriter/pom.xml new file mode 100755 index 0000000000..38672cff8f --- /dev/null +++ b/odpswriter/pom.xml @@ -0,0 +1,107 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + odpswriter + odpswriter + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + org.bouncycastle + bcprov-jdk15on + 1.52 + system + ${basedir}/src/main/libs/bcprov-jdk15on-1.52.jar + + + com.aliyun.odps + odps-sdk-core + 0.19.3-public + + + + + commons-httpclient + commons-httpclient + 3.1 + + + + + org.mockito + mockito-core + 1.8.5 + test + + + org.powermock + powermock-api-mockito + 1.4.10 + test + + + org.powermock + powermock-module-junit4 + 1.4.10 + test + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/odpswriter/src/main/assembly/package.xml b/odpswriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..7d3c91b51b --- /dev/null +++ b/odpswriter/src/main/assembly/package.xml @@ -0,0 +1,42 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/odpswriter + + + target/ + + odpswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/odpswriter + + + src/main/libs + + *.* + + plugin/writer/odpswriter/libs + + + + + + false + plugin/writer/odpswriter/libs + runtime + + + diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Constant.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Constant.java new file mode 100755 index 0000000000..22bcc16cb3 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Constant.java @@ -0,0 +1,15 @@ +package com.alibaba.datax.plugin.writer.odpswriter; + + +public class Constant { + public static final String SKYNET_ACCESSID = "SKYNET_ACCESSID"; + + public static final String SKYNET_ACCESSKEY = "SKYNET_ACCESSKEY"; + + public static final String DEFAULT_ACCOUNT_TYPE = "aliyun"; + + public static final String TAOBAO_ACCOUNT_TYPE = "taobao"; + + public static final String COLUMN_POSITION = "columnPosition"; + +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Key.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Key.java new file mode 100755 index 0000000000..f578d72d9a --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/Key.java @@ -0,0 +1,34 @@ +package com.alibaba.datax.plugin.writer.odpswriter; + + +public final class Key { + + public final static String ODPS_SERVER = "odpsServer"; + + public final static String TUNNEL_SERVER = "tunnelServer"; + + public final static String ACCESS_ID = "accessId"; + + public final static String ACCESS_KEY = "accessKey"; + + public final static String PROJECT = "project"; + + public final static String TABLE = "table"; + + public final static String PARTITION = "partition"; + + public final static String COLUMN = "column"; + + public final static String TRUNCATE = "truncate"; + + public final static String MAX_RETRY_TIME = "maxRetryTime"; + + public final static String BLOCK_SIZE_IN_MB = "blockSizeInMB"; + + //boolean 类型,default:false + public final static String EMPTY_AS_NULL = "emptyAsNull"; + + public final static String ACCOUNT_TYPE = "accountType"; + + public final static String IS_COMPRESS = "isCompress"; +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriter.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriter.java new file mode 100755 index 0000000000..60deb5dd30 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriter.java @@ -0,0 +1,356 @@ +package com.alibaba.datax.plugin.writer.odpswriter; + +import com.alibaba.datax.common.exception.CommonErrorCode; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.statistics.PerfRecord; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.ListUtil; +import com.alibaba.datax.plugin.writer.odpswriter.util.IdAndKeyUtil; +import com.alibaba.datax.plugin.writer.odpswriter.util.OdpsUtil; + +import com.aliyun.odps.Odps; +import com.aliyun.odps.Table; +import com.aliyun.odps.TableSchema; +import com.aliyun.odps.tunnel.TableTunnel; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.atomic.AtomicLong; + +/** + * 已修改为:每个 task 各自创建自己的 upload,拥有自己的 uploadId,并在 task 中完成对对应 block 的提交。 + */ +public class OdpsWriter extends Writer { + + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + private static final boolean IS_DEBUG = LOG.isDebugEnabled(); + + private Configuration originalConfig; + private Odps odps; + private Table table; + + private String projectName; + private String tableName; + private String tunnelServer; + private String partition; + private String accountType; + private boolean truncate; + private String uploadId; + private TableTunnel.UploadSession masterUpload; + private int blockSizeInMB; + + public void preCheck() { + this.init(); + this.doPreCheck(); + } + + public void doPreCheck() { + //检查accessId,accessKey配置 + if (Constant.DEFAULT_ACCOUNT_TYPE + .equalsIgnoreCase(this.accountType)) { + this.originalConfig = IdAndKeyUtil.parseAccessIdAndKey(this.originalConfig); + String accessId = this.originalConfig.getString(Key.ACCESS_ID); + String accessKey = this.originalConfig.getString(Key.ACCESS_KEY); + if (IS_DEBUG) { + LOG.debug("accessId:[{}], accessKey:[{}] .", accessId, + accessKey); + } + LOG.info("accessId:[{}] .", accessId); + } + // init odps config + this.odps = OdpsUtil.initOdpsProject(this.originalConfig); + + //检查表等配置是否正确 + this.table = OdpsUtil.getTable(odps,this.projectName,this.tableName); + + //检查列信息是否正确 + List allColumns = OdpsUtil.getAllColumns(this.table.getSchema()); + LOG.info("allColumnList: {} .", StringUtils.join(allColumns, ',')); + dealColumn(this.originalConfig, allColumns); + + //检查分区信息是否正确 + OdpsUtil.preCheckPartition(this.odps, this.table, this.partition, this.truncate); + } + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + OdpsUtil.checkNecessaryConfig(this.originalConfig); + OdpsUtil.dealMaxRetryTime(this.originalConfig); + + this.projectName = this.originalConfig.getString(Key.PROJECT); + this.tableName = this.originalConfig.getString(Key.TABLE); + this.tunnelServer = this.originalConfig.getString(Key.TUNNEL_SERVER, null); + + //check isCompress + this.originalConfig.getBool(Key.IS_COMPRESS, false); + + this.partition = OdpsUtil.formatPartition(this.originalConfig + .getString(Key.PARTITION, "")); + this.originalConfig.set(Key.PARTITION, this.partition); + + this.accountType = this.originalConfig.getString(Key.ACCOUNT_TYPE, + Constant.DEFAULT_ACCOUNT_TYPE); + if (!Constant.DEFAULT_ACCOUNT_TYPE.equalsIgnoreCase(this.accountType) && + !Constant.TAOBAO_ACCOUNT_TYPE.equalsIgnoreCase(this.accountType)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ACCOUNT_TYPE_ERROR, + String.format("账号类型错误,因为你的账号 [%s] 不是datax目前支持的账号类型,目前仅支持aliyun, taobao账号,请修改您的账号信息.", accountType)); + } + this.originalConfig.set(Key.ACCOUNT_TYPE, this.accountType); + + this.truncate = this.originalConfig.getBool(Key.TRUNCATE); + + boolean emptyAsNull = this.originalConfig.getBool(Key.EMPTY_AS_NULL, false); + this.originalConfig.set(Key.EMPTY_AS_NULL, emptyAsNull); + if (emptyAsNull) { + LOG.warn("这是一条需要注意的信息 由于您的作业配置了写入 ODPS 的目的表时emptyAsNull=true, 所以 DataX将会把长度为0的空字符串作为 java 的 null 写入 ODPS."); + } + + this.blockSizeInMB = this.originalConfig.getInt(Key.BLOCK_SIZE_IN_MB, 64); + if(this.blockSizeInMB < 8) { + this.blockSizeInMB = 8; + } + this.originalConfig.set(Key.BLOCK_SIZE_IN_MB, this.blockSizeInMB); + LOG.info("blockSizeInMB={}.", this.blockSizeInMB); + + if (IS_DEBUG) { + LOG.debug("After master init(), job config now is: [\n{}\n] .", + this.originalConfig.toJSON()); + } + } + + @Override + public void prepare() { + String accessId = null; + String accessKey = null; + if (Constant.DEFAULT_ACCOUNT_TYPE + .equalsIgnoreCase(this.accountType)) { + this.originalConfig = IdAndKeyUtil.parseAccessIdAndKey(this.originalConfig); + accessId = this.originalConfig.getString(Key.ACCESS_ID); + accessKey = this.originalConfig.getString(Key.ACCESS_KEY); + if (IS_DEBUG) { + LOG.debug("accessId:[{}], accessKey:[{}] .", accessId, + accessKey); + } + LOG.info("accessId:[{}] .", accessId); + } + + // init odps config + this.odps = OdpsUtil.initOdpsProject(this.originalConfig); + + //检查表等配置是否正确 + this.table = OdpsUtil.getTable(odps,this.projectName,this.tableName); + + OdpsUtil.dealTruncate(this.odps, this.table, this.partition, this.truncate); + } + + /** + * 此处主要是对 uploadId进行设置,以及对 blockId 的开始值进行设置。 + *

+ * 对 blockId 需要同时设置开始值以及下一个 blockId 的步长值(INTERVAL_STEP)。 + */ + @Override + public List split(int mandatoryNumber) { + List configurations = new ArrayList(); + + // 此处获取到 masterUpload 只是为了拿到 RecordSchema,以完成对 column 的处理 + TableTunnel tableTunnel = new TableTunnel(this.odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tableTunnel.setEndpoint(tunnelServer); + } + + this.masterUpload = OdpsUtil.createMasterTunnelUpload( + tableTunnel, this.projectName, this.tableName, this.partition); + this.uploadId = this.masterUpload.getId(); + LOG.info("Master uploadId:[{}].", this.uploadId); + + TableSchema schema = this.masterUpload.getSchema(); + List allColumns = OdpsUtil.getAllColumns(schema); + LOG.info("allColumnList: {} .", StringUtils.join(allColumns, ',')); + + dealColumn(this.originalConfig, allColumns); + + for (int i = 0; i < mandatoryNumber; i++) { + Configuration tempConfig = this.originalConfig.clone(); + + configurations.add(tempConfig); + } + + if (IS_DEBUG) { + LOG.debug("After master split, the job config now is:[\n{}\n].", this.originalConfig); + } + + this.masterUpload = null; + + return configurations; + } + + private void dealColumn(Configuration originalConfig, List allColumns) { + //之前已经检查了userConfiguredColumns 一定不为空 + List userConfiguredColumns = originalConfig.getList(Key.COLUMN, String.class); + if (1 == userConfiguredColumns.size() && "*".equals(userConfiguredColumns.get(0))) { + userConfiguredColumns = allColumns; + originalConfig.set(Key.COLUMN, allColumns); + } else { + //检查列是否重复,大小写不敏感(所有写入,都是不允许写入段的列重复的) + ListUtil.makeSureNoValueDuplicate(userConfiguredColumns, false); + + //检查列是否存在,大小写不敏感 + ListUtil.makeSureBInA(allColumns, userConfiguredColumns, false); + } + + List columnPositions = OdpsUtil.parsePosition(allColumns, userConfiguredColumns); + originalConfig.set(Constant.COLUMN_POSITION, columnPositions); + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + } + + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory + .getLogger(Task.class); + + private static final boolean IS_DEBUG = LOG.isDebugEnabled(); + + private Configuration sliceConfig; + private Odps odps; + + private String projectName; + private String tableName; + private String tunnelServer; + private String partition; + private boolean emptyAsNull; + private boolean isCompress; + + private TableTunnel.UploadSession managerUpload; + private TableTunnel.UploadSession workerUpload; + + private String uploadId = null; + private List blocks; + private int blockSizeInMB; + + private Integer failoverState = 0; //0 未failover 1准备failover 2已提交,不能failover + private byte[] lock = new byte[0]; + + @Override + public void init() { + this.sliceConfig = super.getPluginJobConf(); + + this.projectName = this.sliceConfig.getString(Key.PROJECT); + this.tableName = this.sliceConfig.getString(Key.TABLE); + this.tunnelServer = this.sliceConfig.getString(Key.TUNNEL_SERVER, null); + this.partition = OdpsUtil.formatPartition(this.sliceConfig + .getString(Key.PARTITION, "")); + this.sliceConfig.set(Key.PARTITION, this.partition); + + this.emptyAsNull = this.sliceConfig.getBool(Key.EMPTY_AS_NULL); + this.blockSizeInMB = this.sliceConfig.getInt(Key.BLOCK_SIZE_IN_MB); + this.isCompress = this.sliceConfig.getBool(Key.IS_COMPRESS, false); + if (this.blockSizeInMB < 1 || this.blockSizeInMB > 512) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ILLEGAL_VALUE, + String.format("您配置的blockSizeInMB:%s 参数错误. 正确的配置是[1-512]之间的整数. 请修改此参数的值为该区间内的数值", this.blockSizeInMB)); + } + + if (IS_DEBUG) { + LOG.debug("After init in task, sliceConfig now is:[\n{}\n].", this.sliceConfig); + } + + } + + @Override + public void prepare() { + this.odps = OdpsUtil.initOdpsProject(this.sliceConfig); + + TableTunnel tableTunnel = new TableTunnel(this.odps); + if (StringUtils.isNoneBlank(tunnelServer)) { + tableTunnel.setEndpoint(tunnelServer); + } + + this.managerUpload = OdpsUtil.createMasterTunnelUpload(tableTunnel, this.projectName, + this.tableName, this.partition); + this.uploadId = this.managerUpload.getId(); + LOG.info("task uploadId:[{}].", this.uploadId); + + this.workerUpload = OdpsUtil.getSlaveTunnelUpload(tableTunnel, this.projectName, + this.tableName, this.partition, uploadId); + } + + @Override + public void startWrite(RecordReceiver recordReceiver) { + blocks = new ArrayList(); + + AtomicLong blockId = new AtomicLong(0); + + List columnPositions = this.sliceConfig.getList(Constant.COLUMN_POSITION, + Integer.class); + + try { + TaskPluginCollector taskPluginCollector = super.getTaskPluginCollector(); + + OdpsWriterProxy proxy = new OdpsWriterProxy(this.workerUpload, this.blockSizeInMB, blockId, + columnPositions, taskPluginCollector, this.emptyAsNull, this.isCompress); + + com.alibaba.datax.common.element.Record dataXRecord = null; + + PerfRecord blockClose = new PerfRecord(super.getTaskGroupId(),super.getTaskId(), PerfRecord.PHASE.ODPS_BLOCK_CLOSE); + blockClose.start(); + long blockCloseUsedTime = 0; + while ((dataXRecord = recordReceiver.getFromReader()) != null) { + blockCloseUsedTime += proxy.writeOneRecord(dataXRecord, blocks); + } + + blockCloseUsedTime += proxy.writeRemainingRecord(blocks); + blockClose.end(blockCloseUsedTime); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.WRITER_RECORD_FAIL, "写入 ODPS 目的表失败. 请联系 ODPS 管理员处理.", e); + } + } + + @Override + public void post() { + synchronized (lock){ + if(failoverState==0){ + failoverState = 2; + LOG.info("Slave which uploadId=[{}] begin to commit blocks:[\n{}\n].", this.uploadId, + StringUtils.join(blocks, ",")); + OdpsUtil.masterCompleteBlocks(this.managerUpload, blocks.toArray(new Long[0])); + LOG.info("Slave which uploadId=[{}] commit blocks ok.", this.uploadId); + }else{ + throw DataXException.asDataXException(CommonErrorCode.SHUT_DOWN_TASK, ""); + } + } + } + + @Override + public void destroy() { + } + + @Override + public boolean supportFailOver(){ + synchronized (lock){ + if(failoverState==0){ + failoverState = 1; + return true; + } + return false; + } + } + } +} \ No newline at end of file diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterErrorCode.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterErrorCode.java new file mode 100755 index 0000000000..02020c046e --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterErrorCode.java @@ -0,0 +1,66 @@ +package com.alibaba.datax.plugin.writer.odpswriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum OdpsWriterErrorCode implements ErrorCode { + REQUIRED_VALUE("OdpsWriter-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("OdpsWriter-01", "您配置的值不合法."), + UNSUPPORTED_COLUMN_TYPE("OdpsWriter-02", "DataX 不支持写入 ODPS 的目的表的此种数据类型."), + + TABLE_TRUNCATE_ERROR("OdpsWriter-03", "清空 ODPS 目的表时出错."), + CREATE_MASTER_UPLOAD_FAIL("OdpsWriter-04", "创建 ODPS 的 uploadSession 失败."), + GET_SLAVE_UPLOAD_FAIL("OdpsWriter-05", "获取 ODPS 的 uploadSession 失败."), + GET_ID_KEY_FAIL("OdpsWriter-06", "获取 accessId/accessKey 失败."), + GET_PARTITION_FAIL("OdpsWriter-07", "获取 ODPS 目的表的所有分区失败."), + + ADD_PARTITION_FAILED("OdpsWriter-08", "添加分区到 ODPS 目的表失败."), + WRITER_RECORD_FAIL("OdpsWriter-09", "写入数据到 ODPS 目的表失败."), + + COMMIT_BLOCK_FAIL("OdpsWriter-10", "提交 block 到 ODPS 目的表失败."), + RUN_SQL_FAILED("OdpsWriter-11", "执行 ODPS Sql 失败."), + CHECK_IF_PARTITIONED_TABLE_FAILED("OdpsWriter-12", "检查 ODPS 目的表:%s 是否为分区表失败."), + + RUN_SQL_ODPS_EXCEPTION("OdpsWriter-13", "执行 ODPS Sql 时抛出异常, 可重试"), + + ACCOUNT_TYPE_ERROR("OdpsWriter-30", "账号类型错误."), + + PARTITION_ERROR("OdpsWriter-31", "分区配置错误."), + + COLUMN_NOT_EXIST("OdpsWriter-32", "用户配置的列不存在."), + + ODPS_PROJECT_NOT_FOUNT("OdpsWriter-100", "您配置的值不合法, odps project 不存在."), //ODPS-0420111: Project not found + + ODPS_TABLE_NOT_FOUNT("OdpsWriter-101", "您配置的值不合法, odps table 不存在"), // ODPS-0130131:Table not found + + ODPS_ACCESS_KEY_ID_NOT_FOUND("OdpsWriter-102", "您配置的值不合法, odps accessId,accessKey 不存在"), //ODPS-0410051:Invalid credentials - accessKeyId not found + + ODPS_ACCESS_KEY_INVALID("OdpsWriter-103", "您配置的值不合法, odps accessKey 错误"), //ODPS-0410042:Invalid signature value - User signature dose not match; + + ODPS_ACCESS_DENY("OdpsWriter-104", "拒绝访问, 您不在 您配置的 project 中") //ODPS-0420095: Access Denied - Authorization Failed [4002], You doesn't exist in project + + ; + + private final String code; + private final String description; + + private OdpsWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterProxy.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterProxy.java new file mode 100755 index 0000000000..9833616c5d --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/OdpsWriterProxy.java @@ -0,0 +1,190 @@ +package com.alibaba.datax.plugin.writer.odpswriter; + +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.plugin.writer.odpswriter.util.OdpsUtil; + +import com.alibaba.fastjson.JSON; +import com.aliyun.odps.OdpsType; +import com.aliyun.odps.TableSchema; + +import com.aliyun.odps.data.Record; + +import com.aliyun.odps.tunnel.TableTunnel; + +import com.aliyun.odps.tunnel.TunnelException; +import com.aliyun.odps.tunnel.io.ProtobufRecordPack; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.List; +import java.util.concurrent.atomic.AtomicLong; + +public class OdpsWriterProxy { + private static final Logger LOG = LoggerFactory + .getLogger(OdpsWriterProxy.class); + + private volatile boolean printColumnLess;// 是否打印对于源头字段数小于 ODPS 目的表的行的日志 + + private TaskPluginCollector taskPluginCollector; + + private TableTunnel.UploadSession slaveUpload; + private TableSchema schema; + private int maxBufferSize; + private ProtobufRecordPack protobufRecordPack; + private int protobufCapacity; + private AtomicLong blockId; + + private List columnPositions; + private List tableOriginalColumnTypeList; + private boolean emptyAsNull; + private boolean isCompress; + + public OdpsWriterProxy(TableTunnel.UploadSession slaveUpload, int blockSizeInMB, + AtomicLong blockId, List columnPositions, + TaskPluginCollector taskPluginCollector, boolean emptyAsNull, boolean isCompress) + throws IOException, TunnelException { + this.slaveUpload = slaveUpload; + this.schema = this.slaveUpload.getSchema(); + this.tableOriginalColumnTypeList = OdpsUtil + .getTableOriginalColumnTypeList(this.schema); + + this.blockId = blockId; + this.columnPositions = columnPositions; + this.taskPluginCollector = taskPluginCollector; + this.emptyAsNull = emptyAsNull; + this.isCompress = isCompress; + + // 初始化与 buffer 区相关的值 + this.maxBufferSize = (blockSizeInMB - 4) * 1024 * 1024; + this.protobufCapacity = blockSizeInMB * 1024 * 1024; + this.protobufRecordPack = new ProtobufRecordPack(this.schema, null, this.protobufCapacity); + printColumnLess = true; + + } + + public long writeOneRecord( + com.alibaba.datax.common.element.Record dataXRecord, + List blocks) throws Exception { + + Record record = dataxRecordToOdpsRecord(dataXRecord); + + if (null == record) { + return 0; + } + protobufRecordPack.append(record); + + if (protobufRecordPack.getTotalBytes() >= maxBufferSize) { + long startTimeInNs = System.nanoTime(); + OdpsUtil.slaveWriteOneBlock(this.slaveUpload, + protobufRecordPack, blockId.get(), this.isCompress); + LOG.info("write block {} ok.", blockId.get()); + blocks.add(blockId.get()); + protobufRecordPack.reset(); + this.blockId.incrementAndGet(); + return System.nanoTime() - startTimeInNs; + } + return 0; + } + + public long writeRemainingRecord(List blocks) throws Exception { + // complete protobuf stream, then write to http + if (protobufRecordPack.getTotalBytes() != 0) { + long startTimeInNs = System.nanoTime(); + OdpsUtil.slaveWriteOneBlock(this.slaveUpload, + protobufRecordPack, blockId.get(), this.isCompress); + LOG.info("write block {} ok.", blockId.get()); + + blocks.add(blockId.get()); + // reset the buffer for next block + protobufRecordPack.reset(); + return System.nanoTime() - startTimeInNs; + } + return 0; + } + + public Record dataxRecordToOdpsRecord( + com.alibaba.datax.common.element.Record dataXRecord) throws Exception { + int sourceColumnCount = dataXRecord.getColumnNumber(); + Record odpsRecord = slaveUpload.newRecord(); + + int userConfiguredColumnNumber = this.columnPositions.size(); +//todo + if (sourceColumnCount > userConfiguredColumnNumber) { + throw DataXException + .asDataXException( + OdpsWriterErrorCode.ILLEGAL_VALUE, + String.format( + "亲,配置中的源表的列个数和目的端表不一致,源表中您配置的列数是:%s 大于目的端的列数是:%s , 这样会导致源头数据无法正确导入目的端, 请检查您的配置并修改.", + sourceColumnCount, + userConfiguredColumnNumber)); + } else if (sourceColumnCount < userConfiguredColumnNumber) { + if (printColumnLess) { + LOG.warn( + "源表的列个数小于目的表的列个数,源表列数是:{} 目的表列数是:{} , 数目不匹配. DataX 会把目的端多出的列的值设置为空值. 如果这个默认配置不符合您的期望,请保持源表和目的表配置的列数目保持一致.", + sourceColumnCount, userConfiguredColumnNumber); + } + printColumnLess = false; + } + + int currentIndex; + int sourceIndex = 0; + try { + com.alibaba.datax.common.element.Column columnValue; + + for (; sourceIndex < sourceColumnCount; sourceIndex++) { + currentIndex = columnPositions.get(sourceIndex); + OdpsType type = this.tableOriginalColumnTypeList + .get(currentIndex); + columnValue = dataXRecord.getColumn(sourceIndex); + + if (columnValue == null) { + continue; + } + // for compatible dt lib, "" as null + if(this.emptyAsNull && columnValue instanceof StringColumn && "".equals(columnValue.asString())){ + continue; + } + + switch (type) { + case STRING: + odpsRecord.setString(currentIndex, columnValue.asString()); + break; + case BIGINT: + odpsRecord.setBigint(currentIndex, columnValue.asLong()); + break; + case BOOLEAN: + odpsRecord.setBoolean(currentIndex, columnValue.asBoolean()); + break; + case DATETIME: + odpsRecord.setDatetime(currentIndex, columnValue.asDate()); + break; + case DOUBLE: + odpsRecord.setDouble(currentIndex, columnValue.asDouble()); + break; + case DECIMAL: + odpsRecord.setDecimal(currentIndex, columnValue.asBigDecimal()); + String columnStr = columnValue.asString(); + if(columnStr != null && columnStr.indexOf(".") >= 36) { + throw new Exception("Odps decimal 类型的整数位个数不能超过35"); + } + default: + break; + } + } + + return odpsRecord; + } catch (Exception e) { + String message = String.format( + "写入 ODPS 目的表时遇到了脏数据: 第[%s]个字段的数据出现错误,请检查该数据并作出修改 或者您可以增大阀值,忽略这条记录.", sourceIndex); + this.taskPluginCollector.collectDirtyRecord(dataXRecord, e, + message); + + return null; + } + + } +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/DESCipher.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/DESCipher.java new file mode 100755 index 0000000000..bf7f5a8832 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/DESCipher.java @@ -0,0 +1,355 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.writer.odpswriter.util; + +import javax.crypto.Cipher; +import javax.crypto.SecretKey; +import javax.crypto.SecretKeyFactory; +import javax.crypto.spec.DESKeySpec; +import java.security.SecureRandom; + +/** + *   * DES加解密,支持与delphi交互(字符串编码需统一为UTF-8) + * + *   * + * + *   * @author wym + * + *    + */ + +public class DESCipher { + + /** + *   * 密钥 + * + *    + */ + + public static final String KEY = "u4Gqu4Z8"; + + private final static String DES = "DES"; + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @param key + * + *   * 密钥,长度必须是8的倍数 + * + *   * @return 密文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] encrypt(byte[] src, byte[] key) throws Exception { + + // DES算法要求有一个可信任的随机数源 + + SecureRandom sr = new SecureRandom(); + + // 从原始密匙数据创建DESKeySpec对象 + + DESKeySpec dks = new DESKeySpec(key); + + // 创建一个密匙工厂,然后用它把DESKeySpec转换成 + + // 一个SecretKey对象 + + SecretKeyFactory keyFactory = SecretKeyFactory.getInstance(DES); + + SecretKey securekey = keyFactory.generateSecret(dks); + + // Cipher对象实际完成加密操作 + + Cipher cipher = Cipher.getInstance(DES); + + // 用密匙初始化Cipher对象 + + cipher.init(Cipher.ENCRYPT_MODE, securekey, sr); + + // 现在,获取数据并加密 + + // 正式执行加密操作 + + return cipher.doFinal(src); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @param key + * + *   * 密钥,长度必须是8的倍数 + * + *   * @return 明文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] decrypt(byte[] src, byte[] key) throws Exception { + + // DES算法要求有一个可信任的随机数源 + + SecureRandom sr = new SecureRandom(); + + // 从原始密匙数据创建一个DESKeySpec对象 + + DESKeySpec dks = new DESKeySpec(key); + + // 创建一个密匙工厂,然后用它把DESKeySpec对象转换成 + + // 一个SecretKey对象 + + SecretKeyFactory keyFactory = SecretKeyFactory.getInstance(DES); + + SecretKey securekey = keyFactory.generateSecret(dks); + + // Cipher对象实际完成解密操作 + + Cipher cipher = Cipher.getInstance(DES); + + // 用密匙初始化Cipher对象 + + cipher.init(Cipher.DECRYPT_MODE, securekey, sr); + + // 现在,获取数据并解密 + + // 正式执行解密操作 + + return cipher.doFinal(src); + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @return 密文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] encrypt(byte[] src) throws Exception { + + return encrypt(src, KEY.getBytes()); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @return 明文(字节) + * + *   * @throws Exception + * + *    + */ + + public static byte[] decrypt(byte[] src) throws Exception { + + return decrypt(src, KEY.getBytes()); + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字符串) + * + *   * @return 密文(16进制字符串) + * + *   * @throws Exception + * + *    + */ + + public final static String encrypt(String src) { + + try { + + return byte2hex(encrypt(src.getBytes(), KEY.getBytes())); + + } catch (Exception e) { + + e.printStackTrace(); + + } + + return null; + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字符串) + * + *   * @return 明文(字符串) + * + *   * @throws Exception + * + *    + */ + + public final static String decrypt(String src) { + try { + + return new String(decrypt(hex2byte(src.getBytes()), KEY.getBytes())); + + } catch (Exception e) { + + e.printStackTrace(); + + } + + return null; + + } + + /** + *   * 加密 + * + *   * + * + *   * @param src + * + *   * 明文(字节) + * + *   * @return 密文(16进制字符串) + * + *   * @throws Exception + * + *    + */ + + public static String encryptToString(byte[] src) throws Exception { + + return encrypt(new String(src)); + + } + + /** + *   * 解密 + * + *   * + * + *   * @param src + * + *   * 密文(字节) + * + *   * @return 明文(字符串) + * + *   * @throws Exception + * + *    + */ + + public static String decryptToString(byte[] src) throws Exception { + + return decrypt(new String(src)); + + } + + public static String byte2hex(byte[] b) { + + String hs = ""; + + String stmp = ""; + + for (int n = 0; n < b.length; n++) { + + stmp = (Integer.toHexString(b[n] & 0XFF)); + + if (stmp.length() == 1) + + hs = hs + "0" + stmp; + + else + + hs = hs + stmp; + + } + + return hs.toUpperCase(); + + } + + public static byte[] hex2byte(byte[] b) { + + if ((b.length % 2) != 0) + + throw new IllegalArgumentException("长度不是偶数"); + + byte[] b2 = new byte[b.length / 2]; + + for (int n = 0; n < b.length; n += 2) { + + String item = new String(b, n, 2); + + b2[n / 2] = (byte) Integer.parseInt(item, 16); + + } + return b2; + + } + + /* + * public static void main(String[] args) { try { String src = "cheetah"; + * String crypto = DESCipher.encrypt(src); System.out.println("密文[" + src + + * "]:" + crypto); System.out.println("解密后:" + DESCipher.decrypt(crypto)); } + * catch (Exception e) { e.printStackTrace(); } } + */ +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/IdAndKeyUtil.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/IdAndKeyUtil.java new file mode 100755 index 0000000000..95e4b56b54 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/IdAndKeyUtil.java @@ -0,0 +1,85 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.writer.odpswriter.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.odpswriter.Constant; +import com.alibaba.datax.plugin.writer.odpswriter.Key; +import com.alibaba.datax.plugin.writer.odpswriter.OdpsWriterErrorCode; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Map; + +public class IdAndKeyUtil { + private static Logger LOG = LoggerFactory.getLogger(IdAndKeyUtil.class); + + public static Configuration parseAccessIdAndKey(Configuration originalConfig) { + String accessId = originalConfig.getString(Key.ACCESS_ID); + String accessKey = originalConfig.getString(Key.ACCESS_KEY); + + // 只要 accessId,accessKey 二者配置了一个,就理解为是用户本意是要直接手动配置其 accessid/accessKey + if (StringUtils.isNotBlank(accessId) || StringUtils.isNotBlank(accessKey)) { + LOG.info("Try to get accessId/accessKey from your config."); + //通过如下语句,进行检查是否确实配置了 + accessId = originalConfig.getNecessaryValue(Key.ACCESS_ID, OdpsWriterErrorCode.REQUIRED_VALUE); + accessKey = originalConfig.getNecessaryValue(Key.ACCESS_KEY, OdpsWriterErrorCode.REQUIRED_VALUE); + //检查完毕,返回即可 + return originalConfig; + } else { + Map envProp = System.getenv(); + return getAccessIdAndKeyFromEnv(originalConfig, envProp); + } + } + + private static Configuration getAccessIdAndKeyFromEnv(Configuration originalConfig, + Map envProp) { + String accessId = null; + String accessKey = null; + + String skynetAccessID = envProp.get(Constant.SKYNET_ACCESSID); + String skynetAccessKey = envProp.get(Constant.SKYNET_ACCESSKEY); + + if (StringUtils.isNotBlank(skynetAccessID) + || StringUtils.isNotBlank(skynetAccessKey)) { + /** + * 环境变量中,如果存在SKYNET_ACCESSID/SKYNET_ACCESSKEy(只要有其中一个变量,则认为一定是两个都存在的!), + * 则使用其值作为odps的accessId/accessKey(会解密) + */ + + LOG.info("Try to get accessId/accessKey from environment."); + accessId = skynetAccessID; + accessKey = DESCipher.decrypt(skynetAccessKey); + if (StringUtils.isNotBlank(accessKey)) { + originalConfig.set(Key.ACCESS_ID, accessId); + originalConfig.set(Key.ACCESS_KEY, accessKey); + LOG.info("Get accessId/accessKey from environment variables successfully."); + } else { + throw DataXException.asDataXException(OdpsWriterErrorCode.GET_ID_KEY_FAIL, + String.format("从环境变量中获取accessId/accessKey 失败, accessId=[%s]", accessId)); + } + } else { + // 无处获取(既没有配置在作业中,也没用在环境变量中) + throw DataXException.asDataXException(OdpsWriterErrorCode.GET_ID_KEY_FAIL, + "无法获取到accessId/accessKey. 它们既不存在于您的配置中,也不存在于环境变量中."); + } + + return originalConfig; + } +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsExceptionMsg.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsExceptionMsg.java new file mode 100644 index 0000000000..d613eefda9 --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsExceptionMsg.java @@ -0,0 +1,18 @@ +package com.alibaba.datax.plugin.writer.odpswriter.util; + +/** + * Created by hongjiao.hj on 2015/6/9. + */ +public class OdpsExceptionMsg { + + public static final String ODPS_PROJECT_NOT_FOUNT = "ODPS-0420111: Project not found"; + + public static final String ODPS_TABLE_NOT_FOUNT = "ODPS-0130131:Table not found"; + + public static final String ODPS_ACCESS_KEY_ID_NOT_FOUND = "ODPS-0410051:Invalid credentials - accessKeyId not found"; + + public static final String ODPS_ACCESS_KEY_INVALID = "ODPS-0410042:Invalid signature value - User signature dose not match"; + + public static final String ODPS_ACCESS_DENY = "ODPS-0420095: Access Denied - Authorization Failed [4002], You doesn't exist in project"; + +} diff --git a/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsUtil.java b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsUtil.java new file mode 100755 index 0000000000..2a401b696c --- /dev/null +++ b/odpswriter/src/main/java/com/alibaba/datax/plugin/writer/odpswriter/util/OdpsUtil.java @@ -0,0 +1,586 @@ +package com.alibaba.datax.plugin.writer.odpswriter.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.writer.odpswriter.Constant; +import com.alibaba.datax.plugin.writer.odpswriter.Key; + +import com.alibaba.datax.plugin.writer.odpswriter.OdpsWriterErrorCode; +import com.aliyun.odps.*; +import com.aliyun.odps.account.Account; +import com.aliyun.odps.account.AliyunAccount; +import com.aliyun.odps.task.SQLTask; +import com.aliyun.odps.tunnel.TableTunnel; + +import com.aliyun.odps.tunnel.io.ProtobufRecordPack; +import com.aliyun.odps.tunnel.io.TunnelRecordWriter; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; +import java.util.concurrent.Callable; + +public class OdpsUtil { + private static final Logger LOG = LoggerFactory.getLogger(OdpsUtil.class); + + public static int MAX_RETRY_TIME = 10; + + public static void checkNecessaryConfig(Configuration originalConfig) { + originalConfig.getNecessaryValue(Key.ODPS_SERVER, + OdpsWriterErrorCode.REQUIRED_VALUE); + + originalConfig.getNecessaryValue(Key.PROJECT, + OdpsWriterErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.TABLE, + OdpsWriterErrorCode.REQUIRED_VALUE); + + if (null == originalConfig.getList(Key.COLUMN) || + originalConfig.getList(Key.COLUMN, String.class).isEmpty()) { + throw DataXException.asDataXException(OdpsWriterErrorCode.REQUIRED_VALUE, "您未配置写入 ODPS 目的表的列信息. " + + "正确的配置方式是给datax的 column 项配置上您需要读取的列名称,用英文逗号分隔 例如: \"column\": [\"id\",\"name\"]."); + } + + // getBool 内部要求,值只能为 true,false 的字符串(大小写不敏感),其他一律报错,不再有默认配置 + Boolean truncate = originalConfig.getBool(Key.TRUNCATE); + if (null == truncate) { + throw DataXException.asDataXException(OdpsWriterErrorCode.REQUIRED_VALUE, "[truncate]是必填配置项, 意思是写入 ODPS 目的表前是否清空表/分区. " + + "请您增加 truncate 的配置,根据业务需要选择上true 或者 false."); + } + } + + public static void dealMaxRetryTime(Configuration originalConfig) { + int maxRetryTime = originalConfig.getInt(Key.MAX_RETRY_TIME, + OdpsUtil.MAX_RETRY_TIME); + if (maxRetryTime < 1 || maxRetryTime > OdpsUtil.MAX_RETRY_TIME) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ILLEGAL_VALUE, "您所配置的maxRetryTime 值错误. 该值不能小于1, 且不能大于 " + OdpsUtil.MAX_RETRY_TIME + + ". 推荐的配置方式是给maxRetryTime 配置1-11之间的某个值. 请您检查配置并做出相应修改."); + } + MAX_RETRY_TIME = maxRetryTime; + } + + public static String formatPartition(String partitionString) { + if (null == partitionString) { + return null; + } + + return partitionString.trim().replaceAll(" *= *", "=").replaceAll(" */ *", ",") + .replaceAll(" *, *", ",").replaceAll("'", ""); + } + + + public static Odps initOdpsProject(Configuration originalConfig) { + String accountType = originalConfig.getString(Key.ACCOUNT_TYPE); + String accessId = originalConfig.getString(Key.ACCESS_ID); + String accessKey = originalConfig.getString(Key.ACCESS_KEY); + + String odpsServer = originalConfig.getString(Key.ODPS_SERVER); + String project = originalConfig.getString(Key.PROJECT); + + Account account; + if (accountType.equalsIgnoreCase(Constant.DEFAULT_ACCOUNT_TYPE)) { + account = new AliyunAccount(accessId, accessKey); + } else { + throw DataXException.asDataXException(OdpsWriterErrorCode.ACCOUNT_TYPE_ERROR, + String.format("不支持的账号类型:[%s]. 账号类型目前仅支持aliyun, taobao.", accountType)); + } + + Odps odps = new Odps(account); + boolean isPreCheck = originalConfig.getBool("dryRun", false); + if(isPreCheck) { + odps.getRestClient().setConnectTimeout(3); + odps.getRestClient().setReadTimeout(3); + odps.getRestClient().setRetryTimes(2); + } + odps.setDefaultProject(project); + odps.setEndpoint(odpsServer); + + return odps; + } + + public static Table getTable(Odps odps, String projectName, String tableName) { + final Table table = odps.tables().get(projectName, tableName); + try { + //通过这种方式检查表是否存在,失败重试。重试策略:每秒钟重试一次,最大重试3次 + return RetryUtil.executeWithRetry(new Callable

() { + @Override + public Table call() throws Exception { + table.reload(); + return table; + } + }, 3, 1000, false); + } catch (Exception e) { + throwDataXExceptionWhenReloadTable(e, tableName); + } + return table; + } + + public static List listOdpsPartitions(Table table) { + List parts = new ArrayList(); + try { + List partitions = table.getPartitions(); + for(Partition partition : partitions) { + parts.add(partition.getPartitionSpec().toString()); + } + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.GET_PARTITION_FAIL, String.format("获取 ODPS 目的表:%s 的所有分区失败. 请联系 ODPS 管理员处理.", + table.getName()), e); + } + return parts; + } + + public static boolean isPartitionedTable(Table table) { + //必须要是非分区表才能 truncate 整个表 + List partitionKeys; + try { + partitionKeys = table.getSchema().getPartitionColumns(); + if (null != partitionKeys && !partitionKeys.isEmpty()) { + return true; + } + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.CHECK_IF_PARTITIONED_TABLE_FAILED, + String.format("检查 ODPS 目的表:%s 是否为分区表失败, 请联系 ODPS 管理员处理.", table.getName()), e); + } + return false; + } + + + public static void truncateNonPartitionedTable(Odps odps, Table tab) { + String truncateNonPartitionedTableSql = "truncate table " + tab.getName() + ";"; + + try { + runSqlTaskWithRetry(odps, truncateNonPartitionedTableSql, MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.TABLE_TRUNCATE_ERROR, + String.format(" 清空 ODPS 目的表:%s 失败, 请联系 ODPS 管理员处理.", tab.getName()), e); + } + } + + public static void truncatePartition(Odps odps, Table table, String partition) { + if (isPartitionExist(table, partition)) { + dropPart(odps, table, partition); + } + addPart(odps, table, partition); + } + + private static boolean isPartitionExist(Table table, String partition) { + // check if exist partition 返回值不为 null + List odpsParts = OdpsUtil.listOdpsPartitions(table); + + int j = 0; + for (; j < odpsParts.size(); j++) { + if (odpsParts.get(j).replaceAll("'", "").equals(partition)) { + break; + } + } + + return j != odpsParts.size(); + } + + public static void addPart(Odps odps, Table table, String partition) { + String partSpec = getPartSpec(partition); + // add if not exists partition + StringBuilder addPart = new StringBuilder(); + addPart.append("alter table ").append(table.getName()).append(" add IF NOT EXISTS partition(") + .append(partSpec).append(");"); + try { + runSqlTaskWithRetry(odps, addPart.toString(), MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ADD_PARTITION_FAILED, + String.format("添加 ODPS 目的表的分区失败. 错误发生在添加 ODPS 的项目:%s 的表:%s 的分区:%s. 请联系 ODPS 管理员处理.", + table.getProject(), table.getName(), partition), e); + } + } + + + public static TableTunnel.UploadSession createMasterTunnelUpload(final TableTunnel tunnel, final String projectName, + final String tableName, final String partition) { + if(StringUtils.isBlank(partition)) { + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.UploadSession call() throws Exception { + return tunnel.createUploadSession(projectName, tableName); + } + }, MAX_RETRY_TIME, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.CREATE_MASTER_UPLOAD_FAIL, + "创建TunnelUpload失败. 请联系 ODPS 管理员处理.", e); + } + } else { + final PartitionSpec partitionSpec = new PartitionSpec(partition); + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.UploadSession call() throws Exception { + return tunnel.createUploadSession(projectName, tableName, partitionSpec); + } + }, MAX_RETRY_TIME, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.CREATE_MASTER_UPLOAD_FAIL, + "创建TunnelUpload失败. 请联系 ODPS 管理员处理.", e); + } + } + } + + public static TableTunnel.UploadSession getSlaveTunnelUpload(final TableTunnel tunnel, final String projectName, final String tableName, + final String partition, final String uploadId) { + + if(StringUtils.isBlank(partition)) { + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.UploadSession call() throws Exception { + return tunnel.getUploadSession(projectName, tableName, uploadId); + } + }, MAX_RETRY_TIME, 1000L, true); + + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.GET_SLAVE_UPLOAD_FAIL, + "获取TunnelUpload失败. 请联系 ODPS 管理员处理.", e); + } + } else { + final PartitionSpec partitionSpec = new PartitionSpec(partition); + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public TableTunnel.UploadSession call() throws Exception { + return tunnel.getUploadSession(projectName, tableName, partitionSpec, uploadId); + } + }, MAX_RETRY_TIME, 1000L, true); + + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.GET_SLAVE_UPLOAD_FAIL, + "获取TunnelUpload失败. 请联系 ODPS 管理员处理.", e); + } + } + } + + + private static void dropPart(Odps odps, Table table, String partition) { + String partSpec = getPartSpec(partition); + StringBuilder dropPart = new StringBuilder(); + dropPart.append("alter table ").append(table.getName()) + .append(" drop IF EXISTS partition(").append(partSpec) + .append(");"); + try { + runSqlTaskWithRetry(odps, dropPart.toString(), MAX_RETRY_TIME, 1000, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ADD_PARTITION_FAILED, + String.format("Drop ODPS 目的表分区失败. 错误发生在项目:%s 的表:%s 的分区:%s .请联系 ODPS 管理员处理.", + table.getProject(), table.getName(), partition), e); + } + } + + private static String getPartSpec(String partition) { + StringBuilder partSpec = new StringBuilder(); + String[] parts = partition.split(","); + for (int i = 0; i < parts.length; i++) { + String part = parts[i]; + String[] kv = part.split("="); + if (kv.length != 2) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ILLEGAL_VALUE, + String.format("ODPS 目的表自身的 partition:%s 格式不对. 正确的格式形如: pt=1,ds=hangzhou", partition)); + } + partSpec.append(kv[0]).append("="); + partSpec.append("'").append(kv[1].replace("'", "")).append("'"); + if (i != parts.length - 1) { + partSpec.append(","); + } + } + return partSpec.toString(); + } + + /** + * 该方法只有在 sql 为幂等的才可以使用,且odps抛出异常时候才会进行重试 + * + * @param odps odps + * @param query 执行sql + * @throws Exception + */ + public static void runSqlTaskWithRetry(final Odps odps, final String query, int retryTimes, + long sleepTimeInMilliSecond, boolean exponential) throws Exception { + for(int i = 0; i < retryTimes; i++) { + try { + runSqlTask(odps, query); + return; + } catch (DataXException e) { + if (OdpsWriterErrorCode.RUN_SQL_ODPS_EXCEPTION.equals(e.getErrorCode())) { + LOG.debug("Exception when calling callable", e); + if (i + 1 < retryTimes && sleepTimeInMilliSecond > 0) { + LOG.warn(String.format("will do [%s] times retry, current exception=%s", i + 1, e.getMessage())); + long timeToSleep; + if (exponential) { + timeToSleep = sleepTimeInMilliSecond * (long) Math.pow(2, i); + if(timeToSleep >= 128 * 1000) { + timeToSleep = 128 * 1000; + } + } else { + timeToSleep = sleepTimeInMilliSecond; + if(timeToSleep >= 128 * 1000) { + timeToSleep = 128 * 1000; + } + } + + try { + Thread.sleep(timeToSleep); + } catch (InterruptedException ignored) { + } + } else { + throw e; + } + } else { + throw e; + } + } catch (Exception e) { + throw e; + } + } + } + + public static void runSqlTask(Odps odps, String query) { + if (StringUtils.isBlank(query)) { + return; + } + + String taskName = "datax_odpswriter_trunacte_" + UUID.randomUUID().toString().replace('-', '_'); + + LOG.info("Try to start sqlTask:[{}] to run odps sql:[\n{}\n] .", taskName, query); + + //todo:biz_id set (目前ddl先不做) + Instance instance; + Instance.TaskStatus status; + try { + instance = SQLTask.run(odps, odps.getDefaultProject(), query, taskName, null, null); + instance.waitForSuccess(); + status = instance.getTaskStatus().get(taskName); + if (!Instance.TaskStatus.Status.SUCCESS.equals(status.getStatus())) { + throw DataXException.asDataXException(OdpsWriterErrorCode.RUN_SQL_FAILED, + String.format("ODPS 目的表在运行 ODPS SQL失败, 返回值为:%s. 请联系 ODPS 管理员处理. SQL 内容为:[\n%s\n].", instance.getTaskResults().get(taskName), + query)); + } + } catch (DataXException e) { + throw e; + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.RUN_SQL_ODPS_EXCEPTION, + String.format("ODPS 目的表在运行 ODPS SQL 时抛出异常, 请联系 ODPS 管理员处理. SQL 内容为:[\n%s\n].", query), e); + } + } + + public static void masterCompleteBlocks(final TableTunnel.UploadSession masterUpload, final Long[] blocks) { + try { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Void call() throws Exception { + masterUpload.commit(blocks); + return null; + } + }, MAX_RETRY_TIME, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.COMMIT_BLOCK_FAIL, + String.format("ODPS 目的表在提交 block:[\n%s\n] 时失败, uploadId=[%s]. 请联系 ODPS 管理员处理.", StringUtils.join(blocks, ","), masterUpload.getId()), e); + } + } + + public static void slaveWriteOneBlock(final TableTunnel.UploadSession slaveUpload, final ProtobufRecordPack protobufRecordPack, + final long blockId, final boolean isCompress) { + try { + RetryUtil.executeWithRetry(new Callable() { + @Override + public Void call() throws Exception { + TunnelRecordWriter tunnelRecordWriter = (TunnelRecordWriter)slaveUpload.openRecordWriter(blockId, isCompress); + tunnelRecordWriter.write(protobufRecordPack); + tunnelRecordWriter.close(); + return null; + } + }, MAX_RETRY_TIME, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(OdpsWriterErrorCode.WRITER_RECORD_FAIL, + String.format("ODPS 目的表写 block:%s 失败, uploadId=[%s]. 请联系 ODPS 管理员处理.", blockId, slaveUpload.getId()), e); + } + + } + + public static List parsePosition(List allColumnList, + List userConfiguredColumns) { + List retList = new ArrayList(); + + boolean hasColumn; + for (String col : userConfiguredColumns) { + hasColumn = false; + for (int i = 0, len = allColumnList.size(); i < len; i++) { + if (allColumnList.get(i).equalsIgnoreCase(col)) { + retList.add(i); + hasColumn = true; + break; + } + } + if (!hasColumn) { + throw DataXException.asDataXException(OdpsWriterErrorCode.COLUMN_NOT_EXIST, + String.format("ODPS 目的表的列配置错误. 由于您所配置的列:%s 不存在,会导致datax无法正常插入数据,请检查该列是否存在,如果存在请检查大小写等配置.", col)); + } + } + return retList; + } + + public static List getAllColumns(TableSchema schema) { + if (null == schema) { + throw new IllegalArgumentException("parameter schema can not be null."); + } + + List allColumns = new ArrayList(); + + List columns = schema.getColumns(); + OdpsType type; + for(Column column: columns) { + allColumns.add(column.getName()); + type = column.getType(); + if (type == OdpsType.ARRAY || type == OdpsType.MAP) { + throw DataXException.asDataXException(OdpsWriterErrorCode.UNSUPPORTED_COLUMN_TYPE, + String.format("DataX 写入 ODPS 表不支持该字段类型:[%s]. 目前支持抽取的字段类型有:bigint, boolean, datetime, double, string. " + + "您可以选择不抽取 DataX 不支持的字段或者联系 ODPS 管理员寻求帮助.", + type)); + } + } + return allColumns; + } + + public static List getTableOriginalColumnTypeList(TableSchema schema) { + List tableOriginalColumnTypeList = new ArrayList(); + + List columns = schema.getColumns(); + for (Column column : columns) { + tableOriginalColumnTypeList.add(column.getType()); + } + + return tableOriginalColumnTypeList; + } + + public static void dealTruncate(Odps odps, Table table, String partition, boolean truncate) { + boolean isPartitionedTable = OdpsUtil.isPartitionedTable(table); + + if (truncate) { + //需要 truncate + if (isPartitionedTable) { + //分区表 + if (StringUtils.isBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, String.format("您没有配置分区信息,因为你配置的表是分区表:%s 如果需要进行 truncate 操作,必须指定需要清空的具体分区. 请修改分区配置,格式形如 pt=${bizdate} .", + table.getName())); + } else { + LOG.info("Try to truncate partition=[{}] in table=[{}].", partition, table.getName()); + OdpsUtil.truncatePartition(odps, table, partition); + } + } else { + //非分区表 + if (StringUtils.isNotBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, String.format("分区信息配置错误,你的ODPS表是非分区表:%s 进行 truncate 操作时不需要指定具体分区值. 请检查您的分区配置,删除该配置项的值.", + table.getName())); + } else { + LOG.info("Try to truncate table:[{}].", table.getName()); + OdpsUtil.truncateNonPartitionedTable(odps, table); + } + } + } else { + //不需要 truncate + if (isPartitionedTable) { + //分区表 + if (StringUtils.isBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, + String.format("您的目的表是分区表,写入分区表:%s 时必须指定具体分区值. 请修改您的分区配置信息,格式形如 格式形如 pt=${bizdate}.", table.getName())); + } else { + boolean isPartitionExists = OdpsUtil.isPartitionExist(table, partition); + if (!isPartitionExists) { + LOG.info("Try to add partition:[{}] in table:[{}].", partition, + table.getName()); + OdpsUtil.addPart(odps, table, partition); + } + } + } else { + //非分区表 + if (StringUtils.isNotBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, + String.format("您的目的表是非分区表,写入非分区表:%s 时不需要指定具体分区值. 请删除分区配置信息", table.getName())); + } + } + } + } + + + /** + * 检查odpswriter 插件的分区信息 + * + * @param odps + * @param table + * @param partition + * @param truncate + */ + public static void preCheckPartition(Odps odps, Table table, String partition, boolean truncate) { + boolean isPartitionedTable = OdpsUtil.isPartitionedTable(table); + + if (truncate) { + //需要 truncate + if (isPartitionedTable) { + //分区表 + if (StringUtils.isBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, String.format("您没有配置分区信息,因为你配置的表是分区表:%s 如果需要进行 truncate 操作,必须指定需要清空的具体分区. 请修改分区配置,格式形如 pt=${bizdate} .", + table.getName())); + } + } else { + //非分区表 + if (StringUtils.isNotBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, String.format("分区信息配置错误,你的ODPS表是非分区表:%s 进行 truncate 操作时不需要指定具体分区值. 请检查您的分区配置,删除该配置项的值.", + table.getName())); + } + } + } else { + //不需要 truncate + if (isPartitionedTable) { + //分区表 + if (StringUtils.isBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, + String.format("您的目的表是分区表,写入分区表:%s 时必须指定具体分区值. 请修改您的分区配置信息,格式形如 格式形如 pt=${bizdate}.", table.getName())); + } + } else { + //非分区表 + if (StringUtils.isNotBlank(partition)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.PARTITION_ERROR, + String.format("您的目的表是非分区表,写入非分区表:%s 时不需要指定具体分区值. 请删除分区配置信息", table.getName())); + } + } + } + } + + /** + * table.reload() 方法抛出的 odps 异常 转化为更清晰的 datax 异常 抛出 + */ + public static void throwDataXExceptionWhenReloadTable(Exception e, String tableName) { + if(e.getMessage() != null) { + if(e.getMessage().contains(OdpsExceptionMsg.ODPS_PROJECT_NOT_FOUNT)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ODPS_PROJECT_NOT_FOUNT, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 [project] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_TABLE_NOT_FOUNT)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ODPS_TABLE_NOT_FOUNT, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 [table] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_KEY_ID_NOT_FOUND)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ODPS_ACCESS_KEY_ID_NOT_FOUND, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 [accessId] [accessKey]是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_KEY_INVALID)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ODPS_ACCESS_KEY_INVALID, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 [accessKey] 是否正确.", tableName), e); + } else if(e.getMessage().contains(OdpsExceptionMsg.ODPS_ACCESS_DENY)) { + throw DataXException.asDataXException(OdpsWriterErrorCode.ODPS_ACCESS_DENY, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 [accessId] [accessKey] [project]是否匹配.", tableName), e); + } + } + throw DataXException.asDataXException(OdpsWriterErrorCode.ILLEGAL_VALUE, + String.format("加载 ODPS 目的表:%s 失败. " + + "请检查您配置的 ODPS 目的表的 project,table,accessId,accessKey,odpsServer等值.", tableName), e); + } + +} diff --git a/odpswriter/src/main/libs/bcprov-jdk15on-1.52.jar b/odpswriter/src/main/libs/bcprov-jdk15on-1.52.jar new file mode 100644 index 0000000000..6c54dd901c Binary files /dev/null and b/odpswriter/src/main/libs/bcprov-jdk15on-1.52.jar differ diff --git a/odpswriter/src/main/resources/plugin.json b/odpswriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..d867129e88 --- /dev/null +++ b/odpswriter/src/main/resources/plugin.json @@ -0,0 +1,10 @@ +{ + "name": "odpswriter", + "class": "com.alibaba.datax.plugin.writer.odpswriter.OdpsWriter", + "description": { + "useScene": "prod.", + "mechanism": "TODO", + "warn": "TODO" + }, + "developer": "alibaba" +} \ No newline at end of file diff --git a/odpswriter/src/main/resources/plugin_job_template.json b/odpswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..3570f9eba8 --- /dev/null +++ b/odpswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,14 @@ +{ + "name": "odpswriter", + "parameter": { + "project": "", + "table": "", + "partition":"", + "column": [], + "accessId": "", + "accessKey": "", + "truncate": true, + "odpsServer": "", + "tunnelServer": "" + } +} \ No newline at end of file diff --git a/oraclereader/doc/oraclereader.md b/oraclereader/doc/oraclereader.md new file mode 100644 index 0000000000..250527aee0 --- /dev/null +++ b/oraclereader/doc/oraclereader.md @@ -0,0 +1,350 @@ + +# OracleReader 插件文档 + + +___ + + +## 1 快速介绍 + +OracleReader插件实现了从Oracle读取数据。在底层实现上,OracleReader通过JDBC连接远程Oracle数据库,并执行相应的sql语句将数据从Oracle库中SELECT出来。 + +## 2 实现原理 + +简而言之,OracleReader通过JDBC连接器连接到远程的Oracle数据库,并根据用户配置的信息生成查询SELECT SQL语句并发送到远程Oracle数据库,并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集,并传递给下游Writer处理。 + +对于用户配置Table、Column、Where的信息,OracleReader将其拼接为SQL语句发送到Oracle数据库;对于用户配置querySql信息,Oracle直接将其发送到Oracle数据库。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从Oracle数据库同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + "speed": { + //设置传输速度 byte/s 尽量逼近这个速度但是不高于它. + // channel 表示通道数量,byte表示通道速度,如果单通道速度1MB,配置byte为1048576表示一个channel + "byte": 1048576 + }, + //出错限制 + "errorLimit": { + //先选择record + "record": 0, + //百分比 1表示100% + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "oraclereader", + "parameter": { + // 数据库连接用户名 + "username": "root", + // 数据库连接密码 + "password": "root", + "column": [ + "id","name" + ], + //切分主键 + "splitPk": "db_id", + "connection": [ + { + "table": [ + "table" + ], + "jdbcUrl": [ + "jdbc:oracle:thin:@[HOST_NAME]:PORT:[DATABASE_NAME]" + ] + } + ] + } + }, + "writer": { + //writer类型 + "name": "streamwriter", + // 是否打印内容 + "parameter": { + "print": true + } + } + } + ] + } +} + +``` + +* 配置一个自定义SQL的数据库同步任务到本地内容的作业: + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 5 + } + }, + "content": [ + { + "reader": { + "name": "oraclereader", + "parameter": { + "username": "root", + "password": "root", + "where": "", + "connection": [ + { + "querySql": [ + "select db_id,on_line_flag from db_info where db_id < 10;" + ], + "jdbcUrl": [ + "jdbc:oracle:thin:@[HOST_NAME]:PORT:[DATABASE_NAME]" + ] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "visible": false, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,使用JSON的数组描述,并支持一个库填写多个连接地址。之所以使用JSON数组描述连接信息,是因为阿里集团内部支持多个IP探测,如果配置了多个,OracleReader可以依次探测ip的可连接性,直到选择一个合法的IP。如果全部连接失败,OracleReader报错。 注意,jdbcUrl必须包含在connection配置单元中。对于阿里集团外部使用情况,JSON数组填写一个JDBC连接即可。 + + jdbcUrl按照Oracle官方规范,并可以填写连接附件控制信息。具体请参看[Oracle官方文档](http://www.oracle.com/technetwork/database/enterprise-edition/documentation/index.html)。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:所选取的需要同步的表。使用JSON的数组描述,因此支持多张表同时抽取。当配置为多张表时,用户自己需保证多张表是同一schema结构,OracleReader不予检查表是否同一逻辑表。注意,table必须包含在connection配置单元中。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。用户使用\*代表默认使用所有列配置,例如['\*']。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照JSON格式: + ["id", "`table`", "1", "'bazhen.csy'", "null", "to_char(a + 1)", "2.3" , "true"] + id为普通列名,\`table\`为包含保留在的列名,1为整形数字常量,'bazhen.csy'为字符串常量,null为空指针,to_char(a + 1)为表达式,2.3为浮点数,true为布尔值。 + + Column必须显示填写,不允许为空! + + * 必选:是
+ + * 默认值:无
+ +* **splitPk** + + * 描述:OracleReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,DataX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 + + 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 + + 目前splitPk仅支持整形、字符串型数据切分,`不支持浮点、日期等其他类型`。如果用户指定其他非支持类型,OracleReader将报错! + + splitPk如果不填写,将视作用户不对单表进行切分,OracleReader使用单通道同步全量数据。 + + * 必选:否
+ + * 默认值:无
+ +* **where** + + * 描述:筛选条件,MysqlReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。
+ + where条件可以有效地进行业务增量同步。 + + * 必选:否
+ + * 默认值:无
+ +* **querySql** + + * 描述:在有些业务场景下,where这一配置项不足以描述所筛选的条件,用户可以通过该配置型来自定义筛选SQL。当用户配置了这一项之后,DataX系统就会忽略table,column这些配置型,直接使用这个配置项的内容对数据进行筛选,例如需要进行多表join后同步数据,使用select a,b from table_a join table_b on table_a.id = table_b.id
+ + `当用户配置querySql时,OracleReader直接忽略table、column、where条件的配置`。 + + * 必选:否
+ + * 默认值:无
+ +* **fetchSize** + + * 描述:该配置项定义了插件和数据库服务器端每次批量数据获取条数,该值决定了DataX和服务器端的网络交互次数,能够较大的提升数据抽取性能。
+ + `注意,该值过大(>2048)可能造成DataX进程OOM。`。 + + * 必选:否
+ + * 默认值:1024
+ +* **session** + + * 描述:控制写入数据的时间格式,时区等的配置,如果表中有时间字段,配置该值以明确告知写入 oracle 的时间格式。通常配置的参数为:NLS_DATE_FORMAT,NLS_TIME_FORMAT。其配置的值为 json 格式,例如: +``` +"session": [ + "alter session set NLS_DATE_FORMAT='yyyy-mm-dd hh24:mi:ss'", + "alter session set NLS_TIMESTAMP_FORMAT='yyyy-mm-dd hh24:mi:ss'", + "alter session set NLS_TIMESTAMP_TZ_FORMAT='yyyy-mm-dd hh24:mi:ss'", + "alter session set TIME_ZONE='US/Pacific'" + ] +``` + `(注意"是 " 的转义字符串)`。 + + * 必选:否
+ + * 默认值:无
+ + +### 3.3 类型转换 + +目前OracleReader支持大部分Oracle类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出OracleReader针对Oracle类型转换列表: + + +| DataX 内部类型| Oracle 数据类型 | +| -------- | ----- | +| Long |NUMBER,INTEGER,INT,SMALLINT| +| Double |NUMERIC,DECIMAL,FLOAT,DOUBLE PRECISION,REAL| +| String |LONG,CHAR,NCHAR,VARCHAR,VARCHAR2,NVARCHAR2,CLOB,NCLOB,CHARACTER,CHARACTER VARYING,CHAR VARYING,NATIONAL CHARACTER,NATIONAL CHAR,NATIONAL CHARACTER VARYING,NATIONAL CHAR VARYING,NCHAR VARYING | +| Date |TIMESTAMP,DATE | +| Boolean |bit, bool | +| Bytes |BLOB,BFILE,RAW,LONG RAW | + + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 + + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 + +为了模拟线上真实数据,我们设计两个Oracle数据表,分别为: + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + +* Oracle数据库机器参数为: + +### 4.2 测试报告 + +#### 4.2.1 表1测试报告 + + +| 并发任务数| DataX速度(Rec/s)|DataX流量|网卡流量|DataX运行负载|DB运行负载| +|--------| --------|--------|--------|--------|--------| +|1| DataX 统计速度(Rec/s)|DataX统计流量|网卡流量|DataX运行负载|DB运行负载| + +## 5 约束限制 + +### 5.1 主备同步数据恢复问题 + +主备同步问题指Oracle使用主从灾备,备库从主库不间断通过binlog恢复数据。由于主备数据同步存在一定的时间差,特别在于某些特定情况,例如网络延迟等问题,导致备库同步恢复的数据与主库有较大差别,导致从备库同步的数据不是一份当前时间的完整镜像。 + +针对这个问题,我们提供了preSql功能,该功能待补充。 + +### 5.2 一致性约束 + +Oracle在数据存储划分中属于RDBMS系统,对外可以提供强一致性数据查询接口。例如当一次同步任务启动运行过程中,当该库存在其他数据写入方写入数据时,OracleReader完全不会获取到写入更新数据,这是由于数据库本身的快照特性决定的。关于数据库快照特性,请参看[MVCC Wikipedia](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) + +上述是在OracleReader单线程模型下数据同步一致性的特性,由于OracleReader可以根据用户配置信息使用了并发数据抽取,因此不能严格保证数据一致性:当OracleReader根据splitPk进行数据切分后,会先后启动多个并发任务完成数据同步。由于多个并发任务相互之间不属于同一个读事务,同时多个并发任务存在时间间隔。因此这份数据并不是`完整的`、`一致的`数据快照信息。 + +针对多线程的一致性快照需求,在技术上目前无法实现,只能从工程角度解决,工程化的方式存在取舍,我们提供几个解决思路给用户,用户可以自行选择: + +1. 使用单线程同步,即不再进行数据切片。缺点是速度比较慢,但是能够很好保证一致性。 + +2. 关闭其他数据写入方,保证当前数据为静态数据,例如,锁表、关闭备库同步等等。缺点是可能影响在线业务。 + +### 5.3 数据库编码问题 + + +OracleReader底层使用JDBC进行数据抽取,JDBC天然适配各类编码,并在底层进行了编码转换。因此OracleReader不需用户指定编码,可以自动获取编码并转码。 + +对于Oracle底层写入编码和其设定的编码不一致的混乱情况,OracleReader对此无法识别,对此也无法提供解决方案,对于这类情况,`导出有可能为乱码`。 + +### 5.4 增量数据同步 + +OracleReader使用JDBC SELECT语句完成数据抽取工作,因此可以使用SELECT...WHERE...进行增量数据抽取,方式有多种: + +* 数据库在线应用写入数据库时,填充modify字段为更改时间戳,包括新增、更新、删除(逻辑删)。对于这类应用,OracleReader只需要WHERE条件跟上一同步阶段时间戳即可。 +* 对于新增流水型数据,OracleReader可以WHERE条件后跟上一阶段最大自增ID即可。 + +对于业务上无字段区分新增、修改数据情况,OracleReader也无法进行增量数据同步,只能同步全量数据。 + +### 5.5 Sql安全性 + +OracleReader提供querySql语句交给用户自己实现SELECT抽取语句,OracleReader本身对querySql不做任何安全性校验。这块交由DataX用户方自己保证。 + +## 6 FAQ + +*** + +**Q: OracleReader同步报错,报错信息为XXX** + + A: 网络或者权限问题,请使用Oracle命令行测试: + sqlplus username/password@//host:port/sid + + +如果上述命令也报错,那可以证实是环境问题,请联系你的DBA。 + + +**Q: OracleReader抽取速度很慢怎么办?** + + A: 影响抽取时间的原因大概有如下几个:(来自专业 DBA 卫绾) + 1. 由于SQL的plan异常,导致的抽取时间长; 在抽取时,尽可能使用全表扫描代替索引扫描; + 2. 合理sql的并发度,减少抽取时间;根据表的大小, + <50G可以不用并发, + <100G添加如下hint: parallel(a,2), + >100G添加如下hint : parallel(a,4); + 3. 抽取sql要简单,尽量不用replace等函数,这个非常消耗cpu,会严重影响抽取速度; diff --git a/oraclereader/pom.xml b/oraclereader/pom.xml new file mode 100755 index 0000000000..fc91ca2932 --- /dev/null +++ b/oraclereader/pom.xml @@ -0,0 +1,86 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + oraclereader + oraclereader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + com.oracle + ojdbc6 + 11.2.0.3 + system + ${basedir}/src/main/lib/ojdbc6-11.2.0.3.jar + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/oraclereader/src/main/assembly/package.xml b/oraclereader/src/main/assembly/package.xml new file mode 100755 index 0000000000..a0c9fd1c70 --- /dev/null +++ b/oraclereader/src/main/assembly/package.xml @@ -0,0 +1,42 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/oraclereader + + + src/main/lib + + ojdbc6-11.2.0.3.jar + + plugin/reader/oraclereader/libs + + + target/ + + oraclereader-0.0.1-SNAPSHOT.jar + + plugin/reader/oraclereader + + + + + + false + plugin/reader/oraclereader/libs + runtime + + + diff --git a/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/Constant.java b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/Constant.java new file mode 100755 index 0000000000..8006b1a6c7 --- /dev/null +++ b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/Constant.java @@ -0,0 +1,7 @@ +package com.alibaba.datax.plugin.reader.oraclereader; + +public class Constant { + + public static final int DEFAULT_FETCH_SIZE = 1024; + +} diff --git a/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReader.java b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReader.java new file mode 100755 index 0000000000..403b30e9bd --- /dev/null +++ b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReader.java @@ -0,0 +1,126 @@ +package com.alibaba.datax.plugin.reader.oraclereader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.reader.util.HintUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; + +public class OracleReader extends Reader { + + private static final DataBaseType DATABASE_TYPE = DataBaseType.Oracle; + + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory + .getLogger(OracleReader.Job.class); + + private Configuration originalConfig = null; + private CommonRdbmsReader.Job commonRdbmsReaderJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + dealFetchSize(this.originalConfig); + + this.commonRdbmsReaderJob = new CommonRdbmsReader.Job( + DATABASE_TYPE); + this.commonRdbmsReaderJob.init(this.originalConfig); + + // 注意:要在 this.commonRdbmsReaderJob.init(this.originalConfig); 之后执行,这样可以直接快速判断是否是querySql 模式 + dealHint(this.originalConfig); + } + + @Override + public void preCheck(){ + init(); + this.commonRdbmsReaderJob.preCheck(this.originalConfig,DATABASE_TYPE); + } + + @Override + public List split(int adviceNumber) { + return this.commonRdbmsReaderJob.split(this.originalConfig, + adviceNumber); + } + + @Override + public void post() { + this.commonRdbmsReaderJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderJob.destroy(this.originalConfig); + } + + private void dealFetchSize(Configuration originalConfig) { + int fetchSize = originalConfig.getInt( + com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + Constant.DEFAULT_FETCH_SIZE); + if (fetchSize < 1) { + throw DataXException + .asDataXException(DBUtilErrorCode.REQUIRED_VALUE, + String.format("您配置的 fetchSize 有误,fetchSize:[%d] 值不能小于 1.", + fetchSize)); + } + originalConfig.set( + com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + fetchSize); + } + + private void dealHint(Configuration originalConfig) { + String hint = originalConfig.getString(Key.HINT); + if (StringUtils.isNotBlank(hint)) { + boolean isTableMode = originalConfig.getBool(com.alibaba.datax.plugin.rdbms.reader.Constant.IS_TABLE_MODE).booleanValue(); + if(!isTableMode){ + throw DataXException.asDataXException(OracleReaderErrorCode.HINT_ERROR, "当且仅当非 querySql 模式读取 oracle 时才能配置 HINT."); + } + HintUtil.initHintConf(DATABASE_TYPE, originalConfig); + } + } + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + private CommonRdbmsReader.Task commonRdbmsReaderTask; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsReaderTask = new CommonRdbmsReader.Task( + DATABASE_TYPE ,super.getTaskGroupId(), super.getTaskId()); + this.commonRdbmsReaderTask.init(this.readerSliceConfig); + } + + @Override + public void startRead(RecordSender recordSender) { + int fetchSize = this.readerSliceConfig + .getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE); + + this.commonRdbmsReaderTask.startRead(this.readerSliceConfig, + recordSender, super.getTaskPluginCollector(), fetchSize); + } + + @Override + public void post() { + this.commonRdbmsReaderTask.post(this.readerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderTask.destroy(this.readerSliceConfig); + } + + } + +} diff --git a/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReaderErrorCode.java b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReaderErrorCode.java new file mode 100755 index 0000000000..05ee8604a8 --- /dev/null +++ b/oraclereader/src/main/java/com/alibaba/datax/plugin/reader/oraclereader/OracleReaderErrorCode.java @@ -0,0 +1,33 @@ +package com.alibaba.datax.plugin.reader.oraclereader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum OracleReaderErrorCode implements ErrorCode { + HINT_ERROR("Oraclereader-00", "您的 Hint 配置出错."), + + ; + + private final String code; + private final String description; + + private OracleReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/oraclereader/src/main/lib/ojdbc6-11.2.0.3.jar b/oraclereader/src/main/lib/ojdbc6-11.2.0.3.jar new file mode 100644 index 0000000000..01da074d5a Binary files /dev/null and b/oraclereader/src/main/lib/ojdbc6-11.2.0.3.jar differ diff --git a/oraclereader/src/main/resources/plugin.json b/oraclereader/src/main/resources/plugin.json new file mode 100755 index 0000000000..f1ed98aec5 --- /dev/null +++ b/oraclereader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "oraclereader", + "class": "com.alibaba.datax.plugin.reader.oraclereader.OracleReader", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/oraclereader/src/main/resources/plugin_job_template.json b/oraclereader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..beae2552ec --- /dev/null +++ b/oraclereader/src/main/resources/plugin_job_template.json @@ -0,0 +1,14 @@ +{ + "name": "oraclereader", + "parameter": { + "username": "", + "password": "", + "column": [], + "connection": [ + { + "table": [], + "jdbcUrl": [] + } + ] + } +} \ No newline at end of file diff --git a/oraclewriter/doc/oraclewriter.md b/oraclewriter/doc/oraclewriter.md new file mode 100644 index 0000000000..6dee50494e --- /dev/null +++ b/oraclewriter/doc/oraclewriter.md @@ -0,0 +1,416 @@ +# DataX OracleWriter + + +--- + + +## 1 快速介绍 + +OracleWriter 插件实现了写入数据到 Oracle 主库的目的表的功能。在底层实现上, OracleWriter 通过 JDBC 连接远程 Oracle 数据库,并执行相应的 insert into ... sql 语句将数据写入 Oracle,内部会分批次提交入库。 + +OracleWriter 面向ETL开发工程师,他们使用 OracleWriter 从数仓导入数据到 Oracle。同时 OracleWriter 亦可以作为数据迁移工具为DBA等用户提供服务。 + + +## 2 实现原理 + +OracleWriter 通过 DataX 框架获取 Reader 生成的协议数据,根据你配置生成相应的SQL语句 + + +* `insert into...`(当主键/唯一性索引冲突时会写不进去冲突的行) + +
+ + 注意: + 1. 目的表所在数据库必须是主库才能写入数据;整个任务至少需具备 insert into...的权限,是否需要其他权限,取决于你任务配置中在 preSql 和 postSql 中指定的语句。 + 2.OracleWriter和MysqlWriter不同,不支持配置writeMode参数。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到 Oracle 导入的数据。 + +```json +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19880808, + "type": "long" + }, + { + "value": "1988-08-08 08:08:08", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 1000 + } + }, + "writer": { + "name": "oraclewriter", + "parameter": { + "username": "root", + "password": "root", + "column": [ + "id", + "name" + ], + "preSql": [ + "delete from test" + ], + "connection": [ + { + "jdbcUrl": "jdbc:oracle:thin:@[HOST_NAME]:PORT:[DATABASE_NAME]", + "table": [ + "test" + ] + } + ] + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:目的数据库的 JDBC 连接信息 ,jdbcUrl必须包含在connection配置单元中。 + + 注意:1、在一个数据库上只能配置一个值。这与 OracleReader 支持多个备库探测不同,因为此处不支持同一个数据库存在多个主库的情况(双主导入数据情况) + 2、jdbcUrl按照Oracle官方规范,并可以填写连接附加参数信息。具体请参看 Oracle官方文档或者咨询对应 DBA。 + + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:目的数据库的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:目的数据库的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。支持写入一个或者多个表。当配置为多张表时,必须确保所有表结构保持一致。 + + 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。如果要依次写入全部列,使用*表示, 例如: "column": ["*"] + + **column配置项必须指定,不能留空!** + + + 注意:1、我们强烈不推荐你这样配置,因为当你目的表字段个数、类型等有改动时,你的任务可能运行不正确或者失败 + 2、此处 column 不能配置任何常量值 + + * 必选:是
+ + * 默认值:否
+ +* **preSql** + + * 描述:写入数据到目的表前,会先执行这里的标准语句。如果 Sql 中有你需要操作到的表名称,请使用 `@table` 表示,这样在实际执行 Sql 语句时,会对变量按照实际表名称进行替换。比如你的任务是要写入到目的端的100个同构分表(表名称为:datax_00,datax01, ... datax_98,datax_99),并且你希望导入数据前,先对表中数据进行删除操作,那么你可以这样配置:`"preSql":["delete from @table"]`,效果是:在执行到每个表写入数据前,会先执行对应的 delete from 对应表名称
+ + * 必选:否
+ + * 默认值:无
+ +* **postSql** + + * 描述:写入数据到目的表后,会执行这里的标准语句。(原理同 preSql )
+ + * 必选:否
+ + * 默认值:无
+ +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与Oracle的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。
+ + * 必选:否
+ + * 默认值:1024
+ +* **session** + + * 描述:设置oracle连接时的session信息,格式示例如下:
+ + ``` + "session":[ + "alter session set nls_date_format = 'dd.mm.yyyy hh24:mi:ss';" + "alter session set NLS_LANG = 'AMERICAN';" + ] + + ``` + + * 必选:否
+ + * 默认值:无
+ +### 3.3 类型转换 + +类似 OracleReader ,目前 OracleWriter 支持大部分 Oracle 类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出 OracleWriter 针对 Oracle 类型转换列表: + + +| DataX 内部类型| Oracle 数据类型 | +| -------- | ----- | +| Long |NUMBER,INTEGER,INT,SMALLINT| +| Double |NUMERIC,DECIMAL,FLOAT,DOUBLE PRECISION,REAL| +| String |LONG,CHAR,NCHAR,VARCHAR,VARCHAR2,NVARCHAR2,CLOB,NCLOB,CHARACTER,CHARACTER VARYING,CHAR VARYING,NATIONAL CHARACTER,NATIONAL CHAR,NATIONAL CHARACTER VARYING,NATIONAL CHAR VARYING,NCHAR VARYING | +| Date |TIMESTAMP,DATE | +| Boolean |bit, bool | +| Bytes |BLOB,BFILE,RAW,LONG RAW | + + + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: +``` +--DROP TABLE PERF_ORACLE_WRITER; +CREATE TABLE PERF_ORACLE_WRITER ( +COL1 VARCHAR2(255 BYTE) NULL , +COL2 NUMBER(32) NULL , +COL3 NUMBER(32) NULL , +COL4 DATE NULL , +COL5 FLOAT NULL , +COL6 VARCHAR2(255 BYTE) NULL , +COL7 VARCHAR2(255 BYTE) NULL , +COL8 VARCHAR2(255 BYTE) NULL , +COL9 VARCHAR2(255 BYTE) NULL , +COL10 VARCHAR2(255 BYTE) NULL +) +LOGGING +NOCOMPRESS +NOCACHE; +``` +单行记录类似于: +``` +col1:485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&* +co12:1 +co13:1696248667889 +co14:2013-01-06 00:00:00 +co15:3.141592653578 +co16:100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 +co17:100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209 +co18:100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209 +co19:100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209 +co110:12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209 +``` +#### 4.1.2 机器参数 + +* 执行 DataX 的机器参数为: + 1. cpu: 24 Core Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + 2. mem: 94GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* Oracle 数据库机器参数为: + 1. cpu: 4 Core Intel(R) Xeon(R) CPU E5420 @ 2.50GHz + 2. mem: 7GB + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + +#### 4.1.4 性能测试作业配置 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 4 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "sliceRecordCount": 1000000000, + "column": [ + { + "value": "485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&*" + }, + { + "value": 1, + "type": "long" + }, + { + "value": "1696248667889", + "type": "long" + }, + { + "type": "date", + "value": "2013-07-06 00:00:00", + "dateFormat": "yyyy-mm-dd hh:mm:ss" + }, + { + "value": "3.141592653578", + "type": "double" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209" + }, + { + "value": "100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209" + }, + { + "value": "100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209" + }, + { + "value": "12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209" + } + ] + } + }, + "writer": { + "name": "oraclewriter", + "parameter": { + "username": "username", + "password": "password", + "truncate": "true", + "batchSize": "512", + "column": [ + "col1", + "col2", + "col3", + "col4", + "col5", + "col6", + "col7", + "col8", + "col9", + "col10" + ], + "connection": [ + { + "table": [ + "PERF_ORACLE_WRITER" + ], + "jdbcUrl": "jdbc:oracle:thin:@ip:port:dataplat" + } + ] + } + } + } + ] + } +} + +``` + +### 4.2 测试报告 + +#### 4.2.1 测试报告 + +| 通道数| 批量提交行数| DataX速度(Rec/s)|DataX流量(MB/s)| DataX机器网卡流出流量(MB/s)|DataX机器运行负载|DB网卡进入流量(MB/s)|DB运行负载| +|--------|--------| --------|--------|--------|--------|--------|--------| +|1|128|15564|6.51|7.5|0.02|7.4|1.08| +|1|512|29491|10.90|12.6|0.05|12.4|1.55| +|1|1024|31529|11.87|13.5|0.22|13.3|1.58| +|1|2048|33469|12.57|14.3|0.17|14.3|1.53| +|1|4096|31363|12.48|13.4|0.10|10.0|1.72| +|4|10|9440|4.05|5.6|0.01|5.0|3.75| +|4|128|42832|16.48|18.3|0.07|18.5|2.89| +|4|512|46643|20.02|22.7|0.35|21.1|3.31| +|4|1024|39116|16.79|18.7|0.10|18.1|3.05| +|4|2048|39526|16.96|18.5|0.32|17.1|2.86| +|4|4096|37683|16.17|17.2|0.23|15.5|2.26| +|8|128|38336|16.45|17.5|0.13|16.2|3.87| +|8|512|31078|13.34|14.9|0.11|13.4|2.09| +|8|1024|37888|16.26|18.5|0.20|18.5|3.14| +|8|2048|38502|16.52|18.5|0.18|18.5|2.96| +|8|4096|38092|16.35|18.3|0.10|17.8|3.19| +|16|128|35366|15.18|16.9|0.13|15.6|3.49| +|16|512|35584|15.27|16.8|0.23|17.4|3.05| +|16|1024|38297|16.44|17.5|0.20|17.0|3.42| +|16|2048|28467|12.22|13.8|0.10|12.4|3.38| +|16|4096|27852|11.95|12.3|0.11|12.3|3.86| +|32|1024|34406|14.77|15.4|0.09|15.4|3.55| + + +1. `batchSize 和 通道个数,对性能影响较大` +2. `通常不建议写入数据库时,通道个数 >32` + + + +## 5 约束限制 + + + + +## FAQ + +*** + +**Q: OracleWriter 执行 postSql 语句报错,那么数据导入到目标数据库了吗?** + +A: DataX 导入过程存在三块逻辑,pre 操作、导入操作、post 操作,其中任意一环报错,DataX 作业报错。由于 DataX 不能保证在同一个事务完成上述几个操作,因此有可能数据已经落入到目标端。 + +*** + +**Q: 按照上述说法,那么有部分脏数据导入数据库,如果影响到线上数据库怎么办?** + +A: 目前有两种解法,第一种配置 pre 语句,该 sql 可以清理当天导入数据, DataX 每次导入时候可以把上次清理干净并导入完整数据。第二种,向临时表导入数据,完成后再 rename 到线上表。 + +*** + +**Q: 上面第二种方法可以避免对线上数据造成影响,那我具体怎样操作?** + +A: 可以配置临时表导入 diff --git a/oraclewriter/pom.xml b/oraclewriter/pom.xml new file mode 100755 index 0000000000..90104049f9 --- /dev/null +++ b/oraclewriter/pom.xml @@ -0,0 +1,82 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + oraclewriter + oraclewriter + jar + writer data into oracle database + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + com.oracle + ojdbc6 + 11.2.0.3 + system + ${basedir}/src/main/lib/ojdbc6-11.2.0.3.jar + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/oraclewriter/src/main/assembly/package.xml b/oraclewriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..09a25d1a2e --- /dev/null +++ b/oraclewriter/src/main/assembly/package.xml @@ -0,0 +1,42 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/oraclewriter + + + src/main/lib + + ojdbc6-11.2.0.3.jar + + plugin/writer/oraclewriter/libs + + + target/ + + oraclewriter-0.0.1-SNAPSHOT.jar + + plugin/writer/oraclewriter + + + + + + false + plugin/writer/oraclewriter/libs + runtime + + + diff --git a/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriter.java b/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriter.java new file mode 100755 index 0000000000..73a9ad6a37 --- /dev/null +++ b/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriter.java @@ -0,0 +1,104 @@ +package com.alibaba.datax.plugin.writer.oraclewriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; +import com.alibaba.datax.plugin.rdbms.writer.Key; + +import java.util.List; + +public class OracleWriter extends Writer { + private static final DataBaseType DATABASE_TYPE = DataBaseType.Oracle; + + public static class Job extends Writer.Job { + private Configuration originalConfig = null; + private CommonRdbmsWriter.Job commonRdbmsWriterJob; + + public void preCheck() { + this.init(); + this.commonRdbmsWriterJob.writerPreCheck(this.originalConfig, DATABASE_TYPE); + } + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + // warn:not like mysql, oracle only support insert mode, don't use + String writeMode = this.originalConfig.getString(Key.WRITE_MODE); + if (null != writeMode) { + throw DataXException + .asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format( + "写入模式(writeMode)配置错误. 因为Oracle不支持配置项 writeMode: %s, Oracle只能使用insert sql 插入数据. 请检查您的配置并作出修改", + writeMode)); + } + + this.commonRdbmsWriterJob = new CommonRdbmsWriter.Job( + DATABASE_TYPE); + this.commonRdbmsWriterJob.init(this.originalConfig); + } + + @Override + public void prepare() { + //oracle实跑先不做权限检查 + //this.commonRdbmsWriterJob.privilegeValid(this.originalConfig, DATABASE_TYPE); + this.commonRdbmsWriterJob.prepare(this.originalConfig); + } + + @Override + public List split(int mandatoryNumber) { + return this.commonRdbmsWriterJob.split(this.originalConfig, + mandatoryNumber); + } + + @Override + public void post() { + this.commonRdbmsWriterJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private CommonRdbmsWriter.Task commonRdbmsWriterTask; + + @Override + public void init() { + this.writerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsWriterTask = new CommonRdbmsWriter.Task(DATABASE_TYPE); + this.commonRdbmsWriterTask.init(this.writerSliceConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterTask.prepare(this.writerSliceConfig); + } + + public void startWrite(RecordReceiver recordReceiver) { + this.commonRdbmsWriterTask.startWrite(recordReceiver, + this.writerSliceConfig, super.getTaskPluginCollector()); + } + + @Override + public void post() { + this.commonRdbmsWriterTask.post(this.writerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterTask.destroy(this.writerSliceConfig); + } + + } + +} diff --git a/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriterErrorCode.java b/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriterErrorCode.java new file mode 100755 index 0000000000..06f0cfa260 --- /dev/null +++ b/oraclewriter/src/main/java/com/alibaba/datax/plugin/writer/oraclewriter/OracleWriterErrorCode.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.writer.oraclewriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum OracleWriterErrorCode implements ErrorCode { + ; + + private final String code; + private final String describe; + + private OracleWriterErrorCode(String code, String describe) { + this.code = code; + this.describe = describe; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.describe; + } + + @Override + public String toString() { + return String.format("Code:[%s], Describe:[%s]. ", this.code, + this.describe); + } +} diff --git a/oraclewriter/src/main/lib/ojdbc6-11.2.0.3.jar b/oraclewriter/src/main/lib/ojdbc6-11.2.0.3.jar new file mode 100644 index 0000000000..01da074d5a Binary files /dev/null and b/oraclewriter/src/main/lib/ojdbc6-11.2.0.3.jar differ diff --git a/oraclewriter/src/main/resources/plugin.json b/oraclewriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..54df0a8903 --- /dev/null +++ b/oraclewriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "oraclewriter", + "class": "com.alibaba.datax.plugin.writer.oraclewriter.OracleWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute insert sql. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/oraclewriter/src/main/resources/plugin_job_template.json b/oraclewriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..0ef68e9ed5 --- /dev/null +++ b/oraclewriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "oraclewriter", + "parameter": { + "username": "", + "password": "", + "column": [], + "preSql": [], + "connection": [ + { + "jdbcUrl": "", + "table": [] + } + ] + } +} \ No newline at end of file diff --git a/ossreader/doc/ossreader.md b/ossreader/doc/ossreader.md new file mode 100644 index 0000000000..e0259a2a58 --- /dev/null +++ b/ossreader/doc/ossreader.md @@ -0,0 +1,281 @@ +# DataX OSSReader 说明 + + +------------ + +## 1 快速介绍 + +OSSReader提供了读取OSS数据存储的能力。在底层实现上,OSSReader使用OSS官方Java SDK获取OSS数据,并转换为DataX传输协议传递给Writer。 + +* OSS 产品介绍, 参看[[阿里云OSS Portal](http://www.aliyun.com/product/oss)] +* OSS Java SDK, 参看[[阿里云OSS Java SDK](http://oss.aliyuncs.com/aliyun_portal_storage/help/oss/OSS_Java_SDK_Dev_Guide_20141113.pdf)] + +## 2 功能与限制 + +OSSReader实现了从OSS读取数据并转为DataX协议的功能,OSS本身是无结构化数据存储,对于DataX而言,OSSReader实现上类比TxtFileReader,有诸多相似之处。目前OSSReader支持功能如下: + +1. 支持且仅支持读取TXT的文件,且要求TXT中shema为一张二维表。 + +2. 支持类CSV格式文件,自定义分隔符。 + +3. 支持多种类型数据读取(使用String表示),支持列裁剪,支持列常量 + +4. 支持递归读取、支持文件名过滤。 + +5. 支持文本压缩,现有压缩格式为zip、gzip、bzip2。注意,一个压缩包不允许多文件打包压缩。 + +6. 多个object可以支持并发读取。 + +我们暂时不能做到: + +1. 单个Object(File)支持多线程并发读取,这里涉及到单个Object内部切分算法。二期考虑支持。 + +2. 单个Object在压缩情况下,从技术上无法支持多线程并发读取。 + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "job": { + "setting": {}, + "content": [ + { + "reader": { + "name": "ossreader", + "parameter": { + "endpoint": "http://oss.aliyuncs.com", + "accessId": "", + "accessKey": "", + "bucket": "myBucket", + "object": [ + "bazhen/*" + ], + "column": [ + { + "type": "long", + "index": 0 + }, + { + "type": "string", + "value": "alibaba" + }, + { + "type": "date", + "index": 1, + "format": "yyyy-MM-dd" + } + ], + "encoding": "UTF-8", + "fieldDelimiter": "\t", + "compress": "gzip" + } + }, + "writer": {} + } + ] + } +} +``` + +### 3.2 参数说明 + +* **endpoint** + + * 描述:OSS Server的EndPoint地址,例如http://oss.aliyuncs.com。 + + * 必选:是
+ + * 默认值:无
+ +* **accessId** + + * 描述:OSS的accessId
+ + * 必选:是
+ + * 默认值:无
+ +* **accessKey** + + * 描述:OSS的accessKey
+ + * 必选:是
+ + * 默认值:无
+ +* **bucket** + + * 描述:OSS的bucket
+ + * 必选:是
+ + * 默认值:无
+ +* **object** + + * 描述:OSS的object信息,注意这里可以支持填写多个Object。
+ + 当指定单个OSS Object,OSSReader暂时只能使用单线程进行数据抽取。二期考虑在非压缩文件情况下针对单个Object可以进行多线程并发读取。 + + 当指定多个OSS Object,OSSReader支持使用多线程进行数据抽取。线程并发数通过通道数指定。 + + 当指定通配符,OSSReader尝试遍历出多个Object信息。例如: 指定/*代表读取bucket下游所有的Object,指定/bazhen/\*代表读取bazhen目录下游所有的Object。 + + **特别需要注意的是,DataX会将一个作业下同步的所有Object视作同一张数据表。用户必须自己保证所有的Object能够适配同一套schema信息。** + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:读取字段列表,type指定源数据的类型,index指定当前列来自于文本第几列(以0开始),value指定当前类型为常量,不从源头文件读取数据,而是根据value值自动生成对应的列。
+ + 默认情况下,用户可以全部按照String类型读取数据,配置如下: + + ```json + "column": ["*"] + ``` + + 用户可以指定Column字段信息,配置如下: + + ```json + { + "type": "long", + "index": 0 //从OSS文本第一列获取int字段 + }, + { + "type": "string", + "value": "alibaba" //从OSSReader内部生成alibaba的字符串字段作为当前字段 + } + ``` + + 对于用户指定Column信息,type必须填写,index/value必须选择其一。 + + * 必选:是
+ + * 默认值:全部按照string类型读取
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + * 必选:是
+ + * 默认值:,
+ +* **compress** + + * 描述:文本压缩类型,默认不填写意味着没有压缩。支持压缩类型为zip、gzip、bzip2。
+ + * 必选:否
+ + * 默认值:不压缩
+ +* **encoding** + + * 描述:读取文件的编码配置,目前只支持utf-8/gbk。
+ + * 必选:否
+ + * 默认值:utf-8
+ +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat="\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:\N
+ +* **skipHeader** + + * 描述:类CSV格式文件可能存在表头为标题情况,需要跳过。默认不跳过。
+ + * 必选:否
+ + * 默认值:false
+ + +* **csvReaderConfig** + + * 描述:读取CSV类型文件参数配置,Map类型。读取CSV类型文件使用的CsvReader进行读取,会有很多配置,不配置则使用默认值。
+ + * 必选:否
+ + * 默认值:无
+ + +常见配置: + +```json +"csvReaderConfig":{ + "safetySwitch": false, + "skipEmptyRecords": false, + "useTextQualifier": false +} +``` + +所有配置项及默认值,配置时 csvReaderConfig 的map中请**严格按照以下字段名字进行配置**: + +``` +boolean caseSensitive = true; +char textQualifier = 34; +boolean trimWhitespace = true; +boolean useTextQualifier = true;//是否使用csv转义字符 +char delimiter = 44;//分隔符 +char recordDelimiter = 0; +char comment = 35; +boolean useComments = false; +int escapeMode = 1; +boolean safetySwitch = true;//单列长度是否限制100000字符 +boolean skipEmptyRecords = true;//是否跳过空行 +boolean captureRawRecord = true; +``` + + +### 3.3 类型转换 + + +OSS本身不提供数据类型,该类型是DataX OSSReader定义: + +| DataX 内部类型| OSS 数据类型 | +| -------- | ----- | +| Long |Long | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Date |Date | + +其中: + +* OSS Long是指OSS文本中使用整形的字符串表示形式,例如"19901219"。 +* OSS Double是指OSS文本中使用Double的字符串表示形式,例如"3.1415"。 +* OSS Boolean是指OSS文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* OSS Date是指OSS文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。 + +## 4 性能报告 + +|并发数|DataX 流量|Datax 记录数| +|--------|--------| --------| +|1| 971.40KB/s |10047rec/s | +|2| 1.81MB/s | 19181rec/s | +|4| 3.46MB/s| 36695rec/s | +|8| 6.57MB/s | 69289 records/s | +|16|7.92MB/s| 83920 records/s| +|32|7.87MB/s| 83350 records/s| + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + diff --git a/ossreader/pom.xml b/ossreader/pom.xml new file mode 100755 index 0000000000..de6fe3add9 --- /dev/null +++ b/ossreader/pom.xml @@ -0,0 +1,87 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + ossreader + ossreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + com.aliyun.oss + aliyun-sdk-oss + 2.2.3 + + + junit + junit + test + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/ossreader/src/main/assembly/package.xml b/ossreader/src/main/assembly/package.xml new file mode 100755 index 0000000000..e6f7257dc9 --- /dev/null +++ b/ossreader/src/main/assembly/package.xml @@ -0,0 +1,34 @@ + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/ossreader + + + target/ + + ossreader-0.0.1-SNAPSHOT.jar + + plugin/reader/ossreader + + + + + + false + plugin/reader/ossreader/libs + runtime + + + diff --git a/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Constant.java b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Constant.java new file mode 100755 index 0000000000..e3429445a0 --- /dev/null +++ b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Constant.java @@ -0,0 +1,10 @@ +package com.alibaba.datax.plugin.reader.ossreader; + +/** + * Created by mengxin.liumx on 2014/12/7. + */ +public class Constant { + + public static final String OBJECT = "object"; + public static final int SOCKETTIMEOUT = 5000000; +} diff --git a/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Key.java b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Key.java new file mode 100755 index 0000000000..e836fbbd09 --- /dev/null +++ b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/Key.java @@ -0,0 +1,21 @@ +package com.alibaba.datax.plugin.reader.ossreader; + +/** + * Created by mengxin.liumx on 2014/12/7. + */ +public class Key { + public static final String ENDPOINT = "endpoint"; + + public static final String ACCESSID = "accessId"; + + public static final String ACCESSKEY = "accessKey"; + + public static final String ENCODING = "encoding"; + + public static final String BUCKET = "bucket"; + + public static final String OBJECT = "object"; + + public static final String CNAME = "cname"; + +} diff --git a/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReader.java b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReader.java new file mode 100755 index 0000000000..ce4f0875b4 --- /dev/null +++ b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReader.java @@ -0,0 +1,318 @@ +package com.alibaba.datax.plugin.reader.ossreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.ossreader.util.OssUtil; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; +import com.aliyun.oss.ClientException; +import com.aliyun.oss.OSSClient; +import com.aliyun.oss.OSSException; +import com.aliyun.oss.model.ListObjectsRequest; +import com.aliyun.oss.model.OSSObject; +import com.aliyun.oss.model.OSSObjectSummary; +import com.aliyun.oss.model.ObjectListing; +import com.google.common.collect.Sets; + +import org.apache.commons.io.Charsets; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.InputStream; +import java.nio.charset.UnsupportedCharsetException; +import java.util.ArrayList; +import java.util.List; +import java.util.Set; +import java.util.regex.Pattern; + +/** + * Created by mengxin.liumx on 2014/12/7. + */ +public class OssReader extends Reader { + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory + .getLogger(OssReader.Job.class); + + private Configuration readerOriginConfig = null; + + @Override + public void init() { + LOG.debug("init() begin..."); + this.readerOriginConfig = this.getPluginJobConf(); + this.validate(); + LOG.debug("init() ok and end..."); + } + + private void validate() { + String endpoint = this.readerOriginConfig.getString(Key.ENDPOINT); + if (StringUtils.isBlank(endpoint)) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 endpoint"); + } + + String accessId = this.readerOriginConfig.getString(Key.ACCESSID); + if (StringUtils.isBlank(accessId)) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 accessId"); + } + + String accessKey = this.readerOriginConfig.getString(Key.ACCESSKEY); + if (StringUtils.isBlank(accessKey)) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 accessKey"); + } + + String bucket = this.readerOriginConfig.getString(Key.BUCKET); + if (StringUtils.isBlank(bucket)) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 endpoint"); + } + + String object = this.readerOriginConfig.getString(Key.OBJECT); + if (StringUtils.isBlank(object)) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 object"); + } + + String fieldDelimiter = this.readerOriginConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.FIELD_DELIMITER); + // warn: need length 1 + if (null == fieldDelimiter || fieldDelimiter.length() == 0) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 fieldDelimiter"); + } + + String encoding = this.readerOriginConfig + .getString( + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, + com.alibaba.datax.plugin.unstructuredstorage.reader.Constant.DEFAULT_ENCODING); + try { + Charsets.toCharset(encoding); + } catch (UnsupportedCharsetException uce) { + throw DataXException.asDataXException( + OssReaderErrorCode.ILLEGAL_VALUE, + String.format("不支持的编码格式 : [%s]", encoding), uce); + } catch (Exception e) { + throw DataXException.asDataXException( + OssReaderErrorCode.ILLEGAL_VALUE, + String.format("运行配置异常 : %s", e.getMessage()), e); + } + + // 检测是column 是否为 ["*"] 若是则填为空 + List column = this.readerOriginConfig + .getListConfiguration(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + if (null != column + && 1 == column.size() + && ("\"*\"".equals(column.get(0).toString()) || "'*'" + .equals(column.get(0).toString()))) { + readerOriginConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN, + new ArrayList()); + } else { + // column: 1. index type 2.value type 3.when type is Data, may + // have + // format + List columns = this.readerOriginConfig + .getListConfiguration(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + + if (null == columns || columns.size() == 0) { + throw DataXException.asDataXException( + OssReaderErrorCode.CONFIG_INVALID_EXCEPTION, + "您需要指定 columns"); + } + + if (null != columns && columns.size() != 0) { + for (Configuration eachColumnConf : columns) { + eachColumnConf + .getNecessaryValue( + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.TYPE, + OssReaderErrorCode.REQUIRED_VALUE); + Integer columnIndex = eachColumnConf + .getInt(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.INDEX); + String columnValue = eachColumnConf + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.VALUE); + + if (null == columnIndex && null == columnValue) { + throw DataXException.asDataXException( + OssReaderErrorCode.NO_INDEX_VALUE, + "由于您配置了type, 则至少需要配置 index 或 value"); + } + + if (null != columnIndex && null != columnValue) { + throw DataXException.asDataXException( + OssReaderErrorCode.MIXED_INDEX_VALUE, + "您混合配置了index, value, 每一列同时仅能选择其中一种"); + } + + } + } + } + + // only support compress: gzip,bzip2,zip + String compress = this.readerOriginConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS); + if (StringUtils.isBlank(compress)) { + this.readerOriginConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS, + null); + } else { + Set supportedCompress = Sets + .newHashSet("gzip", "bzip2", "zip"); + compress = compress.toLowerCase().trim(); + if (!supportedCompress.contains(compress)) { + throw DataXException + .asDataXException( + OssReaderErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 gzip, bzip2, zip 文件压缩格式 , 不支持您配置的文件压缩格式: [%s]", + compress)); + } + this.readerOriginConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS, + compress); + } + } + + @Override + public void prepare() { + LOG.debug("prepare()"); + } + + @Override + public void post() { + LOG.debug("post()"); + } + + @Override + public void destroy() { + LOG.debug("destroy()"); + } + + @Override + public List split(int adviceNumber) { + LOG.debug("split() begin..."); + List readerSplitConfigs = new ArrayList(); + + // 将每个单独的 object 作为一个 slice + List objects = parseOriginObjects(readerOriginConfig + .getList(Constant.OBJECT, String.class)); + if (0 == objects.size()) { + throw DataXException.asDataXException( + OssReaderErrorCode.EMPTY_BUCKET_EXCEPTION, + String.format( + "未能找到待读取的Object,请确认您的配置项bucket: %s object: %s", + this.readerOriginConfig.get(Key.BUCKET), + this.readerOriginConfig.get(Key.OBJECT))); + } + + for (String object : objects) { + Configuration splitedConfig = this.readerOriginConfig.clone(); + splitedConfig.set(Constant.OBJECT, object); + readerSplitConfigs.add(splitedConfig); + LOG.info(String.format("OSS object to be read:%s", object)); + } + LOG.debug("split() ok and end..."); + return readerSplitConfigs; + } + + private List parseOriginObjects(List originObjects) { + List parsedObjects = new ArrayList(); + + for (String object : originObjects) { + int firstMetaChar = (object.indexOf('*') > object.indexOf('?')) ? object + .indexOf('*') : object.indexOf('?'); + + if (firstMetaChar != -1) { + int lastDirSeparator = object.lastIndexOf( + IOUtils.DIR_SEPARATOR, firstMetaChar); + String parentDir = object + .substring(0, lastDirSeparator + 1); + List remoteObjects = getRemoteObjects(parentDir); + Pattern pattern = Pattern.compile(object.replace("*", ".*") + .replace("?", ".?")); + + for (String remoteObject : remoteObjects) { + if (pattern.matcher(remoteObject).matches()) { + parsedObjects.add(remoteObject); + } + } + } else { + parsedObjects.add(object); + } + } + return parsedObjects; + } + + private List getRemoteObjects(String parentDir) + throws OSSException, ClientException { + + LOG.debug(String.format("父文件夹 : %s", parentDir)); + List remoteObjects = new ArrayList(); + OSSClient client = OssUtil.initOssClient(readerOriginConfig); + try { + ListObjectsRequest listObjectsRequest = new ListObjectsRequest( + readerOriginConfig.getString(Key.BUCKET)); + listObjectsRequest.setPrefix(parentDir); + ObjectListing objectList; + do { + objectList = client.listObjects(listObjectsRequest); + for (OSSObjectSummary objectSummary : objectList + .getObjectSummaries()) { + LOG.debug(String.format("找到文件 : %s", + objectSummary.getKey())); + remoteObjects.add(objectSummary.getKey()); + } + listObjectsRequest.setMarker(objectList.getNextMarker()); + LOG.debug(listObjectsRequest.getMarker()); + LOG.debug(String.valueOf(objectList.isTruncated())); + + } while (objectList.isTruncated()); + } catch (IllegalArgumentException e) { + throw DataXException.asDataXException( + OssReaderErrorCode.OSS_EXCEPTION, e.getMessage()); + } + + return remoteObjects; + } + } + + public static class Task extends Reader.Task { + private static Logger LOG = LoggerFactory.getLogger(Reader.Task.class); + + private Configuration readerSliceConfig; + + @Override + public void startRead(RecordSender recordSender) { + LOG.debug("read start"); + String object = readerSliceConfig.getString(Key.OBJECT); + OSSClient client = OssUtil.initOssClient(readerSliceConfig); + + OSSObject ossObject = client.getObject( + readerSliceConfig.getString(Key.BUCKET), object); + InputStream objectStream = ossObject.getObjectContent(); + UnstructuredStorageReaderUtil.readFromStream(objectStream, object, + this.readerSliceConfig, recordSender, + this.getTaskPluginCollector()); + recordSender.flush(); + } + + @Override + public void init() { + this.readerSliceConfig = this.getPluginJobConf(); + } + + @Override + public void destroy() { + + } + } +} diff --git a/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReaderErrorCode.java b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReaderErrorCode.java new file mode 100755 index 0000000000..aa33c7582a --- /dev/null +++ b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/OssReaderErrorCode.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.plugin.reader.ossreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by mengxin.liumx on 2014/12/7. + */ +public enum OssReaderErrorCode implements ErrorCode { + // TODO: 修改错误码类型 + RUNTIME_EXCEPTION("OssReader-00", "运行时异常"), + OSS_EXCEPTION("OssFileReader-01", "OSS配置异常"), + CONFIG_INVALID_EXCEPTION("OssFileReader-02", "参数配置错误"), + NOT_SUPPORT_TYPE("OssReader-03", "不支持的类型"), + CAST_VALUE_TYPE_ERROR("OssFileReader-04", "无法完成指定类型的转换"), + SECURITY_EXCEPTION("OssReader-05", "缺少权限"), + ILLEGAL_VALUE("OssReader-06", "值错误"), + REQUIRED_VALUE("OssReader-07", "必选项"), + NO_INDEX_VALUE("OssReader-08","没有 Index" ), + MIXED_INDEX_VALUE("OssReader-09","index 和 value 混合" ), + EMPTY_BUCKET_EXCEPTION("OssReader-10", "您尝试读取的Bucket为空"); + + private final String code; + private final String description; + + private OssReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} \ No newline at end of file diff --git a/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/util/OssUtil.java b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/util/OssUtil.java new file mode 100755 index 0000000000..6aa1c48de0 --- /dev/null +++ b/ossreader/src/main/java/com/alibaba/datax/plugin/reader/ossreader/util/OssUtil.java @@ -0,0 +1,46 @@ +package com.alibaba.datax.plugin.reader.ossreader.util; + +import java.util.ArrayList; +import java.util.List; + +import org.apache.commons.lang3.StringUtils; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.ossreader.Constant; +import com.alibaba.datax.plugin.reader.ossreader.Key; +import com.alibaba.datax.plugin.reader.ossreader.OssReaderErrorCode; +import com.aliyun.oss.ClientConfiguration; +import com.aliyun.oss.OSSClient; + +/** + * Created by mengxin.liumx on 2014/12/8. + */ +public class OssUtil { + public static OSSClient initOssClient(Configuration conf) { + String endpoint = conf.getString(Key.ENDPOINT); + String accessId = conf.getString(Key.ACCESSID); + String accessKey = conf.getString(Key.ACCESSKEY); + ClientConfiguration ossConf = new ClientConfiguration(); + ossConf.setSocketTimeout(Constant.SOCKETTIMEOUT); + + // .aliyun.com, if you are .aliyun.ga you need config this + String cname = conf.getString(Key.CNAME); + if (StringUtils.isNotBlank(cname)) { + List cnameExcludeList = new ArrayList(); + cnameExcludeList.add(cname); + ossConf.setCnameExcludeList(cnameExcludeList); + } + + OSSClient client = null; + try { + client = new OSSClient(endpoint, accessId, accessKey, ossConf); + + } catch (IllegalArgumentException e) { + throw DataXException.asDataXException( + OssReaderErrorCode.ILLEGAL_VALUE, e.getMessage()); + } + + return client; + } +} diff --git a/ossreader/src/main/resources/plugin.json b/ossreader/src/main/resources/plugin.json new file mode 100755 index 0000000000..bf1cf5be0a --- /dev/null +++ b/ossreader/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "ossreader", + "class": "com.alibaba.datax.plugin.reader.ossreader.OssReader", + "description": "", + "developer": "alibaba" +} + diff --git a/ossreader/src/main/resources/plugin_job_template.json b/ossreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..41b5e21957 --- /dev/null +++ b/ossreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,14 @@ +{ + "name": "ossreader", + "parameter": { + "endpoint": "", + "accessId": "", + "accessKey": "", + "bucket": "", + "object": [], + "column": [], + "encoding": "", + "fieldDelimiter": "", + "compress": "" + } +} \ No newline at end of file diff --git a/osswriter/doc/osswriter.md b/osswriter/doc/osswriter.md new file mode 100644 index 0000000000..cf7180e1d6 --- /dev/null +++ b/osswriter/doc/osswriter.md @@ -0,0 +1,214 @@ +# DataX OSSWriter 说明 + + +------------ + +## 1 快速介绍 + +OSSWriter提供了向OSS写入类CSV格式的一个或者多个表文件。 + +**写入OSS内容存放的是一张逻辑意义上的二维表,例如CSV格式的文本信息。** + + +* OSS 产品介绍, 参看[[阿里云OSS Portal](http://www.aliyun.com/product/oss)] +* OSS Java SDK, 参看[[阿里云OSS Java SDK](http://oss.aliyuncs.com/aliyun_portal_storage/help/oss/OSS_Java_SDK_Dev_Guide_20141113.pdf)] + + +## 2 功能与限制 + +OSSWriter实现了从DataX协议转为OSS中的TXT文件功能,OSS本身是无结构化数据存储,OSSWriter需要在如下几个方面增加: + +1. 支持且仅支持写入 TXT的文件,且要求TXT中shema为一张二维表。 + +2. 支持类CSV格式文件,自定义分隔符。 + +3. 暂时不支持文本压缩。 + +6. 支持多线程写入,每个线程写入不同子文件。 + +7. 文件支持滚动,当文件大于某个size值或者行数值,文件需要切换。 [暂不支持] + +我们不能做到: + +1. 单个文件不能支持并发写入。 + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "job": { + "setting": {}, + "content": [ + { + "reader": { + + }, + "writer": { + "parameter": { + "endpoint": "http://oss.aliyuncs.com", + "accessId": "", + "accessKey": "", + "bucket": "myBucket", + "object": "/cdo/datax", + "encoding": "UTF-8", + "fieldDelimiter": ",", + "writeMode": "truncate|append|nonConflict" + } + } + } + ] + } +} +``` + +### 3.2 参数说明 + +* **endpoint** + + * 描述:OSS Server的EndPoint地址,例如http://oss.aliyuncs.com。 + + * 必选:是
+ + * 默认值:无
+ +* **accessId** + + * 描述:OSS的accessId
+ + * 必选:是
+ + * 默认值:无
+ +* **accessKey** + + * 描述:OSS的accessKey
+ + * 必选:是
+ + * 默认值:无
+ +* **bucket** + + * 描述:OSS的bucket
+ + * 必选:是
+ + * 默认值:无
+ +* **object** + + * 描述:OSSWriter写入的文件名,OSS使用文件名模拟目录的实现。
+ + 使用"object": "datax",写入object以datax开头,后缀添加随机字符串。 + 使用"object": "/cdo/datax",写入的object以/cdo/datax开头,后缀随机添加字符串,/作为OSS模拟目录的分隔符。 + + * 必选:是
+ + * 默认值:无
+ +* **writeMode** + + * 描述:OSSWriter写入前数据清理处理:
+ + * truncate,写入前清理object名称前缀匹配的所有object。例如: "object": "abc",将清理所有abc开头的object。 + * append,写入前不做任何处理,DataX OSSWriter直接使用object名称写入,并使用随机UUID的后缀名来保证文件名不冲突。例如用户指定的object名为datax,实际写入为datax_xxxxxx_xxxx_xxxx + * nonConflict,如果指定路径出现前缀匹配的object,直接报错。例如: "object": "abc",如果存在abc123的object,将直接报错。 + + * 必选:是
+ + * 默认值:无
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + * 必选:否
+ + * 默认值:,
+ + +* **encoding** + + * 描述:写出文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8
+ + +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat="\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:\N
+* **dateFormat** + + * 描述:日期类型的数据序列化到object中时的格式,例如 "dateFormat": "yyyy-MM-dd"。
+ + + * 必选:否
+ + * 默认值:无
+ +* **fileFormat** + + * 描述:文件写出的格式,包括csv (http://zh.wikipedia.org/wiki/%E9%80%97%E5%8F%B7%E5%88%86%E9%9A%94%E5%80%BC) 和text两种,csv是严格的csv格式,如果待写数据包括列分隔符,则会按照csv的转义语法转义,转义符号为双引号";text格式是用列分隔符简单分割待写数据,对于待写数据包括列分隔符情况下不做转义。
+ + * 必选:否
+ + * 默认值:text
+ +* **header** + + * 描述:Oss写出时的表头,示例['id', 'name', 'age']。
+ + * 必选:否
+ + * 默认值:无
+ +* **maxFileSize** + + * 描述:Oss写出时单个Object文件的最大大小,默认为10000*10MB,类似log4j日志打印时根据日志文件大小轮转。OSS分块上传时,每个分块大小为10MB,每个OSS InitiateMultipartUploadRequest支持的分块最大数量为10000。轮转发生时,object名字规则是:在原有object前缀加UUID随机数的基础上,拼接_1,_2,_3等后缀。
+ + * 必选:否
+ + * 默认值:100000MB
+ +### 3.3 类型转换 + +## 4 性能报告 + +OSS本身不提供数据类型,该类型是DataX OSSWriter定义: + +| DataX 内部类型| OSS 数据类型 | +| -------- | ----- | +| Long |Long | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Date |Date | + +其中: + +* OSS Long是指OSS文本中使用整形的字符串表示形式,例如"19901219"。 +* OSS Double是指OSS文本中使用Double的字符串表示形式,例如"3.1415"。 +* OSS Boolean是指OSS文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* OSS Date是指OSS文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。 + + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + diff --git a/osswriter/pom.xml b/osswriter/pom.xml new file mode 100644 index 0000000000..a5cb76cf73 --- /dev/null +++ b/osswriter/pom.xml @@ -0,0 +1,80 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + osswriter + osswriter + jar + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + com.aliyun.oss + aliyun-sdk-oss + 2.2.3 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + \ No newline at end of file diff --git a/osswriter/src/main/assembly/package.xml b/osswriter/src/main/assembly/package.xml new file mode 100644 index 0000000000..aa40643dee --- /dev/null +++ b/osswriter/src/main/assembly/package.xml @@ -0,0 +1,34 @@ + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/osswriter + + + target/ + + osswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/osswriter + + + + + + false + plugin/writer/osswriter/libs + runtime + + + diff --git a/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Constant.java b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Constant.java new file mode 100644 index 0000000000..5bf2eb46e3 --- /dev/null +++ b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Constant.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.plugin.writer.osswriter; + +/** + * Created by haiwei.luo on 15-02-09. + */ +public class Constant { + public static final String OBJECT = "object"; + public static final int SOCKETTIMEOUT = 5000000; +} diff --git a/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Key.java b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Key.java new file mode 100644 index 0000000000..b922f59c0c --- /dev/null +++ b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/Key.java @@ -0,0 +1,19 @@ +package com.alibaba.datax.plugin.writer.osswriter; + +/** + * Created by haiwei.luo on 15-02-09. + */ +public class Key { + public static final String ENDPOINT = "endpoint"; + + public static final String ACCESSID = "accessId"; + + public static final String ACCESSKEY = "accessKey"; + + public static final String BUCKET = "bucket"; + + public static final String OBJECT = "object"; + + public static final String CNAME = "cname"; + +} diff --git a/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriter.java b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriter.java new file mode 100644 index 0000000000..90a34ad7bf --- /dev/null +++ b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriter.java @@ -0,0 +1,494 @@ +package com.alibaba.datax.plugin.writer.osswriter; + +import java.io.ByteArrayInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.io.StringWriter; +import java.text.DateFormat; +import java.text.SimpleDateFormat; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.UUID; +import java.util.concurrent.Callable; + +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.unstructuredstorage.writer.TextCsvWriterManager; +import com.alibaba.datax.plugin.unstructuredstorage.writer.UnstructuredStorageWriterUtil; +import com.alibaba.datax.plugin.unstructuredstorage.writer.UnstructuredWriter; +import com.alibaba.datax.plugin.writer.osswriter.util.OssUtil; +import com.aliyun.oss.ClientException; +import com.aliyun.oss.OSSClient; +import com.aliyun.oss.OSSException; +import com.aliyun.oss.model.CompleteMultipartUploadRequest; +import com.aliyun.oss.model.CompleteMultipartUploadResult; +import com.aliyun.oss.model.InitiateMultipartUploadRequest; +import com.aliyun.oss.model.InitiateMultipartUploadResult; +import com.aliyun.oss.model.OSSObjectSummary; +import com.aliyun.oss.model.ObjectListing; +import com.aliyun.oss.model.PartETag; +import com.aliyun.oss.model.UploadPartRequest; +import com.aliyun.oss.model.UploadPartResult; + +/** + * Created by haiwei.luo on 15-02-09. + */ +public class OssWriter extends Writer { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + private Configuration writerSliceConfig = null; + private OSSClient ossClient = null; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.validateParameter(); + this.ossClient = OssUtil.initOssClient(this.writerSliceConfig); + } + + private void validateParameter() { + this.writerSliceConfig.getNecessaryValue(Key.ENDPOINT, + OssWriterErrorCode.REQUIRED_VALUE); + this.writerSliceConfig.getNecessaryValue(Key.ACCESSID, + OssWriterErrorCode.REQUIRED_VALUE); + this.writerSliceConfig.getNecessaryValue(Key.ACCESSKEY, + OssWriterErrorCode.REQUIRED_VALUE); + this.writerSliceConfig.getNecessaryValue(Key.BUCKET, + OssWriterErrorCode.REQUIRED_VALUE); + this.writerSliceConfig.getNecessaryValue(Key.OBJECT, + OssWriterErrorCode.REQUIRED_VALUE); + // warn: do not support compress!! + String compress = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.COMPRESS); + if (StringUtils.isNotBlank(compress)) { + String errorMessage = String.format( + "OSS写暂时不支持压缩, 该压缩配置项[%s]不起效用", compress); + LOG.error(errorMessage); + throw DataXException.asDataXException( + OssWriterErrorCode.ILLEGAL_VALUE, errorMessage); + + } + UnstructuredStorageWriterUtil + .validateParameter(this.writerSliceConfig); + + } + + @Override + public void prepare() { + LOG.info("begin do prepare..."); + String bucket = this.writerSliceConfig.getString(Key.BUCKET); + String object = this.writerSliceConfig.getString(Key.OBJECT); + String writeMode = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.WRITE_MODE); + // warn: bucket is not exists, create it + try { + // warn: do not create bucket for user + if (!this.ossClient.doesBucketExist(bucket)) { + // this.ossClient.createBucket(bucket); + String errorMessage = String.format( + "您配置的bucket [%s] 不存在, 请您确认您的配置项.", bucket); + LOG.error(errorMessage); + throw DataXException.asDataXException( + OssWriterErrorCode.ILLEGAL_VALUE, errorMessage); + } + LOG.info(String.format("access control details [%s].", + this.ossClient.getBucketAcl(bucket).toString())); + + // truncate option handler + if ("truncate".equals(writeMode)) { + LOG.info(String + .format("由于您配置了writeMode truncate, 开始清理 [%s] 下面以 [%s] 开头的Object", + bucket, object)); + // warn: 默认情况下,如果Bucket中的Object数量大于100,则只会返回100个Object + while (true) { + ObjectListing listing = null; + LOG.info("list objects with listObject(bucket, object)"); + listing = this.ossClient.listObjects(bucket, object); + List objectSummarys = listing + .getObjectSummaries(); + for (OSSObjectSummary objectSummary : objectSummarys) { + LOG.info(String.format("delete oss object [%s].", + objectSummary.getKey())); + this.ossClient.deleteObject(bucket, + objectSummary.getKey()); + } + if (objectSummarys.isEmpty()) { + break; + } + } + } else if ("append".equals(writeMode)) { + LOG.info(String + .format("由于您配置了writeMode append, 写入前不做清理工作, 数据写入Bucket [%s] 下, 写入相应Object的前缀为 [%s]", + bucket, object)); + } else if ("nonConflict".equals(writeMode)) { + LOG.info(String + .format("由于您配置了writeMode nonConflict, 开始检查Bucket [%s] 下面以 [%s] 命名开头的Object", + bucket, object)); + ObjectListing listing = this.ossClient.listObjects(bucket, + object); + if (0 < listing.getObjectSummaries().size()) { + StringBuilder objectKeys = new StringBuilder(); + objectKeys.append("[ "); + for (OSSObjectSummary ossObjectSummary : listing + .getObjectSummaries()) { + objectKeys.append(ossObjectSummary.getKey() + " ,"); + } + objectKeys.append(" ]"); + LOG.info(String.format( + "object with prefix [%s] details: %s", object, + objectKeys.toString())); + throw DataXException + .asDataXException( + OssWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您配置的Bucket: [%s] 下面存在其Object有前缀 [%s].", + bucket, object)); + } + } + } catch (OSSException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.OSS_COMM_ERROR, e.getMessage()); + } catch (ClientException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.OSS_COMM_ERROR, e.getMessage()); + } + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + + @Override + public List split(int mandatoryNumber) { + LOG.info("begin do split..."); + List writerSplitConfigs = new ArrayList(); + String object = this.writerSliceConfig.getString(Key.OBJECT); + String bucket = this.writerSliceConfig.getString(Key.BUCKET); + + Set allObjects = new HashSet(); + try { + List ossObjectlisting = this.ossClient + .listObjects(bucket).getObjectSummaries(); + for (OSSObjectSummary objectSummary : ossObjectlisting) { + allObjects.add(objectSummary.getKey()); + } + } catch (OSSException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.OSS_COMM_ERROR, e.getMessage()); + } catch (ClientException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.OSS_COMM_ERROR, e.getMessage()); + } + + String objectSuffix; + for (int i = 0; i < mandatoryNumber; i++) { + // handle same object name + Configuration splitedTaskConfig = this.writerSliceConfig + .clone(); + + String fullObjectName = null; + objectSuffix = StringUtils.replace( + UUID.randomUUID().toString(), "-", ""); + fullObjectName = String.format("%s__%s", object, objectSuffix); + while (allObjects.contains(fullObjectName)) { + objectSuffix = StringUtils.replace(UUID.randomUUID() + .toString(), "-", ""); + fullObjectName = String.format("%s__%s", object, + objectSuffix); + } + allObjects.add(fullObjectName); + + splitedTaskConfig.set(Key.OBJECT, fullObjectName); + + LOG.info(String.format("splited write object name:[%s]", + fullObjectName)); + + writerSplitConfigs.add(splitedTaskConfig); + } + LOG.info("end do split."); + return writerSplitConfigs; + } + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + + private OSSClient ossClient; + private Configuration writerSliceConfig; + private String bucket; + private String object; + private String nullFormat; + private String encoding; + private char fieldDelimiter; + private String dateFormat; + private DateFormat dateParse; + private String fileFormat; + private List header; + private Long maxFileSize;// MB + private String suffix; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.ossClient = OssUtil.initOssClient(this.writerSliceConfig); + this.bucket = this.writerSliceConfig.getString(Key.BUCKET); + this.object = this.writerSliceConfig.getString(Key.OBJECT); + this.nullFormat = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.NULL_FORMAT); + this.dateFormat = this.writerSliceConfig + .getString( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.DATE_FORMAT, + null); + if (StringUtils.isNotBlank(this.dateFormat)) { + this.dateParse = new SimpleDateFormat(dateFormat); + } + this.encoding = this.writerSliceConfig + .getString( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.ENCODING, + com.alibaba.datax.plugin.unstructuredstorage.writer.Constant.DEFAULT_ENCODING); + this.fieldDelimiter = this.writerSliceConfig + .getChar( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FIELD_DELIMITER, + com.alibaba.datax.plugin.unstructuredstorage.writer.Constant.DEFAULT_FIELD_DELIMITER); + this.fileFormat = this.writerSliceConfig + .getString( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_FORMAT, + com.alibaba.datax.plugin.unstructuredstorage.writer.Constant.FILE_FORMAT_TEXT); + this.header = this.writerSliceConfig + .getList( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.HEADER, + null, String.class); + this.maxFileSize = this.writerSliceConfig + .getLong( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.MAX_FILE_SIZE, + com.alibaba.datax.plugin.unstructuredstorage.writer.Constant.MAX_FILE_SIZE); + this.suffix = this.writerSliceConfig + .getString( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.SUFFIX, + com.alibaba.datax.plugin.unstructuredstorage.writer.Constant.DEFAULT_SUFFIX); + this.suffix = this.suffix.trim();// warn: need trim + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + // 设置每块字符串长度 + final long partSize = 1024 * 1024 * 10L; + long numberCacul = (this.maxFileSize * 1024 * 1024L) / partSize; + final long maxPartNumber = numberCacul >= 1 ? numberCacul : 1; + int objectRollingNumber = 0; + //warn: may be StringBuffer->StringBuilder + StringWriter sw = new StringWriter(); + StringBuffer sb = sw.getBuffer(); + UnstructuredWriter unstructuredWriter = TextCsvWriterManager + .produceUnstructuredWriter(this.fileFormat, + this.fieldDelimiter, sw); + Record record = null; + + LOG.info(String.format( + "begin do write, each object maxFileSize: [%s]MB...", + maxPartNumber * 10)); + String currentObject = this.object; + InitiateMultipartUploadRequest currentInitiateMultipartUploadRequest = null; + InitiateMultipartUploadResult currentInitiateMultipartUploadResult = null; + boolean gotData = false; + List currentPartETags = null; + // to do: + // 可以根据currentPartNumber做分块级别的重试,InitiateMultipartUploadRequest多次一个currentPartNumber会覆盖原有 + int currentPartNumber = 1; + try { + // warn + boolean needInitMultipartTransform = true; + while ((record = lineReceiver.getFromReader()) != null) { + gotData = true; + // init:begin new multipart upload + if (needInitMultipartTransform) { + if (objectRollingNumber == 0) { + if (StringUtils.isBlank(this.suffix)) { + currentObject = this.object; + } else { + currentObject = String.format("%s%s", + this.object, this.suffix); + } + } else { + // currentObject is like(no suffix) + // myfile__9b886b70fbef11e59a3600163e00068c_1 + if (StringUtils.isBlank(this.suffix)) { + currentObject = String.format("%s_%s", + this.object, objectRollingNumber); + } else { + // or with suffix + // myfile__9b886b70fbef11e59a3600163e00068c_1.csv + currentObject = String.format("%s_%s%s", + this.object, objectRollingNumber, + this.suffix); + } + } + objectRollingNumber++; + currentInitiateMultipartUploadRequest = new InitiateMultipartUploadRequest( + this.bucket, currentObject); + currentInitiateMultipartUploadResult = this.ossClient + .initiateMultipartUpload(currentInitiateMultipartUploadRequest); + currentPartETags = new ArrayList(); + LOG.info(String + .format("write to bucket: [%s] object: [%s] with oss uploadId: [%s]", + this.bucket, currentObject, + currentInitiateMultipartUploadResult + .getUploadId())); + + // each object's header + if (null != this.header && !this.header.isEmpty()) { + unstructuredWriter.writeOneRecord(this.header); + } + // warn + needInitMultipartTransform = false; + currentPartNumber = 1; + } + + // write: upload data to current object + UnstructuredStorageWriterUtil.transportOneRecord(record, + this.nullFormat, this.dateParse, + this.getTaskPluginCollector(), unstructuredWriter); + + if (sb.length() >= partSize) { + this.uploadOnePart(sw, currentPartNumber, + currentInitiateMultipartUploadResult, + currentPartETags, currentObject); + currentPartNumber++; + sb.setLength(0); + } + + // save: end current multipart upload + if (currentPartNumber > maxPartNumber) { + LOG.info(String + .format("current object [%s] size > %s, complete current multipart upload and begin new one", + currentObject, currentPartNumber + * partSize)); + CompleteMultipartUploadRequest currentCompleteMultipartUploadRequest = new CompleteMultipartUploadRequest( + this.bucket, currentObject, + currentInitiateMultipartUploadResult + .getUploadId(), currentPartETags); + CompleteMultipartUploadResult currentCompleteMultipartUploadResult = this.ossClient + .completeMultipartUpload(currentCompleteMultipartUploadRequest); + LOG.info(String.format( + "final object [%s] etag is:[%s]", + currentObject, + currentCompleteMultipartUploadResult.getETag())); + // warn + needInitMultipartTransform = true; + } + } + + if (!gotData) { + LOG.info("Receive no data from the source."); + currentInitiateMultipartUploadRequest = new InitiateMultipartUploadRequest( + this.bucket, currentObject); + currentInitiateMultipartUploadResult = this.ossClient + .initiateMultipartUpload(currentInitiateMultipartUploadRequest); + currentPartETags = new ArrayList(); + // each object's header + if (null != this.header && !this.header.isEmpty()) { + unstructuredWriter.writeOneRecord(this.header); + } + } + // warn: may be some data stall in sb + if (0 < sb.length()) { + this.uploadOnePart(sw, currentPartNumber, + currentInitiateMultipartUploadResult, + currentPartETags, currentObject); + } + CompleteMultipartUploadRequest completeMultipartUploadRequest = new CompleteMultipartUploadRequest( + this.bucket, currentObject, + currentInitiateMultipartUploadResult.getUploadId(), + currentPartETags); + CompleteMultipartUploadResult completeMultipartUploadResult = this.ossClient + .completeMultipartUpload(completeMultipartUploadRequest); + LOG.info(String.format("final object etag is:[%s]", + completeMultipartUploadResult.getETag())); + } catch (IOException e) { + // 脏数据UnstructuredStorageWriterUtil.transportOneRecord已经记录,header + // 都是字符串不认为有脏数据 + throw DataXException.asDataXException( + OssWriterErrorCode.Write_OBJECT_ERROR, e.getMessage()); + } catch (Exception e) { + throw DataXException.asDataXException( + OssWriterErrorCode.Write_OBJECT_ERROR, e.getMessage()); + } + LOG.info("end do write"); + } + + /** + * 对于同一个UploadID,该号码不但唯一标识这一块数据,也标识了这块数据在整个文件内的相对位置。 + * 如果你用同一个part号码,上传了新的数据,那么OSS上已有的这个号码的Part数据将被覆盖。 + * + * @throws Exception + * */ + private void uploadOnePart( + final StringWriter sw, + final int partNumber, + final InitiateMultipartUploadResult initiateMultipartUploadResult, + final List partETags, final String currentObject) + throws Exception { + final String encoding = this.encoding; + final String bucket = this.bucket; + final OSSClient ossClient = this.ossClient; + RetryUtil.executeWithRetry(new Callable() { + @Override + public Boolean call() throws Exception { + byte[] byteArray = sw.toString().getBytes(encoding); + InputStream inputStream = new ByteArrayInputStream( + byteArray); + // 创建UploadPartRequest,上传分块 + UploadPartRequest uploadPartRequest = new UploadPartRequest(); + uploadPartRequest.setBucketName(bucket); + uploadPartRequest.setKey(currentObject); + uploadPartRequest.setUploadId(initiateMultipartUploadResult + .getUploadId()); + uploadPartRequest.setInputStream(inputStream); + uploadPartRequest.setPartSize(byteArray.length); + uploadPartRequest.setPartNumber(partNumber); + UploadPartResult uploadPartResult = ossClient + .uploadPart(uploadPartRequest); + partETags.add(uploadPartResult.getPartETag()); + LOG.info(String + .format("upload part [%s] size [%s] Byte has been completed.", + partNumber, byteArray.length)); + IOUtils.closeQuietly(inputStream); + return true; + } + }, 3, 1000L, false); + } + + @Override + public void prepare() { + + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + } +} diff --git a/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriterErrorCode.java b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriterErrorCode.java new file mode 100644 index 0000000000..c258460625 --- /dev/null +++ b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/OssWriterErrorCode.java @@ -0,0 +1,41 @@ +package com.alibaba.datax.plugin.writer.osswriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by haiwei.luo on 14-9-17. + */ +public enum OssWriterErrorCode implements ErrorCode { + + CONFIG_INVALID_EXCEPTION("OssWriter-00", "您的参数配置错误."), + REQUIRED_VALUE("OssWriter-01", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("OssWriter-02", "您填写的参数值不合法."), + Write_OBJECT_ERROR("OssWriter-03", "您配置的目标Object在写入时异常."), + OSS_COMM_ERROR("OssWriter-05", "执行相应的OSS操作异常."), + ; + + private final String code; + private final String description; + + private OssWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } + +} diff --git a/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/util/OssUtil.java b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/util/OssUtil.java new file mode 100644 index 0000000000..ea63a5a636 --- /dev/null +++ b/osswriter/src/main/java/com/alibaba/datax/plugin/writer/osswriter/util/OssUtil.java @@ -0,0 +1,43 @@ +package com.alibaba.datax.plugin.writer.osswriter.util; + +import java.util.ArrayList; +import java.util.List; + +import org.apache.commons.lang3.StringUtils; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.osswriter.Constant; +import com.alibaba.datax.plugin.writer.osswriter.Key; +import com.alibaba.datax.plugin.writer.osswriter.OssWriterErrorCode; +import com.aliyun.oss.ClientConfiguration; +import com.aliyun.oss.OSSClient; + +public class OssUtil { + public static OSSClient initOssClient(Configuration conf) { + String endpoint = conf.getString(Key.ENDPOINT); + String accessId = conf.getString(Key.ACCESSID); + String accessKey = conf.getString(Key.ACCESSKEY); + ClientConfiguration ossConf = new ClientConfiguration(); + ossConf.setSocketTimeout(Constant.SOCKETTIMEOUT); + + // .aliyun.com, if you are .aliyun.ga you need config this + String cname = conf.getString(Key.CNAME); + if (StringUtils.isNotBlank(cname)) { + List cnameExcludeList = new ArrayList(); + cnameExcludeList.add(cname); + ossConf.setCnameExcludeList(cnameExcludeList); + } + + OSSClient client = null; + try { + client = new OSSClient(endpoint, accessId, accessKey, ossConf); + + } catch (IllegalArgumentException e) { + throw DataXException.asDataXException( + OssWriterErrorCode.ILLEGAL_VALUE, e.getMessage()); + } + + return client; + } +} diff --git a/osswriter/src/main/resources/plugin.json b/osswriter/src/main/resources/plugin.json new file mode 100644 index 0000000000..d7d99960b4 --- /dev/null +++ b/osswriter/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "osswriter", + "class": "com.alibaba.datax.plugin.writer.osswriter.OssWriter", + "description": "", + "developer": "alibaba" +} + diff --git a/osswriter/src/main/resources/plugin_job_template.json b/osswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..0692b19165 --- /dev/null +++ b/osswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "osswriter", + "parameter": { + "endpoint": "", + "accessId": "", + "accessKey": "", + "bucket": "", + "object": "", + "encoding": "", + "fieldDelimiter": "", + "writeMode": "" + } +} \ No newline at end of file diff --git a/otsreader/doc/otsreader.md b/otsreader/doc/otsreader.md new file mode 100644 index 0000000000..1297dbd69e --- /dev/null +++ b/otsreader/doc/otsreader.md @@ -0,0 +1,340 @@ + +# OTSReader 插件文档 + + +___ + + +## 1 快速介绍 + +OTSReader插件实现了从OTS读取数据,并可以通过用户指定抽取数据范围可方便的实现数据增量抽取的需求。目前支持三种抽取方式: + +* 全表抽取 +* 范围抽取 +* 指定分片抽取 + +OTS是构建在阿里云飞天分布式系统之上的 NoSQL数据库服务,提供海量结构化数据的存储和实时访问。OTS 以实例和表的形式组织数据,通过数据分片和负载均衡技术,实现规模上的无缝扩展。 + +## 2 实现原理 + +简而言之,OTSReader通过OTS官方Java SDK连接到OTS服务端,获取并按照DataX官方协议标准转为DataX字段信息传递给下游Writer端。 + +OTSReader会根据OTS的表范围,按照Datax并发的数目N,将范围等分为N份Task。每个Task都会有一个OTSReader线程来执行。 + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从OTS全表同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + }, + "content": [ + { + "reader": { + "name": "otsreader", + "parameter": { + /* ----------- 必填 --------------*/ + "endpoint":"", + "accessId":"", + "accessKey":"", + "instanceName":"", + + // 导出数据表的表名 + "table":"", + + // 需要导出的列名,支持重复列和常量列,区分大小写 + // 常量列:类型支持STRING,INT,DOUBLE,BOOL和BINARY + // 备注:BINARY需要通过Base64转换为对应的字符串传入插件 + "column":[ + {"name":"col1"}, // 普通列 + {"name":"col2"}, // 普通列 + {"name":"col3"}, // 普通列 + {"type":"STRING", "value" : "bazhen"}, // 常量列(字符串) + {"type":"INT", "value" : ""}, // 常量列(整形) + {"type":"DOUBLE", "value" : ""}, // 常量列(浮点) + {"type":"BOOL", "value" : ""}, // 常量列(布尔) + {"type":"BINARY", "value" : "Base64(bin)"} // 常量列(二进制),使用Base64编码完成 + ], + "range":{ + // 导出数据的起始范围 + // 支持INF_MIN, INF_MAX, STRING, INT + "begin":[ + {"type":"INF_MIN"}, + ], + // 导出数据的结束范围 + // 支持INF_MIN, INF_MAX, STRING, INT + "end":[ + {"type":"INF_MAX"}, + ] + } + } + }, + "writer": {} + } + ] + } +} +``` + +* 配置一个定义抽取范围的OTSReader: + +``` +{ + "job": { + "setting": { + "speed": { + "byte":10485760 + }, + "errorLimit":0.0 + }, + "content": [ + { + "reader": { + "name": "otsreader", + "parameter": { + "endpoint":"", + "accessId":"", + "accessKey":"", + "instanceName":"", + + // 导出数据表的表名 + "table":"", + + // 需要导出的列名,支持重复类和常量列,区分大小写 + // 常量列:类型支持STRING,INT,DOUBLE,BOOL和BINARY + // 备注:BINARY需要通过Base64转换为对应的字符串传入插件 + "column":[ + {"name":"col1"}, // 普通列 + {"name":"col2"}, // 普通列 + {"name":"col3"}, // 普通列 + {"type":"STRING","value" : ""}, // 常量列(字符串) + {"type":"INT","value" : ""}, // 常量列(整形) + {"type":"DOUBLE","value" : ""}, // 常量列(浮点) + {"type":"BOOL","value" : ""}, // 常量列(布尔) + {"type":"BINARY","value" : "Base64(bin)"} // 常量列(二进制) + ], + "range":{ + // 导出数据的起始范围 + // 支持INF_MIN, INF_MAX, STRING, INT + "begin":[ + {"type":"INF_MIN"}, + {"type":"INF_MAX"}, + {"type":"STRING", "value":"hello"}, + {"type":"INT", "value":"2999"}, + ], + // 导出数据的结束范围 + // 支持INF_MIN, INF_MAX, STRING, INT + "end":[ + {"type":"INF_MAX"}, + {"type":"INF_MIN"}, + {"type":"STRING", "value":"hello"}, + {"type":"INT", "value":"2999"}, + ] + } + } + }, + "writer": {} + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **endpoint** + + * 描述:OTS Server的EndPoint地址,例如http://bazhen.cn−hangzhou.ots.aliyuncs.com。 + + * 必选:是
+ + * 默认值:无
+ +* **accessId** + + * 描述:OTS的accessId
+ + * 必选:是
+ + * 默认值:无
+ +* **accessKey** + + * 描述:OTS的accessKey
+ + * 必选:是
+ + * 默认值:无
+ +* **instanceName** + + * 描述:OTS的实例名称,实例是用户使用和管理 OTS 服务的实体,用户在开通 OTS 服务之后,需要通过管理控制台来创建实例,然后在实例内进行表的创建和管理。实例是 OTS 资源管理的基础单元,OTS 对应用程序的访问控制和资源计量都在实例级别完成。
+ + * 必选:是
+ + * 默认值:无
+ + +* **table** + + * 描述:所选取的需要抽取的表名称,这里有且只能填写一张表。在OTS不存在多表同步的需求。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。由于OTS本身是NoSQL系统,在OTSReader抽取数据过程中,必须指定相应地字段名称。 + + 支持普通的列读取,例如: {"name":"col1"} + + 支持部分列读取,如用户不配置该列,则OTSReader不予读取。 + + 支持常量列读取,例如: {"type":"STRING", "value" : "DataX"}。使用type描述常量类型,目前支持STRING、INT、DOUBLE、BOOL、BINARY(用户使用Base64编码填写)、INF_MIN(OTS的系统限定最小值,使用该值用户不能填写value属性,否则报错)、INF_MAX(OTS的系统限定最大值,使用该值用户不能填写value属性,否则报错)。 + + 不支持函数或者自定义表达式,由于OTS本身不提供类似SQL的函数或者表达式功能,OTSReader也不能提供函数或表达式列功能。 + + * 必选:是
+ + * 默认值:无
+ +* **begin/end** + + * 描述:该配置项必须配对使用,用于支持OTS表范围抽取。begin/end中描述的是OTS **PrimaryKey**的区间分布状态,而且必须保证区间覆盖到所有的PrimaryKey,**需要指定该表下所有的PrimaryKey范围,不能遗漏任意一个PrimaryKey**,对于无限大小的区间,可以使用{"type":"INF_MIN"},{"type":"INF_MAX"}指代。例如对一张主键为 [DeviceID, SellerID]的OTS进行抽取任务,begin/end可以配置为: + + ```json + "range": { + "begin": { + {"type":"INF_MIN"}, //指定deviceID最小值 + {"type":"INT", "value":"0"} //指定deviceID最小值 + }, + "end": { + {"type":"INF_MAX"}, //指定deviceID抽取最大值 + {"type":"INT", "value":"9999"} //指定deviceID抽取最大值 + } + } + ``` + + + 如果要对上述表抽取全表,可以使用如下配置: + + ``` + "range": { + "begin": [ + {"type":"INF_MIN"}, //指定deviceID最小值 + {"type":"INF_MIN"} //指定SellerID最小值 + ], + "end": [ + {"type":"INF_MAX"}, //指定deviceID抽取最大值 + {"type":"INF_MAX"} //指定SellerID抽取最大值 + ] + } + ``` + + * 必选:是
+ + * 默认值:空
+ +* **split** + + * 描述:该配置项属于高级配置项,是用户自己定义切分配置信息,普通情况下不建议用户使用。适用场景通常在OTS数据存储发生热点,使用OTSReader自动切分的策略不能生效情况下,使用用户自定义的切分规则。split指定是的在Begin、End区间内的切分点,且只能是partitionKey的切分点信息,即在split仅配置partitionKey,而不需要指定全部的PrimaryKey。 + + 例如对一张主键为 [DeviceID, SellerID]的OTS进行抽取任务,可以配置为: + + ```json + "range": { + "begin": { + {"type":"INF_MIN"}, //指定deviceID最小值 + {"type":"INF_MIN"} //指定deviceID最小值 + }, + "end": { + {"type":"INF_MAX"}, //指定deviceID抽取最大值 + {"type":"INF_MAX"} //指定deviceID抽取最大值 + }, + // 用户指定的切分点,如果指定了切分点,Job将按照begin、end和split进行Task的切分, + // 切分的列只能是Partition Key(ParimaryKey的第一列) + // 支持INF_MIN, INF_MAX, STRING, INT + "split":[ + {"type":"STRING", "value":"1"}, + {"type":"STRING", "value":"2"}, + {"type":"STRING", "value":"3"}, + {"type":"STRING", "value":"4"}, + {"type":"STRING", "value":"5"} + ] + } + ``` + + * 必选:否
+ + * 默认值:无
+ + +### 3.3 类型转换 + +目前OTSReader支持所有OTS类型,下面列出OTSReader针对OTS类型转换列表: + + +| DataX 内部类型| OTS 数据类型 | +| -------- | ----- | +| Long |Integer | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Bytes |Binary | + + +* 注意,OTS本身不支持日期型类型。应用层一般使用Long报错时间的Unix TimeStamp。 + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 + +15列String(10 Byte), 2两列Integer(8 Byte),总计168Byte/r。 + +#### 4.1.2 机器参数 + +OTS端:3台前端机,5台后端机 + +DataX运行端: 24核CPU, 98GB内存 + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + +### 4.2 测试报告 + +#### 4.2.1 测试报告 + +|并发数|DataX CPU|OTS 流量|DATAX流量 | 前端QPS| 前端延时| +|--------|--------| --------|--------|--------|------| +|2| 36% |6.3M/s |12739 rec/s | 4.7 | 308ms | +|11| 155% | 32M/s |60732 rec/s | 23.9 | 412ms | +|50| 377% | 73M/s |145139 rec/s | 54 | 874ms | +|100| 448% | 82M/s | 156262 rec/s |60 | 1570ms | + + + +## 5 约束限制 + +### 5.1 一致性约束 + +OTS是类BigTable的存储系统,OTS本身能够保证单行写事务性,无法提供跨行级别的事务。对于OTSReader而言也无法提供全表的一致性视图。例如对于OTSReader在0点启动的数据同步任务,在整个表数据同步过程中,OTSReader同样会抽取到后续更新的数据,无法提供准确的0点时刻该表一致性视图。 + +### 5.2 增量数据同步 + +OTS本质上KV存储,目前只能针对PK进行范围查询,暂不支持按照字段范围抽取数据。因此只能对于增量查询,如果PK能够表示范围信息,例如自增ID,或者时间戳。 + +自增ID,OTSReader可以通过记录上次最大的ID信息,通过指定Range范围进行增量抽取。这样使用的前提是OTS中的PrimaryKey必须包含主键自增列(自增主键需要使用OTS应用方生成。) + +时间戳, OTSReader可以通过PK过滤时间戳,通过制定Range范围进行增量抽取。这样使用的前提是OTS中的PrimaryKey必须包含主键时间列(时间主键需要使用OTS应用方生成。) + +## 6 FAQ + diff --git a/otsreader/pom.xml b/otsreader/pom.xml new file mode 100644 index 0000000000..f6f7673b1f --- /dev/null +++ b/otsreader/pom.xml @@ -0,0 +1,93 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + otsreader + otsreader + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.aliyun.openservices + ots-public + 2.2.4 + + + com.google.code.gson + gson + 2.2.4 + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + org.apache.maven.plugins + maven-surefire-plugin + 2.5 + + all + 10 + true + -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=. + + **/unittest/*.java + **/functiontest/*.java + + + + + + + diff --git a/otsreader/src/main/assembly/package.xml b/otsreader/src/main/assembly/package.xml new file mode 100644 index 0000000000..7ee305d14a --- /dev/null +++ b/otsreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/otsreader + + + target/ + + otsreader-0.0.1-SNAPSHOT.jar + + plugin/reader/otsreader + + + + + + false + plugin/reader/otsreader/libs + runtime + + + diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/Key.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/Key.java new file mode 100644 index 0000000000..da6d4a5f78 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/Key.java @@ -0,0 +1,50 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.reader.otsreader; + +public final class Key { + /* ots account configuration */ + public final static String OTS_ENDPOINT = "endpoint"; + + public final static String OTS_ACCESSID = "accessId"; + + public final static String OTS_ACCESSKEY = "accessKey"; + + public final static String OTS_INSTANCE_NAME = "instanceName"; + + public final static String TABLE_NAME = "table"; + + public final static String COLUMN = "column"; + + //====================================================== + // 注意:如果range-begin大于range-end,那么系统将逆序导出所有数据 + //====================================================== + // Range的组织格式 + // "range":{ + // "begin":[], + // "end":[], + // "split":[] + // } + public final static String RANGE = "range"; + + public final static String RANGE_BEGIN = "begin"; + + public final static String RANGE_END = "end"; + + public final static String RANGE_SPLIT = "split"; + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReader.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReader.java new file mode 100644 index 0000000000..8880c07eda --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReader.java @@ -0,0 +1,124 @@ +package com.alibaba.datax.plugin.reader.otsreader; + +import java.util.List; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.otsreader.utils.Common; +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSException; + +public class OtsReader extends Reader { + + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + private OtsReaderMasterProxy proxy = new OtsReaderMasterProxy(); + @Override + public void init() { + LOG.info("init() begin ..."); + try { + this.proxy.init(getPluginJobConf()); + } catch (OTSException e) { + LOG.error("OTSException. ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{e.getErrorCode(), e.getMessage(), e.getRequestId()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (ClientException e) { + LOG.error("ClientException. ErrorCode:{}, ErrorMsg:{}", + new Object[]{e.getErrorCode(), e.getMessage()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (IllegalArgumentException e) { + LOG.error("IllegalArgumentException. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.INVALID_PARAM, Common.getDetailMessage(e), e); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.ERROR, Common.getDetailMessage(e), e); + } + LOG.info("init() end ..."); + } + + @Override + public void destroy() { + this.proxy.close(); + } + + @Override + public List split(int adviceNumber) { + LOG.info("split() begin ..."); + + if (adviceNumber <= 0) { + throw DataXException.asDataXException(OtsReaderError.ERROR, "Datax input adviceNumber <= 0."); + } + + List confs = null; + + try { + confs = this.proxy.split(adviceNumber); + } catch (OTSException e) { + LOG.error("OTSException. ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{e.getErrorCode(), e.getMessage(), e.getRequestId()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (ClientException e) { + LOG.error("ClientException. ErrorCode:{}, ErrorMsg:{}", + new Object[]{e.getErrorCode(), e.getMessage()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (IllegalArgumentException e) { + LOG.error("IllegalArgumentException. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.INVALID_PARAM, Common.getDetailMessage(e), e); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.ERROR, Common.getDetailMessage(e), e); + } + + LOG.info("split() end ..."); + return confs; + } + } + + public static class Task extends Reader.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + private OtsReaderSlaveProxy proxy = new OtsReaderSlaveProxy(); + + @Override + public void init() { + } + + @Override + public void destroy() { + } + + @Override + public void startRead(RecordSender recordSender) { + LOG.info("startRead() begin ..."); + try { + this.proxy.read(recordSender,getPluginJobConf()); + } catch (OTSException e) { + LOG.error("OTSException. ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{e.getErrorCode(), e.getMessage(), e.getRequestId()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (ClientException e) { + LOG.error("ClientException. ErrorCode:{}, ErrorMsg:{}", + new Object[]{e.getErrorCode(), e.getMessage()}); + LOG.error("Stack", e); + throw DataXException.asDataXException(new OtsReaderError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (IllegalArgumentException e) { + LOG.error("IllegalArgumentException. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.INVALID_PARAM, Common.getDetailMessage(e), e); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsReaderError.ERROR, Common.getDetailMessage(e), e); + } + LOG.info("startRead() end ..."); + } + + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderError.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderError.java new file mode 100644 index 0000000000..05a13c1a72 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderError.java @@ -0,0 +1,42 @@ +package com.alibaba.datax.plugin.reader.otsreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public class OtsReaderError implements ErrorCode { + + private String code; + + private String description; + + // TODO + // 这一块需要DATAX来统一定义分类, OTS基于这些分类在细化 + // 所以暂定两个基础的Error Code,其他错误统一使用OTS的错误码和错误消息 + + public final static OtsReaderError ERROR = new OtsReaderError( + "OtsReaderError", + "该错误表示插件的内部错误,表示系统没有处理到的异常"); + public final static OtsReaderError INVALID_PARAM = new OtsReaderError( + "OtsReaderInvalidParameter", + "该错误表示参数错误,表示用户输入了错误的参数格式等"); + + public OtsReaderError (String code) { + this.code = code; + this.description = code; + } + + public OtsReaderError (String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderMasterProxy.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderMasterProxy.java new file mode 100644 index 0000000000..2b758f0683 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderMasterProxy.java @@ -0,0 +1,221 @@ +package com.alibaba.datax.plugin.reader.otsreader; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.otsreader.callable.GetFirstRowPrimaryKeyCallable; +import com.alibaba.datax.plugin.reader.otsreader.callable.GetTableMetaCallable; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConf; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConst; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSRange; +import com.alibaba.datax.plugin.reader.otsreader.utils.ParamChecker; +import com.alibaba.datax.plugin.reader.otsreader.utils.Common; +import com.alibaba.datax.plugin.reader.otsreader.utils.GsonParser; +import com.alibaba.datax.plugin.reader.otsreader.utils.ReaderModelParser; +import com.alibaba.datax.plugin.reader.otsreader.utils.RangeSplit; +import com.alibaba.datax.plugin.reader.otsreader.utils.RetryHelper; +import com.aliyun.openservices.ots.OTSClient; +import com.aliyun.openservices.ots.model.Direction; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RangeRowQueryCriteria; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; + +public class OtsReaderMasterProxy { + + private OTSConf conf = new OTSConf(); + + private OTSRange range = null; + + private OTSClient ots = null; + + private TableMeta meta = null; + + private Direction direction = null; + + private static final Logger LOG = LoggerFactory.getLogger(OtsReaderMasterProxy.class); + + /** + * 1.检查参数是否为 + * null,endpoint,accessid,accesskey,instance-name,table,column,range-begin,range-end,range-split + * 2.检查参数是否为空字符串 + * endpoint,accessid,accesskey,instance-name,table + * 3.检查是否为空数组 + * column + * 4.检查Range的类型个个数是否和PrimaryKey匹配 + * column,range-begin,range-end + * 5.检查Range Split 顺序和类型是否Range一致,类型是否于PartitionKey一致 + * column-split + * @param param + * @throws Exception + */ + public void init(Configuration param) throws Exception { + // 默认参数 + // 每次重试的时间都是上一次的一倍,当sleep时间大于30秒时,Sleep重试时间不在增长。18次能覆盖OTS的Failover时间5分钟 + conf.setRetry(param.getInt(OTSConst.RETRY, 18)); + conf.setSleepInMilliSecond(param.getInt(OTSConst.SLEEP_IN_MILLI_SECOND, 100)); + + // 必选参数 + conf.setEndpoint(ParamChecker.checkStringAndGet(param, Key.OTS_ENDPOINT)); + conf.setAccessId(ParamChecker.checkStringAndGet(param, Key.OTS_ACCESSID)); + conf.setAccesskey(ParamChecker.checkStringAndGet(param, Key.OTS_ACCESSKEY)); + conf.setInstanceName(ParamChecker.checkStringAndGet(param, Key.OTS_INSTANCE_NAME)); + conf.setTableName(ParamChecker.checkStringAndGet(param, Key.TABLE_NAME)); + + ots = new OTSClient( + this.conf.getEndpoint(), + this.conf.getAccessId(), + this.conf.getAccesskey(), + this.conf.getInstanceName()); + + meta = getTableMeta(ots, conf.getTableName()); + LOG.info("Table Meta : {}", GsonParser.metaToJson(meta)); + + conf.setColumns(ReaderModelParser.parseOTSColumnList(ParamChecker.checkListAndGet(param, Key.COLUMN, true))); + + Map rangeMap = ParamChecker.checkMapAndGet(param, Key.RANGE, true); + conf.setRangeBegin(ReaderModelParser.parsePrimaryKey(ParamChecker.checkListAndGet(rangeMap, Key.RANGE_BEGIN, false))); + conf.setRangeEnd(ReaderModelParser.parsePrimaryKey(ParamChecker.checkListAndGet(rangeMap, Key.RANGE_END, false))); + + range = ParamChecker.checkRangeAndGet(meta, this.conf.getRangeBegin(), this.conf.getRangeEnd()); + + direction = ParamChecker.checkDirectionAndEnd(meta, range.getBegin(), range.getEnd()); + LOG.info("Direction : {}", direction); + + List points = ReaderModelParser.parsePrimaryKey(ParamChecker.checkListAndGet(rangeMap, Key.RANGE_SPLIT)); + ParamChecker.checkInputSplitPoints(meta, range, direction, points); + conf.setRangeSplit(points); + } + + public List split(int num) throws Exception { + LOG.info("Expect split num : " + num); + + List configurations = new ArrayList(); + + List ranges = null; + + if (this.conf.getRangeSplit() != null) { // 用户显示指定了拆分范围 + LOG.info("Begin userDefinedRangeSplit"); + ranges = userDefinedRangeSplit(meta, range, this.conf.getRangeSplit()); + LOG.info("End userDefinedRangeSplit"); + } else { // 采用默认的切分算法 + LOG.info("Begin defaultRangeSplit"); + ranges = defaultRangeSplit(ots, meta, range, num); + LOG.info("End defaultRangeSplit"); + } + + // 解决大量的Split Point序列化消耗内存的问题 + // 因为slave中不会使用这个配置,所以置为空 + this.conf.setRangeSplit(null); + + for (OTSRange item : ranges) { + Configuration configuration = Configuration.newDefault(); + configuration.set(OTSConst.OTS_CONF, GsonParser.confToJson(this.conf)); + configuration.set(OTSConst.OTS_RANGE, GsonParser.rangeToJson(item)); + configuration.set(OTSConst.OTS_DIRECTION, GsonParser.directionToJson(direction)); + configurations.add(configuration); + } + + LOG.info("Configuration list count : " + configurations.size()); + + return configurations; + } + + public OTSConf getConf() { + return conf; + } + + public void close() { + ots.shutdown(); + } + + // private function + + private TableMeta getTableMeta(OTSClient ots, String tableName) throws Exception { + return RetryHelper.executeWithRetry( + new GetTableMetaCallable(ots, tableName), + conf.getRetry(), + conf.getSleepInMilliSecond() + ); + } + + private RowPrimaryKey getPKOfFirstRow( + OTSRange range , Direction direction) throws Exception { + + RangeRowQueryCriteria cur = new RangeRowQueryCriteria(this.conf.getTableName()); + cur.setInclusiveStartPrimaryKey(range.getBegin()); + cur.setExclusiveEndPrimaryKey(range.getEnd()); + cur.setLimit(1); + cur.setColumnsToGet(Common.getPrimaryKeyNameList(meta)); + cur.setDirection(direction); + + return RetryHelper.executeWithRetry( + new GetFirstRowPrimaryKeyCallable(ots, meta, cur), + conf.getRetry(), + conf.getSleepInMilliSecond() + ); + } + + private List defaultRangeSplit(OTSClient ots, TableMeta meta, OTSRange range, int num) throws Exception { + if (num == 1) { + List ranges = new ArrayList(); + ranges.add(range); + return ranges; + } + + OTSRange reverseRange = new OTSRange(); + reverseRange.setBegin(range.getEnd()); + reverseRange.setEnd(range.getBegin()); + + Direction reverseDirection = (direction == Direction.FORWARD ? Direction.BACKWARD : Direction.FORWARD); + + RowPrimaryKey realBegin = getPKOfFirstRow(range, direction); + RowPrimaryKey realEnd = getPKOfFirstRow(reverseRange, reverseDirection); + + // 因为如果其中一行为空,表示这个范围内至多有一行数据 + // 所以不再细分,直接使用用户定义的范围 + if (realBegin == null || realEnd == null) { + List ranges = new ArrayList(); + ranges.add(range); + return ranges; + } + + // 如果出现realBegin,realEnd的方向和direction不一致的情况,直接返回range + int cmp = Common.compareRangeBeginAndEnd(meta, realBegin, realEnd); + Direction realDirection = cmp > 0 ? Direction.BACKWARD : Direction.FORWARD; + if (realDirection != direction) { + LOG.warn("Expect '" + direction + "', but direction of realBegin and readlEnd is '" + realDirection + "'"); + List ranges = new ArrayList(); + ranges.add(range); + return ranges; + } + + List ranges = RangeSplit.rangeSplitByCount(meta, realBegin, realEnd, num); + + if (ranges.isEmpty()) { // 当PartitionKey相等时,工具内部不会切分Range + ranges.add(range); + } else { + // replace first and last + OTSRange first = ranges.get(0); + OTSRange last = ranges.get(ranges.size() - 1); + + first.setBegin(range.getBegin()); + last.setEnd(range.getEnd()); + } + + return ranges; + } + + private List userDefinedRangeSplit(TableMeta meta, OTSRange range, List points) { + List ranges = RangeSplit.rangeSplitByPoint(meta, range.getBegin(), range.getEnd(), points); + if (ranges.isEmpty()) { // 当PartitionKey相等时,工具内部不会切分Range + ranges.add(range); + } + return ranges; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderSlaveProxy.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderSlaveProxy.java new file mode 100644 index 0000000000..e64b4e7e24 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/OtsReaderSlaveProxy.java @@ -0,0 +1,135 @@ +package com.alibaba.datax.plugin.reader.otsreader; + +import java.util.List; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.otsreader.callable.GetRangeCallable; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConf; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConst; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSRange; +import com.alibaba.datax.plugin.reader.otsreader.utils.Common; +import com.alibaba.datax.plugin.reader.otsreader.utils.GsonParser; +import com.alibaba.datax.plugin.reader.otsreader.utils.DefaultNoRetry; +import com.alibaba.datax.plugin.reader.otsreader.utils.RetryHelper; +import com.aliyun.openservices.ots.OTSClientAsync; +import com.aliyun.openservices.ots.OTSServiceConfiguration; +import com.aliyun.openservices.ots.model.Direction; +import com.aliyun.openservices.ots.model.GetRangeRequest; +import com.aliyun.openservices.ots.model.GetRangeResult; +import com.aliyun.openservices.ots.model.OTSFuture; +import com.aliyun.openservices.ots.model.RangeRowQueryCriteria; +import com.aliyun.openservices.ots.model.Row; +import com.aliyun.openservices.ots.model.RowPrimaryKey; + +public class OtsReaderSlaveProxy { + + class RequestItem { + private RangeRowQueryCriteria criteria; + private OTSFuture future; + + RequestItem(RangeRowQueryCriteria criteria, OTSFuture future) { + this.criteria = criteria; + this.future = future; + } + + public RangeRowQueryCriteria getCriteria() { + return criteria; + } + + public OTSFuture getFuture() { + return future; + } + } + + private static final Logger LOG = LoggerFactory.getLogger(OtsReaderSlaveProxy.class); + + private void rowsToSender(List rows, RecordSender sender, List columns) { + for (Row row : rows) { + Record line = sender.createRecord(); + line = Common.parseRowToLine(row, columns, line); + + LOG.debug("Reader send record : {}", line.toString()); + + sender.sendToWriter(line); + } + } + + private RangeRowQueryCriteria generateRangeRowQueryCriteria(String tableName, RowPrimaryKey begin, RowPrimaryKey end, Direction direction, List columns) { + RangeRowQueryCriteria criteria = new RangeRowQueryCriteria(tableName); + criteria.setInclusiveStartPrimaryKey(begin); + criteria.setDirection(direction); + criteria.setColumnsToGet(columns); + criteria.setLimit(-1); + criteria.setExclusiveEndPrimaryKey(end); + return criteria; + } + + private RequestItem generateRequestItem( + OTSClientAsync ots, + OTSConf conf, + RowPrimaryKey begin, + RowPrimaryKey end, + Direction direction, + List columns) throws Exception { + RangeRowQueryCriteria criteria = generateRangeRowQueryCriteria(conf.getTableName(), begin, end, direction, columns); + + GetRangeRequest request = new GetRangeRequest(); + request.setRangeRowQueryCriteria(criteria); + OTSFuture future = ots.getRange(request); + + return new RequestItem(criteria, future); + } + + public void read(RecordSender sender, Configuration configuration) throws Exception { + LOG.info("read begin."); + + OTSConf conf = GsonParser.jsonToConf(configuration.getString(OTSConst.OTS_CONF)); + OTSRange range = GsonParser.jsonToRange(configuration.getString(OTSConst.OTS_RANGE)); + Direction direction = GsonParser.jsonToDirection(configuration.getString(OTSConst.OTS_DIRECTION)); + + OTSServiceConfiguration configure = new OTSServiceConfiguration(); + configure.setRetryStrategy(new DefaultNoRetry()); + + OTSClientAsync ots = new OTSClientAsync( + conf.getEndpoint(), + conf.getAccessId(), + conf.getAccesskey(), + conf.getInstanceName(), + null, + configure, + null); + + RowPrimaryKey token = range.getBegin(); + List columns = Common.getNormalColumnNameList(conf.getColumns()); + + RequestItem request = null; + + do { + LOG.debug("Next token : {}", GsonParser.rowPrimaryKeyToJson(token)); + if (request == null) { + request = generateRequestItem(ots, conf, token, range.getEnd(), direction, columns); + } else { + RequestItem req = request; + + GetRangeResult result = RetryHelper.executeWithRetry( + new GetRangeCallable(ots, req.getCriteria(), req.getFuture()), + conf.getRetry(), + conf.getSleepInMilliSecond() + ); + if ((token = result.getNextStartPrimaryKey()) != null) { + request = generateRequestItem(ots, conf, token, range.getEnd(), direction, columns); + } + + rowsToSender(result.getRows(), sender, conf.getColumns()); + } + } while (token != null); + ots.shutdown(); + LOG.info("read end."); + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/OTSColumnAdaptor.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/OTSColumnAdaptor.java new file mode 100644 index 0000000000..25f9b682c2 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/OTSColumnAdaptor.java @@ -0,0 +1,117 @@ +package com.alibaba.datax.plugin.reader.otsreader.adaptor; + +import java.lang.reflect.Type; + +import org.apache.commons.codec.binary.Base64; + +import com.alibaba.datax.plugin.reader.otsreader.model.OTSColumn; +import com.aliyun.openservices.ots.model.ColumnType; +import com.google.gson.JsonDeserializationContext; +import com.google.gson.JsonDeserializer; +import com.google.gson.JsonElement; +import com.google.gson.JsonObject; +import com.google.gson.JsonParseException; +import com.google.gson.JsonPrimitive; +import com.google.gson.JsonSerializationContext; +import com.google.gson.JsonSerializer; + +public class OTSColumnAdaptor implements JsonDeserializer, JsonSerializer{ + private final static String NAME = "name"; + private final static String COLUMN_TYPE = "column_type"; + private final static String VALUE_TYPE = "value_type"; + private final static String VALUE = "value"; + + private void serializeConstColumn(JsonObject json, OTSColumn obj) { + switch (obj.getValueType()) { + case STRING : + json.add(VALUE_TYPE, new JsonPrimitive(ColumnType.STRING.toString())); + json.add(VALUE, new JsonPrimitive(obj.getValue().asString())); + break; + case INTEGER : + json.add(VALUE_TYPE, new JsonPrimitive(ColumnType.INTEGER.toString())); + json.add(VALUE, new JsonPrimitive(obj.getValue().asLong())); + break; + case DOUBLE : + json.add(VALUE_TYPE, new JsonPrimitive(ColumnType.DOUBLE.toString())); + json.add(VALUE, new JsonPrimitive(obj.getValue().asDouble())); + break; + case BOOLEAN : + json.add(VALUE_TYPE, new JsonPrimitive(ColumnType.BOOLEAN.toString())); + json.add(VALUE, new JsonPrimitive(obj.getValue().asBoolean())); + break; + case BINARY : + json.add(VALUE_TYPE, new JsonPrimitive(ColumnType.BINARY.toString())); + json.add(VALUE, new JsonPrimitive(Base64.encodeBase64String(obj.getValue().asBytes()))); + break; + default: + throw new IllegalArgumentException("Unsupport serialize the type : " + obj.getValueType() + ""); + } + } + + private OTSColumn deserializeConstColumn(JsonObject obj) { + String strType = obj.getAsJsonPrimitive(VALUE_TYPE).getAsString(); + ColumnType type = ColumnType.valueOf(strType); + + JsonPrimitive jsonValue = obj.getAsJsonPrimitive(VALUE); + + switch (type) { + case STRING : + return OTSColumn.fromConstStringColumn(jsonValue.getAsString()); + case INTEGER : + return OTSColumn.fromConstIntegerColumn(jsonValue.getAsLong()); + case DOUBLE : + return OTSColumn.fromConstDoubleColumn(jsonValue.getAsDouble()); + case BOOLEAN : + return OTSColumn.fromConstBoolColumn(jsonValue.getAsBoolean()); + case BINARY : + return OTSColumn.fromConstBytesColumn(Base64.decodeBase64(jsonValue.getAsString())); + default: + throw new IllegalArgumentException("Unsupport deserialize the type : " + type + ""); + } + } + + private void serializeNormalColumn(JsonObject json, OTSColumn obj) { + json.add(NAME, new JsonPrimitive(obj.getName())); + } + + private OTSColumn deserializeNormarlColumn(JsonObject obj) { + return OTSColumn.fromNormalColumn(obj.getAsJsonPrimitive(NAME).getAsString()); + } + + @Override + public JsonElement serialize(OTSColumn obj, Type t, + JsonSerializationContext c) { + JsonObject json = new JsonObject(); + + switch (obj.getColumnType()) { + case CONST: + json.add(COLUMN_TYPE, new JsonPrimitive(OTSColumn.OTSColumnType.CONST.toString())); + serializeConstColumn(json, obj); + break; + case NORMAL: + json.add(COLUMN_TYPE, new JsonPrimitive(OTSColumn.OTSColumnType.NORMAL.toString())); + serializeNormalColumn(json, obj); + break; + default: + throw new IllegalArgumentException("Unsupport serialize the type : " + obj.getColumnType() + ""); + } + return json; + } + + @Override + public OTSColumn deserialize(JsonElement ele, Type t, + JsonDeserializationContext c) throws JsonParseException { + JsonObject obj = ele.getAsJsonObject(); + String strColumnType = obj.getAsJsonPrimitive(COLUMN_TYPE).getAsString(); + OTSColumn.OTSColumnType columnType = OTSColumn.OTSColumnType.valueOf(strColumnType); + + switch(columnType) { + case CONST: + return deserializeConstColumn(obj); + case NORMAL: + return deserializeNormarlColumn(obj); + default: + throw new IllegalArgumentException("Unsupport deserialize the type : " + columnType + ""); + } + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/PrimaryKeyValueAdaptor.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/PrimaryKeyValueAdaptor.java new file mode 100644 index 0000000000..1a49ea476f --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/adaptor/PrimaryKeyValueAdaptor.java @@ -0,0 +1,91 @@ +package com.alibaba.datax.plugin.reader.otsreader.adaptor; + +import java.lang.reflect.Type; + +import com.aliyun.openservices.ots.model.ColumnType; +import com.aliyun.openservices.ots.model.PrimaryKeyType; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.google.gson.JsonDeserializationContext; +import com.google.gson.JsonDeserializer; +import com.google.gson.JsonElement; +import com.google.gson.JsonObject; +import com.google.gson.JsonParseException; +import com.google.gson.JsonPrimitive; +import com.google.gson.JsonSerializationContext; +import com.google.gson.JsonSerializer; + +/** + * {"type":"INF_MIN", "value":""} + * {"type":"INF_MAX", "value":""} + * {"type":"STRING", "value":"hello"} + * {"type":"INTEGER", "value":"1222"} + */ +public class PrimaryKeyValueAdaptor implements JsonDeserializer, JsonSerializer{ + private final static String TYPE = "type"; + private final static String VALUE = "value"; + private final static String INF_MIN = "INF_MIN"; + private final static String INF_MAX = "INF_MAX"; + + @Override + public JsonElement serialize(PrimaryKeyValue obj, Type t, + JsonSerializationContext c) { + JsonObject json = new JsonObject(); + + if (obj == PrimaryKeyValue.INF_MIN) { + json.add(TYPE, new JsonPrimitive(INF_MIN)); + json.add(VALUE, new JsonPrimitive("")); + return json; + } + + if (obj == PrimaryKeyValue.INF_MAX) { + json.add(TYPE, new JsonPrimitive(INF_MAX)); + json.add(VALUE, new JsonPrimitive("")); + return json; + } + + switch (obj.getType()) { + case STRING : + json.add(TYPE, new JsonPrimitive(ColumnType.STRING.toString())); + json.add(VALUE, new JsonPrimitive(obj.asString())); + break; + case INTEGER : + json.add(TYPE, new JsonPrimitive(ColumnType.INTEGER.toString())); + json.add(VALUE, new JsonPrimitive(obj.asLong())); + break; + default: + throw new IllegalArgumentException("Unsupport serialize the type : " + obj.getType() + ""); + } + return json; + } + + @Override + public PrimaryKeyValue deserialize(JsonElement ele, Type t, + JsonDeserializationContext c) throws JsonParseException { + + JsonObject obj = ele.getAsJsonObject(); + String strType = obj.getAsJsonPrimitive(TYPE).getAsString(); + JsonPrimitive jsonValue = obj.getAsJsonPrimitive(VALUE); + + if (strType.equals(INF_MIN)) { + return PrimaryKeyValue.INF_MIN; + } + + if (strType.equals(INF_MAX)) { + return PrimaryKeyValue.INF_MAX; + } + + PrimaryKeyValue value = null; + PrimaryKeyType type = PrimaryKeyType.valueOf(strType); + switch(type) { + case STRING : + value = PrimaryKeyValue.fromString(jsonValue.getAsString()); + break; + case INTEGER : + value = PrimaryKeyValue.fromLong(jsonValue.getAsLong()); + break; + default: + throw new IllegalArgumentException("Unsupport deserialize the type : " + type + ""); + } + return value; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetFirstRowPrimaryKeyCallable.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetFirstRowPrimaryKeyCallable.java new file mode 100644 index 0000000000..f004c0ff6f --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetFirstRowPrimaryKeyCallable.java @@ -0,0 +1,55 @@ +package com.alibaba.datax.plugin.reader.otsreader.callable; + +import java.util.List; +import java.util.Map; +import java.util.concurrent.Callable; + +import com.aliyun.openservices.ots.OTSClient; +import com.aliyun.openservices.ots.model.ColumnType; +import com.aliyun.openservices.ots.model.ColumnValue; +import com.aliyun.openservices.ots.model.GetRangeRequest; +import com.aliyun.openservices.ots.model.GetRangeResult; +import com.aliyun.openservices.ots.model.PrimaryKeyType; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RangeRowQueryCriteria; +import com.aliyun.openservices.ots.model.Row; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; + +public class GetFirstRowPrimaryKeyCallable implements Callable{ + + private OTSClient ots = null; + private TableMeta meta = null; + private RangeRowQueryCriteria criteria = null; + + public GetFirstRowPrimaryKeyCallable(OTSClient ots, TableMeta meta, RangeRowQueryCriteria criteria) { + this.ots = ots; + this.meta = meta; + this.criteria = criteria; + } + + @Override + public RowPrimaryKey call() throws Exception { + RowPrimaryKey ret = new RowPrimaryKey(); + GetRangeRequest request = new GetRangeRequest(); + request.setRangeRowQueryCriteria(criteria); + GetRangeResult result = ots.getRange(request); + List rows = result.getRows(); + if(rows.isEmpty()) { + return null;// no data + } + Row row = rows.get(0); + + Map pk = meta.getPrimaryKey(); + for (String key:pk.keySet()) { + ColumnValue v = row.getColumns().get(key); + if (v.getType() == ColumnType.INTEGER) { + ret.addPrimaryKeyColumn(key, PrimaryKeyValue.fromLong(v.asLong())); + } else { + ret.addPrimaryKeyColumn(key, PrimaryKeyValue.fromString(v.asString())); + } + } + return ret; + } + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetRangeCallable.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetRangeCallable.java new file mode 100644 index 0000000000..2cd1398a66 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetRangeCallable.java @@ -0,0 +1,35 @@ +package com.alibaba.datax.plugin.reader.otsreader.callable; + +import java.util.concurrent.Callable; + +import com.aliyun.openservices.ots.OTSClientAsync; +import com.aliyun.openservices.ots.model.GetRangeRequest; +import com.aliyun.openservices.ots.model.GetRangeResult; +import com.aliyun.openservices.ots.model.OTSFuture; +import com.aliyun.openservices.ots.model.RangeRowQueryCriteria; + +public class GetRangeCallable implements Callable { + + private OTSClientAsync ots; + private RangeRowQueryCriteria criteria; + private OTSFuture future; + + public GetRangeCallable(OTSClientAsync ots, RangeRowQueryCriteria criteria, OTSFuture future) { + this.ots = ots; + this.criteria = criteria; + this.future = future; + } + + @Override + public GetRangeResult call() throws Exception { + try { + return future.get(); + } catch (Exception e) { + GetRangeRequest request = new GetRangeRequest(); + request.setRangeRowQueryCriteria(criteria); + future = ots.getRange(request); + throw e; + } + } + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetTableMetaCallable.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetTableMetaCallable.java new file mode 100644 index 0000000000..2884e12b14 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/callable/GetTableMetaCallable.java @@ -0,0 +1,29 @@ +package com.alibaba.datax.plugin.reader.otsreader.callable; + +import java.util.concurrent.Callable; + +import com.aliyun.openservices.ots.OTSClient; +import com.aliyun.openservices.ots.model.DescribeTableRequest; +import com.aliyun.openservices.ots.model.DescribeTableResult; +import com.aliyun.openservices.ots.model.TableMeta; + +public class GetTableMetaCallable implements Callable{ + + private OTSClient ots = null; + private String tableName = null; + + public GetTableMetaCallable(OTSClient ots, String tableName) { + this.ots = ots; + this.tableName = tableName; + } + + @Override + public TableMeta call() throws Exception { + DescribeTableRequest describeTableRequest = new DescribeTableRequest(); + describeTableRequest.setTableName(tableName); + DescribeTableResult result = ots.describeTable(describeTableRequest); + TableMeta tableMeta = result.getTableMeta(); + return tableMeta; + } + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSColumn.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSColumn.java new file mode 100644 index 0000000000..129ccd2fd2 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSColumn.java @@ -0,0 +1,76 @@ +package com.alibaba.datax.plugin.reader.otsreader.model; + +import com.alibaba.datax.common.element.BoolColumn; +import com.alibaba.datax.common.element.BytesColumn; +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.DoubleColumn; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.StringColumn; +import com.aliyun.openservices.ots.model.ColumnType; + +public class OTSColumn { + private String name; + private Column value; + private OTSColumnType columnType; + private ColumnType valueType; + + public static enum OTSColumnType { + NORMAL, // 普通列 + CONST // 常量列 + } + + private OTSColumn(String name) { + this.name = name; + this.columnType = OTSColumnType.NORMAL; + } + + private OTSColumn(Column value, ColumnType type) { + this.value = value; + this.columnType = OTSColumnType.CONST; + this.valueType = type; + } + + public static OTSColumn fromNormalColumn(String name) { + if (name.isEmpty()) { + throw new IllegalArgumentException("The column name is empty."); + } + + return new OTSColumn(name); + } + + public static OTSColumn fromConstStringColumn(String value) { + return new OTSColumn(new StringColumn(value), ColumnType.STRING); + } + + public static OTSColumn fromConstIntegerColumn(long value) { + return new OTSColumn(new LongColumn(value), ColumnType.INTEGER); + } + + public static OTSColumn fromConstDoubleColumn(double value) { + return new OTSColumn(new DoubleColumn(value), ColumnType.DOUBLE); + } + + public static OTSColumn fromConstBoolColumn(boolean value) { + return new OTSColumn(new BoolColumn(value), ColumnType.BOOLEAN); + } + + public static OTSColumn fromConstBytesColumn(byte[] value) { + return new OTSColumn(new BytesColumn(value), ColumnType.BINARY); + } + + public Column getValue() { + return value; + } + + public OTSColumnType getColumnType() { + return columnType; + } + + public ColumnType getValueType() { + return valueType; + } + + public String getName() { + return name; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConf.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConf.java new file mode 100644 index 0000000000..8b109a39e9 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConf.java @@ -0,0 +1,90 @@ +package com.alibaba.datax.plugin.reader.otsreader.model; + +import java.util.List; + +import com.aliyun.openservices.ots.model.PrimaryKeyValue; + +public class OTSConf { + private String endpoint= null; + private String accessId = null; + private String accesskey = null; + private String instanceName = null; + private String tableName = null; + + private List rangeBegin = null; + private List rangeEnd = null; + private List rangeSplit = null; + + private List columns = null; + + private int retry; + private int sleepInMilliSecond; + + public String getEndpoint() { + return endpoint; + } + public void setEndpoint(String endpoint) { + this.endpoint = endpoint; + } + public String getAccessId() { + return accessId; + } + public void setAccessId(String accessId) { + this.accessId = accessId; + } + public String getAccesskey() { + return accesskey; + } + public void setAccesskey(String accesskey) { + this.accesskey = accesskey; + } + public String getInstanceName() { + return instanceName; + } + public void setInstanceName(String instanceName) { + this.instanceName = instanceName; + } + public String getTableName() { + return tableName; + } + public void setTableName(String tableName) { + this.tableName = tableName; + } + + public List getColumns() { + return columns; + } + public void setColumns(List columns) { + this.columns = columns; + } + public int getRetry() { + return retry; + } + public void setRetry(int retry) { + this.retry = retry; + } + public int getSleepInMilliSecond() { + return sleepInMilliSecond; + } + public void setSleepInMilliSecond(int sleepInMilliSecond) { + this.sleepInMilliSecond = sleepInMilliSecond; + } + public List getRangeBegin() { + return rangeBegin; + } + public void setRangeBegin(List rangeBegin) { + this.rangeBegin = rangeBegin; + } + public List getRangeEnd() { + return rangeEnd; + } + public void setRangeEnd(List rangeEnd) { + this.rangeEnd = rangeEnd; + } + public List getRangeSplit() { + return rangeSplit; + } + public void setRangeSplit(List rangeSplit) { + this.rangeSplit = rangeSplit; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConst.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConst.java new file mode 100644 index 0000000000..30177193b2 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSConst.java @@ -0,0 +1,25 @@ +package com.alibaba.datax.plugin.reader.otsreader.model; + +public class OTSConst { + // Reader support type + public final static String TYPE_STRING = "STRING"; + public final static String TYPE_INTEGER = "INT"; + public final static String TYPE_DOUBLE = "DOUBLE"; + public final static String TYPE_BOOLEAN = "BOOL"; + public final static String TYPE_BINARY = "BINARY"; + public final static String TYPE_INF_MIN = "INF_MIN"; + public final static String TYPE_INF_MAX = "INF_MAX"; + + // Column + public final static String NAME = "name"; + public final static String TYPE = "type"; + public final static String VALUE = "value"; + + public final static String OTS_CONF = "OTS_CONF"; + public final static String OTS_RANGE = "OTS_RANGE"; + public final static String OTS_DIRECTION = "OTS_DIRECTION"; + + // options + public final static String RETRY = "maxRetryTime"; + public final static String SLEEP_IN_MILLI_SECOND = "retrySleepInMillionSecond"; +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSPrimaryKeyColumn.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSPrimaryKeyColumn.java new file mode 100644 index 0000000000..eaec50ce5b --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSPrimaryKeyColumn.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.plugin.reader.otsreader.model; + +import com.aliyun.openservices.ots.model.PrimaryKeyType; + +public class OTSPrimaryKeyColumn { + private String name; + private PrimaryKeyType type; + + public String getName() { + return name; + } + public void setName(String name) { + this.name = name; + } + public PrimaryKeyType getType() { + return type; + } + public void setType(PrimaryKeyType type) { + this.type = type; + } + +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSRange.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSRange.java new file mode 100644 index 0000000000..8ebfcf7ea3 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/model/OTSRange.java @@ -0,0 +1,29 @@ +package com.alibaba.datax.plugin.reader.otsreader.model; + +import com.aliyun.openservices.ots.model.RowPrimaryKey; + +public class OTSRange { + + private RowPrimaryKey begin = null; + private RowPrimaryKey end = null; + + public OTSRange() {} + + public OTSRange(RowPrimaryKey begin, RowPrimaryKey end) { + this.begin = begin; + this.end = end; + } + + public RowPrimaryKey getBegin() { + return begin; + } + public void setBegin(RowPrimaryKey begin) { + this.begin = begin; + } + public RowPrimaryKey getEnd() { + return end; + } + public void setEnd(RowPrimaryKey end) { + this.end = end; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/Common.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/Common.java new file mode 100644 index 0000000000..7bb3f52ea1 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/Common.java @@ -0,0 +1,161 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +import com.alibaba.datax.common.element.BoolColumn; +import com.alibaba.datax.common.element.BytesColumn; +import com.alibaba.datax.common.element.DoubleColumn; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSPrimaryKeyColumn; +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSException; +import com.aliyun.openservices.ots.model.ColumnValue; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.Row; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; + +public class Common { + + public static int primaryKeyValueCmp(PrimaryKeyValue v1, PrimaryKeyValue v2) { + if (v1.getType() != null && v2.getType() != null) { + if (v1.getType() != v2.getType()) { + throw new IllegalArgumentException( + "Not same column type, column1:" + v1.getType() + ", column2:" + v2.getType()); + } + switch (v1.getType()) { + case INTEGER: + Long l1 = Long.valueOf(v1.asLong()); + Long l2 = Long.valueOf(v2.asLong()); + return l1.compareTo(l2); + case STRING: + return v1.asString().compareTo(v2.asString()); + default: + throw new IllegalArgumentException("Unsuporrt compare the type: " + v1.getType() + "."); + } + } else { + if (v1 == v2) { + return 0; + } else { + if (v1 == PrimaryKeyValue.INF_MIN) { + return -1; + } else if (v1 == PrimaryKeyValue.INF_MAX) { + return 1; + } + + if (v2 == PrimaryKeyValue.INF_MAX) { + return -1; + } else if (v2 == PrimaryKeyValue.INF_MIN) { + return 1; + } + } + } + return 0; + } + + public static OTSPrimaryKeyColumn getPartitionKey(TableMeta meta) { + List keys = new ArrayList(); + keys.addAll(meta.getPrimaryKey().keySet()); + + String key = keys.get(0); + + OTSPrimaryKeyColumn col = new OTSPrimaryKeyColumn(); + col.setName(key); + col.setType(meta.getPrimaryKey().get(key)); + return col; + } + + public static List getPrimaryKeyNameList(TableMeta meta) { + List names = new ArrayList(); + names.addAll(meta.getPrimaryKey().keySet()); + return names; + } + + public static int compareRangeBeginAndEnd(TableMeta meta, RowPrimaryKey begin, RowPrimaryKey end) { + if (begin.getPrimaryKey().size() != end.getPrimaryKey().size()) { + throw new IllegalArgumentException("Input size of begin not equal size of end, begin size : " + begin.getPrimaryKey().size() + + ", end size : " + end.getPrimaryKey().size() + "."); + } + for (String key : meta.getPrimaryKey().keySet()) { + PrimaryKeyValue v1 = begin.getPrimaryKey().get(key); + PrimaryKeyValue v2 = end.getPrimaryKey().get(key); + int cmp = primaryKeyValueCmp(v1, v2); + if (cmp != 0) { + return cmp; + } + } + return 0; + } + + public static List getNormalColumnNameList(List columns) { + List normalColumns = new ArrayList(); + for (OTSColumn col : columns) { + if (col.getColumnType() == OTSColumn.OTSColumnType.NORMAL) { + normalColumns.add(col.getName()); + } + } + return normalColumns; + } + + public static Record parseRowToLine(Row row, List columns, Record line) { + Map values = row.getColumns(); + for (OTSColumn col : columns) { + if (col.getColumnType() == OTSColumn.OTSColumnType.CONST) { + line.addColumn(col.getValue()); + } else { + ColumnValue v = values.get(col.getName()); + if (v == null) { + line.addColumn(new StringColumn(null)); + } else { + switch(v.getType()) { + case STRING: line.addColumn(new StringColumn(v.asString())); break; + case INTEGER: line.addColumn(new LongColumn(v.asLong())); break; + case DOUBLE: line.addColumn(new DoubleColumn(v.asDouble())); break; + case BOOLEAN: line.addColumn(new BoolColumn(v.asBoolean())); break; + case BINARY: line.addColumn(new BytesColumn(v.asBinary())); break; + default: + throw new IllegalArgumentException("Unsuporrt tranform the type: " + col.getValue().getType() + "."); + } + } + } + } + return line; + } + + public static String getDetailMessage(Exception exception) { + if (exception instanceof OTSException) { + OTSException e = (OTSException) exception; + return "OTSException[ErrorCode:" + e.getErrorCode() + ", ErrorMessage:" + e.getMessage() + ", RequestId:" + e.getRequestId() + "]"; + } else if (exception instanceof ClientException) { + ClientException e = (ClientException) exception; + return "ClientException[ErrorCode:" + e.getErrorCode() + ", ErrorMessage:" + e.getMessage() + "]"; + } else if (exception instanceof IllegalArgumentException) { + IllegalArgumentException e = (IllegalArgumentException) exception; + return "IllegalArgumentException[ErrorMessage:" + e.getMessage() + "]"; + } else { + return "Exception[ErrorMessage:" + exception.getMessage() + "]"; + } + } + + public static long getDelaySendMillinSeconds(int hadRetryTimes, int initSleepInMilliSecond) { + + if (hadRetryTimes <= 0) { + return 0; + } + + int sleepTime = initSleepInMilliSecond; + for (int i = 1; i < hadRetryTimes; i++) { + sleepTime += sleepTime; + if (sleepTime > 30000) { + sleepTime = 30000; + break; + } + } + return sleepTime; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/DefaultNoRetry.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/DefaultNoRetry.java new file mode 100644 index 0000000000..87264dab76 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/DefaultNoRetry.java @@ -0,0 +1,12 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import com.aliyun.openservices.ots.internal.OTSDefaultRetryStrategy; + +public class DefaultNoRetry extends OTSDefaultRetryStrategy { + + @Override + public boolean shouldRetry(String action, Exception ex, int retries) { + return false; + } + +} \ No newline at end of file diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/GsonParser.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/GsonParser.java new file mode 100644 index 0000000000..a82f335006 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/GsonParser.java @@ -0,0 +1,63 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import com.alibaba.datax.plugin.reader.otsreader.adaptor.OTSColumnAdaptor; +import com.alibaba.datax.plugin.reader.otsreader.adaptor.PrimaryKeyValueAdaptor; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConf; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSRange; +import com.aliyun.openservices.ots.model.Direction; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; +import com.google.gson.Gson; +import com.google.gson.GsonBuilder; + +public class GsonParser { + + private static Gson gsonBuilder() { + return new GsonBuilder() + .registerTypeAdapter(OTSColumn.class, new OTSColumnAdaptor()) + .registerTypeAdapter(PrimaryKeyValue.class, new PrimaryKeyValueAdaptor()) + .create(); + } + + public static String rangeToJson (OTSRange range) { + Gson g = gsonBuilder(); + return g.toJson(range); + } + + public static OTSRange jsonToRange (String jsonStr) { + Gson g = gsonBuilder(); + return g.fromJson(jsonStr, OTSRange.class); + } + + public static String confToJson (OTSConf conf) { + Gson g = gsonBuilder(); + return g.toJson(conf); + } + + public static OTSConf jsonToConf (String jsonStr) { + Gson g = gsonBuilder(); + return g.fromJson(jsonStr, OTSConf.class); + } + + public static String directionToJson (Direction direction) { + Gson g = gsonBuilder(); + return g.toJson(direction); + } + + public static Direction jsonToDirection (String jsonStr) { + Gson g = gsonBuilder(); + return g.fromJson(jsonStr, Direction.class); + } + + public static String metaToJson (TableMeta meta) { + Gson g = gsonBuilder(); + return g.toJson(meta); + } + + public static String rowPrimaryKeyToJson (RowPrimaryKey row) { + Gson g = gsonBuilder(); + return g.toJson(row); + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ParamChecker.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ParamChecker.java new file mode 100644 index 0000000000..fbcdc9722e --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ParamChecker.java @@ -0,0 +1,245 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSPrimaryKeyColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSRange; +import com.aliyun.openservices.ots.model.Direction; +import com.aliyun.openservices.ots.model.PrimaryKeyType; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; + +public class ParamChecker { + + private static void throwNotExistException(String key) { + throw new IllegalArgumentException("The param '" + key + "' is not exist."); + } + + private static void throwStringLengthZeroException(String key) { + throw new IllegalArgumentException("The param length of '" + key + "' is zero."); + } + + private static void throwEmptyException(String key) { + throw new IllegalArgumentException("The param '" + key + "' is empty."); + } + + private static void throwNotListException(String key) { + throw new IllegalArgumentException("The param '" + key + "' is not a json array."); + } + + private static void throwNotMapException(String key) { + throw new IllegalArgumentException("The param '" + key + "' is not a json map."); + } + + public static String checkStringAndGet(Configuration param, String key) { + String value = param.getString(key); + if (null == value) { + throwNotExistException(key); + } else if (value.length() == 0) { + throwStringLengthZeroException(key); + } + return value; + } + + public static List checkListAndGet(Configuration param, String key, boolean isCheckEmpty) { + List value = null; + try { + value = param.getList(key); + } catch (ClassCastException e) { + throwNotListException(key); + } + if (null == value) { + throwNotExistException(key); + } else if (isCheckEmpty && value.isEmpty()) { + throwEmptyException(key); + } + return value; + } + + public static List checkListAndGet(Map range, String key) { + Object obj = range.get(key); + if (null == obj) { + return null; + } + return checkListAndGet(range, key, false); + } + + public static List checkListAndGet(Map range, String key, boolean isCheckEmpty) { + Object obj = range.get(key); + if (null == obj) { + throwNotExistException(key); + } + if (obj instanceof List) { + @SuppressWarnings("unchecked") + List value = (List)obj; + if (isCheckEmpty && value.isEmpty()) { + throwEmptyException(key); + } + return value; + } else { + throw new IllegalArgumentException("Can not parse list of '" + key + "' from map."); + } + } + + public static List checkListAndGet(Map range, String key, List defaultList) { + Object obj = range.get(key); + if (null == obj) { + return defaultList; + } + if (obj instanceof List) { + @SuppressWarnings("unchecked") + List value = (List)obj; + return value; + } else { + throw new IllegalArgumentException("Can not parse list of '" + key + "' from map."); + } + } + + public static Map checkMapAndGet(Configuration param, String key, boolean isCheckEmpty) { + Map value = null; + try { + value = param.getMap(key); + } catch (ClassCastException e) { + throwNotMapException(key); + } + if (null == value) { + throwNotExistException(key); + } else if (isCheckEmpty && value.isEmpty()) { + throwEmptyException(key); + } + return value; + } + + public static RowPrimaryKey checkInputPrimaryKeyAndGet(TableMeta meta, List range) { + if (meta.getPrimaryKey().size() != range.size()) { + throw new IllegalArgumentException(String.format( + "Input size of values not equal size of primary key. input size:%d, primary key size:%d .", + range.size(), meta.getPrimaryKey().size())); + } + RowPrimaryKey pk = new RowPrimaryKey(); + int i = 0; + for (Entry e: meta.getPrimaryKey().entrySet()) { + PrimaryKeyValue value = range.get(i); + if (e.getValue() != value.getType() && value != PrimaryKeyValue.INF_MIN && value != PrimaryKeyValue.INF_MAX) { + throw new IllegalArgumentException( + "Input range type not match primary key. Input type:" + value.getType() + ", Primary Key Type:"+ e.getValue() +", Index:" + i + ); + } else { + pk.addPrimaryKeyColumn(e.getKey(), value); + } + i++; + } + return pk; + } + + public static OTSRange checkRangeAndGet(TableMeta meta, List begin, List end) { + OTSRange range = new OTSRange(); + if (begin.size() == 0 && end.size() == 0) { + RowPrimaryKey beginRow = new RowPrimaryKey(); + RowPrimaryKey endRow = new RowPrimaryKey(); + for (String name : meta.getPrimaryKey().keySet()) { + beginRow.addPrimaryKeyColumn(name, PrimaryKeyValue.INF_MIN); + endRow.addPrimaryKeyColumn(name, PrimaryKeyValue.INF_MAX); + } + range.setBegin(beginRow); + range.setEnd(endRow); + } else { + RowPrimaryKey beginRow = checkInputPrimaryKeyAndGet(meta, begin); + RowPrimaryKey endRow = checkInputPrimaryKeyAndGet(meta, end); + range.setBegin(beginRow); + range.setEnd(endRow); + } + return range; + } + + public static Direction checkDirectionAndEnd(TableMeta meta, RowPrimaryKey begin, RowPrimaryKey end) { + Direction direction = null; + int cmp = Common.compareRangeBeginAndEnd(meta, begin, end) ; + + if (cmp > 0) { + direction = Direction.BACKWARD; + } else if (cmp < 0) { + direction = Direction.FORWARD; + } else { + throw new IllegalArgumentException("Value of 'range-begin' equal value of 'range-end'."); + } + return direction; + } + + /** + * 检查类型是否一致,是否重复,方向是否一致 + * @param direction + * @param before + * @param after + */ + private static void checkDirection(Direction direction, PrimaryKeyValue before, PrimaryKeyValue after) { + int cmp = Common.primaryKeyValueCmp(before, after); + if (cmp > 0) { // 反向 + if (direction == Direction.FORWARD) { + throw new IllegalArgumentException("Input direction of 'range-split' is FORWARD, but direction of 'range' is BACKWARD."); + } + } else if (cmp < 0) { // 正向 + if (direction == Direction.BACKWARD) { + throw new IllegalArgumentException("Input direction of 'range-split' is BACKWARD, but direction of 'range' is FORWARD."); + } + } else { // 重复列 + throw new IllegalArgumentException("Multi same column in 'range-split'."); + } + } + + /** + * 检查 points中的所有点是否是在Begin和end之间 + * @param begin + * @param end + * @param points + */ + private static void checkPointsRange(Direction direction, PrimaryKeyValue begin, PrimaryKeyValue end, List points) { + if (direction == Direction.FORWARD) { + if (!(Common.primaryKeyValueCmp(begin, points.get(0)) < 0 && Common.primaryKeyValueCmp(end, points.get(points.size() - 1)) > 0)) { + throw new IllegalArgumentException("The item of 'range-split' is not within scope of 'range-begin' and 'range-end'."); + } + } else { + if (!(Common.primaryKeyValueCmp(begin, points.get(0)) > 0 && Common.primaryKeyValueCmp(end, points.get(points.size() - 1)) < 0)) { + throw new IllegalArgumentException("The item of 'range-split' is not within scope of 'range-begin' and 'range-end'."); + } + } + } + + /** + * 1.检测用户的输入类型是否和PartitionKey一致 + * 2.顺序是否和Range一致 + * 3.是否有重复列 + * 4.检查points的范围是否在range内 + * @param meta + * @param points + */ + public static void checkInputSplitPoints(TableMeta meta, OTSRange range, Direction direction, List points) { + if (null == points || points.isEmpty()) { + return; + } + + OTSPrimaryKeyColumn part = Common.getPartitionKey(meta); + + // 处理第一个 + PrimaryKeyValue item = points.get(0); + if ( item.getType() != part.getType()) { + throw new IllegalArgumentException("Input type of 'range-split' not match partition key. " + + "Item of 'range-split' type:" + item.getType()+ ", Partition type:" + part.getType()); + } + + for (int i = 0 ; i < points.size() - 1; i++) { + PrimaryKeyValue before = points.get(i); + PrimaryKeyValue after = points.get(i + 1); + checkDirection(direction, before, after); + } + + PrimaryKeyValue begin = range.getBegin().getPrimaryKey().get(part.getName()); + PrimaryKeyValue end = range.getEnd().getPrimaryKey().get(part.getName()); + + checkPointsRange(direction, begin, end, points); + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RangeSplit.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RangeSplit.java new file mode 100644 index 0000000000..74caac3f7a --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RangeSplit.java @@ -0,0 +1,379 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import java.math.BigInteger; +import java.util.ArrayList; +import java.util.Collections; +import java.util.Comparator; +import java.util.List; + +import com.alibaba.datax.plugin.reader.otsreader.model.OTSPrimaryKeyColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSRange; +import com.aliyun.openservices.ots.model.PrimaryKeyType; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; + +/** + * 主要提供对范围的解析 + */ +public class RangeSplit { + + private static String bigIntegerToString(BigInteger baseValue, + BigInteger bValue, BigInteger multi, int lenOfString) { + BigInteger tmp = bValue; + StringBuilder sb = new StringBuilder(); + for (int tmpLength = 0; tmpLength < lenOfString; tmpLength++) { + sb.insert(0, + (char) (baseValue.add(tmp.remainder(multi)).intValue())); + tmp = tmp.divide(multi); + } + return sb.toString(); + } + + /** + * 切分String的Unicode Unit + * + * 注意:该方法只支持begin小于end + * + * @param beginStr + * @param endStr + * @param count + * @return + */ + private static List splitCodePoint(int begin, int end, int count) { + if (begin >= end) { + throw new IllegalArgumentException("Only support begin < end."); + } + + List results = new ArrayList(); + BigInteger beginBig = BigInteger.valueOf(begin); + BigInteger endBig = BigInteger.valueOf(end); + BigInteger countBig = BigInteger.valueOf(count); + BigInteger multi = endBig.subtract(beginBig).add(BigInteger.ONE); + BigInteger range = endBig.subtract(beginBig); + BigInteger interval = BigInteger.ZERO; + int length = 1; + + BigInteger tmpBegin = BigInteger.ZERO; + BigInteger tmpEnd = endBig.subtract(beginBig); + + // 扩大之后的数值 + BigInteger realBegin = tmpBegin; + BigInteger realEnd = tmpEnd; + + while (range.compareTo(countBig) < 0) { // 不够切分 + realEnd = realEnd.multiply(multi).add(tmpEnd); + range = realEnd.subtract(realBegin); + length++; + } + + interval = range.divide(countBig); + + BigInteger cur = realBegin; + + for (int i = 0; i < (count - 1); i++) { + results.add(bigIntegerToString(beginBig, cur, multi, length)); + cur = cur.add(interval); + } + results.add(bigIntegerToString(beginBig, realEnd, multi, length)); + return results; + } + + /** + * 注意: 当begin和end相等时,函数将返回空的List + * + * @param begin + * @param end + * @param count + * @return + */ + public static List splitStringRange(String begin, String end, int count) { + + if (count <= 1) { + throw new IllegalArgumentException("Input count <= 1 ."); + } + + List results = new ArrayList(); + + int beginValue = 0; + if (!begin.isEmpty()) { + beginValue = begin.codePointAt(0); + } + int endValue = 0; + if (!end.isEmpty()) { + endValue = end.codePointAt(0); + } + + int cmp = beginValue - endValue; + + if (cmp == 0) { + return results; + } + + results.add(begin); + + Comparator comparator = new Comparator(){ + public int compare(String arg0, String arg1) { + return arg0.compareTo(arg1); + } + }; + + List tmp = null; + + if (cmp > 0) { // 如果是逆序,则 reverse Comparator + comparator = Collections.reverseOrder(comparator); + tmp = splitCodePoint(endValue, beginValue, count); + } else { // 正序 + tmp = splitCodePoint(beginValue, endValue, count); + } + + Collections.sort(tmp, comparator); + + for (String value : tmp) { + if (comparator.compare(value, begin) > 0 && comparator.compare(value, end) < 0) { + results.add(value); + } + } + + results.add(end); + + return results; + } + + /** + * begin 一定要小于 end + * @param begin + * @param end + * @param count + * @return + */ + private static List splitIntegerRange(BigInteger bigBegin, BigInteger bigEnd, BigInteger bigCount) { + List is = new ArrayList(); + + BigInteger interval = (bigEnd.subtract(bigBegin)).divide(bigCount); + BigInteger cur = bigBegin; + BigInteger i = BigInteger.ZERO; + while (cur.compareTo(bigEnd) < 0 && i.compareTo(bigCount) < 0) { + is.add(cur.longValue()); + cur = cur.add(interval); + i = i.add(BigInteger.ONE); + } + is.add(bigEnd.longValue()); + return is; + } + + /** + * 切分数值类型 注意: 当begin和end相等时,函数将返回空的List + * + * @param begin + * @param end + * @param count + * @return + */ + public static List splitIntegerRange(long begin, long end, int count) { + + if (count <= 1) { + throw new IllegalArgumentException("Input count <= 1 ."); + } + List is = new ArrayList(); + + BigInteger bigBegin = BigInteger.valueOf(begin); + BigInteger bigEnd = BigInteger.valueOf(end); + BigInteger bigCount = BigInteger.valueOf(count); + + BigInteger abs = (bigEnd.subtract(bigBegin)).abs(); + + if (abs.compareTo(BigInteger.ZERO) == 0) { // partition key 相等的情况 + return is; + } + + if (bigCount.compareTo(abs) > 0) { + bigCount = abs; + } + + if (bigEnd.subtract(bigBegin).compareTo(BigInteger.ZERO) > 0) { // 正向 + return splitIntegerRange(bigBegin, bigEnd, bigCount); + } else { // 逆向 + List tmp = splitIntegerRange(bigEnd, bigBegin, bigCount); + + Comparator comparator = new Comparator(){ + public int compare(Long arg0, Long arg1) { + return arg0.compareTo(arg1); + } + }; + + Collections.sort(tmp,Collections.reverseOrder(comparator)); + return tmp; + } + } + + public static List splitRangeByPrimaryKeyType( + PrimaryKeyType type, PrimaryKeyValue begin, PrimaryKeyValue end, + int count) { + List result = new ArrayList(); + if (type == PrimaryKeyType.STRING) { + List points = splitStringRange(begin.asString(), + end.asString(), count); + for (String s : points) { + result.add(PrimaryKeyValue.fromString(s)); + } + } else { + List points = splitIntegerRange(begin.asLong(), end.asLong(), + count); + for (Long l : points) { + result.add(PrimaryKeyValue.fromLong(l)); + } + } + return result; + } + + public static List rangeSplitByCount(TableMeta meta, + RowPrimaryKey begin, RowPrimaryKey end, int count) { + List results = new ArrayList(); + + OTSPrimaryKeyColumn partitionKey = Common.getPartitionKey(meta); + + PrimaryKeyValue beginPartitionKey = begin.getPrimaryKey().get( + partitionKey.getName()); + PrimaryKeyValue endPartitionKey = end.getPrimaryKey().get( + partitionKey.getName()); + + // 第一,先对PartitionKey列进行拆分 + + List ranges = RangeSplit.splitRangeByPrimaryKeyType( + partitionKey.getType(), beginPartitionKey, endPartitionKey, + count); + + if (ranges.isEmpty()) { + return results; + } + + int size = ranges.size(); + for (int i = 0; i < size - 1; i++) { + RowPrimaryKey bPk = new RowPrimaryKey(); + RowPrimaryKey ePk = new RowPrimaryKey(); + + bPk.addPrimaryKeyColumn(partitionKey.getName(), ranges.get(i)); + ePk.addPrimaryKeyColumn(partitionKey.getName(), ranges.get(i + 1)); + + results.add(new OTSRange(bPk, ePk)); + } + + // 第二,填充非PartitionKey的ParimaryKey列 + // 注意:在填充过程中,需要使用用户给定的Begin和End来替换切分出来的第一个Range + // 的Begin和最后一个Range的End + + List keys = new ArrayList(meta.getPrimaryKey().size()); + keys.addAll(meta.getPrimaryKey().keySet()); + + for (int i = 0; i < results.size(); i++) { + for (int j = 1; j < keys.size(); j++) { + OTSRange c = results.get(i); + RowPrimaryKey beginPK = c.getBegin(); + RowPrimaryKey endPK = c.getEnd(); + String key = keys.get(j); + if (i == 0) { // 第一行 + beginPK.addPrimaryKeyColumn(key, + begin.getPrimaryKey().get(key)); + endPK.addPrimaryKeyColumn(key, PrimaryKeyValue.INF_MIN); + } else if (i == results.size() - 1) {// 最后一行 + beginPK.addPrimaryKeyColumn(key, PrimaryKeyValue.INF_MIN); + endPK.addPrimaryKeyColumn(key, end.getPrimaryKey().get(key)); + } else { + beginPK.addPrimaryKeyColumn(key, PrimaryKeyValue.INF_MIN); + endPK.addPrimaryKeyColumn(key, PrimaryKeyValue.INF_MIN); + } + } + } + return results; + } + + private static List getCompletePK(int num, + PrimaryKeyValue value) { + List values = new ArrayList(); + for (int j = 0; j < num; j++) { + if (j == 0) { + values.add(value); + } else { + // 这里在填充PK时,系统需要选择特定的值填充于此 + // 系统默认填充INF_MIN + values.add(PrimaryKeyValue.INF_MIN); + } + } + return values; + } + + /** + * 根据输入的范围begin和end,从target中取得对应的point + * @param begin + * @param end + * @param target + * @return + */ + public static List getSplitPoint(PrimaryKeyValue begin, PrimaryKeyValue end, List target) { + List result = new ArrayList(); + + int cmp = Common.primaryKeyValueCmp(begin, end); + + if (cmp == 0) { + return result; + } + + result.add(begin); + + Comparator comparator = new Comparator(){ + public int compare(PrimaryKeyValue arg0, PrimaryKeyValue arg1) { + return Common.primaryKeyValueCmp(arg0, arg1); + } + }; + + if (cmp > 0) { // 如果是逆序,则 reverse Comparator + comparator = Collections.reverseOrder(comparator); + } + + Collections.sort(target, comparator); + + for (PrimaryKeyValue value:target) { + if (comparator.compare(value, begin) > 0 && comparator.compare(value, end) < 0) { + result.add(value); + } + } + result.add(end); + + return result; + } + + public static List rangeSplitByPoint(TableMeta meta, RowPrimaryKey beginPK, RowPrimaryKey endPK, + List splits) { + + List results = new ArrayList(); + + int pkCount = meta.getPrimaryKey().size(); + + String partName = Common.getPartitionKey(meta).getName(); + PrimaryKeyValue begin = beginPK.getPrimaryKey().get(partName); + PrimaryKeyValue end = endPK.getPrimaryKey().get(partName); + + List newSplits = getSplitPoint(begin, end, splits); + + if (newSplits.isEmpty()) { + return results; + } + + for (int i = 0; i < newSplits.size() - 1; i++) { + OTSRange item = new OTSRange( + ParamChecker.checkInputPrimaryKeyAndGet(meta, + getCompletePK(pkCount, newSplits.get(i))), + ParamChecker.checkInputPrimaryKeyAndGet(meta, + getCompletePK(pkCount, newSplits.get(i + 1)))); + results.add(item); + } + // replace first and last + OTSRange first = results.get(0); + OTSRange last = results.get(results.size() - 1); + + first.setBegin(beginPK); + last.setEnd(endPK); + return results; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ReaderModelParser.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ReaderModelParser.java new file mode 100644 index 0000000000..8e1dfd4159 --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/ReaderModelParser.java @@ -0,0 +1,175 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +import org.apache.commons.codec.binary.Base64; + +import com.alibaba.datax.plugin.reader.otsreader.model.OTSColumn; +import com.alibaba.datax.plugin.reader.otsreader.model.OTSConst; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; + +/** + * 主要对OTS PrimaryKey,OTSColumn的解析 + */ +public class ReaderModelParser { + + private static long getLongValue(String value) { + try { + return Long.parseLong(value); + } catch (NumberFormatException e) { + throw new IllegalArgumentException("Can not parse the value '"+ value +"' to Int."); + } + } + + private static double getDoubleValue(String value) { + try { + return Double.parseDouble(value); + } catch (NumberFormatException e) { + throw new IllegalArgumentException("Can not parse the value '"+ value +"' to Double."); + } + } + + private static boolean getBoolValue(String value) { + if (!(value.equalsIgnoreCase("true") || value.equalsIgnoreCase("false"))) { + throw new IllegalArgumentException("Can not parse the value '"+ value +"' to Bool."); + } + return Boolean.parseBoolean(value); + } + + public static OTSColumn parseConstColumn(String type, String value) { + if (type.equalsIgnoreCase(OTSConst.TYPE_STRING)) { + return OTSColumn.fromConstStringColumn(value); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INTEGER)) { + return OTSColumn.fromConstIntegerColumn(getLongValue(value)); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_DOUBLE)) { + return OTSColumn.fromConstDoubleColumn(getDoubleValue(value)); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_BOOLEAN)) { + return OTSColumn.fromConstBoolColumn(getBoolValue(value)); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_BINARY)) { + return OTSColumn.fromConstBytesColumn(Base64.decodeBase64(value)); + } else { + throw new IllegalArgumentException("Invalid 'column', Can not parse map to 'OTSColumn', input type:" + type + ", value:" + value + "."); + } + } + + public static OTSColumn parseOTSColumn(Map item) { + if (item.containsKey(OTSConst.NAME) && item.size() == 1) { + Object name = item.get(OTSConst.NAME); + if (name instanceof String) { + String nameStr = (String) name; + return OTSColumn.fromNormalColumn(nameStr); + } else { + throw new IllegalArgumentException("Invalid 'column', Can not parse map to 'OTSColumn', the value is not a string."); + } + } else if (item.containsKey(OTSConst.TYPE) && item.containsKey(OTSConst.VALUE) && item.size() == 2) { + Object type = item.get(OTSConst.TYPE); + Object value = item.get(OTSConst.VALUE); + if (type instanceof String && value instanceof String) { + String typeStr = (String) type; + String valueStr = (String) value; + return parseConstColumn(typeStr, valueStr); + } else { + throw new IllegalArgumentException("Invalid 'column', Can not parse map to 'OTSColumn', the value is not a string."); + } + } else { + throw new IllegalArgumentException( + "Invalid 'column', Can not parse map to 'OTSColumn', valid format: '{\"name\":\"\"}' or '{\"type\":\"\", \"value\":\"\"}'."); + } + } + + private static void checkIsAllConstColumn(List columns) { + for (OTSColumn c : columns) { + if (c.getColumnType() == OTSColumn.OTSColumnType.NORMAL) { + return ; + } + } + throw new IllegalArgumentException("Invalid 'column', 'column' should include at least one or more Normal Column."); + } + + public static List parseOTSColumnList(List input) { + if (input.isEmpty()) { + throw new IllegalArgumentException("Input count of 'column' is zero."); + } + + List columns = new ArrayList(input.size()); + + for (Object item:input) { + if (item instanceof Map){ + @SuppressWarnings("unchecked") + Map column = (Map) item; + columns.add(parseOTSColumn(column)); + } else { + throw new IllegalArgumentException("Invalid 'column', Can not parse Object to 'OTSColumn', item of list is not a map."); + } + } + checkIsAllConstColumn(columns); + return columns; + } + + public static PrimaryKeyValue parsePrimaryKeyValue(String type, String value) { + if (type.equalsIgnoreCase(OTSConst.TYPE_STRING)) { + return PrimaryKeyValue.fromString(value); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INTEGER)) { + return PrimaryKeyValue.fromLong(getLongValue(value)); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INF_MIN)) { + throw new IllegalArgumentException("Format error, the " + OTSConst.TYPE_INF_MIN + " only support {\"type\":\"" + OTSConst.TYPE_INF_MIN + "\"}."); + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INF_MAX)) { + throw new IllegalArgumentException("Format error, the " + OTSConst.TYPE_INF_MAX + " only support {\"type\":\"" + OTSConst.TYPE_INF_MAX + "\"}."); + } else { + throw new IllegalArgumentException("Not supprot parsing type: "+ type +" for PrimaryKeyValue."); + } + } + + public static PrimaryKeyValue parsePrimaryKeyValue(String type) { + if (type.equalsIgnoreCase(OTSConst.TYPE_INF_MIN)) { + return PrimaryKeyValue.INF_MIN; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INF_MAX)) { + return PrimaryKeyValue.INF_MAX; + } else { + throw new IllegalArgumentException("Not supprot parsing type: "+ type +" for PrimaryKeyValue."); + } + } + + public static PrimaryKeyValue parsePrimaryKeyValue(Map item) { + if (item.containsKey(OTSConst.TYPE) && item.containsKey(OTSConst.VALUE) && item.size() == 2) { + Object type = item.get(OTSConst.TYPE); + Object value = item.get(OTSConst.VALUE); + if (type instanceof String && value instanceof String) { + String typeStr = (String) type; + String valueStr = (String) value; + return parsePrimaryKeyValue(typeStr, valueStr); + } else { + throw new IllegalArgumentException("The 'type' and 'value‘ only support string."); + } + } else if (item.containsKey(OTSConst.TYPE) && item.size() == 1) { + Object type = item.get(OTSConst.TYPE); + if (type instanceof String) { + String typeStr = (String) type; + return parsePrimaryKeyValue(typeStr); + } else { + throw new IllegalArgumentException("The 'type' only support string."); + } + } else { + throw new IllegalArgumentException("The map must consist of 'type' and 'value'."); + } + } + + public static List parsePrimaryKey(List input) { + if (null == input) { + return null; + } + List columns = new ArrayList(input.size()); + for (Object item:input) { + if (item instanceof Map) { + @SuppressWarnings("unchecked") + Map column = (Map) item; + columns.add(parsePrimaryKeyValue(column)); + } else { + throw new IllegalArgumentException("Can not parse Object to 'PrimaryKeyValue', item of list is not a map."); + } + } + return columns; + } +} diff --git a/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RetryHelper.java b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RetryHelper.java new file mode 100644 index 0000000000..8ed412670c --- /dev/null +++ b/otsreader/src/main/java/com/alibaba/datax/plugin/reader/otsreader/utils/RetryHelper.java @@ -0,0 +1,83 @@ +package com.alibaba.datax.plugin.reader.otsreader.utils; + +import java.util.HashSet; +import java.util.Set; +import java.util.concurrent.Callable; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSErrorCode; +import com.aliyun.openservices.ots.OTSException; + +public class RetryHelper { + + private static final Logger LOG = LoggerFactory.getLogger(RetryHelper.class); + private static final Set noRetryErrorCode = prepareNoRetryErrorCode(); + + public static V executeWithRetry(Callable callable, int maxRetryTimes, int sleepInMilliSecond) throws Exception { + int retryTimes = 0; + while (true){ + Thread.sleep(Common.getDelaySendMillinSeconds(retryTimes, sleepInMilliSecond)); + try { + return callable.call(); + } catch (Exception e) { + LOG.warn("Call callable fail, {}", e.getMessage()); + if (!canRetry(e)){ + LOG.error("Can not retry for Exception.", e); + throw e; + } else if (retryTimes >= maxRetryTimes) { + LOG.error("Retry times more than limition. maxRetryTimes : {}", maxRetryTimes); + throw e; + } + retryTimes++; + LOG.warn("Retry time : {}", retryTimes); + } + } + } + + private static Set prepareNoRetryErrorCode() { + Set pool = new HashSet(); + pool.add(OTSErrorCode.AUTHORIZATION_FAILURE); + pool.add(OTSErrorCode.INVALID_PARAMETER); + pool.add(OTSErrorCode.REQUEST_TOO_LARGE); + pool.add(OTSErrorCode.OBJECT_NOT_EXIST); + pool.add(OTSErrorCode.OBJECT_ALREADY_EXIST); + pool.add(OTSErrorCode.INVALID_PK); + pool.add(OTSErrorCode.OUT_OF_COLUMN_COUNT_LIMIT); + pool.add(OTSErrorCode.OUT_OF_ROW_SIZE_LIMIT); + pool.add(OTSErrorCode.CONDITION_CHECK_FAIL); + return pool; + } + + public static boolean canRetry(String otsErrorCode) { + if (noRetryErrorCode.contains(otsErrorCode)) { + return false; + } else { + return true; + } + } + + public static boolean canRetry(Exception exception) { + OTSException e = null; + if (exception instanceof OTSException) { + e = (OTSException) exception; + LOG.warn( + "OTSException:ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{e.getErrorCode(), e.getMessage(), e.getRequestId()} + ); + return canRetry(e.getErrorCode()); + + } else if (exception instanceof ClientException) { + ClientException ce = (ClientException) exception; + LOG.warn( + "ClientException:{}, ErrorMsg:{}", + new Object[]{ce.getErrorCode(), ce.getMessage()} + ); + return true; + } else { + return false; + } + } +} diff --git a/otsreader/src/main/resources/plugin.json b/otsreader/src/main/resources/plugin.json new file mode 100644 index 0000000000..bfd956273a --- /dev/null +++ b/otsreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "otsreader", + "class": "com.alibaba.datax.plugin.reader.otsreader.OtsReader", + "description": "", + "developer": "alibaba" +} \ No newline at end of file diff --git a/otsreader/src/main/resources/plugin_job_template.json b/otsreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..7d4d0dbc60 --- /dev/null +++ b/otsreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,14 @@ +{ + "name": "otsreader", + "parameter": { + "endpoint":"", + "accessId":"", + "accessKey":"", + "instanceName":"", + "column":[], + "range":{ + "begin":[], + "end":[] + } + } +} \ No newline at end of file diff --git a/otsstreamreader/README.md b/otsstreamreader/README.md new file mode 100644 index 0000000000..c861a737ba --- /dev/null +++ b/otsstreamreader/README.md @@ -0,0 +1,127 @@ +## TableStore增量数据导出通道:TableStoreStreamReader + +### 快速介绍 + +TableStoreStreamReader插件主要用于TableStore的增量数据导出,增量数据可以看作操作日志,除了数据本身外还附有操作信息。 + +与全量导出插件不同,增量导出插件只有多版本模式,同时不支持指定列。这是与增量导出的原理有关的,导出的格式下面有详细介绍。 + +使用插件前必须确保表上已经开启Stream功能,可以在建表的时候指定开启,或者使用SDK的UpdateTable接口开启。 + + 开启Stream的方法: + SyncClient client = new SyncClient("", "", "", ""); + 1. 建表的时候开启: + CreateTableRequest createTableRequest = new CreateTableRequest(tableMeta); + createTableRequest.setStreamSpecification(new StreamSpecification(true, 24)); // 24代表增量数据保留24小时 + client.createTable(createTableRequest); + + 2. 如果建表时未开启,可以通过UpdateTable开启: + UpdateTableRequest updateTableRequest = new UpdateTableRequest("tableName"); + updateTableRequest.setStreamSpecification(new StreamSpecification(true, 24)); + client.updateTable(updateTableRequest); + +### 实现原理 + +首先用户使用SDK的UpdateTable功能,指定开启Stream并设置过期时间,即开启了增量功能。 + +开启后,TableStore服务端就会将用户的操作日志额外保存起来, +每个分区有一个有序的操作日志队列,每条操作日志会在一定时间后被垃圾回收,这个时间即用户指定的过期时间。 + +TableStore的SDK提供了几个Stream相关的API用于将这部分操作日志读取出来,增量插件也是通过TableStore SDK的接口获取到增量数据的,并将 +增量数据转化为多个6元组的形式(pk, colName, version, colValue, opType, sequenceInfo)导入到ODPS中。 + +### Reader的配置模版: + + "reader": { + "name" : "otsstreamreader", + "parameter" : { + "endpoint" : "", + "accessId" : "", + "accessKey" : "", + "instanceName" : "", + //dataTable即需要导出数据的表。 + "dataTable" : "", + //statusTable是Reader用于保存状态的表,若该表不存在,Reader会自动创建该表。 + //一次离线导出任务完成后,用户不应删除该表,该表中记录的状态可用于下次导出任务中。 + "statusTable" : "TableStoreStreamReaderStatusTable", + //增量数据的时间范围(左闭右开)的左边界。 + "startTimestampMillis" : "", + //增量数据的时间范围(左闭右开)的右边界。 + "endTimestampMillis" : "", + //采云间调度只支持天级别,所以提供该配置,作用与startTimestampMillis和endTimestampMillis类似。 + "date": "", + //是否导出时序信息。 + "isExportSequenceInfo": true, + //从TableStore中读增量数据时,每次请求的最大重试次数,默认为30。 + "maxRetries" : 30 + } + } + +### 参数说明 + +| 名称 | 说明 | 类型 | 必选 | +| ---- | ---- | ---- | ---- | +| endpoint | TableStoreServer的Endpoint地址。| String | 是 | +| accessId | 用于访问TableStore服务的accessId。| String | 是 | +| accessKey | 用于访问TableStore服务的accessKey。 | String | 是 | +| instanceName | TableStore的实例名称。 | String | 是 | +| dataTable | 需要导出增量数据的表的名称。该表需要开启Stream,可以在建表时开启,或者使用UpdateTable接口开启。 | String | 是 | +| statusTable | Reader插件用于记录状态的表的名称,这些状态可用于减少对非目标范围内的数据的扫描,从而加快导出速度。
1. 用户不需要创建该表,只需要给出一个表名。Reader插件会尝试在用户的instance下创建该表,若该表不存在即创建新表,若该表已存在,会判断该表的Meta是否与期望一致,若不一致会抛出异常。
2. 在一次导出完成之后,用户不应删除该表,该表的状态可用于下次导出任务。
3. 该表会开启TTL,数据自动过期,因此可认为其数据量很小。
4. 针对同一个instance下的多个不同的dataTable的Reader配置,可以使用同一个statusTable,记录的状态信息互不影响。
综上,用户配置一个类似TableStoreStreamReaderStatusTable之类的名称即可,注意不要与业务相关的表重名。| String | 是 | +| startTimestampMillis | 增量数据的时间范围(左闭右开)的左边界,单位毫秒。
1. Reader插件会从statusTable中找对应startTimestampMillis的位点,从该点开始读取开始导出数据。
2. 若statusTable中找不到对应的位点,则从系统保留的增量数据的第一条开始读取,并跳过写入时间小于startTimestampMillis的数据。| Long | 否 | +| endTimestampMillis | 增量数据的时间范围(左闭右开)的右边界,单位毫秒。
1. Reader插件从startTimestampMillis位置开始导出数据后,当遇到第一条时间戳大于等于endTimestampMillis的数据时,结束导出数据,导出完成。
2. 当读取完当前全部的增量数据时,结束读取,即使未达到endTimestampMillis。 | Long | 否 | +| date | 日期格式为yyyyMMdd,如20151111,表示导出该日的数据。
若没有指定date,则必须指定startTimestampMillis和endTimestampMillis,反之也成立。 | String | 否 | +| isExportSequenceInfo | 是否导出时序信息,时序信息包含了数据的写入时间等。默认该值为false,即不导出。 | Boolean | 否 | +| maxRetries | 从TableStore中读增量数据时,每次请求的最大重试次数,默认为30,重试之间有间隔,30次重试总时间约为5分钟,一般无需更改。| Int | 否 | + +### 导出的数据格式 +首先,在TableStore多版本模型下,表中的数据组织为“行-列-版本”三级的模式, +一行可以有任意列,列名也并非固定的,每一列可以含有多个版本,每个版本都有一个特定的时间戳(版本号)。 + +用户可以通过TableStore的API进行一系列读写操作, +TableStore通过记录用户最近对表的一系列写操作(或称为数据更改操作)来实现记录增量数据的目的, +所以也可以把增量数据看作一批操作记录。 + +TableStore有三类数据更改操作:PutRow、UpdateRow、DeleteRow。 + + + PutRow的语义是写入一行,若该行已存在即覆盖该行。 + + + UpdateRow的语义是更新一行,对原行其他数据不做更改, + 更新可能包括新增或覆盖(若对应列的对应版本已存在)一些列值、删除某一列的全部版本、删除某一列的某个版本。 + + + DeleteRow的语义是删除一行。 + +TableStore会根据每种操作生成对应的增量数据记录,Reader插件会读出这些记录,并导出成Datax的数据格式。 + +同时,由于TableStore具有动态列、多版本的特性,所以Reader插件导出的一行不对应TableStore中的一行,而是对应TableStore中的一列的一个版本。 +即TableStore中的一行可能会导出很多行,每行包含主键值、该列的列名、该列下该版本的时间戳(版本号)、该版本的值、操作类型。若设置isExportSequenceInfo为true,还会包括时序信息。 + +转换为Datax的数据格式后,我们定义了四种操作类型,分别为: + + + U(UPDATE): 写入一列的一个版本 + + + DO(DELETE_ONE_VERSION): 删除某一列的某个版本 + + + DA(DELETE_ALL_VERSION): 删除某一列的全部版本,此时需要根据主键和列名,将对应列的全部版本删除 + + + DR(DELETE_ROW): 删除某一行,此时需要根据主键,将该行数据全部删除 + + +举例如下,假设该表有两个主键列,主键列名分别为pkName1, pkName2: + +| pkName1 | pkName2 | columnName | timestamp | columnValue | opType | +| ------- | ------- | ---------- | --------- | ----------- | ------ | +| pk1_V1 | pk2_V1 | col_a | 1441803688001 | col_val1 | U | +| pk1_V1 | pk2_V1 | col_a | 1441803688002 | col_val2 | U | +| pk1_V1 | pk2_V1 | col_b | 1441803688003 | col_val3 | U | +| pk1_V2 | pk2_V2 | col_a | 1441803688000 | | DO | +| pk1_V2 | pk2_V2 | col_b | | | DA | +| pk1_V3 | pk2_V3 | | | | DR | +| pk1_V3 | pk2_V3 | col_a | 1441803688005 | col_val1 | U | + +假设导出的数据如上,共7行,对应TableStore表内的3行,主键分别是(pk1_V1,pk2_V1), (pk1_V2, pk2_V2), (pk1_V3, pk2_V3)。 + +对于主键为(pk1_V1, pk2_V1)的一行,包含三个操作,分别是写入col_a列的两个版本和col_b列的一个版本。 + +对于主键为(pk1_V2, pk2_V2)的一行,包含两个操作,分别是删除col_a列的一个版本、删除col_b列的全部版本。 + +对于主键为(pk1_V3, pk2_V3)的一行,包含两个操作,分别是删除整行、写入col_a列的一个版本。 diff --git a/otsstreamreader/pom.xml b/otsstreamreader/pom.xml new file mode 100644 index 0000000000..dca4de238f --- /dev/null +++ b/otsstreamreader/pom.xml @@ -0,0 +1,90 @@ + + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + com.alibaba.datax + otsstreamreader + 0.0.1-SNAPSHOT + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + logback-classic + ch.qos.logback + + + + + + com.aliyun.openservices + tablestore-streamclient + 1.0.0-SNAPSHOT + + + com.google.code.gson + gson + 2.2.4 + + + com.google.guava + guava + 18.0 + test + + + + + + + src/main/java + + **/*.properties + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/otsstreamreader/src/main/assembly/package.xml b/otsstreamreader/src/main/assembly/package.xml new file mode 100644 index 0000000000..424a8cc332 --- /dev/null +++ b/otsstreamreader/src/main/assembly/package.xml @@ -0,0 +1,34 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + + plugin/reader/otsstreamreader + + + target/ + + otsstreamreader-0.0.1-SNAPSHOT.jar + + plugin/reader/otsstreamreader + + + + + + false + plugin/reader/otsstreamreader/libs + runtime + + + diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSReaderError.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSReaderError.java new file mode 100644 index 0000000000..1bf1784de7 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSReaderError.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal; + +import com.alibaba.datax.common.spi.ErrorCode; + +public class OTSReaderError implements ErrorCode { + + private String code; + + private String description; + + public final static OTSReaderError ERROR = new OTSReaderError("OTSStreamReaderError", "OTS Stream Reader Error"); + + public final static OTSReaderError INVALID_PARAM = new OTSReaderError( + "OTSStreamReaderInvalidParameter", "OTS Stream Reader Invalid Parameter"); + + public OTSReaderError(String code, String description) { + this.code = code; + this.description = description; + } + + public String getCode() { + return this.code; + } + + public String getDescription() { + return this.description; + } + + public String toString() { + return "[ code:" + this.code + ", message" + this.description + "]"; + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReader.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReader.java new file mode 100644 index 0000000000..6731346701 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReader.java @@ -0,0 +1,75 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSStreamReaderConfig; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSStreamReaderConstants; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.StreamJob; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.GsonParser; +import com.alicloud.openservices.tablestore.TableStoreException; +import com.alicloud.openservices.tablestore.model.StreamShard; + +import java.util.HashSet; +import java.util.List; +import java.util.concurrent.ConcurrentSkipListSet; + +public class OTSStreamReader { + + public static class Job extends Reader.Job { + + private OTSStreamReaderMasterProxy proxy = new OTSStreamReaderMasterProxy(); + @Override + public List split(int adviceNumber) { + return proxy.split(adviceNumber); + } + + public void init() { + try { + OTSStreamReaderConfig config = OTSStreamReaderConfig.load(getPluginJobConf()); + proxy.init(config); + } catch (TableStoreException ex) { + throw DataXException.asDataXException(new OTSReaderError(ex.getErrorCode(), "OTS ERROR"), ex.toString(), ex); + } catch (Exception ex) { + throw DataXException.asDataXException(OTSReaderError.ERROR, ex.toString(), ex); + } + } + + public void destroy() { + this.proxy.close(); + } + } + + public static class Task extends Reader.Task { + + private OTSStreamReaderSlaveProxy proxy = new OTSStreamReaderSlaveProxy(); + + @Override + public void startRead(RecordSender recordSender) { + proxy.startRead(recordSender); + } + + public void init() { + try { + OTSStreamReaderConfig config = GsonParser.jsonToConfig( + (String) this.getPluginJobConf().get(OTSStreamReaderConstants.CONF)); + StreamJob streamJob = StreamJob.fromJson( + (String) this.getPluginJobConf().get(OTSStreamReaderConstants.STREAM_JOB)); + List ownedShards = GsonParser.jsonToList( + (String) this.getPluginJobConf().get(OTSStreamReaderConstants.OWNED_SHARDS)); + List allShards = GsonParser.fromJson( + (String) this.getPluginJobConf().get(OTSStreamReaderConstants.ALL_SHARDS)); + proxy.init(config, streamJob, allShards, new HashSet(ownedShards)); + } catch (TableStoreException ex) { + throw DataXException.asDataXException(new OTSReaderError(ex.getErrorCode(), "OTS ERROR"), ex.toString(), ex); + } catch (Exception ex) { + throw DataXException.asDataXException(OTSReaderError.ERROR, ex.toString(), ex); + } + } + + public void destroy() { + proxy.close(); + } + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReaderException.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReaderException.java new file mode 100644 index 0000000000..c112656bd8 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReaderException.java @@ -0,0 +1,12 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal; + +public class OTSStreamReaderException extends RuntimeException { + + public OTSStreamReaderException(String message) { + super(message); + } + + public OTSStreamReaderException(String message, Exception cause) { + super(message, cause); + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReaderMasterProxy.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReaderMasterProxy.java new file mode 100644 index 0000000000..473e2c8132 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReaderMasterProxy.java @@ -0,0 +1,112 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSStreamReaderConfig; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSStreamReaderConstants; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.core.CheckpointTimeTracker; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.core.OTSStreamReaderChecker; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.StreamJob; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.GsonParser; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.OTSHelper; +import com.alicloud.openservices.tablestore.*; +import com.alicloud.openservices.tablestore.model.*; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; + +public class OTSStreamReaderMasterProxy { + + private OTSStreamReaderConfig conf = null; + private SyncClientInterface ots = null; + + private StreamJob streamJob; + private List allShards; + + private static final Logger LOG = LoggerFactory.getLogger(OTSStreamReaderConfig.class); + + public void init(OTSStreamReaderConfig config) throws Exception { + this.conf = config; + + // Init ots + ots = OTSHelper.getOTSInstance(conf); + + // 创建Checker + OTSStreamReaderChecker checker = new OTSStreamReaderChecker(ots, conf); + + // 检查Stream是否开启,选取的时间范围是否可以导出。 + checker.checkStreamEnabledAndTimeRangeOK(); + + // 检查StatusTable是否存在,若不存在则创建StatusTable。 + checker.checkAndCreateStatusTableIfNotExist(); + + // 删除StatusTable记录的对应EndTime时刻的Checkpoint信息。防止本次任务受到之前导出任务的影响。 + String streamId = OTSHelper.getStreamDetails(ots, config.getDataTable()).getStreamId(); + CheckpointTimeTracker checkpointInfoTracker = new CheckpointTimeTracker(ots, config.getStatusTable(), streamId); + checkpointInfoTracker.clearAllCheckpoints(config.getEndTimestampMillis()); + + SyncClientInterface ots = OTSHelper.getOTSInstance(config); + + allShards = OTSHelper.getOrderedShardList(ots, streamId); + List shardIds = new ArrayList(); + for (StreamShard shard : allShards) { + shardIds.add(shard.getShardId()); + } + + String version = "" + System.currentTimeMillis() + "-" + UUID.randomUUID(); + + streamJob = new StreamJob(conf.getDataTable(), streamId, version, new HashSet(shardIds), + conf.getStartTimestampMillis(), conf.getEndTimestampMillis()); + checkpointInfoTracker.writeStreamJob(streamJob); + + LOG.info("Start stream job: {}.", streamJob.toJson()); + } + + /** + * For testing purpose. + * + * @param streamJob + */ + void setStreamJob(StreamJob streamJob) { + this.streamJob = streamJob; + } + + public StreamJob getStreamJob() { + return streamJob; + } + + public List split(int adviceNumber) { + int shardCount = streamJob.getShardIds().size(); + int splitNumber = Math.min(adviceNumber, shardCount); + int splitSize = shardCount / splitNumber; + List configurations = new ArrayList(); + + List shardIds = new ArrayList(streamJob.getShardIds()); + Collections.shuffle(shardIds); + int start = 0; + int end = 0; + int remain = shardCount % splitNumber; + for (int i = 0; i < splitNumber; i++) { + start = end; + end = start + splitSize; + + if (remain > 0) { + end += 1; + remain -= 1; + } + + Configuration configuration = Configuration.newDefault(); + configuration.set(OTSStreamReaderConstants.CONF, GsonParser.configToJson(conf)); + configuration.set(OTSStreamReaderConstants.STREAM_JOB, streamJob.toJson()); + configuration.set(OTSStreamReaderConstants.ALL_SHARDS, GsonParser.toJson(allShards)); + configuration.set(OTSStreamReaderConstants.OWNED_SHARDS, GsonParser.listToJson(shardIds.subList(start, end))); + configurations.add(configuration); + } + LOG.info("Master split to {} slave, with advice number {}.", configurations.size(), adviceNumber); + return configurations; + } + + public void close(){ + ots.shutdown(); + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReaderSlaveProxy.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReaderSlaveProxy.java new file mode 100644 index 0000000000..22035851b2 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/OTSStreamReaderSlaveProxy.java @@ -0,0 +1,290 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSStreamReaderConfig; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSStreamReaderConstants; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.core.*; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.ShardCheckpoint; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.StreamJob; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.OTSHelper; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.TimeUtils; +import com.alicloud.openservices.tablestore.*; +import com.alicloud.openservices.tablestore.model.*; +import com.aliyun.openservices.ots.internal.streamclient.model.CheckpointPosition; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; +import java.util.concurrent.*; +import java.util.concurrent.atomic.AtomicInteger; + +public class OTSStreamReaderSlaveProxy { + private static final Logger LOG = LoggerFactory.getLogger(OTSStreamReaderSlaveProxy.class); + private static AtomicInteger slaveNumber = new AtomicInteger(0); + + private OTSStreamReaderConfig config; + private SyncClientInterface ots; + private Map shardToCheckpointMap = new ConcurrentHashMap(); + private CheckpointTimeTracker checkpointInfoTracker; + private OTSStreamReaderChecker checker; + + private StreamJob streamJob; + private Map allShardsMap; // all shards from job master + private Map ownedShards; // shards to read arranged by job master + private boolean findCheckpoints; // whether find checkpoint for last job, if so, we should read from checkpoint and skip nothing. + private String slaveId = UUID.randomUUID().toString(); + private StreamDetails streamDetails; + + public void init(final OTSStreamReaderConfig otsStreamReaderConfig, StreamJob streamJob, List allShards, Set ownedShardIds) { + slaveNumber.getAndIncrement(); + this.config = otsStreamReaderConfig; + this.ots = OTSHelper.getOTSInstance(config); + this.streamJob = streamJob; + this.streamDetails = OTSHelper.getStreamDetails(ots, this.streamJob.getTableName()); + this.checkpointInfoTracker = new CheckpointTimeTracker(ots, config.getStatusTable(), this.streamJob.getStreamId()); + this.checker = new OTSStreamReaderChecker(ots, config); + this.allShardsMap = OTSHelper.toShardMap(allShards); + + LOG.info("SlaveId: {}, ShardIds: {}, OwnedShards: {}.", slaveId, allShards, ownedShardIds); + this.ownedShards = new HashMap(); + for (String ownedShardId : ownedShardIds) { + ownedShards.put(ownedShardId, allShardsMap.get(ownedShardId)); + } + + for (String shardId : this.streamJob.getShardIds()) { + shardToCheckpointMap.put(shardId, new ShardCheckpoint(shardId, this.streamJob.getVersion(), CheckpointPosition.TRIM_HORIZON, 0)); + } + + findCheckpoints = checker.checkAndSetCheckpoints(checkpointInfoTracker, allShardsMap, streamJob, shardToCheckpointMap); + if (!findCheckpoints) { + LOG.info("Checkpoint for stream '{}' in timestamp '{}' is not found.", streamJob.getStreamId(), streamJob.getStartTimeInMillis()); + setWithNearestCheckpoint(); + } + + LOG.info("Find checkpoints: {}.", findCheckpoints); + for (Map.Entry shard : ownedShards.entrySet()) { + LOG.info("Shard to process, ShardInfo: [{}], StartCheckpoint: [{}].", shard.getValue(), shardToCheckpointMap.get(shard.getKey())); + } + LOG.info("Count of owned shards: {}. ShardIds: {}.", ownedShardIds.size(), ownedShardIds); + } + + public boolean isFindCheckpoints() { + return findCheckpoints; + } + + public Map getAllShardsMap() { + return allShardsMap; + } + + public Map getOwnedShards() { + return ownedShards; + } + + public Map getShardToCheckpointMap() { + return shardToCheckpointMap; + } + + /** + * 没有找到上一次任务的checkpoint,需要重新从头开始读。 + * 为了减少扫描的数据量,尝试查找里startTime最近的一次checkpoint。 + */ + private void setWithNearestCheckpoint() { + long expirationTime = (streamDetails.getExpirationTime() - 1) * TimeUtils.HOUR_IN_MILLIS; + long timeRangeBegin = System.currentTimeMillis() - expirationTime; + long timeRangeEnd = this.config.getStartTimestampMillis() - 1; + if (timeRangeBegin < timeRangeEnd) { + for (String shardId : ownedShards.keySet()) { + LOG.info("Try find nearest checkpoint for shard {}, startTime: {}.", shardId, config.getStartTimestampMillis()); + String checkpoint = this.checkpointInfoTracker.getShardLargestCheckpointInTimeRange(shardId, timeRangeBegin, timeRangeEnd); + if (checkpoint != null) { + LOG.info("Found checkpoint for shard {}, checkpoint: {}.", shardId, checkpoint); + shardToCheckpointMap.put(shardId, new ShardCheckpoint(shardId, streamJob.getVersion(), checkpoint, 0)); + } + } + } + } + + private int calcThreadPoolSize() { + int threadNum = 0; + // 如果配置了thread num,则计算平均每个slave所启动的thread的个数 + if (config.getThreadNum() > 0) { + threadNum = config.getThreadNum() / slaveNumber.get(); + } else { + threadNum = Runtime.getRuntime().availableProcessors() * 4 / slaveNumber.get(); + } + + if (threadNum == 0) { + threadNum = 1; + } + LOG.info("ThreadNum: {}.", threadNum); + return threadNum; + } + + private Map filterShardsReachEnd(Map ownedShards, Map allCheckpoints) { + Map allShardToProcess = new HashMap(); + for (Map.Entry shard : ownedShards.entrySet()) { + String shardId = shard.getKey(); + if (allCheckpoints.get(shardId).getCheckpoint().equals(CheckpointPosition.SHARD_END)) { + LOG.info("Shard has reach end, no need to process. ShardId: {}.", shardId); + // but we need to set checkpoint for this job + checkpointInfoTracker.writeCheckpoint(streamJob.getEndTimeInMillis(), + new ShardCheckpoint(shardId, streamJob.getVersion(), CheckpointPosition.SHARD_END, 0), 0); + } else { + allShardToProcess.put(shard.getKey(), shard.getValue()); + } + } + return allShardToProcess; + } + + public void startRead(RecordSender recordSender) { + int threadPoolSize = calcThreadPoolSize(); + ExecutorService executorService = new ThreadPoolExecutor( + 0, threadPoolSize, 60L, TimeUnit.SECONDS, new ArrayBlockingQueue(ownedShards.size())); + LOG.info("Start thread pool with size: {}, ShardsCount: {}, SlaveCount: {}.", threadPoolSize, ownedShards.size(), slaveNumber.get()); + try { + Map allShardToProcess = filterShardsReachEnd(ownedShards, shardToCheckpointMap); + Map shardProcessingState = new HashMap(); + for (String shardId : allShardToProcess.keySet()) { + shardProcessingState.put(shardId, ShardStatusChecker.ProcessState.BLOCK); + } + + List processors = new ArrayList(); + + // 获取当前所有shard的checkpoint状态,对当前的owned shard执行对应的任务。 + long lastLogTime = System.currentTimeMillis(); + while (!allShardToProcess.isEmpty()) { + Map checkpointMap = checkpointInfoTracker.getAllCheckpoints(streamJob.getEndTimeInMillis()); + + // 检查当前job的checkpoint,排查是否有其他job误入或者出现不明的shard。 + checkCheckpoint(checkpointMap, streamJob); + + // 找到需要处理的shard以及确定不需要被处理的shard + List shardToProcess = new ArrayList(); + List shardNoNeedProcess = new ArrayList(); + List shardBlocked = new ArrayList(); + ShardStatusChecker.findShardToProcess(allShardToProcess, allShardsMap, checkpointMap, shardToProcess, shardNoNeedProcess, shardBlocked); + + // 将不需要处理的shard,设置checkpoint,代表本轮处理完毕,且checkpoint为TRIM_HORIZON + for (StreamShard shard : shardNoNeedProcess) { + LOG.info("Skip shard: {}.", shard.getShardId()); + ShardCheckpoint checkpoint = new ShardCheckpoint(shard.getShardId(), streamJob.getVersion(), CheckpointPosition.TRIM_HORIZON, 0); + checkpointInfoTracker.writeCheckpoint(config.getEndTimestampMillis(), checkpoint, 0); + shardProcessingState.put(shard.getShardId(), ShardStatusChecker.ProcessState.SKIP); + } + + for (StreamShard shard : shardToProcess) { + RecordProcessor processor = new RecordProcessor(ots, config, streamJob, shard, + shardToCheckpointMap.get(shard.getShardId()), !findCheckpoints, checkpointInfoTracker, recordSender); + processor.initialize(); + executorService.submit(processor); + processors.add(processor); + shardProcessingState.put(shard.getShardId(), ShardStatusChecker.ProcessState.READY); + } + + // 等待所有任务执行完毕,并且检查每个任务的状态,检查是否发生hang或长时间没有数据 + checkProcessorRunningStatus(processors); + + if (!allShardToProcess.isEmpty()) { + TimeUtils.sleepMillis(config.getSlaveLoopInterval()); + } + + long now = System.currentTimeMillis(); + if (now - lastLogTime > config.getSlaveLoggingStatusInterval()) { + logShardProcessingState(shardProcessingState); + LOG.info("AllCheckpoints: {}", checkpointMap); + lastLogTime = now; + } + } + + LOG.info("All shard is processing."); + logShardProcessingState(shardProcessingState); + // 等待当前分配的shard的读取任务执行完毕后退出。 + while (true) { + boolean finished = true; + checkProcessorRunningStatus(processors); + for (RecordProcessor processor : processors) { + RecordProcessor.State state = processor.getState(); + if (state != RecordProcessor.State.SUCCEED) { + LOG.info("Shard is processing, shardId: {}, status: {}.", processor.getShard().getShardId(), state); + finished = false; + } + } + + if (finished) { + LOG.info("All record processor finished."); + break; + } + + TimeUtils.sleepMillis(config.getSlaveLoopInterval()); + } + + } catch (TableStoreException ex) { + throw DataXException.asDataXException(new OTSReaderError(ex.getErrorCode(), "SyncClientInterface Error"), ex.toString(), ex); + } catch (OTSStreamReaderException ex) { + LOG.error("SlaveId: {}, OwnedShards: {}.", slaveId, ownedShards, ex); + throw DataXException.asDataXException(OTSReaderError.ERROR, ex.toString(), ex); + } catch (Exception ex) { + LOG.error("SlaveId: {}, OwnedShards: {}.", slaveId, ownedShards, ex); + throw DataXException.asDataXException(OTSReaderError.ERROR, ex.toString(), ex); + } finally { + try { + executorService.shutdownNow(); + executorService.awaitTermination(1, TimeUnit.MINUTES); + } catch (Exception e) { + LOG.error("Shutdown encounter exception.", e); + } + } + } + + private void logShardProcessingState(Map shardProcessingState) { + StringBuilder sb = new StringBuilder(); + sb.append("Shard running status: \n"); + for (Map.Entry entry : shardProcessingState.entrySet()) { + sb.append("ShardId:").append(entry.getKey()). + append(", ProcessingState: ").append(entry.getValue()).append("\n"); + } + LOG.info("Version: {}, Reader status: {}", streamJob.getVersion(), sb.toString()); + } + + private void checkProcessorRunningStatus(List processors) { + long now = System.currentTimeMillis(); + for (RecordProcessor processor : processors) { + RecordProcessor.State state = processor.getState(); + StreamShard shard = processor.getShard(); + if (state == RecordProcessor.State.READY || state == RecordProcessor.State.SUCCEED) { + continue; + } else if (state == RecordProcessor.State.INTERRUPTED || state == RecordProcessor.State.FAILED) { + throw new OTSStreamReaderException("Read task for shard '" + shard.getShardId() + "' has failed."); + } else { // status = RUNNING + long lastProcessTime = processor.getLastProcessTime(); + if (now - lastProcessTime > OTSStreamReaderConstants.MAX_ONCE_PROCESS_TIME_MILLIS) { + throw new OTSStreamReaderException("Process shard timeout, ShardId:" + shard.getShardId() + ", LastProcessTime:" + + lastProcessTime + ", MaxProcessTime:" + OTSStreamReaderConstants.MAX_ONCE_PROCESS_TIME_MILLIS + ", Now:" + now + "."); + } + } + } + } + + void checkCheckpoint(Map checkpointMap, StreamJob streamJob) { + for (Map.Entry entry : checkpointMap.entrySet()) { + String shardId = entry.getKey(); + String version = entry.getValue().getVersion(); + if (!streamJob.getShardIds().contains(shardId)) { + LOG.info("Shard '{}' is not found in job. Job: {}.", entry.getKey(), streamJob.getShardIds()); + throw DataXException.asDataXException(OTSReaderError.ERROR, "Some shard from checkpoint is not belong to this job: " + shardId); + } + + if (!version.equals(streamJob.getVersion())) { + LOG.info("Version of shard '{}' in checkpoint is not equal with version of this job. " + + "Checkpoint version: {}, job version: {}.", shardId, version, streamJob.getVersion()); + throw DataXException.asDataXException(OTSReaderError.ERROR, "Version of checkpoint is not equal with version of this job."); + } + } + } + + public void close() { + ots.shutdown(); + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/Mode.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/Mode.java new file mode 100644 index 0000000000..af394b29f2 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/Mode.java @@ -0,0 +1,8 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.config; + +public enum Mode { + + MULTI_VERSION, + + SINGLE_VERSION_AND_UPDATE_ONLY +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/OTSRetryStrategyForStreamReader.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/OTSRetryStrategyForStreamReader.java new file mode 100644 index 0000000000..a7e8acbfb7 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/OTSRetryStrategyForStreamReader.java @@ -0,0 +1,81 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.config; + +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.OTSErrorCode; +import com.alicloud.openservices.tablestore.*; +import com.alicloud.openservices.tablestore.model.RetryStrategy; + +import java.util.Arrays; +import java.util.List; + +public class OTSRetryStrategyForStreamReader implements RetryStrategy { + + private int maxRetries = 30; + private static long retryPauseScaleTimeMillis = 100; + private static long maxPauseTimeMillis = 10 * 1000; + private int retries = 0; + + private static List noRetryErrorCode = Arrays.asList( + OTSErrorCode.AUTHORIZATION_FAILURE, + OTSErrorCode.CONDITION_CHECK_FAIL, + OTSErrorCode.INVALID_PARAMETER, + OTSErrorCode.INVALID_PK, + OTSErrorCode.OBJECT_ALREADY_EXIST, + OTSErrorCode.OBJECT_NOT_EXIST, + OTSErrorCode.OUT_OF_COLUMN_COUNT_LIMIT, + OTSErrorCode.OUT_OF_ROW_SIZE_LIMIT, + OTSErrorCode.REQUEST_TOO_LARGE, + OTSErrorCode.TRIMMED_DATA_ACCESS + ); + + private boolean canRetry(Exception ex) { + if (ex instanceof TableStoreException) { + if (noRetryErrorCode.contains(((TableStoreException) ex).getErrorCode())) { + return false; + } + return true; + } else if (ex instanceof ClientException) { + return true; + } else { + return false; + } + } + + public boolean shouldRetry(String action, Exception ex, int retries) { + if (retries > maxRetries) { + return false; + } + if (canRetry(ex)) { + return true; + } + return false; + } + + public void setMaxRetries(int maxRetries) { + this.maxRetries = maxRetries; + } + + public int getMaxRetries() { + return this.maxRetries; + } + + @Override + public RetryStrategy clone() { + return new OTSRetryStrategyForStreamReader(); + } + + @Override + public int getRetries() { + return retries; + } + + @Override + public long nextPause(String action, Exception ex) { + if (!shouldRetry(action, ex, retries)) { + return 0; + } + + long pause = Math.min((int)Math.pow(2, retries) * retryPauseScaleTimeMillis, maxPauseTimeMillis); + ++retries; + return pause; + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/OTSStreamReaderConfig.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/OTSStreamReaderConfig.java new file mode 100644 index 0000000000..c89d7a3777 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/OTSStreamReaderConfig.java @@ -0,0 +1,325 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.config; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.OTSStreamReaderException; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.ParamChecker; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.TimeUtils; +import com.alicloud.openservices.tablestore.SyncClientInterface; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.text.ParseException; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +public class OTSStreamReaderConfig { + + private static final Logger LOG = LoggerFactory.getLogger(OTSStreamReaderConfig.class); + + private static final String KEY_OTS_ENDPOINT = "endpoint"; + private static final String KEY_OTS_ACCESSID = "accessId"; + private static final String KEY_OTS_ACCESSKEY = "accessKey"; + private static final String KEY_OTS_INSTANCE_NAME = "instanceName"; + private static final String KEY_DATA_TABLE_NAME = "dataTable"; + private static final String KEY_STATUS_TABLE_NAME = "statusTable"; + private static final String KEY_START_TIMESTAMP_MILLIS = "startTimestampMillis"; + private static final String KEY_END_TIMESTAMP_MILLIS = "endTimestampMillis"; + private static final String KEY_START_TIME_STRING = "startTimeString"; + private static final String KEY_END_TIME_STRING = "endTimeString"; + private static final String KEY_IS_EXPORT_SEQUENCE_INFO = "isExportSequenceInfo"; + private static final String KEY_DATE = "date"; + private static final String KEY_MAX_RETRIES = "maxRetries"; + private static final String KEY_MODE = "mode"; + private static final String KEY_COLUMN = "column"; + private static final String KEY_THREAD_NUM = "threadNum"; + + private static final int DEFAULT_MAX_RETRIES = 30; + private static final long DEFAULT_SLAVE_LOOP_INTERVAL = 10 * TimeUtils.SECOND_IN_MILLIS; + private static final long DEFAULT_SLAVE_LOGGING_STATUS_INTERVAL = 60 * TimeUtils.SECOND_IN_MILLIS; + + private String endpoint; + private String accessId; + private String accessKey; + private String instanceName; + private String dataTable; + private String statusTable; + private long startTimestampMillis; + private long endTimestampMillis; + private boolean isExportSequenceInfo; + private int maxRetries = DEFAULT_MAX_RETRIES; + private int threadNum = 32; + private long slaveLoopInterval = DEFAULT_SLAVE_LOOP_INTERVAL; + private long slaveLoggingStatusInterval = DEFAULT_SLAVE_LOGGING_STATUS_INTERVAL; + + private Mode mode; + private List columns; + + private transient SyncClientInterface otsForTest; + + public String getEndpoint() { + return endpoint; + } + + public void setEndpoint(String endpoint) { + this.endpoint = endpoint; + } + + public String getAccessId() { + return accessId; + } + + public void setAccessId(String accessId) { + this.accessId = accessId; + } + + public String getAccessKey() { + return accessKey; + } + + public void setAccessKey(String accessKey) { + this.accessKey = accessKey; + } + + public String getInstanceName() { + return instanceName; + } + + public void setInstanceName(String instanceName) { + this.instanceName = instanceName; + } + + public String getDataTable() { + return dataTable; + } + + public void setDataTable(String dataTable) { + this.dataTable = dataTable; + } + + public String getStatusTable() { + return statusTable; + } + + public void setStatusTable(String statusTable) { + this.statusTable = statusTable; + } + + public long getStartTimestampMillis() { + return startTimestampMillis; + } + + public void setStartTimestampMillis(long startTimestampMillis) { + this.startTimestampMillis = startTimestampMillis; + } + + public long getEndTimestampMillis() { + return endTimestampMillis; + } + + public void setEndTimestampMillis(long endTimestampMillis) { + this.endTimestampMillis = endTimestampMillis; + } + + public boolean isExportSequenceInfo() { + return isExportSequenceInfo; + } + + public void setIsExportSequenceInfo(boolean isExportSequenceInfo) { + this.isExportSequenceInfo = isExportSequenceInfo; + } + + public Mode getMode() { + return mode; + } + + public void setMode(Mode mode) { + this.mode = mode; + } + + public List getColumns() { + return columns; + } + + public void setColumns(List columns) { + this.columns = columns; + } + + private static void parseConfigForSingleVersionAndUpdateOnlyMode(OTSStreamReaderConfig config, Configuration param) { + try { + List values = param.getList(KEY_COLUMN); + if (values == null) { + config.setColumns(new ArrayList()); + return; + } + + List columns = new ArrayList(); + for (Object item : values) { + if (item instanceof Map) { + String columnName = (String) ((Map) item).get("name"); + columns.add(columnName); + } else { + throw new IllegalArgumentException("The item of column must be map object, please check your input."); + } + } + config.setColumns(columns); + } catch (RuntimeException ex) { + throw new OTSStreamReaderException("Parse column fail, please check your config.", ex); + } + } + + public static OTSStreamReaderConfig load(Configuration param) { + OTSStreamReaderConfig config = new OTSStreamReaderConfig(); + + config.setEndpoint(ParamChecker.checkStringAndGet(param, KEY_OTS_ENDPOINT, true)); + config.setAccessId(ParamChecker.checkStringAndGet(param, KEY_OTS_ACCESSID, true)); + config.setAccessKey(ParamChecker.checkStringAndGet(param, KEY_OTS_ACCESSKEY, true)); + config.setInstanceName(ParamChecker.checkStringAndGet(param, KEY_OTS_INSTANCE_NAME, true)); + config.setDataTable(ParamChecker.checkStringAndGet(param, KEY_DATA_TABLE_NAME, true)); + config.setStatusTable(ParamChecker.checkStringAndGet(param, KEY_STATUS_TABLE_NAME, true)); + config.setIsExportSequenceInfo(param.getBool(KEY_IS_EXPORT_SEQUENCE_INFO, false)); + + if (param.getInt(KEY_THREAD_NUM) != null) { + config.setThreadNum(param.getInt(KEY_THREAD_NUM)); + } + + if (param.getString(KEY_DATE) == null && + (param.getLong(KEY_START_TIMESTAMP_MILLIS) == null || param.getLong(KEY_END_TIMESTAMP_MILLIS) == null) && + (param.getLong(KEY_START_TIME_STRING) == null || param.getLong(KEY_END_TIME_STRING) == null)) { + throw new OTSStreamReaderException("Must set date or time range millis or time range string, please check your config."); + } + + if (param.get(KEY_DATE) != null && + (param.getLong(KEY_START_TIMESTAMP_MILLIS) != null || param.getLong(KEY_END_TIMESTAMP_MILLIS) != null) && + (param.getLong(KEY_START_TIME_STRING) != null || param.getLong(KEY_END_TIME_STRING) != null)) { + throw new OTSStreamReaderException("Can't set date and time range millis and time range string, please check your config."); + } + + if (param.get(KEY_DATE) != null && + (param.getLong(KEY_START_TIMESTAMP_MILLIS) != null || param.getLong(KEY_END_TIMESTAMP_MILLIS) != null)) { + throw new OTSStreamReaderException("Can't set date and time range both, please check your config."); + } + + if (param.get(KEY_DATE) != null && + (param.getLong(KEY_START_TIME_STRING) != null || param.getLong(KEY_END_TIME_STRING) != null)) { + throw new OTSStreamReaderException("Can't set date and time range string both, please check your config."); + } + + if ((param.getLong(KEY_START_TIMESTAMP_MILLIS) != null || param.getLong(KEY_END_TIMESTAMP_MILLIS) != null)&& + (param.getLong(KEY_START_TIME_STRING) != null || param.getLong(KEY_END_TIME_STRING) != null)) { + throw new OTSStreamReaderException("Can't set time range millis and time range string both, please check your config."); + } + + if (param.getString(KEY_START_TIME_STRING) != null && + param.getString(KEY_END_TIME_STRING) != null) { + String startTime=ParamChecker.checkStringAndGet(param, KEY_START_TIME_STRING, true); + String endTime=ParamChecker.checkStringAndGet(param, KEY_END_TIME_STRING, true); + try { + long startTimestampMillis = TimeUtils.parseTimeStringToTimestampMillis(startTime); + config.setStartTimestampMillis(startTimestampMillis); + } catch (Exception ex) { + throw new OTSStreamReaderException("Can't parse startTimeString: " + startTime); + } + try { + long endTimestampMillis = TimeUtils.parseTimeStringToTimestampMillis(endTime); + config.setEndTimestampMillis(endTimestampMillis); + } catch (Exception ex) { + throw new OTSStreamReaderException("Can't parse startTimeString: " + startTime); + } + + }else if (param.getString(KEY_DATE) == null) { + config.setStartTimestampMillis(param.getLong(KEY_START_TIMESTAMP_MILLIS)); + config.setEndTimestampMillis(param.getLong(KEY_END_TIMESTAMP_MILLIS)); + } else { + String date = ParamChecker.checkStringAndGet(param, KEY_DATE, true); + try { + long startTimestampMillis = TimeUtils.parseDateToTimestampMillis(date); + config.setStartTimestampMillis(startTimestampMillis); + config.setEndTimestampMillis(startTimestampMillis + TimeUtils.DAY_IN_MILLIS); + } catch (ParseException ex) { + throw new OTSStreamReaderException("Can't parse date: " + date); + } + } + + + + + if (config.getStartTimestampMillis() >= config.getEndTimestampMillis()) { + throw new OTSStreamReaderException("EndTimestamp must be larger than startTimestamp."); + } + + config.setMaxRetries(param.getInt(KEY_MAX_RETRIES, DEFAULT_MAX_RETRIES)); + + String mode = param.getString(KEY_MODE); + if (mode != null) { + if (mode.equalsIgnoreCase(Mode.SINGLE_VERSION_AND_UPDATE_ONLY.name())) { + config.setMode(Mode.SINGLE_VERSION_AND_UPDATE_ONLY); + parseConfigForSingleVersionAndUpdateOnlyMode(config, param); + } else { + throw new OTSStreamReaderException("Unsupported Mode: " + mode + ", please check your config."); + } + } else { + config.setMode(Mode.MULTI_VERSION); + List values = param.getList(KEY_COLUMN); + if (values != null) { + throw new OTSStreamReaderException("The multi version mode doesn't support setting columns."); + } + } + + LOG.info("endpoint: {}, accessId: {}, accessKey: {}, instanceName: {}, dataTableName: {}, statusTableName: {}," + + " isExportSequenceInfo: {}, startTimestampMillis: {}, endTimestampMillis:{}, maxRetries:{}.", config.getEndpoint(), + config.getAccessId(), config.getAccessKey(), config.getInstanceName(), config.getDataTable(), + config.getStatusTable(), config.isExportSequenceInfo(), config.getStartTimestampMillis(), + config.getEndTimestampMillis(), config.getMaxRetries()); + + return config; + } + + /** + * test use + * @return + */ + public SyncClientInterface getOtsForTest() { + return otsForTest; + } + + /** + * test use + * @param otsForTest + */ + public void setOtsForTest(SyncClientInterface otsForTest) { + this.otsForTest = otsForTest; + } + + public int getMaxRetries() { + return maxRetries; + } + + public void setMaxRetries(int maxRetries) { + this.maxRetries = maxRetries; + } + + public int getThreadNum() { + return threadNum; + } + + public void setSlaveLoopInterval(long slaveLoopInterval) { + this.slaveLoopInterval = slaveLoopInterval; + } + + public void setSlaveLoggingStatusInterval(long slaveLoggingStatusInterval) { + this.slaveLoggingStatusInterval = slaveLoggingStatusInterval; + } + + public long getSlaveLoopInterval() { + return slaveLoopInterval; + } + + public long getSlaveLoggingStatusInterval() { + return slaveLoggingStatusInterval; + } + + public void setThreadNum(int threadNum) { + this.threadNum = threadNum; + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/OTSStreamReaderConstants.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/OTSStreamReaderConstants.java new file mode 100644 index 0000000000..19db148a71 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/OTSStreamReaderConstants.java @@ -0,0 +1,37 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.config; + +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.TimeUtils; + +public class OTSStreamReaderConstants { + + public static long BEFORE_OFFSET_TIME_MILLIS = 10 * TimeUtils.MINUTE_IN_MILLIS; + + public static long AFTER_OFFSET_TIME_MILLIS = 5 * TimeUtils.MINUTE_IN_MILLIS; + + public static final int STATUS_TABLE_TTL = 30 * TimeUtils.DAY_IN_SEC; + + public static final long MAX_WAIT_TABLE_READY_TIME_MILLIS = 2 * TimeUtils.MINUTE_IN_MILLIS; + + public static final long MAX_OTS_UNAVAILABLE_TIME = 30 * TimeUtils.MINUTE_IN_MILLIS; + + public static final long MAX_ONCE_PROCESS_TIME_MILLIS = MAX_OTS_UNAVAILABLE_TIME; + + public static final String CONF = "conf"; + + public static final String STREAM_JOB = "STREAM_JOB"; + public static final String OWNED_SHARDS = "OWNED_SHARDS"; + public static final String ALL_SHARDS = "ALL_SHARDS"; + + + static { + String beforeOffsetMillis = System.getProperty("BEFORE_OFFSET_TIME_MILLIS"); + if (beforeOffsetMillis != null) { + BEFORE_OFFSET_TIME_MILLIS = Long.valueOf(beforeOffsetMillis); + } + + String afterOffsetMillis = System.getProperty("AFTER_OFFSET_TIME_MILLIS"); + if (afterOffsetMillis != null) { + AFTER_OFFSET_TIME_MILLIS = Long.valueOf(afterOffsetMillis); + } + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/StatusTableConstants.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/StatusTableConstants.java new file mode 100644 index 0000000000..344847071c --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/config/StatusTableConstants.java @@ -0,0 +1,67 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.config; + +import com.alicloud.openservices.tablestore.model.PrimaryKeySchema; +import com.alicloud.openservices.tablestore.model.PrimaryKeyType; + +import java.util.Arrays; +import java.util.List; + +public class StatusTableConstants { + // status table's schema + public static String PK1_STREAM_ID = "StreamId"; + public static String PK2_STATUS_TYPE = "StatusType"; + public static String PK3_STATUS_VALUE = "StatusValue"; + + public static List STATUS_TABLE_PK_SCHEMA = Arrays.asList( + new PrimaryKeySchema(PK1_STREAM_ID, PrimaryKeyType.STRING), + new PrimaryKeySchema(PK2_STATUS_TYPE, PrimaryKeyType.STRING), + new PrimaryKeySchema(PK3_STATUS_VALUE, PrimaryKeyType.STRING)); + + /** + * 记录对应某一时刻的所有Shard的Checkpoint。 + * 格式如下: + * + * PK1 : StreamId : "dataTable_131231" + * PK2 : StatusType : "CheckpointForDataxReader" + * + * 记录Checkpoint: + * PK3 : StatusValue : "1444357620415 shard1" (Time + \t + ShardId) + * Column : Checkpoint : "checkpoint" + * 记录ShardCount: + * PK3 : StatusValue : "1444357620415" (Time) + * Column : ShardCount : 3 + * + */ + public static String STATUS_TYPE_CHECKPOINT = "CheckpointForDataxReader"; + + // 记录每次Datax Job的运行信息,包括Shard列表,StreamId和版本等。 + public static String STATUS_TYPE_JOB_DESC = "DataxJobDesc"; + + /** + * 记录某个Shard在某个时间的Checkpoint + * PK1: StreamId : "dataTable_131231" + * PK2: StatusType: "ShardTimeCheckpointForDataxReader" + * PK3: StatusValue: "shard1 1444357620415" (ShardId + \t + Time) + * Column: Checkpoint : "checkpoint" + */ + public static String STATUS_TYPE_SHARD_CHECKPOINT = "ShardTimeCheckpointForDataxReader"; + + public static String TIME_SHARD_SEPARATOR = "\t"; + public static String LARGEST_SHARD_ID = String.valueOf((char)127); //用于确定GetRange的范围。 + + // 记录Checkpoint的行的属性列 + public static String CHECKPOINT_COLUMN_NAME = "Checkpoint"; + public static String VERSION_COLUMN_NAME = "Version"; + public static String SKIP_COUNT_COLUMN_NAME = "SkipCount"; + public static String SHARDCOUNT_COLUMN_NAME = "ShardCount"; + + // 记录Job信息的行的属性列 + public static final int COLUMN_MAX_SIZE = 64 * 1024; + public static final String JOB_SHARD_LIST_PREFIX_COLUMN_NAME = "ShardIds_"; + public static final String JOB_VERSION_COLUMN_NAME = "Version"; + public static final String JOB_TABLE_NAME_COLUMN_NAME = "TableName"; + public static final String JOB_STREAM_ID_COLUMN_NAME = "JobStreamId"; + public static final String JOB_START_TIME_COLUMN_NAME = "StartTime"; + public static final String JOB_END_TIME_COLUMN_NAME = "EndTime"; + +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/CheckpointTimeTracker.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/CheckpointTimeTracker.java new file mode 100644 index 0000000000..8f98cbf20c --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/CheckpointTimeTracker.java @@ -0,0 +1,357 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.core; + +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.StatusTableConstants; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.ShardCheckpoint; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.StreamJob; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.GsonParser; +import com.alicloud.openservices.tablestore.*; +import com.alicloud.openservices.tablestore.core.protocol.OtsInternalApi; +import com.alicloud.openservices.tablestore.model.*; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; +import java.util.concurrent.ConcurrentHashMap; + +public class CheckpointTimeTracker { + + private static final Logger LOG = LoggerFactory.getLogger(CheckpointTimeTracker.class); + + private final SyncClientInterface client; + private final String statusTable; + private final String streamId; + + public CheckpointTimeTracker(SyncClientInterface client, String statusTable, String streamId) { + this.client = client; + this.statusTable = statusTable; + this.streamId = streamId; + } + + /** + * 返回timestamp时刻记录了checkpoint的shard的个数,用于检查checkpoints是否完整。 + * + * @param timestamp + * @return 如果status表中未记录shardCount信息,返回-1 + */ + public int getShardCountForCheck(long timestamp) { + PrimaryKey primaryKey = getPrimaryKeyForShardCount(timestamp); + GetRowRequest getRowRequest = getOTSRequestForGet(primaryKey); + Row row = client.getRow(getRowRequest).getRow(); + if (row == null) { + return -1; + } + int shardCount = (int) row.getColumn(StatusTableConstants.SHARDCOUNT_COLUMN_NAME).get(0).getValue().asLong(); + LOG.info("GetShardCount: timestamp: {}, shardCount: {}.", timestamp, shardCount); + return shardCount; + } + + + /** + * 从状态表中读取所有的checkpoint。 + * + * @param timestamp + * @return + */ + public Map getAllCheckpoints(long timestamp) { + Iterator rowIter = getRangeIteratorForGetAllCheckpoints(client, timestamp); + List rows = readAllRows(rowIter); + + Map checkpointMap = new HashMap(); + for (Row row : rows) { + String pk3 = row.getPrimaryKey().getPrimaryKeyColumn(StatusTableConstants.PK3_STATUS_VALUE).getValue().asString(); + String shardId = pk3.split(StatusTableConstants.TIME_SHARD_SEPARATOR)[1]; + + ShardCheckpoint checkpoint = ShardCheckpoint.fromRow(shardId, row); + checkpointMap.put(shardId, checkpoint); + } + + if (LOG.isDebugEnabled()) { + StringBuilder stringBuilder = new StringBuilder(); + stringBuilder.append("GetAllCheckpoints: size: " + checkpointMap.size()); + for (String shardId : checkpointMap.keySet()) { + stringBuilder.append(", [shardId: "); + stringBuilder.append(shardId); + stringBuilder.append(", checkpoint: "); + stringBuilder.append(checkpointMap.get(shardId)); + stringBuilder.append("]"); + } + LOG.debug(stringBuilder.toString()); + } + return checkpointMap; + } + + private List readAllRows(Iterator rowIter) { + List rows = new ArrayList(); + while (rowIter.hasNext()) { + rows.add(rowIter.next()); + } + return rows; + } + + /** + * 设置某个分片某个时间的checkpoint, 用于寻找某个分片在一定区间内较大的checkpoint, 减少扫描的数据量. + * + * @param shardId + * @param timestamp + * @param checkpointValue + */ + public void setShardTimeCheckpoint(String shardId, long timestamp, String checkpointValue) { + PutRowRequest putRowRequest = getOTSRequestForSetShardTimeCheckpoint(shardId, timestamp, checkpointValue); + client.putRow(putRowRequest); + LOG.info("SetShardTimeCheckpoint: timestamp: {}, shardId: {}, checkpointValue: {}.", timestamp, shardId, checkpointValue); + } + + /** + * 获取某个分片在某个时间范围内最大的checkpoint, 用于寻找某个分片在一定区间内较大的checkpoint, 减少扫描的数据量. + * 查询的范围为左开右闭。 + * + * @param shardId + * @param startTimestamp + * @param endTimestamp + * @return + */ + public String getShardLargestCheckpointInTimeRange(String shardId, long startTimestamp, long endTimestamp) { + PrimaryKey startPk = getPrimaryKeyForShardTimeCheckpoint(shardId, endTimestamp); + PrimaryKey endPk = getPrimaryKeyForShardTimeCheckpoint(shardId, startTimestamp); + RangeRowQueryCriteria rangeRowQueryCriteria = new RangeRowQueryCriteria(statusTable); + rangeRowQueryCriteria.setMaxVersions(1); + rangeRowQueryCriteria.setDirection(Direction.BACKWARD); + rangeRowQueryCriteria.setLimit(1); + rangeRowQueryCriteria.setInclusiveStartPrimaryKey(startPk); + rangeRowQueryCriteria.setExclusiveEndPrimaryKey(endPk); + GetRangeRequest getRangeRequest = new GetRangeRequest(rangeRowQueryCriteria); + + GetRangeResponse result = client.getRange(getRangeRequest); + if (result.getRows().isEmpty()) { + return null; + } else { + try { + String checkpoint = result.getRows().get(0).getLatestColumn(StatusTableConstants.CHECKPOINT_COLUMN_NAME).getValue().asString(); + String time = result.getRows().get(0).getPrimaryKey().getPrimaryKeyColumn(2).getValue().asString().split(StatusTableConstants.TIME_SHARD_SEPARATOR)[1]; + LOG.info("find checkpoint for shard {} in time {}.", shardId, time); + return checkpoint; + } catch (Exception ex) { + LOG.error("Error when get shard time checkpoint.", ex); + return null; + } + } + } + + public void clearAllCheckpoints(long timestamp) { + Iterator rowIter = getRangeIteratorForGetAllCheckpoints(client, timestamp); + List rows = readAllRows(rowIter); + + for (Row row : rows) { + DeleteRowRequest deleteRowRequest = getOTSRequestForDelete(row.getPrimaryKey()); + client.deleteRow(deleteRowRequest); + } + + LOG.info("ClearAllCheckpoints: timestamp: {}.", timestamp); + } + + private PrimaryKey getPrimaryKeyForCheckpoint(long timestamp, String shardId) { + String statusValue = String.format("%16d", timestamp) + StatusTableConstants.TIME_SHARD_SEPARATOR + shardId; + + List pkCols = new ArrayList(); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK1_STREAM_ID, PrimaryKeyValue.fromString(streamId))); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK2_STATUS_TYPE, PrimaryKeyValue.fromString(StatusTableConstants.STATUS_TYPE_CHECKPOINT))); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK3_STATUS_VALUE, PrimaryKeyValue.fromString(statusValue))); + + PrimaryKey primaryKey = new PrimaryKey(pkCols); + return primaryKey; + } + + private PrimaryKey getPrimaryKeyForJobDesc(long timestamp) { + String statusValue = String.format("%16d", timestamp); + + List pkCols = new ArrayList(); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK1_STREAM_ID, PrimaryKeyValue.fromString(streamId))); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK2_STATUS_TYPE, PrimaryKeyValue.fromString(StatusTableConstants.STATUS_TYPE_JOB_DESC))); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK3_STATUS_VALUE, PrimaryKeyValue.fromString(statusValue))); + + PrimaryKey primaryKey = new PrimaryKey(pkCols); + return primaryKey; + } + + public PrimaryKey getPrimaryKeyForShardCount(long timestamp) { + String statusValue = String.format("%16d", timestamp); + + List pkCols = new ArrayList(); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK1_STREAM_ID, PrimaryKeyValue.fromString(streamId))); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK2_STATUS_TYPE, PrimaryKeyValue.fromString(StatusTableConstants.STATUS_TYPE_CHECKPOINT))); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK3_STATUS_VALUE, PrimaryKeyValue.fromString(statusValue))); + + PrimaryKey primaryKey = new PrimaryKey(pkCols); + return primaryKey; + } + + private PrimaryKey getPrimaryKeyForShardTimeCheckpoint(String shardId, long timestamp) { + String statusValue = shardId + StatusTableConstants.TIME_SHARD_SEPARATOR + String.format("%16d", timestamp); + + List pkCols = new ArrayList(); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK1_STREAM_ID, PrimaryKeyValue.fromString(streamId))); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK2_STATUS_TYPE, PrimaryKeyValue.fromString(StatusTableConstants.STATUS_TYPE_SHARD_CHECKPOINT))); + pkCols.add(new PrimaryKeyColumn(StatusTableConstants.PK3_STATUS_VALUE, PrimaryKeyValue.fromString(statusValue))); + + PrimaryKey primaryKey = new PrimaryKey(pkCols); + return primaryKey; + } + + private PutRowRequest getOTSRequestForSetShardTimeCheckpoint(String shardId, long timestamp, String checkpointValue) { + PrimaryKey primaryKey = getPrimaryKeyForShardTimeCheckpoint(shardId, timestamp); + + RowPutChange rowPutChange = new RowPutChange(statusTable, primaryKey); + rowPutChange.addColumn(StatusTableConstants.CHECKPOINT_COLUMN_NAME, ColumnValue.fromString(checkpointValue)); + + PutRowRequest putRowRequest = new PutRowRequest(rowPutChange); + return putRowRequest; + } + + private GetRowRequest getOTSRequestForGet(PrimaryKey primaryKey) { + SingleRowQueryCriteria rowQueryCriteria = new SingleRowQueryCriteria(statusTable, primaryKey); + rowQueryCriteria.setMaxVersions(1); + + GetRowRequest getRowRequest = new GetRowRequest(rowQueryCriteria); + return getRowRequest; + } + + private Iterator getRangeIteratorForGetAllCheckpoints(SyncClientInterface client, long timestamp) { + RangeIteratorParameter param = new RangeIteratorParameter(statusTable); + + PrimaryKey startPk = getPrimaryKeyForCheckpoint(timestamp, ""); + PrimaryKey endPk = getPrimaryKeyForCheckpoint(timestamp, StatusTableConstants.LARGEST_SHARD_ID); + param.setMaxVersions(1); + param.setInclusiveStartPrimaryKey(startPk); + param.setExclusiveEndPrimaryKey(endPk); + + return client.createRangeIterator(param); + } + + private DeleteRowRequest getOTSRequestForDelete(PrimaryKey primaryKey) { + RowDeleteChange rowDeleteChange = new RowDeleteChange(statusTable, primaryKey); + DeleteRowRequest deleteRowRequest = new DeleteRowRequest(rowDeleteChange); + return deleteRowRequest; + } + + public void writeCheckpoint(long timestamp, ShardCheckpoint checkpoint) { + writeCheckpoint(timestamp, checkpoint, 0); + } + + public void writeCheckpoint(long timestamp, ShardCheckpoint checkpoint, long sendRecordCount) { + LOG.info("Write checkpoint of time '{}' of shard '{}'.", timestamp, checkpoint.getShardId()); + PrimaryKey primaryKey = getPrimaryKeyForCheckpoint(timestamp, checkpoint.getShardId()); + + RowPutChange rowChange = new RowPutChange(statusTable, primaryKey); + checkpoint.serializeColumn(rowChange); + + if (sendRecordCount > 0) { + rowChange.addColumn("SendRecordCount", ColumnValue.fromLong(sendRecordCount)); + } + + PutRowRequest request = new PutRowRequest(); + request.setRowChange(rowChange); + client.putRow(request); + } + + public ShardCheckpoint readCheckpoint(String shardId, long timestamp) { + PrimaryKey primaryKey = getPrimaryKeyForCheckpoint(timestamp, shardId); + GetRowRequest getRowRequest = getOTSRequestForGet(primaryKey); + Row row = client.getRow(getRowRequest).getRow(); + if (row == null) { + return null; + } + + return ShardCheckpoint.fromRow(shardId, row); + } + + public void writeStreamJob(StreamJob streamJob) { + PrimaryKey primaryKey = getPrimaryKeyForJobDesc(streamJob.getEndTimeInMillis()); + + RowPutChange rowChange = new RowPutChange(statusTable); + rowChange.setPrimaryKey(primaryKey); + streamJob.serializeColumn(rowChange); + + PutRowRequest request = new PutRowRequest(); + request.setRowChange(rowChange); + client.putRow(request); + } + + public StreamJob readStreamJob(long timestamp) { + PrimaryKey primaryKey = getPrimaryKeyForJobDesc(timestamp); + GetRowRequest request = getOTSRequestForGet(primaryKey); + + GetRowResponse response = client.getRow(request); + return StreamJob.fromRow(response.getRow()); + } + + /** + * 获取指定timestamp对应的Job的checkpoint,并检查checkpoint是否完整。 + * 若是老版本的Job,则只检查shardCount是否一致。 + * 若是新版本的Job,则除了检查shard id列表完全一致,还需要检查每个shard的checkpoint的version是否与job描述内的一致。 + * + * @param timestamp + * @param streamId + * @param allCheckpoints + * @return 若成功获取上一次Job完整的checkpoint,则返回true,否则返回false + */ + public boolean getAndCheckAllCheckpoints(long timestamp, String streamId, Map allCheckpoints) { + allCheckpoints.clear(); + Map allCheckpointsInTable = getAllCheckpoints(timestamp); + + long shardCount = -1; + boolean checkShardCountOnly = false; + StreamJob streamJob = readStreamJob(timestamp); + if (streamJob == null) { + LOG.info("Stream job is not exist, timestamp: {}.", timestamp); + + // 如果streamJob不存在,则有可能是老版本的Job,尝试读取shardCount + shardCount = getShardCountForCheck(timestamp); + if (shardCount == -1) { + LOG.info("Shard count not found, timestamp: {}.", timestamp); + return false; + } + + checkShardCountOnly = true; + } + + if (checkShardCountOnly) { + if (shardCount != allCheckpointsInTable.size()) { + LOG.info("Shard count not equal, shardCount: {}, checkpointCount: {}.", shardCount, allCheckpoints.size()); + return false; + } + } else { + // 检查streamJob内的信息是否与checkpoint一致 + if (!streamJob.getStreamId().equals(streamId)) { + LOG.info("Stream id of the checkpoint is not equal with current job. StreamIdInCheckpoint: {}, StreamId: {}.", + streamJob.getStreamId(), streamId); + return false; + } + + if (streamJob.getShardIds().size() != allCheckpointsInTable.size()) { + LOG.info( + "Shards in stream job is not equal with checkpoint count. " + + "StreamJob shard count: {}, checkpoint count: {}.", + streamJob.getShardIds().size(), allCheckpointsInTable.size()); + return false; + } + + for (String shardId : streamJob.getShardIds()) { + ShardCheckpoint checkpoint = allCheckpointsInTable.get(shardId); + if (checkpoint == null) { + LOG.info("Checkpoint of shard in job is not found. ShardId: {}.", shardId); + return false; + } + + if (!checkpoint.getVersion().equals(streamJob.getVersion())) { + LOG.info("Version is different. Checkpoint: {}, StreamJob: {}.", checkpoint, streamJob); + return false; + } + } + } + + for (Map.Entry entry : allCheckpointsInTable.entrySet()) { + allCheckpoints.put(entry.getKey(), entry.getValue()); + } + return true; + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/IStreamRecordSender.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/IStreamRecordSender.java new file mode 100644 index 0000000000..c9d053cb38 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/IStreamRecordSender.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.core; + +import com.alicloud.openservices.tablestore.model.StreamRecord; + +public interface IStreamRecordSender { + + void sendToDatax(StreamRecord streamRecord); + +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/MultiVerModeRecordSender.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/MultiVerModeRecordSender.java new file mode 100644 index 0000000000..36ad99bbd1 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/MultiVerModeRecordSender.java @@ -0,0 +1,142 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.core; + +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.ColumnValueTransformHelper; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.OTSStreamReaderException; +import com.alicloud.openservices.tablestore.model.*; + +/** + * 输出完整的增量变化信息,每一行为一个cell的变更记录,输出样例如下: + * | pk1 | pk2 | column_name | timestamp | column_value | op_type | seq_id | + * | --- | --- | ----------- | --------- | ------------ | ------- | ------ | + * | a | b | col1 | 10928121 | null | DO | 001 | 删除某一列某个特定版本 + * | a | b | col2 | null | null | DA | 002 | 删除某一列所有版本 + * | a | b | null | null | null | DR | 003 | 删除整行 + * | a | b | col1 | 1928821 | abc | U | 004 | 插入一列 + * + */ +public class MultiVerModeRecordSender implements IStreamRecordSender { + + enum OpType { + U, // update + DO, // delete one version + DA, // delete all version + DR // delete row + } + + private final RecordSender dataxRecordSender; + private String shardId; + private final boolean isExportSequenceInfo; + + public MultiVerModeRecordSender(RecordSender dataxRecordSender, String shardId, boolean isExportSequenceInfo) { + this.dataxRecordSender = dataxRecordSender; + this.shardId = shardId; + this.isExportSequenceInfo = isExportSequenceInfo; + } + + @Override + public void sendToDatax(StreamRecord streamRecord) { + int colIdx = 0; + switch (streamRecord.getRecordType()) { + case PUT: + sendToDatax(streamRecord.getPrimaryKey(), OpType.DR, null, + getSequenceInfo(streamRecord, colIdx++)); + for (RecordColumn recordColumn : streamRecord.getColumns()) { + String sequenceInfo = getSequenceInfo(streamRecord, colIdx++); + sendToDatax(streamRecord.getPrimaryKey(), recordColumn, sequenceInfo); + } + break; + case UPDATE: + for (RecordColumn recordColumn : streamRecord.getColumns()) { + String sequenceInfo = getSequenceInfo(streamRecord, colIdx++); + sendToDatax(streamRecord.getPrimaryKey(), recordColumn, sequenceInfo); + } + break; + case DELETE: + sendToDatax(streamRecord.getPrimaryKey(), OpType.DR, null, + getSequenceInfo(streamRecord, colIdx++)); + break; + default: + throw new OTSStreamReaderException("Unknown stream record type: " + streamRecord.getRecordType() + "."); + } + } + + private void sendToDatax(PrimaryKey primaryKey, RecordColumn column, String sequenceInfo) { + switch (column.getColumnType()) { + case PUT: + sendToDatax(primaryKey, OpType.U, column.getColumn(), sequenceInfo); + break; + case DELETE_ONE_VERSION: + sendToDatax(primaryKey, OpType.DO, column.getColumn(), sequenceInfo); + break; + case DELETE_ALL_VERSION: + sendToDatax(primaryKey, OpType.DA, column.getColumn(), sequenceInfo); + break; + default: + throw new OTSStreamReaderException("Unknown record column type: " + column.getColumnType() + "."); + } + } + + private void sendToDatax(PrimaryKey primaryKey, OpType opType, Column column, String sequenceInfo) { + Record line = dataxRecordSender.createRecord(); + + for (PrimaryKeyColumn pkCol : primaryKey.getPrimaryKeyColumns()) { + line.addColumn(ColumnValueTransformHelper.otsPrimaryKeyValueToDataxColumn(pkCol.getValue())); + } + + switch (opType) { + case U: + line.addColumn(new StringColumn(column.getName())); + line.addColumn(new LongColumn(column.getTimestamp())); + line.addColumn(ColumnValueTransformHelper.otsColumnValueToDataxColumn(column.getValue())); + line.addColumn(new StringColumn("" + opType)); + if (isExportSequenceInfo) { + line.addColumn(new StringColumn(sequenceInfo)); + } + break; + case DO: + line.addColumn(new StringColumn(column.getName())); + line.addColumn(new LongColumn(column.getTimestamp())); + line.addColumn(new StringColumn(null)); + line.addColumn(new StringColumn("" + opType)); + if (isExportSequenceInfo) { + line.addColumn(new StringColumn(sequenceInfo)); + } + break; + case DA: + line.addColumn(new StringColumn(column.getName())); + line.addColumn(new StringColumn(null)); + line.addColumn(new StringColumn(null)); + line.addColumn(new StringColumn("" + opType)); + if (isExportSequenceInfo) { + line.addColumn(new StringColumn(sequenceInfo)); + } + break; + case DR: + line.addColumn(new StringColumn(null)); + line.addColumn(new StringColumn(null)); + line.addColumn(new StringColumn(null)); + line.addColumn(new StringColumn("" + OpType.DR)); + if (isExportSequenceInfo) { + line.addColumn(new StringColumn(sequenceInfo)); + } + break; + default: + throw new OTSStreamReaderException("Unknown operation type: " + opType + "."); + } + synchronized (dataxRecordSender) { + dataxRecordSender.sendToWriter(line); + } + } + + private String getSequenceInfo(StreamRecord streamRecord, int colIdx) { + int epoch = streamRecord.getSequenceInfo().getEpoch(); + long timestamp = streamRecord.getSequenceInfo().getTimestamp(); + int rowIdx = streamRecord.getSequenceInfo().getRowIndex(); + String sequenceId = String.format("%010d_%020d_%010d_%s:%010d", epoch, timestamp, rowIdx, shardId, colIdx); + return sequenceId; + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/OTSStreamReaderChecker.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/OTSStreamReaderChecker.java new file mode 100644 index 0000000000..086d0159a0 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/OTSStreamReaderChecker.java @@ -0,0 +1,157 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.core; + +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSStreamReaderConstants; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSStreamReaderConfig; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.OTSStreamReaderException; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.StatusTableConstants; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.ShardCheckpoint; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.StreamJob; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.OTSHelper; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.TimeUtils; +import com.alicloud.openservices.tablestore.*; +import com.alicloud.openservices.tablestore.model.*; +import com.aliyun.openservices.ots.internal.streamclient.Worker; +import com.aliyun.openservices.ots.internal.streamclient.model.CheckpointPosition; +import com.aliyun.openservices.ots.internal.streamclient.model.WorkerStatus; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Date; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +public class OTSStreamReaderChecker { + private static final Logger LOG = LoggerFactory.getLogger(OTSStreamReaderChecker.class); + + + private final SyncClientInterface ots; + private final OTSStreamReaderConfig config; + + public OTSStreamReaderChecker(SyncClientInterface ots, OTSStreamReaderConfig config) { + this.ots = ots; + this.config = config; + } + + /** + * 1. 检查dataTable是否开启了stream。 + * 2. 检查要导出的时间范围是否合理: + * 最大可导出的时间范围为: [now - expirationTime, now] + * 为了避免时间误差影响,允许导出的范围为: [now - expirationTime + beforeOffset, now - afterOffset] + */ + public void checkStreamEnabledAndTimeRangeOK() { + boolean exists = OTSHelper.checkTableExists(ots, config.getDataTable()); + if (!exists) { + throw new OTSStreamReaderException("The data table is not exist."); + } + StreamDetails streamDetails = OTSHelper.getStreamDetails(ots, config.getDataTable()); + if (streamDetails == null || !streamDetails.isEnableStream()) { + throw new OTSStreamReaderException("The stream of data table is not enabled."); + } + long now = System.currentTimeMillis(); + long startTime = config.getStartTimestampMillis(); + long endTime = config.getEndTimestampMillis(); + long beforeOffset = OTSStreamReaderConstants.BEFORE_OFFSET_TIME_MILLIS; + long afterOffset = OTSStreamReaderConstants.AFTER_OFFSET_TIME_MILLIS; + long expirationTime = streamDetails.getExpirationTime() * TimeUtils.HOUR_IN_MILLIS; + + if (startTime < now - expirationTime + beforeOffset) { + throw new OTSStreamReaderException("As expiration time is " + expirationTime + ", so the start timestamp must greater than " + + TimeUtils.getTimeInISO8601(new Date(now - expirationTime + beforeOffset)) + "(" + (now - expirationTime + beforeOffset )+ ")"); + } + + if (endTime > now - afterOffset) { + throw new OTSStreamReaderException("To avoid timing error between different machines, the end timestamp must smaller" + + " than " + TimeUtils.getTimeInISO8601(new Date(now - afterOffset)) + "(" + (now - afterOffset) + ")"); + } + } + + /** + * 检查statusTable的tableMeta + * @param tableMeta + */ + private void checkTableMetaOfStatusTable(TableMeta tableMeta) { + List pkSchema = tableMeta.getPrimaryKeyList(); + if (!pkSchema.equals(StatusTableConstants.STATUS_TABLE_PK_SCHEMA)) { + throw new OTSStreamReaderException("Unexpected table meta in status table, please check your config."); + } + } + + /** + * 检查statusTable是否存在,如果不存在就创建statusTable,并等待表ready。 + */ + public void checkAndCreateStatusTableIfNotExist() { + boolean tableExist = OTSHelper.checkTableExists(ots, config.getStatusTable()); + if (tableExist) { + DescribeTableResponse describeTableResult = OTSHelper.describeTable(ots, config.getStatusTable()); + checkTableMetaOfStatusTable(describeTableResult.getTableMeta()); + } else { + TableMeta tableMeta = new TableMeta(config.getStatusTable()); + tableMeta.addPrimaryKeyColumns(StatusTableConstants.STATUS_TABLE_PK_SCHEMA); + TableOptions tableOptions = new TableOptions(OTSStreamReaderConstants.STATUS_TABLE_TTL, 1); + OTSHelper.createTable(ots, tableMeta, tableOptions); + boolean tableReady = OTSHelper.waitUntilTableReady(ots, config.getStatusTable(), + OTSStreamReaderConstants.MAX_WAIT_TABLE_READY_TIME_MILLIS); + if (!tableReady) { + throw new OTSStreamReaderException("Check table ready timeout, MaxWaitTableReadyTimeMillis:" + + OTSStreamReaderConstants.MAX_WAIT_TABLE_READY_TIME_MILLIS + "."); + } + } + } + + /** + * 尝试从状态表中恢复上一次Job执行结束后的checkpoint。 + * 若恢复成功,则返回true,否则返回false。 + * + * @param checkpointTimeTracker + * @param allShardsMap + *@param streamJob + * @param currentShardCheckpointMap @return + */ + public boolean checkAndSetCheckpoints( + CheckpointTimeTracker checkpointTimeTracker, + Map allShardsMap, + StreamJob streamJob, + Map currentShardCheckpointMap) { + long timestamp = config.getStartTimestampMillis(); + Map allCheckpoints = new HashMap(); + boolean gotCheckpoint = checkpointTimeTracker.getAndCheckAllCheckpoints(timestamp, streamJob.getStreamId(), allCheckpoints); + if (!gotCheckpoint) { + return false; + } + + for (Map.Entry entry : allCheckpoints.entrySet()) { + String shardId = entry.getKey(); + ShardCheckpoint checkpoint = entry.getValue(); + if (!currentShardCheckpointMap.containsKey(shardId)) { + // 发现未读完的shard,并且该shard还不在此次任务列表中 + if (!checkpoint.getCheckpoint().equals(CheckpointPosition.SHARD_END)) { + throw new OTSStreamReaderException("Shard does not exist now, ShardId:" + + shardId + ", Checkpoint:" + checkpoint); + } + } else { + currentShardCheckpointMap.put(shardId, new ShardCheckpoint(shardId, streamJob.getVersion(), + checkpoint.getCheckpoint(), checkpoint.getSkipCount())); + } + } + + // 检查是否有丢失的shard + for (Map.Entry entry : allShardsMap.entrySet()) { + StreamShard shard = entry.getValue(); + String parentId = shard.getParentId(); + // shard不在本次任务中,且shard也不在上一次任务中 + if (parentId != null && !allShardsMap.containsKey(parentId) && !allCheckpoints.containsKey(parentId)) { + LOG.error("Shard is lost: {}.", shard); + throw new OTSStreamReaderException("Can't find checkpoint for shard: " + parentId); + } + + parentId = shard.getParentSiblingId(); + if (parentId != null && !allShardsMap.containsKey(parentId) && !allCheckpoints.containsKey(parentId)) { + LOG.error("Shard is lost: {}.", shard); + throw new OTSStreamReaderException("Can't find checkpoint for shard: " + parentId); + } + } + + return true; + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/RecordProcessor.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/RecordProcessor.java new file mode 100644 index 0000000000..ba17bd9cc1 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/RecordProcessor.java @@ -0,0 +1,267 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.core; + +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.Mode; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSStreamReaderConfig; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.OTSStreamReaderException; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.ShardCheckpoint; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.StreamJob; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.TimeUtils; +import com.alicloud.openservices.tablestore.*; +import com.alicloud.openservices.tablestore.model.*; +import com.aliyun.openservices.ots.internal.streamclient.model.*; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.List; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.concurrent.atomic.AtomicLong; + +public class RecordProcessor implements Runnable { + + private static final Logger LOG = LoggerFactory.getLogger(RecordProcessor.class); + private static final long RECORD_CHECKPOINT_INTERVAL = 10 * TimeUtils.MINUTE_IN_MILLIS; + + private final SyncClientInterface ots; + private final long startTimestampMillis; + private final long endTimestampMillis; + private final OTSStreamReaderConfig readerConfig; + private boolean shouldSkip; + private final CheckpointTimeTracker checkpointTimeTracker; + private final RecordSender recordSender; + private final boolean isExportSequenceInfo; + private IStreamRecordSender otsStreamRecordSender; + private long lastRecordCheckpointTime; + + private StreamJob stream; + private StreamShard shard; + private ShardCheckpoint startCheckpoint; + + // read state + private String lastShardIterator; + private String nextShardIterator; + private long skipCount; + + // running state + private long startTime; + private long lastProcessTime; + private AtomicBoolean stop; + private AtomicLong sendRecordCount; + + public enum State { + READY, // initialized but not start + RUNNING, // start to read and process records + SUCCEED, // succeed to process all records + FAILED, // encounter exception and failed + INTERRUPTED // not finish but been interrupted + } + + private State state; + + public RecordProcessor(SyncClientInterface ots, + OTSStreamReaderConfig config, + StreamJob stream, + StreamShard shardToProcess, + ShardCheckpoint startCheckpoint, + boolean shouldSkip, + CheckpointTimeTracker checkpointTimeTracker, + RecordSender recordSender) { + this.ots = ots; + this.readerConfig = config; + this.stream = stream; + this.shard = shardToProcess; + this.startCheckpoint = startCheckpoint; + this.startTimestampMillis = stream.getStartTimeInMillis(); + this.endTimestampMillis = stream.getEndTimeInMillis(); + this.shouldSkip = shouldSkip; + this.checkpointTimeTracker = checkpointTimeTracker; + this.recordSender = recordSender; + this.isExportSequenceInfo = config.isExportSequenceInfo(); + this.lastRecordCheckpointTime = 0; + + // set init state + startTime = 0; + lastProcessTime = 0; + state = State.READY; + stop = new AtomicBoolean(true); + sendRecordCount = new AtomicLong(0); + } + + public StreamShard getShard() { + return shard; + } + + public State getState() { + return state; + } + + public long getStartTime() { + return startTime; + } + + public long getLastProcessTime() { + return lastProcessTime; + } + + public void initialize() { + if (readerConfig.getMode().equals(Mode.MULTI_VERSION)) { + this.otsStreamRecordSender = new MultiVerModeRecordSender(recordSender, shard.getShardId(), isExportSequenceInfo); + } else if (readerConfig.getMode().equals(Mode.SINGLE_VERSION_AND_UPDATE_ONLY)) { + this.otsStreamRecordSender = new SingleVerAndUpOnlyModeRecordSender(recordSender, shard.getShardId(), isExportSequenceInfo, readerConfig.getColumns()); + } else { + throw new OTSStreamReaderException("Internal Error. Unhandled Mode: " + readerConfig.getMode()); + } + + if (startCheckpoint.getCheckpoint().equals(CheckpointPosition.TRIM_HORIZON)) { + lastShardIterator = null; + nextShardIterator = ots.getShardIterator(new GetShardIteratorRequest(stream.getStreamId(), shard.getShardId())).getShardIterator(); + skipCount = startCheckpoint.getSkipCount(); + } else { + lastShardIterator = null; + nextShardIterator = startCheckpoint.getCheckpoint(); + skipCount = startCheckpoint.getSkipCount(); + } + LOG.info("Initialize record processor. Mode: {}, StartCheckpoint: [{}], ShardId: {}, ShardIterator: {}, SkipCount: {}.", + readerConfig.getMode(), startCheckpoint, shard.getShardId(), nextShardIterator, skipCount); + } + + private long getTimestamp(StreamRecord record) { + return record.getSequenceInfo().getTimestamp() / 1000; + } + + void sendRecord(StreamRecord record) { + sendRecordCount.incrementAndGet(); + otsStreamRecordSender.sendToDatax(record); + } + + @Override + public void run() { + LOG.info("Start process records with startTime: {}, endTime: {}, nextShardIterator: {}, skipCount: {}.", + startTimestampMillis, endTimestampMillis, nextShardIterator, skipCount); + try { + startTime = System.currentTimeMillis(); + lastProcessTime = startTime; + boolean finished = false; + + stop.set(false); + state = State.RUNNING; + while (!stop.get()) { + finished = readAndProcessRecords(); + lastProcessTime = System.currentTimeMillis(); + if (finished) { + break; + } + + if (Thread.currentThread().isInterrupted()) { + state = State.INTERRUPTED; + break; + } + } + + if (finished) { + state = State.SUCCEED; + } else { + state = State.INTERRUPTED; + } + } catch (Exception e) { + LOG.error("Some fatal error has happened, shardId: {}, LastShardIterator: {}, NextShartIterator: {}.", + shard.getShardId(), lastShardIterator, nextShardIterator, e); + state = State.FAILED; + } + LOG.info("Finished process records. ShardId: {}, RecordSent: {}.", shard.getShardId(), sendRecordCount.get()); + } + + public void stop() { + stop.set(true); + } + + /** + * 处理所有记录。 + * 当发现已经获取得到完整的时间范围内的数据,则返回true,否则返回false。 + * + * @param records + * @param nextShardIterator + * @return + */ + boolean process(List records, String nextShardIterator) { + if (records.isEmpty() && nextShardIterator != null) { + LOG.info("ProcessFinished: No more data in shard, shardId: {}.", shard.getShardId()); + ShardCheckpoint checkpoint = new ShardCheckpoint(shard.getShardId(), stream.getVersion(), nextShardIterator, 0); + checkpointTimeTracker.writeCheckpoint(endTimestampMillis, checkpoint, sendRecordCount.get()); + checkpointTimeTracker.setShardTimeCheckpoint(shard.getShardId(), endTimestampMillis, nextShardIterator); + return true; + } + + int size = records.size(); + + // 只记录每次Iterator的第一个record作为checkpoint,因为checkpoint只记录shardIterator,而不记录skipCount。 + if (!records.isEmpty()) { + long firstRecordTimestamp = getTimestamp(records.get(0)); + if (firstRecordTimestamp >= lastRecordCheckpointTime + RECORD_CHECKPOINT_INTERVAL) { + lastRecordCheckpointTime = firstRecordTimestamp; + checkpointTimeTracker.setShardTimeCheckpoint(shard.getShardId(), firstRecordTimestamp, lastShardIterator); + } + } + + for (int i = 0; i < size; i++) { + long timestamp = getTimestamp(records.get(i)); + LOG.debug("Process record with timestamp: {}.", timestamp); + if (timestamp < endTimestampMillis) { + if (shouldSkip && (timestamp < startTimestampMillis)) { + LOG.debug("Skip record out of start time: {}, startTime: {}.", timestamp, startTimestampMillis); + continue; + } + shouldSkip = false; + if (skipCount > 0) { + LOG.debug("Skip record. Timestamp: {}, SkipCount: {}.", timestamp, skipCount); + skipCount -= 1; + continue; + } + + LOG.debug("Send record. Timestamp: {}.", timestamp); + sendRecord(records.get(i)); + } else { + LOG.info("ProcessFinished: Record in shard reach boundary of endTime, shardId: {}. Timestamp: {}, EndTime: {}", shard.getShardId(), timestamp, endTimestampMillis); + ShardCheckpoint checkpoint = new ShardCheckpoint(shard.getShardId(), stream.getVersion(), lastShardIterator, i); + checkpointTimeTracker.writeCheckpoint(endTimestampMillis, checkpoint, sendRecordCount.get()); + return true; + } + } + + if (nextShardIterator == null) { + LOG.info("ProcessFinished: Shard has reach to end, shardId: {}.", shard.getShardId()); + ShardCheckpoint checkpoint = new ShardCheckpoint(shard.getShardId(), stream.getVersion(), CheckpointPosition.SHARD_END, 0); + checkpointTimeTracker.writeCheckpoint(endTimestampMillis, checkpoint, sendRecordCount.get()); + return true; + } + + return false; + } + + private boolean readAndProcessRecords() { + LOG.debug("Read and process records. ShardId: {}, ShardIterator: {}.", shard.getShardId(), nextShardIterator); + GetStreamRecordRequest request = new GetStreamRecordRequest(nextShardIterator); + GetStreamRecordResponse response = ots.getStreamRecord(request); + lastShardIterator = nextShardIterator; + nextShardIterator = response.getNextShardIterator(); + return processRecords(response.getRecords(), nextShardIterator); + } + + public boolean processRecords(List records, String nextShardIterator) { + long startTime = System.currentTimeMillis(); + + if (records.isEmpty()) { + LOG.info("StartProcessRecords: size: {}.", records.size()); + } else { + LOG.debug("StartProcessRecords: size: {}, recordTime: {}.", records.size(), getTimestamp(records.get(0))); + } + + if (process(records, nextShardIterator)) { + return true; + } + + LOG.debug("ProcessRecords, ProcessShard:{}, ProcessTime: {}, Size:{}, NextShardIterator:{}", + shard.getShardId(), System.currentTimeMillis() - startTime, records.size(), nextShardIterator); + return false; + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/ShardStatusChecker.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/ShardStatusChecker.java new file mode 100644 index 0000000000..7d2e2014ae --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/ShardStatusChecker.java @@ -0,0 +1,131 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.core; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.OTSReaderError; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.model.ShardCheckpoint; +import com.alicloud.openservices.tablestore.model.StreamShard; +import com.aliyun.openservices.ots.internal.streamclient.model.CheckpointPosition; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; + +public class ShardStatusChecker { + + private static final Logger LOG = LoggerFactory.getLogger(ShardStatusChecker.class); + + public enum ProcessState { + READY, // shard is ready to process records and is not done + DONE_NOT_END, // shard is done but not reach end of shard + DONE_REACH_END, // shard is done and reach end of shard + BLOCK, // shard is block on its parents + SKIP // shard is skipped + } + + /** + * 1. 若shard没有parent shard,或者其parent shard均已达到END,则该shard需要被处理 + * 2. 若shard有parent shard,其已经被处理完毕,且其checkpoint不为END,则该shard不需要再被处理 + *

+ * 所有确认需要被处理和不需要被处理的shard,都会从allShardToProcess列表中移除 + * + * @param allShardToProcess + * @param allShardsMap + * @param checkpointMap + * @return + */ + public static void findShardToProcess( + Map allShardToProcess, + Map allShardsMap, + Map checkpointMap, + List shardToProcess, + List shardNoNeedToProcess, + List shardBlocked) { + Map shardStates = new HashMap(); + for (Map.Entry entry : allShardToProcess.entrySet()) { + determineShardState(entry.getValue().getShardId(), allShardsMap, checkpointMap, shardStates); + } + + for (Map.Entry entry : shardStates.entrySet()) { + String shardId = entry.getKey(); + if (allShardToProcess.containsKey(shardId)) { + StreamShard shard = allShardToProcess.get(shardId); + switch (entry.getValue()) { + case READY: + shardToProcess.add(shard); + allShardToProcess.remove(shardId); + break; + case BLOCK: + shardBlocked.add(shard); + break; + case SKIP: + shardNoNeedToProcess.add(shard); + allShardToProcess.remove(shardId); + break; + default: + LOG.error("Unexpected state '{}' for shard '{}'.", entry.getValue(), shard); + throw DataXException.asDataXException(OTSReaderError.ERROR, "Unexpected state '" + entry.getValue() + "' for shard '" + shard + "'."); + } + } + } + } + + public static ProcessState determineShardState( + String shardId, + Map allShards, + Map allCheckpoints, + Map shardStates) { + StreamShard shard = allShards.get(shardId); + if (shard == null) { + // 若发现shard已经不存在,则我们认为shard已经被处理完毕。 + // 做出这种判断的前提是: + // 若此次任务是延续上次任务的checkpoint,则该shard一定是在上一次任务中checkpoint达到了SHARD_END(在slave初始化时做检查)。 + // 若此次任务不是延续上次任务,则对于全新的任务,不存在的shard我们可以认为是处理完毕的,即不需要处理。 + LOG.warn("Shard is not found: {}.", shardId); + return ProcessState.DONE_REACH_END; + } + + if (shardStates.containsKey(shardId)) { + return shardStates.get(shardId); + } + + ProcessState finalState; + + if (allCheckpoints.containsKey(shardId)) { + ShardCheckpoint checkpoint = allCheckpoints.get(shardId); + if (checkpoint == null || checkpoint.getCheckpoint() == null) { + finalState = ProcessState.READY; + } else if (checkpoint.getCheckpoint().equals(CheckpointPosition.SHARD_END)){ + finalState = ProcessState.DONE_REACH_END; + } else { + finalState = ProcessState.DONE_NOT_END; + } + } else { + ProcessState stateOfParent = ProcessState.DONE_REACH_END; + String parentId = shard.getParentId(); + if (parentId != null) { + stateOfParent = determineShardState(parentId, allShards, allCheckpoints, shardStates); + } + + ProcessState stateOfParentSibling = ProcessState.DONE_REACH_END; + String parentSiblingId = shard.getParentSiblingId(); + if (parentSiblingId != null) { + stateOfParentSibling = determineShardState(parentSiblingId, allShards, allCheckpoints, shardStates); + } + + if (stateOfParent == ProcessState.SKIP || stateOfParentSibling == ProcessState.SKIP) { + finalState = ProcessState.SKIP; + } else if (stateOfParent == ProcessState.DONE_NOT_END || stateOfParentSibling == ProcessState.DONE_NOT_END) { + finalState = ProcessState.SKIP; + } else if (stateOfParent == ProcessState.BLOCK || stateOfParentSibling == ProcessState.BLOCK) { + finalState = ProcessState.BLOCK; + } else if (stateOfParent == ProcessState.READY || stateOfParentSibling == ProcessState.READY){ + finalState = ProcessState.BLOCK; + } else { // stateOfParent == ProcessState.DONE_REACH_END && stateOfParentSibling == ProcessState.DONE_REACH_END + finalState = ProcessState.READY; + } + } + + shardStates.put(shard.getShardId(), finalState); + return finalState; + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/SingleVerAndUpOnlyModeRecordSender.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/SingleVerAndUpOnlyModeRecordSender.java new file mode 100644 index 0000000000..1cc32bad08 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/core/SingleVerAndUpOnlyModeRecordSender.java @@ -0,0 +1,101 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.core; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.OTSStreamReaderException; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.ColumnValueTransformHelper; +import com.alicloud.openservices.tablestore.model.*; + +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * 该输出模式假设用户对数据只有Put和Update操作,无Delete操作,且没有使用多版本。 + * 在该种模式下,会整行输出数据,用户必须指定需要导出的列的列名,输出的数据样例如下: + * | pk1 | pk2 | col1 | col2 | col3 | sequence id | + * | --- | --- | ---- | ---- | ---- | ----------- | + * | a | b | c1 | null | null | 001 | + * + * 注意:删除整行,删除某列(某个版本或所有),这些增量信息都会被忽略。 + */ +public class SingleVerAndUpOnlyModeRecordSender implements IStreamRecordSender { + + private final RecordSender dataxRecordSender; + private String shardId; + private final boolean isExportSequenceInfo; + private List columnNames; + + public SingleVerAndUpOnlyModeRecordSender(RecordSender dataxRecordSender, String shardId, boolean isExportSequenceInfo, List columnNames) { + this.dataxRecordSender = dataxRecordSender; + this.shardId = shardId; + this.isExportSequenceInfo = isExportSequenceInfo; + this.columnNames = columnNames; + } + + @Override + public void sendToDatax(StreamRecord streamRecord) { + String sequenceInfo = getSequenceInfo(streamRecord); + switch (streamRecord.getRecordType()) { + case PUT: + case UPDATE: + sendToDatax(streamRecord.getPrimaryKey(), streamRecord.getColumns(), sequenceInfo); + break; + case DELETE: + break; + default: + throw new OTSStreamReaderException("Unknown stream record type: " + streamRecord.getRecordType() + "."); + } + } + + private void sendToDatax(PrimaryKey primaryKey, List columns, String sequenceInfo) { + Record line = dataxRecordSender.createRecord(); + + Map map = new HashMap(); + for (PrimaryKeyColumn pkCol : primaryKey.getPrimaryKeyColumns()) { + map.put(pkCol.getName(), pkCol.getValue()); + } + + for (RecordColumn recordColumn : columns) { + if (recordColumn.getColumnType().equals(RecordColumn.ColumnType.PUT)) { + map.put(recordColumn.getColumn().getName(), recordColumn.getColumn().getValue()); + } + } + + boolean findColumn = false; + + for (String colName : columnNames) { + Object value = map.get(colName); + if (value != null) { + findColumn = true; + if (value instanceof ColumnValue) { + line.addColumn(ColumnValueTransformHelper.otsColumnValueToDataxColumn((ColumnValue) value)); + } else { + line.addColumn(ColumnValueTransformHelper.otsPrimaryKeyValueToDataxColumn((PrimaryKeyValue) value)); + } + } else { + line.addColumn(new StringColumn(null)); + } + } + + if (!findColumn) { + return; + } + + if (isExportSequenceInfo) { + line.addColumn(new StringColumn(sequenceInfo)); + } + synchronized (dataxRecordSender) { + dataxRecordSender.sendToWriter(line); + } + } + + private String getSequenceInfo(StreamRecord streamRecord) { + int epoch = streamRecord.getSequenceInfo().getEpoch(); + long timestamp = streamRecord.getSequenceInfo().getTimestamp(); + int rowIdx = streamRecord.getSequenceInfo().getRowIndex(); + String sequenceId = String.format("%010d_%020d_%010d_%s", epoch, timestamp, rowIdx, shardId); + return sequenceId; + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/model/OTSErrorCode.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/model/OTSErrorCode.java new file mode 100644 index 0000000000..ab60f8661c --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/model/OTSErrorCode.java @@ -0,0 +1,114 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.model; + +/** + * 表示来自开放结构化数据服务(Open Table Service,OTS)的错误代码。 + * + */ +public class OTSErrorCode { + /** + * 用户身份验证失败。 + */ + public static final String AUTHORIZATION_FAILURE = "OTSAuthFailed"; + + /** + * 服务器内部错误。 + */ + public static final String INTERNAL_SERVER_ERROR = "OTSInternalServerError"; + + /** + * 参数错误。 + */ + public static final String INVALID_PARAMETER = "OTSParameterInvalid"; + + /** + * 整个请求过大。 + */ + public static final String REQUEST_TOO_LARGE = "OTSRequestBodyTooLarge"; + + /** + * 客户端请求超时。 + */ + public static final String REQUEST_TIMEOUT = "OTSRequestTimeout"; + + /** + * 用户的配额已经用满。 + */ + public static final String QUOTA_EXHAUSTED = "OTSQuotaExhausted"; + + /** + * 内部服务器发生failover,导致表的部分分区不可服务。 + */ + public static final String PARTITION_UNAVAILABLE = "OTSPartitionUnavailable"; + + /** + * 表刚被创建还无法立马提供服务。 + */ + public static final String TABLE_NOT_READY = "OTSTableNotReady"; + + /** + * 请求的表不存在。 + */ + public static final String OBJECT_NOT_EXIST = "OTSObjectNotExist"; + + /** + * 请求创建的表已经存在。 + */ + public static final String OBJECT_ALREADY_EXIST = "OTSObjectAlreadyExist"; + + /** + * 多个并发的请求写同一行数据,导致冲突。 + */ + public static final String ROW_OPEARTION_CONFLICT = "OTSRowOperationConflict"; + + /** + * 主键不匹配。 + */ + public static final String INVALID_PK = "OTSInvalidPK"; + + /** + * 读写能力调整过于频繁。 + */ + public static final String TOO_FREQUENT_RESERVED_THROUGHPUT_ADJUSTMENT = "OTSTooFrequentReservedThroughputAdjustment"; + + /** + * 该行总列数超出限制。 + */ + public static final String OUT_OF_COLUMN_COUNT_LIMIT = "OTSOutOfColumnCountLimit"; + + /** + * 该行所有列数据大小总和超出限制。 + */ + public static final String OUT_OF_ROW_SIZE_LIMIT = "OTSOutOfRowSizeLimit"; + + /** + * 剩余预留读写能力不足。 + */ + public static final String NOT_ENOUGH_CAPACITY_UNIT = "OTSNotEnoughCapacityUnit"; + + /** + * 预查条件检查失败。 + */ + public static final String CONDITION_CHECK_FAIL = "OTSConditionCheckFail"; + + /** + * 在OTS内部操作超时。 + */ + public static final String STORAGE_TIMEOUT = "OTSTimeout"; + + /** + * 在OTS内部有服务器不可访问。 + */ + public static final String SERVER_UNAVAILABLE = "OTSServerUnavailable"; + + /** + * OTS内部服务器繁忙。 + */ + public static final String SERVER_BUSY = "OTSServerBusy"; + + + /** + * 流数据已经过期 + */ + public static final String TRIMMED_DATA_ACCESS = "OTSTrimmedDataAccess"; + +} \ No newline at end of file diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/model/ShardCheckpoint.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/model/ShardCheckpoint.java new file mode 100644 index 0000000000..fe43225008 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/model/ShardCheckpoint.java @@ -0,0 +1,118 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.model; + +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.StatusTableConstants; +import com.alicloud.openservices.tablestore.model.ColumnValue; +import com.alicloud.openservices.tablestore.model.Row; +import com.alicloud.openservices.tablestore.model.RowPutChange; + +public class ShardCheckpoint { + private String shardId; + private String version; + private String checkpoint; + private long skipCount; + + public ShardCheckpoint(String shardId, String version, String shardIterator, long skipCount) { + this.shardId = shardId; + this.version = version; + this.checkpoint = shardIterator; + this.skipCount = skipCount; + } + + public String getShardId() { + return shardId; + } + + public void setShardId(String shardId) { + this.shardId = shardId; + } + + public String getVersion() { + return version; + } + + public void setVersion(String version) { + this.version = version; + } + + public String getCheckpoint() { + return checkpoint; + } + + public void setCheckpoint(String checkpoint) { + this.checkpoint = checkpoint; + } + + public long getSkipCount() { + return skipCount; + } + + public void setSkipCount(long skipCount) { + this.skipCount = skipCount; + } + + public static ShardCheckpoint fromRow(String shardId, Row row) { + String shardIterator = row.getLatestColumn(StatusTableConstants.CHECKPOINT_COLUMN_NAME).getValue().asString(); + + long skipCount = 0; + // compatible with old stream reader + if (row.contains(StatusTableConstants.SKIP_COUNT_COLUMN_NAME)) { + skipCount = row.getLatestColumn(StatusTableConstants.SKIP_COUNT_COLUMN_NAME).getValue().asLong(); + } + + // compatible with old stream reader + String version = ""; + if (row.contains(StatusTableConstants.VERSION_COLUMN_NAME)) { + version = row.getLatestColumn(StatusTableConstants.VERSION_COLUMN_NAME).getValue().asString(); + } + + return new ShardCheckpoint(shardId, version, shardIterator, skipCount); + } + + public void serializeColumn(RowPutChange rowChange) { + rowChange.addColumn(StatusTableConstants.VERSION_COLUMN_NAME, ColumnValue.fromString(version)); + rowChange.addColumn(StatusTableConstants.CHECKPOINT_COLUMN_NAME, ColumnValue.fromString(checkpoint)); + rowChange.addColumn(StatusTableConstants.SKIP_COUNT_COLUMN_NAME, ColumnValue.fromLong(skipCount)); + } + + @Override + public int hashCode() { + int result = 31; + result = result ^ this.shardId.hashCode(); + result = result ^ this.version.hashCode(); + result = result ^ this.checkpoint.hashCode(); + result = result ^ (int)this.skipCount; + return result; + } + + @Override + public boolean equals(Object obj) { + if (this == obj) { + return true; + } + + if (obj == null) { + return false; + } + + if (!(obj instanceof ShardCheckpoint)) { + return false; + } + + ShardCheckpoint other = (ShardCheckpoint)obj; + + return this.shardId.equals(other.shardId) && + this.version.equals(other.version) && + this.checkpoint.equals(other.checkpoint) && + this.skipCount == other.skipCount; + } + + @Override + public String toString() { + StringBuilder sb = new StringBuilder(); + sb.append("ShardId: ").append(shardId) + .append(", Version: ").append(version) + .append(", Checkpoint: ").append(checkpoint) + .append(", SkipCount: ").append(skipCount); + return sb.toString(); + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/model/StreamJob.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/model/StreamJob.java new file mode 100644 index 0000000000..e147c42fd1 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/model/StreamJob.java @@ -0,0 +1,184 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.model; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.OTSReaderError; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.StatusTableConstants; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils.GsonParser; +import com.alicloud.openservices.tablestore.core.utils.CompressUtil; +import com.alicloud.openservices.tablestore.model.Column; +import com.alicloud.openservices.tablestore.model.ColumnValue; +import com.alicloud.openservices.tablestore.model.Row; +import com.alicloud.openservices.tablestore.model.RowPutChange; +import com.google.gson.Gson; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.io.UnsupportedEncodingException; +import java.util.*; +import java.util.zip.DataFormatException; +import java.util.zip.Deflater; +import java.util.zip.Inflater; + +public class StreamJob { + private String tableName; + private String streamId; + private String version; + private Set shardIds; + private long startTimeInMillis; + private long endTimeInMillis; + + public StreamJob(String tableName, String streamId, String version, + Set shardIds, long startTimestampMillis, long endTimestampMillis) { + this.tableName = tableName; + this.streamId = streamId; + this.version = version; + this.shardIds = shardIds; + this.startTimeInMillis = startTimestampMillis; + this.endTimeInMillis = endTimestampMillis; + } + + public String getTableName() { + return tableName; + } + + public void setTableName(String tableName) { + this.tableName = tableName; + } + + public String getStreamId() { + return streamId; + } + + public void setStreamId(String streamId) { + this.streamId = streamId; + } + + public String getVersion() { + return version; + } + + public void setVersion(String version) { + this.version = version; + } + + public Set getShardIds() { + return shardIds; + } + + public void setShardIds(Set shardIds) { + this.shardIds = shardIds; + } + + public long getStartTimeInMillis() { + return startTimeInMillis; + } + + public void setStartTimeInMillis(long startTimeInMillis) { + this.startTimeInMillis = startTimeInMillis; + } + + public long getEndTimeInMillis() { + return endTimeInMillis; + } + + public void setEndTimeInMillis(long endTimeInMillis) { + this.endTimeInMillis = endTimeInMillis; + } + + public void serializeShardIdList(RowPutChange rowChange, Set shardIds) { + try { + String json = GsonParser.listToJson(new ArrayList(shardIds)); + byte[] content = CompressUtil.compress(new ByteArrayInputStream(json.getBytes("utf-8")), new Deflater()); + List columns = new ArrayList(); + int index = 0; + while (index < content.length) { + int endIndex = index + StatusTableConstants.COLUMN_MAX_SIZE; + if (endIndex > content.length) { + endIndex = content.length; + } + + columns.add(ColumnValue.fromBinary(Arrays.copyOfRange(content, index, endIndex))); + + index = endIndex; + } + + for (int id = 0; id < columns.size(); id++) { + rowChange.addColumn(StatusTableConstants.JOB_SHARD_LIST_PREFIX_COLUMN_NAME + id, columns.get(id)); + } + } catch (UnsupportedEncodingException e) { + throw DataXException.asDataXException(OTSReaderError.ERROR, e); + } catch (IOException e) { + throw DataXException.asDataXException(OTSReaderError.ERROR, e); + } + } + + public static Set deserializeShardIdList(Row row) { + ByteArrayOutputStream output = new ByteArrayOutputStream(); + + try { + int id = 0; + while (true) { + String columnName = StatusTableConstants.JOB_SHARD_LIST_PREFIX_COLUMN_NAME + id; + Column column = row.getLatestColumn(columnName); + if (column != null) { + output.write(column.getValue().asBinary()); + id++; + } else { + break; + } + } + + byte[] content = output.toByteArray(); + + byte[] realContent = CompressUtil.decompress(new ByteArrayInputStream(content), 1024, new Inflater()); + String json = new String(realContent, "utf-8"); + return new HashSet(GsonParser.jsonToList(json)); + } catch (UnsupportedEncodingException e) { + throw DataXException.asDataXException(OTSReaderError.ERROR, e); + } catch (IOException e) { + throw DataXException.asDataXException(OTSReaderError.ERROR, e); + } catch (DataFormatException e) { + throw DataXException.asDataXException(OTSReaderError.ERROR, e); + } + } + + public void serializeColumn(RowPutChange rowChange) { + serializeShardIdList(rowChange, shardIds); + rowChange.addColumn(StatusTableConstants.JOB_VERSION_COLUMN_NAME, ColumnValue.fromString(version)); + rowChange.addColumn(StatusTableConstants.JOB_TABLE_NAME_COLUMN_NAME, ColumnValue.fromString(tableName)); + rowChange.addColumn(StatusTableConstants.JOB_STREAM_ID_COLUMN_NAME, ColumnValue.fromString(streamId)); + rowChange.addColumn(StatusTableConstants.JOB_START_TIME_COLUMN_NAME, ColumnValue.fromLong(startTimeInMillis)); + rowChange.addColumn(StatusTableConstants.JOB_END_TIME_COLUMN_NAME, ColumnValue.fromLong(endTimeInMillis)); + } + + public String toJson() { + Gson gson = new Gson(); + return gson.toJson(this); + } + + @Override + public String toString() { + return toJson(); + } + + public static StreamJob fromJson(String json) { + Gson gson = new Gson(); + return gson.fromJson(json, StreamJob.class); + } + + public static StreamJob fromRow(Row row) { + if (row == null) { + return null; + } + + Set shardIds = deserializeShardIdList(row); + String version = row.getLatestColumn(StatusTableConstants.JOB_VERSION_COLUMN_NAME).getValue().asString(); + String tableName = row.getLatestColumn(StatusTableConstants.JOB_TABLE_NAME_COLUMN_NAME).getValue().asString(); + String streamId = row.getLatestColumn(StatusTableConstants.JOB_STREAM_ID_COLUMN_NAME).getValue().asString(); + long startTime = row.getLatestColumn(StatusTableConstants.JOB_START_TIME_COLUMN_NAME).getValue().asLong(); + long endTime = row.getLatestColumn(StatusTableConstants.JOB_END_TIME_COLUMN_NAME).getValue().asLong(); + + return new StreamJob(tableName, streamId, version, shardIds, startTime, endTime); + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/ColumnValueTransformHelper.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/ColumnValueTransformHelper.java new file mode 100644 index 0000000000..80032e89cd --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/ColumnValueTransformHelper.java @@ -0,0 +1,29 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.element.Column; +import com.alicloud.openservices.tablestore.model.*; + +public class ColumnValueTransformHelper { + public static Column otsPrimaryKeyValueToDataxColumn(PrimaryKeyValue pkValue) { + switch (pkValue.getType()) { + case STRING:return new StringColumn(pkValue.asString()); + case INTEGER:return new LongColumn(pkValue.asLong()); + case BINARY:return new BytesColumn(pkValue.asBinary()); + default: + throw new IllegalArgumentException("Unknown primary key type: " + pkValue.getType() + "."); + } + } + + public static Column otsColumnValueToDataxColumn(ColumnValue columnValue) { + switch (columnValue.getType()) { + case STRING:return new StringColumn(columnValue.asString()); + case INTEGER:return new LongColumn(columnValue.asLong()); + case BINARY:return new BytesColumn(columnValue.asBinary()); + case BOOLEAN:return new BoolColumn(columnValue.asBoolean()); + case DOUBLE:return new DoubleColumn(columnValue.asDouble()); + default: + throw new IllegalArgumentException("Unknown column type: " + columnValue.getType() + "."); + } + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/GsonParser.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/GsonParser.java new file mode 100644 index 0000000000..d8ac827292 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/GsonParser.java @@ -0,0 +1,37 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils; + +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSStreamReaderConfig; +import com.alicloud.openservices.tablestore.model.StreamShard; +import com.google.gson.GsonBuilder; +import com.google.gson.reflect.TypeToken; + +import java.lang.reflect.Type; +import java.util.ArrayList; +import java.util.List; + +public class GsonParser { + + public static String configToJson(OTSStreamReaderConfig config) { + return new GsonBuilder().create().toJson(config); + } + + public static OTSStreamReaderConfig jsonToConfig(String jsonStr) { + return new GsonBuilder().create().fromJson(jsonStr, OTSStreamReaderConfig.class); + } + + public static String listToJson(List list) { + return new GsonBuilder().create().toJson(list); + } + + public static List jsonToList(String jsonStr) { + return new GsonBuilder().create().fromJson(jsonStr, new TypeToken>(){}.getType()); + } + + public static Object toJson(List allShards) { + return new GsonBuilder().create().toJson(allShards); + } + + public static List fromJson(String jsonStr) { + return new GsonBuilder().create().fromJson(jsonStr, new TypeToken>(){}.getType()); + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/OTSHelper.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/OTSHelper.java new file mode 100644 index 0000000000..79b6c1d700 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/OTSHelper.java @@ -0,0 +1,119 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils; + +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSRetryStrategyForStreamReader; +import com.alibaba.datax.plugin.reader.otsstreamreader.internal.config.OTSStreamReaderConfig; +import com.alicloud.openservices.tablestore.model.*; +import com.alicloud.openservices.tablestore.*; +import com.aliyun.openservices.ots.internal.streamclient.utils.TimeUtils; + +import java.util.*; + +public class OTSHelper { + + private static final String TABLE_NOT_READY = "OTSTableNotReady"; + private static final String OTS_PARTITION_UNAVAILABLE = "OTSPartitionUnavailable"; + private static final String OBJECT_NOT_EXIST = "OTSObjectNotExist"; + private static final int CREATE_TABLE_READ_CU = 0; + private static final int CREATE_TABLE_WRITE_CU = 0; + private static final long CHECK_TABLE_READY_INTERNAL_MILLIS = 100; + + public static SyncClientInterface getOTSInstance(OTSStreamReaderConfig config) { + if (config.getOtsForTest() != null) { + return config.getOtsForTest(); // for test + } + + ClientConfiguration clientConfig = new ClientConfiguration(); + OTSRetryStrategyForStreamReader retryStrategy = new OTSRetryStrategyForStreamReader(); + retryStrategy.setMaxRetries(config.getMaxRetries()); + clientConfig.setRetryStrategy(retryStrategy); + clientConfig.setConnectionTimeoutInMillisecond(50 * 1000); + clientConfig.setSocketTimeoutInMillisecond(50 * 1000); + clientConfig.setIoThreadCount(4); + clientConfig.setMaxConnections(30); + SyncClientInterface ots = new SyncClient(config.getEndpoint(), config.getAccessId(), + config.getAccessKey(), config.getInstanceName(), clientConfig); + return ots; + } + + public static StreamDetails getStreamDetails(SyncClientInterface ots, String tableName) { + DescribeTableRequest describeTableRequest = new DescribeTableRequest(tableName); + DescribeTableResponse result = ots.describeTable(describeTableRequest); + return result.getStreamDetails(); + } + + public static List getOrderedShardList(SyncClientInterface ots, String streamId) { + DescribeStreamRequest describeStreamRequest = new DescribeStreamRequest(streamId); + DescribeStreamResponse describeStreamResult = ots.describeStream(describeStreamRequest); + List shardList = new ArrayList(); + shardList.addAll(describeStreamResult.getShards()); + while (describeStreamResult.getNextShardId() != null) { + describeStreamRequest.setInclusiveStartShardId(describeStreamResult.getNextShardId()); + describeStreamResult = ots.describeStream(describeStreamRequest); + shardList.addAll(describeStreamResult.getShards()); + } + return shardList; + } + + public static boolean checkTableExists(SyncClientInterface ots, String tableName) { + boolean exist = false; + try { + describeTable(ots, tableName); + exist = true; + } catch (TableStoreException ex) { + if (!ex.getErrorCode().equals(OBJECT_NOT_EXIST)) { + throw ex; + } + } + return exist; + } + + public static DescribeTableResponse describeTable(SyncClientInterface ots, String tableName) { + return ots.describeTable(new DescribeTableRequest(tableName)); + } + + public static void createTable(SyncClientInterface ots, TableMeta tableMeta, TableOptions tableOptions) { + CreateTableRequest request = new CreateTableRequest(tableMeta, tableOptions, + new ReservedThroughput(CREATE_TABLE_READ_CU, CREATE_TABLE_WRITE_CU)); + ots.createTable(request); + } + + public static boolean waitUntilTableReady(SyncClientInterface ots, String tableName, long maxWaitTimeMillis) { + TableMeta tableMeta = describeTable(ots, tableName).getTableMeta(); + List startPkCols = new ArrayList(); + List endPkCols = new ArrayList(); + for (PrimaryKeySchema pkSchema : tableMeta.getPrimaryKeyList()) { + startPkCols.add(new PrimaryKeyColumn(pkSchema.getName(), PrimaryKeyValue.INF_MIN)); + endPkCols.add(new PrimaryKeyColumn(pkSchema.getName(), PrimaryKeyValue.INF_MAX)); + } + RangeRowQueryCriteria rangeRowQueryCriteria = new RangeRowQueryCriteria(tableName); + rangeRowQueryCriteria.setInclusiveStartPrimaryKey(new PrimaryKey(startPkCols)); + rangeRowQueryCriteria.setExclusiveEndPrimaryKey(new PrimaryKey(endPkCols)); + rangeRowQueryCriteria.setLimit(1); + rangeRowQueryCriteria.setMaxVersions(1); + + long startTime = System.currentTimeMillis(); + + while (System.currentTimeMillis() - startTime < maxWaitTimeMillis) { + try { + GetRangeRequest getRangeRequest = new GetRangeRequest(rangeRowQueryCriteria); + ots.getRange(getRangeRequest); + return true; + } catch (TableStoreException ex) { + if (!ex.getErrorCode().equals(OTS_PARTITION_UNAVAILABLE) && + !ex.getErrorCode().equals(TABLE_NOT_READY)) { + throw ex; + } + } + TimeUtils.sleepMillis(CHECK_TABLE_READY_INTERNAL_MILLIS); + } + return false; + } + + public static Map toShardMap(List orderedShardList) { + Map shardsMap = new HashMap(); + for (StreamShard shard : orderedShardList) { + shardsMap.put(shard.getShardId(), shard); + } + return shardsMap; + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/ParamChecker.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/ParamChecker.java new file mode 100644 index 0000000000..9f19700029 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/ParamChecker.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils; + +import com.alibaba.datax.common.util.Configuration; + +public class ParamChecker { + + private static void throwNotExistException() { + throw new IllegalArgumentException("missing the key."); + } + + private static void throwStringLengthZeroException() { + throw new IllegalArgumentException("input the key is empty string."); + } + + public static String checkStringAndGet(Configuration param, String key, boolean isTrim) { + try { + String value = param.getString(key); + if (isTrim) { + value = value != null ? value.trim() : null; + } + if (null == value) { + throwNotExistException(); + } else if (value.length() == 0) { + throwStringLengthZeroException(); + } + return value; + } catch(RuntimeException e) { + throw e; + } + } +} diff --git a/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/TimeUtils.java b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/TimeUtils.java new file mode 100644 index 0000000000..8d11ac2663 --- /dev/null +++ b/otsstreamreader/src/main/java/com/alibaba/datax/plugin/reader/otsstreamreader/internal/utils/TimeUtils.java @@ -0,0 +1,63 @@ +package com.alibaba.datax.plugin.reader.otsstreamreader.internal.utils; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.text.DateFormat; +import java.text.ParseException; +import java.text.SimpleDateFormat; +import java.util.Date; +import java.util.TimeZone; + +public class TimeUtils { + + public static final long SECOND_IN_MILLIS = 1000; + + public static final long MINUTE_IN_MILLIS = 60 * 1000; + + public static final int DAY_IN_SEC = 24 * 60 * 60; + + public static final long DAY_IN_MILLIS = DAY_IN_SEC * 1000; + + public static final long HOUR_IN_MILLIS = 60 * MINUTE_IN_MILLIS; + + private static final Logger LOG = LoggerFactory.getLogger(TimeUtils.class); + + public static long sleepMillis(long timeToSleepMillis) { + if(timeToSleepMillis <= 0L) { + return 0L; + } else { + long startTime = System.currentTimeMillis(); + + try { + Thread.sleep(timeToSleepMillis); + } catch (InterruptedException var5) { + Thread.interrupted(); + LOG.warn("Interrupted while sleeping"); + } + + return System.currentTimeMillis() - startTime; + } + } + + public static long parseDateToTimestampMillis(String dateStr) throws ParseException { + SimpleDateFormat format = new SimpleDateFormat("yyyyMMdd"); + Date date = format.parse(dateStr); + return date.getTime(); + } + + public static long parseTimeStringToTimestampMillis(String dateStr) throws ParseException { + SimpleDateFormat format = new SimpleDateFormat("yyyyMMddHHmmss"); + Date date = format.parse(dateStr); + return date.getTime(); + } + + + public static String getTimeInISO8601(Date date) { + TimeZone tz = TimeZone.getTimeZone("UTC"); + DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm'Z'"); + df.setTimeZone(tz); + String nowAsISO = df.format(date); + return nowAsISO; + } +} diff --git a/otsstreamreader/src/main/resources/log4j2.xml b/otsstreamreader/src/main/resources/log4j2.xml new file mode 100644 index 0000000000..60a4f3b4cd --- /dev/null +++ b/otsstreamreader/src/main/resources/log4j2.xml @@ -0,0 +1,37 @@ + + + + + + %d %p %c{1.} [%t] %m%n + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/otsstreamreader/src/main/resources/plugin.json b/otsstreamreader/src/main/resources/plugin.json new file mode 100644 index 0000000000..9a70a47a46 --- /dev/null +++ b/otsstreamreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "otsstreamreader", + "class": "com.alibaba.datax.plugin.reader.otsstreamreader.internal.OTSStreamReader", + "description": "", + "developer": "zhaofeng.zhou@alibaba-inc.com" +} diff --git a/otsstreamreader/tools/config.json b/otsstreamreader/tools/config.json new file mode 100644 index 0000000000..ba019ec76f --- /dev/null +++ b/otsstreamreader/tools/config.json @@ -0,0 +1,7 @@ +{ + "endpoint" : "", + "accessId" : "", + "accessKey" : "", + "instanceName" : "", + "statusTable" : "" +} diff --git a/otsstreamreader/tools/tablestore_streamreader_console.py b/otsstreamreader/tools/tablestore_streamreader_console.py new file mode 100644 index 0000000000..f9379d727c --- /dev/null +++ b/otsstreamreader/tools/tablestore_streamreader_console.py @@ -0,0 +1,179 @@ +#!/bin/usr/env python +#-*- coding: utf-8 -*- + +from optparse import OptionParser +import sys +import json +import tabulate +import zlib +from ots2 import * + +class ConsoleConfig: + def __init__(self, config_file): + f = open(config_file, 'r') + config = json.loads(f.read()) + self.endpoint = str(config['endpoint']) + self.accessid = str(config['accessId']) + self.accesskey = str(config['accessKey']) + self.instance_name = str(config['instanceName']) + self.status_table = str(config['statusTable']) + + self.ots = OTSClient(self.endpoint, self.accessid, self.accesskey, self.instance_name) + +def describe_job(config, options): + ''' + 1. get job's description + 2. get all job's checkpoints and check if it is done + ''' + if not options.stream_id: + print "Error: Should set the stream id using '-s' or '--streamid'." + sys.exit(-1) + + if not options.timestamp: + print "Error: Should set the timestamp using '-t' or '--timestamp'." + sys.exit(-1) + + pk = [('StreamId', options.stream_id), ('StatusType', 'DataxJobDesc'), ('StatusValue', '%16d' % int(options.timestamp))] + consumed, pk, attrs, next_token = config.ots.get_row(config.status_table, pk, [], None, 1) + if not attrs: + print 'Stream job is not found.' + sys.exit(-1) + + job_detail = parse_job_detail(attrs) + print '----------JobDescriptions----------' + print json.dumps(job_detail, indent=2) + print '-----------------------------------' + + stream_checkpoints = _list_checkpoints(config, options.stream_id, int(options.timestamp)) + + cps_headers = ['ShardId', 'SendRecordCount', 'Checkpoint', 'SkipCount', 'Version'] + table_content = [] + for cp in stream_checkpoints: + table_content.append([cp['ShardId'], cp['SendRecordCount'], cp['Checkpoint'], cp['SkipCount'], cp['Version']]) + + print tabulate.tabulate(table_content, headers=cps_headers) + + # check if stream job has finished + finished = True + if len(job_detail['ShardIds']) != len(stream_checkpoints): + finished = False + + for cp in stream_checkpoints: + if cp['Version'] != job_detail['Version']: + finished = False + + print '----------JobSummary----------' + print 'ShardsCount:', len(job_detail['ShardIds']) + print 'CheckPointsCount:', len(stream_checkpoints) + print 'JobStatus:', 'Finished' if finished else 'NotFinished' + print '------------------------------' + +def _list_checkpoints(config, stream_id, timestamp): + start_pk = [('StreamId', stream_id), ('StatusType', 'CheckpointForDataxReader'), ('StatusValue', '%16d' % timestamp)] + end_pk = [('StreamId', stream_id), ('StatusType', 'CheckpointForDataxReader'), ('StatusValue', '%16d' % (timestamp + 1))] + + consumed_counter = CapacityUnit(0, 0) + columns_to_get = [] + checkpoints = [] + range_iter = config.ots.xget_range( + config.status_table, Direction.FORWARD, + start_pk, end_pk, + consumed_counter, columns_to_get, 100, + column_filter=None, max_version=1 + ) + + rows = [] + for (primary_key, attrs) in range_iter: + checkpoint = {} + for attr in attrs: + checkpoint[attr[0]] = attr[1] + + if not checkpoint.has_key('SendRecordCount'): + checkpoint['SendRecordCount'] = 0 + checkpoint['ShardId'] = primary_key[2][1].split('\t')[1] + checkpoints.append(checkpoint) + + return checkpoints + +def list_job(config, options): + ''' + Two options: + 1. list all jobs of stream + 2. list all jobs and all streams + ''' + consumed_counter = CapacityUnit(0, 0) + + if options.stream_id: + start_pk = [('StreamId', options.stream_id), ('StatusType', INF_MIN), ('StatusValue', INF_MIN)] + end_pk = [('StreamId', options.stream_id), ('StatusType', INF_MAX), ('StatusValue', INF_MAX)] + else: + start_pk = [('StreamId', INF_MIN), ('StatusType', INF_MIN), ('StatusValue', INF_MIN)] + end_pk = [('StreamId', INF_MAX), ('StatusType', INF_MAX), ('StatusValue', INF_MAX)] + + columns_to_get = [] + range_iter = config.ots.xget_range( + config.status_table, Direction.FORWARD, + start_pk, end_pk, + consumed_counter, columns_to_get, None, + column_filter=None, max_version=1 + ) + + rows = [] + for (primary_key, attrs) in range_iter: + if primary_key[1][1] == 'DataxJobDesc': + job_detail = parse_job_detail(attrs) + rows.append([job_detail['TableName'], job_detail['JobStreamId'], job_detail['EndTime'], job_detail['StartTime'], job_detail['EndTime'], job_detail['Version']]) + + headers = ['TableName', 'JobStreamId', 'Timestamp', 'StartTime', 'EndTime', 'Version'] + print tabulate.tabulate(rows, headers=headers) + +def parse_job_detail(attrs): + job_details = {} + shard_ids_content = '' + for attr in attrs: + if attr[0].startswith('ShardIds_'): + shard_ids_content += attr[1] + else: + job_details[attr[0]] = attr[1] + + shard_ids = json.loads(zlib.decompress(shard_ids_content)) + + if not job_details.has_key('Version'): + job_details['Version'] = '' + + if not job_details.has_key('SkipCount'): + job_details['SkipCount'] = 0 + job_details['ShardIds'] = shard_ids + + return job_details + +def parse_time(value): + try: + return int(value) + except Exception,e: + return int(time.mktime(time.strptime(value, '%Y-%m-%d %H:%M:%S'))) + +if __name__ == '__main__': + parser = OptionParser() + parser.add_option('-c', '--config', dest='config_file', help='path of config file', metavar='tablestore_streamreader_config.json') + parser.add_option('-a', '--action', dest='action', help='the action to do', choices = ['describe_job', 'list_job'], metavar='') + parser.add_option('-t', '--timestamp', dest='timestamp', help='the timestamp', metavar='') + parser.add_option('-s', '--streamid', dest='stream_id', help='the id of stream', metavar='') + parser.add_option('-d', '--shardid', dest='shard_id', help='the id of shard', metavar='') + + options, args = parser.parse_args() + + if not options.config_file: + print "Error: Should set the path of config file using '-c' or '--config'." + sys.exit(-1) + + if not options.action: + print "Error: Should set the action using '-a' or '--action'." + sys.exit(-1) + + console_config = ConsoleConfig(options.config_file) + if options.action == 'list_job': + list_job(console_config, options) + elif options.action == 'describe_job': + describe_job(console_config, options) + diff --git a/otsstreamreader/tools/tabulate.py b/otsstreamreader/tools/tabulate.py new file mode 100644 index 0000000000..2444dcbfc7 --- /dev/null +++ b/otsstreamreader/tools/tabulate.py @@ -0,0 +1,1237 @@ +# -*- coding: utf-8 -*- + +"""Pretty-print tabular data.""" + +from __future__ import print_function +from __future__ import unicode_literals +from collections import namedtuple, Iterable +from platform import python_version_tuple +import re + + +if python_version_tuple()[0] < "3": + from itertools import izip_longest + from functools import partial + _none_type = type(None) + _int_type = int + _long_type = long + _float_type = float + _text_type = unicode + _binary_type = str + + def _is_file(f): + return isinstance(f, file) + +else: + from itertools import zip_longest as izip_longest + from functools import reduce, partial + _none_type = type(None) + _int_type = int + _long_type = int + _float_type = float + _text_type = str + _binary_type = bytes + + import io + def _is_file(f): + return isinstance(f, io.IOBase) + +try: + import wcwidth # optional wide-character (CJK) support +except ImportError: + wcwidth = None + + +__all__ = ["tabulate", "tabulate_formats", "simple_separated_format"] +__version__ = "0.7.6-dev" + + +# minimum extra space in headers +MIN_PADDING = 2 + +# if True, enable wide-character (CJK) support +WIDE_CHARS_MODE = wcwidth is not None + + +Line = namedtuple("Line", ["begin", "hline", "sep", "end"]) + + +DataRow = namedtuple("DataRow", ["begin", "sep", "end"]) + + +# A table structure is suppposed to be: +# +# --- lineabove --------- +# headerrow +# --- linebelowheader --- +# datarow +# --- linebewteenrows --- +# ... (more datarows) ... +# --- linebewteenrows --- +# last datarow +# --- linebelow --------- +# +# TableFormat's line* elements can be +# +# - either None, if the element is not used, +# - or a Line tuple, +# - or a function: [col_widths], [col_alignments] -> string. +# +# TableFormat's *row elements can be +# +# - either None, if the element is not used, +# - or a DataRow tuple, +# - or a function: [cell_values], [col_widths], [col_alignments] -> string. +# +# padding (an integer) is the amount of white space around data values. +# +# with_header_hide: +# +# - either None, to display all table elements unconditionally, +# - or a list of elements not to be displayed if the table has column headers. +# +TableFormat = namedtuple("TableFormat", ["lineabove", "linebelowheader", + "linebetweenrows", "linebelow", + "headerrow", "datarow", + "padding", "with_header_hide"]) + + +def _pipe_segment_with_colons(align, colwidth): + """Return a segment of a horizontal line with optional colons which + indicate column's alignment (as in `pipe` output format).""" + w = colwidth + if align in ["right", "decimal"]: + return ('-' * (w - 1)) + ":" + elif align == "center": + return ":" + ('-' * (w - 2)) + ":" + elif align == "left": + return ":" + ('-' * (w - 1)) + else: + return '-' * w + + +def _pipe_line_with_colons(colwidths, colaligns): + """Return a horizontal line with optional colons to indicate column's + alignment (as in `pipe` output format).""" + segments = [_pipe_segment_with_colons(a, w) for a, w in zip(colaligns, colwidths)] + return "|" + "|".join(segments) + "|" + + +def _mediawiki_row_with_attrs(separator, cell_values, colwidths, colaligns): + alignment = { "left": '', + "right": 'align="right"| ', + "center": 'align="center"| ', + "decimal": 'align="right"| ' } + # hard-coded padding _around_ align attribute and value together + # rather than padding parameter which affects only the value + values_with_attrs = [' ' + alignment.get(a, '') + c + ' ' + for c, a in zip(cell_values, colaligns)] + colsep = separator*2 + return (separator + colsep.join(values_with_attrs)).rstrip() + + +def _textile_row_with_attrs(cell_values, colwidths, colaligns): + cell_values[0] += ' ' + alignment = { "left": "<.", "right": ">.", "center": "=.", "decimal": ">." } + values = (alignment.get(a, '') + v for a, v in zip(colaligns, cell_values)) + return '|' + '|'.join(values) + '|' + + +def _html_begin_table_without_header(colwidths_ignore, colaligns_ignore): + # this table header will be suppressed if there is a header row + return "\n".join(["

", ""]) + + +def _html_row_with_attrs(celltag, cell_values, colwidths, colaligns): + alignment = { "left": '', + "right": ' style="text-align: right;"', + "center": ' style="text-align: center;"', + "decimal": ' style="text-align: right;"' } + values_with_attrs = ["<{0}{1}>{2}".format(celltag, alignment.get(a, ''), c) + for c, a in zip(cell_values, colaligns)] + rowhtml = "" + "".join(values_with_attrs).rstrip() + "" + if celltag == "th": # it's a header row, create a new table header + rowhtml = "\n".join(["
", + "", + rowhtml, + "", + ""]) + return rowhtml + +def _moin_row_with_attrs(celltag, cell_values, colwidths, colaligns, header=''): + alignment = { "left": '', + "right": '', + "center": '', + "decimal": '' } + values_with_attrs = ["{0}{1} {2} ".format(celltag, + alignment.get(a, ''), + header+c+header) + for c, a in zip(cell_values, colaligns)] + return "".join(values_with_attrs)+"||" + +def _latex_line_begin_tabular(colwidths, colaligns, booktabs=False): + alignment = { "left": "l", "right": "r", "center": "c", "decimal": "r" } + tabular_columns_fmt = "".join([alignment.get(a, "l") for a in colaligns]) + return "\n".join(["\\begin{tabular}{" + tabular_columns_fmt + "}", + "\\toprule" if booktabs else "\hline"]) + +LATEX_ESCAPE_RULES = {r"&": r"\&", r"%": r"\%", r"$": r"\$", r"#": r"\#", + r"_": r"\_", r"^": r"\^{}", r"{": r"\{", r"}": r"\}", + r"~": r"\textasciitilde{}", "\\": r"\textbackslash{}", + r"<": r"\ensuremath{<}", r">": r"\ensuremath{>}"} + + +def _latex_row(cell_values, colwidths, colaligns): + def escape_char(c): + return LATEX_ESCAPE_RULES.get(c, c) + escaped_values = ["".join(map(escape_char, cell)) for cell in cell_values] + rowfmt = DataRow("", "&", "\\\\") + return _build_simple_row(escaped_values, rowfmt) + + +_table_formats = {"simple": + TableFormat(lineabove=Line("", "-", " ", ""), + linebelowheader=Line("", "-", " ", ""), + linebetweenrows=None, + linebelow=Line("", "-", " ", ""), + headerrow=DataRow("", " ", ""), + datarow=DataRow("", " ", ""), + padding=0, + with_header_hide=["lineabove", "linebelow"]), + "plain": + TableFormat(lineabove=None, linebelowheader=None, + linebetweenrows=None, linebelow=None, + headerrow=DataRow("", " ", ""), + datarow=DataRow("", " ", ""), + padding=0, with_header_hide=None), + "grid": + TableFormat(lineabove=Line("+", "-", "+", "+"), + linebelowheader=Line("+", "=", "+", "+"), + linebetweenrows=Line("+", "-", "+", "+"), + linebelow=Line("+", "-", "+", "+"), + headerrow=DataRow("|", "|", "|"), + datarow=DataRow("|", "|", "|"), + padding=1, with_header_hide=None), + "fancy_grid": + TableFormat(lineabove=Line("╒", "═", "╤", "╕"), + linebelowheader=Line("╞", "═", "╪", "╡"), + linebetweenrows=Line("├", "─", "┼", "┤"), + linebelow=Line("╘", "═", "╧", "╛"), + headerrow=DataRow("│", "│", "│"), + datarow=DataRow("│", "│", "│"), + padding=1, with_header_hide=None), + "pipe": + TableFormat(lineabove=_pipe_line_with_colons, + linebelowheader=_pipe_line_with_colons, + linebetweenrows=None, + linebelow=None, + headerrow=DataRow("|", "|", "|"), + datarow=DataRow("|", "|", "|"), + padding=1, + with_header_hide=["lineabove"]), + "orgtbl": + TableFormat(lineabove=None, + linebelowheader=Line("|", "-", "+", "|"), + linebetweenrows=None, + linebelow=None, + headerrow=DataRow("|", "|", "|"), + datarow=DataRow("|", "|", "|"), + padding=1, with_header_hide=None), + "jira": + TableFormat(lineabove=None, + linebelowheader=None, + linebetweenrows=None, + linebelow=None, + headerrow=DataRow("||", "||", "||"), + datarow=DataRow("|", "|", "|"), + padding=1, with_header_hide=None), + "psql": + TableFormat(lineabove=Line("+", "-", "+", "+"), + linebelowheader=Line("|", "-", "+", "|"), + linebetweenrows=None, + linebelow=Line("+", "-", "+", "+"), + headerrow=DataRow("|", "|", "|"), + datarow=DataRow("|", "|", "|"), + padding=1, with_header_hide=None), + "rst": + TableFormat(lineabove=Line("", "=", " ", ""), + linebelowheader=Line("", "=", " ", ""), + linebetweenrows=None, + linebelow=Line("", "=", " ", ""), + headerrow=DataRow("", " ", ""), + datarow=DataRow("", " ", ""), + padding=0, with_header_hide=None), + "mediawiki": + TableFormat(lineabove=Line("{| class=\"wikitable\" style=\"text-align: left;\"", + "", "", "\n|+ \n|-"), + linebelowheader=Line("|-", "", "", ""), + linebetweenrows=Line("|-", "", "", ""), + linebelow=Line("|}", "", "", ""), + headerrow=partial(_mediawiki_row_with_attrs, "!"), + datarow=partial(_mediawiki_row_with_attrs, "|"), + padding=0, with_header_hide=None), + "moinmoin": + TableFormat(lineabove=None, + linebelowheader=None, + linebetweenrows=None, + linebelow=None, + headerrow=partial(_moin_row_with_attrs,"||",header="'''"), + datarow=partial(_moin_row_with_attrs,"||"), + padding=1, with_header_hide=None), + "html": + TableFormat(lineabove=_html_begin_table_without_header, + linebelowheader="", + linebetweenrows=None, + linebelow=Line("\n
", "", "", ""), + headerrow=partial(_html_row_with_attrs, "th"), + datarow=partial(_html_row_with_attrs, "td"), + padding=0, with_header_hide=["lineabove"]), + "latex": + TableFormat(lineabove=_latex_line_begin_tabular, + linebelowheader=Line("\\hline", "", "", ""), + linebetweenrows=None, + linebelow=Line("\\hline\n\\end{tabular}", "", "", ""), + headerrow=_latex_row, + datarow=_latex_row, + padding=1, with_header_hide=None), + "latex_booktabs": + TableFormat(lineabove=partial(_latex_line_begin_tabular, booktabs=True), + linebelowheader=Line("\\midrule", "", "", ""), + linebetweenrows=None, + linebelow=Line("\\bottomrule\n\\end{tabular}", "", "", ""), + headerrow=_latex_row, + datarow=_latex_row, + padding=1, with_header_hide=None), + "tsv": + TableFormat(lineabove=None, linebelowheader=None, + linebetweenrows=None, linebelow=None, + headerrow=DataRow("", "\t", ""), + datarow=DataRow("", "\t", ""), + padding=0, with_header_hide=None), + "textile": + TableFormat(lineabove=None, linebelowheader=None, + linebetweenrows=None, linebelow=None, + headerrow=DataRow("|_. ", "|_.", "|"), + datarow=_textile_row_with_attrs, + padding=1, with_header_hide=None)} + + +tabulate_formats = list(sorted(_table_formats.keys())) + + +_invisible_codes = re.compile(r"\x1b\[\d*m|\x1b\[\d*\;\d*\;\d*m") # ANSI color codes +_invisible_codes_bytes = re.compile(b"\x1b\[\d*m|\x1b\[\d*\;\d*\;\d*m") # ANSI color codes + + +def simple_separated_format(separator): + """Construct a simple TableFormat with columns separated by a separator. + + >>> tsv = simple_separated_format("\\t") ; \ + tabulate([["foo", 1], ["spam", 23]], tablefmt=tsv) == 'foo \\t 1\\nspam\\t23' + True + + """ + return TableFormat(None, None, None, None, + headerrow=DataRow('', separator, ''), + datarow=DataRow('', separator, ''), + padding=0, with_header_hide=None) + + +def _isconvertible(conv, string): + try: + n = conv(string) + return True + except (ValueError, TypeError): + return False + + +def _isnumber(string): + """ + >>> _isnumber("123.45") + True + >>> _isnumber("123") + True + >>> _isnumber("spam") + False + """ + return _isconvertible(float, string) + + +def _isint(string, inttype=int): + """ + >>> _isint("123") + True + >>> _isint("123.45") + False + """ + return type(string) is inttype or\ + (isinstance(string, _binary_type) or isinstance(string, _text_type))\ + and\ + _isconvertible(inttype, string) + + +def _type(string, has_invisible=True): + """The least generic type (type(None), int, float, str, unicode). + + >>> _type(None) is type(None) + True + >>> _type("foo") is type("") + True + >>> _type("1") is type(1) + True + >>> _type('\x1b[31m42\x1b[0m') is type(42) + True + >>> _type('\x1b[31m42\x1b[0m') is type(42) + True + + """ + + if has_invisible and \ + (isinstance(string, _text_type) or isinstance(string, _binary_type)): + string = _strip_invisible(string) + + if string is None: + return _none_type + elif hasattr(string, "isoformat"): # datetime.datetime, date, and time + return _text_type + elif _isint(string): + return int + elif _isint(string, _long_type): + return int + elif _isnumber(string): + return float + elif isinstance(string, _binary_type): + return _binary_type + else: + return _text_type + + +def _afterpoint(string): + """Symbols after a decimal point, -1 if the string lacks the decimal point. + + >>> _afterpoint("123.45") + 2 + >>> _afterpoint("1001") + -1 + >>> _afterpoint("eggs") + -1 + >>> _afterpoint("123e45") + 2 + + """ + if _isnumber(string): + if _isint(string): + return -1 + else: + pos = string.rfind(".") + pos = string.lower().rfind("e") if pos < 0 else pos + if pos >= 0: + return len(string) - pos - 1 + else: + return -1 # no point + else: + return -1 # not a number + + +def _padleft(width, s): + """Flush right. + + >>> _padleft(6, '\u044f\u0439\u0446\u0430') == ' \u044f\u0439\u0446\u0430' + True + + """ + fmt = "{0:>%ds}" % width + return fmt.format(s) + + +def _padright(width, s): + """Flush left. + + >>> _padright(6, '\u044f\u0439\u0446\u0430') == '\u044f\u0439\u0446\u0430 ' + True + + """ + fmt = "{0:<%ds}" % width + return fmt.format(s) + + +def _padboth(width, s): + """Center string. + + >>> _padboth(6, '\u044f\u0439\u0446\u0430') == ' \u044f\u0439\u0446\u0430 ' + True + + """ + fmt = "{0:^%ds}" % width + return fmt.format(s) + + +def _strip_invisible(s): + "Remove invisible ANSI color codes." + if isinstance(s, _text_type): + return re.sub(_invisible_codes, "", s) + else: # a bytestring + return re.sub(_invisible_codes_bytes, "", s) + + +def _visible_width(s): + """Visible width of a printed string. ANSI color codes are removed. + + >>> _visible_width('\x1b[31mhello\x1b[0m'), _visible_width("world") + (5, 5) + + """ + # optional wide-character support + if wcwidth is not None and WIDE_CHARS_MODE: + len_fn = wcwidth.wcswidth + else: + len_fn = len + if isinstance(s, _text_type) or isinstance(s, _binary_type): + return len_fn(_strip_invisible(s)) + else: + return len_fn(_text_type(s)) + + +def _align_column(strings, alignment, minwidth=0, has_invisible=True): + """[string] -> [padded_string] + + >>> list(map(str,_align_column(["12.345", "-1234.5", "1.23", "1234.5", "1e+234", "1.0e234"], "decimal"))) + [' 12.345 ', '-1234.5 ', ' 1.23 ', ' 1234.5 ', ' 1e+234 ', ' 1.0e234'] + + >>> list(map(str,_align_column(['123.4', '56.7890'], None))) + ['123.4', '56.7890'] + + """ + if alignment == "right": + strings = [s.strip() for s in strings] + padfn = _padleft + elif alignment == "center": + strings = [s.strip() for s in strings] + padfn = _padboth + elif alignment == "decimal": + if has_invisible: + decimals = [_afterpoint(_strip_invisible(s)) for s in strings] + else: + decimals = [_afterpoint(s) for s in strings] + maxdecimals = max(decimals) + strings = [s + (maxdecimals - decs) * " " + for s, decs in zip(strings, decimals)] + padfn = _padleft + elif not alignment: + return strings + else: + strings = [s.strip() for s in strings] + padfn = _padright + + enable_widechars = wcwidth is not None and WIDE_CHARS_MODE + if has_invisible: + width_fn = _visible_width + elif enable_widechars: # optional wide-character support if available + width_fn = wcwidth.wcswidth + else: + width_fn = len + + s_lens = list(map(len, strings)) + s_widths = list(map(width_fn, strings)) + maxwidth = max(max(s_widths), minwidth) + if not enable_widechars and not has_invisible: + padded_strings = [padfn(maxwidth, s) for s in strings] + else: + # enable wide-character width corrections + visible_widths = [maxwidth - (w - l) for w, l in zip(s_widths, s_lens)] + # wcswidth and _visible_width don't count invisible characters; + # padfn doesn't need to apply another correction + padded_strings = [padfn(w, s) for s, w in zip(strings, visible_widths)] + return padded_strings + + +def _more_generic(type1, type2): + types = { _none_type: 0, int: 1, float: 2, _binary_type: 3, _text_type: 4 } + invtypes = { 4: _text_type, 3: _binary_type, 2: float, 1: int, 0: _none_type } + moregeneric = max(types.get(type1, 4), types.get(type2, 4)) + return invtypes[moregeneric] + + +def _column_type(strings, has_invisible=True): + """The least generic type all column values are convertible to. + + >>> _column_type(["1", "2"]) is _int_type + True + >>> _column_type(["1", "2.3"]) is _float_type + True + >>> _column_type(["1", "2.3", "four"]) is _text_type + True + >>> _column_type(["four", '\u043f\u044f\u0442\u044c']) is _text_type + True + >>> _column_type([None, "brux"]) is _text_type + True + >>> _column_type([1, 2, None]) is _int_type + True + >>> import datetime as dt + >>> _column_type([dt.datetime(1991,2,19), dt.time(17,35)]) is _text_type + True + + """ + types = [_type(s, has_invisible) for s in strings ] + return reduce(_more_generic, types, int) + + +def _format(val, valtype, floatfmt, missingval="", has_invisible=True): + """Format a value accoding to its type. + + Unicode is supported: + + >>> hrow = ['\u0431\u0443\u043a\u0432\u0430', '\u0446\u0438\u0444\u0440\u0430'] ; \ + tbl = [['\u0430\u0437', 2], ['\u0431\u0443\u043a\u0438', 4]] ; \ + good_result = '\\u0431\\u0443\\u043a\\u0432\\u0430 \\u0446\\u0438\\u0444\\u0440\\u0430\\n------- -------\\n\\u0430\\u0437 2\\n\\u0431\\u0443\\u043a\\u0438 4' ; \ + tabulate(tbl, headers=hrow) == good_result + True + + """ + if val is None: + return missingval + + if valtype in [int, _text_type]: + return "{0}".format(val) + elif valtype is _binary_type: + try: + return _text_type(val, "ascii") + except TypeError: + return _text_type(val) + elif valtype is float: + is_a_colored_number = has_invisible and isinstance(val, (_text_type, _binary_type)) + if is_a_colored_number: + raw_val = _strip_invisible(val) + formatted_val = format(float(raw_val), floatfmt) + return val.replace(raw_val, formatted_val) + else: + return format(float(val), floatfmt) + else: + return "{0}".format(val) + + +def _align_header(header, alignment, width, visible_width): + "Pad string header to width chars given known visible_width of the header." + width += len(header) - visible_width + if alignment == "left": + return _padright(width, header) + elif alignment == "center": + return _padboth(width, header) + elif not alignment: + return "{0}".format(header) + else: + return _padleft(width, header) + + +def _prepend_row_index(rows, index): + """Add a left-most index column.""" + if index is None or index is False: + return rows + if len(index) != len(rows): + print('index=', index) + print('rows=', rows) + raise ValueError('index must be as long as the number of data rows') + rows = [[v]+list(row) for v,row in zip(index, rows)] + return rows + + +def _bool(val): + "A wrapper around standard bool() which doesn't throw on NumPy arrays" + try: + return bool(val) + except ValueError: # val is likely to be a numpy array with many elements + return False + + +def _normalize_tabular_data(tabular_data, headers, showindex="default"): + """Transform a supported data type to a list of lists, and a list of headers. + + Supported tabular data types: + + * list-of-lists or another iterable of iterables + + * list of named tuples (usually used with headers="keys") + + * list of dicts (usually used with headers="keys") + + * list of OrderedDicts (usually used with headers="keys") + + * 2D NumPy arrays + + * NumPy record arrays (usually used with headers="keys") + + * dict of iterables (usually used with headers="keys") + + * pandas.DataFrame (usually used with headers="keys") + + The first row can be used as headers if headers="firstrow", + column indices can be used as headers if headers="keys". + + If showindex="default", show row indices of the pandas.DataFrame. + If showindex="always", show row indices for all types of data. + If showindex="never", don't show row indices for all types of data. + If showindex is an iterable, show its values as row indices. + + """ + + try: + bool(headers) + is_headers2bool_broken = False + except ValueError: # numpy.ndarray, pandas.core.index.Index, ... + is_headers2bool_broken = True + headers = list(headers) + + index = None + if hasattr(tabular_data, "keys") and hasattr(tabular_data, "values"): + # dict-like and pandas.DataFrame? + if hasattr(tabular_data.values, "__call__"): + # likely a conventional dict + keys = tabular_data.keys() + rows = list(izip_longest(*tabular_data.values())) # columns have to be transposed + elif hasattr(tabular_data, "index"): + # values is a property, has .index => it's likely a pandas.DataFrame (pandas 0.11.0) + keys = tabular_data.keys() + vals = tabular_data.values # values matrix doesn't need to be transposed + # for DataFrames add an index per default + index = list(tabular_data.index) + rows = [list(row) for row in vals] + else: + raise ValueError("tabular data doesn't appear to be a dict or a DataFrame") + + if headers == "keys": + headers = list(map(_text_type,keys)) # headers should be strings + + else: # it's a usual an iterable of iterables, or a NumPy array + rows = list(tabular_data) + + if (headers == "keys" and + hasattr(tabular_data, "dtype") and + getattr(tabular_data.dtype, "names")): + # numpy record array + headers = tabular_data.dtype.names + elif (headers == "keys" + and len(rows) > 0 + and isinstance(rows[0], tuple) + and hasattr(rows[0], "_fields")): + # namedtuple + headers = list(map(_text_type, rows[0]._fields)) + elif (len(rows) > 0 + and isinstance(rows[0], dict)): + # dict or OrderedDict + uniq_keys = set() # implements hashed lookup + keys = [] # storage for set + if headers == "firstrow": + firstdict = rows[0] if len(rows) > 0 else {} + keys.extend(firstdict.keys()) + uniq_keys.update(keys) + rows = rows[1:] + for row in rows: + for k in row.keys(): + #Save unique items in input order + if k not in uniq_keys: + keys.append(k) + uniq_keys.add(k) + if headers == 'keys': + headers = keys + elif isinstance(headers, dict): + # a dict of headers for a list of dicts + headers = [headers.get(k, k) for k in keys] + headers = list(map(_text_type, headers)) + elif headers == "firstrow": + if len(rows) > 0: + headers = [firstdict.get(k, k) for k in keys] + headers = list(map(_text_type, headers)) + else: + headers = [] + elif headers: + raise ValueError('headers for a list of dicts is not a dict or a keyword') + rows = [[row.get(k) for k in keys] for row in rows] + + elif headers == "keys" and len(rows) > 0: + # keys are column indices + headers = list(map(_text_type, range(len(rows[0])))) + + # take headers from the first row if necessary + if headers == "firstrow" and len(rows) > 0: + if index is not None: + headers = [index[0]] + list(rows[0]) + index = index[1:] + else: + headers = rows[0] + headers = list(map(_text_type, headers)) # headers should be strings + rows = rows[1:] + + headers = list(map(_text_type,headers)) + rows = list(map(list,rows)) + + # add or remove an index column + showindex_is_a_str = type(showindex) in [_text_type, _binary_type] + if showindex == "default" and index is not None: + rows = _prepend_row_index(rows, index) + elif isinstance(showindex, Iterable) and not showindex_is_a_str: + rows = _prepend_row_index(rows, list(showindex)) + elif showindex == "always" or (_bool(showindex) and not showindex_is_a_str): + if index is None: + index = list(range(len(rows))) + rows = _prepend_row_index(rows, index) + elif showindex == "never" or (not _bool(showindex) and not showindex_is_a_str): + pass + + # pad with empty headers for initial columns if necessary + if headers and len(rows) > 0: + nhs = len(headers) + ncols = len(rows[0]) + if nhs < ncols: + headers = [""]*(ncols - nhs) + headers + + return rows, headers + + +def tabulate(tabular_data, headers=(), tablefmt="simple", + floatfmt="g", numalign="decimal", stralign="left", + missingval="", showindex="default"): + """Format a fixed width table for pretty printing. + + >>> print(tabulate([[1, 2.34], [-56, "8.999"], ["2", "10001"]])) + --- --------- + 1 2.34 + -56 8.999 + 2 10001 + --- --------- + + The first required argument (`tabular_data`) can be a + list-of-lists (or another iterable of iterables), a list of named + tuples, a dictionary of iterables, an iterable of dictionaries, + a two-dimensional NumPy array, NumPy record array, or a Pandas' + dataframe. + + + Table headers + ------------- + + To print nice column headers, supply the second argument (`headers`): + + - `headers` can be an explicit list of column headers + - if `headers="firstrow"`, then the first row of data is used + - if `headers="keys"`, then dictionary keys or column indices are used + + Otherwise a headerless table is produced. + + If the number of headers is less than the number of columns, they + are supposed to be names of the last columns. This is consistent + with the plain-text format of R and Pandas' dataframes. + + >>> print(tabulate([["sex","age"],["Alice","F",24],["Bob","M",19]], + ... headers="firstrow")) + sex age + ----- ----- ----- + Alice F 24 + Bob M 19 + + By default, pandas.DataFrame data have an additional column called + row index. To add a similar column to all other types of data, + use `showindex="always"` or `showindex=True`. To suppress row indices + for all types of data, pass `showindex="never" or `showindex=False`. + To add a custom row index column, pass `showindex=some_iterable`. + + >>> print(tabulate([["F",24],["M",19]], showindex="always")) + - - -- + 0 F 24 + 1 M 19 + - - -- + + + Column alignment + ---------------- + + `tabulate` tries to detect column types automatically, and aligns + the values properly. By default it aligns decimal points of the + numbers (or flushes integer numbers to the right), and flushes + everything else to the left. Possible column alignments + (`numalign`, `stralign`) are: "right", "center", "left", "decimal" + (only for `numalign`), and None (to disable alignment). + + + Table formats + ------------- + + `floatfmt` is a format specification used for columns which + contain numeric data with a decimal point. + + `None` values are replaced with a `missingval` string: + + >>> print(tabulate([["spam", 1, None], + ... ["eggs", 42, 3.14], + ... ["other", None, 2.7]], missingval="?")) + ----- -- ---- + spam 1 ? + eggs 42 3.14 + other ? 2.7 + ----- -- ---- + + Various plain-text table formats (`tablefmt`) are supported: + 'plain', 'simple', 'grid', 'pipe', 'orgtbl', 'rst', 'mediawiki', + 'latex', and 'latex_booktabs'. Variable `tabulate_formats` contains the list of + currently supported formats. + + "plain" format doesn't use any pseudographics to draw tables, + it separates columns with a double space: + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], + ... ["strings", "numbers"], "plain")) + strings numbers + spam 41.9999 + eggs 451 + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], tablefmt="plain")) + spam 41.9999 + eggs 451 + + "simple" format is like Pandoc simple_tables: + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], + ... ["strings", "numbers"], "simple")) + strings numbers + --------- --------- + spam 41.9999 + eggs 451 + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], tablefmt="simple")) + ---- -------- + spam 41.9999 + eggs 451 + ---- -------- + + "grid" is similar to tables produced by Emacs table.el package or + Pandoc grid_tables: + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], + ... ["strings", "numbers"], "grid")) + +-----------+-----------+ + | strings | numbers | + +===========+===========+ + | spam | 41.9999 | + +-----------+-----------+ + | eggs | 451 | + +-----------+-----------+ + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], tablefmt="grid")) + +------+----------+ + | spam | 41.9999 | + +------+----------+ + | eggs | 451 | + +------+----------+ + + "fancy_grid" draws a grid using box-drawing characters: + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], + ... ["strings", "numbers"], "fancy_grid")) + ╒═══════════╤═══════════╕ + │ strings │ numbers │ + ╞═══════════╪═══════════╡ + │ spam │ 41.9999 │ + ├───────────┼───────────┤ + │ eggs │ 451 │ + ╘═══════════╧═══════════╛ + + "pipe" is like tables in PHP Markdown Extra extension or Pandoc + pipe_tables: + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], + ... ["strings", "numbers"], "pipe")) + | strings | numbers | + |:----------|----------:| + | spam | 41.9999 | + | eggs | 451 | + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], tablefmt="pipe")) + |:-----|---------:| + | spam | 41.9999 | + | eggs | 451 | + + "orgtbl" is like tables in Emacs org-mode and orgtbl-mode. They + are slightly different from "pipe" format by not using colons to + define column alignment, and using a "+" sign to indicate line + intersections: + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], + ... ["strings", "numbers"], "orgtbl")) + | strings | numbers | + |-----------+-----------| + | spam | 41.9999 | + | eggs | 451 | + + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], tablefmt="orgtbl")) + | spam | 41.9999 | + | eggs | 451 | + + "rst" is like a simple table format from reStructuredText; please + note that reStructuredText accepts also "grid" tables: + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], + ... ["strings", "numbers"], "rst")) + ========= ========= + strings numbers + ========= ========= + spam 41.9999 + eggs 451 + ========= ========= + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], tablefmt="rst")) + ==== ======== + spam 41.9999 + eggs 451 + ==== ======== + + "mediawiki" produces a table markup used in Wikipedia and on other + MediaWiki-based sites: + + >>> print(tabulate([["strings", "numbers"], ["spam", 41.9999], ["eggs", "451.0"]], + ... headers="firstrow", tablefmt="mediawiki")) + {| class="wikitable" style="text-align: left;" + |+ + |- + ! strings !! align="right"| numbers + |- + | spam || align="right"| 41.9999 + |- + | eggs || align="right"| 451 + |} + + "html" produces HTML markup: + + >>> print(tabulate([["strings", "numbers"], ["spam", 41.9999], ["eggs", "451.0"]], + ... headers="firstrow", tablefmt="html")) + + + + + + + + +
strings numbers
spam 41.9999
eggs 451
+ + "latex" produces a tabular environment of LaTeX document markup: + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], tablefmt="latex")) + \\begin{tabular}{lr} + \\hline + spam & 41.9999 \\\\ + eggs & 451 \\\\ + \\hline + \\end{tabular} + + "latex_booktabs" produces a tabular environment of LaTeX document markup + using the booktabs.sty package: + + >>> print(tabulate([["spam", 41.9999], ["eggs", "451.0"]], tablefmt="latex_booktabs")) + \\begin{tabular}{lr} + \\toprule + spam & 41.9999 \\\\ + eggs & 451 \\\\ + \\bottomrule + \end{tabular} + """ + if tabular_data is None: + tabular_data = [] + list_of_lists, headers = _normalize_tabular_data( + tabular_data, headers, showindex=showindex) + + # optimization: look for ANSI control codes once, + # enable smart width functions only if a control code is found + plain_text = '\n'.join(['\t'.join(map(_text_type, headers))] + \ + ['\t'.join(map(_text_type, row)) for row in list_of_lists]) + + has_invisible = re.search(_invisible_codes, plain_text) + enable_widechars = wcwidth is not None and WIDE_CHARS_MODE + if has_invisible: + width_fn = _visible_width + elif enable_widechars: # optional wide-character support if available + width_fn = wcwidth.wcswidth + else: + width_fn = len + + # format rows and columns, convert numeric values to strings + cols = list(zip(*list_of_lists)) + coltypes = list(map(_column_type, cols)) + cols = [[_format(v, ct, floatfmt, missingval, has_invisible) for v in c] + for c,ct in zip(cols, coltypes)] + + # align columns + aligns = [numalign if ct in [int,float] else stralign for ct in coltypes] + minwidths = [width_fn(h) + MIN_PADDING for h in headers] if headers else [0]*len(cols) + cols = [_align_column(c, a, minw, has_invisible) + for c, a, minw in zip(cols, aligns, minwidths)] + + if headers: + # align headers and add headers + t_cols = cols or [['']] * len(headers) + t_aligns = aligns or [stralign] * len(headers) + minwidths = [max(minw, width_fn(c[0])) for minw, c in zip(minwidths, t_cols)] + headers = [_align_header(h, a, minw, width_fn(h)) + for h, a, minw in zip(headers, t_aligns, minwidths)] + rows = list(zip(*cols)) + else: + minwidths = [width_fn(c[0]) for c in cols] + rows = list(zip(*cols)) + + if not isinstance(tablefmt, TableFormat): + tablefmt = _table_formats.get(tablefmt, _table_formats["simple"]) + + return _format_table(tablefmt, headers, rows, minwidths, aligns) + + +def _build_simple_row(padded_cells, rowfmt): + "Format row according to DataRow format without padding." + begin, sep, end = rowfmt + return (begin + sep.join(padded_cells) + end).rstrip() + + +def _build_row(padded_cells, colwidths, colaligns, rowfmt): + "Return a string which represents a row of data cells." + if not rowfmt: + return None + if hasattr(rowfmt, "__call__"): + return rowfmt(padded_cells, colwidths, colaligns) + else: + return _build_simple_row(padded_cells, rowfmt) + + +def _build_line(colwidths, colaligns, linefmt): + "Return a string which represents a horizontal line." + if not linefmt: + return None + if hasattr(linefmt, "__call__"): + return linefmt(colwidths, colaligns) + else: + begin, fill, sep, end = linefmt + cells = [fill*w for w in colwidths] + return _build_simple_row(cells, (begin, sep, end)) + + +def _pad_row(cells, padding): + if cells: + pad = " "*padding + padded_cells = [pad + cell + pad for cell in cells] + return padded_cells + else: + return cells + + +def _format_table(fmt, headers, rows, colwidths, colaligns): + """Produce a plain-text representation of the table.""" + lines = [] + hidden = fmt.with_header_hide if (headers and fmt.with_header_hide) else [] + pad = fmt.padding + headerrow = fmt.headerrow + + padded_widths = [(w + 2*pad) for w in colwidths] + padded_headers = _pad_row(headers, pad) + padded_rows = [_pad_row(row, pad) for row in rows] + + if fmt.lineabove and "lineabove" not in hidden: + lines.append(_build_line(padded_widths, colaligns, fmt.lineabove)) + + if padded_headers: + lines.append(_build_row(padded_headers, padded_widths, colaligns, headerrow)) + if fmt.linebelowheader and "linebelowheader" not in hidden: + lines.append(_build_line(padded_widths, colaligns, fmt.linebelowheader)) + + if padded_rows and fmt.linebetweenrows and "linebetweenrows" not in hidden: + # initial rows with a line below + for row in padded_rows[:-1]: + lines.append(_build_row(row, padded_widths, colaligns, fmt.datarow)) + lines.append(_build_line(padded_widths, colaligns, fmt.linebetweenrows)) + # the last row without a line below + lines.append(_build_row(padded_rows[-1], padded_widths, colaligns, fmt.datarow)) + else: + for row in padded_rows: + lines.append(_build_row(row, padded_widths, colaligns, fmt.datarow)) + + if fmt.linebelow and "linebelow" not in hidden: + lines.append(_build_line(padded_widths, colaligns, fmt.linebelow)) + + return "\n".join(lines) + + +def _main(): + """\ + Usage: tabulate [options] [FILE ...] + + Pretty-print tabular data. + See also https://bitbucket.org/astanin/python-tabulate + + FILE a filename of the file with tabular data; + if "-" or missing, read data from stdin. + + Options: + + -h, --help show this message + -1, --header use the first row of data as a table header + -o FILE, --output FILE print table to FILE (default: stdout) + -s REGEXP, --sep REGEXP use a custom column separator (default: whitespace) + -F FPFMT, --float FPFMT floating point number format (default: g) + -f FMT, --format FMT set output table format; supported formats: + plain, simple, grid, fancy_grid, pipe, orgtbl, + rst, mediawiki, html, latex, latex_booktabs, tsv + (default: simple) + """ + import getopt + import sys + import textwrap + usage = textwrap.dedent(_main.__doc__) + try: + opts, args = getopt.getopt(sys.argv[1:], + "h1o:s:F:f:", + ["help", "header", "output", "sep=", "float=", "format="]) + except getopt.GetoptError as e: + print(e) + print(usage) + sys.exit(2) + headers = [] + floatfmt = "g" + tablefmt = "simple" + sep = r"\s+" + outfile = "-" + for opt, value in opts: + if opt in ["-1", "--header"]: + headers = "firstrow" + elif opt in ["-o", "--output"]: + outfile = value + elif opt in ["-F", "--float"]: + floatfmt = value + elif opt in ["-f", "--format"]: + if value not in tabulate_formats: + print("%s is not a supported table format" % value) + print(usage) + sys.exit(3) + tablefmt = value + elif opt in ["-s", "--sep"]: + sep = value + elif opt in ["-h", "--help"]: + print(usage) + sys.exit(0) + files = [sys.stdin] if not args else args + with (sys.stdout if outfile == "-" else open(outfile, "w")) as out: + for f in files: + if f == "-": + f = sys.stdin + if _is_file(f): + _pprint_file(f, headers=headers, tablefmt=tablefmt, + sep=sep, floatfmt=floatfmt, file=out) + else: + with open(f) as fobj: + _pprint_file(fobj, headers=headers, tablefmt=tablefmt, + sep=sep, floatfmt=floatfmt, file=out) + + +def _pprint_file(fobject, headers, tablefmt, sep, floatfmt, file): + rows = fobject.readlines() + table = [re.split(sep, r.rstrip()) for r in rows if r.strip()] + print(tabulate(table, headers, tablefmt, floatfmt=floatfmt), file=file) + + +if __name__ == "__main__": + _main() \ No newline at end of file diff --git a/otswriter/doc/otswriter.md b/otswriter/doc/otswriter.md new file mode 100644 index 0000000000..cbfaf2a865 --- /dev/null +++ b/otswriter/doc/otswriter.md @@ -0,0 +1,239 @@ + +# OTSWriter 插件文档 + + +___ + + +## 1 快速介绍 + +OTSWriter插件实现了向OTS写入数据,目前支持三种写入方式: + +* PutRow,对应于OTS API PutRow,插入数据到指定的行,如果该行不存在,则新增一行;若该行存在,则覆盖原有行。 + +* UpdateRow,对应于OTS API UpdateRow,更新指定行的数据,如果该行不存在,则新增一行;若该行存在,则根据请求的内容在这一行中新增、修改或者删除指定列的值。 + +* DeleteRow,对应于OTS API DeleteRow,删除指定行的数据。 + +OTS是构建在阿里云飞天分布式系统之上的 NoSQL数据库服务,提供海量结构化数据的存储和实时访问。OTS 以实例和表的形式组织数据,通过数据分片和负载均衡技术,实现规模上的无缝扩展。 + +## 2 实现原理 + +简而言之,OTSWriter通过OTS官方Java SDK连接到OTS服务端,并通过SDK写入OTS服务端。OTSWriter本身对于写入过程做了很多优化,包括写入超时重试、异常写入重试、批量提交等Feature。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个写入OTS作业: + +``` +{ + "job": { + "setting": { + }, + "content": [ + { + "reader": {}, + "writer": { + "name": "otswriter", + "parameter": { + "endpoint":"", + "accessId":"", + "accessKey":"", + "instanceName":"", + // 导出数据表的表名 + "table":"", + + // Writer支持不同类型之间进行相互转换 + // 如下类型转换不支持: + // ================================ + // int -> binary + // double -> bool, binary + // bool -> binary + // bytes -> int, double, bool + // ================================ + + // 需要导入的PK列名,区分大小写 + // 类型支持:STRING,INT + // 1. 支持类型转换,注意类型转换时的精度丢失 + // 2. 顺序不要求和表的Meta一致 + "primaryKey" : [ + {"name":"pk1", "type":"string"}, + {"name":"pk2", "type":"int"} + ], + + // 需要导入的列名,区分大小写 + // 类型支持STRING,INT,DOUBLE,BOOL和BINARY + "column" : [ + {"name":"col2", "type":"INT"}, + {"name":"col3", "type":"STRING"}, + {"name":"col4", "type":"STRING"}, + {"name":"col5", "type":"BINARY"}, + {"name":"col6", "type":"DOUBLE"} + ], + + // 写入OTS的方式 + // PutRow : 等同于OTS API中PutRow操作,检查条件是ignore + // UpdateRow : 等同于OTS API中UpdateRow操作,检查条件是ignore + // DeleteRow: 等同于OTS API中DeleteRow操作,检查条件是ignore + "writeMode" : "PutRow" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **endpoint** + + * 描述:OTS Server的EndPoint(服务地址),例如http://bazhen.cn−hangzhou.ots.aliyuncs.com。 + + * 必选:是
+ + * 默认值:无
+ +* **accessId** + + * 描述:OTS的accessId
+ + * 必选:是
+ + * 默认值:无
+ +* **accessKey** + + * 描述:OTS的accessKey
+ + * 必选:是
+ + * 默认值:无
+ +* **instanceName** + + * 描述:OTS的实例名称,实例是用户使用和管理 OTS 服务的实体,用户在开通 OTS 服务之后,需要通过管理控制台来创建实例,然后在实例内进行表的创建和管理。实例是 OTS 资源管理的基础单元,OTS 对应用程序的访问控制和资源计量都在实例级别完成。
+ + * 必选:是
+ + * 默认值:无
+ + +* **table** + + * 描述:所选取的需要抽取的表名称,这里有且只能填写一张表。在OTS不存在多表同步的需求。
+ + * 必选:是
+ + * 默认值:无
+ +* **primaryKey** + + * 描述: OTS的主键信息,使用JSON的数组描述字段信息。OTS本身是NoSQL系统,在OTSWriter导入数据过程中,必须指定相应地字段名称。 + + OTS的PrimaryKey只能支持STRING,INT两种类型,因此OTSWriter本身也限定填写上述两种类型。 + + DataX本身支持类型转换的,因此对于源头数据非String/Int,OTSWriter会进行数据类型转换。 + + 配置实例: + + ```json + "primaryKey" : [ + {"name":"pk1", "type":"string"}, + {"name":"pk2", "type":"int"} + ], + ``` + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。使用格式为 + + ```json + {"name":"col2", "type":"INT"}, + ``` + + 其中的name指定写入的OTS列名,type指定写入的类型。OTS类型支持STRING,INT,DOUBLE,BOOL和BINARY几种类型 。 + + 写入过程不支持常量、函数或者自定义表达式。 + + * 必选:是
+ + * 默认值:无
+ +* **writeMode** + + * 描述:写入模式,目前支持两种模式, + + * PutRow,对应于OTS API PutRow,插入数据到指定的行,如果该行不存在,则新增一行;若该行存在,则覆盖原有行。 + + * UpdateRow,对应于OTS API UpdateRow,更新指定行的数据,如果该行不存在,则新增一行;若该行存在,则根据请求的内容在这一行中新增、修改或者删除指定列的值。 + + * DeleteRow,对应于OTS API DeleteRow,删除指定行的数据。 + + * 必选:是
+ + * 默认值:无
+ + +### 3.3 类型转换 + +目前OTSWriter支持所有OTS类型,下面列出OTSWriter针对OTS类型转换列表: + + +| DataX 内部类型| OTS 数据类型 | +| -------- | ----- | +| Long |Integer | +| Double |Double| +| String |String| +| Boolean |Boolean| +| Bytes |Binary | + +* 注意,OTS本身不支持日期型类型。应用层一般使用Long报错时间的Unix TimeStamp。 + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 + +2列PK(10 + 8),15列String(10 Byte), 2两列Integer(8 Byte),算上Column Name每行大概327Byte,每次BatchWriteRow写入100行数据,所以当个请求的数据大小是32KB。 + +#### 4.1.2 机器参数 + +OTS端:3台前端机,5台后端机 + +DataX运行端: 24核CPU, 98GB内存 + +### 4.2 测试报告 + +#### 4.2.1 测试报告 + +|并发数|DataX CPU|DATAX流量 |OTS 流量 | BatchWrite前端QPS| BatchWriteRow前端延时| +|--------|--------| --------|--------|--------|------| +|40| 1027% |Speed 22.13MB/s, 112640 records/s|65.8M/s |42|153ms | +|50| 1218% |Speed 24.11MB/s, 122700 records/s|73.5M/s |47|174ms| +|60| 1355% |Speed 25.31MB/s, 128854 records/s|78.1M/s |50|190ms| +|70| 1578% |Speed 26.35MB/s, 134121 records/s|80.8M/s |52|210ms| +|80| 1771% |Speed 26.55MB/s, 135161 records/s|82.7M/s |53|230ms| + + + + +## 5 约束限制 + +### 5.1 写入幂等性 + +OTS写入本身是支持幂等性的,也就是使用OTS SDK同一条数据写入OTS系统,一次和多次请求的结果可以理解为一致的。因此对于OTSWriter多次尝试写入同一条数据与写入一条数据结果是等同的。 + +### 5.2 单任务FailOver + +由于OTS写入本身是幂等性的,因此可以支持单任务FailOver。即一旦写入Fail,DataX会重新启动相关子任务进行重试。 + +## 6 FAQ diff --git a/otswriter/pom.xml b/otswriter/pom.xml new file mode 100644 index 0000000000..018e011fe6 --- /dev/null +++ b/otswriter/pom.xml @@ -0,0 +1,88 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + otswriter + otswriter + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.aliyun.openservices + ots-public + 2.2.4 + + + com.google.code.gson + gson + 2.2.4 + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + org.apache.maven.plugins + maven-surefire-plugin + 2.5 + + + **/unittest/*.java + **/functiontest/*.java + + + + + + diff --git a/otswriter/src/main/assembly/package.xml b/otswriter/src/main/assembly/package.xml new file mode 100644 index 0000000000..5ae7a01511 --- /dev/null +++ b/otswriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/otswriter + + + target/ + + otswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/otswriter + + + + + + false + plugin/writer/otswriter/libs + runtime + + + diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/Key.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/Key.java new file mode 100644 index 0000000000..0724b9cf6f --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/Key.java @@ -0,0 +1,36 @@ +/** + * (C) 2010-2014 Alibaba Group Holding Limited. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.alibaba.datax.plugin.writer.otswriter; + +public final class Key { + + public final static String OTS_ENDPOINT = "endpoint"; + + public final static String OTS_ACCESSID = "accessId"; + + public final static String OTS_ACCESSKEY = "accessKey"; + + public final static String OTS_INSTANCE_NAME = "instanceName"; + + public final static String TABLE_NAME = "table"; + + public final static String PRIMARY_KEY = "primaryKey"; + + public final static String COLUMN = "column"; + + public final static String WRITE_MODE = "writeMode"; +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriter.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriter.java new file mode 100644 index 0000000000..4d2ed17b3f --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriter.java @@ -0,0 +1,92 @@ +package com.alibaba.datax.plugin.writer.otswriter; + +import java.util.List; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.otswriter.utils.Common; +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSException; + +public class OtsWriter { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + private OtsWriterMasterProxy proxy = new OtsWriterMasterProxy(); + + @Override + public void init() { + LOG.info("init() begin ..."); + try { + this.proxy.init(getPluginJobConf()); + } catch (OTSException e) { + LOG.error("OTSException: {}", e.getMessage(), e); + throw DataXException.asDataXException(new OtsWriterError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (ClientException e) { + LOG.error("ClientException: {}", e.getMessage(), e); + throw DataXException.asDataXException(new OtsWriterError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (IllegalArgumentException e) { + LOG.error("IllegalArgumentException. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsWriterError.INVALID_PARAM, Common.getDetailMessage(e), e); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsWriterError.ERROR, Common.getDetailMessage(e), e); + } + LOG.info("init() end ..."); + } + + @Override + public void destroy() { + this.proxy.close(); + } + + @Override + public List split(int mandatoryNumber) { + try { + return this.proxy.split(mandatoryNumber); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsWriterError.ERROR, Common.getDetailMessage(e), e); + } + } + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + private OtsWriterSlaveProxy proxy = new OtsWriterSlaveProxy(); + + @Override + public void init() {} + + @Override + public void destroy() { + this.proxy.close(); + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + LOG.info("startWrite() begin ..."); + try { + this.proxy.init(this.getPluginJobConf()); + this.proxy.write(lineReceiver, this.getTaskPluginCollector()); + } catch (OTSException e) { + LOG.error("OTSException: {}", e.getMessage(), e); + throw DataXException.asDataXException(new OtsWriterError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (ClientException e) { + LOG.error("ClientException: {}", e.getMessage(), e); + throw DataXException.asDataXException(new OtsWriterError(e.getErrorCode(), "OTS端的错误"), Common.getDetailMessage(e), e); + } catch (IllegalArgumentException e) { + LOG.error("IllegalArgumentException. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsWriterError.INVALID_PARAM, Common.getDetailMessage(e), e); + } catch (Exception e) { + LOG.error("Exception. ErrorMsg:{}", e.getMessage(), e); + throw DataXException.asDataXException(OtsWriterError.ERROR, Common.getDetailMessage(e), e); + } + LOG.info("startWrite() end ..."); + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterError.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterError.java new file mode 100644 index 0000000000..67d1ee2b77 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterError.java @@ -0,0 +1,46 @@ +package com.alibaba.datax.plugin.writer.otswriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public class OtsWriterError implements ErrorCode { + + private String code; + + private String description; + + // TODO + // 这一块需要DATAX来统一定义分类, OTS基于这些分类在细化 + // 所以暂定两个基础的Error Code,其他错误统一使用OTS的错误码和错误消息 + + public final static OtsWriterError ERROR = new OtsWriterError( + "OtsWriterError", + "该错误表示插件的内部错误,表示系统没有处理到的异常"); + public final static OtsWriterError INVALID_PARAM = new OtsWriterError( + "OtsWriterInvalidParameter", + "该错误表示参数错误,表示用户输入了错误的参数格式等"); + + public OtsWriterError (String code) { + this.code = code; + this.description = code; + } + + public OtsWriterError (String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return this.code; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterMasterProxy.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterMasterProxy.java new file mode 100644 index 0000000000..91cf9b120f --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterMasterProxy.java @@ -0,0 +1,110 @@ +package com.alibaba.datax.plugin.writer.otswriter; + +import java.util.ArrayList; +import java.util.List; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.otswriter.callable.GetTableMetaCallable; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConf; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConf.RestrictConf; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConst; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSOpType; +import com.alibaba.datax.plugin.writer.otswriter.utils.GsonParser; +import com.alibaba.datax.plugin.writer.otswriter.utils.ParamChecker; +import com.alibaba.datax.plugin.writer.otswriter.utils.RetryHelper; +import com.alibaba.datax.plugin.writer.otswriter.utils.WriterModelParser; +import com.aliyun.openservices.ots.OTSClient; +import com.aliyun.openservices.ots.model.TableMeta; + +public class OtsWriterMasterProxy { + + private OTSConf conf = new OTSConf(); + + private OTSClient ots = null; + + private TableMeta meta = null; + + private static final Logger LOG = LoggerFactory.getLogger(OtsWriterMasterProxy.class); + + /** + * @param param + * @throws Exception + */ + public void init(Configuration param) throws Exception { + + // 默认参数 + conf.setRetry(param.getInt(OTSConst.RETRY, 18)); + conf.setSleepInMillisecond(param.getInt(OTSConst.SLEEP_IN_MILLISECOND, 100)); + conf.setBatchWriteCount(param.getInt(OTSConst.BATCH_WRITE_COUNT, 100)); + conf.setConcurrencyWrite(param.getInt(OTSConst.CONCURRENCY_WRITE, 5)); + conf.setIoThreadCount(param.getInt(OTSConst.IO_THREAD_COUNT, 1)); + conf.setSocketTimeout(param.getInt(OTSConst.SOCKET_TIMEOUT, 20000)); + conf.setConnectTimeout(param.getInt(OTSConst.CONNECT_TIMEOUT, 10000)); + conf.setBufferSize(param.getInt(OTSConst.BUFFER_SIZE, 1024)); + + RestrictConf restrictConf = conf.new RestrictConf(); + restrictConf.setRequestTotalSizeLimition(param.getInt(OTSConst.REQUEST_TOTAL_SIZE_LIMITATION, 1024 * 1024)); + restrictConf.setAttributeColumnSize(param.getInt(OTSConst.ATTRIBUTE_COLUMN_SIZE_LIMITATION, 2 * 1024 * 1024)); + restrictConf.setPrimaryKeyColumnSize(param.getInt(OTSConst.PRIMARY_KEY_COLUMN_SIZE_LIMITATION, 1024)); + restrictConf.setMaxColumnsCount(param.getInt(OTSConst.ATTRIBUTE_COLUMN_MAX_COUNT, 1024)); + conf.setRestrictConf(restrictConf); + + // 必选参数 + conf.setEndpoint(ParamChecker.checkStringAndGet(param, Key.OTS_ENDPOINT)); + conf.setAccessId(ParamChecker.checkStringAndGet(param, Key.OTS_ACCESSID)); + conf.setAccessKey(ParamChecker.checkStringAndGet(param, Key.OTS_ACCESSKEY)); + conf.setInstanceName(ParamChecker.checkStringAndGet(param, Key.OTS_INSTANCE_NAME)); + conf.setTableName(ParamChecker.checkStringAndGet(param, Key.TABLE_NAME)); + + conf.setOperation(WriterModelParser.parseOTSOpType(ParamChecker.checkStringAndGet(param, Key.WRITE_MODE))); + + ots = new OTSClient( + this.conf.getEndpoint(), + this.conf.getAccessId(), + this.conf.getAccessKey(), + this.conf.getInstanceName()); + + meta = getTableMeta(ots, conf.getTableName()); + LOG.info("Table Meta : {}", GsonParser.metaToJson(meta)); + + conf.setPrimaryKeyColumn(WriterModelParser.parseOTSPKColumnList(ParamChecker.checkListAndGet(param, Key.PRIMARY_KEY, true))); + ParamChecker.checkPrimaryKey(meta, conf.getPrimaryKeyColumn()); + + conf.setAttributeColumn(WriterModelParser.parseOTSAttrColumnList(ParamChecker.checkListAndGet(param, Key.COLUMN, conf.getOperation() == OTSOpType.UPDATE_ROW ? true : false))); + ParamChecker.checkAttribute(conf.getAttributeColumn()); + } + + public List split(int mandatoryNumber){ + LOG.info("Begin split and MandatoryNumber : {}", mandatoryNumber); + List configurations = new ArrayList(); + for (int i = 0; i < mandatoryNumber; i++) { + Configuration configuration = Configuration.newDefault(); + configuration.set(OTSConst.OTS_CONF, GsonParser.confToJson(this.conf)); + configurations.add(configuration); + } + LOG.info("End split."); + assert(mandatoryNumber == configurations.size()); + return configurations; + } + + public void close() { + ots.shutdown(); + } + + public OTSConf getOTSConf() { + return conf; + } + + // private function + + private TableMeta getTableMeta(OTSClient ots, String tableName) throws Exception { + return RetryHelper.executeWithRetry( + new GetTableMetaCallable(ots, tableName), + conf.getRetry(), + conf.getSleepInMillisecond() + ); + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterSlaveProxy.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterSlaveProxy.java new file mode 100644 index 0000000000..762edfb4d6 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/OtsWriterSlaveProxy.java @@ -0,0 +1,126 @@ +package com.alibaba.datax.plugin.writer.otswriter; + +import com.alibaba.datax.plugin.writer.otswriter.model.*; +import com.alibaba.datax.plugin.writer.otswriter.utils.Common; +import com.aliyun.openservices.ots.*; +import com.aliyun.openservices.ots.internal.OTSCallback; +import com.aliyun.openservices.ots.internal.writer.WriterConfig; +import com.aliyun.openservices.ots.model.*; +import org.apache.commons.math3.util.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.otswriter.utils.GsonParser; + +import java.util.List; +import java.util.concurrent.Executors; + + +public class OtsWriterSlaveProxy { + + private static final Logger LOG = LoggerFactory.getLogger(OtsWriterSlaveProxy.class); + private OTSConf conf; + private OTSAsync otsAsync; + private OTSWriter otsWriter; + + private class WriterCallback implements OTSCallback { + + private TaskPluginCollector collector; + public WriterCallback(TaskPluginCollector collector) { + this.collector = collector; + } + + @Override + public void onCompleted(OTSContext otsContext) { + LOG.debug("Write row succeed. PrimaryKey: {}.", otsContext.getOTSRequest().getRowPrimaryKey()); + } + + @Override + public void onFailed(OTSContext otsContext, OTSException ex) { + LOG.error("Write row failed.", ex); + WithRecord withRecord = (WithRecord)otsContext.getOTSRequest(); + collector.collectDirtyRecord(withRecord.getRecord(), ex); + } + + @Override + public void onFailed(OTSContext otsContext, ClientException ex) { + LOG.error("Write row failed.", ex); + WithRecord withRecord = (WithRecord)otsContext.getOTSRequest(); + collector.collectDirtyRecord(withRecord.getRecord(), ex); + } + } + + public void init(Configuration configuration) { + conf = GsonParser.jsonToConf(configuration.getString(OTSConst.OTS_CONF)); + + ClientConfiguration clientConfigure = new ClientConfiguration(); + clientConfigure.setIoThreadCount(conf.getIoThreadCount()); + clientConfigure.setMaxConnections(conf.getConcurrencyWrite()); + clientConfigure.setSocketTimeoutInMillisecond(conf.getSocketTimeout()); + clientConfigure.setConnectionTimeoutInMillisecond(conf.getConnectTimeout()); + + OTSServiceConfiguration otsConfigure = new OTSServiceConfiguration(); + otsConfigure.setRetryStrategy(new WriterRetryPolicy(conf)); + + otsAsync = new OTSClientAsync( + conf.getEndpoint(), + conf.getAccessId(), + conf.getAccessKey(), + conf.getInstanceName(), + clientConfigure, + otsConfigure); + } + + public void close() { + otsAsync.shutdown(); + } + + public void write(RecordReceiver recordReceiver, TaskPluginCollector collector) throws Exception { + LOG.info("Writer slave started."); + + WriterConfig writerConfig = new WriterConfig(); + writerConfig.setConcurrency(conf.getConcurrencyWrite()); + writerConfig.setMaxBatchRowsCount(conf.getBatchWriteCount()); + writerConfig.setMaxBatchSize(conf.getRestrictConf().getRequestTotalSizeLimition()); + writerConfig.setBufferSize(conf.getBufferSize()); + writerConfig.setMaxAttrColumnSize(conf.getRestrictConf().getAttributeColumnSize()); + writerConfig.setMaxColumnsCount(conf.getRestrictConf().getMaxColumnsCount()); + writerConfig.setMaxPKColumnSize(conf.getRestrictConf().getPrimaryKeyColumnSize()); + otsWriter = new DefaultOTSWriter(otsAsync, conf.getTableName(), writerConfig, new WriterCallback(collector), Executors.newFixedThreadPool(3)); + + int expectColumnCount = conf.getPrimaryKeyColumn().size() + conf.getAttributeColumn().size(); + Record record; + while ((record = recordReceiver.getFromReader()) != null) { + LOG.debug("Record Raw: {}", record.toString()); + + int columnCount = record.getColumnNumber(); + if (columnCount != expectColumnCount) { + // 如果Column的个数和预期的个数不一致时,认为是系统故障或者用户配置Column错误,异常退出 + throw new IllegalArgumentException(String.format(OTSErrorMessage.RECORD_AND_COLUMN_SIZE_ERROR, columnCount, expectColumnCount)); + } + + // 类型转换 + try { + RowPrimaryKey primaryKey = Common.getPKFromRecord(conf.getPrimaryKeyColumn(), record); + List> attributes = Common.getAttrFromRecord(conf.getPrimaryKeyColumn().size(), conf.getAttributeColumn(), record); + RowChange rowChange = Common.columnValuesToRowChange(conf.getTableName(), conf.getOperation(), primaryKey, attributes); + WithRecord withRecord = (WithRecord)rowChange; + withRecord.setRecord(record); + otsWriter.addRowChange(rowChange); + } catch (IllegalArgumentException e) { + LOG.warn("Found dirty data.", e); + collector.collectDirtyRecord(record, e.getMessage()); + } catch (ClientException e) { + LOG.warn("Found dirty data.", e); + collector.collectDirtyRecord(record, e.getMessage()); + } + } + + otsWriter.close(); + LOG.info("Writer slave finished."); + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/WriterRetryPolicy.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/WriterRetryPolicy.java new file mode 100644 index 0000000000..3aa61a6834 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/WriterRetryPolicy.java @@ -0,0 +1,27 @@ +package com.alibaba.datax.plugin.writer.otswriter; + +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConf; +import com.aliyun.openservices.ots.internal.OTSRetryStrategy; + +public class WriterRetryPolicy implements OTSRetryStrategy { + OTSConf conf; + + public WriterRetryPolicy(OTSConf conf) { + this.conf = conf; + } + + @Override + public boolean shouldRetry(String action, Exception ex, int retries) { + return retries <= conf.getRetry(); + } + + @Override + public long getPauseDelay(String action, Exception ex, int retries) { + if (retries <= 0) { + return 0; + } + + int sleepTime = conf.getSleepInMillisecond() * retries; + return sleepTime > 30000 ? 30000 : sleepTime; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/GetTableMetaCallable.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/GetTableMetaCallable.java new file mode 100644 index 0000000000..d4128e14ce --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/callable/GetTableMetaCallable.java @@ -0,0 +1,29 @@ +package com.alibaba.datax.plugin.writer.otswriter.callable; + +import java.util.concurrent.Callable; + +import com.aliyun.openservices.ots.OTSClient; +import com.aliyun.openservices.ots.model.DescribeTableRequest; +import com.aliyun.openservices.ots.model.DescribeTableResult; +import com.aliyun.openservices.ots.model.TableMeta; + +public class GetTableMetaCallable implements Callable{ + + private OTSClient ots = null; + private String tableName = null; + + public GetTableMetaCallable(OTSClient ots, String tableName) { + this.ots = ots; + this.tableName = tableName; + } + + @Override + public TableMeta call() throws Exception { + DescribeTableRequest describeTableRequest = new DescribeTableRequest(); + describeTableRequest.setTableName(tableName); + DescribeTableResult result = ots.describeTable(describeTableRequest); + TableMeta tableMeta = result.getTableMeta(); + return tableMeta; + } + +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/LogExceptionManager.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/LogExceptionManager.java new file mode 100644 index 0000000000..93175ddb18 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/LogExceptionManager.java @@ -0,0 +1,58 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.aliyun.openservices.ots.OTSErrorCode; +import com.aliyun.openservices.ots.OTSException; + +/** + * 添加这个类的主要目的是为了解决当用户遇到CU不够时,打印大量的日志 + * @author redchen + * + */ +public class LogExceptionManager { + + private long count = 0; + private long updateTimestamp = 0; + + private static final Logger LOG = LoggerFactory.getLogger(LogExceptionManager.class); + + private synchronized void countAndReset() { + count++; + long cur = System.currentTimeMillis(); + long interval = cur - updateTimestamp; + if (interval >= 10000) { + LOG.warn("Call callable fail, OTSNotEnoughCapacityUnit, total times:"+ count +", time range:"+ (interval/1000) +"s, times per second:" + ((float)count / (interval/1000))); + count = 0; + updateTimestamp = cur; + } + } + + public synchronized void addException(Exception exception) { + if (exception instanceof OTSException) { + OTSException e = (OTSException)exception; + if (e.getErrorCode().equals(OTSErrorCode.NOT_ENOUGH_CAPACITY_UNIT)) { + countAndReset(); + } else { + LOG.warn( + "Call callable fail, OTSException:ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{e.getErrorCode(), e.getMessage(), e.getRequestId()} + ); + } + } else { + LOG.warn("Call callable fail, {}", exception.getMessage()); + } + } + + public synchronized void addException(com.aliyun.openservices.ots.model.Error error, String requestId) { + if (error.getCode().equals(OTSErrorCode.NOT_ENOUGH_CAPACITY_UNIT)) { + countAndReset(); + } else { + LOG.warn( + "OTSException:ErrorCode:{}, ErrorMsg:{}, RequestId:{}", + new Object[]{error.getCode(), error.getMessage(), requestId} + ); + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSAttrColumn.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSAttrColumn.java new file mode 100644 index 0000000000..d37960e000 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSAttrColumn.java @@ -0,0 +1,21 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import com.aliyun.openservices.ots.model.ColumnType; + +public class OTSAttrColumn { + private String name; + private ColumnType type; + + public OTSAttrColumn(String name, ColumnType type) { + this.name = name; + this.type = type; + } + + public String getName() { + return name; + } + + public ColumnType getType() { + return type; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConf.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConf.java new file mode 100644 index 0000000000..bd7eccc5a4 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConf.java @@ -0,0 +1,172 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import java.util.List; + +public class OTSConf { + private String endpoint; + private String accessId; + private String accessKey; + private String instanceName; + private String tableName; + + private List primaryKeyColumn; + private List attributeColumn; + + private int bufferSize = 1024; + private int retry = 18; + private int sleepInMillisecond = 100; + private int batchWriteCount = 10; + private int concurrencyWrite = 5; + private int ioThreadCount = 1; + private int socketTimeout = 20000; + private int connectTimeout = 10000; + + private OTSOpType operation; + private RestrictConf restrictConf; + + //限制项 + public class RestrictConf { + private int requestTotalSizeLimition = 1024 * 1024; + private int primaryKeyColumnSize = 1024; + private int attributeColumnSize = 2 * 1024 * 1024; + private int maxColumnsCount = 1024; + + public int getRequestTotalSizeLimition() { + return requestTotalSizeLimition; + } + public void setRequestTotalSizeLimition(int requestTotalSizeLimition) { + this.requestTotalSizeLimition = requestTotalSizeLimition; + } + + public void setPrimaryKeyColumnSize(int primaryKeyColumnSize) { + this.primaryKeyColumnSize = primaryKeyColumnSize; + } + + public void setAttributeColumnSize(int attributeColumnSize) { + this.attributeColumnSize = attributeColumnSize; + } + + public void setMaxColumnsCount(int maxColumnsCount) { + this.maxColumnsCount = maxColumnsCount; + } + + public int getAttributeColumnSize() { + return attributeColumnSize; + } + + public int getMaxColumnsCount() { + return maxColumnsCount; + } + + public int getPrimaryKeyColumnSize() { + return primaryKeyColumnSize; + } + } + + public RestrictConf getRestrictConf() { + return restrictConf; + } + public void setRestrictConf(RestrictConf restrictConf) { + this.restrictConf = restrictConf; + } + public OTSOpType getOperation() { + return operation; + } + public void setOperation(OTSOpType operation) { + this.operation = operation; + } + public List getPrimaryKeyColumn() { + return primaryKeyColumn; + } + public void setPrimaryKeyColumn(List primaryKeyColumn) { + this.primaryKeyColumn = primaryKeyColumn; + } + + public int getConcurrencyWrite() { + return concurrencyWrite; + } + public void setConcurrencyWrite(int concurrencyWrite) { + this.concurrencyWrite = concurrencyWrite; + } + public int getBatchWriteCount() { + return batchWriteCount; + } + public void setBatchWriteCount(int batchWriteCount) { + this.batchWriteCount = batchWriteCount; + } + public String getEndpoint() { + return endpoint; + } + public void setEndpoint(String endpoint) { + this.endpoint = endpoint; + } + public String getAccessId() { + return accessId; + } + public void setAccessId(String accessId) { + this.accessId = accessId; + } + public String getAccessKey() { + return accessKey; + } + public void setAccessKey(String accessKey) { + this.accessKey = accessKey; + } + public String getInstanceName() { + return instanceName; + } + public void setInstanceName(String instanceName) { + this.instanceName = instanceName; + } + public String getTableName() { + return tableName; + } + public void setTableName(String tableName) { + this.tableName = tableName; + } + public List getAttributeColumn() { + return attributeColumn; + } + public void setAttributeColumn(List attributeColumn) { + this.attributeColumn = attributeColumn; + } + public int getRetry() { + return retry; + } + public void setRetry(int retry) { + this.retry = retry; + } + public int getSleepInMillisecond() { + return sleepInMillisecond; + } + public void setSleepInMillisecond(int sleepInMillisecond) { + this.sleepInMillisecond = sleepInMillisecond; + } + public int getIoThreadCount() { + return ioThreadCount; + } + public void setIoThreadCount(int ioThreadCount) { + this.ioThreadCount = ioThreadCount; + } + public int getSocketTimeout() { + return socketTimeout; + } + public void setSocketTimeout(int socketTimeout) { + this.socketTimeout = socketTimeout; + } + public int getConnectTimeout() { + return connectTimeout; + } + + public int getBufferSize() { + return bufferSize; + } + + public void setBufferSize(int bufferSize) { + this.bufferSize = bufferSize; + } + + public void setConnectTimeout(int connectTimeout) { + this.connectTimeout = connectTimeout; + } +} \ No newline at end of file diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConst.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConst.java new file mode 100644 index 0000000000..1b8f805374 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSConst.java @@ -0,0 +1,36 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +public class OTSConst { + // Reader support type + public final static String TYPE_STRING = "STRING"; + public final static String TYPE_INTEGER = "INT"; + public final static String TYPE_DOUBLE = "DOUBLE"; + public final static String TYPE_BOOLEAN = "BOOL"; + public final static String TYPE_BINARY = "BINARY"; + + // Column + public final static String NAME = "name"; + public final static String TYPE = "type"; + + public final static String OTS_CONF = "OTS_CONF"; + + public final static String OTS_OP_TYPE_PUT = "PutRow"; + public final static String OTS_OP_TYPE_UPDATE = "UpdateRow"; + public final static String OTS_OP_TYPE_DELETE = "DeleteRow"; + + // options + public final static String RETRY = "maxRetryTime"; + public final static String SLEEP_IN_MILLISECOND = "retrySleepInMillisecond"; + public final static String BATCH_WRITE_COUNT = "batchWriteCount"; + public final static String CONCURRENCY_WRITE = "concurrencyWrite"; + public final static String IO_THREAD_COUNT = "ioThreadCount"; + public final static String SOCKET_TIMEOUT = "socketTimeoutInMillisecond"; + public final static String CONNECT_TIMEOUT = "connectTimeoutInMillisecond"; + public final static String BUFFER_SIZE = "bufferSize"; + + // 限制项 + public final static String REQUEST_TOTAL_SIZE_LIMITATION = "requestTotalSizeLimitation"; + public final static String ATTRIBUTE_COLUMN_SIZE_LIMITATION = "attributeColumnSizeLimitation"; + public final static String PRIMARY_KEY_COLUMN_SIZE_LIMITATION = "primaryKeyColumnSizeLimitation"; + public final static String ATTRIBUTE_COLUMN_MAX_COUNT = "attributeColumnMaxCount"; +} \ No newline at end of file diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSErrorMessage.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSErrorMessage.java new file mode 100644 index 0000000000..9523342fa4 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSErrorMessage.java @@ -0,0 +1,66 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +public class OTSErrorMessage { + + public static final String OPERATION_PARSE_ERROR = "The 'writeMode' only support 'PutRow', 'UpdateRow' or 'DeleteRow', not '%s'."; + + public static final String UNSUPPORT_PARSE = "Unsupport parse '%s' to '%s'."; + + public static final String RECORD_AND_COLUMN_SIZE_ERROR = "Size of record not equal size of config column. record size : %d, config column size : %d."; + + public static final String PK_TYPE_ERROR = "Primary key type only support 'string' and 'int', not support '%s'."; + + public static final String ATTR_TYPE_ERROR = "Column type only support 'string','int','double','bool' and 'binary', not support '%s'."; + + public static final String PK_COLUMN_MISSING_ERROR = "Missing the column '%s' in 'primaryKey'."; + + public static final String INPUT_PK_COUNT_NOT_EQUAL_META_ERROR = "The count of 'primaryKey' not equal meta, input count : %d, primary key count : %d in meta."; + + public static final String INPUT_PK_TYPE_NOT_MATCH_META_ERROR = "The type of 'primaryKey' not match meta, column name : %s, input type: %s, primary key type : %s in meta."; + + public static final String ATTR_REPEAT_COLUMN_ERROR = "Repeat column '%s' in 'column'."; + + public static final String MISSING_PARAMTER_ERROR = "The param '%s' is not exist."; + + public static final String PARAMTER_STRING_IS_EMPTY_ERROR = "The param length of '%s' is zero."; + + public static final String PARAMETER_LIST_IS_EMPTY_ERROR = "The param '%s' is a empty json array."; + + public static final String PARAMETER_IS_NOT_ARRAY_ERROR = "The param '%s' is not a json array."; + + public static final String PARAMETER_IS_NOT_MAP_ERROR = "The param '%s' is not a json map."; + + public static final String PARSE_TO_LIST_ERROR = "Can not parse '%s' to list."; + + public static final String PK_MAP_NAME_TYPE_ERROR = "The 'name' and 'type only support string in json map of 'primaryKey'."; + + public static final String ATTR_MAP_NAME_TYPE_ERROR = "The 'name' and 'type only support string in json map of 'column'."; + + public static final String PK_MAP_INCLUDE_NAME_TYPE_ERROR = "The only support 'name' and 'type' fileds in json map of 'primaryKey'."; + + public static final String ATTR_MAP_INCLUDE_NAME_TYPE_ERROR = "The only support 'name' and 'type' fileds in json map of 'column'."; + + public static final String PK_ITEM_IS_NOT_MAP_ERROR = "The item is not map in 'primaryKey'."; + + public static final String ATTR_ITEM_IS_NOT_MAP_ERROR = "The item is not map in 'column'."; + + public static final String PK_COLUMN_NAME_IS_EMPTY_ERROR = "The name of item can not be a empty string in 'primaryKey'."; + + public static final String ATTR_COLUMN_NAME_IS_EMPTY_ERROR = "The name of item can not be a empty string in 'column'."; + + public static final String MULTI_ATTR_COLUMN_ERROR = "Multi item in 'column', column name : %s ."; + + public static final String COLUMN_CONVERSION_ERROR = "Column coversion error, src type : %s, src value: %s, expect type: %s ."; + + public static final String PK_COLUMN_VALUE_IS_NULL_ERROR = "The column of record is NULL, primary key name : %s ."; + + public static final String PK_STRONG_LENGTH_ERROR = "The length of pk string value is more than configuration, conf: %d, input: %d ."; + + public static final String ATTR_STRING_LENGTH_ERROR = "The length of attr string value is more than configuration, conf: %d, input: %d ."; + + public static final String BINARY_LENGTH_ERROR = "The length of binary value is more than configuration, conf: %d, input: %d ."; + + public static final String LINE_LENGTH_ERROR = "The length of row is more than length of request configuration, conf: %d, row: %d ."; + + public static final String INSERT_TASK_ERROR = "Can not execute the task, becase the ExecutorService is shutdown."; +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSOpType.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSOpType.java new file mode 100644 index 0000000000..17b650331e --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSOpType.java @@ -0,0 +1,7 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +public enum OTSOpType { + PUT_ROW, + UPDATE_ROW, + DELETE_ROW +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSPKColumn.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSPKColumn.java new file mode 100644 index 0000000000..c873cb9637 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSPKColumn.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import com.aliyun.openservices.ots.model.PrimaryKeyType; + +public class OTSPKColumn { + private String name; + private PrimaryKeyType type; + + public OTSPKColumn(String name, PrimaryKeyType type) { + this.name = name; + this.type = type; + } + + public PrimaryKeyType getType() { + return type; + } + + public String getName() { + return name; + } + +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSRowPrimaryKey.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSRowPrimaryKey.java new file mode 100644 index 0000000000..d89d501779 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/OTSRowPrimaryKey.java @@ -0,0 +1,61 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import java.util.Map; +import java.util.Map.Entry; + +import com.aliyun.openservices.ots.model.PrimaryKeyValue; + +public class OTSRowPrimaryKey { + + private Map columns; + + public OTSRowPrimaryKey(Map columns) { + if (null == columns) { + throw new IllegalArgumentException("Input columns can not be null."); + } + this.columns = columns; + } + + public Map getColumns() { + return columns; + } + + @Override + public int hashCode() { + int result = 31; + for (Entry entry : columns.entrySet()) { + result = result ^ entry.getKey().hashCode() ^ entry.getValue().hashCode(); + } + return result; + } + + @Override + public boolean equals(Object obj) { + if (this == obj) { + return true; + } + if (obj == null) { + return false; + } + if (!(obj instanceof OTSRowPrimaryKey)) { + return false; + } + OTSRowPrimaryKey other = (OTSRowPrimaryKey) obj; + + if (columns.size() != other.columns.size()) { + return false; + } + + for (Entry entry : columns.entrySet()) { + PrimaryKeyValue otherValue = other.columns.get(entry.getKey()); + + if (otherValue == null) { + return false; + } + if (!otherValue.equals(entry.getValue())) { + return false; + } + } + return true; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/RowDeleteChangeWithRecord.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/RowDeleteChangeWithRecord.java new file mode 100644 index 0000000000..5d77ad8792 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/RowDeleteChangeWithRecord.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import com.alibaba.datax.common.element.Record; + +public class RowDeleteChangeWithRecord extends com.aliyun.openservices.ots.model.RowDeleteChange implements WithRecord { + + private Record record; + + public RowDeleteChangeWithRecord(String tableName) { + super(tableName); + } + + @Override + public Record getRecord() { + return record; + } + + @Override + public void setRecord(Record record) { + this.record = record; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/RowPutChangeWithRecord.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/RowPutChangeWithRecord.java new file mode 100644 index 0000000000..e97a7d63c0 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/RowPutChangeWithRecord.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import com.alibaba.datax.common.element.Record; + +public class RowPutChangeWithRecord extends com.aliyun.openservices.ots.model.RowPutChange implements WithRecord { + + private Record record; + + public RowPutChangeWithRecord(String tableName) { + super(tableName); + } + + @Override + public Record getRecord() { + return record; + } + + @Override + public void setRecord(Record record) { + this.record = record; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/RowUpdateChangeWithRecord.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/RowUpdateChangeWithRecord.java new file mode 100644 index 0000000000..f47ca1d294 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/RowUpdateChangeWithRecord.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import com.alibaba.datax.common.element.Record; + +public class RowUpdateChangeWithRecord extends com.aliyun.openservices.ots.model.RowUpdateChange implements WithRecord { + + private Record record; + + public RowUpdateChangeWithRecord(String tableName) { + super(tableName); + } + + @Override + public Record getRecord() { + return record; + } + + @Override + public void setRecord(Record record) { + this.record = record; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/WithRecord.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/WithRecord.java new file mode 100644 index 0000000000..2e1672a7d3 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/model/WithRecord.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.plugin.writer.otswriter.model; + +import com.alibaba.datax.common.element.Record; + +public interface WithRecord { + Record getRecord(); + + void setRecord(Record record); +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ColumnConversion.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ColumnConversion.java new file mode 100644 index 0000000000..51162b8452 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ColumnConversion.java @@ -0,0 +1,61 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSAttrColumn; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSErrorMessage; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSPKColumn; +import com.aliyun.openservices.ots.model.ColumnValue; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; + + +/** + * 备注:datax提供的转换机制有如下限制,如下规则是不能转换的 + * 1. bool -> binary + * 2. binary -> long, double, bool + * 3. double -> bool, binary + * 4. long -> binary + */ +public class ColumnConversion { + public static PrimaryKeyValue columnToPrimaryKeyValue(Column c, OTSPKColumn col) { + try { + switch (col.getType()) { + case STRING: + return PrimaryKeyValue.fromString(c.asString()); + case INTEGER: + return PrimaryKeyValue.fromLong(c.asLong()); + default: + throw new IllegalArgumentException(String.format(OTSErrorMessage.UNSUPPORT_PARSE, col.getType(), "PrimaryKeyValue")); + } + } catch (DataXException e) { + throw new IllegalArgumentException(String.format( + OTSErrorMessage.COLUMN_CONVERSION_ERROR, + c.getType(), c.asString(), col.getType().toString() + )); + } + } + + public static ColumnValue columnToColumnValue(Column c, OTSAttrColumn col) { + try { + switch (col.getType()) { + case STRING: + return ColumnValue.fromString(c.asString()); + case INTEGER: + return ColumnValue.fromLong(c.asLong()); + case BOOLEAN: + return ColumnValue.fromBoolean(c.asBoolean()); + case DOUBLE: + return ColumnValue.fromDouble(c.asDouble()); + case BINARY: + return ColumnValue.fromBinary(c.asBytes()); + default: + throw new IllegalArgumentException(String.format(OTSErrorMessage.UNSUPPORT_PARSE, col.getType(), "ColumnValue")); + } + } catch (DataXException e) { + throw new IllegalArgumentException(String.format( + OTSErrorMessage.COLUMN_CONVERSION_ERROR, + c.getType(), c.asString(), col.getType().toString() + )); + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/Common.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/Common.java new file mode 100644 index 0000000000..26eb9329d6 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/Common.java @@ -0,0 +1,121 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import java.util.ArrayList; +import java.util.List; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.plugin.writer.otswriter.model.*; +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSException; +import com.aliyun.openservices.ots.model.ColumnValue; +import com.aliyun.openservices.ots.model.PrimaryKeyValue; +import com.aliyun.openservices.ots.model.RowChange; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.RowPutChange; +import com.aliyun.openservices.ots.model.RowUpdateChange; +import org.apache.commons.math3.util.Pair; + +public class Common { + + public static String getDetailMessage(Exception exception) { + if (exception instanceof OTSException) { + OTSException e = (OTSException) exception; + return "OTSException[ErrorCode:" + e.getErrorCode() + ", ErrorMessage:" + e.getMessage() + ", RequestId:" + e.getRequestId() + "]"; + } else if (exception instanceof ClientException) { + ClientException e = (ClientException) exception; + return "ClientException[ErrorCode:" + e.getErrorCode() + ", ErrorMessage:" + e.getMessage() + "]"; + } else if (exception instanceof IllegalArgumentException) { + IllegalArgumentException e = (IllegalArgumentException) exception; + return "IllegalArgumentException[ErrorMessage:" + e.getMessage() + "]"; + } else { + return "Exception[ErrorMessage:" + exception.getMessage() + "]"; + } + } + + public static RowPrimaryKey getPKFromRecord(List pkColumns, Record r) { + RowPrimaryKey primaryKey = new RowPrimaryKey(); + int pkCount = pkColumns.size(); + for (int i = 0; i < pkCount; i++) { + Column col = r.getColumn(i); + OTSPKColumn expect = pkColumns.get(i); + + if (col.getRawData() == null) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PK_COLUMN_VALUE_IS_NULL_ERROR, expect.getName())); + } + + PrimaryKeyValue pk = ColumnConversion.columnToPrimaryKeyValue(col, expect); + primaryKey.addPrimaryKeyColumn(expect.getName(), pk); + } + return primaryKey; + } + + public static List> getAttrFromRecord(int pkCount, List attrColumns, Record r) { + List> attr = new ArrayList>(r.getColumnNumber()); + for (int i = 0; i < attrColumns.size(); i++) { + Column col = r.getColumn(i + pkCount); + OTSAttrColumn expect = attrColumns.get(i); + + if (col.getRawData() == null) { + attr.add(new Pair(expect.getName(), null)); + continue; + } + + ColumnValue cv = ColumnConversion.columnToColumnValue(col, expect); + attr.add(new Pair(expect.getName(), cv)); + } + return attr; + } + + public static RowChange columnValuesToRowChange(String tableName, OTSOpType type, RowPrimaryKey pk, List> values) { + switch (type) { + case PUT_ROW: + RowPutChangeWithRecord rowPutChange = new RowPutChangeWithRecord(tableName); + rowPutChange.setPrimaryKey(pk); + + for (Pair en : values) { + if (en.getValue() != null) { + rowPutChange.addAttributeColumn(en.getKey(), en.getValue()); + } + } + + return rowPutChange; + case UPDATE_ROW: + RowUpdateChangeWithRecord rowUpdateChange = new RowUpdateChangeWithRecord(tableName); + rowUpdateChange.setPrimaryKey(pk); + + for (Pair en : values) { + if (en.getValue() != null) { + rowUpdateChange.addAttributeColumn(en.getKey(), en.getValue()); + } else { + rowUpdateChange.deleteAttributeColumn(en.getKey()); + } + } + return rowUpdateChange; + case DELETE_ROW: + RowDeleteChangeWithRecord rowDeleteChange = new RowDeleteChangeWithRecord(tableName); + rowDeleteChange.setPrimaryKey(pk); + return rowDeleteChange; + default: + throw new IllegalArgumentException(String.format(OTSErrorMessage.UNSUPPORT_PARSE, type, "RowChange")); + } + } + + public static long getDelaySendMilliseconds(int hadRetryTimes, int initSleepInMilliSecond) { + + if (hadRetryTimes <= 0) { + return 0; + } + + int sleepTime = initSleepInMilliSecond; + for (int i = 1; i < hadRetryTimes; i++) { + sleepTime += sleepTime; + if (sleepTime > 30000) { + sleepTime = 30000; + break; + } + } + return sleepTime; + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/GsonParser.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/GsonParser.java new file mode 100644 index 0000000000..0cae91f2b4 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/GsonParser.java @@ -0,0 +1,46 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConf; +import com.aliyun.openservices.ots.model.Direction; +import com.aliyun.openservices.ots.model.RowPrimaryKey; +import com.aliyun.openservices.ots.model.TableMeta; +import com.google.gson.Gson; +import com.google.gson.GsonBuilder; + +public class GsonParser { + + private static Gson gsonBuilder() { + return new GsonBuilder() + .create(); + } + + public static String confToJson (OTSConf conf) { + Gson g = gsonBuilder(); + return g.toJson(conf); + } + + public static OTSConf jsonToConf (String jsonStr) { + Gson g = gsonBuilder(); + return g.fromJson(jsonStr, OTSConf.class); + } + + public static String directionToJson (Direction direction) { + Gson g = gsonBuilder(); + return g.toJson(direction); + } + + public static Direction jsonToDirection (String jsonStr) { + Gson g = gsonBuilder(); + return g.fromJson(jsonStr, Direction.class); + } + + public static String metaToJson (TableMeta meta) { + Gson g = gsonBuilder(); + return g.toJson(meta); + } + + public static String rowPrimaryKeyToJson (RowPrimaryKey row) { + Gson g = gsonBuilder(); + return g.toJson(row); + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ParamChecker.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ParamChecker.java new file mode 100644 index 0000000000..f9e17af5f1 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/ParamChecker.java @@ -0,0 +1,153 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Set; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSAttrColumn; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSErrorMessage; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSPKColumn; +import com.aliyun.openservices.ots.model.PrimaryKeyType; +import com.aliyun.openservices.ots.model.TableMeta; + +public class ParamChecker { + + private static void throwNotExistException(String key) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.MISSING_PARAMTER_ERROR, key)); + } + + private static void throwStringLengthZeroException(String key) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARAMTER_STRING_IS_EMPTY_ERROR, key)); + } + + private static void throwEmptyListException(String key) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARAMETER_LIST_IS_EMPTY_ERROR, key)); + } + + private static void throwNotListException(String key) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARAMETER_IS_NOT_ARRAY_ERROR, key)); + } + + private static void throwNotMapException(String key) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARAMETER_IS_NOT_MAP_ERROR, key)); + } + + public static String checkStringAndGet(Configuration param, String key) { + String value = param.getString(key); + if (null == value) { + throwNotExistException(key); + } else if (value.length() == 0) { + throwStringLengthZeroException(key); + } + return value; + } + + public static List checkListAndGet(Configuration param, String key, boolean isCheckEmpty) { + List value = null; + try { + value = param.getList(key); + } catch (ClassCastException e) { + throwNotListException(key); + } + if (null == value) { + throwNotExistException(key); + } else if (isCheckEmpty && value.isEmpty()) { + throwEmptyListException(key); + } + return value; + } + + public static List checkListAndGet(Map range, String key) { + Object obj = range.get(key); + if (null == obj) { + return null; + } + return checkListAndGet(range, key, false); + } + + public static List checkListAndGet(Map range, String key, boolean isCheckEmpty) { + Object obj = range.get(key); + if (null == obj) { + throwNotExistException(key); + } + if (obj instanceof List) { + @SuppressWarnings("unchecked") + List value = (List)obj; + if (isCheckEmpty && value.isEmpty()) { + throwEmptyListException(key); + } + return value; + } else { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARSE_TO_LIST_ERROR, key)); + } + } + + public static List checkListAndGet(Map range, String key, List defaultList) { + Object obj = range.get(key); + if (null == obj) { + return defaultList; + } + if (obj instanceof List) { + @SuppressWarnings("unchecked") + List value = (List)obj; + return value; + } else { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PARSE_TO_LIST_ERROR, key)); + } + } + + public static Map checkMapAndGet(Configuration param, String key, boolean isCheckEmpty) { + Map value = null; + try { + value = param.getMap(key); + } catch (ClassCastException e) { + throwNotMapException(key); + } + if (null == value) { + throwNotExistException(key); + } else if (isCheckEmpty && value.isEmpty()) { + throwEmptyListException(key); + } + return value; + } + + public static void checkPrimaryKey(TableMeta meta, List pk) { + Map types = meta.getPrimaryKey(); + // 个数是否相等 + if (types.size() != pk.size()) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.INPUT_PK_COUNT_NOT_EQUAL_META_ERROR, pk.size(), types.size())); + } + + // 名字类型是否相等 + Map inputTypes = new HashMap(); + for (OTSPKColumn col : pk) { + inputTypes.put(col.getName(), col.getType()); + } + + for (Entry e : types.entrySet()) { + if (!inputTypes.containsKey(e.getKey())) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PK_COLUMN_MISSING_ERROR, e.getKey())); + } + PrimaryKeyType type = inputTypes.get(e.getKey()); + if (type != e.getValue()) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.INPUT_PK_TYPE_NOT_MATCH_META_ERROR, e.getKey(), type, e.getValue())); + } + } + } + + public static void checkAttribute(List attr) { + // 检查重复列 + Set names = new HashSet(); + for (OTSAttrColumn col : attr) { + if (names.contains(col.getName())) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.ATTR_REPEAT_COLUMN_ERROR, col.getName())); + } else { + names.add(col.getName()); + } + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/RetryHelper.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/RetryHelper.java new file mode 100644 index 0000000000..a863b908ed --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/RetryHelper.java @@ -0,0 +1,76 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import java.util.HashSet; +import java.util.Set; +import java.util.concurrent.Callable; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.plugin.writer.otswriter.model.LogExceptionManager; +import com.aliyun.openservices.ots.ClientException; +import com.aliyun.openservices.ots.OTSErrorCode; +import com.aliyun.openservices.ots.OTSException; + +public class RetryHelper { + + private static final Logger LOG = LoggerFactory.getLogger(RetryHelper.class); + private static final Set noRetryErrorCode = prepareNoRetryErrorCode(); + + public static LogExceptionManager logManager = new LogExceptionManager(); + + public static V executeWithRetry(Callable callable, int maxRetryTimes, int sleepInMilliSecond) throws Exception { + int retryTimes = 0; + while (true){ + Thread.sleep(Common.getDelaySendMilliseconds(retryTimes, sleepInMilliSecond)); + try { + return callable.call(); + } catch (Exception e) { + logManager.addException(e); + if (!canRetry(e)){ + LOG.error("Can not retry for Exception.", e); + throw e; + } else if (retryTimes >= maxRetryTimes) { + LOG.error("Retry times more than limition. maxRetryTimes : {}", maxRetryTimes); + throw e; + } + retryTimes++; + LOG.warn("Retry time : {}", retryTimes); + } + } + } + + private static Set prepareNoRetryErrorCode() { + Set pool = new HashSet(); + pool.add(OTSErrorCode.AUTHORIZATION_FAILURE); + pool.add(OTSErrorCode.INVALID_PARAMETER); + pool.add(OTSErrorCode.REQUEST_TOO_LARGE); + pool.add(OTSErrorCode.OBJECT_NOT_EXIST); + pool.add(OTSErrorCode.OBJECT_ALREADY_EXIST); + pool.add(OTSErrorCode.INVALID_PK); + pool.add(OTSErrorCode.OUT_OF_COLUMN_COUNT_LIMIT); + pool.add(OTSErrorCode.OUT_OF_ROW_SIZE_LIMIT); + pool.add(OTSErrorCode.CONDITION_CHECK_FAIL); + return pool; + } + + public static boolean canRetry(String otsErrorCode) { + if (noRetryErrorCode.contains(otsErrorCode)) { + return false; + } else { + return true; + } + } + + public static boolean canRetry(Exception exception) { + OTSException e = null; + if (exception instanceof OTSException) { + e = (OTSException) exception; + return canRetry(e.getErrorCode()); + } else if (exception instanceof ClientException) { + return true; + } else { + return false; + } + } +} diff --git a/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/WriterModelParser.java b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/WriterModelParser.java new file mode 100644 index 0000000000..c81587b685 --- /dev/null +++ b/otswriter/src/main/java/com/alibaba/datax/plugin/writer/otswriter/utils/WriterModelParser.java @@ -0,0 +1,139 @@ +package com.alibaba.datax.plugin.writer.otswriter.utils; + +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; + +import com.alibaba.datax.plugin.writer.otswriter.model.OTSAttrColumn; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSPKColumn; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSConst; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSErrorMessage; +import com.alibaba.datax.plugin.writer.otswriter.model.OTSOpType; +import com.aliyun.openservices.ots.model.ColumnType; +import com.aliyun.openservices.ots.model.PrimaryKeyType; + +/** + * 解析配置中参数 + * @author redchen + * + */ +public class WriterModelParser { + + public static PrimaryKeyType parsePrimaryKeyType(String type) { + if (type.equalsIgnoreCase(OTSConst.TYPE_STRING)) { + return PrimaryKeyType.STRING; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INTEGER)) { + return PrimaryKeyType.INTEGER; + } else { + throw new IllegalArgumentException(String.format(OTSErrorMessage.PK_TYPE_ERROR, type)); + } + } + + public static OTSPKColumn parseOTSPKColumn(Map column) { + if (column.containsKey(OTSConst.NAME) && column.containsKey(OTSConst.TYPE) && column.size() == 2) { + Object type = column.get(OTSConst.TYPE); + Object name = column.get(OTSConst.NAME); + if (type instanceof String && name instanceof String) { + String typeStr = (String) type; + String nameStr = (String) name; + if (nameStr.isEmpty()) { + throw new IllegalArgumentException(OTSErrorMessage.PK_COLUMN_NAME_IS_EMPTY_ERROR); + } + return new OTSPKColumn(nameStr, parsePrimaryKeyType(typeStr)); + } else { + throw new IllegalArgumentException(OTSErrorMessage.PK_MAP_NAME_TYPE_ERROR); + } + } else { + throw new IllegalArgumentException(OTSErrorMessage.PK_MAP_INCLUDE_NAME_TYPE_ERROR); + } + } + + public static List parseOTSPKColumnList(List values) { + List pks = new ArrayList(); + for (Object obj : values) { + if (obj instanceof Map) { + @SuppressWarnings("unchecked") + Map column = (Map) obj; + pks.add(parseOTSPKColumn(column)); + } else { + throw new IllegalArgumentException(OTSErrorMessage.PK_ITEM_IS_NOT_MAP_ERROR); + } + } + return pks; + } + + public static ColumnType parseColumnType(String type) { + if (type.equalsIgnoreCase(OTSConst.TYPE_STRING)) { + return ColumnType.STRING; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_INTEGER)) { + return ColumnType.INTEGER; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_BOOLEAN)) { + return ColumnType.BOOLEAN; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_DOUBLE)) { + return ColumnType.DOUBLE; + } else if (type.equalsIgnoreCase(OTSConst.TYPE_BINARY)) { + return ColumnType.BINARY; + } else { + throw new IllegalArgumentException(String.format(OTSErrorMessage.ATTR_TYPE_ERROR, type)); + } + } + + public static OTSAttrColumn parseOTSAttrColumn(Map column) { + if (column.containsKey(OTSConst.NAME) && column.containsKey(OTSConst.TYPE) && column.size() == 2) { + Object type = column.get(OTSConst.TYPE); + Object name = column.get(OTSConst.NAME); + if (type instanceof String && name instanceof String) { + String typeStr = (String) type; + String nameStr = (String) name; + if (nameStr.isEmpty()) { + throw new IllegalArgumentException(OTSErrorMessage.ATTR_COLUMN_NAME_IS_EMPTY_ERROR); + } + return new OTSAttrColumn(nameStr, parseColumnType(typeStr)); + } else { + throw new IllegalArgumentException(OTSErrorMessage.ATTR_MAP_NAME_TYPE_ERROR); + } + } else { + throw new IllegalArgumentException(OTSErrorMessage.ATTR_MAP_INCLUDE_NAME_TYPE_ERROR); + } + } + + private static void checkMultiAttrColumn(List attrs) { + Set pool = new HashSet(); + for (OTSAttrColumn col : attrs) { + if (pool.contains(col.getName())) { + throw new IllegalArgumentException(String.format(OTSErrorMessage.MULTI_ATTR_COLUMN_ERROR, col.getName())); + } else { + pool.add(col.getName()); + } + } + } + + public static List parseOTSAttrColumnList(List values) { + List attrs = new ArrayList(); + for (Object obj : values) { + if (obj instanceof Map) { + @SuppressWarnings("unchecked") + Map column = (Map) obj; + attrs.add(parseOTSAttrColumn(column)); + } else { + throw new IllegalArgumentException(OTSErrorMessage.ATTR_ITEM_IS_NOT_MAP_ERROR); + } + } + checkMultiAttrColumn(attrs); + return attrs; + } + + public static OTSOpType parseOTSOpType(String value) { + if (value.equalsIgnoreCase(OTSConst.OTS_OP_TYPE_PUT)) { + return OTSOpType.PUT_ROW; + } else if (value.equalsIgnoreCase(OTSConst.OTS_OP_TYPE_UPDATE)) { + return OTSOpType.UPDATE_ROW; + } else if (value.equalsIgnoreCase(OTSConst.OTS_OP_TYPE_DELETE)) { + return OTSOpType.DELETE_ROW; + } else { + throw new IllegalArgumentException(String.format(OTSErrorMessage.OPERATION_PARSE_ERROR, value)); + } + } +} diff --git a/otswriter/src/main/resources/plugin.json b/otswriter/src/main/resources/plugin.json new file mode 100644 index 0000000000..315e96cc3c --- /dev/null +++ b/otswriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "otswriter", + "class": "com.alibaba.datax.plugin.writer.otswriter.OtsWriter", + "description": "", + "developer": "alibaba" +} \ No newline at end of file diff --git a/otswriter/src/main/resources/plugin_job_template.json b/otswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..572a9b2542 --- /dev/null +++ b/otswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "otswriter", + "parameter": { + "endpoint":"", + "accessId":"", + "accessKey":"", + "instanceName":"", + "table":"", + "primaryKey" : [], + "column" : [], + "writeMode" : "" + } +} \ No newline at end of file diff --git a/package.xml b/package.xml new file mode 100755 index 0000000000..11e1da5ede --- /dev/null +++ b/package.xml @@ -0,0 +1,298 @@ + + + + tar.gz + dir + + false + + + transformer/target/datax/ + + **/*.* + + datax + + + core/target/datax/ + + **/*.* + + datax + + + + + mysqlreader/target/datax/ + + **/*.* + + datax + + + oceanbasereader/target/datax/ + + **/*.* + + datax + + + drdsreader/target/datax/ + + **/*.* + + datax + + + oraclereader/target/datax/ + + **/*.* + + datax + + + sqlserverreader/target/datax/ + + **/*.* + + datax + + + db2reader/target/datax/ + + **/*.* + + datax + + + postgresqlreader/target/datax/ + + **/*.* + + datax + + + rdbmsreader/target/datax/ + + **/*.* + + datax + + + + odpsreader/target/datax/ + + **/*.* + + datax + + + otsreader/target/datax/ + + **/*.* + + datax + + + otsstreamreader/target/datax/ + + **/*.* + + datax + + + txtfilereader/target/datax/ + + **/*.* + + datax + + + ossreader/target/datax/ + + **/*.* + + datax + + + mongodbreader/target/datax/ + + **/*.* + + datax + + + streamreader/target/datax/ + + **/*.* + + datax + + + ftpreader/target/datax/ + + **/*.* + + datax + + + hdfsreader/target/datax/ + + **/*.* + + datax + + + hbase11xreader/target/datax/ + + **/*.* + + datax + + + hbase094xreader/target/datax/ + + **/*.* + + datax + + + + + mysqlwriter/target/datax/ + + **/*.* + + datax + + + drdswriter/target/datax/ + + **/*.* + + datax + + + odpswriter/target/datax/ + + **/*.* + + datax + + + txtfilewriter/target/datax/ + + **/*.* + + datax + + + ftpwriter/target/datax/ + + **/*.* + + datax + + + osswriter/target/datax/ + + **/*.* + + datax + + + adswriter/target/datax/ + + **/*.* + + datax + + + streamwriter/target/datax/ + + **/*.* + + datax + + + otswriter/target/datax/ + + **/*.* + + datax + + + mongodbwriter/target/datax/ + + **/*.* + + datax + + + oraclewriter/target/datax/ + + **/*.* + + datax + + + sqlserverwriter/target/datax/ + + **/*.* + + datax + + + postgresqlwriter/target/datax/ + + **/*.* + + datax + + + rdbmswriter/target/datax/ + + **/*.* + + datax + + + ocswriter/target/datax/ + + **/*.* + + datax + + + hdfswriter/target/datax/ + + **/*.* + + datax + + + hbase11xwriter/target/datax/ + + **/*.* + + datax + + + hbase094xwriter/target/datax/ + + **/*.* + + datax + + + hbase11xsqlwriter/target/datax/ + + **/*.* + + datax + + + diff --git a/plugin-rdbms-util/pom.xml b/plugin-rdbms-util/pom.xml new file mode 100755 index 0000000000..1001a37c58 --- /dev/null +++ b/plugin-rdbms-util/pom.xml @@ -0,0 +1,67 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + plugin-rdbms-util + plugin-rdbms-util + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + commons-collections + commons-collections + 3.0 + + + mysql + mysql-connector-java + 5.1.34 + test + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + com.alibaba + druid + 1.0.15 + + + junit + junit + test + + + org.mockito + mockito-all + 1.9.5 + test + + + com.google.guava + guava + r05 + + + diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/CommonRdbmsReader.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/CommonRdbmsReader.java new file mode 100755 index 0000000000..f31804025e --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/CommonRdbmsReader.java @@ -0,0 +1,353 @@ +package com.alibaba.datax.plugin.rdbms.reader; + +import com.alibaba.datax.common.element.BoolColumn; +import com.alibaba.datax.common.element.BytesColumn; +import com.alibaba.datax.common.element.DateColumn; +import com.alibaba.datax.common.element.DoubleColumn; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.statistics.PerfRecord; +import com.alibaba.datax.common.statistics.PerfTrace; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.util.OriginalConfPretreatmentUtil; +import com.alibaba.datax.plugin.rdbms.reader.util.PreCheckTask; +import com.alibaba.datax.plugin.rdbms.reader.util.ReaderSplitUtil; +import com.alibaba.datax.plugin.rdbms.reader.util.SingleTableSplitUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.util.RdbmsException; +import com.google.common.collect.Lists; + +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.sql.ResultSet; +import java.sql.ResultSetMetaData; +import java.sql.Types; +import java.util.ArrayList; +import java.util.Collection; +import java.util.List; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; + +public class CommonRdbmsReader { + + public static class Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + public Job(DataBaseType dataBaseType) { + OriginalConfPretreatmentUtil.DATABASE_TYPE = dataBaseType; + SingleTableSplitUtil.DATABASE_TYPE = dataBaseType; + } + + public void init(Configuration originalConfig) { + + OriginalConfPretreatmentUtil.doPretreatment(originalConfig); + + LOG.debug("After job init(), job config now is:[\n{}\n]", + originalConfig.toJSON()); + } + + public void preCheck(Configuration originalConfig,DataBaseType dataBaseType) { + /*检查每个表是否有读权限,以及querySql跟splik Key是否正确*/ + Configuration queryConf = ReaderSplitUtil.doPreCheckSplit(originalConfig); + String splitPK = queryConf.getString(Key.SPLIT_PK); + List connList = queryConf.getList(Constant.CONN_MARK, Object.class); + String username = queryConf.getString(Key.USERNAME); + String password = queryConf.getString(Key.PASSWORD); + ExecutorService exec; + if (connList.size() < 10){ + exec = Executors.newFixedThreadPool(connList.size()); + }else{ + exec = Executors.newFixedThreadPool(10); + } + Collection taskList = new ArrayList(); + for (int i = 0, len = connList.size(); i < len; i++){ + Configuration connConf = Configuration.from(connList.get(i).toString()); + PreCheckTask t = new PreCheckTask(username,password,connConf,dataBaseType,splitPK); + taskList.add(t); + } + List> results = Lists.newArrayList(); + try { + results = exec.invokeAll(taskList); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } + + for (Future result : results){ + try { + result.get(); + } catch (ExecutionException e) { + DataXException de = (DataXException) e.getCause(); + throw de; + }catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } + } + exec.shutdownNow(); + } + + + public List split(Configuration originalConfig, + int adviceNumber) { + return ReaderSplitUtil.doSplit(originalConfig, adviceNumber); + } + + public void post(Configuration originalConfig) { + // do nothing + } + + public void destroy(Configuration originalConfig) { + // do nothing + } + + } + + public static class Task { + private static final Logger LOG = LoggerFactory + .getLogger(Task.class); + private static final boolean IS_DEBUG = LOG.isDebugEnabled(); + protected final byte[] EMPTY_CHAR_ARRAY = new byte[0]; + + private DataBaseType dataBaseType; + private int taskGroupId = -1; + private int taskId=-1; + + private String username; + private String password; + private String jdbcUrl; + private String mandatoryEncoding; + + // 作为日志显示信息时,需要附带的通用信息。比如信息所对应的数据库连接等信息,针对哪个表做的操作 + private String basicMsg; + + public Task(DataBaseType dataBaseType) { + this(dataBaseType, -1, -1); + } + + public Task(DataBaseType dataBaseType,int taskGropuId, int taskId) { + this.dataBaseType = dataBaseType; + this.taskGroupId = taskGropuId; + this.taskId = taskId; + } + + public void init(Configuration readerSliceConfig) { + + /* for database connection */ + + this.username = readerSliceConfig.getString(Key.USERNAME); + this.password = readerSliceConfig.getString(Key.PASSWORD); + this.jdbcUrl = readerSliceConfig.getString(Key.JDBC_URL); + + //ob10的处理 + if (this.jdbcUrl.startsWith(com.alibaba.datax.plugin.rdbms.writer.Constant.OB10_SPLIT_STRING) && this.dataBaseType == DataBaseType.MySql) { + String[] ss = this.jdbcUrl.split(com.alibaba.datax.plugin.rdbms.writer.Constant.OB10_SPLIT_STRING_PATTERN); + if (ss.length != 3) { + throw DataXException + .asDataXException( + DBUtilErrorCode.JDBC_OB10_ADDRESS_ERROR, "JDBC OB10格式错误,请联系askdatax"); + } + LOG.info("this is ob1_0 jdbc url."); + this.username = ss[1].trim() +":"+this.username; + this.jdbcUrl = ss[2]; + LOG.info("this is ob1_0 jdbc url. user=" + this.username + " :url=" + this.jdbcUrl); + } + + this.mandatoryEncoding = readerSliceConfig.getString(Key.MANDATORY_ENCODING, ""); + + basicMsg = String.format("jdbcUrl:[%s]", this.jdbcUrl); + + } + + public void startRead(Configuration readerSliceConfig, + RecordSender recordSender, + TaskPluginCollector taskPluginCollector, int fetchSize) { + String querySql = readerSliceConfig.getString(Key.QUERY_SQL); + String table = readerSliceConfig.getString(Key.TABLE); + + PerfTrace.getInstance().addTaskDetails(taskId, table + "," + basicMsg); + + LOG.info("Begin to read record by Sql: [{}\n] {}.", + querySql, basicMsg); + PerfRecord queryPerfRecord = new PerfRecord(taskGroupId,taskId, PerfRecord.PHASE.SQL_QUERY); + queryPerfRecord.start(); + + Connection conn = DBUtil.getConnection(this.dataBaseType, jdbcUrl, + username, password); + + // session config .etc related + DBUtil.dealWithSessionConfig(conn, readerSliceConfig, + this.dataBaseType, basicMsg); + + int columnNumber = 0; + ResultSet rs = null; + try { + rs = DBUtil.query(conn, querySql, fetchSize); + queryPerfRecord.end(); + + ResultSetMetaData metaData = rs.getMetaData(); + columnNumber = metaData.getColumnCount(); + + //这个统计干净的result_Next时间 + PerfRecord allResultPerfRecord = new PerfRecord(taskGroupId, taskId, PerfRecord.PHASE.RESULT_NEXT_ALL); + allResultPerfRecord.start(); + + long rsNextUsedTime = 0; + long lastTime = System.nanoTime(); + while (rs.next()) { + rsNextUsedTime += (System.nanoTime() - lastTime); + this.transportOneRecord(recordSender, rs, + metaData, columnNumber, mandatoryEncoding, taskPluginCollector); + lastTime = System.nanoTime(); + } + + allResultPerfRecord.end(rsNextUsedTime); + //目前大盘是依赖这个打印,而之前这个Finish read record是包含了sql查询和result next的全部时间 + LOG.info("Finished read record by Sql: [{}\n] {}.", + querySql, basicMsg); + + }catch (Exception e) { + throw RdbmsException.asQueryException(this.dataBaseType, e, querySql, table, username); + } finally { + DBUtil.closeDBResources(null, conn); + } + } + + public void post(Configuration originalConfig) { + // do nothing + } + + public void destroy(Configuration originalConfig) { + // do nothing + } + + protected Record transportOneRecord(RecordSender recordSender, ResultSet rs, + ResultSetMetaData metaData, int columnNumber, String mandatoryEncoding, + TaskPluginCollector taskPluginCollector) { + Record record = buildRecord(recordSender,rs,metaData,columnNumber,mandatoryEncoding,taskPluginCollector); + recordSender.sendToWriter(record); + return record; + } + protected Record buildRecord(RecordSender recordSender,ResultSet rs, ResultSetMetaData metaData, int columnNumber, String mandatoryEncoding, + TaskPluginCollector taskPluginCollector) { + Record record = recordSender.createRecord(); + + try { + for (int i = 1; i <= columnNumber; i++) { + switch (metaData.getColumnType(i)) { + + case Types.CHAR: + case Types.NCHAR: + case Types.VARCHAR: + case Types.LONGVARCHAR: + case Types.NVARCHAR: + case Types.LONGNVARCHAR: + String rawData; + if(StringUtils.isBlank(mandatoryEncoding)){ + rawData = rs.getString(i); + }else{ + rawData = new String((rs.getBytes(i) == null ? EMPTY_CHAR_ARRAY : + rs.getBytes(i)), mandatoryEncoding); + } + record.addColumn(new StringColumn(rawData)); + break; + + case Types.CLOB: + case Types.NCLOB: + record.addColumn(new StringColumn(rs.getString(i))); + break; + + case Types.SMALLINT: + case Types.TINYINT: + case Types.INTEGER: + case Types.BIGINT: + record.addColumn(new LongColumn(rs.getString(i))); + break; + + case Types.NUMERIC: + case Types.DECIMAL: + record.addColumn(new DoubleColumn(rs.getString(i))); + break; + + case Types.FLOAT: + case Types.REAL: + case Types.DOUBLE: + record.addColumn(new DoubleColumn(rs.getString(i))); + break; + + case Types.TIME: + record.addColumn(new DateColumn(rs.getTime(i))); + break; + + // for mysql bug, see http://bugs.mysql.com/bug.php?id=35115 + case Types.DATE: + if (metaData.getColumnTypeName(i).equalsIgnoreCase("year")) { + record.addColumn(new LongColumn(rs.getInt(i))); + } else { + record.addColumn(new DateColumn(rs.getDate(i))); + } + break; + + case Types.TIMESTAMP: + record.addColumn(new DateColumn(rs.getTimestamp(i))); + break; + + case Types.BINARY: + case Types.VARBINARY: + case Types.BLOB: + case Types.LONGVARBINARY: + record.addColumn(new BytesColumn(rs.getBytes(i))); + break; + + // warn: bit(1) -> Types.BIT 可使用BoolColumn + // warn: bit(>1) -> Types.VARBINARY 可使用BytesColumn + case Types.BOOLEAN: + case Types.BIT: + record.addColumn(new BoolColumn(rs.getBoolean(i))); + break; + + case Types.NULL: + String stringData = null; + if(rs.getObject(i) != null) { + stringData = rs.getObject(i).toString(); + } + record.addColumn(new StringColumn(stringData)); + break; + + default: + throw DataXException + .asDataXException( + DBUtilErrorCode.UNSUPPORTED_TYPE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库读取这种字段类型. 字段名:[%s], 字段名称:[%s], 字段Java类型:[%s]. 请尝试使用数据库函数将其转换datax支持的类型 或者不同步该字段 .", + metaData.getColumnName(i), + metaData.getColumnType(i), + metaData.getColumnClassName(i))); + } + } + } catch (Exception e) { + if (IS_DEBUG) { + LOG.debug("read data " + record.toString() + + " occur exception:", e); + } + //TODO 这里识别为脏数据靠谱吗? + taskPluginCollector.collectDirtyRecord(record, e); + if (e instanceof DataXException) { + throw (DataXException) e; + } + } + return record; + } + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Constant.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Constant.java new file mode 100755 index 0000000000..729d71acb7 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Constant.java @@ -0,0 +1,28 @@ +package com.alibaba.datax.plugin.rdbms.reader; + +public final class Constant { + public static final String PK_TYPE = "pkType"; + + public static final Object PK_TYPE_STRING = "pkTypeString"; + + public static final Object PK_TYPE_LONG = "pkTypeLong"; + + public static final Object PK_TYPE_MONTECARLO = "pkTypeMonteCarlo"; + + public static final String SPLIT_MODE_RANDOMSAMPLE = "randomSampling"; + + public static String CONN_MARK = "connection"; + + public static String TABLE_NUMBER_MARK = "tableNumber"; + + public static String IS_TABLE_MODE = "isTableMode"; + + public final static String FETCH_SIZE = "fetchSize"; + + public static String QUERY_SQL_TEMPLATE_WITHOUT_WHERE = "select %s from %s "; + + public static String QUERY_SQL_TEMPLATE = "select %s from %s where (%s)"; + + public static String TABLE_NAME_PLACEHOLDER = "@table"; + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Key.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Key.java new file mode 100755 index 0000000000..63f8dde013 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/Key.java @@ -0,0 +1,50 @@ +package com.alibaba.datax.plugin.rdbms.reader; + +/** + * 编码,时区等配置,暂未定. + */ +public final class Key { + public final static String JDBC_URL = "jdbcUrl"; + + public final static String USERNAME = "username"; + + public final static String PASSWORD = "password"; + + public final static String TABLE = "table"; + + public final static String MANDATORY_ENCODING = "mandatoryEncoding"; + + // 是数组配置 + public final static String COLUMN = "column"; + + public final static String COLUMN_LIST = "columnList"; + + public final static String WHERE = "where"; + + public final static String HINT = "hint"; + + public final static String SPLIT_PK = "splitPk"; + + public final static String SPLIT_MODE = "splitMode"; + + public final static String SAMPLE_PERCENTAGE = "samplePercentage"; + + public final static String QUERY_SQL = "querySql"; + + public final static String SPLIT_PK_SQL = "splitPkSql"; + + + public final static String PRE_SQL = "preSql"; + + public final static String POST_SQL = "postSql"; + + public final static String CHECK_SLAVE = "checkSlave"; + + public final static String SESSION = "session"; + + public final static String DBNAME = "dbName"; + + public final static String DRYRUN = "dryRun"; + + +} \ No newline at end of file diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/ResultSetReadProxy.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/ResultSetReadProxy.java new file mode 100755 index 0000000000..9fe765c6cb --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/ResultSetReadProxy.java @@ -0,0 +1,139 @@ +package com.alibaba.datax.plugin.rdbms.reader; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; + +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.ResultSet; +import java.sql.ResultSetMetaData; +import java.sql.Types; + +public class ResultSetReadProxy { + private static final Logger LOG = LoggerFactory + .getLogger(ResultSetReadProxy.class); + + private static final boolean IS_DEBUG = LOG.isDebugEnabled(); + private static final byte[] EMPTY_CHAR_ARRAY = new byte[0]; + + //TODO + public static void transportOneRecord(RecordSender recordSender, ResultSet rs, + ResultSetMetaData metaData, int columnNumber, String mandatoryEncoding, + TaskPluginCollector taskPluginCollector) { + Record record = recordSender.createRecord(); + + try { + for (int i = 1; i <= columnNumber; i++) { + switch (metaData.getColumnType(i)) { + + case Types.CHAR: + case Types.NCHAR: + case Types.VARCHAR: + case Types.LONGVARCHAR: + case Types.NVARCHAR: + case Types.LONGNVARCHAR: + String rawData; + if(StringUtils.isBlank(mandatoryEncoding)){ + rawData = rs.getString(i); + }else{ + rawData = new String((rs.getBytes(i) == null ? EMPTY_CHAR_ARRAY : + rs.getBytes(i)), mandatoryEncoding); + } + record.addColumn(new StringColumn(rawData)); + break; + + case Types.CLOB: + case Types.NCLOB: + record.addColumn(new StringColumn(rs.getString(i))); + break; + + case Types.SMALLINT: + case Types.TINYINT: + case Types.INTEGER: + case Types.BIGINT: + record.addColumn(new LongColumn(rs.getString(i))); + break; + + case Types.NUMERIC: + case Types.DECIMAL: + record.addColumn(new DoubleColumn(rs.getString(i))); + break; + + case Types.FLOAT: + case Types.REAL: + case Types.DOUBLE: + record.addColumn(new DoubleColumn(rs.getString(i))); + break; + + case Types.TIME: + record.addColumn(new DateColumn(rs.getTime(i))); + break; + + // for mysql bug, see http://bugs.mysql.com/bug.php?id=35115 + case Types.DATE: + if (metaData.getColumnTypeName(i).equalsIgnoreCase("year")) { + record.addColumn(new LongColumn(rs.getInt(i))); + } else { + record.addColumn(new DateColumn(rs.getDate(i))); + } + break; + + case Types.TIMESTAMP: + record.addColumn(new DateColumn(rs.getTimestamp(i))); + break; + + case Types.BINARY: + case Types.VARBINARY: + case Types.BLOB: + case Types.LONGVARBINARY: + record.addColumn(new BytesColumn(rs.getBytes(i))); + break; + + // warn: bit(1) -> Types.BIT 可使用BoolColumn + // warn: bit(>1) -> Types.VARBINARY 可使用BytesColumn + case Types.BOOLEAN: + case Types.BIT: + record.addColumn(new BoolColumn(rs.getBoolean(i))); + break; + + case Types.NULL: + String stringData = null; + if(rs.getObject(i) != null) { + stringData = rs.getObject(i).toString(); + } + record.addColumn(new StringColumn(stringData)); + break; + + // TODO 添加BASIC_MESSAGE + default: + throw DataXException + .asDataXException( + DBUtilErrorCode.UNSUPPORTED_TYPE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库读取这种字段类型. 字段名:[%s], 字段名称:[%s], 字段Java类型:[%s]. 请尝试使用数据库函数将其转换datax支持的类型 或者不同步该字段 .", + metaData.getColumnName(i), + metaData.getColumnType(i), + metaData.getColumnClassName(i))); + } + } + } catch (Exception e) { + if (IS_DEBUG) { + LOG.debug("read data " + record.toString() + + " occur exception:", e); + } + + //TODO 这里识别为脏数据靠谱吗? + taskPluginCollector.collectDirtyRecord(record, e); + if (e instanceof DataXException) { + throw (DataXException) e; + } + } + + recordSender.sendToWriter(record); + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/HintUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/HintUtil.java new file mode 100644 index 0000000000..4e6827cfc1 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/HintUtil.java @@ -0,0 +1,67 @@ +package com.alibaba.datax.plugin.rdbms.reader.util; + +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * Created by liuyi on 15/9/18. + */ +public class HintUtil { + private static final Logger LOG = LoggerFactory.getLogger(ReaderSplitUtil.class); + + private static DataBaseType dataBaseType; + private static String username; + private static String password; + private static Pattern tablePattern; + private static String hintExpression; + + public static void initHintConf(DataBaseType type, Configuration configuration){ + dataBaseType = type; + username = configuration.getString(Key.USERNAME); + password = configuration.getString(Key.PASSWORD); + String hint = configuration.getString(Key.HINT); + if(StringUtils.isNotBlank(hint)){ + String[] tablePatternAndHint = hint.split("#"); + if(tablePatternAndHint.length==1){ + tablePattern = Pattern.compile(".*"); + hintExpression = tablePatternAndHint[0]; + }else{ + tablePattern = Pattern.compile(tablePatternAndHint[0]); + hintExpression = tablePatternAndHint[1]; + } + } + } + + public static String buildQueryColumn(String jdbcUrl, String table, String column){ + try{ + if(tablePattern != null && DataBaseType.Oracle.equals(dataBaseType)) { + Matcher m = tablePattern.matcher(table); + if(m.find()){ + String[] tableStr = table.split("\\."); + String tableWithoutSchema = tableStr[tableStr.length-1]; + String finalHint = hintExpression.replaceAll(Constant.TABLE_NAME_PLACEHOLDER, tableWithoutSchema); + //主库不并发读取 + if(finalHint.indexOf("parallel") > 0 && DBUtil.isOracleMaster(jdbcUrl, username, password)){ + LOG.info("master:{} will not use hint:{}", jdbcUrl, finalHint); + }else{ + LOG.info("table:{} use hint:{}.", table, finalHint); + return finalHint + column; + } + } + } + } catch (Exception e){ + LOG.warn("match hint exception, will not use hint", e); + } + return column; + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/OriginalConfPretreatmentUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/OriginalConfPretreatmentUtil.java new file mode 100755 index 0000000000..3ac5f2af7d --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/OriginalConfPretreatmentUtil.java @@ -0,0 +1,272 @@ +package com.alibaba.datax.plugin.rdbms.reader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.ListUtil; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.util.TableExpandUtil; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +public final class OriginalConfPretreatmentUtil { + private static final Logger LOG = LoggerFactory + .getLogger(OriginalConfPretreatmentUtil.class); + + public static DataBaseType DATABASE_TYPE; + + public static void doPretreatment(Configuration originalConfig) { + // 检查 username/password 配置(必填) + originalConfig.getNecessaryValue(Key.USERNAME, + DBUtilErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.PASSWORD, + DBUtilErrorCode.REQUIRED_VALUE); + dealWhere(originalConfig); + + simplifyConf(originalConfig); + } + + public static void dealWhere(Configuration originalConfig) { + String where = originalConfig.getString(Key.WHERE, null); + if(StringUtils.isNotBlank(where)) { + String whereImprove = where.trim(); + if(whereImprove.endsWith(";") || whereImprove.endsWith(";")) { + whereImprove = whereImprove.substring(0,whereImprove.length()-1); + } + originalConfig.set(Key.WHERE, whereImprove); + } + } + + /** + * 对配置进行初步处理: + *
    + *
  1. 处理同一个数据库配置了多个jdbcUrl的情况
  2. + *
  3. 识别并标记是采用querySql 模式还是 table 模式
  4. + *
  5. 对 table 模式,确定分表个数,并处理 column 转 *事项
  6. + *
+ */ + private static void simplifyConf(Configuration originalConfig) { + boolean isTableMode = recognizeTableOrQuerySqlMode(originalConfig); + originalConfig.set(Constant.IS_TABLE_MODE, isTableMode); + + dealJdbcAndTable(originalConfig); + + dealColumnConf(originalConfig); + } + + private static void dealJdbcAndTable(Configuration originalConfig) { + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + boolean checkSlave = originalConfig.getBool(Key.CHECK_SLAVE, false); + boolean isTableMode = originalConfig.getBool(Constant.IS_TABLE_MODE); + boolean isPreCheck = originalConfig.getBool(Key.DRYRUN,false); + + List conns = originalConfig.getList(Constant.CONN_MARK, + Object.class); + List preSql = originalConfig.getList(Key.PRE_SQL, String.class); + + int tableNum = 0; + + for (int i = 0, len = conns.size(); i < len; i++) { + Configuration connConf = Configuration + .from(conns.get(i).toString()); + + connConf.getNecessaryValue(Key.JDBC_URL, + DBUtilErrorCode.REQUIRED_VALUE); + + List jdbcUrls = connConf + .getList(Key.JDBC_URL, String.class); + + String jdbcUrl; + if (isPreCheck) { + jdbcUrl = DBUtil.chooseJdbcUrlWithoutRetry(DATABASE_TYPE, jdbcUrls, + username, password, preSql, checkSlave); + } else { + jdbcUrl = DBUtil.chooseJdbcUrl(DATABASE_TYPE, jdbcUrls, + username, password, preSql, checkSlave); + } + + jdbcUrl = DATABASE_TYPE.appendJDBCSuffixForReader(jdbcUrl); + + // 回写到connection[i].jdbcUrl + originalConfig.set(String.format("%s[%d].%s", Constant.CONN_MARK, + i, Key.JDBC_URL), jdbcUrl); + + LOG.info("Available jdbcUrl:{}.",jdbcUrl); + + if (isTableMode) { + // table 方式 + // 对每一个connection 上配置的table 项进行解析(已对表名称进行了 ` 处理的) + List tables = connConf.getList(Key.TABLE, String.class); + + List expandedTables = TableExpandUtil.expandTableConf( + DATABASE_TYPE, tables); + + if (null == expandedTables || expandedTables.isEmpty()) { + throw DataXException.asDataXException( + DBUtilErrorCode.ILLEGAL_VALUE, String.format("您所配置的读取数据库表:%s 不正确. 因为DataX根据您的配置找不到这张表. 请检查您的配置并作出修改." + + "请先了解 DataX 配置.", StringUtils.join(tables, ","))); + } + + tableNum += expandedTables.size(); + + originalConfig.set(String.format("%s[%d].%s", + Constant.CONN_MARK, i, Key.TABLE), expandedTables); + } else { + // 说明是配置的 querySql 方式,不做处理. + } + } + + originalConfig.set(Constant.TABLE_NUMBER_MARK, tableNum); + } + + private static void dealColumnConf(Configuration originalConfig) { + boolean isTableMode = originalConfig.getBool(Constant.IS_TABLE_MODE); + + List userConfiguredColumns = originalConfig.getList(Key.COLUMN, + String.class); + + if (isTableMode) { + if (null == userConfiguredColumns + || userConfiguredColumns.isEmpty()) { + throw DataXException.asDataXException(DBUtilErrorCode.REQUIRED_VALUE, "您未配置读取数据库表的列信息. " + + "正确的配置方式是给 column 配置上您需要读取的列名称,用英文逗号分隔. 例如: \"column\": [\"id\", \"name\"],请参考上述配置并作出修改."); + } else { + String splitPk = originalConfig.getString(Key.SPLIT_PK, null); + + if (1 == userConfiguredColumns.size() + && "*".equals(userConfiguredColumns.get(0))) { + LOG.warn("您的配置文件中的列配置存在一定的风险. 因为您未配置读取数据库表的列,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改."); + // 回填其值,需要以 String 的方式转交后续处理 + originalConfig.set(Key.COLUMN, "*"); + } else { + String jdbcUrl = originalConfig.getString(String.format( + "%s[0].%s", Constant.CONN_MARK, Key.JDBC_URL)); + + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + + String tableName = originalConfig.getString(String.format( + "%s[0].%s[0]", Constant.CONN_MARK, Key.TABLE)); + + List allColumns = DBUtil.getTableColumns( + DATABASE_TYPE, jdbcUrl, username, password, + tableName); + LOG.info("table:[{}] has columns:[{}].", + tableName, StringUtils.join(allColumns, ",")); + // warn:注意mysql表名区分大小写 + allColumns = ListUtil.valueToLowerCase(allColumns); + List quotedColumns = new ArrayList(); + + for (String column : userConfiguredColumns) { + if ("*".equals(column)) { + throw DataXException.asDataXException( + DBUtilErrorCode.ILLEGAL_VALUE, + "您的配置文件中的列配置信息有误. 因为根据您的配置,数据库表的列中存在多个*. 请检查您的配置并作出修改. "); + } + + quotedColumns.add(column); + //以下判断没有任何意义 +// if (null == column) { +// quotedColumns.add(null); +// } else { +// if (allColumns.contains(column.toLowerCase())) { +// quotedColumns.add(column); +// } else { +// // 可能是由于用户填写为函数,或者自己对字段进行了`处理或者常量 +// quotedColumns.add(column); +// } +// } + } + + originalConfig.set(Key.COLUMN_LIST, quotedColumns); + originalConfig.set(Key.COLUMN, + StringUtils.join(quotedColumns, ",")); + if (StringUtils.isNotBlank(splitPk)) { + if (!allColumns.contains(splitPk.toLowerCase())) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + String.format("您的配置文件中的列配置信息有误. 因为根据您的配置,您读取的数据库表:%s 中没有主键名为:%s. 请检查您的配置并作出修改.", tableName, splitPk)); + } + } + + } + } + } else { + // querySql模式,不希望配制 column,那样是混淆不清晰的 + if (null != userConfiguredColumns + && userConfiguredColumns.size() > 0) { + LOG.warn("您的配置有误. 由于您读取数据库表采用了querySql的方式, 所以您不需要再配置 column. 如果您不想看到这条提醒,请移除您源头表中配置中的 column."); + originalConfig.remove(Key.COLUMN); + } + + // querySql模式,不希望配制 where,那样是混淆不清晰的 + String where = originalConfig.getString(Key.WHERE, null); + if (StringUtils.isNotBlank(where)) { + LOG.warn("您的配置有误. 由于您读取数据库表采用了querySql的方式, 所以您不需要再配置 where. 如果您不想看到这条提醒,请移除您源头表中配置中的 where."); + originalConfig.remove(Key.WHERE); + } + + // querySql模式,不希望配制 splitPk,那样是混淆不清晰的 + String splitPk = originalConfig.getString(Key.SPLIT_PK, null); + if (StringUtils.isNotBlank(splitPk)) { + LOG.warn("您的配置有误. 由于您读取数据库表采用了querySql的方式, 所以您不需要再配置 splitPk. 如果您不想看到这条提醒,请移除您源头表中配置中的 splitPk."); + originalConfig.remove(Key.SPLIT_PK); + } + } + + } + + private static boolean recognizeTableOrQuerySqlMode( + Configuration originalConfig) { + List conns = originalConfig.getList(Constant.CONN_MARK, + Object.class); + + List tableModeFlags = new ArrayList(); + List querySqlModeFlags = new ArrayList(); + + String table = null; + String querySql = null; + + boolean isTableMode = false; + boolean isQuerySqlMode = false; + for (int i = 0, len = conns.size(); i < len; i++) { + Configuration connConf = Configuration + .from(conns.get(i).toString()); + table = connConf.getString(Key.TABLE, null); + querySql = connConf.getString(Key.QUERY_SQL, null); + + isTableMode = StringUtils.isNotBlank(table); + tableModeFlags.add(isTableMode); + + isQuerySqlMode = StringUtils.isNotBlank(querySql); + querySqlModeFlags.add(isQuerySqlMode); + + if (false == isTableMode && false == isQuerySqlMode) { + // table 和 querySql 二者均未配制 + throw DataXException.asDataXException( + DBUtilErrorCode.TABLE_QUERYSQL_MISSING, "您的配置有误. 因为table和querySql应该配置并且只能配置一个. 请检查您的配置并作出修改."); + } else if (true == isTableMode && true == isQuerySqlMode) { + // table 和 querySql 二者均配置 + throw DataXException.asDataXException(DBUtilErrorCode.TABLE_QUERYSQL_MIXED, + "您的配置凌乱了. 因为datax不能同时既配置table又配置querySql.请检查您的配置并作出修改."); + } + } + + // 混合配制 table 和 querySql + if (!ListUtil.checkIfValueSame(tableModeFlags) + || !ListUtil.checkIfValueSame(tableModeFlags)) { + throw DataXException.asDataXException(DBUtilErrorCode.TABLE_QUERYSQL_MIXED, + "您配置凌乱了. 不能同时既配置table又配置querySql. 请检查您的配置并作出修改."); + } + + return tableModeFlags.get(0); + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/PreCheckTask.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/PreCheckTask.java new file mode 100644 index 0000000000..36e96732f5 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/PreCheckTask.java @@ -0,0 +1,100 @@ +package com.alibaba.datax.plugin.rdbms.reader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.util.RdbmsException; +import com.alibaba.druid.sql.parser.ParserException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.sql.ResultSet; +import java.util.List; +import java.util.concurrent.Callable; + +/** + * Created by judy.lt on 2015/6/4. + */ +public class PreCheckTask implements Callable{ + private static final Logger LOG = LoggerFactory.getLogger(PreCheckTask.class); + private String userName; + private String password; + private String splitPkId; + private Configuration connection; + private DataBaseType dataBaseType; + + public PreCheckTask(String userName, + String password, + Configuration connection, + DataBaseType dataBaseType, + String splitPkId){ + this.connection = connection; + this.userName=userName; + this.password=password; + this.dataBaseType = dataBaseType; + this.splitPkId = splitPkId; + } + + @Override + public Boolean call() throws DataXException { + String jdbcUrl = this.connection.getString(Key.JDBC_URL); + List querySqls = this.connection.getList(Key.QUERY_SQL, Object.class); + List splitPkSqls = this.connection.getList(Key.SPLIT_PK_SQL, Object.class); + List tables = this.connection.getList(Key.TABLE,Object.class); + Connection conn = DBUtil.getConnectionWithoutRetry(this.dataBaseType, jdbcUrl, + this.userName, password); + int fetchSize = 1; + if(DataBaseType.MySql.equals(dataBaseType) || DataBaseType.DRDS.equals(dataBaseType)) { + fetchSize = Integer.MIN_VALUE; + } + try{ + for (int i=0;i doSplit( + Configuration originalSliceConfig, int adviceNumber) { + boolean isTableMode = originalSliceConfig.getBool(Constant.IS_TABLE_MODE).booleanValue(); + int eachTableShouldSplittedNumber = -1; + if (isTableMode) { + // adviceNumber这里是channel数量大小, 即datax并发task数量 + // eachTableShouldSplittedNumber是单表应该切分的份数, 向上取整可能和adviceNumber没有比例关系了已经 + eachTableShouldSplittedNumber = calculateEachTableShouldSplittedNumber( + adviceNumber, originalSliceConfig.getInt(Constant.TABLE_NUMBER_MARK)); + } + + String column = originalSliceConfig.getString(Key.COLUMN); + String where = originalSliceConfig.getString(Key.WHERE, null); + + List conns = originalSliceConfig.getList(Constant.CONN_MARK, Object.class); + + List splittedConfigs = new ArrayList(); + + for (int i = 0, len = conns.size(); i < len; i++) { + Configuration sliceConfig = originalSliceConfig.clone(); + + Configuration connConf = Configuration.from(conns.get(i).toString()); + String jdbcUrl = connConf.getString(Key.JDBC_URL); + sliceConfig.set(Key.JDBC_URL, jdbcUrl); + + // 抽取 jdbcUrl 中的 ip/port 进行资源使用的打标,以提供给 core 做有意义的 shuffle 操作 + sliceConfig.set(CommonConstant.LOAD_BALANCE_RESOURCE_MARK, DataBaseType.parseIpFromJdbcUrl(jdbcUrl)); + + sliceConfig.remove(Constant.CONN_MARK); + + Configuration tempSlice; + + // 说明是配置的 table 方式 + if (isTableMode) { + // 已在之前进行了扩展和`处理,可以直接使用 + List tables = connConf.getList(Key.TABLE, String.class); + + Validate.isTrue(null != tables && !tables.isEmpty(), "您读取数据库表配置错误."); + + String splitPk = originalSliceConfig.getString(Key.SPLIT_PK, null); + + //最终切分份数不一定等于 eachTableShouldSplittedNumber + boolean needSplitTable = eachTableShouldSplittedNumber > 1 + && StringUtils.isNotBlank(splitPk); + if (needSplitTable) { + if (tables.size() == 1) { + //原来:如果是单表的,主键切分num=num*2+1 + // splitPk is null这类的情况的数据量本身就比真实数据量少很多, 和channel大小比率关系时,不建议考虑 + //eachTableShouldSplittedNumber = eachTableShouldSplittedNumber * 2 + 1;// 不应该加1导致长尾 + + //考虑其他比率数字?(splitPk is null, 忽略此长尾) + eachTableShouldSplittedNumber = eachTableShouldSplittedNumber * 5; + } + // 尝试对每个表,切分为eachTableShouldSplittedNumber 份 + for (String table : tables) { + tempSlice = sliceConfig.clone(); + tempSlice.set(Key.TABLE, table); + + List splittedSlices = SingleTableSplitUtil + .splitSingleTable(tempSlice, eachTableShouldSplittedNumber); + + splittedConfigs.addAll(splittedSlices); + } + } else { + for (String table : tables) { + tempSlice = sliceConfig.clone(); + tempSlice.set(Key.TABLE, table); + String queryColumn = HintUtil.buildQueryColumn(jdbcUrl, table, column); + tempSlice.set(Key.QUERY_SQL, SingleTableSplitUtil.buildQuerySql(queryColumn, table, where)); + splittedConfigs.add(tempSlice); + } + } + } else { + // 说明是配置的 querySql 方式 + List sqls = connConf.getList(Key.QUERY_SQL, String.class); + + // TODO 是否check 配置为多条语句?? + for (String querySql : sqls) { + tempSlice = sliceConfig.clone(); + tempSlice.set(Key.QUERY_SQL, querySql); + splittedConfigs.add(tempSlice); + } + } + + } + + return splittedConfigs; + } + + public static Configuration doPreCheckSplit(Configuration originalSliceConfig) { + Configuration queryConfig = originalSliceConfig.clone(); + boolean isTableMode = originalSliceConfig.getBool(Constant.IS_TABLE_MODE).booleanValue(); + + String splitPK = originalSliceConfig.getString(Key.SPLIT_PK); + String column = originalSliceConfig.getString(Key.COLUMN); + String where = originalSliceConfig.getString(Key.WHERE, null); + + List conns = queryConfig.getList(Constant.CONN_MARK, Object.class); + + for (int i = 0, len = conns.size(); i < len; i++){ + Configuration connConf = Configuration.from(conns.get(i).toString()); + List querys = new ArrayList(); + List splitPkQuerys = new ArrayList(); + String connPath = String.format("connection[%d]",i); + // 说明是配置的 table 方式 + if (isTableMode) { + // 已在之前进行了扩展和`处理,可以直接使用 + List tables = connConf.getList(Key.TABLE, String.class); + Validate.isTrue(null != tables && !tables.isEmpty(), "您读取数据库表配置错误."); + for (String table : tables) { + querys.add(SingleTableSplitUtil.buildQuerySql(column,table,where)); + if (splitPK != null && !splitPK.isEmpty()){ + splitPkQuerys.add(SingleTableSplitUtil.genPKSql(splitPK.trim(),table,where)); + } + } + if (!splitPkQuerys.isEmpty()){ + connConf.set(Key.SPLIT_PK_SQL,splitPkQuerys); + } + connConf.set(Key.QUERY_SQL,querys); + queryConfig.set(connPath,connConf); + } else { + // 说明是配置的 querySql 方式 + List sqls = connConf.getList(Key.QUERY_SQL, + String.class); + for (String querySql : sqls) { + querys.add(querySql); + } + connConf.set(Key.QUERY_SQL,querys); + queryConfig.set(connPath,connConf); + } + } + return queryConfig; + } + + private static int calculateEachTableShouldSplittedNumber(int adviceNumber, + int tableNumber) { + double tempNum = 1.0 * adviceNumber / tableNumber; + + return (int) Math.ceil(tempNum); + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/SingleTableSplitUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/SingleTableSplitUtil.java new file mode 100755 index 0000000000..d9846b3939 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/reader/util/SingleTableSplitUtil.java @@ -0,0 +1,390 @@ +package com.alibaba.datax.plugin.rdbms.reader.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.Constant; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.datax.plugin.rdbms.util.*; +import com.alibaba.fastjson.JSON; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.ImmutablePair; +import org.apache.commons.lang3.tuple.Pair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.math.BigInteger; +import java.sql.Connection; +import java.sql.ResultSet; +import java.sql.ResultSetMetaData; +import java.sql.Types; +import java.util.ArrayList; +import java.util.List; + +public class SingleTableSplitUtil { + private static final Logger LOG = LoggerFactory + .getLogger(SingleTableSplitUtil.class); + + public static DataBaseType DATABASE_TYPE; + + private SingleTableSplitUtil() { + } + + public static List splitSingleTable( + Configuration configuration, int adviceNum) { + List pluginParams = new ArrayList(); + List rangeList; + String splitPkName = configuration.getString(Key.SPLIT_PK); + String column = configuration.getString(Key.COLUMN); + String table = configuration.getString(Key.TABLE); + String where = configuration.getString(Key.WHERE, null); + boolean hasWhere = StringUtils.isNotBlank(where); + + //String splitMode = configuration.getString(Key.SPLIT_MODE, ""); + //if (Constant.SPLIT_MODE_RANDOMSAMPLE.equals(splitMode) && DATABASE_TYPE == DataBaseType.Oracle) { + if (DATABASE_TYPE == DataBaseType.Oracle) { + rangeList = genSplitSqlForOracle(splitPkName, table, where, + configuration, adviceNum); + // warn: mysql etc to be added... + } else { + Pair minMaxPK = getPkRange(configuration); + if (null == minMaxPK) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "根据切分主键切分表失败. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + + configuration.set(Key.QUERY_SQL, buildQuerySql(column, table, where)); + if (null == minMaxPK.getLeft() || null == minMaxPK.getRight()) { + // 切分后获取到的start/end 有 Null 的情况 + pluginParams.add(configuration); + return pluginParams; + } + + boolean isStringType = Constant.PK_TYPE_STRING.equals(configuration + .getString(Constant.PK_TYPE)); + boolean isLongType = Constant.PK_TYPE_LONG.equals(configuration + .getString(Constant.PK_TYPE)); + + + if (isStringType) { + rangeList = RdbmsRangeSplitWrap.splitAndWrap( + String.valueOf(minMaxPK.getLeft()), + String.valueOf(minMaxPK.getRight()), adviceNum, + splitPkName, "'", DATABASE_TYPE); + } else if (isLongType) { + rangeList = RdbmsRangeSplitWrap.splitAndWrap( + new BigInteger(minMaxPK.getLeft().toString()), + new BigInteger(minMaxPK.getRight().toString()), + adviceNum, splitPkName); + } else { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "您配置的切分主键(splitPk) 类型 DataX 不支持. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + } + String tempQuerySql; + List allQuerySql = new ArrayList(); + + if (null != rangeList && !rangeList.isEmpty()) { + for (String range : rangeList) { + Configuration tempConfig = configuration.clone(); + + tempQuerySql = buildQuerySql(column, table, where) + + (hasWhere ? " and " : " where ") + range; + + allQuerySql.add(tempQuerySql); + tempConfig.set(Key.QUERY_SQL, tempQuerySql); + pluginParams.add(tempConfig); + } + } else { + //pluginParams.add(configuration); // this is wrong for new & old split + Configuration tempConfig = configuration.clone(); + tempQuerySql = buildQuerySql(column, table, where) + + (hasWhere ? " and " : " where ") + + String.format(" %s IS NOT NULL", splitPkName); + allQuerySql.add(tempQuerySql); + tempConfig.set(Key.QUERY_SQL, tempQuerySql); + pluginParams.add(tempConfig); + } + + // deal pk is null + Configuration tempConfig = configuration.clone(); + tempQuerySql = buildQuerySql(column, table, where) + + (hasWhere ? " and " : " where ") + + String.format(" %s IS NULL", splitPkName); + + allQuerySql.add(tempQuerySql); + + LOG.info("After split(), allQuerySql=[\n{}\n].", + StringUtils.join(allQuerySql, "\n")); + + tempConfig.set(Key.QUERY_SQL, tempQuerySql); + pluginParams.add(tempConfig); + + return pluginParams; + } + + public static String buildQuerySql(String column, String table, + String where) { + String querySql; + + if (StringUtils.isBlank(where)) { + querySql = String.format(Constant.QUERY_SQL_TEMPLATE_WITHOUT_WHERE, + column, table); + } else { + querySql = String.format(Constant.QUERY_SQL_TEMPLATE, column, + table, where); + } + + return querySql; + } + + @SuppressWarnings("resource") + private static Pair getPkRange(Configuration configuration) { + String pkRangeSQL = genPKRangeSQL(configuration); + + int fetchSize = configuration.getInt(Constant.FETCH_SIZE); + String jdbcURL = configuration.getString(Key.JDBC_URL); + String username = configuration.getString(Key.USERNAME); + String password = configuration.getString(Key.PASSWORD); + String table = configuration.getString(Key.TABLE); + + Connection conn = DBUtil.getConnection(DATABASE_TYPE, jdbcURL, username, password); + Pair minMaxPK = checkSplitPk(conn, pkRangeSQL, fetchSize, table, username, configuration); + DBUtil.closeDBResources(null, null, conn); + return minMaxPK; + } + + public static void precheckSplitPk(Connection conn, String pkRangeSQL, int fetchSize, + String table, String username) { + Pair minMaxPK = checkSplitPk(conn, pkRangeSQL, fetchSize, table, username, null); + if (null == minMaxPK) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "根据切分主键切分表失败. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + } + + /** + * 检测splitPk的配置是否正确。 + * configuration为null, 是precheck的逻辑,不需要回写PK_TYPE到configuration中 + * + */ + private static Pair checkSplitPk(Connection conn, String pkRangeSQL, int fetchSize, String table, + String username, Configuration configuration) { + LOG.info("split pk [sql={}] is running... ", pkRangeSQL); + ResultSet rs = null; + Pair minMaxPK = null; + try { + try { + rs = DBUtil.query(conn, pkRangeSQL, fetchSize); + }catch (Exception e) { + throw RdbmsException.asQueryException(DATABASE_TYPE, e, pkRangeSQL,table,username); + } + ResultSetMetaData rsMetaData = rs.getMetaData(); + if (isPKTypeValid(rsMetaData)) { + if (isStringType(rsMetaData.getColumnType(1))) { + if(configuration != null) { + configuration + .set(Constant.PK_TYPE, Constant.PK_TYPE_STRING); + } + while (DBUtil.asyncResultSetNext(rs)) { + minMaxPK = new ImmutablePair( + rs.getString(1), rs.getString(2)); + } + } else if (isLongType(rsMetaData.getColumnType(1))) { + if(configuration != null) { + configuration.set(Constant.PK_TYPE, Constant.PK_TYPE_LONG); + } + + while (DBUtil.asyncResultSetNext(rs)) { + minMaxPK = new ImmutablePair( + rs.getString(1), rs.getString(2)); + + // check: string shouldn't contain '.', for oracle + String minMax = rs.getString(1) + rs.getString(2); + if (StringUtils.contains(minMax, '.')) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "您配置的DataX切分主键(splitPk)有误. 因为您配置的切分主键(splitPk) 类型 DataX 不支持. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + } + } else { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "您配置的DataX切分主键(splitPk)有误. 因为您配置的切分主键(splitPk) 类型 DataX 不支持. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + } else { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "您配置的DataX切分主键(splitPk)有误. 因为您配置的切分主键(splitPk) 类型 DataX 不支持. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + } catch(DataXException e) { + throw e; + } catch (Exception e) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, "DataX尝试切分表发生错误. 请检查您的配置并作出修改.", e); + } finally { + DBUtil.closeDBResources(rs, null, null); + } + + return minMaxPK; + } + + private static boolean isPKTypeValid(ResultSetMetaData rsMetaData) { + boolean ret = false; + try { + int minType = rsMetaData.getColumnType(1); + int maxType = rsMetaData.getColumnType(2); + + boolean isNumberType = isLongType(minType); + + boolean isStringType = isStringType(minType); + + if (minType == maxType && (isNumberType || isStringType)) { + ret = true; + } + } catch (Exception e) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "DataX获取切分主键(splitPk)字段类型失败. 该错误通常是系统底层异常导致. 请联系旺旺:askdatax或者DBA处理."); + } + return ret; + } + + // warn: Types.NUMERIC is used for oracle! because oracle use NUMBER to + // store INT, SMALLINT, INTEGER etc, and only oracle need to concern + // Types.NUMERIC + private static boolean isLongType(int type) { + boolean isValidLongType = type == Types.BIGINT || type == Types.INTEGER + || type == Types.SMALLINT || type == Types.TINYINT; + + switch (SingleTableSplitUtil.DATABASE_TYPE) { + case Oracle: + isValidLongType |= type == Types.NUMERIC; + break; + default: + break; + } + return isValidLongType; + } + + private static boolean isStringType(int type) { + return type == Types.CHAR || type == Types.NCHAR + || type == Types.VARCHAR || type == Types.LONGVARCHAR + || type == Types.NVARCHAR; + } + + private static String genPKRangeSQL(Configuration configuration) { + + String splitPK = configuration.getString(Key.SPLIT_PK).trim(); + String table = configuration.getString(Key.TABLE).trim(); + String where = configuration.getString(Key.WHERE, null); + return genPKSql(splitPK,table,where); + } + + public static String genPKSql(String splitPK, String table, String where){ + + String minMaxTemplate = "SELECT MIN(%s),MAX(%s) FROM %s"; + String pkRangeSQL = String.format(minMaxTemplate, splitPK, splitPK, + table); + if (StringUtils.isNotBlank(where)) { + pkRangeSQL = String.format("%s WHERE (%s AND %s IS NOT NULL)", + pkRangeSQL, where, splitPK); + } + return pkRangeSQL; + } + + /** + * support Number and String split + * */ + public static List genSplitSqlForOracle(String splitPK, + String table, String where, Configuration configuration, + int adviceNum) { + if (adviceNum < 1) { + throw new IllegalArgumentException(String.format( + "切分份数不能小于1. 此处:adviceNum=[%s].", adviceNum)); + } else if (adviceNum == 1) { + return null; + } + String whereSql = String.format("%s IS NOT NULL", splitPK); + if (StringUtils.isNotBlank(where)) { + whereSql = String.format(" WHERE (%s) AND (%s) ", whereSql, where); + } else { + whereSql = String.format(" WHERE (%s) ", whereSql); + } + Double percentage = configuration.getDouble(Key.SAMPLE_PERCENTAGE, 0.1); + String sampleSqlTemplate = "SELECT * FROM ( SELECT %s FROM %s SAMPLE (%s) %s ORDER BY DBMS_RANDOM.VALUE) WHERE ROWNUM <= %s ORDER by %s ASC"; + String splitSql = String.format(sampleSqlTemplate, splitPK, table, + percentage, whereSql, adviceNum, splitPK); + + int fetchSize = configuration.getInt(Constant.FETCH_SIZE, 32); + String jdbcURL = configuration.getString(Key.JDBC_URL); + String username = configuration.getString(Key.USERNAME); + String password = configuration.getString(Key.PASSWORD); + Connection conn = DBUtil.getConnection(DATABASE_TYPE, jdbcURL, + username, password); + LOG.info("split pk [sql={}] is running... ", splitSql); + ResultSet rs = null; + List> splitedRange = new ArrayList>(); + try { + try { + rs = DBUtil.query(conn, splitSql, fetchSize); + } catch (Exception e) { + throw RdbmsException.asQueryException(DATABASE_TYPE, e, + splitSql, table, username); + } + if (configuration != null) { + configuration + .set(Constant.PK_TYPE, Constant.PK_TYPE_MONTECARLO); + } + ResultSetMetaData rsMetaData = rs.getMetaData(); + while (DBUtil.asyncResultSetNext(rs)) { + ImmutablePair eachPoint = new ImmutablePair( + rs.getObject(1), rsMetaData.getColumnType(1)); + splitedRange.add(eachPoint); + } + } catch (DataXException e) { + throw e; + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "DataX尝试切分表发生错误. 请检查您的配置并作出修改.", e); + } finally { + DBUtil.closeDBResources(rs, null, null); + } + LOG.debug(JSON.toJSONString(splitedRange)); + List rangeSql = new ArrayList(); + int splitedRangeSize = splitedRange.size(); + // warn: splitedRangeSize may be 0 or 1,切分规则为IS NULL以及 IS NOT NULL + // demo: Parameter rangeResult can not be null and its length can not <2. detail:rangeResult=[24999930]. + if (splitedRangeSize >= 2) { + // warn: oracle Number is long type here + if (isLongType(splitedRange.get(0).getRight())) { + BigInteger[] integerPoints = new BigInteger[splitedRange.size()]; + for (int i = 0; i < splitedRangeSize; i++) { + integerPoints[i] = new BigInteger(splitedRange.get(i) + .getLeft().toString()); + } + rangeSql.addAll(RdbmsRangeSplitWrap.wrapRange(integerPoints, + splitPK)); + // its ok if splitedRangeSize is 1 + rangeSql.add(RdbmsRangeSplitWrap.wrapFirstLastPoint( + integerPoints[0], integerPoints[splitedRangeSize - 1], + splitPK)); + } else if (isStringType(splitedRange.get(0).getRight())) { + // warn: treated as string type + String[] stringPoints = new String[splitedRange.size()]; + for (int i = 0; i < splitedRangeSize; i++) { + stringPoints[i] = new String(splitedRange.get(i).getLeft() + .toString()); + } + rangeSql.addAll(RdbmsRangeSplitWrap.wrapRange(stringPoints, + splitPK, "'", DATABASE_TYPE)); + // its ok if splitedRangeSize is 1 + rangeSql.add(RdbmsRangeSplitWrap.wrapFirstLastPoint( + stringPoints[0], stringPoints[splitedRangeSize - 1], + splitPK, "'", DATABASE_TYPE)); + } else { + throw DataXException + .asDataXException( + DBUtilErrorCode.ILLEGAL_SPLIT_PK, + "您配置的DataX切分主键(splitPk)有误. 因为您配置的切分主键(splitPk) 类型 DataX 不支持. DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型. 请尝试使用其他的切分主键或者联系 DBA 进行处理."); + } + } + return rangeSql; + } +} \ No newline at end of file diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/ConnectionFactory.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/ConnectionFactory.java new file mode 100644 index 0000000000..3aef46b355 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/ConnectionFactory.java @@ -0,0 +1,16 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import java.sql.Connection; + +/** + * Date: 15/3/16 下午2:17 + */ +public interface ConnectionFactory { + + public Connection getConnecttion(); + + public Connection getConnecttionWithoutRetry(); + + public String getConnectionInfo(); + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/Constant.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/Constant.java new file mode 100755 index 0000000000..68ec400003 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/Constant.java @@ -0,0 +1,27 @@ +package com.alibaba.datax.plugin.rdbms.util; + +public final class Constant { + static final int TIMEOUT_SECONDS = 15; + static final int MAX_TRY_TIMES = 4; + static final int SOCKET_TIMEOUT_INSECOND = 172800; + + public static final String MYSQL_DATABASE = "Unknown database"; + public static final String MYSQL_CONNEXP = "Communications link failure"; + public static final String MYSQL_ACCDENIED = "Access denied"; + public static final String MYSQL_TABLE_NAME_ERR1 = "Table"; + public static final String MYSQL_TABLE_NAME_ERR2 = "doesn't exist"; + public static final String MYSQL_SELECT_PRI = "SELECT command denied to user"; + public static final String MYSQL_COLUMN1 = "Unknown column"; + public static final String MYSQL_COLUMN2 = "field list"; + public static final String MYSQL_WHERE = "where clause"; + + public static final String ORACLE_DATABASE = "ORA-12505"; + public static final String ORACLE_CONNEXP = "The Network Adapter could not establish the connection"; + public static final String ORACLE_ACCDENIED = "ORA-01017"; + public static final String ORACLE_TABLE_NAME = "table or view does not exist"; + public static final String ORACLE_SELECT_PRI = "insufficient privileges"; + public static final String ORACLE_SQL = "invalid identifier"; + + + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtil.java new file mode 100755 index 0000000000..63d1621b34 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtil.java @@ -0,0 +1,803 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.RetryUtil; +import com.alibaba.datax.plugin.rdbms.reader.Key; +import com.alibaba.druid.sql.parser.SQLParserUtils; +import com.alibaba.druid.sql.parser.SQLStatementParser; +import com.google.common.util.concurrent.ThreadFactoryBuilder; + +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.ImmutableTriple; +import org.apache.commons.lang3.tuple.Triple; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.sql.*; +import java.util.*; +import java.util.concurrent.*; + +public final class DBUtil { + private static final Logger LOG = LoggerFactory.getLogger(DBUtil.class); + + private static final ThreadLocal rsExecutors = new ThreadLocal() { + @Override + protected ExecutorService initialValue() { + return Executors.newFixedThreadPool(1, new ThreadFactoryBuilder() + .setNameFormat("rsExecutors-%d") + .setDaemon(true) + .build()); + } + }; + + private DBUtil() { + } + + public static String chooseJdbcUrl(final DataBaseType dataBaseType, + final List jdbcUrls, final String username, + final String password, final List preSql, + final boolean checkSlave) { + + if (null == jdbcUrls || jdbcUrls.isEmpty()) { + throw DataXException.asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format("您的jdbcUrl的配置信息有错, 因为jdbcUrl[%s]不能为空. 请检查您的配置并作出修改.", + StringUtils.join(jdbcUrls, ","))); + } + + try { + return RetryUtil.executeWithRetry(new Callable() { + + @Override + public String call() throws Exception { + boolean connOK = false; + for (String url : jdbcUrls) { + if (StringUtils.isNotBlank(url)) { + url = url.trim(); + if (null != preSql && !preSql.isEmpty()) { + connOK = testConnWithoutRetry(dataBaseType, + url, username, password, preSql); + } else { + connOK = testConnWithoutRetry(dataBaseType, + url, username, password, checkSlave); + } + if (connOK) { + return url; + } + } + } + throw new Exception("DataX无法连接对应的数据库,可能原因是:1) 配置的ip/port/database/jdbc错误,无法连接。2) 配置的username/password错误,鉴权失败。请和DBA确认该数据库的连接信息是否正确。"); +// throw new Exception(DBUtilErrorCode.JDBC_NULL.toString()); + } + }, 7, 1000L, true); + //warn: 7 means 2 minutes + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.CONN_DB_ERROR, + String.format("数据库连接失败. 因为根据您配置的连接信息,无法从:%s 中找到可连接的jdbcUrl. 请检查您的配置并作出修改.", + StringUtils.join(jdbcUrls, ",")), e); + } + } + + public static String chooseJdbcUrlWithoutRetry(final DataBaseType dataBaseType, + final List jdbcUrls, final String username, + final String password, final List preSql, + final boolean checkSlave) throws DataXException { + + if (null == jdbcUrls || jdbcUrls.isEmpty()) { + throw DataXException.asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format("您的jdbcUrl的配置信息有错, 因为jdbcUrl[%s]不能为空. 请检查您的配置并作出修改.", + StringUtils.join(jdbcUrls, ","))); + } + + boolean connOK = false; + for (String url : jdbcUrls) { + if (StringUtils.isNotBlank(url)) { + url = url.trim(); + if (null != preSql && !preSql.isEmpty()) { + connOK = testConnWithoutRetry(dataBaseType, + url, username, password, preSql); + } else { + try { + connOK = testConnWithoutRetry(dataBaseType, + url, username, password, checkSlave); + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.CONN_DB_ERROR, + String.format("数据库连接失败. 因为根据您配置的连接信息,无法从:%s 中找到可连接的jdbcUrl. 请检查您的配置并作出修改.", + StringUtils.join(jdbcUrls, ",")), e); + } + } + if (connOK) { + return url; + } + } + } + throw DataXException.asDataXException( + DBUtilErrorCode.CONN_DB_ERROR, + String.format("数据库连接失败. 因为根据您配置的连接信息,无法从:%s 中找到可连接的jdbcUrl. 请检查您的配置并作出修改.", + StringUtils.join(jdbcUrls, ","))); + } + + /** + * 检查slave的库中的数据是否已到凌晨00:00 + * 如果slave同步的数据还未到00:00返回false + * 否则范围true + * + * @author ZiChi + * @version 1.0 2014-12-01 + */ + private static boolean isSlaveBehind(Connection conn) { + try { + ResultSet rs = query(conn, "SHOW VARIABLES LIKE 'read_only'"); + if (DBUtil.asyncResultSetNext(rs)) { + String readOnly = rs.getString("Value"); + if ("ON".equalsIgnoreCase(readOnly)) { //备库 + ResultSet rs1 = query(conn, "SHOW SLAVE STATUS"); + if (DBUtil.asyncResultSetNext(rs1)) { + String ioRunning = rs1.getString("Slave_IO_Running"); + String sqlRunning = rs1.getString("Slave_SQL_Running"); + long secondsBehindMaster = rs1.getLong("Seconds_Behind_Master"); + if ("Yes".equalsIgnoreCase(ioRunning) && "Yes".equalsIgnoreCase(sqlRunning)) { + ResultSet rs2 = query(conn, "SELECT TIMESTAMPDIFF(SECOND, CURDATE(), NOW())"); + DBUtil.asyncResultSetNext(rs2); + long secondsOfDay = rs2.getLong(1); + return secondsBehindMaster > secondsOfDay; + } else { + return true; + } + } else { + LOG.warn("SHOW SLAVE STATUS has no result"); + } + } + } else { + LOG.warn("SHOW VARIABLES like 'read_only' has no result"); + } + } catch (Exception e) { + LOG.warn("checkSlave failed, errorMessage:[{}].", e.getMessage()); + } + return false; + } + + /** + * 检查表是否具有insert 权限 + * insert on *.* 或者 insert on database.* 时验证通过 + * 当insert on database.tableName时,确保tableList中的所有table有insert 权限,验证通过 + * 其它验证都不通过 + * + * @author ZiChi + * @version 1.0 2015-01-28 + */ + public static boolean hasInsertPrivilege(DataBaseType dataBaseType, String jdbcURL, String userName, String password, List tableList) { + /*准备参数*/ + + String[] urls = jdbcURL.split("/"); + String dbName; + if (urls != null && urls.length != 0) { + dbName = urls[3]; + }else{ + return false; + } + + String dbPattern = "`" + dbName + "`.*"; + Collection tableNames = new HashSet(tableList.size()); + tableNames.addAll(tableList); + + Connection connection = connect(dataBaseType, jdbcURL, userName, password); + try { + ResultSet rs = query(connection, "SHOW GRANTS FOR " + userName); + while (DBUtil.asyncResultSetNext(rs)) { + String grantRecord = rs.getString("Grants for " + userName + "@%"); + String[] params = grantRecord.split("\\`"); + if (params != null && params.length >= 3) { + String tableName = params[3]; + if (params[0].contains("INSERT") && !tableName.equals("*") && tableNames.contains(tableName)) + tableNames.remove(tableName); + } else { + if (grantRecord.contains("INSERT") ||grantRecord.contains("ALL PRIVILEGES")) { + if (grantRecord.contains("*.*")) + return true; + else if (grantRecord.contains(dbPattern)) { + return true; + } + } + } + } + } catch (Exception e) { + LOG.warn("Check the database has the Insert Privilege failed, errorMessage:[{}]", e.getMessage()); + } + if (tableNames.isEmpty()) + return true; + return false; + } + + + public static boolean checkInsertPrivilege(DataBaseType dataBaseType, String jdbcURL, String userName, String password, List tableList) { + Connection connection = connect(dataBaseType, jdbcURL, userName, password); + String insertTemplate = "insert into %s(select * from %s where 1 = 2)"; + + boolean hasInsertPrivilege = true; + Statement insertStmt = null; + for(String tableName : tableList) { + String checkInsertPrivilegeSql = String.format(insertTemplate, tableName, tableName); + try { + insertStmt = connection.createStatement(); + executeSqlWithoutResultSet(insertStmt, checkInsertPrivilegeSql); + } catch (Exception e) { + if(DataBaseType.Oracle.equals(dataBaseType)) { + if(e.getMessage() != null && e.getMessage().contains("insufficient privileges")) { + hasInsertPrivilege = false; + LOG.warn("User [" + userName +"] has no 'insert' privilege on table[" + tableName + "], errorMessage:[{}]", e.getMessage()); + } + } else { + hasInsertPrivilege = false; + LOG.warn("User [" + userName + "] has no 'insert' privilege on table[" + tableName + "], errorMessage:[{}]", e.getMessage()); + } + } + } + try { + connection.close(); + } catch (SQLException e) { + LOG.warn("connection close failed, " + e.getMessage()); + } + return hasInsertPrivilege; + } + + public static boolean checkDeletePrivilege(DataBaseType dataBaseType,String jdbcURL, String userName, String password, List tableList) { + Connection connection = connect(dataBaseType, jdbcURL, userName, password); + String deleteTemplate = "delete from %s WHERE 1 = 2"; + + boolean hasInsertPrivilege = true; + Statement deleteStmt = null; + for(String tableName : tableList) { + String checkDeletePrivilegeSQL = String.format(deleteTemplate, tableName); + try { + deleteStmt = connection.createStatement(); + executeSqlWithoutResultSet(deleteStmt, checkDeletePrivilegeSQL); + } catch (Exception e) { + hasInsertPrivilege = false; + LOG.warn("User [" + userName +"] has no 'delete' privilege on table[" + tableName + "], errorMessage:[{}]", e.getMessage()); + } + } + try { + connection.close(); + } catch (SQLException e) { + LOG.warn("connection close failed, " + e.getMessage()); + } + return hasInsertPrivilege; + } + + public static boolean needCheckDeletePrivilege(Configuration originalConfig) { + List allSqls =new ArrayList(); + List preSQLs = originalConfig.getList(Key.PRE_SQL, String.class); + List postSQLs = originalConfig.getList(Key.POST_SQL, String.class); + if (preSQLs != null && !preSQLs.isEmpty()){ + allSqls.addAll(preSQLs); + } + if (postSQLs != null && !postSQLs.isEmpty()){ + allSqls.addAll(postSQLs); + } + for(String sql : allSqls) { + if(StringUtils.isNotBlank(sql)) { + if (sql.trim().toUpperCase().startsWith("DELETE")) { + return true; + } + } + } + return false; + } + + /** + * Get direct JDBC connection + *

+ * if connecting failed, try to connect for MAX_TRY_TIMES times + *

+ * NOTE: In DataX, we don't need connection pool in fact + */ + public static Connection getConnection(final DataBaseType dataBaseType, + final String jdbcUrl, final String username, final String password) { + + return getConnection(dataBaseType, jdbcUrl, username, password, String.valueOf(Constant.SOCKET_TIMEOUT_INSECOND * 1000)); + } + + /** + * + * @param dataBaseType + * @param jdbcUrl + * @param username + * @param password + * @param socketTimeout 设置socketTimeout,单位ms,String类型 + * @return + */ + public static Connection getConnection(final DataBaseType dataBaseType, + final String jdbcUrl, final String username, final String password, final String socketTimeout) { + + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public Connection call() throws Exception { + return DBUtil.connect(dataBaseType, jdbcUrl, username, + password, socketTimeout); + } + }, 9, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.CONN_DB_ERROR, + String.format("数据库连接失败. 因为根据您配置的连接信息:%s获取数据库连接失败. 请检查您的配置并作出修改.", jdbcUrl), e); + } + } + + /** + * Get direct JDBC connection + *

+ * if connecting failed, try to connect for MAX_TRY_TIMES times + *

+ * NOTE: In DataX, we don't need connection pool in fact + */ + public static Connection getConnectionWithoutRetry(final DataBaseType dataBaseType, + final String jdbcUrl, final String username, final String password) { + return getConnectionWithoutRetry(dataBaseType, jdbcUrl, username, + password, String.valueOf(Constant.SOCKET_TIMEOUT_INSECOND * 1000)); + } + + public static Connection getConnectionWithoutRetry(final DataBaseType dataBaseType, + final String jdbcUrl, final String username, final String password, String socketTimeout) { + return DBUtil.connect(dataBaseType, jdbcUrl, username, + password, socketTimeout); + } + + private static synchronized Connection connect(DataBaseType dataBaseType, + String url, String user, String pass) { + return connect(dataBaseType, url, user, pass, String.valueOf(Constant.SOCKET_TIMEOUT_INSECOND * 1000)); + } + + private static synchronized Connection connect(DataBaseType dataBaseType, + String url, String user, String pass, String socketTimeout) { + + //ob10的处理 + if (url.startsWith(com.alibaba.datax.plugin.rdbms.writer.Constant.OB10_SPLIT_STRING) && dataBaseType == DataBaseType.MySql) { + String[] ss = url.split(com.alibaba.datax.plugin.rdbms.writer.Constant.OB10_SPLIT_STRING_PATTERN); + if (ss.length != 3) { + throw DataXException + .asDataXException( + DBUtilErrorCode.JDBC_OB10_ADDRESS_ERROR, "JDBC OB10格式错误,请联系askdatax"); + } + LOG.info("this is ob1_0 jdbc url."); + user = ss[1].trim() +":"+user; + url = ss[2]; + LOG.info("this is ob1_0 jdbc url. user="+user+" :url="+url); + } + + Properties prop = new Properties(); + prop.put("user", user); + prop.put("password", pass); + + if (dataBaseType == DataBaseType.Oracle) { + //oracle.net.READ_TIMEOUT for jdbc versions < 10.1.0.5 oracle.jdbc.ReadTimeout for jdbc versions >=10.1.0.5 + // unit ms + prop.put("oracle.jdbc.ReadTimeout", socketTimeout); + } + + return connect(dataBaseType, url, prop); + } + + private static synchronized Connection connect(DataBaseType dataBaseType, + String url, Properties prop) { + try { + Class.forName(dataBaseType.getDriverClassName()); + DriverManager.setLoginTimeout(Constant.TIMEOUT_SECONDS); + return DriverManager.getConnection(url, prop); + } catch (Exception e) { + throw RdbmsException.asConnException(dataBaseType, e, prop.getProperty("user"), null); + } + } + + /** + * a wrapped method to execute select-like sql statement . + * + * @param conn Database connection . + * @param sql sql statement to be executed + * @return a {@link ResultSet} + * @throws SQLException if occurs SQLException. + */ + public static ResultSet query(Connection conn, String sql, int fetchSize) + throws SQLException { + // 默认3600 s 的query Timeout + return query(conn, sql, fetchSize, Constant.SOCKET_TIMEOUT_INSECOND); + } + + /** + * a wrapped method to execute select-like sql statement . + * + * @param conn Database connection . + * @param sql sql statement to be executed + * @param fetchSize + * @param queryTimeout unit:second + * @return + * @throws SQLException + */ + public static ResultSet query(Connection conn, String sql, int fetchSize, int queryTimeout) + throws SQLException { + // make sure autocommit is off + conn.setAutoCommit(false); + Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, + ResultSet.CONCUR_READ_ONLY); + stmt.setFetchSize(fetchSize); + stmt.setQueryTimeout(queryTimeout); + return query(stmt, sql); + } + + /** + * a wrapped method to execute select-like sql statement . + * + * @param stmt {@link Statement} + * @param sql sql statement to be executed + * @return a {@link ResultSet} + * @throws SQLException if occurs SQLException. + */ + public static ResultSet query(Statement stmt, String sql) + throws SQLException { + return stmt.executeQuery(sql); + } + + public static void executeSqlWithoutResultSet(Statement stmt, String sql) + throws SQLException { + stmt.execute(sql); + } + + /** + * Close {@link ResultSet}, {@link Statement} referenced by this + * {@link ResultSet} + * + * @param rs {@link ResultSet} to be closed + * @throws IllegalArgumentException + */ + public static void closeResultSet(ResultSet rs) { + try { + if (null != rs) { + Statement stmt = rs.getStatement(); + if (null != stmt) { + stmt.close(); + stmt = null; + } + rs.close(); + } + rs = null; + } catch (SQLException e) { + throw new IllegalStateException(e); + } + } + + public static void closeDBResources(ResultSet rs, Statement stmt, + Connection conn) { + if (null != rs) { + try { + rs.close(); + } catch (SQLException unused) { + } + } + + if (null != stmt) { + try { + stmt.close(); + } catch (SQLException unused) { + } + } + + if (null != conn) { + try { + conn.close(); + } catch (SQLException unused) { + } + } + } + + public static void closeDBResources(Statement stmt, Connection conn) { + closeDBResources(null, stmt, conn); + } + + public static List getTableColumns(DataBaseType dataBaseType, + String jdbcUrl, String user, String pass, String tableName) { + Connection conn = getConnection(dataBaseType, jdbcUrl, user, pass); + return getTableColumnsByConn(dataBaseType, conn, tableName, "jdbcUrl:"+jdbcUrl); + } + + public static List getTableColumnsByConn(DataBaseType dataBaseType, Connection conn, String tableName, String basicMsg) { + List columns = new ArrayList(); + Statement statement = null; + ResultSet rs = null; + String queryColumnSql = null; + try { + statement = conn.createStatement(); + queryColumnSql = String.format("select * from %s where 1=2", + tableName); + rs = statement.executeQuery(queryColumnSql); + ResultSetMetaData rsMetaData = rs.getMetaData(); + for (int i = 0, len = rsMetaData.getColumnCount(); i < len; i++) { + columns.add(rsMetaData.getColumnName(i + 1)); + } + + } catch (SQLException e) { + throw RdbmsException.asQueryException(dataBaseType,e,queryColumnSql,tableName,null); + } finally { + DBUtil.closeDBResources(rs, statement, conn); + } + + return columns; + } + + /** + * @return Left:ColumnName Middle:ColumnType Right:ColumnTypeName + */ + public static Triple, List, List> getColumnMetaData( + DataBaseType dataBaseType, String jdbcUrl, String user, + String pass, String tableName, String column) { + Connection conn = null; + try { + conn = getConnection(dataBaseType, jdbcUrl, user, pass); + return getColumnMetaData(conn, tableName, column); + } finally { + DBUtil.closeDBResources(null, null, conn); + } + } + + /** + * @return Left:ColumnName Middle:ColumnType Right:ColumnTypeName + */ + public static Triple, List, List> getColumnMetaData( + Connection conn, String tableName, String column) { + Statement statement = null; + ResultSet rs = null; + + Triple, List, List> columnMetaData = new ImmutableTriple, List, List>( + new ArrayList(), new ArrayList(), + new ArrayList()); + try { + statement = conn.createStatement(); + String queryColumnSql = "select " + column + " from " + tableName + + " where 1=2"; + + rs = statement.executeQuery(queryColumnSql); + ResultSetMetaData rsMetaData = rs.getMetaData(); + for (int i = 0, len = rsMetaData.getColumnCount(); i < len; i++) { + + columnMetaData.getLeft().add(rsMetaData.getColumnName(i + 1)); + columnMetaData.getMiddle().add(rsMetaData.getColumnType(i + 1)); + columnMetaData.getRight().add( + rsMetaData.getColumnTypeName(i + 1)); + } + return columnMetaData; + + } catch (SQLException e) { + throw DataXException + .asDataXException(DBUtilErrorCode.GET_COLUMN_INFO_FAILED, + String.format("获取表:%s 的字段的元信息时失败. 请联系 DBA 核查该库、表信息.", tableName), e); + } finally { + DBUtil.closeDBResources(rs, statement, null); + } + } + + public static boolean testConnWithoutRetry(DataBaseType dataBaseType, + String url, String user, String pass, boolean checkSlave){ + Connection connection = null; + + try { + connection = connect(dataBaseType, url, user, pass); + if (connection != null) { + if (dataBaseType.equals(dataBaseType.MySql) && checkSlave) { + //dataBaseType.MySql + boolean connOk = !isSlaveBehind(connection); + return connOk; + } else { + return true; + } + } + } catch (Exception e) { + LOG.warn("test connection of [{}] failed, for {}.", url, + e.getMessage()); + } finally { + DBUtil.closeDBResources(null, connection); + } + return false; + } + + public static boolean testConnWithoutRetry(DataBaseType dataBaseType, + String url, String user, String pass, List preSql) { + Connection connection = null; + try { + connection = connect(dataBaseType, url, user, pass); + if (null != connection) { + for (String pre : preSql) { + if (doPreCheck(connection, pre) == false) { + LOG.warn("doPreCheck failed."); + return false; + } + } + return true; + } + } catch (Exception e) { + LOG.warn("test connection of [{}] failed, for {}.", url, + e.getMessage()); + } finally { + DBUtil.closeDBResources(null, connection); + } + + return false; + } + + public static boolean isOracleMaster(final String url, final String user, final String pass) { + try { + return RetryUtil.executeWithRetry(new Callable() { + @Override + public Boolean call() throws Exception { + Connection conn = null; + try { + conn = connect(DataBaseType.Oracle, url, user, pass); + ResultSet rs = query(conn, "select DATABASE_ROLE from V$DATABASE"); + if (DBUtil.asyncResultSetNext(rs, 5)) { + String role = rs.getString("DATABASE_ROLE"); + return "PRIMARY".equalsIgnoreCase(role); + } + throw DataXException.asDataXException(DBUtilErrorCode.RS_ASYNC_ERROR, + String.format("select DATABASE_ROLE from V$DATABASE failed,请检查您的jdbcUrl:%s.", url)); + } finally { + DBUtil.closeDBResources(null, conn); + } + } + }, 3, 1000L, true); + } catch (Exception e) { + throw DataXException.asDataXException(DBUtilErrorCode.CONN_DB_ERROR, + String.format("select DATABASE_ROLE from V$DATABASE failed, url: %s", url), e); + } + } + + public static ResultSet query(Connection conn, String sql) + throws SQLException { + Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, + ResultSet.CONCUR_READ_ONLY); + //默认3600 seconds + stmt.setQueryTimeout(Constant.SOCKET_TIMEOUT_INSECOND); + return query(stmt, sql); + } + + private static boolean doPreCheck(Connection conn, String pre) { + ResultSet rs = null; + try { + rs = query(conn, pre); + + int checkResult = -1; + if (DBUtil.asyncResultSetNext(rs)) { + checkResult = rs.getInt(1); + if (DBUtil.asyncResultSetNext(rs)) { + LOG.warn( + "pre check failed. It should return one result:0, pre:[{}].", + pre); + return false; + } + + } + + if (0 == checkResult) { + return true; + } + + LOG.warn( + "pre check failed. It should return one result:0, pre:[{}].", + pre); + } catch (Exception e) { + LOG.warn("pre check failed. pre:[{}], errorMessage:[{}].", pre, + e.getMessage()); + } finally { + DBUtil.closeResultSet(rs); + } + return false; + } + + // warn:until now, only oracle need to handle session config. + public static void dealWithSessionConfig(Connection conn, + Configuration config, DataBaseType databaseType, String message) { + List sessionConfig = null; + switch (databaseType) { + case Oracle: + sessionConfig = config.getList(Key.SESSION, + new ArrayList(), String.class); + DBUtil.doDealWithSessionConfig(conn, sessionConfig, message); + break; + case DRDS: + // 用于关闭 drds 的分布式事务开关 + sessionConfig = new ArrayList(); + sessionConfig.add("set transaction policy 4"); + DBUtil.doDealWithSessionConfig(conn, sessionConfig, message); + break; + case MySql: + sessionConfig = config.getList(Key.SESSION, + new ArrayList(), String.class); + DBUtil.doDealWithSessionConfig(conn, sessionConfig, message); + break; + default: + break; + } + } + + private static void doDealWithSessionConfig(Connection conn, + List sessions, String message) { + if (null == sessions || sessions.isEmpty()) { + return; + } + + Statement stmt; + try { + stmt = conn.createStatement(); + } catch (SQLException e) { + throw DataXException + .asDataXException(DBUtilErrorCode.SET_SESSION_ERROR, String + .format("session配置有误. 因为根据您的配置执行 session 设置失败. 上下文信息是:[%s]. 请检查您的配置并作出修改.", message), + e); + } + + for (String sessionSql : sessions) { + LOG.info("execute sql:[{}]", sessionSql); + try { + DBUtil.executeSqlWithoutResultSet(stmt, sessionSql); + } catch (SQLException e) { + throw DataXException.asDataXException( + DBUtilErrorCode.SET_SESSION_ERROR, String.format( + "session配置有误. 因为根据您的配置执行 session 设置失败. 上下文信息是:[%s]. 请检查您的配置并作出修改.", message), e); + } + } + DBUtil.closeDBResources(stmt, null); + } + + public static void sqlValid(String sql, DataBaseType dataBaseType){ + SQLStatementParser statementParser = SQLParserUtils.createSQLStatementParser(sql,dataBaseType.getTypeName()); + statementParser.parseStatementList(); + } + + /** + * 异步获取resultSet的next(),注意,千万不能应用在数据的读取中。只能用在meta的获取 + * @param resultSet + * @return + */ + public static boolean asyncResultSetNext(final ResultSet resultSet) { + return asyncResultSetNext(resultSet, 3600); + } + + public static boolean asyncResultSetNext(final ResultSet resultSet, int timeout) { + Future future = rsExecutors.get().submit(new Callable() { + @Override + public Boolean call() throws Exception { + return resultSet.next(); + } + }); + try { + return future.get(timeout, TimeUnit.SECONDS); + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.RS_ASYNC_ERROR, "异步获取ResultSet失败", e); + } + } + + public static void loadDriverClass(String pluginType, String pluginName) { + try { + String pluginJsonPath = StringUtils.join( + new String[] { System.getProperty("datax.home"), "plugin", + pluginType, + String.format("%s%s", pluginName, pluginType), + "plugin.json" }, File.separator); + Configuration configuration = Configuration.from(new File( + pluginJsonPath)); + List drivers = configuration.getList("drivers", + String.class); + for (String driver : drivers) { + Class.forName(driver); + } + } catch (ClassNotFoundException e) { + throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, + "数据库驱动加载错误, 请确认libs目录有驱动jar包且plugin.json中drivers配置驱动类正确!", + e); + } + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtilErrorCode.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtilErrorCode.java new file mode 100755 index 0000000000..fb01d446cf --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DBUtilErrorCode.java @@ -0,0 +1,96 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import com.alibaba.datax.common.spi.ErrorCode; + +//TODO +public enum DBUtilErrorCode implements ErrorCode { + //连接错误 + MYSQL_CONN_USERPWD_ERROR("MYSQLErrCode-01","数据库用户名或者密码错误,请检查填写的账号密码或者联系DBA确认账号和密码是否正确"), + MYSQL_CONN_IPPORT_ERROR("MYSQLErrCode-02","数据库服务的IP地址或者Port错误,请检查填写的IP地址和Port或者联系DBA确认IP地址和Port是否正确。如果是同步中心用户请联系DBA确认idb上录入的IP和PORT信息和数据库的当前实际信息是一致的"), + MYSQL_CONN_DB_ERROR("MYSQLErrCode-03","数据库名称错误,请检查数据库实例名称或者联系DBA确认该实例是否存在并且在正常服务"), + + ORACLE_CONN_USERPWD_ERROR("ORACLEErrCode-01","数据库用户名或者密码错误,请检查填写的账号密码或者联系DBA确认账号和密码是否正确"), + ORACLE_CONN_IPPORT_ERROR("ORACLEErrCode-02","数据库服务的IP地址或者Port错误,请检查填写的IP地址和Port或者联系DBA确认IP地址和Port是否正确。如果是同步中心用户请联系DBA确认idb上录入的IP和PORT信息和数据库的当前实际信息是一致的"), + ORACLE_CONN_DB_ERROR("ORACLEErrCode-03","数据库名称错误,请检查数据库实例名称或者联系DBA确认该实例是否存在并且在正常服务"), + + //execute query错误 + MYSQL_QUERY_TABLE_NAME_ERROR("MYSQLErrCode-04","表不存在,请检查表名或者联系DBA确认该表是否存在"), + MYSQL_QUERY_SQL_ERROR("MYSQLErrCode-05","SQL语句执行出错,请检查Where条件是否存在拼写或语法错误"), + MYSQL_QUERY_COLUMN_ERROR("MYSQLErrCode-06","Column信息错误,请检查该列是否存在,如果是常量或者变量,请使用英文单引号’包起来"), + MYSQL_QUERY_SELECT_PRI_ERROR("MYSQLErrCode-07","读表数据出错,因为账号没有读表的权限,请联系DBA确认该账号的权限并授权"), + + ORACLE_QUERY_TABLE_NAME_ERROR("ORACLEErrCode-04","表不存在,请检查表名或者联系DBA确认该表是否存在"), + ORACLE_QUERY_SQL_ERROR("ORACLEErrCode-05","SQL语句执行出错,原因可能是你填写的列不存在或者where条件不符合要求,1,请检查该列是否存在,如果是常量或者变量,请使用英文单引号’包起来; 2,请检查Where条件是否存在拼写或语法错误"), + ORACLE_QUERY_SELECT_PRI_ERROR("ORACLEErrCode-06","读表数据出错,因为账号没有读表的权限,请联系DBA确认该账号的权限并授权"), + ORACLE_QUERY_SQL_PARSER_ERROR("ORACLEErrCode-07","SQL语法出错,请检查Where条件是否存在拼写或语法错误"), + + //PreSql,Post Sql错误 + MYSQL_PRE_SQL_ERROR("MYSQLErrCode-08","PreSQL语法错误,请检查"), + MYSQL_POST_SQL_ERROR("MYSQLErrCode-09","PostSql语法错误,请检查"), + MYSQL_QUERY_SQL_PARSER_ERROR("MYSQLErrCode-10","SQL语法出错,请检查Where条件是否存在拼写或语法错误"), + + ORACLE_PRE_SQL_ERROR("ORACLEErrCode-08", "PreSQL语法错误,请检查"), + ORACLE_POST_SQL_ERROR("ORACLEErrCode-09", "PostSql语法错误,请检查"), + + //SplitPK 错误 + MYSQL_SPLIT_PK_ERROR("MYSQLErrCode-11","SplitPK错误,请检查"), + ORACLE_SPLIT_PK_ERROR("ORACLEErrCode-10","SplitPK错误,请检查"), + + //Insert,Delete 权限错误 + MYSQL_INSERT_ERROR("MYSQLErrCode-12","数据库没有写权限,请联系DBA"), + MYSQL_DELETE_ERROR("MYSQLErrCode-13","数据库没有Delete权限,请联系DBA"), + ORACLE_INSERT_ERROR("ORACLEErrCode-11","数据库没有写权限,请联系DBA"), + ORACLE_DELETE_ERROR("ORACLEErrCode-12","数据库没有Delete权限,请联系DBA"), + + JDBC_NULL("DBUtilErrorCode-20","JDBC URL为空,请检查配置"), + JDBC_OB10_ADDRESS_ERROR("DBUtilErrorCode-OB10-01","JDBC OB10格式错误,请联系askdatax"), + CONF_ERROR("DBUtilErrorCode-00", "您的配置错误."), + CONN_DB_ERROR("DBUtilErrorCode-10", "连接数据库失败. 请检查您的 账号、密码、数据库名称、IP、Port或者向 DBA 寻求帮助(注意网络环境)."), + GET_COLUMN_INFO_FAILED("DBUtilErrorCode-01", "获取表字段相关信息失败."), + UNSUPPORTED_TYPE("DBUtilErrorCode-12", "不支持的数据库类型. 请注意查看 DataX 已经支持的数据库类型以及数据库版本."), + COLUMN_SPLIT_ERROR("DBUtilErrorCode-13", "根据主键进行切分失败."), + SET_SESSION_ERROR("DBUtilErrorCode-14", "设置 session 失败."), + RS_ASYNC_ERROR("DBUtilErrorCode-15", "异步获取ResultSet next失败."), + + REQUIRED_VALUE("DBUtilErrorCode-03", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("DBUtilErrorCode-02", "您填写的参数值不合法."), + ILLEGAL_SPLIT_PK("DBUtilErrorCode-04", "您填写的主键列不合法, DataX 仅支持切分主键为一个,并且类型为整数或者字符串类型."), + SPLIT_FAILED_ILLEGAL_SQL("DBUtilErrorCode-15", "DataX尝试切分表时, 执行数据库 Sql 失败. 请检查您的配置 table/splitPk/where 并作出修改."), + SQL_EXECUTE_FAIL("DBUtilErrorCode-06", "执行数据库 Sql 失败, 请检查您的配置的 column/table/where/querySql或者向 DBA 寻求帮助."), + + // only for reader + READ_RECORD_FAIL("DBUtilErrorCode-07", "读取数据库数据失败. 请检查您的配置的 column/table/where/querySql或者向 DBA 寻求帮助."), + TABLE_QUERYSQL_MIXED("DBUtilErrorCode-08", "您配置凌乱了. 不能同时既配置table又配置querySql"), + TABLE_QUERYSQL_MISSING("DBUtilErrorCode-09", "您配置错误. table和querySql 应该并且只能配置一个."), + + // only for writer + WRITE_DATA_ERROR("DBUtilErrorCode-05", "往您配置的写入表中写入数据时失败."), + NO_INSERT_PRIVILEGE("DBUtilErrorCode-11", "数据库没有写权限,请联系DBA"), + NO_DELETE_PRIVILEGE("DBUtilErrorCode-16", "数据库没有DELETE权限,请联系DBA"), + ; + + private final String code; + + private final String description; + + private DBUtilErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DataBaseType.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DataBaseType.java new file mode 100755 index 0000000000..55d9e47b09 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/DataBaseType.java @@ -0,0 +1,198 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import com.alibaba.datax.common.exception.DataXException; + +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +/** + * refer:http://blog.csdn.net/ring0hx/article/details/6152528 + *

+ */ +public enum DataBaseType { + MySql("mysql", "com.mysql.jdbc.Driver"), + Tddl("mysql", "com.mysql.jdbc.Driver"), + DRDS("drds", "com.mysql.jdbc.Driver"), + Oracle("oracle", "oracle.jdbc.OracleDriver"), + SQLServer("sqlserver", "com.microsoft.sqlserver.jdbc.SQLServerDriver"), + PostgreSQL("postgresql", "org.postgresql.Driver"), + RDBMS("rdbms", "com.alibaba.datax.plugin.rdbms.util.DataBaseType"), + DB2("db2", "com.ibm.db2.jcc.DB2Driver"), + ADS("ads","com.mysql.jdbc.Driver"); + + + private String typeName; + private String driverClassName; + + DataBaseType(String typeName, String driverClassName) { + this.typeName = typeName; + this.driverClassName = driverClassName; + } + + public String getDriverClassName() { + return this.driverClassName; + } + + public String appendJDBCSuffixForReader(String jdbc) { + String result = jdbc; + String suffix = null; + switch (this) { + case MySql: + case DRDS: + suffix = "yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true"; + if (jdbc.contains("?")) { + result = jdbc + "&" + suffix; + } else { + result = jdbc + "?" + suffix; + } + break; + case Oracle: + break; + case SQLServer: + break; + case DB2: + break; + case PostgreSQL: + break; + case RDBMS: + break; + default: + throw DataXException.asDataXException(DBUtilErrorCode.UNSUPPORTED_TYPE, "unsupported database type."); + } + + return result; + } + + public String appendJDBCSuffixForWriter(String jdbc) { + String result = jdbc; + String suffix = null; + switch (this) { + case MySql: + suffix = "yearIsDateType=false&zeroDateTimeBehavior=convertToNull&rewriteBatchedStatements=true&tinyInt1isBit=false"; + if (jdbc.contains("?")) { + result = jdbc + "&" + suffix; + } else { + result = jdbc + "?" + suffix; + } + break; + case DRDS: + suffix = "yearIsDateType=false&zeroDateTimeBehavior=convertToNull"; + if (jdbc.contains("?")) { + result = jdbc + "&" + suffix; + } else { + result = jdbc + "?" + suffix; + } + break; + case Oracle: + break; + case SQLServer: + break; + case DB2: + break; + case PostgreSQL: + break; + case RDBMS: + break; + default: + throw DataXException.asDataXException(DBUtilErrorCode.UNSUPPORTED_TYPE, "unsupported database type."); + } + + return result; + } + + public String formatPk(String splitPk) { + String result = splitPk; + + switch (this) { + case MySql: + case Oracle: + if (splitPk.length() >= 2 && splitPk.startsWith("`") && splitPk.endsWith("`")) { + result = splitPk.substring(1, splitPk.length() - 1).toLowerCase(); + } + break; + case SQLServer: + if (splitPk.length() >= 2 && splitPk.startsWith("[") && splitPk.endsWith("]")) { + result = splitPk.substring(1, splitPk.length() - 1).toLowerCase(); + } + break; + case DB2: + case PostgreSQL: + break; + default: + throw DataXException.asDataXException(DBUtilErrorCode.UNSUPPORTED_TYPE, "unsupported database type."); + } + + return result; + } + + + public String quoteColumnName(String columnName) { + String result = columnName; + + switch (this) { + case MySql: + result = "`" + columnName.replace("`", "``") + "`"; + break; + case Oracle: + break; + case SQLServer: + result = "[" + columnName + "]"; + break; + case DB2: + case PostgreSQL: + break; + default: + throw DataXException.asDataXException(DBUtilErrorCode.UNSUPPORTED_TYPE, "unsupported database type"); + } + + return result; + } + + public String quoteTableName(String tableName) { + String result = tableName; + + switch (this) { + case MySql: + result = "`" + tableName.replace("`", "``") + "`"; + break; + case Oracle: + break; + case SQLServer: + break; + case DB2: + break; + case PostgreSQL: + break; + default: + throw DataXException.asDataXException(DBUtilErrorCode.UNSUPPORTED_TYPE, "unsupported database type"); + } + + return result; + } + + private static Pattern mysqlPattern = Pattern.compile("jdbc:mysql://(.+):\\d+/.+"); + private static Pattern oraclePattern = Pattern.compile("jdbc:oracle:thin:@(.+):\\d+:.+"); + + /** + * 注意:目前只实现了从 mysql/oracle 中识别出ip 信息.未识别到则返回 null. + */ + public static String parseIpFromJdbcUrl(String jdbcUrl) { + Matcher mysql = mysqlPattern.matcher(jdbcUrl); + if (mysql.matches()) { + return mysql.group(1); + } + Matcher oracle = oraclePattern.matcher(jdbcUrl); + if (oracle.matches()) { + return oracle.group(1); + } + return null; + } + public String getTypeName() { + return typeName; + } + + public void setTypeName(String typeName) { + this.typeName = typeName; + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/JdbcConnectionFactory.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/JdbcConnectionFactory.java new file mode 100644 index 0000000000..2fe3108ece --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/JdbcConnectionFactory.java @@ -0,0 +1,39 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import java.sql.Connection; + +/** + * Date: 15/3/16 下午3:12 + */ +public class JdbcConnectionFactory implements ConnectionFactory { + + private DataBaseType dataBaseType; + + private String jdbcUrl; + + private String userName; + + private String password; + + public JdbcConnectionFactory(DataBaseType dataBaseType, String jdbcUrl, String userName, String password) { + this.dataBaseType = dataBaseType; + this.jdbcUrl = jdbcUrl; + this.userName = userName; + this.password = password; + } + + @Override + public Connection getConnecttion() { + return DBUtil.getConnection(dataBaseType, jdbcUrl, userName, password); + } + + @Override + public Connection getConnecttionWithoutRetry() { + return DBUtil.getConnectionWithoutRetry(dataBaseType, jdbcUrl, userName, password); + } + + @Override + public String getConnectionInfo() { + return "jdbcUrl:" + jdbcUrl; + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsException.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsException.java new file mode 100644 index 0000000000..4b6601adb9 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsException.java @@ -0,0 +1,190 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by judy.lt on 2015/6/5. + */ +public class RdbmsException extends DataXException{ + public RdbmsException(ErrorCode errorCode, String message){ + super(errorCode,message); + } + + public static DataXException asConnException(DataBaseType dataBaseType,Exception e,String userName,String dbName){ + if (dataBaseType.equals(DataBaseType.MySql)){ + DBUtilErrorCode dbUtilErrorCode = mySqlConnectionErrorAna(e.getMessage()); + if (dbUtilErrorCode == DBUtilErrorCode.MYSQL_CONN_DB_ERROR && dbName !=null ){ + return DataXException.asDataXException(dbUtilErrorCode,"该数据库名称为:"+dbName+" 具体错误信息为:"+e); + } + if (dbUtilErrorCode == DBUtilErrorCode.MYSQL_CONN_USERPWD_ERROR ){ + return DataXException.asDataXException(dbUtilErrorCode,"该数据库用户名为:"+userName+" 具体错误信息为:"+e); + } + return DataXException.asDataXException(dbUtilErrorCode," 具体错误信息为:"+e); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + DBUtilErrorCode dbUtilErrorCode = oracleConnectionErrorAna(e.getMessage()); + if (dbUtilErrorCode == DBUtilErrorCode.ORACLE_CONN_DB_ERROR && dbName != null){ + return DataXException.asDataXException(dbUtilErrorCode,"该数据库名称为:"+dbName+" 具体错误信息为:"+e); + } + if (dbUtilErrorCode == DBUtilErrorCode.ORACLE_CONN_USERPWD_ERROR ){ + return DataXException.asDataXException(dbUtilErrorCode,"该数据库用户名为:"+userName+" 具体错误信息为:"+e); + } + return DataXException.asDataXException(dbUtilErrorCode," 具体错误信息为:"+e); + } + return DataXException.asDataXException(DBUtilErrorCode.CONN_DB_ERROR," 具体错误信息为:"+e); + } + + public static DBUtilErrorCode mySqlConnectionErrorAna(String e){ + if (e.contains(Constant.MYSQL_DATABASE)){ + return DBUtilErrorCode.MYSQL_CONN_DB_ERROR; + } + + if (e.contains(Constant.MYSQL_CONNEXP)){ + return DBUtilErrorCode.MYSQL_CONN_IPPORT_ERROR; + } + + if (e.contains(Constant.MYSQL_ACCDENIED)){ + return DBUtilErrorCode.MYSQL_CONN_USERPWD_ERROR; + } + + return DBUtilErrorCode.CONN_DB_ERROR; + } + + public static DBUtilErrorCode oracleConnectionErrorAna(String e){ + if (e.contains(Constant.ORACLE_DATABASE)){ + return DBUtilErrorCode.ORACLE_CONN_DB_ERROR; + } + + if (e.contains(Constant.ORACLE_CONNEXP)){ + return DBUtilErrorCode.ORACLE_CONN_IPPORT_ERROR; + } + + if (e.contains(Constant.ORACLE_ACCDENIED)){ + return DBUtilErrorCode.ORACLE_CONN_USERPWD_ERROR; + } + + return DBUtilErrorCode.CONN_DB_ERROR; + } + + public static DataXException asQueryException(DataBaseType dataBaseType, Exception e,String querySql,String table,String userName){ + if (dataBaseType.equals(DataBaseType.MySql)){ + DBUtilErrorCode dbUtilErrorCode = mySqlQueryErrorAna(e.getMessage()); + if (dbUtilErrorCode == DBUtilErrorCode.MYSQL_QUERY_TABLE_NAME_ERROR && table != null){ + return DataXException.asDataXException(dbUtilErrorCode,"表名为:"+table+" 执行的SQL为:"+querySql+" 具体错误信息为:"+e); + } + if (dbUtilErrorCode == DBUtilErrorCode.MYSQL_QUERY_SELECT_PRI_ERROR && userName != null){ + return DataXException.asDataXException(dbUtilErrorCode,"用户名为:"+userName+" 具体错误信息为:"+e); + } + + return DataXException.asDataXException(dbUtilErrorCode,"执行的SQL为: "+querySql+" 具体错误信息为:"+e); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + DBUtilErrorCode dbUtilErrorCode = oracleQueryErrorAna(e.getMessage()); + if (dbUtilErrorCode == DBUtilErrorCode.ORACLE_QUERY_TABLE_NAME_ERROR && table != null){ + return DataXException.asDataXException(dbUtilErrorCode,"表名为:"+table+" 执行的SQL为:"+querySql+" 具体错误信息为:"+e); + } + if (dbUtilErrorCode == DBUtilErrorCode.ORACLE_QUERY_SELECT_PRI_ERROR){ + return DataXException.asDataXException(dbUtilErrorCode,"用户名为:"+userName+" 具体错误信息为:"+e); + } + + return DataXException.asDataXException(dbUtilErrorCode,"执行的SQL为: "+querySql+" 具体错误信息为:"+e); + + } + + return DataXException.asDataXException(DBUtilErrorCode.SQL_EXECUTE_FAIL, "执行的SQL为: "+querySql+" 具体错误信息为:"+e); + } + + public static DBUtilErrorCode mySqlQueryErrorAna(String e){ + if (e.contains(Constant.MYSQL_TABLE_NAME_ERR1) && e.contains(Constant.MYSQL_TABLE_NAME_ERR2)){ + return DBUtilErrorCode.MYSQL_QUERY_TABLE_NAME_ERROR; + }else if (e.contains(Constant.MYSQL_SELECT_PRI)){ + return DBUtilErrorCode.MYSQL_QUERY_SELECT_PRI_ERROR; + }else if (e.contains(Constant.MYSQL_COLUMN1) && e.contains(Constant.MYSQL_COLUMN2)){ + return DBUtilErrorCode.MYSQL_QUERY_COLUMN_ERROR; + }else if (e.contains(Constant.MYSQL_WHERE)){ + return DBUtilErrorCode.MYSQL_QUERY_SQL_ERROR; + } + return DBUtilErrorCode.READ_RECORD_FAIL; + } + + public static DBUtilErrorCode oracleQueryErrorAna(String e){ + if (e.contains(Constant.ORACLE_TABLE_NAME)){ + return DBUtilErrorCode.ORACLE_QUERY_TABLE_NAME_ERROR; + }else if (e.contains(Constant.ORACLE_SQL)){ + return DBUtilErrorCode.ORACLE_QUERY_SQL_ERROR; + }else if (e.contains(Constant.ORACLE_SELECT_PRI)){ + return DBUtilErrorCode.ORACLE_QUERY_SELECT_PRI_ERROR; + } + return DBUtilErrorCode.READ_RECORD_FAIL; + } + + public static DataXException asSqlParserException(DataBaseType dataBaseType, Exception e,String querySql){ + if (dataBaseType.equals(DataBaseType.MySql)){ + throw DataXException.asDataXException(DBUtilErrorCode.MYSQL_QUERY_SQL_PARSER_ERROR, "执行的SQL为:"+querySql+" 具体错误信息为:" + e); + } + if (dataBaseType.equals(DataBaseType.Oracle)){ + throw DataXException.asDataXException(DBUtilErrorCode.ORACLE_QUERY_SQL_PARSER_ERROR,"执行的SQL为:"+querySql+" 具体错误信息为:" +e); + } + throw DataXException.asDataXException(DBUtilErrorCode.READ_RECORD_FAIL,"执行的SQL为:"+querySql+" 具体错误信息为:"+e); + } + + public static DataXException asPreSQLParserException(DataBaseType dataBaseType, Exception e,String querySql){ + if (dataBaseType.equals(DataBaseType.MySql)){ + throw DataXException.asDataXException(DBUtilErrorCode.MYSQL_PRE_SQL_ERROR, "执行的SQL为:"+querySql+" 具体错误信息为:" + e); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + throw DataXException.asDataXException(DBUtilErrorCode.ORACLE_PRE_SQL_ERROR,"执行的SQL为:"+querySql+" 具体错误信息为:" +e); + } + throw DataXException.asDataXException(DBUtilErrorCode.READ_RECORD_FAIL,"执行的SQL为:"+querySql+" 具体错误信息为:"+e); + } + + public static DataXException asPostSQLParserException(DataBaseType dataBaseType, Exception e,String querySql){ + if (dataBaseType.equals(DataBaseType.MySql)){ + throw DataXException.asDataXException(DBUtilErrorCode.MYSQL_POST_SQL_ERROR, "执行的SQL为:"+querySql+" 具体错误信息为:" + e); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + throw DataXException.asDataXException(DBUtilErrorCode.ORACLE_POST_SQL_ERROR,"执行的SQL为:"+querySql+" 具体错误信息为:" +e); + } + throw DataXException.asDataXException(DBUtilErrorCode.READ_RECORD_FAIL,"执行的SQL为:"+querySql+" 具体错误信息为:"+e); + } + + public static DataXException asInsertPriException(DataBaseType dataBaseType, String userName,String jdbcUrl){ + if (dataBaseType.equals(DataBaseType.MySql)){ + throw DataXException.asDataXException(DBUtilErrorCode.MYSQL_INSERT_ERROR, "用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + throw DataXException.asDataXException(DBUtilErrorCode.ORACLE_INSERT_ERROR,"用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + throw DataXException.asDataXException(DBUtilErrorCode.NO_INSERT_PRIVILEGE,"用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + + public static DataXException asDeletePriException(DataBaseType dataBaseType, String userName,String jdbcUrl){ + if (dataBaseType.equals(DataBaseType.MySql)){ + throw DataXException.asDataXException(DBUtilErrorCode.MYSQL_DELETE_ERROR, "用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + throw DataXException.asDataXException(DBUtilErrorCode.ORACLE_DELETE_ERROR,"用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + throw DataXException.asDataXException(DBUtilErrorCode.NO_DELETE_PRIVILEGE,"用户名为:"+userName+" jdbcURL为:"+jdbcUrl); + } + + public static DataXException asSplitPKException(DataBaseType dataBaseType, Exception e,String splitSql,String splitPkID){ + if (dataBaseType.equals(DataBaseType.MySql)){ + + return DataXException.asDataXException(DBUtilErrorCode.MYSQL_SPLIT_PK_ERROR,"配置的SplitPK为: "+splitPkID+", 执行的SQL为: "+splitSql+" 具体错误信息为:"+e); + } + + if (dataBaseType.equals(DataBaseType.Oracle)){ + return DataXException.asDataXException(DBUtilErrorCode.ORACLE_SPLIT_PK_ERROR,"配置的SplitPK为: "+splitPkID+", 执行的SQL为: "+splitSql+" 具体错误信息为:"+e); + } + + return DataXException.asDataXException(DBUtilErrorCode.READ_RECORD_FAIL,splitSql+e); + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsRangeSplitWrap.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsRangeSplitWrap.java new file mode 100755 index 0000000000..71248ae931 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/RdbmsRangeSplitWrap.java @@ -0,0 +1,101 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import com.alibaba.datax.common.util.RangeSplitUtil; +import org.apache.commons.lang3.StringUtils; + +import java.math.BigInteger; +import java.util.ArrayList; +import java.util.List; + +public final class RdbmsRangeSplitWrap { + + public static List splitAndWrap(String left, String right, int expectSliceNumber, + String columnName, String quote, DataBaseType dataBaseType) { + String[] tempResult = RangeSplitUtil.doAsciiStringSplit(left, right, expectSliceNumber); + return RdbmsRangeSplitWrap.wrapRange(tempResult, columnName, quote, dataBaseType); + } + + // warn: do not use this method long->BigInteger + public static List splitAndWrap(long left, long right, int expectSliceNumber, String columnName) { + long[] tempResult = RangeSplitUtil.doLongSplit(left, right, expectSliceNumber); + return RdbmsRangeSplitWrap.wrapRange(tempResult, columnName); + } + + public static List splitAndWrap(BigInteger left, BigInteger right, int expectSliceNumber, String columnName) { + BigInteger[] tempResult = RangeSplitUtil.doBigIntegerSplit(left, right, expectSliceNumber); + return RdbmsRangeSplitWrap.wrapRange(tempResult, columnName); + } + + public static List wrapRange(long[] rangeResult, String columnName) { + String[] rangeStr = new String[rangeResult.length]; + for (int i = 0, len = rangeResult.length; i < len; i++) { + rangeStr[i] = String.valueOf(rangeResult[i]); + } + return wrapRange(rangeStr, columnName, "", null); + } + + public static List wrapRange(BigInteger[] rangeResult, String columnName) { + String[] rangeStr = new String[rangeResult.length]; + for (int i = 0, len = rangeResult.length; i < len; i++) { + rangeStr[i] = rangeResult[i].toString(); + } + return wrapRange(rangeStr, columnName, "", null); + } + + public static List wrapRange(String[] rangeResult, String columnName, + String quote, DataBaseType dataBaseType) { + if (null == rangeResult || rangeResult.length < 2) { + throw new IllegalArgumentException(String.format( + "Parameter rangeResult can not be null and its length can not <2. detail:rangeResult=[%s].", + StringUtils.join(rangeResult, ","))); + } + + List result = new ArrayList(); + + //TODO change to stringbuilder.append(..) + if (2 == rangeResult.length) { + result.add(String.format(" (%s%s%s <= %s AND %s <= %s%s%s) ", quote, quoteConstantValue(rangeResult[0], dataBaseType), + quote, columnName, columnName, quote, quoteConstantValue(rangeResult[1], dataBaseType), quote)); + return result; + } else { + for (int i = 0, len = rangeResult.length - 2; i < len; i++) { + result.add(String.format(" (%s%s%s <= %s AND %s < %s%s%s) ", quote, quoteConstantValue(rangeResult[i], dataBaseType), + quote, columnName, columnName, quote, quoteConstantValue(rangeResult[i + 1], dataBaseType), quote)); + } + + result.add(String.format(" (%s%s%s <= %s AND %s <= %s%s%s) ", quote, quoteConstantValue(rangeResult[rangeResult.length - 2], dataBaseType), + quote, columnName, columnName, quote, quoteConstantValue(rangeResult[rangeResult.length - 1], dataBaseType), quote)); + return result; + } + } + + public static String wrapFirstLastPoint(String firstPoint, String lastPoint, String columnName, + String quote, DataBaseType dataBaseType) { + return String.format(" ((%s < %s%s%s) OR (%s%s%s < %s)) ", columnName, quote, quoteConstantValue(firstPoint, dataBaseType), + quote, quote, quoteConstantValue(lastPoint, dataBaseType), quote, columnName); + } + + public static String wrapFirstLastPoint(Long firstPoint, Long lastPoint, String columnName) { + return wrapFirstLastPoint(firstPoint.toString(), lastPoint.toString(), columnName, "", null); + } + + public static String wrapFirstLastPoint(BigInteger firstPoint, BigInteger lastPoint, String columnName) { + return wrapFirstLastPoint(firstPoint.toString(), lastPoint.toString(), columnName, "", null); + } + + + private static String quoteConstantValue(String aString, DataBaseType dataBaseType) { + if (null == dataBaseType) { + return aString; + } + + if (dataBaseType.equals(DataBaseType.MySql)) { + return aString.replace("'", "''").replace("\\", "\\\\"); + } else if (dataBaseType.equals(DataBaseType.Oracle) || dataBaseType.equals(DataBaseType.SQLServer)) { + return aString.replace("'", "''"); + } else { + //TODO other type supported + return aString; + } + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/SqlFormatUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/SqlFormatUtil.java new file mode 100755 index 0000000000..76137d3117 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/SqlFormatUtil.java @@ -0,0 +1,359 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import java.util.HashSet; +import java.util.LinkedList; +import java.util.Set; +import java.util.StringTokenizer; + +// TODO delete it +public class SqlFormatUtil { + + private static final Set BEGIN_CLAUSES = new HashSet(); + private static final Set END_CLAUSES = new HashSet(); + private static final Set LOGICAL = new HashSet(); + private static final Set QUANTIFIERS = new HashSet(); + private static final Set DML = new HashSet(); + private static final Set MISC = new HashSet(); + + private static final String WHITESPACE = " \n\r\f\t"; + + static { + BEGIN_CLAUSES.add("left"); + BEGIN_CLAUSES.add("right"); + BEGIN_CLAUSES.add("inner"); + BEGIN_CLAUSES.add("outer"); + BEGIN_CLAUSES.add("group"); + BEGIN_CLAUSES.add("order"); + + END_CLAUSES.add("where"); + END_CLAUSES.add("set"); + END_CLAUSES.add("having"); + END_CLAUSES.add("join"); + END_CLAUSES.add("from"); + END_CLAUSES.add("by"); + END_CLAUSES.add("join"); + END_CLAUSES.add("into"); + END_CLAUSES.add("union"); + + LOGICAL.add("and"); + LOGICAL.add("or"); + LOGICAL.add("when"); + LOGICAL.add("else"); + LOGICAL.add("end"); + + QUANTIFIERS.add("in"); + QUANTIFIERS.add("all"); + QUANTIFIERS.add("exists"); + QUANTIFIERS.add("some"); + QUANTIFIERS.add("any"); + + DML.add("insert"); + DML.add("update"); + DML.add("delete"); + + MISC.add("select"); + MISC.add("on"); + } + + static final String indentString = " "; + static final String initial = "\n "; + + public static String format(String source) { + return new FormatProcess(source).perform(); + } + + private static class FormatProcess { + boolean beginLine = true; + boolean afterBeginBeforeEnd = false; + boolean afterByOrSetOrFromOrSelect = false; + boolean afterValues = false; + boolean afterOn = false; + boolean afterBetween = false; + boolean afterInsert = false; + int inFunction = 0; + int parensSinceSelect = 0; + private LinkedList parenCounts = new LinkedList(); + private LinkedList afterByOrFromOrSelects = new LinkedList(); + + int indent = 1; + + StringBuilder result = new StringBuilder(); + StringTokenizer tokens; + String lastToken; + String token; + String lcToken; + + public FormatProcess(String sql) { + tokens = new StringTokenizer(sql, "()+*/-=<>'`\"[]," + WHITESPACE, + true); + } + + public String perform() { + + result.append(initial); + + while (tokens.hasMoreTokens()) { + token = tokens.nextToken(); + lcToken = token.toLowerCase(); + + if ("'".equals(token)) { + String t; + do { + t = tokens.nextToken(); + token += t; + } while (!"'".equals(t) && tokens.hasMoreTokens()); // cannot + // handle + // single + // quotes + } else if ("\"".equals(token)) { + String t; + do { + t = tokens.nextToken(); + token += t; + } while (!"\"".equals(t)); + } + + if (afterByOrSetOrFromOrSelect && ",".equals(token)) { + commaAfterByOrFromOrSelect(); + } else if (afterOn && ",".equals(token)) { + commaAfterOn(); + } + + else if ("(".equals(token)) { + openParen(); + } else if (")".equals(token)) { + closeParen(); + } + + else if (BEGIN_CLAUSES.contains(lcToken)) { + beginNewClause(); + } + + else if (END_CLAUSES.contains(lcToken)) { + endNewClause(); + } + + else if ("select".equals(lcToken)) { + select(); + } + + else if (DML.contains(lcToken)) { + updateOrInsertOrDelete(); + } + + else if ("values".equals(lcToken)) { + values(); + } + + else if ("on".equals(lcToken)) { + on(); + } + + else if (afterBetween && lcToken.equals("and")) { + misc(); + afterBetween = false; + } + + else if (LOGICAL.contains(lcToken)) { + logical(); + } + + else if (isWhitespace(token)) { + white(); + } + + else { + misc(); + } + + if (!isWhitespace(token)) { + lastToken = lcToken; + } + + } + return result.toString(); + } + + private void commaAfterOn() { + out(); + indent--; + newline(); + afterOn = false; + afterByOrSetOrFromOrSelect = true; + } + + private void commaAfterByOrFromOrSelect() { + out(); + newline(); + } + + private void logical() { + if ("end".equals(lcToken)) { + indent--; + } + newline(); + out(); + beginLine = false; + } + + private void on() { + indent++; + afterOn = true; + newline(); + out(); + beginLine = false; + } + + private void misc() { + out(); + if ("between".equals(lcToken)) { + afterBetween = true; + } + if (afterInsert) { + newline(); + afterInsert = false; + } else { + beginLine = false; + if ("case".equals(lcToken)) { + indent++; + } + } + } + + private void white() { + if (!beginLine) { + result.append(" "); + } + } + + private void updateOrInsertOrDelete() { + out(); + indent++; + beginLine = false; + if ("update".equals(lcToken)) { + newline(); + } + if ("insert".equals(lcToken)) { + afterInsert = true; + } + } + + private void select() { + out(); + indent++; + newline(); + parenCounts.addLast(Integer.valueOf(parensSinceSelect)); + afterByOrFromOrSelects.addLast(Boolean + .valueOf(afterByOrSetOrFromOrSelect)); + parensSinceSelect = 0; + afterByOrSetOrFromOrSelect = true; + } + + private void out() { + result.append(token); + } + + private void endNewClause() { + if (!afterBeginBeforeEnd) { + indent--; + if (afterOn) { + indent--; + afterOn = false; + } + newline(); + } + out(); + if (!"union".equals(lcToken)) { + indent++; + } + newline(); + afterBeginBeforeEnd = false; + afterByOrSetOrFromOrSelect = "by".equals(lcToken) + || "set".equals(lcToken) || "from".equals(lcToken); + } + + private void beginNewClause() { + if (!afterBeginBeforeEnd) { + if (afterOn) { + indent--; + afterOn = false; + } + indent--; + newline(); + } + out(); + beginLine = false; + afterBeginBeforeEnd = true; + } + + private void values() { + indent--; + newline(); + out(); + indent++; + newline(); + afterValues = true; + } + + private void closeParen() { + parensSinceSelect--; + if (parensSinceSelect < 0) { + indent--; + parensSinceSelect = parenCounts.removeLast().intValue(); + afterByOrSetOrFromOrSelect = afterByOrFromOrSelects + .removeLast().booleanValue(); + } + if (inFunction > 0) { + inFunction--; + out(); + } else { + if (!afterByOrSetOrFromOrSelect) { + indent--; + newline(); + } + out(); + } + beginLine = false; + } + + private void openParen() { + if (isFunctionName(lastToken) || inFunction > 0) { + inFunction++; + } + beginLine = false; + if (inFunction > 0) { + out(); + } else { + out(); + if (!afterByOrSetOrFromOrSelect) { + indent++; + newline(); + beginLine = true; + } + } + parensSinceSelect++; + } + + private static boolean isFunctionName(String tok) { + final char begin = tok.charAt(0); + final boolean isIdentifier = Character.isJavaIdentifierStart(begin) + || '"' == begin; + return isIdentifier && !LOGICAL.contains(tok) + && !END_CLAUSES.contains(tok) && !QUANTIFIERS.contains(tok) + && !DML.contains(tok) && !MISC.contains(tok); + } + + private static boolean isWhitespace(String token) { + return WHITESPACE.indexOf(token) >= 0; + } + + private void newline() { + result.append("\n"); + for (int i = 0; i < indent; i++) { + result.append(indentString); + } + beginLine = true; + } + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/TableExpandUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/TableExpandUtil.java new file mode 100755 index 0000000000..8d28ed4f09 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/util/TableExpandUtil.java @@ -0,0 +1,83 @@ +package com.alibaba.datax.plugin.rdbms.util; + +import java.util.ArrayList; +import java.util.List; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +public final class TableExpandUtil { + + // schema.table[0-2]more + // 1 2 3 4 5 + public static Pattern pattern = Pattern + .compile("(\\w+\\.)?(\\w+)\\[(\\d+)-(\\d+)\\](.*)"); + + private TableExpandUtil() { + } + + /** + * Split the table string(Usually contains names of some tables) to a List + * that is formated. example: table[0-32] will be splitted into `table0`, + * `table1`, `table2`, ... ,`table32` in {@link List} + * + * @param tables + * a string contains table name(one or many). + * @return a split result of table name. + *

+ * TODO 删除参数 DataBaseType + */ + public static List splitTables(DataBaseType dataBaseType, + String tables) { + List splittedTables = new ArrayList(); + + String[] tableArrays = tables.split(","); + + String tableName = null; + for (String tableArray : tableArrays) { + Matcher matcher = pattern.matcher(tableArray.trim()); + if (!matcher.matches()) { + tableName = tableArray.trim(); + splittedTables.add(tableName); + } else { + String start = matcher.group(3).trim(); + String end = matcher.group(4).trim(); + String tmp = ""; + if (Integer.valueOf(start) > Integer.valueOf(end)) { + tmp = start; + start = end; + end = tmp; + } + int len = start.length(); + String schema = null; + for (int k = Integer.valueOf(start); k <= Integer.valueOf(end); k++) { + schema = (null == matcher.group(1)) ? "" : matcher.group(1) + .trim(); + if (start.startsWith("0")) { + tableName = schema + matcher.group(2).trim() + + String.format("%0" + len + "d", k) + + matcher.group(5).trim(); + splittedTables.add(tableName); + } else { + tableName = schema + matcher.group(2).trim() + + String.format("%d", k) + + matcher.group(5).trim(); + splittedTables.add(tableName); + } + } + } + } + return splittedTables; + } + + public static List expandTableConf(DataBaseType dataBaseType, + List tables) { + List parsedTables = new ArrayList(); + for (String table : tables) { + List splittedTables = splitTables(dataBaseType, table); + parsedTables.addAll(splittedTables); + } + + return parsedTables; + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/CommonRdbmsWriter.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/CommonRdbmsWriter.java new file mode 100755 index 0000000000..440aac2ad8 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/CommonRdbmsWriter.java @@ -0,0 +1,568 @@ +package com.alibaba.datax.plugin.rdbms.writer; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.util.RdbmsException; +import com.alibaba.datax.plugin.rdbms.writer.util.OriginalConfPretreatmentUtil; +import com.alibaba.datax.plugin.rdbms.writer.util.WriterUtil; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.lang3.tuple.Triple; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.sql.PreparedStatement; +import java.sql.SQLException; +import java.sql.Types; +import java.util.ArrayList; +import java.util.List; + +public class CommonRdbmsWriter { + + public static class Job { + private DataBaseType dataBaseType; + + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + public Job(DataBaseType dataBaseType) { + this.dataBaseType = dataBaseType; + OriginalConfPretreatmentUtil.DATABASE_TYPE = this.dataBaseType; + } + + public void init(Configuration originalConfig) { + OriginalConfPretreatmentUtil.doPretreatment(originalConfig, this.dataBaseType); + + LOG.debug("After job init(), originalConfig now is:[\n{}\n]", + originalConfig.toJSON()); + } + + /*目前只支持MySQL Writer跟Oracle Writer;检查PreSQL跟PostSQL语法以及insert,delete权限*/ + public void writerPreCheck(Configuration originalConfig, DataBaseType dataBaseType) { + /*检查PreSql跟PostSql语句*/ + prePostSqlValid(originalConfig, dataBaseType); + /*检查insert 跟delete权限*/ + privilegeValid(originalConfig, dataBaseType); + } + + public void prePostSqlValid(Configuration originalConfig, DataBaseType dataBaseType) { + /*检查PreSql跟PostSql语句*/ + WriterUtil.preCheckPrePareSQL(originalConfig, dataBaseType); + WriterUtil.preCheckPostSQL(originalConfig, dataBaseType); + } + + public void privilegeValid(Configuration originalConfig, DataBaseType dataBaseType) { + /*检查insert 跟delete权限*/ + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + List connections = originalConfig.getList(Constant.CONN_MARK, + Object.class); + + for (int i = 0, len = connections.size(); i < len; i++) { + Configuration connConf = Configuration.from(connections.get(i).toString()); + String jdbcUrl = connConf.getString(Key.JDBC_URL); + List expandedTables = connConf.getList(Key.TABLE, String.class); + boolean hasInsertPri = DBUtil.checkInsertPrivilege(dataBaseType, jdbcUrl, username, password, expandedTables); + + if (!hasInsertPri) { + throw RdbmsException.asInsertPriException(dataBaseType, originalConfig.getString(Key.USERNAME), jdbcUrl); + } + + if (DBUtil.needCheckDeletePrivilege(originalConfig)) { + boolean hasDeletePri = DBUtil.checkDeletePrivilege(dataBaseType, jdbcUrl, username, password, expandedTables); + if (!hasDeletePri) { + throw RdbmsException.asDeletePriException(dataBaseType, originalConfig.getString(Key.USERNAME), jdbcUrl); + } + } + } + } + + // 一般来说,是需要推迟到 task 中进行pre 的执行(单表情况例外) + public void prepare(Configuration originalConfig) { + int tableNumber = originalConfig.getInt(Constant.TABLE_NUMBER_MARK); + if (tableNumber == 1) { + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + + List conns = originalConfig.getList(Constant.CONN_MARK, + Object.class); + Configuration connConf = Configuration.from(conns.get(0) + .toString()); + + // 这里的 jdbcUrl 已经 append 了合适后缀参数 + String jdbcUrl = connConf.getString(Key.JDBC_URL); + originalConfig.set(Key.JDBC_URL, jdbcUrl); + + String table = connConf.getList(Key.TABLE, String.class).get(0); + originalConfig.set(Key.TABLE, table); + + List preSqls = originalConfig.getList(Key.PRE_SQL, + String.class); + List renderedPreSqls = WriterUtil.renderPreOrPostSqls( + preSqls, table); + + originalConfig.remove(Constant.CONN_MARK); + if (null != renderedPreSqls && !renderedPreSqls.isEmpty()) { + // 说明有 preSql 配置,则此处删除掉 + originalConfig.remove(Key.PRE_SQL); + + Connection conn = DBUtil.getConnection(dataBaseType, + jdbcUrl, username, password); + LOG.info("Begin to execute preSqls:[{}]. context info:{}.", + StringUtils.join(renderedPreSqls, ";"), jdbcUrl); + + WriterUtil.executeSqls(conn, renderedPreSqls, jdbcUrl, dataBaseType); + DBUtil.closeDBResources(null, null, conn); + } + } + + LOG.debug("After job prepare(), originalConfig now is:[\n{}\n]", + originalConfig.toJSON()); + } + + public List split(Configuration originalConfig, + int mandatoryNumber) { + return WriterUtil.doSplit(originalConfig, mandatoryNumber); + } + + // 一般来说,是需要推迟到 task 中进行post 的执行(单表情况例外) + public void post(Configuration originalConfig) { + int tableNumber = originalConfig.getInt(Constant.TABLE_NUMBER_MARK); + if (tableNumber == 1) { + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + + // 已经由 prepare 进行了appendJDBCSuffix处理 + String jdbcUrl = originalConfig.getString(Key.JDBC_URL); + + String table = originalConfig.getString(Key.TABLE); + + List postSqls = originalConfig.getList(Key.POST_SQL, + String.class); + List renderedPostSqls = WriterUtil.renderPreOrPostSqls( + postSqls, table); + + if (null != renderedPostSqls && !renderedPostSqls.isEmpty()) { + // 说明有 postSql 配置,则此处删除掉 + originalConfig.remove(Key.POST_SQL); + + Connection conn = DBUtil.getConnection(this.dataBaseType, + jdbcUrl, username, password); + + LOG.info( + "Begin to execute postSqls:[{}]. context info:{}.", + StringUtils.join(renderedPostSqls, ";"), jdbcUrl); + WriterUtil.executeSqls(conn, renderedPostSqls, jdbcUrl, dataBaseType); + DBUtil.closeDBResources(null, null, conn); + } + } + } + + public void destroy(Configuration originalConfig) { + } + + } + + public static class Task { + protected static final Logger LOG = LoggerFactory + .getLogger(Task.class); + + protected DataBaseType dataBaseType; + private static final String VALUE_HOLDER = "?"; + + protected String username; + protected String password; + protected String jdbcUrl; + protected String table; + protected List columns; + protected List preSqls; + protected List postSqls; + protected int batchSize; + protected int batchByteSize; + protected int columnNumber = 0; + protected TaskPluginCollector taskPluginCollector; + + // 作为日志显示信息时,需要附带的通用信息。比如信息所对应的数据库连接等信息,针对哪个表做的操作 + protected static String BASIC_MESSAGE; + + protected static String INSERT_OR_REPLACE_TEMPLATE; + + protected String writeRecordSql; + protected String writeMode; + protected boolean emptyAsNull; + protected Triple, List, List> resultSetMetaData; + + public Task(DataBaseType dataBaseType) { + this.dataBaseType = dataBaseType; + } + + public void init(Configuration writerSliceConfig) { + this.username = writerSliceConfig.getString(Key.USERNAME); + this.password = writerSliceConfig.getString(Key.PASSWORD); + this.jdbcUrl = writerSliceConfig.getString(Key.JDBC_URL); + + //ob10的处理 + if (this.jdbcUrl.startsWith(Constant.OB10_SPLIT_STRING) && this.dataBaseType == DataBaseType.MySql) { + String[] ss = this.jdbcUrl.split(Constant.OB10_SPLIT_STRING_PATTERN); + if (ss.length != 3) { + throw DataXException + .asDataXException( + DBUtilErrorCode.JDBC_OB10_ADDRESS_ERROR, "JDBC OB10格式错误,请联系askdatax"); + } + LOG.info("this is ob1_0 jdbc url."); + this.username = ss[1].trim() + ":" + this.username; + this.jdbcUrl = ss[2]; + LOG.info("this is ob1_0 jdbc url. user=" + this.username + " :url=" + this.jdbcUrl); + } + + this.table = writerSliceConfig.getString(Key.TABLE); + + this.columns = writerSliceConfig.getList(Key.COLUMN, String.class); + this.columnNumber = this.columns.size(); + + this.preSqls = writerSliceConfig.getList(Key.PRE_SQL, String.class); + this.postSqls = writerSliceConfig.getList(Key.POST_SQL, String.class); + this.batchSize = writerSliceConfig.getInt(Key.BATCH_SIZE, Constant.DEFAULT_BATCH_SIZE); + this.batchByteSize = writerSliceConfig.getInt(Key.BATCH_BYTE_SIZE, Constant.DEFAULT_BATCH_BYTE_SIZE); + + writeMode = writerSliceConfig.getString(Key.WRITE_MODE, "INSERT"); + emptyAsNull = writerSliceConfig.getBool(Key.EMPTY_AS_NULL, true); + INSERT_OR_REPLACE_TEMPLATE = writerSliceConfig.getString(Constant.INSERT_OR_REPLACE_TEMPLATE_MARK); + this.writeRecordSql = String.format(INSERT_OR_REPLACE_TEMPLATE, this.table); + + BASIC_MESSAGE = String.format("jdbcUrl:[%s], table:[%s]", + this.jdbcUrl, this.table); + } + + public void prepare(Configuration writerSliceConfig) { + Connection connection = DBUtil.getConnection(this.dataBaseType, + this.jdbcUrl, username, password); + + DBUtil.dealWithSessionConfig(connection, writerSliceConfig, + this.dataBaseType, BASIC_MESSAGE); + + int tableNumber = writerSliceConfig.getInt( + Constant.TABLE_NUMBER_MARK); + if (tableNumber != 1) { + LOG.info("Begin to execute preSqls:[{}]. context info:{}.", + StringUtils.join(this.preSqls, ";"), BASIC_MESSAGE); + WriterUtil.executeSqls(connection, this.preSqls, BASIC_MESSAGE, dataBaseType); + } + + DBUtil.closeDBResources(null, null, connection); + } + + public void startWriteWithConnection(RecordReceiver recordReceiver, TaskPluginCollector taskPluginCollector, Connection connection) { + this.taskPluginCollector = taskPluginCollector; + + // 用于写入数据的时候的类型根据目的表字段类型转换 + this.resultSetMetaData = DBUtil.getColumnMetaData(connection, + this.table, StringUtils.join(this.columns, ",")); + // 写数据库的SQL语句 + calcWriteRecordSql(); + + List writeBuffer = new ArrayList(this.batchSize); + int bufferBytes = 0; + try { + Record record; + while ((record = recordReceiver.getFromReader()) != null) { + if (record.getColumnNumber() != this.columnNumber) { + // 源头读取字段列数与目的表字段写入列数不相等,直接报错 + throw DataXException + .asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format( + "列配置信息有错误. 因为您配置的任务中,源头读取字段数:%s 与 目的表要写入的字段数:%s 不相等. 请检查您的配置并作出修改.", + record.getColumnNumber(), + this.columnNumber)); + } + + writeBuffer.add(record); + bufferBytes += record.getMemorySize(); + + if (writeBuffer.size() >= batchSize || bufferBytes >= batchByteSize) { + doBatchInsert(connection, writeBuffer); + writeBuffer.clear(); + bufferBytes = 0; + } + } + if (!writeBuffer.isEmpty()) { + doBatchInsert(connection, writeBuffer); + writeBuffer.clear(); + bufferBytes = 0; + } + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + writeBuffer.clear(); + bufferBytes = 0; + DBUtil.closeDBResources(null, null, connection); + } + } + + // TODO 改用连接池,确保每次获取的连接都是可用的(注意:连接可能需要每次都初始化其 session) + public void startWrite(RecordReceiver recordReceiver, + Configuration writerSliceConfig, + TaskPluginCollector taskPluginCollector) { + Connection connection = DBUtil.getConnection(this.dataBaseType, + this.jdbcUrl, username, password); + DBUtil.dealWithSessionConfig(connection, writerSliceConfig, + this.dataBaseType, BASIC_MESSAGE); + startWriteWithConnection(recordReceiver, taskPluginCollector, connection); + } + + + public void post(Configuration writerSliceConfig) { + int tableNumber = writerSliceConfig.getInt( + Constant.TABLE_NUMBER_MARK); + + boolean hasPostSql = (this.postSqls != null && this.postSqls.size() > 0); + if (tableNumber == 1 || !hasPostSql) { + return; + } + + Connection connection = DBUtil.getConnection(this.dataBaseType, + this.jdbcUrl, username, password); + + LOG.info("Begin to execute postSqls:[{}]. context info:{}.", + StringUtils.join(this.postSqls, ";"), BASIC_MESSAGE); + WriterUtil.executeSqls(connection, this.postSqls, BASIC_MESSAGE, dataBaseType); + DBUtil.closeDBResources(null, null, connection); + } + + public void destroy(Configuration writerSliceConfig) { + } + + protected void doBatchInsert(Connection connection, List buffer) + throws SQLException { + PreparedStatement preparedStatement = null; + try { + connection.setAutoCommit(false); + preparedStatement = connection + .prepareStatement(this.writeRecordSql); + + for (Record record : buffer) { + preparedStatement = fillPreparedStatement( + preparedStatement, record); + preparedStatement.addBatch(); + } + preparedStatement.executeBatch(); + connection.commit(); + } catch (SQLException e) { + LOG.warn("回滚此次写入, 采用每次写入一行方式提交. 因为:" + e.getMessage()); + connection.rollback(); + doOneInsert(connection, buffer); + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + DBUtil.closeDBResources(preparedStatement, null); + } + } + + protected void doOneInsert(Connection connection, List buffer) { + PreparedStatement preparedStatement = null; + try { + connection.setAutoCommit(true); + preparedStatement = connection + .prepareStatement(this.writeRecordSql); + + for (Record record : buffer) { + try { + preparedStatement = fillPreparedStatement( + preparedStatement, record); + preparedStatement.execute(); + } catch (SQLException e) { + LOG.debug(e.toString()); + + this.taskPluginCollector.collectDirtyRecord(record, e); + } finally { + // 最后不要忘了关闭 preparedStatement + preparedStatement.clearParameters(); + } + } + } catch (Exception e) { + throw DataXException.asDataXException( + DBUtilErrorCode.WRITE_DATA_ERROR, e); + } finally { + DBUtil.closeDBResources(preparedStatement, null); + } + } + + // 直接使用了两个类变量:columnNumber,resultSetMetaData + protected PreparedStatement fillPreparedStatement(PreparedStatement preparedStatement, Record record) + throws SQLException { + for (int i = 0; i < this.columnNumber; i++) { + int columnSqltype = this.resultSetMetaData.getMiddle().get(i); + preparedStatement = fillPreparedStatementColumnType(preparedStatement, i, columnSqltype, record.getColumn(i)); + } + + return preparedStatement; + } + + protected PreparedStatement fillPreparedStatementColumnType(PreparedStatement preparedStatement, int columnIndex, int columnSqltype, Column column) throws SQLException { + java.util.Date utilDate; + switch (columnSqltype) { + case Types.CHAR: + case Types.NCHAR: + case Types.CLOB: + case Types.NCLOB: + case Types.VARCHAR: + case Types.LONGVARCHAR: + case Types.NVARCHAR: + case Types.LONGNVARCHAR: + preparedStatement.setString(columnIndex + 1, column + .asString()); + break; + + case Types.SMALLINT: + case Types.INTEGER: + case Types.BIGINT: + case Types.NUMERIC: + case Types.DECIMAL: + case Types.FLOAT: + case Types.REAL: + case Types.DOUBLE: + String strValue = column.asString(); + if (emptyAsNull && "".equals(strValue)) { + preparedStatement.setString(columnIndex + 1, null); + } else { + preparedStatement.setString(columnIndex + 1, strValue); + } + break; + + //tinyint is a little special in some database like mysql {boolean->tinyint(1)} + case Types.TINYINT: + Long longValue = column.asLong(); + if (null == longValue) { + preparedStatement.setString(columnIndex + 1, null); + } else { + preparedStatement.setString(columnIndex + 1, longValue.toString()); + } + break; + + // for mysql bug, see http://bugs.mysql.com/bug.php?id=35115 + case Types.DATE: + if (this.resultSetMetaData.getRight().get(columnIndex) + .equalsIgnoreCase("year")) { + if (column.asBigInteger() == null) { + preparedStatement.setString(columnIndex + 1, null); + } else { + preparedStatement.setInt(columnIndex + 1, column.asBigInteger().intValue()); + } + } else { + java.sql.Date sqlDate = null; + try { + utilDate = column.asDate(); + } catch (DataXException e) { + throw new SQLException(String.format( + "Date 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlDate = new java.sql.Date(utilDate.getTime()); + } + preparedStatement.setDate(columnIndex + 1, sqlDate); + } + break; + + case Types.TIME: + java.sql.Time sqlTime = null; + try { + utilDate = column.asDate(); + } catch (DataXException e) { + throw new SQLException(String.format( + "TIME 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlTime = new java.sql.Time(utilDate.getTime()); + } + preparedStatement.setTime(columnIndex + 1, sqlTime); + break; + + case Types.TIMESTAMP: + java.sql.Timestamp sqlTimestamp = null; + try { + utilDate = column.asDate(); + } catch (DataXException e) { + throw new SQLException(String.format( + "TIMESTAMP 类型转换错误:[%s]", column)); + } + + if (null != utilDate) { + sqlTimestamp = new java.sql.Timestamp( + utilDate.getTime()); + } + preparedStatement.setTimestamp(columnIndex + 1, sqlTimestamp); + break; + + case Types.BINARY: + case Types.VARBINARY: + case Types.BLOB: + case Types.LONGVARBINARY: + preparedStatement.setBytes(columnIndex + 1, column + .asBytes()); + break; + + case Types.BOOLEAN: + preparedStatement.setString(columnIndex + 1, column.asString()); + break; + + // warn: bit(1) -> Types.BIT 可使用setBoolean + // warn: bit(>1) -> Types.VARBINARY 可使用setBytes + case Types.BIT: + if (this.dataBaseType == DataBaseType.MySql) { + preparedStatement.setBoolean(columnIndex + 1, column.asBoolean()); + } else { + preparedStatement.setString(columnIndex + 1, column.asString()); + } + break; + default: + throw DataXException + .asDataXException( + DBUtilErrorCode.UNSUPPORTED_TYPE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库写入这种字段类型. 字段名:[%s], 字段类型:[%d], 字段Java类型:[%s]. 请修改表中该字段的类型或者不同步该字段.", + this.resultSetMetaData.getLeft() + .get(columnIndex), + this.resultSetMetaData.getMiddle() + .get(columnIndex), + this.resultSetMetaData.getRight() + .get(columnIndex))); + } + return preparedStatement; + } + + private void calcWriteRecordSql() { + if (!VALUE_HOLDER.equals(calcValueHolder(""))) { + List valueHolders = new ArrayList(columnNumber); + for (int i = 0; i < columns.size(); i++) { + String type = resultSetMetaData.getRight().get(i); + valueHolders.add(calcValueHolder(type)); + } + + boolean forceUseUpdate = false; + //ob10的处理 + if (dataBaseType != null && dataBaseType == DataBaseType.MySql && OriginalConfPretreatmentUtil.isOB10(jdbcUrl)) { + forceUseUpdate = true; + } + + INSERT_OR_REPLACE_TEMPLATE = WriterUtil.getWriteTemplate(columns, valueHolders, writeMode, dataBaseType, forceUseUpdate); + writeRecordSql = String.format(INSERT_OR_REPLACE_TEMPLATE, this.table); + } + } + + protected String calcValueHolder(String columnType) { + return VALUE_HOLDER; + } + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Constant.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Constant.java new file mode 100755 index 0000000000..0e4692e2c8 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Constant.java @@ -0,0 +1,22 @@ +package com.alibaba.datax.plugin.rdbms.writer; + +/** + * 用于插件解析用户配置时,需要进行标识(MARK)的常量的声明. + */ +public final class Constant { + public static final int DEFAULT_BATCH_SIZE = 2048; + + public static final int DEFAULT_BATCH_BYTE_SIZE = 32 * 1024 * 1024; + + public static String TABLE_NAME_PLACEHOLDER = "@table"; + + public static String CONN_MARK = "connection"; + + public static String TABLE_NUMBER_MARK = "tableNumber"; + + public static String INSERT_OR_REPLACE_TEMPLATE_MARK = "insertOrReplaceTemplate"; + + public static final String OB10_SPLIT_STRING = "||_dsc_ob10_dsc_||"; + public static final String OB10_SPLIT_STRING_PATTERN = "\\|\\|_dsc_ob10_dsc_\\|\\|"; + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Key.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Key.java new file mode 100755 index 0000000000..25a2ab52f8 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/Key.java @@ -0,0 +1,40 @@ +package com.alibaba.datax.plugin.rdbms.writer; + +public final class Key { + public final static String JDBC_URL = "jdbcUrl"; + + public final static String USERNAME = "username"; + + public final static String PASSWORD = "password"; + + public final static String TABLE = "table"; + + public final static String COLUMN = "column"; + + //可选值为:insert,replace,默认为 insert (mysql 支持,oracle 没用 replace 机制,只能 insert,oracle 可以不暴露这个参数) + public final static String WRITE_MODE = "writeMode"; + + public final static String PRE_SQL = "preSql"; + + public final static String POST_SQL = "postSql"; + + public final static String TDDL_APP_NAME = "appName"; + + //默认值:256 + public final static String BATCH_SIZE = "batchSize"; + + //默认值:32m + public final static String BATCH_BYTE_SIZE = "batchByteSize"; + + public final static String EMPTY_AS_NULL = "emptyAsNull"; + + public final static String DB_NAME_PATTERN = "dbNamePattern"; + + public final static String DB_RULE = "dbRule"; + + public final static String TABLE_NAME_PATTERN = "tableNamePattern"; + + public final static String TABLE_RULE = "tableRule"; + + public final static String DRYRUN = "dryRun"; +} \ No newline at end of file diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/MysqlWriterErrorCode.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/MysqlWriterErrorCode.java new file mode 100755 index 0000000000..523292ad02 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/MysqlWriterErrorCode.java @@ -0,0 +1,32 @@ +package com.alibaba.datax.plugin.rdbms.writer; + +import com.alibaba.datax.common.spi.ErrorCode; + +//TODO 后续考虑与 util 包种的 DBUTilErrorCode 做合并.(区分读和写的错误码) +public enum MysqlWriterErrorCode implements ErrorCode { + ; + + private final String code; + private final String describe; + + private MysqlWriterErrorCode(String code, String describe) { + this.code = code; + this.describe = describe; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.describe; + } + + @Override + public String toString() { + return String.format("Code:[%s], Describe:[%s]. ", this.code, + this.describe); + } +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/OriginalConfPretreatmentUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/OriginalConfPretreatmentUtil.java new file mode 100755 index 0000000000..c42dd3eac4 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/OriginalConfPretreatmentUtil.java @@ -0,0 +1,184 @@ +package com.alibaba.datax.plugin.rdbms.writer.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.common.util.ListUtil; +import com.alibaba.datax.plugin.rdbms.util.*; +import com.alibaba.datax.plugin.rdbms.writer.Constant; +import com.alibaba.datax.plugin.rdbms.writer.Key; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.ArrayList; +import java.util.List; + +public final class OriginalConfPretreatmentUtil { + private static final Logger LOG = LoggerFactory + .getLogger(OriginalConfPretreatmentUtil.class); + + public static DataBaseType DATABASE_TYPE; + +// public static void doPretreatment(Configuration originalConfig) { +// doPretreatment(originalConfig,null); +// } + + public static void doPretreatment(Configuration originalConfig, DataBaseType dataBaseType) { + // 检查 username/password 配置(必填) + originalConfig.getNecessaryValue(Key.USERNAME, DBUtilErrorCode.REQUIRED_VALUE); + originalConfig.getNecessaryValue(Key.PASSWORD, DBUtilErrorCode.REQUIRED_VALUE); + + doCheckBatchSize(originalConfig); + + simplifyConf(originalConfig); + + dealColumnConf(originalConfig); + dealWriteMode(originalConfig, dataBaseType); + } + + public static void doCheckBatchSize(Configuration originalConfig) { + // 检查batchSize 配置(选填,如果未填写,则设置为默认值) + int batchSize = originalConfig.getInt(Key.BATCH_SIZE, Constant.DEFAULT_BATCH_SIZE); + if (batchSize < 1) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, String.format( + "您的batchSize配置有误. 您所配置的写入数据库表的 batchSize:%s 不能小于1. 推荐配置范围为:[100-1000], 该值越大, 内存溢出可能性越大. 请检查您的配置并作出修改.", + batchSize)); + } + + originalConfig.set(Key.BATCH_SIZE, batchSize); + } + + public static void simplifyConf(Configuration originalConfig) { + List connections = originalConfig.getList(Constant.CONN_MARK, + Object.class); + + int tableNum = 0; + + for (int i = 0, len = connections.size(); i < len; i++) { + Configuration connConf = Configuration.from(connections.get(i).toString()); + + String jdbcUrl = connConf.getString(Key.JDBC_URL); + if (StringUtils.isBlank(jdbcUrl)) { + throw DataXException.asDataXException(DBUtilErrorCode.REQUIRED_VALUE, "您未配置的写入数据库表的 jdbcUrl."); + } + + jdbcUrl = DATABASE_TYPE.appendJDBCSuffixForReader(jdbcUrl); + originalConfig.set(String.format("%s[%d].%s", Constant.CONN_MARK, i, Key.JDBC_URL), + jdbcUrl); + + List tables = connConf.getList(Key.TABLE, String.class); + + if (null == tables || tables.isEmpty()) { + throw DataXException.asDataXException(DBUtilErrorCode.REQUIRED_VALUE, + "您未配置写入数据库表的表名称. 根据配置DataX找不到您配置的表. 请检查您的配置并作出修改."); + } + + // 对每一个connection 上配置的table 项进行解析 + List expandedTables = TableExpandUtil + .expandTableConf(DATABASE_TYPE, tables); + + if (null == expandedTables || expandedTables.isEmpty()) { + throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, + "您配置的写入数据库表名称错误. DataX找不到您配置的表,请检查您的配置并作出修改."); + } + + tableNum += expandedTables.size(); + + originalConfig.set(String.format("%s[%d].%s", Constant.CONN_MARK, + i, Key.TABLE), expandedTables); + } + + originalConfig.set(Constant.TABLE_NUMBER_MARK, tableNum); + } + + public static void dealColumnConf(Configuration originalConfig, ConnectionFactory connectionFactory, String oneTable) { + List userConfiguredColumns = originalConfig.getList(Key.COLUMN, String.class); + if (null == userConfiguredColumns || userConfiguredColumns.isEmpty()) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + "您的配置文件中的列配置信息有误. 因为您未配置写入数据库表的列名称,DataX获取不到列信息. 请检查您的配置并作出修改."); + } else { + boolean isPreCheck = originalConfig.getBool(Key.DRYRUN, false); + List allColumns; + if (isPreCheck){ + allColumns = DBUtil.getTableColumnsByConn(DATABASE_TYPE,connectionFactory.getConnecttionWithoutRetry(), oneTable, connectionFactory.getConnectionInfo()); + }else{ + allColumns = DBUtil.getTableColumnsByConn(DATABASE_TYPE,connectionFactory.getConnecttion(), oneTable, connectionFactory.getConnectionInfo()); + } + + LOG.info("table:[{}] all columns:[\n{}\n].", oneTable, + StringUtils.join(allColumns, ",")); + + if (1 == userConfiguredColumns.size() && "*".equals(userConfiguredColumns.get(0))) { + LOG.warn("您的配置文件中的列配置信息存在风险. 因为您配置的写入数据库表的列为*,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改."); + + // 回填其值,需要以 String 的方式转交后续处理 + originalConfig.set(Key.COLUMN, allColumns); + } else if (userConfiguredColumns.size() > allColumns.size()) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + String.format("您的配置文件中的列配置信息有误. 因为您所配置的写入数据库表的字段个数:%s 大于目的表的总字段总个数:%s. 请检查您的配置并作出修改.", + userConfiguredColumns.size(), allColumns.size())); + } else { + // 确保用户配置的 column 不重复 + ListUtil.makeSureNoValueDuplicate(userConfiguredColumns, false); + + // 检查列是否都为数据库表中正确的列(通过执行一次 select column from table 进行判断) + DBUtil.getColumnMetaData(connectionFactory.getConnecttion(), oneTable,StringUtils.join(userConfiguredColumns, ",")); + } + } + } + + public static void dealColumnConf(Configuration originalConfig) { + String jdbcUrl = originalConfig.getString(String.format("%s[0].%s", + Constant.CONN_MARK, Key.JDBC_URL)); + + String username = originalConfig.getString(Key.USERNAME); + String password = originalConfig.getString(Key.PASSWORD); + String oneTable = originalConfig.getString(String.format( + "%s[0].%s[0]", Constant.CONN_MARK, Key.TABLE)); + + JdbcConnectionFactory jdbcConnectionFactory = new JdbcConnectionFactory(DATABASE_TYPE, jdbcUrl, username, password); + dealColumnConf(originalConfig, jdbcConnectionFactory, oneTable); + } + + public static void dealWriteMode(Configuration originalConfig, DataBaseType dataBaseType) { + List columns = originalConfig.getList(Key.COLUMN, String.class); + + String jdbcUrl = originalConfig.getString(String.format("%s[0].%s", + Constant.CONN_MARK, Key.JDBC_URL, String.class)); + + // 默认为:insert 方式 + String writeMode = originalConfig.getString(Key.WRITE_MODE, "INSERT"); + + List valueHolders = new ArrayList(columns.size()); + for (int i = 0; i < columns.size(); i++) { + valueHolders.add("?"); + } + + boolean forceUseUpdate = false; + //ob10的处理 + if (dataBaseType == DataBaseType.MySql && isOB10(jdbcUrl)) { + forceUseUpdate = true; + } + + String writeDataSqlTemplate = WriterUtil.getWriteTemplate(columns, valueHolders, writeMode,dataBaseType, forceUseUpdate); + + LOG.info("Write data [\n{}\n], which jdbcUrl like:[{}]", writeDataSqlTemplate, jdbcUrl); + + originalConfig.set(Constant.INSERT_OR_REPLACE_TEMPLATE_MARK, writeDataSqlTemplate); + } + + public static boolean isOB10(String jdbcUrl) { + //ob10的处理 + if (jdbcUrl.startsWith(com.alibaba.datax.plugin.rdbms.writer.Constant.OB10_SPLIT_STRING)) { + String[] ss = jdbcUrl.split(com.alibaba.datax.plugin.rdbms.writer.Constant.OB10_SPLIT_STRING_PATTERN); + if (ss.length != 3) { + throw DataXException + .asDataXException( + DBUtilErrorCode.JDBC_OB10_ADDRESS_ERROR, "JDBC OB10格式错误,请联系askdatax"); + } + return true; + } + return false; + } + +} diff --git a/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/WriterUtil.java b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/WriterUtil.java new file mode 100755 index 0000000000..5f5f0d5114 --- /dev/null +++ b/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/WriterUtil.java @@ -0,0 +1,218 @@ +package com.alibaba.datax.plugin.rdbms.writer.util; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.util.RdbmsException; +import com.alibaba.datax.plugin.rdbms.writer.Constant; +import com.alibaba.datax.plugin.rdbms.writer.Key; +import com.alibaba.druid.sql.parser.ParserException; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.sql.Connection; +import java.sql.Statement; +import java.util.*; + +public final class WriterUtil { + private static final Logger LOG = LoggerFactory.getLogger(WriterUtil.class); + + //TODO 切分报错 + public static List doSplit(Configuration simplifiedConf, + int adviceNumber) { + + List splitResultConfigs = new ArrayList(); + + int tableNumber = simplifiedConf.getInt(Constant.TABLE_NUMBER_MARK); + + //处理单表的情况 + if (tableNumber == 1) { + //由于在之前的 master prepare 中已经把 table,jdbcUrl 提取出来,所以这里处理十分简单 + for (int j = 0; j < adviceNumber; j++) { + splitResultConfigs.add(simplifiedConf.clone()); + } + + return splitResultConfigs; + } + + if (tableNumber != adviceNumber) { + throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, + String.format("您的配置文件中的列配置信息有误. 您要写入的目的端的表个数是:%s , 但是根据系统建议需要切分的份数是:%s. 请检查您的配置并作出修改.", + tableNumber, adviceNumber)); + } + + String jdbcUrl; + List preSqls = simplifiedConf.getList(Key.PRE_SQL, String.class); + List postSqls = simplifiedConf.getList(Key.POST_SQL, String.class); + + List conns = simplifiedConf.getList(Constant.CONN_MARK, + Object.class); + + for (Object conn : conns) { + Configuration sliceConfig = simplifiedConf.clone(); + + Configuration connConf = Configuration.from(conn.toString()); + jdbcUrl = connConf.getString(Key.JDBC_URL); + sliceConfig.set(Key.JDBC_URL, jdbcUrl); + + sliceConfig.remove(Constant.CONN_MARK); + + List tables = connConf.getList(Key.TABLE, String.class); + + for (String table : tables) { + Configuration tempSlice = sliceConfig.clone(); + tempSlice.set(Key.TABLE, table); + tempSlice.set(Key.PRE_SQL, renderPreOrPostSqls(preSqls, table)); + tempSlice.set(Key.POST_SQL, renderPreOrPostSqls(postSqls, table)); + + splitResultConfigs.add(tempSlice); + } + + } + + return splitResultConfigs; + } + + public static List renderPreOrPostSqls(List preOrPostSqls, String tableName) { + if (null == preOrPostSqls) { + return Collections.emptyList(); + } + + List renderedSqls = new ArrayList(); + for (String sql : preOrPostSqls) { + //preSql为空时,不加入执行队列 + if (StringUtils.isNotBlank(sql)) { + renderedSqls.add(sql.replace(Constant.TABLE_NAME_PLACEHOLDER, tableName)); + } + } + + return renderedSqls; + } + + public static void executeSqls(Connection conn, List sqls, String basicMessage,DataBaseType dataBaseType) { + Statement stmt = null; + String currentSql = null; + try { + stmt = conn.createStatement(); + for (String sql : sqls) { + currentSql = sql; + DBUtil.executeSqlWithoutResultSet(stmt, sql); + } + } catch (Exception e) { + throw RdbmsException.asQueryException(dataBaseType,e,currentSql,null,null); + } finally { + DBUtil.closeDBResources(null, stmt, null); + } + } + + public static String getWriteTemplate(List columnHolders, List valueHolders, String writeMode, DataBaseType dataBaseType, boolean forceUseUpdate) { + boolean isWriteModeLegal = writeMode.trim().toLowerCase().startsWith("insert") + || writeMode.trim().toLowerCase().startsWith("replace") + || writeMode.trim().toLowerCase().startsWith("update"); + + if (!isWriteModeLegal) { + throw DataXException.asDataXException(DBUtilErrorCode.ILLEGAL_VALUE, + String.format("您所配置的 writeMode:%s 错误. 因为DataX 目前仅支持replace,update 或 insert 方式. 请检查您的配置并作出修改.", writeMode)); + } + // && writeMode.trim().toLowerCase().startsWith("replace") + String writeDataSqlTemplate; + if (forceUseUpdate || + ((dataBaseType == DataBaseType.MySql || dataBaseType == DataBaseType.Tddl) && writeMode.trim().toLowerCase().startsWith("update")) + ) { + //update只在mysql下使用 + + writeDataSqlTemplate = new StringBuilder() + .append("INSERT INTO %s (").append(StringUtils.join(columnHolders, ",")) + .append(") VALUES(").append(StringUtils.join(valueHolders, ",")) + .append(")") + .append(onDuplicateKeyUpdateString(columnHolders)) + .toString(); + } else { + + //这里是保护,如果其他错误的使用了update,需要更换为replace + if (writeMode.trim().toLowerCase().startsWith("update")) { + writeMode = "replace"; + } + writeDataSqlTemplate = new StringBuilder().append(writeMode) + .append(" INTO %s (").append(StringUtils.join(columnHolders, ",")) + .append(") VALUES(").append(StringUtils.join(valueHolders, ",")) + .append(")").toString(); + } + + return writeDataSqlTemplate; + } + + public static String onDuplicateKeyUpdateString(List columnHolders){ + if (columnHolders == null || columnHolders.size() < 1) { + return ""; + } + StringBuilder sb = new StringBuilder(); + sb.append(" ON DUPLICATE KEY UPDATE "); + boolean first = true; + for(String column:columnHolders){ + if(!first){ + sb.append(","); + }else{ + first = false; + } + sb.append(column); + sb.append("=VALUES("); + sb.append(column); + sb.append(")"); + } + + return sb.toString(); + } + + public static void preCheckPrePareSQL(Configuration originalConfig, DataBaseType type) { + List conns = originalConfig.getList(Constant.CONN_MARK, Object.class); + Configuration connConf = Configuration.from(conns.get(0).toString()); + String table = connConf.getList(Key.TABLE, String.class).get(0); + + List preSqls = originalConfig.getList(Key.PRE_SQL, + String.class); + List renderedPreSqls = WriterUtil.renderPreOrPostSqls( + preSqls, table); + + if (null != renderedPreSqls && !renderedPreSqls.isEmpty()) { + LOG.info("Begin to preCheck preSqls:[{}].", + StringUtils.join(renderedPreSqls, ";")); + for(String sql : renderedPreSqls) { + try{ + DBUtil.sqlValid(sql, type); + }catch(ParserException e) { + throw RdbmsException.asPreSQLParserException(type,e,sql); + } + } + } + } + + public static void preCheckPostSQL(Configuration originalConfig, DataBaseType type) { + List conns = originalConfig.getList(Constant.CONN_MARK, Object.class); + Configuration connConf = Configuration.from(conns.get(0).toString()); + String table = connConf.getList(Key.TABLE, String.class).get(0); + + List postSqls = originalConfig.getList(Key.POST_SQL, + String.class); + List renderedPostSqls = WriterUtil.renderPreOrPostSqls( + postSqls, table); + if (null != renderedPostSqls && !renderedPostSqls.isEmpty()) { + + LOG.info("Begin to preCheck postSqls:[{}].", + StringUtils.join(renderedPostSqls, ";")); + for(String sql : renderedPostSqls) { + try{ + DBUtil.sqlValid(sql, type); + }catch(ParserException e){ + throw RdbmsException.asPostSQLParserException(type,e,sql); + } + + } + } + } + + +} diff --git a/plugin-unstructured-storage-util/pom.xml b/plugin-unstructured-storage-util/pom.xml new file mode 100755 index 0000000000..e344e8a24a --- /dev/null +++ b/plugin-unstructured-storage-util/pom.xml @@ -0,0 +1,97 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + plugin-unstructured-storage-util + plugin-unstructured-storage-util + plugin-unstructured-storage-util通用的文件类型的读取写入方法, + 供TxtFileReader/Writer, OSSReader/Writer ,FtpReader/Writer, HdfsReader/Writer使用。 + jar + + 2.7.1 + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + net.sourceforge.javacsv + javacsv + 2.0 + + + org.apache.commons + commons-compress + 1.9 + + + org.anarres.lzo + lzo-core + 1.0.5 + + + com.aliyun.oss + aliyun-sdk-oss + 2.0.2 + test + + + io.airlift + aircompressor + 0.3 + + + com.facebook.presto.hadoop + hadoop-apache2 + 0.3 + provided + + + junit + junit + test + + + commons-beanutils + commons-beanutils + 1.9.2 + + + + org.apache.hadoop + hadoop-common + ${hadoop.version} + + + org.apache.commons + commons-compress + + + + + \ No newline at end of file diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/ColumnEntry.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/ColumnEntry.java new file mode 100644 index 0000000000..ee3af81601 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/ColumnEntry.java @@ -0,0 +1,63 @@ +package com.alibaba.datax.plugin.unstructuredstorage.reader; + +import java.text.DateFormat; +import java.text.SimpleDateFormat; + +import org.apache.commons.lang3.StringUtils; + +import com.alibaba.fastjson.JSON; + +public class ColumnEntry { + private Integer index; + private String type; + private String value; + private String format; + private DateFormat dateParse; + + public Integer getIndex() { + return index; + } + + public void setIndex(Integer index) { + this.index = index; + } + + public String getType() { + return type; + } + + public void setType(String type) { + this.type = type; + } + + public String getValue() { + return value; + } + + public void setValue(String value) { + this.value = value; + } + + public String getFormat() { + return format; + } + + public void setFormat(String format) { + this.format = format; + if (StringUtils.isNotBlank(this.format)) { + this.dateParse = new SimpleDateFormat(this.format); + } + } + + public DateFormat getDateFormat() { + return this.dateParse; + } + + public String toJSONString() { + return ColumnEntry.toJSONString(this); + } + + public static String toJSONString(ColumnEntry columnEntry) { + return JSON.toJSONString(columnEntry); + } +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Constant.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Constant.java new file mode 100755 index 0000000000..7c6bc13956 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Constant.java @@ -0,0 +1,13 @@ +package com.alibaba.datax.plugin.unstructuredstorage.reader; + +public class Constant { + public static final String DEFAULT_ENCODING = "UTF-8"; + + public static final char DEFAULT_FIELD_DELIMITER = ','; + + public static final boolean DEFAULT_SKIP_HEADER = false; + + public static final String DEFAULT_NULL_FORMAT = "\\N"; + + public static final Integer DEFAULT_BUFFER_SIZE = 8192; +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/ExpandLzopInputStream.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/ExpandLzopInputStream.java new file mode 100644 index 0000000000..66b1b0c730 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/ExpandLzopInputStream.java @@ -0,0 +1,150 @@ +/* + * description: + * + * 使用了shevek在github上开源的lzo解压缩代码(https://github.com/shevek/lzo-java) + * + * 继承LzopInputStream的原因是因为开源版本代码中LZO_LIBRARY_VERSION是这样定义的: + * public static final short LZO_LIBRARY_VERSION = 0x2050; + * 而很多lzo文件LZO_LIBRARY_VERSION是0x2060,要解压这种version的lzo文件,必须要更改 + * LZO_LIBRARY_VERSION的值,才不会抛异常,而LZO_LIBRARY_VERSION是final类型的,无法更改 + * 其值,于是继承了LzopInputStream的类,重新定义了LZO_LIBRARY_VERSION的值。 + * + */ +package com.alibaba.datax.plugin.unstructuredstorage.reader; + +import org.anarres.lzo.LzoVersion; +import org.anarres.lzo.LzopConstants; +import org.anarres.lzo.LzopInputStream; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; + +import javax.annotation.Nonnegative; +import javax.annotation.Nonnull; +import java.io.IOException; +import java.io.InputStream; +import java.util.Arrays; +import java.util.zip.Adler32; +import java.util.zip.CRC32; + +/** + * Created by mingya.wmy on 16/8/26. + */ +public class ExpandLzopInputStream extends LzopInputStream { + + + public ExpandLzopInputStream(@Nonnull InputStream in) throws IOException { + super(in); + } + + /** + * Read and verify an lzo header, setting relevant block checksum options + * and ignoring most everything else. + */ + @Override + protected int readHeader() throws IOException { + short LZO_LIBRARY_VERSION = 0x2060; + Log LOG = LogFactory.getLog(LzopInputStream.class); + byte[] LZOP_MAGIC = new byte[]{ + -119, 'L', 'Z', 'O', 0, '\r', '\n', '\032', '\n'}; + byte[] buf = new byte[9]; + readBytes(buf, 0, 9); + if (!Arrays.equals(buf, LZOP_MAGIC)) + throw new IOException("Invalid LZO header"); + Arrays.fill(buf, (byte) 0); + Adler32 adler = new Adler32(); + CRC32 crc32 = new CRC32(); + int hitem = readHeaderItem(buf, 2, adler, crc32); // lzop version + if (hitem > LzopConstants.LZOP_VERSION) { + LOG.debug("Compressed with later version of lzop: " + + Integer.toHexString(hitem) + " (expected 0x" + + Integer.toHexString(LzopConstants.LZOP_VERSION) + ")"); + } + hitem = readHeaderItem(buf, 2, adler, crc32); // lzo library version + if (hitem > LZO_LIBRARY_VERSION) { + throw new IOException("Compressed with incompatible lzo version: 0x" + + Integer.toHexString(hitem) + " (expected 0x" + + Integer.toHexString(LzoVersion.LZO_LIBRARY_VERSION) + ")"); + } + hitem = readHeaderItem(buf, 2, adler, crc32); // lzop extract version + if (hitem > LzopConstants.LZOP_VERSION) { + throw new IOException("Compressed with incompatible lzop version: 0x" + + Integer.toHexString(hitem) + " (expected 0x" + + Integer.toHexString(LzopConstants.LZOP_VERSION) + ")"); + } + hitem = readHeaderItem(buf, 1, adler, crc32); // method + switch (hitem) { + case LzopConstants.M_LZO1X_1: + case LzopConstants.M_LZO1X_1_15: + case LzopConstants.M_LZO1X_999: + break; + default: + throw new IOException("Invalid strategy " + Integer.toHexString(hitem)); + } + readHeaderItem(buf, 1, adler, crc32); // ignore level + + // flags + int flags = readHeaderItem(buf, 4, adler, crc32); + boolean useCRC32 = (flags & LzopConstants.F_H_CRC32) != 0; + boolean extraField = (flags & LzopConstants.F_H_EXTRA_FIELD) != 0; + if ((flags & LzopConstants.F_MULTIPART) != 0) + throw new IOException("Multipart lzop not supported"); + if ((flags & LzopConstants.F_H_FILTER) != 0) + throw new IOException("lzop filter not supported"); + if ((flags & LzopConstants.F_RESERVED) != 0) + throw new IOException("Unknown flags in header"); + // known !F_H_FILTER, so no optional block + + readHeaderItem(buf, 4, adler, crc32); // ignore mode + readHeaderItem(buf, 4, adler, crc32); // ignore mtime + readHeaderItem(buf, 4, adler, crc32); // ignore gmtdiff + hitem = readHeaderItem(buf, 1, adler, crc32); // fn len + if (hitem > 0) { + byte[] tmp = (hitem > buf.length) ? new byte[hitem] : buf; + readHeaderItem(tmp, hitem, adler, crc32); // skip filename + } + int checksum = (int) (useCRC32 ? crc32.getValue() : adler.getValue()); + hitem = readHeaderItem(buf, 4, adler, crc32); // read checksum + if (hitem != checksum) { + throw new IOException("Invalid header checksum: " + + Long.toHexString(checksum) + " (expected 0x" + + Integer.toHexString(hitem) + ")"); + } + if (extraField) { // lzop 1.08 ultimately ignores this + LOG.debug("Extra header field not processed"); + adler.reset(); + crc32.reset(); + hitem = readHeaderItem(buf, 4, adler, crc32); + readHeaderItem(new byte[hitem], hitem, adler, crc32); + checksum = (int) (useCRC32 ? crc32.getValue() : adler.getValue()); + if (checksum != readHeaderItem(buf, 4, adler, crc32)) { + throw new IOException("Invalid checksum for extra header field"); + } + } + + return flags; + } + + private int readHeaderItem(@Nonnull byte[] buf, @Nonnegative int len, @Nonnull Adler32 adler, @Nonnull CRC32 crc32) throws IOException { + int ret = readInt(buf, len); + adler.update(buf, 0, len); + crc32.update(buf, 0, len); + Arrays.fill(buf, (byte) 0); + return ret; + } + + /** + * Read len bytes into buf, st LSB of int returned is the last byte of the + * first word read. + */ + // @Nonnegative ? + private int readInt(@Nonnull byte[] buf, @Nonnegative int len) + throws IOException { + readBytes(buf, 0, len); + int ret = (0xFF & buf[0]) << 24; + ret |= (0xFF & buf[1]) << 16; + ret |= (0xFF & buf[2]) << 8; + ret |= (0xFF & buf[3]); + return (len > 3) ? ret : (ret >>> (8 * (4 - len))); + } + +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Key.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Key.java new file mode 100755 index 0000000000..bb5bf59fee --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/Key.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.unstructuredstorage.reader; + +/** + * Created by haiwei.luo on 14-12-5. + */ +public class Key { + public static final String COLUMN = "column"; + + public static final String ENCODING = "encoding"; + + public static final String FIELD_DELIMITER = "fieldDelimiter"; + + public static final String SKIP_HEADER = "skipHeader"; + + public static final String TYPE = "type"; + + public static final String FORMAT = "format"; + + public static final String INDEX = "index"; + + public static final String VALUE = "value"; + + public static final String COMPRESS = "compress"; + + public static final String NULL_FORMAT = "nullFormat"; + + public static final String BUFFER_SIZE = "bufferSize"; + + public static final String CSV_READER_CONFIG = "csvReaderConfig"; + +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderErrorCode.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderErrorCode.java new file mode 100755 index 0000000000..911bf3457e --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderErrorCode.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.plugin.unstructuredstorage.reader; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public enum UnstructuredStorageReaderErrorCode implements ErrorCode { + CONFIG_INVALID_EXCEPTION("UnstructuredStorageReader-00", "您的参数配置错误."), + NOT_SUPPORT_TYPE("UnstructuredStorageReader-01","您配置的列类型暂不支持."), + REQUIRED_VALUE("UnstructuredStorageReader-02", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("UnstructuredStorageReader-03", "您填写的参数值不合法."), + MIXED_INDEX_VALUE("UnstructuredStorageReader-04", "您的列信息配置同时包含了index,value."), + NO_INDEX_VALUE("UnstructuredStorageReader-05","您明确的配置列信息,但未填写相应的index,value."), + FILE_NOT_EXISTS("UnstructuredStorageReader-06", "您配置的源路径不存在."), + OPEN_FILE_WITH_CHARSET_ERROR("UnstructuredStorageReader-07", "您配置的编码和实际存储编码不符合."), + OPEN_FILE_ERROR("UnstructuredStorageReader-08", "您配置的源在打开时异常,建议您检查源源是否有隐藏实体,管道文件等特殊文件."), + READ_FILE_IO_ERROR("UnstructuredStorageReader-09", "您配置的文件在读取时出现IO异常."), + SECURITY_NOT_ENOUGH("UnstructuredStorageReader-10", "您缺少权限执行相应的文件读取操作."), + RUNTIME_EXCEPTION("UnstructuredStorageReader-11", "出现运行时异常, 请联系我们"); + + private final String code; + private final String description; + + private UnstructuredStorageReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderUtil.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderUtil.java new file mode 100755 index 0000000000..423f66db99 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/UnstructuredStorageReaderUtil.java @@ -0,0 +1,698 @@ +package com.alibaba.datax.plugin.unstructuredstorage.reader; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.fastjson.JSON; +import com.alibaba.fastjson.JSONObject; +import com.alibaba.fastjson.TypeReference; +import com.csvreader.CsvReader; +import org.apache.commons.beanutils.BeanUtils; +import io.airlift.compress.snappy.SnappyCodec; +import io.airlift.compress.snappy.SnappyFramedInputStream; +import org.anarres.lzo.*; +import org.apache.commons.compress.compressors.CompressorInputStream; +import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream; +import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream; +import org.apache.commons.io.Charsets; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.apache.hadoop.io.compress.CompressionCodec; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.*; +import java.nio.charset.UnsupportedCharsetException; +import java.text.DateFormat; +import java.util.ArrayList; +import java.util.Date; +import java.util.HashMap; +import java.util.List; + +public class UnstructuredStorageReaderUtil { + private static final Logger LOG = LoggerFactory + .getLogger(UnstructuredStorageReaderUtil.class); + public static HashMap csvReaderConfigMap; + + private UnstructuredStorageReaderUtil() { + + } + + /** + * @param inputLine + * 输入待分隔字符串 + * @param delimiter + * 字符串分割符 + * @return 分隔符分隔后的字符串数组,出现异常时返回为null 支持转义,即数据中可包含分隔符 + * */ + public static String[] splitOneLine(String inputLine, char delimiter) { + String[] splitedResult = null; + if (null != inputLine) { + try { + CsvReader csvReader = new CsvReader(new StringReader(inputLine)); + csvReader.setDelimiter(delimiter); + + setCsvReaderConfig(csvReader); + + if (csvReader.readRecord()) { + splitedResult = csvReader.getValues(); + } + } catch (IOException e) { + // nothing to do + } + } + return splitedResult; + } + + public static String[] splitBufferedReader(CsvReader csvReader) + throws IOException { + String[] splitedResult = null; + if (csvReader.readRecord()) { + splitedResult = csvReader.getValues(); + } + return splitedResult; + } + + /** + * 不支持转义 + * + * @return 分隔符分隔后的字符串数, + * */ + public static String[] splitOneLine(String inputLine, String delimiter) { + String[] splitedResult = StringUtils.split(inputLine, delimiter); + return splitedResult; + } + + public static void readFromStream(InputStream inputStream, String context, + Configuration readerSliceConfig, RecordSender recordSender, + TaskPluginCollector taskPluginCollector) { + String compress = readerSliceConfig.getString(Key.COMPRESS, null); + if (StringUtils.isBlank(compress)) { + compress = null; + } + String encoding = readerSliceConfig.getString(Key.ENCODING, + Constant.DEFAULT_ENCODING); + // handle blank encoding + if (StringUtils.isBlank(encoding)) { + encoding = Constant.DEFAULT_ENCODING; + LOG.warn(String.format("您配置的encoding为[%s], 使用默认值[%s]", encoding, + Constant.DEFAULT_ENCODING)); + } + + List column = readerSliceConfig + .getListConfiguration(Key.COLUMN); + // handle ["*"] -> [], null + if (null != column && 1 == column.size() + && "\"*\"".equals(column.get(0).toString())) { + readerSliceConfig.set(Key.COLUMN, null); + column = null; + } + + BufferedReader reader = null; + int bufferSize = readerSliceConfig.getInt(Key.BUFFER_SIZE, + Constant.DEFAULT_BUFFER_SIZE); + + // compress logic + try { + if (null == compress) { + reader = new BufferedReader(new InputStreamReader(inputStream, + encoding), bufferSize); + } else { + // TODO compress + if ("lzo_deflate".equalsIgnoreCase(compress)) { + LzoInputStream lzoInputStream = new LzoInputStream( + inputStream, new LzoDecompressor1x_safe()); + reader = new BufferedReader(new InputStreamReader( + lzoInputStream, encoding)); + } else if ("lzo".equalsIgnoreCase(compress)) { + LzoInputStream lzopInputStream = new ExpandLzopInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + lzopInputStream, encoding)); + } else if ("gzip".equalsIgnoreCase(compress)) { + CompressorInputStream compressorInputStream = new GzipCompressorInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + compressorInputStream, encoding), bufferSize); + } else if ("bzip2".equalsIgnoreCase(compress)) { + CompressorInputStream compressorInputStream = new BZip2CompressorInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + compressorInputStream, encoding), bufferSize); + } else if ("hadoop-snappy".equalsIgnoreCase(compress)) { + CompressionCodec snappyCodec = new SnappyCodec(); + InputStream snappyInputStream = snappyCodec.createInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + snappyInputStream, encoding)); + } else if ("framing-snappy".equalsIgnoreCase(compress)) { + InputStream snappyInputStream = new SnappyFramedInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + snappyInputStream, encoding)); + }/* else if ("lzma".equalsIgnoreCase(compress)) { + CompressorInputStream compressorInputStream = new LZMACompressorInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + compressorInputStream, encoding)); + } *//*else if ("pack200".equalsIgnoreCase(compress)) { + CompressorInputStream compressorInputStream = new Pack200CompressorInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + compressorInputStream, encoding)); + } *//*else if ("xz".equalsIgnoreCase(compress)) { + CompressorInputStream compressorInputStream = new XZCompressorInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + compressorInputStream, encoding)); + } else if ("ar".equalsIgnoreCase(compress)) { + ArArchiveInputStream arArchiveInputStream = new ArArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + arArchiveInputStream, encoding)); + } else if ("arj".equalsIgnoreCase(compress)) { + ArjArchiveInputStream arjArchiveInputStream = new ArjArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + arjArchiveInputStream, encoding)); + } else if ("cpio".equalsIgnoreCase(compress)) { + CpioArchiveInputStream cpioArchiveInputStream = new CpioArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + cpioArchiveInputStream, encoding)); + } else if ("dump".equalsIgnoreCase(compress)) { + DumpArchiveInputStream dumpArchiveInputStream = new DumpArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + dumpArchiveInputStream, encoding)); + } else if ("jar".equalsIgnoreCase(compress)) { + JarArchiveInputStream jarArchiveInputStream = new JarArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + jarArchiveInputStream, encoding)); + } else if ("tar".equalsIgnoreCase(compress)) { + TarArchiveInputStream tarArchiveInputStream = new TarArchiveInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + tarArchiveInputStream, encoding)); + }*/ + else if ("zip".equalsIgnoreCase(compress)) { + ZipCycleInputStream zipCycleInputStream = new ZipCycleInputStream( + inputStream); + reader = new BufferedReader(new InputStreamReader( + zipCycleInputStream, encoding), bufferSize); + } else { + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("仅支持 gzip, bzip2, zip, lzo, lzo_deflate, hadoop-snappy, framing-snappy" + + "文件压缩格式 , 不支持您配置的文件压缩格式: [%s]", compress)); + } + } + UnstructuredStorageReaderUtil.doReadFromStream(reader, context, + readerSliceConfig, recordSender, taskPluginCollector); + } catch (UnsupportedEncodingException uee) { + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.OPEN_FILE_WITH_CHARSET_ERROR, + String.format("不支持的编码格式 : [%s]", encoding), uee); + } catch (NullPointerException e) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.RUNTIME_EXCEPTION, + "运行时错误, 请联系我们", e); + }/* catch (ArchiveException e) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.READ_FILE_IO_ERROR, + String.format("压缩文件流读取错误 : [%s]", context), e); + } */catch (IOException e) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.READ_FILE_IO_ERROR, + String.format("流读取错误 : [%s]", context), e); + } finally { + IOUtils.closeQuietly(reader); + } + + } + + public static void doReadFromStream(BufferedReader reader, String context, + Configuration readerSliceConfig, RecordSender recordSender, + TaskPluginCollector taskPluginCollector) { + String encoding = readerSliceConfig.getString(Key.ENCODING, + Constant.DEFAULT_ENCODING); + Character fieldDelimiter = null; + String delimiterInStr = readerSliceConfig + .getString(Key.FIELD_DELIMITER); + if (null != delimiterInStr && 1 != delimiterInStr.length()) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", delimiterInStr)); + } + if (null == delimiterInStr) { + LOG.warn(String.format("您没有配置列分隔符, 使用默认值[%s]", + Constant.DEFAULT_FIELD_DELIMITER)); + } + + // warn: default value ',', fieldDelimiter could be \n(lineDelimiter) + // for no fieldDelimiter + fieldDelimiter = readerSliceConfig.getChar(Key.FIELD_DELIMITER, + Constant.DEFAULT_FIELD_DELIMITER); + Boolean skipHeader = readerSliceConfig.getBool(Key.SKIP_HEADER, + Constant.DEFAULT_SKIP_HEADER); + // warn: no default value '\N' + String nullFormat = readerSliceConfig.getString(Key.NULL_FORMAT); + + // warn: Configuration -> List for performance + // List column = readerSliceConfig + // .getListConfiguration(Key.COLUMN); + List column = UnstructuredStorageReaderUtil + .getListColumnEntry(readerSliceConfig, Key.COLUMN); + CsvReader csvReader = null; + + // every line logic + try { + // TODO lineDelimiter + if (skipHeader) { + String fetchLine = reader.readLine(); + LOG.info(String.format("Header line %s has been skiped.", + fetchLine)); + } + csvReader = new CsvReader(reader); + csvReader.setDelimiter(fieldDelimiter); + + setCsvReaderConfig(csvReader); + + String[] parseRows; + while ((parseRows = UnstructuredStorageReaderUtil + .splitBufferedReader(csvReader)) != null) { + UnstructuredStorageReaderUtil.transportOneRecord(recordSender, + column, parseRows, nullFormat, taskPluginCollector); + } + } catch (UnsupportedEncodingException uee) { + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.OPEN_FILE_WITH_CHARSET_ERROR, + String.format("不支持的编码格式 : [%s]", encoding), uee); + } catch (FileNotFoundException fnfe) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.FILE_NOT_EXISTS, + String.format("无法找到文件 : [%s]", context), fnfe); + } catch (IOException ioe) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.READ_FILE_IO_ERROR, + String.format("读取文件错误 : [%s]", context), ioe); + } catch (Exception e) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.RUNTIME_EXCEPTION, + String.format("运行时异常 : %s", e.getMessage()), e); + } finally { + csvReader.close(); + IOUtils.closeQuietly(reader); + } + } + + public static Record transportOneRecord(RecordSender recordSender, + Configuration configuration, + TaskPluginCollector taskPluginCollector, + String line){ + List column = UnstructuredStorageReaderUtil + .getListColumnEntry(configuration, Key.COLUMN); + // 注意: nullFormat 没有默认值 + String nullFormat = configuration.getString(Key.NULL_FORMAT); + String delimiterInStr = configuration.getString(Key.FIELD_DELIMITER); + if (null != delimiterInStr && 1 != delimiterInStr.length()) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", delimiterInStr)); + } + if (null == delimiterInStr) { + LOG.warn(String.format("您没有配置列分隔符, 使用默认值[%s]", + Constant.DEFAULT_FIELD_DELIMITER)); + } + // warn: default value ',', fieldDelimiter could be \n(lineDelimiter) + // for no fieldDelimiter + Character fieldDelimiter = configuration.getChar(Key.FIELD_DELIMITER, + Constant.DEFAULT_FIELD_DELIMITER); + + String[] sourceLine = StringUtils.split(line, fieldDelimiter); + + return transportOneRecord(recordSender, column, sourceLine, nullFormat, taskPluginCollector); + } + + public static Record transportOneRecord(RecordSender recordSender, + List columnConfigs, String[] sourceLine, + String nullFormat, TaskPluginCollector taskPluginCollector) { + Record record = recordSender.createRecord(); + Column columnGenerated = null; + + // 创建都为String类型column的record + if (null == columnConfigs || columnConfigs.size() == 0) { + for (String columnValue : sourceLine) { + // not equalsIgnoreCase, it's all ok if nullFormat is null + if (columnValue.equals(nullFormat)) { + columnGenerated = new StringColumn(null); + } else { + columnGenerated = new StringColumn(columnValue); + } + record.addColumn(columnGenerated); + } + recordSender.sendToWriter(record); + } else { + try { + for (ColumnEntry columnConfig : columnConfigs) { + String columnType = columnConfig.getType(); + Integer columnIndex = columnConfig.getIndex(); + String columnConst = columnConfig.getValue(); + + String columnValue = null; + + if (null == columnIndex && null == columnConst) { + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.NO_INDEX_VALUE, + "由于您配置了type, 则至少需要配置 index 或 value"); + } + + if (null != columnIndex && null != columnConst) { + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.MIXED_INDEX_VALUE, + "您混合配置了index, value, 每一列同时仅能选择其中一种"); + } + + if (null != columnIndex) { + if (columnIndex >= sourceLine.length) { + String message = String + .format("您尝试读取的列越界,源文件该行有 [%s] 列,您尝试读取第 [%s] 列, 数据详情[%s]", + sourceLine.length, columnIndex + 1, + StringUtils.join(sourceLine, ",")); + LOG.warn(message); + throw new IndexOutOfBoundsException(message); + } + + columnValue = sourceLine[columnIndex]; + } else { + columnValue = columnConst; + } + Type type = Type.valueOf(columnType.toUpperCase()); + // it's all ok if nullFormat is null + if (columnValue.equals(nullFormat)) { + columnValue = null; + } + switch (type) { + case STRING: + columnGenerated = new StringColumn(columnValue); + break; + case LONG: + try { + columnGenerated = new LongColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "LONG")); + } + break; + case DOUBLE: + try { + columnGenerated = new DoubleColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "DOUBLE")); + } + break; + case BOOLEAN: + try { + columnGenerated = new BoolColumn(columnValue); + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "BOOLEAN")); + } + + break; + case DATE: + try { + if (columnValue == null) { + Date date = null; + columnGenerated = new DateColumn(date); + } else { + String formatString = columnConfig.getFormat(); + //if (null != formatString) { + if (StringUtils.isNotBlank(formatString)) { + // 用户自己配置的格式转换, 脏数据行为出现变化 + DateFormat format = columnConfig + .getDateFormat(); + columnGenerated = new DateColumn( + format.parse(columnValue)); + } else { + // 框架尝试转换 + columnGenerated = new DateColumn( + new StringColumn(columnValue) + .asDate()); + } + } + } catch (Exception e) { + throw new IllegalArgumentException(String.format( + "类型转换错误, 无法将[%s] 转换为[%s]", columnValue, + "DATE")); + } + break; + default: + String errorMessage = String.format( + "您配置的列类型暂不支持 : [%s]", columnType); + LOG.error(errorMessage); + throw DataXException + .asDataXException( + UnstructuredStorageReaderErrorCode.NOT_SUPPORT_TYPE, + errorMessage); + } + + record.addColumn(columnGenerated); + + } + recordSender.sendToWriter(record); + } catch (IllegalArgumentException iae) { + taskPluginCollector + .collectDirtyRecord(record, iae.getMessage()); + } catch (IndexOutOfBoundsException ioe) { + taskPluginCollector + .collectDirtyRecord(record, ioe.getMessage()); + } catch (Exception e) { + if (e instanceof DataXException) { + throw (DataXException) e; + } + // 每一种转换失败都是脏数据处理,包括数字格式 & 日期格式 + taskPluginCollector.collectDirtyRecord(record, e.getMessage()); + } + } + + return record; + } + + public static List getListColumnEntry( + Configuration configuration, final String path) { + List lists = configuration.getList(path, JSONObject.class); + if (lists == null) { + return null; + } + List result = new ArrayList(); + for (final JSONObject object : lists) { + result.add(JSON.parseObject(object.toJSONString(), + ColumnEntry.class)); + } + return result; + } + + private enum Type { + STRING, LONG, BOOLEAN, DOUBLE, DATE, ; + } + + /** + * check parameter:encoding, compress, filedDelimiter + * */ + public static void validateParameter(Configuration readerConfiguration) { + + // encoding check + validateEncoding(readerConfiguration); + + //only support compress types + validateCompress(readerConfiguration); + + //fieldDelimiter check + validateFieldDelimiter(readerConfiguration); + + // column: 1. index type 2.value type 3.when type is Date, may have format + validateColumn(readerConfiguration); + + } + + public static void validateEncoding(Configuration readerConfiguration) { + // encoding check + String encoding = readerConfiguration + .getString( + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, + com.alibaba.datax.plugin.unstructuredstorage.reader.Constant.DEFAULT_ENCODING); + try { + encoding = encoding.trim(); + readerConfiguration.set(Key.ENCODING, encoding); + Charsets.toCharset(encoding); + } catch (UnsupportedCharsetException uce) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("不支持您配置的编码格式 : [%s]", encoding), uce); + } catch (Exception e) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.CONFIG_INVALID_EXCEPTION, + String.format("编码配置异常, 请联系我们: %s", e.getMessage()), e); + } + } + + public static void validateCompress(Configuration readerConfiguration) { + String compress =readerConfiguration + .getUnnecessaryValue(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS,null,null); + if(StringUtils.isNotBlank(compress)){ + compress = compress.toLowerCase().trim(); + boolean compressTag = "gzip".equals(compress) || "bzip2".equals(compress) || "zip".equals(compress) + || "lzo".equals(compress) || "lzo_deflate".equals(compress) || "hadoop-snappy".equals(compress) + || "framing-snappy".equals(compress); + if (!compressTag) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("仅支持 gzip, bzip2, zip, lzo, lzo_deflate, hadoop-snappy, framing-snappy " + + "文件压缩格式, 不支持您配置的文件压缩格式: [%s]", compress)); + } + }else{ + // 用户可能配置的是 compress:"",空字符串,需要将compress设置为null + compress = null; + } + readerConfiguration.set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS, compress); + + } + + public static void validateFieldDelimiter(Configuration readerConfiguration) { + //fieldDelimiter check + String delimiterInStr = readerConfiguration.getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.FIELD_DELIMITER,null); + if(null == delimiterInStr){ + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.REQUIRED_VALUE, + String.format("您提供配置文件有误,[%s]是必填参数.", + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.FIELD_DELIMITER)); + }else if(1 != delimiterInStr.length()){ + // warn: if have, length must be one + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", delimiterInStr)); + } + } + + public static void validateColumn(Configuration readerConfiguration) { + // column: 1. index type 2.value type 3.when type is Date, may have + // format + List columns = readerConfiguration + .getListConfiguration(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + if (null == columns || columns.size() == 0) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.REQUIRED_VALUE, "您需要指定 columns"); + } + // handle ["*"] + if (null != columns && 1 == columns.size()) { + String columnsInStr = columns.get(0).toString(); + if ("\"*\"".equals(columnsInStr) || "'*'".equals(columnsInStr)) { + readerConfiguration.set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN, null); + columns = null; + } + } + + if (null != columns && columns.size() != 0) { + for (Configuration eachColumnConf : columns) { + eachColumnConf.getNecessaryValue(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.TYPE, + UnstructuredStorageReaderErrorCode.REQUIRED_VALUE); + Integer columnIndex = eachColumnConf + .getInt(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.INDEX); + String columnValue = eachColumnConf + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.VALUE); + + if (null == columnIndex && null == columnValue) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.NO_INDEX_VALUE, + "由于您配置了type, 则至少需要配置 index 或 value"); + } + + if (null != columnIndex && null != columnValue) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.MIXED_INDEX_VALUE, + "您混合配置了index, value, 每一列同时仅能选择其中一种"); + } + if (null != columnIndex && columnIndex < 0) { + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("index需要大于等于0, 您配置的index为[%s]", columnIndex)); + } + } + } + } + + public static void validateCsvReaderConfig(Configuration readerConfiguration) { + String csvReaderConfig = readerConfiguration.getString(Key.CSV_READER_CONFIG); + if(StringUtils.isNotBlank(csvReaderConfig)){ + try{ + UnstructuredStorageReaderUtil.csvReaderConfigMap = JSON.parseObject(csvReaderConfig, new TypeReference>() {}); + }catch (Exception e) { + LOG.info(String.format("WARN!!!!忽略csvReaderConfig配置! 配置错误,值只能为空或者为Map结构,您配置的值为: %s", csvReaderConfig)); + } + } + } + + /** + * + * @Title: getRegexPathParent + * @Description: 获取正则表达式目录的父目录 + * @param @param regexPath + * @param @return + * @return String + * @throws + */ + public static String getRegexPathParent(String regexPath){ + int endMark; + for (endMark = 0; endMark < regexPath.length(); endMark++) { + if ('*' != regexPath.charAt(endMark) && '?' != regexPath.charAt(endMark)) { + continue; + } else { + break; + } + } + int lastDirSeparator = regexPath.substring(0, endMark).lastIndexOf(IOUtils.DIR_SEPARATOR); + String parentPath = regexPath.substring(0,lastDirSeparator + 1); + + return parentPath; + } + /** + * + * @Title: getRegexPathParentPath + * @Description: 获取含有通配符路径的父目录,目前只支持在最后一级目录使用通配符*或者?. + * (API jcraft.jsch.ChannelSftp.ls(String path)函数限制) http://epaul.github.io/jsch-documentation/javadoc/ + * @param @param regexPath + * @param @return + * @return String + * @throws + */ + public static String getRegexPathParentPath(String regexPath){ + int lastDirSeparator = regexPath.lastIndexOf(IOUtils.DIR_SEPARATOR); + String parentPath = ""; + parentPath = regexPath.substring(0,lastDirSeparator + 1); + if(parentPath.contains("*") || parentPath.contains("?")){ + throw DataXException.asDataXException(UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("配置项目path中:[%s]不合法,目前只支持在最后一级目录使用通配符*或者?", regexPath)); + } + return parentPath; + } + + public static void setCsvReaderConfig(CsvReader csvReader){ + if(null != UnstructuredStorageReaderUtil.csvReaderConfigMap && !UnstructuredStorageReaderUtil.csvReaderConfigMap.isEmpty()){ + try { + BeanUtils.populate(csvReader,UnstructuredStorageReaderUtil.csvReaderConfigMap); + LOG.info(String.format("csvReaderConfig设置成功,设置后CsvReader:%s", JSON.toJSONString(csvReader))); + } catch (Exception e) { + LOG.info(String.format("WARN!!!!忽略csvReaderConfig配置!通过BeanUtils.populate配置您的csvReaderConfig发生异常,您配置的值为: %s;请检查您的配置!CsvReader使用默认值[%s]", + JSON.toJSONString(UnstructuredStorageReaderUtil.csvReaderConfigMap),JSON.toJSONString(csvReader))); + } + }else { + //默认关闭安全模式, 放开10W字节的限制 + csvReader.setSafetySwitch(false); + LOG.info(String.format("CsvReader使用默认值[%s],csvReaderConfig值为[%s]",JSON.toJSONString(csvReader),JSON.toJSONString(UnstructuredStorageReaderUtil.csvReaderConfigMap))); + } + } +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/ZipCycleInputStream.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/ZipCycleInputStream.java new file mode 100644 index 0000000000..328856779d --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/reader/ZipCycleInputStream.java @@ -0,0 +1,59 @@ +package com.alibaba.datax.plugin.unstructuredstorage.reader; + +import java.io.IOException; +import java.io.InputStream; +import java.util.zip.ZipEntry; +import java.util.zip.ZipInputStream; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class ZipCycleInputStream extends InputStream { + private static final Logger LOG = LoggerFactory + .getLogger(ZipCycleInputStream.class); + + private ZipInputStream zipInputStream; + private ZipEntry currentZipEntry; + + public ZipCycleInputStream(InputStream in) { + this.zipInputStream = new ZipInputStream(in); + } + + @Override + public int read() throws IOException { + // 定位一个Entry数据流的开头 + if (null == this.currentZipEntry) { + this.currentZipEntry = this.zipInputStream.getNextEntry(); + if (null == this.currentZipEntry) { + return -1; + } else { + LOG.info(String.format("Validate zipEntry with name: %s", + this.currentZipEntry.getName())); + } + } + + // 不支持zip下的嵌套, 对于目录跳过 + if (this.currentZipEntry.isDirectory()) { + LOG.warn(String.format("meet a directory %s, ignore...", + this.currentZipEntry.getName())); + this.currentZipEntry = null; + return this.read(); + } + + // 读取一个Entry数据流 + int result = this.zipInputStream.read(); + + // 当前Entry数据流结束了, 需要尝试下一个Entry + if (-1 == result) { + this.currentZipEntry = null; + return this.read(); + } else { + return result; + } + } + + @Override + public void close() throws IOException { + this.zipInputStream.close(); + } +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Constant.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Constant.java new file mode 100755 index 0000000000..93b4baa978 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Constant.java @@ -0,0 +1,19 @@ +package com.alibaba.datax.plugin.unstructuredstorage.writer; + +public class Constant { + + public static final String DEFAULT_ENCODING = "UTF-8"; + + public static final char DEFAULT_FIELD_DELIMITER = ','; + + public static final String DEFAULT_NULL_FORMAT = "\\N"; + + public static final String FILE_FORMAT_CSV = "csv"; + + public static final String FILE_FORMAT_TEXT = "text"; + + //每个分块10MB,最大10000个分块 + public static final Long MAX_FILE_SIZE = 1024 * 1024 * 10 * 10000L; + + public static final String DEFAULT_SUFFIX = ""; +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Key.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Key.java new file mode 100755 index 0000000000..2e7fe079f3 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/Key.java @@ -0,0 +1,38 @@ +package com.alibaba.datax.plugin.unstructuredstorage.writer; + +public class Key { + // must have + public static final String FILE_NAME = "fileName"; + + // must have + public static final String WRITE_MODE = "writeMode"; + + // not must , not default , + public static final String FIELD_DELIMITER = "fieldDelimiter"; + + // not must, default UTF-8 + public static final String ENCODING = "encoding"; + + // not must, default no compress + public static final String COMPRESS = "compress"; + + // not must, not default \N + public static final String NULL_FORMAT = "nullFormat"; + + // not must, date format old style, do not use this + public static final String FORMAT = "format"; + // for writers ' data format + public static final String DATE_FORMAT = "dateFormat"; + + // csv or plain text + public static final String FILE_FORMAT = "fileFormat"; + + // writer headers + public static final String HEADER = "header"; + + // writer maxFileSize + public static final String MAX_FILE_SIZE = "maxFileSize"; + + // writer file type suffix, like .txt .csv + public static final String SUFFIX = "suffix"; +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/TextCsvWriterManager.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/TextCsvWriterManager.java new file mode 100644 index 0000000000..1ea8275963 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/TextCsvWriterManager.java @@ -0,0 +1,95 @@ +package com.alibaba.datax.plugin.unstructuredstorage.writer; + +import java.io.IOException; +import java.io.Writer; +import java.util.List; + +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.csvreader.CsvWriter; + +public class TextCsvWriterManager { + public static UnstructuredWriter produceUnstructuredWriter( + String fileFormat, char fieldDelimiter, Writer writer) { + // warn: false means plain text(old way), true means strict csv format + if (Constant.FILE_FORMAT_TEXT.equals(fileFormat)) { + return new TextWriterImpl(writer, fieldDelimiter); + } else { + return new CsvWriterImpl(writer, fieldDelimiter); + } + } +} + +class CsvWriterImpl implements UnstructuredWriter { + private static final Logger LOG = LoggerFactory + .getLogger(CsvWriterImpl.class); + // csv 严格符合csv语法, 有标准的转义等处理 + private char fieldDelimiter; + private CsvWriter csvWriter; + + public CsvWriterImpl(Writer writer, char fieldDelimiter) { + this.fieldDelimiter = fieldDelimiter; + this.csvWriter = new CsvWriter(writer, this.fieldDelimiter); + this.csvWriter.setTextQualifier('"'); + this.csvWriter.setUseTextQualifier(true); + // warn: in linux is \n , in windows is \r\n + this.csvWriter.setRecordDelimiter(IOUtils.LINE_SEPARATOR.charAt(0)); + } + + @Override + public void writeOneRecord(List splitedRows) throws IOException { + if (splitedRows.isEmpty()) { + LOG.info("Found one record line which is empty."); + } + this.csvWriter.writeRecord((String[]) splitedRows + .toArray(new String[0])); + } + + @Override + public void flush() throws IOException { + this.csvWriter.flush(); + } + + @Override + public void close() throws IOException { + this.csvWriter.close(); + } + +} + +class TextWriterImpl implements UnstructuredWriter { + private static final Logger LOG = LoggerFactory + .getLogger(TextWriterImpl.class); + // text StringUtils的join方式, 简单的字符串拼接 + private char fieldDelimiter; + private Writer textWriter; + + public TextWriterImpl(Writer writer, char fieldDelimiter) { + this.fieldDelimiter = fieldDelimiter; + this.textWriter = writer; + } + + @Override + public void writeOneRecord(List splitedRows) throws IOException { + if (splitedRows.isEmpty()) { + LOG.info("Found one record line which is empty."); + } + this.textWriter.write(String.format("%s%s", + StringUtils.join(splitedRows, this.fieldDelimiter), + IOUtils.LINE_SEPARATOR)); + } + + @Override + public void flush() throws IOException { + this.textWriter.flush(); + } + + @Override + public void close() throws IOException { + this.textWriter.close(); + } + +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterErrorCode.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterErrorCode.java new file mode 100755 index 0000000000..0f780ebdd1 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterErrorCode.java @@ -0,0 +1,36 @@ +package com.alibaba.datax.plugin.unstructuredstorage.writer; + +import com.alibaba.datax.common.spi.ErrorCode; + + +public enum UnstructuredStorageWriterErrorCode implements ErrorCode { + ILLEGAL_VALUE("UnstructuredStorageWriter-00", "您填写的参数值不合法."), + Write_FILE_WITH_CHARSET_ERROR("UnstructuredStorageWriter-01", "您配置的编码未能正常写入."), + Write_FILE_IO_ERROR("UnstructuredStorageWriter-02", "您配置的文件在写入时出现IO异常."), + RUNTIME_EXCEPTION("UnstructuredStorageWriter-03", "出现运行时异常, 请联系我们"), + REQUIRED_VALUE("UnstructuredStorageWriter-04", "您缺失了必须填写的参数值."),; + + private final String code; + private final String description; + + private UnstructuredStorageWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterUtil.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterUtil.java new file mode 100755 index 0000000000..b1927ce79b --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredStorageWriterUtil.java @@ -0,0 +1,333 @@ +package com.alibaba.datax.plugin.unstructuredstorage.writer; + +import java.io.BufferedWriter; +import java.io.IOException; +import java.io.OutputStream; +import java.io.OutputStreamWriter; +import java.io.UnsupportedEncodingException; +import java.text.DateFormat; +import java.text.SimpleDateFormat; +import java.util.ArrayList; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.UUID; + +import org.apache.commons.compress.compressors.CompressorOutputStream; +import org.apache.commons.compress.compressors.bzip2.BZip2CompressorOutputStream; +import org.apache.commons.compress.compressors.gzip.GzipCompressorOutputStream; +import org.apache.commons.io.Charsets; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.DateColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.common.util.Configuration; +import com.google.common.collect.Sets; + +public class UnstructuredStorageWriterUtil { + private UnstructuredStorageWriterUtil() { + + } + + private static final Logger LOG = LoggerFactory + .getLogger(UnstructuredStorageWriterUtil.class); + + /** + * check parameter: writeMode, encoding, compress, filedDelimiter + * */ + public static void validateParameter(Configuration writerConfiguration) { + // writeMode check + String writeMode = writerConfiguration.getNecessaryValue( + Key.WRITE_MODE, + UnstructuredStorageWriterErrorCode.REQUIRED_VALUE); + writeMode = writeMode.trim(); + Set supportedWriteModes = Sets.newHashSet("truncate", "append", + "nonConflict"); + if (!supportedWriteModes.contains(writeMode)) { + throw DataXException + .asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 truncate, append, nonConflict 三种模式, 不支持您配置的 writeMode 模式 : [%s]", + writeMode)); + } + writerConfiguration.set(Key.WRITE_MODE, writeMode); + + // encoding check + String encoding = writerConfiguration.getString(Key.ENCODING); + if (StringUtils.isBlank(encoding)) { + // like " ", null + LOG.warn(String.format("您的encoding配置为空, 将使用默认值[%s]", + Constant.DEFAULT_ENCODING)); + writerConfiguration.set(Key.ENCODING, Constant.DEFAULT_ENCODING); + } else { + try { + encoding = encoding.trim(); + writerConfiguration.set(Key.ENCODING, encoding); + Charsets.toCharset(encoding); + } catch (Exception e) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format("不支持您配置的编码格式:[%s]", encoding), e); + } + } + + // only support compress types + String compress = writerConfiguration.getString(Key.COMPRESS); + if (StringUtils.isBlank(compress)) { + writerConfiguration.set(Key.COMPRESS, null); + } else { + Set supportedCompress = Sets.newHashSet("gzip", "bzip2"); + if (!supportedCompress.contains(compress.toLowerCase().trim())) { + String message = String.format( + "仅支持 [%s] 文件压缩格式 , 不支持您配置的文件压缩格式: [%s]", + StringUtils.join(supportedCompress, ","), compress); + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format(message, compress)); + } + } + + // fieldDelimiter check + String delimiterInStr = writerConfiguration + .getString(Key.FIELD_DELIMITER); + // warn: if have, length must be one + if (null != delimiterInStr && 1 != delimiterInStr.length()) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", delimiterInStr)); + } + if (null == delimiterInStr) { + LOG.warn(String.format("您没有配置列分隔符, 使用默认值[%s]", + Constant.DEFAULT_FIELD_DELIMITER)); + writerConfiguration.set(Key.FIELD_DELIMITER, + Constant.DEFAULT_FIELD_DELIMITER); + } + + // fileFormat check + String fileFormat = writerConfiguration.getString(Key.FILE_FORMAT, + Constant.FILE_FORMAT_TEXT); + if (!Constant.FILE_FORMAT_CSV.equals(fileFormat) + && !Constant.FILE_FORMAT_TEXT.equals(fileFormat)) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, String + .format("您配置的fileFormat [%s]错误, 支持csv, text两种.", + fileFormat)); + } + } + + public static List split(Configuration writerSliceConfig, + Set originAllFileExists, int mandatoryNumber) { + LOG.info("begin do split..."); + Set allFileExists = new HashSet(); + allFileExists.addAll(originAllFileExists); + List writerSplitConfigs = new ArrayList(); + String filePrefix = writerSliceConfig.getString(Key.FILE_NAME); + + String fileSuffix; + for (int i = 0; i < mandatoryNumber; i++) { + // handle same file name + Configuration splitedTaskConfig = writerSliceConfig.clone(); + String fullFileName = null; + fileSuffix = UUID.randomUUID().toString().replace('-', '_'); + fullFileName = String.format("%s__%s", filePrefix, fileSuffix); + while (allFileExists.contains(fullFileName)) { + fileSuffix = UUID.randomUUID().toString().replace('-', '_'); + fullFileName = String.format("%s__%s", filePrefix, fileSuffix); + } + allFileExists.add(fullFileName); + splitedTaskConfig.set(Key.FILE_NAME, fullFileName); + LOG.info(String + .format("splited write file name:[%s]", fullFileName)); + writerSplitConfigs.add(splitedTaskConfig); + } + LOG.info("end do split."); + return writerSplitConfigs; + } + + public static String buildFilePath(String path, String fileName, + String suffix) { + boolean isEndWithSeparator = false; + switch (IOUtils.DIR_SEPARATOR) { + case IOUtils.DIR_SEPARATOR_UNIX: + isEndWithSeparator = path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR)); + break; + case IOUtils.DIR_SEPARATOR_WINDOWS: + isEndWithSeparator = path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR_WINDOWS)); + break; + default: + break; + } + if (!isEndWithSeparator) { + path = path + IOUtils.DIR_SEPARATOR; + } + if (null == suffix) { + suffix = ""; + } else { + suffix = suffix.trim(); + } + return String.format("%s%s%s", path, fileName, suffix); + } + + public static void writeToStream(RecordReceiver lineReceiver, + OutputStream outputStream, Configuration config, String context, + TaskPluginCollector taskPluginCollector) { + String encoding = config.getString(Key.ENCODING, + Constant.DEFAULT_ENCODING); + // handle blank encoding + if (StringUtils.isBlank(encoding)) { + LOG.warn(String.format("您配置的encoding为[%s], 使用默认值[%s]", encoding, + Constant.DEFAULT_ENCODING)); + encoding = Constant.DEFAULT_ENCODING; + } + String compress = config.getString(Key.COMPRESS); + + BufferedWriter writer = null; + // compress logic + try { + if (null == compress) { + writer = new BufferedWriter(new OutputStreamWriter( + outputStream, encoding)); + } else { + // TODO more compress + if ("gzip".equalsIgnoreCase(compress)) { + CompressorOutputStream compressorOutputStream = new GzipCompressorOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + compressorOutputStream, encoding)); + } else if ("bzip2".equalsIgnoreCase(compress)) { + CompressorOutputStream compressorOutputStream = new BZip2CompressorOutputStream( + outputStream); + writer = new BufferedWriter(new OutputStreamWriter( + compressorOutputStream, encoding)); + } else { + throw DataXException + .asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 gzip, bzip2 文件压缩格式 , 不支持您配置的文件压缩格式: [%s]", + compress)); + } + } + UnstructuredStorageWriterUtil.doWriteToStream(lineReceiver, writer, + context, config, taskPluginCollector); + } catch (UnsupportedEncodingException uee) { + throw DataXException + .asDataXException( + UnstructuredStorageWriterErrorCode.Write_FILE_WITH_CHARSET_ERROR, + String.format("不支持的编码格式 : [%s]", encoding), uee); + } catch (NullPointerException e) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.RUNTIME_EXCEPTION, + "运行时错误, 请联系我们", e); + } catch (IOException e) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.Write_FILE_IO_ERROR, + String.format("流写入错误 : [%s]", context), e); + } finally { + IOUtils.closeQuietly(writer); + } + } + + private static void doWriteToStream(RecordReceiver lineReceiver, + BufferedWriter writer, String contex, Configuration config, + TaskPluginCollector taskPluginCollector) throws IOException { + + String nullFormat = config.getString(Key.NULL_FORMAT); + + // 兼容format & dataFormat + String dateFormat = config.getString(Key.DATE_FORMAT); + DateFormat dateParse = null; // warn: 可能不兼容 + if (StringUtils.isNotBlank(dateFormat)) { + dateParse = new SimpleDateFormat(dateFormat); + } + + // warn: default false + String fileFormat = config.getString(Key.FILE_FORMAT, + Constant.FILE_FORMAT_TEXT); + + String delimiterInStr = config.getString(Key.FIELD_DELIMITER); + if (null != delimiterInStr && 1 != delimiterInStr.length()) { + throw DataXException.asDataXException( + UnstructuredStorageWriterErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", delimiterInStr)); + } + if (null == delimiterInStr) { + LOG.warn(String.format("您没有配置列分隔符, 使用默认值[%s]", + Constant.DEFAULT_FIELD_DELIMITER)); + } + + // warn: fieldDelimiter could not be '' for no fieldDelimiter + char fieldDelimiter = config.getChar(Key.FIELD_DELIMITER, + Constant.DEFAULT_FIELD_DELIMITER); + + UnstructuredWriter unstructuredWriter = TextCsvWriterManager + .produceUnstructuredWriter(fileFormat, fieldDelimiter, writer); + + List headers = config.getList(Key.HEADER, String.class); + if (null != headers && !headers.isEmpty()) { + unstructuredWriter.writeOneRecord(headers); + } + + Record record = null; + while ((record = lineReceiver.getFromReader()) != null) { + UnstructuredStorageWriterUtil.transportOneRecord(record, + nullFormat, dateParse, taskPluginCollector, + unstructuredWriter); + } + + // warn:由调用方控制流的关闭 + // IOUtils.closeQuietly(unstructuredWriter); + } + + /** + * 异常表示脏数据 + * */ + public static void transportOneRecord(Record record, String nullFormat, + DateFormat dateParse, TaskPluginCollector taskPluginCollector, + UnstructuredWriter unstructuredWriter) { + // warn: default is null + if (null == nullFormat) { + nullFormat = "null"; + } + try { + List splitedRows = new ArrayList(); + int recordLength = record.getColumnNumber(); + if (0 != recordLength) { + Column column; + for (int i = 0; i < recordLength; i++) { + column = record.getColumn(i); + if (null != column.getRawData()) { + boolean isDateColumn = column instanceof DateColumn; + if (!isDateColumn) { + splitedRows.add(column.asString()); + } else { + if (null != dateParse) { + splitedRows.add(dateParse.format(column + .asDate())); + } else { + splitedRows.add(column.asString()); + } + } + } else { + // warn: it's all ok if nullFormat is null + splitedRows.add(nullFormat); + } + } + } + unstructuredWriter.writeOneRecord(splitedRows); + } catch (Exception e) { + // warn: dirty data + taskPluginCollector.collectDirtyRecord(record, e); + } + } +} diff --git a/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredWriter.java b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredWriter.java new file mode 100644 index 0000000000..b3b5329bb7 --- /dev/null +++ b/plugin-unstructured-storage-util/src/main/java/com/alibaba/datax/plugin/unstructuredstorage/writer/UnstructuredWriter.java @@ -0,0 +1,15 @@ +package com.alibaba.datax.plugin.unstructuredstorage.writer; + +import java.io.Closeable; +import java.io.IOException; +import java.util.List; + +public interface UnstructuredWriter extends Closeable { + + public void writeOneRecord(List splitedRows) throws IOException; + + public void flush() throws IOException; + + public void close() throws IOException; + +} diff --git a/pom.xml b/pom.xml new file mode 100755 index 0000000000..c86cc76fd1 --- /dev/null +++ b/pom.xml @@ -0,0 +1,197 @@ + + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + org.hamcrest + hamcrest-core + 1.3 + + + + datax-all + pom + + + 0.0.1-SNAPSHOT + 3.3.2 + 1.10 + 1.2 + 1.1.46.sec01 + 16.0.1 + 3.7.2.1-SNAPSHOT + + + 1.7.10 + 1.0.13 + 2.4 + 4.11 + 5.1.22-1 + 1.0.0 + + UTF-8 + UTF-8 + UTF-8 + UTF-8 + + + + common + core + transformer + + + mysqlreader + drdsreader + sqlserverreader + postgresqlreader + oraclereader + odpsreader + otsreader + otsstreamreader + txtfilereader + hdfsreader + streamreader + ossreader + ftpreader + mongodbreader + rdbmsreader + hbase11xreader + hbase094xreader + + + mysqlwriter + drdswriter + odpswriter + txtfilewriter + ftpwriter + hdfswriter + streamwriter + otswriter + oraclewriter + sqlserverwriter + postgresqlwriter + osswriter + mongodbwriter + adswriter + ocswriter + rdbmswriter + hbase11xwriter + hbase094xwriter + hbase11xsqlwriter + + + plugin-rdbms-util + plugin-unstructured-storage-util + + + + + + org.apache.commons + commons-lang3 + ${commons-lang3-version} + + + com.alibaba + fastjson + ${fastjson-version} + + + + commons-io + commons-io + ${commons-io-version} + + + org.slf4j + slf4j-api + ${slf4j-api-version} + + + ch.qos.logback + logback-classic + ${logback-classic-version} + + + + com.taobao.tddl + tddl-client + ${tddl.version} + + + com.google.guava + guava + + + com.taobao.diamond + diamond-client + + + + + + com.taobao.diamond + diamond-client + ${diamond.version} + + + + com.alibaba.search.swift + swift_client + ${swift-version} + + + + junit + junit + ${junit-version} + + + + org.mockito + mockito-all + 1.9.5 + test + + + + + + + + maven-assembly-plugin + + datax + + package.xml + + + + + make-assembly + package + + + + + org.apache.maven.plugins + maven-compiler-plugin + 2.3.2 + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + diff --git a/postgresqlreader/doc/postgresqlreader.md b/postgresqlreader/doc/postgresqlreader.md new file mode 100644 index 0000000000..fed2c7e97c --- /dev/null +++ b/postgresqlreader/doc/postgresqlreader.md @@ -0,0 +1,297 @@ + +# PostgresqlReader 插件文档 + + +___ + + +## 1 快速介绍 + +PostgresqlReader插件实现了从PostgreSQL读取数据。在底层实现上,PostgresqlReader通过JDBC连接远程PostgreSQL数据库,并执行相应的sql语句将数据从PostgreSQL库中SELECT出来。 + +## 2 实现原理 + +简而言之,PostgresqlReader通过JDBC连接器连接到远程的PostgreSQL数据库,并根据用户配置的信息生成查询SELECT SQL语句并发送到远程PostgreSQL数据库,并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集,并传递给下游Writer处理。 + +对于用户配置Table、Column、Where的信息,PostgresqlReader将其拼接为SQL语句发送到PostgreSQL数据库;对于用户配置querySql信息,PostgresqlReader直接将其发送到PostgreSQL数据库。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从PostgreSQL数据库同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + "speed": { + //设置传输速度,单位为byte/s,DataX运行会尽可能达到该速度但是不超过它. + "byte": 1048576 + }, + //出错限制 + "errorLimit": { + //出错的record条数上限,当大于该值即报错。 + "record": 0, + //出错的record百分比上限 1.0表示100%,0.02表示2% + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "postgresqlreader", + "parameter": { + // 数据库连接用户名 + "username": "xx", + // 数据库连接密码 + "password": "xx", + "column": [ + "id","name" + ], + //切分主键 + "splitPk": "id", + "connection": [ + { + "table": [ + "table" + ], + "jdbcUrl": [ + "jdbc:postgresql://host:port/database" + ] + } + ] + } + }, + "writer": { + //writer类型 + "name": "streamwriter", + //是否打印内容 + "parameter": { + "print":true, + } + } + } + ] + } +} + +``` + +* 配置一个自定义SQL的数据库同步任务到本地内容的作业: + +``` +{ + "job": { + "setting": { + "speed": 1048576 + }, + "content": [ + { + "reader": { + "name": "postgresqlreader", + "parameter": { + "username": "xx", + "password": "xx", + "where": "", + "connection": [ + { + "querySql": [ + "select db_id,on_line_flag from db_info where db_id < 10;" + ], + "jdbcUrl": [ + "jdbc:postgresql://host:port/database", "jdbc:postgresql://host:port/database" + ] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,使用JSON的数组描述,并支持一个库填写多个连接地址。之所以使用JSON数组描述连接信息,是因为阿里集团内部支持多个IP探测,如果配置了多个,PostgresqlReader可以依次探测ip的可连接性,直到选择一个合法的IP。如果全部连接失败,PostgresqlReader报错。 注意,jdbcUrl必须包含在connection配置单元中。对于阿里集团外部使用情况,JSON数组填写一个JDBC连接即可。 + + jdbcUrl按照PostgreSQL官方规范,并可以填写连接附件控制信息。具体请参看[PostgreSQL官方文档](http://jdbc.postgresql.org/documentation/93/connect.html)。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:所选取的需要同步的表。使用JSON的数组描述,因此支持多张表同时抽取。当配置为多张表时,用户自己需保证多张表是同一schema结构,PostgresqlReader不予检查表是否同一逻辑表。注意,table必须包含在connection配置单元中。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。用户使用\*代表默认使用所有列配置,例如['\*']。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照PostgreSQL语法格式: + ["id", "'hello'::varchar", "true", "2.5::real", "power(2,3)"] + id为普通列名,'hello'::varchar为字符串常量,true为布尔值,2.5为浮点数, power(2,3)为函数。 + + **column必须用户显示指定同步的列集合,不允许为空!** + + * 必选:是
+ + * 默认值:无
+ +* **splitPk** + + * 描述:PostgresqlReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,DataX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 + + 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 + + 目前splitPk仅支持整形数据切分,`不支持浮点、字符串型、日期等其他类型`。如果用户指定其他非支持类型,PostgresqlReader将报错! + + splitPk设置为空,底层将视作用户不允许对单表进行切分,因此使用单通道进行抽取。 + + * 必选:否
+ + * 默认值:空
+ +* **where** + + * 描述:筛选条件,MysqlReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。
+ + where条件可以有效地进行业务增量同步。 where条件不配置或者为空,视作全表同步数据。 + + * 必选:否
+ + * 默认值:无
+ +* **querySql** + + * 描述:在有些业务场景下,where这一配置项不足以描述所筛选的条件,用户可以通过该配置型来自定义筛选SQL。当用户配置了这一项之后,DataX系统就会忽略table,column这些配置型,直接使用这个配置项的内容对数据进行筛选,例如需要进行多表join后同步数据,使用select a,b from table_a join table_b on table_a.id = table_b.id
+ + `当用户配置querySql时,PostgresqlReader直接忽略table、column、where条件的配置`。 + + * 必选:否
+ + * 默认值:无
+ +* **fetchSize** + + * 描述:该配置项定义了插件和数据库服务器端每次批量数据获取条数,该值决定了DataX和服务器端的网络交互次数,能够较大的提升数据抽取性能。
+ + `注意,该值过大(>2048)可能造成DataX进程OOM。`。 + + * 必选:否
+ + * 默认值:1024
+ + +### 3.3 类型转换 + +目前PostgresqlReader支持大部分PostgreSQL类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出PostgresqlReader针对PostgreSQL类型转换列表: + + +| DataX 内部类型| PostgreSQL 数据类型 | +| -------- | ----- | +| Long |bigint, bigserial, integer, smallint, serial | +| Double |double precision, money, numeric, real | +| String |varchar, char, text, bit, inet| +| Date |date, time, timestamp | +| Boolean |bool| +| Bytes |bytea| + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持; money,inet,bit需用户使用a_inet::varchar类似的语法转换`。 + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + +create table pref_test( + id serial, + a_bigint bigint, + a_bit bit(10), + a_boolean boolean, + a_char character(5), + a_date date, + a_double double precision, + a_integer integer, + a_money money, + a_num numeric(10,2), + a_real real, + a_smallint smallint, + a_text text, + a_time time, + a_timestamp timestamp +) + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 16核 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz + 2. mem: MemTotal: 24676836kB MemFree: 6365080kB + 3. net: 百兆双网卡 + +* PostgreSQL数据库机器参数为: + D12 24逻辑核 192G内存 12*480G SSD 阵列 + + +### 4.2 测试报告 + +#### 4.2.1 单表测试报告 + + +| 通道数 | 是否按照主键切分 | DataX速度(Rec/s) | DataX流量(MB/s) | DataX机器运行负载 | +|--------|--------| --------|--------|--------| +|1| 否 | 10211 | 0.63 | 0.2 | +|1| 是 | 10211 | 0.63 | 0.2 | +|4| 否 | 10211 | 0.63 | 0.2 | +|4| 是 | 40000 | 2.48 | 0.5 | +|8| 否 | 10211 | 0.63 | 0.2 | +|8| 是 | 78048 | 4.84 | 0.8 | + + +说明: + +1. 这里的单表,主键类型为 serial,数据分布均匀。 +2. 对单表如果没有按照主键切分,那么配置通道个数不会提升速度,效果与1个通道一样。 diff --git a/postgresqlreader/pom.xml b/postgresqlreader/pom.xml new file mode 100755 index 0000000000..48a3c61525 --- /dev/null +++ b/postgresqlreader/pom.xml @@ -0,0 +1,86 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + postgresqlreader + postgresqlreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + + org.slf4j + slf4j-api + + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + org.postgresql + postgresql + 9.3-1102-jdbc4 + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/postgresqlreader/src/main/assembly/package.xml b/postgresqlreader/src/main/assembly/package.xml new file mode 100755 index 0000000000..5860c05768 --- /dev/null +++ b/postgresqlreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/postgresqlreader + + + target/ + + postgresqlreader-0.0.1-SNAPSHOT.jar + + plugin/reader/postgresqlreader + + + + + + false + plugin/reader/postgresqlreader/libs + runtime + + + diff --git a/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/Constant.java b/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/Constant.java new file mode 100755 index 0000000000..9b9b46789f --- /dev/null +++ b/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/Constant.java @@ -0,0 +1,7 @@ +package com.alibaba.datax.plugin.reader.postgresqlreader; + +public class Constant { + + public static final int DEFAULT_FETCH_SIZE = 1000; + +} diff --git a/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/PostgresqlReader.java b/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/PostgresqlReader.java new file mode 100755 index 0000000000..59d2825fe4 --- /dev/null +++ b/postgresqlreader/src/main/java/com/alibaba/datax/plugin/reader/postgresqlreader/PostgresqlReader.java @@ -0,0 +1,86 @@ +package com.alibaba.datax.plugin.reader.postgresqlreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; + +import java.util.List; + +public class PostgresqlReader extends Reader { + + private static final DataBaseType DATABASE_TYPE = DataBaseType.PostgreSQL; + + public static class Job extends Reader.Job { + + private Configuration originalConfig; + private CommonRdbmsReader.Job commonRdbmsReaderMaster; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + int fetchSize = this.originalConfig.getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + Constant.DEFAULT_FETCH_SIZE); + if (fetchSize < 1) { + throw DataXException.asDataXException(DBUtilErrorCode.REQUIRED_VALUE, + String.format("您配置的fetchSize有误,根据DataX的设计,fetchSize : [%d] 设置值不能小于 1.", fetchSize)); + } + this.originalConfig.set(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, fetchSize); + + this.commonRdbmsReaderMaster = new CommonRdbmsReader.Job(DATABASE_TYPE); + this.commonRdbmsReaderMaster.init(this.originalConfig); + } + + @Override + public List split(int adviceNumber) { + return this.commonRdbmsReaderMaster.split(this.originalConfig, adviceNumber); + } + + @Override + public void post() { + this.commonRdbmsReaderMaster.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderMaster.destroy(this.originalConfig); + } + + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + private CommonRdbmsReader.Task commonRdbmsReaderSlave; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsReaderSlave = new CommonRdbmsReader.Task(DATABASE_TYPE,super.getTaskGroupId(), super.getTaskId()); + this.commonRdbmsReaderSlave.init(this.readerSliceConfig); + } + + @Override + public void startRead(RecordSender recordSender) { + int fetchSize = this.readerSliceConfig.getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE); + + this.commonRdbmsReaderSlave.startRead(this.readerSliceConfig, recordSender, + super.getTaskPluginCollector(), fetchSize); + } + + @Override + public void post() { + this.commonRdbmsReaderSlave.post(this.readerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderSlave.destroy(this.readerSliceConfig); + } + + } + +} diff --git a/postgresqlreader/src/main/resources/plugin.json b/postgresqlreader/src/main/resources/plugin.json new file mode 100755 index 0000000000..152f8b7b06 --- /dev/null +++ b/postgresqlreader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "postgresqlreader", + "class": "com.alibaba.datax.plugin.reader.postgresqlreader.PostgresqlReader", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/postgresqlreader/src/main/resources/plugin_job_template.json b/postgresqlreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..21970520f2 --- /dev/null +++ b/postgresqlreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "postgresqlreader", + "parameter": { + "username": "", + "password": "", + "connection": [ + { + "table": [], + "jdbcUrl": [] + } + ] + } +} \ No newline at end of file diff --git a/postgresqlwriter/doc/postgresqlwriter.md b/postgresqlwriter/doc/postgresqlwriter.md new file mode 100644 index 0000000000..662da2e4f2 --- /dev/null +++ b/postgresqlwriter/doc/postgresqlwriter.md @@ -0,0 +1,267 @@ +# DataX PostgresqlWriter + + +--- + + +## 1 快速介绍 + +PostgresqlWriter插件实现了写入数据到 PostgreSQL主库目的表的功能。在底层实现上,PostgresqlWriter通过JDBC连接远程 PostgreSQL 数据库,并执行相应的 insert into ... sql 语句将数据写入 PostgreSQL,内部会分批次提交入库。 + +PostgresqlWriter面向ETL开发工程师,他们使用PostgresqlWriter从数仓导入数据到PostgreSQL。同时 PostgresqlWriter亦可以作为数据迁移工具为DBA等用户提供服务。 + + +## 2 实现原理 + +PostgresqlWriter通过 DataX 框架获取 Reader 生成的协议数据,根据你配置生成相应的SQL插入语句 + + +* `insert into...`(当主键/唯一性索引冲突时会写不进去冲突的行) + +
+ + 注意: + 1. 目的表所在数据库必须是主库才能写入数据;整个任务至少需具备 insert into...的权限,是否需要其他权限,取决于你任务配置中在 preSql 和 postSql 中指定的语句。 + 2. PostgresqlWriter和MysqlWriter不同,不支持配置writeMode参数。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到 PostgresqlWriter导入的数据。 + +```json +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column" : [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19880808, + "type": "long" + }, + { + "value": "1988-08-08 08:08:08", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 1000 + } + }, + "writer": { + "name": "postgresqlwriter", + "parameter": { + "username": "xx", + "password": "xx", + "column": [ + "id", + "name" + ], + "preSql": [ + "delete from test" + ], + "connection": [ + { + "jdbcUrl": "jdbc:postgresql://127.0.0.1:3002/datax", + "table": [ + "test" + ] + } + ] + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:目的数据库的 JDBC 连接信息 ,jdbcUrl必须包含在connection配置单元中。 + + 注意:1、在一个数据库上只能配置一个值。 + 2、jdbcUrl按照PostgreSQL官方规范,并可以填写连接附加参数信息。具体请参看PostgreSQL官方文档或者咨询对应 DBA。 + + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:目的数据库的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:目的数据库的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。支持写入一个或者多个表。当配置为多张表时,必须确保所有表结构保持一致。 + + 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。如果要依次写入全部列,使用*表示, 例如: "column": ["*"] + + 注意:1、我们强烈不推荐你这样配置,因为当你目的表字段个数、类型等有改动时,你的任务可能运行不正确或者失败 + 2、此处 column 不能配置任何常量值 + + * 必选:是
+ + * 默认值:否
+ +* **preSql** + + * 描述:写入数据到目的表前,会先执行这里的标准语句。如果 Sql 中有你需要操作到的表名称,请使用 `@table` 表示,这样在实际执行 Sql 语句时,会对变量按照实际表名称进行替换。比如你的任务是要写入到目的端的100个同构分表(表名称为:datax_00,datax01, ... datax_98,datax_99),并且你希望导入数据前,先对表中数据进行删除操作,那么你可以这样配置:`"preSql":["delete from @table"]`,效果是:在执行到每个表写入数据前,会先执行对应的 delete from 对应表名称
+ + * 必选:否
+ + * 默认值:无
+ +* **postSql** + + * 描述:写入数据到目的表后,会执行这里的标准语句。(原理同 preSql )
+ + * 必选:否
+ + * 默认值:无
+ +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与PostgreSql的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。
+ + * 必选:否
+ + * 默认值:1024
+ +### 3.3 类型转换 + +目前 PostgresqlWriter支持大部分 PostgreSQL类型,但也存在部分没有支持的情况,请注意检查你的类型。 + +下面列出 PostgresqlWriter针对 PostgreSQL类型转换列表: + +| DataX 内部类型| PostgreSQL 数据类型 | +| -------- | ----- | +| Long |bigint, bigserial, integer, smallint, serial | +| Double |double precision, money, numeric, real | +| String |varchar, char, text, bit| +| Date |date, time, timestamp | +| Boolean |bool| +| Bytes |bytea| + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: + + create table pref_test( + id serial, + a_bigint bigint, + a_bit bit(10), + a_boolean boolean, + a_char character(5), + a_date date, + a_double double precision, + a_integer integer, + a_money money, + a_num numeric(10,2), + a_real real, + a_smallint smallint, + a_text text, + a_time time, + a_timestamp timestamp +) + +#### 4.1.2 机器参数 + +* 执行DataX的机器参数为: + 1. cpu: 16核 Intel(R) Xeon(R) CPU E5620 @ 2.40GHz + 2. mem: MemTotal: 24676836kB MemFree: 6365080kB + 3. net: 百兆双网卡 + +* PostgreSQL数据库机器参数为: + D12 24逻辑核 192G内存 12*480G SSD 阵列 + + +### 4.2 测试报告 + +#### 4.2.1 单表测试报告 + +| 通道数| 批量提交batchSize | DataX速度(Rec/s)| DataX流量(M/s) | DataX机器运行负载 +|--------|--------| --------|--------|--------|--------| +|1| 128 | 9259 | 0.55 | 0.3 +|1| 512 | 10869 | 0.653 | 0.3 +|1| 2048 | 9803 | 0.589 | 0.8 +|4| 128 | 30303 | 1.82 | 1 +|4| 512 | 36363 | 2.18 | 1 +|4| 2048 | 36363 | 2.18 | 1 +|8| 128 | 57142 | 3.43 | 2 +|8| 512 | 66666 | 4.01 | 1.5 +|8| 2048 | 66666 | 4.01 | 1.1 +|16| 128 | 88888 | 5.34 | 1.8 +|16| 2048 | 94117 | 5.65 | 2.5 +|32| 512 | 76190 | 4.58 | 3 + +#### 4.2.2 性能测试小结 +1. `channel数对性能影响很大` +2. `通常不建议写入数据库时,通道个数 > 32` + + +## FAQ + +*** + +**Q: PostgresqlWriter 执行 postSql 语句报错,那么数据导入到目标数据库了吗?** + +A: DataX 导入过程存在三块逻辑,pre 操作、导入操作、post 操作,其中任意一环报错,DataX 作业报错。由于 DataX 不能保证在同一个事务完成上述几个操作,因此有可能数据已经落入到目标端。 + +*** + +**Q: 按照上述说法,那么有部分脏数据导入数据库,如果影响到线上数据库怎么办?** + +A: 目前有两种解法,第一种配置 pre 语句,该 sql 可以清理当天导入数据, DataX 每次导入时候可以把上次清理干净并导入完整数据。 +第二种,向临时表导入数据,完成后再 rename 到线上表。 + +*** diff --git a/postgresqlwriter/pom.xml b/postgresqlwriter/pom.xml new file mode 100755 index 0000000000..b90cf307c8 --- /dev/null +++ b/postgresqlwriter/pom.xml @@ -0,0 +1,82 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + postgresqlwriter + postgresqlwriter + jar + writer data into postgresql database + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + + org.slf4j + slf4j-api + + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + org.postgresql + postgresql + 9.3-1102-jdbc4 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/postgresqlwriter/src/main/assembly/package.xml b/postgresqlwriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..20bfe6226c --- /dev/null +++ b/postgresqlwriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/postgresqlwriter + + + target/ + + postgresqlwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/postgresqlwriter + + + + + + false + plugin/writer/postgresqlwriter/libs + runtime + + + diff --git a/postgresqlwriter/src/main/java/com/alibaba/datax/plugin/writer/postgresqlwriter/PostgresqlWriter.java b/postgresqlwriter/src/main/java/com/alibaba/datax/plugin/writer/postgresqlwriter/PostgresqlWriter.java new file mode 100755 index 0000000000..22dc0c1e6d --- /dev/null +++ b/postgresqlwriter/src/main/java/com/alibaba/datax/plugin/writer/postgresqlwriter/PostgresqlWriter.java @@ -0,0 +1,100 @@ +package com.alibaba.datax.plugin.writer.postgresqlwriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; +import com.alibaba.datax.plugin.rdbms.writer.Key; + +import java.util.List; + +public class PostgresqlWriter extends Writer { + private static final DataBaseType DATABASE_TYPE = DataBaseType.PostgreSQL; + + public static class Job extends Writer.Job { + private Configuration originalConfig = null; + private CommonRdbmsWriter.Job commonRdbmsWriterMaster; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + // warn:not like mysql, PostgreSQL only support insert mode, don't use + String writeMode = this.originalConfig.getString(Key.WRITE_MODE); + if (null != writeMode) { + throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, + String.format("写入模式(writeMode)配置有误. 因为PostgreSQL不支持配置参数项 writeMode: %s, PostgreSQL仅使用insert sql 插入数据. 请检查您的配置并作出修改.", writeMode)); + } + + this.commonRdbmsWriterMaster = new CommonRdbmsWriter.Job(DATABASE_TYPE); + this.commonRdbmsWriterMaster.init(this.originalConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterMaster.prepare(this.originalConfig); + } + + @Override + public List split(int mandatoryNumber) { + return this.commonRdbmsWriterMaster.split(this.originalConfig, mandatoryNumber); + } + + @Override + public void post() { + this.commonRdbmsWriterMaster.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterMaster.destroy(this.originalConfig); + } + + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private CommonRdbmsWriter.Task commonRdbmsWriterSlave; + + @Override + public void init() { + this.writerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsWriterSlave = new CommonRdbmsWriter.Task(DATABASE_TYPE){ + @Override + public String calcValueHolder(String columnType){ + if("serial".equalsIgnoreCase(columnType)){ + return "?::int"; + }else if("bit".equalsIgnoreCase(columnType)){ + return "?::bit varying"; + } + return "?::" + columnType; + } + }; + this.commonRdbmsWriterSlave.init(this.writerSliceConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterSlave.prepare(this.writerSliceConfig); + } + + public void startWrite(RecordReceiver recordReceiver) { + this.commonRdbmsWriterSlave.startWrite(recordReceiver, this.writerSliceConfig, super.getTaskPluginCollector()); + } + + @Override + public void post() { + this.commonRdbmsWriterSlave.post(this.writerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterSlave.destroy(this.writerSliceConfig); + } + + } + +} diff --git a/postgresqlwriter/src/main/resources/plugin.json b/postgresqlwriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..b61b28886d --- /dev/null +++ b/postgresqlwriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "postgresqlwriter", + "class": "com.alibaba.datax.plugin.writer.postgresqlwriter.PostgresqlWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute insert sql. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/postgresqlwriter/src/main/resources/plugin_job_template.json b/postgresqlwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..c1e781f16d --- /dev/null +++ b/postgresqlwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,17 @@ +{ + "name": "postgresqlwriter", + "parameter": { + "username": "", + "password": "", + "column": [], + "preSql": [], + "connection": [ + { + "jdbcUrl": "", + "table": [] + } + ], + "preSql": [], + "postSql": [] + } +} \ No newline at end of file diff --git a/rdbmsreader/doc/rdbmsreader.md b/rdbmsreader/doc/rdbmsreader.md new file mode 100644 index 0000000000..dd3039e9a2 --- /dev/null +++ b/rdbmsreader/doc/rdbmsreader.md @@ -0,0 +1,284 @@ +# RDBMSReader 插件文档 + + +___ + + +## 1 快速介绍 + +RDBMSReader插件实现了从RDBMS读取数据。在底层实现上,RDBMSReader通过JDBC连接远程RDBMS数据库,并执行相应的sql语句将数据从RDBMS库中SELECT出来。目前支持达梦、db2、PPAS、Sybase数据库的读取。RDBMSReader是一个通用的关系数据库读插件,您可以通过注册数据库驱动等方式增加任意多样的关系数据库读支持。 + + +## 2 实现原理 + +简而言之,RDBMSReader通过JDBC连接器连接到远程的RDBMS数据库,并根据用户配置的信息生成查询SELECT SQL语句并发送到远程RDBMS数据库,并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集,并传递给下游Writer处理。 + +对于用户配置Table、Column、Where的信息,RDBMSReader将其拼接为SQL语句发送到RDBMS数据库;对于用户配置querySql信息,RDBMS直接将其发送到RDBMS数据库。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从RDBMS数据库同步抽取数据作业: + +``` +{ + "job": { + "setting": { + "speed": { + "byte": 1048576 + }, + "errorLimit": { + "record": 0, + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "rdbmsreader", + "parameter": { + "username": "xxx", + "password": "xxx", + "column": [ + "id", + "name" + ], + "splitPk": "pk", + "connection": [ + { + "table": [ + "table" + ], + "jdbcUrl": [ + "jdbc:dm://ip:port/database" + ] + } + ], + "fetchSize": 1024, + "where": "1 = 1" + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": true + } + } + } + ] + } +} + +``` + +* 配置一个自定义SQL的数据库同步任务到ODPS的作业: + +``` +{ + "job": { + "setting": { + "speed": { + "byte": 1048576 + }, + "errorLimit": { + "record": 0, + "percentage": 0.02 + } + }, + "content": [ + { + "reader": { + "name": "rdbmsreader", + "parameter": { + "username": "xxx", + "password": "xxx", + "column": [ + "id", + "name" + ], + "splitPk": "pk", + "connection": [ + { + "querySql": [ + "SELECT * from dual" + ], + "jdbcUrl": [ + "jdbc:dm://ip:port/database" + ] + } + ], + "fetchSize": 1024, + "where": "1 = 1" + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": true + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,jdbcUrl按照RDBMS官方规范,并可以填写连接附件控制信息。请注意不同的数据库jdbc的格式是不同的,DataX会根据具体jdbc的格式选择合适的数据库驱动完成数据读取。 + + - 达梦 jdbc:dm://ip:port/database + - db2格式 jdbc:db2://ip:port/database + - PPAS格式 jdbc:edb://ip:port/database + + **rdbmswriter如何增加新的数据库支持:** + + - 进入rdbmsreader对应目录,这里${DATAX_HOME}为DataX主目录,即: ${DATAX_HOME}/plugin/reader/rdbmswriter + - 在rdbmsreader插件目录下有plugin.json配置文件,在此文件中注册您具体的数据库驱动,具体放在drivers数组中。rdbmsreader插件在任务执行时会动态选择合适的数据库驱动连接数据库。 + + + ``` + { + "name": "rdbmsreader", + "class": "com.alibaba.datax.plugin.reader.rdbmsreader.RdbmsReader", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba", + "drivers": [ + "dm.jdbc.driver.DmDriver", + "com.ibm.db2.jcc.DB2Driver", + "com.sybase.jdbc3.jdbc.SybDriver", + "com.edb.Driver" + ] + } + ``` + + - 在rdbmsreader插件目录下有libs子目录,您需要将您具体的数据库驱动放到libs目录下。 + + + ``` + $tree + . + |-- libs + | |-- Dm7JdbcDriver16.jar + | |-- commons-collections-3.0.jar + | |-- commons-io-2.4.jar + | |-- commons-lang3-3.3.2.jar + | |-- commons-math3-3.1.1.jar + | |-- datax-common-0.0.1-SNAPSHOT.jar + | |-- datax-service-face-1.0.23-20160120.024328-1.jar + | |-- db2jcc4.jar + | |-- druid-1.0.15.jar + | |-- edb-jdbc16.jar + | |-- fastjson-1.1.46.sec01.jar + | |-- guava-r05.jar + | |-- hamcrest-core-1.3.jar + | |-- jconn3-1.0.0-SNAPSHOT.jar + | |-- logback-classic-1.0.13.jar + | |-- logback-core-1.0.13.jar + | |-- plugin-rdbms-util-0.0.1-SNAPSHOT.jar + | `-- slf4j-api-1.7.10.jar + |-- plugin.json + |-- plugin_job_template.json + `-- rdbmsreader-0.0.1-SNAPSHOT.jar + ``` + + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名。
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码。
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:所选取的需要同步的表名。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。用户使用*代表默认使用所有列配置,例如['*']。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照JSON格式: + ["id", "1", "'bazhen.csy'", "null", "to_char(a + 1)", "2.3" , "true"] + id为普通列名,1为整形数字常量,'bazhen.csy'为字符串常量,null为空指针,to_char(a + 1)为表达式,2.3为浮点数,true为布尔值。 + + Column必须显示填写,不允许为空! + + * 必选:是
+ + * 默认值:无
+ +* **splitPk** + + * 描述:RDBMSReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,DataX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 + + 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 + + 目前splitPk仅支持整形数据切分,`不支持浮点、字符串型、日期等其他类型`。如果用户指定其他非支持类型,RDBMSReader将报错! + + splitPk如果不填写,将视作用户不对单表进行切分,RDBMSReader使用单通道同步全量数据。 + + * 必选:否
+ + * 默认值:空
+ +* **where** + + * 描述:筛选条件,RDBMSReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。例如在做测试时,可以将where条件指定为limit 10;在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。
。 + + where条件可以有效地进行业务增量同步。where条件不配置或者为空,视作全表同步数据。 + + * 必选:否
+ + * 默认值:无
+ +* **querySql** + + * 描述:在有些业务场景下,where这一配置项不足以描述所筛选的条件,用户可以通过该配置型来自定义筛选SQL。当用户配置了这一项之后,DataX系统就会忽略table,column这些配置型,直接使用这个配置项的内容对数据进行筛选,例如需要进行多表join后同步数据,使用select a,b from table_a join table_b on table_a.id = table_b.id
+ + `当用户配置querySql时,RDBMSReader直接忽略table、column、where条件的配置`。 + + * 必选:否
+ + * 默认值:无
+ +* **fetchSize** + + * 描述:该配置项定义了插件和数据库服务器端每次批量数据获取条数,该值决定了DataX和服务器端的网络交互次数,能够较大的提升数据抽取性能。
+ + `注意,该值过大(>2048)可能造成DataX进程OOM。`。 + + * 必选:否
+ + * 默认值:1024
+ + +### 3.3 类型转换 + +目前RDBMSReader支持大部分通用得关系数据库类型如数字、字符等,但也存在部分个别类型没有支持的情况,请注意检查你的类型,根据具体的数据库做选择。 diff --git a/rdbmsreader/pom.xml b/rdbmsreader/pom.xml new file mode 100755 index 0000000000..34fbbada8e --- /dev/null +++ b/rdbmsreader/pom.xml @@ -0,0 +1,109 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + rdbmsreader + rdbmsreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + + com.dm + dm + system + ${basedir}/src/main/libs/Dm7JdbcDriver16.jar + + + com.sybase + jconn3 + 1.0.0-SNAPSHOT + system + ${basedir}/src/main/libs/jconn3-1.0.0-SNAPSHOT.jar + + + ppas + ppas + 16 + system + ${basedir}/src/main/libs/edb-jdbc16.jar + + + + org.slf4j + slf4j-api + + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + + + + com.dm + dm + 16 + + + + diff --git a/rdbmsreader/src/main/assembly/package.xml b/rdbmsreader/src/main/assembly/package.xml new file mode 100755 index 0000000000..3a50bc8138 --- /dev/null +++ b/rdbmsreader/src/main/assembly/package.xml @@ -0,0 +1,42 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/rdbmsreader + + + target/ + + rdbmsreader-0.0.1-SNAPSHOT.jar + + plugin/reader/rdbmsreader + + + src/main/libs + + *.* + + plugin/reader/rdbmsreader/libs + + + + + + false + plugin/reader/rdbmsreader/libs + runtime + + + diff --git a/rdbmsreader/src/main/java/com/alibaba/datax/plugin/reader/rdbmsreader/Constant.java b/rdbmsreader/src/main/java/com/alibaba/datax/plugin/reader/rdbmsreader/Constant.java new file mode 100755 index 0000000000..aa1ac5709b --- /dev/null +++ b/rdbmsreader/src/main/java/com/alibaba/datax/plugin/reader/rdbmsreader/Constant.java @@ -0,0 +1,7 @@ +package com.alibaba.datax.plugin.reader.rdbmsreader; + +public class Constant { + + public static final int DEFAULT_FETCH_SIZE = 1000; + +} diff --git a/rdbmsreader/src/main/java/com/alibaba/datax/plugin/reader/rdbmsreader/RdbmsReader.java b/rdbmsreader/src/main/java/com/alibaba/datax/plugin/reader/rdbmsreader/RdbmsReader.java new file mode 100755 index 0000000000..3153e114b2 --- /dev/null +++ b/rdbmsreader/src/main/java/com/alibaba/datax/plugin/reader/rdbmsreader/RdbmsReader.java @@ -0,0 +1,94 @@ +package com.alibaba.datax.plugin.reader.rdbmsreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; + +import java.util.List; + +public class RdbmsReader extends Reader { + private static final DataBaseType DATABASE_TYPE = DataBaseType.RDBMS; + + public static class Job extends Reader.Job { + + private Configuration originalConfig; + private CommonRdbmsReader.Job commonRdbmsReaderMaster; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + int fetchSize = this.originalConfig.getInt( + com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + Constant.DEFAULT_FETCH_SIZE); + if (fetchSize < 1) { + throw DataXException + .asDataXException( + DBUtilErrorCode.REQUIRED_VALUE, + String.format( + "您配置的fetchSize有误,根据DataX的设计,fetchSize : [%d] 设置值不能小于 1.", + fetchSize)); + } + this.originalConfig.set( + com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + fetchSize); + + this.commonRdbmsReaderMaster = new SubCommonRdbmsReader.Job( + DATABASE_TYPE); + this.commonRdbmsReaderMaster.init(this.originalConfig); + } + + @Override + public List split(int adviceNumber) { + return this.commonRdbmsReaderMaster.split(this.originalConfig, + adviceNumber); + } + + @Override + public void post() { + this.commonRdbmsReaderMaster.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderMaster.destroy(this.originalConfig); + } + + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + private CommonRdbmsReader.Task commonRdbmsReaderSlave; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsReaderSlave = new SubCommonRdbmsReader.Task( + DATABASE_TYPE); + this.commonRdbmsReaderSlave.init(this.readerSliceConfig); + } + + @Override + public void startRead(RecordSender recordSender) { + int fetchSize = this.readerSliceConfig + .getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE); + + this.commonRdbmsReaderSlave.startRead(this.readerSliceConfig, + recordSender, super.getTaskPluginCollector(), fetchSize); + } + + @Override + public void post() { + this.commonRdbmsReaderSlave.post(this.readerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderSlave.destroy(this.readerSliceConfig); + } + } +} diff --git a/rdbmsreader/src/main/java/com/alibaba/datax/plugin/reader/rdbmsreader/SubCommonRdbmsReader.java b/rdbmsreader/src/main/java/com/alibaba/datax/plugin/reader/rdbmsreader/SubCommonRdbmsReader.java new file mode 100755 index 0000000000..b94021bfef --- /dev/null +++ b/rdbmsreader/src/main/java/com/alibaba/datax/plugin/reader/rdbmsreader/SubCommonRdbmsReader.java @@ -0,0 +1,169 @@ +package com.alibaba.datax.plugin.reader.rdbmsreader; + +import com.alibaba.datax.common.element.BoolColumn; +import com.alibaba.datax.common.element.BytesColumn; +import com.alibaba.datax.common.element.DateColumn; +import com.alibaba.datax.common.element.DoubleColumn; +import com.alibaba.datax.common.element.LongColumn; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.element.StringColumn; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.plugin.TaskPluginCollector; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; + +import java.sql.ResultSet; +import java.sql.ResultSetMetaData; +import java.sql.Types; + +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class SubCommonRdbmsReader extends CommonRdbmsReader { + static { + DBUtil.loadDriverClass("reader", "rdbms"); + } + + public static class Job extends CommonRdbmsReader.Job { + public Job(DataBaseType dataBaseType) { + super(dataBaseType); + } + } + + public static class Task extends CommonRdbmsReader.Task { + + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + private static final boolean IS_DEBUG = LOG.isDebugEnabled(); + + public Task(DataBaseType dataBaseType) { + super(dataBaseType); + } + + @Override + protected Record transportOneRecord(RecordSender recordSender, + ResultSet rs, ResultSetMetaData metaData, int columnNumber, + String mandatoryEncoding, + TaskPluginCollector taskPluginCollector) { + Record record = recordSender.createRecord(); + + try { + for (int i = 1; i <= columnNumber; i++) { + switch (metaData.getColumnType(i)) { + + case Types.CHAR: + case Types.NCHAR: + case Types.VARCHAR: + case Types.LONGVARCHAR: + case Types.NVARCHAR: + case Types.LONGNVARCHAR: + String rawData; + if (StringUtils.isBlank(mandatoryEncoding)) { + rawData = rs.getString(i); + } else { + rawData = new String( + (rs.getBytes(i) == null ? EMPTY_CHAR_ARRAY + : rs.getBytes(i)), + mandatoryEncoding); + } + record.addColumn(new StringColumn(rawData)); + break; + + case Types.CLOB: + case Types.NCLOB: + record.addColumn(new StringColumn(rs.getString(i))); + break; + + case Types.SMALLINT: + case Types.TINYINT: + case Types.INTEGER: + case Types.BIGINT: + record.addColumn(new LongColumn(rs.getString(i))); + break; + + case Types.NUMERIC: + case Types.DECIMAL: + record.addColumn(new DoubleColumn(rs.getString(i))); + break; + + case Types.FLOAT: + case Types.REAL: + case Types.DOUBLE: + record.addColumn(new DoubleColumn(rs.getString(i))); + break; + + case Types.TIME: + record.addColumn(new DateColumn(rs.getTime(i))); + break; + + // for mysql bug, see http://bugs.mysql.com/bug.php?id=35115 + case Types.DATE: + if (metaData.getColumnTypeName(i).equalsIgnoreCase( + "year")) { + record.addColumn(new LongColumn(rs.getInt(i))); + } else { + record.addColumn(new DateColumn(rs.getDate(i))); + } + break; + + case Types.TIMESTAMP: + record.addColumn(new DateColumn(rs.getTimestamp(i))); + break; + + case Types.BINARY: + case Types.VARBINARY: + case Types.BLOB: + case Types.LONGVARBINARY: + record.addColumn(new BytesColumn(rs.getBytes(i))); + break; + + // warn: bit(1) -> Types.BIT 可使用BoolColumn + // warn: bit(>1) -> Types.VARBINARY 可使用BytesColumn + case Types.BOOLEAN: + case Types.BIT: + record.addColumn(new BoolColumn(rs.getBoolean(i))); + break; + + case Types.NULL: + String stringData = null; + if (rs.getObject(i) != null) { + stringData = rs.getObject(i).toString(); + } + record.addColumn(new StringColumn(stringData)); + break; + //case Types.TIME_WITH_TIMEZONE: + //case Types.TIMESTAMP_WITH_TIMEZONE: + // record.addColumn(new StringColumn(rs.getString(i))); + // break; + + default: + // warn:not support INTERVAL etc: Types.JAVA_OBJECT + throw DataXException + .asDataXException( + DBUtilErrorCode.UNSUPPORTED_TYPE, + String.format( + "您的配置文件中的列配置信息有误. 因为DataX 不支持数据库读取这种字段类型. 字段名:[%s], 字段名称:[%s], 字段Java类型:[%s]. 请尝试使用数据库函数将其转换datax支持的类型 或者不同步该字段 .", + metaData.getColumnName(i), + metaData.getColumnType(i), + metaData.getColumnClassName(i))); + } + } + } catch (Exception e) { + if (IS_DEBUG) { + LOG.debug("read data " + record.toString() + + " occur exception:", e); + } + // TODO 这里识别为脏数据靠谱吗? + taskPluginCollector.collectDirtyRecord(record, e); + if (e instanceof DataXException) { + throw (DataXException) e; + } + } + recordSender.sendToWriter(record); + return record; + } + } +} diff --git a/rdbmsreader/src/main/libs/Dm7JdbcDriver16.jar b/rdbmsreader/src/main/libs/Dm7JdbcDriver16.jar new file mode 100755 index 0000000000..30740dcd2c Binary files /dev/null and b/rdbmsreader/src/main/libs/Dm7JdbcDriver16.jar differ diff --git a/rdbmsreader/src/main/libs/db2jcc4.jar b/rdbmsreader/src/main/libs/db2jcc4.jar new file mode 100755 index 0000000000..fc53cfd94b Binary files /dev/null and b/rdbmsreader/src/main/libs/db2jcc4.jar differ diff --git a/rdbmsreader/src/main/libs/edb-jdbc16.jar b/rdbmsreader/src/main/libs/edb-jdbc16.jar new file mode 100644 index 0000000000..255e64794d Binary files /dev/null and b/rdbmsreader/src/main/libs/edb-jdbc16.jar differ diff --git a/rdbmsreader/src/main/libs/jconn3-1.0.0-SNAPSHOT.jar b/rdbmsreader/src/main/libs/jconn3-1.0.0-SNAPSHOT.jar new file mode 100755 index 0000000000..df6e78bbc4 Binary files /dev/null and b/rdbmsreader/src/main/libs/jconn3-1.0.0-SNAPSHOT.jar differ diff --git a/rdbmsreader/src/main/resources/plugin.json b/rdbmsreader/src/main/resources/plugin.json new file mode 100755 index 0000000000..d344dd8602 --- /dev/null +++ b/rdbmsreader/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "rdbmsreader", + "class": "com.alibaba.datax.plugin.reader.rdbmsreader.RdbmsReader", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba", + "drivers":["dm.jdbc.driver.DmDriver", "com.sybase.jdbc3.jdbc.SybDriver", "com.edb.Driver"] +} diff --git a/rdbmsreader/src/main/resources/plugin_job_template.json b/rdbmsreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..bd08e9d44d --- /dev/null +++ b/rdbmsreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,15 @@ +{ + "name": "rdbmsreader", + "parameter": { + "username": "", + "password": "", + "column": [], + "connection": [ + { + "jdbcUrl": [], + "table": [] + } + ], + "where": "" + } +} \ No newline at end of file diff --git a/rdbmswriter/doc/rdbmswriter.md b/rdbmswriter/doc/rdbmswriter.md new file mode 100644 index 0000000000..4135d93bcf --- /dev/null +++ b/rdbmswriter/doc/rdbmswriter.md @@ -0,0 +1,200 @@ +# RDBMSWriter 插件文档 + +--- + +## 1 快速介绍 + +RDBMSWriter 插件实现了写入数据到 RDBMS 主库的目的表的功能。在底层实现上, RDBMSWriter 通过 JDBC 连接远程 RDBMS 数据库,并执行相应的 insert into ... 的 sql 语句将数据写入 RDBMS。 RDBMSWriter是一个通用的关系数据库写插件,您可以通过注册数据库驱动等方式增加任意多样的关系数据库写支持。 + +RDBMSWriter 面向ETL开发工程师,他们使用 RDBMSWriter 从数仓导入数据到 RDBMS。同时 RDBMSWriter 亦可以作为数据迁移工具为DBA等用户提供服务。 + + +## 2 实现原理 + +RDBMSWriter 通过 DataX 框架获取 Reader 生成的协议数据,RDBMSWriter 通过 JDBC 连接远程 RDBMS 数据库,并执行相应的 insert into ... 的 sql 语句将数据写入 RDBMS。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个写入RDBMS的作业。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column": [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19880808, + "type": "long" + }, + { + "value": "1988-08-08 08:08:08", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 1000 + } + }, + "writer": { + "name": "rdbmswriter", + "parameter": { + "connection": [ + { + "jdbcUrl": "jdbc:dm://ip:port/database", + "table": [ + "table" + ] + } + ], + "username": "username", + "password": "password", + "table": "table", + "column": [ + "*" + ], + "preSql": [ + "delete from XXX;" + ] + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,jdbcUrl按照RDBMS官方规范,并可以填写连接附件控制信息。请注意不同的数据库jdbc的格式是不同的,DataX会根据具体jdbc的格式选择合适的数据库驱动完成数据读取。 + + - 达梦 jdbc:dm://ip:port/database + - db2格式 jdbc:db2://ip:port/database + - PPAS格式 jdbc:edb://ip:port/database + + **rdbmswriter如何增加新的数据库支持:** + + - 进入rdbmswriter对应目录,这里${DATAX_HOME}为DataX主目录,即: ${DATAX_HOME}/plugin/writer/rdbmswriter + - 在rdbmswriter插件目录下有plugin.json配置文件,在此文件中注册您具体的数据库驱动,具体放在drivers数组中。rdbmswriter插件在任务执行时会动态选择合适的数据库驱动连接数据库。 + + ```json + { + "name": "rdbmswriter", + "class": "com.alibaba.datax.plugin.reader.rdbmswriter.RdbmsWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba", + "drivers": [ + "dm.jdbc.driver.DmDriver", + "com.ibm.db2.jcc.DB2Driver", + "com.sybase.jdbc3.jdbc.SybDriver", + "com.edb.Driver" + ] + } + ``` + - 在rdbmswriter插件目录下有libs子目录,您需要将您具体的数据库驱动放到libs目录下。 + + ``` + $tree + . + |-- libs + | |-- Dm7JdbcDriver16.jar + | |-- commons-collections-3.0.jar + | |-- commons-io-2.4.jar + | |-- commons-lang3-3.3.2.jar + | |-- commons-math3-3.1.1.jar + | |-- datax-common-0.0.1-SNAPSHOT.jar + | |-- datax-service-face-1.0.23-20160120.024328-1.jar + | |-- db2jcc4.jar + | |-- druid-1.0.15.jar + | |-- edb-jdbc16.jar + | |-- fastjson-1.1.46.sec01.jar + | |-- guava-r05.jar + | |-- hamcrest-core-1.3.jar + | |-- jconn3-1.0.0-SNAPSHOT.jar + | |-- logback-classic-1.0.13.jar + | |-- logback-core-1.0.13.jar + | |-- plugin-rdbms-util-0.0.1-SNAPSHOT.jar + | `-- slf4j-api-1.7.10.jar + |-- plugin.json + |-- plugin_job_template.json + `-- rdbmswriter-0.0.1-SNAPSHOT.jar + ``` + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名
+ * 必选:是
+ * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码
+ * 必选:是
+ * 默认值:无
+ +* **table** + + * 描述:目标表名称,如果表的schema信息和上述配置username不一致,请使用schema.table的格式填写table信息。
+ * 必选:是
+ * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合。以英文逗号(,)进行分隔。`我们强烈不推荐用户使用默认列情况`
+ + * 必选:是
+ * 默认值:无
+ +* **preSql** + + * 描述:执行数据同步任务之前率先执行的sql语句,目前只允许执行一条SQL语句,例如清除旧数据。
+ * 必选:否
+ * 默认值:无
+ +* **postSql** + + * 描述:执行数据同步任务之后执行的sql语句,目前只允许执行一条SQL语句,例如加上某一个时间戳。
+ * 必选:否
+ * 默认值:无
+ +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与RDBMS的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。
+ + * 必选:否
+ + * 默认值:1024
+ +### 3.3 类型转换 + +目前RDBMSReader支持大部分通用得关系数据库类型如数字、字符等,但也存在部分个别类型没有支持的情况,请注意检查你的类型,根据具体的数据库做选择。 diff --git a/rdbmswriter/pom.xml b/rdbmswriter/pom.xml new file mode 100755 index 0000000000..bb20a74c58 --- /dev/null +++ b/rdbmswriter/pom.xml @@ -0,0 +1,101 @@ + + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + rdbmswriter + rdbmswriter + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + + com.dm + dm + 16 + system + ${basedir}/src/main/libs/Dm7JdbcDriver16.jar + + + com.sybase + jconn3 + 1.0.0-SNAPSHOT + system + ${basedir}/src/main/libs/jconn3-1.0.0-SNAPSHOT.jar + + + ppas + ppas + 16 + system + ${basedir}/src/main/libs/edb-jdbc16.jar + + + + org.slf4j + slf4j-api + + + + ch.qos.logback + logback-classic + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/rdbmswriter/src/main/assembly/package.xml b/rdbmswriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..5d06bcecb9 --- /dev/null +++ b/rdbmswriter/src/main/assembly/package.xml @@ -0,0 +1,42 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/rdbmswriter + + + target/ + + rdbmswriter-0.0.1-SNAPSHOT.jar + + plugin/writer/rdbmswriter + + + src/main/libs + + *.* + + plugin/writer/rdbmswriter/libs + + + + + + false + plugin/writer/rdbmswriter/libs + runtime + + + diff --git a/rdbmswriter/src/main/java/com/alibaba/datax/plugin/reader/rdbmswriter/RdbmsWriter.java b/rdbmswriter/src/main/java/com/alibaba/datax/plugin/reader/rdbmswriter/RdbmsWriter.java new file mode 100755 index 0000000000..49ef387734 --- /dev/null +++ b/rdbmswriter/src/main/java/com/alibaba/datax/plugin/reader/rdbmswriter/RdbmsWriter.java @@ -0,0 +1,98 @@ +package com.alibaba.datax.plugin.reader.rdbmswriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; +import com.alibaba.datax.plugin.rdbms.writer.Key; + +import java.util.List; + +public class RdbmsWriter extends Writer { + private static final DataBaseType DATABASE_TYPE = DataBaseType.RDBMS; + + public static class Job extends Writer.Job { + private Configuration originalConfig = null; + private CommonRdbmsWriter.Job commonRdbmsWriterMaster; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + // warn:not like mysql, only support insert mode, don't use + String writeMode = this.originalConfig.getString(Key.WRITE_MODE); + if (null != writeMode) { + throw DataXException + .asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format( + "写入模式(writeMode)配置有误. 因为不支持配置参数项 writeMode: %s, 仅使用insert sql 插入数据. 请检查您的配置并作出修改.", + writeMode)); + } + + this.commonRdbmsWriterMaster = new SubCommonRdbmsWriter.Job( + DATABASE_TYPE); + this.commonRdbmsWriterMaster.init(this.originalConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterMaster.prepare(this.originalConfig); + } + + @Override + public List split(int mandatoryNumber) { + return this.commonRdbmsWriterMaster.split(this.originalConfig, + mandatoryNumber); + } + + @Override + public void post() { + this.commonRdbmsWriterMaster.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterMaster.destroy(this.originalConfig); + } + + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private CommonRdbmsWriter.Task commonRdbmsWriterSlave; + + @Override + public void init() { + this.writerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsWriterSlave = new SubCommonRdbmsWriter.Task( + DATABASE_TYPE); + this.commonRdbmsWriterSlave.init(this.writerSliceConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterSlave.prepare(this.writerSliceConfig); + } + + public void startWrite(RecordReceiver recordReceiver) { + this.commonRdbmsWriterSlave.startWrite(recordReceiver, + this.writerSliceConfig, super.getTaskPluginCollector()); + } + + @Override + public void post() { + this.commonRdbmsWriterSlave.post(this.writerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterSlave.destroy(this.writerSliceConfig); + } + + } + +} \ No newline at end of file diff --git a/rdbmswriter/src/main/java/com/alibaba/datax/plugin/reader/rdbmswriter/SubCommonRdbmsWriter.java b/rdbmswriter/src/main/java/com/alibaba/datax/plugin/reader/rdbmswriter/SubCommonRdbmsWriter.java new file mode 100755 index 0000000000..f1fbc552ef --- /dev/null +++ b/rdbmswriter/src/main/java/com/alibaba/datax/plugin/reader/rdbmswriter/SubCommonRdbmsWriter.java @@ -0,0 +1,169 @@ +package com.alibaba.datax.plugin.reader.rdbmswriter; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.plugin.rdbms.util.DBUtil; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; + +import java.sql.PreparedStatement; +import java.sql.SQLException; +import java.sql.Types; + +public class SubCommonRdbmsWriter extends CommonRdbmsWriter { + static { + DBUtil.loadDriverClass("writer", "rdbms"); + } + + public static class Job extends CommonRdbmsWriter.Job { + public Job(DataBaseType dataBaseType) { + super(dataBaseType); + } + } + + public static class Task extends CommonRdbmsWriter.Task { + public Task(DataBaseType dataBaseType) { + super(dataBaseType); + } + + @Override + protected PreparedStatement fillPreparedStatementColumnType( + PreparedStatement preparedStatement, int columnIndex, + int columnSqltype, Column column) throws SQLException { + java.util.Date utilDate; + try { + switch (columnSqltype) { + case Types.CHAR: + case Types.NCHAR: + case Types.CLOB: + case Types.NCLOB: + case Types.VARCHAR: + case Types.LONGVARCHAR: + case Types.NVARCHAR: + case Types.LONGNVARCHAR: + if (null == column.getRawData()) { + preparedStatement.setObject(columnIndex + 1, null); + } else { + preparedStatement.setString(columnIndex + 1, + column.asString()); + } + break; + + case Types.SMALLINT: + case Types.INTEGER: + case Types.BIGINT: + case Types.TINYINT: + String strLongValue = column.asString(); + if (emptyAsNull && "".equals(strLongValue)) { + preparedStatement.setObject(columnIndex + 1, null); + } else if (null == column.getRawData()) { + preparedStatement.setObject(columnIndex + 1, null); + } else { + preparedStatement.setLong(columnIndex + 1, + column.asLong()); + } + break; + case Types.NUMERIC: + case Types.DECIMAL: + case Types.FLOAT: + case Types.REAL: + case Types.DOUBLE: + String strValue = column.asString(); + if (emptyAsNull && "".equals(strValue)) { + preparedStatement.setObject(columnIndex + 1, null); + } else if (null == column.getRawData()) { + preparedStatement.setObject(columnIndex + 1, null); + } else { + preparedStatement.setDouble(columnIndex + 1, + column.asDouble()); + } + break; + + case Types.DATE: + java.sql.Date sqlDate = null; + utilDate = column.asDate(); + if (null != utilDate) { + sqlDate = new java.sql.Date(utilDate.getTime()); + preparedStatement.setDate(columnIndex + 1, sqlDate); + } else { + preparedStatement.setNull(columnIndex + 1, Types.DATE); + } + break; + + case Types.TIME: + java.sql.Time sqlTime = null; + utilDate = column.asDate(); + if (null != utilDate) { + sqlTime = new java.sql.Time(utilDate.getTime()); + preparedStatement.setTime(columnIndex + 1, sqlTime); + } else { + preparedStatement.setNull(columnIndex + 1, Types.TIME); + } + break; + + case Types.TIMESTAMP: + java.sql.Timestamp sqlTimestamp = null; + utilDate = column.asDate(); + if (null != utilDate) { + sqlTimestamp = new java.sql.Timestamp( + utilDate.getTime()); + preparedStatement.setTimestamp(columnIndex + 1, + sqlTimestamp); + } else { + preparedStatement.setNull(columnIndex + 1, + Types.TIMESTAMP); + } + break; + + case Types.BINARY: + case Types.VARBINARY: + case Types.BLOB: + case Types.LONGVARBINARY: + if (null == column.getRawData()) { + preparedStatement.setObject(columnIndex + 1, null); + } else { + preparedStatement.setBytes(columnIndex + 1, + column.asBytes()); + } + break; + + case Types.BOOLEAN: + if (null == column.getRawData()) { + preparedStatement.setNull(columnIndex + 1, + Types.BOOLEAN); + } else { + preparedStatement.setBoolean(columnIndex + 1, + column.asBoolean()); + } + break; + + // warn: bit(1) -> Types.BIT 可使用setBoolean + // warn: bit(>1) -> Types.VARBINARY 可使用setBytes + case Types.BIT: + if (null == column.getRawData()) { + preparedStatement.setObject(columnIndex + 1, null); + } else if (this.dataBaseType == DataBaseType.MySql) { + preparedStatement.setBoolean(columnIndex + 1, + column.asBoolean()); + } else { + preparedStatement.setString(columnIndex + 1, + column.asString()); + } + break; + default: + preparedStatement.setObject(columnIndex + 1, + column.getRawData()); + break; + } + } catch (DataXException e) { + throw new SQLException(String.format( + "类型转换错误:[%s] 字段名:[%s], 字段类型:[%d], 字段Java类型:[%s].", + column, + this.resultSetMetaData.getLeft().get(columnIndex), + this.resultSetMetaData.getMiddle().get(columnIndex), + this.resultSetMetaData.getRight().get(columnIndex))); + } + return preparedStatement; + } + } +} diff --git a/rdbmswriter/src/main/libs/Dm7JdbcDriver16.jar b/rdbmswriter/src/main/libs/Dm7JdbcDriver16.jar new file mode 100755 index 0000000000..30740dcd2c Binary files /dev/null and b/rdbmswriter/src/main/libs/Dm7JdbcDriver16.jar differ diff --git a/rdbmswriter/src/main/libs/db2jcc4.jar b/rdbmswriter/src/main/libs/db2jcc4.jar new file mode 100755 index 0000000000..fc53cfd94b Binary files /dev/null and b/rdbmswriter/src/main/libs/db2jcc4.jar differ diff --git a/rdbmswriter/src/main/libs/edb-jdbc16.jar b/rdbmswriter/src/main/libs/edb-jdbc16.jar new file mode 100644 index 0000000000..255e64794d Binary files /dev/null and b/rdbmswriter/src/main/libs/edb-jdbc16.jar differ diff --git a/rdbmswriter/src/main/libs/jconn3-1.0.0-SNAPSHOT.jar b/rdbmswriter/src/main/libs/jconn3-1.0.0-SNAPSHOT.jar new file mode 100755 index 0000000000..df6e78bbc4 Binary files /dev/null and b/rdbmswriter/src/main/libs/jconn3-1.0.0-SNAPSHOT.jar differ diff --git a/rdbmswriter/src/main/resources/plugin.json b/rdbmswriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..fa771af294 --- /dev/null +++ b/rdbmswriter/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "rdbmswriter", + "class": "com.alibaba.datax.plugin.reader.rdbmswriter.RdbmsWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba", + "drivers":["dm.jdbc.driver.DmDriver", "com.sybase.jdbc3.jdbc.SybDriver", "com.edb.Driver"] +} diff --git a/rdbmswriter/src/main/resources/plugin_job_template.json b/rdbmswriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..66102f0a33 --- /dev/null +++ b/rdbmswriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,17 @@ +{ + "name": "rdbmswriter", + "parameter": { + "username": "", + "password": "", + "writeMode": "", + "column": [], + "session": [], + "preSql": [], + "connection": [ + { + "jdbcUrl": "", + "table": [] + } + ] + } +} \ No newline at end of file diff --git a/rpm/t_dp_dw_datax_3_core_all-build.sh b/rpm/t_dp_dw_datax_3_core_all-build.sh new file mode 100755 index 0000000000..2a8d6f9b43 --- /dev/null +++ b/rpm/t_dp_dw_datax_3_core_all-build.sh @@ -0,0 +1,12 @@ +#!/bin/bash +export PATH=/home/tops/bin/:${PATH} +export temppath=$1 +cd $temppath/rpm +sed -i "s/^Release:.*$/Release: "$4"/" $2.spec +sed -i "s/^Version:.*$/Version: "$3"/" $2.spec +sed -i "s/UNKNOWN_DATAX_VERSION/$3-$4/g" ../core/src/main/bin/datax.py +sed -i "s/UNKNOWN_DATAX_VERSION/$3-$4/g" ../core/src/main/bin/perftrace.py +export TAGS=TAG:`svn info|grep "URL"|cut -d ":" -f 2-|sed "s/^ //g"|awk -F "trunk|tags|branche" '{print $1}'`tags/$2_A_`echo $3|tr "." "_"`_$4 +sed -i "s#%description#%description \n $TAGS#g" $2.spec +/usr/local/bin/rpm_create -p /home/admin -v $3 -r $4 $2.spec -k +mv `find . -name $2-$3-$4*rpm` . diff --git a/rpm/t_dp_dw_datax_3_core_all.spec b/rpm/t_dp_dw_datax_3_core_all.spec new file mode 100755 index 0000000000..6a5b77c214 --- /dev/null +++ b/rpm/t_dp_dw_datax_3_core_all.spec @@ -0,0 +1,123 @@ +Name: t_dp_dw_datax_3_core_all +Packager:xiafei.qiuxf +Version:201607221827 +Release: %(echo $RELEASE)%{?dist} + +Summary: datax 3 core +URL: http://gitlab.alibaba-inc.com/datax/datax +Group: t_dp +License: Commercial +BuildArch: noarch + + +%define __os_install_post %{nil} + +%description +CodeUrl: http://gitlab.alibaba-inc.com/datax/datax +datax core +%{_svn_path} +%{_svn_revision} + +%define _prefix /home/admin/datax3 +%define _plugin6 /home/admin/datax3/plugin_%{version}_%{release} +%define _lib6 /home/admin/datax3/lib_%{version}_%{release} + +%prep +export LANG=zh_CN.UTF-8 + +%pre +grep -q "^cug-tbdp:" /etc/group &>/dev/null || groupadd -g 508 cug-tbdp &>/dev/null || true +grep -q "^taobao:" /etc/passwd &>/dev/null || useradd -u 503 -g cug-tbdp taobao &>/dev/null || true +if [ -d %{_prefix}/log ]; then + find %{_prefix}/log -type f -mtime +7 -exec rm -rf {} \; + find %{_prefix}/log -type d -empty -mtime +7 -exec rm -rf {} \; + find %{_prefix}/log_perf -type f -mtime +7 -exec rm -rf {} \; + find %{_prefix}/log_perf -type d -empty -mtime +7 -exec rm -rf {} \; +fi + +mkdir -p %{_plugin6} +mkdir -p %{_lib6} + +%build +cd ${OLDPWD}/../ + +export MAVEN_OPTS="-Xms256m -Xmx1024m -XX:MaxPermSize=128m" +#/home/ads/tools/apache-maven-3.0.3/bin/ +mvn clean package -DskipTests assembly:assembly + +%install + +mkdir -p .%{_plugin6} +mkdir -p .%{_lib6} +cp -rf $OLDPWD/../target/datax/datax/bin .%{_prefix}/. +cp -rf $OLDPWD/../target/datax/datax/conf .%{_prefix}/. +cp -rf $OLDPWD/../target/datax/datax/job .%{_prefix}/. +cp -rf $OLDPWD/../target/datax/datax/script .%{_prefix}/. +cp -rf $OLDPWD/../target/datax/datax/lib/* .%{_lib6}/. +cp -rf $OLDPWD/../target/datax/datax/plugin/* .%{_plugin6}/. + +# make dir for hook +mkdir -p .%{_prefix}/hook +mkdir -p .%{_prefix}/tmp +mkdir -p .%{_prefix}/log +mkdir -p .%{_prefix}/log_perf +mkdir -p .%{_prefix}/local_storage + +%post +chmod -R 0755 %{_prefix}/bin +chmod -R 0755 %{_prefix}/conf +chmod -R 0755 %{_prefix}/job +chmod -R 0755 %{_prefix}/script +chmod -R 0755 %{_prefix}/hook +chmod -R 0777 %{_prefix}/tmp +chmod -R 0755 %{_prefix}/log +chmod -R 0755 %{_prefix}/log_perf +chmod -R 0755 %{_prefix}/local_storage +chmod -R 0700 %{_prefix}/conf/.secret.properties + + + +# 指定新目录 +# 如果datax3 plugin是软连接,直接删除,并创建新的软链接 +if [ -L %{_prefix}/plugin ]; then + oldplugin=$(readlink %{_prefix}/plugin) + rm -rf %{_prefix}/plugin + ln -s %{_plugin6} %{_prefix}/plugin + + oldlib=`readlink %{_prefix}/lib` + rm -rf %{_prefix}/lib + ln -s %{_lib6} %{_prefix}/lib + + ## 解决--force + if [ "${oldplugin}" != "%{_plugin6}" ];then + rm -rf ${oldplugin} + rm -rf ${oldlib} + fi + +elif [ -d %{_prefix}/plugin ]; then + mv %{_prefix}/plugin %{_prefix}/plugin_bak_rpm + mv %{_prefix}/lib %{_prefix}/lib_bak_rpm + + ln -s %{_plugin6} %{_prefix}/plugin + ln -s %{_lib6} %{_prefix}/lib + + rm -rf %{_prefix}/plugin_bak_rpm + rm -rf %{_prefix}/lib_bak_rpm +else + ln -s %{_lib6} %{_prefix}/lib + ln -s %{_plugin6} %{_prefix}/plugin +fi + +chown -h admin %{_prefix}/plugin +chown -h admin %{_prefix}/lib + +chgrp -h cug-tbdp %{_prefix}/plugin +chgrp -h cug-tbdp %{_prefix}/lib + +%files +%defattr(755,admin,cug-tbdp) +%config(noreplace) %{_prefix}/conf/core.json +%config(noreplace) %{_prefix}/conf/logback.xml +%config(noreplace) %{_prefix}/conf/.secret.properties + +%{_prefix} diff --git a/rpm/t_dp_dw_datax_3_hook_dqc-build.sh b/rpm/t_dp_dw_datax_3_hook_dqc-build.sh new file mode 100755 index 0000000000..3087eb726e --- /dev/null +++ b/rpm/t_dp_dw_datax_3_hook_dqc-build.sh @@ -0,0 +1,10 @@ +#!/bin/bash +export PATH=/home/tops/bin/:${PATH} +export temppath=$1 +cd $temppath/rpm +sed -i "s/^Release:.*$/Release: "$4"/" $2.spec +sed -i "s/^Version:.*$/Version: "$3"/" $2.spec +export TAGS=TAG:`svn info|grep "URL"|cut -d ":" -f 2-|sed "s/^ //g"|awk -F "trunk|tags|branche" '{print $1}'`tags/$2_A_`echo $3|tr "." "_"`_$4 +sed -i "s#%description#%description \n $TAGS#g" $2.spec +/usr/local/bin/rpm_create -p /home/admin -v $3 -r $4 $2.spec -k +mv `find . -name $2-$3-$4*rpm` . diff --git a/rpm/t_dp_dw_datax_3_hook_dqc.spec b/rpm/t_dp_dw_datax_3_hook_dqc.spec new file mode 100755 index 0000000000..6a452012f1 --- /dev/null +++ b/rpm/t_dp_dw_datax_3_hook_dqc.spec @@ -0,0 +1,57 @@ +Name: t_dp_dw_datax_3_hook_dqc +Packager:xiafei.qiuxf +Version:2014122220.3 +Release: 1 + +Summary: datax 3 dqc hook +URL: http://gitlab.alibaba-inc.com/datax/datax +Group: t_dp +License: Commercial +BuildArch: noarch + + +%define __os_install_post %{nil} + +%description +CodeUrl: http://gitlab.alibaba-inc.com/datax/datax +datax dqc hook +%{_svn_path} +%{_svn_revision} + +%define _prefix /home/admin/datax3/hook/dqc + +%prep +export LANG=zh_CN.UTF-8 + +%pre +grep -q "^cug-tbdp:" /etc/group &>/dev/null || groupadd -g 508 cug-tbdp &>/dev/null || true +grep -q "^taobao:" /etc/passwd &>/dev/null || useradd -u 503 -g cug-tbdp taobao &>/dev/null || true + + +%build +BASE_DIR="${OLDPWD}/../" + +cd ${BASE_DIR}/ + +#/home/ads/tools/apache-maven-3.0.3/bin/ +mvn install -N +#/home/ads/tools/apache-maven-3.0.3/bin/ +mvn install -pl common -DskipTests +cd ${BASE_DIR}/dqchook +#/home/ads/tools/apache-maven-3.0.3/bin/ +mvn clean package -DskipTests assembly:assembly +cd ${BASE_DIR} + +%install +BASE_DIR="${OLDPWD}/../" +mkdir -p .%{_prefix} +cp -r ${BASE_DIR}/dqchook/target/datax/hook/dqc/* .%{_prefix}/ + +%post +chmod -R 0755 %{_prefix} + + +%files +%defattr(755,admin,cug-tbdp) +%config(noreplace) %{_prefix}/dqc.properties +%{_prefix} diff --git a/sqlserverreader/doc/sqlserverreader.md b/sqlserverreader/doc/sqlserverreader.md new file mode 100644 index 0000000000..8822bf391d --- /dev/null +++ b/sqlserverreader/doc/sqlserverreader.md @@ -0,0 +1,279 @@ + +# SqlServerReader 插件文档 + +___ + + +## 1 快速介绍 + +SqlServerReader插件实现了从SqlServer读取数据。在底层实现上,SqlServerReader通过JDBC连接远程SqlServer数据库,并执行相应的sql语句将数据从SqlServer库中SELECT出来。 + +## 2 实现原理 + +简而言之,SqlServerReader通过JDBC连接器连接到远程的SqlServer数据库,并根据用户配置的信息生成查询SELECT SQL语句并发送到远程SqlServer数据库,并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集,并传递给下游Writer处理。 + +对于用户配置Table、Column、Where的信息,SqlServerReader将其拼接为SQL语句发送到SqlServer数据库;对于用户配置querySql信息,SqlServer直接将其发送到SqlServer数据库。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 配置一个从SqlServer数据库同步抽取数据到本地的作业: + +``` +{ + "job": { + "setting": { + "speed": { + "byte": 1048576 + } + }, + "content": [ + { + "reader": { + "name": "sqlserverreader", + "parameter": { + // 数据库连接用户名 + "username": "root", + // 数据库连接密码 + "password": "root", + "column": [ + "id" + ], + "splitPk": "db_id", + "connection": [ + { + "table": [ + "table" + ], + "jdbcUrl": [ + "jdbc:sqlserver://localhost:3433;DatabaseName=dbname" + ] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": true, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + +* 配置一个自定义SQL的数据库同步任务到本地内容的作业: + +``` +{ + "job": { + "setting": { + "speed": 1048576 + }, + "content": [ + { + "reader": { + "name": "sqlserverreader", + "parameter": { + "username": "root", + "password": "root", + "where": "", + "connection": [ + { + "querySql": [ + "select db_id,on_line_flag from db_info where db_id < 10;" + ], + "jdbcUrl": [ + "jdbc:sqlserver://bad_ip:3433;DatabaseName=dbname", + "jdbc:sqlserver://127.0.0.1:bad_port;DatabaseName=dbname", + "jdbc:sqlserver://127.0.0.1:3306;DatabaseName=dbname" + ] + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "visible": false, + "encoding": "UTF-8" + } + } + } + ] + } +} +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:描述的是到对端数据库的JDBC连接信息,使用JSON的数组描述,并支持一个库填写多个连接地址。之所以使用JSON数组描述连接信息,是因为阿里集团内部支持多个IP探测,如果配置了多个,SqlServerReader可以依次探测ip的可连接性,直到选择一个合法的IP。如果全部连接失败,SqlServerReader报错。 注意,jdbcUrl必须包含在connection配置单元中。对于阿里集团外部使用情况,JSON数组填写一个JDBC连接即可。 + + jdbcUrl按照SqlServer官方规范,并可以填写连接附件控制信息。具体请参看[SqlServer官方文档](http://technet.microsoft.com/zh-cn/library/ms378749(v=SQL.110).aspx)。 + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:数据源的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:数据源指定用户名的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:所选取的需要同步的表。使用JSON的数组描述,因此支持多张表同时抽取。当配置为多张表时,用户自己需保证多张表是同一schema结构,SqlServerReader不予检查表是否同一逻辑表。注意,table必须包含在connection配置单元中。
+ + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:所配置的表中需要同步的列名集合,使用JSON的数组描述字段信息。用户使用\*代表默认使用所有列配置,例如["\*"]。 + + 支持列裁剪,即列可以挑选部分列进行导出。 + + 支持列换序,即列可以不按照表schema信息进行导出。 + + 支持常量配置,用户需要按照JSON格式: + ["id", "[table]", "1", "'bazhen.csy'", "null", "COUNT(*)", "2.3" , "true"] + id为普通列名,[table]为包含保留在的列名,1为整形数字常量,'bazhen.csy'为字符串常量,null为空指针,to_char(a + 1)为表达式,2.3为浮点数,true为布尔值。 + + column必须用户显示指定同步的列集合,不允许为空! + + * 必选:是
+ + * 默认值:无
+ +* **splitPk** + + * 描述:SqlServerReader进行数据抽取时,如果指定splitPk,表示用户希望使用splitPk代表的字段进行数据分片,DataX因此会启动并发任务进行数据同步,这样可以大大提供数据同步的效能。 + + 推荐splitPk用户使用表主键,因为表主键通常情况下比较均匀,因此切分出来的分片也不容易出现数据热点。 + + 目前splitPk仅支持整形型数据切分,`不支持浮点、字符串、日期等其他类型`。如果用户指定其他非支持类型,SqlServerReader将报错! + + splitPk设置为空,底层将视作用户不允许对单表进行切分,因此使用单通道进行抽取。 + + * 必选:否
+ + * 默认值:无
+ +* **where** + + * 描述:筛选条件,MysqlReader根据指定的column、table、where条件拼接SQL,并根据这个SQL进行数据抽取。在实际业务场景中,往往会选择当天的数据进行同步,可以将where条件指定为gmt_create > $bizdate 。注意:不可以将where条件指定为limit 10,limit不是SQL的合法where子句。
+ + where条件可以有效地进行业务增量同步。如果该值为空,代表同步全表所有的信息。 + + * 必选:否
+ + * 默认值:无
+ +* **querySql** + + * 描述:在有些业务场景下,where这一配置项不足以描述所筛选的条件,用户可以通过该配置型来自定义筛选SQL。当用户配置了这一项之后,DataX系统就会忽略table,column这些配置型,直接使用这个配置项的内容对数据进行筛选,例如需要进行多表join后同步数据,使用select a,b from table_a join table_b on table_a.id = table_b.id
+ + `当用户配置querySql时,SqlServerReader直接忽略table、column、where条件的配置`。 + + * 必选:否
+ + * 默认值:无
+ +* **fetchSize** + + * 描述:该配置项定义了插件和数据库服务器端每次批量数据获取条数,该值决定了DataX和服务器端的网络交互次数,能够较大的提升数据抽取性能。
+ + `注意,该值过大(>2048)可能造成DataX进程OOM。`。 + + * 必选:否
+ + * 默认值:1024
+ + +### 3.3 类型转换 + +目前SqlServerReader支持大部分SqlServer类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出SqlServerReader针对SqlServer类型转换列表: + + +| DataX 内部类型| SqlServer 数据类型 | +| -------- | ----- | +| Long |bigint, int, smallint, tinyint| +| Double |float, decimal, real, numeric| +|String |char,nchar,ntext,nvarchar,text,varchar,nvarchar(MAX),varchar(MAX)| +| Date |date, datetime, time | +| Boolean |bit| +| Bytes |binary,varbinary,varbinary(MAX),timestamp| + + + +请注意: + +* `除上述罗列字段类型外,其他类型均不支持`。 +* `timestamp类型作为二进制类型`。 + +## 4 性能报告 + +暂无 + +## 5 约束限制 + +### 5.1 主备同步数据恢复问题 + +主备同步问题指SqlServer使用主从灾备,备库从主库不间断通过binlog恢复数据。由于主备数据同步存在一定的时间差,特别在于某些特定情况,例如网络延迟等问题,导致备库同步恢复的数据与主库有较大差别,导致从备库同步的数据不是一份当前时间的完整镜像。 + +针对这个问题,我们提供了preSql功能,该功能待补充。 + +### 5.2 一致性约束 + +SqlServer在数据存储划分中属于RDBMS系统,对外可以提供强一致性数据查询接口。例如当一次同步任务启动运行过程中,当该库存在其他数据写入方写入数据时,SqlServerReader完全不会获取到写入更新数据,这是由于数据库本身的快照特性决定的。关于数据库快照特性,请参看[MVCC Wikipedia](https://en.wikipedia.org/wiki/Multiversion_concurrency_control) + +上述是在SqlServerReader单线程模型下数据同步一致性的特性,由于SqlServerReader可以根据用户配置信息使用了并发数据抽取,因此不能严格保证数据一致性:当SqlServerReader根据splitPk进行数据切分后,会先后启动多个并发任务完成数据同步。由于多个并发任务相互之间不属于同一个读事务,同时多个并发任务存在时间间隔。因此这份数据并不是`完整的`、`一致的`数据快照信息。 + +针对多线程的一致性快照需求,在技术上目前无法实现,只能从工程角度解决,工程化的方式存在取舍,我们提供几个解决思路给用户,用户可以自行选择: + +1. 使用单线程同步,即不再进行数据切片。缺点是速度比较慢,但是能够很好保证一致性。 + +2. 关闭其他数据写入方,保证当前数据为静态数据,例如,锁表、关闭备库同步等等。缺点是可能影响在线业务。 + +### 5.3 数据库编码问题 + +SqlServerReader底层使用JDBC进行数据抽取,JDBC天然适配各类编码,并在底层进行了编码转换。因此SqlServerReader不需用户指定编码,可以自动识别编码并转码。 + +### 5.4 增量数据同步 + +SqlServerReader使用JDBC SELECT语句完成数据抽取工作,因此可以使用SELECT...WHERE...进行增量数据抽取,方式有多种: + +* 数据库在线应用写入数据库时,填充modify字段为更改时间戳,包括新增、更新、删除(逻辑删)。对于这类应用,SqlServerReader只需要WHERE条件跟上一同步阶段时间戳即可。 +* 对于新增流水型数据,SqlServerReader可以WHERE条件后跟上一阶段最大自增ID即可。 + +对于业务上无字段区分新增、修改数据情况,SqlServerReader也无法进行增量数据同步,只能同步全量数据。 + +### 5.5 Sql安全性 + +SqlServerReader提供querySql语句交给用户自己实现SELECT抽取语句,SqlServerReader本身对querySql不做任何安全性校验。这块交由DataX用户方自己保证。 + +## 6 FAQ + + diff --git a/sqlserverreader/pom.xml b/sqlserverreader/pom.xml new file mode 100755 index 0000000000..0ec2a60992 --- /dev/null +++ b/sqlserverreader/pom.xml @@ -0,0 +1,78 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + sqlserverreader + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.microsoft.sqlserver + sqljdbc4 + 4.0 + system + ${basedir}/src/main/lib/sqljdbc4-4.0.jar + + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + + diff --git a/sqlserverreader/src/main/assembly/package.xml b/sqlserverreader/src/main/assembly/package.xml new file mode 100755 index 0000000000..55fbdc0b9c --- /dev/null +++ b/sqlserverreader/src/main/assembly/package.xml @@ -0,0 +1,42 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/sqlserverreader + + + src/main/lib + + sqljdbc4-4.0.jar + + plugin/reader/sqlserverreader/libs + + + target/ + + sqlserverreader-0.0.1-SNAPSHOT.jar + + plugin/reader/sqlserverreader + + + + + + false + plugin/reader/sqlserverreader/libs + runtime + + + diff --git a/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Constant.java b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Constant.java new file mode 100755 index 0000000000..1b6a14d281 --- /dev/null +++ b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Constant.java @@ -0,0 +1,7 @@ +package com.alibaba.datax.plugin.reader.sqlserverreader; + +public class Constant { + + public static final int DEFAULT_FETCH_SIZE = 1024; + +} diff --git a/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Key.java b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Key.java new file mode 100755 index 0000000000..c1a083107a --- /dev/null +++ b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/Key.java @@ -0,0 +1,6 @@ +package com.alibaba.datax.plugin.reader.sqlserverreader; + +public class Key { + + public static final String FETCH_SIZE = "fetchSize"; +} diff --git a/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReader.java b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReader.java new file mode 100755 index 0000000000..fbb7bfa7fb --- /dev/null +++ b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReader.java @@ -0,0 +1,95 @@ +package com.alibaba.datax.plugin.reader.sqlserverreader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; + +import java.util.List; + +public class SqlServerReader extends Reader { + + private static final DataBaseType DATABASE_TYPE = DataBaseType.SQLServer; + + public static class Job extends Reader.Job { + + private Configuration originalConfig = null; + private CommonRdbmsReader.Job commonRdbmsReaderJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + int fetchSize = this.originalConfig.getInt( + com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + Constant.DEFAULT_FETCH_SIZE); + if (fetchSize < 1) { + throw DataXException + .asDataXException(DBUtilErrorCode.REQUIRED_VALUE, + String.format("您配置的fetchSize有误,根据DataX的设计,fetchSize : [%d] 设置值不能小于 1.", + fetchSize)); + } + this.originalConfig.set( + com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, + fetchSize); + + this.commonRdbmsReaderJob = new CommonRdbmsReader.Job( + DATABASE_TYPE); + this.commonRdbmsReaderJob.init(this.originalConfig); + } + + @Override + public List split(int adviceNumber) { + return this.commonRdbmsReaderJob.split(this.originalConfig, + adviceNumber); + } + + @Override + public void post() { + this.commonRdbmsReaderJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + private CommonRdbmsReader.Task commonRdbmsReaderTask; + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsReaderTask = new CommonRdbmsReader.Task( + DATABASE_TYPE ,super.getTaskGroupId(), super.getTaskId()); + this.commonRdbmsReaderTask.init(this.readerSliceConfig); + } + + @Override + public void startRead(RecordSender recordSender) { + int fetchSize = this.readerSliceConfig + .getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE); + + this.commonRdbmsReaderTask.startRead(this.readerSliceConfig, + recordSender, super.getTaskPluginCollector(), fetchSize); + } + + @Override + public void post() { + this.commonRdbmsReaderTask.post(this.readerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsReaderTask.destroy(this.readerSliceConfig); + } + + } + +} diff --git a/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReaderErrorCode.java b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReaderErrorCode.java new file mode 100755 index 0000000000..6f24a17999 --- /dev/null +++ b/sqlserverreader/src/main/java/com/alibaba/datax/plugin/reader/sqlserverreader/SqlServerReaderErrorCode.java @@ -0,0 +1,26 @@ +package com.alibaba.datax.plugin.reader.sqlserverreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum SqlServerReaderErrorCode implements ErrorCode { + ; + + private String code; + private String description; + + private SqlServerReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + +} diff --git a/sqlserverreader/src/main/lib/sqljdbc4-4.0.jar b/sqlserverreader/src/main/lib/sqljdbc4-4.0.jar new file mode 100644 index 0000000000..d6b7f6daf4 Binary files /dev/null and b/sqlserverreader/src/main/lib/sqljdbc4-4.0.jar differ diff --git a/sqlserverreader/src/main/resources/plugin.json b/sqlserverreader/src/main/resources/plugin.json new file mode 100755 index 0000000000..5b9d497098 --- /dev/null +++ b/sqlserverreader/src/main/resources/plugin.json @@ -0,0 +1,7 @@ +{ + "name": "sqlserverreader", + "class": "com.alibaba.datax.plugin.reader.sqlserverreader.SqlServerReader", + "description": "useScene: test. mechanism: use datax framework to transport data from SQL Server. warn: The more you know about the data, the less problems you encounter.", + "developer": "alibaba" +} + diff --git a/sqlserverreader/src/main/resources/plugin_job_template.json b/sqlserverreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..002ebca575 --- /dev/null +++ b/sqlserverreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,13 @@ +{ + "name": "sqlserverreader", + "parameter": { + "username": "", + "password": "", + "connection": [ + { + "table": [], + "jdbcUrl": [] + } + ] + } +} \ No newline at end of file diff --git a/sqlserverwriter/doc/sqlserverwriter.md b/sqlserverwriter/doc/sqlserverwriter.md new file mode 100644 index 0000000000..255834c65b --- /dev/null +++ b/sqlserverwriter/doc/sqlserverwriter.md @@ -0,0 +1,248 @@ +# DataX SqlServerWriter + + +--- + + +## 1 快速介绍 + +SqlServerWriter 插件实现了写入数据到 SqlServer 库的目的表的功能。在底层实现上, SqlServerWriter 通过 JDBC 连接远程 SqlServer 数据库,并执行相应的 insert into ... sql 语句将数据写入 SqlServer,内部会分批次提交入库。 + +SqlServerWriter 面向ETL开发工程师,他们使用 SqlServerWriter 从数仓导入数据到 SqlServer。同时 SqlServerWriter 亦可以作为数据迁移工具为DBA等用户提供服务。 + + +## 2 实现原理 + +SqlServerWriter 通过 DataX 框架获取 Reader 生成的协议数据,根据你配置生成相应的SQL语句 + + +* `insert into...`(当主键/唯一性索引冲突时会写不进去冲突的行) + +
+ + 注意: + 1. 目的表所在数据库必须是主库才能写入数据;整个任务至少需具备 insert into...的权限,是否需要其他权限,取决于你任务配置中在 preSql 和 postSql 中指定的语句。 + 2.SqlServerWriter和MysqlWriter不同,不支持配置writeMode参数。 + + +## 3 功能说明 + +### 3.1 配置样例 + +* 这里使用一份从内存产生到 SqlServer 导入的数据。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 5 + } + }, + "content": [ + { + "reader": {}, + "writer": { + "name": "sqlserverwriter", + "parameter": { + "username": "root", + "password": "root", + "column": [ + "db_id", + "db_type", + "db_ip", + "db_port", + "db_role", + "db_name", + "db_username", + "db_password", + "db_modify_time", + "db_modify_user", + "db_description", + "db_tddl_info" + ], + "connection": [ + { + "table": [ + "db_info_for_writer" + ], + "jdbcUrl": "jdbc:sqlserver://[HOST_NAME]:PORT;DatabaseName=[DATABASE_NAME]" + } + ], + "preSql": [ + "delete from @table where db_id = -1;" + ], + "postSql": [ + "update @table set db_modify_time = now() where db_id = 1;" + ] + } + } + } + ] + } +} + +``` + + +### 3.2 参数说明 + +* **jdbcUrl** + + * 描述:目的数据库的 JDBC 连接信息 ,jdbcUrl必须包含在connection配置单元中。 + + 注意:1、在一个数据库上只能配置一个值。这与 SqlServerReader 支持多个备库探测不同,因为此处不支持同一个数据库存在多个主库的情况(双主导入数据情况) + 2、jdbcUrl按照SqlServer官方规范,并可以填写连接附加参数信息。具体请参看 SqlServer官方文档或者咨询对应 DBA。 + + + * 必选:是
+ + * 默认值:无
+ +* **username** + + * 描述:目的数据库的用户名
+ + * 必选:是
+ + * 默认值:无
+ +* **password** + + * 描述:目的数据库的密码
+ + * 必选:是
+ + * 默认值:无
+ +* **table** + + * 描述:目的表的表名称。支持写入一个或者多个表。当配置为多张表时,必须确保所有表结构保持一致。 + + 注意:table 和 jdbcUrl 必须包含在 connection 配置单元中 + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id","name","age"]。如果要依次写入全部列,使用*表示, 例如: "column": ["*"] + + **column配置项必须指定,不能留空!** + + + 注意:1、我们强烈不推荐你这样配置,因为当你目的表字段个数、类型等有改动时,你的任务可能运行不正确或者失败 + 2、此处 column 不能配置任何常量值 + + * 必选:是
+ + * 默认值:否
+ +* **preSql** + + * 描述:写入数据到目的表前,会先执行这里的标准语句。如果 Sql 中有你需要操作到的表名称,请使用 `@table` 表示,这样在实际执行 Sql 语句时,会对变量按照实际表名称进行替换。
+ + * 必选:否
+ + * 默认值:无
+ +* **postSql** + + * 描述:写入数据到目的表后,会执行这里的标准语句。(原理同 preSql )
+ + * 必选:否
+ + * 默认值:无
+ +* **batchSize** + + * 描述:一次性批量提交的记录数大小,该值可以极大减少DataX与SqlServer的网络交互次数,并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。
+ + * 必选:否
+ + * 默认值:1024
+ + + +### 3.3 类型转换 + +类似 SqlServerReader ,目前 SqlServerWriter 支持大部分 SqlServer 类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。 + +下面列出 SqlServerWriter 针对 SqlServer 类型转换列表: + + +| DataX 内部类型| SqlServer 数据类型 | +| -------- | ----- | +| Long || +| Double || +| String || +| Date || +| Boolean || +| Bytes || + + + +## 4 性能报告 + +### 4.1 环境准备 + +#### 4.1.1 数据特征 +建表语句: +``` + +``` +单行记录类似于: +``` +``` +#### 4.1.2 机器参数 + +* 执行 DataX 的机器参数为: + 1. cpu: 24 Core Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz + 2. mem: 94GB + 3. net: 千兆双网卡 + 4. disc: DataX 数据不落磁盘,不统计此项 + +* SqlServer 数据库机器参数为: + 1. cpu: 4 Core Intel(R) Xeon(R) CPU E5420 @ 2.50GHz + 2. mem: 7GB + +#### 4.1.3 DataX jvm 参数 + + -Xms1024m -Xmx1024m -XX:+HeapDumpOnOutOfMemoryError + +#### 4.1.4 性能测试作业配置 + +``` + +``` + +### 4.2 测试报告 + +#### 4.2.1 测试报告 + + +## 5 约束限制 + + + + +## FAQ + +*** + +**Q: SqlServerWriter 执行 postSql 语句报错,那么数据导入到目标数据库了吗?** + +A: DataX 导入过程存在三块逻辑,pre 操作、导入操作、post 操作,其中任意一环报错,DataX 作业报错。由于 DataX 不能保证在同一个事务完成上述几个操作,因此有可能数据已经落入到目标端。 + +*** + +**Q: 按照上述说法,那么有部分脏数据导入数据库,如果影响到线上数据库怎么办?** + +A: 目前有两种解法,第一种配置 pre 语句,该 sql 可以清理当天导入数据, DataX 每次导入时候可以把上次清理干净并导入完整数据。第二种,向临时表导入数据,完成后再 rename 到线上表。 + +*** + +**Q: 上面第二种方法可以避免对线上数据造成影响,那我具体怎样操作?** + +A: 可以配置临时表导入 diff --git a/sqlserverwriter/pom.xml b/sqlserverwriter/pom.xml new file mode 100644 index 0000000000..f4879561ff --- /dev/null +++ b/sqlserverwriter/pom.xml @@ -0,0 +1,79 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + sqlserverwriter + sqlserverwriter + jar + writer data into sqlserver database + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.microsoft.sqlserver + sqljdbc4 + 4.0 + system + ${basedir}/src/main/lib/sqljdbc4-4.0.jar + + + com.alibaba.datax + plugin-rdbms-util + ${datax-project-version} + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/sqlserverwriter/src/main/assembly/package.xml b/sqlserverwriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..761dffcd12 --- /dev/null +++ b/sqlserverwriter/src/main/assembly/package.xml @@ -0,0 +1,42 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/sqlserverwriter + + + src/main/lib + + sqljdbc4-4.0.jar + + plugin/writer/sqlserverwriter/libs + + + target/ + + sqlserverwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/sqlserverwriter + + + + + + false + plugin/writer/sqlserverwriter/libs + runtime + + + diff --git a/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriter.java b/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriter.java new file mode 100644 index 0000000000..6c81971915 --- /dev/null +++ b/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriter.java @@ -0,0 +1,97 @@ +package com.alibaba.datax.plugin.writer.sqlserverwriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode; +import com.alibaba.datax.plugin.rdbms.util.DataBaseType; +import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter; +import com.alibaba.datax.plugin.rdbms.writer.Key; + +import java.util.List; + +public class SqlServerWriter extends Writer { + private static final DataBaseType DATABASE_TYPE = DataBaseType.SQLServer; + + public static class Job extends Writer.Job { + private Configuration originalConfig = null; + private CommonRdbmsWriter.Job commonRdbmsWriterJob; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + // warn:not like mysql, sqlserver only support insert mode + String writeMode = this.originalConfig.getString(Key.WRITE_MODE); + if (null != writeMode) { + throw DataXException + .asDataXException( + DBUtilErrorCode.CONF_ERROR, + String.format( + "写入模式(writeMode)配置错误. 因为sqlserver不支持配置项 writeMode: %s, sqlserver只能使用insert sql 插入数据. 请检查您的配置并作出修改", + writeMode)); + } + + this.commonRdbmsWriterJob = new CommonRdbmsWriter.Job(DATABASE_TYPE); + this.commonRdbmsWriterJob.init(this.originalConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterJob.prepare(this.originalConfig); + } + + @Override + public List split(int mandatoryNumber) { + return this.commonRdbmsWriterJob.split(this.originalConfig, + mandatoryNumber); + } + + @Override + public void post() { + this.commonRdbmsWriterJob.post(this.originalConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterJob.destroy(this.originalConfig); + } + + } + + public static class Task extends Writer.Task { + private Configuration writerSliceConfig; + private CommonRdbmsWriter.Task commonRdbmsWriterTask; + + @Override + public void init() { + this.writerSliceConfig = super.getPluginJobConf(); + this.commonRdbmsWriterTask = new CommonRdbmsWriter.Task( + DATABASE_TYPE); + this.commonRdbmsWriterTask.init(this.writerSliceConfig); + } + + @Override + public void prepare() { + this.commonRdbmsWriterTask.prepare(this.writerSliceConfig); + } + + public void startWrite(RecordReceiver recordReceiver) { + this.commonRdbmsWriterTask.startWrite(recordReceiver, + this.writerSliceConfig, super.getTaskPluginCollector()); + } + + @Override + public void post() { + this.commonRdbmsWriterTask.post(this.writerSliceConfig); + } + + @Override + public void destroy() { + this.commonRdbmsWriterTask.destroy(this.writerSliceConfig); + } + + } + +} diff --git a/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriterErrorCode.java b/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriterErrorCode.java new file mode 100644 index 0000000000..26f526a0f7 --- /dev/null +++ b/sqlserverwriter/src/main/java/com/alibaba/datax/plugin/writer/sqlserverwriter/SqlServerWriterErrorCode.java @@ -0,0 +1,31 @@ +package com.alibaba.datax.plugin.writer.sqlserverwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum SqlServerWriterErrorCode implements ErrorCode { + ; + + private final String code; + private final String describe; + + private SqlServerWriterErrorCode(String code, String describe) { + this.code = code; + this.describe = describe; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.describe; + } + + @Override + public String toString() { + return String.format("Code:[%s], Describe:[%s]. ", this.code, + this.describe); + } +} diff --git a/sqlserverwriter/src/main/lib/sqljdbc4-4.0.jar b/sqlserverwriter/src/main/lib/sqljdbc4-4.0.jar new file mode 100644 index 0000000000..d6b7f6daf4 Binary files /dev/null and b/sqlserverwriter/src/main/lib/sqljdbc4-4.0.jar differ diff --git a/sqlserverwriter/src/main/resources/plugin.json b/sqlserverwriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..a92d0c69b8 --- /dev/null +++ b/sqlserverwriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "sqlserverwriter", + "class": "com.alibaba.datax.plugin.writer.sqlserverwriter.SqlServerWriter", + "description": "useScene: prod. mechanism: Jdbc connection using the database, execute insert sql. warn: The more you know about the database, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/sqlserverwriter/src/main/resources/plugin_job_template.json b/sqlserverwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..b22c7dff47 --- /dev/null +++ b/sqlserverwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,17 @@ +{ + "name": "sqlserverwriter", + "parameter": { + "username": "", + "password": "", + "column": [], + "preSql": [], + "connection": [ + { + "jdbcUrl": "", + "table": [] + } + ], + "preSql": [], + "postSql": [] + } +} \ No newline at end of file diff --git a/streamreader/pom.xml b/streamreader/pom.xml new file mode 100755 index 0000000000..f7b12d501c --- /dev/null +++ b/streamreader/pom.xml @@ -0,0 +1,74 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + streamreader + streamreader + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/streamreader/src/main/assembly/package.xml b/streamreader/src/main/assembly/package.xml new file mode 100755 index 0000000000..5db1e0b756 --- /dev/null +++ b/streamreader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/streamreader + + + target/ + + streamreader-0.0.1-SNAPSHOT.jar + + plugin/reader/streamreader + + + + + + false + plugin/reader/streamreader/libs + runtime + + + diff --git a/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Constant.java b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Constant.java new file mode 100755 index 0000000000..a3584cae86 --- /dev/null +++ b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Constant.java @@ -0,0 +1,23 @@ +package com.alibaba.datax.plugin.reader.streamreader; + +public class Constant { + + public static final String TYPE = "type"; + + public static final String VALUE = "value"; + + public static final String RANDOM = "random"; + + + + public static final String DATE_FORMAT_MARK = "dateFormat"; + + public static final String DEFAULT_DATE_FORMAT = "yyyy-MM-dd HH:mm:ss"; + + public static final String HAVE_MIXUP_FUNCTION = "hasMixupFunction"; + public static final String MIXUP_FUNCTION_PATTERN = "\\s*(.*)\\s*,\\s*(.*)\\s*"; + public static final String MIXUP_FUNCTION_PARAM1 = "mixupParam1"; + public static final String MIXUP_FUNCTION_PARAM2 = "mixupParam2"; + + +} diff --git a/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Key.java b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Key.java new file mode 100755 index 0000000000..6542f4b709 --- /dev/null +++ b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/Key.java @@ -0,0 +1,12 @@ +package com.alibaba.datax.plugin.reader.streamreader; + +public class Key { + + /** + * should look like:[{"value":"123","type":"int"},{"value":"hello","type":"string"}] + */ + public static final String COLUMN = "column"; + + public static final String SLICE_RECORD_COUNT = "sliceRecordCount"; + +} diff --git a/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReader.java b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReader.java new file mode 100755 index 0000000000..e3b866596a --- /dev/null +++ b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReader.java @@ -0,0 +1,349 @@ +package com.alibaba.datax.plugin.reader.streamreader; + +import com.alibaba.datax.common.element.*; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.fastjson.JSONObject; + +import org.apache.commons.lang3.RandomStringUtils; +import org.apache.commons.lang3.RandomUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.text.ParseException; +import java.text.SimpleDateFormat; +import java.util.ArrayList; +import java.util.Date; +import java.util.List; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +public class StreamReader extends Reader { + + public static class Job extends Reader.Job { + + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + private Pattern mixupFunctionPattern; + private Configuration originalConfig; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + // warn: 忽略大小写 + this.mixupFunctionPattern = Pattern.compile(Constant.MIXUP_FUNCTION_PATTERN, Pattern.CASE_INSENSITIVE); + dealColumn(this.originalConfig); + + Long sliceRecordCount = this.originalConfig + .getLong(Key.SLICE_RECORD_COUNT); + if (null == sliceRecordCount) { + throw DataXException.asDataXException(StreamReaderErrorCode.REQUIRED_VALUE, + "没有设置参数[sliceRecordCount]."); + } else if (sliceRecordCount < 1) { + throw DataXException.asDataXException(StreamReaderErrorCode.ILLEGAL_VALUE, + "参数[sliceRecordCount]不能小于1."); + } + + } + + private void dealColumn(Configuration originalConfig) { + List columns = originalConfig.getList(Key.COLUMN, + JSONObject.class); + if (null == columns || columns.isEmpty()) { + throw DataXException.asDataXException(StreamReaderErrorCode.REQUIRED_VALUE, + "没有设置参数[column]."); + } + + List dealedColumns = new ArrayList(); + for (JSONObject eachColumn : columns) { + Configuration eachColumnConfig = Configuration.from(eachColumn); + try { + this.parseMixupFunctions(eachColumnConfig); + } catch (Exception e) { + throw DataXException.asDataXException(StreamReaderErrorCode.NOT_SUPPORT_TYPE, + String.format("解析混淆函数失败[%s]", e.getMessage()), e); + } + + String typeName = eachColumnConfig.getString(Constant.TYPE); + if (StringUtils.isBlank(typeName)) { + // empty typeName will be set to default type: string + eachColumnConfig.set(Constant.TYPE, Type.STRING); + } else { + if (Type.DATE.name().equalsIgnoreCase(typeName)) { + boolean notAssignDateFormat = StringUtils + .isBlank(eachColumnConfig + .getString(Constant.DATE_FORMAT_MARK)); + if (notAssignDateFormat) { + eachColumnConfig.set(Constant.DATE_FORMAT_MARK, + Constant.DEFAULT_DATE_FORMAT); + } + } + if (!Type.isTypeIllegal(typeName)) { + throw DataXException.asDataXException( + StreamReaderErrorCode.NOT_SUPPORT_TYPE, + String.format("不支持类型[%s]", typeName)); + } + } + dealedColumns.add(eachColumnConfig.toJSON()); + } + + originalConfig.set(Key.COLUMN, dealedColumns); + } + + private void parseMixupFunctions(Configuration eachColumnConfig) throws Exception{ + // 支持随机函数, demo如下: + // LONG: random 0, 10 0到10之间的随机数字 + // STRING: random 0, 10 0到10长度之间的随机字符串 + // BOOL: random 0, 10 false 和 true出现的比率 + // DOUBLE: random 0, 10 0到10之间的随机浮点数 + // DATE: random 2014-07-07 00:00:00, 2016-07-07 00:00:00 开始时间->结束时间之间的随机时间,日期格式默认(不支持逗号)yyyy-MM-dd HH:mm:ss + // BYTES: random 0, 10 0到10长度之间的随机字符串获取其UTF-8编码的二进制串 + // 配置了混淆函数后,可不配置value + // 2者都没有配置 + String columnValue = eachColumnConfig.getString(Constant.VALUE); + String columnMixup = eachColumnConfig.getString(Constant.RANDOM); + if (StringUtils.isBlank(columnMixup)) { + eachColumnConfig.getNecessaryValue(Constant.VALUE, + StreamReaderErrorCode.REQUIRED_VALUE); + } + // 2者都有配置 + if (StringUtils.isNotBlank(columnMixup) && StringUtils.isNotBlank(columnValue)) { + LOG.warn(String.format("您配置了streamreader常量列(value:%s)和随机混淆列(random:%s), 常量列优先", columnValue, columnMixup)); + eachColumnConfig.remove(Constant.RANDOM); + } + if (StringUtils.isNotBlank(columnMixup)) { + Matcher matcher= this.mixupFunctionPattern.matcher(columnMixup); + if (matcher.matches()) { + String param1 = matcher.group(1); + long param1Int = 0; + String param2 = matcher.group(2); + long param2Int = 0; + if (StringUtils.isBlank(param1) && StringUtils.isBlank(param2)) { + throw DataXException.asDataXException( + StreamReaderErrorCode.ILLEGAL_VALUE, + String.format("random混淆函数不合法[%s], 混淆函数random的参数不能为空:%s, %s", columnMixup, param1, param2)); + } + String typeName = eachColumnConfig.getString(Constant.TYPE); + if (Type.DATE.name().equalsIgnoreCase(typeName)) { + String dateFormat = eachColumnConfig.getString(Constant.DATE_FORMAT_MARK, Constant.DEFAULT_DATE_FORMAT); + try{ + SimpleDateFormat format = new SimpleDateFormat( + eachColumnConfig.getString(Constant.DATE_FORMAT_MARK, Constant.DEFAULT_DATE_FORMAT)); + //warn: do no concern int -> long + param1Int = format.parse(param1).getTime();//milliseconds + param2Int = format.parse(param2).getTime();//milliseconds + }catch (ParseException e) { + throw DataXException.asDataXException( + StreamReaderErrorCode.ILLEGAL_VALUE, + String.format("dateFormat参数[%s]和混淆函数random的参数不匹配,解析错误:%s, %s", dateFormat, param1, param2), e); + } + } else { + param1Int = Integer.parseInt(param1); + param2Int = Integer.parseInt(param2); + } + if (param1Int < 0 || param2Int < 0) { + throw DataXException.asDataXException( + StreamReaderErrorCode.ILLEGAL_VALUE, + String.format("random混淆函数不合法[%s], 混淆函数random的参数不能为负数:%s, %s", columnMixup, param1, param2)); + } + if (!Type.BOOL.name().equalsIgnoreCase(typeName)) { + if (param1Int > param2Int) { + throw DataXException.asDataXException( + StreamReaderErrorCode.ILLEGAL_VALUE, + String.format("random混淆函数不合法[%s], 混淆函数random的参数需要第一个小于等于第二个:%s, %s", columnMixup, param1, param2)); + } + } + eachColumnConfig.set(Constant.MIXUP_FUNCTION_PARAM1, param1Int); + eachColumnConfig.set(Constant.MIXUP_FUNCTION_PARAM2, param2Int); + } else { + throw DataXException.asDataXException( + StreamReaderErrorCode.ILLEGAL_VALUE, + String.format("random混淆函数不合法[%s], 需要为param1, param2形式", columnMixup)); + } + this.originalConfig.set(Constant.HAVE_MIXUP_FUNCTION, true); + } + } + + @Override + public void prepare() { + } + + @Override + public List split(int adviceNumber) { + List configurations = new ArrayList(); + + for (int i = 0; i < adviceNumber; i++) { + configurations.add(this.originalConfig.clone()); + } + return configurations; + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + + } + + public static class Task extends Reader.Task { + + private Configuration readerSliceConfig; + + private List columns; + + private long sliceRecordCount; + + private boolean haveMixupFunction; + + + @Override + public void init() { + this.readerSliceConfig = super.getPluginJobConf(); + this.columns = this.readerSliceConfig.getList(Key.COLUMN, + String.class); + + this.sliceRecordCount = this.readerSliceConfig + .getLong(Key.SLICE_RECORD_COUNT); + this.haveMixupFunction = this.readerSliceConfig.getBool( + Constant.HAVE_MIXUP_FUNCTION, false); + } + + @Override + public void prepare() { + } + + @Override + public void startRead(RecordSender recordSender) { + Record oneRecord = buildOneRecord(recordSender, this.columns); + while (this.sliceRecordCount > 0) { + if (this.haveMixupFunction) { + oneRecord = buildOneRecord(recordSender, this.columns); + } + recordSender.sendToWriter(oneRecord); + this.sliceRecordCount--; + } + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + + private Column buildOneColumn(Configuration eachColumnConfig) throws Exception { + String columnValue = eachColumnConfig + .getString(Constant.VALUE); + Type columnType = Type.valueOf(eachColumnConfig.getString( + Constant.TYPE).toUpperCase()); + String columnMixup = eachColumnConfig.getString(Constant.RANDOM); + long param1Int = eachColumnConfig.getLong(Constant.MIXUP_FUNCTION_PARAM1, 0L); + long param2Int = eachColumnConfig.getLong(Constant.MIXUP_FUNCTION_PARAM2, 1L); + boolean isColumnMixup = StringUtils.isNotBlank(columnMixup); + + switch (columnType) { + case STRING: + if (isColumnMixup) { + return new StringColumn(RandomStringUtils.randomAlphanumeric((int)RandomUtils.nextLong(param1Int, param2Int + 1))); + } else { + return new StringColumn(columnValue); + } + case LONG: + if (isColumnMixup) { + return new LongColumn(RandomUtils.nextLong(param1Int, param2Int + 1)); + } else { + return new LongColumn(columnValue); + } + case DOUBLE: + if (isColumnMixup) { + return new DoubleColumn(RandomUtils.nextDouble(param1Int, param2Int + 1)); + } else { + return new DoubleColumn(columnValue); + } + case DATE: + SimpleDateFormat format = new SimpleDateFormat( + eachColumnConfig.getString(Constant.DATE_FORMAT_MARK, Constant.DEFAULT_DATE_FORMAT)); + if (isColumnMixup) { + return new DateColumn(new Date(RandomUtils.nextLong(param1Int, param2Int + 1))); + } else { + return new DateColumn(format.parse(columnValue)); + } + case BOOL: + if (isColumnMixup) { + // warn: no concern -10 etc..., how about (0, 0)(0, 1)(1,2) + if (param1Int == param2Int) { + param1Int = 0; + param2Int = 1; + } + if (param1Int == 0) { + return new BoolColumn(true); + } else if (param2Int == 0) { + return new BoolColumn(false); + } else { + long randomInt = RandomUtils.nextLong(0, param1Int + param2Int + 1); + return new BoolColumn(randomInt <= param1Int ? false : true); + } + } else { + return new BoolColumn("true".equalsIgnoreCase(columnValue) ? true : false); + } + case BYTES: + if (isColumnMixup) { + return new BytesColumn(RandomStringUtils.randomAlphanumeric((int)RandomUtils.nextLong(param1Int, param2Int + 1)).getBytes()); + } else { + return new BytesColumn(columnValue.getBytes()); + } + default: + // in fact,never to be here + throw new Exception(String.format("不支持类型[%s]", + columnType.name())); + } + } + + private Record buildOneRecord(RecordSender recordSender, + List columns) { + if (null == recordSender) { + throw new IllegalArgumentException( + "参数[recordSender]不能为空."); + } + + if (null == columns || columns.isEmpty()) { + throw new IllegalArgumentException( + "参数[column]不能为空."); + } + + Record record = recordSender.createRecord(); + try { + for (String eachColumn : columns) { + Configuration eachColumnConfig = Configuration.from(eachColumn); + record.addColumn(this.buildOneColumn(eachColumnConfig)); + } + } catch (Exception e) { + throw DataXException.asDataXException(StreamReaderErrorCode.ILLEGAL_VALUE, + "构造一个record失败.", e); + } + return record; + } + } + + private enum Type { + STRING, LONG, BOOL, DOUBLE, DATE, BYTES, ; + + private static boolean isTypeIllegal(String typeString) { + try { + Type.valueOf(typeString.toUpperCase()); + } catch (Exception e) { + return false; + } + + return true; + } + } + +} diff --git a/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReaderErrorCode.java b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReaderErrorCode.java new file mode 100755 index 0000000000..ae3f2b8804 --- /dev/null +++ b/streamreader/src/main/java/com/alibaba/datax/plugin/reader/streamreader/StreamReaderErrorCode.java @@ -0,0 +1,34 @@ +package com.alibaba.datax.plugin.reader.streamreader; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum StreamReaderErrorCode implements ErrorCode { + REQUIRED_VALUE("StreamReader-00", "缺失必要的值"), + ILLEGAL_VALUE("StreamReader-01", "值非法"), + NOT_SUPPORT_TYPE("StreamReader-02", "不支持的column类型"),; + + + private final String code; + private final String description; + + private StreamReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/streamreader/src/main/resources/plugin.json b/streamreader/src/main/resources/plugin.json new file mode 100755 index 0000000000..4c0b3edf9d --- /dev/null +++ b/streamreader/src/main/resources/plugin.json @@ -0,0 +1,10 @@ +{ + "name": "streamreader", + "class": "com.alibaba.datax.plugin.reader.streamreader.StreamReader", + "description": { + "useScene": "only for developer test.", + "mechanism": "use datax framework to transport data from stream.", + "warn": "Never use it in your real job." + }, + "developer": "alibaba" +} \ No newline at end of file diff --git a/streamreader/src/main/resources/plugin_job_template.json b/streamreader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..4dced63625 --- /dev/null +++ b/streamreader/src/main/resources/plugin_job_template.json @@ -0,0 +1,7 @@ +{ + "name": "streamreader", + "parameter": { + "sliceRecordCount": "", + "column": [] + } +} \ No newline at end of file diff --git a/streamwriter/pom.xml b/streamwriter/pom.xml new file mode 100755 index 0000000000..58b2947125 --- /dev/null +++ b/streamwriter/pom.xml @@ -0,0 +1,68 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + streamwriter + streamwriter + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/streamwriter/src/main/assembly/package.xml b/streamwriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..6564e05bab --- /dev/null +++ b/streamwriter/src/main/assembly/package.xml @@ -0,0 +1,34 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/streamwriter + + + target/ + + streamwriter-0.0.1-SNAPSHOT.jar + + plugin/writer/streamwriter + + + + + + false + plugin/writer/streamwriter/libs + runtime + + + diff --git a/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/Key.java b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/Key.java new file mode 100755 index 0000000000..b716ea21c2 --- /dev/null +++ b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/Key.java @@ -0,0 +1,16 @@ +package com.alibaba.datax.plugin.writer.streamwriter; + +public class Key { + public static final String FIELD_DELIMITER = "fieldDelimiter"; + + public static final String PRINT = "print"; + + public static final String PATH = "path"; + + public static final String FILE_NAME = "fileName"; + + public static final String RECORD_NUM_BEFORE_SLEEP = "recordNumBeforeSleep"; + + public static final String SLEEP_TIME = "sleepTime"; + +} diff --git a/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriter.java b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriter.java new file mode 100755 index 0000000000..888c6ad777 --- /dev/null +++ b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriter.java @@ -0,0 +1,255 @@ + +package com.alibaba.datax.plugin.writer.streamwriter; + +import com.alibaba.datax.common.element.Column; +import com.alibaba.datax.common.element.Record; +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import org.apache.commons.io.FileUtils; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.*; +import java.util.ArrayList; +import java.util.List; + +public class StreamWriter extends Writer { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory + .getLogger(Job.class); + + private Configuration originalConfig; + + @Override + public void init() { + this.originalConfig = super.getPluginJobConf(); + + String path = this.originalConfig.getString(Key.PATH, null); + String fileName = this.originalConfig.getString(Key.FILE_NAME, null); + + if(StringUtils.isNoneBlank(path) && StringUtils.isNoneBlank(fileName)) { + validateParameter(path, fileName); + } + } + + private void validateParameter(String path, String fileName) { + try { + // warn: 这里用户需要配一个目录 + File dir = new File(path); + if (dir.isFile()) { + throw DataXException + .asDataXException( + StreamWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您配置的path: [%s] 不是一个合法的目录, 请您注意文件重名, 不合法目录名等情况.", + path)); + } + if (!dir.exists()) { + boolean createdOk = dir.mkdirs(); + if (!createdOk) { + throw DataXException + .asDataXException( + StreamWriterErrorCode.CONFIG_INVALID_EXCEPTION, + String.format("您指定的文件路径 : [%s] 创建失败.", + path)); + } + } + + String fileFullPath = buildFilePath(path, fileName); + File newFile = new File(fileFullPath); + if(newFile.exists()) { + try { + FileUtils.forceDelete(newFile); + } catch (IOException e) { + throw DataXException.asDataXException( + StreamWriterErrorCode.RUNTIME_EXCEPTION, + String.format("删除文件失败 : [%s] ", fileFullPath), e); + } + } + } catch (SecurityException se) { + throw DataXException.asDataXException( + StreamWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限创建文件路径 : [%s] ", path), se); + } + } + + @Override + public void prepare() { + } + + @Override + public List split(int mandatoryNumber) { + List writerSplitConfigs = new ArrayList(); + for (int i = 0; i < mandatoryNumber; i++) { + writerSplitConfigs.add(this.originalConfig); + } + + return writerSplitConfigs; + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory + .getLogger(Task.class); + + private static final String NEWLINE_FLAG = System.getProperty("line.separator", "\n"); + + private Configuration writerSliceConfig; + + private String fieldDelimiter; + private boolean print; + + private String path; + private String fileName; + + private long recordNumBeforSleep; + private long sleepTime; + + + + @Override + public void init() { + this.writerSliceConfig = getPluginJobConf(); + + this.fieldDelimiter = this.writerSliceConfig.getString( + Key.FIELD_DELIMITER, "\t"); + this.print = this.writerSliceConfig.getBool(Key.PRINT, true); + + this.path = this.writerSliceConfig.getString(Key.PATH, null); + this.fileName = this.writerSliceConfig.getString(Key.FILE_NAME, null); + this.recordNumBeforSleep = this.writerSliceConfig.getLong(Key.RECORD_NUM_BEFORE_SLEEP, 0); + this.sleepTime = this.writerSliceConfig.getLong(Key.SLEEP_TIME, 0); + if(recordNumBeforSleep < 0) { + throw DataXException.asDataXException(StreamWriterErrorCode.CONFIG_INVALID_EXCEPTION, "recordNumber 不能为负值"); + } + if(sleepTime <0) { + throw DataXException.asDataXException(StreamWriterErrorCode.CONFIG_INVALID_EXCEPTION, "sleep 不能为负值"); + } + + } + + @Override + public void prepare() { + } + + @Override + public void startWrite(RecordReceiver recordReceiver) { + + + if(StringUtils.isNoneBlank(path) && StringUtils.isNoneBlank(fileName)) { + writeToFile(recordReceiver,path, fileName, recordNumBeforSleep, sleepTime); + } else { + try { + BufferedWriter writer = new BufferedWriter( + new OutputStreamWriter(System.out, "UTF-8")); + + Record record; + while ((record = recordReceiver.getFromReader()) != null) { + if (this.print) { + writer.write(recordToString(record)); + } else { + /* do nothing */ + } + } + writer.flush(); + + } catch (Exception e) { + throw DataXException.asDataXException(StreamWriterErrorCode.RUNTIME_EXCEPTION, e); + } + } + } + + private void writeToFile(RecordReceiver recordReceiver, String path, String fileName, + long recordNumBeforSleep, long sleepTime) { + + LOG.info("begin do write..."); + String fileFullPath = buildFilePath(path, fileName); + LOG.info(String.format("write to file : [%s]", fileFullPath)); + BufferedWriter writer = null; + try { + File newFile = new File(fileFullPath); + newFile.createNewFile(); + + writer = new BufferedWriter( + new OutputStreamWriter(new FileOutputStream(newFile, true), "UTF-8")); + + Record record; + int count =0; + while ((record = recordReceiver.getFromReader()) != null) { + if(recordNumBeforSleep > 0 && sleepTime >0 &&count == recordNumBeforSleep) { + LOG.info("StreamWriter start to sleep ... recordNumBeforSleep={},sleepTime={}",recordNumBeforSleep,sleepTime); + try { + Thread.sleep(sleepTime * 1000l); + } catch (InterruptedException e) { + } + } + writer.write(recordToString(record)); + count++; + } + writer.flush(); + } catch (Exception e) { + throw DataXException.asDataXException(StreamWriterErrorCode.RUNTIME_EXCEPTION, e); + } finally { + IOUtils.closeQuietly(writer); + } + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + + private String recordToString(Record record) { + int recordLength = record.getColumnNumber(); + if (0 == recordLength) { + return NEWLINE_FLAG; + } + + Column column; + StringBuilder sb = new StringBuilder(); + for (int i = 0; i < recordLength; i++) { + column = record.getColumn(i); + sb.append(column.asString()).append(fieldDelimiter); + } + sb.setLength(sb.length() - 1); + sb.append(NEWLINE_FLAG); + + return sb.toString(); + } + } + + private static String buildFilePath(String path, String fileName) { + boolean isEndWithSeparator = false; + switch (IOUtils.DIR_SEPARATOR) { + case IOUtils.DIR_SEPARATOR_UNIX: + isEndWithSeparator = path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR)); + break; + case IOUtils.DIR_SEPARATOR_WINDOWS: + isEndWithSeparator = path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR_WINDOWS)); + break; + default: + break; + } + if (!isEndWithSeparator) { + path = path + IOUtils.DIR_SEPARATOR; + } + return String.format("%s%s", path, fileName); + } +} diff --git a/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriterErrorCode.java b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriterErrorCode.java new file mode 100755 index 0000000000..1762482a24 --- /dev/null +++ b/streamwriter/src/main/java/com/alibaba/datax/plugin/writer/streamwriter/StreamWriterErrorCode.java @@ -0,0 +1,36 @@ +package com.alibaba.datax.plugin.writer.streamwriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +public enum StreamWriterErrorCode implements ErrorCode { + RUNTIME_EXCEPTION("StreamWriter-00", "运行时异常"), + ILLEGAL_VALUE("StreamWriter-01", "您填写的参数值不合法."), + CONFIG_INVALID_EXCEPTION("StreamWriter-02", "您的参数配置错误."), + SECURITY_NOT_ENOUGH("TxtFileWriter-03", "您缺少权限执行相应的文件写入操作."); + + + + private final String code; + private final String description; + + private StreamWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s]. ", this.code, + this.description); + } +} diff --git a/streamwriter/src/main/resources/plugin.json b/streamwriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..6eed86e3c5 --- /dev/null +++ b/streamwriter/src/main/resources/plugin.json @@ -0,0 +1,10 @@ +{ + "name": "streamwriter", + "class": "com.alibaba.datax.plugin.writer.streamwriter.StreamWriter", + "description": { + "useScene": "only for developer test.", + "mechanism": "use datax framework to transport data to stream.", + "warn": "Never use it in your real job." + }, + "developer": "alibaba" +} \ No newline at end of file diff --git a/streamwriter/src/main/resources/plugin_job_template.json b/streamwriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..66e1f5e396 --- /dev/null +++ b/streamwriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,7 @@ +{ + "name": "streamwriter", + "parameter": { + "encoding": "", + "print": true + } +} \ No newline at end of file diff --git a/transformer/doc/.gitkeep b/transformer/doc/.gitkeep new file mode 100644 index 0000000000..e69de29bb2 diff --git a/transformer/doc/transformer.md b/transformer/doc/transformer.md new file mode 100644 index 0000000000..84fab96a52 --- /dev/null +++ b/transformer/doc/transformer.md @@ -0,0 +1,232 @@ +# DataX Transformer + +## Transformer定义 + +在数据同步、传输过程中,存在用户对于数据传输进行特殊定制化的需求场景,包括裁剪列、转换列等工作,可以借助ETL的T过程实现(Transformer)。DataX包含了完成的E(Extract)、T(Transformer)、L(Load)支持。 + +## 运行模型 + +![image](http://git.cn-hangzhou.oss.aliyun-inc.com/uploads/datax/datax/b5652c0492c394684958272219ce327c/image.png) + +## UDF手册 +1. dx_substr + * 参数:3个 + * 第一个参数:字段编号,对应record中第几个字段。 + * 第二个参数:字段值的开始位置。 + * 第三个参数:目标字段长度。 + * 返回: 从字符串的指定位置(包含)截取指定长度的字符串。如果开始位置非法抛出异常。如果字段为空值,直接返回(即不参与本transformer) + * 举例: +``` +dx_substr(1,"2","5") column 1的value为“dataxTest”=>"taxTe" +dx_substr(1,"5","10") column 1的value为“dataxTest”=>"Test" +``` +2. dx_pad + * 参数:4个 + * 第一个参数:字段编号,对应record中第几个字段。 + * 第二个参数:"l","r", 指示是在头进行pad,还是尾进行pad。 + * 第三个参数:目标字段长度。 + * 第四个参数:需要pad的字符。 + * 返回: 如果源字符串长度小于目标字段长度,按照位置添加pad字符后返回。如果长于,直接截断(都截右边)。如果字段为空值,转换为空字符串进行pad,即最后的字符串全是需要pad的字符 + * 举例: +``` + dx_pad(1,"l","4","A"), 如果column 1 的值为 xyz=> Axyz, 值为 xyzzzzz => xyzz + dx_pad(1,"r","4","A"), 如果column 1 的值为 xyz=> xyzA, 值为 xyzzzzz => xyzz +``` +3. dx_replace + * 参数:4个 + * 第一个参数:字段编号,对应record中第几个字段。 + * 第二个参数:字段值的开始位置。 + * 第三个参数:需要替换的字段长度。 + * 第四个参数:需要替换的字符串。 + * 返回: 从字符串的指定位置(包含)替换指定长度的字符串。如果开始位置非法抛出异常。如果字段为空值,直接返回(即不参与本transformer) + * 举例: +``` +dx_replace(1,"2","4","****") column 1的value为“dataxTest”=>"da****est" +dx_replace(1,"5","10","****") column 1的value为“dataxTest”=>"data****" +``` +4. dx_filter (关联filter暂不支持,即多个字段的联合判断,函参太过复杂,用户难以使用。) + * 参数: + * 第一个参数:字段编号,对应record中第几个字段。 + * 第二个参数:运算符,支持一下运算符:like, not like, >, =, <, >=, !=, <= + * 第三个参数:正则表达式(java正则表达式)、值。 + * 返回: + * 如果匹配正则表达式,返回Null,表示过滤该行。不匹配表达式时,表示保留该行。(注意是该行)。对于>=<都是对字段直接compare的结果. + * like , not like是将字段转换成String,然后和目标正则表达式进行全匹配。 + * >, =, <, >=, !=, <= 对于DoubleColumn比较double值,对于LongColumn和DateColumn比较long值,其他StringColumn,BooleanColumn以及ByteColumn均比较的是StringColumn值。 + * 如果目标colunn为空(null),对于 = null的过滤条件,将满足条件,被过滤。!=null的过滤条件,null不满足过滤条件,不被过滤。 like,字段为null不满足条件,不被过滤,和not like,字段为null满足条件,被过滤。 + * 举例: +``` +dx_filter(1,"like","dataTest") +dx_filter(1,">=","10") +``` +5. dx_groovy + * 参数。 + * 第一个参数: groovy code + * 第二个参数(列表或者为空):extraPackage + * 备注: + * dx_groovy只能调用一次。不能多次调用。 + * groovy code中支持java.lang, java.util的包,可直接引用的对象有record,以及element下的各种column(BoolColumn.class,BytesColumn.class,DateColumn.class,DoubleColumn.class,LongColumn.class,StringColumn.class)。不支持其他包,如果用户有需要用到其他包,可设置extraPackage,注意extraPackage不支持第三方jar包。 + * groovy code中,返回更新过的Record(比如record.setColumn(columnIndex, new StringColumn(newValue));),或者null。返回null表示过滤此行。 + * 用户可以直接调用静态的Util方式(GroovyTransformerStaticUtil),目前GroovyTransformerStaticUtil的方法列表 (按需补充): + * 举例: +``` +groovy 实现的subStr: + String code = "Column column = record.getColumn(1);\n" + + " String oriValue = column.asString();\n" + + " String newValue = oriValue.substring(0, 3);\n" + + " record.setColumn(1, new StringColumn(newValue));\n" + + " return record;"; + dx_groovy(record); +``` +``` +groovy 实现的Replace +String code2 = "Column column = record.getColumn(1);\n" + + " String oriValue = column.asString();\n" + + " String newValue = \"****\" + oriValue.substring(3, oriValue.length());\n" + + " record.setColumn(1, new StringColumn(newValue));\n" + + " return record;"; +``` +``` +groovy 实现的Pad +String code3 = "Column column = record.getColumn(1);\n" + + " String oriValue = column.asString();\n" + + " String padString = \"12345\";\n" + + " String finalPad = \"\";\n" + + " int NeedLength = 8 - oriValue.length();\n" + + " while (NeedLength > 0) {\n" + + "\n" + + " if (NeedLength >= padString.length()) {\n" + + " finalPad += padString;\n" + + " NeedLength -= padString.length();\n" + + " } else {\n" + + " finalPad += padString.substring(0, NeedLength);\n" + + " NeedLength = 0;\n" + + " }\n" + + " }\n" + + " String newValue= finalPad + oriValue;\n" + + " record.setColumn(1, new StringColumn(newValue));\n" + + " return record;"; +``` + +## Job定义 +* 本例中,配置3个UDF。 + +``` +{ + "job": { + "setting": { + "speed": { + "channel": 1 + }, + "errorLimit": { + "record": 0 + } + }, + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column": [ + { + "value": "DataX", + "type": "string" + }, + { + "value": 19890604, + "type": "long" + }, + { + "value": "1989-06-04 00:00:00", + "type": "date" + }, + { + "value": true, + "type": "bool" + }, + { + "value": "test", + "type": "bytes" + } + ], + "sliceRecordCount": 100000 + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "print": false, + "encoding": "UTF-8" + } + }, + "transformer": [ + { + "name": "dx_substr", + "parameter": + { + "columnIndex":5, + "paras":["1","3"] + } + }, + { + "name": "dx_replace", + "parameter": + { + "columnIndex":4, + "paras":["3","4","****"] + } + }, + { + "name": "dx_groovy", + "parameter": + { + "code": "//groovy code//", + "extraPackage":[ + "import somePackage1;", + "import somePackage2;" + ] + } + } + ] + } + ] + } +} + +``` + + + +## 计量和脏数据 + +Transform过程涉及到数据的转换,可能造成数据的增加或减少,因此更加需要精确度量,包括: + +* Transform的入参Record条数、字节数。 +* Transform的出参Record条数、字节数。 +* Transform的脏数据Record条数、字节数。 +* 如果是多个Transform,某一个发生脏数据,将不会再进行后面的transform,直接统计为脏数据。 +* 目前只提供了所有Transform的计量(成功,失败,过滤的count,以及transform的消耗时间)。 + +涉及到运行过程的计量数据展现定义如下: + +``` +Total 1000000 records, 22000000 bytes | Transform 100000 records(in), 10000 records(out) | Speed 2.10MB/s, 100000 records/s | Error 0 records, 0 bytes | Percentage 100.00% +``` + +**注意,这里主要记录转换的输入输出,需要检测数据输入输出的记录数量变化。** + +涉及到最终作业的计量数据展现定义如下: + +``` +任务启动时刻 : 2015-03-10 17:34:21 +任务结束时刻 : 2015-03-10 17:34:31 +任务总计耗时 : 10s +任务平均流量 : 2.10MB/s +记录写入速度 : 100000rec/s +转换输入总数 : 1000000 +转换输出总数 : 1000000 +读出记录总数 : 1000000 +同步失败总数 : 0 +``` + +**注意,这里主要记录转换的输入输出,需要检测数据输入输出的记录数量变化。** diff --git a/transformer/pom.xml b/transformer/pom.xml new file mode 100644 index 0000000000..8f3b7aeed2 --- /dev/null +++ b/transformer/pom.xml @@ -0,0 +1,67 @@ + + + datax-all + com.alibaba.datax + 0.0.1-SNAPSHOT + + 4.0.0 + + datax-transformer + jar + + datax-transformer + + + UTF-8 + 0.0.1-SNAPSHOT + + + + + com.alibaba.datax + datax-common + ${datax-version} + + + slf4j-log4j12 + org.slf4j + + + + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/transformer/src/main/assembly/package.xml b/transformer/src/main/assembly/package.xml new file mode 100755 index 0000000000..62fa7b02ac --- /dev/null +++ b/transformer/src/main/assembly/package.xml @@ -0,0 +1,27 @@ + + + + dir + + false + + + target/ + + datax-transformer-0.0.1-SNAPSHOT.jar + + /lib + + + + + + false + /lib + runtime + + + diff --git a/transformer/src/main/java/com/alibaba/datax/transformer/ComplexTransformer.java b/transformer/src/main/java/com/alibaba/datax/transformer/ComplexTransformer.java new file mode 100644 index 0000000000..2a820aeae5 --- /dev/null +++ b/transformer/src/main/java/com/alibaba/datax/transformer/ComplexTransformer.java @@ -0,0 +1,30 @@ +package com.alibaba.datax.transformer; + +import com.alibaba.datax.common.element.Record; + +import java.util.Map; + +/** + * no comments. + * Created by liqiang on 16/3/3. + */ +public abstract class ComplexTransformer { + //transformerName的唯一性在datax中检查,或者提交到插件中心检查。 + private String transformerName; + + + public String getTransformerName() { + return transformerName; + } + + public void setTransformerName(String transformerName) { + this.transformerName = transformerName; + } + + /** + * @param record 行记录,UDF进行record的处理后,更新相应的record + * @param tContext transformer运行的配置项 + * @param paras transformer函数参数 + */ + abstract public Record evaluate(Record record, Map tContext, Object... paras); +} diff --git a/transformer/src/main/java/com/alibaba/datax/transformer/Transformer.java b/transformer/src/main/java/com/alibaba/datax/transformer/Transformer.java new file mode 100644 index 0000000000..37f947da51 --- /dev/null +++ b/transformer/src/main/java/com/alibaba/datax/transformer/Transformer.java @@ -0,0 +1,28 @@ +package com.alibaba.datax.transformer; + +import com.alibaba.datax.common.element.Record; + + +/** + * no comments. + * Created by liqiang on 16/3/3. + */ +public abstract class Transformer { + //transformerName的唯一性在datax中检查,或者提交到插件中心检查。 + private String transformerName; + + + public String getTransformerName() { + return transformerName; + } + + public void setTransformerName(String transformerName) { + this.transformerName = transformerName; + } + + /** + * @param record 行记录,UDF进行record的处理后,更新相应的record + * @param paras transformer函数参数 + */ + abstract public Record evaluate(Record record, Object... paras); +} diff --git a/txtfilereader/doc/txtfilereader.md b/txtfilereader/doc/txtfilereader.md new file mode 100644 index 0000000000..ae91b35b98 --- /dev/null +++ b/txtfilereader/doc/txtfilereader.md @@ -0,0 +1,258 @@ +# DataX TxtFileReader 说明 + + +------------ + +## 1 快速介绍 + +TxtFileReader提供了读取本地文件系统数据存储的能力。在底层实现上,TxtFileReader获取本地文件数据,并转换为DataX传输协议传递给Writer。 + +**本地文件内容存放的是一张逻辑意义上的二维表,例如CSV格式的文本信息。** + + +## 2 功能与限制 + +TxtFileReader实现了从本地文件读取数据并转为DataX协议的功能,本地文件本身是无结构化数据存储,对于DataX而言,TxtFileReader实现上类比OSSReader,有诸多相似之处。目前TxtFileReader支持功能如下: + +1. 支持且仅支持读取TXT的文件,且要求TXT中shema为一张二维表。 + +2. 支持类CSV格式文件,自定义分隔符。 + +3. 支持多种类型数据读取(使用String表示),支持列裁剪,支持列常量 + +4. 支持递归读取、支持文件名过滤。 + +5. 支持文本压缩,现有压缩格式为zip、gzip、bzip2。 + +6. 多个File可以支持并发读取。 + +我们暂时不能做到: + +1. 单个File支持多线程并发读取,这里涉及到单个File内部切分算法。二期考虑支持。 + +2. 单个File在压缩情况下,从技术上无法支持多线程并发读取。 + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "setting": {}, + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "txtfilereader", + "parameter": { + "path": ["/home/haiwei.luo/case00/data"], + "encoding": "UTF-8", + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "boolean" + }, + { + "index": 2, + "type": "double" + }, + { + "index": 3, + "type": "string" + }, + { + "index": 4, + "type": "date", + "format": "yyyy.MM.dd" + } + ], + "fieldDelimiter": "," + } + }, + "writer": { + "name": "txtfilewriter", + "parameter": { + "path": "/home/haiwei.luo/case00/result", + "fileName": "luohw", + "writeMode": "truncate", + "format": "yyyy-MM-dd" + } + } + } + ] + } +} +``` + +### 3.2 参数说明 + +* **path** + + * 描述:本地文件系统的路径信息,注意这里可以支持填写多个路径。
+ + 当指定单个本地文件,TxtFileReader暂时只能使用单线程进行数据抽取。二期考虑在非压缩文件情况下针对单个File可以进行多线程并发读取。 + + 当指定多个本地文件,TxtFileReader支持使用多线程进行数据抽取。线程并发数通过通道数指定。 + + 当指定通配符,TxtFileReader尝试遍历出多个文件信息。例如: 指定/*代表读取/目录下所有的文件,指定/bazhen/\*代表读取bazhen目录下游所有的文件。**TxtFileReader目前只支持\*作为文件通配符。** + + **特别需要注意的是,DataX会将一个作业下同步的所有Text File视作同一张数据表。用户必须自己保证所有的File能够适配同一套schema信息。读取文件用户必须保证为类CSV格式,并且提供给DataX权限可读。** + + **特别需要注意的是,如果Path指定的路径下没有符合匹配的文件抽取,DataX将报错。** + + * 必选:是
+ + * 默认值:无
+ +* **column** + + * 描述:读取字段列表,type指定源数据的类型,index指定当前列来自于文本第几列(以0开始),value指定当前类型为常量,不从源头文件读取数据,而是根据value值自动生成对应的列。
+ + 默认情况下,用户可以全部按照String类型读取数据,配置如下: + + ```json + "column": ["*"] + ``` + + 用户可以指定Column字段信息,配置如下: + + ```json + { + "type": "long", + "index": 0 //从本地文件文本第一列获取int字段 + }, + { + "type": "string", + "value": "alibaba" //从TxtFileReader内部生成alibaba的字符串字段作为当前字段 + } + ``` + + 对于用户指定Column信息,type必须填写,index/value必须选择其一。 + + * 必选:是
+ + * 默认值:全部按照string类型读取
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + * 必选:是
+ + * 默认值:,
+ +* **compress** + + * 描述:文本压缩类型,默认不填写意味着没有压缩。支持压缩类型为zip、gzip、bzip2。
+ + * 必选:否
+ + * 默认值:没有压缩
+ +* **encoding** + + * 描述:读取文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8
+ +* **skipHeader** + + * 描述:类CSV格式文件可能存在表头为标题情况,需要跳过。默认不跳过。
+ + * 必选:否
+ + * 默认值:false
+ +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat:"\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:\N
+ +* **csvReaderConfig** + + * 描述:读取CSV类型文件参数配置,Map类型。读取CSV类型文件使用的CsvReader进行读取,会有很多配置,不配置则使用默认值。
+ + * 必选:否
+ + * 默认值:无
+ + +常见配置: + +```json +"csvReaderConfig":{ + "safetySwitch": false, + "skipEmptyRecords": false, + "useTextQualifier": false +} +``` + +所有配置项及默认值,配置时 csvReaderConfig 的map中请**严格按照以下字段名字进行配置**: + +``` +boolean caseSensitive = true; +char textQualifier = 34; +boolean trimWhitespace = true; +boolean useTextQualifier = true;//是否使用csv转义字符 +char delimiter = 44;//分隔符 +char recordDelimiter = 0; +char comment = 35; +boolean useComments = false; +int escapeMode = 1; +boolean safetySwitch = true;//单列长度是否限制100000字符 +boolean skipEmptyRecords = true;//是否跳过空行 +boolean captureRawRecord = true; +``` + +### 3.3 类型转换 + +本地文件本身不提供数据类型,该类型是DataX TxtFileReader定义: + +| DataX 内部类型| 本地文件 数据类型 | +| -------- | ----- | +| +| Long |Long | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Date |Date | + +其中: + +* 本地文件 Long是指本地文件文本中使用整形的字符串表示形式,例如"19901219"。 +* 本地文件 Double是指本地文件文本中使用Double的字符串表示形式,例如"3.1415"。 +* 本地文件 Boolean是指本地文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* 本地文件 Date是指本地文件文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。 + + +## 4 性能报告 + + + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + + diff --git a/txtfilereader/pom.xml b/txtfilereader/pom.xml new file mode 100755 index 0000000000..f1c79db70f --- /dev/null +++ b/txtfilereader/pom.xml @@ -0,0 +1,78 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + txtfilereader + txtfilereader + TxtFileReader提供了本地读取TEXT功能,并可以根据用户配置的类型进行类型转换,建议开发、测试环境使用。 + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + + diff --git a/txtfilereader/src/main/assembly/package.xml b/txtfilereader/src/main/assembly/package.xml new file mode 100755 index 0000000000..895737b32e --- /dev/null +++ b/txtfilereader/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/reader/txtfilereader + + + target/ + + txtfilereader-0.0.1-SNAPSHOT.jar + + plugin/reader/txtfilereader + + + + + + false + plugin/reader/txtfilereader/libs + runtime + + + diff --git a/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Constant.java b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Constant.java new file mode 100755 index 0000000000..7b7a46fa27 --- /dev/null +++ b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Constant.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.plugin.reader.txtfilereader; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public class Constant { + public static final String SOURCE_FILES = "sourceFiles"; + +} diff --git a/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Key.java b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Key.java new file mode 100755 index 0000000000..4f6ddb016e --- /dev/null +++ b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/Key.java @@ -0,0 +1,8 @@ +package com.alibaba.datax.plugin.reader.txtfilereader; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public class Key { + public static final String PATH = "path"; +} diff --git a/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReader.java b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReader.java new file mode 100755 index 0000000000..914305c69c --- /dev/null +++ b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReader.java @@ -0,0 +1,420 @@ +package com.alibaba.datax.plugin.reader.txtfilereader; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordSender; +import com.alibaba.datax.common.spi.Reader; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderErrorCode; +import com.alibaba.datax.plugin.unstructuredstorage.reader.UnstructuredStorageReaderUtil; +import com.google.common.collect.Sets; + +import org.apache.commons.io.Charsets; +import org.apache.commons.io.IOUtils; +import org.apache.commons.lang3.BooleanUtils; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.InputStream; +import java.nio.charset.UnsupportedCharsetException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.regex.Pattern; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public class TxtFileReader extends Reader { + public static class Job extends Reader.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + private Configuration originConfig = null; + + private List path = null; + + private List sourceFiles; + + private Map pattern; + + private Map isRegexPath; + + @Override + public void init() { + this.originConfig = this.getPluginJobConf(); + this.pattern = new HashMap(); + this.isRegexPath = new HashMap(); + this.validateParameter(); + } + + private void validateParameter() { + // Compatible with the old version, path is a string before + String pathInString = this.originConfig.getNecessaryValue(Key.PATH, + TxtFileReaderErrorCode.REQUIRED_VALUE); + if (StringUtils.isBlank(pathInString)) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.REQUIRED_VALUE, + "您需要指定待读取的源目录或文件"); + } + if (!pathInString.startsWith("[") && !pathInString.endsWith("]")) { + path = new ArrayList(); + path.add(pathInString); + } else { + path = this.originConfig.getList(Key.PATH, String.class); + if (null == path || path.size() == 0) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.REQUIRED_VALUE, + "您需要指定待读取的源目录或文件"); + } + } + + String encoding = this.originConfig + .getString( + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, + com.alibaba.datax.plugin.unstructuredstorage.reader.Constant.DEFAULT_ENCODING); + if (StringUtils.isBlank(encoding)) { + this.originConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, + com.alibaba.datax.plugin.unstructuredstorage.reader.Constant.DEFAULT_ENCODING); + } else { + try { + encoding = encoding.trim(); + this.originConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.ENCODING, + encoding); + Charsets.toCharset(encoding); + } catch (UnsupportedCharsetException uce) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.ILLEGAL_VALUE, + String.format("不支持您配置的编码格式 : [%s]", encoding), uce); + } catch (Exception e) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.CONFIG_INVALID_EXCEPTION, + String.format("编码配置异常, 请联系我们: %s", e.getMessage()), + e); + } + } + + // column: 1. index type 2.value type 3.when type is Date, may have + // format + List columns = this.originConfig + .getListConfiguration(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN); + // handle ["*"] + if (null != columns && 1 == columns.size()) { + String columnsInStr = columns.get(0).toString(); + if ("\"*\"".equals(columnsInStr) || "'*'".equals(columnsInStr)) { + this.originConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COLUMN, + null); + columns = null; + } + } + + if (null != columns && columns.size() != 0) { + for (Configuration eachColumnConf : columns) { + eachColumnConf + .getNecessaryValue( + com.alibaba.datax.plugin.unstructuredstorage.reader.Key.TYPE, + TxtFileReaderErrorCode.REQUIRED_VALUE); + Integer columnIndex = eachColumnConf + .getInt(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.INDEX); + String columnValue = eachColumnConf + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.VALUE); + + if (null == columnIndex && null == columnValue) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.NO_INDEX_VALUE, + "由于您配置了type, 则至少需要配置 index 或 value"); + } + + if (null != columnIndex && null != columnValue) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.MIXED_INDEX_VALUE, + "您混合配置了index, value, 每一列同时仅能选择其中一种"); + } + if (null != columnIndex && columnIndex < 0) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.ILLEGAL_VALUE, String + .format("index需要大于等于0, 您配置的index为[%s]", + columnIndex)); + } + } + } + + // only support compress types + String compress = this.originConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS); + if (StringUtils.isBlank(compress)) { + this.originConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS, + null); + } else { + Set supportedCompress = Sets + .newHashSet("gzip", "bzip2", "zip"); + compress = compress.toLowerCase().trim(); + if (!supportedCompress.contains(compress)) { + throw DataXException + .asDataXException( + TxtFileReaderErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 gzip, bzip2, zip 文件压缩格式 , 不支持您配置的文件压缩格式: [%s]", + compress)); + } + this.originConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.COMPRESS, + compress); + } + + String delimiterInStr = this.originConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.reader.Key.FIELD_DELIMITER); + // warn: if have, length must be one + if (null != delimiterInStr && 1 != delimiterInStr.length()) { + throw DataXException.asDataXException( + UnstructuredStorageReaderErrorCode.ILLEGAL_VALUE, + String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", + delimiterInStr)); + } + + } + + @Override + public void prepare() { + LOG.debug("prepare() begin..."); + // warn:make sure this regex string + // warn:no need trim + for (String eachPath : this.path) { + String regexString = eachPath.replace("*", ".*").replace("?", + ".?"); + Pattern patt = Pattern.compile(regexString); + this.pattern.put(eachPath, patt); + this.sourceFiles = this.buildSourceTargets(); + } + + LOG.info(String.format("您即将读取的文件数为: [%s]", this.sourceFiles.size())); + } + + @Override + public void post() { + } + + @Override + public void destroy() { + } + + // warn: 如果源目录为空会报错,拖空目录意图=>空文件显示指定此意图 + @Override + public List split(int adviceNumber) { + LOG.debug("split() begin..."); + List readerSplitConfigs = new ArrayList(); + + // warn:每个slice拖且仅拖一个文件, + // int splitNumber = adviceNumber; + int splitNumber = this.sourceFiles.size(); + if (0 == splitNumber) { + throw DataXException.asDataXException( + TxtFileReaderErrorCode.EMPTY_DIR_EXCEPTION, String + .format("未能找到待读取的文件,请确认您的配置项path: %s", + this.originConfig.getString(Key.PATH))); + } + + List> splitedSourceFiles = this.splitSourceFiles( + this.sourceFiles, splitNumber); + for (List files : splitedSourceFiles) { + Configuration splitedConfig = this.originConfig.clone(); + splitedConfig.set(Constant.SOURCE_FILES, files); + readerSplitConfigs.add(splitedConfig); + } + LOG.debug("split() ok and end..."); + return readerSplitConfigs; + } + + // validate the path, path must be a absolute path + private List buildSourceTargets() { + // for eath path + Set toBeReadFiles = new HashSet(); + for (String eachPath : this.path) { + int endMark; + for (endMark = 0; endMark < eachPath.length(); endMark++) { + if ('*' != eachPath.charAt(endMark) + && '?' != eachPath.charAt(endMark)) { + continue; + } else { + this.isRegexPath.put(eachPath, true); + break; + } + } + + String parentDirectory; + if (BooleanUtils.isTrue(this.isRegexPath.get(eachPath))) { + int lastDirSeparator = eachPath.substring(0, endMark) + .lastIndexOf(IOUtils.DIR_SEPARATOR); + parentDirectory = eachPath.substring(0, + lastDirSeparator + 1); + } else { + this.isRegexPath.put(eachPath, false); + parentDirectory = eachPath; + } + this.buildSourceTargetsEathPath(eachPath, parentDirectory, + toBeReadFiles); + } + return Arrays.asList(toBeReadFiles.toArray(new String[0])); + } + + private void buildSourceTargetsEathPath(String regexPath, + String parentDirectory, Set toBeReadFiles) { + // 检测目录是否存在,错误情况更明确 + try { + File dir = new File(parentDirectory); + boolean isExists = dir.exists(); + if (!isExists) { + String message = String.format("您设定的目录不存在 : [%s]", + parentDirectory); + LOG.error(message); + throw DataXException.asDataXException( + TxtFileReaderErrorCode.FILE_NOT_EXISTS, message); + } + } catch (SecurityException se) { + String message = String.format("您没有权限查看目录 : [%s]", + parentDirectory); + LOG.error(message); + throw DataXException.asDataXException( + TxtFileReaderErrorCode.SECURITY_NOT_ENOUGH, message); + } + + directoryRover(regexPath, parentDirectory, toBeReadFiles); + } + + private void directoryRover(String regexPath, String parentDirectory, + Set toBeReadFiles) { + File directory = new File(parentDirectory); + // is a normal file + if (!directory.isDirectory()) { + if (this.isTargetFile(regexPath, directory.getAbsolutePath())) { + toBeReadFiles.add(parentDirectory); + LOG.info(String.format( + "add file [%s] as a candidate to be read.", + parentDirectory)); + + } + } else { + // 是目录 + try { + // warn:对于没有权限的目录,listFiles 返回null,而不是抛出SecurityException + File[] files = directory.listFiles(); + if (null != files) { + for (File subFileNames : files) { + directoryRover(regexPath, + subFileNames.getAbsolutePath(), + toBeReadFiles); + } + } else { + // warn: 对于没有权限的文件,是直接throw DataXException + String message = String.format("您没有权限查看目录 : [%s]", + directory); + LOG.error(message); + throw DataXException.asDataXException( + TxtFileReaderErrorCode.SECURITY_NOT_ENOUGH, + message); + } + + } catch (SecurityException e) { + String message = String.format("您没有权限查看目录 : [%s]", + directory); + LOG.error(message); + throw DataXException.asDataXException( + TxtFileReaderErrorCode.SECURITY_NOT_ENOUGH, + message, e); + } + } + } + + // 正则过滤 + private boolean isTargetFile(String regexPath, String absoluteFilePath) { + if (this.isRegexPath.get(regexPath)) { + return this.pattern.get(regexPath).matcher(absoluteFilePath) + .matches(); + } else { + return true; + } + + } + + private List> splitSourceFiles(final List sourceList, + int adviceNumber) { + List> splitedList = new ArrayList>(); + int averageLength = sourceList.size() / adviceNumber; + averageLength = averageLength == 0 ? 1 : averageLength; + + for (int begin = 0, end = 0; begin < sourceList.size(); begin = end) { + end = begin + averageLength; + if (end > sourceList.size()) { + end = sourceList.size(); + } + splitedList.add(sourceList.subList(begin, end)); + } + return splitedList; + } + + } + + public static class Task extends Reader.Task { + private static Logger LOG = LoggerFactory.getLogger(Task.class); + + private Configuration readerSliceConfig; + private List sourceFiles; + + @Override + public void init() { + this.readerSliceConfig = this.getPluginJobConf(); + this.sourceFiles = this.readerSliceConfig.getList( + Constant.SOURCE_FILES, String.class); + } + + @Override + public void prepare() { + + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + + @Override + public void startRead(RecordSender recordSender) { + LOG.debug("start read source files..."); + for (String fileName : this.sourceFiles) { + LOG.info(String.format("reading file : [%s]", fileName)); + InputStream inputStream; + try { + inputStream = new FileInputStream(fileName); + UnstructuredStorageReaderUtil.readFromStream(inputStream, + fileName, this.readerSliceConfig, recordSender, + this.getTaskPluginCollector()); + recordSender.flush(); + } catch (FileNotFoundException e) { + // warn: sock 文件无法read,能影响所有文件的传输,需要用户自己保证 + String message = String + .format("找不到待读取的文件 : [%s]", fileName); + LOG.error(message); + throw DataXException.asDataXException( + TxtFileReaderErrorCode.OPEN_FILE_ERROR, message); + } + } + LOG.debug("end read source files..."); + } + + } +} diff --git a/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReaderErrorCode.java b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReaderErrorCode.java new file mode 100755 index 0000000000..4a37dadc98 --- /dev/null +++ b/txtfilereader/src/main/java/com/alibaba/datax/plugin/reader/txtfilereader/TxtFileReaderErrorCode.java @@ -0,0 +1,45 @@ +package com.alibaba.datax.plugin.reader.txtfilereader; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by haiwei.luo on 14-9-20. + */ +public enum TxtFileReaderErrorCode implements ErrorCode { + REQUIRED_VALUE("TxtFileReader-00", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("TxtFileReader-01", "您填写的参数值不合法."), + MIXED_INDEX_VALUE("TxtFileReader-02", "您的列信息配置同时包含了index,value."), + NO_INDEX_VALUE("TxtFileReader-03","您明确的配置列信息,但未填写相应的index,value."), + FILE_NOT_EXISTS("TxtFileReader-04", "您配置的目录文件路径不存在."), + OPEN_FILE_WITH_CHARSET_ERROR("TxtFileReader-05", "您配置的文件编码和实际文件编码不符合."), + OPEN_FILE_ERROR("TxtFileReader-06", "您配置的文件在打开时异常,建议您检查源目录是否有隐藏文件,管道文件等特殊文件."), + READ_FILE_IO_ERROR("TxtFileReader-07", "您配置的文件在读取时出现IO异常."), + SECURITY_NOT_ENOUGH("TxtFileReader-08", "您缺少权限执行相应的文件操作."), + CONFIG_INVALID_EXCEPTION("TxtFileReader-09", "您的参数配置错误."), + RUNTIME_EXCEPTION("TxtFileReader-10", "出现运行时异常, 请联系我们"), + EMPTY_DIR_EXCEPTION("TxtFileReader-11", "您尝试读取的文件目录为空."),; + + private final String code; + private final String description; + + private TxtFileReaderErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } +} diff --git a/txtfilereader/src/main/resources/plugin.json b/txtfilereader/src/main/resources/plugin.json new file mode 100755 index 0000000000..3a42196214 --- /dev/null +++ b/txtfilereader/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "txtfilereader", + "class": "com.alibaba.datax.plugin.reader.txtfilereader.TxtFileReader", + "description": "useScene: test. mechanism: use datax framework to transport data from txt file. warn: The more you know about the data, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/txtfilereader/src/main/resources/plugin_job_template.json b/txtfilereader/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..744d6f54bd --- /dev/null +++ b/txtfilereader/src/main/resources/plugin_job_template.json @@ -0,0 +1,9 @@ +{ + "name": "txtfilereader", + "parameter": { + "path": [], + "encoding": "", + "column": [], + "fieldDelimiter": "" + } +} \ No newline at end of file diff --git a/txtfilewriter/doc/txtfilewriter.md b/txtfilewriter/doc/txtfilewriter.md new file mode 100644 index 0000000000..e8daab739e --- /dev/null +++ b/txtfilewriter/doc/txtfilewriter.md @@ -0,0 +1,216 @@ +# DataX TxtFileWriter 说明 + + +------------ + +## 1 快速介绍 + +TxtFileWriter提供了向本地文件写入类CSV格式的一个或者多个表文件。TxtFileWriter服务的用户主要在于DataX开发、测试同学。 + +**写入本地文件内容存放的是一张逻辑意义上的二维表,例如CSV格式的文本信息。** + + +## 2 功能与限制 + +TxtFileWriter实现了从DataX协议转为本地TXT文件功能,本地文件本身是无结构化数据存储,TxtFileWriter如下几个方面约定: + +1. 支持且仅支持写入 TXT的文件,且要求TXT中shema为一张二维表。 + +2. 支持类CSV格式文件,自定义分隔符。 + +3. 支持文本压缩,现有压缩格式为gzip、bzip2。 + +6. 支持多线程写入,每个线程写入不同子文件。 + +7. 文件支持滚动,当文件大于某个size值或者行数值,文件需要切换。 [暂不支持] + +我们不能做到: + +1. 单个文件不能支持并发写入。 + + +## 3 功能说明 + + +### 3.1 配置样例 + +```json +{ + "setting": {}, + "job": { + "setting": { + "speed": { + "channel": 2 + } + }, + "content": [ + { + "reader": { + "name": "txtfilereader", + "parameter": { + "path": ["/home/haiwei.luo/case00/data"], + "encoding": "UTF-8", + "column": [ + { + "index": 0, + "type": "long" + }, + { + "index": 1, + "type": "boolean" + }, + { + "index": 2, + "type": "double" + }, + { + "index": 3, + "type": "string" + }, + { + "index": 4, + "type": "date", + "format": "yyyy.MM.dd" + } + ], + "fieldDelimiter": "," + } + }, + "writer": { + "name": "txtfilewriter", + "parameter": { + "path": "/home/haiwei.luo/case00/result", + "fileName": "luohw", + "writeMode": "truncate", + "dateFormat": "yyyy-MM-dd" + } + } + } + ] + } +} +``` + +### 3.2 参数说明 + +* **path** + + * 描述:本地文件系统的路径信息,TxtFileWriter会写入Path目录下属多个文件。
+ + * 必选:是
+ + * 默认值:无
+ +* **fileName** + + * 描述:TxtFileWriter写入的文件名,该文件名会添加随机的后缀作为每个线程写入实际文件名。
+ + * 必选:是
+ + * 默认值:无
+ +* **writeMode** + + * 描述:TxtFileWriter写入前数据清理处理模式:
+ + * truncate,写入前清理目录下一fileName前缀的所有文件。 + * append,写入前不做任何处理,DataX TxtFileWriter直接使用filename写入,并保证文件名不冲突。 + * nonConflict,如果目录下有fileName前缀的文件,直接报错。 + + * 必选:是
+ + * 默认值:无
+ +* **fieldDelimiter** + + * 描述:读取的字段分隔符
+ + * 必选:否
+ + * 默认值:,
+ +* **compress** + + * 描述:文本压缩类型,默认不填写意味着没有压缩。支持压缩类型为zip、lzo、lzop、tgz、bzip2。
+ + * 必选:否
+ + * 默认值:无压缩
+ +* **encoding** + + * 描述:读取文件的编码配置。
+ + * 必选:否
+ + * 默认值:utf-8
+ + +* **nullFormat** + + * 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。
+ + 例如如果用户配置: nullFormat="\N",那么如果源头数据是"\N",DataX视作null字段。 + + * 必选:否
+ + * 默认值:\N
+ +* **dateFormat** + + * 描述:日期类型的数据序列化到文件中时的格式,例如 "dateFormat": "yyyy-MM-dd"。
+ + * 必选:否
+ + * 默认值:无
+ +* **fileFormat** + + * 描述:文件写出的格式,包括csv (http://zh.wikipedia.org/wiki/%E9%80%97%E5%8F%B7%E5%88%86%E9%9A%94%E5%80%BC) 和text两种,csv是严格的csv格式,如果待写数据包括列分隔符,则会按照csv的转义语法转义,转义符号为双引号";text格式是用列分隔符简单分割待写数据,对于待写数据包括列分隔符情况下不做转义。
+ + * 必选:否
+ + * 默认值:text
+ +* **header** + + * 描述:txt写出时的表头,示例['id', 'name', 'age']。
+ + * 必选:否
+ + * 默认值:无
+ +### 3.3 类型转换 + + +本地文件本身不提供数据类型,该类型是DataX TxtFileWriter定义: + +| DataX 内部类型| 本地文件 数据类型 | +| -------- | ----- | +| +| Long |Long | +| Double |Double| +| String |String| +| Boolean |Boolean | +| Date |Date | + +其中: + +* 本地文件 Long是指本地文件文本中使用整形的字符串表示形式,例如"19901219"。 +* 本地文件 Double是指本地文件文本中使用Double的字符串表示形式,例如"3.1415"。 +* 本地文件 Boolean是指本地文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。 +* 本地文件 Date是指本地文件文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。 + + +## 4 性能报告 + + +## 5 约束限制 + +略 + +## 6 FAQ + +略 + + diff --git a/txtfilewriter/pom.xml b/txtfilewriter/pom.xml new file mode 100755 index 0000000000..7d6489eee7 --- /dev/null +++ b/txtfilewriter/pom.xml @@ -0,0 +1,78 @@ + + 4.0.0 + + com.alibaba.datax + datax-all + 0.0.1-SNAPSHOT + + + txtfilewriter + txtfilewriter + TxtFileWriter提供了本地写入TEXT功能,建议开发、测试环境使用。 + jar + + + + com.alibaba.datax + datax-common + ${datax-project-version} + + + slf4j-log4j12 + org.slf4j + + + + + com.alibaba.datax + plugin-unstructured-storage-util + ${datax-project-version} + + + org.slf4j + slf4j-api + + + ch.qos.logback + logback-classic + + + com.google.guava + guava + 16.0.1 + + + + + + + + maven-compiler-plugin + + 1.6 + 1.6 + ${project-sourceEncoding} + + + + maven-assembly-plugin + + + src/main/assembly/package.xml + + datax + + + + dwzip + package + + single + + + + + + + diff --git a/txtfilewriter/src/main/assembly/package.xml b/txtfilewriter/src/main/assembly/package.xml new file mode 100755 index 0000000000..3b6371c9a3 --- /dev/null +++ b/txtfilewriter/src/main/assembly/package.xml @@ -0,0 +1,35 @@ + + + + dir + + false + + + src/main/resources + + plugin.json + plugin_job_template.json + + plugin/writer/txtfilewriter + + + target/ + + txtfilewriter-0.0.1-SNAPSHOT.jar + + plugin/writer/txtfilewriter + + + + + + false + plugin/writer/txtfilewriter/libs + runtime + + + diff --git a/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/Key.java b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/Key.java new file mode 100755 index 0000000000..70739dcc57 --- /dev/null +++ b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/Key.java @@ -0,0 +1,9 @@ +package com.alibaba.datax.plugin.writer.txtfilewriter; + +/** + * Created by haiwei.luo on 14-9-17. + */ +public class Key { + // must have + public static final String PATH = "path"; +} diff --git a/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriter.java b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriter.java new file mode 100755 index 0000000000..04dba29798 --- /dev/null +++ b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriter.java @@ -0,0 +1,342 @@ +package com.alibaba.datax.plugin.writer.txtfilewriter; + +import com.alibaba.datax.common.exception.DataXException; +import com.alibaba.datax.common.plugin.RecordReceiver; +import com.alibaba.datax.common.spi.Writer; +import com.alibaba.datax.common.util.Configuration; +import com.alibaba.datax.plugin.unstructuredstorage.writer.UnstructuredStorageWriterUtil; + +import org.apache.commons.io.FileUtils; +import org.apache.commons.io.IOUtils; +import org.apache.commons.io.filefilter.PrefixFileFilter; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.File; +import java.io.FileOutputStream; +import java.io.FilenameFilter; +import java.io.IOException; +import java.io.OutputStream; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.UUID; + +/** + * Created by haiwei.luo on 14-9-17. + */ +public class TxtFileWriter extends Writer { + public static class Job extends Writer.Job { + private static final Logger LOG = LoggerFactory.getLogger(Job.class); + + private Configuration writerSliceConfig = null; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.validateParameter(); + String dateFormatOld = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FORMAT); + String dateFormatNew = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.DATE_FORMAT); + if (null == dateFormatNew) { + this.writerSliceConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.DATE_FORMAT, + dateFormatOld); + } + if (null != dateFormatOld) { + LOG.warn("您使用format配置日期格式化, 这是不推荐的行为, 请优先使用dateFormat配置项, 两项同时存在则使用dateFormat."); + } + UnstructuredStorageWriterUtil + .validateParameter(this.writerSliceConfig); + } + + private void validateParameter() { + this.writerSliceConfig + .getNecessaryValue( + com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME, + TxtFileWriterErrorCode.REQUIRED_VALUE); + + String path = this.writerSliceConfig.getNecessaryValue(Key.PATH, + TxtFileWriterErrorCode.REQUIRED_VALUE); + + try { + // warn: 这里用户需要配一个目录 + File dir = new File(path); + if (dir.isFile()) { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您配置的path: [%s] 不是一个合法的目录, 请您注意文件重名, 不合法目录名等情况.", + path)); + } + if (!dir.exists()) { + boolean createdOk = dir.mkdirs(); + if (!createdOk) { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.CONFIG_INVALID_EXCEPTION, + String.format("您指定的文件路径 : [%s] 创建失败.", + path)); + } + } + } catch (SecurityException se) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限创建文件路径 : [%s] ", path), se); + } + } + + @Override + public void prepare() { + String path = this.writerSliceConfig.getString(Key.PATH); + String fileName = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME); + String writeMode = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.WRITE_MODE); + // truncate option handler + if ("truncate".equals(writeMode)) { + LOG.info(String.format( + "由于您配置了writeMode truncate, 开始清理 [%s] 下面以 [%s] 开头的内容", + path, fileName)); + File dir = new File(path); + // warn:需要判断文件是否存在,不存在时,不能删除 + try { + if (dir.exists()) { + // warn:不要使用FileUtils.deleteQuietly(dir); + FilenameFilter filter = new PrefixFileFilter(fileName); + File[] filesWithFileNamePrefix = dir.listFiles(filter); + for (File eachFile : filesWithFileNamePrefix) { + LOG.info(String.format("delete file [%s].", + eachFile.getName())); + FileUtils.forceDelete(eachFile); + } + // FileUtils.cleanDirectory(dir); + } + } catch (NullPointerException npe) { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.Write_FILE_ERROR, + String.format("您配置的目录清空时出现空指针异常 : [%s]", + path), npe); + } catch (IllegalArgumentException iae) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您配置的目录参数异常 : [%s]", path)); + } catch (SecurityException se) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限查看目录 : [%s]", path)); + } catch (IOException e) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.Write_FILE_ERROR, + String.format("无法清空目录 : [%s]", path), e); + } + } else if ("append".equals(writeMode)) { + LOG.info(String + .format("由于您配置了writeMode append, 写入前不做清理工作, [%s] 目录下写入相应文件名前缀 [%s] 的文件", + path, fileName)); + } else if ("nonConflict".equals(writeMode)) { + LOG.info(String.format( + "由于您配置了writeMode nonConflict, 开始检查 [%s] 下面的内容", path)); + // warn: check two times about exists, mkdirs + File dir = new File(path); + try { + if (dir.exists()) { + if (dir.isFile()) { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您配置的path: [%s] 不是一个合法的目录, 请您注意文件重名, 不合法目录名等情况.", + path)); + } + // fileName is not null + FilenameFilter filter = new PrefixFileFilter(fileName); + File[] filesWithFileNamePrefix = dir.listFiles(filter); + if (filesWithFileNamePrefix.length > 0) { + List allFiles = new ArrayList(); + for (File eachFile : filesWithFileNamePrefix) { + allFiles.add(eachFile.getName()); + } + LOG.error(String.format("冲突文件列表为: [%s]", + StringUtils.join(allFiles, ","))); + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.ILLEGAL_VALUE, + String.format( + "您配置的path: [%s] 目录不为空, 下面存在其他文件或文件夹.", + path)); + } + } else { + boolean createdOk = dir.mkdirs(); + if (!createdOk) { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.CONFIG_INVALID_EXCEPTION, + String.format( + "您指定的文件路径 : [%s] 创建失败.", + path)); + } + } + } catch (SecurityException se) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限查看目录 : [%s]", path)); + } + } else { + throw DataXException + .asDataXException( + TxtFileWriterErrorCode.ILLEGAL_VALUE, + String.format( + "仅支持 truncate, append, nonConflict 三种模式, 不支持您配置的 writeMode 模式 : [%s]", + writeMode)); + } + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + + @Override + public List split(int mandatoryNumber) { + LOG.info("begin do split..."); + List writerSplitConfigs = new ArrayList(); + String filePrefix = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME); + + Set allFiles = new HashSet(); + String path = null; + try { + path = this.writerSliceConfig.getString(Key.PATH); + File dir = new File(path); + allFiles.addAll(Arrays.asList(dir.list())); + } catch (SecurityException se) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限查看目录 : [%s]", path)); + } + + String fileSuffix; + for (int i = 0; i < mandatoryNumber; i++) { + // handle same file name + + Configuration splitedTaskConfig = this.writerSliceConfig + .clone(); + + String fullFileName = null; + fileSuffix = UUID.randomUUID().toString().replace('-', '_'); + fullFileName = String.format("%s__%s", filePrefix, fileSuffix); + while (allFiles.contains(fullFileName)) { + fileSuffix = UUID.randomUUID().toString().replace('-', '_'); + fullFileName = String.format("%s__%s", filePrefix, + fileSuffix); + } + allFiles.add(fullFileName); + + splitedTaskConfig + .set(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME, + fullFileName); + + LOG.info(String.format("splited write file name:[%s]", + fullFileName)); + + writerSplitConfigs.add(splitedTaskConfig); + } + LOG.info("end do split."); + return writerSplitConfigs; + } + + } + + public static class Task extends Writer.Task { + private static final Logger LOG = LoggerFactory.getLogger(Task.class); + + private Configuration writerSliceConfig; + + private String path; + + private String fileName; + + @Override + public void init() { + this.writerSliceConfig = this.getPluginJobConf(); + this.path = this.writerSliceConfig.getString(Key.PATH); + this.fileName = this.writerSliceConfig + .getString(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME); + } + + @Override + public void prepare() { + + } + + @Override + public void startWrite(RecordReceiver lineReceiver) { + LOG.info("begin do write..."); + String fileFullPath = this.buildFilePath(); + LOG.info(String.format("write to file : [%s]", fileFullPath)); + + OutputStream outputStream = null; + try { + File newFile = new File(fileFullPath); + newFile.createNewFile(); + outputStream = new FileOutputStream(newFile); + UnstructuredStorageWriterUtil.writeToStream(lineReceiver, + outputStream, this.writerSliceConfig, this.fileName, + this.getTaskPluginCollector()); + } catch (SecurityException se) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.SECURITY_NOT_ENOUGH, + String.format("您没有权限创建文件 : [%s]", this.fileName)); + } catch (IOException ioe) { + throw DataXException.asDataXException( + TxtFileWriterErrorCode.Write_FILE_IO_ERROR, + String.format("无法创建待写文件 : [%s]", this.fileName), ioe); + } finally { + IOUtils.closeQuietly(outputStream); + } + LOG.info("end do write"); + } + + private String buildFilePath() { + boolean isEndWithSeparator = false; + switch (IOUtils.DIR_SEPARATOR) { + case IOUtils.DIR_SEPARATOR_UNIX: + isEndWithSeparator = this.path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR)); + break; + case IOUtils.DIR_SEPARATOR_WINDOWS: + isEndWithSeparator = this.path.endsWith(String + .valueOf(IOUtils.DIR_SEPARATOR_WINDOWS)); + break; + default: + break; + } + if (!isEndWithSeparator) { + this.path = this.path + IOUtils.DIR_SEPARATOR; + } + return String.format("%s%s", this.path, this.fileName); + } + + @Override + public void post() { + + } + + @Override + public void destroy() { + + } + } +} diff --git a/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriterErrorCode.java b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriterErrorCode.java new file mode 100755 index 0000000000..0e3a6fdd5d --- /dev/null +++ b/txtfilewriter/src/main/java/com/alibaba/datax/plugin/writer/txtfilewriter/TxtFileWriterErrorCode.java @@ -0,0 +1,41 @@ +package com.alibaba.datax.plugin.writer.txtfilewriter; + +import com.alibaba.datax.common.spi.ErrorCode; + +/** + * Created by haiwei.luo on 14-9-17. + */ +public enum TxtFileWriterErrorCode implements ErrorCode { + + CONFIG_INVALID_EXCEPTION("TxtFileWriter-00", "您的参数配置错误."), + REQUIRED_VALUE("TxtFileWriter-01", "您缺失了必须填写的参数值."), + ILLEGAL_VALUE("TxtFileWriter-02", "您填写的参数值不合法."), + Write_FILE_ERROR("TxtFileWriter-03", "您配置的目标文件在写入时异常."), + Write_FILE_IO_ERROR("TxtFileWriter-04", "您配置的文件在写入时出现IO异常."), + SECURITY_NOT_ENOUGH("TxtFileWriter-05", "您缺少权限执行相应的文件写入操作."); + + private final String code; + private final String description; + + private TxtFileWriterErrorCode(String code, String description) { + this.code = code; + this.description = description; + } + + @Override + public String getCode() { + return this.code; + } + + @Override + public String getDescription() { + return this.description; + } + + @Override + public String toString() { + return String.format("Code:[%s], Description:[%s].", this.code, + this.description); + } + +} diff --git a/txtfilewriter/src/main/resources/plugin.json b/txtfilewriter/src/main/resources/plugin.json new file mode 100755 index 0000000000..cf4ca024c9 --- /dev/null +++ b/txtfilewriter/src/main/resources/plugin.json @@ -0,0 +1,6 @@ +{ + "name": "txtfilewriter", + "class": "com.alibaba.datax.plugin.writer.txtfilewriter.TxtFileWriter", + "description": "useScene: test. mechanism: use datax framework to transport data to txt file. warn: The more you know about the data, the less problems you encounter.", + "developer": "alibaba" +} \ No newline at end of file diff --git a/txtfilewriter/src/main/resources/plugin_job_template.json b/txtfilewriter/src/main/resources/plugin_job_template.json new file mode 100644 index 0000000000..62d075bbd3 --- /dev/null +++ b/txtfilewriter/src/main/resources/plugin_job_template.json @@ -0,0 +1,10 @@ +{ + "name": "txtfilewriter", + "parameter": { + "path": "", + "fileName": "", + "writeMode": "", + "fieldDelimiter":"", + "dateFormat": "" + } +} \ No newline at end of file diff --git a/userGuid.md b/userGuid.md new file mode 100644 index 0000000000..5b17e06882 --- /dev/null +++ b/userGuid.md @@ -0,0 +1,183 @@ +# DataX + +DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、SQL Server、Oracle、PostgreSQL、HDFS、Hive、HBase、OTS、ODPS 等各种异构数据源之间高效的数据同步功能。 + +# Features + +DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。 + +# System Requirements + +- Linux +- [JDK(1.6以上,推荐1.6) ](http://www.oracle.com/technetwork/cn/java/javase/downloads/index.html) +- [Python(推荐Python2.6.X) ](https://www.python.org/downloads/) +- [Apache Maven 3.x](https://maven.apache.org/download.cgi) (Compile DataX) + +# Quick Start + +* 工具部署 + + * 方法一、直接下载DataX工具包:[DataX](https://github.com/alibaba/DataX) + + 下载后解压至本地某个目录,进入bin目录,即可运行同步作业: + + ``` shell + $ cd {YOUR_DATAX_HOME}/bin + $ python datax.py {YOUR_JOB.json} + ``` + + * 方法二、下载DataX源码,自己编译:[DataX源码](https://github.com/alibaba/DataX) + + (1)、下载DataX源码: + + ``` shell + $ git clone git@github.com:alibaba/DataX.git + ``` + + (2)、通过maven打包: + + ``` shell + $ cd {DataX_source_code_home} + $ mvn -U clean package assembly:assembly -Dmaven.test.skip=true + ``` + + 打包成功,日志显示如下: + + ``` + [INFO] BUILD SUCCESS + [INFO] ----------------------------------------------------------------- + [INFO] Total time: 08:12 min + [INFO] Finished at: 2015-12-13T16:26:48+08:00 + [INFO] Final Memory: 133M/960M + [INFO] ----------------------------------------------------------------- + ``` + + 打包成功后的DataX包位于 {DataX_source_code_home}/target/datax/datax/ ,结构如下: + + ``` shell + $ cd {DataX_source_code_home} + $ ls ./target/datax/datax/ + bin conf job lib log log_perf plugin script tmp + ``` + + +* 配置示例:从stream读取数据并打印到控制台 + + * 第一步、创建创业的配置文件(json格式) + + 可以通过命令查看配置模板: python datax.py -r {YOUR_READER} -w {YOUR_WRITER} + + ``` shell + $ cd {YOUR_DATAX_HOME}/bin + $ python datax.py -r streamreader -w streamwriter + DataX (UNKNOWN_DATAX_VERSION), From Alibaba ! + Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved. + Please refer to the streamreader document: + https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md + + Please refer to the streamwriter document: + https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md + + Please save the following configuration as a json file and use + python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json + to run the job. + + { + "job": { + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "column": [], + "sliceRecordCount": "" + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "encoding": "", + "print": true + } + } + } + ], + "setting": { + "speed": { + "channel": "" + } + } + } + } + ``` + + 根据模板配置json如下: + + ``` json + #stream2stream.json + { + "job": { + "content": [ + { + "reader": { + "name": "streamreader", + "parameter": { + "sliceRecordCount": 10, + "column": [ + { + "type": "long", + "value": "10" + }, + { + "type": "string", + "value": "hello,你好,世界-DataX" + } + ] + } + }, + "writer": { + "name": "streamwriter", + "parameter": { + "encoding": "UTF-8", + "print": true + } + } + } + ], + "setting": { + "speed": { + "channel": 5 + } + } + } + } + ``` + + * 第二步:启动DataX + + ``` shell + $ cd {YOUR_DATAX_DIR_BIN} + $ python datax.py ./stream2stream.json + ``` + + 同步结束,显示日志如下: + + ``` shell + ... + 2015-12-17 11:20:25.263 [job-0] INFO JobContainer - + 任务启动时刻 : 2015-12-17 11:20:15 + 任务结束时刻 : 2015-12-17 11:20:25 + 任务总计耗时 : 10s + 任务平均流量 : 205B/s + 记录写入速度 : 5rec/s + 读出记录总数 : 50 + 读写失败总数 : 0 + ``` + +# Contact us + +Google Groups: [DataX-user](https://github.com/alibaba/DataX) + + + +