Apache Griffin

Apache Griffin is a model driven data quality solution for modern data systems. It provides a standard process to define data quality measures, execute, report, as well as an unified dashboard across multiple data systems.

Getting Started

You can try Griffin in docker following the docker guide.

To run Griffin at local, you can follow instructions below.

Prerequisites

You need to install following items

jdk (1.8 or later versions).
mysql.
Postgresql.
npm (version 6.0.0+).
Hadoop (2.6.0 or later), you can get some help here.
Spark (version 1.6.x, griffin does not support 2.0.x at current), if you want to install Pseudo Distributed/Single Node Cluster, you can get some help here.
Hive (version 1.2.1 or later), you can get some help here. You need to make sure that your spark cluster could access your HiveContext.
Livy, you can get some help here. Griffin need to schedule spark jobs by server, we use livy to submit our jobs. For some issues of Livy for HiveContext, we need to download 3 files, and put them into HDFS.
```
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
```
ElasticSearch. ElasticSearch works as a metrics collector, Griffin produces metrics to it, and our default UI get metrics from it, you can use your own way as well.

Configuration

Create database 'quartz' in mysql

mysql -u username -e "create database quartz" -p

Init quartz tables in mysql by service/src/main/resources/Init_quartz.sql

mysql -u username -p quartz < service/src/main/resources/Init_quartz.sql

You should also modify some configurations of Griffin for your environment.

service/src/main/resources/application.properties

# jpa
spring.datasource.url = jdbc:postgresql://<your IP>:5432/quartz?autoReconnect=true&useSSL=false
spring.datasource.username = <user name>
spring.datasource.password = <password>
spring.jpa.generate-ddl=true
spring.datasource.driverClassName = org.postgresql.Driver
spring.jpa.show-sql = true

# hive metastore
hive.metastore.uris = thrift://<your IP>:9083
hive.metastore.dbname = <hive database name>    # default is "default"

# external properties directory location, ignore it if not required
external.config.location =

# login strategy, default is "default"
login.strategy = <default or ldap>

# ldap properties, ignore them if ldap is not enabled
ldap.url = ldap://hostname:port
ldap.email = @example.com
ldap.searchBase = DC=org,DC=example
ldap.searchPattern = (sAMAccountName={0})

# hdfs, ignore it if you do not need predicate job
fs.defaultFS = hdfs://<hdfs-default-name>

# elasticsearch
elasticsearch.host = <your IP>
elasticsearch.port = <your elasticsearch rest port>
# authentication properties, uncomment if basic authentication is enabled
# elasticsearch.user = user
# elasticsearch.password = password

measure/src/main/resources/env.json

 "persist": [
     ...
     {
 		"type": "http",
 		"config": {
 	        "method": "post",
 	        "api": "http://<your ES IP>:<ES rest port>/griffin/accuracy"
 		}
 	}
 ]

Put the modified env.json file into HDFS.

service/src/main/resources/sparkJob.properties

sparkJob.file = hdfs://<griffin measure path>/griffin-measure.jar
sparkJob.args_1 = hdfs://<griffin env path>/env.json

sparkJob.jars = hdfs://<datanucleus path>/spark-avro_2.11-2.0.1.jar\
    hdfs://<datanucleus path>/datanucleus-api-jdo-3.2.6.jar\
    hdfs://<datanucleus path>/datanucleus-core-3.2.10.jar\
    hdfs://<datanucleus path>/datanucleus-rdbms-3.2.9.jar
    
spark.yarn.dist.files = hdfs:///<spark conf path>/hive-site.xml

livy.uri = http://<your IP>:8998/batches
spark.uri = http://<your IP>:8088

<griffin measure path> is the location you should put the jar file of measure module.
<griffin env path> is the location you should put the env.json file.
<datanucleus path> is the location you should put the 3 jar files of livy, and the spark avro jar file if you need.
<spark conf path> is the location of spark conf directory.

Build and Run

Build the whole project and deploy. (NPM should be installed)

mvn clean install

Put jar file of measure module into <griffin measure path> in HDFS

cp measure/target/measure-<version>-incubating-SNAPSHOT.jar measure/target/griffin-measure.jar
hdfs dfs -put measure/target/griffin-measure.jar <griffin measure path>/

After all environment services startup, we can start our server.

java -jar service/target/service.jar

After a few seconds, we can visit our default UI of Griffin (by default the port of spring boot is 8080).

http://<your IP>:8080

You can use UI following the steps here.

Note: The front-end UI is still under development, you can only access some basic features currently.

Build and Debug

If you want to develop Griffin, please follow this document, to skip complex environment building work.

Community

You can contact us via email: [email protected]

You can also subscribe this mail by sending a email to here.

You can access our issues jira page here

Contributing

See Contributing Guide for details on how to contribute code, documentation, etc.

References

Home Page
Wiki
Documents:
- Measure
- Service
- UI
- Docker usage
- Postman API

Name		Name	Last commit message	Last commit date
Latest commit History 288 Commits
griffin-doc		griffin-doc
licenses		licenses
measure		measure
service		service
ui		ui
.gitignore		.gitignore
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
DISCLAIMER		DISCLAIMER
KEYS		KEYS
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
merge_pr.py		merge_pr.py
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Griffin

Getting Started

Prerequisites

Configuration

Build and Run

Build and Debug

Community

Contributing

References

About

Releases

Packages

Languages

License

julienyu/incubator-griffin

Folders and files

Latest commit

History

Repository files navigation

Apache Griffin

Getting Started

Prerequisites

Configuration

Build and Run

Build and Debug

Community

Contributing

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages