Skip to content

Commit

Permalink
refactoring
Browse files Browse the repository at this point in the history
issue #9
  • Loading branch information
rsoika committed Sep 18, 2017
1 parent c44ef48 commit c287c0c
Show file tree
Hide file tree
Showing 8 changed files with 134 additions and 101 deletions.
40 changes: 12 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,24 @@
# imixs-archive
# Imixs-Archive

Imixs-Archive is a sub project of Imixs-Workflow providing a solution for long-term archiving of business data.
Imixs-Archive can be combined with the Worklfow Suite Imixs-Office-Workflow as also with individual business applications based on the Imixs-Workflow engine. The archive data is transferred to a Hadoop cluster. The Imixs-Archive project provides various functions for the exchange of data with a Hadoop culster.
Imixs-Archive is an open source project designed to provide a transparent and sustaining solution for long-term archiving of business data. In this context, business data means not only documents but also the comprehensible documentation of business processes.
Imixs-Archive is a sub-project of the Human-Centric Workflow engine [Imixs-Workflow](http://www.imixs.org), which provides a powerful platform for the description and execution of business processes.

The goal of this project is to provide a open and transparent technology for long-term archiving of business data based on the Imixs-Workflow project.
Imixs-Archive provides an API for a transparent data exchange with any kind of archive system. One of these systems supported by Imixs-Archive is [Apache Hadoop](http://hadoop.apache.org/).


## Hadoop

Imixs-Archive is based on the [Hadoop technology](http://hadoop.apache.org/) and provides submodules to plugin hadoop into the Imixs-Workflow engine.

* [Imixs-Archive-Hadoop-JCA](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-hadoop-jca)
* [Imixs-Archive-Hadoop-Client](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-hadoop-client)


## Docker

The [Imixs-Docker/hadoop project](https://github.com/imixs/imixs-docker/tree/master/hadoop) provides a Docker image to run Haddop in a Docker container. This container can be used to test the hadoop in combination with Imixs-Archive. **NOTE:** The Imixs-Docker/hadoop container is for test purpose only. The container should only run in a system environment protected from external access.

## The API

* [Imixs-Archive-Test](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-test)
The sub-module Imixs-Archive-API provides the core functionality and interfaces to generate, store and retrieve business data into an archive system. This api is platform independent and based on the Imixs-Workflow API.

## Hadoop

The sub-module Imixs-Archive-Hadoop provides an adapter for the [Apache Hadoop Filesystem (hdfs)](http://hadoop.apache.org/). The adapter is based on HttpFS which can be used to transfer data between different versions of Hadoop clusters. HttpFS allows to access data in clusterd HDFS behind of a firewall which enables a restricted and secured archive architecture.
As HttpFS is based on REST, this component does not have any additional hadoop libraries. In addition HttpFS has built-in security supporting Hadoop pseudo authentication, HTTP SPNEGO Kerberos and other pluggable authentication mechanisms to be used depending on the target architecture.


## Docker

# Concepts

Imixs-Archive is mainly based on the 'Workflow Push' strategy where the archive process is directly coupled to the workflow process. This means that the archive process can be controlled by the workflow model. The Imixs-Archiveplug-in communicates with the hadoop cluster via the Hadoop Rest API. During the archive process, the Checksum computed by hadoop is immediately stored into the source workitem. This is a tightly coupled archive strategy which guaranties a transactional secure archive process.


## Data Consistency

Imixs-Archive guarantees the consistency of the stored data by calculating a MD5 checksum for each document stored into the archive. The checksum is part of the access-URL returned by the archive system after a document was stored. If the access-URL specified later by the client to read the data did not match, an error code is returned.

The [Imixs-Docker/hadoop project](https://github.com/imixs/imixs-docker/tree/master/hadoop) provides a Docker image to run Haddop in a Docker container. This container can be used to test Imixs-Archive in combination with a Hadoop single-node-cluster.
**NOTE:** The Imixs-Docker/hadoop container is for test purpose only. The container should only run in a system environment protected from external access.

## Access Control
The access to data, written into the Imixs-Archive, should be ideally managed completely by the [Imixs-Workflow](http://www.imixs.org) engine. Imixs-Workflow supports a multiple-level security model, that offers a great space of flexibility while controlling the access to all parts of a workitem.

43 changes: 43 additions & 0 deletions imixs-archive-api/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Imixs-Archive API

The sub-module Imixs-Archive-API provides the core functionality and interfaces to generate, store and retrieve business data into an archive system. This api is platform independent and based on the Imixs-Workflow API.


## Concepts

Imixs-Archive is mainly based on the 'Workflow Push' strategy where the archive process is directly coupled to the workflow process. This means that the archive process can be controlled by the workflow model.


### The Snapshot-Architecture

Imixs-Workflow provides a build-in snapshot mechanism to archive the content of a workitem into a snapshot-workitem.
A snapshot workitem is a copy of the current workitem (origin-workitem) including all the file content of attached files. The origin-workitem only holds a reference ($snapshotID) to the snapshot-workitem to load attached file data.
See the Snapshot-Concept for further details.


Attached files will be linked from the snapshot-workitem to the origin-workitem.

The snapshot process includes the following stages:

1. create a copy of the origin workitem instance
2. compute a snapshot $uniqueId based on the origin workitem suffixed with a timestamp.
3. change the type of the snapshot-workitem with the prefix 'archive-'
4. If an old snapshot already exists, Files are compared to the current $files and, if necessary, stored in the Snapshot applied
5. remove the file content form the origin-workitem
6. store the snapshot uniqeId into the origin-workitem as a reference ($snapshotID)
7. remove deprecated snapshots

A snapshot-workitem holds a reference to the origin-workitem by its own $UniqueID which is
always the $UniqueID from the origin-workitem suffixed with a timestamp.
During the snapshot creation the snapshot $UniqueID is stored into the origin-workitem.

The ArchiveLocalPlugin implements the ObserverPlugin interface and is tied to the transaction context of the imixs-workflow engine. The process of creating a new snapshot workitem is aware of the current transaction in a transparent way and will automatically role back any snapshots workitems in case of a EJB Exception.



### The Access Control
The access to archive data, written into the Imixs-Archive, is controlled completely by the [Imixs-Workflow engine](http://www.imixs.org). Imixs-Workflow supports a multiple-level security model, that offers a great space of flexibility while controlling the access to all parts of a workitem.




47 changes: 47 additions & 0 deletions imixs-archive-api/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.imixs.workflow</groupId>
<artifactId>imixs-archive</artifactId>
<version>0.0.2-SNAPSHOT</version>
</parent>
<artifactId>imixs-archive-api</artifactId>
<name>Imixs-Archive API</name>

<dependencies>
<dependency>
<groupId>org.imixs.workflow</groupId>
<artifactId>imixs-workflow-core</artifactId>
</dependency>
<dependency>
<groupId>org.imixs.workflow</groupId>
<artifactId>imixs-workflow-engine</artifactId>
</dependency>

<!-- Java EE dependencies -->
<dependency>
<groupId>javax</groupId>
<artifactId>javaee-api</artifactId>
<version>7.0</version>
<scope>provided</scope>
</dependency>

<!-- JSON Parser -->
<dependency>
<groupId>javax.json</groupId>
<artifactId>javax.json-api</artifactId>
<version>1.1</version>
<scope>provided</scope>
</dependency>

<!-- JUnit Tests -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>

</project>
40 changes: 12 additions & 28 deletions imixs-archive-hadoop/README.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,34 @@
# Imixs-Archive-Hadoop

The Imixs-Archive-Hadoop project provides a API to store workitems into a Hadoop Cluster. Imixs-Archive-Hadoop uses the [Imixs-JCA-Hadoop Connector](https://github.com/imixs/imixs-jca/tree/master/imixs-jca-hadoop)
The Imixs-Archive-Hadoop project provides an API to ingest Imixs-Workflow data into a Hadoop Cluster. The Imixs-Hadoop Scheduler Service automatically transfers the so-called snapshot-workitems into a Hadoop cluster.
A snapshot-workitem is a core concept of the Imixs-Workflow engine and can be configured through the Imixs-Workflow Plug-In API.

Imixs-Archive-Hadoop is communicating with a hadoop cluster via the [WebHDFS Rest API](https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html).

## Synchronous Mode Push

This implementation follows the architector of a synchronous push mode. By this strategy the archive process is directly coupled to the workflow process. This means that the archive process can be controlled by the workflow model. The implementation is realized by a Imixs-Plug-In which is directly controlled by the engine. The plug-in access the hadoop cluster via the Hadoop Rest API. In this scenario the plugin can store archive data, like the Checksum, immediately into the workitem. This is a tightly coupled archive strategy.

### Pros
## HDFS Schema Design
The Hadoops's 'Schema-on-Read' model does not impose any requirements when loading data into hadoop. Nevertheless the Imixs-Archive-Hadoop provides a structured and organized data repository. All workflow data ingested into hadoop is partitioned by the creation YEAR and MONTH component of the process instance into a directory hirrarchie

* The archive process can be directly controlled by the workflow engine (via a plug-in)
* The data between hadoop and imixs-workflow is synchron at any time
* A workitem can store archive information in synchronous way (e.g. checksumm)
/data/[workflow-instance]/YEAR/MONTH/[SNAPSHOT-UNIQUEID].xml

### Cons
See the following example

* The process is time consuming and slows down the overall performance from the workflow engine
* The process is memory consuming
* The process have to be embedded into the running transaction which increases the complexity
* Hadoop must be accessible via the internet and additional security must be implemented on both sides.
/data/company/2017/01/333444.1111.2222-21555122.xml
All data is stored in a semistructured XML format based on the Imixs-XML Schema.
With this schema design it is


# Implementation

The service is implemented a a stateful session EJB with a Plug-In. The statefull session EJB synchronizes the transaction and decided in the afterCommit(boolean) method either to comit or rolback the changes in hadoop. This approach is a little bit complex, time and memory consuming but has the advantage that the workitem is always synchron with the data in the hadoop cluster.
## Hadoop Rest API
The ingestion mechanism is based on the Hadoop Rest API 'HttpFS'.

## CDI Support

The HadoopService and the Archive Plugin support CDI. A bean.xml is located in the META-INF folder. Make sure that the client library is visible to your EJB modules. See the section 'Using shared libraries' in the [Imixs Deployment Guide](http://www.imixs.org/doc/deployment/deployment_guide.html).



## HDFSWebClient

The HDFSWebClient code is based on the workf of [zxs/webhdfs-java-client](https://github.com/zxs/webhdfs-java-client].

## JUnit Tests

The libray can be tested with a single node hadoop cluster.
For all integration tests just start the Docker hadoop container with the following command:

docker run --name="hadoop" -d -h my-hadoop-cluster.local -p 50070:50070 -p 50075:50075 imixs/hadoop

Make sure that the hostname 'my-hadoop-cluster.local' is listed in your local test environment

See the [Imixs-Docker hadoop project](https://github.com/imixs/imixs-docker/tree/master/hadoop) for mor details.


Original file line number Diff line number Diff line change
Expand Up @@ -27,21 +27,26 @@
* The snapshot process includes the following stages:
*
* <ol>
* <li>create a copy of the current workitem
* <li>compute a snapshot $uniqueId containing a timestamp
* <li>change the type with the prefix 'archive-'
* <li>create a copy of the origin workitem instance
* <li>compute a snapshot $uniqueId based on the origin workitem suffixed with a timestamp
* <li>change the type of the snapshot-workitem with the prefix 'archive-'
* <li>If an old snapshot already exists, Files are compared to the current $
* files and, if necessary, stored in the Snapshot applied
* <li>remove file content form the origin-workitem
* <li>store the snapshot uniqeId into the origin-workitem ($snapshotID)
* <li>remove the file content form the origin-workitem
* <li>store the snapshot uniqeId into the origin-workitem as a reference ($snapshotID)
* <li>remove deprecated snapshots
* </ol>
*
* The Plugin implements the ObserverPlugin interface
* A snapshot workitem holds a reference to the origin workitem by its own $uniqueId which is
* always the $uniqueId from the origin workitem suffixed with a timestamp.
* During the snapshot creation the snapshot-uniquId is stored into the origin workitem.
*
* The ArchiveLocalPlugin implements the ObserverPlugin interface
*
* <p>
* Note: The ArchiveLocalPlugin replaces the DMSPlugin from the imixs-marty
* project and provides a migration mechanism for old BlobWorkitems. The old
* Note: The ArchiveLocalPlugin replaces the BlobWorkitems mechanism which was earlier
* part of the DMSPlugin from the imixs-marty project. The plugin provides a
* migration mechanism for old BlobWorkitems. The old
* BlobWorkitems will not be deleted.
*
* @version 1.0
Expand Down
5 changes: 4 additions & 1 deletion imixs-archive-ocr/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# imixs-archive OCR
# Imixs-Archive-OCR

Imixs-Archive-OCR is a sub project of Imixs-Archive providing a solution for OCR scans on documents.

Expand All @@ -7,3 +7,6 @@ The project provides a shell script to perform a OCR scan on documents managed b
This script is based on the tesseract library. The script automatically converts PDF files into a TIF format, so this
script can be used for images as also for PDF files. The text result is stored into a file ${FILENAME}.txt

## Fulltext Index

Imixs-Archive-OCR includes a fulltext search based on [Apache Lucene](http://lucene.apache.org/). This module can be combined with any Imixs-Workflow business application as also with standalone applications.
36 changes: 1 addition & 35 deletions imixs-archive-test/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# imixs-archive Test Environment
# Imixs-Archive-Test Environment

Imixs-Archive-Test provides a Docker based test environment for Imixs-Archive. The Test Environment consists of the following docker containers:

Expand Down Expand Up @@ -28,40 +28,6 @@ To build the docker image run
We use the standalone-full.xml configuration profile to activate JMS!


### JCA Hadoop Connector

Imixs Archive uses a JCA connector to communicate with the Hadoop Cluster. To install the connector follow the installation guide on
[Imixs-JCA-Hadoop](https://github.com/imixs/imixs-jca/tree/master/imixs-jca-hadoop)

The configuration for the connector is part of the wildfly standalone.xml:

<subsystem xmlns="urn:jboss:domain:resource-adapters:4.0">
<resource-adapters>
<resource-adapter id="imixs-jca-hadoop">
<archive>imixs-jca-hadoop.rar</archive>
<transaction-support>LocalTransaction</transaction-support>
<connection-definitions>
<connection-definition class-name="org.imixs.workflow.hadoop.jca.store.GenericManagedConnectionFactory" jndi-name="java:/jca/org.imixs.workflow.hadoop" enabled="true" use-java-context="true" pool-name="hadoop" use-ccm="true">
<config-property name="rootDirectory">
./store/
</config-property>
<pool>
<min-pool-size>0</min-pool-size>
<max-pool-size>10</max-pool-size>
<prefill>false</prefill>
<use-strict-min>false</use-strict-min>
<flush-strategy>FailingConnectionOnly</flush-strategy>
</pool>
<security>
<application/>
</security>
</connection-definition>
</connection-definitions>
</resource-adapter>
</resource-adapters>
</subsystem>


## Workflow Models

The folder /workflow/ contains BPMN Model for testing.
Expand Down
3 changes: 2 additions & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
<packaging>pom</packaging>

<modules>
<module>imixs-archive-api</module>
<module>imixs-archive-ocr</module>
<module>imixs-archive-hadoop</module>
<module>imixs-archive-test</module>
Expand Down Expand Up @@ -78,7 +79,7 @@

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<org.imixs.workflow.version>4.1.5-SNAPSHOT</org.imixs.workflow.version>
<org.imixs.workflow.version>4.1.6-SNAPSHOT</org.imixs.workflow.version>
</properties>


Expand Down

0 comments on commit c287c0c

Please sign in to comment.