refactoring

issue #9
imixs · Sep 18, 2017 · c287c0c · c287c0c
1 parent c44ef48
commit c287c0c
Show file tree

Hide file tree

Showing 8 changed files with 134 additions and 101 deletions.
diff --git a/README.md b/README.md
@@ -1,40 +1,24 @@
-# imixs-archive
+# Imixs-Archive
 
-Imixs-Archive is a sub project of Imixs-Workflow providing a solution for long-term archiving of business data.
-Imixs-Archive can be combined with the Worklfow Suite Imixs-Office-Workflow as also with individual business applications based on the Imixs-Workflow engine. The archive data is transferred to a Hadoop cluster. The Imixs-Archive project provides various functions for the exchange of data with a Hadoop culster.
+Imixs-Archive is an open source project designed to provide a transparent and sustaining solution for long-term archiving of business data. In this context, business data means not only documents but also the comprehensible documentation of business processes.
+Imixs-Archive is a sub-project of the Human-Centric Workflow engine [Imixs-Workflow](http://www.imixs.org), which provides a powerful platform for the description and execution of business processes.
 
-The goal of this project is to provide a open and transparent technology for long-term archiving of business data based on the Imixs-Workflow project.
+Imixs-Archive provides an API for a transparent data exchange with any kind of archive system. One of these systems supported by Imixs-Archive is [Apache Hadoop](http://hadoop.apache.org/).
 
 
-## Hadoop 
-
-Imixs-Archive is based on the [Hadoop technology](http://hadoop.apache.org/) and provides  submodules to plugin hadoop into the Imixs-Workflow engine.
-
-* [Imixs-Archive-Hadoop-JCA](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-hadoop-jca)
-* [Imixs-Archive-Hadoop-Client](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-hadoop-client)
-
-
-## Docker
-
-The [Imixs-Docker/hadoop project](https://github.com/imixs/imixs-docker/tree/master/hadoop) provides a Docker image to run Haddop in a Docker container. This container can be used to test the hadoop in combination with Imixs-Archive. **NOTE:** The Imixs-Docker/hadoop container is for test purpose only. The container should only run in a system environment protected from external access. 
-
+## The API
 
-* [Imixs-Archive-Test](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-test)
+The sub-module Imixs-Archive-API provides the core functionality and interfaces to generate, store and retrieve business data into an archive system. This api is platform independent and based on the Imixs-Workflow API.  
 
+## Hadoop 
 
+The sub-module Imixs-Archive-Hadoop provides an adapter for the [Apache Hadoop Filesystem (hdfs)](http://hadoop.apache.org/). The adapter is based on HttpFS which can be used to transfer data between different versions of Hadoop clusters. HttpFS allows to access data in clusterd HDFS behind of a firewall which enables a restricted and secured archive architecture. 
+As HttpFS is based on REST, this component does not have any additional hadoop libraries. In addition HttpFS has built-in security supporting Hadoop pseudo authentication, HTTP SPNEGO Kerberos and other pluggable authentication mechanisms to be used depending on the target architecture. 
 
 
+## Docker
 
-# Concepts
-
-Imixs-Archive is mainly based on the 'Workflow Push' strategy where the archive process is directly coupled to the workflow process. This means that the archive process can be controlled by the workflow model. The Imixs-Archiveplug-in communicates with the hadoop cluster via the Hadoop Rest API. During the archive process, the Checksum computed by hadoop is immediately stored into the source workitem. This is a tightly coupled archive strategy which guaranties a transactional secure archive process.
-
-
-## Data Consistency 
-
-Imixs-Archive guarantees the consistency of the stored data by calculating a MD5 checksum for each document stored into the archive. The checksum is part of the access-URL returned by the archive system after a document was stored. If the access-URL specified later by the client to read the data did not match, an error code is returned. 
-
+The [Imixs-Docker/hadoop project](https://github.com/imixs/imixs-docker/tree/master/hadoop) provides a Docker image to run Haddop in a Docker container. This container can be used to test Imixs-Archive in combination with a Hadoop single-node-cluster.
+**NOTE:** The Imixs-Docker/hadoop container is for test purpose only. The container should only run in a system environment protected from external access. 
 
-## Access Control
-The access to data, written into the Imixs-Archive, should be ideally managed completely by the [Imixs-Workflow](http://www.imixs.org) engine. Imixs-Workflow supports a multiple-level security model, that offers a great space of flexibility while controlling the access to all parts of a workitem. 
 
diff --git a/imixs-archive-api/README.md b/imixs-archive-api/README.md
@@ -0,0 +1,43 @@
+# Imixs-Archive API
+
+The sub-module Imixs-Archive-API provides the core functionality and interfaces to generate, store and retrieve business data into an archive system. This api is platform independent and based on the Imixs-Workflow API.  
+
+
+## Concepts
+
+Imixs-Archive is mainly based on the 'Workflow Push' strategy where the archive process is directly coupled to the workflow process. This means that the archive process can be controlled by the workflow model. 
+
+
+### The Snapshot-Architecture
+
+Imixs-Workflow provides a build-in snapshot mechanism to archive the content of a workitem into a snapshot-workitem. 
+A snapshot workitem is a copy of the current workitem (origin-workitem) including all the file content of attached files. The origin-workitem only holds a reference ($snapshotID) to the snapshot-workitem to load attached file data. 
+See the Snapshot-Concept for further details. 
+
+
+Attached files will be linked from the snapshot-workitem to the origin-workitem.
+
+The snapshot process includes the following stages:
+
+1. create a copy of the origin workitem instance
+2. compute a snapshot $uniqueId based on the origin workitem suffixed with a timestamp.
+3. change the type of the snapshot-workitem with the prefix 'archive-'
+4. If an old snapshot already exists, Files are compared to the current $files and, if necessary, stored in the Snapshot applied
+5. remove the file content form the origin-workitem 
+6. store the snapshot uniqeId into the origin-workitem as a reference ($snapshotID)
+7. remove deprecated snapshots
+
+A snapshot-workitem holds a reference to the origin-workitem by its own $UniqueID which is 
+always the $UniqueID from the origin-workitem suffixed with a timestamp. 
+During the snapshot creation the snapshot $UniqueID is stored into the origin-workitem. 
+
+The ArchiveLocalPlugin implements the ObserverPlugin interface and is tied to the transaction context of the imixs-workflow engine. The process of creating a new snapshot workitem is aware of the current transaction in a transparent way and will automatically role back any snapshots workitems in case of a EJB Exception. 
+
+
+
+### The Access Control
+The access to archive data, written into the Imixs-Archive, is controlled completely by the [Imixs-Workflow engine](http://www.imixs.org). Imixs-Workflow supports a multiple-level security model, that offers a great space of flexibility while controlling the access to all parts of a workitem. 
+
+
+
+
diff --git a/imixs-archive-api/pom.xml b/imixs-archive-api/pom.xml
@@ -0,0 +1,47 @@
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+	<modelVersion>4.0.0</modelVersion>
+	<parent>
+		<groupId>org.imixs.workflow</groupId>
+		<artifactId>imixs-archive</artifactId>
+		<version>0.0.2-SNAPSHOT</version>
+	</parent>
+	<artifactId>imixs-archive-api</artifactId>
+	<name>Imixs-Archive API</name>
+
+	<dependencies>
+		<dependency>
+			<groupId>org.imixs.workflow</groupId>
+			<artifactId>imixs-workflow-core</artifactId>
+		</dependency>
+		<dependency>
+			<groupId>org.imixs.workflow</groupId>
+			<artifactId>imixs-workflow-engine</artifactId>
+		</dependency>
+
+		<!-- Java EE dependencies -->
+		<dependency>
+			<groupId>javax</groupId>
+			<artifactId>javaee-api</artifactId>
+			<version>7.0</version>
+			<scope>provided</scope>
+		</dependency>
+
+		<!-- JSON Parser -->
+		<dependency>
+			<groupId>javax.json</groupId>
+			<artifactId>javax.json-api</artifactId>
+			<version>1.1</version>
+			<scope>provided</scope>
+		</dependency>
+
+		<!-- JUnit Tests -->
+		<dependency>
+			<groupId>junit</groupId>
+			<artifactId>junit</artifactId>
+			<version>4.8.1</version>
+			<scope>test</scope>
+		</dependency>
+	</dependencies>
+
+</project>
diff --git a/imixs-archive-hadoop/README.md b/imixs-archive-hadoop/README.md
@@ -1,50 +1,34 @@
 # Imixs-Archive-Hadoop
 
-The Imixs-Archive-Hadoop project provides a API to store workitems into a Hadoop Cluster. Imixs-Archive-Hadoop uses the [Imixs-JCA-Hadoop Connector](https://github.com/imixs/imixs-jca/tree/master/imixs-jca-hadoop)
+The Imixs-Archive-Hadoop project provides an API to ingest Imixs-Workflow data into a Hadoop Cluster. The Imixs-Hadoop Scheduler Service automatically transfers the so-called snapshot-workitems into a Hadoop cluster. 
+A snapshot-workitem is a core concept of the Imixs-Workflow engine and can be configured through the Imixs-Workflow Plug-In API. 
 
-Imixs-Archive-Hadoop is communicating with a hadoop cluster via the [WebHDFS Rest API](https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html). 
 
-## Synchronous Mode Push
 
-This implementation follows the architector of a synchronous push mode. By this strategy the archive process is directly coupled to the workflow process. This means that the archive process can be controlled by the workflow model. The implementation is realized by a Imixs-Plug-In which is directly controlled by the engine. The plug-in access the hadoop cluster via the Hadoop Rest API. In this scenario the plugin can store archive data, like the Checksum, immediately into the workitem. This is a tightly coupled archive strategy.
 
-### Pros
+## HDFS Schema Design
+The Hadoops's 'Schema-on-Read' model does not impose any requirements when loading data into hadoop. Nevertheless the Imixs-Archive-Hadoop provides a structured and organized data repository. All workflow data ingested into hadoop is partitioned by the creation YEAR and MONTH component of the process instance into a directory hirrarchie
 
-* The archive process can be directly controlled by the workflow engine (via a plug-in)
-* The data between hadoop and imixs-workflow is synchron at any time
-* A workitem can store archive information in synchronous way (e.g. checksumm)
+    /data/[workflow-instance]/YEAR/MONTH/[SNAPSHOT-UNIQUEID].xml
 
-### Cons
+See the following example 
 
-* The process is time consuming and slows down the overall performance from the workflow engine
-* The process is memory consuming
-* The process have to be embedded into the running transaction which increases the complexity
-* Hadoop must be accessible via the internet and additional security must be implemented on both sides.
+    /data/company/2017/01/333444.1111.2222-21555122.xml
+    
+All data is stored in a semistructured XML format based on the Imixs-XML Schema.
+With this schema design it is 
 
 
-# Implementation
 
-The service is implemented a a stateful session EJB with a Plug-In. The statefull session EJB synchronizes the transaction and decided in the afterCommit(boolean) method either to comit or rolback the changes in hadoop. This approach is a little bit complex, time and memory consuming but has the advantage that the workitem is always synchron with the data in the hadoop cluster.  
+## Hadoop Rest API
+The ingestion mechanism is based on the Hadoop Rest API 'HttpFS'.
 
-## CDI Support
 
-The HadoopService and the Archive Plugin support CDI. A bean.xml is located in the META-INF folder. Make sure that the client library is visible to your EJB modules. See the section 'Using shared libraries' in the [Imixs Deployment Guide](http://www.imixs.org/doc/deployment/deployment_guide.html). 
 
 
 
-## HDFSWebClient
 
-The HDFSWebClient code is based on the workf of [zxs/webhdfs-java-client](https://github.com/zxs/webhdfs-java-client]. 
 
-## JUnit Tests
 
-The libray can be tested with a single node hadoop cluster. 
-For all integration tests just start the Docker hadoop container with the following command:
-
-	docker run --name="hadoop" -d -h my-hadoop-cluster.local -p 50070:50070 -p 50075:50075  imixs/hadoop
-
-Make sure that the hostname 'my-hadoop-cluster.local' is listed in your local test environment
-
-See the [Imixs-Docker hadoop project](https://github.com/imixs/imixs-docker/tree/master/hadoop) for mor details.
 
 
diff --git a/imixs-archive-hadoop/src/main/java/org/imixs/workflow/archive/hadoop/ArchiveLocalPlugin.java b/imixs-archive-hadoop/src/main/java/org/imixs/workflow/archive/hadoop/ArchiveLocalPlugin.java
@@ -27,21 +27,26 @@
  * The snapshot process includes the following stages:
  * 
  * <ol>
- * <li>create a copy of the current workitem
- * <li>compute a snapshot $uniqueId containing a timestamp
- * <li>change the type with the prefix 'archive-'
+ * <li>create a copy of the origin workitem instance
+ * <li>compute a snapshot $uniqueId based on the origin workitem suffixed with a timestamp
+ * <li>change the type of the snapshot-workitem with the prefix 'archive-'
  * <li>If an old snapshot already exists, Files are compared to the current $
  * files and, if necessary, stored in the Snapshot applied
- * <li>remove file content form the origin-workitem
- * <li>store the snapshot uniqeId into the origin-workitem ($snapshotID)
+ * <li>remove the file content form the origin-workitem
+ * <li>store the snapshot uniqeId into the origin-workitem as a reference ($snapshotID)
  * <li>remove deprecated snapshots
  * </ol>
  * 
- * The Plugin implements the ObserverPlugin interface
+ * A snapshot workitem holds a reference to the origin workitem by its own $uniqueId which is 
+ * always the $uniqueId from the origin workitem suffixed with a timestamp. 
+ * During the snapshot creation the snapshot-uniquId is stored into the origin workitem. 
+ * 
+ * The ArchiveLocalPlugin implements the ObserverPlugin interface
  * 
  * <p>
- * Note: The ArchiveLocalPlugin replaces the DMSPlugin from the imixs-marty
- * project and provides a migration mechanism for old BlobWorkitems. The old
+ * Note: The ArchiveLocalPlugin replaces the BlobWorkitems mechanism which was earlier
+ * part of the DMSPlugin from the imixs-marty project. The plugin provides a 
+ * migration mechanism for old BlobWorkitems. The old
  * BlobWorkitems will not be deleted.
  * 
  * @version 1.0

diff --git a/imixs-archive-ocr/README.md b/imixs-archive-ocr/README.md
@@ -1,4 +1,4 @@
-# imixs-archive OCR
+# Imixs-Archive-OCR
 
 Imixs-Archive-OCR is a sub project of Imixs-Archive providing a solution for OCR scans on documents. 
 
@@ -7,3 +7,6 @@ The project provides a shell script to perform a OCR scan on documents managed b
 This script is based on the tesseract library. The script automatically converts PDF files into a TIF format, so this 
 script can be used for images as also for PDF files.  The text result is stored into a file ${FILENAME}.txt
 
+## Fulltext Index
+
+Imixs-Archive-OCR includes a fulltext search based on [Apache Lucene](http://lucene.apache.org/). This module can be combined with any Imixs-Workflow business application as also with standalone applications. 
diff --git a/imixs-archive-test/README.md b/imixs-archive-test/README.md
@@ -1,4 +1,4 @@
-# imixs-archive Test Environment
+# Imixs-Archive-Test Environment
 
 Imixs-Archive-Test provides a Docker based test environment for Imixs-Archive. The Test Environment consists of the following docker containers:
 
@@ -28,40 +28,6 @@ To build the docker image run
 We use the standalone-full.xml configuration profile to activate JMS!
 
 
-### JCA Hadoop Connector
-
-Imixs Archive uses a JCA connector to communicate with the Hadoop Cluster. To install the connector follow the installation guide on 
-[Imixs-JCA-Hadoop](https://github.com/imixs/imixs-jca/tree/master/imixs-jca-hadoop)
-
-The configuration for the connector is part of the wildfly standalone.xml:
-
-	<subsystem xmlns="urn:jboss:domain:resource-adapters:4.0">
-	         <resource-adapters>
-	             <resource-adapter id="imixs-jca-hadoop">
-	                    <archive>imixs-jca-hadoop.rar</archive>
-	                    <transaction-support>LocalTransaction</transaction-support>
-	                    <connection-definitions>
-	                        <connection-definition class-name="org.imixs.workflow.hadoop.jca.store.GenericManagedConnectionFactory" jndi-name="java:/jca/org.imixs.workflow.hadoop" enabled="true" use-java-context="true" pool-name="hadoop" use-ccm="true">
-	                            <config-property name="rootDirectory">
-	                                ./store/
-	                            </config-property>
-	                            <pool>
-	                                <min-pool-size>0</min-pool-size>
-	                                <max-pool-size>10</max-pool-size>
-	                                <prefill>false</prefill>
-	                                <use-strict-min>false</use-strict-min>
-	                                <flush-strategy>FailingConnectionOnly</flush-strategy>
-	                            </pool>
-	                            <security>
-	                                <application/>
-	                            </security>
-	                        </connection-definition>
-	                    </connection-definitions>
-	                </resource-adapter>
-	            </resource-adapters>
-	    </subsystem>
-
-
 ## Workflow Models
 
 The folder /workflow/ contains BPMN Model for testing.

diff --git a/pom.xml b/pom.xml
@@ -6,6 +6,7 @@
 	<packaging>pom</packaging>
 
 	<modules>
+		<module>imixs-archive-api</module>
 		<module>imixs-archive-ocr</module>
 		<module>imixs-archive-hadoop</module>
 		<module>imixs-archive-test</module>
@@ -78,7 +79,7 @@
 
 	<properties>
 		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
-		<org.imixs.workflow.version>4.1.5-SNAPSHOT</org.imixs.workflow.version>
+		<org.imixs.workflow.version>4.1.6-SNAPSHOT</org.imixs.workflow.version>
 	</properties>