#5 📝 Tranche 3 of documentation migration

boozallen · Apr 30, 2024 · 0373f45 · 0373f45
1 parent e4396d2
commit 0373f45
Show file tree

Hide file tree

Showing 10 changed files with 1,732 additions and 1 deletion.
diff --git a/DRAFT_RELEASE_NOTES.md b/DRAFT_RELEASE_NOTES.md
@@ -1,6 +1,7 @@
 # Major Additions
 
-* Python modules were renamed to reflect aiSSEMBLE. These include the following. 
+* Python modules were renamed to reflect aiSSEMBLE. These include the following.
+
 | Old Python Module                          | New Python Module                                              |
 |--------------------------------------------|----------------------------------------------------------------|
 | foundation-core-python                     | aissemble-core-python                                          | 

diff --git a/docs/modules/ROOT/images/aissemble-solution-architecture.png b/docs/modules/ROOT/images/aissemble-solution-architecture.png
diff --git a/docs/modules/ROOT/images/you-are-here-containers.png b/docs/modules/ROOT/images/you-are-here-containers.png
diff --git a/docs/modules/ROOT/pages/containers.adoc b/docs/modules/ROOT/pages/containers.adoc
diff --git a/docs/modules/ROOT/pages/data-delivery-pipeline-overview.adoc b/docs/modules/ROOT/pages/data-delivery-pipeline-overview.adoc
@@ -0,0 +1,173 @@
+[#_data_delivery_pipeline_overview]
+= Data Delivery Pipelines Overview
+
+== Overview
+Data Delivery covers activities including ingestion, storage, transformation, augmentation, and delivery of data. Data
+and its cleanliness are essential for machine learning because it ensures accurate, reliable, and unbiased input,
+leading to better model performance and trustworthy results. Data Delivery can also be applied in any system that is
+looking for a consistent mechanism to handle and prepare data for any use. The following sections are to assist in
+understanding and determining where to modify and customize elements to suit your specific implementation.
+
+== What are Data Delivery Pipelines
+The purpose of a data delivery pipeline is to process data into derivative forms. To simplify the incorporation of
+commonly used pipeline components, pre-constructed patterns have been developed and can be included in your project.
+If you are adding a data delivery pipeline to your project, there are only a few steps necessary to incorporate generated
+code for the major data actions, such as ingest, transform, and enrich. At the abstract level, Data Delivery consists of
+two primary concepts:
+
+. Coordinating Data Flow Steps: coordinating any two or more Data Actions together to achieve some data result.
+. Performing a Data Action: any action on data (e.g., enrich, validate, transform).
+
+There are two data delivery implementations that aiSSEMBLE supports out of the box: Java Spark and PySpark.
+
+== What Gets Generated
+=== Docker Image
+The generation process will create a new submodule under the `<project-name>-docker module`, called
+`<project-name>-spark-worker-docker`. This is responsible for creating a Docker image for Data Delivery activities.
+
+|===
+|Generated file | Description
+
+|`<project-name>/<project-name>-docker/<project-name>-spark-worker-docker/src/main/resources/docker/Dockerfile`
+|Encapsulation for all fields for a specific type.
+|===
+
+=== SparkApplication YAML
+aiSSEMBLE packages a https://helm.sh/[Helm,role=external,window=_blank] chart to support the execution of data-delivery
+pipelines as Spark Application deployments through Spark Operator. Read the Configuring and
+xref:guides/guides-spark-job.adoc#_configuring_and_executing_spark_jobs[Configuring and Executing Spark Jobs]
+guide for more information on configuring the execution for different environments.
+
+|===
+|Generated file | Description
+
+|`<project-name>-pipelines/<pipeline-name>/src/main/resources/apps/<pipeline-name>-base-values.yaml`
+|Base configuration common to all execution environments.
+
+|`<project-name>-pipelines/<pipeline-name>/src/main/resources/apps/<pipeline-name>-ci-values.yaml`
+|Configuration for executing the pipeline in a continuous integration environment.
+
+|`<project-name>-pipelines/<pipeline-name>/src/main/resources/apps/<pipeline-name>-debug-values.yaml`
+|Configures the pipeline for local debugging.
+|===
+
+=== Invocation Service
+Invocation Service is an additional option that can be added for executing pipelines. This is generated into
+`<project-name>-deploy/src/main/resources/apps/pipeline-invocation-service`. The deployment will set up the service to
+submit spark applications to Kubernetes by sending the name of the pipeline to be executed. Additionally, the service
+can be accessed through HTTP POST requests or through the xref:/messaging-details.adoc#_messaging_details[messaging module].
+
+To use the service to execute a pipeline, you simply need to send a request to the Invocation Service’s
+`start-spark-operator-job` endpoint, like the following:
+
+
+=== POST/invoke-pipeline/start-spark-operator-job
+Creates and submits a SparkApplication invocation for a specific pipeline to the Kubernetes cluster by leveraging the
+pipeline’s values files to configure the
+https://github.com/boozallen/aissemble/tree/dev/extensions/extensions-helm/aissemble-spark-application-chart[aissemble-spark-application ,role=external,window=_blank]
+Helm chart.
+[%collapsible]
+====
+****
+*Parameters*
+
+|===
+|*Name* | *Description*
+|applicationName
+|The name of the pipeline to invoke, in lower kebab case. (e.g. `my-pipeline`).
+
+|profile
+|Specified the execution profile, indicating the values files to layer together. One of "prod", "ci", or "dev".
+
+|overrideValues
+|Additional, individual values to layer on top of the profile's values files, corresponding to the `--values` Helm
+command line option.
+
+|===
+
+
+.Sample data input:
+[source,JSON]
+----
+{
+  "applicationName": "my-pipeline",
+  "profile": "ci",
+  "overrideValues": {
+    "metadata.name": "testapp"
+  }
+}
+----
+
+.Sample data output:
+[source,JSON]
+----
+Submitted my-pipeline
+----
+****
+====
+
+=== GET/invoke-pipeline/healthcheck
+A health check endpoint that returns http `200` if the invocation service is up and running.
+
+[%collapsible]
+====
+****
+*Parameters*
+
+|===
+|*Name* | *Description*
+|n/a
+|
+****
+====
+
+Alternatively, you may emit a message through your message broker to the topic pipeline-invocation using the same body
+format shown above.
+
+
+==== Configuration
+The pipeline invocation service can be configured with the application options in the table below. See the Pipeline
+https://github.com/boozallen/aissemble/tree/dev/extensions/extensions-helm/extensions-helm-pipeline-invocation/aissemble-pipeline-invocation-app-chart#readme[Pipeline
+Invocation Chart documentation,role=external,window=_blank]
+README for information on how to provide this configuration.
+
+=== Application options
+|===
+|Value | Description | Default | Valid Options
+
+|`service.pipelineInvocation.failureStrategy.global`
+|Sets the global failure strategy in the event of a processing error.
+|`LOG`
+|`SILENT`, `LOG`, `EXCEPTIONAL`
+
+|`service.pipelineInvocation.failureStrategy.[pipeline-name]`
+|Override the global failure strategy for a specific pipeline.
+|None
+|`SILENT`, `LOG`, `EXCEPTIONAL`
+
+|`service.pipelineInvocation.execution.profile`
+|Default execution profile.
+|`dev`
+|`dev`, `ci`, `prod`
+|===
+
+See the https://github.com/boozallen/aissemble/blob/dev/extensions/extensions-helm/extensions-helm-pipeline-invocation/aissemble-pipeline-invocation-app-chart/README.md[Helm Chart documentation,role=external,window=_blank]
+README for guidance on application configurations for the service.
+
+== Configuring your Pipeline
+
+=== Persistence
+
+=== Connection to RDBMS
+
+IMPORTANT: The following instructions are applicable where the persist type specified for a pipeline is RDBMS.
+
+If the persist type specified for a pipeline is RDBMS (e.g., PostgreSQL, SQLite, MySQL), then a method is automatically
+added to your pipeline step’s base class (e.g., ingest_base.py) to retrieve the database connection.
+
+The primary use case is calling the function generated on the step base class to make a connection to the database to
+insert/update/select.
+
+**Note**: It is recommended to implement a secure configuration to enhance the security of your code and prevent committing
+sensitive connection values directly into the code. You may want to use https://pypi.org/project/krausening/[Krausening,role=external,window=_blank]
+to prevent committing sensitive connection values.
diff --git a/docs/modules/ROOT/pages/data-lineage.adoc b/docs/modules/ROOT/pages/data-lineage.adoc
@@ -0,0 +1,144 @@
+= Data Lineage
+
+== Overview
+Data lineage represents the origins, movements, transformations, and flows of data.  It provides high processing
+visibility to data scientists and engineers for simplified tracking of the full data
+lifecycle.  Its granularity can vary and be specified to the needs of a particular project at a particular time.
+
+== Leveraging Data Lineage
+In order to integrate data lineage into your aiSSEMBLE(TM)-based project, you only need to opt-in through your pipeline
+model definition, as described in the xref:pipeline-metamodel.adoc[detailed pipeline options].  aiSSEMBLE's factory
+will create your pipeline with the necessary dependencies, as well as convenient stubs!
+
+Data lineage is a rapidly evolving industry topic, with new and improved standards and formats emerging frequently.  We
+provide generic types, capable of representing any lineage metadata, which can self-convert to and from other formats,
+such as the https://openlineage.io[OpenLineage] standard.  We also provide functions for conveying your data lineage
+throughout your aiSSEMBLE-powered ecosystem, such as via Kafka messaging.
+
+=== Data Lineage Types & Objects
+
+Our data lineage modules are modeled after the OpenLineage format.  We provide the following basic representation types:
+
+* `Facet`
+* `Dataset`, `InputDataset`, and `OutputDataset`
+* `Job` and `Run`
+
+==== Facets
+A `Facet` describes some element of a data lineage action or entity.  `Runs`, `Jobs`, and `Datasets` can all have facets,
+though not all facets would make sense on each.  For instance, one could construct a `Facet` to represent the source of
+a given `Dataset`.
+
+The `Facet` type provided by aiSSEMBLE is intentionally generic.  It can be extended within your project to construct
+your own customized `Facets`, or it can be created directly from OpenLineage facets using the `from_open_lineage_facet`
+function.  Custom `Facets` should override the appropriate conversion methods, such as `get_open_lineage_facet`, to address
+specific conversion needs.
+
+Facets are described by OpenLineage-format JSON schemas, and expect an URI string representing the path to that schema
+as their only required argument.
+
+Example construction of `Facet` objects:
+
+```python
+from aissemble_data_lineage import Facet, from_open_lineage_facet
+from openlineage.client.facet import DataSourceDatasetFacet
+
+my_facet = Facet("http://my_schema_url.org/my_schema.json")
+converted_facet = from_open_lineage_facet(DataSourceDatasetFacet(name="dataset_name", uri="http://my_datasource"))
+```
+
+```java
+import io.openlineage.client.OpenLineage;
+
+OpenLineage openLineage = new OpenLineage(producer);
+OpenLineage.RunFacets runFacets = openLineage.newRunFacetsBuilder().build();
+```
+
+To define your own custom `Facet`, you will need to create a new class that extends OpenLineage's `BaseFacet` and define
+the attributes that the `Facet` will track. You can then use the `from_open_lineage_facet` function to set the values
+of the custom `Facet`.
+
+Example construction of a customized `Facet` object:
+```python
+from aissemble_data_lineage import from_open_lineage_facet
+from openlineage.client.facet import BaseFacet
+import attr
+
+@attr.s
+class MyCustomFacet(BaseFacet):
+    custom_attr: str = attr.ib()
+    _additional_skip_redact: ["custom_attr"]
+
+    def __init__(self, custom_attr):
+        super().__init__()
+        self.custom_attr = custom_attr
+
+custom_facet = from_open_lineage_facet(MyCustomFacet(custom_attr = "Custom Value"))
+```
+
+==== Datasets
+`Dataset` objects represent just that-- A known collection of data following a common path through the processing system.
+For cases where inputs and outputs are distinctive from processed data, we provide the `InputDataset` and `OutputDataset`
+objects.
+
+Example construction of `Dataset` objects:
+
+```python
+from aissemble_data_lineage import Dataset, InputDataset, OutputDataset
+
+my_dataset = Dataset(name="dataset_1", facets={"facet_1": my_facet})
+my_input_dataset = InputDataset(name="dataset_2", facets={}, input_facets={})
+my_output_dataset = OutputDataset(name="dataset_3", facets={}, output_facets={})
+```
+
+```java
+import io.openlineage.client.OpenLineage.InputDataset;
+
+ArrayList<InputDataset> olInputs = new ArrayList<>();
+    if(inputs != null) {
+        inputs.forEach(input -> {
+            olInputs.add((InputDataset) input.getOpenLineageDataset());
+        });
+    }
+```
+
+==== Runs
+A `Run` is composed of one to many `Jobs`, and encompasses exactly one full processing pass over your data.
+
+Example construction of `Run` objects:
+```python
+from aissemble_data_lineage import Run
+from uuid import uuid4
+
+my_run = Run(uuid4(), facets={})
+```
+
+```java
+import com.boozallen.aiops.data.lineage.Run;
+
+Run olRun = run.getOpenLineageRun();
+```
+
+==== Jobs
+A `Job` represents a single discrete step in the data processing pipeline, such as ingestion, a specific transformation,
+or any other action of note.
+
+Example construction of `Job` objects:
+```python
+from aissemble_data_lineage import Job
+
+my_job = Job("my_job", facets={})
+```
+
+```java
+import com.boozallen.aiops.data.lineage.Job;
+
+Run olJob = run.getOpenLineageJob();
+```
+
+== Additional Resources
+The full aiSSEMBLE data lineage source code can be reviewed on
+https://github.com/boozallen/aissemble/tree/dev/foundation/foundation-lineage/foundation-data-lineage[GitHub].
+
+The Python methods detailing the generation of minimal instances of `Run`, `Jo`b, and `RunEvent `for emission are described
+in `src/<pipeline-name>/generated/step/abstract_data_action.py`. Correspondingly, the Java methods outlining the same
+are detailed in `src/generated/java/<package-name>/AbstractPipelineStep.java`.