Skip to content

Commit

Permalink
#5 📝 Tranche 3 of documentation migration
Browse files Browse the repository at this point in the history
  • Loading branch information
d-ryan-ashcraft committed Apr 30, 2024
1 parent e4396d2 commit 0373f45
Show file tree
Hide file tree
Showing 10 changed files with 1,732 additions and 1 deletion.
3 changes: 2 additions & 1 deletion DRAFT_RELEASE_NOTES.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Major Additions

* Python modules were renamed to reflect aiSSEMBLE. These include the following.
* Python modules were renamed to reflect aiSSEMBLE. These include the following.

| Old Python Module | New Python Module |
|--------------------------------------------|----------------------------------------------------------------|
| foundation-core-python | aissemble-core-python |
Expand Down
Binary file modified docs/modules/ROOT/images/aissemble-solution-architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
364 changes: 364 additions & 0 deletions docs/modules/ROOT/pages/containers.adoc

Large diffs are not rendered by default.

173 changes: 173 additions & 0 deletions docs/modules/ROOT/pages/data-delivery-pipeline-overview.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
[#_data_delivery_pipeline_overview]
= Data Delivery Pipelines Overview
== Overview
Data Delivery covers activities including ingestion, storage, transformation, augmentation, and delivery of data. Data
and its cleanliness are essential for machine learning because it ensures accurate, reliable, and unbiased input,
leading to better model performance and trustworthy results. Data Delivery can also be applied in any system that is
looking for a consistent mechanism to handle and prepare data for any use. The following sections are to assist in
understanding and determining where to modify and customize elements to suit your specific implementation.
== What are Data Delivery Pipelines
The purpose of a data delivery pipeline is to process data into derivative forms. To simplify the incorporation of
commonly used pipeline components, pre-constructed patterns have been developed and can be included in your project.
If you are adding a data delivery pipeline to your project, there are only a few steps necessary to incorporate generated
code for the major data actions, such as ingest, transform, and enrich. At the abstract level, Data Delivery consists of
two primary concepts:
. Coordinating Data Flow Steps: coordinating any two or more Data Actions together to achieve some data result.
. Performing a Data Action: any action on data (e.g., enrich, validate, transform).
There are two data delivery implementations that aiSSEMBLE supports out of the box: Java Spark and PySpark.
== What Gets Generated
=== Docker Image
The generation process will create a new submodule under the `<project-name>-docker module`, called
`<project-name>-spark-worker-docker`. This is responsible for creating a Docker image for Data Delivery activities.
|===
|Generated file | Description
|`<project-name>/<project-name>-docker/<project-name>-spark-worker-docker/src/main/resources/docker/Dockerfile`
|Encapsulation for all fields for a specific type.
|===
=== SparkApplication YAML
aiSSEMBLE packages a https://helm.sh/[Helm,role=external,window=_blank] chart to support the execution of data-delivery
pipelines as Spark Application deployments through Spark Operator. Read the Configuring and
xref:guides/guides-spark-job.adoc#_configuring_and_executing_spark_jobs[Configuring and Executing Spark Jobs]
guide for more information on configuring the execution for different environments.
|===
|Generated file | Description
|`<project-name>-pipelines/<pipeline-name>/src/main/resources/apps/<pipeline-name>-base-values.yaml`
|Base configuration common to all execution environments.
|`<project-name>-pipelines/<pipeline-name>/src/main/resources/apps/<pipeline-name>-ci-values.yaml`
|Configuration for executing the pipeline in a continuous integration environment.
|`<project-name>-pipelines/<pipeline-name>/src/main/resources/apps/<pipeline-name>-debug-values.yaml`
|Configures the pipeline for local debugging.
|===
=== Invocation Service
Invocation Service is an additional option that can be added for executing pipelines. This is generated into
`<project-name>-deploy/src/main/resources/apps/pipeline-invocation-service`. The deployment will set up the service to
submit spark applications to Kubernetes by sending the name of the pipeline to be executed. Additionally, the service
can be accessed through HTTP POST requests or through the xref:/messaging-details.adoc#_messaging_details[messaging module].
To use the service to execute a pipeline, you simply need to send a request to the Invocation Service’s
`start-spark-operator-job` endpoint, like the following:
=== POST/invoke-pipeline/start-spark-operator-job
Creates and submits a SparkApplication invocation for a specific pipeline to the Kubernetes cluster by leveraging the
pipeline’s values files to configure the
https://github.com/boozallen/aissemble/tree/dev/extensions/extensions-helm/aissemble-spark-application-chart[aissemble-spark-application ,role=external,window=_blank]
Helm chart.
[%collapsible]
====
****
*Parameters*
|===
|*Name* | *Description*
|applicationName
|The name of the pipeline to invoke, in lower kebab case. (e.g. `my-pipeline`).
|profile
|Specified the execution profile, indicating the values files to layer together. One of "prod", "ci", or "dev".
|overrideValues
|Additional, individual values to layer on top of the profile's values files, corresponding to the `--values` Helm
command line option.
|===
.Sample data input:
[source,JSON]
----
{
"applicationName": "my-pipeline",
"profile": "ci",
"overrideValues": {
"metadata.name": "testapp"
}
}
----
.Sample data output:
[source,JSON]
----
Submitted my-pipeline
----
****
====
=== GET/invoke-pipeline/healthcheck
A health check endpoint that returns http `200` if the invocation service is up and running.
[%collapsible]
====
****
*Parameters*
|===
|*Name* | *Description*
|n/a
|
****
====
Alternatively, you may emit a message through your message broker to the topic pipeline-invocation using the same body
format shown above.
==== Configuration
The pipeline invocation service can be configured with the application options in the table below. See the Pipeline
https://github.com/boozallen/aissemble/tree/dev/extensions/extensions-helm/extensions-helm-pipeline-invocation/aissemble-pipeline-invocation-app-chart#readme[Pipeline
Invocation Chart documentation,role=external,window=_blank]
README for information on how to provide this configuration.
=== Application options
|===
|Value | Description | Default | Valid Options
|`service.pipelineInvocation.failureStrategy.global`
|Sets the global failure strategy in the event of a processing error.
|`LOG`
|`SILENT`, `LOG`, `EXCEPTIONAL`
|`service.pipelineInvocation.failureStrategy.[pipeline-name]`
|Override the global failure strategy for a specific pipeline.
|None
|`SILENT`, `LOG`, `EXCEPTIONAL`
|`service.pipelineInvocation.execution.profile`
|Default execution profile.
|`dev`
|`dev`, `ci`, `prod`
|===
See the https://github.com/boozallen/aissemble/blob/dev/extensions/extensions-helm/extensions-helm-pipeline-invocation/aissemble-pipeline-invocation-app-chart/README.md[Helm Chart documentation,role=external,window=_blank]
README for guidance on application configurations for the service.
== Configuring your Pipeline
=== Persistence
=== Connection to RDBMS
IMPORTANT: The following instructions are applicable where the persist type specified for a pipeline is RDBMS.
If the persist type specified for a pipeline is RDBMS (e.g., PostgreSQL, SQLite, MySQL), then a method is automatically
added to your pipeline step’s base class (e.g., ingest_base.py) to retrieve the database connection.
The primary use case is calling the function generated on the step base class to make a connection to the database to
insert/update/select.
**Note**: It is recommended to implement a secure configuration to enhance the security of your code and prevent committing
sensitive connection values directly into the code. You may want to use https://pypi.org/project/krausening/[Krausening,role=external,window=_blank]
to prevent committing sensitive connection values.
144 changes: 144 additions & 0 deletions docs/modules/ROOT/pages/data-lineage.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
= Data Lineage

== Overview
Data lineage represents the origins, movements, transformations, and flows of data. It provides high processing
visibility to data scientists and engineers for simplified tracking of the full data
lifecycle. Its granularity can vary and be specified to the needs of a particular project at a particular time.

== Leveraging Data Lineage
In order to integrate data lineage into your aiSSEMBLE(TM)-based project, you only need to opt-in through your pipeline
model definition, as described in the xref:pipeline-metamodel.adoc[detailed pipeline options]. aiSSEMBLE's factory
will create your pipeline with the necessary dependencies, as well as convenient stubs!

Data lineage is a rapidly evolving industry topic, with new and improved standards and formats emerging frequently. We
provide generic types, capable of representing any lineage metadata, which can self-convert to and from other formats,
such as the https://openlineage.io[OpenLineage] standard. We also provide functions for conveying your data lineage
throughout your aiSSEMBLE-powered ecosystem, such as via Kafka messaging.

=== Data Lineage Types & Objects

Our data lineage modules are modeled after the OpenLineage format. We provide the following basic representation types:

* `Facet`
* `Dataset`, `InputDataset`, and `OutputDataset`
* `Job` and `Run`

==== Facets
A `Facet` describes some element of a data lineage action or entity. `Runs`, `Jobs`, and `Datasets` can all have facets,
though not all facets would make sense on each. For instance, one could construct a `Facet` to represent the source of
a given `Dataset`.

The `Facet` type provided by aiSSEMBLE is intentionally generic. It can be extended within your project to construct
your own customized `Facets`, or it can be created directly from OpenLineage facets using the `from_open_lineage_facet`
function. Custom `Facets` should override the appropriate conversion methods, such as `get_open_lineage_facet`, to address
specific conversion needs.

Facets are described by OpenLineage-format JSON schemas, and expect an URI string representing the path to that schema
as their only required argument.

Example construction of `Facet` objects:

```python
from aissemble_data_lineage import Facet, from_open_lineage_facet
from openlineage.client.facet import DataSourceDatasetFacet

my_facet = Facet("http://my_schema_url.org/my_schema.json")
converted_facet = from_open_lineage_facet(DataSourceDatasetFacet(name="dataset_name", uri="http://my_datasource"))
```

```java
import io.openlineage.client.OpenLineage;

OpenLineage openLineage = new OpenLineage(producer);
OpenLineage.RunFacets runFacets = openLineage.newRunFacetsBuilder().build();
```

To define your own custom `Facet`, you will need to create a new class that extends OpenLineage's `BaseFacet` and define
the attributes that the `Facet` will track. You can then use the `from_open_lineage_facet` function to set the values
of the custom `Facet`.

Example construction of a customized `Facet` object:
```python
from aissemble_data_lineage import from_open_lineage_facet
from openlineage.client.facet import BaseFacet
import attr

@attr.s
class MyCustomFacet(BaseFacet):
custom_attr: str = attr.ib()
_additional_skip_redact: ["custom_attr"]

def __init__(self, custom_attr):
super().__init__()
self.custom_attr = custom_attr

custom_facet = from_open_lineage_facet(MyCustomFacet(custom_attr = "Custom Value"))
```

==== Datasets
`Dataset` objects represent just that-- A known collection of data following a common path through the processing system.
For cases where inputs and outputs are distinctive from processed data, we provide the `InputDataset` and `OutputDataset`
objects.

Example construction of `Dataset` objects:

```python
from aissemble_data_lineage import Dataset, InputDataset, OutputDataset

my_dataset = Dataset(name="dataset_1", facets={"facet_1": my_facet})
my_input_dataset = InputDataset(name="dataset_2", facets={}, input_facets={})
my_output_dataset = OutputDataset(name="dataset_3", facets={}, output_facets={})
```

```java
import io.openlineage.client.OpenLineage.InputDataset;

ArrayList<InputDataset> olInputs = new ArrayList<>();
if(inputs != null) {
inputs.forEach(input -> {
olInputs.add((InputDataset) input.getOpenLineageDataset());
});
}
```

==== Runs
A `Run` is composed of one to many `Jobs`, and encompasses exactly one full processing pass over your data.

Example construction of `Run` objects:
```python
from aissemble_data_lineage import Run
from uuid import uuid4

my_run = Run(uuid4(), facets={})
```

```java
import com.boozallen.aiops.data.lineage.Run;

Run olRun = run.getOpenLineageRun();
```

==== Jobs
A `Job` represents a single discrete step in the data processing pipeline, such as ingestion, a specific transformation,
or any other action of note.

Example construction of `Job` objects:
```python
from aissemble_data_lineage import Job

my_job = Job("my_job", facets={})
```

```java
import com.boozallen.aiops.data.lineage.Job;

Run olJob = run.getOpenLineageJob();
```

== Additional Resources
The full aiSSEMBLE data lineage source code can be reviewed on
https://github.com/boozallen/aissemble/tree/dev/foundation/foundation-lineage/foundation-data-lineage[GitHub].

The Python methods detailing the generation of minimal instances of `Run`, `Jo`b, and `RunEvent `for emission are described
in `src/<pipeline-name>/generated/step/abstract_data_action.py`. Correspondingly, the Java methods outlining the same
are detailed in `src/generated/java/<package-name>/AbstractPipelineStep.java`.
Loading

0 comments on commit 0373f45

Please sign in to comment.