-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
#5 📝 Tranche 3 of documentation migration
- Loading branch information
1 parent
e4396d2
commit 0373f45
Showing
10 changed files
with
1,732 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Large diffs are not rendered by default.
Oops, something went wrong.
173 changes: 173 additions & 0 deletions
173
docs/modules/ROOT/pages/data-delivery-pipeline-overview.adoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
[#_data_delivery_pipeline_overview] | ||
= Data Delivery Pipelines Overview | ||
== Overview | ||
Data Delivery covers activities including ingestion, storage, transformation, augmentation, and delivery of data. Data | ||
and its cleanliness are essential for machine learning because it ensures accurate, reliable, and unbiased input, | ||
leading to better model performance and trustworthy results. Data Delivery can also be applied in any system that is | ||
looking for a consistent mechanism to handle and prepare data for any use. The following sections are to assist in | ||
understanding and determining where to modify and customize elements to suit your specific implementation. | ||
== What are Data Delivery Pipelines | ||
The purpose of a data delivery pipeline is to process data into derivative forms. To simplify the incorporation of | ||
commonly used pipeline components, pre-constructed patterns have been developed and can be included in your project. | ||
If you are adding a data delivery pipeline to your project, there are only a few steps necessary to incorporate generated | ||
code for the major data actions, such as ingest, transform, and enrich. At the abstract level, Data Delivery consists of | ||
two primary concepts: | ||
. Coordinating Data Flow Steps: coordinating any two or more Data Actions together to achieve some data result. | ||
. Performing a Data Action: any action on data (e.g., enrich, validate, transform). | ||
There are two data delivery implementations that aiSSEMBLE supports out of the box: Java Spark and PySpark. | ||
== What Gets Generated | ||
=== Docker Image | ||
The generation process will create a new submodule under the `<project-name>-docker module`, called | ||
`<project-name>-spark-worker-docker`. This is responsible for creating a Docker image for Data Delivery activities. | ||
|=== | ||
|Generated file | Description | ||
|`<project-name>/<project-name>-docker/<project-name>-spark-worker-docker/src/main/resources/docker/Dockerfile` | ||
|Encapsulation for all fields for a specific type. | ||
|=== | ||
=== SparkApplication YAML | ||
aiSSEMBLE packages a https://helm.sh/[Helm,role=external,window=_blank] chart to support the execution of data-delivery | ||
pipelines as Spark Application deployments through Spark Operator. Read the Configuring and | ||
xref:guides/guides-spark-job.adoc#_configuring_and_executing_spark_jobs[Configuring and Executing Spark Jobs] | ||
guide for more information on configuring the execution for different environments. | ||
|=== | ||
|Generated file | Description | ||
|`<project-name>-pipelines/<pipeline-name>/src/main/resources/apps/<pipeline-name>-base-values.yaml` | ||
|Base configuration common to all execution environments. | ||
|`<project-name>-pipelines/<pipeline-name>/src/main/resources/apps/<pipeline-name>-ci-values.yaml` | ||
|Configuration for executing the pipeline in a continuous integration environment. | ||
|`<project-name>-pipelines/<pipeline-name>/src/main/resources/apps/<pipeline-name>-debug-values.yaml` | ||
|Configures the pipeline for local debugging. | ||
|=== | ||
=== Invocation Service | ||
Invocation Service is an additional option that can be added for executing pipelines. This is generated into | ||
`<project-name>-deploy/src/main/resources/apps/pipeline-invocation-service`. The deployment will set up the service to | ||
submit spark applications to Kubernetes by sending the name of the pipeline to be executed. Additionally, the service | ||
can be accessed through HTTP POST requests or through the xref:/messaging-details.adoc#_messaging_details[messaging module]. | ||
To use the service to execute a pipeline, you simply need to send a request to the Invocation Service’s | ||
`start-spark-operator-job` endpoint, like the following: | ||
=== POST/invoke-pipeline/start-spark-operator-job | ||
Creates and submits a SparkApplication invocation for a specific pipeline to the Kubernetes cluster by leveraging the | ||
pipeline’s values files to configure the | ||
https://github.com/boozallen/aissemble/tree/dev/extensions/extensions-helm/aissemble-spark-application-chart[aissemble-spark-application ,role=external,window=_blank] | ||
Helm chart. | ||
[%collapsible] | ||
==== | ||
**** | ||
*Parameters* | ||
|=== | ||
|*Name* | *Description* | ||
|applicationName | ||
|The name of the pipeline to invoke, in lower kebab case. (e.g. `my-pipeline`). | ||
|profile | ||
|Specified the execution profile, indicating the values files to layer together. One of "prod", "ci", or "dev". | ||
|overrideValues | ||
|Additional, individual values to layer on top of the profile's values files, corresponding to the `--values` Helm | ||
command line option. | ||
|=== | ||
.Sample data input: | ||
[source,JSON] | ||
---- | ||
{ | ||
"applicationName": "my-pipeline", | ||
"profile": "ci", | ||
"overrideValues": { | ||
"metadata.name": "testapp" | ||
} | ||
} | ||
---- | ||
.Sample data output: | ||
[source,JSON] | ||
---- | ||
Submitted my-pipeline | ||
---- | ||
**** | ||
==== | ||
=== GET/invoke-pipeline/healthcheck | ||
A health check endpoint that returns http `200` if the invocation service is up and running. | ||
[%collapsible] | ||
==== | ||
**** | ||
*Parameters* | ||
|=== | ||
|*Name* | *Description* | ||
|n/a | ||
| | ||
**** | ||
==== | ||
Alternatively, you may emit a message through your message broker to the topic pipeline-invocation using the same body | ||
format shown above. | ||
==== Configuration | ||
The pipeline invocation service can be configured with the application options in the table below. See the Pipeline | ||
https://github.com/boozallen/aissemble/tree/dev/extensions/extensions-helm/extensions-helm-pipeline-invocation/aissemble-pipeline-invocation-app-chart#readme[Pipeline | ||
Invocation Chart documentation,role=external,window=_blank] | ||
README for information on how to provide this configuration. | ||
=== Application options | ||
|=== | ||
|Value | Description | Default | Valid Options | ||
|`service.pipelineInvocation.failureStrategy.global` | ||
|Sets the global failure strategy in the event of a processing error. | ||
|`LOG` | ||
|`SILENT`, `LOG`, `EXCEPTIONAL` | ||
|`service.pipelineInvocation.failureStrategy.[pipeline-name]` | ||
|Override the global failure strategy for a specific pipeline. | ||
|None | ||
|`SILENT`, `LOG`, `EXCEPTIONAL` | ||
|`service.pipelineInvocation.execution.profile` | ||
|Default execution profile. | ||
|`dev` | ||
|`dev`, `ci`, `prod` | ||
|=== | ||
See the https://github.com/boozallen/aissemble/blob/dev/extensions/extensions-helm/extensions-helm-pipeline-invocation/aissemble-pipeline-invocation-app-chart/README.md[Helm Chart documentation,role=external,window=_blank] | ||
README for guidance on application configurations for the service. | ||
== Configuring your Pipeline | ||
=== Persistence | ||
=== Connection to RDBMS | ||
IMPORTANT: The following instructions are applicable where the persist type specified for a pipeline is RDBMS. | ||
If the persist type specified for a pipeline is RDBMS (e.g., PostgreSQL, SQLite, MySQL), then a method is automatically | ||
added to your pipeline step’s base class (e.g., ingest_base.py) to retrieve the database connection. | ||
The primary use case is calling the function generated on the step base class to make a connection to the database to | ||
insert/update/select. | ||
**Note**: It is recommended to implement a secure configuration to enhance the security of your code and prevent committing | ||
sensitive connection values directly into the code. You may want to use https://pypi.org/project/krausening/[Krausening,role=external,window=_blank] | ||
to prevent committing sensitive connection values. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
= Data Lineage | ||
|
||
== Overview | ||
Data lineage represents the origins, movements, transformations, and flows of data. It provides high processing | ||
visibility to data scientists and engineers for simplified tracking of the full data | ||
lifecycle. Its granularity can vary and be specified to the needs of a particular project at a particular time. | ||
|
||
== Leveraging Data Lineage | ||
In order to integrate data lineage into your aiSSEMBLE(TM)-based project, you only need to opt-in through your pipeline | ||
model definition, as described in the xref:pipeline-metamodel.adoc[detailed pipeline options]. aiSSEMBLE's factory | ||
will create your pipeline with the necessary dependencies, as well as convenient stubs! | ||
|
||
Data lineage is a rapidly evolving industry topic, with new and improved standards and formats emerging frequently. We | ||
provide generic types, capable of representing any lineage metadata, which can self-convert to and from other formats, | ||
such as the https://openlineage.io[OpenLineage] standard. We also provide functions for conveying your data lineage | ||
throughout your aiSSEMBLE-powered ecosystem, such as via Kafka messaging. | ||
|
||
=== Data Lineage Types & Objects | ||
|
||
Our data lineage modules are modeled after the OpenLineage format. We provide the following basic representation types: | ||
|
||
* `Facet` | ||
* `Dataset`, `InputDataset`, and `OutputDataset` | ||
* `Job` and `Run` | ||
|
||
==== Facets | ||
A `Facet` describes some element of a data lineage action or entity. `Runs`, `Jobs`, and `Datasets` can all have facets, | ||
though not all facets would make sense on each. For instance, one could construct a `Facet` to represent the source of | ||
a given `Dataset`. | ||
|
||
The `Facet` type provided by aiSSEMBLE is intentionally generic. It can be extended within your project to construct | ||
your own customized `Facets`, or it can be created directly from OpenLineage facets using the `from_open_lineage_facet` | ||
function. Custom `Facets` should override the appropriate conversion methods, such as `get_open_lineage_facet`, to address | ||
specific conversion needs. | ||
|
||
Facets are described by OpenLineage-format JSON schemas, and expect an URI string representing the path to that schema | ||
as their only required argument. | ||
|
||
Example construction of `Facet` objects: | ||
|
||
```python | ||
from aissemble_data_lineage import Facet, from_open_lineage_facet | ||
from openlineage.client.facet import DataSourceDatasetFacet | ||
|
||
my_facet = Facet("http://my_schema_url.org/my_schema.json") | ||
converted_facet = from_open_lineage_facet(DataSourceDatasetFacet(name="dataset_name", uri="http://my_datasource")) | ||
``` | ||
|
||
```java | ||
import io.openlineage.client.OpenLineage; | ||
|
||
OpenLineage openLineage = new OpenLineage(producer); | ||
OpenLineage.RunFacets runFacets = openLineage.newRunFacetsBuilder().build(); | ||
``` | ||
|
||
To define your own custom `Facet`, you will need to create a new class that extends OpenLineage's `BaseFacet` and define | ||
the attributes that the `Facet` will track. You can then use the `from_open_lineage_facet` function to set the values | ||
of the custom `Facet`. | ||
|
||
Example construction of a customized `Facet` object: | ||
```python | ||
from aissemble_data_lineage import from_open_lineage_facet | ||
from openlineage.client.facet import BaseFacet | ||
import attr | ||
|
||
@attr.s | ||
class MyCustomFacet(BaseFacet): | ||
custom_attr: str = attr.ib() | ||
_additional_skip_redact: ["custom_attr"] | ||
|
||
def __init__(self, custom_attr): | ||
super().__init__() | ||
self.custom_attr = custom_attr | ||
|
||
custom_facet = from_open_lineage_facet(MyCustomFacet(custom_attr = "Custom Value")) | ||
``` | ||
|
||
==== Datasets | ||
`Dataset` objects represent just that-- A known collection of data following a common path through the processing system. | ||
For cases where inputs and outputs are distinctive from processed data, we provide the `InputDataset` and `OutputDataset` | ||
objects. | ||
|
||
Example construction of `Dataset` objects: | ||
|
||
```python | ||
from aissemble_data_lineage import Dataset, InputDataset, OutputDataset | ||
|
||
my_dataset = Dataset(name="dataset_1", facets={"facet_1": my_facet}) | ||
my_input_dataset = InputDataset(name="dataset_2", facets={}, input_facets={}) | ||
my_output_dataset = OutputDataset(name="dataset_3", facets={}, output_facets={}) | ||
``` | ||
|
||
```java | ||
import io.openlineage.client.OpenLineage.InputDataset; | ||
|
||
ArrayList<InputDataset> olInputs = new ArrayList<>(); | ||
if(inputs != null) { | ||
inputs.forEach(input -> { | ||
olInputs.add((InputDataset) input.getOpenLineageDataset()); | ||
}); | ||
} | ||
``` | ||
|
||
==== Runs | ||
A `Run` is composed of one to many `Jobs`, and encompasses exactly one full processing pass over your data. | ||
|
||
Example construction of `Run` objects: | ||
```python | ||
from aissemble_data_lineage import Run | ||
from uuid import uuid4 | ||
|
||
my_run = Run(uuid4(), facets={}) | ||
``` | ||
|
||
```java | ||
import com.boozallen.aiops.data.lineage.Run; | ||
|
||
Run olRun = run.getOpenLineageRun(); | ||
``` | ||
|
||
==== Jobs | ||
A `Job` represents a single discrete step in the data processing pipeline, such as ingestion, a specific transformation, | ||
or any other action of note. | ||
|
||
Example construction of `Job` objects: | ||
```python | ||
from aissemble_data_lineage import Job | ||
|
||
my_job = Job("my_job", facets={}) | ||
``` | ||
|
||
```java | ||
import com.boozallen.aiops.data.lineage.Job; | ||
|
||
Run olJob = run.getOpenLineageJob(); | ||
``` | ||
|
||
== Additional Resources | ||
The full aiSSEMBLE data lineage source code can be reviewed on | ||
https://github.com/boozallen/aissemble/tree/dev/foundation/foundation-lineage/foundation-data-lineage[GitHub]. | ||
|
||
The Python methods detailing the generation of minimal instances of `Run`, `Jo`b, and `RunEvent `for emission are described | ||
in `src/<pipeline-name>/generated/step/abstract_data_action.py`. Correspondingly, the Java methods outlining the same | ||
are detailed in `src/generated/java/<package-name>/AbstractPipelineStep.java`. |
Oops, something went wrong.