-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #36 from boozallen/5-migrate-documentation-tranche-6
#5 📝 Tranche 6 of documentation migration
- Loading branch information
Showing
19 changed files
with
709 additions
and
0 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
= DataBricks Support | ||
:source-highlighter: rouge | ||
|
||
//todo can we outsource most of these steps and just call out the aiSSEMBLE-specific stuff? | ||
There are multiple ways you can deploy a data delivery pipeline. One way is to leverage the | ||
https://databricks.com/product/data-lakehouse[Databricks,role=external,window=_blank] managed environment. | ||
== Creating a cluster | ||
To deploy a Spark job on Azure Databricks we first need to define a cluster. The steps for creating a cluster that | ||
is capable of running your pipeline are listed below. Start by selecting `Compute` on the far left and then select | ||
`Create Cluster`. | ||
image::azure-new-cluster.png[] | ||
1. Name your cluster | ||
2. The runtime needs to support Spark 3.0.1 and Scala 2.12 so choose `Runtime: 7.3 LTS` | ||
3. Select a Worker Type with enough resources to run your data delivery pipeline. You can always go back and change | ||
this setting so there's no harm in starting small and increasing at a later time. | ||
4. The Driver Type can be any type of server or the same as the Worker Type. | ||
5. Expand the Advanced Options | ||
6. In the Spark Config box you can add java options. If your project uses Krausening to configure your properties | ||
you can set the following parameters: | ||
spark.driver.extraJavaOptions | ||
-DKRAUSENING_BASE=/dbfs/FileStore/shared_uploads/project-name/krausening/base | ||
-DKRAUSENING_EXTENSIONS=/dbfs/FileStore/shared_uploads/project-name/krausening/databricks | ||
-DKRAUSENING_PASSWORD=3uQ2j_=wmP5A2q8b | ||
|
||
7. When you are done configuring, select `Create Cluster` | ||
== Creating a job | ||
|
||
Your data delivery project is executed in Databricks through a `Job`. In the following steps we will define this job. | ||
|
||
image::azure-create-job.png[] | ||
Click on the `Jobs` menu item on the far left then click `Create Job` | ||
|
||
image::azure-job-details.png[Job Detail, 500] | ||
1. Give your task a name | ||
2. Select `Jar` Type | ||
3. Enter your Spark job's fully qualified main class name | ||
4. Click `Add` to add your jar file | ||
5. Select the cluster we created in the previous section | ||
6. Click `Create` | ||
|
||
== Initialize and configure environment | ||
|
||
Now with the cluster created and the Spark job defined we need to import the project's property files and initialize | ||
any tables in the database. First lets create a shared folder. | ||
|
||
image::azure-create-folder.png[Create Folder, 500] | ||
|
||
Click on the `Workspace` menu item on the far left then right click in the folder area. Then select `Create` > | ||
`Folder`. Give the folder a name like data_delivery_shared. | ||
|
||
image::azure-create-notebook.png[Create Notebook, 500] | ||
To run SQL commands we need a notebook. Creating a new notebook in our shared folder is easy, just cick on options | ||
triangle next to the shared folder we just created. Then select `Create` > `Notebook` | ||
|
||
image::azure-notebook-details.png[Notebook Details, 500] | ||
|
||
1. Give your notebook a name | ||
2. Change the default language to `SQL` | ||
3. Make sure your cluster is selected | ||
4. Click `Create` | ||
|
||
In this notebook you can write any SQL (DDL) to create necessary tables to support your pipeline. | ||
|
||
Next we need to import the project's property files. To do this open the SQL notebook you just created (double click | ||
on notebook name) and find the file menu item. Click on it and select `Upload Data`. Then on the upload dialog box | ||
select your shared folder and drag and drop your property files and upload them. | ||
|
||
image::azure-upload-data.png[Upload Property Files, 500] | ||
|
||
By default Databricks will rename uploaded files to fit it's syntax requirements. Often this means you will have to | ||
rename your uploaded files back to `*.properties`. To do this you can create a python notebook and run the following command | ||
|
||
---- | ||
dbutils.fs.mv("/FileStore/shared_folder/path/to/your/files/my_file_properties", "/FileStore/shared_folder/path/to/your/krausening/files/hive-metadata.properties") | ||
---- | ||
|
||
Now that your tables are generated and your property files are loaded you can launch the job by clicking on the | ||
`Run Now` action (the play icon) on the `Jobs` tab. | ||
|
||
image::azure-run-job.png[] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
[#_messaging_details] | ||
= Messaging | ||
== Overview | ||
Messaging is used to provide a decoupled orchestration of events and data between pipeline steps and between the | ||
pipeline and external systems. By leveraging messaging, the pipeline model can be used to define the flow of data | ||
through the pipeline, instead of manually controlling the flow via the pipeline driver. Pipeline messaging utilizes | ||
an implementation of the Reactive Messaging Specification by Eclipse. (See <<Advanced-Details>> section for more info.) | ||
=== Basic Concepts | ||
The following messaging concepts are useful in understanding this document. | ||
[cols="1,5"] | ||
|=== | ||
| *Publisher* | ||
| A unit of code that produces messages upon request. Sometimes called a source. | ||
| *Subscriber* | ||
| A unit of code that consumes messages. Sometimes called a sink. | ||
| *Processor* | ||
| A unit of code that is simultaneously a Publisher and Subscriber. It consumes an incoming message and uses the | ||
data to produce an outgoing message. | ||
| *Channel* | ||
| The means by which messages flow. All sources and processors push created messages onto channels, and all sinks | ||
and processors pull messages from channels. | ||
|=== | ||
In aiSSEMBLE, a channel is backed by the message broker service in the form of a queue or topic. The source and | ||
sinks of a pipeline can reside within the pipeline itself or external sources via the message broker service. | ||
image::pipeline-messaging-basic.svg[Messaging architecture] | ||
//todo our What Gets Generated sections are mildly inconsistent. Worth unifying | ||
== What Gets Generated | ||
=== _microprofile-config.properties_ | ||
This is the standard configuration file per the reactive messaging specification. Any reactive messaging | ||
configuration that does not pertain to pipeline messaging should be placed here. | ||
|
||
=== _org.eclipse.microprofile.config.spi.ConfigSource_ | ||
This file specifies other custom reactive messaging configuration sources. By default, two custom sources are | ||
created and registered in this file | ||
|
||
com.boozallen.aiops.data.delivery.messaging.PipelineMessagingConfig | ||
|
||
The `PipelineMessagingConfig` source exposes fine-grained reactive messaging configuration that backs pipeline | ||
messaging for advanced use cases. See the <<Customization>> section for more details. | ||
|
||
<pipeline-package-and-name>DefaultConfig | ||
|
||
The pipeline default configuration provides sensible defaults to support messaging within the pipeline. These | ||
defaults can be overridden by utilizing the `PipelineMessagingConfig` mentioned above. | ||
|
||
[#Customization] | ||
== Customization | ||
[#Advanced-Details] | ||
=== Advanced Details | ||
Pipeline messaging leverages the https://smallrye.io/smallrye-reactive-messaging[Smallrye implementation,role=external,window=_blank] | ||
of the https://download.eclipse.org/microprofile/microprofile-reactive-messaging-1.0/microprofile-reactive-messaging-spec.html[Reactive | ||
Messaging specification,role=external,window=_blank]. In order to fully understand how to customize messaging and what | ||
can be customized, it's important to understand how reactive messaging is leveraged to achieve pipeline messaging. | ||
|
||
For the most part, the concepts of pipeline messaging parallel the concepts of reactive messaging. The primary | ||
difference is how channels operate. As discussed in <<Basic Concepts>>, a pipeline channel is backed by a message | ||
broker service and flows all messages through the message broker. In contrast, a reactive messaging channel is | ||
completely internal to the process that is using reactive messaging (e.g. the pipeline). In order to connect to | ||
external systems, reactive messaging uses *Connectors* to attach a topic or queue in a message broker to either the | ||
incoming or outgoing side of a channel. Therefore, a pipeline channel is better represented by the following model: | ||
|
||
image::pipeline-messaging-channel.svg[Pipeline channel] | ||
|
||
Using this expanded representation, we can redraw the previous pipeline messaging diagram as the following: | ||
|
||
image::pipeline-messaging-adv.svg[Messaging implementation] | ||
|
||
=== Configuration | ||
aiSSEMBLE provides for advanced customization of the reactive channels and connectors that back pipeline messaging | ||
via the `pipeline-messaging.properties` krausening file. | ||
|
||
All configuration properties outlined by the | ||
https://download.eclipse.org/microprofile/microprofile-reactive-messaging-1.0/microprofile-reactive-messaging-spec.html#_configuration[Reactive | ||
Messaging specification,role=external,window=_blank] and the | ||
https://smallrye.io/smallrye-reactive-messaging/latest/concepts/connectors/#configuring-connectors[Smallrye | ||
documentation,role=external,window=_blank] is available but must be translated to reference a pipeline step instead | ||
of directly referencing a reactive channel by name. Instead of | ||
`mp.messaging.[incoming|outgoing].[channel-name].[attribute]=[value]`, the configuration pattern becomes | ||
`[step-name].[in|out].[attribute]=[value,role=external,window=_blank]` | ||
|
||
Consider the following example configuration from the smallrye documentation: | ||
|
||
[source,properties] | ||
---- | ||
mp.messaging.incoming.health.topic=neo | ||
mp.messaging.incoming.health.connector=smallrye-mqtt | ||
mp.messaging.incoming.health.host=localhost | ||
mp.messaging.outgoing.data.connector=smallrye-kafka | ||
mp.messaging.outgoing.data.bootstrap.servers=localhost:9092 | ||
mp.messaging.outgoing.data.key.serializer=org.apache.kafka.common.serialization.StringSerializer | ||
mp.messaging.outgoing.data.value.serializer=io.vertx.kafka.client.serialization.JsonObjectSerializer | ||
mp.messaging.outgoing.data.acks=1 | ||
---- | ||
|
||
If this configuration were translated to a pipeline-messaging.properties configuration for a step named IngestData, | ||
it would become the following: | ||
|
||
[source,properties] | ||
---- | ||
IngestData.in.topic=neo | ||
IngestData.in.connector=smallrye-mqtt | ||
IngestData.in.host=localhost | ||
IngestData.out.connector=smallrye-kafka | ||
IngestData.out.bootstrap.servers=localhost:9092 | ||
IngestData.out.key.serializer=org.apache.kafka.common.serialization.StringSerializer | ||
IngestData.out.value.serializer=io.vertx.kafka.client.serialization.JsonObjectSerializer | ||
IngestData.out.acks=1 | ||
---- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
= Path to Production | ||
|
||
Aside from testing scaffolding and the Maven project structure, aiSSEMBLE(TM) generates several artifacts to help drive | ||
consistency, repeatability, and quality of delivery. These artifacts are designed as starting points for a mature | ||
DevOps-centric approach for delivering high-quality AI systems and are not intended as complete solutions in and of | ||
themselves. | ||
|
||
== Mapping to aiSSEMBLE Concepts | ||
[#img-you-are-here-path-to-production] | ||
.xref:solution-baseline-process.adoc[You Are Here] | ||
image::you-are-here-path-to-production.png[You Are Here,200,100,role="thumb right"] | ||
|
||
_Path to Production_: Many AI systems are developed across the industry much like a school project - a Data Engineer or | ||
Data Scientist exports local code via a container and pushes that into production. This practice was once rampant in | ||
software development as well. Over time, it has become an industry best practice to leverage a repeatable, consistent | ||
_path to production_ to ensure quality and enable higher-velocity delivery. It's critical to recognize that while | ||
slower for an initial deployment, the speed and repeatability of this process are amongst the most important enablers in | ||
moving faster over time (an attribute most clients revere). | ||
|
||
== Containers | ||
|
||
When generating a Docker module aiSSEMBLE generates not only the Dockerfile, but also the relevant Kubernetes manifest | ||
files. See xref:containers.adoc[Container Support] for more details. | ||
|
||
== CI/CD | ||
|
||
Every project is incepted with a devops folder that contains configurations for | ||
https://plugins.jenkins.io/templating-engine/[Jenkins Templating Engine,role=external,window=_blank]. The generated | ||
templates leverage libraries from https://boozallen.github.io/sdp-docs/sdp-libraries/index.html[Solutions Delivery | ||
Platform,role=external,window=_blank] that build the project and send notifications to a configured Slack channel. | ||
See xref:ci-cd.adoc[CI/CD Support] for more details. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
= Post-Training Actions | ||
|
||
== Overview | ||
Post-training actions can be specified in a machine-learning pipeline to apply additional post-training processing. | ||
When one or more post-training actions are specified in a training step, scaffolding code is generated into the | ||
training pipeline, and each post-training action is applied during the training run after the model is trained. | ||
This page is intended to assist in understanding the generated components that are included when post-training | ||
actions are specified. | ||
|
||
== What Gets Generated | ||
|
||
For details on how to specify post-training actions, please see the | ||
xref:pipeline-metamodel.adoc#_pipeline_step_post_actions_element_options[Pipeline Step Post Actions Element Options]. | ||
|
||
=== Model Conversion Post-Training Action | ||
|
||
A model-conversion post-training action can be used to convert the trained model to another model format. Below will | ||
describe the scaffolding code that is generated for a model-conversion post-training action. | ||
|
||
==== ONNX | ||
|
||
The following example post-training action will be used to describe what gets generated for ONNX model conversion: | ||
|
||
.Example ONNX Model Conversion Post-Training Action | ||
[source,json] | ||
---- | ||
{ | ||
"name": "ConvertModelToOnnxFormat", | ||
"type": "model-conversion", | ||
"modelTarget": "onnx", | ||
"modelSource": "sklearn" | ||
} | ||
---- | ||
|
||
[cols="2,4a"] | ||
|=== | ||
|File|Description | ||
|
||
| `src/${python_package_name}/generated/post_action/onnx_sklearn_model_conversion_base.py` | ||
| Base class containing core code for leveraging ONNX to convert sklearn models. This class is regenerated with | ||
every build, and therefore cannot be modified. | ||
|
||
The following methods are generated: | ||
|
||
* `_convert` - performs the ONNX conversion. | ||
* `_save` - saves the converted ONNX model. | ||
|
||
In addition, ONNX conversion properties are generated with default values. These can be overridden in the | ||
post-training action implementation class to specify custom values to pass to the ONNX conversion. | ||
|
||
| `src/${python_package_name}/generated/post_action/convert_model_to_onnx_format_base.py` | ||
| Post-training action base class containing core code for applying the post-training action. This class is | ||
regenerated with every build, and therefore cannot be modified. | ||
|
||
The following method is generated: | ||
|
||
* `apply` - applies the ONNX conversion by calling the `_convert` and `_save` methods from the above class. | ||
|
||
| `src/${python_package_name}/post_action/convert_model_to_onnx_format.py` | ||
| Post-training action implementation class. This class is where properties and methods from the above base classes | ||
can be overridden, if desired. | ||
|
||
If the ONNX conversion has any required parameters, they will be generated here for manual implementation. | ||
|
||
|=== | ||
|
||
|
||
==== Custom | ||
|
||
The following example post-training action will be used to describe what gets generated for a custom model conversion: | ||
|
||
.Example Custom Model Conversion Post-Training Action | ||
[source,json] | ||
---- | ||
{ | ||
"name": "ConvertModelToCustomFormat", | ||
"type": "model-conversion", | ||
"modelTarget": "custom", | ||
"modelSource": "sklearn" | ||
} | ||
---- | ||
|
||
[cols="2,4a"] | ||
|=== | ||
|File|Description | ||
|
||
| `src/${python_package_name}/generated/post_action/custom_model_conversion_base.py` | ||
| Base class containing core code for implementing a custom model conversion. This class is regenerated with every | ||
build, and therefore cannot be modified. | ||
|
||
The following methods are generated: | ||
|
||
* `_convert` - abstract method to implement the custom conversion. This should be implemented in the post-training | ||
action implementation class. | ||
* `_save` - abstract method to implement the saving of the converted model. This should be implemented in the | ||
post-training action implementation class. | ||
|
||
| `src/${python_package_name}/generated/post_action/convert_model_to_custom_format_base.py` | ||
| Post-training action base class containing core code for applying the post-training action. This class is | ||
regenerated with every build, and therefore cannot be modified. | ||
|
||
The following method is generated: | ||
|
||
* `apply` - applies the cusom conversion by calling the `_convert` and `_save` methods from the above class. | ||
|
||
| `src/${python_package_name}/post_action/convert_model_to_custom_format.py` | ||
| Post-training action implementation class. This class is where the `_convert` and `_save` methods should be implemented. | ||
|
||
|=== | ||
|
||
=== Freeform Post-Training Action | ||
|
||
A freeform post-training action can be used to apply any custom post-training processing. The following example post-training action will be used to describe what gets generated for a freeform post-training action: | ||
|
||
.Example Freeform Post-Training Action | ||
[source,json] | ||
---- | ||
{ | ||
"name": "AdditionalProcessing", | ||
"type": "freeform" | ||
} | ||
---- | ||
|
||
[cols="2,4a"] | ||
|=== | ||
|File|Description | ||
|
||
| `src/${python_package_name}/generated/post_action/additional_processing_base.py` | ||
| Post-training action base class containing core code for applying the post-training action. This class is | ||
regenerated with every build, and therefore cannot be modified. | ||
|
||
The following method is generated: | ||
|
||
* `apply` - abstract method to implement the custom processing. This should be implemented in the post-training | ||
action implementation class. | ||
|
||
| `src/${python_package_name}/post_action/additional_processing.py` | ||
| Post-training action implementation class. This class is where the `apply` method should be implemented. | ||
|
||
|=== |
Oops, something went wrong.