Skip to content

Commit

Permalink
Merge pull request #36 from boozallen/5-migrate-documentation-tranche-6
Browse files Browse the repository at this point in the history
#5 📝 Tranche 6 of documentation migration
  • Loading branch information
d-ryan-ashcraft authored May 2, 2024
2 parents a43eacf + e13d35c commit e8eeed8
Show file tree
Hide file tree
Showing 19 changed files with 709 additions and 0 deletions.
Binary file added docs/modules/ROOT/images/azure-create-folder.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/azure-create-job.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/azure-job-details.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/azure-new-cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/azure-run-job.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/ROOT/images/azure-upload-data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/modules/ROOT/images/pipeline-messaging-adv.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/modules/ROOT/images/pipeline-messaging-basic.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/modules/ROOT/images/pipeline-messaging-channel.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
84 changes: 84 additions & 0 deletions docs/modules/ROOT/pages/databricks.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
= DataBricks Support
:source-highlighter: rouge

//todo can we outsource most of these steps and just call out the aiSSEMBLE-specific stuff?
There are multiple ways you can deploy a data delivery pipeline. One way is to leverage the
https://databricks.com/product/data-lakehouse[Databricks,role=external,window=_blank] managed environment.
== Creating a cluster
To deploy a Spark job on Azure Databricks we first need to define a cluster. The steps for creating a cluster that
is capable of running your pipeline are listed below. Start by selecting `Compute` on the far left and then select
`Create Cluster`.
image::azure-new-cluster.png[]
1. Name your cluster
2. The runtime needs to support Spark 3.0.1 and Scala 2.12 so choose `Runtime: 7.3 LTS`
3. Select a Worker Type with enough resources to run your data delivery pipeline. You can always go back and change
this setting so there's no harm in starting small and increasing at a later time.
4. The Driver Type can be any type of server or the same as the Worker Type.
5. Expand the Advanced Options
6. In the Spark Config box you can add java options. If your project uses Krausening to configure your properties
you can set the following parameters:
spark.driver.extraJavaOptions
-DKRAUSENING_BASE=/dbfs/FileStore/shared_uploads/project-name/krausening/base
-DKRAUSENING_EXTENSIONS=/dbfs/FileStore/shared_uploads/project-name/krausening/databricks
-DKRAUSENING_PASSWORD=3uQ2j_=wmP5A2q8b

7. When you are done configuring, select `Create Cluster`
== Creating a job

Your data delivery project is executed in Databricks through a `Job`. In the following steps we will define this job.

image::azure-create-job.png[]
Click on the `Jobs` menu item on the far left then click `Create Job`

image::azure-job-details.png[Job Detail, 500]
1. Give your task a name
2. Select `Jar` Type
3. Enter your Spark job's fully qualified main class name
4. Click `Add` to add your jar file
5. Select the cluster we created in the previous section
6. Click `Create`

== Initialize and configure environment

Now with the cluster created and the Spark job defined we need to import the project's property files and initialize
any tables in the database. First lets create a shared folder.

image::azure-create-folder.png[Create Folder, 500]

Click on the `Workspace` menu item on the far left then right click in the folder area. Then select `Create` >
`Folder`. Give the folder a name like data_delivery_shared.

image::azure-create-notebook.png[Create Notebook, 500]
To run SQL commands we need a notebook. Creating a new notebook in our shared folder is easy, just cick on options
triangle next to the shared folder we just created. Then select `Create` > `Notebook`

image::azure-notebook-details.png[Notebook Details, 500]

1. Give your notebook a name
2. Change the default language to `SQL`
3. Make sure your cluster is selected
4. Click `Create`

In this notebook you can write any SQL (DDL) to create necessary tables to support your pipeline.

Next we need to import the project's property files. To do this open the SQL notebook you just created (double click
on notebook name) and find the file menu item. Click on it and select `Upload Data`. Then on the upload dialog box
select your shared folder and drag and drop your property files and upload them.

image::azure-upload-data.png[Upload Property Files, 500]

By default Databricks will rename uploaded files to fit it's syntax requirements. Often this means you will have to
rename your uploaded files back to `*.properties`. To do this you can create a python notebook and run the following command

----
dbutils.fs.mv("/FileStore/shared_folder/path/to/your/files/my_file_properties", "/FileStore/shared_folder/path/to/your/krausening/files/hive-metadata.properties")
----

Now that your tables are generated and your property files are loaded you can launch the job by clicking on the
`Run Now` action (the play icon) on the `Jobs` tab.

image::azure-run-job.png[]
115 changes: 115 additions & 0 deletions docs/modules/ROOT/pages/messaging-details.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
[#_messaging_details]
= Messaging
== Overview
Messaging is used to provide a decoupled orchestration of events and data between pipeline steps and between the
pipeline and external systems. By leveraging messaging, the pipeline model can be used to define the flow of data
through the pipeline, instead of manually controlling the flow via the pipeline driver. Pipeline messaging utilizes
an implementation of the Reactive Messaging Specification by Eclipse. (See <<Advanced-Details>> section for more info.)
=== Basic Concepts
The following messaging concepts are useful in understanding this document.
[cols="1,5"]
|===
| *Publisher*
| A unit of code that produces messages upon request. Sometimes called a source.
| *Subscriber*
| A unit of code that consumes messages. Sometimes called a sink.
| *Processor*
| A unit of code that is simultaneously a Publisher and Subscriber. It consumes an incoming message and uses the
data to produce an outgoing message.
| *Channel*
| The means by which messages flow. All sources and processors push created messages onto channels, and all sinks
and processors pull messages from channels.
|===
In aiSSEMBLE, a channel is backed by the message broker service in the form of a queue or topic. The source and
sinks of a pipeline can reside within the pipeline itself or external sources via the message broker service.
image::pipeline-messaging-basic.svg[Messaging architecture]
//todo our What Gets Generated sections are mildly inconsistent. Worth unifying
== What Gets Generated
=== _microprofile-config.properties_
This is the standard configuration file per the reactive messaging specification. Any reactive messaging
configuration that does not pertain to pipeline messaging should be placed here.

=== _org.eclipse.microprofile.config.spi.ConfigSource_
This file specifies other custom reactive messaging configuration sources. By default, two custom sources are
created and registered in this file

com.boozallen.aiops.data.delivery.messaging.PipelineMessagingConfig

The `PipelineMessagingConfig` source exposes fine-grained reactive messaging configuration that backs pipeline
messaging for advanced use cases. See the <<Customization>> section for more details.

<pipeline-package-and-name>DefaultConfig

The pipeline default configuration provides sensible defaults to support messaging within the pipeline. These
defaults can be overridden by utilizing the `PipelineMessagingConfig` mentioned above.

[#Customization]
== Customization
[#Advanced-Details]
=== Advanced Details
Pipeline messaging leverages the https://smallrye.io/smallrye-reactive-messaging[Smallrye implementation,role=external,window=_blank]
of the https://download.eclipse.org/microprofile/microprofile-reactive-messaging-1.0/microprofile-reactive-messaging-spec.html[Reactive
Messaging specification,role=external,window=_blank]. In order to fully understand how to customize messaging and what
can be customized, it's important to understand how reactive messaging is leveraged to achieve pipeline messaging.

For the most part, the concepts of pipeline messaging parallel the concepts of reactive messaging. The primary
difference is how channels operate. As discussed in <<Basic Concepts>>, a pipeline channel is backed by a message
broker service and flows all messages through the message broker. In contrast, a reactive messaging channel is
completely internal to the process that is using reactive messaging (e.g. the pipeline). In order to connect to
external systems, reactive messaging uses *Connectors* to attach a topic or queue in a message broker to either the
incoming or outgoing side of a channel. Therefore, a pipeline channel is better represented by the following model:

image::pipeline-messaging-channel.svg[Pipeline channel]

Using this expanded representation, we can redraw the previous pipeline messaging diagram as the following:

image::pipeline-messaging-adv.svg[Messaging implementation]

=== Configuration
aiSSEMBLE provides for advanced customization of the reactive channels and connectors that back pipeline messaging
via the `pipeline-messaging.properties` krausening file.

All configuration properties outlined by the
https://download.eclipse.org/microprofile/microprofile-reactive-messaging-1.0/microprofile-reactive-messaging-spec.html#_configuration[Reactive
Messaging specification,role=external,window=_blank] and the
https://smallrye.io/smallrye-reactive-messaging/latest/concepts/connectors/#configuring-connectors[Smallrye
documentation,role=external,window=_blank] is available but must be translated to reference a pipeline step instead
of directly referencing a reactive channel by name. Instead of
`mp.messaging.[incoming|outgoing].[channel-name].[attribute]=[value]`, the configuration pattern becomes
`[step-name].[in|out].[attribute]=[value,role=external,window=_blank]`

Consider the following example configuration from the smallrye documentation:

[source,properties]
----
mp.messaging.incoming.health.topic=neo
mp.messaging.incoming.health.connector=smallrye-mqtt
mp.messaging.incoming.health.host=localhost
mp.messaging.outgoing.data.connector=smallrye-kafka
mp.messaging.outgoing.data.bootstrap.servers=localhost:9092
mp.messaging.outgoing.data.key.serializer=org.apache.kafka.common.serialization.StringSerializer
mp.messaging.outgoing.data.value.serializer=io.vertx.kafka.client.serialization.JsonObjectSerializer
mp.messaging.outgoing.data.acks=1
----

If this configuration were translated to a pipeline-messaging.properties configuration for a step named IngestData,
it would become the following:

[source,properties]
----
IngestData.in.topic=neo
IngestData.in.connector=smallrye-mqtt
IngestData.in.host=localhost
IngestData.out.connector=smallrye-kafka
IngestData.out.bootstrap.servers=localhost:9092
IngestData.out.key.serializer=org.apache.kafka.common.serialization.StringSerializer
IngestData.out.value.serializer=io.vertx.kafka.client.serialization.JsonObjectSerializer
IngestData.out.acks=1
----
31 changes: 31 additions & 0 deletions docs/modules/ROOT/pages/path-to-production.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
= Path to Production

Aside from testing scaffolding and the Maven project structure, aiSSEMBLE(TM) generates several artifacts to help drive
consistency, repeatability, and quality of delivery. These artifacts are designed as starting points for a mature
DevOps-centric approach for delivering high-quality AI systems and are not intended as complete solutions in and of
themselves.

== Mapping to aiSSEMBLE Concepts
[#img-you-are-here-path-to-production]
.xref:solution-baseline-process.adoc[You Are Here]
image::you-are-here-path-to-production.png[You Are Here,200,100,role="thumb right"]

_Path to Production_: Many AI systems are developed across the industry much like a school project - a Data Engineer or
Data Scientist exports local code via a container and pushes that into production. This practice was once rampant in
software development as well. Over time, it has become an industry best practice to leverage a repeatable, consistent
_path to production_ to ensure quality and enable higher-velocity delivery. It's critical to recognize that while
slower for an initial deployment, the speed and repeatability of this process are amongst the most important enablers in
moving faster over time (an attribute most clients revere).

== Containers

When generating a Docker module aiSSEMBLE generates not only the Dockerfile, but also the relevant Kubernetes manifest
files. See xref:containers.adoc[Container Support] for more details.

== CI/CD

Every project is incepted with a devops folder that contains configurations for
https://plugins.jenkins.io/templating-engine/[Jenkins Templating Engine,role=external,window=_blank]. The generated
templates leverage libraries from https://boozallen.github.io/sdp-docs/sdp-libraries/index.html[Solutions Delivery
Platform,role=external,window=_blank] that build the project and send notifications to a configured Slack channel.
See xref:ci-cd.adoc[CI/CD Support] for more details.
140 changes: 140 additions & 0 deletions docs/modules/ROOT/pages/post-actions.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
= Post-Training Actions

== Overview
Post-training actions can be specified in a machine-learning pipeline to apply additional post-training processing.
When one or more post-training actions are specified in a training step, scaffolding code is generated into the
training pipeline, and each post-training action is applied during the training run after the model is trained.
This page is intended to assist in understanding the generated components that are included when post-training
actions are specified.

== What Gets Generated

For details on how to specify post-training actions, please see the
xref:pipeline-metamodel.adoc#_pipeline_step_post_actions_element_options[Pipeline Step Post Actions Element Options].

=== Model Conversion Post-Training Action

A model-conversion post-training action can be used to convert the trained model to another model format. Below will
describe the scaffolding code that is generated for a model-conversion post-training action.

==== ONNX

The following example post-training action will be used to describe what gets generated for ONNX model conversion:

.Example ONNX Model Conversion Post-Training Action
[source,json]
----
{
"name": "ConvertModelToOnnxFormat",
"type": "model-conversion",
"modelTarget": "onnx",
"modelSource": "sklearn"
}
----

[cols="2,4a"]
|===
|File|Description

| `src/${python_package_name}/generated/post_action/onnx_sklearn_model_conversion_base.py`
| Base class containing core code for leveraging ONNX to convert sklearn models. This class is regenerated with
every build, and therefore cannot be modified.

The following methods are generated:

* `_convert` - performs the ONNX conversion.
* `_save` - saves the converted ONNX model.

In addition, ONNX conversion properties are generated with default values. These can be overridden in the
post-training action implementation class to specify custom values to pass to the ONNX conversion.

| `src/${python_package_name}/generated/post_action/convert_model_to_onnx_format_base.py`
| Post-training action base class containing core code for applying the post-training action. This class is
regenerated with every build, and therefore cannot be modified.

The following method is generated:

* `apply` - applies the ONNX conversion by calling the `_convert` and `_save` methods from the above class.

| `src/${python_package_name}/post_action/convert_model_to_onnx_format.py`
| Post-training action implementation class. This class is where properties and methods from the above base classes
can be overridden, if desired.

If the ONNX conversion has any required parameters, they will be generated here for manual implementation.

|===


==== Custom

The following example post-training action will be used to describe what gets generated for a custom model conversion:

.Example Custom Model Conversion Post-Training Action
[source,json]
----
{
"name": "ConvertModelToCustomFormat",
"type": "model-conversion",
"modelTarget": "custom",
"modelSource": "sklearn"
}
----

[cols="2,4a"]
|===
|File|Description

| `src/${python_package_name}/generated/post_action/custom_model_conversion_base.py`
| Base class containing core code for implementing a custom model conversion. This class is regenerated with every
build, and therefore cannot be modified.

The following methods are generated:

* `_convert` - abstract method to implement the custom conversion. This should be implemented in the post-training
action implementation class.
* `_save` - abstract method to implement the saving of the converted model. This should be implemented in the
post-training action implementation class.

| `src/${python_package_name}/generated/post_action/convert_model_to_custom_format_base.py`
| Post-training action base class containing core code for applying the post-training action. This class is
regenerated with every build, and therefore cannot be modified.

The following method is generated:

* `apply` - applies the cusom conversion by calling the `_convert` and `_save` methods from the above class.

| `src/${python_package_name}/post_action/convert_model_to_custom_format.py`
| Post-training action implementation class. This class is where the `_convert` and `_save` methods should be implemented.

|===

=== Freeform Post-Training Action

A freeform post-training action can be used to apply any custom post-training processing. The following example post-training action will be used to describe what gets generated for a freeform post-training action:

.Example Freeform Post-Training Action
[source,json]
----
{
"name": "AdditionalProcessing",
"type": "freeform"
}
----

[cols="2,4a"]
|===
|File|Description

| `src/${python_package_name}/generated/post_action/additional_processing_base.py`
| Post-training action base class containing core code for applying the post-training action. This class is
regenerated with every build, and therefore cannot be modified.

The following method is generated:

* `apply` - abstract method to implement the custom processing. This should be implemented in the post-training
action implementation class.

| `src/${python_package_name}/post_action/additional_processing.py`
| Post-training action implementation class. This class is where the `apply` method should be implemented.

|===
Loading

0 comments on commit e8eeed8

Please sign in to comment.