Merge pull request #36 from boozallen/5-migrate-documentation-tranche-6

#5 📝 Tranche 6 of documentation migration
boozallen · May 2, 2024 · e8eeed8 · e8eeed8
2 parents a43eacf + e13d35c
commit e8eeed8
Show file tree

Hide file tree

Showing 19 changed files with 709 additions and 0 deletions.
diff --git a/docs/modules/ROOT/images/azure-create-folder.png b/docs/modules/ROOT/images/azure-create-folder.png
diff --git a/docs/modules/ROOT/images/azure-create-job.png b/docs/modules/ROOT/images/azure-create-job.png
diff --git a/docs/modules/ROOT/images/azure-create-notebook.png b/docs/modules/ROOT/images/azure-create-notebook.png
diff --git a/docs/modules/ROOT/images/azure-job-details.png b/docs/modules/ROOT/images/azure-job-details.png
diff --git a/docs/modules/ROOT/images/azure-new-cluster.png b/docs/modules/ROOT/images/azure-new-cluster.png
diff --git a/docs/modules/ROOT/images/azure-notebook-details.png b/docs/modules/ROOT/images/azure-notebook-details.png
diff --git a/docs/modules/ROOT/images/azure-run-job.png b/docs/modules/ROOT/images/azure-run-job.png
diff --git a/docs/modules/ROOT/images/azure-upload-data.png b/docs/modules/ROOT/images/azure-upload-data.png
diff --git a/docs/modules/ROOT/images/pipeline-messaging-adv.svg b/docs/modules/ROOT/images/pipeline-messaging-adv.svg
diff --git a/docs/modules/ROOT/images/pipeline-messaging-basic.svg b/docs/modules/ROOT/images/pipeline-messaging-basic.svg
diff --git a/docs/modules/ROOT/images/pipeline-messaging-channel.svg b/docs/modules/ROOT/images/pipeline-messaging-channel.svg
diff --git a/docs/modules/ROOT/images/you-are-here-path-to-production.png b/docs/modules/ROOT/images/you-are-here-path-to-production.png
diff --git a/docs/modules/ROOT/pages/databricks.adoc b/docs/modules/ROOT/pages/databricks.adoc
@@ -0,0 +1,84 @@
+= DataBricks Support
+:source-highlighter: rouge
+
+//todo can we outsource most of these steps and just call out the aiSSEMBLE-specific stuff?
+There are multiple ways you can deploy a data delivery pipeline.  One way is to leverage the
+https://databricks.com/product/data-lakehouse[Databricks,role=external,window=_blank] managed environment.
+
+== Creating a cluster
+To deploy a Spark job on Azure Databricks we first need to define a cluster.  The steps for creating a cluster that
+is capable of running your pipeline are listed below.  Start by selecting `Compute` on the far left and then select
+`Create Cluster`.
+
+image::azure-new-cluster.png[]
+1.  Name your cluster
+2.  The runtime needs to support Spark 3.0.1 and Scala 2.12 so choose `Runtime: 7.3 LTS`
+3.  Select a Worker Type with enough resources to run your data delivery pipeline.  You can always go back and change
+this setting so there's no harm in starting small and increasing at a later time.
+4.  The Driver Type can be any type of server or the same as the Worker Type.
+5.  Expand the Advanced Options
+6.  In the Spark Config box you can add java options.  If your project uses Krausening to configure your properties
+you can set the following parameters:
+
+    spark.driver.extraJavaOptions
+    -DKRAUSENING_BASE=/dbfs/FileStore/shared_uploads/project-name/krausening/base
+    -DKRAUSENING_EXTENSIONS=/dbfs/FileStore/shared_uploads/project-name/krausening/databricks
+    -DKRAUSENING_PASSWORD=3uQ2j_=wmP5A2q8b
+
+7. When you are done configuring, select `Create Cluster`
+
+== Creating a job
+
+Your data delivery project is executed in Databricks through a `Job`.  In the following steps we will define this job.
+
+image::azure-create-job.png[]
+Click on the `Jobs` menu item on the far left then click `Create Job`
+
+image::azure-job-details.png[Job Detail, 500]
+1.  Give your task a name
+2.  Select `Jar` Type
+3.  Enter your Spark job's fully qualified main class name
+4.  Click `Add` to add your jar file
+5.  Select the cluster we created in the previous section
+6.  Click `Create`
+
+== Initialize and configure environment
+
+Now with the cluster created and the Spark job defined we need to import the project's property files and initialize
+any tables in the database.  First lets create a shared folder.
+
+image::azure-create-folder.png[Create Folder, 500]
+
+Click on the `Workspace` menu item on the far left then right click in the folder area.  Then select `Create` >
+`Folder`.  Give the folder a name like data_delivery_shared.
+
+image::azure-create-notebook.png[Create Notebook, 500]
+To run SQL commands we need a notebook.  Creating a new notebook in our shared folder is easy, just cick on options
+triangle next to the shared folder we just created.  Then select `Create` > `Notebook`
+
+image::azure-notebook-details.png[Notebook Details, 500]
+
+1.  Give your notebook a name
+2.  Change the default language to `SQL`
+3.  Make sure your cluster is selected
+4.  Click `Create`
+
+In this notebook you can write any SQL (DDL) to create necessary tables to support your pipeline.
+
+Next we need to import the project's property files.  To do this open the SQL notebook you just created (double click
+on notebook name) and find the file menu item.  Click on it and select `Upload Data`. Then on the upload dialog box
+select your shared folder and drag and drop your property files and upload them.
+
+image::azure-upload-data.png[Upload Property Files, 500]
+
+By default Databricks will rename uploaded files to fit it's syntax requirements.  Often this means you will have to
+rename your uploaded files back to `*.properties`. To do this you can create a python notebook and run the following command
+
+----
+dbutils.fs.mv("/FileStore/shared_folder/path/to/your/files/my_file_properties", "/FileStore/shared_folder/path/to/your/krausening/files/hive-metadata.properties")
+----
+
+Now that your tables are generated and your property files are loaded you can launch the job by clicking on the
+`Run Now` action (the play icon) on the `Jobs` tab.
+
+image::azure-run-job.png[]
diff --git a/docs/modules/ROOT/pages/messaging-details.adoc b/docs/modules/ROOT/pages/messaging-details.adoc
@@ -0,0 +1,115 @@
+[#_messaging_details]
+= Messaging
+
+== Overview
+Messaging is used to provide a decoupled orchestration of events and data between pipeline steps and between the
+pipeline and external systems.  By leveraging messaging, the pipeline model can be used to define the flow of data
+through the pipeline, instead of manually controlling the flow via the pipeline driver.  Pipeline messaging utilizes
+an implementation of the Reactive Messaging Specification by Eclipse. (See <<Advanced-Details>> section for more info.)
+
+=== Basic Concepts
+The following messaging concepts are useful in understanding this document.
+[cols="1,5"]
+|===
+| *Publisher*
+| A unit of code that produces messages upon request. Sometimes called a source.
+| *Subscriber*
+| A unit of code that consumes messages. Sometimes called a sink.
+| *Processor*
+| A unit of code that is simultaneously a Publisher and Subscriber. It consumes an incoming message and uses the
+data to produce an outgoing message.
+| *Channel*
+| The means by which messages flow. All sources and processors push created messages onto channels, and all sinks
+and processors pull messages from channels.
+|===
+
+In aiSSEMBLE, a channel is backed by the message broker service in the form of a queue or topic.  The source and
+sinks of a pipeline can reside within the pipeline itself or external sources via the message broker service.
+
+image::pipeline-messaging-basic.svg[Messaging architecture]
+
+//todo our What Gets Generated sections are mildly inconsistent. Worth unifying
+== What Gets Generated
+=== _microprofile-config.properties_
+This is the standard configuration file per the reactive messaging specification.  Any reactive messaging
+configuration that does not pertain to pipeline messaging should be placed here.
+
+=== _org.eclipse.microprofile.config.spi.ConfigSource_
+This file specifies other custom reactive messaging configuration sources. By default, two custom sources are
+created and registered in this file
+
+ com.boozallen.aiops.data.delivery.messaging.PipelineMessagingConfig
+
+The `PipelineMessagingConfig` source exposes fine-grained reactive messaging configuration that backs pipeline
+messaging for advanced use cases. See the <<Customization>> section for more details.
+
+ <pipeline-package-and-name>DefaultConfig
+
+The pipeline default configuration provides sensible defaults to support messaging within the pipeline. These
+defaults can be overridden by utilizing the `PipelineMessagingConfig` mentioned above.
+
+[#Customization]
+== Customization
+[#Advanced-Details]
+=== Advanced Details
+Pipeline messaging leverages the https://smallrye.io/smallrye-reactive-messaging[Smallrye implementation,role=external,window=_blank]
+of the https://download.eclipse.org/microprofile/microprofile-reactive-messaging-1.0/microprofile-reactive-messaging-spec.html[Reactive
+Messaging specification,role=external,window=_blank]. In order to fully understand how to customize messaging and what
+can be customized, it's important to understand how reactive messaging is leveraged to achieve pipeline messaging.
+
+For the most part, the concepts of pipeline messaging parallel the concepts of reactive messaging.  The primary
+difference is how channels operate.  As discussed in <<Basic Concepts>>, a pipeline channel is backed by a message
+broker service and flows all messages through the message broker.  In contrast, a reactive messaging channel is
+completely internal to the process that is using reactive messaging (e.g. the pipeline).  In order to connect to
+external systems, reactive messaging uses *Connectors* to attach a topic or queue in a message broker to either the
+incoming or outgoing side of a channel.  Therefore, a pipeline channel is better represented by the following model:
+
+image::pipeline-messaging-channel.svg[Pipeline channel]
+
+Using this expanded representation, we can redraw the previous pipeline messaging diagram as the following:
+
+image::pipeline-messaging-adv.svg[Messaging implementation]
+
+=== Configuration
+aiSSEMBLE provides for advanced customization of the reactive channels and connectors that back pipeline messaging
+via the `pipeline-messaging.properties` krausening file.
+
+All configuration properties outlined by the
+https://download.eclipse.org/microprofile/microprofile-reactive-messaging-1.0/microprofile-reactive-messaging-spec.html#_configuration[Reactive
+Messaging specification,role=external,window=_blank] and the
+https://smallrye.io/smallrye-reactive-messaging/latest/concepts/connectors/#configuring-connectors[Smallrye
+documentation,role=external,window=_blank] is available but must be translated to reference a pipeline step instead
+of directly referencing a reactive channel by name.  Instead of
+`mp.messaging.[incoming|outgoing].[channel-name].[attribute]=[value]`, the configuration pattern becomes
+`[step-name].[in|out].[attribute]=[value,role=external,window=_blank]`
+
+Consider the following example configuration from the smallrye documentation:
+
+[source,properties]
+----
+mp.messaging.incoming.health.topic=neo
+mp.messaging.incoming.health.connector=smallrye-mqtt
+mp.messaging.incoming.health.host=localhost
+
+mp.messaging.outgoing.data.connector=smallrye-kafka
+mp.messaging.outgoing.data.bootstrap.servers=localhost:9092
+mp.messaging.outgoing.data.key.serializer=org.apache.kafka.common.serialization.StringSerializer
+mp.messaging.outgoing.data.value.serializer=io.vertx.kafka.client.serialization.JsonObjectSerializer
+mp.messaging.outgoing.data.acks=1
+----
+
+If this configuration were translated to a pipeline-messaging.properties configuration for a step named IngestData,
+it would become the following:
+
+[source,properties]
+----
+IngestData.in.topic=neo
+IngestData.in.connector=smallrye-mqtt
+IngestData.in.host=localhost
+
+IngestData.out.connector=smallrye-kafka
+IngestData.out.bootstrap.servers=localhost:9092
+IngestData.out.key.serializer=org.apache.kafka.common.serialization.StringSerializer
+IngestData.out.value.serializer=io.vertx.kafka.client.serialization.JsonObjectSerializer
+IngestData.out.acks=1
+----
diff --git a/docs/modules/ROOT/pages/path-to-production.adoc b/docs/modules/ROOT/pages/path-to-production.adoc
@@ -0,0 +1,31 @@
+= Path to Production
+
+Aside from testing scaffolding and the Maven project structure, aiSSEMBLE(TM) generates several artifacts to help drive
+consistency, repeatability, and quality of delivery. These artifacts are designed as starting points for a mature
+DevOps-centric approach for delivering high-quality AI systems and are not intended as complete solutions in and of
+themselves.
+
+== Mapping to aiSSEMBLE Concepts
+[#img-you-are-here-path-to-production]
+.xref:solution-baseline-process.adoc[You Are Here]
+image::you-are-here-path-to-production.png[You Are Here,200,100,role="thumb right"]
+
+_Path to Production_: Many AI systems are developed across the industry much like a school project - a Data Engineer or
+Data Scientist exports local code via a container and pushes that into production. This practice was once rampant in
+software development as well. Over time, it has become an industry best practice to leverage a repeatable, consistent
+_path to production_ to ensure quality and enable higher-velocity delivery.  It's critical to recognize that while
+slower for an initial deployment, the speed and repeatability of this process are amongst the most important enablers in
+moving faster over time (an attribute most clients revere).
+
+== Containers
+
+When generating a Docker module aiSSEMBLE generates not only the Dockerfile, but also the relevant Kubernetes manifest
+files. See xref:containers.adoc[Container Support] for more details.
+
+== CI/CD
+
+Every project is incepted with a devops folder that contains configurations for
+https://plugins.jenkins.io/templating-engine/[Jenkins Templating Engine,role=external,window=_blank]. The generated
+templates leverage libraries from https://boozallen.github.io/sdp-docs/sdp-libraries/index.html[Solutions Delivery
+Platform,role=external,window=_blank] that build the project and send notifications to a configured Slack channel.
+See xref:ci-cd.adoc[CI/CD Support] for more details.
diff --git a/docs/modules/ROOT/pages/post-actions.adoc b/docs/modules/ROOT/pages/post-actions.adoc
@@ -0,0 +1,140 @@
+= Post-Training Actions
+
+== Overview
+Post-training actions can be specified in a machine-learning pipeline to apply additional post-training processing.
+When one or more post-training actions are specified in a training step, scaffolding code is generated into the
+training pipeline, and each post-training action is applied during the training run after the model is trained.
+This page is intended to assist in understanding the generated components that are included when post-training
+actions are specified.
+
+== What Gets Generated
+
+For details on how to specify post-training actions, please see the
+xref:pipeline-metamodel.adoc#_pipeline_step_post_actions_element_options[Pipeline Step Post Actions Element Options].
+
+=== Model Conversion Post-Training Action
+
+A model-conversion post-training action can be used to convert the trained model to another model format. Below will
+describe the scaffolding code that is generated for a model-conversion post-training action.
+
+==== ONNX
+
+The following example post-training action will be used to describe what gets generated for ONNX model conversion:
+
+.Example ONNX Model Conversion Post-Training Action
+[source,json]
+----
+{
+    "name": "ConvertModelToOnnxFormat",
+    "type": "model-conversion",
+    "modelTarget": "onnx",
+    "modelSource": "sklearn"
+}
+----
+
+[cols="2,4a"]
+|===
+|File|Description
+
+| `src/${python_package_name}/generated/post_action/onnx_sklearn_model_conversion_base.py`
+| Base class containing core code for leveraging ONNX to convert sklearn models. This class is regenerated with
+every build, and therefore cannot be modified.
+
+The following methods are generated:
+
+* `_convert` - performs the ONNX conversion.
+* `_save` - saves the converted ONNX model.
+
+In addition, ONNX conversion properties are generated with default values. These can be overridden in the
+post-training action implementation class to specify custom values to pass to the ONNX conversion.
+
+| `src/${python_package_name}/generated/post_action/convert_model_to_onnx_format_base.py`
+| Post-training action base class containing core code for applying the post-training action. This class is
+regenerated with every build, and therefore cannot be modified.
+
+The following method is generated:
+
+* `apply` - applies the ONNX conversion by calling the `_convert` and `_save` methods from the above class.
+
+| `src/${python_package_name}/post_action/convert_model_to_onnx_format.py`
+| Post-training action implementation class. This class is where properties and methods from the above base classes
+can be overridden, if desired.
+
+If the ONNX conversion has any required parameters, they will be generated here for manual implementation.
+
+|===
+
+
+==== Custom
+
+The following example post-training action will be used to describe what gets generated for a custom model conversion:
+
+.Example Custom Model Conversion Post-Training Action
+[source,json]
+----
+{
+    "name": "ConvertModelToCustomFormat",
+    "type": "model-conversion",
+    "modelTarget": "custom",
+    "modelSource": "sklearn"
+}
+----
+
+[cols="2,4a"]
+|===
+|File|Description
+
+| `src/${python_package_name}/generated/post_action/custom_model_conversion_base.py`
+| Base class containing core code for implementing a custom model conversion. This class is regenerated with every
+build, and therefore cannot be modified.
+
+The following methods are generated:
+
+* `_convert` - abstract method to implement the custom conversion. This should be implemented in the post-training
+action implementation class.
+* `_save` - abstract method to implement the saving of the converted model. This should be implemented in the
+post-training action implementation class.
+
+| `src/${python_package_name}/generated/post_action/convert_model_to_custom_format_base.py`
+| Post-training action base class containing core code for applying the post-training action. This class is
+regenerated with every build, and therefore cannot be modified.
+
+The following method is generated:
+
+* `apply` - applies the cusom conversion by calling the `_convert` and `_save` methods from the above class.
+
+| `src/${python_package_name}/post_action/convert_model_to_custom_format.py`
+| Post-training action implementation class. This class is where the `_convert` and `_save` methods should be implemented.
+
+|===
+
+=== Freeform Post-Training Action
+
+A freeform post-training action can be used to apply any custom post-training processing. The following example post-training action will be used to describe what gets generated for a freeform post-training action:
+
+.Example Freeform Post-Training Action
+[source,json]
+----
+{
+    "name": "AdditionalProcessing",
+    "type": "freeform"
+}
+----
+
+[cols="2,4a"]
+|===
+|File|Description
+
+| `src/${python_package_name}/generated/post_action/additional_processing_base.py`
+| Post-training action base class containing core code for applying the post-training action. This class is
+regenerated with every build, and therefore cannot be modified.
+
+The following method is generated:
+
+* `apply` - abstract method to implement the custom processing. This should be implemented in the post-training
+action implementation class.
+
+| `src/${python_package_name}/post_action/additional_processing.py`
+| Post-training action implementation class. This class is where the `apply` method should be implemented.
+
+|===