diff --git a/README.md b/README.md index ceba403cd..3d9da589c 100644 --- a/README.md +++ b/README.md @@ -43,7 +43,7 @@ processes adjusted and honed into the public GitHub ## Configurations -For details on using the configuration tool, please consult our [Configuring Your Environment guidance (LINK COMING SOON)](https://boozallen.github.io/aissemble/current-dev/configurations.html). +For details on using the configuration tool, please consult our [Configuring Your Environment guidance](https://boozallen.github.io/aissemble/aissemble/current/configurations.html). ## Build @@ -76,7 +76,7 @@ following profiles are often useful when first starting with aiSSEMBLE: build does not build the Docker images directly. The images are built within the Kubernetes cluster to speed up development builds and save disk space.** -## Use a Maven Archetype to Create a New aiSSEMBLE-Based Project +## Use a Maven Archetype to Create a New aiSSEMBLE-Based Project The first step in creating a new project is to leverage Maven’s archetype functionality to incept a new Maven project that will contain all of your aiSSEMBLE component implementations - Data Delivery and Machine Learning pipelines as @@ -97,7 +97,7 @@ This command will trigger an interactive questionnaire giving you the opportunit 5. projectGitUrl 6. projectName -* For details on these fields refer to (LINK COMING SOON) https://boozallen.github.io/aissemble/current-dev/archetype.html#_use_a_maven_archetype_to_create_a_new_project +* For details on these fields refer to https://boozallen.github.io/aissemble/aissemble/current/archetype.html * For detailed instructions on adding a pipeline refer to (LINK COMING SOON) https://boozallen.github.io/aissemble/current-dev/add-pipelines-to-build.html diff --git a/docs/modules/ROOT/images/aissemble-landscape.png b/docs/modules/ROOT/images/aissemble-landscape.png new file mode 100644 index 000000000..327609e89 Binary files /dev/null and b/docs/modules/ROOT/images/aissemble-landscape.png differ diff --git a/docs/modules/ROOT/images/aissemble-solution-architecture.png b/docs/modules/ROOT/images/aissemble-solution-architecture.png new file mode 100644 index 000000000..62f5dda61 Binary files /dev/null and b/docs/modules/ROOT/images/aissemble-solution-architecture.png differ diff --git a/docs/modules/ROOT/pages/add-pipelines-to-build.adoc b/docs/modules/ROOT/pages/add-pipelines-to-build.adoc new file mode 100644 index 000000000..db6bd2552 --- /dev/null +++ b/docs/modules/ROOT/pages/add-pipelines-to-build.adoc @@ -0,0 +1,62 @@ +[#_adding_a_pipeline] += Adding a Pipeline + +Once you have generated your new project, it is time to add a pipeline. Pipelines are the core of most projects and are +responsible for major data delivery and machine learning tasks. The following content walks through the process of +standing up a pipeline at a very high level. + +=== Step 1: Creating the pipeline model file +aiSSEMBLE(TM) uses Model Driven Architecture (MDA) to accelerate development. Pipeline models are JSON files used to +drive the generation of multiple aspects in your project - including the pipeline code module and deployment modules. + +. Create a new JSON file in your project's `pipeline-model` directory. +** Sample path: `test-project/test-project-pipeline-models/src/main/resources/pipelines` + +. Create the model pipeline data within the newly added JSON file. For detailed options, see the pipeline +xref:pipeline-metamodel.adoc[metamodel documentation]. +.. *Note:* Ensure the name of the JSON file matches the `"name"` of the pipeline. + +*_Example:* Shown below is the SimpleDataDeliveryExample pipeline._ + +._View SimpleDataDeliveryExample.json_ +[%collapsible] +==== +[source] +---- +{ + "name":"SimpleDataDeliveryExample", + "package":"com.boozallen.aissemble.documentation", + "type":{ + "name":"data-flow", + "implementation":"data-delivery-spark" + }, + "steps":[ + { + "name":"IngestData", + "type":"synchronous", + "dataProfiling":{ + "enabled":false + } + } + ] +} +---- +==== + +=== Step 2: Generating the pipeline code +After creating your model pipeline, execute the build to trigger the creation of the Maven modules that accompany it. + +. Run the maven build to execute the MDA generation engine. +.. `mvn clean install` + +. The MDA generator will take several build iterations to fully generate your project, and requires that you modify +certain files to enable this generation. These *Manual Actions* are meant to guide you through that process, and will +only need to be performed after changes to your pipeline model(s). +[source] +---- +*********************************************************************** +*** MANUAL ACTION NEEDED! *** +*********************************************************************** +---- + +[start=3] +. Re-run the build and address all manual actions until they have been resolved. \ No newline at end of file diff --git a/docs/modules/ROOT/pages/aissemble-approach.adoc b/docs/modules/ROOT/pages/aissemble-approach.adoc new file mode 100644 index 000000000..ae86eb570 --- /dev/null +++ b/docs/modules/ROOT/pages/aissemble-approach.adoc @@ -0,0 +1,38 @@ += aiSSEMBLE(TM) Approach + +aiSSEMBLE is a lean manufacturing approach for holistically designing, developing, and fielding AI. The aiSSEMBLE Baseline +is a framework which builds on the software and machine learning engineering best practices and lessons learned, +providing and maintaining reusable components, rather than a one-size-fits-all platform. In return, various members of +your team, including solution architects, DevOps engineers, data/software engineers, and AI/ML practitioners can easily +leverage these components to suit your business needs and integrate with any other tools that meet the needs of your +project. + +== Baseline Fabrications +The Baseline represents the Factory component in the diagram below. aiSSEMBLE leverages +xref:aissemble-approach.adoc#_configurable_tooling[configurable tooling] in the fabrication process to generate key +scaffolding and other components that are tailored to your project. aiSSEMBLE also establishes a build process that +produces deployment-ready artifacts. + +image::aissemble-solution-architecture.png[align="left",width=1366,height=768] + + +[#_configurable_tooling] +=== Configurable Tooling +With the fast-moving landscape of tools and techniques within AI, a process that can quickly change at the speed of AI +is needed. aiSSEMBLE understands it is crucial that the nuances of existing customer environments and preferences can be +incorporated into AI projects as an integral concept. The Baseline’s Configurable Tooling is realized by using +https://www.omg.org/mda/[Model Driven Architecture,role=external,window=_blank] to describe your AI system with simple +JSON files. These files are then ingested to generate Python, Java, Scala, or any other type of file. Configurable +tooling exists for a vast array of components leveraging popular implementation choices, which can be used as-is or +tailored to specific needs. + +=== Reusable Components +Reusable components represent the "ready-to-deploy" artifacts within aiSSEMBLE. These reusable components can be used in +a project directly. These include, but are not limited to, data lineage, AI model versioning, and bias detection. + +== aiSSEMBLE Landscape +aiSSEMBLE is not a platform, but rather a framework that integrates with various technologies. The graphic below +demonstrates the representative set of technologies aiSSEMBLE integrates with. This set is always evolving to +keep pace with the rapidly changing AI space. + +image::aissemble-landscape.png[align="left",width=1366,height=768] \ No newline at end of file diff --git a/docs/modules/ROOT/pages/aissemble-versions.adoc b/docs/modules/ROOT/pages/aissemble-versions.adoc index 156aa86ac..670fe45c5 100644 --- a/docs/modules/ROOT/pages/aissemble-versions.adoc +++ b/docs/modules/ROOT/pages/aissemble-versions.adoc @@ -7,5 +7,5 @@ https://github.com/boozallen/aissemble/releases[Releases page,role=external,wind ## Preview of Next Release Given the fragile nature of GitHub Release Notes (no version control), we track our in-flight changes in a vesioned -document within the codebase. Check out https://github.com/boozallen/aissemble/releases[DRAFT Release Notes,role=external,window=_blank] -to see what's coming in the next release. +document within the codebase. Check out https://github.com/boozallen/aissemble/blob/dev/DRAFT_RELEASE_NOTES.md[DRAFT +Release Notes,role=external,window=_blank] to see what's coming in the next release. diff --git a/docs/modules/ROOT/pages/index.adoc b/docs/modules/ROOT/pages/index.adoc index 10f5c9e84..b4d6c56f8 100644 --- a/docs/modules/ROOT/pages/index.adoc +++ b/docs/modules/ROOT/pages/index.adoc @@ -9,24 +9,24 @@ GitHub issue,role=external,window=_blank] for more information. * **aiSSEMBLE** – (pronounced assemble) is Booz Allen’s lean, manufacturing-inspired approach for holistically designing, developing, and fielding AI solutions across the engineering lifecycle: from data processing (DataOps) to model building, tuning, and training (ModelOps, MLOps), to secure operational deployment (DevSecOps). -* **https://github.com/boozallen/aissemble[Solution Baseline,role=external,window=_blank]** - An evolving, +* **https://github.com/boozallen/aissemble[aiSSEMBLE Baseline,role=external,window=_blank]** - An evolving, versioned, core set of capabilities that can serve as an implementation's foundation, enabling the rapid standup of an operational data delivery, data preparation, and/or machine learning (ML) pipelines. == aiSSEMBLE Overview [.lead] -The aiSSEMBLE Solution Baseline is an accelerant that helps AI projects by leveraging pre-packaged and tailored +The aiSSEMBLE Baseline is an accelerant that helps AI projects by leveraging pre-packaged and tailored components, allowing a shift in focus towards value-added tasks, rather than labor-intensive, boilerplate tasks. -The Solution Baseline also lowers the barrier for including features that are otherwise treated as nice-to-have items, +The Baseline also lowers the barrier for including features that are otherwise treated as nice-to-have items, as teams often do not have time to incorporate them (e.g., drift detection, bias detection, provenance capture). -The Solution Baseline is the key enabler for the aiSSEMBLE manufacturing process: a modern take on a holistic, +The Baseline is the key enabler for the aiSSEMBLE manufacturing process: a modern take on a holistic, integrated lean software manufacturing process that can deliver a variety of AI products at high velocity. The graphic below demonstrates the steps to kick off an aiSSEMBLE project. // .aiSSEMBLE Notional Architecture -image::solution-baseline-process-overview.png[align="center",alt="Solution Baseline Overview",width=1366,height=768] +image::solution-baseline-process-overview.png[align="center",alt="Baseline Overview",width=1366,height=768] == New project with aiSSEMBLE diff --git a/docs/modules/ROOT/pages/pipeline-metamodel.adoc b/docs/modules/ROOT/pages/pipeline-metamodel.adoc new file mode 100644 index 000000000..3cdef87c6 --- /dev/null +++ b/docs/modules/ROOT/pages/pipeline-metamodel.adoc @@ -0,0 +1,900 @@ +[#_pipeline_metamodel] += Pipeline Metamodel + +The pipeline metamodel enables data engineers, machine learning engineers, and DevSecOps engineers to specify key +attributes that describe a data delivery, data preparation, or machine learning pipeline. Specifically, this +metamodel allows common pipeline patterns to be quickly instantiated with one or more steps that each detail inbound and +outbound data mapping, persistence definition, provenance, and alerting features. This metamodel is meant to be a +reference and not a complete guide to creating your pipeline. To view a more specific example, please check out the +core components page that best aligns to your project’s needs: + +* xref:data-delivery-pipeline-overview.adoc[Data Delivery Pipelines] + +* xref:machine-learning-pipeline-details.adoc[Machine Learning Pipelines] + +== Pipeline Metamodel Specifications + +Each metadata instance should be placed in a file with the same name as your pipeline that lives within the following +directory structure (to initially create this structure, please see xref:archetype.adoc[]): + +`/-pipeline-models/src/main/resources/pipelines` + +For example: + +`test-project/test-project-pipeline-models/src/main/resources/pipelines/TaxPayerPipeline.json` + +=== Pipeline Root Element Options +The following options are available on the root pipeline element: + +.Pipeline Root Location +[source,json] +---- +{ + "..." +} +---- + +.Pipeline Root Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `name` +| Yes +| None +| Specifies the name of your pipeline. It should represent the functional purpose of your pipeline and must use +UpperCamelCase/PascalCase notation (e.g., `TaxPayerPipeline.json`). + +| `package` +| Yes +| None +| Provides a namespace that will be leveraged to differentiate between instances and provide language-specific +namespacing (e.g. package in Java, namespace in XSD). The archetype process will default this to your Maven `groupId`, +which is a recommended best practice to maintain across your pipelines. + +| `description` +| No +| None +| A description of the purpose of the pipeline being specified. + +| `type` (xref:#_pipeline_type_options[details]) +| Yes +| None +| The set of elements that describes the type of pipeline you are modeling (e.g., Data Delivery via Spark, Data Delivery +via Nifi, Machine Learning via MLFlow). + +| `dataLineage` +| No +| None +| A flag indicating whether the pipeline should include OpenLineage metadata capture as the +xref:data-lineage.adoc[data lineage] tool. + +| `fileStores` (xref:#_pipeline_file_store_options[details]) +| No +| None +| The xref:file-storage-details.adoc[file stores] used by the pipeline. + +| `steps` (xref:#_pipeline_step_options[details]) +| Yes +| None +| The various steps or stages in your pipeline. At least one step is required within steps. Within a Data Delivery +pipeline, each step is a Data Action. Within a Machine Learning pipeline, each step represents part of the ML Workflow. + +|=== + +[#_pipeline_type_options] +=== Pipeline Type Options +The following options are available on the `type` pipeline element: + +.Type Location +[source,json] +---- +{ + "type": { + "..." + } +} +---- +.Type Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `type/name` +| Yes +| None +| The name of the pipeline type that is being modeled. Supported types are: + +* `data-flow` - should be used for Data Delivery or Data Preparation pipelines +* `machine-learning` - should be used for Machine Learning pipelines + +| `type/implementation` +| Yes +| None +| The specific implementation of the pipeline type you want to use. Currently supported types are: + +*`data-flow` implementations:* + +* xref:spark-data-delivery-pipeline.adoc[`data-delivery-spark`] +* xref:pyspark-data-delivery-pipeline-details.adoc[`data-delivery-pyspark`] + +*`machine-learning` implementation:* + +* xref:machine-learning-pipeline-details.adoc[`machine-learning-mlflow`] + +| `type/platforms` (xref:#_pipeline_type_platform_options[details]) +| No +| None +| Any additional platforms to include with the pipeline. + +| `type/versioning` (xref:#_pipeline_type_versioning_options[details]) +| Yes +| Situational +| By default, `versioning` is not enabled. Specifying this element allows for versioning to be disabled or enabled, if +desired. + +| `type/executionHelpers` +| No +| None +| Setting the value to ["airflow"] will bring airflow into the project. Other helpers may be added in the future. + +|=== + + +[#_pipeline_type_platform_options] +=== Pipeline Type Platform Options +The following options are available on the `platform` pipeline element: + +.Type Platform Location +[source,json] +---- +{ + "type": { + "platforms":[ + { + "..." + } + ] + } +} +---- +.Type Platform Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `type/platforms/platform/name` +| Yes +| None +| The name of an additional platform to include with the pipeline. The following platforms are currently supported: + +* `sedona` - adds https://sedona.apache.org/[Apache Sedona,role=external,window=_blank] support for geospatial data +delivery purposes. Note that this platform is only applicable to the `data-delivery-spark` and `data-delivery-pyspark` +pipeline implementations. + +|=== + + +[#_pipeline_type_versioning_options] +=== Pipeline Type Versioning Options +The following options are available on the `versioning` pipeline element: + +.Type Versioning Location +[source,json] +---- +{ + "type": { + "versioning":{ + "..." + } + } +} +---- +.Type Versioning Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `type/versioning/enabled` +| No +| None +| Setting `enabled` to false will disable versioning for *machine-learning implementations only*. + +|=== + + +[#_pipeline_file_store_options] +=== Pipeline File Store Options +The following options are available on the `fileStore` pipeline element: + +.File Store Location +[source,json] +---- +{ + "fileStores": [ + { + "..." + } + ] +} +---- +.File Store Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `fileStores/fileStore/name` +| Yes +| None +| Specifies the name of your pipeline. It should represent the functional purpose of your pipeline and must use +UpperCamelCase/PascalCase notation (e.g., `PublishedModels`). To be used as the prefix in the pipeline's +xref:file-storage-details.adoc[file stores] configuration. +|=== + +[#_pipeline_step_options] +=== Pipeline Step Options +The following options are available on the `step` pipeline element: + +.Step Location +[source,json] +---- +{ + "steps": [ + { + "..." + } + ] +} +---- +.Step Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/name` +| Yes +| None +| Specifies the name of your step. It should represent the functional purpose of your step and must use +UpperCamelCase/PascalCase notation (e.g., `IngestNetflowData`). + +| `steps/step/type` +| Yes +| None +| Defines the type of step you want to create. There are different options for data-flow and machine-learning pipelines. + +*`data-flow` implementations:* + +* `synchronous` - the step will accept an input and then return an output in a single execution +* `asynchronous` - the step will listen for messages off a queue and then combine them into batches for subsequent +processing + +*`machine-learning` implementations:* + +* `generic` - the step will provide minimal scaffolding for custom logic to be executed in the training/inference steps +* `training` - the step will train an ML model +* `inference` - the step will expose a trained model for inference +* `sagemakertraining` - the step will use Amazon SageMaker + +| `steps/step/inbound` (xref:#_pipeline_step_inbound_options[details]) +| No +| None +| Defines the type of inbound data for the step. If not specified, then there will be no inbound type. Conceptually, +inbound maps to the parameter type that will be passed to a method/function. + +| `steps/step/outbound` (xref:#_pipeline_step_outbound_options[details]) +| No +| No +| Defines the type of outbound for the step. If not specified, then there will be no outbound type. Conceptually, +outbound maps to the parameter type that will be returned from a method/function. + +| `steps/step/persist` (xref:#_pipeline_step_persist_options[details]) +| No +| None +| Allows specification of the type of persistence that should be performed in this step. If not specified, no persist +logic will be created. + +| `steps/step/provenance` (xref:#_pipeline_step_provenance_options[details]) +| No +| Yes +| By default, provenance will be tracked for every step. Specifying this element allows for provenance to be disabled or +enabled, if desired + +| `steps/step/alerting` (xref:#_pipeline_step_alerting_options[details]) +| No +| Yes +| By default, xref:alerting-details.adoc[alerting] is triggered for every step. Specifying this element allows for alerting +to be disabled or enabled, if desired + +| `steps/step/postActions` (xref:#_pipeline_step_post_actions_options[details]) +| No +| None +| Allows specification of one or more xref:post-actions.adoc[post-training actions] to apply on a `machine-learning` +`training` step *only*. If not specified or specified on a non-training step, no post-training action logic will be +created. + +| `steps/step/fileStores` +| No +| None +| A list of xref:file-storage-details.adoc[file store] names utilized by this step. They must be defined in the +`pipeline/fileStores` element. + +| `steps/step/configuration` (xref:#_pipeline_step_configuration_options[details]) +| No +| None +| Allows the specification of arbitrary list of key-value pairs for implementation-specific configuration. + +|=== + +[#_pipeline_step_inbound_options] +=== Pipeline Step Inbound Options +The following options are available on the `step` pipeline element: + +.Step Inbound Location +[source,json] +---- +{ + "steps": [ + { + "inbound": { + "..." + } + } + ] +} +---- +.Step Inbound Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/inbound/type` +| Yes +| None +| Allows specification of how the step will be invoked. There are currently three options: + +* xref:messaging-details.adoc[`messaging`] - invocation will occur when a message is received. This message can contain +content, a pointer to content, or just a signal that some process should start. +* `native` - invocation will occur when the step is invoked directly by some caller. +* `void` - no inbound is specified. All other inbound options should be removed if using void or, preferably, the +entire input block should be eliminated from your step. + +| `steps/step/inbound/channelName` +| Situational +| None +| Required if using xref:messaging-details.adoc[messaging] as the `inbound` type, `channelName` specifies the messaging +channel from which input should be received. + +| `steps/step/inbound/nativeCollectionType` (xref:#_pipeline_step_inbound_native_collection_type_options[details]) +| No +| Yes +| If using native as the `inbound` type, `nativeCollectionType` allows the implementation of the collection object being +passed into the step to be customized to any valid xref:type-metamodel.adoc[Type Manager] type. If not +specified, it will default to dataset (which in turn is defaulted to `org.apache.spark.sql.Dataset` for a Spark-based +implementation). + +| `steps/step/inbound/recordType` (xref:#_pipeline_step_inbound_record_type_options[details]) +| No +| Yes +| Allows the type of an individual record being processed in a step to be defined to any valid +xref:type-metamodel.adoc[Type Manager] type. If not specified, it will default to row (which in turn +is defaulted to `org.apache.spark.sql.Row` for a Spark-based implementation). + +|=== + +[#_pipeline_step_inbound_native_collection_type_options] +=== Pipeline Step Inbound Native Collection Type Options +The following options are available on the `nativeCollectionType` pipeline element: + +.Step Inbound Native Collection Type Location +[source,json] +---- +{ + "steps": [ + { + "inbound": { + "nativeCollectionType":{ + "..." + } + } + } + ] +} +---- +.Step Inbound Native Collection Type Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/inbound/nativeCollectionType/name` +| Yes +| None +| This is the name of the xref:dictionary-metamodel.adoc[dictionary] type to be used as the inbound native collection type. + +| `steps/step/inbound/nativeCollectionType/package` +| No +| Yes +| This is the package for the xref:dictionary-metamodel.adoc[dictionary] to look up the inbound native collection type. +If not specified, it will default to the base package. +|=== + +[#_pipeline_step_inbound_record_type_options] +=== Pipeline Step Inbound Record Type Options +The following options are available on the `recordType` pipeline element: + +.Step Inbound Record Type Location +[source,json] +---- +{ + "steps": [ + { + "inbound": { + "recordType":{ + "..." + } + } + } + ] +} +---- +.Step Inbound Record Type Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/inbound/recordType/name` +| Yes +| None +| This is the name of the `record` to be used as the inbound data type. + +| `steps/step/inbound/recordType/package` +| No +| Yes +| This is the package in which the `record` to be used as the inbound data type resides. +If not specified, it will default to the base package. +|=== + + +[#_pipeline_step_outbound_options] +=== Pipeline Step Outbound Options +The following options are available on the `outbound` pipeline element: + +.Step Outbound Location +[source,json] +---- +{ + "steps": [ + { + "outbound": { + "..." + } + } + ] +} +---- +.Step Outbound Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/outbound/type` +| Yes +| No +| Allows specification of how the step returns results. There are currently three options: + +* xref:messaging-details.adoc[`messaging`] - a message will be emitted from the step. This message can contain content, +a pointer to content, or just a signal. +* `native` - results will be directly returned from the step to be handled by the caller of the step. As such, messaging +inbound cannot be combined with a native outbound since no caller will exist. +* `void` - no outbound result is specified. All other options should be removed if using void or, preferably, the entire +outbound block should be eliminated from your step. + +| `steps/step/outbound/channelName` +| Situational +| None +| Required if using xref:messaging-details.adoc[messaging] as the `outbound` type, `channelName` specifies the messaging +channel to which output should be sent. + +| `steps/step/outbound/nativeCollectionType` (xref:#_pipeline_step_outbound_native_collection_type_options[details]) +| No +| Yes +| If using native as the `outbound` type, `nativeCollectionType` allows the implementation of the collection object being +returned from the step to be customized to any valid xref:type-metamodel.adoc[Type Manager] type. If +not specified, it will default to dataset (which in turn is defaulted to `org.apache.spark.sql.Dataset` for a +Spark-based implementation). + +This has been changed to be defined to a valid dictionary type. + +| `steps/step/outbound/recordType` (xref:_pipeline_step_outbound_record_type_options[details]) +| No +| Yes +| Allows the type of an individual record being returned from a step to be defined to any valid +xref:type-metamodel.adoc[Type Manager] type. If not specified, it will default to row (which in turn +is defaulted to `org.apache.spark.sql.Row` for a Spark-based implementation). + +This has been changed to be defined to a valid `record` type. +|=== + +[#_pipeline_step_outbound_native_collection_type_options] +=== Pipeline Step Outbound Native Collection Type Options +The following options are available on the `nativeCollectionType` pipeline element: + +.Step Outbound Native Collection Type Location +[source,json] +---- +{ + "steps": [ + { + "outbound": { + "nativeCollectionType":{ + "..." + } + } + } + ] +} +---- +.Step Outbound Native Collection Type Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/outnbound/nativeCollectionType/name` +| Yes +| None +| This is the name of the xref:dictionary-metamodel.adoc[dictionary] type to be used as the outbound native collection +type. + +| `steps/step/outbound/nativeCollectionType/package` +| No +| Yes +| This is the package for the xref:dictionary-metamodel.adoc[dictionary] to look up the outbound native collection type. +If not specified, it will default to the base package. +|=== + +[#_pipeline_step_outbound_record_type_options] +=== Pipeline Step Outbound Record Type Options +The following options are available on the `recordType` pipeline element: + +.Step Outbound Record Type Location +[source,json] +---- +{ + "steps": [ + { + "outbound": { + "recordType":{ + "..." + } + } + } + ] +} +---- +.Step Outbound Record Type Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/outbound/recordType/name` +| Yes +| None +| This is the name of the `record` to be used as the outbound data type. + +| `steps/step/outbound/recordType/package` +| No +| Yes +| This is the package in which the `record` to be used as the outbound data type resides. +If not specified, it will default to the base package. +|=== + +[#_pipeline_step_persist_options] +=== Pipeline Step Persist Options +The following options are available on the `persist` pipeline element: + +.Step Persist Location +[source,json] +---- +{ + "steps": [ + { + "persist": { + "..." + } + } + ] +} +---- +.Step Persist Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/persist/type` +| Yes +| None +| Allows the specification of the storage system that you want to use to persist data in your step. There are currently +five options: + +* `delta-lake` - leverage https://delta.io/[Delta Lake,role=external,window=_blank] to save data; this is the preferred, +general purpose data store that is well suited for intermediate storage that will consumed by subsequent steps within +Spark implementations. +* `hive` - leverage https://hive.apache.org/[Apache Hive,role=external,window=_blank] to save data; this is often a +good choice when you want to expose data for remote consumption. +* `rdbms` - leverage https://www.sqlalchemy.org/[Rdbms,role=external,window=_blank] to save data. +* `elasticsearch` - leverage https://www.elastic.co/[Elasticsearch,role=external,window=_blank] to save data. +* `neo4j` - leverage https://neo4j.com/[Neo4j,role=external,window=_blank] to save data. + +| `steps/step/persist/mode` +| No +| Yes +| Allows the specification of how you want to persist data in your step. There are currently four options: + +* `append` +* `error` +* `ignore` +* `overwrite` + +If not specified, it will default to `append`. Please see +https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes[documentation on Spark save +modes] for details on the options. + +| `steps/step/persist/collectionType` (xref:_pipeline_step_persist_collection_type_options[details]) +| No +| Yes +| Allows the implementation of the collection object being persisted from the step to be customized to +any valid xref:type-metamodel.adoc[Type Manager] type. If not specified, it will default to dataset +(which in turn is defaulted to `org.apache.spark.sql.Dataset` for a Spark-based implementation). + +| `steps/step/persist/recordType` (xref:_pipeline_step_persist_record_type_options[details]) +| No +| Yes +| Allows the type of an individual record that will be persisted from a step to be defined to any valid +xref:type-metamodel.adoc[Type Manager] type. If not specified, it will default to row (which in turn +is defaulted to `org.apache.spark.sql.Row` for a Spark-based implementation). + +|=== + + +[#_pipeline_step_persist_collection_type_options] +=== Pipeline Step Persist Collection Type Options +The following options are available on the `collectionType` pipeline element: + +.Step Persist Collection Type Location +[source,json] +---- +{ + "steps": [ + { + "persist": { + "collectionType":{ + "..." + } + } + } + ] +} +---- +.Step Persist Collection Type Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/persist/collectionType/name` +| Yes +| No +| This is the name of the xref:dictionary-metamodel.adoc[dictionary] type to be used as the persist collection type. + +| `steps/step/persist/collectionType/package` +| No +| Yes +| This is the package for the xref:dictionary-metamodel.adoc[dictionary] to look up the persist collection type. If not +specified, it will default to the base package. +|=== + +[#_pipeline_step_persist_record_type_options] +=== Pipeline Step Persist Record Type Options +The following options are available on the `recordType` pipeline element: + +.Step Persist Record Type Location +[source,json] +---- +{ + "steps": [ + { + "persist": { + "recordType":{ + "..." + } + } + } + ] +} +---- +.Step Persist Record Type Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/persist/recordType/name` +| Yes +| None +| This is the name of the `record` to be used as the persist data type. + +| `steps/step/persist/recordType/package` +| No +| Yes +| This is the name of the `record` to be used as the persist data type. + +|=== + +[#_pipeline_step_provenance_options] +=== Pipeline Step Provenance Options +The following options are available on the `provenance` pipeline element: + +.Step Provenance Location +[source,json] +---- +{ + "steps": [ + { + "provenance": { + "..." + } + } + ] +} +---- +.Step Provenance Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/provenance/enabled` +| Yes +| None +| Setting `enabled` to false will disable provenance creation. + +| `steps/step/provenance/resource` +| No +| None +| The name for the resource being modified in the step. + +| `steps/step/provenance/subject` +| No +| None +| The name of the subject modifying the resource in the step. + +| `steps/step/provenance/action` +| No +| None +| The name of the action being taken on the resource in the step. + +|=== + +[#_pipeline_step_alerting_options] +=== Pipeline Step Alerting Options +The following options are available on the xref:alerting-details.adoc[`alerting`] pipeline element: + +.Step Alerting Location +[source,json] +---- +{ + "steps": [ + { + "alerting": { + "..." + } + } + ] +} +---- +.Step Alerting Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/alerting/enabled` +| Yes +| None +| Setting `enabled` to false will disable alerting. + +|=== + +[#_pipeline_step_post_actions_options] +=== Pipeline Step Post Actions Options +[IMPORTANT] +The `postActions` pipeline step element is only applicable to a machine-learning training step! + +The following options are available on the xref:post-actions.adoc[`postActions`] pipeline step element: + +.Step Post Actions Location +[source,json] +---- +{ + "steps": [ + { + "postActions": [ + { + "..." + } + ] + } + ] +} +---- +.Step Post Actions Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/postActions/name` +| Yes +| None +| Specifies the name of the xref:post-actions.adoc[post-training action]. It should represent the functional purpose of +your post action and use UpperCamelCase/PascalCase notation (e.g., `ConvertModel`). + +| `steps/step/postActions/type` +| Yes +| None +| Specifies the type of the xref:post-actions.adoc[post-training action]. The following types are currently supported: + +* `model-conversion` - to convert the trained model to another model format. +* `freeform` - to implement a custom post-training process. + +| `steps/step/postActions/modelTarget` +| Situational +| None +| Required when post action type is `model-conversion`. Specifies the format to convert the trained model to. The +following model targets are currently supported: + +* `onnx` - to convert a model to ONNX format. Please see https://github.com/onnx/onnxmltools[ONNX +documentation,role=external,window=_blank] for more information. +* `custom` - to implement a custom model conversion. + +| `steps/step/postActions/modelSource` +| Situational +| None +| Required when post action type is `model-conversion`. Specifies the format of the trained model that will be converted. + +For `onnx` model conversion, the following model sources are currently supported: + +* `sklearn` +* `keras` + +|=== + +[#_pipeline_step_configuration_options] +=== Pipeline Step Configuration Options +The following options are available on the `configuration` pipeline element: + +.Step Configuration Location +[source,json] +---- +{ + "steps": [ + { + "configuration": [ + { + "..." + } + ] + } + ] +} +---- +.Step Configuration Metamodel Options +[cols="2a,1a,1a,4a"] +|=== +| Element Name | Required? | Default | Use + +| `steps/step/configuration/key` +| No +| None +| The name of the configuration key-value pair by which the value can be retrieved. + +| `steps/step/configuration/value` +| No +| None +| The configuration value. + +|=== \ No newline at end of file diff --git a/docs/modules/ROOT/pages/solution-baseline-process.adoc b/docs/modules/ROOT/pages/solution-baseline-process.adoc new file mode 100644 index 000000000..5836185aa --- /dev/null +++ b/docs/modules/ROOT/pages/solution-baseline-process.adoc @@ -0,0 +1,9 @@ += aiSSEMBLE(TM) Baseline Process + +The following pages will walk through the process of creating your aiSSEMBLE-powered project, creating foundational +components, then layering in more advanced concepts. + +Each page will provide a visual clue as to how it fits into the process graphic below to help map activities. + +.aiSSEMBLE Process +image::solution-baseline-process-overview.png[align="center"]