Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#5 📝 Tranche 5 of documentation migration #32

Merged
merged 1 commit into from
May 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions docs/modules/ROOT/pages/data-encryption.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
[#_data_encryption]
= Data Encryption

An important security service provided by the fabrication process is data encryption. Data encryption is a method
where information is encoded and obscured making it unreadable by unauthorized entities.

== Encryption Method

=== AES encryption
AES encryption is a symmetric-key algorithm which means you can use the same key to encrypt and decrypt data.
To encrypt data using this algorithm you first need to generate a 128 bit (16 character) key and save it in the
`encrypt.properties` file, then follow the example below to encrypt data using the AES algorithm.

.encrypt.properties
[source,java]
----
encrypt.aes.secret.key.spec=<SOME_AES_ENCRYPTED_VALUE_HERE>
----

.Java
[source,java]
----
import com.boozallen.aissemble.data.encryption.SimpleAesEncrypt;

AiopsEncrypt simpleAiopsEncrypt = new SimpleAesEncrypt();

//Encrypt
String encryptedData = simpleAiopsEncrypt.encryptValue("Plain text string data");

//Decrypt
String decryptedVaultData = simpleAiopsEncrypt.decryptValue(encryptedData);
----

.Python
[source,python]
----
from aiops_encrypt.aes_encrypt import AESEncrypt

aes_encrypt = AESEncrypt()

# Encrypt
encrypted_value = aes_encrypt.encrypt('Plain text string data')

# Decrypt
decrypted_value = aes_encrypt.decrypt(encrypted_value)
----

=== Vault encryption
Another encryption service is provide through Hashicorp Vault. This is a well known provider of security services and
the fabrication process generates the client methods for calling and encrypting data. In order to encrypt data using V
ault the following properties need to be set:

.encrypt.properties
[source,java]
----
secrets.host.url=[Some Url]
Example: http://vault:8217
secrets.unseal.keys=[Comma delimited set of keys]
Example (no quotes or whitespace): key1,key2,key3,key4,key5
secrets.root.key=[The root key]
Example: s.EXAMPLE
----

The keys in the example property file above can be retrieved from the vault console log. The keys will be printed at
the top of the log when the server is started. For example:

.Vault console log
[source,java]
----
vault | ROOT KEY
vault | s.EXAMPLE
vault | UNSEAL KEYS
vault | ["key1", "key2", "key3", "key4", "key5"]
vault | TRANSIT TOKEN
vault | {"request_id": "29d26f42-7be2-9b06-c4ce-1ecc94114393", "lease_id": "", "renewable": false, "lease_duration": 0, "data": null, "wrap_info": null, "warnings": null, "auth": {"client_token": "s.TOKEN", "accessor": "zFcMdiOHhtXUyRTUigkePpzS", "policies": ["app-aiops", "default"], "token_policies": ["app-aiops", "default"], "metadata": null, "lease_duration": 2764800, "renewable": true, "entity_id": "", "token_type": "service", "orphan": false}}
----

To add vault encryption to your code follow the example below. Please note that the vault Docker container needs to
be running in order for this example to work.

.Java
[source,java]
----
import com.boozallen.aissemble.data.encryption.VaultEncrypt;

AiopsEncrypt vaultAiopsEncrypt = new VaultEncrypt();

//Encrypt
String encryptedData = vaultAiopsEncrypt.encryptValue("Plain text string data");

//Decrypt
String decryptedVaultData = vaultAiopsEncrypt.decryptValue(encryptedData);
----

.Python
[source,python]
----
from aiops_encrypt.vault_remote_encryption_strategy import VaultRemoteEncryptionStrategy
from aiops_encrypt.vault_local_encryption_strategy import VaultLocalEncryptionStrategy

vault_client = VaultRemoteEncryptionStrategy()
'''
Optionally you can use local Vault encryption (vault_client = VaultLocalEncryptionStrategy())
which will download the encryption key once and perform encryption locally without a round trip
to the Vault server. This is useful for encrypting large data objects and for high volume encryption tasks.
NOTE: If you are encrypting your data through a User Defined Function (udf) in PySpark you need to use
the VaultLocalEncryptionStrategy. Currently the remote version causes threading issues. This issue will
likely be resolved in a future update to the Hashicorp Vault client
'''

# Encrypt
encrypted_value = vault_client.encrypt('Plain text string data')

# Decrypt
decrypted_value = vault_client.decrypt(encrypted_value)
----


== Encryption by policy

The fabrication process generates built in encryption code that can be activated through a policy file.
When an encryption policy is configured the pipeline will apply encryption to the fields specified in the policy.
The following example illustrates how to encrypt a field named "ssn" during the ingest step.

.example-encrypt-policy.json (can be named anything as long as it's in the correct policy directory)
[source,json]
----
[
{
"identifier": "encryptSSN",
"rules": [
{
"className": "EncryptRule",
"configurations": {
"description": "Apply encryption policy"
}
}
],
"encryptPhase": "ingest",
"encryptFields": [
"ssn"
],
"encryptAlgorithm": "AES_ENCRYPT"
}
]
----

This file should be placed in a directory which can be specified by the user in the policy-configuration.properties
file (see example below).

`encryptPhase` - The step in the pipeline where encryption takes place. Typically, this will happen in the first step.

`encryptFields` - An array of field names that will be encrypted.

`encryptAlgorithm` - The algorithm that will be used to encrypt the data. Currently, the options are `AES_ENCRYPT`
and `VAULT_ENCRYPT`. More can be added through customization.

.policy-configuration.properties
[source,json]
----
policies-location=policies
----

This configuration defines which folder the encryption policy resides in. In the example above the policies are in
the `policies` directory (relative to the working directory). An absolute path can also be used.
7 changes: 7 additions & 0 deletions docs/modules/ROOT/pages/data-profiling-details.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
= Data Profiling

WARNING: Data profiling is currently in an incubating state and is undergoing significant changes. Stay
tuned for information on our upcoming improved offerings!
//Machine learning pipelines should not have data profiling defined in the metamodel


125 changes: 125 additions & 0 deletions docs/modules/ROOT/pages/data-validation.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
= Data Validation

The Data Validation component ensures the quality, feasibility, and accuracy of data before it is used for machine
learning. Validation boosts machine learning confidence by ensuring data is cleansed, standardized, and ready for use.
By leveraging data validation from the xref:semantic-data.adoc#_semantic_data[semantic data model], consistent data
validation rules are applied throughout the entire project. This page will walk you through how to leverage data
validation.

== What Gets Generated
For each record metamodel, a handful of methods are generated (outlined below) that can be leveraged in the
implementation logic of your pipeline steps to apply data validation. These methods have default logic implemented in
generated base classes but can be customized by overriding them in the corresponding implementation class.

For the following method documentation, assume the record name is `TaxPayer` and the dictionary type is `Ssn`.

=== Java Implementation
****
.TaxPayerSchema.java
[source,java]
----
/**
* Spark-optimized application of the record validation. This should be the preferred method for validating your
* dataset in a Spark pipeline. Override this method to customize how data should be validated for the dataset.
*/
public Dataset<Row> validateDataFrame(Dataset<Row> data)
----
_Parameters:_

* `data` – dataset to be filtered

_Return:_ `validData` – dataset with invalid records removed
****

****
.Ssn.java
[source,java]
----
/**
* Applies validation in the dictionary metamodels to a specific field.
* Override this method to customize how the dictionary type should be validated.
*/
public void validate()
----

_Parameters:_ None

_Return:_ None

_Throws:_

* `ValidationException` - if the field fails to meet the validation rules
****

****
.TaxPayer.java
[source,java]
----
/**
* Applies validation described in the record metamodel and applies validation in the dictionary to any relevant fields.
* Override this method to customize how record should be validated.
*/
public void validate()
----

_Parameters:_ None

_Return:_ None

_Throws:_

* `ValidationException` - if the field fails to meet the validation rules
****

=== Python Implementation
****
.tax_payer_schema.py
[source,python]
----
# Spark-optimized application of the record validation. This should be the preferred method for validating your dataset in a PySpark pipeline.
# Override this method to customize how data should be validated for the dataset.
def validate_dataset(ingest_dataset: DataFrame)
----

_Parameters:_

* `DataFrame` – dataset filtered

_Return:_ `valid_data` – dataset with invalid records removed
****

****
.ssn.py
[source,python]
----
# Applies validation in the dictionary metamodels to a specific field.
# Override this method to customize how the dictionary type should be validated.
def validate()
----

_Parameters:_ None

_Return:_ None

_Throws:_

* `ValueError` - throws error when the value does not match any valid formats
****

****
.tax_payer.py
[source,python]
----
# Applies validation described in the record metamodel and applies validation in the dictionary to any relevant fields.
# Override this method to customize how record should be validated.
def validate()
----

_Parameters:_ None

_Return:_ None

_Throws:_

* `ValueError` - throws error when the value does not match any valid formats.
****
9 changes: 9 additions & 0 deletions docs/modules/ROOT/pages/drift-detection.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[#_drift_detection]
= Drift Detection

The conceptual idea of drift is that as deployed artificial intelligence (AI) systems adapt to evolving data streams,
the predictive power may potentially degrade. Their inferences may “drift” away from the intended targets. When using
semantic data models, it is possible to ensure that data are consistently monitored for drift, thus maintaining the
performance of AI systems.

Please contact the team for more information.
Loading
Loading