diff --git a/docs/readthedocs/source/doc/PPML/Overview/azure_ppml.md b/docs/readthedocs/source/doc/PPML/Overview/azure_ppml.md index 7ed72acd494..ce2808c75b1 100644 --- a/docs/readthedocs/source/doc/PPML/Overview/azure_ppml.md +++ b/docs/readthedocs/source/doc/PPML/Overview/azure_ppml.md @@ -3,7 +3,7 @@ ## 1. Introduction Protecting privacy and confidentiality is critical for large-scale data analysis and machine learning. BigDL ***PPML*** combines various low-level hardware and software security technologies (e.g., [IntelĀ® Software Guard Extensions (IntelĀ® SGX)](https://www.intel.com/content/www/us/en/architecture-and-technology/software-guard-extensions.html), [Library Operating System (LibOS)](https://events19.linuxfoundation.org/wp-content/uploads/2017/12/Library-OS-is-the-New-Container-Why-is-Library-OS-A-Better-Option-for-Compatibility-and-Sandboxing-Chia-Che-Tsai-UC-Berkeley.pdf) such as [Graphene](https://github.com/gramineproject/graphene) and [Occlum](https://github.com/occlum/occlum), [Federated Learning](https://en.wikipedia.org/wiki/Federated_learning), etc.), so that users can continue to apply standard Big Data and AI technologies (such as Apache Spark, Apache Flink, Tensorflow, PyTorch, etc.) without sacrificing privacy. -BigDL PPML on Azure solution integrate BigDL ***PPML*** technology with Azure Services(Azure Kubernetes Service, Azure Storage Account, Azure Key Vault, etc.) to faciliate Azure customer to create Big Data and AI applications while getting high privacy and confidentiality protection. +BigDL PPML on Azure solution integrate BigDL ***PPML*** technology with Azure Services(Azure Kubernetes Service, Azure Storage Account, Azure Key Vault, etc.) to facilitate Azure customer to create Big Data and AI applications while getting high privacy and confidentiality protection. ### Overall Architecture ![](../images/ppml_azure_latest.png) @@ -15,7 +15,7 @@ BigDL PPML on Azure solution integrate BigDL ***PPML*** technology with Azure Se ### 2.1 Install Azure CLI Before you setup your environment, please install Azure CLI on your machine according to [Azure CLI guide](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli). -Then run `az login` to login to Azure system before you run following Azure commands. +Then run `az login` to login to Azure system before you run the following Azure commands. ### 2.2 Create Azure VM for hosting BigDL PPML image #### 2.2.1 Create Resource Group @@ -33,10 +33,10 @@ For size of the VM, please choose DC-V3 Series VM with more than 4 vCPU cores. #### 2.2.3 Pull BigDL PPML image and run on Linux client * Go to Azure Marketplace, search "BigDL PPML" and find `BigDL PPML: Secure Big Data AI on Intel SGX` product. Click "Create" button which will lead you to `Subscribe` page. -On `Subscribe` page, input your subscription, your Azure container registry, your resource group, location. Then click `Subscribe` to subscribe BigDL PPML to your container registry. +On `Subscribe` page, input your subscription, your Azure container registry, your resource group and your location. Then click `Subscribe` to subscribe BigDL PPML to your container registry. * Go to your Azure container regsitry, check `Repostirories`, and find `intel_corporation/bigdl-ppml-trusted-big-data-ml-python-graphene` -* Login to the created VM. Then login to your Azure container registry, pull BigDL PPML image using such command: +* Login to the created VM. Then login to your Azure container registry, pull BigDL PPML image using this command: ```bash docker pull myContainerRegistry/intel_corporation/bigdl-ppml-trusted-big-data-ml-python-graphene ``` @@ -67,7 +67,7 @@ Create AKS or use existing AKS with Intel SGX support. In your BigDL PPML container, you can run `/ppml/trusted-big-data-ml/azure/create-aks.sh` to create AKS with confidential computing support. -Note: Please use same VNet information of your client to create AKS. And use DC-Series VM size(i.e.Standard_DC8ds_v3) to create AKS. +Note: Please use the same VNet information of your client to create AKS. And use DC-Series VM size(i.e.Standard_DC8ds_v3) to create AKS. ```bash /ppml/trusted-big-data-ml/azure/create-aks.sh \ --resource-group myResourceGroup \ @@ -79,13 +79,13 @@ Note: Please use same VNet information of your client to create AKS. And use DC- --node-count myAKSInitNodeCount ``` -You can check the information by run: +You can check the information by running: ```bash /ppml/trusted-big-data-ml/azure/create-aks.sh --help ``` ## 2.4 Create Azure Data Lake Store Gen 2 -### 2.4.1 Create Data Lake Storage account or use existing one. +### 2.4.1 Create Data Lake Storage account or use an existing one. The example command to create Data Lake store is as below: ```bash az dls account create --account myDataLakeAccount --location myLocation --resource-group myResourceGroup @@ -113,16 +113,16 @@ az storage fs directory upload -f myFS --account-name myDataLakeAccount -s "path You can access Data Lake Storage in Hadoop filesytem by such URI: ```abfs[s]://file_system@account_name.dfs.core.windows.net///``` #### Authentication The ABFS driver supports two forms of authentication so that the Hadoop application may securely access resources contained within a Data Lake Storage Gen2 capable account. -- Shared Key: This permits users access to ALL resources in the account. The key is encrypted and stored in Hadoop configuration. +- Shared Key: This permits users to access to ALL resources in the account. The key is encrypted and stored in Hadoop configuration. - Azure Active Directory OAuth Bearer Token: Azure AD bearer tokens are acquired and refreshed by the driver using either the identity of the end user or a configured Service Principal. Using this authentication model, all access is authorized on a per-call basis using the identity associated with the supplied token and evaluated against the assigned POSIX Access Control List (ACL). By default, in our solution, we use shared key authentication. -- Get Access key list of storage account: +- Get Access key list of the storage account: ```bash az storage account keys list -g MyResourceGroup -n myDataLakeAccount ``` -Use one of the keys in authentication. +Use one of the keys for authentication. ## 2.5 Create Azure Key Vault ### 2.5.1 Create or use an existing Azure Key Vault @@ -245,16 +245,16 @@ Run such scripts to generate keys: When entering the passphrase or password, you could input the same password by yourself; and these passwords could also be used for the next step of generating other passwords. Password should be longer than 6 bits and contain numbers and letters, and one sample password is "3456abcd". These passwords would be used for future remote attestations and to start SGX enclaves more securely. ### 3.3 Generate password -Run such script to save password to Azure Key Vault +Run such script to save the password to Azure Key Vault ```bash /ppml/trusted-big-data-ml/azure/generate-password-az.sh myKeyVault used_password_when_generate_keys ``` -### 3.4 Save kube config to secret +### 3.4 Save kubeconfig to secret Login to AKS use such command: ```bash az aks get-credentials --resource-group myResourceGroup --name myAKSCluster ``` -Run such script to save kube config to secret +Run such script to save kubeconfig to secret ```bash /ppml/trusted-big-data-ml/azure/kubeconfig-secret.sh ``` @@ -351,19 +351,21 @@ export TF_MKL_ALLOC_MAX_BYTES=10737418240 && \ ``` ## 4. Run TPC-H example -TPC-H queries implemented in Spark using the DataFrames API running with BigDL PPML. +TPC-H queries are implemented using Spark DataFrames API running with BigDL PPML. ### 4.1 Generating tables Go to [TPC Download](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) site, choose `TPC-H` source code, then download the TPC-H toolkits. -After you download the tpc-h tools zip and uncompressed the zip file. Go to `dbgen` directory, and create a makefile based on `makefile.suite`, and run `make`. +After you download the TPC-h tools zip and uncompressed the zip file. Go to `dbgen` directory and create a makefile based on `makefile.suite`, and run `make`. + +This should generate an executable called `dbgen`. -This should generate an executable called `dbgen` ``` ./dbgen -h ``` -gives you the various options for generating the tables. The simplest case is running: +`dbgen` gives you various options for generating the tables. The simplest case is running: + ``` ./dbgen ``` @@ -376,7 +378,7 @@ will generate roughly 10GB of input data. ### 4.2 Generate primary key and data key Generate primary key and data key, then save to file system. -The example code of generate primary key and data key is like below: +The example code for generating the primary key and data key is like below: ``` java -cp '/ppml/trusted-big-data-ml/work/bigdl-2.1.0-SNAPSHOT/jars/*:/ppml/trusted-big-data-ml/work/spark-3.1.2/conf/:/ppml/trusted-big-data-ml/work/spark-3.1.2/jars/* \ -Xmx10g \ @@ -390,7 +392,7 @@ java -cp '/ppml/trusted-big-data-ml/work/bigdl-2.1.0-SNAPSHOT/jars/*:/ppml/trust ### 4.3 Encrypt Data Encrypt data with specified BigDL `AzureKeyManagementService` -The example code of encrypt data is like below: +The example code of encrypting data is like below: ``` java -cp '/ppml/trusted-big-data-ml/work/bigdl-2.1.0-SNAPSHOT/jars/*:/ppml/trusted-big-data-ml/work/spark-3.1.2/conf/:/ppml/trusted-big-data-ml/work/spark-3.1.2/jars/* \ -Xmx10g \ @@ -406,13 +408,14 @@ java -cp '/ppml/trusted-big-data-ml/work/bigdl-2.1.0-SNAPSHOT/jars/*:/ppml/trust After encryption, you may upload encrypted data to Azure Data Lake store. The example script is like below: + ```bash az storage fs directory upload -f myFS --account-name myDataLakeAccount -s xxx/dbgen-encrypted -d myDirectory --recursive ``` ### 4.4 Running Make sure you set the INPUT_DIR and OUTPUT_DIR in `TpchQuery` class before compiling to point to the -location the of the input data and where the output should be saved. +location of the input data and where the output should be saved. The example script to run a query is like: @@ -499,7 +502,7 @@ export TF_MKL_ALLOC_MAX_BYTES=10737418240 && \ $INPUT_DIR $OUTPUT_DIR aes_cbc_pkcs5padding plain_text [QUERY] ``` -INPUT_DIR is the tpch's data dir. +INPUT_DIR is the TPC-H's data dir. OUTPUT_DIR is the dir to write the query result. The optional parameter [QUERY] is the number of the query to run e.g 1, 2, ..., 22 diff --git a/docs/readthedocs/source/doc/PPML/Overview/ppml.md b/docs/readthedocs/source/doc/PPML/Overview/ppml.md index 739d6279d84..5c41141de93 100644 --- a/docs/readthedocs/source/doc/PPML/Overview/ppml.md +++ b/docs/readthedocs/source/doc/PPML/Overview/ppml.md @@ -42,7 +42,7 @@ cd BigDL/ppml/ 2. Generate the signing key for SGX Enclaves - Generate the enclave key using the command below, keep it safely for future remote attestations and to start SGX Enclaves more securely. It will generate a file `enclave-key.pem` in the current working directory, which will be the enclave key. To store the key elsewhere, modify the output file path. + Generate the enclave key using the command below, keep it safely for future remote attestations and to start SGX Enclaves more securely. It will generate a file `enclave-key.pem` in the current working directory, which will be the enclave key. To store the key elsewhere, modify the output file path. ```bash cd scripts/ @@ -128,9 +128,9 @@ Enter `BigDL/ppml/trusted-big-data-ml/python/docker-graphene` dir. **KEYS_PATH** means the absolute path to the keys you just created and copied to. According to the above commands, the path would be like "BigDL/ppml/trusted-big-data-ml/python/docker-graphene/keys"
**LOCAL_IP** means your local IP address.
-##### 2.2.2.2 Run Your Spark Program with BigDL PPML on SGX +##### 2.2.2.2 Run Your Spark Applications with BigDL PPML on SGX -To run your pyspark program, you need to prepare your own pyspark program and put it under the trusted directory in SGX `/ppml/trusted-big-data-ml/work`. Then run with `bigdl-ppml-submit.sh` using the command: +To run your PySpark application, you need to prepare your PySpark application and put it under the trusted directory in SGX `/ppml/trusted-big-data-ml/work`. Then run with `bigdl-ppml-submit.sh` using the command: ```bash ./bigdl-ppml-submit.sh work/YOUR_PROMGRAM.py | tee YOUR_PROGRAM-sgx.log @@ -164,7 +164,7 @@ The result should look something like this: This example shows how to run trusted Spark SQL (e.g., TPC-H queries). -First, download and install sbt from [here](https://www.scala-sbt.org/download.html) and deploy an Hadoop Distributed File System(HDFS) from [here](https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-common/ClusterSetup.html) for the Transaction Processing Performance Council Benchmark H (TPC-H) dataset and output, then build the source codes with SBT and generate the TPC-H dataset according to the TPC-H example from [here](https://github.com/intel-analytics/zoo-tutorials/tree/master/tpch-spark). After that, check if there is `spark-tpc-h-queries_2.11-1.0.jar` under `tpch-spark/target/scala-2.11`; if so, we have successfully packaged the project. +First, download and install sbt from [here](https://www.scala-sbt.org/download.html) and deploy a Hadoop Distributed File System(HDFS) from [here](https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-common/ClusterSetup.html) for the Transaction Processing Performance Council Benchmark H (TPC-H) dataset and output, then build the source codes with SBT and generate the TPC-H dataset according to the TPC-H example from [here](https://github.com/intel-analytics/zoo-tutorials/tree/master/tpch-spark). After that, check if there is `spark-tpc-h-queries_2.11-1.0.jar` under `tpch-spark/target/scala-2.11`; if so, we have successfully packaged the project. Copy the TPC-H package to the container: @@ -224,11 +224,11 @@ The result should look like this: WARNING: If you want spark standalone mode, please refer to [standalone/README.md](https://github.com/intel-analytics/BigDL/blob/main/ppml/trusted-big-data-ml/python/docker-graphene/standalone/README.md). But it is not recommended. -Follow the guide below to run Spark on Kubernetes manually. Alternatively, you can also use Helm to set everything up automatically. See [kubernetes/README.md](https://github.com/intel-analytics/BigDL/blob/main/ppml/trusted-big-data-ml/python/docker-graphene/kubernetes/README.md). +Follow the guide below to run Spark on Kubernetes manually. Alternatively, you can also use Helm to set everything up automatically. See [Kubernetes/README.md](https://github.com/intel-analytics/BigDL/blob/main/ppml/trusted-big-data-ml/python/docker-graphene/kubernetes/README.md). ##### 2.2.3.1 Configure the Environment -1. Enter `BigDL/ppml/trusted-big-data-ml/python/docker-graphene` dir. Refer to the previous section about [preparing data, key and password](#2221-start-ppml-container). Then run the following commands to generate your enclave key and add it to your Kubernetes cluster as a secret. +1. Enter `BigDL/ppml/trusted-big-data-ml/python/docker-graphene` dir. Refer to the previous section about [preparing data, keys and passwords](#2221-start-ppml-container). Then run the following commands to generate your enclave key and add it to your Kubernetes cluster as a secret. ```bash kubectl apply -f keys/keys.yaml @@ -243,12 +243,12 @@ kubectl create serviceaccount spark kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default ``` -3. Generate k8s config file, modify `YOUR_DIR` to the location you want to store the config: +3. Generate K8s config file, modify `YOUR_DIR` to the location you want to store the config: ```bash kubectl config view --flatten --minify > /YOUR_DIR/kubeconfig ``` -4. Create k8s secret, the secret created `YOUR_SECRET` should be the same as the password you specified in step 1: +4. Create K8s secret, the secret created `YOUR_SECRET` should be the same as the password you specified in step 1: ```bash kubectl create secret generic spark-secret --from-literal secret=YOUR_SECRET @@ -256,7 +256,7 @@ kubectl create secret generic spark-secret --from-literal secret=YOUR_SECRET ##### 2.2.3.2 Start the client container -Configure the environment variables in the following script before running it. Check [Bigdl ppml SGX related configurations](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-big-data-ml/python/docker-graphene#1-bigdl-ppml-sgx-related-configurations) for detailed memory configurations. Modify `YOUR_DIR` to the location you specify in section 2.2.3.1. Modify `$LOCAL_IP` to the IP address of your machine. +Configure the environment variables in the following script before running it. Check [BigDL PPML SGX related configurations](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-big-data-ml/python/docker-graphene#1-bigdl-ppml-sgx-related-configurations) for detailed memory configurations. Modify `YOUR_DIR` to the location you specify in section 2.2.3.1. Modify `$LOCAL_IP` to the IP address of your machine. ```bash export K8S_MASTER=k8s://$( sudo kubectl cluster-info | grep 'https.*' -o -m 1 ) @@ -305,20 +305,20 @@ sudo docker run -itd \ $DOCKER_IMAGE bash ``` -##### 2.2.3.3 Init the client and run Spark applications on k8s +##### 2.2.3.3 Init the client and run Spark applications on K8s -1. Run `docker exec -it spark-local-k8s-client bash` to entry the container. Then run the following command to init the Spark local k8s client. +1. Run `docker exec -it spark-local-k8s-client bash` to enter the container. Then run the following command to init the Spark local K8s client. ```bash ./init.sh ``` -2. We assume you have a working Network File System (NFS) configured for your Kubernetes cluster. Configure the `nfsvolumeclaim` on the last line to the name of the Persistent Volume Claim (PVC) of your NFS.Please prepare the following and put them in your NFS directory: +2. We assume you have a working Network File System (NFS) configured for your Kubernetes cluster. Configure the `nfsvolumeclaim` on the last line to the name of the Persistent Volume Claim (PVC) of your NFS. Please prepare the following and put them in your NFS directory: - The data (in a directory called `data`) - The kubeconfig file. -3. Run the following command to start Spark-Pi example. When the appliction runs in `cluster` mode, you can run ` kubectl get pod ` to get the name and status of your k8s pod(e.g. driver-xxxx). Then you can run ` kubectl logs -f driver-xxxx ` to get the output of your appliction. +3. Run the following command to start Spark-Pi example. When the application runs in `cluster` mode, you can run ` kubectl get pod ` to get the name and status of your K8s pod(e.g., driver-xxxx). Then you can run ` kubectl logs -f driver-xxxx ` to get the output of your application. ```bash #!/bin/bash @@ -379,7 +379,7 @@ export TF_MKL_ALLOC_MAX_BYTES=10737418240 && \ local:///ppml/trusted-big-data-ml/work/spark-3.1.2/examples/jars/spark-examples_2.12-3.1.2.jar 100 2>&1 | tee spark-pi-sgx-$SPARK_MODE.log ``` -You can run your own Spark Appliction after changing `--class` and jar path. +You can run your own Spark application after changing `--class` and jar path. 1. `local:///ppml/trusted-big-data-ml/work/spark-3.1.2/examples/jars/spark-examples_2.12-3.1.2.jar` => `your_jar_path` 2. `--class org.apache.spark.examples.SparkPi` => `--class your_class_path` @@ -425,9 +425,9 @@ Enter `BigDL/ppml/trusted-big-data-ml/python/docker-graphene` directory. ./init.sh ``` -##### 2.3.2.2 Run Your Pyspark Program with BigDL PPML on SGX +##### 2.3.2.2 Run Your PySpark Applications with BigDL PPML on SGX -To run your pyspark program, you need to prepare your own pyspark program and put it under the trusted directory in SGX `/ppml/trusted-big-data-ml/work`. Then run with `bigdl-ppml-submit.sh` using the command: +To run your PySpark application, you need to prepare your PySpark application and put it under the trusted directory in SGX `/ppml/trusted-big-data-ml/work`. Then run with `bigdl-ppml-submit.sh` using the command: ```bash ./bigdl-ppml-submit.sh work/YOUR_PROMGRAM.py | tee YOUR_PROGRAM-sgx.log @@ -435,7 +435,7 @@ To run your pyspark program, you need to prepare your own pyspark program and pu When the program finishes, check the results with the log `YOUR_PROGRAM-sgx.log`. -##### 2.3.2.3 Run Python and Pyspark Examples with BigDL PPML on SGX +##### 2.3.2.3 Run Python and PySpark Examples with BigDL PPML on SGX ##### 2.3.2.3.1 Run Trusted Python Helloworld @@ -649,11 +649,11 @@ The result should contain the content look like this: > >Stopping orca context -##### 2.3.2.3.8 Run Trusted Spark Orca Learn Tensorflow Basic Text Classification +##### 2.3.2.3.8 Run Trusted Spark Orca Tensorflow Text Classification -This example shows how to run Trusted Spark Orca learn Tensorflow basic text classification. +This example shows how to run Trusted Spark Orca Tensorflow text classification. -Run the script to run Trusted Spark Orca learn Tensorflow basic text classification and it would take some time to show the final results. To run this example in standalone mode, replace `-e SGX_MEM_SIZE=32G \` with `-e SGX_MEM_SIZE=64G \` in `start-distributed-spark-driver.sh` +Run the script to run Trusted Spark Orca Tensorflow text classification and it would take some time to show the final results. To run this example in standalone mode, replace `-e SGX_MEM_SIZE=32G \` with `-e SGX_MEM_SIZE=64G \` in `start-distributed-spark-driver.sh` ```bash bash start-spark-local-orca-tf-text.sh @@ -673,7 +673,7 @@ The result should be similar to: ##### 2.3.3.1 Configure the Environment -Prerequisite: passwordless ssh login to all the nodes needs to be properly set up first. +Prerequisite: [no password ssh login](http://www.linuxproblem.org/art_9.html) to all the nodes needs to be properly set up first. ```bash nano environments.sh @@ -681,7 +681,7 @@ nano environments.sh ##### 2.3.3.2 Start Distributed Big Data and ML Platform -First run the following command to start the service: +First, run the following command to start the service: ```bash ./deploy-distributed-standalone-spark.sh @@ -801,6 +801,7 @@ The result should look like this: (bodkin,1) (bourn,1) ``` + #### 3.3.4 Run Trusted Cluster Serving Start Cluster Serving as follows: @@ -809,7 +810,7 @@ Start Cluster Serving as follows: ./start-local-cluster-serving.sh ``` -After all services are ready, you can directly push inference requests int queue with [Restful API](https://analytics-zoo.github.io/master/#ClusterServingGuide/ProgrammingGuide/#restful-api). Also, you can push image/input into queue with Python API +After all cluster serving services are ready, you can directly push inference requests into the queue with [Restful API](https://analytics-zoo.github.io/master/#ClusterServingGuide/ProgrammingGuide/#restful-api). Also, you can push image/input into the queue with Python API ```python from bigdl.serving.client import InputQueue @@ -817,7 +818,7 @@ input_api = InputQueue() input_api.enqueue('my-image1', user_define_key={"path": 'path/to/image1'}) ``` -Cluster Serving service is a long running service in container, you can stop it as follows: +Cluster Serving service is a long-running service in containers, you can stop it as follows: ```bash docker stop trusted-cluster-serving-local diff --git a/docs/readthedocs/source/doc/PPML/Overview/trusted_big_data_analytics_and_ml.md b/docs/readthedocs/source/doc/PPML/Overview/trusted_big_data_analytics_and_ml.md index 37e3883ca78..4a443a7798b 100644 --- a/docs/readthedocs/source/doc/PPML/Overview/trusted_big_data_analytics_and_ml.md +++ b/docs/readthedocs/source/doc/PPML/Overview/trusted_big_data_analytics_and_ml.md @@ -9,11 +9,11 @@ BigDL helps to build PPML applications (including big data analytics, machine le ## [1. Trusted Big Data ML](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-big-data-ml) -With the trusted Big Data analytics and ML/DL support, users can run standard Spark data analysis (such as Spark SQL, Dataframe, MLlib, etc.) and distributed deep learning (using BigDL) in a secure and trusted fashion. +With trusted Big Data analytics and ML/DL support, users can run standard Spark data analysis (such as Spark SQL, Dataframe, MLlib, etc.) and distributed deep learning (using BigDL) in a secure and trusted fashion. ## [2. Trusted Real Time ML](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-realtime-ml/scala) -With the trusted realtime compute and ML/DL support, users can run standard Flink stream processing and distributed DL model inference (using Cluster Serving) in a secure and trusted fashion. +With the trusted real time compute and ML/DL support, users can run standard Flink stream processing and distributed DL model inference (using Cluster Serving) in a secure and trusted fashion. ## 3. Intel SGX and LibOS diff --git a/docs/readthedocs/source/doc/PPML/Overview/trusted_fl.md b/docs/readthedocs/source/doc/PPML/Overview/trusted_fl.md index 31df934ba62..ce3491a02f3 100644 --- a/docs/readthedocs/source/doc/PPML/Overview/trusted_fl.md +++ b/docs/readthedocs/source/doc/PPML/Overview/trusted_fl.md @@ -1,17 +1,17 @@ # Trusted FL (Federated Learning) -[Federated Learning](https://en.wikipedia.org/wiki/Federated_learning) is a new tool in PPML (Privacy Preserving Machine Learning), which empowers multi-parities to build united model across different parties without compromising privacy, even if these parities have different datasets or features. In FL training stage, sensitive data will be kept locally, only temp gradients or weights will be safely aggregated by a trusted third-parity. In our design, this trusted third-parity is fully protected by Intel SGX. +[Federated Learning](https://en.wikipedia.org/wiki/Federated_learning) is a new tool in PPML (Privacy Preserving Machine Learning), which empowers multi-parities to build a united model across different parties without compromising privacy, even if these parties have different datasets or features. In FL training stage, sensitive data will be kept locally, and only temp gradients or weights will be safely aggregated by a trusted third-party. In our design, this trusted third-parity is fully protected by Intel SGX. -A number of FL tools or frameworks have been proposed to enable FL in different areas, i.e., OpenFL, TensorFlow Federated, FATE, Flower and PySyft etc. However, none of them is designed for Big Data scenario. To enable FL in big data ecosystem, BigDL PPML provides a SGX-based End-to-end Trusted FL platform. With this platform, data scientist and developers can easily setup FL applications upon distributed large scale datasets with a few clicks. To achieve this goal, we provides following features: +A number of FL tools or frameworks have been proposed to enable FL in different areas, i.e., OpenFL, TensorFlow Federated, FATE, Flower and PySyft etc. However, none of them is designed for Big Data scenarios. To enable FL in big data ecosystem, BigDL PPML provides a SGX-based End-to-end Trusted FL platform. With this platform, data scientists and developers can easily setup FL applications upon distributed large-scale datasets with a few clicks. To achieve this goal, we provide the following features: - * ID & feature align: figure out portions of local data that will participate in training stage - * Horizontal FL: training across multi-parties with same features and different entities - * Vertical FL: training across multi-parties with same entries and different features. + * ID & feature align: figure out portions of local data that will participate in the training stage + * Horizontal FL: training across multi-parties with the same features and different entities + * Vertical FL: training across multi-parties with the same entries and different features. -To ensure sensitive data are fully protected in training and inference stages, we make sure: +To ensure sensitive data are fully protected in the training and inference stages, we make sure: - * Sensitive data and weights are kept local, only temp gradients or weights will be safely aggregated by a trusted third-parity - * Trusted third-parity, i.e., FL Server, is protected by SGX Enclaves + * Sensitive data and weights are kept local, only temp gradients or weights will be safely aggregated by a trusted third-party + * Trusted third-party, i.e., FL Server, is protected by SGX Enclaves * Local training environment is protected by SGX Enclaves (recommended but not enforced) * Network communication and Storage (e.g., data and model) protected by encryption and Transport Layer Security (TLS)](https://en.wikipedia.org/wiki/Transport_Layer_Security) @@ -25,7 +25,7 @@ Please ensure SGX is properly enabled, and SGX driver is installed. If not, plea 1. Generate the signing key for SGX Enclaves - Generate the enclave key using the command below, keep it safely for future remote attestations and to start SGX Enclaves more securely. It will generate a file `enclave-key.pem` in the current working directory, which will be the enclave key. To store the key elsewhere, modify the output file path. + Generate the enclave key using the command below, keep it safely for future remote attestations and to start SGX Enclaves more securely. It will generate a file `enclave-key.pem` in the current working directory, which will be the enclave key. To store the key elsewhere, modify the output file path. ```bash cd scripts/ @@ -74,7 +74,7 @@ If Dockerhub is not accessible, you can build docker image. Modify your `http_pr ## Start FLServer -Before starting any local training client or worker, we need to start a Trusted third-parity, i.e., FL Server, for secure aggregation. In current design, this FL Server is running in SGX with help of Graphene or Occlum. Local workers/Clients can verify its integrity with SGX Remote Attestation. +Before starting any local training client or worker, we need to start a Trusted third-parity, i.e., FL Server, for secure aggregation. In our design, this FL Server is running in SGX with help of Graphene or Occlum. Local workers/Clients can verify its integrity with SGX Remote Attestation. Running this command will start a docker container and initialize the SGX environment. @@ -92,7 +92,7 @@ In container, run: The fl-server will start and listen on 8980 port. Both horizontal fl-demo and vertical fl-demo need two clients. You can change the listening port and client number by editing `BigDL/scala/ppml/demo/ppml-conf.yaml`'s `serverPort` and `clientNum`. -Note that we skip ID & Feature for simplify demo. In practice, before we start Federated Learning, we need to align ID & Feature, and figure out portions of local data that will participate in later training stage. In horizontal FL, feature align is required to ensure each party is training on the same features. In vertical FL, both ID and feature align are required to ensure each party training on different features of the same record. +Note that we skip ID & Feature for simplifying demo. In practice, before we start Federated Learning, we need to align ID & Feature, and figure out portions of local data that will participate in later training stages. In horizontal FL, feature alignment is required to ensure each party is training on the same features. In vertical FL, both ID and feature alignment are required to ensure each party training on different features of the same record. ## HFL Logistic Regression diff --git a/docs/readthedocs/source/doc/PPML/QuickStart/secure_your_services.md b/docs/readthedocs/source/doc/PPML/QuickStart/secure_your_services.md index 84e358a6de3..558d129ea88 100644 --- a/docs/readthedocs/source/doc/PPML/QuickStart/secure_your_services.md +++ b/docs/readthedocs/source/doc/PPML/QuickStart/secure_your_services.md @@ -1,19 +1,19 @@ # Secure Your Services -This document is a gentle reminder for enabling security & privacy features for your services. To avoid privacy & security issues during deployment, we recommend Developer/Admin to go through this document, which suits users/customers who want to apply BigDL into their production environment (not just for PPML). +This document is a gentle reminder for enabling security & privacy features for your services. To avoid privacy & security issues during deployment, we recommend Developer/Admin go through this document, which suits users/customers who want to apply BigDL into their production environment (not just for PPML). ## Security in data lifecycle -Almost all Big Data & AI applications are built upon large scale dataset, we can simply go through security key steps in the data lifecycle. That is data protection in transit, in storage, and in use. +Almost all Big Data & AI applications are built upon large-scale datasets, we can simply go through security key steps in the data lifecycle. That is data protection in transit, in storage, and in use. ### Secure Network (in transit) Big Data & AI applications are mainly distributed applications, which means we need to use lots of nodes to run our applications and get jobs done. During that period, not just control flows (command used to control applications running on different nodes), data partitions (a division of data) may also go through different nodes. So, we need to ensure all network traffic is fully protected. -Talking about secure data transit, TLS is commonly used. The server would provide a private key and certificate chain. To make sure it is fully secured, a complete certificate chain is needed (with two or more certificates built). In addition, a SSL/TLS protocol and secure cipher tools would be used. It is also recommended to use forward secrecy and strong key exchange. However, it is general that secure approaches would bring some performance problems. To mitigate these problems, a series of approaches are available, including session resumption, cache, etc. For the details of this section, please see [SSL-and-TLS-Deployment-Best-Practices](https://github.com/ssllabs/research/wiki/SSL-and-TLS-Deployment-Best-Practices). +Talking about secure data transit, TLS is commonly used. The server would provide a private key and certificate chain. To make sure it is fully secured, a complete certificate chain is needed (with two or more certificates built). In addition, SSL/TLS protocol and secure cipher tools would be used. It is also recommended to use forward secrecy and strong key exchange. However, it is general that secure approaches would bring some performance problems. To mitigate these problems, a series of approaches are available, including session resumption, cache, etc. For the details of this section, please see [SSL-and-TLS-Deployment-Best-Practices](https://github.com/ssllabs/research/wiki/SSL-and-TLS-Deployment-Best-Practices). ### Secure Storage (in storage) -Beside network traffic, we also need to ensure data is safely stored in a hard disk. In Big Data & AI applications, data is mainly stored in distributed storage or cloud storage, e.g., HDFS, Ceph and AWS S3 etc. This makes storage security a bit different. We need to ensure each storage node is secured by correct settings, meanwhile we need to ensure the whole storage system is secured (network, access control, authentication etc). +Besides network traffic, we also need to ensure data is safely stored in storage. In Big Data & AI applications, data is mainly stored in distributed storage or cloud storage, e.g., HDFS, Ceph and AWS S3 etc. This makes storage security a bit different. We need to ensure each storage node is secured by correct settings, meanwhile we need to ensure the whole storage system is secured (network, access control, authentication etc). ### Secure Computation (in use) @@ -25,7 +25,7 @@ WARNING: This example lists minimum security features that should be enabled for ### Prepare & Manage Your keys -Ensure you are generating, using & managing your keys in right way. Check with your admin or security reviewer about that. Using Key Management Service (KMS) in your deployment environment is recommended. It will reduce a lot of effort and potential issues. +Ensure you are generating, using & managing your keys in the right way. Check with your admin or security reviewer about that. Using Key Management Service (KMS) in your deployment environment is recommended. It will reduce a lot of effort and potential issues. Back to our example, please prepare SSL & TLS keys based on [SSL & TLS Private Key and Certificate](https://github.com/ssllabs/research/wiki/SSL-and-TLS-Deployment-Best-Practices#1-private-key-and-certificate). Ensure these keys are correctly configured and stored. @@ -47,13 +47,13 @@ Enable [Local Storage Encryption](https://spark.apache.org/docs/latest/security. Enable [SSL](https://spark.apache.org/docs/latest/security.html#ssl-configuration) to secure Spark Webui. -You can enable [Kerberos related settings](https://spark.apache.org/docs/latest/security.html#kerberos) if your have Kerberos service. +You can enable [Kerberos related settings](https://spark.apache.org/docs/latest/security.html#kerberos) if you have Kerberos service. ### [Kubernetes Security](https://kubernetes.io/docs/concepts/security/) As a huge resource management service, Kubernetes has lots of security features. -Enable [RBAC](https://kubernetes.io/docs/concepts/security/rbac-good-practices/) to ensure that cluster users and workloads have only the access to resources required to execute their roles. +Enable [RBAC](https://kubernetes.io/docs/concepts/security/rbac-good-practices/) to ensure that cluster users and workloads have only access to resources required to execute their roles. Enable [Encrypting Secret Data at Rest](https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/) to protect data in rest API. When mounting key & sensitive configurations into pods, use [Kubernetes Secret](https://kubernetes.io/docs/concepts/configuration/secret/).