diff --git a/README.md b/README.md index 69db713e..e109cd69 100644 --- a/README.md +++ b/README.md @@ -218,7 +218,7 @@ Before we do so, first we need to setup a Python interpreter/environmen, this ca ``` bash python3 -m venv venv source venv/bin/activate - pip3 install -r requirements.txt + pip3 install -r requirements-cpu.txt ``` - Then we will download the datasets using a Python script in the same @@ -337,3 +337,4 @@ Which will collect and run all the tests in the repository, and show in `verbose * Currently, there is no GPU support in the Docker containers, for this the `Dockerfile` will need to be updated to accommodate for this. +* Currently, the `GLOO` backend is chosen only, when using CUDA capable devices, the `NCCL` backend is recommended. diff --git a/jupyter/terraform_notebook.ipynb b/jupyter/terraform_notebook.ipynb index 553de9eb..e59110cb 100644 --- a/jupyter/terraform_notebook.ipynb +++ b/jupyter/terraform_notebook.ipynb @@ -8,15 +8,40 @@ } }, "source": [ - "# Pre-requisites\n", + "# README (Ignore if you are running on Mac/Linux)\n", + "\n", + "If you are running on Windows, make sure you have started the Jupyter Notebook in a Bash shell.\n", + "Moreover, all the requirements below must be installed in this Bash (compatible) shell.\n", + "\n", + "This can be achieved as follows:\n", + "\n", + "1. Enable and install WSL(2) for Windows 10/11 [official documentation](https://docs.microsoft.com/en-us/windows/wsl/install)\n", + " * On newer builds of W10/11 you can install WSL by running the following command in an *administrator* PowerShell terminal. Which will install by default an Ubuntu instance of WSL.\n", + " ```bash\n", + " wsl --install\n", + " ```\n", + "2. Start the Ubuntu Bash shell by searching for `Bash` under Start, or by running `bash` in a (normal) PowerShell terminal.\n", + "\n", + "Using a Bash terminal as started under step 2 above, you can install the Requirements as described below as if you are running it under Linux or Ubuntu/Debian." + ], + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Requirements\n", + "These requirements may also be installed on Windows, however, development has only been tested on Linux/macOS.\n", "\n", "Before we get started, first make sure to install all the required tools. We provide two lists below, one needed for setting up the testbed. And one for developing code to use with the testbed. Feel free to skip the installation of the second list, and return at a later point in time.\n", "\n", "\n", + "### Deployment\n", + "\n", + " > ⚠️ All dependencies must be installed in a Bash-compatible shell. For Windows users also see [above](#read-me)\n", "Make sure to install a recent version of each of the dependencies.\n", "\n", "\n", - " * (Windows only) It is strongly recommended to install every dependency in a Windows Subsystem for Linux shell. For installation refer to [here](https://docs.microsoft.com/en-us/windows/wsl/install).\n", + " * (Windows only) Install every dependency in a Windows Subsystem for the Linux, Bash shell (see also README above).\n", " * GCloud SDK\n", " - Follow the installation instructions [here](https://cloud.google.com/sdk/docs/install)\n", " - Initialize the SDK with `gcloud init`, if prompted you may ignore to set/create a default/first project.\n", @@ -34,14 +59,21 @@ "python3 -m bash_kernel.install\n", "```\n", "\n", + "### Development\n", "For development, the following tools are needed/recommended:\n", "\n", " * Docker (>= 18.09).\n", " - If you don't have experience with using Docker, we recommend following [this](https://docs.docker.com/get-started/) tutorial.\n", " * Python3.9\n", " * pip3\n", - " * JetBrains PyCharm\n" - ] + " * JetBrains PyCharm" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } }, { "cell_type": "markdown", @@ -59,9 +91,9 @@ "2. Redeem your academic coupon on GCP, see Brightspace for information on obtaining the \\\\$50 academic coupon, or use the free \\\\$300 credits for new users provided by Google.\n", "\n", "\n", - "3. (Non unix systems) Make sure to use the `Bash` kernel, not a Python or other kernel. For those on windows machines, make sure to launch the `jupyter notebook` server from a bash-compliant commandline, we recommend Windows Subsystem for Linux.\n", + "3. Make sure to use the `Bash` kernel, not a Python or other kernel. For those on windows machines, make sure to launch the `jupyter notebook` server from a bash-compliant command line, we recommend Windows Subsystem for Linux.\n", "\n", - "⚠️ Make sure to run this Notebook within a cloned repository, not standalone/downloaded from Github." + "⚠️ Make sure to run this Notebook within a cloned repository, not standalone/downloaded from GitHub.\n" ] }, { @@ -78,13 +110,15 @@ "\n", "## Getting started\n", "\n", - "First, we will set a few variables used **throughout** the project. We set them in this notebook for convenience, but they are also set as defaults in configuration files for the project. If you change any of these, make sure to change the corresponding variables as well in;\n", + "First, we will set a few variables used **throughout** the project. We set them in this notebook for convenience, but they are also set to some example default values in configuration files for the project. If you change any of these, make sure to change the corresponding variables as well in;\n", + "\n", + "* [`../terraform/terraform-gke/variables.tf`](../terraform/terraform-gke/variables.tf)\n", + "* [`../terraform/terraform-dependencies/variables.tf`](../terraform/terraform-dependencies/variables.tf)\n", "\n", - "* [`terraform-gke/variables.tf`](terraform-gke/variables.tf)\n", - "* [`terraform-dependencies/variables.tf`](terraform-dependencies/variables.tf)\n", "\n", + "> ⚠️ As you have changed the `PROJECT_ID` parameter to a unique project name, also change the `project_id` variable in the following files. This allows you to run `terraform apply` without having to override the default value for the project.\n", "\n", - "⚠️ Change the `PROJECT_ID` parameter to a unique project name, remember to update the paramter in the variables files!" + "> ℹ️ Any variable changed here can also be provided to `terraform` using the `-var` flag, i.e. `-var terraform_variable=$BASH_VARIABLE`. An example for setting the `project_id` variable is also provided later." ] }, { @@ -97,9 +131,11 @@ }, "outputs": [], "source": [ + "# VARIABLES THAT NEEDS TO BE SET\n", + "PROJECT_ID=\"test-bed-fltk\" # CHANGE ME!\n", + "\n", + "# DEFAULT VARIABLES\n", "ACCOUNT_ID=\"terraform-iam-service-account\"\n", - "# CHANGE ME!\n", - "PROJECT_ID=\"test-bed-fltk\"\n", "PRIVILEGED_ACCOUNT_ID=\"${ACCOUNT_ID}@${PROJECT_ID}.iam.gserviceaccount.com\"\n", "CLUSTER_NAME=\"fltk-testbed-cluster\"\n", "REGION=\"us-central1-c\"" @@ -115,7 +151,9 @@ "source": [ "## Project creation\n", "\n", - "Next, we create a project using the `PROJECT_ID` variable, and get all the billing account information." + "Next, we create a project using the `PROJECT_ID` variable and get all the billing account information.\n", + "\n", + "⁉️ (Ignore if using a pre-existing GCP Project) If the command below does not complete successfully, make sure to change the `PROJECT_ID` variable in the previous cell and re-run it." ] }, { @@ -153,7 +191,7 @@ }, "outputs": [], "source": [ - "BILLING_ACCOUNT=\"015594-41687F-092941\"" + "BILLING_ACCOUNT=\"015594-41687F-092941\" # CHANGE ME!" ] }, { @@ -299,133 +337,143 @@ }, "source": [ "## Creating a Google managed cluster (GKE)\n", - "To create the cluster, first change the active directory to the `terraform-gke` directory." + "To create the cluster, first change the active directory to the `terraform-gke` directory.\n", + "\n", + "⚠️ Creating a cluster will incur billing cost on your project, by default the cluster will be small to minimize costs during this tutorial. Forgetting to `destroy` or scale down the cluster may result in quickly spending your academic coupon." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, "outputs": [], "source": [ "cd ../terraform/terraform-gke\n", "echo $PWD" - ] + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } }, { "cell_type": "markdown", + "source": [ + "Init the directory, to initialize the Terraform module." + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%% md\n" } - }, - "source": [ - "Init the directory, to initialize the Terraform module." - ] + } }, { "cell_type": "code", "execution_count": null, + "outputs": [], + "source": [ + "terraform init " + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%%\n" } - }, - "outputs": [], - "source": [ - "terraform init " - ] + } }, { "cell_type": "markdown", + "source": [ + "Next, we can check whether we can create a cluster. No warnings or errors should occur during this process. It may take a while to complete.\n", + "\n", + "> ⚠️ We provide the project_id variable from `terraform/terraform-gke` manually, and also change the default value.\n", + "\n", + "⁉️ If the command below does not complete successfully, e.g. after raising a `403` error, make sure that you have successfully created the project with `gcloud` earlier.\n" + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%% md\n" } - }, - "source": [ - "Next, we can check whether we can create a cluster. No warnings or errors should occur during this process. It may take a while to complete." - ] + } }, { "cell_type": "code", "execution_count": null, + "outputs": [], + "source": [ + "terraform plan -var project_id=$PROJECT_ID" + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%%\n" } - }, - "outputs": [], - "source": [ - "terraform plan -var project_id=$PROJECT_ID" - ] + } }, { "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, "source": [ "When the previous command completes successfully, we can start the deployment. Depending on any changes you may have done, this might take a while.\n", "\n", - "By default, this will create a private zonal cluster consisting of two node-pools.\n", + "By default, this will create a private zonal cluster consisting of two node pools.\n", "\n", - "⚠️ Any changes to create a regional cluster, an additional free of \\\\$ 0.10 /hour will be billed with minute increments. However, only the **first** zonal cluster is free of this cost.\n", + "> ⚠️ A regional cluster (multi-zonal) will incur an additional fee of \\\\$ 0.10 /hour per managed (GKE) cluster. The **first** zonal cluster is free of this charge.\n", "\n", - "⚠️ By default spot/preemtible nodes are disabled, as such no discounts will be given for the deployment. You can experiment by setting `spot` to true in the `tf` files. Note, however, that the default implementations provided in the test-bed do not allow for recovering from getting rescheduled.\n" - ] + "> ⚠️ By default spot/preemptive nodes are disabled. You can experiment by setting `spot` to true in the `tf` files. Note, however, that the default implementations provided in the testbed do not allow for recovery from getting spun down and rescheduled. Moreover, this may result in poor availability during busy hours in the region in which you deploy your cluster.\n" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } }, { "cell_type": "code", "execution_count": null, + "outputs": [], + "source": [ + "terraform apply -auto-approve -var project_id=$PROJECT_ID" + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%%\n" } - }, - "outputs": [], - "source": [ - "terraform apply -auto-approve -var project_id=$PROJECT_ID" - ] + } }, { "cell_type": "markdown", + "source": [ + "Next, we add cluster credentials (so you can interact with the cluster through `kubectl` an `helm`)." + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%% md\n" } - }, - "source": [ - "Next, we add cluster credentials (so you can interact with the cluster through `kubectl` an `helm`)." - ] + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, "outputs": [], "source": [ "# Add credentials for interacting with cluster via kubectl\n", "gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project $PROJECT_ID" - ] - }, - { - "cell_type": "markdown", + ], "metadata": { + "collapsed": false, "pycharm": { - "name": "#%% md\n" + "name": "#%%\n" } - }, + } + }, + { + "cell_type": "markdown", "source": [ "### Changing deployment\n", "\n", @@ -443,105 +491,121 @@ "```\n", "\n", "Depending on the number of changes, this may take some time." - ] - }, - { - "cell_type": "markdown", + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%% md\n" } - }, + } + }, + { + "cell_type": "markdown", "source": [ "## Installing dependencies\n", "Lastly, we need to install the dependencies on our cluster. First change the directories, and then run the `init`, `plan` and `apply` commands as we did for creating the GKE cluster." - ] + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } }, { "cell_type": "code", "execution_count": null, + "outputs": [], + "source": [ + "cd ../terraform-dependencies" + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%%\n" } - }, - "outputs": [], - "source": [ - "cd ../terraform-dependencies" - ] + } }, { "cell_type": "markdown", + "source": [ + "Init the directory, to initialize the Terraform module." + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%% md\n" } - }, - "source": [ - "Init the directory, to initialize the Terraform module." - ] + } }, { "cell_type": "code", "execution_count": null, + "outputs": [], + "source": [ + "terraform init -reconfigure" + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%%\n" } - }, - "outputs": [], - "source": [ - "terraform init -reconfigure" - ] + } }, { "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, "source": [ "Check to see if we can plan the deployment. This will setup the following:\n", "\n", "* Kubeflow training operator (used to deploy and manage PyTorchTrainJobs programmatically)\n", "* NFS-provisioner (used to enable logging on a persistent `ReadWriteMany` PVC in the cluster)\n" - ] + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } }, { "cell_type": "code", "execution_count": null, + "outputs": [], + "source": [ + "terraform plan -var project_id=$PROJECT_ID" + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%%\n" } - }, - "outputs": [], - "source": [ - "terraform plan -var project_id=$PROJECT_ID" - ] + } }, { "cell_type": "markdown", + "source": [ + "When the previous command completes successfully, we can start the deployment. This will install the NFS provisioner and Kubeflow Training Operator dependencies\n" + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%% md\n" } - }, - "source": [] + } }, { "cell_type": "code", "execution_count": null, + "outputs": [], + "source": [ + "terraform apply -auto-approve -var project_id=$PROJECT_ID" + ], "metadata": { + "collapsed": false, "pycharm": { "name": "#%%\n" } - }, - "outputs": [], - "source": [ - "terraform apply -auto-approve -var project_id=$PROJECT_ID" - ] + } }, { "cell_type": "markdown", @@ -557,7 +621,7 @@ "\n", "This will create a simple deployment using a Kubeflow pytorch example job.\n", "\n", - "This will create a small (1 master, 1 client) training job on mnist on your cluster" + "This will create a small (1 master, 1 client) training job on mnist on your cluster. You can follow the deployment by navigating to your cluster on [cloud.google.com](cloud.google.com)" ] }, { @@ -570,7 +634,7 @@ }, "outputs": [], "source": [ - "kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml" + "kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml\n" ] }, { @@ -583,11 +647,15 @@ "source": [ "# Cleaning up\n", "\n", - "To clean up/remove the cluster, or clean up the deploymen we have created, use the `terraform destroy` command.\n", + "> ⚠️ THIS WILL REMOVE YOUR CLUSTER AND DATA STORED ON IT. For this tutorial's purpose destroying your cluster is not an issue. For testing/developing, we recommend manually scaling your cluster up and down instead.\n", + "\n", + "\n", + "To clean up/remove the cluster, we will use the `terraform destroy` command.\n", "\n", - "This will remove everything defined in the Terraform configuration.\n", + " * Running it in `terraform-dependencies` WILL REMOVE the Kubeflow Training-Operator from your cluster.\n", + " * Running it in `terraform-gke` WILL REMOVE YOU ENTIRE CLUSTER.\n", "\n", - "You can uncomment the commands below to remove the cluster, or run the command in a terminal in the [`../terraform/terraform-gke`](../terraform/terraform-gke) directory." + "You can uncomment the commands below to remove the cluster, or run the command in a terminal in the [`.../terraform/terraform-gke`](../terraform/terraform-gke) directory.\n" ] }, { @@ -600,9 +668,9 @@ }, "outputs": [], "source": [ - "# cd ../terraform-gke\n", + "cd ../terraform-gke\n", "\n", - "# terraform destroy #-auto-approve" + "terraform destroy -auto-approve" ] } ], @@ -621,4 +689,4 @@ }, "nbformat": 4, "nbformat_minor": 1 -} +} \ No newline at end of file