Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README for unified setup #10

Open
wants to merge 5 commits into
base: gh-pages
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 81 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
# MMC.AI Setup Guide
# Memory Machine AI Setup Guide

## Installation prerequisites

NVIDIA’s DeepOps project uses Ansible to deploy Kubernetes onto host machines. Ansible is an automation tool that allows system administrators to run commands on multiple machines, while interacting with only one host, called the “provisioning machine.”

#### Setting up user accounts
## Setting up user accounts for Ansible

A user with `sudo` permissions is needed on each host where Kubernetes will be installed.

Expand All @@ -22,7 +18,7 @@ sudo usermod -aG sudo mmai-admin
echo "mmai-admin ALL=(ALL:ALL) NOPASSWD: ALL" > /etc/sudoers.d/mmai-admin
```

#### Enabling private-key SSH
## Enabling private-key SSH

To allow Ansible to connect to remote hosts without querying for a password, private-key SSH connections must be enabled. From the provisioning machine, follow these steps:
```bash
Expand All @@ -38,25 +34,22 @@ ssh-copy-id <username>@<host>

These instructions come from [NVIDIA’s guide on Ansible](https://github.com/NVIDIA/deepops/blob/master/docs/deepops/ansible.md#passwordless-configuration-using-ssh-keys), which contains more information.

## Ansible Installation with DeepOps
## [OPTIONAL] Installing Kubernetes via DeepOps

The following set of commands will install Ansible on the provisioning machine. They must be run as a regular user.
```bash
git clone https://github.com/NVIDIA/deepops.git
cd ./deepops
cd deepops
git checkout 23.08
./scripts/setup.sh
```

## Editing Ansible Configurations
### Ansible configuration

Once Ansible installation is complete, `deepops/config/inventory` must be configured by the system admin.

#### `deepops/config/inventory`

This file defines which hosts will be used for Kubernetes installation.

Within there are four relevant headers:
Within, there are four relevant host groups:

- **`[all]`**
A list of the hosts that will participate in the Kubernetes cluster.
Expand All @@ -76,57 +69,26 @@ Within there are four relevant headers:
- **`[kube-node]`**
Should contain the cluster's "worker nodes" -- that is, nodes that do not appear in `[kube-master]`, but are expected to run workloads.

## Installing Kubernetes
### Kubernetes installation script

Once Ansible configuration is complete, copy these commands into your terminal to install Kubernetes:
```bash
wget -O logging.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/logging.sh
wget -O deepops-setup.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/deepops-setup.sh
chmod +x deepops-setup.sh
git clone https://github.com/MemVerge/mmc.ai-setup
cd mmc.ai-setup
./deepops-setup.sh
```

## Installing Kubeflow

Download and run `kubeflow-setup.sh` on a node with kubectl and kustomize:
```bash
wget -O logging.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/logging.sh
wget -O git-clone.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/git-clone.sh
wget -O kubeflow-setup.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/kubeflow-setup.sh
chmod +x kubeflow-setup.sh
./kubeflow-setup.sh
```

The following command prints the port for the Kubeflow Central Dashboard:
```bash
echo $(kubectl get svc istio-ingressgateway -n istio-system -o jsonpath='{.spec.ports[?(@.port==80)].nodePort}')
```

Using this port, the URL `http://<node-ip>:<port>` will fetch the Kubeflow Central Dashboard, where `<node-ip>` is the IPv4 address of any node on the cluster.


## Installing NVIDIA GPU Operator

Download and run `nvidia-gpu-operator-setup.sh` on the node used to manage Helm installations:
```bash
wget -O logging.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/logging.sh
wget -O gpu-operator-values.yaml https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/gpu-operator-values.yaml
wget -O nvidia-gpu-operator-setup.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/nvidia-gpu-operator-setup.sh
chmod +x nvidia-gpu-operator-setup.sh
./nvidia-gpu-operator-setup.sh
```

## Installing MMC.AI
## Installing Memory Machine AI

> **Important:**
> The following prerequisites are necessary if you did not follow the instructions above:
> 1. Kubernetes set up.
> 2. [Default StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/#default-storageclass) set up in cluster.
> 3. [Kubeflow](https://www.kubeflow.org/docs/started/installing-kubeflow/) installed in cluster.
> 4. NVIDIA GPU Operator installed via Helm in cluster with overrides from `gpu-operator-values.yaml`.
> 5. Node(s) in cluster with [Helm](https://helm.sh/docs/intro/quickstart/) installed.
> 1. User accounts for Ansible set up.
> 2. Private-key SSH enabled.
> 3. Kubernetes cluster set up.
> 4. [Default StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/#default-storageclass) in Kubernetes cluster set up.

### [INTERNAL] Helm login secrets

#### (Internal) Helm Login Secrets
In order to download the pre-release packages, MemVerge team members must authenticate with the Github container registry.

First, create a personal access token on this Github page: https://github.com/settings/tokens
Expand All @@ -145,37 +107,81 @@ helm registry login ghcr.io/memverge/charts
# Password: <personal-access-token>
```

### Image Pull Secrets
### Image pull secrets

Copy the `mmcai-ghcr-secret.yaml` file provided by MemVerge to the node with `kubectl` access (i.e., the "control plane node"). Then, deploy its image pull credentials to the cluster like so:
```bash
kubectl apply -f mmcai-ghcr-secret.yaml
```

### Cluster Components
### Ansible configuration

In an inventory file (which can be named anything), configure two host groups:
- **`[all]`**
> **Note:**
> The [all] group in this section should be identical to the one in `deepops/config/inventory` if you installed Kubernetes via DeepOps.

A list of the hosts that will participate in the Kubernetes cluster.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For [all], maybe add an indented section like

Note:
The [all] group of this file will be identical to the one in deepops/config/inventory when installing Kubernetes via DeepOps.
If DeepOps was not used to install Kubernetes, follow these instructions:

Also, nitpicks,
configuraiton -> configuration
configure the two groups -> configure two groups

For example:
```
[all]
<host-1-name> ansible_host=<host-1-ip-address>
<host-2-name> ansible_host=<host-2-ip-address>
# The following will configure the local machine as a target:
# host-1 ansible_host=localhost
```
In order to have the Kubernetes node names match with the names of the servers in the cluster, it is best to let `<host-N-name>` be the domain name of the remote host. You can determine a host's domain by running the `hostname` command (without the optional `-f` flag, which prints the fully qualified domain name) on each machine.
- **`[mmai_database]`**
Memory Machine AI MySQL database (single) node. The specified node will be used for a database.
For example:
```
[mmai_database]
<host-name> ansible_host=<host-ip-address>
```

This file will be used by the Memory Machine AI installation script.

### Memory Machine AI installation script

Download and run the interactive `mmcai-setup.sh` script on the control plane node.

You will have a chance to confirm your changes after making your selections:

#### Billing Database
Download and run `mysql-pre-setup.sh` on the control plane node:
```bash
wget -O logging.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/logging.sh
wget -O mysql-pre-setup.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/mysql-pre-setup.sh
chmod +x mysql-pre-setup.sh
./mysql-pre-setup.sh
git clone https://github.com/MemVerge/mmc.ai-setup
cd mmc.ai-setup
./mmcai-setup.sh
```

#### MMC.AI Cluster and Management Planes
Download and run `mmcai-setup.sh` on the control plane node:
``` bash
wget -O logging.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/logging.sh
wget -O mmcai-setup.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/mmcai-setup.sh
chmod +x mmcai-setup.sh
./mmcai-setup.sh
# Answer prompts as needed.
If MMC.AI Manager is installed, the MMC.AI dashboard should be accessible at `http://<control-plane-ip>:32323`.

If Kubeflow is installed, the following command should print the port for the Kubeflow Central Dashboard:

```bash
echo $(kubectl get svc istio-ingressgateway -n istio-system -o jsonpath='{.spec.ports[?(@.port==80)].nodePort}')
```

Once deployed, the MMC.AI dashboard should be accessible at `http://<control-plane-ip>:32323`.
Using this port, the URL `http://<node-ip>:<port>` will fetch the Kubeflow Central Dashboard, where `<node-ip>` is the IPv4 address of any node on the cluster.

# Memory Machine AI Teardown Guide

## Uninstalling Memory Machine AI

### Ansible configuration

In an inventory file (which can be named anything), configure the host group:
- **`[mmai_database]`**
Memory Machine AI MySQL database (single or multiple) nodes. Databases on the specified nodes will be removed.
For example:
```
[mmai_database]
<host-1-name> ansible_host=<host-1-ip-address>
<host-2-name> ansible_host=<host-2-ip-address>
```

This file will be used by the Memory Machine AI uninstallation script.

# MMC.AI Teardown Guide
### Memory Machine AI uninstallation script

Download and run the interactive `mmcai-teardown.sh` script on the control plane node.

Expand All @@ -184,8 +190,7 @@ If you have nothing else installed in the cluster and want to remove everything,

You will have a chance to confirm your changes after making your selections:
```bash
wget -O logging.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/logging.sh
wget -O mmcai-teardown.sh https://raw.githubusercontent.com/MemVerge/mmc.ai-setup/main/mmcai-teardown.sh
chmod +x mmcai-teardown.sh
git clone https://github.com/MemVerge/mmc.ai-setup
cd mmc.ai-setup
./mmcai-teardown.sh
```