diff --git a/10-simple.md b/10-simple.md new file mode 100644 index 0000000..969fc8b --- /dev/null +++ b/10-simple.md @@ -0,0 +1,112 @@ +# Simple simulations on a remote computer + +The simplest thing you can do remotely is running simulations that you generate with Makita. +This should be a very straightforward process, assuming that you have some permissions to install what you need. + +## SSH into your remote computer + +The first thing you have to do is login into your remote computer. +Find the IP address and your user name. +You might also need to perform additional steps, such as configure SSH keys. +These depend on which cloud provider you use (SURF, Digital Ocean, AWS, Azure, etc.), so make sure you follow the provider's instructions. + +Connect to the remote computer using SSH. +If you are using Windows, use [PuTTY](https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html) or some other SSH client. +On Linux, OSX, and WSL you should be able to use the terminal: + +```bash +ssh USER@IPADRESS +``` + +It is normal to receive a message such as "The authenticity ... can't be established". +Write "yes" and press "enter". + +For more information on using ssh, check . + +## Copying data + +To copy the `data` folder and other possible files, use the `scp` command. + +```bash +scp -r data USER@IPADDRESS:/location/data +``` + +The `-r` is only necessary for folders. + +## Install and run tmux + +The main issue with running things remotely is that when you close the SSH connection, the commands that you are running will be killed. +To avoid that, we run [tmux](https://github.com/tmux/tmux/wiki), which is like a virtual terminal that we can detach (it's more than that, read their link). + +Install `tmux` through whatever means your remote computer allows. +E.g., for Ubuntu you can run + +```bash +apt install tmux +``` + +To run tmux just enter + +```bash +tmux +``` + +## Run your simulations + +This is where you do what you know. +For example, let's assume that we want to run Makita with an arfi template on one or more files in our `data` folder. +For that, we will install `makita` in a python environment, install the packages from pip, and run the `jobs.sh` file. +**This is exactly what we would do in a local machine.** + +```bash +apt install python3-venv +python3 -m venv env +. env/bin/activate +pip3 install --upgrade pip setuptools +pip3 install asreview asreview-makita asreview-insights asreview-wordcloud asreview-datatools +asreview makita template arfi +bash jobs.sh +``` + +Now, the remote computer will be running the simulations. +To leave it running and come back later, follow the steps below + +## Detach and attach tmux and close ssh session + +Since your simulations are running inside tmux, you have to *detach* it pressing CTRL+b and then d (hold the CTRL key, press b, release both, press d). + +You will be back in the terminal, with some `[detached (from session 0)]` or similar message. + +**Your simulations are still running on tmux.** + +To go back to them, attach back using + +```bash +tmux attach +``` + +It will be as if you never left. + +Most importantly, you can now exit from your SSH session and come back and the tmux session should still be reachable. + +To close a ssh session, simply enter `exit` on the terminal. + +### Test the persistence of your simulation run + +To make sure that things work as expected before you leave your remote computer unnatended, do the following: + +- Connect through ssh, open tmux. +- Run some simulation that takes a few minutes. +- Detach, exit the ssh session. +- Connect back, attach tmux. + +The simulation should still be running but it should have made some progress. +To make sure that it is making progress, you can repeat and wait longer before reconnecting. + +## Copy things back to your local machine + +We use `scp` again to copy from the remote machine back to the local machine. + +```bash +scp -r USER@IPADDRESS:/location/output ./ +``` diff --git a/20-parallel.md b/20-parallel.md new file mode 100644 index 0000000..b56ac28 --- /dev/null +++ b/20-parallel.md @@ -0,0 +1,79 @@ +# Running the jobs.sh file with GNU parallel + +These steps can be run locally or in a remote computer. +However, given the nature of this parallelization, there are no limitation to memory usage, so your local computer can run out of memory. + +If you run this method in a remote computer, follow the [guide on running simulations remotely first](10-simple.md). +When that guide tell you to run your simulations, stop and come back here. + +## Install GNU parallel + +Install the package [GNU parallel](https://www.gnu.org/software/parallel/) following the instructions on the website. +We recommend installing the package via package managers if you have one (such as `apt-get`, `homebrew` or `chocolatey`). + +> **Note** +> +> For SURF, that would be `sudo apt-get install parallel`. + +In case you do not have one, you can follow the steps below: + +- If you are using UNIX based system(Linux or MacOS),you are going to need `wget`. + +Run those commands `parallel-*` will be a downloaded file with a number instead of '*': + +```bash +wget https://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2 +tar -xjf parallel-latest.tar.bz2 +cd parallel-NUMBER +./configure +make +sudo make install +``` + +Check that the package is installed with + +```bash +parallel --version +``` + +## Running jobs.sh in parallel + +To parallelize your `jobs.sh` file, we need to split it into blocks that can be parallelized. +To do that, we need the `split-file.py` script included in this repo. + +To directly download it from the internet, you can issue the following command: + +```bash +wget https://raw.githubusercontent.com/abelsiqueira/asreview-cloud/main/split-file.py +``` + +Now run the following to split on the jobs.sh file into three files: + +```bash +python3 split-file.py jobs.sh +``` + +This will generate files `jobs.sh.part1`, 2, and 3. +The first part contains all lines with "mkdir" and "describe" in them. +The second part contains all lines with "simulate" in them. +The rest of the useful lines (non-empty and not comments) constitute the third part. + +Each part must finish before the next is run, and the first part must be run sequentially. +The other two parts can be run using `parallel`. + +To simplify your usage, we have created the script `parallel_run.sh`. +Download it issuing + +```bash +wget https://raw.githubusercontent.com/abelsiqueira/asreview-cloud/main/parallel_run.sh +``` + +Then you can just run the script below, specifying the number of cores as an argument. +> **Warning** +> We recommend not using all of your CPU cores at once. +> Leave at least one or two to allow your machine to process other tasks. +> Notice that there is no limitation on memory usage per task, so for models that use a lot of memory, there might be some competition for resources. + +```bash +bash parallel_run.sh NUMBER_OF_CORES +``` diff --git a/30-many-jobs.md b/30-many-jobs.md new file mode 100644 index 0000000..3fd3b48 --- /dev/null +++ b/30-many-jobs.md @@ -0,0 +1,62 @@ +# Running many jobs.sh files one after the other + +One more advanced situation is running many simulations that have changing parameters. +For instance, simulating different combinations of models. + +Ideally, you would want to parallelize even more your execution, but this guide assumes that you can't do that, for instance, because you don't have access to more computers. + +In that case, the guidance that we can provide is to recommend writing a loop over the arguments, and properly saving the output. + +Let's assume that you want to run `asreview makita CONSTANT VARIABLE` where `CONSTANT` is a fixed part that is the same for all runs and `VARIABLE` is what you are varying. + +## Arguments + +Open a file `makita-args.txt` and write the arguments that you want to run. +For instance, we could write + +```plaintext +-m logistic -e tfidf +-m nb -e tfidf +``` + +## Execution script + +Now, download the file `many-jobs.sh`: + +```bash +wget https://raw.githubusercontent.com/abelsiqueira/asreview-cloud/main/many-jobs.sh +``` + +This file should contain something like + +```bash +CONSTANT="template arfi" # Edit here to your liking +num_cores=$1 + +# Shortened for readability + +while read -r arg +do + # A overwrites all files + echo "A" | asreview makita "$CONSTANT" "$arg" + # Edit to your liking from here + python3 split-file.py jobs.sh + bash parallel_run.sh "$num_cores" + mv output "output-args_$arg" + # to here +done < makita-args.txt +``` + +Edit this file to reflect your usage: + +1. The `CONSTANT` variable defines that we will run `template arfi` for every `asreview makita` call. If you use a custom template, change here. + +2. After running `asreview makita`, we chose to use the [parallelization strategy](20-parallel.md). If you prefer, you can use just `bash jobs.sh` instead of these two first lines. The last line renames the output, so it is important, but you do something else that you find more relevant, such as uploading the results. + +## Running + +After you change everything that needs changing, simply run + +```bash +bash many-jobs.sh NUMBER_OF_CORES +``` diff --git a/40-kubernetes.md b/40-kubernetes.md new file mode 100644 index 0000000..d9e4c39 --- /dev/null +++ b/40-kubernetes.md @@ -0,0 +1,254 @@ +# Running very large simulations using Kubernetes + +**Warning**: This parallelization strategy requires time, patience, and probably some troubleshooting using Kubernetes that is not included in this guide. + +--- + +The idea behind using Kubernetes is to allow scaling the parallelization across many computers while containerizing the different tasks. +This means that if you have more money and needs results faster, you can run more tasks in parallel. + +The basic strategy of this Kubernetes use is to extend the parallelization in ["Running the jobs.sh file with GNU parallel"](20-parallel.md) to multiple computers. + +However, even if you have several computer laying around, installing a Kubernetes cluster is not a trivial task. +For that reason, we will consider two situations: + +1. You have a single computer in which you will run _minikube_; or +2. You are using a Kubernetes cluster from a cloud provider. + +The first case is useful for testing some things out, or when your computer is powerful enough, but you need more control over CPU and memory usage. +The second case is the expected use case. + +## How it works + +The basic idea behind this Kubernetes implementation is to have a **Tasker** pod and as many **Worker** pods as we can. +The Tasker will send the individual simulate commands (and other less time-consuming commands) to the Workers. + +The Tasker and Workers communicate using [RabbitMQ](https://www.rabbitmq.com). +The Tasker sends every command as a message and the Worker sends a confirmation when the command is completed. + +The Tasker and Workers share the same volume, where both the input and output data will be stored. +The Workers can, optionally and if provided, upload the output to S3. + +Each of these pods run a Docker image that install necessary packages. +Each of these images run a bash file. +Each bash file uses a python script and the Python package [pika](https://pika.readthedocs.io/en/stable/) to send and receive messages. + +In the Worker case, the [Worker Dockerfile](worker.Dockerfile) must have the packages to run the models that you need to run, in addition to some basic things, and it runs the Worker bash file. +The [Worker bash file](worker.sh) just runs the [Worker receiver](worker-receiver.py) file. +The Worker receiver keeps the Worker alive waiting for messages; runs received messages as commands; tells the Tasker that it is done with a message; and sends files to S3, if configured to do so. + +In the Tasker case, the [Tasker Dockerfile](tasker.Dockerfile) only needs the basic things, and it runs the Tasker bash file. +The [Tasker bash file](tasker.sh) is responsible for very important tasks. +It starts by cleaning up the volume and moving files to the correct location inside it. +Then, it runs whatever you need it to run, and this is where you have to edit to do what you need. + +In the default case, the Tasker bash file runs makita once, then splits the file using the [split-file.py](split-file.py) script that we mentioned before. +Then it runs the first part itself (which can't be parallelized), and sends the next two parts to the Workers using the script . + +The script sends each line of the input file to the Workers as messages, and then waits for all messages to be completed. +This ensures that the part 2 is completed before part 3 starts being executed. + +We have created a visual representation below: + +![Workflow representation](workflow.jpg) + +## Guide + +As we said in the beginning, we will consider two situtations. +Either you have a single computer in which you will install `minikube`, or you have a Kubernetes cluster already set up, probably from a cloud provider. + +The "single computer" strategy can also be followed to test your scripts before using the real cluster, although there are many small differences that need to be addressed. + +In both cases, start by cloning this repo, as you will need the configuration files provided here: + +```bash +git clone https://github.com/abelsiqueira/asreview-cloud +cd asreview-cloud +``` + +First, follow the specific guides to setup your local computer or cluster: + +- [Single computer](41-kubernetes-single-computer.md) +- [Kubernetes cluster](42-kubernetes-cloud-provider.md) + +## Install RabbitMQ + +We need to install and run RabbitMQ on Kubernetes. +Run the following command taken from [RabbitMQ Cluster Operator](https://www.rabbitmq.com/kubernetes/operator/quickstart-operator.html), and then the `rabbitmq.yml` service. + +```bash +kubectl apply -f "https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml" +``` + +## Create a namespace for asreview things + +The configuration files use the namespace `asreview-cloud` by default, so if you want to change it, you need to change in the file below and all other places that have `# namespace: asreview-cloud`. + +```bash +kubectl apply -f asreview-cloud-namespace.yml +``` + +## Start RabbitMQ configuration + +Run + +```bash +kubectl apply -f rabbitmq.yml +``` + +## S3 storage (_Optional step_) + +You might want to setup S3 storage for some files after running the simulation. +You have to find your own S3 service, e.g. AWS S3 or Scaleway - looks like you can use [Scaleway](https://scaleway.com) for free under some limitations, but do that under your own risk. + +After setting up S3 storage, edit the `s3-secret.yml` file with the relevant values. +The file must store the base 64 encoded strings, not the raw strings. +To encode, use + +```bash +echo -n 'WHATEVER' | base64 +``` + +Copy that value and paste in the appropriate field of the file. + +Finally, run the secret: + +```bash +kubectl apply -f s3-secret.yml +``` + +Edit the `worker.yml` file and uncomment the lines related to S3. + +By default, only the metrics file are uploaded to S3. +Edit `worker-receiver.py` to change that. + +By default, the prefix of the folder on S3 is the date and time. +To change that, edit `tasker.sh`. + +## Prepare the tasker script and Docker image + +The `tasker.sh` defines everything that will be executed by the tasker, and indirectly by the workers. +The `tasker.Dockerfile` will create the image that will be executed in the tasker pod. +You can modify these as you see fit. +After you are done, compile and push the image: + +> **Warning** +> +> The default tasker assumes that a data folder exists with your data. +> Make sure to either provide the data or change the tasker and Dockerfile. + +```bash +docker build -t YOURUSER/tasker -f tasker.Dockerfile . +docker push YOURUSER/tasker +``` + +> **Note** +> +> This will push the image to Docker. You will need to create an account an login in your terminal with `docker login`. + +## Prepare the worker script and Docker image + +The `worker.sh` defines a very short list of tasks: running `worker-receiver.py`. +You can do other things before that, but tasks that are meant to be run before **all** workers start working should go on `tasker.sh`. +The `worker-receiver.py` runs continuously, waiting for new tasks from the tasker. + +```bash +docker build -t YOURUSER/worker -f worker.Dockerfile . +docker push YOURUSER/worker +``` + +## Running the workers + +The file `worker.yml` contains the configuration of the deployment of the workers. +Change the `image` to reflect the path to the image that you pushed. +You can select the number of `replicas` to change the number of workers. +Pay attention to the resource limits, and change as you see fit. + +Run with + +```bash +kubectl apply -f worker.yml +``` + +Check that the workers are running with the following: + +```bash +kubectl get pods +``` + +You should see some `asreview-worker-FULL-NAME` pods with "Running" status after a while. +Follow the logs of a single pod with + +```bash +kubectl logs asreview-worker-FULL-NAME -f +``` + +You should see something like + +```plaintext +Logging as ... +[*] Waiting for messages. CTRL+C to exit +``` + +## Running the tasker + +Similarly, the `tasker.yml` allows you to run the tasker as a Kubernetes job. +Change the `image`, and optionally add a `ttlSecondsAfterFinished` to auto delete the task - I prefer to keep it until I review the log. +Run + +```bash +kubectl apply -f tasker.yml +``` + +Similarly, you should see a `tasker` pod, and you can follow its log. + +## Deleting and restarting + +If you plan to make modifications to the tasker or the worker, they have to be deleted, respectivelly. + +The workers keep running after the tasker is done. +They don't know when to stop. +To stop and delete them, run + +```bash +kubectl delete -f worker.yml +``` + +If you did not set a `ttlSecondsAfterFinished` for the tasker, it will keep existing, although not running. +You can delete it the same way as you did the workers, but using `tasker.yml`. + +You can then delete the `volume.yml` and the `rabbit.yml`, but if you are running new tests, you don't need to. + +Since the volume is mounted separately, you don't lose the data. +You will lose the execution log, though. + +Running everything again is simply a matter of using `kubectl apply` again. +Of course, if you modify the `.sh` or `.py` files, you have to build the corresponding docker image again. + +> **Warning** +> +> The default **tasker** deletes the whole workdir folder to make sure that it is clean when it starts. +> If you don't want this behaviour, look for the "rm -rf" line and comment it out or remove it. +> However, if you run into a "Project already exists" error, this is why. + +## Troubleshooting and FAQ + +### After running the tasker, the workers are in CrashLoopBackOff/Error + +Probably some command in the tasker resulted in the worker failure, and now the queue is populated and the worker keep trying and failing. +Looking at the logs of the worker should give insight in the real issue. + +To verify if you have a queue issue, run + +```bash +kubectl -n asreview-cloud exec rabbitmq-server-0 -- rabbitmqctl list_queues +``` + +If any of the queues has more than 0 messages, then this confirms the issue. +Delete the queue with messages: + +```bash +kubectl -n asreview-cloud exec rabbitmq-server-0 -- rabbitmqctl delete_queue asreview_queue +``` + +You should see the workers go back to "Running" state. diff --git a/41-kubernetes-single-computer.md b/41-kubernetes-single-computer.md new file mode 100644 index 0000000..ffaba52 --- /dev/null +++ b/41-kubernetes-single-computer.md @@ -0,0 +1,126 @@ +# Using Kubernetes with a single computer + +Note: If you don't have a Kubernetes cloud provider, running this on a single computer is less efficient than using the other suggested parallelization strategy because we will use some of the CPU and memory available for other tasks. + +If you have many many cores, though, it might still make sense to do it in a single computer because the relative cost will decrease, and you have control over your CPU and memory resources. + +## Installing minikube and required packages + +Install `minikube` from your package provider or look into the [official documentation](https://minikube.sigs.k8s.io/docs/start/). +You should have the `kubectl` command as well. + +Below, we have a more detailed explanation of the installation, which can be skipped if you want to follow a different strategy, e.g., installing with your linux package manager. + +> **Note** +> +> We tested this installation procedure on SURF using a workspace with "Ubuntu 20.04 (SUDO enabled)". + +Install docker following [the official documentation](https://docs.docker.com/engine/install/ubuntu/). +At the time of writing, the commands are: + +```bash +sudo apt-get update +sudo apt-get install ca-certificates curl gnupg +sudo install -m 0755 -d /etc/apt/keyrings +curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg +sudo chmod a+r /etc/apt/keyrings/docker.gpg +echo \ + "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ + "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \ + sudo tee /etc/apt/sources.list.d/docker.list > /dev/null +sudo apt-get update +sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin +``` + +Then, add your user to the docker group: + +```bash +sudo usermod -aG docker $USER +``` + +Log out and log in again. Test that you can run `docker run hello-world`. + +Download minikube and install it. +Following the [official documentation](https://minikube.sigs.k8s.io/docs/start/) at the time of writing, you can run: + +```bash +curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube_latest_amd64.deb +sudo dpkg -i minikube_latest_amd64.deb +``` + +Install `kubectl` following the [official documentation](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/#install-using-native-package-management), **however, fix the curl command** following [this issue](https://github.com/kubernetes/release/issues/2862): + +```bash +sudo apt-get update +sudo apt-get install -y ca-certificates curl +# sudo curl -fsSLo /etc/apt/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg +sudo curl -fsSLo /etc/apt/keyrings/kubernetes-archive-keyring.gpg https://dl.k8s.io/apt/doc/apt-key.gpg +echo "deb [signed-by=/etc/apt/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list +sudo apt-get update +sudo apt-get install -y kubectl +``` + +You can install bash completions using + +```bash +kubectl completion bash | sudo tee /etc/bash_completion.d/kubectl > /dev/null +``` + +Log out and log in after installing bash completions. + +## Start minikube + +Run + +```bash +minikube start --cpus CPU_NUMBER --memory HOW_MUCH_MEMORY +``` + +The `CPU_NUMBER` argument is the number of CPUs you want to dedicate to `minikube`. +The `HOW_MUCH_MEMORY` argument is how much memory. + +## Create a volume + +To share data between the worker and taskers, and to keep that data after using it, we need to create a volume. +The volume is necessary to hold the `data`, `scripts`, and the `output`, for instance. + +We show how to configure a local volume, but you are free to use other volumes as well, as long as they accept `ReadWriteMany`. +Please notice that this assumes that you use a single node. + +Below we have the command for `minikube`. + +```bash +minikube ssh -- sudo mkdir -p /mnt/asreview-storage +``` + +Then, run + +```bash +kubectl apply -f storage-local.yml +``` + +The `storage-local.yml` file contains a `StorageClass`, a `PersistentVolume`, and a `PersistentVolumeClaim`. +It uses a local storage inside `minikube`, and it assumes that **2 GB** are sufficient for the project. +Change as necessary. + +Then, uncomment the `worker.yml` and `tasker.yml` relevant part at the `volumes` section in the end. +For this case, it should look like + +```yml +volumes: + - name: asreview-storage + persistentVolumeClaim: + claimName: asreview-storage +``` + +### Retrieving the output + +You can copy the `output` folder from the volume with + +```bash +kubectl cp asreview-worker-FULL-NAME:/app/workdir/output ./output +``` + +Also, check the `/app/workdir/issues` folder. +It should be empty, because it contains errors while running the simulate code. +If it is not empty, the infringing lines will be shown. diff --git a/42-kubernetes-cloud-provider.md b/42-kubernetes-cloud-provider.md new file mode 100644 index 0000000..fb619d3 --- /dev/null +++ b/42-kubernetes-cloud-provider.md @@ -0,0 +1,78 @@ +# Kubernetes with a cloud provider + +> **Warning** +> +> This strategy was not tested on an actual cluster yet, so it is highly experimental at this point. + +If you run `kubectl` from your computer to manage the Kubernetes cluster, you will need to install it. +You can check the guide for [Single computer](41-kubernetes-single-computer.md), and ignore the minikube installation. + +You have to configure access to the cluster, and since that depends on the cloud provider, I will leave that to you. +Please remember that all commands will assume that you are connecting to the cluster, which might involve additional flags to pass your credentials. + +## Create a volume + +To share data between the worker and taskers, and to keep that data after using it, we need to create a volume. +The volume is necessary to hold the `data`, `scripts`, and the `output`, for instance. + +If you have some volume provider that accepts `ReadWriteMany`, use that. +Otherwise, we show below how to set up a NFS server using Kubernetes resources, and then how to use that server as volume for your pods. + +The file `storage-nfs.yml` will run an NFS server inside one of the nodes. +Simple run + +```bash +kubectl apply -f storage-nfs.yml +``` + +Then, run + +```bash +kubectl -n asreview-cloud get services +``` + +You should see something like + +```plaintext +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +... +nfs-service ClusterIP NFS_SERVICE_IP 2049/TCP,20048/TCP,111/TCP 82m +... +``` + +Copy the `NFS_SERVICE_IP`. +Then, uncomment the `worker.yml` and `tasker.yml` relevant part at the `volumes` section in the end. +For this case, it should look like + +```yml +volumes: + - name: asreview-storage + nfs: + server: NFS_SERVICE_IP + path: "/" +``` + +### Retrieving the output + +The easiest way to manipulate the output when you have an NFS server is to mount the NFS server. +Run the following command in a terminal: + +```bash +kubectl -n asreview-cloud port-forward nfs-server-FULL-NAME 2049 +``` + +In another terminal, run + +```bash +mkdir asreview-storage +sudo mount -v -o vers=4,loud localhost:/ asreview-storage +``` + +Copy things out as necessary. +When you're done, run + +```bash +sudo umount asreview-storage +``` + +And hit CTRL-C on the running `kubectl port-forward` command. diff --git a/README.md b/README.md index 1625523..abdf2ae 100644 --- a/README.md +++ b/README.md @@ -1,459 +1,17 @@ -# asreview-cloud - -In this repository, we keep files used to run (very) large simulations in parallel in the cloud. -The approach we use here is to run a Kubernetes cluster, and send individual simulation commands to different workers. -This assumes that you already know how to run simulations with [Makita](https://github.com/asreview/asreview-makita). - -This documentation should help you get Kubernetes installed locally on a Linux machine or on SURF, and to run some examples of simulations. -For more advanced usage, for instance, using an existing Kubernetes cluster, we provide no official support, but the community might have some tips. - -## Explanation - -The basic explanation of how this work is: one core of the machine reads the `jobs.sh` Makita file and sends each line to a different core of the machine. - -The more convoluted explanation is below: - -- We have various _worker_ pods. -- We have a _tasker_ pod. -- The tasker runs a **shell file** (`tasker.sh`) that prepares the ground for the workers and then sends work for the workers. - - This file can be heavilly modified by the user, to handle specific use cases. -- One possible script is `python tasker-send.py FILE`, which send each line of the `FILE` to the workers as message through RabbitMQ. - - If you don't use `tasker-send.py`, there is no parallel execution of the tasks. -- The worker receives the message and runs it as a shell command. -- When the worker completes the command, it sends a message back to the `tasker-send.py`, so it can keep track of what was executed. -- `tasker-send.py` will block the execution of further commands until `FILE` is completed. -- Another possible command is `python split-file.py FILE`, which reads the `FILE` and creates three new files: - - `FILE.part1` contains every command with "mkdir" and "describe", and everything before the first `simulate` line. - - `FILE.part2` contains every `simulate` line. - - `FILE.part3` contains every other command. -- The most basic workflow is to take the Makita `jobs.sh`, split it into three, run the first part directly with the tasker (to create folders), then send the second part with `tasker-send.py`, and finally send the third part as well. - -The visual representation (which is not very helpful by itself) is below: - -![Workflow representation](workflow.jpg) - -## Installing locally - -Install `minikube` from your package provider or look into the [official documentation](https://minikube.sigs.k8s.io/docs/start/). - -You should have the `kubectl` command as well. - -## Install on SURF - -Create a workspace with "Ubuntu 20.04 (SUDO enabled)". - -Install docker following [the official documentation](https://docs.docker.com/engine/install/ubuntu/). -At the time of writing, the commands are: - -```bash -sudo apt-get update -sudo apt-get install ca-certificates curl gnupg -sudo install -m 0755 -d /etc/apt/keyrings -curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg -sudo chmod a+r /etc/apt/keyrings/docker.gpg -echo \ - "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ - "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \ - sudo tee /etc/apt/sources.list.d/docker.list > /dev/null -sudo apt-get update -sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -``` - -Then, add your user to the docker group: - -```bash -sudo usermod -aG docker $USER -``` - -Log out and log in again. Test that you can run `docker run hello-world`. - -Download minikube and install it. -Following the [official documentation](https://minikube.sigs.k8s.io/docs/start/) at the time of writing, you can run: - -```bash -curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube_latest_amd64.deb -sudo dpkg -i minikube_latest_amd64.deb -``` - -Install `kubectl` following the [official documentation](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/#install-using-native-package-management), **however, fix the curl command** following [this issue](https://github.com/kubernetes/release/issues/2862): - -```bash -sudo apt-get update -sudo apt-get install -y ca-certificates curl -# sudo curl -fsSLo /etc/apt/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg -sudo curl -fsSLo /etc/apt/keyrings/kubernetes-archive-keyring.gpg https://dl.k8s.io/apt/doc/apt-key.gpg -echo "deb [signed-by=/etc/apt/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list -sudo apt-get update -sudo apt-get install -y kubectl -``` - -You can install bash completions using - -```bash -kubectl completion bash | sudo tee /etc/bash_completion.d/kubectl > /dev/null -``` - -Log out and log in after installing bash completions. - -## Start minikube and install RabbitMQ - -We need to install and run RabbitMQ on Kubernetes. -Run the following commands takes from [RabbitMQ Cluster Operator](https://www.rabbitmq.com/kubernetes/operator/quickstart-operator.html), and then the `rabbitmq.yml` service. - -```bash -minikube start --cpus CPU_NUMBER --memory HOW_MUCH_MEMORY -kubectl apply -f "https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml" -``` - -The `CPU_NUMBER` argument is the number of CPUs you want to dedicate to `minikube`. -The `HOW_MUCH_MEMORY` argument is how much memory. - -> **Note** -> -> If you are on SURF, you found these values when creating the machine. - -## Clone this repo - -If you haven't already, clone this repo and enter it's folder: - -```bash -git clone https://github.com/abelsiqueira/asreview-cloud -cd asreview-cloud -``` - -From here on, we will need files inside the `asreview-cloud` repo. - -## Create a namespace for asreview things - -The configuration files use the namespace `asreview-cloud` by default, so if you want to change it, you need to change in the file below and all other places that have `# namespace: asreview-cloud`. - -```bash -kubectl apply -f asreview-cloud-namespace.yml -``` - -## Start RabbitMQ configuration - -Run - -```bash -kubectl apply -f rabbitmq.yml -``` - -## Create a volume - -To share data between the worker and taskers, and to keep that data after using it, we need to create a volume. -The volume is necessary to hold the `data`, `scripts`, and the `output`, for instance. - -If you are using a single node (e.g., on SURF), you can use the local configuration. -Otherwise, we show below how to set up a NFS server using Kubernetes resources. -You can, naturally, use other Kubernetes volume options, as long as they accept `ReadWriteMany`, however we won't show how to configure that. - -### Local - -It you have a single node, you need to create the storage folder on the node. -Below we have the command for `minikube`. - -```bash -minikube ssh -- sudo mkdir -p /mnt/asreview-storage -``` - -Then, run - -```bash -kubectl apply -f storage-local.yml -``` - -The `storage-local.yml` file contains a `StorageClass`, a `PersistentVolume`, and a `PersistentVolumeClaim`. -It uses a local storage inside `minikube`, and it assumes that **2 GB** are sufficient for the project. -Change as necessary. - -Then, uncomment the `worker.yml` and `tasker.yml` relevant part at the `volumes` section in the end. -For this case, it should look like - -```yml -volumes: - - name: asreview-storage - persistentVolumeClaim: - claimName: asreview-storage -``` - -### NFS - -The file `storage-nfs.yml` will run an NFS server inside one of the nodes. -Simple run - -```bash -kubectl apply -f storage-nfs.yml -``` - -Then, run - -```bash -kubectl -n asreview-cloud get services -``` - -You should see something like - -```plaintext -NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE -... -nfs-service ClusterIP NFS_SERVICE_IP 2049/TCP,20048/TCP,111/TCP 82m -... -``` - -Copy the `NFS_SERVICE_IP`. -Then, uncomment the `worker.yml` and `tasker.yml` relevant part at the `volumes` section in the end. -For this case, it should look like - -```yml -volumes: - - name: asreview-storage - nfs: - server: NFS_SERVICE_IP - path: "/" -``` - -## S3 storage (_Optional step_) - -You might want to setup S3 storage for some files after running the simulation. -You have to find your own S3 service, e.g. AWS S3 or Scaleway - looks like you can use [Scaleway](https://scaleway.com) for free under some limitations, but do that under your own risk. - -After setting up S3 storage, edit the `s3-secret.yml` file with the relevant values. -The file must store the base 64 encoded strings, not the raw strings. -To encode, use - -```bash -echo -n 'WHATEVER' | base64 -``` - -Copy that value and paste in the appropriate field of the file. - -Finally, run the secret: - -```bash -kubectl apply -f s3-secret.yml -``` - -Edit the `worker.yml` file and uncomment the lines related to S3. - -By default, only the metrics file are uploaded to S3. -Edit `worker-receiver.py` to change that. - -By default, the prefix of the folder on S3 is the date and time. -To change that, edit `tasker.sh`. - -## Prepare the tasker script and Docker image - -The `tasker.sh` defines everything that will be executed by the tasker, and indirectly by the workers. -The `tasker.Dockerfile` will create the image that will be executed in the tasker pod. -You can modify these as you see fit. -After you are done, compile and push the image: - -> **Warning** -> -> The default tasker assumes that a data folder exists with your data. -> Make sure to either provide the data or change the tasker and Dockerfile. - -```bash -docker build -t YOURUSER/tasker -f tasker.Dockerfile . -docker push YOURUSER/tasker -``` - -> **Note** -> -> This will push the image to Docker. You will need to create an account an login in your terminal with `docker login`. - -## Prepare the worker script and Docker image - -The `worker.sh` defines a very short list of tasks: running `worker-receiver.py`. -You can do other things before that, but tasks that are meant to be run before **all** workers start working should go on `tasker.sh`. -The `worker-receiver.py` runs continuously, waiting for new tasks from the tasker. - -```bash -docker build -t YOURUSER/worker -f worker.Dockerfile . -docker push YOURUSER/worker -``` - -## Running the workers - -The file `worker.yml` contains the configuration of the deployment of the workers. -Change the `image` to reflect the path to the image that you pushed. -You can select the number of `replicas` to change the number of workers. -Pay attention to the resource limits, and change as you see fit. - -Run with - -```bash -kubectl apply -f worker.yml -``` - -Check that the workers are running with the following: - -```bash -kubectl get pods -``` - -You should see some `asreview-worker-FULL-NAME` pods with "Running" status after a while. -Follow the logs of a single pod with - -```bash -kubectl logs asreview-worker-FULL-NAME -f -``` - -You should see something like - -```plaintext -Logging as ... -[*] Waiting for messages. CTRL+C to exit -``` - -## Running the tasker - -Similarly, the `tasker.yml` allows you to run the tasker as a Kubernetes job. -Change the `image`, and optionally add a `ttlSecondsAfterFinished` to auto delete the task - I prefer to keep it until I review the log. -Run - -```bash -kubectl apply -f tasker.yml -``` - -Similarly, you should see a `tasker` pod, and you can follow its log. - -## Retrieving the output - -### Local - -If you used a local volume, you can copy the `output` folder from the volume with - -```bash -kubectl cp asreview-worker-FULL-NAME:/app/workdir/output ./output -``` - -Also, check the `/app/workdir/issues` folder. -It should be empty, because it contains errors while running the simulate code. -If it is not empty, the infringing lines will be shown. - -### NFS - -The easiest way to manipulate the output when you have an NFS server is to mount the NFS server. -Run the following command in a terminal: - -```bash -kubectl -n asreview-cloud port-forward nfs-server-FULL-NAME 2049 -``` - -In another terminal, run - -```bash -mkdir asreview-storage -sudo mount -v -o vers=4,loud localhost:/ asreview-storage -``` - -Copy things out as necessary. -When you're done, run - -```bash -sudo umount asreview-storage -``` - -And hit CTRL-C on the running `kubectl port-forward` command. - -## Deleting and restarting - -If you plan to make modifications to the tasker or the worker, they have to be deleted, respectivelly. - -The workers keep running after the tasker is done. -They don't know when to stop. -To stop and delete them, run - -```bash -kubectl delete -f worker.yml -``` - -If you did not set a `ttlSecondsAfterFinished` for the tasker, it will keep existing, although not running. -You can delete it the same way as you did the workers, but using `tasker.yml`. - -You can then delete the `volume.yml` and the `rabbit.yml`, but if you are running new tests, you don't need to. - -Since the volume is mounted separately, you don't lose the data. -You will lose the execution log, though. - -Running everything again is simply a matter of using `kubectl apply` again. -Of course, if you modify the `.sh` or `.py` files, you have to build the corresponding docker image again. - -> **Warning** -> -> The default **tasker** deletes the whole workdir folder to make sure that it is clean when it starts. -> If you don't want this behaviour, look for the "rm -rf" line and comment it out or remove it. -> However, if you run into a "Project already exists" error, this is why. - -## Lightweight usage - -If you want to use parallelization, but can't, or don't want to, use Kubernetes, it is possible to use [GNU Project package](https://www.gnu.org/software/parallel) to parallelize the execution of the script locally. -For such usage follow these steps: - -Install the package [GNU parallel](https://www.gnu.org/software/parallel/) following the instructions on the website. -We recommend installing the package via package managers if you have one (such as `apt-get`, `homebrew` or `chocolatey`). - -> **Note** -> -> For SURF, that would be `sudo apt-get install parallel`. - -In case you do not have one, you can follow the steps below: - -- If you are using UNIX based system(Linux or MacOS),you are going to need `wget`. - -Run those commands `parallel-*` will be a downloaded file with a number instead of '*': - -```bash -wget https://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2 -tar -xjf parallel-latest.tar.bz2 -cd parallel-NUMBER -./configure -make -sudo make install -``` - -Check that the package is installed with - -```bash -parallel --version -``` - -To parallelize your `jobs.sh` file, we need to split it into blocks that can be parallelized. - -```bash -python split-file.py -``` - -Then you can just run the script below, specifying the number of cores as an argument. -> **Warning** -> We recommend not using all of your CPU cores at once. -> Leave at least one or two to allow your machine to process other tasks. -> Notice that there is no limitation on memory usage per task, so for models that use a lot of memory, there might be some competition for resources. - -```bash -bash parallel_run.sh -``` - -## Troubleshooting and FAQ - -### After running the tasker, the workers are in CrashLoopBackOff/Error - -Probably some command in the tasker resulted in the worker failure, and now the queue is populated and the worker keep trying and failing. -Looking at the logs of the worker should give insight in the real issue. - -To verify if you have a queue issue, run - -```bash -kubectl -n asreview-cloud exec rabbitmq-server-0 -- rabbitmqctl list_queues -``` - -If any of the queues has more than 0 messages, then this confirms the issue. -Delete the queue with messages: - -```bash -kubectl -n asreview-cloud exec rabbitmq-server-0 -- rabbitmqctl delete_queue asreview_queue -``` - -You should see the workers go back to "Running" state. +# asreview-cloud (Running simulations on the cloud and/or in parallel) + +This repository houses some files to help run simulations on the cloud, i.e., outside your computer, possibly in parallel. +We assume that you know how to run simulations on your computer using [Makita](https://github.com/asreview/asreview-makita). +The information for running simulations on the cloud is separated in the following use cases: + +- [Running a "short" simulation on SURF, Digital Ocean, AWS, Azure, etc.](10-simple.md) + - Use this guide if your local computer is not powerful enough, or if you need it available while the simulations run. +- [Running simulations in parallel](20-parallel.md) + - Use this when you have a computer (local or remote) with a good amount of cores and memory, and you want to speed things up. +- [Running many jobs.sh files one after the other](30-many-jobs.md) + - Use if you need to run many simulations changing parameteres, but you only have one computer. + - You can still parallelize the individual `jobs.sh` execution. +- [Running large simulations using Kubernetes](40-kubernetes.md) + - Use if your simulation would take a very long time. + - Alternatively, if you have a powerfull enough computer and needs to control the cpu and memory usage. + - This is very complicated and it usually requires a lot of time to setup and money to run on a cluster. diff --git a/examples/custom_arfi_synergy/worker.Dockerfile b/examples/custom_arfi_synergy/worker.Dockerfile index 3fe9ff8..cac2787 100644 --- a/examples/custom_arfi_synergy/worker.Dockerfile +++ b/examples/custom_arfi_synergy/worker.Dockerfile @@ -18,7 +18,7 @@ RUN pip install gensim~=4.2.0 # RUN pip install https://github.com/jteijema/asreview-reusable-fe/archive/main.zip # RUN pip install https://github.com/jteijema/asreview-XGBoost/archive/main.zip -# For neural netowrk +# For neural network # RUN pip install tensorflow~=2.9.1 #### Don't modify below this line diff --git a/many-jobs.sh b/many-jobs.sh new file mode 100644 index 0000000..b33584e --- /dev/null +++ b/many-jobs.sh @@ -0,0 +1,25 @@ +#!/bin/bash + +CONSTANT="template arfi" # Edit here to your liking +num_cores=$1 + +if [ -z "$num_cores" ]; then + echo "ERROR: Missing number of cores" + exit 1 +fi + +if [ ! -f makita-args.txt ]; then + echo "ERROR: Create a file makita-args.txt before running this" + exit 1 +fi + +while read -r arg +do + # A overwrites all files + echo "A" | asreview makita "$CONSTANT" "$arg" + # Edit to your liking from here + python3 split-file.py jobs.sh + bash parallel_run.sh "$num_cores" + mv output "output-args_$arg" + # to here +done < makita-args.txt diff --git a/parallel_run.sh b/parallel_run.sh index 8add085..bbdb83b 100644 --- a/parallel_run.sh +++ b/parallel_run.sh @@ -1,8 +1,33 @@ #!/bin/bash + +function usage { + echo "Usage:" + echo "" + echo "Generate your jobs.sh file, obtain split-file.py and run" + echo "" + echo " python3 split-file.py jobs.sh" + echo " bash parallel_run.sh NUMBER_OF_CORES" +} + # Record the start time start_time=$(date +%s) num_cores=$1 + +if [ -z "$num_cores" ]; then + echo "ERROR: Missing number of cores" + usage + exit 1 +fi +for i in 1 2 3 +do + if [ ! -f "jobs.sh.part$i" ]; then + echo "ERROR: File jobs.sh.part$i not found. Did you run split-file.py?" + usage + exit 1 + fi +done + # Utilize the GNU package for parallelization bash jobs.sh.part1 parallel -j "$num_cores" < jobs.sh.part2 diff --git a/tasker.Dockerfile b/tasker.Dockerfile index 8310775..2bac083 100644 --- a/tasker.Dockerfile +++ b/tasker.Dockerfile @@ -14,7 +14,7 @@ RUN apt-get update && \ apt-get install -y git \ --no-install-recommends \ && rm -rf /var/lib/apt/lists/* \ - && pip install git+https://github.com/abelsiqueira/asreview-makita@29-fix-broken-comment-line + && pip install asreview-makita #### Don't modify below this line COPY ./split-file.py /app/split-file.py