Add code for training and evaluating DMR and LLaMA (including README …

…for instructions)
McGill-NLP · Feb 14, 2024 · 1623163 · 1623163
1 parent 96aa84f
commit 1623163
Show file tree

Hide file tree

Showing 19 changed files with 1,769 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -40,8 +40,19 @@ Check out our [documentation](https://mcgill-nlp.github.io/weblinx/docs) for mor
 
 ### Modeling
 
-Coming soon (mid-February)
+Our modeling code is separate from the `weblinx` library, but requires it as a dependency. You can install the modeling code by running:
+
+```bash
+# First, install the base package
+pip install weblinx
+
+# THen, clone this repo
+git clone https://github.com/McGill-NLP/weblinx
+cd weblinx/modeling
+```
+
+For the rest of the instructions, please take a look at the [modeling README](./modeling/README.md).
 
 ### Evaluation
 
-Coming soon (mid-February)
+Coming soon!
diff --git a/modeling/README.md b/modeling/README.md
@@ -0,0 +1,136 @@
+The following instructions assume you are running from this directory (you may need to `cd` to this directory).
+
+### Download Candidates
+
+First, you need to download the `train.jsonl` candidate selected by `McGill-NLP/MiniLM-L6-DMR`:
+
+```python
+from huggingface_hub import snapshot_download
+
+snapshot_download(
+    repo_id="McGill-NLP/WebLINX-full", 
+    repo_type="dataset", 
+    allow_patterns="candidates/*.jsonl", 
+    local_dir="./"
+)
+```
+
+Download entire dataset:
+
+```python
+from huggingface_hub import snapshot_download
+
+snapshot_download(repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", local_dir="./wl_data/")
+
+# If you only want the splits.json file, you can just run:
+snapshot_download(
+    repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", allow_patterns="splits.json", local_dir="./wl_data/"
+)
+
+# If you only want candidates:
+snapshot_download(
+    repo_id="McGill-NLP/WebLINX-full", 
+    repo_type="dataset", 
+    allow_patterns="candidates/*.jsonl", 
+    local_dir="./wl_data/"
+)
+```
+
+The default configs (`config.yml`) assume that the `train.jsonl` is located at `./candidates/train.jsonl`. If you want to change the path, you need to modify the `config.yml` accordingly.
+
+### Set `WEBLINX_PROJECT_DIR`
+
+You need to set the `WEBLINX_PROJECT_DIR` environment variable to the root directory of the WebLINX project. For example, if you have the following directory structure:
+
+```bash
+export WEBLINX_PROJECT_DIR=/path/to/the/modeling/directory/
+
+# For example, if you are in the modeling directory, you can run:
+export WEBLINX_PROJECT_DIR=$(pwd)
+```
+
+### Install Dependencies
+
+You need to install the dependencies by running the following command:
+
+```bash
+pip install -r requirements.txt
+```
+
+### Action Model: LLaMA
+
+#### Train LLaMA
+
+You can train the model by running the following command (it will automatically use the hydra config from `conf/`):
+
+```bash
+export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
+
+# Finetune 1.3b variant
+python -m llama.train +variant="ft_1.3b"
+
+# Finetune 2.7b variant
+python -m llama.train +variant="ft_2.7b"
+
+# For 7b, you will need to use fsdp in accelerate to train on 4 GPUs with 48GB VRAM
+export CUDA_VISIBLE_DEVICES="0,1,2,3"
+accelerate launch --use_fsdp --config_file llama/accelerate/fsdp_7b.yaml -m llama.train +variant="ft_7b"
+
+# For 13b, you need 6 GPUs with 48GB VRAM
+export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5"
+accelerate launch --use_fsdp --config_file llama/accelerate/fsdp_13b.yaml -m llama.train +variant="ft_13b"
+```
+
+Results will be saved in `./results` and checkpoints in `./checkpoints`.
+
+
+### Evaluate LLaMA
+
+You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command:
+
+```bash
+export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
+
+# On just one split
+python -m llama.eval +variant="ft_1.3b" eval.split=valid
+
+# On multiple splits (e.g. test_iid, test_vis)
+python -m llama.eval -m +variant="ft_2.7b" eval.split=test_iid,test_web,test_geo,test_cat,test_vis
+```
+
+### Dense Markup Ranking (DMR)
+
+#### Train DMR
+
+You can train the model by running the following command (it will automatically use the hydra config from `conf/`):
+
+```bash
+export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
+
+# Finetune MiniLM-L6-DMR (Default)
+python -m dmr.train
+
+# Finetune variant gte or bge
+python -m dmr.train +variant=gte
+python -m dmr.train +variant=bge
+```
+
+Results will be saved in `./results` and checkpoints in `./checkpoints`.
+
+#### Evaluate DMR
+
+You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command:
+
+```bash
+export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use
+
+# On just one
+python -m dmr.eval eval.split=valid
+
+# On multiple splits (e.g. test_iid, test_vis)
+python -m dmr.eval eval.split=test_iid,test_web,test_geo,test_cat,test_vis
+
+# Or for bge, gte
+python -m dmr.eval +variant=gte eval.split=test_iid,test_web,test_geo,test_cat,test_vis
+python -m dmr.eval +variant=bge eval.split=test_iid,test_web,test_geo,test_cat,test_vis
+```
diff --git a/modeling/dmr/conf/config.yaml b/modeling/dmr/conf/config.yaml
@@ -0,0 +1,43 @@
+project_dir: ${oc.env:WEBLINX_PROJECT_DIR}
+seed: 123
+project_name: dmr
+
+data:
+  split_path: ${project_dir}/wl_data/splits.json
+
+model:
+  name: sentence-transformers/all-MiniLM-L6-v2
+  max_seq_length: 512
+  use_bf16: True
+  similarity: cos_sim
+  save_dir: ${project_dir}/checkpoints/${project_name}/${model.name}
+
+train:
+  split: train
+  num_epochs: 10
+  max_neg_per_turn: 9
+  batch_size_per_device: 64
+  dataloader_num_workers: 8
+  optim: adamw
+  gradient_checkpointing: True
+  learning_rate: 0.00003
+  warmup_steps: 500
+  # Available schedulers: 
+  # constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
+  scheduler: warmuplinear
+
+eval:
+  split: dev
+  mrr_k: 50
+  batch_size_per_device: 64
+  result_dir: ${project_dir}/results/${project_name}/${model.name}/${eval.split}
+
+hydra:
+  run:
+    dir: ${project_dir}/logs/${project_name}/${hydra.job.name}/${now:%Y-%m-%d-%H:%M:%S}
+  # Use the same for sweep's subdir
+  sweep:
+    dir: ${hydra.run.dir}
+  job:
+    chdir: False
+  verbose: INFO
diff --git a/modeling/dmr/conf/variant/bge.yaml b/modeling/dmr/conf/variant/bge.yaml
@@ -0,0 +1,3 @@
+# @package _global_
+model:
+  name: BAAI/bge-small-en-v1.5
diff --git a/modeling/dmr/conf/variant/gte.yaml b/modeling/dmr/conf/variant/gte.yaml
@@ -0,0 +1,6 @@
+# @package _global_
+model:
+  name: thenlper/gte-base
+
+train:
+  gradient_checkpointing: True