Skip to content

Commit

Permalink
Add code for training and evaluating DMR and LLaMA (including README …
Browse files Browse the repository at this point in the history
…for instructions)
  • Loading branch information
xhluca committed Feb 14, 2024
1 parent 96aa84f commit 1623163
Show file tree
Hide file tree
Showing 19 changed files with 1,769 additions and 2 deletions.
15 changes: 13 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,19 @@ Check out our [documentation](https://mcgill-nlp.github.io/weblinx/docs) for mor

### Modeling

Coming soon (mid-February)
Our modeling code is separate from the `weblinx` library, but requires it as a dependency. You can install the modeling code by running:

```bash
# First, install the base package
pip install weblinx

# THen, clone this repo
git clone https://github.com/McGill-NLP/weblinx
cd weblinx/modeling
```

For the rest of the instructions, please take a look at the [modeling README](./modeling/README.md).

### Evaluation

Coming soon (mid-February)
Coming soon!
136 changes: 136 additions & 0 deletions modeling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
The following instructions assume you are running from this directory (you may need to `cd` to this directory).

### Download Candidates

First, you need to download the `train.jsonl` candidate selected by `McGill-NLP/MiniLM-L6-DMR`:

```python
from huggingface_hub import snapshot_download

snapshot_download(
repo_id="McGill-NLP/WebLINX-full",
repo_type="dataset",
allow_patterns="candidates/*.jsonl",
local_dir="./"
)
```

Download entire dataset:

```python
from huggingface_hub import snapshot_download

snapshot_download(repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", local_dir="./wl_data/")

# If you only want the splits.json file, you can just run:
snapshot_download(
repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", allow_patterns="splits.json", local_dir="./wl_data/"
)

# If you only want candidates:
snapshot_download(
repo_id="McGill-NLP/WebLINX-full",
repo_type="dataset",
allow_patterns="candidates/*.jsonl",
local_dir="./wl_data/"
)
```

The default configs (`config.yml`) assume that the `train.jsonl` is located at `./candidates/train.jsonl`. If you want to change the path, you need to modify the `config.yml` accordingly.

### Set `WEBLINX_PROJECT_DIR`

You need to set the `WEBLINX_PROJECT_DIR` environment variable to the root directory of the WebLINX project. For example, if you have the following directory structure:

```bash
export WEBLINX_PROJECT_DIR=/path/to/the/modeling/directory/

# For example, if you are in the modeling directory, you can run:
export WEBLINX_PROJECT_DIR=$(pwd)
```

### Install Dependencies

You need to install the dependencies by running the following command:

```bash
pip install -r requirements.txt
```

### Action Model: LLaMA

#### Train LLaMA

You can train the model by running the following command (it will automatically use the hydra config from `conf/`):

```bash
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use

# Finetune 1.3b variant
python -m llama.train +variant="ft_1.3b"

# Finetune 2.7b variant
python -m llama.train +variant="ft_2.7b"

# For 7b, you will need to use fsdp in accelerate to train on 4 GPUs with 48GB VRAM
export CUDA_VISIBLE_DEVICES="0,1,2,3"
accelerate launch --use_fsdp --config_file llama/accelerate/fsdp_7b.yaml -m llama.train +variant="ft_7b"

# For 13b, you need 6 GPUs with 48GB VRAM
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5"
accelerate launch --use_fsdp --config_file llama/accelerate/fsdp_13b.yaml -m llama.train +variant="ft_13b"
```

Results will be saved in `./results` and checkpoints in `./checkpoints`.


### Evaluate LLaMA

You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command:

```bash
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use

# On just one split
python -m llama.eval +variant="ft_1.3b" eval.split=valid

# On multiple splits (e.g. test_iid, test_vis)
python -m llama.eval -m +variant="ft_2.7b" eval.split=test_iid,test_web,test_geo,test_cat,test_vis
```

### Dense Markup Ranking (DMR)

#### Train DMR

You can train the model by running the following command (it will automatically use the hydra config from `conf/`):

```bash
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use

# Finetune MiniLM-L6-DMR (Default)
python -m dmr.train

# Finetune variant gte or bge
python -m dmr.train +variant=gte
python -m dmr.train +variant=bge
```

Results will be saved in `./results` and checkpoints in `./checkpoints`.

#### Evaluate DMR

You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command:

```bash
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use

# On just one
python -m dmr.eval eval.split=valid

# On multiple splits (e.g. test_iid, test_vis)
python -m dmr.eval eval.split=test_iid,test_web,test_geo,test_cat,test_vis

# Or for bge, gte
python -m dmr.eval +variant=gte eval.split=test_iid,test_web,test_geo,test_cat,test_vis
python -m dmr.eval +variant=bge eval.split=test_iid,test_web,test_geo,test_cat,test_vis
```
43 changes: 43 additions & 0 deletions modeling/dmr/conf/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
project_dir: ${oc.env:WEBLINX_PROJECT_DIR}
seed: 123
project_name: dmr

data:
split_path: ${project_dir}/wl_data/splits.json

model:
name: sentence-transformers/all-MiniLM-L6-v2
max_seq_length: 512
use_bf16: True
similarity: cos_sim
save_dir: ${project_dir}/checkpoints/${project_name}/${model.name}

train:
split: train
num_epochs: 10
max_neg_per_turn: 9
batch_size_per_device: 64
dataloader_num_workers: 8
optim: adamw
gradient_checkpointing: True
learning_rate: 0.00003
warmup_steps: 500
# Available schedulers:
# constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
scheduler: warmuplinear

eval:
split: dev
mrr_k: 50
batch_size_per_device: 64
result_dir: ${project_dir}/results/${project_name}/${model.name}/${eval.split}

hydra:
run:
dir: ${project_dir}/logs/${project_name}/${hydra.job.name}/${now:%Y-%m-%d-%H:%M:%S}
# Use the same for sweep's subdir
sweep:
dir: ${hydra.run.dir}
job:
chdir: False
verbose: INFO
3 changes: 3 additions & 0 deletions modeling/dmr/conf/variant/bge.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# @package _global_
model:
name: BAAI/bge-small-en-v1.5
6 changes: 6 additions & 0 deletions modeling/dmr/conf/variant/gte.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# @package _global_
model:
name: thenlper/gte-base

train:
gradient_checkpointing: True
Loading

0 comments on commit 1623163

Please sign in to comment.