-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add code for training and evaluating DMR and LLaMA (including README …
…for instructions)
- Loading branch information
Showing
19 changed files
with
1,769 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
The following instructions assume you are running from this directory (you may need to `cd` to this directory). | ||
|
||
### Download Candidates | ||
|
||
First, you need to download the `train.jsonl` candidate selected by `McGill-NLP/MiniLM-L6-DMR`: | ||
|
||
```python | ||
from huggingface_hub import snapshot_download | ||
|
||
snapshot_download( | ||
repo_id="McGill-NLP/WebLINX-full", | ||
repo_type="dataset", | ||
allow_patterns="candidates/*.jsonl", | ||
local_dir="./" | ||
) | ||
``` | ||
|
||
Download entire dataset: | ||
|
||
```python | ||
from huggingface_hub import snapshot_download | ||
|
||
snapshot_download(repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", local_dir="./wl_data/") | ||
|
||
# If you only want the splits.json file, you can just run: | ||
snapshot_download( | ||
repo_id="McGill-NLP/WebLINX-full", repo_type="dataset", allow_patterns="splits.json", local_dir="./wl_data/" | ||
) | ||
|
||
# If you only want candidates: | ||
snapshot_download( | ||
repo_id="McGill-NLP/WebLINX-full", | ||
repo_type="dataset", | ||
allow_patterns="candidates/*.jsonl", | ||
local_dir="./wl_data/" | ||
) | ||
``` | ||
|
||
The default configs (`config.yml`) assume that the `train.jsonl` is located at `./candidates/train.jsonl`. If you want to change the path, you need to modify the `config.yml` accordingly. | ||
|
||
### Set `WEBLINX_PROJECT_DIR` | ||
|
||
You need to set the `WEBLINX_PROJECT_DIR` environment variable to the root directory of the WebLINX project. For example, if you have the following directory structure: | ||
|
||
```bash | ||
export WEBLINX_PROJECT_DIR=/path/to/the/modeling/directory/ | ||
|
||
# For example, if you are in the modeling directory, you can run: | ||
export WEBLINX_PROJECT_DIR=$(pwd) | ||
``` | ||
|
||
### Install Dependencies | ||
|
||
You need to install the dependencies by running the following command: | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
### Action Model: LLaMA | ||
|
||
#### Train LLaMA | ||
|
||
You can train the model by running the following command (it will automatically use the hydra config from `conf/`): | ||
|
||
```bash | ||
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use | ||
|
||
# Finetune 1.3b variant | ||
python -m llama.train +variant="ft_1.3b" | ||
|
||
# Finetune 2.7b variant | ||
python -m llama.train +variant="ft_2.7b" | ||
|
||
# For 7b, you will need to use fsdp in accelerate to train on 4 GPUs with 48GB VRAM | ||
export CUDA_VISIBLE_DEVICES="0,1,2,3" | ||
accelerate launch --use_fsdp --config_file llama/accelerate/fsdp_7b.yaml -m llama.train +variant="ft_7b" | ||
|
||
# For 13b, you need 6 GPUs with 48GB VRAM | ||
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5" | ||
accelerate launch --use_fsdp --config_file llama/accelerate/fsdp_13b.yaml -m llama.train +variant="ft_13b" | ||
``` | ||
|
||
Results will be saved in `./results` and checkpoints in `./checkpoints`. | ||
|
||
|
||
### Evaluate LLaMA | ||
|
||
You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command: | ||
|
||
```bash | ||
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use | ||
|
||
# On just one split | ||
python -m llama.eval +variant="ft_1.3b" eval.split=valid | ||
|
||
# On multiple splits (e.g. test_iid, test_vis) | ||
python -m llama.eval -m +variant="ft_2.7b" eval.split=test_iid,test_web,test_geo,test_cat,test_vis | ||
``` | ||
|
||
### Dense Markup Ranking (DMR) | ||
|
||
#### Train DMR | ||
|
||
You can train the model by running the following command (it will automatically use the hydra config from `conf/`): | ||
|
||
```bash | ||
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use | ||
|
||
# Finetune MiniLM-L6-DMR (Default) | ||
python -m dmr.train | ||
|
||
# Finetune variant gte or bge | ||
python -m dmr.train +variant=gte | ||
python -m dmr.train +variant=bge | ||
``` | ||
|
||
Results will be saved in `./results` and checkpoints in `./checkpoints`. | ||
|
||
#### Evaluate DMR | ||
|
||
You need to specify which `eval.split` you want to evaluate on. For example, to evaluate on the `iid` split, you can run the following command: | ||
|
||
```bash | ||
export CUDA_VISIBLE_DEVICES="0" # Set the GPU device you want to use | ||
|
||
# On just one | ||
python -m dmr.eval eval.split=valid | ||
|
||
# On multiple splits (e.g. test_iid, test_vis) | ||
python -m dmr.eval eval.split=test_iid,test_web,test_geo,test_cat,test_vis | ||
|
||
# Or for bge, gte | ||
python -m dmr.eval +variant=gte eval.split=test_iid,test_web,test_geo,test_cat,test_vis | ||
python -m dmr.eval +variant=bge eval.split=test_iid,test_web,test_geo,test_cat,test_vis | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
project_dir: ${oc.env:WEBLINX_PROJECT_DIR} | ||
seed: 123 | ||
project_name: dmr | ||
|
||
data: | ||
split_path: ${project_dir}/wl_data/splits.json | ||
|
||
model: | ||
name: sentence-transformers/all-MiniLM-L6-v2 | ||
max_seq_length: 512 | ||
use_bf16: True | ||
similarity: cos_sim | ||
save_dir: ${project_dir}/checkpoints/${project_name}/${model.name} | ||
|
||
train: | ||
split: train | ||
num_epochs: 10 | ||
max_neg_per_turn: 9 | ||
batch_size_per_device: 64 | ||
dataloader_num_workers: 8 | ||
optim: adamw | ||
gradient_checkpointing: True | ||
learning_rate: 0.00003 | ||
warmup_steps: 500 | ||
# Available schedulers: | ||
# constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts | ||
scheduler: warmuplinear | ||
|
||
eval: | ||
split: dev | ||
mrr_k: 50 | ||
batch_size_per_device: 64 | ||
result_dir: ${project_dir}/results/${project_name}/${model.name}/${eval.split} | ||
|
||
hydra: | ||
run: | ||
dir: ${project_dir}/logs/${project_name}/${hydra.job.name}/${now:%Y-%m-%d-%H:%M:%S} | ||
# Use the same for sweep's subdir | ||
sweep: | ||
dir: ${hydra.run.dir} | ||
job: | ||
chdir: False | ||
verbose: INFO |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# @package _global_ | ||
model: | ||
name: BAAI/bge-small-en-v1.5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# @package _global_ | ||
model: | ||
name: thenlper/gte-base | ||
|
||
train: | ||
gradient_checkpointing: True |
Oops, something went wrong.