Distinguishing instructions for RAR-b #1066

Muennighoff · 2024-07-10T02:42:26Z

Have been discussing with @gowitheflow-1998 how to distinguish results w/ & w/o instructions for RAR-b in the result files & the LB. Two ideas:

We differentiate at the model level. E.g. we have a GritLM-7B & a GritLM-7B-noinstruct folder in the results repo. For models that are no instruct by default, we'd have to add an instruct folder e.g. text-embedding-3-large & text-embedding-3-large-instruct.

Works out of the box when running mteb without overwriting results from the other setup by just specifying the results_folder via the kwargs.
Generalizes nicely to other tasks - e.g. if you want to benchmark GritLM-7B on MTEB or FollowIR without instructions you just put those results in GritLM-7B-noinstruct.
For the LB we could consider adding a Model types filter that says instructions to filter for models that do/do not use instructions. Maybe that way we don't need two tabs anymore for RAR-b & RAR-b No instructions but just put all of them in the same RAR-b tab and then people can filter out for instruction usage similar to how one can filter out Cross Encoders for the FollowIR tab.

We differentiate at the task level. E.g. the files are named RARbMath.json for instructions & RARbMathNoInstruct.json for no instructions.

It would not work nicely out of the box as it would interfere with previously written results unless a new results_folder is specified. But then you still need to rename them and move them back into the same folder which makes it more complex than 1.
Can also generalize to other tasks but requires renaming json files which is a bit more complex than placing it in a different folder. We could have an automatic script for that though but still a bit clunky.
Adds some complexity for the LB because we'd break the behavior that the filename equals the taskname (e.g. the task would likely still be called RARbMath). Also does not work out of the box with our Model types-based filtering and would not allow showing e.g. GritLM-7B no instruction results in the main mteb tab.

So I suggest we go with 1. but lmk if you disagree! Curious about your thoughts @orionw @gowitheflow-1998! 😊

The text was updated successfully, but these errors were encountered:

orionw · 2024-07-10T03:13:29Z

Thanks for kicking off the discussion @Muennighoff @gowitheflow-1998! I agree we should start updating the leaderboard!

I think the core issue here again is the centrality of the instructions. If the instructions can be changed, if they are model specific, then (1) is a much better choice IMO. However, for FollowIR/InstructIR/other follow up works, the instruction is fixed, it's part of the data. You can't run FollowIR without instructions since the instruction is what defines the query relevance, if that makes sense.

For RAR-B, from what I've understood at looking at the data, the instructions are not part of the data but are part of the model approach (and it is desirable for follow up work to show that better instructions can lead to better models). This falls in the same bucket as SciFact and others where the goal is to adapt to something using an instruction and that instruction can be model designed and is not part of the data. Correct me if I'm wrong @gowitheflow-1998!

Have been discussing with @gowitheflow-1998 how to distinguish results w/ & w/o instructions for RAR-b in the result files & the LB. Two ideas:

It would not work nicely out of the box as it would interfere with previously written results unless a new results_folder is specified. But then you still need to rename them and move them back into the same folder which makes it more complex than 1.

If I'm understanding correctly, the issue is comparing models that use instructions on retrieval datasets vs those that don't (so GritLM style vs Contriever style). Given that the instructions are model specific for all the current Retrieval tasks (and I think it's a fair point that depending on the instruction, it often leaks dataset info in many cases so it can be a non-fair comparison to those without instructions) then I would probably vote for (1) as well -- splitting the models into categories: models using instructions, models not using instructions.

Because FollowIR and InstructIR cannot be evaluated without instructions, they would only show up in leaderboards where "Models with Instructions" is selected.

This would be a fairly large change -- all GritLM evaluations would be moved to "with instructions" since I assume every value on the leaderboard includes instructions. I think that's a reasonable move, but something to consider. I think such a leaderboard would also help show how important the instructions are to the model: if you see a 10 point drop in performance when switching off instructions then you can see the robustness.

We differentiate at the model level. E.g. we have a GritLM-7B & a GritLM-7B-noinstruct folder

Not super on topic, but I think it would be nice if could save the instructions in the run. Both for research purposes and because they have a pretty large impact on performance!

gowitheflow-1998 · 2024-07-10T03:56:57Z

thanks for the detailed discussions! @Muennighoff @orionw

I think both suggestions are great! The only concern for differentiating on the model-level might be aligning/distinguishing with a model's defined setting. e.g., models like BGE are trained with the same instruction for one big task (e.g. sts; retrieval), so depending on the instruction centrality of the task, it is hard to say if the model's default setting corresponds to model-noinstruct or model-instruct when placing the results. It's by default w/ instruction in the sense that they actually do add an instruction, but w/o instruction in the sense that the default instruction is too general and not useful for dataset-based tasks like RAR-b or even example-based tasks like FollowIR @orionw. For such tasks for examples, one would replace BGE's general retrieval instruction and add more specific instructions (even only for research purpose). Might be in general a concern for tasks that need instructions that are not default to the models. And this poses a misalignment with other previous tasks in defining whether the results should fall under model-instruct or model-noinstruct.

Otherwise I think it makes more sense in the long run to do model-level given its advantages to the repo/LB usage & structure that @Muennighoff mentioned!

orionw · 2024-07-10T04:56:39Z

it is hard to say if the model's default setting corresponds to model-noinstruct or model-instruct when placing the result

BGE is definitely the outlier on instructions, they are really vague. FWIW, my guess is that that style of "one instruction for all of retrieval" is not going to be very popular -- I don't even think BGE-M3 has it although I may have missed it. I think we could pick either way for the early BGE models and it wouldn't make too much of a difference.

KennethEnevoldsen · 2024-07-10T08:52:05Z

A minor comment here (otherwise it seems like most aspects are discussed). If 1) then I would probably make sure to also update the model_meta.json file to include the information as well.

orionw · 2024-07-12T04:36:49Z

@Muennighoff re the recent PR into the results repo, so we all are on the same page.

Default way the model is intended to be used model_name - If model is used in non-default way with instruction add -instruct, w/o instruction add -noinstruct? But if you prefer not leaving the default names as is, we can also consider renaming.

From the above I think I and @gowitheflow-1998 understood it that if models use instructions they get the naming convention of "instructions". That's why I said above:

This would be a fairly large change -- all GritLM evaluations would be moved to "with instructions" since I assume every value on the leaderboard includes instructions. I think that's a reasonable move, but something to consider. I think such a leaderboard would also help show how important the instructions are to the model: if you see a 10 point drop in performance when switching off instructions then you can see the robustness.

Hence the results Github issue about renaming all models which used instructions.

Default way the model is intended to be used model_name

I think this would work for the model side, but wouldn't work for the task side -- what you suggested as (1) at the beginning of this issue. We could do default and non-default for models, but then we'd have to end up doing something else to differentiate tasks where the models used instructions since default is model-specific.

So it seems we need to have one consistent tag that applies to all models. I'm fine using either non-instruct and normal or instruct and normal -- my preference would be to specifically call out instructions since they're newer and non-standard until the last year but either would work. Either way we have to re-label some set of models to specify the no-instructions or to specify they do use instructions.

On that one note - Why does this PR have bge-base-en-v1.5-instruct? Isn't using the retrieval instruction the default way for bge base and thus the existing results in bge-base-en-v1.5 are with instruction? Sorry maybe I am missing something.

Thus, it has the instruct version attached to the name because it used instructions for rar-b and so we needed a model category for bge-base that used instructions.

Please let me know if this doesn't make sense or if you have alternative suggestions!

Muennighoff · 2024-07-12T15:44:50Z

Makes sense thanks for the detailed explanation!

I think this would work for the model side, but wouldn't work for the task side

Why wouldn't it work for the task side? The way I understand it is that users will have to rename the folder, no? So if they run with the expected setting for the model (i.e. instructions fro GritLM etc; no instructions for all-MiniLM etc) then they don't rename the folder. If they run with the non-default setting they add -noinstruct or -instruct?

But I'm also fine with also renaming the default folders to include the suffix if you prefer 👍

gowitheflow-1998 · 2024-07-12T15:52:54Z

I think when we talk about instructions, we are talking about dataset-specific or example-specific instructions?

As models like GritLM-7B and e5-mistral-7b-instruct are trained to follow fine-grained instructions, when showing on RAR-b and FollowIR leaderboard, it makes sense they have their original names, as the task settings align with their default settings. When we force them not to take in instructions, it makes sense to call GritLM-7B-noinstruct and e5-mistral-7b-instruct-noinstruct to distinguish with original settings. We can perhaps show noinstruct as a subscript with different fonts on the leaderboard?

The opposite applies to models like all-mpnet-base-v2. I think when it shows in FollowIR, it makes sense to call all-mpnet-base-v2-instruct to show it's not aligning with default settings.

For models like BGE, as it by default has the same instruction for one big task, @orionw and I agree they are not aligning with what RAR-b and FollowIR are doing with instructions, so it makes more sense that their default is defined as noinstruct, therefore bge-base-en-v1.5 and bge-base-en-v1.5-instruct on the leaderboard that requires dataset-level or example-level instructions.

what do you think? @Muennighoff I actually agree newer models should have their original names (without emphasing instruct) because they are originally trained to follow fine-grained instructions, and align with instruction-following tasks. We can perhaps put a short note on the leaderboard?

orionw · 2024-07-12T16:19:14Z

Bumping up a level of abstraction, I see three reasons for why we'd want to distinguish instructions overall:

We can then filter these instruction tags to create the leaderboard of tasks (one version for models using instructions, one without)
We can compare the same model using instructions vs not-using instructions
We already have a system setup so that if someone uses the non-default setting, we have a place for them to put results (using instructions when they don't normally or vice versa).

The TLDR is I think it's easier to add a consistent tag then one that uses ambiguous "default" that goes in both directions. Whether that's a consistent "-instruct" for models using instructions or we go very explicit and every model gets a "-instruct" or "non-instruct" tag to clear up ambiguity.

For models not in the results folder, I think we'll have to make a mapping in the model_meta that is is_instruct_results or something.

I actually agree newer models should have their original names (without emphasizing instruct)

Ah perhaps this is where the confusion is @gowitheflow-1998 @Muennighoff. For the leaderboard name I think we can easily strip off the suffixes. They can just be there in the results folder we can easily use that when creating the leaderboard. I agree we don't want to confuse people by naming models that don't exist on HF! :)

I think this would work for the model side, but wouldn't work for the task side

Lets say we go with default and non-default. Then we have Contriever-instruct and GritLM-no-instruct as non-defaults. If someone wanted to filter all models using instructions, we can't use instruct as the filtering mechanism since we'd have to have a mapping of what models use a default. It seems simplest to just have some naming convention for the results folder that is consistent.

My other concern with a default as the naming convention is that it gets a bit weird when models can do both, which ones is the default? There's a few models like this (I think some BGE ones, would have to double check). My guess is in the future this will also be more common. Explicitly marking results as instruct or non-instruct usage gets around this.

gowitheflow-1998 · 2024-07-12T17:08:15Z

For the leaderboard name I think we can easily strip off the suffixes. They can just be there in the results folder we can easily use that when creating the leaderboard. I agree we don't want to confuse people by naming models that don't exist on HF! :)

yeah I think this is where the confusion is! indeed having both in the folder names and stripping them off the suffixes in the leaderboard will work as well.

gowitheflow-1998 mentioned this issue Jul 10, 2024

RAR-b results embeddings-benchmark/results#7

Merged

orionw mentioned this issue Jul 11, 2024

Re-name models to clearly separate instruction vs non-instruction models embeddings-benchmark/results#8

Closed

orionw mentioned this issue Oct 29, 2024

Consolidate Retrieval/Reranking/Instruction Variants #1359

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguishing instructions for RAR-b #1066

Distinguishing instructions for RAR-b #1066

Muennighoff commented Jul 10, 2024

orionw commented Jul 10, 2024

gowitheflow-1998 commented Jul 10, 2024 •

edited

Loading

orionw commented Jul 10, 2024

KennethEnevoldsen commented Jul 10, 2024

orionw commented Jul 12, 2024

Muennighoff commented Jul 12, 2024

gowitheflow-1998 commented Jul 12, 2024

orionw commented Jul 12, 2024 •

edited

Loading

gowitheflow-1998 commented Jul 12, 2024

Distinguishing instructions for RAR-b #1066

Distinguishing instructions for RAR-b #1066

Comments

Muennighoff commented Jul 10, 2024

orionw commented Jul 10, 2024

gowitheflow-1998 commented Jul 10, 2024 • edited Loading

orionw commented Jul 10, 2024

KennethEnevoldsen commented Jul 10, 2024

orionw commented Jul 12, 2024

Muennighoff commented Jul 12, 2024

gowitheflow-1998 commented Jul 12, 2024

orionw commented Jul 12, 2024 • edited Loading

gowitheflow-1998 commented Jul 12, 2024

gowitheflow-1998 commented Jul 10, 2024 •

edited

Loading

orionw commented Jul 12, 2024 •

edited

Loading