-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguishing instructions for RAR-b #1066
Comments
Thanks for kicking off the discussion @Muennighoff @gowitheflow-1998! I agree we should start updating the leaderboard! I think the core issue here again is the centrality of the instructions. If the instructions can be changed, if they are model specific, then (1) is a much better choice IMO. However, for FollowIR/InstructIR/other follow up works, the instruction is fixed, it's part of the data. You can't run FollowIR without instructions since the instruction is what defines the query relevance, if that makes sense. For RAR-B, from what I've understood at looking at the data, the instructions are not part of the data but are part of the model approach (and it is desirable for follow up work to show that better instructions can lead to better models). This falls in the same bucket as SciFact and others where the goal is to adapt to something using an instruction and that instruction can be model designed and is not part of the data. Correct me if I'm wrong @gowitheflow-1998!
If I'm understanding correctly, the issue is comparing models that use instructions on retrieval datasets vs those that don't (so GritLM style vs Contriever style). Given that the instructions are model specific for all the current Retrieval tasks (and I think it's a fair point that depending on the instruction, it often leaks dataset info in many cases so it can be a non-fair comparison to those without instructions) then I would probably vote for (1) as well -- splitting the models into categories: models using instructions, models not using instructions. Because FollowIR and InstructIR cannot be evaluated without instructions, they would only show up in leaderboards where "Models with Instructions" is selected. This would be a fairly large change -- all GritLM evaluations would be moved to "with instructions" since I assume every value on the leaderboard includes instructions. I think that's a reasonable move, but something to consider. I think such a leaderboard would also help show how important the instructions are to the model: if you see a 10 point drop in performance when switching off instructions then you can see the robustness.
Not super on topic, but I think it would be nice if could save the instructions in the run. Both for research purposes and because they have a pretty large impact on performance! |
thanks for the detailed discussions! @Muennighoff @orionw I think both suggestions are great! The only concern for differentiating on the Otherwise I think it makes more sense in the long run to do |
BGE is definitely the outlier on instructions, they are really vague. FWIW, my guess is that that style of "one instruction for all of retrieval" is not going to be very popular -- I don't even think BGE-M3 has it although I may have missed it. I think we could pick either way for the early BGE models and it wouldn't make too much of a difference. |
A minor comment here (otherwise it seems like most aspects are discussed). If 1) then I would probably make sure to also update the model_meta.json file to include the information as well. |
@Muennighoff re the recent PR into the
From the above I think I and @gowitheflow-1998 understood it that if models use instructions they get the naming convention of "instructions". That's why I said above:
Hence the
I think this would work for the model side, but wouldn't work for the task side -- what you suggested as (1) at the beginning of this issue. We could do So it seems we need to have one consistent tag that applies to all models. I'm fine using either
Thus, it has the instruct version attached to the name because it used instructions for rar-b and so we needed a model category for bge-base that used instructions. Please let me know if this doesn't make sense or if you have alternative suggestions! |
Makes sense thanks for the detailed explanation!
Why wouldn't it work for the task side? The way I understand it is that users will have to rename the folder, no? So if they run with the expected setting for the model (i.e. instructions fro GritLM etc; no instructions for all-MiniLM etc) then they don't rename the folder. If they run with the non-default setting they add But I'm also fine with also renaming the default folders to include the suffix if you prefer 👍 |
I think when we talk about instructions, we are talking about dataset-specific or example-specific instructions? As models like The opposite applies to models like For models like BGE, as it by default has the same instruction for one big task, @orionw and I agree they are not aligning with what RAR-b and FollowIR are doing with instructions, so it makes more sense that their default is defined as what do you think? @Muennighoff I actually agree newer models should have their original names (without emphasing |
Bumping up a level of abstraction, I see three reasons for why we'd want to distinguish instructions overall:
The TLDR is I think it's easier to add a consistent tag then one that uses ambiguous "default" that goes in both directions. Whether that's a consistent "-instruct" for models using instructions or we go very explicit and every model gets a "-instruct" or "non-instruct" tag to clear up ambiguity. For models not in the results folder, I think we'll have to make a mapping in the model_meta that is
Ah perhaps this is where the confusion is @gowitheflow-1998 @Muennighoff. For the leaderboard name I think we can easily strip off the suffixes. They can just be there in the results folder we can easily use that when creating the leaderboard. I agree we don't want to confuse people by naming models that don't exist on HF! :)
Lets say we go with default and non-default. Then we have My other concern with a default as the naming convention is that it gets a bit weird when models can do both, which ones is the default? There's a few models like this (I think some BGE ones, would have to double check). My guess is in the future this will also be more common. Explicitly marking results as instruct or non-instruct usage gets around this. |
yeah I think this is where the confusion is! indeed having both in the folder names and stripping them off the suffixes in the leaderboard will work as well. |
Have been discussing with @gowitheflow-1998 how to distinguish results w/ & w/o instructions for RAR-b in the result files & the LB. Two ideas:
GritLM-7B
& aGritLM-7B-noinstruct
folder in the results repo. For models that are no instruct by default, we'd have to add an instruct folder e.g.text-embedding-3-large
&text-embedding-3-large-instruct
.results_folder
via the kwargs.GritLM-7B-noinstruct
.Model types
filter that saysinstructions
to filter for models that do/do not use instructions. Maybe that way we don't need two tabs anymore forRAR-b
&RAR-b No instructions
but just put all of them in the sameRAR-b
tab and then people can filter out for instruction usage similar to how one can filter out Cross Encoders for the FollowIR tab.RARbMath.json
for instructions &RARbMathNoInstruct.json
for no instructions.results_folder
is specified. But then you still need to rename them and move them back into the same folder which makes it more complex than 1.RARbMath
). Also does not work out of the box with ourModel types
-based filtering and would not allow showing e.g. GritLM-7B no instruction results in the main mteb tab.So I suggest we go with 1. but lmk if you disagree! Curious about your thoughts @orionw @gowitheflow-1998! 😊
The text was updated successfully, but these errors were encountered: