-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MedHelm: Implement medcalc bench scenario, metrics and specs #3207
base: med-helm
Are you sure you want to change the base?
MedHelm: Implement medcalc bench scenario, metrics and specs #3207
Conversation
Co-authored-by: Yifan Mai <[email protected]>
…rd-crfm#3185) Co-authored-by: Yifan Mai <[email protected]>
98d3f83
to
792fb4f
Compare
Thanks for your pull request! This is taking me longer than usual to review due to the size, but I will get back to you by this week. |
"lower_limit": example["Lower Limit"], | ||
"upper_limit": example["Upper Limit"], | ||
"calculator_id": example["Calculator ID"], | ||
"ground_truth": example["Ground Truth Answer"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete ground_truth
and use the answer in the reference instead.
split=helm_split_name, | ||
extra_data={ | ||
"id": example["Row Number"], | ||
"relevant_entities": example["Relevant Entities"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete relevant_entities
since it doens't seem to be used.
], | ||
split=helm_split_name, | ||
extra_data={ | ||
"id": example["Row Number"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete id
- instead just set id="id" + example["Row Number"]
on the Instance
itself.
# TODO: Add a base url | ||
DATASET_DOWNLOAD_BASE_URL: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks unused; remove?
"set their adjusted body weight to the minimum of the ideal body and actual weight. If the " | ||
"patient is underweight, please set their adjusted body weight to their actual body weight." | ||
), | ||
calculator_id="2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a couple of ways to get this to work:
- Write a custom adapter that looks up the calculator id and prepends the correct example.
- Move this logic into the Scenario i.e. add an
one_shot
argument to the scenario, and when it isTrue
, look up the corresponding example in the JSON file and prepend it totext
insideInput
insideInstance
.
I think the second method is easier and more straightforward, so I'd recommend doing that.
) | ||
|
||
|
||
def _get_zero_shot_cot_instructions() -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for using CoT here compared to the original scenario? Is it because we expect models to use CoT for calculations in realistic settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For us there's no practical difference as we only implement the CoT prompt anyway. I kept the naming from the original benchmark repo so we can track future changes easily in the future.
From their paper, they compare performance of three approaches: direct prompting, CoT, and code generation.
I didn't implement the code generation testing because it seemed to add another focus to the evaluation (not only medical knowledge, but also programming knowledge). And I opted for CoT instead of direct prompting because it is rather well stablished CoT helps models make correct calculations, even though it deviates from the original motto of testing models and not prompt techniques.
The original expect that for each sample, we collect the calculator ID and the question for building the one-shot instructions. | ||
""" | ||
examples: Dict = {} | ||
with open(ONE_SHOT_EXAMPLES_URL, "r") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you move this inside the scenario as I suggested above, you should also replace this raw open()
with ensure_file_downloaded()
instead.
Regarding outputting JSON, I'm okay with changing this to not output JSON. Regarding truncation, ideally I would prefer to avoid this issue altogether by using models with longer context lengths (GPT-2 only has 1k tokens). If you're using recent models only, almost all recent models have context lengths of at least 8k, so truncation might not be needed. But if you need truncation, then you would need to implement a custom adapter. |
This PR has the adds support to MedCalc-Bench benchmarking.
As much as possible, I maintained the implementation as done originally. For example, data is downloaded directly from huggingFace, and that the evaluation metric is basically a copy of the original benchmark repo. But in some cases, changes were necessary, as discussed in the following paragraphs.
The original benchmark repo) expects the model to answer in JSON format. But considering we do not want to necessarily evaluate models' ability to output JSON, prompts were adapted so the output is given in natural language. It would be great if someone can review the prompts, make sure they align with other Helm scenarios.
Originally, one-shot examples come from a pre-defined JSON, and a specific example is used for each question type (defined by the field
Calculator Id
in the dataset. I couldn't figure how to pass information, at runtime, from theScenario
to theRunSpec
. Namely I did not find a way to extract from theInstance
being used theCalculator Id
and use it when building the inference prompt. Is this supported by the current implementation of Helm?The original implementation has a special truncation logic for the one-shot method. They truncates the Patient note and the step by step explanation of the output depending on the model. This is currently not implemented, but already during local tests (with gpt2) I had issues with input length. Any suggestions on how to solve this?