Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MedHelm: Implement medcalc bench scenario, metrics and specs #3207

Open
wants to merge 14 commits into
base: med-helm
Choose a base branch
from

Conversation

sashimono-san
Copy link

@sashimono-san sashimono-san commented Dec 11, 2024

This PR has the adds support to MedCalc-Bench benchmarking.

As much as possible, I maintained the implementation as done originally. For example, data is downloaded directly from huggingFace, and that the evaluation metric is basically a copy of the original benchmark repo. But in some cases, changes were necessary, as discussed in the following paragraphs.

The original benchmark repo) expects the model to answer in JSON format. But considering we do not want to necessarily evaluate models' ability to output JSON, prompts were adapted so the output is given in natural language. It would be great if someone can review the prompts, make sure they align with other Helm scenarios.

Originally, one-shot examples come from a pre-defined JSON, and a specific example is used for each question type (defined by the field Calculator Id in the dataset. I couldn't figure how to pass information, at runtime, from the Scenario to the RunSpec. Namely I did not find a way to extract from the Instance being used the Calculator Id and use it when building the inference prompt. Is this supported by the current implementation of Helm?

The original implementation has a special truncation logic for the one-shot method. They truncates the Patient note and the step by step explanation of the output depending on the model. This is currently not implemented, but already during local tests (with gpt2) I had issues with input length. Any suggestions on how to solve this?

@sashimono-san sashimono-san reopened this Dec 17, 2024
@sashimono-san sashimono-san changed the title feat: implement medcalc bench scenario, metrics and specs Implement medcalc bench scenario, metrics and specs Dec 17, 2024
@sashimono-san sashimono-san force-pushed the feat/medcalc_bench_scenario branch from 98d3f83 to 792fb4f Compare December 17, 2024 15:56
@sashimono-san sashimono-san changed the title Implement medcalc bench scenario, metrics and specs MedHelm: Implement medcalc bench scenario, metrics and specs Dec 24, 2024
@yifanmai
Copy link
Collaborator

yifanmai commented Jan 8, 2025

Thanks for your pull request! This is taking me longer than usual to review due to the size, but I will get back to you by this week.

"lower_limit": example["Lower Limit"],
"upper_limit": example["Upper Limit"],
"calculator_id": example["Calculator ID"],
"ground_truth": example["Ground Truth Answer"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete ground_truth and use the answer in the reference instead.

split=helm_split_name,
extra_data={
"id": example["Row Number"],
"relevant_entities": example["Relevant Entities"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete relevant_entities since it doens't seem to be used.

],
split=helm_split_name,
extra_data={
"id": example["Row Number"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete id - instead just set id="id" + example["Row Number"] on the Instance itself.

Comment on lines +46 to +47
# TODO: Add a base url
DATASET_DOWNLOAD_BASE_URL: str = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks unused; remove?

"set their adjusted body weight to the minimum of the ideal body and actual weight. If the "
"patient is underweight, please set their adjusted body weight to their actual body weight."
),
calculator_id="2",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple of ways to get this to work:

  1. Write a custom adapter that looks up the calculator id and prepends the correct example.
  2. Move this logic into the Scenario i.e. add an one_shot argument to the scenario, and when it is True, look up the corresponding example in the JSON file and prepend it to text inside Input inside Instance.

I think the second method is easier and more straightforward, so I'd recommend doing that.

)


def _get_zero_shot_cot_instructions() -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for using CoT here compared to the original scenario? Is it because we expect models to use CoT for calculations in realistic settings?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For us there's no practical difference as we only implement the CoT prompt anyway. I kept the naming from the original benchmark repo so we can track future changes easily in the future.

From their paper, they compare performance of three approaches: direct prompting, CoT, and code generation.

I didn't implement the code generation testing because it seemed to add another focus to the evaluation (not only medical knowledge, but also programming knowledge). And I opted for CoT instead of direct prompting because it is rather well stablished CoT helps models make correct calculations, even though it deviates from the original motto of testing models and not prompt techniques.

The original expect that for each sample, we collect the calculator ID and the question for building the one-shot instructions.
"""
examples: Dict = {}
with open(ONE_SHOT_EXAMPLES_URL, "r") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move this inside the scenario as I suggested above, you should also replace this raw open() with ensure_file_downloaded() instead.

@yifanmai
Copy link
Collaborator

yifanmai commented Jan 8, 2025

Regarding outputting JSON, I'm okay with changing this to not output JSON.

Regarding truncation, ideally I would prefer to avoid this issue altogether by using models with longer context lengths (GPT-2 only has 1k tokens). If you're using recent models only, almost all recent models have context lengths of at least 8k, so truncation might not be needed. But if you need truncation, then you would need to implement a custom adapter.

@sashimono-san sashimono-san changed the base branch from main to med-helm January 14, 2025 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants