MedHelm: Implement medcalc bench scenario, metrics and specs #3207

sashimono-san · 2024-12-11T16:05:09Z

This PR has the adds support to MedCalc-Bench benchmarking.

As much as possible, I maintained the implementation as done originally. For example, data is downloaded directly from huggingFace, and that the evaluation metric is basically a copy of the original benchmark repo. But in some cases, changes were necessary, as discussed in the following paragraphs.

The original benchmark repo) expects the model to answer in JSON format. But considering we do not want to necessarily evaluate models' ability to output JSON, prompts were adapted so the output is given in natural language. It would be great if someone can review the prompts, make sure they align with other Helm scenarios.

Originally, one-shot examples come from a pre-defined JSON, and a specific example is used for each question type (defined by the field Calculator Id in the dataset. I couldn't figure how to pass information, at runtime, from the Scenario to the RunSpec. Namely I did not find a way to extract from the Instance being used the Calculator Id and use it when building the inference prompt. Is this supported by the current implementation of Helm?

The original implementation has a special truncation logic for the one-shot method. They truncates the Patient note and the step by step explanation of the output depending on the model. This is currently not implemented, but already during local tests (with gpt2) I had issues with input length. Any suggestions on how to solve this?

…crfm#3172)

Co-authored-by: Yifan Mai <[email protected]>

…rd-crfm#3185) Co-authored-by: Yifan Mai <[email protected]>

…-crfm#3203)

yifanmai · 2025-01-08T00:38:53Z

Thanks for your pull request! This is taking me longer than usual to review due to the size, but I will get back to you by this week.

yifanmai · 2025-01-08T18:02:03Z

src/helm/benchmark/scenarios/medcalc_bench_scenario.py

+                            "lower_limit": example["Lower Limit"],
+                            "upper_limit": example["Upper Limit"],
+                            "calculator_id": example["Calculator ID"],
+                            "ground_truth": example["Ground Truth Answer"],


Delete ground_truth and use the answer in the reference instead.

yifanmai · 2025-01-08T18:02:17Z

src/helm/benchmark/scenarios/medcalc_bench_scenario.py

+                        split=helm_split_name,
+                        extra_data={
+                            "id": example["Row Number"],
+                            "relevant_entities": example["Relevant Entities"],


Delete relevant_entities since it doens't seem to be used.

yifanmai · 2025-01-08T18:02:44Z

src/helm/benchmark/scenarios/medcalc_bench_scenario.py

+                        ],
+                        split=helm_split_name,
+                        extra_data={
+                            "id": example["Row Number"],


Delete id - instead just set id="id" + example["Row Number"] on the Instance itself.

yifanmai · 2025-01-08T18:03:58Z

src/helm/benchmark/scenarios/medcalc_bench_scenario.py

+    # TODO: Add a base url
+    DATASET_DOWNLOAD_BASE_URL: str = ""


Looks unused; remove?

yifanmai · 2025-01-08T18:07:22Z

src/helm/benchmark/run_specs/medcalc_bench_specs.py

+                "set their adjusted body weight to the minimum of the ideal body and actual weight. If the "
+                "patient is underweight, please set their adjusted body weight to their actual body weight."
+            ),
+            calculator_id="2",


There's a couple of ways to get this to work:

Write a custom adapter that looks up the calculator id and prepends the correct example.

Move this logic into the Scenario i.e. add an one_shot argument to the scenario, and when it is True, look up the corresponding example in the JSON file and prepend it to text inside Input inside Instance.

I think the second method is easier and more straightforward, so I'd recommend doing that.

yifanmai · 2025-01-08T18:09:24Z

src/helm/benchmark/run_specs/medcalc_bench_specs.py

+    )
+
+
+def _get_zero_shot_cot_instructions() -> str:


What's the rationale for using CoT here compared to the original scenario? Is it because we expect models to use CoT for calculations in realistic settings?

For us there's no practical difference as we only implement the CoT prompt anyway. I kept the naming from the original benchmark repo so we can track future changes easily in the future.

From their paper, they compare performance of three approaches: direct prompting, CoT, and code generation.

I didn't implement the code generation testing because it seemed to add another focus to the evaluation (not only medical knowledge, but also programming knowledge). And I opted for CoT instead of direct prompting because it is rather well stablished CoT helps models make correct calculations, even though it deviates from the original motto of testing models and not prompt techniques.

yifanmai · 2025-01-08T18:13:18Z

src/helm/benchmark/run_specs/medcalc_bench_specs.py

+    The original expect that for each sample, we collect the calculator ID and the question for building the one-shot instructions.
+    """
+    examples: Dict = {}
+    with open(ONE_SHOT_EXAMPLES_URL, "r") as f:


If you move this inside the scenario as I suggested above, you should also replace this raw open() with ensure_file_downloaded() instead.

yifanmai · 2025-01-08T18:31:24Z

Regarding outputting JSON, I'm okay with changing this to not output JSON.

Regarding truncation, ideally I would prefer to avoid this issue altogether by using models with longer context lengths (GPT-2 only has 1k tokens). If you're using recent models only, almost all recent models have context lengths of at least 8k, so truncation might not be needed. But if you need truncation, then you would need to implement a custom adapter.

yifanmai and others added 12 commits November 22, 2024 21:14

Add Upstage Solar Pro Preview model (stanford-crfm#3181)

141588e

Add Llama 3.1 Nemotron Instruct (70B) model on Together AI (stanford-…

ee10b8f

…crfm#3172)

Add Air-Bench chat audio scenario (stanford-crfm#3189)

c0b2901

Co-authored-by: Yifan Mai <[email protected]>

Add Solar Pro model (stanford-crfm#3198)

2e16cf2

Add NECTEC (stanford-crfm#3197)

6c358c6

Add Llama 3.3 model (stanford-crfm#3202)

416601c

Changes for MMLU PRO with COT (stanford-crfm#3200)

ff9c7c9

Adding ENEM Challenge Scenario & Maritaca AI model (Sabiá 7B) (stanfo…

b8a140f

…rd-crfm#3185) Co-authored-by: Yifan Mai <[email protected]>

Release Lite and MMLU v1.11.0 leaderboards (stanford-crfm#3204)

709336e

Rename Multimodality section to Papers in the documentation (stanford…

c9065e1

…-crfm#3203)

Shorten run spec names for Unitxt runs (stanford-crfm#3205)

d7a61c6

feat: implement medcalc bench scenario, metrics and specs

98d7d0b

sashimono-san closed this Dec 11, 2024

feat: med calc bench one shot spec

bb15f35

sashimono-san reopened this Dec 17, 2024

sashimono-san changed the title ~~feat: implement medcalc bench scenario, metrics and specs~~ Implement medcalc bench scenario, metrics and specs Dec 17, 2024

fix: dataset loading and standardize naming

792fb4f

sashimono-san force-pushed the feat/medcalc_bench_scenario branch from 98d3f83 to 792fb4f Compare December 17, 2024 15:56

sashimono-san changed the title ~~Implement medcalc bench scenario, metrics and specs~~ MedHelm: Implement medcalc bench scenario, metrics and specs Dec 24, 2024

yifanmai requested changes Jan 8, 2025

View reviewed changes

sashimono-san changed the base branch from main to med-helm January 14, 2025 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MedHelm: Implement medcalc bench scenario, metrics and specs #3207

MedHelm: Implement medcalc bench scenario, metrics and specs #3207

sashimono-san commented Dec 11, 2024 •

edited

Loading

yifanmai commented Jan 8, 2025

yifanmai Jan 8, 2025

yifanmai Jan 8, 2025

yifanmai Jan 8, 2025

yifanmai Jan 8, 2025

yifanmai Jan 8, 2025

yifanmai Jan 8, 2025

sashimono-san Jan 14, 2025

yifanmai Jan 8, 2025

yifanmai commented Jan 8, 2025

MedHelm: Implement medcalc bench scenario, metrics and specs #3207

Are you sure you want to change the base?

MedHelm: Implement medcalc bench scenario, metrics and specs #3207

Conversation

sashimono-san commented Dec 11, 2024 • edited Loading

yifanmai commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifanmai commented Jan 8, 2025

sashimono-san commented Dec 11, 2024 •

edited

Loading