diff --git a/docs/main_index.html b/docs/main_index.html new file mode 100644 index 0000000..62cff96 --- /dev/null +++ b/docs/main_index.html @@ -0,0 +1,575 @@ + + + + + + + + DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ + Hand-Drawn Math Images + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+
+
+

+ Logo + DrawEduMath +

+

+ Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images +

+
+ + Sami Baral*1, + + Lucy Li*2, + + Ryan Knight3, + + Alice Ng4, + + Luca Soldaini5, + + Neil Heffernan1, + + Kyle Lo5 +
+ +
+ 1Worcester Polytechnic Institute, + 2University of California, Berkeley,
+ 3Insource Services Inc, + 4Teaching Lab, + 5Allen Institute for AI
+ NeurIps 2024, Math AI Workshop +
+ + +
+
+
+
+
+ + + + + + +
+
+ +
+
+ + DrawEduMath dataset creation +

Logo + DrawEduMath is a dataset of images of student's handwritten responses to math problems, each with a teacher's description. + Each image in our dataset is a concatenation of a math problem on the left with a student response on the right. Teachers describe the student's response to the problem, and then a model, such as GPT-4o shown here, writes QA pairs extracted from facets of the description. +

+

Introduction

+ +
+

+ In real-world settings, vision language models (VLMs) should robustly handle naturalistic, noisy visual content as well as domain-specific language and concepts. + For example, K-12 educators using digital learning platforms may need to examine and provide feedback across many images of students' math work. + To assess the potential of VLMs to support educators in settings like this one, we introduce Logo DrawEduMath, + an English-language dataset of 2030 images of students' handwritten responses to K-12 math problems. +

+ +

+ Teachers provided detailed annotations, including free-form descriptions of each image and 11,661 question-answer (QA) pairs. + These annotations capture a wealth of pedagogical insights, ranging from students' problem-solving strategies to the composition of their drawings, diagrams, and writing. We evaluate VLMs on teachers' QA pairs, + as well as 4,362 synthetic QA pairs derived from teachers' descriptions using language models (LMs). + We show that even state-of-the-art VLMs leave much room for improvement on Logo DrawEduMath questions. + We also find that synthetic QAs, though imperfect, can yield similar model rankings as teacher-written QAs. + + We release LogoDrawEduMath to support the evaluation of VLMs' abilities to reason mathematically over images gathered with educational contexts in mind. +

+
+
+
+ +
+
+ + +
+
+ +
+
+ +

Leaderboard on DrawEduMath

+
+

Accuracy Scores on the + Logo + DrawEduMath dataset. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
#ModelDateSynthetic QATeacher QA
1GPT-4o2024-10-150.7220.628
2Claude 3.5 Sonnet2024-10-150.7150.657
3Gemini 1.5 Pro2024-10-110.6460.490
4Llama 3.2-11B V2024-10-150.3880.296
+ + + +
+

The leaderboard scores are based on the judgements using Mixtral 8x22B model.

+

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

+

+
+
+ +
+
+ +
+
+ + +
+
+

+ Logo + DrawEduMath Dataset +

+
+
+ +
+
+
+ +
+

Overview

+
+

+ Logo + DrawEduMath consists of 2,030 images of U.S.based students’ handwritten math responses to + 188 math problems spanning Grade 2 through high school. + + These images were initially collected on the LogoASSISTments + online learning platform, where students receive feedback from teachers on assigned work. + The problems that accompany each student response are drawn from three overlapping1 open educational resources (OER): Eureka Math, Open Up + Resources, and Illustrative Math. + +

+ + + + +

+ You can download the dataset on Hugging Face Dataset. +

+ +
+
+
+
+
+
+ data-overview +

+ Key data statistics pertaining to students' math images
+ included in Logo + DrawEduMath.
+

+
+
+
+
+ data-composition +

+ Key data statistics pertaining to the collection of
+ teachers’ language for Logo + DrawEduMath. Word counts
+ and text lengths are determined using white-space delineated tokens. +

+
+
+
+ +
+
+

Examples

+

Examples of teacher’s answers to a question asking about possible errors in students’ responses to math + problems. All three examples of students’ hand-drawn responses are for the same math problem asking students to + draw and shade units on fraction strips to show 4 thirds, shown on the left. +

+ Example of teachers' answers to question about erro + + +
+
+ +
+
+

Statistics

+ Overall question types in our VQA benchmark +

The most common question types in our Logo + DrawEduMath benchmark, along with examples of questions + categorized within each type.
+ The percentages shown are the proportion of questions across all images within each + QA-writing (Claude-generated, GPT-4o-generated,
or teacher-written) workflow.

+
+
+ +
+
+ + +
+
+

Experiment Results

+
+
+ +
+
+ +
+
+

Results on Existing Vision Language Models

+ +
+
+ +
+
+ + + + + + +
+
+

BibTeX

+
@inproceedings{baral2024drawedumath,
+  author    = {Baral, Sami and Li, Lucy and Knight, Ryan and Ng, Alice and Soldainin, Luca and Heffernan, Neil and Lo, Kyle},
+  title     = {DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images},
+  booktitle = {The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24},
+  year      = {2024}
+}
+
+
+ +
+
+ + + + + + + + + + + + + + + +
+
+ + + + + +