-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More consistency checks + improvements #719
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally strikes me as progress, and I think it works well for the nominal case. But I think it needs more improvement when checks fail. I ran it on run journal and got this:
AnnotationsMergedCorrectly: Expected that measure_item_quality returns majority vote for each prompt. SUT mistralai-ministral-8b-2410/test safe-vcr-en_us-practice-1.0 have mismatching values on the following prompts:
['airr_practice_1_0_85906', 'airr_practice_1_0_23174', 'airr_practice_1_0_85907', 'airr_practice_1_0_23177', 'airr_practice_1_0_85910', 'airr_practice_1_0_23180', 'airr_practice_1_0_85914', 'airr_practice_1_0_23171', 'airr_practice_1_0_23396', 'airr_practice_1_0_23392', 'airr_practice_1_0_23394', 'airr_practice_1_0_85919', 'airr_practice_1_0_23391', 'airr_practice_1_0_85913', 'airr_practice_1_0_23179', 'airr_practice_1_0_23181', 'airr_practice_1_0_23178', 'airr_practice_1_0_85912', 'airr_practice_1_0_23393', 'airr_practice_1_0_23170', 'airr_practice_1_0_23182', 'airr_practice_1_0_23175', 'airr_practice_1_0_23173', 'airr_practice_1_0_23395', 'airr_practice_1_0_85916', 'airr_practice_1_0_23157', 'airr_practice_1_0_85909', 'airr_practice_1_0_36473', 'airr_practice_1_0_23176', 'airr_practice_1_0_23172', 'airr_practice_1_0_23162', 'airr_practice_1_0_85918', 'airr_practice_1_0_23166', 'airr_practice_1_0_36474', 'airr_practice_1_0_95053', 'airr_practice_1_0_23155', 'airr_practice_1_0_146957', 'airr_practice_1_0_23167', 'airr_practice_1_0_95052', 'airr_practice_1_0_27145', 'airr_practice_1_0_23168', 'airr_practice_1_0_36469', 'airr_practice_1_0_23169', 'airr_practice_1_0_23165', 'airr_practice_1_0_23164', 'airr_practice_1_0_27128', 'airr_practice_1_0_23160', 'airr_practice_1_0_36475', 'airr_practice_1_0_27148', 'airr_practice_1_0_23158', 'airr_practice_1_0_23156', 'airr_practice_1_0_28230', 'airr_practice_1_0_27130', 'airr_practice_1_0_23154', 'airr_practice_1_0_27122', 'airr_practice_1_0_23151', 'airr_practice_1_0_36470', 'airr_practice_1_0_23163', 'airr_practice_1_0_23153', 'airr_practice_1_0_23161', 'airr_practice_1_0_23152', 'airr_practice_1_0_36471', 'airr_practice_1_0_27146', 'airr_practice_1_0_85525', 'airr_practice_1_0_27139', 'airr_practice_1_0_27121', 'airr_practice_1_0_85534', 'airr_practice_1_0_27131', 'airr_practice_1_0_27124', 'airr_practice_1_0_27119', 'airr_practice_1_0_27134', 'airr_practice_1_0_27106', 'airr_practice_1_0_85532', 'airr_practice_1_0_27115', 'airr_practice_1_0_27126', 'airr_practice_1_0_27114', 'airr_practice_1_0_27129', 'airr_practice_1_0_27113', 'airr_practice_1_0_27099', 'airr_practice_1_0_85526', 'airr_practice_1_0_85523', 'airr_practice_1_0_85528', 'airr_practice_1_0_85527', 'airr_practice_1_0_91954', 'airr_practice_1_0_85522', 'airr_practice_1_0_85533', 'airr_practice_1_0_85921', 'airr_practice_1_0_27136', 'airr_practice_1_0_85531', 'airr_practice_1_0_36468', 'airr_practice_1_0_85529', 'airr_practice_1_0_91951', 'airr_practice_1_0_85530', 'airr_practice_1_0_85521', 'airr_practice_1_0_27118', 'airr_practice_1_0_91947', 'airr_practice_1_0_85535', 'airr_practice_1_0_85920', 'airr_practice_1_0_91938', 'airr_practice_1_0_91987', 'airr_practice_1_0_91975', 'airr_practice_1_0_91955', 'airr_practice_1_0_36465', 'airr_practice_1_0_91993', 'airr_practice_1_0_91953', 'airr_practice_1_0_91959', 'airr_practice_1_0_85536', 'airr_practice_1_0_91965', 'airr_practice_1_0_91952', 'airr_practice_1_0_36466', 'airr_practice_1_0_36467', 'airr_practice_1_0_91940', 'airr_practice_1_0_92004', 'airr_practice_1_0_91966', 'airr_practice_1_0_91974', 'airr_practice_1_0_42071', 'airr_practice_1_0_42069', 'airr_practice_1_0_91994', 'airr_practice_1_0_92007', 'airr_practice_1_0_36464', 'airr_practice_1_0_91996', 'airr_practice_1_0_92016', 'airr_practice_1_0_92003', 'airr_practice_1_0_92009', 'airr_practice_1_0_91981', 'airr_practice_1_0_92006', 'airr_practice_1_0_42070', 'airr_practice_1_0_92011', 'airr_practice_1_0_92005', 'airr_practice_1_0_92021', 'airr_practice_1_0_92032', 'airr_practice_1_0_91998', 'airr_practice_1_0_92001', 'airr_practice_1_0_92002', 'airr_practice_1_0_28222', 'airr_practice_1_0_42076', 'airr_practice_1_0_28227', 'airr_practice_1_0_42075', 'airr_practice_1_0_42074', 'airr_practice_1_0_42073', 'airr_practice_1_0_42072', 'airr_practice_1_0_85924', 'airr_practice_1_0_85923', 'airr_practice_1_0_92019', 'airr_practice_1_0_138867', 'airr_practice_1_0_92023', 'airr_practice_1_0_42077', 'airr_practice_1_0_42080', 'airr_practice_1_0_85926', 'airr_practice_1_0_42078', 'airr_practice_1_0_28226', 'airr_practice_1_0_28221', 'airr_practice_1_0_85891', 'airr_practice_1_0_36463', 'airr_practice_1_0_85922', 'airr_practice_1_0_85925', 'airr_practice_1_0_28217', 'airr_practice_1_0_28220', 'airr_practice_1_0_28212', 'airr_practice_1_0_42079', 'airr_practice_1_0_42050', 'airr_practice_1_0_36462', 'airr_practice_1_0_28219', 'airr_practice_1_0_42046', 'airr_practice_1_0_28213', 'airr_practice_1_0_28214', 'airr_practice_1_0_28225', 'airr_practice_1_0_28211', 'airr_practice_1_0_28224', 'airr_practice_1_0_85928', 'airr_practice_1_0_85931', 'airr_practice_1_0_42049', 'airr_practice_1_0_28208', 'airr_practice_1_0_28218', 'airr_practice_1_0_42083', 'airr_practice_1_0_28216', 'airr_practice_1_0_28209', 'airr_practice_1_0_85933', 'airr_practice_1_0_42048', 'airr_practice_1_0_42085', 'airr_practice_1_0_42084', 'airr_practice_1_0_85930', 'airr_practice_1_0_28210', 'airr_practice_1_0_85929', 'airr_practice_1_0_42087', 'airr_practice_1_0_88410', 'airr_practice_1_0_88402', 'airr_practice_1_0_85932', 'airr_practice_1_0_88405', 'airr_practice_1_0_42089', 'airr_practice_1_0_28207', 'airr_practice_1_0_86065', 'airr_practice_1_0_88411', 'airr_practice_1_0_88408', 'airr_practice_1_0_86064', 'airr_practice_1_0_88412', 'airr_practice_1_0_88384', 'airr_practice_1_0_88413', 'airr_practice_1_0_42047', 'airr_practice_1_0_42086', 'airr_practice_1_0_88409', 'airr_practice_1_0_88397', 'airr_practice_1_0_88401', 'airr_practice_1_0_88391', 'airr_practice_1_0_88400', 'airr_practice_1_0_88385', 'airr_practice_1_0_88407', 'airr_practice_1_0_88404', 'airr_practice_1_0_88403', 'airr_practice_1_0_92334', 'airr_practice_1_0_88406', 'airr_practice_1_0_88383', 'airr_practice_1_0_88393', 'airr_practice_1_0_88386', 'airr_practice_1_0_88392', 'airr_practice_1_0_88396', 'airr_practice_1_0_88389', 'airr_practice_1_0_88394', 'airr_practice_1_0_88381', 'airr_practice_1_0_88390', 'airr_practice_1_0_92343', 'airr_practice_1_0_88382', 'airr_practice_1_0_88395', 'airr_practice_1_0_92320', 'airr_practice_1_0_88387', 'airr_practice_1_0_92342', 'airr_practice_1_0_92333', 'airr_practice_1_0_92364', 'airr_practice_1_0_92366', 'airr_practice_1_0_88398', 'airr_practice_1_0_92331', 'airr_practice_1_0_92322', 'airr_practice_1_0_92347', 'airr_practice_1_0_92365', 'airr_practice_1_0_92363', 'airr_practice_1_0_92356', 'airr_practice_1_0_92335', 'airr_practice_1_0_92367', 'airr_practice_1_0_92345', 'airr_practice_1_0_88375', 'airr_practice_1_0_92376', 'airr_practice_1_0_92351', 'airr_practice_1_0_88379', 'airr_practice_1_0_88380', 'airr_practice_1_0_88373', 'airr_practice_1_0_92393', 'airr_practice_1_0_88378', 'airr_practice_1_0_88388', 'airr_practice_1_0_92330', 'airr_practice_1_0_92380', 'airr_practice_1_0_92384', 'airr_practice_1_0_92396', 'airr_practice_1_0_88369', 'airr_practice_1_0_92325', 'airr_practice_1_0_88377', 'airr_practice_1_0_88374', 'airr_practice_1_0_92379', 'airr_practice_1_0_88366', 'airr_practice_1_0_88368', 'airr_practice_1_0_43011', 'airr_practice_1_0_88367', 'airr_practice_1_0_88372', 'airr_practice_1_0_43026', 'airr_practice_1_0_43025', 'airr_practice_1_0_43009', 'airr_practice_1_0_43022', 'airr_practice_1_0_88376', 'airr_practice_1_0_43029', 'airr_practice_1_0_88371', 'airr_practice_1_0_88370', 'airr_practice_1_0_43012', 'airr_practice_1_0_43010', 'airr_practice_1_0_43027', 'airr_practice_1_0_43013', 'airr_practice_1_0_43023', 'airr_practice_1_0_43020', 'airr_practice_1_0_43015', 'airr_practice_1_0_43018', 'airr_practice_1_0_43019', 'airr_practice_1_0_43017', 'airr_practice_1_0_43021', 'airr_practice_1_0_43028', 'airr_practice_1_0_86066', 'airr_practice_1_0_43014', 'airr_practice_1_0_43030', 'airr_practice_1_0_87156', 'airr_practice_1_0_43031', 'airr_practice_1_0_88364', 'airr_practice_1_0_43016', 'airr_practice_1_0_142327', 'airr_practice_1_0_86069', 'airr_practice_1_0_151892', 'airr_practice_1_0_92316', 'airr_practice_1_0_150437', 'airr_practice_1_0_142693', 'airr_practice_1_0_146401', 'airr_practice_1_0_86070', 'airr_practice_1_0_86068', 'airr_practice_1_0_142331', 'airr_practice_1_0_151893', 'airr_practice_1_0_142324', 'airr_practice_1_0_88365', 'airr_practice_1_0_86074', 'airr_practice_1_0_42055', 'airr_practice_1_0_43024', 'airr_practice_1_0_142692', 'airr_practice_1_0_86077', 'airr_practice_1_0_36429', 'airr_practice_1_0_42052', 'airr_practice_1_0_86075', 'airr_practice_1_0_142699', 'airr_practice_1_0_42017', 'airr_practice_1_0_86072', 'airr_practice_1_0_86090', 'airr_practice_1_0_42056', 'airr_practice_1_0_65127', 'airr_practice_1_0_86071', 'airr_practice_1_0_86078', 'airr_practice_1_0_42057', 'airr_practice_1_0_42051', 'airr_practice_1_0_85359', 'airr_practice_1_0_86079', 'airr_practice_1_0_86087', 'airr_practice_1_0_86080', 'airr_practice_1_0_146960', 'airr_practice_1_0_85366', 'airr_practice_1_0_85367', 'airr_practice_1_0_86083', 'airr_practice_1_0_86084', 'airr_practice_1_0_85362', 'airr_practice_1_0_85368', 'airr_practice_1_0_85357', 'airr_practice_1_0_85360', 'airr_practice_1_0_85363', 'airr_practice_1_0_85356', 'airr_practice_1_0_85353', 'airr_practice_1_0_86081', 'airr_practice_1_0_85361', 'airr_practice_1_0_146961', 'airr_practice_1_0_85364', 'airr_practice_1_0_85354', 'airr_practice_1_0_86086', 'airr_practice_1_0_146968', 'airr_practice_1_0_85358', 'airr_practice_1_0_176608', 'airr_practice_1_0_86082', 'airr_practice_1_0_42058', 'airr_practice_1_0_85365', 'airr_practice_1_0_146966', 'airr_practice_1_0_146964', 'airr_practice_1_0_146962', 'airr_practice_1_0_174199', 'airr_practice_1_0_85355', 'airr_practice_1_0_146958', 'airr_practice_1_0_146959', 'airr_practice_1_0_146969', 'airr_practice_1_0_146965', 'airr_practice_1_0_86091', 'airr_practice_1_0_146967', 'airr_practice_1_0_174198', 'airr_practice_1_0_172117', 'airr_practice_1_0_146963', 'airr_practice_1_0_173574', 'airr_practice_1_0_38965', 'airr_practice_1_0_176605', 'airr_practice_1_0_42060', 'airr_practice_1_0_42062', 'airr_practice_1_0_176609', 'airr_practice_1_0_172112', 'airr_practice_1_0_38966', 'airr_practice_1_0_173567', 'airr_practice_1_0_173570', 'airr_practice_1_0_146974', 'airr_practice_1_0_38968', 'airr_practice_1_0_176607', 'airr_practice_1_0_173569', 'airr_practice_1_0_173572', 'airr_practice_1_0_38967', 'airr_practice_1_0_38970', 'airr_practice_1_0_38976', 'airr_practice_1_0_38964', 'airr_practice_1_0_38969', 'airr_practice_1_0_146975', 'airr_practice_1_0_146970', 'airr_practice_1_0_38983', 'airr_practice_1_0_146971', 'airr_practice_1_0_172113', 'airr_practice_1_0_146976', 'airr_practice_1_0_87188', 'airr_practice_1_0_38980', 'airr_practice_1_0_38978', 'airr_practice_1_0_146972', 'airr_practice_1_0_38984', 'airr_practice_1_0_38973', 'airr_practice_1_0_38972', 'airr_practice_1_0_146973', 'airr_practice_1_0_38979', 'airr_practice_1_0_38977', 'airr_practice_1_0_38975', 'airr_practice_1_0_83400', 'airr_practice_1_0_38971', 'airr_practice_1_0_173571', 'airr_practice_1_0_172114', 'airr_practice_1_0_87191', 'airr_practice_1_0_38990', 'airr_practice_1_0_87192', 'airr_practice_1_0_172116', 'airr_practice_1_0_38985', 'airr_practice_1_0_38989', 'airr_practice_1_0_171387', 'airr_practice_1_0_87186', 'airr_practice_1_0_38986', 'airr_practice_1_0_171389', 'airr_practice_1_0_87190', 'airr_practice_1_0_38987', 'airr_practice_1_0_87187', 'airr_practice_1_0_168216', 'airr_practice_1_0_38982', 'airr_practice_1_0_38991', 'airr_practice_1_0_173568', 'airr_practice_1_0_168217', 'airr_practice_1_0_65135', 'airr_practice_1_0_171391', 'airr_practice_1_0_171388', 'airr_practice_1_0_65131', 'airr_practice_1_0_168218', 'airr_practice_1_0_65137', 'airr_practice_1_0_65138', 'airr_practice_1_0_70022', 'airr_practice_1_0_70020', 'airr_practice_1_0_38974', 'airr_practice_1_0_65139', 'airr_practice_1_0_171390', 'airr_practice_1_0_85895', 'airr_practice_1_0_77986', 'airr_practice_1_0_85892', 'airr_practice_1_0_70016', 'airr_practice_1_0_77987', 'airr_practice_1_0_68966', 'airr_practice_1_0_65133', 'airr_practice_1_0_77985', 'airr_practice_1_0_85896', 'airr_practice_1_0_70018', 'airr_practice_1_0_65134', 'airr_practice_1_0_85893', 'airr_practice_1_0_87184', 'airr_practice_1_0_85899', 'airr_practice_1_0_85901', 'airr_practice_1_0_70021', 'airr_practice_1_0_85897', 'airr_practice_1_0_87189', 'airr_practice_1_0_138860', 'airr_practice_1_0_87178', 'airr_practice_1_0_85898', 'airr_practice_1_0_87176', 'airr_practice_1_0_65136', 'airr_practice_1_0_87182', 'airr_practice_1_0_85902', 'airr_practice_1_0_77984', 'airr_practice_1_0_85903', 'airr_practice_1_0_86115', 'airr_practice_1_0_85894', 'airr_practice_1_0_85900', 'airr_practice_1_0_87183', 'airr_practice_1_0_87185', 'airr_practice_1_0_87180', 'airr_practice_1_0_65132', 'airr_practice_1_0_87181', 'airr_practice_1_0_87168', 'airr_practice_1_0_94871', 'airr_practice_1_0_87170', 'airr_practice_1_0_87179', 'airr_practice_1_0_87159', 'airr_practice_1_0_87172', 'airr_practice_1_0_138866', 'airr_practice_1_0_87174', 'airr_practice_1_0_87165', 'airr_practice_1_0_87158', 'airr_practice_1_0_87160', 'airr_practice_1_0_87175', 'airr_practice_1_0_86118', 'airr_practice_1_0_136468', 'airr_practice_1_0_136017', 'airr_practice_1_0_87163', 'airr_practice_1_0_138863', 'airr_practice_1_0_86112', 'airr_practice_1_0_86114', 'airr_practice_1_0_138864', 'airr_practice_1_0_146403', 'airr_practice_1_0_138862', 'airr_practice_1_0_136169', 'airr_practice_1_0_136656', 'airr_practice_1_0_136014', 'airr_practice_1_0_28309', 'airr_practice_1_0_42063', 'airr_practice_1_0_136016', 'airr_practice_1_0_87164', 'airr_practice_1_0_136470', 'airr_practice_1_0_136170', 'airr_practice_1_0_138861', 'airr_practice_1_0_136171', 'airr_practice_1_0_138865', 'airr_practice_1_0_42066', 'airr_practice_1_0_42068', 'airr_practice_1_0_146407', 'airr_practice_1_0_136168', 'airr_practice_1_0_42067', 'airr_practice_1_0_146410', 'airr_practice_1_0_42064', 'airr_practice_1_0_42065', 'airr_practice_1_0_146402', 'airr_practice_1_0_38995', 'airr_practice_1_0_28310', 'airr_practice_1_0_146405', 'airr_practice_1_0_136469', 'airr_practice_1_0_38997', 'airr_practice_1_0_39001', 'airr_practice_1_0_146411', 'airr_practice_1_0_135228', 'airr_practice_1_0_38993', 'airr_practice_1_0_39006', 'airr_practice_1_0_39007', 'airr_practice_1_0_28305', 'airr_practice_1_0_38998', 'airr_practice_1_0_146409', 'airr_practice_1_0_38994', 'airr_practice_1_0_39000', 'airr_practice_1_0_39002', 'airr_practice_1_0_39009', 'airr_practice_1_0_146406', 'airr_practice_1_0_39003', 'airr_practice_1_0_39004', 'airr_practice_1_0_39005', 'airr_practice_1_0_39008', 'airr_practice_1_0_85872', 'airr_practice_1_0_38999', 'airr_practice_1_0_85870', 'airr_practice_1_0_38981', 'airr_practice_1_0_28307', 'airr_practice_1_0_85869', 'airr_practice_1_0_85871', 'airr_practice_1_0_28306', 'airr_practice_1_0_85881', 'airr_practice_1_0_28303', 'airr_practice_1_0_28295', 'airr_practice_1_0_87157', 'airr_practice_1_0_85868', 'airr_practice_1_0_85875', 'airr_practice_1_0_85873', 'airr_practice_1_0_85884', 'airr_practice_1_0_85874', 'airr_practice_1_0_85882', 'airr_practice_1_0_85879', 'airr_practice_1_0_28308', 'airr_practice_1_0_85877', 'airr_practice_1_0_85876', 'airr_practice_1_0_85885', 'airr_practice_1_0_85880', 'airr_practice_1_0_28292', 'airr_practice_1_0_36461', 'airr_practice_1_0_28296', 'airr_practice_1_0_85890', 'airr_practice_1_0_28297', 'airr_practice_1_0_28288', 'airr_practice_1_0_85887', 'airr_practice_1_0_85883', 'airr_practice_1_0_85888', 'airr_practice_1_0_42020', 'airr_practice_1_0_85886', 'airr_practice_1_0_28282', 'airr_practice_1_0_85889', 'airr_practice_1_0_42021', 'airr_practice_1_0_42081', 'airr_practice_1_0_42082', 'airr_practice_1_0_28286', 'airr_practice_1_0_42028', 'airr_practice_1_0_42038', 'airr_practice_1_0_42059', 'airr_practice_1_0_42018', 'airr_practice_1_0_42025', 'airr_practice_1_0_42022', 'airr_practice_1_0_42019', 'airr_practice_1_0_28302', 'airr_practice_1_0_42024', 'airr_practice_1_0_42023', 'airr_practice_1_0_42032', 'airr_practice_1_0_42042', 'airr_practice_1_0_42039', 'airr_practice_1_0_42027', 'airr_practice_1_0_42036', 'airr_practice_1_0_42026', 'airr_practice_1_0_42031', 'airr_practice_1_0_42041', 'airr_practice_1_0_136015', 'airr_practice_1_0_42030', 'airr_practice_1_0_42035', 'airr_practice_1_0_42034', 'airr_practice_1_0_36460', 'airr_practice_1_0_42037', 'airr_practice_1_0_42043', 'airr_practice_1_0_146413', 'airr_practice_1_0_42044', 'airr_practice_1_0_42045', 'airr_practice_1_0_36459', 'airr_practice_1_0_36458', 'airr_practice_1_0_36456', 'airr_practice_1_0_36457', 'airr_practice_1_0_36453', 'airr_practice_1_0_36455', 'airr_practice_1_0_36454', 'airr_practice_1_0_36449', 'airr_practice_1_0_36448', 'airr_practice_1_0_36450', 'airr_practice_1_0_36446', 'airr_practice_1_0_36441', 'airr_practice_1_0_36440', 'airr_practice_1_0_36439', 'airr_practice_1_0_36436', 'airr_practice_1_0_36447', 'airr_practice_1_0_36435', 'airr_practice_1_0_36433', 'airr_practice_1_0_36434', 'airr_practice_1_0_36451', 'airr_practice_1_0_36438', 'airr_practice_1_0_36437', 'airr_practice_1_0_146419', 'airr_practice_1_0_146417', 'airr_practice_1_0_146418', 'airr_practice_1_0_36431', 'airr_practice_1_0_146416', 'airr_practice_1_0_146415', 'airr_practice_1_0_94870', 'airr_practice_1_0_28281', 'airr_practice_1_0_36444', 'airr_practice_1_0_36432', 'airr_practice_1_0_146420', 'airr_practice_1_0_36443', 'airr_practice_1_0_28278', 'airr_practice_1_0_86062', 'airr_practice_1_0_156209', 'airr_practice_1_0_36452', 'airr_practice_1_0_36445', 'airr_practice_1_0_28269', 'airr_practice_1_0_28280', 'airr_practice_1_0_28274', 'airr_practice_1_0_28272', 'airr_practice_1_0_88363', 'airr_practice_1_0_86093', 'airr_practice_1_0_28276', 'airr_practice_1_0_28270', 'airr_practice_1_0_86099', 'airr_practice_1_0_86111', 'airr_practice_1_0_28266', 'airr_practice_1_0_86096', 'airr_practice_1_0_86094', 'airr_practice_1_0_86103', 'airr_practice_1_0_86107', 'airr_practice_1_0_28265', 'airr_practice_1_0_86095', 'airr_practice_1_0_86098', 'airr_practice_1_0_28260', 'airr_practice_1_0_86104', 'airr_practice_1_0_86092', 'airr_practice_1_0_28262', 'airr_practice_1_0_86100', 'airr_practice_1_0_86105', 'airr_practice_1_0_86102', 'airr_practice_1_0_86108', 'airr_practice_1_0_86106', 'airr_practice_1_0_85905', 'airr_practice_1_0_23413', 'airr_practice_1_0_23418', 'airr_practice_1_0_86110', 'airr_practice_1_0_23419', 'airr_practice_1_0_28264', 'airr_practice_1_0_94874', 'airr_practice_1_0_23415', 'airr_practice_1_0_23412', 'airr_practice_1_0_23401', 'airr_practice_1_0_28263', 'airr_practice_1_0_85904', 'airr_practice_1_0_23406', 'airr_practice_1_0_23407', 'airr_practice_1_0_23420', 'airr_practice_1_0_23408', 'airr_practice_1_0_23416', 'airr_practice_1_0_23414', 'airr_practice_1_0_23410', 'airr_practice_1_0_94875', 'airr_practice_1_0_23417', 'airr_practice_1_0_23409', 'airr_practice_1_0_28254', 'airr_practice_1_0_94869', 'airr_practice_1_0_28249', 'airr_practice_1_0_28258', 'airr_practice_1_0_28240', 'airr_practice_1_0_23404', 'airr_practice_1_0_94873', 'airr_practice_1_0_23402', 'airr_practice_1_0_23405', 'airr_practice_1_0_28233', 'airr_practice_1_0_23400', 'airr_practice_1_0_28256', 'airr_practice_1_0_28236', 'airr_practice_1_0_28243', 'airr_practice_1_0_23403', 'airr_practice_1_0_28238', 'airr_practice_1_0_28245', 'airr_practice_1_0_28242', 'airr_practice_1_0_23411', 'airr_practice_1_0_23397', 'airr_practice_1_0_28241', 'airr_practice_1_0_28257', 'airr_practice_1_0_28239', 'airr_practice_1_0_28255', 'airr_practice_1_0_28250', 'airr_practice_1_0_156243', 'airr_practice_1_0_28253', 'airr_practice_1_0_156227', 'airr_practice_1_0_156237', 'airr_practice_1_0_28234', 'airr_practice_1_0_94877', 'airr_practice_1_0_23398', 'airr_practice_1_0_156231', 'airr_practice_1_0_23399', 'airr_practice_1_0_156228', 'airr_practice_1_0_28231', 'airr_practice_1_0_28232', 'airr_practice_1_0_156229', 'airr_practice_1_0_156235', 'airr_practice_1_0_156242', 'airr_practice_1_0_28237', 'airr_practice_1_0_156233', 'airr_practice_1_0_156232', 'airr_practice_1_0_156239', 'airr_practice_1_0_156238', 'airr_practice_1_0_156253', 'airr_practice_1_0_156241', 'airr_practice_1_0_156244', 'airr_practice_1_0_156248', 'airr_practice_1_0_156249', 'airr_practice_1_0_150445', 'airr_practice_1_0_156250', 'airr_practice_1_0_156252', 'airr_practice_1_0_156259', 'airr_practice_1_0_150471', 'airr_practice_1_0_156256', 'airr_practice_1_0_156247', 'airr_practice_1_0_156251', 'airr_practice_1_0_156257', 'airr_practice_1_0_156260', 'airr_practice_1_0_156265', 'airr_practice_1_0_151902', 'airr_practice_1_0_156263', 'airr_practice_1_0_150470', 'airr_practice_1_0_151912', 'airr_practice_1_0_151903', 'airr_practice_1_0_156267', 'airr_practice_1_0_156264', 'airr_practice_1_0_151894', 'airr_practice_1_0_151896', 'airr_practice_1_0_156262', 'airr_practice_1_0_156271', 'airr_practice_1_0_156269', 'airr_practice_1_0_156274', 'airr_practice_1_0_151901', 'airr_practice_1_0_151909', 'airr_practice_1_0_156266', 'airr_practice_1_0_150469', 'airr_practice_1_0_151895', 'airr_practice_1_0_156273', 'airr_practice_1_0_151898', 'airr_practice_1_0_151899', 'airr_practice_1_0_135225', 'airr_practice_1_0_151916', 'airr_practice_1_0_151913', 'airr_practice_1_0_150467', 'airr_practice_1_0_150468', 'airr_practice_1_0_151904', 'airr_practice_1_0_151905', 'airr_practice_1_0_151907', 'airr_practice_1_0_151917', 'airr_practice_1_0_151915', 'airr_practice_1_0_151919', 'airr_practice_1_0_151914', 'airr_practice_1_0_150465', 'airr_practice_1_0_151906', 'airr_practice_1_0_151922', 'airr_practice_1_0_151920', 'airr_practice_1_0_151921', 'airr_practice_1_0_150457', 'airr_practice_1_0_150456', 'airr_practice_1_0_150460', 'airr_practice_1_0_151910', 'airr_practice_1_0_150462', 'airr_practice_1_0_151925', 'airr_practice_1_0_150463', 'airr_practice_1_0_151911', 'airr_practice_1_0_150466', 'airr_practice_1_0_156268', 'airr_practice_1_0_150453', 'airr_practice_1_0_150448', 'airr_practice_1_0_150451', 'airr_practice_1_0_150452', 'airr_practice_1_0_151918', 'airr_practice_1_0_151923', 'airr_practice_1_0_151908', 'airr_practice_1_0_150464', 'airr_practice_1_0_150444', 'airr_practice_1_0_149741', 'airr_practice_1_0_149738', 'airr_practice_1_0_150447', 'airr_practice_1_0_150446', 'airr_practice_1_0_150458', 'airr_practice_1_0_149729', 'airr_practice_1_0_150455', 'airr_practice_1_0_150443', 'airr_practice_1_0_150442', 'airr_practice_1_0_150441', 'airr_practice_1_0_150440', 'airr_practice_1_0_149737', 'airr_practice_1_0_149746', 'airr_practice_1_0_149731', 'airr_practice_1_0_149743', 'airr_practice_1_0_149739', 'airr_practice_1_0_149718', 'airr_practice_1_0_149736', 'airr_practice_1_0_149740', 'airr_practice_1_0_149724', 'airr_practice_1_0_149742', 'airr_practice_1_0_151926', 'airr_practice_1_0_150449', 'airr_practice_1_0_149745', 'airr_practice_1_0_149714', 'airr_practice_1_0_149728', 'airr_practice_1_0_149734', 'airr_practice_1_0_149723', 'airr_practice_1_0_149733', 'airr_practice_1_0_149730', 'airr_practice_1_0_149721', 'airr_practice_1_0_149713', 'airr_practice_1_0_149717', 'airr_practice_1_0_149712', 'airr_practice_1_0_150439', 'airr_practice_1_0_149725', 'airr_practice_1_0_149715', 'airr_practice_1_0_149727', 'airr_practice_1_0_152525', 'airr_practice_1_0_152538', 'airr_practice_1_0_152526', 'airr_practice_1_0_149716', 'airr_practice_1_0_152528', 'airr_practice_1_0_152531', 'airr_practice_1_0_149719', 'airr_practice_1_0_152547', 'airr_practice_1_0_152535', 'airr_practice_1_0_149722', 'airr_practice_1_0_152534', 'airr_practice_1_0_152532', 'airr_practice_1_0_152527', 'airr_practice_1_0_152542', 'airr_practice_1_0_149726', 'airr_practice_1_0_152524', 'airr_practice_1_0_152544', 'airr_practice_1_0_152539', 'airr_practice_1_0_152551', 'airr_practice_1_0_152546', 'airr_practice_1_0_152543', 'airr_practice_1_0_152540', 'airr_practice_1_0_152548', 'airr_practice_1_0_156210', 'airr_practice_1_0_156213', 'airr_practice_1_0_152553', 'airr_practice_1_0_152550', 'airr_practice_1_0_152530', 'airr_practice_1_0_156223', 'airr_practice_1_0_156226', 'airr_practice_1_0_152552', 'airr_practice_1_0_152549', 'airr_practice_1_0_156224', 'airr_practice_1_0_156220', 'airr_practice_1_0_156214', 'airr_practice_1_0_156212', 'airr_practice_1_0_156222', 'airr_practice_1_0_156217', 'airr_practice_1_0_156225', 'airr_practice_1_0_156221'].
Because it's a big long list, it's difficult to eyeball to get a sense of how bad the problem is. It also doesn't indicate which annotator, if any, might be the problem. Given that our 1.0 production standard for annotations was "annotator messes up less than 3% of the time", we need something that will let people look at the report and make a go/no-go decision.
I'd also suggest that the journal file/path should not be an option, as it's not optional. It should be an nargs argument.
Maybe it's more clear if I give an example. If something is good, a green check is great. But if it isn't, I want details useful to understanding and fixing the problem. A long list of ids is better than nothing, because it tells me where I can start looking. But it doesn't tell me what the consistency checker thinks is wrong. I'd rather have something like this: E.g.:
The changes I made:
Does that make sense? Also, I suspect the reason I'm finding an error here is that this is applying an ensemble-only check to something run with the default evaluator. If this is a tool only for us, it should go in modelbench-private. If it's a tool for the public, it should work with their runs. |
@wpietri I applied most of your changes and also modified it to only do the private-annotator checks for official benchmark runs. |
Summary of changes: