Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More consistency checks + improvements #719

Merged
merged 28 commits into from
Dec 9, 2024
Merged

Conversation

bkorycki
Copy link
Contributor

@bkorycki bkorycki commented Nov 27, 2024

Summary of changes:

  • CLI 'consistency-check` command accepts path to journal file or directory which will be recursively searched for journal files, all of which will be checked.
  • If multiple journals are checked, a summary table will be printed at the end.
    Screenshot 2024-11-27 at 12 31 16 PM
  • Checker objects don't hold memory intensive journal-querying object anymore. This was prohibitive for running checks on multiple large journals.
  • To improve robustness, try to get a test's annotator set directly from the test. If that doesn't work (e.g. don't have the right secrets), then get the annotator set from the journals annotator-related entries.
  • More tests
  • More checks

@bkorycki bkorycki requested a review from a team as a code owner November 27, 2024 20:36
Copy link

github-actions bot commented Nov 27, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@bkorycki bkorycki temporarily deployed to Scheduled Testing November 27, 2024 22:22 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing November 27, 2024 22:22 — with GitHub Actions Inactive
@bkorycki bkorycki requested a review from rogthefrog November 27, 2024 22:24
Copy link
Contributor

@rogthefrog rogthefrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Copy link
Contributor

@wpietri wpietri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally strikes me as progress, and I think it works well for the nominal case. But I think it needs more improvement when checks fail. I ran it on run journal and got this:

AnnotationsMergedCorrectly: Expected that measure_item_quality returns majority vote for each prompt. SUT mistralai-ministral-8b-2410/test safe-vcr-en_us-practice-1.0 have mismatching values on the following prompts:
        ['airr_practice_1_0_85906', 'airr_practice_1_0_23174', 'airr_practice_1_0_85907', 'airr_practice_1_0_23177', 'airr_practice_1_0_85910', 'airr_practice_1_0_23180', 'airr_practice_1_0_85914', 'airr_practice_1_0_23171', 'airr_practice_1_0_23396', 'airr_practice_1_0_23392', 'airr_practice_1_0_23394', 'airr_practice_1_0_85919', 'airr_practice_1_0_23391', 'airr_practice_1_0_85913', 'airr_practice_1_0_23179', 'airr_practice_1_0_23181', 'airr_practice_1_0_23178', 'airr_practice_1_0_85912', 'airr_practice_1_0_23393', 'airr_practice_1_0_23170', 'airr_practice_1_0_23182', 'airr_practice_1_0_23175', 'airr_practice_1_0_23173', 'airr_practice_1_0_23395', 'airr_practice_1_0_85916', 'airr_practice_1_0_23157', 'airr_practice_1_0_85909', 'airr_practice_1_0_36473', 'airr_practice_1_0_23176', 'airr_practice_1_0_23172', 'airr_practice_1_0_23162', 'airr_practice_1_0_85918', 'airr_practice_1_0_23166', 'airr_practice_1_0_36474', 'airr_practice_1_0_95053', 'airr_practice_1_0_23155', 'airr_practice_1_0_146957', 'airr_practice_1_0_23167', 'airr_practice_1_0_95052', 'airr_practice_1_0_27145', 'airr_practice_1_0_23168', 'airr_practice_1_0_36469', 'airr_practice_1_0_23169', 'airr_practice_1_0_23165', 'airr_practice_1_0_23164', 'airr_practice_1_0_27128', 'airr_practice_1_0_23160', 'airr_practice_1_0_36475', 'airr_practice_1_0_27148', 'airr_practice_1_0_23158', 'airr_practice_1_0_23156', 'airr_practice_1_0_28230', 'airr_practice_1_0_27130', 'airr_practice_1_0_23154', 'airr_practice_1_0_27122', 'airr_practice_1_0_23151', 'airr_practice_1_0_36470', 'airr_practice_1_0_23163', 'airr_practice_1_0_23153', 'airr_practice_1_0_23161', 'airr_practice_1_0_23152', 'airr_practice_1_0_36471', 'airr_practice_1_0_27146', 'airr_practice_1_0_85525', 'airr_practice_1_0_27139', 'airr_practice_1_0_27121', 'airr_practice_1_0_85534', 'airr_practice_1_0_27131', 'airr_practice_1_0_27124', 'airr_practice_1_0_27119', 'airr_practice_1_0_27134', 'airr_practice_1_0_27106', 'airr_practice_1_0_85532', 'airr_practice_1_0_27115', 'airr_practice_1_0_27126', 'airr_practice_1_0_27114', 'airr_practice_1_0_27129', 'airr_practice_1_0_27113', 'airr_practice_1_0_27099', 'airr_practice_1_0_85526', 'airr_practice_1_0_85523', 'airr_practice_1_0_85528', 'airr_practice_1_0_85527', 'airr_practice_1_0_91954', 'airr_practice_1_0_85522', 'airr_practice_1_0_85533', 'airr_practice_1_0_85921', 'airr_practice_1_0_27136', 'airr_practice_1_0_85531', 'airr_practice_1_0_36468', 'airr_practice_1_0_85529', 'airr_practice_1_0_91951', 'airr_practice_1_0_85530', 'airr_practice_1_0_85521', 'airr_practice_1_0_27118', 'airr_practice_1_0_91947', 'airr_practice_1_0_85535', 'airr_practice_1_0_85920', 'airr_practice_1_0_91938', 'airr_practice_1_0_91987', 'airr_practice_1_0_91975', 'airr_practice_1_0_91955', 'airr_practice_1_0_36465', 'airr_practice_1_0_91993', 'airr_practice_1_0_91953', 'airr_practice_1_0_91959', 'airr_practice_1_0_85536', 'airr_practice_1_0_91965', 'airr_practice_1_0_91952', 'airr_practice_1_0_36466', 'airr_practice_1_0_36467', 'airr_practice_1_0_91940', 'airr_practice_1_0_92004', 'airr_practice_1_0_91966', 'airr_practice_1_0_91974', 'airr_practice_1_0_42071', 'airr_practice_1_0_42069', 'airr_practice_1_0_91994', 'airr_practice_1_0_92007', 'airr_practice_1_0_36464', 'airr_practice_1_0_91996', 'airr_practice_1_0_92016', 'airr_practice_1_0_92003', 'airr_practice_1_0_92009', 'airr_practice_1_0_91981', 'airr_practice_1_0_92006', 'airr_practice_1_0_42070', 'airr_practice_1_0_92011', 'airr_practice_1_0_92005', 'airr_practice_1_0_92021', 'airr_practice_1_0_92032', 'airr_practice_1_0_91998', 'airr_practice_1_0_92001', 'airr_practice_1_0_92002', 'airr_practice_1_0_28222', 'airr_practice_1_0_42076', 'airr_practice_1_0_28227', 'airr_practice_1_0_42075', 'airr_practice_1_0_42074', 'airr_practice_1_0_42073', 'airr_practice_1_0_42072', 'airr_practice_1_0_85924', 'airr_practice_1_0_85923', 'airr_practice_1_0_92019', 'airr_practice_1_0_138867', 'airr_practice_1_0_92023', 'airr_practice_1_0_42077', 'airr_practice_1_0_42080', 'airr_practice_1_0_85926', 'airr_practice_1_0_42078', 'airr_practice_1_0_28226', 'airr_practice_1_0_28221', 'airr_practice_1_0_85891', 'airr_practice_1_0_36463', 'airr_practice_1_0_85922', 'airr_practice_1_0_85925', 'airr_practice_1_0_28217', 'airr_practice_1_0_28220', 'airr_practice_1_0_28212', 'airr_practice_1_0_42079', 'airr_practice_1_0_42050', 'airr_practice_1_0_36462', 'airr_practice_1_0_28219', 'airr_practice_1_0_42046', 'airr_practice_1_0_28213', 'airr_practice_1_0_28214', 'airr_practice_1_0_28225', 'airr_practice_1_0_28211', 'airr_practice_1_0_28224', 'airr_practice_1_0_85928', 'airr_practice_1_0_85931', 'airr_practice_1_0_42049', 'airr_practice_1_0_28208', 'airr_practice_1_0_28218', 'airr_practice_1_0_42083', 'airr_practice_1_0_28216', 'airr_practice_1_0_28209', 'airr_practice_1_0_85933', 'airr_practice_1_0_42048', 'airr_practice_1_0_42085', 'airr_practice_1_0_42084', 'airr_practice_1_0_85930', 'airr_practice_1_0_28210', 'airr_practice_1_0_85929', 'airr_practice_1_0_42087', 'airr_practice_1_0_88410', 'airr_practice_1_0_88402', 'airr_practice_1_0_85932', 'airr_practice_1_0_88405', 'airr_practice_1_0_42089', 'airr_practice_1_0_28207', 'airr_practice_1_0_86065', 'airr_practice_1_0_88411', 'airr_practice_1_0_88408', 'airr_practice_1_0_86064', 'airr_practice_1_0_88412', 'airr_practice_1_0_88384', 'airr_practice_1_0_88413', 'airr_practice_1_0_42047', 'airr_practice_1_0_42086', 'airr_practice_1_0_88409', 'airr_practice_1_0_88397', 'airr_practice_1_0_88401', 'airr_practice_1_0_88391', 'airr_practice_1_0_88400', 'airr_practice_1_0_88385', 'airr_practice_1_0_88407', 'airr_practice_1_0_88404', 'airr_practice_1_0_88403', 'airr_practice_1_0_92334', 'airr_practice_1_0_88406', 'airr_practice_1_0_88383', 'airr_practice_1_0_88393', 'airr_practice_1_0_88386', 'airr_practice_1_0_88392', 'airr_practice_1_0_88396', 'airr_practice_1_0_88389', 'airr_practice_1_0_88394', 'airr_practice_1_0_88381', 'airr_practice_1_0_88390', 'airr_practice_1_0_92343', 'airr_practice_1_0_88382', 'airr_practice_1_0_88395', 'airr_practice_1_0_92320', 'airr_practice_1_0_88387', 'airr_practice_1_0_92342', 'airr_practice_1_0_92333', 'airr_practice_1_0_92364', 'airr_practice_1_0_92366', 'airr_practice_1_0_88398', 'airr_practice_1_0_92331', 'airr_practice_1_0_92322', 'airr_practice_1_0_92347', 'airr_practice_1_0_92365', 'airr_practice_1_0_92363', 'airr_practice_1_0_92356', 'airr_practice_1_0_92335', 'airr_practice_1_0_92367', 'airr_practice_1_0_92345', 'airr_practice_1_0_88375', 'airr_practice_1_0_92376', 'airr_practice_1_0_92351', 'airr_practice_1_0_88379', 'airr_practice_1_0_88380', 'airr_practice_1_0_88373', 'airr_practice_1_0_92393', 'airr_practice_1_0_88378', 'airr_practice_1_0_88388', 'airr_practice_1_0_92330', 'airr_practice_1_0_92380', 'airr_practice_1_0_92384', 'airr_practice_1_0_92396', 'airr_practice_1_0_88369', 'airr_practice_1_0_92325', 'airr_practice_1_0_88377', 'airr_practice_1_0_88374', 'airr_practice_1_0_92379', 'airr_practice_1_0_88366', 'airr_practice_1_0_88368', 'airr_practice_1_0_43011', 'airr_practice_1_0_88367', 'airr_practice_1_0_88372', 'airr_practice_1_0_43026', 'airr_practice_1_0_43025', 'airr_practice_1_0_43009', 'airr_practice_1_0_43022', 'airr_practice_1_0_88376', 'airr_practice_1_0_43029', 'airr_practice_1_0_88371', 'airr_practice_1_0_88370', 'airr_practice_1_0_43012', 'airr_practice_1_0_43010', 'airr_practice_1_0_43027', 'airr_practice_1_0_43013', 'airr_practice_1_0_43023', 'airr_practice_1_0_43020', 'airr_practice_1_0_43015', 'airr_practice_1_0_43018', 'airr_practice_1_0_43019', 'airr_practice_1_0_43017', 'airr_practice_1_0_43021', 'airr_practice_1_0_43028', 'airr_practice_1_0_86066', 'airr_practice_1_0_43014', 'airr_practice_1_0_43030', 'airr_practice_1_0_87156', 'airr_practice_1_0_43031', 'airr_practice_1_0_88364', 'airr_practice_1_0_43016', 'airr_practice_1_0_142327', 'airr_practice_1_0_86069', 'airr_practice_1_0_151892', 'airr_practice_1_0_92316', 'airr_practice_1_0_150437', 'airr_practice_1_0_142693', 'airr_practice_1_0_146401', 'airr_practice_1_0_86070', 'airr_practice_1_0_86068', 'airr_practice_1_0_142331', 'airr_practice_1_0_151893', 'airr_practice_1_0_142324', 'airr_practice_1_0_88365', 'airr_practice_1_0_86074', 'airr_practice_1_0_42055', 'airr_practice_1_0_43024', 'airr_practice_1_0_142692', 'airr_practice_1_0_86077', 'airr_practice_1_0_36429', 'airr_practice_1_0_42052', 'airr_practice_1_0_86075', 'airr_practice_1_0_142699', 'airr_practice_1_0_42017', 'airr_practice_1_0_86072', 'airr_practice_1_0_86090', 'airr_practice_1_0_42056', 'airr_practice_1_0_65127', 'airr_practice_1_0_86071', 'airr_practice_1_0_86078', 'airr_practice_1_0_42057', 'airr_practice_1_0_42051', 'airr_practice_1_0_85359', 'airr_practice_1_0_86079', 'airr_practice_1_0_86087', 'airr_practice_1_0_86080', 'airr_practice_1_0_146960', 'airr_practice_1_0_85366', 'airr_practice_1_0_85367', 'airr_practice_1_0_86083', 'airr_practice_1_0_86084', 'airr_practice_1_0_85362', 'airr_practice_1_0_85368', 'airr_practice_1_0_85357', 'airr_practice_1_0_85360', 'airr_practice_1_0_85363', 'airr_practice_1_0_85356', 'airr_practice_1_0_85353', 'airr_practice_1_0_86081', 'airr_practice_1_0_85361', 'airr_practice_1_0_146961', 'airr_practice_1_0_85364', 'airr_practice_1_0_85354', 'airr_practice_1_0_86086', 'airr_practice_1_0_146968', 'airr_practice_1_0_85358', 'airr_practice_1_0_176608', 'airr_practice_1_0_86082', 'airr_practice_1_0_42058', 'airr_practice_1_0_85365', 'airr_practice_1_0_146966', 'airr_practice_1_0_146964', 'airr_practice_1_0_146962', 'airr_practice_1_0_174199', 'airr_practice_1_0_85355', 'airr_practice_1_0_146958', 'airr_practice_1_0_146959', 'airr_practice_1_0_146969', 'airr_practice_1_0_146965', 'airr_practice_1_0_86091', 'airr_practice_1_0_146967', 'airr_practice_1_0_174198', 'airr_practice_1_0_172117', 'airr_practice_1_0_146963', 'airr_practice_1_0_173574', 'airr_practice_1_0_38965', 'airr_practice_1_0_176605', 'airr_practice_1_0_42060', 'airr_practice_1_0_42062', 'airr_practice_1_0_176609', 'airr_practice_1_0_172112', 'airr_practice_1_0_38966', 'airr_practice_1_0_173567', 'airr_practice_1_0_173570', 'airr_practice_1_0_146974', 'airr_practice_1_0_38968', 'airr_practice_1_0_176607', 'airr_practice_1_0_173569', 'airr_practice_1_0_173572', 'airr_practice_1_0_38967', 'airr_practice_1_0_38970', 'airr_practice_1_0_38976', 'airr_practice_1_0_38964', 'airr_practice_1_0_38969', 'airr_practice_1_0_146975', 'airr_practice_1_0_146970', 'airr_practice_1_0_38983', 'airr_practice_1_0_146971', 'airr_practice_1_0_172113', 'airr_practice_1_0_146976', 'airr_practice_1_0_87188', 'airr_practice_1_0_38980', 'airr_practice_1_0_38978', 'airr_practice_1_0_146972', 'airr_practice_1_0_38984', 'airr_practice_1_0_38973', 'airr_practice_1_0_38972', 'airr_practice_1_0_146973', 'airr_practice_1_0_38979', 'airr_practice_1_0_38977', 'airr_practice_1_0_38975', 'airr_practice_1_0_83400', 'airr_practice_1_0_38971', 'airr_practice_1_0_173571', 'airr_practice_1_0_172114', 'airr_practice_1_0_87191', 'airr_practice_1_0_38990', 'airr_practice_1_0_87192', 'airr_practice_1_0_172116', 'airr_practice_1_0_38985', 'airr_practice_1_0_38989', 'airr_practice_1_0_171387', 'airr_practice_1_0_87186', 'airr_practice_1_0_38986', 'airr_practice_1_0_171389', 'airr_practice_1_0_87190', 'airr_practice_1_0_38987', 'airr_practice_1_0_87187', 'airr_practice_1_0_168216', 'airr_practice_1_0_38982', 'airr_practice_1_0_38991', 'airr_practice_1_0_173568', 'airr_practice_1_0_168217', 'airr_practice_1_0_65135', 'airr_practice_1_0_171391', 'airr_practice_1_0_171388', 'airr_practice_1_0_65131', 'airr_practice_1_0_168218', 'airr_practice_1_0_65137', 'airr_practice_1_0_65138', 'airr_practice_1_0_70022', 'airr_practice_1_0_70020', 'airr_practice_1_0_38974', 'airr_practice_1_0_65139', 'airr_practice_1_0_171390', 'airr_practice_1_0_85895', 'airr_practice_1_0_77986', 'airr_practice_1_0_85892', 'airr_practice_1_0_70016', 'airr_practice_1_0_77987', 'airr_practice_1_0_68966', 'airr_practice_1_0_65133', 'airr_practice_1_0_77985', 'airr_practice_1_0_85896', 'airr_practice_1_0_70018', 'airr_practice_1_0_65134', 'airr_practice_1_0_85893', 'airr_practice_1_0_87184', 'airr_practice_1_0_85899', 'airr_practice_1_0_85901', 'airr_practice_1_0_70021', 'airr_practice_1_0_85897', 'airr_practice_1_0_87189', 'airr_practice_1_0_138860', 'airr_practice_1_0_87178', 'airr_practice_1_0_85898', 'airr_practice_1_0_87176', 'airr_practice_1_0_65136', 'airr_practice_1_0_87182', 'airr_practice_1_0_85902', 'airr_practice_1_0_77984', 'airr_practice_1_0_85903', 'airr_practice_1_0_86115', 'airr_practice_1_0_85894', 'airr_practice_1_0_85900', 'airr_practice_1_0_87183', 'airr_practice_1_0_87185', 'airr_practice_1_0_87180', 'airr_practice_1_0_65132', 'airr_practice_1_0_87181', 'airr_practice_1_0_87168', 'airr_practice_1_0_94871', 'airr_practice_1_0_87170', 'airr_practice_1_0_87179', 'airr_practice_1_0_87159', 'airr_practice_1_0_87172', 'airr_practice_1_0_138866', 'airr_practice_1_0_87174', 'airr_practice_1_0_87165', 'airr_practice_1_0_87158', 'airr_practice_1_0_87160', 'airr_practice_1_0_87175', 'airr_practice_1_0_86118', 'airr_practice_1_0_136468', 'airr_practice_1_0_136017', 'airr_practice_1_0_87163', 'airr_practice_1_0_138863', 'airr_practice_1_0_86112', 'airr_practice_1_0_86114', 'airr_practice_1_0_138864', 'airr_practice_1_0_146403', 'airr_practice_1_0_138862', 'airr_practice_1_0_136169', 'airr_practice_1_0_136656', 'airr_practice_1_0_136014', 'airr_practice_1_0_28309', 'airr_practice_1_0_42063', 'airr_practice_1_0_136016', 'airr_practice_1_0_87164', 'airr_practice_1_0_136470', 'airr_practice_1_0_136170', 'airr_practice_1_0_138861', 'airr_practice_1_0_136171', 'airr_practice_1_0_138865', 'airr_practice_1_0_42066', 'airr_practice_1_0_42068', 'airr_practice_1_0_146407', 'airr_practice_1_0_136168', 'airr_practice_1_0_42067', 'airr_practice_1_0_146410', 'airr_practice_1_0_42064', 'airr_practice_1_0_42065', 'airr_practice_1_0_146402', 'airr_practice_1_0_38995', 'airr_practice_1_0_28310', 'airr_practice_1_0_146405', 'airr_practice_1_0_136469', 'airr_practice_1_0_38997', 'airr_practice_1_0_39001', 'airr_practice_1_0_146411', 'airr_practice_1_0_135228', 'airr_practice_1_0_38993', 'airr_practice_1_0_39006', 'airr_practice_1_0_39007', 'airr_practice_1_0_28305', 'airr_practice_1_0_38998', 'airr_practice_1_0_146409', 'airr_practice_1_0_38994', 'airr_practice_1_0_39000', 'airr_practice_1_0_39002', 'airr_practice_1_0_39009', 'airr_practice_1_0_146406', 'airr_practice_1_0_39003', 'airr_practice_1_0_39004', 'airr_practice_1_0_39005', 'airr_practice_1_0_39008', 'airr_practice_1_0_85872', 'airr_practice_1_0_38999', 'airr_practice_1_0_85870', 'airr_practice_1_0_38981', 'airr_practice_1_0_28307', 'airr_practice_1_0_85869', 'airr_practice_1_0_85871', 'airr_practice_1_0_28306', 'airr_practice_1_0_85881', 'airr_practice_1_0_28303', 'airr_practice_1_0_28295', 'airr_practice_1_0_87157', 'airr_practice_1_0_85868', 'airr_practice_1_0_85875', 'airr_practice_1_0_85873', 'airr_practice_1_0_85884', 'airr_practice_1_0_85874', 'airr_practice_1_0_85882', 'airr_practice_1_0_85879', 'airr_practice_1_0_28308', 'airr_practice_1_0_85877', 'airr_practice_1_0_85876', 'airr_practice_1_0_85885', 'airr_practice_1_0_85880', 'airr_practice_1_0_28292', 'airr_practice_1_0_36461', 'airr_practice_1_0_28296', 'airr_practice_1_0_85890', 'airr_practice_1_0_28297', 'airr_practice_1_0_28288', 'airr_practice_1_0_85887', 'airr_practice_1_0_85883', 'airr_practice_1_0_85888', 'airr_practice_1_0_42020', 'airr_practice_1_0_85886', 'airr_practice_1_0_28282', 'airr_practice_1_0_85889', 'airr_practice_1_0_42021', 'airr_practice_1_0_42081', 'airr_practice_1_0_42082', 'airr_practice_1_0_28286', 'airr_practice_1_0_42028', 'airr_practice_1_0_42038', 'airr_practice_1_0_42059', 'airr_practice_1_0_42018', 'airr_practice_1_0_42025', 'airr_practice_1_0_42022', 'airr_practice_1_0_42019', 'airr_practice_1_0_28302', 'airr_practice_1_0_42024', 'airr_practice_1_0_42023', 'airr_practice_1_0_42032', 'airr_practice_1_0_42042', 'airr_practice_1_0_42039', 'airr_practice_1_0_42027', 'airr_practice_1_0_42036', 'airr_practice_1_0_42026', 'airr_practice_1_0_42031', 'airr_practice_1_0_42041', 'airr_practice_1_0_136015', 'airr_practice_1_0_42030', 'airr_practice_1_0_42035', 'airr_practice_1_0_42034', 'airr_practice_1_0_36460', 'airr_practice_1_0_42037', 'airr_practice_1_0_42043', 'airr_practice_1_0_146413', 'airr_practice_1_0_42044', 'airr_practice_1_0_42045', 'airr_practice_1_0_36459', 'airr_practice_1_0_36458', 'airr_practice_1_0_36456', 'airr_practice_1_0_36457', 'airr_practice_1_0_36453', 'airr_practice_1_0_36455', 'airr_practice_1_0_36454', 'airr_practice_1_0_36449', 'airr_practice_1_0_36448', 'airr_practice_1_0_36450', 'airr_practice_1_0_36446', 'airr_practice_1_0_36441', 'airr_practice_1_0_36440', 'airr_practice_1_0_36439', 'airr_practice_1_0_36436', 'airr_practice_1_0_36447', 'airr_practice_1_0_36435', 'airr_practice_1_0_36433', 'airr_practice_1_0_36434', 'airr_practice_1_0_36451', 'airr_practice_1_0_36438', 'airr_practice_1_0_36437', 'airr_practice_1_0_146419', 'airr_practice_1_0_146417', 'airr_practice_1_0_146418', 'airr_practice_1_0_36431', 'airr_practice_1_0_146416', 'airr_practice_1_0_146415', 'airr_practice_1_0_94870', 'airr_practice_1_0_28281', 'airr_practice_1_0_36444', 'airr_practice_1_0_36432', 'airr_practice_1_0_146420', 'airr_practice_1_0_36443', 'airr_practice_1_0_28278', 'airr_practice_1_0_86062', 'airr_practice_1_0_156209', 'airr_practice_1_0_36452', 'airr_practice_1_0_36445', 'airr_practice_1_0_28269', 'airr_practice_1_0_28280', 'airr_practice_1_0_28274', 'airr_practice_1_0_28272', 'airr_practice_1_0_88363', 'airr_practice_1_0_86093', 'airr_practice_1_0_28276', 'airr_practice_1_0_28270', 'airr_practice_1_0_86099', 'airr_practice_1_0_86111', 'airr_practice_1_0_28266', 'airr_practice_1_0_86096', 'airr_practice_1_0_86094', 'airr_practice_1_0_86103', 'airr_practice_1_0_86107', 'airr_practice_1_0_28265', 'airr_practice_1_0_86095', 'airr_practice_1_0_86098', 'airr_practice_1_0_28260', 'airr_practice_1_0_86104', 'airr_practice_1_0_86092', 'airr_practice_1_0_28262', 'airr_practice_1_0_86100', 'airr_practice_1_0_86105', 'airr_practice_1_0_86102', 'airr_practice_1_0_86108', 'airr_practice_1_0_86106', 'airr_practice_1_0_85905', 'airr_practice_1_0_23413', 'airr_practice_1_0_23418', 'airr_practice_1_0_86110', 'airr_practice_1_0_23419', 'airr_practice_1_0_28264', 'airr_practice_1_0_94874', 'airr_practice_1_0_23415', 'airr_practice_1_0_23412', 'airr_practice_1_0_23401', 'airr_practice_1_0_28263', 'airr_practice_1_0_85904', 'airr_practice_1_0_23406', 'airr_practice_1_0_23407', 'airr_practice_1_0_23420', 'airr_practice_1_0_23408', 'airr_practice_1_0_23416', 'airr_practice_1_0_23414', 'airr_practice_1_0_23410', 'airr_practice_1_0_94875', 'airr_practice_1_0_23417', 'airr_practice_1_0_23409', 'airr_practice_1_0_28254', 'airr_practice_1_0_94869', 'airr_practice_1_0_28249', 'airr_practice_1_0_28258', 'airr_practice_1_0_28240', 'airr_practice_1_0_23404', 'airr_practice_1_0_94873', 'airr_practice_1_0_23402', 'airr_practice_1_0_23405', 'airr_practice_1_0_28233', 'airr_practice_1_0_23400', 'airr_practice_1_0_28256', 'airr_practice_1_0_28236', 'airr_practice_1_0_28243', 'airr_practice_1_0_23403', 'airr_practice_1_0_28238', 'airr_practice_1_0_28245', 'airr_practice_1_0_28242', 'airr_practice_1_0_23411', 'airr_practice_1_0_23397', 'airr_practice_1_0_28241', 'airr_practice_1_0_28257', 'airr_practice_1_0_28239', 'airr_practice_1_0_28255', 'airr_practice_1_0_28250', 'airr_practice_1_0_156243', 'airr_practice_1_0_28253', 'airr_practice_1_0_156227', 'airr_practice_1_0_156237', 'airr_practice_1_0_28234', 'airr_practice_1_0_94877', 'airr_practice_1_0_23398', 'airr_practice_1_0_156231', 'airr_practice_1_0_23399', 'airr_practice_1_0_156228', 'airr_practice_1_0_28231', 'airr_practice_1_0_28232', 'airr_practice_1_0_156229', 'airr_practice_1_0_156235', 'airr_practice_1_0_156242', 'airr_practice_1_0_28237', 'airr_practice_1_0_156233', 'airr_practice_1_0_156232', 'airr_practice_1_0_156239', 'airr_practice_1_0_156238', 'airr_practice_1_0_156253', 'airr_practice_1_0_156241', 'airr_practice_1_0_156244', 'airr_practice_1_0_156248', 'airr_practice_1_0_156249', 'airr_practice_1_0_150445', 'airr_practice_1_0_156250', 'airr_practice_1_0_156252', 'airr_practice_1_0_156259', 'airr_practice_1_0_150471', 'airr_practice_1_0_156256', 'airr_practice_1_0_156247', 'airr_practice_1_0_156251', 'airr_practice_1_0_156257', 'airr_practice_1_0_156260', 'airr_practice_1_0_156265', 'airr_practice_1_0_151902', 'airr_practice_1_0_156263', 'airr_practice_1_0_150470', 'airr_practice_1_0_151912', 'airr_practice_1_0_151903', 'airr_practice_1_0_156267', 'airr_practice_1_0_156264', 'airr_practice_1_0_151894', 'airr_practice_1_0_151896', 'airr_practice_1_0_156262', 'airr_practice_1_0_156271', 'airr_practice_1_0_156269', 'airr_practice_1_0_156274', 'airr_practice_1_0_151901', 'airr_practice_1_0_151909', 'airr_practice_1_0_156266', 'airr_practice_1_0_150469', 'airr_practice_1_0_151895', 'airr_practice_1_0_156273', 'airr_practice_1_0_151898', 'airr_practice_1_0_151899', 'airr_practice_1_0_135225', 'airr_practice_1_0_151916', 'airr_practice_1_0_151913', 'airr_practice_1_0_150467', 'airr_practice_1_0_150468', 'airr_practice_1_0_151904', 'airr_practice_1_0_151905', 'airr_practice_1_0_151907', 'airr_practice_1_0_151917', 'airr_practice_1_0_151915', 'airr_practice_1_0_151919', 'airr_practice_1_0_151914', 'airr_practice_1_0_150465', 'airr_practice_1_0_151906', 'airr_practice_1_0_151922', 'airr_practice_1_0_151920', 'airr_practice_1_0_151921', 'airr_practice_1_0_150457', 'airr_practice_1_0_150456', 'airr_practice_1_0_150460', 'airr_practice_1_0_151910', 'airr_practice_1_0_150462', 'airr_practice_1_0_151925', 'airr_practice_1_0_150463', 'airr_practice_1_0_151911', 'airr_practice_1_0_150466', 'airr_practice_1_0_156268', 'airr_practice_1_0_150453', 'airr_practice_1_0_150448', 'airr_practice_1_0_150451', 'airr_practice_1_0_150452', 'airr_practice_1_0_151918', 'airr_practice_1_0_151923', 'airr_practice_1_0_151908', 'airr_practice_1_0_150464', 'airr_practice_1_0_150444', 'airr_practice_1_0_149741', 'airr_practice_1_0_149738', 'airr_practice_1_0_150447', 'airr_practice_1_0_150446', 'airr_practice_1_0_150458', 'airr_practice_1_0_149729', 'airr_practice_1_0_150455', 'airr_practice_1_0_150443', 'airr_practice_1_0_150442', 'airr_practice_1_0_150441', 'airr_practice_1_0_150440', 'airr_practice_1_0_149737', 'airr_practice_1_0_149746', 'airr_practice_1_0_149731', 'airr_practice_1_0_149743', 'airr_practice_1_0_149739', 'airr_practice_1_0_149718', 'airr_practice_1_0_149736', 'airr_practice_1_0_149740', 'airr_practice_1_0_149724', 'airr_practice_1_0_149742', 'airr_practice_1_0_151926', 'airr_practice_1_0_150449', 'airr_practice_1_0_149745', 'airr_practice_1_0_149714', 'airr_practice_1_0_149728', 'airr_practice_1_0_149734', 'airr_practice_1_0_149723', 'airr_practice_1_0_149733', 'airr_practice_1_0_149730', 'airr_practice_1_0_149721', 'airr_practice_1_0_149713', 'airr_practice_1_0_149717', 'airr_practice_1_0_149712', 'airr_practice_1_0_150439', 'airr_practice_1_0_149725', 'airr_practice_1_0_149715', 'airr_practice_1_0_149727', 'airr_practice_1_0_152525', 'airr_practice_1_0_152538', 'airr_practice_1_0_152526', 'airr_practice_1_0_149716', 'airr_practice_1_0_152528', 'airr_practice_1_0_152531', 'airr_practice_1_0_149719', 'airr_practice_1_0_152547', 'airr_practice_1_0_152535', 'airr_practice_1_0_149722', 'airr_practice_1_0_152534', 'airr_practice_1_0_152532', 'airr_practice_1_0_152527', 'airr_practice_1_0_152542', 'airr_practice_1_0_149726', 'airr_practice_1_0_152524', 'airr_practice_1_0_152544', 'airr_practice_1_0_152539', 'airr_practice_1_0_152551', 'airr_practice_1_0_152546', 'airr_practice_1_0_152543', 'airr_practice_1_0_152540', 'airr_practice_1_0_152548', 'airr_practice_1_0_156210', 'airr_practice_1_0_156213', 'airr_practice_1_0_152553', 'airr_practice_1_0_152550', 'airr_practice_1_0_152530', 'airr_practice_1_0_156223', 'airr_practice_1_0_156226', 'airr_practice_1_0_152552', 'airr_practice_1_0_152549', 'airr_practice_1_0_156224', 'airr_practice_1_0_156220', 'airr_practice_1_0_156214', 'airr_practice_1_0_156212', 'airr_practice_1_0_156222', 'airr_practice_1_0_156217', 'airr_practice_1_0_156225', 'airr_practice_1_0_156221'].

Because it's a big long list, it's difficult to eyeball to get a sense of how bad the problem is. It also doesn't indicate which annotator, if any, might be the problem. Given that our 1.0 production standard for annotations was "annotator messes up less than 3% of the time", we need something that will let people look at the report and make a go/no-go decision.

I'd also suggest that the journal file/path should not be an option, as it's not optional. It should be an nargs argument.

@bkorycki
Copy link
Contributor Author

bkorycki commented Dec 3, 2024

@wpietri

Because it's a big long list, it's difficult to eyeball to get a sense of how bad the problem is.

I can change it to print the number of prompts instead of the one's that failed if you'd like. But I thought it would be useful to output the problematic prompt UIDs so that a user can investigate further on their own if they'd like. I know that that was useful to me when I ran the consistency checks on the actual journals. Maybe a good compromise would be to output the # of prompts that failed that test and then up to 3 failing prompt UIDs?

It also doesn't indicate which annotator, if any, might be the problem.

AnnotationsMergedCorrectly checks for issues within the merging/voting strategy, not individual annotators. There are separate checks for individual annotators.

Given that our 1.0 production standard for annotations was "annotator messes up less than 3% of the time", we need something that will let people look at the report and make a go/no-go decision.

That is checked by MinValidAnnotatorItems. You can look at that column in the table of results to make that quick go/no-go decision:
Screenshot 2024-12-03 at 2 04 56 PM
You can also see more details about that failure if you run with -v:
Screenshot 2024-12-03 at 2 05 10 PM

I'd also suggest that the journal file/path should not be an option, as it's not optional. It should be an nargs argument.

Updated!

@bkorycki bkorycki temporarily deployed to Scheduled Testing December 3, 2024 22:23 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing December 3, 2024 22:23 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing December 3, 2024 22:24 — with GitHub Actions Inactive
@bkorycki bkorycki requested a review from wpietri December 3, 2024 22:25
@wpietri
Copy link
Contributor

wpietri commented Dec 4, 2024

Maybe it's more clear if I give an example. If something is good, a green check is great. But if it isn't, I want details useful to understanding and fixing the problem. A long list of ids is better than nothing, because it tells me where I can start looking. But it doesn't tell me what the consistency checker thinks is wrong. I'd rather have something like this: E.g.:

Results for Test x SUT level checks
┌──────────────────────────────────────────────────────────┬────────────────┬─────────────────┬────────────────────┬──────────────────┬────────────────────┬────────────────┐
│ suts, tests                                              │  each prompt   │  each prompt    │  each response     │  each item       │  num items         │  annotations   │
│                                                          │  queued once   │  responded to   │  translated once   │  measured once   │  finished equals   │  merged        │
│                                                          │                │  once           │                    │                  │  measured items    │  correctly     │
├──────────────────────────────────────────────────────────┼────────────────┼─────────────────┼────────────────────┼──────────────────┼────────────────────┼────────────────┤
│ mistralai-ministral-8b-2410, safe-cse-en_us-practice-1.0 │ ✅             │ ✅               │ ✅                  │ ✅                │ ✅                  │ ❌ 995/1000     │
├──────────────────────────────────────────────────────────┼────────────────┼─────────────────┼────────────────────┼──────────────────┼────────────────────┼────────────────┤
│ mistralai-ministral-8b-2410, safe-dfm-en_us-practice-1.0 │ ✅             │ ✅               │ ✅                  │ ✅                │ ✅                  │ ❌ 112/1000     │
├──────────────────────────────────────────────────────────┼────────────────┼─────────────────┼────────────────────┼──────────────────┼────────────────────┼────────────────┤

Failed checks for Test x SUT level checks:
AnnotationsMergedCorrectly: Expected that measure_item_quality returns majority vote for each prompt. SUT mistralai-ministral-8b-2410/test safe-vcr-en_us-practice-1.0 have mismatching values on the following prompts:
        {'airr_practice_1_0_85906': "votes are [safe, safe, invalid], unsafe but result is unsafe",
         'airr_practice_1_0_23174':  "too few votes but result is safe and valid",
         'airr_practice_1_0_85907': "votes are [invalid, invalid, invalid] but result is safe",
 [...]

The changes I made:

  • render the column titles as plain text and wrap them to a sane width for a table
  • sorted the hazards alphabetically
  • included the number of successes if something is downchecked (percentage would be fine too)
  • turned the list of IDs into a dict with explanations and then line-wrapped it so it can be scanned visually

Does that make sense?

Also, I suspect the reason I'm finding an error here is that this is applying an ensemble-only check to something run with the default evaluator. If this is a tool only for us, it should go in modelbench-private. If it's a tool for the public, it should work with their runs.

@bkorycki bkorycki temporarily deployed to Scheduled Testing December 6, 2024 20:05 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing December 6, 2024 20:05 — with GitHub Actions Inactive
@bkorycki bkorycki temporarily deployed to Scheduled Testing December 6, 2024 20:05 — with GitHub Actions Inactive
@bkorycki
Copy link
Contributor Author

bkorycki commented Dec 6, 2024

@wpietri I applied most of your changes and also modified it to only do the private-annotator checks for official benchmark runs.
The only change I didn't fully apply is "included the number of successes if something is downchecked (percentage would be fine too)". Not all the checks are compatible fraction-based (e.g. checks for extras or duplicates). Instead I just added the percentage of incorrectly-merged annotation items in the warning message.

@bkorycki bkorycki merged commit a893b5f into main Dec 9, 2024
4 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 9, 2024
@rogthefrog rogthefrog deleted the more-consistency-checks branch December 13, 2024 02:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants