Updated badge criteria, and completed for Allen. Detailed uncertainit…

…ies in logbook. Plus a few other minor changes.
pythonhealthdatascience · Jun 4, 2024 · eaf2f53 · eaf2f53
1 parent ba875be
commit eaf2f53
Show file tree

Hide file tree

Showing 4 changed files with 102 additions and 53 deletions.
diff --git a/evaluation/badges.qmd b/evaluation/badges.qmd
@@ -16,95 +16,95 @@ To use this template:
 Although this script uses Python, it is applicable regardless of the language used by the study you are evaluating.
 -->
 
-This page evaluates the extent to which Monks et al. 2016 meets the criteria of badges related to reproducibility from various organisations and journals.
+This page evaluates the extent to which the author-published research artefacts meet the criteria of badges related to reproducibility from various organisations and journals.
 
 ```{python}
 import numpy as np
 import pandas as pd
 
 criteria = {
-    'archive': 'Code is stored in a permanent archive that is publicly and openly accessible',
-    'id': 'It has a persistent identifier (e.g. DOI)',
-    'license': 'It has an open license',
-    'complete_open': 'Complete set of materials shared (as would be needed to fully reproduce article)',
-    'meta': 'Metadata describes data/code sufficiently to enable reproduction (e.g. package versions)',
-    'statement': 'Manuscript has data availability statement',
-    # DUPE with complete_open above (to reflect on)
-    'complete_review': 'Complete set of materials shared (as would be needed to fully reproduce article)',
-    # DUPE with meta above (to reflect on)
-    'describe_minimal': 'There is a minimal but sufficient description of artefacts',
-    'describe_careful': 'There is a more detailed, careful documentation of artefacts',
-    'artefacts_structure': 'Artefacts are well structured to facilitate reuse, adhering to norms and standards of research community',
+    'archive': 'Stored in a permanent archive that is publicly and openly accessible',
+    'id': 'Has a persistent identifier',
+    'license': 'Includes an open license',
+    'complete': 'Complete set of materials shared (as would be needed to fully reproduce article)',
+    'documentation_sufficient': 'Artefacts are sufficiently documented (eg. to understand how it works, to enable it to be run, including package versions)',
+    'documentation_careful': 'Artefacts are carefully documented (more than sufficient - e.g. to the extent that reuse is facilitated)',
+    'relevant': '''Arefacts are relevant to and contribute to the article's results''',
+    'execute': 'Scripts can be successfully executed',
+    'structure': 'Artefacts are well structured/organised (e.g. to the extent that reuse is facilitated, adhering to norms and standards of research community)',
     'regenerated': 'Independent party regenerated results using the authors research artefacts',
     'hour': 'Reproduced within approximately one hour (excluding compute time)',
-    # DUPE with artefacts_structure and meta/describe
-    'reproduce_organise': 'Requires data and scripts to be well-organised, clearly documented and with a README file with step-by-step instructions on how to reproduce results in the manuscript'
+    # This criteria is kept seperate to documentation_careful, as it specifically requires a README file
+    'documentation_readme': 'Artefacts are clearly documented and accompanied by a README file with step-by-step instructions on how to reproduce results in the manuscript'
 }
 
 badge_names = {
-    # ISSUE: need to make seperate criteria for code and data (unless just do code?)
-    'open_niso': 'NISO "Open Research Objects"',
+    # Open objects
+    'open_niso': 'NISO "Open Research Objects (ORO)"',
+    'open_niso_all': 'NISO "Open Research Objects - All (ORO-A)"',
     'open_acm': 'ACM "Artifacts Available"',
-    'open_cos_data': 'COS "Open Data"',
-    'open_cos_materials': 'COS "Open Materials"',
-    'open_cos_code': 'COS "Open Code"',
-    # ISSUE: need to make seperate criteria for code and data (unless just do code?)
-    'open_ieee_code': 'IEEE "Code Available"',
-    'open_ieee_data': 'IEEE "Datasets Available"',
-    'open_springer': 'Springer Nature "Badge for Open Data"',
+    'open_cos': 'COS "Open Code"',
+    'open_ieee': 'IEEE "Code Available"',
+    # Object review
     'review_acm_functional': 'ACM "Artifacts Evaluated - Functional"',
     'review_acm_reusable': 'ACM "Artifacts Evaluated - Reusable"',
-    # ISSUE: need to make seperate criteria for code and data (unless just do code?)
-    'review_ieee_code': 'IEEE "Code Reviewed"',
-    'review_ieee_data': 'IEEE "Datasets Reviewed"',
-    'reproduce_niso': 'NISO "Results Reproduced"',
+    'review_ieee': 'IEEE "Code Reviewed"',
+    # Results reproduced
+    'reproduce_niso': 'NISO "Results Reproduced (ROR-R)"',
     'reproduce_acm': 'ACM "Results Reproduced"',
-    # ISSUE: need to make seperate criteria for code and data (unless just do code?)
-    'reproduce_ieee_code': 'IEEE "Code Reproducible"',
-    'reproduce_ieee_data': 'IEEE "Dataset Reproducible"',
+    'reproduce_ieee': 'IEEE "Code Reproducible"',
     'reproduce_psy': 'Psychological Science "Computational Reproducibility"'
 }
 
-open_cos = ['archive', 'id', 'license', 'complete_open', 'meta']
-
 badges = {
-    'open_niso': ['archive', 'id', 'license', 'complete_open'],
+    # Open objects
+    'open_niso': ['archive', 'id', 'license'],
+    'open_niso_all': ['archive', 'id', 'license', 'complete'],
     'open_acm': ['archive', 'id'],
-    'open_cos_data': open_cos,
-    'open_cos_materials': open_cos,
-    'open_cos_code': open_cos,
-    'open_ieee_code': ['complete_open'],
-    # etc. etc.
+    'open_cos': ['archive', 'id', 'license', 'complete', 'documentation_sufficient'],
+    'open_ieee': ['complete'],
+    # Object review
+    'review_acm_functional': ['documentation_sufficient', 'relevant', 'complete', 'execute'],
+    'review_acm_reusable': ['documentation_sufficient', 'documentation_careful', 'relevant', 'complete', 'execute', 'structure'],
+    'review_ieee': ['complete', 'execute'],
+    # Results reproduced
+    'reproduce_niso': ['regenerated'],
+    'reproduce_acm': ['regenerated'],
+    'reproduce_ieee': ['regenerated'],
+    'reproduce_psy': ['regenerated', 'hour', 'structure', 'documentation_readme'],
 }
 ```
 
-TO DO: Change to full list of criteria
-
-TO DO: Change to full list of badges
-
-TO DO: Add a table (perhaps minimise-able) that summarises the criteria of each badge, so can see how/why did or did not meet each of the badges.
-
 ```{python}
-# Example (not done for Monks et al. 2016 yet)
+# Example (not done for Allen et al. 2020 yet)
 # Based on the criteria dictionary above, populate with 1 and 0
 eval = pd.Series({
     'archive': 1,
     'id': 1,
-    'license': 0,
-    'complete_open': 1,
-    'meta': 1,
-    'statement': 1
-    # etc. etc.
+    'license': 1,
+    'complete': 1,
+    'documentation_sufficient': 1,
+    'documentation_careful': 0,
+    'relevant': 1,
+    'execute': 1,
+    'structure': 1,
+    'regenerated': 1,
+    'hour': 0,
+    'documentation_readme': 0,
 })
 ```
 
+✅ = Meets criteria ⬜ = Does not meet criteria 📝 = Not yet evaluated
+
 ```{python}
 # Print compliance to each criteria
 for key, value in eval.items(): 
     if value == 1:
         icon = '✅'
-    else:
+    elif value == 0:
         icon = '⬜'
+    else:
+        icon = '📝'
     print(f'{icon} {criteria[key]}')
 ```
 
@@ -121,4 +121,7 @@ for key, value in award.items():
     else:
         icon = '⬜'
     print(f'{icon} {badge_names[key]}')
+
+award_list = list(award.values())
+print(f'Met criteria for {sum(award_list)} of {len(award_list)} badges ({round(sum(award_list)/len(award_list)*100, 1)}%)')
 ```
diff --git a/evaluation/posts/2024_05_23/index.qmd b/evaluation/posts/2024_05_23/index.qmd
@@ -157,6 +157,10 @@ Found that:
 
 This was then successful, and built quickly (within 30s). A quick glance over the environment confirmed that it looked to have the correct python version and packages.
 
+::: {.callout-tip}
+This was fixed in a later version of the repository, but I used an earlier one that matched up with the publication date. Tom suggested learning here for use of a CHANGELOG, to clearly show changes in future versions.
+:::
+
 ### 13.26-13.52 Reproduction
 
 * Copied sim/ and sim_replicate.py into reproduction/.

diff --git a/evaluation/posts/2024_06_03/index.qmd b/evaluation/posts/2024_06_03/index.qmd
@@ -7,7 +7,7 @@ categories: [reproduce]
 
 ::: {.callout-note}
 
-Found seed that produces results fairly visually similar to paper. Total time used: 7h 6m (17.8%). Reproduction stage complete.
+Chose a seed that produced results fairly visually similar to the paper, and decided that each figure had now been successfully reproduced. Total time used: 7h 6m (17.8%). Reproduction stage complete.
 
 :::
 
@@ -115,6 +115,8 @@ Simplified the reproduction notebook to just use the base number 2700, deleting
 
 Created a reproduction success page, presenting those figures alongside the figures from the original study.
 
+Not archiving on Zenodo as this is the test-run.
+
 ## Timings
 
 ```{python}

diff --git a/evaluation/posts/2024_06_04/index.qmd b/evaluation/posts/2024_06_04/index.qmd
@@ -0,0 +1,40 @@
+---
+title: "Day 5"
+author: "Amy Heather"
+date: "2024-06-04"
+categories: [guidelines]
+---
+
+::: {.callout-note}
+
+Evaluated study against badges.
+
+:::
+
+## Work log
+
+### Badges
+
+Evaluating artefacts as in https://zenodo.org/records/3760626 (as has been copied into `original_study/`).
+
+Felt uncertain around these criteria:
+
+* `documentation_sufficient` - it was missing one package from environment - but it had an environment file, package versions, and a README explaining how to set up environment and referring to a notebook which runs the code that produces the output - hence, despite missing one, I felt it met this criteria
+* `documentation_careful` - I feel the documentation is minimal, but that it was actually still sufficient for running the model - and, as such, presume it might meet this criteria? Unless "reuse" refers to being able to use and change - in which case, there is not much guidance on how to change the parameters, only on how to run it matching up with the paper? In which case, I don't think it meets these. I think that interpretation also makes sense, as it then distinguishes from "sufficient"?
+    * The criteria this relates to is from @association_for_computing_machinery_acm_artifact_2020 : "The artifacts associated with the paper are of a quality that significantly exceeds minimal functionality. That is, they have all the qualities of the Artifacts Evaluated – Functional level, but, in addition, they are very carefully documented and well-structured to the extent that reuse and repurposing is facilitated. In particular, norms and standards of the research community for artifacts of this type are strictly adhered to"
+* `execute` - yes, but with one change to environment. Do they allow changes? I've selected yes, as I'm assuming minor troubleshooting is allowed, and the distinction between this and reproduction is that this is just about running scripts, whilst reproduction is about getting sufficiently similar results. Execution required in ACM and IEEE criteria...
+    * From @association_for_computing_machinery_acm_artifact_2020: "Included scripts and/or software used to generate the results in the associated paper can be successfully executed, and included data can be accessed and appropriately manipulated", with no mention of whether minor troubleshooting is allowed
+    * From @institute_of_electrical_and_electronics_engineers_ieee_about_nodate: "runs to produce the outputs desired", with no mention of whether minor troubleshooting is allowed
+* `regenerated` - as above, unclear whether modification is allowed. For simplicity, have assumed definition we allowed - that you can troubleshoot - but this might not align with journals. Is this an issue?
+* `hour` - failed this, but we weren't trying to do this within an hour. If I had been asked to do that (rather than spend time reading and thinking beforehand), I anticipate I could've run it (without going on to then add seeds etc). Hence, does it pass this?
+    * Psychological Science (@hardwicke_transparency_2023, @association_for_psychological_science_aps_psychological_2023) require reproduction within an hour, which I think implies some minor troubleshooting would be allowed?
+    * Others don't specify either way
+* `documentation_readme` - I wouldn't say it explicitly meets this criteria, although it was simple enough that it could do it anyway
+
+## Suggested changes for protocol/template
+
+✅ = Made the change.
+
+Protocol: 
+
+* Suggest that uncertainties on whether a badge criteria is met or not could be discussed within the STARS team