Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curating ranked papers for relevance #1351

Merged
merged 13 commits into from
Jan 14, 2025

Conversation

nalikapalayoor
Copy link
Contributor

This pull request adds the curations for papers that I have found to be irrelevant to the Bioregistry. These papers came from the potentially relevant paper ranking table. I will curate relevant entries in separate PRs.

@bgyori
Copy link
Contributor

bgyori commented Jan 11, 2025

Thanks @nalikapalayoor! One minor question also to @nagutm is that I think most of the irrelevant entries are fundamentally because a paper is not about an identifiers resource (or about a provider for an identifiers resource). Are we overusing the irrelevant_other tag given that we also have two more specific tags, not_identifiers_resource and no_website?

image
at https://biopragmatics.github.io/bioregistry/curation/literature

@bgyori
Copy link
Contributor

bgyori commented Jan 11, 2025

Also, after we clarify the curation tags, we should also lint this file to make sure it's sorted, as the test suggests

@nagutm
Copy link
Collaborator

nagutm commented Jan 11, 2025

I think that the not_identifier_resource definition might be too restrictive and could be expanded to include papers describing software repositories, data visualization tools, and any other externally linked 'resource' that falls along these lines. Currently these types of papers are being classified as irrelevant_other per the current definitions.

I think that the no_website tag can be modified to non_resource_paper. This would describe any papers that are self-contained and don't describe/link to any externally developed resources.

The irrelevant_other tag can be retained as a catch-all that would cover anything else.

I think these revised definitions/tags would help to make the distribution of tags for irrelevant papers more balanced and less one-sided as they currently are.

@bgyori
Copy link
Contributor

bgyori commented Jan 11, 2025

@nagutm is it straightforward to make these changes to the table overall (i.e., past irrelevant_other tags that might fall under the definition of some of the more specific categories)? If so, we could do that.

@nagutm
Copy link
Collaborator

nagutm commented Jan 12, 2025

Changing the tags for all previous papers tagged as irrelevant_other would require us to manually check each of the papers and re-classify them. It's doable but might take a little time. I'll update the guide documentation and CurationRelevance vocabulary files before updating the curated_papers.tsv file.

@nalikapalayoor
Copy link
Contributor Author

Hi, thank you for looking at this! I can go in and reclassify the papers associated with this PR based on those redefined tags. I will also relint the file. I will make these changes and push the changes later today!

@nalikapalayoor
Copy link
Contributor Author

I have updated the tags for the papers determined to be irrelevant based on the new definitions discussed:

not_identifiers_resource: papers with software repos, data visualization tools, externally linked resources, as well as databases not related to defining new identifiers

non_resource_papers: self-contained papers that don't link to external resources

irrelevant_other: all other irrelevant papers

I also linted the currated_papers.tsv file after updating these tags.

@bgyori
Copy link
Contributor

bgyori commented Jan 13, 2025

This looks good, we just need to update all the places where the no_website tag is referred to, to use the new non_resource_papers name, and update any changes to the definition.

@nagutm
Copy link
Collaborator

nagutm commented Jan 13, 2025

See #1359 for these updated changes

bgyori pushed a commit that referenced this pull request Jan 14, 2025
This pull request updates the Curation Relevance vocabulary to 
- Expand the definition of `not_identifier_resource` 
- Replace the `no_website` tag with `non_resource_paper`

See
#1351 (comment)
for a full explanation of what these tags mean and why they were
implemented.
Copy link

codecov bot commented Jan 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 46.58%. Comparing base (8950e70) to head (94cf20b).
Report is 259 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1351      +/-   ##
==========================================
+ Coverage   42.51%   46.58%   +4.07%     
==========================================
  Files         117      118       +1     
  Lines        8327     8297      -30     
  Branches     1963     1364     -599     
==========================================
+ Hits         3540     3865     +325     
+ Misses       4582     4245     -337     
+ Partials      205      187      -18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bgyori bgyori merged commit 5c2cbd4 into biopragmatics:main Jan 14, 2025
14 checks passed
@nalikapalayoor nalikapalayoor deleted the irrelevant branch January 14, 2025 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants