Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't use UMLS as primary identifier for taxonomy #71

Open
balhoff opened this issue Jun 30, 2023 · 2 comments · May be fixed by TranslatorSRI/Babel#167
Open

don't use UMLS as primary identifier for taxonomy #71

balhoff opened this issue Jun 30, 2023 · 2 comments · May be fixed by TranslatorSRI/Babel#167

Comments

@balhoff
Copy link

balhoff commented Jun 30, 2023

I would rather receive an NCBI taxonomy identifier in most cases. However, there are many species that aren't in NCBI, so some other source might be needed for those (GBIF or Catalog of Life?). One problem example: searching for "american goldfinch", I get this result:

[
  {
    "curie": "UMLS:C0326959",
    "label": "Carduelis tristis",
    "synonyms": [
      "Spinus tristis",
      "Fringilla tristis",
      "Carduelis tristis",
      "American goldfinch",
      "Astragalinus tristis",
      "Carduelis tristis (organism)"
    ],
    "types": [
      "biolink:OrganismTaxon",
      "biolink:NamedThing",
      "biolink:Entity"
    ]
  }
]

However the taxonomically valid name for this species is "Spinus tristis" (a synonym here). "Carduelis tristis" is a taxonomic synonym. See https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=54773&lvl=3&lin=f&keep=1&srchmode=1&unlock and https://verifier.globalnames.org/?capitalize=on&format=html&names=Carduelis+tristis

@gaurav
Copy link
Contributor

gaurav commented Jul 1, 2023

This situation is slightly worst in Babel 2023jun29, where UMLS:C0326959 is entirely missing, because it's semantic type -- T012 -- is no longer mapped correctly into the Biolink model.

The NCBITaxon situation should be an easier fix: it looks like we're only importing "scientific name" and "synonym" (meaning taxonomic synonym, not alternate name) and ignoring "common name" and "genbank common name", which is where the common names live.

@gaurav
Copy link
Contributor

gaurav commented Jul 3, 2023

The list of possible name_class values we can use, as of the May 1 release of NCBITaxon (I think), is:

     25 	genbank acronym	
    230 	blast name	
    667 	in-part	
   2086 	acronym	
  14641 	common name	
  30328 	genbank common name	
  56575 	equivalent name	
  75081 	includes	
 220185 	type material	
 245827 	synonym	
 670412 	authority	
2503930 	scientific name	

So we definitely want to add common name and genbank common name so that organism common names will work, and we might want to bring in equivalent name and keep synonym so we can keep synonyms (e.g. Pinus abies is a synonym of the currently accepted name, Picea abies, so we would expect both to potentially bring back the same taxonomic name). I will need to double-check the rest to make sure we don't need them. I am very surprised but pleased to see the 220,185 references to type material in here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants