Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bio] Update biomed property embeddings. #4721

Merged
merged 9 commits into from
Nov 11, 2024
2 changes: 1 addition & 1 deletion deploy/nl/catalog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ indexes:
bio_ft:
store_type: MEMORY
source_path: ../../tools/nl/embeddings/input/bio
embeddings_path: gs://datcom-nl-models/bio_ft_2024_11_05_09_59_39/embeddings.csv
embeddings_path: gs://datcom-nl-models/bio_ft_2024_11_06_16_50_25/embeddings.csv
model: ft-final-v20230717230459-all-MiniLM-L6-v2
healthcheck_query: "Gene"
base_uae_lance:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -75,24 +75,23 @@
0.31467,
0.308395,
0.307941,
0.30463,
0.282709,
0.281875,
0.280828,
0.280453,
0.274298,
0.271505,
0.270718,
0.269547,
0.259901,
0.259901,
0.248279,
0.247354,
0.24662,
0.246321,
0.246163,
0.242122,
0.231355,
0.229624
0.230288,
0.229624,
0.227736,
0.221732,
0.21856,
0.216932
],
"PROP": [
"phylum",
Expand All @@ -106,24 +105,23 @@
"ensemblID",
"simplifiedMolecularInputLineEntrySystem",
"geneticVariantFunctionalCategory",
"imageUrl",
"genomicCoordinates",
"hasGenomicCoordinates",
"chromosomeSize",
"observedAllele",
"hg38GenomicPosition",
"hg19GenomicPosition",
"virusHost",
"strandOrientation",
"ncbiDNASequenceName",
"<-geneID{typeOf:GeneGeneticVariantAssociation}->variantID",
"<-variantID{typeOf:GeneGeneticVariantAssociation}->geneID",
"ncbiProteinAccessionNumber",
"hgncID",
"alleleType",
"hg38GenomicLocation",
"ofVirusSpecies",
"hg19GenomicLocation",
"ncbiTaxonID",
"antigenType"
"ncbiTaxId",
"umlsConceptUniqueID",
"antigenType",
"alleleOrigin",
"<-diseaseID{typeOf:DiseaseGeneAssociation}->geneID",
"antibodyType",
"inChIKey"
]
},
"query_detection_debug_logs": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,17 +83,15 @@
0.583271,
0.581079,
0.576904,
0.543242,
0.539973,
0.532047,
0.530959,
0.51427,
0.540855,
0.512528,
0.46106,
0.444108,
0.398229,
0.394443,
0.39229
0.39229,
0.384802,
0.341795,
0.324578,
0.32201
],
"PROP": [
"typeOfGene",
Expand All @@ -106,22 +104,20 @@
"<-diseaseID{typeOf:DiseaseGeneAssociation}->geneID",
"<-compoundID{typeOf:ChemicalCompoundGeneticVariantAssociation}->variantID",
"<-geneID{typeOf:DiseaseGeneAssociation}->diseaseID",
"genomicCoordinates",
"hasGenomicCoordinates",
"antigenType",
"<-variantID{typeOf:ChemicalCompoundGeneticVariantAssociation}->compoundID",
"alleleType",
"<-geneticVariantID{typeOf:DiseaseGeneticVariantAssociation}->diseaseID",
"hg19GenomicPosition",
"hasRNATranscript",
"hg19GenomicLocation",
"hg38GenomicPosition",
"hg38GenomicLocation",
"hgncID",
"antibodyType",
"chromosomeSize",
"referenceAlleleNCBI",
"observedAllele",
"ncbiDNASequenceName",
"alleleOrigin"
"referenceAllele",
"alleleOrigin",
"specializationOf",
"subClassificationOf",
"virusGenus",
"virusHost"
]
},
"query_detection_debug_logs": {
Expand Down
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chejennifer I'm not sure why we're seeing this diff. I'm removing the hasRNATranscript property which is related to the query, but I don't know why it's affecting entity recognition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, the reason for this is we used to be reading the entity from the context (from the previous query), but we will only read from context if at least one of entity or property is detected in the current query. Since now, no property is detected, we won't use the context and no entity gets returned. We actually should remove this query from the tests since this isn't really testing anything anymore

Original file line number Diff line number Diff line change
@@ -1,53 +1,15 @@
{
"client": "test_detect-and-fulfill",
"config": {
"categories": [
{
"blocks": [
{
"columns": [
{
"tiles": [
{
"entities": [
"bio/FGFR1"
],
"type": "ENTITY_OVERVIEW"
}
]
}
]
}
]
}
]
},
"config": {},
"context": {},
"debug": {},
"entities": [
{
"dcid": "bio/FGFR1",
"name": "FGFR1",
"type": ""
}
],
"pastSourceContext": "",
"place": {},
"placeFallback": {},
"placeSource": "UNKNOWN",
"places": [],
"relatedThings": {
"childPlaces": {},
"childTopics": [],
"exploreMore": {},
"mainTopics": [],
"parentPlaces": [],
"parentTopics": [],
"peerPlaces": [],
"peerTopics": []
"failure": "Sorry, could not complete your request. No entity found in the query.",
"place": {
"dcid": "",
"name": "",
"place_type": ""
},
"svSource": "UNKNOWN",
"userMessages": [
"See relevant information for FGFR1 based on the previous query."
"Sorry, could not complete your request. No entity found in the query."
]
}
52 changes: 18 additions & 34 deletions tools/nl/embeddings/input/bio/_preindex.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
sentence,dcid
"""UMLS CUI""",unifiedMedicalLanguageSystemConceptUniqueIdentifier
A specific organism or taxonomic group of organisms that are susceptible to be infected by a virus,virusHost
A unique ID for a Medical Subject Heading Descriptor record,medicalSubjectHeadingDescriptorID
A unique ID for a Medical Subject Heading supplementary record,medicalSubjectHeadingSupplementaryRecordID
Expand All @@ -17,51 +16,43 @@ Gene associated with a disease,<-diseaseID{typeOf:DiseaseGeneAssociation}->geneI
Gene associated with a genetic variant,<-variantID{typeOf:GeneGeneticVariantAssociation}->geneID
Genetic variant associated with a disease,<-diseaseID{typeOf:DiseaseGeneticVariantAssociation}->geneticVariantID
GeneticVariantGeneAssociation,<-geneID{typeOf:GeneGeneticVariantAssociation}->variantID;<-variantID{typeOf:GeneGeneticVariantAssociation}->geneID
HUGO Gene Nomenclature Committee identifier,hgncID
InChIKey,inChIKey
International Chemical Identifier (InChI) Key,inChIKey
MOA,mechanismOfAction
MeSH descriptor record ID,medicalSubjectHeadingDescriptorID
MeSH supplementary record ID,medicalSubjectHeadingSupplementaryRecordID
NCBI Taxonomy database identifier,ncbiTaxonID
NCBI defined segment of DNA sequence name,ncbiDNASequenceName
NCBI Taxonomy database identifier,ncbiTaxId
NCBI protein accession number,ncbiProteinAccessionNumber
Name used by NIH NCBI to refer to a segment of DNA sequence,ncbiDNASequenceName
OMIM database identifier,omimID
Origin of variant allele,alleleOrigin
RNA transcript that a gene has,hasRNATranscript
Recorded transcript,hasRNATranscript
Reference genomic sequence from dbSNP,referenceAlleleNCBI
Reference genomic sequence from dbSNP,referenceAllele
Simplified Molecular Input Line Entry System (SMILE),simplifiedMolecularInputLineEntrySystem
Size of chromosome,chromosomeSize
Systematiized Nomenclature of Medicine (SNOMED) clinical terms (CT) code,snomedCT
The allele of a genetic variant observed within a population,alleleType
"The disease diagnosis code for version 10 of the International Classification of Diseases (ICD), Clinical Modification",icd10CMCode
The genomic location of a genetic variant using the hg19 assembly,hg19GenomicLocation
The genomic location of a genetic variant using the hg38 assembly,hg38GenomicLocation
The genomic position of a genetic variant using the hg19 assembly,hg19GenomicPosition
The genomic position of a genetic variant using the hg38 assembly,hg38GenomicPosition
The method by which a drug is administered,administrationRoute
The name of the disease,diseaseName
The orientation of the strand on which an annotation is located,strandOrientation
The sequences of the observed alleles from rs-fasta files.,observedAllele
The species of a virus isolate,ofVirusSpecies
The strand on which a given annotation is located,strandOrientation
The type of gene,typeOfGene
Type of allele,alleleType
Unified Medical Language System (UMLS) Concept Unique Identifier (CUI),unifiedMedicalLanguageSystemConceptUniqueIdentifier
UMLS CUI,umlsConceptUniqueID
Unified Medical Language System (UMLS) Concept Unique Identifier (CUI),umlsConceptUniqueID
Variant allele origin,alleleOrigin
activeIngredient,activeIngredient
administrationRoute,administrationRoute
alleleOrigin,alleleOrigin
alleleType,alleleType
antibodyType,antibodyType
antigenType,antigenType
availableStrength,availableStrength
chemblID,chemblID
chemical compound associated with a genetic variant,<-variantID{typeOf:ChemicalCompoundGeneticVariantAssociation}->compoundID
chromosomeSize,chromosomeSize
class,class
"component that provides pharmacological activity or other direct effect in the diagnosis, cure, mitigation, treatment, or prevention of disease, or to affect the structure or any function of the body of man or animals",activeIngredient
diseaseName,diseaseName
dosageForm,dosageForm
dose approved for a drug,availableStrength
ensemblID,ensemblID
full name of the gene,fullName
fullName,fullName
Expand All @@ -70,43 +61,36 @@ geneID,geneID
genetic variant associated with a chemical compound,<-compoundID{typeOf:ChemicalCompoundGeneticVariantAssociation}->variantID
genetic variant associated with a gene,<-geneID{typeOf:GeneGeneticVariantAssociation}->variantID
geneticVariantFunctionalCategory,geneticVariantFunctionalCategory
genomic coordinates,genomicCoordinates
genomicCoordinates,genomicCoordinates
genomic coordinates,hasGenomicCoordinates
genomicCoordinates,hasGenomicCoordinates
genus of a virus species,virusGenus
hasRNATranscript,hasRNATranscript
hg19GenomicLocation,hg19GenomicLocation
hg19GenomicPosition,hg19GenomicPosition
hg38GenomicLocation,hg38GenomicLocation
hg38GenomicPosition,hg38GenomicPosition
hgncID,hgncID
host of a virus,virusHost
icd10CMCode,icd10CMCode
imageUrl,imageUrl
mechanismOfAction,mechanismOfAction
medicalSubjectHeadingDescriptorID,medicalSubjectHeadingDescriptorID
medicalSubjectHeadingSupplementaryRecordID,medicalSubjectHeadingSupplementaryRecordID
ncbiDNASequenceName,ncbiDNASequenceName
ncbiProteinAccessionNumber,ncbiProteinAccessionNumber
ncbiTaxonID,ncbiTaxonID
ncbiTaxID,ncbiTaxId
number of nucleotides in a chromosome,chromosomeSize
observedAllele,observedAllele
ofVirusSpecies,ofVirusSpecies
omimID,omimID
phylum,phylum
physical form in which a drug is produced and dispensed,dosageForm
preferred disease name for the concept specified by disease identifiers,diseaseName
reference allele,referenceAlleleNCBI
referenceAlleleNCBI,referenceAlleleNCBI
reference allele,referenceAllele
referenceAllele,referenceAllele
referenceAlleleNCBI,referenceAllele
simplifiedMolecularInputLineEntrySystem,simplifiedMolecularInputLineEntrySystem
snomedCT,snomedCT
specialization of,specializationOf
specializationOf,specializationOf
strandOrientation,strandOrientation
subClassificationOf,subClassificationOf
subclassification of,subClassificationOf
the biochemcial interaction through which a drug produces a pharmacological effect,mechanismOfAction
type of antibody,antibodyType
type of antigen,antigenType
typeOfGene,typeOfGene
unifiedMedicalLanguageSystemConceptUniqueIdentifier,unifiedMedicalLanguageSystemConceptUniqueIdentifier
url to an image of what the biological specimen looks like,imageUrl
umlsConceptUniqueID,umlsConceptUniqueID
virusGenus,virusGenus
virusHost,virusHost
what the entity looks like,imageUrl
21 changes: 7 additions & 14 deletions tools/nl/embeddings/input/bio/sheets_svs.csv
Original file line number Diff line number Diff line change
@@ -1,30 +1,26 @@
dcid,sentence
ofVirusSpecies,ofVirusSpecies;The species of a virus isolate
virusHost,virusHost;A specific organism or taxonomic group of organisms that are susceptible to be infected by a virus;host of a virus
ncbiTaxonID,ncbiTaxonID;NCBI Taxonomy database identifier
diseaseName,diseaseName;preferred disease name for the concept specified by disease identifiers;The name of the disease
observedAllele,observedAllele;The sequences of the observed alleles from rs-fasta files.
referenceAlleleNCBI,referenceAlleleNCBI;Reference genomic sequence from dbSNP;reference allele
ncbiTaxId,ncbiTaxID;NCBI Taxonomy database identifier
referenceAllele,referenceAllele;referenceAlleleNCBI;Reference genomic sequence from dbSNP;reference allele
clincoln8 marked this conversation as resolved.
Show resolved Hide resolved
class,class
phylum,phylum
geneticVariantFunctionalCategory,geneticVariantFunctionalCategory;Functional category of the genetic variant
hg19GenomicPosition,hg19GenomicPosition;The genomic position of a genetic variant using the hg19 assembly
hg19GenomicLocation,hg19GenomicLocation;The genomic location of a genetic variant using the hg19 assembly
hg38GenomicPosition,hg38GenomicPosition;The genomic position of a genetic variant using the hg38 assembly
hg38GenomicLocation,hg38GenomicLocation;The genomic location of a genetic variant using the hg38 assembly
hasRNATranscript,hasRNATranscript;Recorded transcript;RNA transcript that a gene has
hgncID,hgncID;HUGO Gene Nomenclature Committee identifier
inChIKey,InChIKey;International Chemical Identifier (InChI) Key
strandOrientation,strandOrientation;The strand on which a given annotation is located;The orientation of the strand on which an annotation is located
typeOfGene,typeOfGene;The type of gene
omimID,omimID;OMIM database identifier
icd10CMCode,"icd10CMCode;The disease diagnosis code for version 10 of the International Classification of Diseases (ICD), Clinical Modification"
subClassificationOf,subClassificationOf;subclassification of
snomedCT,snomedCT;Systematiized Nomenclature of Medicine (SNOMED) clinical terms (CT) code
unifiedMedicalLanguageSystemConceptUniqueIdentifier,"unifiedMedicalLanguageSystemConceptUniqueIdentifier;Unified Medical Language System (UMLS) Concept Unique Identifier (CUI);""UMLS CUI"""
umlsConceptUniqueID,umlsConceptUniqueID;Unified Medical Language System (UMLS) Concept Unique Identifier (CUI);UMLS CUI
specializationOf,specializationOf;specialization of
chemblID,chemblID;ChEMBL identifier
simplifiedMolecularInputLineEntrySystem,simplifiedMolecularInputLineEntrySystem;Simplified Molecular Input Line Entry System (SMILE)
medicalSubjectHeadingSupplementaryRecordID,medicalSubjectHeadingSupplementaryRecordID;A unique ID for a Medical Subject Heading supplementary record;An ID for a Medical Subject Heading supplementary record;MeSH supplementary record ID
medicalSubjectHeadingDescriptorID,medicalSubjectHeadingDescriptorID;A unique ID for a Medical Subject Heading Descriptor record;An ID for a Medical Subject Heading descriptor record;MeSH descriptor record ID
mechanismOfAction,mechanismOfAction;MOA;the biochemcial interaction through which a drug produces a pharmacological effect
clincoln8 marked this conversation as resolved.
Show resolved Hide resolved
activeIngredient,"activeIngredient;component that provides pharmacological activity or other direct effect in the diagnosis, cure, mitigation, treatment, or prevention of disease, or to affect the structure or any function of the body of man or animals"
administrationRoute,administrationRoute;The method by which a drug is administered
dosageForm,dosageForm;physical form in which a drug is produced and dispensed
Expand All @@ -37,10 +33,7 @@ geneID,geneID;gene id
ncbiProteinAccessionNumber,ncbiProteinAccessionNumber;NCBI protein accession number
alleleOrigin,alleleOrigin;Variant allele origin;Origin of variant allele
alleleType,alleleType;The allele of a genetic variant observed within a population;Type of allele
ncbiDNASequenceName,ncbiDNASequenceName;NCBI defined segment of DNA sequence name;Name used by NIH NCBI to refer to a segment of DNA sequence
imageUrl,imageUrl;url to an image of what the biological specimen looks like;what the entity looks like
genomicCoordinates,genomicCoordinates;genomic coordinates
availableStrength,availableStrength;dose approved for a drug
hasGenomicCoordinates,genomicCoordinates;genomic coordinates
clincoln8 marked this conversation as resolved.
Show resolved Hide resolved
<-variantID{typeOf:GeneGeneticVariantAssociation}->geneID,GeneticVariantGeneAssociation;Gene associated with a genetic variant
<-geneID{typeOf:GeneGeneticVariantAssociation}->variantID,GeneticVariantGeneAssociation;genetic variant associated with a gene
<-diseaseID{typeOf:DiseaseGeneAssociation}->geneID,DiseaseGeneAssociation;Gene associated with a disease
Expand Down
Loading