You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since the data model is used in AI workflows (as discussed with MIT group and experienced with other development/testing), one easy way to improve AI performance is to provide better information through the data model.
For MIT examples, aim is to improve sample_output. Since the description is provided as context, better hints to be provided are:
assay: We note that this is inconsistently filled out, so assay description could include that this should be the same term within the same dataset
dataSubtype: This is inconsistently filled out, but AI does a good job on fileFormat for obvious reasons, so provide example that fastq usually means "raw" in the description.
dataType: GenomicVariants vs AlignedReads vs GenomicFeatures -- since the AI currently doesn't look up these terms using the EDAM ontology, this can be confusing. Might have to put in description that AlignedReads refer to bam files, GenomicVariants is a broader term for simple variants or structural variants, so can either be maf or vcf ..., and genomic features we expect something like bed or some other formats.
In an ideal workflow, instead of getting hints through description, the generative AI can use relations encoded more formally*:
as ontology axioms or set of rules (again, if fileFormat is fastq, dataSubtype will always be "raw")
or probabilities (for type of tumor A, the most probable tissue sample types are {T1, T2}, though if it's a metastatic tumor, the tissue sample could be from a wider variety of tissues {T1,T2,T3,T4,T5,...}, and there is also known tropism for some cancers)
Since the data model is used in AI workflows (as discussed with MIT group and experienced with other development/testing), one easy way to improve AI performance is to provide better information through the data model.
For MIT examples, aim is to improve sample_output. Since the description is provided as context, better hints to be provided are:
assay
: We note that this is inconsistently filled out, soassay
description could include that this should be the same term within the same datasetdataSubtype
: This is inconsistently filled out, but AI does a good job on fileFormat for obvious reasons, so provide example that fastq usually means "raw" in the description.dataType
: GenomicVariants vs AlignedReads vs GenomicFeatures -- since the AI currently doesn't look up these terms using the EDAM ontology, this can be confusing. Might have to put in description that AlignedReads refer to bam files, GenomicVariants is a broader term for simple variants or structural variants, so can either be maf or vcf ..., and genomic features we expect something like bed or some other formats.In an ideal workflow, instead of getting hints through description, the generative AI can use relations encoded more formally*:
*This might be a future iteration.
Also attached: sample_output.csv
The text was updated successfully, but these errors were encountered: