Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update descriptions and provide more explicit values to improve AI-assisted workflow (MIT collab) #560

Open
anngvu opened this issue Dec 10, 2024 · 0 comments

Comments

@anngvu
Copy link
Collaborator

anngvu commented Dec 10, 2024

Since the data model is used in AI workflows (as discussed with MIT group and experienced with other development/testing), one easy way to improve AI performance is to provide better information through the data model.

For MIT examples, aim is to improve sample_output. Since the description is provided as context, better hints to be provided are:

  • assay: We note that this is inconsistently filled out, so assay description could include that this should be the same term within the same dataset
  • dataSubtype: This is inconsistently filled out, but AI does a good job on fileFormat for obvious reasons, so provide example that fastq usually means "raw" in the description.
  • dataType: GenomicVariants vs AlignedReads vs GenomicFeatures -- since the AI currently doesn't look up these terms using the EDAM ontology, this can be confusing. Might have to put in description that AlignedReads refer to bam files, GenomicVariants is a broader term for simple variants or structural variants, so can either be maf or vcf ..., and genomic features we expect something like bed or some other formats.

In an ideal workflow, instead of getting hints through description, the generative AI can use relations encoded more formally*:

  • as ontology axioms or set of rules (again, if fileFormat is fastq, dataSubtype will always be "raw")
  • or probabilities (for type of tumor A, the most probable tissue sample types are {T1, T2}, though if it's a metastatic tumor, the tissue sample could be from a wider variety of tissues {T1,T2,T3,T4,T5,...}, and there is also known tropism for some cancers)

*This might be a future iteration.

Also attached: sample_output.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant