Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JupyterHub Developing a UseCase: Variant Calling Workflow for Large-Scale Genomic Datasets #5

Open
viktoriaas opened this issue Nov 3, 2024 · 0 comments

Comments

@viktoriaas
Copy link

viktoriaas commented Nov 3, 2024

Why?

Jupyter Notebook is an application for creating and sharing computational documents. JupyterHub is a way of providing the Notebooks to multiple users. The benefit is that users gain easy interactive access to computational resources without need to install anything.

GA4GH TES (Task Execution Service) API is a standardized schema and API for describing and executing batch execution tasks on any underlying computational backend. Full TES spec defines TES capabilities.

The goal of this issue is to develop use case for using JupyterHub instance. Sample use case can be variant calling for large scale genomic datasets.

Objective: Develop a workflow in JupyterHub to perform variant calling on genomic data from multiple cohorts, utilizing federated computing through GA4GH TES.

Scope: The workflow could include data pre-processing, alignment, and variant calling, leveraging TES to offload compute-intensive tasks to appropriate resources. Visualizations could show variant distributions, and results could be exported for further analysis.

More useful information and link: document online

How?

The full functionality of this issue (distributing parts of the workflow) depends on the functionality of other issues. However, it is still crucial to create a sample workflow that includes all steps of a data analysis pipeline logically divided into sections that could be theoretically offloaded to appropriate resources. You can use existing TES instances to offload some parts (or at least one part) to any TES instance.

  1. Create a Jupyter Notebook with sample workflow (any bioinformatics workflow) that includes all steps of a data analysis pipeline.
  2. Identify parts that could be offloaded and define their requirements - do they need data in advance? do they need to save output somewhere? Does the computation require any special resources? Is it possible that this computation could manipulate sensitive data?
  3. Try to offload at least one part of the computation to any TES instance. Remember, that TES instances might require an authentication token so don't forget to add it!

If you want to work on this issue:

  • Assign yourself to the issue (if someone else is already assigned, first ask them if they would mind help on the issue - or pick another one)
  • Once assigned, move your issue to the "In progress" column on the project board
  • Start working 🚀
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

1 participant