Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gene class should have a method to get its sequence #270

Open
antonkulaga opened this issue Oct 6, 2022 · 4 comments
Open

gene class should have a method to get its sequence #270

antonkulaga opened this issue Oct 6, 2022 · 4 comments
Assignees

Comments

@antonkulaga
Copy link

I see that transcripts have the method to get sequences while the gene class does not. Would be nice to be able to get genomic DNA of gene as well.

@iskandr
Copy link
Contributor

iskandr commented Feb 13, 2023

PyEnsembl doesn't currently use a full genomic FASTA -- it's possible to point it at one but the sequences would differ from the transcript sequences (even at the same coordinates), which I think would be pretty confusing.

@rraadd88
Copy link
Contributor

In case there would be a plan to extend pyensembl's compatibility with full genome, I have a few functions in one of repositories that are based on pyensembl's way of caching the files.
The wrapper function is called download_genome with following parameters

download_genome(
    species: str,
    ensembl_release: int,
    force: bool = False,
    verbose: bool = False
) → str

So I would be happy to contribute, if there would be a plan to extend compatibility in the future.

Following couple of things that could be challenging though:

  1. additional dependencies in case of a large genome (e.g. human) for fetching the sequences e.g. using UCSC 2bit format, as implemented in my repository.
  2. I don't have a specific example, but for some species, the assembly number for the genome can be different from the one for transcripts and proteins.

@iskandr
Copy link
Contributor

iskandr commented Feb 12, 2024

I have also been thinking about adding this functionality now that I'm working on making the OpenVax tools work more cleanly with large/structural variants (and thus have to contend with more events outside of annotated exons). I would love a contribution but maybe we can talk a bit more about the design?

A few thoughts:

  • I want downloading the full genome to be optional (pyensembl install shouldn't do it by default)
  • I want it to be easy to point to an already-downloaded genome
  • Should we always download 2bit or allow flexible formats?
  • How do we handle different patches to the genome sequence?
  • How do we handle slight differences in the assemblies, is the naming sufficiently descriptive to always know exactly which FASTA we're referring to? (eg presence/absence of alt and unplaced contigs)
  • ...is there already a tool which does this with a Python API that we should piggy-back on top of?

@rraadd88
Copy link
Contributor

About the last point, I haven't encountered any tool ideally suited to piggy-back on for fetching genome sequences. That's why in my python package, I ended up coupling (1) pyensembl's way to download and cacheing (or use already-downloaded genome) with (2) optional 2bit indexing for large genomes. It works quite well, but it is still to be tested extensively, e.g. with different species etc.

There is also a web API that I know. It is quite fast, but (1) it does not cover all the genomes and (2) a limit on number/rate of queries is expected.

In my experience, the challenge lies with the large genome e.g. human. For the smaller ones, the indexed genomes can be downloaded from ensembl (e.g.) and then basic tools such as pyfaidx can fetch sequences at decent rate.

For large genomes though, using 2bit is the fastest approach, to my knowledge. The required tools can be downloaded from conda/mamba using this command:

conda install install bioconda::ucsc-fatotwobit bioconda::ucsc-twobittofa bioconda::ucsc-twobitinfo # options: conda/mamba

Overall, I would vote for a 2bit based approach which I have already implemented in my repo.

One thing to note however, in my implementation, an intermediate bed file is created for fetching the sequences. This is ideally suited for fetching sequences in batches, not one at a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants