Skip to content

Latest commit

 

History

History
75 lines (41 loc) · 3.76 KB

reference_downloading.md

File metadata and controls

75 lines (41 loc) · 3.76 KB

To download gene and TE (Transposable Element) references for your preferred species from the UCSC Genome Browser, follow these steps:

  1. Go to the UCSC Genome Browser website: https://genome.ucsc.edu/cgi-bin/hgGateway.

  2. In the search bar, type the name of your preferred species and select it from the search results.

Screenshot 2024-11-06 at 16 56 31
  1. Once you've selected your species, click on Tools in the upper tab.

  2. From the dropdown menu, choose Table Browser.

Screenshot 2024-11-06 at 16 56 57
  1. You will be directed to the Table Browser. This step initializes the clade and genome fields for your selected species.

  2. Now, follow these steps for Gene Reference:

    • In the clade dropdown menu, ensure the clade is set to the appropriate classification.
    • In the genome dropdown menu, select the genome assembly you want to use.
    • In the Group dropdown menu, select the one related to genes.
    • Set the output format as GTF.
    • Click the 'bigZip/genes' to get formatted gene identifiers.
    • The remaining fields can be set to their default values unless you have specific requirements.
Screenshot 2024-11-06 at 16 57 34

After opening the webpage, you can use either ensGene.gtf or ncbiRefSeq.gtf file.

Screenshot 2024-11-06 at 16 57 45

Alternatively, you can use Ensembl or related websites like EnsemblPlants.

NCBI database does not always have the GTF with gene identifiers, such as Arabdopsis. We can use the Ensembl/EnsemblPlants/EnsemblFungi/../.

Arabdopsis in EnsemblPlants

Download the GFF3 file.

Screenshot 2024-11-06 at 15 57 49

Drosophila

Screenshot 2024-11-06 at 16 00 17

You can use either GTF or GFF3 file.

Screenshot 2024-11-06 at 16 00 31
  1. For TE (Transposable Element) Reference:

    • Follow the same procedure as for the Gene Reference to initialize the clade and genome fields.
    • In the Group dropdown, you can use either 'all tracks' or repeat related one.
    • In the track dropdown, select RepeatMasker.
    • Change the remaining fields as shown below, making sure they match what you used for the Gene Reference unless you have special requirements.
Screenshot 2024-11-06 at 17 18 47
  • Click the get output button.
  1. After clicking the get output button for either the Gene Reference or TE Reference, the respective reference data will be processed, and you will be prompted to download the zipped reference file.

Follow these steps for both the Gene and TE References to obtain the required reference files for your preferred species.

Here is the example script to build reference genome after downloading the data.

gzip -d dm6.ensGene.gtf.gz
python build_reference.py --species Other --other_species_TE Drosophila_TE.csv --other_species_GTF dm6.ensGene.gtf