Skip to content

Further information on r2d annotate

pre-mRNA edited this page Jul 2, 2024 · 2 revisions

R2Dtool annotate

R2Dtool can annotate transcriptome-mapped RNA feature positions with isoform-specific information, including metatranscript coordinates and absolute and relative distances to annotated transcript landmarks. This annotation process provides valuable context for analyzing RNA features in relation to transcript structure.

Overview

The annotate function takes transcriptomic coordinates as input and uses transcript annotation data from a GTF file to provide detailed information about each RNA feature position. This includes gene details, transcript properties, and distances to important transcript features like splice sites and coding regions.

Key Features

  • Isoform-specific annotation
  • Calculation of metatranscript coordinates
  • Determination of distances to nearest splice sites
  • Flexible header handling
  • Compatible with downstream R2Dtool metaplot functions

Usage

r2d annotate -i <input> -g <gtf> [-H] [-t] [-o <output>]

Arguments

  • -i, --input <input>: Path to tab-separated transcriptome sites in BED format.
  • -g, --gtf <annotation>: Path to gene structure annotation in GTF format.

Options

  • -H, --header: Indicates the input file has a header, which will be preserved in the output [Default: False]
  • -o, --output <OUTPUT>: Path to output file [Default: STDOUT]
  • -t, --transcript-version: Indicates that '.'-delimited transcript version is present in col1 and should be considered during annotation [default: False]

Input Requirements

  1. A GTF file containing transcript annotations (GTF2.2 format)
  2. An input file with transcriptomic coordinates in BED format

Minimum BED File Requirements

The input BED file must contain at least the following columns:

  1. Column 1: Transcript ID
  2. Column 2: Start coordinate (0-based)
  3. Column 3: End coordinate (0-based, half-open)

Additional metadata columns (e.g., feature labels, stoichiometry, probability, motifs) can be included from column 4 onwards. These fields will be preserved in the annotate output.

Output

A tab-separated file containing the original input data along with additional annotation columns:

  1. gene_id
  2. gene_name
  3. transcript_biotype
  4. tx_len (transcript length)
  5. cds_start
  6. cds_end
  7. tx_end
  8. transcript_metacoordinate
  9. abs_cds_start
  10. abs_cds_end
  11. up_junc_dist (upstream junction distance)
  12. down_junc_dist (downstream junction distance)

Key Concepts

Metatranscript Coordinate Calculation

The metatranscript coordinate is a normalized position within the transcript, calculated as follows:

  • For positions in the 5' UTR: position / 5' UTR length
  • For positions in the CDS: 1 + (position - 5' UTR length) / CDS length
  • For positions in the 3' UTR: 2 + (position - 5' UTR length - CDS length) / 3' UTR length

This results in a value between 0 and 3, where:

  • 0 represents the transcript start
  • 1 represents the CDS start
  • 2 represents the CDS end
  • 3 represents the transcript end

Splice Site Distance Calculation

The algorithm calculates the distance to the nearest upstream and downstream splice sites for each position. This provides valuable information about the proximity of RNA features to exon-intron boundaries.

Strand and Coordinate Handling

  • For annotation purposes, the feature strand is always assumed to be positive.
  • The start coordinate of the feature (column 2 in the input BED file) is used to annotate the RNA feature and assign a metatranscript coordinate.

Header Handling

R2Dtool annotate can process input files with or without headers. When headers are present, the -H flag must be used. This flag ensures that:

  1. The input header is preserved in the output.
  2. New column names are added for the additional annotation data.
  3. The output is compatible with downstream R2Dtool metaplot functions.

It's highly recommended to use headers and the -H flag, especially if you plan to use R2Dtool's visualization features later.

Workflow Integration

R2Dtool annotate can be performed before, but not after, the liftover process. This allows users to annotate their transcriptomic data with valuable context before converting to genomic coordinates if needed.

Performance Considerations

The implementation uses parallel processing (via Rayon), which improves performance for large datasets. We recommend using R2Dtool in an environment with at least 4 CPU cores and 12GB of free memory.

Error Handling

The algorithm handles missing or incalculable data points by using "NA" values, ensuring that the output remains consistent even when complete information is not available for all transcripts.

Algorithm Overview

  1. Parse GTF file to create transcript annotations
  2. Generate splice site information for all transcripts
  3. Process input file line by line: a. Extract transcript ID and coordinate b. Retrieve transcript information c. Calculate various metrics (CDS positions, metacoordinates, etc.) d. Determine distances to nearest splice sites e. Write annotated information to output

Detailed Pseudocode

Algorithm: run_annotate
Input: 
  - matches: command-line argument matches
  - has_header: boolean indicating if input file has a header
  - has_version: boolean indicating if transcript IDs include version numbers
Output: 
  - Annotated file with additional transcript information

1. Extract gtf_file, input_file, and output_file from matches

2. annotations ← read_annotation_file(gtf_file, true, has_version)

3. splice_sites ← generate_splice_sites(annotations)

4. Open input_file for reading
5. Open output_file (or stdout) for writing

6. If has_header:
    Read and process header, adding new column names

7. For each line in input_file:
    a. Split line into fields
    b. transcript_id ← Extract from fields[0] (remove version if necessary)
    c. tx_coord ← Parse fields[1] as integer

    d. If transcript_id exists in annotations:
        i. Retrieve transcript information
        ii. Calculate tx_len, cds_start, cds_end, tx_end
        iii. (rel_pos, abs_cds_start, abs_cds_end) ← calculate_meta_coordinates(tx_coord, utr5_len, cds_len, utr3_len)
        iv. (up_junc_dist, down_junc_dist) ← splice_site_distances(tx_coord, splice_sites[transcript_id])
        
        v. Construct output line with all calculated values
    
    e. Else:
        Construct output line with "NA" for all additional fields

    f. Write constructed line to output_file

8. Close input and output files
9. Remove temporary splice sites file

Return: Success or error status

Notes

  • The algorithm handles both versioned and non-versioned transcript IDs.
  • It calculates transcript metacoordinates to provide context within the transcript structure.
  • Splice site distances are computed to give information about proximity to exon-intron boundaries.
  • The implementation uses parallel processing (Rayon) for generating splice site information, which improves performance for large datasets.
  • Error handling and "NA" values are used for missing or incalculable data points.

Complete source code for R2Dtool annotate is available at https://github.com/comprna/R2Dtool/blob/main/src/annotate.rs