-
Notifications
You must be signed in to change notification settings - Fork 4
Further information on r2d annotate
R2Dtool can annotate transcriptome-mapped RNA feature positions with isoform-specific information, including metatranscript coordinates and absolute and relative distances to annotated transcript landmarks. This annotation process provides valuable context for analyzing RNA features in relation to transcript structure.
The annotate
function takes transcriptomic coordinates as input and uses transcript annotation data from a GTF file to provide detailed information about each RNA feature position. This includes gene details, transcript properties, and distances to important transcript features like splice sites and coding regions.
- Isoform-specific annotation
- Calculation of metatranscript coordinates
- Determination of distances to nearest splice sites
- Flexible header handling
- Compatible with downstream R2Dtool metaplot functions
r2d annotate -i <input> -g <gtf> [-H] [-t] [-o <output>]
-
-i, --input <input>
: Path to tab-separated transcriptome sites in BED format. -
-g, --gtf <annotation>
: Path to gene structure annotation in GTF format.
-
-H, --header
: Indicates the input file has a header, which will be preserved in the output [Default: False] -
-o, --output <OUTPUT>
: Path to output file [Default: STDOUT] -
-t, --transcript-version
: Indicates that '.'-delimited transcript version is present in col1 and should be considered during annotation [default: False]
- A GTF file containing transcript annotations (GTF2.2 format)
- An input file with transcriptomic coordinates in BED format
The input BED file must contain at least the following columns:
- Column 1: Transcript ID
- Column 2: Start coordinate (0-based)
- Column 3: End coordinate (0-based, half-open)
Additional metadata columns (e.g., feature labels, stoichiometry, probability, motifs) can be included from column 4 onwards. These fields will be preserved in the annotate
output.
A tab-separated file containing the original input data along with additional annotation columns:
- gene_id
- gene_name
- transcript_biotype
- tx_len (transcript length)
- cds_start
- cds_end
- tx_end
- transcript_metacoordinate
- abs_cds_start
- abs_cds_end
- up_junc_dist (upstream junction distance)
- down_junc_dist (downstream junction distance)
The metatranscript coordinate is a normalized position within the transcript, calculated as follows:
- For positions in the 5' UTR:
position / 5' UTR length
- For positions in the CDS:
1 + (position - 5' UTR length) / CDS length
- For positions in the 3' UTR:
2 + (position - 5' UTR length - CDS length) / 3' UTR length
This results in a value between 0 and 3, where:
- 0 represents the transcript start
- 1 represents the CDS start
- 2 represents the CDS end
- 3 represents the transcript end
The algorithm calculates the distance to the nearest upstream and downstream splice sites for each position. This provides valuable information about the proximity of RNA features to exon-intron boundaries.
- For annotation purposes, the feature strand is always assumed to be positive.
- The
start
coordinate of the feature (column 2 in the input BED file) is used to annotate the RNA feature and assign a metatranscript coordinate.
R2Dtool annotate can process input files with or without headers. When headers are present, the -H
flag must be used. This flag ensures that:
- The input header is preserved in the output.
- New column names are added for the additional annotation data.
- The output is compatible with downstream R2Dtool metaplot functions.
It's highly recommended to use headers and the -H
flag, especially if you plan to use R2Dtool's visualization features later.
R2Dtool annotate can be performed before, but not after, the liftover process. This allows users to annotate their transcriptomic data with valuable context before converting to genomic coordinates if needed.
The implementation uses parallel processing (via Rayon), which improves performance for large datasets. We recommend using R2Dtool in an environment with at least 4 CPU cores and 12GB of free memory.
The algorithm handles missing or incalculable data points by using "NA" values, ensuring that the output remains consistent even when complete information is not available for all transcripts.
- Parse GTF file to create transcript annotations
- Generate splice site information for all transcripts
- Process input file line by line: a. Extract transcript ID and coordinate b. Retrieve transcript information c. Calculate various metrics (CDS positions, metacoordinates, etc.) d. Determine distances to nearest splice sites e. Write annotated information to output
Algorithm: run_annotate
Input:
- matches: command-line argument matches
- has_header: boolean indicating if input file has a header
- has_version: boolean indicating if transcript IDs include version numbers
Output:
- Annotated file with additional transcript information
1. Extract gtf_file, input_file, and output_file from matches
2. annotations ← read_annotation_file(gtf_file, true, has_version)
3. splice_sites ← generate_splice_sites(annotations)
4. Open input_file for reading
5. Open output_file (or stdout) for writing
6. If has_header:
Read and process header, adding new column names
7. For each line in input_file:
a. Split line into fields
b. transcript_id ← Extract from fields[0] (remove version if necessary)
c. tx_coord ← Parse fields[1] as integer
d. If transcript_id exists in annotations:
i. Retrieve transcript information
ii. Calculate tx_len, cds_start, cds_end, tx_end
iii. (rel_pos, abs_cds_start, abs_cds_end) ← calculate_meta_coordinates(tx_coord, utr5_len, cds_len, utr3_len)
iv. (up_junc_dist, down_junc_dist) ← splice_site_distances(tx_coord, splice_sites[transcript_id])
v. Construct output line with all calculated values
e. Else:
Construct output line with "NA" for all additional fields
f. Write constructed line to output_file
8. Close input and output files
9. Remove temporary splice sites file
Return: Success or error status
- The algorithm handles both versioned and non-versioned transcript IDs.
- It calculates transcript metacoordinates to provide context within the transcript structure.
- Splice site distances are computed to give information about proximity to exon-intron boundaries.
- The implementation uses parallel processing (Rayon) for generating splice site information, which improves performance for large datasets.
- Error handling and "NA" values are used for missing or incalculable data points.
Complete source code for R2Dtool annotate is available at https://github.com/comprna/R2Dtool/blob/main/src/annotate.rs