-
Notifications
You must be signed in to change notification settings - Fork 4
Further information on r2d liftover
R2Dtool liftover enables the rapid conversion from coordinates mapped in transcriptomic space to coordinates mapped in genomic space.
To use the liftover functionality, you can use the following command:
r2d liftover -i <input> -g <gtf> [-H] [-O]
-
-i, --input <input>
: Path to tab-separated transcriptome sites in BED format. -
-g, --gtf <annotation>
: Path to gene structure annotation in GTF format.
-
-H, --header
: Indicates the input file has a header, which will be preserved in the output [Default: False] -
-o, --output <OUTPUT>
: Path to output file [Default: STDOUT] -
-t, --transcript-version
: Indicates that '.'-delimited transcript version information is present in col1 and should be considered during liftover [default: False].
r2d liftover -H -g ./test/GRCh38.110_subset.gtf -i ./test/m6A_isoform_sites_GRCh38_subset.bed > ./test/liftover.bed
The input file for liftover should be in a tab-separated BED3+ format. The first three columns are crucial and should be formatted as follows:
- Column 1: Transcript ID
- Column 2: Start position (0-based)
- Column 3: End position (0-based, exclusive)
Any additional columns (4 onwards) will be preserved in the output.
Example input:
transcript start end base coverage strand N_valid_cov fraction_modified
ENST00000381989.4 2682 2683 a 10 + 10 0.00
ENST00000381989.4 2744 2745 a 10 + 10 0.00
Note: Only columns 1-3 are essential for the liftover. Since headers are present, the -H flag must be passed to R2Dtool liftover.
The liftover function prepends 6 columns to the input file, containing the genomic coordinates of the transcript features in BED format. All data from the original input are preserved in the output and shifted by 6 columns.
Example output:
chromosome start end name score strand transcript start end base coverage strand N_valid_cov fraction_modified
13 24455165 24455166 - ENST00000381989.4 2682 2683 a 10 + 10 0.00
13 24435272 24435273 - ENST00000381989.4 3941 3942 a 20 + 20 80.00
- By default, R2Dtool assumes that input files do not have headers.
- If your input file includes a header row, you must use the
-H
or--header
flag when running the liftover command. - When the
-H
flag is used, R2Dtool will preserve the input header and add column names for the new genomic coordinate columns in the output.
To use R2Dtool liftover with data from various upstream methods:
- Ensure your data is in a tab-separated format.
- The first column must contain the transcript ID.
- The second and third columns must contain the start and end positions of the feature in transcriptomic coordinates (0-based, half-open).
- Any additional columns with metadata can follow and will be preserved in the output.
- If your transcript IDs include version numbers (e.g., ENST00000381989.4), use the
-t
or--transcript-version
flag.
By following these guidelines, you can adapt output from various RNA feature detection methods (e.g., m6A site prediction, RNA editing site detection, RNA-protein interaction site prediction) for use with R2Dtool liftover.
We provide pseudocode for the convert_transcriptomic_to_genomic_coordinates
function that is invoked by the r2d liftover
command:
Algorithm: convert_transcriptomic_to_genomic_coordinates
Inputs:
- site_fields: array of strings containing transcriptomic information
- annotations: hashmap of transcript annotations
- has_version: boolean indicating if transcript IDs include version numbers
Output:
- String containing genomic coordinates or None if conversion fails
1. If length of site_fields < 4:
Return None
2. transcript_id_with_version ← site_fields[0]
3. If has_version:
transcript_id ← transcript_id_with_version
Else:
transcript_id ← Split transcript_id_with_version by '.' and take first part
4. position ← Parse site_fields[1] as integer
5. current_position ← 0
6. If transcript_id exists in annotations:
transcript ← annotations[transcript_id]
exons ← transcript.exons.clone()
7. Sort exons:
If transcript.strand = '-':
Sort exons in descending order by start position
Else:
Sort exons in ascending order by start position
8. For each exon_data in sorted exons:
exon_length ← exon_data.end - exon_data.start + 1
9. If current_position + exon_length > position:
If transcript.strand = '+':
genomic_position ← position - current_position + exon_data.start - 1
Else:
genomic_position ← exon_data.end - (position - current_position) - 1
10. chrom ← transcript.chromosome
genomic_strand ← transcript.strand
additional_columns ← Join site_fields[1:] with tab separator
11. Return formatted string:
"{chrom}\t{genomic_position}\t{genomic_position + 1}\t\t\t{genomic_strand}\t{transcript_id_with_version}\t{additional_columns}"
12. current_position ← current_position + exon_length
13. Print warning: "Warning: No associated transcripts found for site '{transcript_id}'."
14. Return None
Note: The actual implementation processes lines in parallel for improved performance.
Original source code for the liftover function is available at https://github.com/comprna/R2Dtool/blob/main/src/liftover.rs