Skip to content

Further information on r2d liftover

pre-mRNA edited this page Jul 1, 2024 · 2 revisions

Further information on r2d liftover

R2Dtool liftover enables the rapid conversion from coordinates mapped in transcriptomic space to coordinates mapped in genomic space.

Usage

To use the liftover functionality, you can use the following command:

r2d liftover -i <input> -g <gtf> [-H] [-O]

Arguments:

  • -i, --input <input>: Path to tab-separated transcriptome sites in BED format.
  • -g, --gtf <annotation>: Path to gene structure annotation in GTF format.

Options:

  • -H, --header: Indicates the input file has a header, which will be preserved in the output [Default: False]
  • -o, --output <OUTPUT>: Path to output file [Default: STDOUT]
  • -t, --transcript-version: Indicates that '.'-delimited transcript version information is present in col1 and should be considered during liftover [default: False].

Example:

r2d liftover -H -g ./test/GRCh38.110_subset.gtf -i ./test/m6A_isoform_sites_GRCh38_subset.bed > ./test/liftover.bed

Input Format

The input file for liftover should be in a tab-separated BED3+ format. The first three columns are crucial and should be formatted as follows:

  1. Column 1: Transcript ID
  2. Column 2: Start position (0-based)
  3. Column 3: End position (0-based, exclusive)

Any additional columns (4 onwards) will be preserved in the output.

Example input:

transcript         start  end   base  coverage  strand  N_valid_cov  fraction_modified
ENST00000381989.4  2682   2683  a     10        +       10           0.00
ENST00000381989.4  2744   2745  a     10        +       10           0.00

Note: Only columns 1-3 are essential for the liftover. Since headers are present, the -H flag must be passed to R2Dtool liftover.

Output Format

The liftover function prepends 6 columns to the input file, containing the genomic coordinates of the transcript features in BED format. All data from the original input are preserved in the output and shifted by 6 columns.

Example output:

chromosome  start     end       name  score  strand  transcript         start  end   base  coverage  strand  N_valid_cov  fraction_modified
13          24455165  24455166               -       ENST00000381989.4  2682   2683  a     10        +       10           0.00
13          24435272  24435273               -       ENST00000381989.4  3941   3942  a     20        +       20           80.00

Header Handling

  • By default, R2Dtool assumes that input files do not have headers.
  • If your input file includes a header row, you must use the -H or --header flag when running the liftover command.
  • When the -H flag is used, R2Dtool will preserve the input header and add column names for the new genomic coordinate columns in the output.

Adapting Upstream Data

To use R2Dtool liftover with data from various upstream methods:

  1. Ensure your data is in a tab-separated format.
  2. The first column must contain the transcript ID.
  3. The second and third columns must contain the start and end positions of the feature in transcriptomic coordinates (0-based, half-open).
  4. Any additional columns with metadata can follow and will be preserved in the output.
  5. If your transcript IDs include version numbers (e.g., ENST00000381989.4), use the -t or --transcript-version flag.

By following these guidelines, you can adapt output from various RNA feature detection methods (e.g., m6A site prediction, RNA editing site detection, RNA-protein interaction site prediction) for use with R2Dtool liftover.

Algorithm

We provide pseudocode for the convert_transcriptomic_to_genomic_coordinates function that is invoked by the r2d liftover command:

Algorithm: convert_transcriptomic_to_genomic_coordinates
Inputs: 
  - site_fields: array of strings containing transcriptomic information
  - annotations: hashmap of transcript annotations
  - has_version: boolean indicating if transcript IDs include version numbers
Output: 
  - String containing genomic coordinates or None if conversion fails

1. If length of site_fields < 4:
    Return None
2. transcript_id_with_version ← site_fields[0]
3. If has_version:
    transcript_id ← transcript_id_with_version
   Else:
    transcript_id ← Split transcript_id_with_version by '.' and take first part
4. position ← Parse site_fields[1] as integer
5. current_position ← 0
6. If transcript_id exists in annotations:
    transcript ← annotations[transcript_id]
    exons ← transcript.exons.clone()
    
    7. Sort exons:
       If transcript.strand = '-':
           Sort exons in descending order by start position
       Else:
           Sort exons in ascending order by start position
    
    8. For each exon_data in sorted exons:
        exon_length ← exon_data.end - exon_data.start + 1
        
        9. If current_position + exon_length > position:
            If transcript.strand = '+':
                genomic_position ← position - current_position + exon_data.start - 1
            Else:
                genomic_position ← exon_data.end - (position - current_position) - 1
            
            10. chrom ← transcript.chromosome
                genomic_strand ← transcript.strand
                additional_columns ← Join site_fields[1:] with tab separator
            
            11. Return formatted string:
                "{chrom}\t{genomic_position}\t{genomic_position + 1}\t\t\t{genomic_strand}\t{transcript_id_with_version}\t{additional_columns}"
        
        12. current_position ← current_position + exon_length
13. Print warning: "Warning: No associated transcripts found for site '{transcript_id}'."
14. Return None

Note: The actual implementation processes lines in parallel for improved performance.

Original source code for the liftover function is available at https://github.com/comprna/R2Dtool/blob/main/src/liftover.rs