-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
76 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Shuffle Regions | ||
|
||
Description | ||
----------- | ||
This script is used to compare sets of genomic intervals and determine if an annotation of interest overlaps more with an annotated set of conserved codons than would be expected by chance. | ||
|
||
* Compare a set of test intervals and a set of conserved codons (both in bed format) and calculate: | ||
* The number of conserved intervals which overlap with at least one test interval | ||
* The number of test intervals which overlap at least one conserved interval | ||
* The total number of nucleotides of overlap between the conserved and test intervals | ||
* Create a specified number of random sets of intervals within the coding seqeuence with the same size distribution as the test intervals | ||
* Calculate the above statistics again for these shuffled intervals | ||
* Calculate and tabulate the percentile of the original result vs the shuffled results | ||
* Plot the original and shuffled results as a histogram | ||
|
||
Usage | ||
----- | ||
```shuffle_regions.py --cds_file cds.bed --genome_file genome.bed --conserved_codon_file conserved.bed --test_file myintervals.bed --outfile_stem out --n_shuffles 1000``` | ||
|
||
|
||
This statement would create 1000 shuffled datasets, each with a set of intervals where the length distribution and number of intervals is equal to those in myintervals.bed, all of which are fully within the coding sequence defined in cds.bed. | ||
The number of 1) intervals in conserved.bed which overlap with at least one interval in myintervals.bed, 2) intervals in myintervals.bed which overlap with at least one interval in conserved.bed and 3) nucleotides of overlap between conserved.bed and myintervals.bed would then be calculated. | ||
These would be compared to the same three statistics comparing conserved.bed and the intervals in the shuffled bed files. | ||
The results would be shown in a table in out_shuffled_results.tsv and plotted as histograms in out_distribution_total_overlap.png. | ||
|
||
Parameters | ||
---------- | ||
`--cds_file` | ||
|
||
Path to bed file containing co-ordinates of coding sequences | ||
|
||
`--genome_file` | ||
|
||
Path to bed file containing co-ordinates of the whole genome | ||
|
||
`--conserved_codon_file` | ||
|
||
Path to bed file containing the conserved codons of interest | ||
|
||
`--test_file` | ||
|
||
Path to bed file containing the regions you want to intersect with the conserved regions | ||
|
||
`--plot_color` | ||
|
||
Colour for histograms | ||
|
||
`--outfile_stem` | ||
|
||
Output file prefix | ||
|
||
`--n_shuffles` | ||
|
||
Number of times to shuffle the intervals | ||
|
||
`--min_overlap` | ||
|
||
Minimum proportion of interval to be in the overlap to classify a region as overlapping | ||
|
||
|
||
Output | ||
------ | ||
`xxx_shuffled_results.tsv` | ||
This table has 6 columns | ||
|
||
|
||
* `ID` - row ID, either 'orig' for the original test file or shuffle_x for each shuffled dataset | ||
* `N_Conserved_in_Test` - the number of conserved intervals which overlap with at least one test interval (or shuffled interval) by at least fraction `min_overlap`. | ||
* `Prop_Conserved_in_Test`- the proportion of conserved intervals which overlap with at least one test interval (or shuffled interval) by at least fraction `min_overlap`. | ||
* `N_Test_in_Conserved` - the number of test intervals (or shuffled intervals) which overlap with at least one conserved interval by at least fraction `min_overlap`. | ||
* `Prop_Test_in_Conserved` - the proportion of test intervals (or shuffled intervals) which overlap with at least one conserved interval by at least fraction `min_overlap`. | ||
* `Total_Overlap` - the total number of nucleotides covered by both the test (or shuffled) intervals and the conserved intervals | ||
|
||
`xxx_distribution_total_overlap.png` | ||
|
||
Three histograms showing the comparison between the test intervals and the shuffled intervals for (left to right) 1) the proportion of conserved intervals which overlap with at least one test interval, 2) the proportion of test intervals which overlap with at least one conserved interval, 3) the total number of nucleotides which overlap between the conserved and test intervals. |