-
Notifications
You must be signed in to change notification settings - Fork 22
lift
The lift program does quite a useful thing when one needs to use data generated for an old genome assembly and doesn't necessarily have the time or the means to go through the steps of generating the equivalent data for a newer genome assembly. Lifting data is the process of converting the coordinates where that data lies, from one assembly to another through sequence alignment data. UCSC creates special alignment files for this purpose called liftOver chains and makes them available on their download page. Lifting is not a perfect process, but if the regions lifted are relatively short, usually only a small percent are not mappable through the chain. In the case of bigWig data, each base is mapped one by one, and is only mapped to the destination genome if it maps to a new locus unambiguously. Usage:
bwtool lift - project data base-by-base into a new assembly using
a liftOver chain file from UCSC.
usage:
bwtool lift old.bw[:chr:start-end] oldToNew.liftOver.chain.gz new.bw
options:
-sizes=new.sizes if set use the chrom.sizes file specified instead of
gathering the size information from the chain.
This is one way to restrict the chromosomes lifted to
in the output.
-unlifted=file.bed save all the regions from the input not lifted
With the classic example:
Suppose that the homologous region in another species has two parts of the original region mapped to it. The first has a gap in the new region, while the second is inverted. A chain file describing such a transition is (never mind the score field as 100... bwtool ignores it in this case):
chain 100 chr 36 + 0 8 chr 19 + 0 10 1
3 0 2
5
chain 100 chr 36 + 10 20 chr 19 - 0 10 2
10
For more information on chain files, see this page. The trickiest thing about the chain format is that the coordinates for the + and - strands are different: they always start at the 5' end of the strand. So above, for Chain 2 the start/end on the - strand is the first 10 bases in the new region while if it weren't inverted it would be bases 9-19. Taken all together, the lift is represented in the following diagram:
The main thing to note is that the 10th base in the new region is overlapped by both the first and the second chain. Although a decision could be made to use one of the original bases: 5.0 in the case of Chain 1, or 2.0 in the case of Chain 2, bwtool instead removes this data. In practice, this doesn't tend to occur very often. liftOver chains generally don't have much overlap, but there's no specific rule that prohibits it. When running bwtool, we get:
$ bwtool lift main.bw to_new_transposon_and_gap.chain new_main.bw -unlifted=bad.bed
$ bigWigToWig new_main.bw /dev/stdout
fixedStep chrom=chr start=1 step=1 span=1
1
2
5
fixedStep chrom=chr start=6 step=1 span=1
6
5
3
3
fixedStep chrom=chr start=11 step=1 span=1
4
4
10
3
3
2
0
6
6
We can further examine the mapping problems for each base in the original region by looking at the extra file generated:
$ cat bad.bed
chr 7 multi_mapped_chr_9
chr 8 deleted_in_destination
chr 9 deleted_in_destination
chr 19 multi_mapped_chr_9
chr 20 deleted_in_destination
chr 21 deleted_in_destination
chr 22 deleted_in_destination
chr 27 deleted_in_destination
chr 28 deleted_in_destination
chr 29 deleted_in_destination
chr 30 deleted_in_destination
chr 31 deleted_in_destination
chr 32 deleted_in_destination
chr 33 deleted_in_destination
chr 34 deleted_in_destination
chr 35 deleted_in_destination
For this example, we'll start with a larger example than usual. Inside the "Stanf Nucleosome" track on the hg19/GRCh37 UCSC Genome Browser, there is a sample from K562. With just this and the UCSC Genes track turned on, and navigating to the SOD1 locus (chr21:33,031,935-33,041,243), the screenshot looks like:
The Stanford Nucleosome track isn't available in the UCSC browser for the previous genome assembly (March 2006 hg18/NCBI36). To convert the track's data to hg18 coordinates, we need the original bigWig and a liftOver chain file that describes the hg19 to hg18 alignment. These files are available here and here respectively. To convert this data we use the command:
$ time bwtool lift wgEncodeSydhNsomeK562Sig.bigWig hg19ToHg18.over.chain.gz wgEncodeSydhNsomeK562Sig.hg18.bigWig -bad=wgEncodeSydhNsomeK562Sig.unlifted.bed -decimals=1
real 152m51.541s
user 149m31.624s
sys 1m27.823s
As you can see, lifting is a heavy process and a genome-wide dataset will take several hours to run. Like all of the bigWig-making commands, lift is memory-intensive. Attempting to make a genome-wide bigWig on a computer with under 40GB of RAM perhaps will not work. The lift procedure itself is quite memory-intensive as well, but overall the limiting step is the bigWig creation step. The result can be added as a custom track on the hg18 UCSC genome browser:
In terms of chromosomal coordinates, the SOD1 locus is over one megabase from the region it was located in hg19, although it is still 9,309 bases long. The nucleosome data is also shifted and aligned to the gene in the same way, showing that the lift was largely successful. At this zoom level, it doesn't appear that any data was lost around SOD1 in the conversion. To look more closely, we can examine the file containing regions from the original bigWig that did not lift:
$ grep chr21 wgEncodeSydhNsomeK562Sig.unlifted.bed | awk '{OFS="\t";print $1, $2, $2+1;}' > chr21.bed
$ wc -l chr21.bed
290066 chr21.bed
$ printf "chr21\t33031934\t33041243\tSOD1\n" > sod1.bed
$ bedtools intersect -a chr21.bed -b sod1.bed | wc -l
0
Even though there are 290,066 bases of data from the original bigWig that were lost in the conversion to hg18, none were from the SOD1 locus.