Skip to content

Scaling BAM Files from DNAseq Data

pongorlorinc edited this page May 1, 2019 · 2 revisions

BAMscale can scale one (or more) BAM files to generate coverage tracks ready for visualization. This step generally takes ~3-5 minutes for standard ATAC-seq/ChIP-seq data, and up to ~20 minutes for genome sequencing data (100Gb in total size) using 4 threads.

There is one main parameter here: BAM file(s).

The output of the two command is a BigWig file. BAMscale "scale" function appends the ".bw" suffix to the end of the input file. For a BAM file named "sample.bam" the output will be:

sample.bam.bw

The full set of available parameters:

Usage: BAMcompare scale [OPTIONS] --bam <BAM_1> (--bam <BAM_2> ... --bam <BAM_N>)

Output: Coverage tracks in BigWig format (un-scaled, scaled, genome scaled)

Required options:
	--bam|-i <file>		Input BAM file. This has to be specified at least two times.

Library options:
	--libtype|-l <str>	Sequencing type to be used. Can be: single, paired, and auto (default: autodetect)
	--frag|-f <flag>	Compute coverage using fragments instead of reads (default: no)
	--fragsize|-a <int>	Fragment size to be used to extend single-end library reads

Normalization, scaling and operation type:
	--normtype|-y <str>	Type of normalization. (default: base)
				If no normalization is needed, set '--scale no' argument, the program will disregard this option.
				Options: 
				  1) reads: No. of mapped reads/fragments
				  2) base: Sum of per-base coverage of reads/fragments

	--scale|-k <str>	Method to scale samples together. (default: genome)
				Options are: 
				  1) no: no scaling, just calculate coverage
				  2) smallest: scale reads to smallest library (multiple-samples only)
				  3) genome: scale samples to 1x genome coverage (only possible with 'base' normalization type)


	--operation|-r <str>    Operation to perform when scaling samples. Default: scaled
                            Options are: 
                              1) scaled: output scaled tracks.
                              2) unscaled: do not scale files in any way.
                              2) log2: log2 transform against first BAM file.
                              3) ratio: coverage ratio against first BAM file.
                              4) subtract: subtract coverage against first BAM file.
                              5) rfd: OK-seq RFD calculation
                              6) endseq: strand-specific coverages
                              6) endseqr: strand-specific coverages (reverse strand score is negative)
                              7) reptime: replication timing mode for two BAM files (binsize: 100bp, smoothen: 500 bins)

                            Short description of settings:
                            endseq: generates scaled coverage tracks of positive/negative strands,
                                    and the log2 ratios

                            endseqr: generates scaled coverage tracks of positive/negative strands,
                                    the negative strand coverage will be negative, and the log2 ratios are calculated

                            reptime: generates scaled coverage tracks and log2 ratios of two BAM files,
                                    setting the binsize to 100bp and smoothening smoothen to 500 bins


	--binsize|-z <int>	Size of bins for output bigWig/bedgraph generation (default: 5)

Sequencing coverage computation options:
	--seqcov|-e <int>	Compute sequencing coverage from BAM file. (default: '1', count reads while parsing BAM)
				Options are: 
				  1) 0: use reads in index (only if normalization is set to 'reads')
				  2) 1: count reads while parsing BAM(s)
				WARNING: this option is only useful when 'reads' are used for normalization

	--blacklist|-c <file>	Input file with list of chromosomes to blacklist during scaling analysis

	--bedsubtract|-u <int>	BED file with regions to subtract when computing coverage for normalization
				These coordinates should not overlap so reads are not counted multiple times

	--smoothen|-j <int>	Smoothen signal by calculating mean of N bins flanking both sides of each bin (default: 0)
				If set to '0', the signal is not smoothened. To turn on specify a value greater than '0'.
				For replication timing, a good value is to smoothen to 100k bases. If binSize is 100bp, this would be '1000'

	--tracksmooth|-b <int>	Which tracks should be smoothened when performing smoothening (default: '1' meaning only binned track).
				Options are: 
				  1) 0: Smoothen scaled and transformed tracks (log2, ratio or subtracted)
				  2) 1: Smoothen only the scaled sequencing track
				  3) 2: Smoothen only the transformed (log2, ratio or subtract) track

Mapping options:
	--mapq|-q <int>		Minimum (at least) mapping quality (default: 0)
	--keepdup|-d <flag>	Keep duplicated reads (default: no)
	--noproper|-p <flag>	Do not filter un-proper alignments (default: filter)
	--unmappair|-m <flag>	Do not remove reads with unmapped pairs
	--minfrag|-g <int>	Minimum fragment size for read pairs (default: 0)
	--maxfrag|-x <int>	Maximum fragment size for read pairs (default: 2000)
	--fragfilt|-w <int>	Filter reads based on fragment size (default: no)

Output options:
	--outdir|-o <str>	Output directory name (default: '.')

Performance options:
	--threads|-t <int>	No. of threads to use (default: 1)

Normalization Parameters

There are two options is scaling is set. Either you scale the coverages to the no. of sequenced bases (default), or to the number of reads. In case scaling is set to number of reads, BAM files from multiple samples have to be analyzed together, as the scaling is done by comparing the library size to the smallest library size.

Scaling

This parameter tells BAMscale, if scaling should be performed, or not.

Operations

These are the main operations, that can be performed using this tool.

1) Scaled: output scaled tracks.
2) Unscaled: do not scale files in any way.
2) log2: log2 transform against first BAM file.
3) Ratio: coverage ratio against first BAM file.
4) Subtract: subtract coverage against first BAM file.
5) RFD: OK-seq RFD calculation

By default, the setting is "scaled". If scaling is disabled, this is set to "unscaled".


log2 Transformation

In this case, the log2 coverages of bins are calculated across the genome. Important to note, that the second (or more) samples coverages are divided by the first file!!!


Subtract Transformation

In this case, the coverages are subtracted across the genome. Important to note, that the second (or more) samples coverages are subtracted by the first file!!!


RFD Transformation

In this case, the RFD coverages are calculated. This is a special case, and expects a single BAM file!! A tutorial can be found here



Description of Important Parameters

If there are regions (in a BED file) that might have unnecessarily high coverages which might skew you results, specify this in the command:

--bedsubtract <BEDfile> 

If you absolutely want to include duplicate alignments in the calculation, use this flag:

--keepdup

If it is necessary to count read pairs as one, use this flag:

--frag

Filtering based on mapping quality (eg. 10):

--mapq 10