Skip to content

Input data

Nicola Casiraghi edited this page Jul 13, 2019 · 33 revisions

Sample info file

The sample info file includes details about samples that are processed by abemus.
The format is simply 5 columns, tab-delimited, and there is no column header.

column 1: The patient ID.
column 2: The tumor sample ID.*
column 3: The full path to the tumor sample BAM file.*
column 4: The matched-tumor control sample ID
column 5: The full path to the matched-tumor control sample BAM file.

* This field must be unique.
  • What if you have tumors without matched-control samples ?
    abemus considers also tumors without matched-control samples to call somatic snvs.

  • What if you have controls without matched-tumor samples ?
    abemus uses control samples to build a global error-sequencing (GSE) distribution. Keep control samples without matched-tumor in the simple info file and fill the corresponding case column with a NA.

Here an example of a valid sample info file:

PT01   TUMOR_A   /my_project/data/TUMOR_A.bam   CTRL_A   /my_project/data/CTRL_A.bam
PT01   TUMOR_B   /my_project/data/TUMOR_B.bam   CTRL_B   /my_project/data/CTRL_B.bam
PT02   TUMOR_C   /my_project/data/TUMOR_C.bam   CTRL_C   /my_project/data/CTRL_C.bam
PT03   NA        NA                             CTRL_D   /my_project/data/CTRL_D.bam
PT04   TUMOR_E   /my_project/data/TUMOR_E.bam   NA       NA   

Control samples CTRL_A, CTRL_B, CTRL_C and CTRL_D will be used to build the GSE distributions.
Tumor samples TUMOR_A, TUMOR_B, TUMOR_C and TUMOR_E will be investigated to check for somatic snvs.
Calls in TUMOR_A, TUMOR_B, TUMOR_C will be refined by exploiting tumor-control matched information.

Targeted genomic regions

abemus looks for snvs in genomic regions of interest. These genomic regions must be in the BED tab-delimited format and sorted (i.e. sortBed). There is no column header.

The 3 required BED fields are:

column 1: Chromosome name ("chr" annotation must be consistent with the one in BAM file).
column 2: Starting position of the genomic region.
column 3: Ending position of the genomic region.

The ending position is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as start_base=0, end_base=100, and span the bases numbered 0-99.

Per-base pileup data

Get .pileup and .pabs data with PaCBAM

pacbam bed=regions.bed vcf=snps.vcf fasta=hg19.fasta strandbias mode=5 out=PaCBAM_outdir

Split .pileup and .pabs data by chromosome

Pileup data from PaCBAM tool are split by chromosome in order to speed up the computational workflow. This task can be achieved by the abemus built-in function split_pacbam_bychrom()

Usage ( ?split_pacbam_bychrom )

split_pacbam_bychrom(targetbed = "/my_project/info/regions.bed",
                     pacbamfolder = "/my_project/data/PaCBAM_outdir",
                     pacbamfolder_bychrom = "/my_project/data/PaCBAM_outdir_bychrom")

The targetbed is the BED tab-delimited file with targeted genomic regions;
The pacbamfolder is the folder in which original .pileup and .pabs output data from PaCBAM are saved;
Output data will be written in the indicated pacbamfolder_bychrom and it will contain a subfolder for each sample (both tumors and controls) with .pileup and .pabs data split by chromosome:

pacbamfolder_bychrom/
   SAMPLE_id/
      pileup/
         chr1.pileup
         chr3.pileup
         ...
      snvs/
         chr1.pabs
         chr3.pabs
         ...

The pacbamfolder_bychrom out directory is not created by the split_pacbam_bychrom() function, make sure to create it in advance. Only data present in the folder pacbamfolder_bychrom will be considered in the downstream analysis.