Merge pull request #1638 from alexanderchang1/master

Identified a WES ASCAT Error and updated documentation
nf-core · Dec 10, 2024 · abdea49 · abdea49
2 parents 427801f + 5415caa
commit abdea49
Show file tree

Hide file tree

Showing 2 changed files with 84 additions and 0 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- [1638](https://github.com/nf-core/sarek/pull/1638) - Added additional documentation detailing ASCAT WES usage.
 - [1640](https://github.com/nf-core/sarek/pull/1620) - Add `lofreq` as a tumor-only variant caller
 - [1642](https://github.com/nf-core/sarek/pull/1642) - Back to dev
 - [1653](https://github.com/nf-core/sarek/pull/1653) - Updates `sarek_subway` files with `lofreq`

diff --git a/docs/usage.md b/docs/usage.md
@@ -647,6 +647,89 @@ mv *loci* battenberg_loci_on_target_hg38/
 
 3. Copy the `targets_with_chr.bed` and `GC_G1000_on_target_hg38.txt` files into the newly created `battenberg_loci_on_target_hg38` folder before running the next set of steps. ASCAT generates a list of GC correction loci with sufficient coverage in a sample, then intersects that with the list of all loci with tumour logR values in that sample. If the intersection is <10% the size of the latter, it will fail with an error. Because the Battenberg loci/allele sets are very dense, subsetting to on-target regions is still too many loci. This script ensures that all SNPs with GC correction information are included in the loci list, plus a random sample of another 30% of all on target loci. You may need to vary this proportion depending on your set of targets. A good rule of thumb is that the size of your GC correction loci list should be about 15% the size of your total loci list. This allows for a margin of error.
 
+### 'chr'-based versus non 'chr'-based reference
+
+Please note that loci files provided from ASCAT developers (https://github.com/VanLoo-lab/ascat/tree/master/ReferenceFiles/WES) are not 'chr'-based (chromosome names are '1', '2', '3', etc. and not 'chr1', 'chr2', 'chr3', etc.). If your BAMs are 'chr'-based, you will need to add 'chr'
+
+```bash
+for i in {1..22} X;
+  do sed -i 's/^/chr/' G1000_loci_hg19_chr${i}.txt;
+done).
+```
+
+ASCAT will internally remove 'chr' so the other files (allele, GC correction and RT correction) should not be modified and chrom_names (ascat.prepareHTS) should be c(1:22,'X').
+
+If using ASCAT provided references:
+
+```bash
+
+cd .../G1000_lociAll_hg38_unzipped/G1000_lociAll_hg38
+
+# Function to check and correct 'chr' prefix
+check_and_correct_chr_prefix() {
+    local file=$1
+    local chr_number=$2
+
+    # Check if file exists
+    if [ ! -f "$file" ]; then
+        echo "Error: File $file not found."
+        exit 1
+    fi
+
+    # Check first line of the file
+    first_line=$(head -n 1 "$file")
+
+    if [[ $first_line == chr${chr_number}* ]]; then
+        echo "File $file already has correct 'chr' prefix. No changes needed."
+    elif [[ $first_line == chrchr${chr_number}* ]]; then
+        echo "File $file has duplicate 'chr' prefix. Correcting..."
+        sed -i 's/^chrchr/chr/' "$file"
+    elif [[ $first_line == ${chr_number}* ]]; then
+        echo "File $file is missing 'chr' prefix. Adding..."
+        sed -i 's/^/chr/' "$file"
+    else
+        echo "Error: Unexpected format in $file. Please check manually."
+        exit 1
+    fi
+}
+
+# Check and correct 'chr' prefix for each loci file
+for i in {1..22} X; do
+    check_and_correct_chr_prefix "G1000_loci_hg38_chr${i}.txt" "${i}"
+done
+
+for i in {1..22} X
+do
+  # Generate BED file from the tailored loci set
+  awk '{ print $1 "\t" $2-1 "\t" $2 }' G1000_loci_hg38_chr${i}.txt > chr${i}.bed
+
+  # Extract relevant GC content data for this chromosome
+  grep "^chr${i}_" GC_G1000_on_target_hg38.txt > chr${i}.txt
+
+  # Intersect BED file with target regions to find loci on target
+  bedtools intersect -a chr${i}.bed -b targets_with_chr.bed | awk '{ print $1 "_" $3 }' > chr${i}_on_target.txt
+
+  # Calculate the number of lines needed for random sampling (30% of total)
+  n=$(wc -l < chr${i}_on_target.txt)
+  count=$((n * 3 / 10))
+
+  # Get loci that are both on target and match the GC content data
+  grep -xf chr${i}.txt chr${i}_on_target.txt > chr${i}.temp
+
+  # Add random subset of on-target loci to the list
+  shuf -n $count chr${i}_on_target.txt >> chr${i}.temp
+
+  # Sort, remove duplicates, and format output
+  sort -n -k2 -t '_' chr${i}.temp | uniq | awk 'BEGIN { FS="_" } ; { print $1 "\t" $2 }' > battenberg_loci_on_target_hg38_chr${i}.txt
+done
+
+# Compress the resulting loci files into a zip archive
+zip battenberg_loci_on_target_hg38.zip battenberg_loci_on_target_hg38_chr*.txt
+
+```
+
+If using Battenberg provided references:
+
 ```bash
 cd battenberg_loci_on_target_hg38/
 rm *chrstring*