Skip to content

Processing scripts and documentation for the 2-row and 6-row NAM parents

Notifications You must be signed in to change notification settings

MorrellLAB/Barley_NAM_Parents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Barley_NAM_Parents

Processing scripts and documentation for the 2-row and 6-row NAM parents

Data Availability

Raw FASTQ files can be downloaded from the SRA.

Sample Metadata and Raw Sample Downloading

Sample metadata is located in spreadsheets under raw_samples.

Samples were downloaded from here.

Renaming

Using the documents described above, two lists were constructed: NAM_parents_newnames.txt and NAM_parents_oldnames.txt. The shell script rename_from_lists.sh was fed these lists and renamed the samples by going through the lists line-by-line.

The raw samples that were split into multiple files followed a different renaming procedure. The shell script cat_NAM_parents.sh used the command zcat to combine and rename the files in one step.

After all samples were renamed and concatenated (if necessary), the timestamps were modified to match the timestamps located here. This was done using touch_timestamp.sh.

In total there were 180 distinct samples, but PI_599621 was sampled twice leading to 181 raw FASTQ files.

  • PI_599621 is the version that was sequenced with the bulk of the parents on 2014-07-18
  • 26_PI_599621 is the version that was sequenced in a smaller group of 26 on 2016-03-24

It was discovered later that some samples had been improperly named. In the renaming scripts you will see the wrong names, but in the final sample list, final VCF, and final BAM files you will see the correct names.

  • CIho_39590 (incorrect) -> PI_039590 (correct)
  • PI_392542 (incorrect) -> PI_392524 (correct)
  • PI_294747 (incorrect) -> PI_294743 (correct)

A full list of the samples can be found under sequence_handling in Final_BAMs.list.

sequence_handling: FASTQ to VCF

Raw FASTQ files were trimmed of adapters, read mapped, deduplicated, and converted to BAM format using the sequence_handling pipeline. The configuration file used to run the entire pipeline is found in Config, updated to match commit aca9041. Full documentation on sequence_handling can be found here. The following handlers were executed in the listed order:

  1. Quality_Assessment commit 579ebd4: Quality summary output for each FASTQ is located in NAM_quality_summary.txt
  2. Adapter_Trimming commit cca97ae
  3. Read_Mapping commit cca97ae
  4. SAM_Processing commit cca97ae
  5. Coverage_Mapping commit acc3405: Coverage summary output for each BAM is located in NAM_coverage_summary.txt

The beginning stages of variant discovery were not performed using sequence_handling since the pipeline was still under development at the time. Because previous versions of sequence_handling output incorrect SAM/BAM headers, a reheader script was used before indel realignment to give each BAM file the correct @RG line. The following custom scripts were executed in the listed order:

  1. GATK_RTC.job
  2. Reheader_BAM.sh
  3. GATK_IndelRealigner.job
  4. GATK_HaplotypeCaller.job
  5. GATK_GenotypeGVCFs.job

SNP calling was performed using version 3.6 of GATK, version 0.1.14 of vcftools, and version 1.0.0 of vcflib. The following handlers of sequence_handling were executed in the listed order to produce the final VCF file:

  1. Create_HC_Subset commit aca9041: DP and GQ summaries for before and after filtering are located here
  2. Variant_Recalibrator commit aca9041
  3. Variant_Filtering commit aca9041: DP and GQ summaries for before and after filtering are located here
  4. Variant_Analysis commit aca9041: Heterozygosity, missingness, Ts/Tv, MAF histogram, population genetics statistics at 18 loci, and SNP count outputs are located here

The 9k genotyping markers, the inversions VCF file (not available yet), and the MBE VCF file (Not available yet?) were used as resource files for Variant_Recalibrator.

The final VCF file can be downloaded here. (Not available yet)

Comparison to Genotyping Data

The final SNP calls were compared to 9k iSelect genotyping for the same lines using version 0.1.14 of vcftools to verify the identity of each sample. The ALCHEMY genotyping data for the NSGC landraces from Poets et al. 2015 was converted into VCF format using this tutorial by Li Lei. This NSGC landraces genotyping VCF file is available for download here. (Not available yet) 9k SNPs with no BLAST hits were filtered out and not used for comparison. The results of the comparison can be found under discordance_exome_vs_alchemy.txt.

vcftools --vcf Barley_NAM_Parents_Final.vcf\
	 --diff NAM_9k_good_SNPs.vcf\
	 --diff-indv-discordance\
	 --diff-indv-map exome_to_alchemy_map.txt\
	 --out discordance_exome_vs_alchemy

Seven samples were above the 10% threshold for discordance:

Sample Name Discordance Row Type
PI_328015 43.3% 2-row
CIho_5989 43.2% 2-row
PI_640220 41.2% 6-row
PI_412946 37.0% 2-row
PI_392491 36.1% 2-row
PI_436149 31.0% 2-row
PI_223883 16.3% 6-row

The exome capture SNP calls for the seven discordant samples were filtered to only the good 9k sites (excluding the 9k SNPs with no BLAST hits) and merged with the VCF file for the NSGC landraces. PLINK was used to create a pairwise similarity matrix of the merged VCF file to determine the likely identity of the discordant NAM parents.

plink --vcf discordant_combined.vcf --allow-extra-chr --distance square ibs

R was used to find all NSGC core lines that were at least 90% similar to one of the discordant lines.

ibs <- read.table("plink.mibs", header=FALSE)
headers <- read.table("plink_ids.txt")
colnames(ibs) <- headers
rownames(ibs) <- t(headers)
sort(ibs[2453,])
sort(ibs[2452,])
sort(ibs[2451,])
sort(ibs[2450,])
sort(ibs[2449,])
sort(ibs[2448,])
sort(ibs[2447,])

Three of the seven samples appear to have been mislabeled as another NAM parent (see below). The other four discordant samples have multiple matches to lines outside of the NAM parents population and no matches within it, thus their true identities are unclear.

Sample Name Likely Identity Concordance with Likely Identity
PI_392491 PI_412946 95.5%
PI_412946 PI_436149 97.0%
PI_640220 PI_392491 97.1%

To-Do

  • Rename files based on discordance matches
  • Make exome capture VCF file available for download
  • Make NSGC genotyping VCF file available for download
  • Make Variant_Recalibrator MBE resource file available for download

About

Processing scripts and documentation for the 2-row and 6-row NAM parents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages