Smudgeplot interpretation - allopolyploid or coverage too low? #164

RSchley · 2024-10-16T10:09:41Z

Hi there Kamil et al.,

First off, thanks so much for developing this amazing tool. I am having a pervasive problem with interpreting my smudgeplots - specifically, I am wondering whether they represent cases of inadequate coverage or allopolyploidy that looks like diploidy.

I am working on tropical trees with an average genome size of about 1Gbp across 13 chromosomes. There are some known tetraploids in the group already for which flow cytometry were performed (e.g. here) although most are diploid.

I ran an initial rough ploidy estimate across hundreds of species in my study genus (Inga) using nQuire (i.e. based on allelic ratios, although I know these methods can be problematic). Based on this, I did whole-genome resequencing at 40x on species inferred to be tetraploid. I then ran smudgeplot and genomescope using the suggested parameters in the OhKnow Kmer course as well as the newer BGA tutorials, with the following commands:


######################################################## Trim with fastp ########################################################
## remember, it's best to trim reads for smudgeplot (see https://github.com/KamilSJaron/smudgeplot?tab=readme-ov-file#smudgeplot)

## Run fastp v0.23.4, with:
#  Quality =<22 removed (-q 22)
#  Length =<30 removed (-l 30)
#  Adapters detected and removed --detect_adapter_for_pe 
#  Base correction in overlapping reads (-c) 
#  Poly g tails trimmed (-g) 

fastp -q 22 -l 30 --detect_adapter_for_pe -c -g -i "$DATADIR"/V*/01_raw_reads/"$SEQNAME"_*_R1_*.fastq.gz -I "$DATADIR"/V*/01_raw_reads/"$SEQNAME"_*_R2_*.fastq.gz -o "$DATADIR"/All_fastp_trimmed/"$SEQNAME"_R1_trimmed.fastq.gz -O "$DATADIR"/All_fastp_trimmed/"$SEQNAME"_R2_trimmed.fastq.gz  

########################################################## Run genomescope ##########################################################


#First, you need to generate a kmer database with FastK 
## Make kmer database by running FastK with kmer size of 31, 4 threads
echo 'Making Kmer database'
FastK -v -t4 -k21" -M16 -T4 "$DATADIR"/All_fastp_trimmed/"$SEQNAME"_R*_trimmed.fastq.gz -N"$SEQNAME"


#Make histogram with Histex
Histex -G "$SEQNAME" > "$SEQNAME"_k"$KMER"_GS.hist

# Run GenomeScope
genomescope2 -p 2 -i ../"$SEQNAME"_k"$KMER"_GS.hist -o ./"$SEQNAME"_dip -n "$SEQNAME"_dip -k 21 --verbose
genomescope2 -p 3 -i ../"$SEQNAME"_k"$KMER"_GS.hist -o ./"$SEQNAME"_tri -n "$SEQNAME"_tri -k 21 --verbose
genomescope2 -p 4 -i ../"$SEQNAME"_k"$KMER"_GS.hist -o ./"$SEQNAME"_tet -n "$SEQNAME"_tet -k 21 --verbose


########################################################## Running smudgeplots ##########################################################
# from BGA workshop version https://github.com/BGAcademy23/smudgeplot

#Run smudgeplot
/home/rschley/scratch/apps/smudgeplot/exec/smudgeplot.py hetmers -L 10 -t 4 --verbose -o "$SEQNAME"_k21_pairs ../"$SEQNAME".ktab

#Plot smudgeplot
/home/rschley/scratch/apps/smudgeplot/exec/smudgeplot.py plot -t "$SEQNAME" -o "$SEQNAME"_k21_smudgeplot "$SEQNAME"_k21_pairs_text.smu

There were cases of true low coverage, which didnt enable distinction of the error peak from the actual peaks:

However, there were multiple cases of species with great coverage which should be tetraploid that still had low normalised minor kmer coverage. These and all others sequenced were inferred to be diploid (although many of these cases infer the wrong number of chromosomes - 16 or 15 instead of 13)

How should I understand my smudgeplot? Is it a case of divergent subgenomes in an allopolyploid or too low coverage? Or, since multiple sequencing runs were performed to get the required coverage, is it some methodological artefact from combining multiple rounds of sequencing?

Thanks so much
Rowan

The text was updated successfully, but these errors were encountered:

KamilSJaron · 2024-10-16T11:05:45Z

Hi @RSchley, are you coming on Friday? I think the plot would get a lot nicer with the upcoming version?

I think it boils down to - how allo is your allotetraploid. If allo are very distinct species, they will become... moreless diploids on k-mer level and I think that's what you might be seeing. That is good to know too, because then you really expect to also assemble two subgenomes and there is a change you will be struggling to map resequencing reads (I would be very strict with coverage cutoffs when calling variants).

RSchley · 2024-10-16T11:17:11Z

Hi Kamil, I was keen to register for the new smudgeplot tutorial but will unfortunately be away on Friday. Definitely will re-run everything with the new version.

OK - good to know. Our prelim results suggest hybridisation is apparently fairly widespread in Inga, but polyploidy is quite restricted (from the nQuire results only ~7 out of 190 species surveyed were putative polyploids). Thanks for the advice - in other work with broader resequencing of Inga I have indeed been very strict with coverage cutoffs so that's good!

Are there any parameters for FastK, GenomeScope or smudgeplot you would suggest changing? Is a kmer size of 21 ok? Is the cleaning regime ok? Should I remove duplicates (despite the fact these were PCR-free libraries)? I have had a bit of a play around with parameters before and have recovered more-or-less the same results.

Finally - is the apparently incorrect number of chromosomes a signal of allopolyploidy?

Thanks again
Rowan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smudgeplot interpretation - allopolyploid or coverage too low? #164

Smudgeplot interpretation - allopolyploid or coverage too low? #164

RSchley commented Oct 16, 2024

KamilSJaron commented Oct 16, 2024

RSchley commented Oct 16, 2024 •

edited

Loading

Smudgeplot interpretation - allopolyploid or coverage too low? #164

Smudgeplot interpretation - allopolyploid or coverage too low? #164

Comments

RSchley commented Oct 16, 2024

KamilSJaron commented Oct 16, 2024

RSchley commented Oct 16, 2024 • edited Loading

RSchley commented Oct 16, 2024 •

edited

Loading