Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation of how --cov-cutoff works #18

Closed
tseemann opened this issue May 9, 2017 · 11 comments
Closed

Explanation of how --cov-cutoff works #18

tseemann opened this issue May 9, 2017 · 11 comments
Assignees
Labels

Comments

@tseemann
Copy link

tseemann commented May 9, 2017

The cov-cutoff parameter remains a mystery to the Spades user community. It used to be auto and now it is off.

Would it be possible to add an explaination to the document explaining it?

Common results are getting contigs with coverages of < 1.0

@snurk snurk self-assigned this May 19, 2017
@snurk snurk added the question label May 19, 2017
@wangyugui
Copy link

'a positive float number' is explaned in help output.

what is the meaning of 0--1.0 and over 1.0?

@asl
Copy link
Member

asl commented May 22, 2017

what is the meaning of 0--1.0 and over 1.0?

See http://cab.spbu.ru/files/release3.10.1/manual.html#sec3.5 that explains what is the coverage reported.

@wangyugui
Copy link

Does cov-cutoff is used to filter contig ouput? It is not used for filter fastq inut by kmer coverage?

@tseemann
Copy link
Author

tseemann commented Jun 1, 2017

Here is what the manual section 3.5 says:

Contigs/scaffolds names in SPAdes output FASTA files have the following format: 

>NODE_3_length_237403_cov_243.207_ID_45

Here 3 is the number of the contig/scaffold, 237403 is the sequence length in nucleotides and 243.207 is the k-mer coverage for the last (largest) k value used. 

Note that the k-mer coverage is always lower than the read (per-base) coverage.

The only way to get k-mer coverage < 1 is to have a contig which is less than the k_max ?

(which can happen in a section of a de bruijn graph when breaking into contigs)

@wangyugui
Copy link

wangyugui commented Jun 1, 2017

the --cov-cutoff of SPAdes is after assembly?
people may want a low k-mer coverage filter before assembly and to speed up the assembly.

kmer-mask is the tool that I wanted, but there are some problems
a) meryl is slower than Jellfish and it uses too much memory( when much threads).
b)some bugs need to fix for big fastq/fasta files .(I have the dirty patch(uint32->uint64), but it seems not active)

@snurk
Copy link
Contributor

snurk commented Jun 1, 2017

Dear @tseemann
SPAdes uses iteratively increases value of K and additinaly tries to glue together potentially broken regions using paired read mapping and searching for small overlaps.
Both these procedures add kmers, which have coverage 0 since they are not present in the reads.
Also if Ns are introduced to scaffolds, then the total length of the scaffold might increase with its average kmer coverage decreasing.
I hope this explains appearance of average kmer coverage <1.0 in the results.
On the other hand, as far as I know SPAdes should not produce contigs shorter than k_max.

@snurk
Copy link
Contributor

snurk commented Jun 1, 2017

Dear @wangyugui

the --cov-cutoff of SPAdes is after assembly?

Yes and no. It happens after the assembly graph is constructed (and most graph simplification procedures finished). But the low covered edges are actually removed from the graph, leading to the compression of remaining unambiguous paths and not interfering with subsequent repeat resolution and scaffolding.
The value auto is compatible only with uniform coverage model (no --meta or --mda flags).
In this case the threshold is set automatically from the probabilistic model trained on kmer frequency histogram. In this case the value is chosen independently for every iteration.
If the value is provided manually, it is interpreted as an "average nucleotide coverage" and will be multiplied by (RL - K)/RL to get a threshold on average kmer coverage for assembly iteration with kmer size K.

Dear @tseemann, I hope this answers your initial question, and I would be glad to provide any clarifications.

people may want a low k-mer coverage filter before assembly and to speed up the assembly.

We are considering adding this option in future, but currently you would have to set up your own pre-processing pipeline.

@tseemann
Copy link
Author

tseemann commented Jun 6, 2017

@snurk thank you very much for responding with such detail to our questions. I'll pass this page onto the bacterial genomics community. And thank you for continuing to develop spades.

@asl
Copy link
Member

asl commented Jun 6, 2017

@tseemann We will try to explain the k-mer coverage model SPAdes uses, if time permits. Though it's already used inside kmergenie :)

@jacarrico
Copy link

Yes thanks a lot for the answers and for developing Spades! This is fundamental for the kind of work we have been doing that includes certification of pipelines using spades

@snurk
Copy link
Contributor

snurk commented Jun 6, 2017

@tseemann, @jacarrico you are welcome!

@snurk snurk closed this as completed Jun 9, 2017
asl added a commit that referenced this issue May 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants