-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input data metrics explanation, % of mtDNA reads of the total sequence reads that mapped to the whole mtDNA_follow up #222
Comments
Not sure what your question is, the 99,99% is the amount of reads used, without the max memory it is usually 100%, but sometimes the reads are too short or have missing read pairs and then get discarded. So no subsampling was done in this case Forward reads without pair: 41965 This means some reads had not matching pair... How many reads are there in total, less than than 434743096? |
Hi Nicolas, raw data: 218,591,573 read pairs (adapter-free) Both raw and trimmed data produced the same largest contig,16521 bp, but % of organelle genome is 0.43 for raw and 0.44 for trimmed data. see below log files log_mito_1_375_raw data.txt Is it expected that % of recovered organelle genome in case if NOVOPlasty overcomes that from reference-based ones (% 0.39)? Below is the pipeline for reference-based approach: Approaches 3 and 4: Reference-based read mapping was performed using two different parameter settings (see below), by aligning the trimmed sequencing reads against the V. vulpes reference genome (assembly version: GCF_003160815.1_VulVul2.2; Kukekova et al., 2018) using BWA-MEM v0.7.17 (Li and Durbin, 2009) with default parameters. We then used SAMtools v1.10 (Li et al., 2009) to obtain sorted bam files, followed by using GATK v4.2.2.0 (https://gatk.broadinstitute.org/hc/en-us) to remove PCR duplicates using MARKDUPLICATESSPARK and to filter out bad read mates, reads with mapping quality zero and reads which mapped ambiguously (Nater et al., 2017), and again SAMtools to extract the mitochondrial reads that mapped to the mtDNA scaffold (NC_008434, Arnason et al. 2006) of the reference genome. HAPLOTYPECALLER in GATK was used to call variants, using two different parameter settings, using as values for the flag --sample-ploidy: 1 for haploid (ploidy 1; Approach 3), and 2 for diploid (ploidy 2; Approach 4), each yielding a separate VCF file. Finally, FastaAlternateReferenceMaker from GATK was used to convert the two VCF files from Approaches 3 and 4 to FASTA format. Kind regards, |
Do you have a problem with the 0.01 difference? :) |
No, I don't have a problem with this difference- I'm joust surperising how de novo (NOVOPlasty) produce more % of mtDNA reads than reference-based approach. I'm doing a comparison between the different assemblies, and I should give some explecit recommondation for the performence of each method, hene any diffrence even if it's samll is importatnt. I think, we must leave max memory empty, thus, it will not subsample and then use all the reads, is this right? I apprecite your time and help. Regards, |
It is a de novo assembler, I didn't put much effort in those statistics, I just added them because I thought they could be informative for the users, but it is still an estimation. Does the mt% have to be that accurate? Max memory setting is to reduce memory usage and increase the speed, the assembly will still be as good, because you have very high coverage. Maybe %mt will be a bit less accurate, but I thought nobody would care about that.. And in your case you have some small contigs, this could explain the elevated %, because the reads used for assembling those are also included and maybe they are not mitochondrial. Although it is more probable that the assembly is incomplete because of some repetitive region in the control region, in those cases you often see those small contigs. Other graph based assemblers will collapse that region, which seems that the assembly is complete but it isn't. Hence, a lot of mt genomes on NCBI are not fully assembled (collapsed repetitive regions)... You can send me your log files and contigs, I can have a look if this is the case |
Thanks for this explanation, below are the log files But, you didn't response to this question, yet. " I think, we must leave max memory empty, thus, it will not subsample and then use all the reads, is this right?" log_mito_1_375_12_3_max memory 100.txt |
If max memory is empty it doesn't subsample, but not sure what you mean by we must.., you can do whatever you prefer And I need the extended log to see, but since you didn't use that option you can send me the Contigs fiel, probaply can see it from there |
Ok, I meant by "must." to get use of all the reads, then the estimation of recovered mtDNA reads will be an accuarte than when you subsample. Below is the extended log files log_extended_mito_1_375_12_3_max memory 100.txt |
Dear Nicolas,
Just a follow up question for the issue #216
How do I know the right % of mtDNA reads of the total sequence reads that mapped to the whole mtDNA? I'm doing a comparison between the performance of NOVOPlasty and other de novo assemblies and this information is so important.
When I used adifferent memmory settings (all other settings are fixed), I have got the same lenght of the largest contig, but with differnt number for assembled reads, aligned and total reads.
Does the subsampled fraction: 99.99 % when setting the Max memory= Null is right? if so, the number of total reads is over the number of reads in the raw data. I'm still confused, sorry.
max memory Null
log_mito_1_375_12_6_max memory Null.txt
max memory 100
log_mito_1_375_12_3_max memory 100.txt
memory 64
log_mito_1_375_12_max memory 64.txt
Kind regards,
Ali
The text was updated successfully, but these errors were encountered: