Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nb-cores tuning different total row numbers #17

Open
sarapita opened this issue May 23, 2022 · 3 comments
Open

nb-cores tuning different total row numbers #17

sarapita opened this issue May 23, 2022 · 3 comments

Comments

@sarapita
Copy link

Dear John,

I’m using unitig-counter/1.1.0 and then feeding the output into pyseer for LMM-based association analyses. However,
I’ve noticed I get different number of significant unitigs when using different values for —nb-cores when running unitig-counter. I then noticed the number of lines in the output of the unitigs file differs in row number (e.g. when using —nb-cores 1 vs 4). Is this occurrence reproducible on your end? If so, is the issue due to parralelization? Is it safest to stick to —nb-cores 1?

Thanks in advance,
Sara

@johnlees
Copy link
Member

It might be more informative to check exactly which sequences differ between the two rather than just the counts.

But I am not sure I'm afraid, but perhaps sticking to one core might be safest in that case.

@sarapita
Copy link
Author

Thanks for your prompt reply!

So I looked a bit more into it and I realise what I posted before isn’t quite right; sorry about this. So I decided to give a bit more context.

I executed 4 runs of unitig-counter (I encounter the same issue when using unitig-caller instead) with —nb-cores 1,2,3 & 4, respectively.
I get the same number of rows in the unitigs output file (contrary to what I said in my first post) and same number of unique unitigs (see below),
but the unique unitigs are different across the 4 runs.

#total unique unitigs
for threads in {1..4}; do zcat count_unitigs_threads_${threads}/unitigs.txt.gz | awk '{print $1}' | sort | uniq | wc -l;done
353548
353548
353548
353548

For example, between the run with 1 thread and the run with 2 threads, 311607 unitigs were common in both runs
and 41941 reads were reported on the 1-thread run
that were not found in the 2-thread run and vice-versa.

anti_join(thread1,thread2) %>% dim
Joining, by = "unitigs"
[1] 41941 1

inner_join(thread1,thread2) %>% dim
Joining, by = "unitigs"
[1] 311607 1

I didn't go on to assess how "different" these unitigs were yet, as you alluded to in your previous post.

When inputting unitigs from these different runs into pyseer for a population structure
adjusted llm (by employing a similarity matrix), this translated into identical bonferroni
significance thresholds, but 4 different similarity matrices,
and number of significant hits, especially when going from a 1-thread run to a 2-threaded run (see table below).

wc -l count_unitigs_threads_*/significant_unitigs.txt
860 count_unitigs_threads_1/significant_unitigs.txt
53 count_unitigs_threads_2/significant_unitigs.txt
98 count_unitigs_threads_3/significant_unitigs.txt
103 count_unitigs_threads_4/significant_unitigs.txt
1114 total

This is a bit concerning given, for example, only 6 significant unitigs of the 52
found in a 2-thread run are successfully grepped from the significant unitigs in
the 1-thread run.

If you have any insight into why this might be happening, I'd appreciate it.
Initially, I thought this was a threading issue where not everything was being merged back
into the final table, but I wonder now if I might be missing something about the method
itself.

I initially ran unitig counting on 2 cores, so in the meantime will see if the annotation hits are similar if I switch to using 1-core.

@ktmeaton
Copy link

I also had this issue, where different -nb-cores resulted in the same number of unitigs, but different sequences. I found that the unitigs were actually the same across the different runs, it was just that in the some runs, the reverse complement was reported (even if the unitig was found in only one strain).

This is very anecdotal, but in testing, I noticed that an even number of cores tends to output reverse complements, (-nb-cores 2, nb-cores 4) while an odd number of cores keeps the original strand (-nb-cores 1, nb-cores-3). I'm not sure how reproducible that is at scale though.

I haven't noticed a difference yet in pyseer results, but will report back if i do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants