Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programmatically Control Thread Count #188

Closed
3 of 4 tasks
jbteves opened this issue Jan 18, 2019 · 6 comments
Closed
3 of 4 tasks

Programmatically Control Thread Count #188

jbteves opened this issue Jan 18, 2019 · 6 comments
Labels
enhancement issues describing possible enhancements to the project

Comments

@jbteves
Copy link
Collaborator

jbteves commented Jan 18, 2019

Summary

Currently tedana uses all available cores or just one (#215). Users should be allowed to control more precisely the number of threads.

Goals

  • Determine a reasonable number of default threads
  • Implement changes to set that as a default
  • Determine whether users should be able to set a number of threads
  • Make number of threads a user-accessible option in CLI

Strategy

  1. Investigate methods of importing user arguments before NumPy is imported. This appears to be surprisingly difficult.

Additional/Summary

Discovered this problem when I redlined my image processing server, triggering angry e-mails from IT. PR #215 prevents tedana from blowing up clusters but uses a hardcoded one-core limit via configuration file. (Thanks @tsalo!)

@tsalo
Copy link
Member

tsalo commented Jan 19, 2019

Optimally, the default should be to use all available cores, and users should be able to control it with an n_threads argument. That said, I've been having a fair amount of trouble controlling this within tedana.

We never explicitly control multithreading in tedana, so I believe that the best approach is to control the environment variables. This SO post provides a list of variables we should set.

Setting the following in the terminal before calling tedana successfully limits the number of CPUs used on my Mac:

export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1

Unfortunately, setting these values within the workflow (with os.environ) so far only works if at the top of the file, which means we couldn't set them as an argument. There might be a way around that, but supposedly it should only work if the modules that use these variables (esp. numpy) are imported after setting the variables. However, even when I move imports of all non-builtin modules (except argparse) to below where the environment variables are set it doesn't work.

One interesting note though- on one core the ICA step takes a lot less time than using all available cores, although the results are somewhat different. We'll need to take a look at results to make sure they're still valid, but it's promising.

@tsalo tsalo added the discussion issues that still need to be discussed label Jan 19, 2019
@tsalo tsalo changed the title [Discussion] Reduce Multithreading Reduce Multithreading Jan 19, 2019
@jbteves
Copy link
Collaborator Author

jbteves commented Jan 26, 2019

@tsalo I was re-reading the issues and I noticed that we have another anecdote of slow performance on HPC systems here in #141
This makes me wonder if we should include figuring out what's going on with the multithreaded ICA as part of reliability analysis, thoughts?

@tsalo
Copy link
Member

tsalo commented Jan 26, 2019

It's hard to tell from the issue, but I looked at the version they use in the code they post and it looks like that issue relates to back when tedana used mdp instead of sklearn. mdp just happened to take longer than sklearn to get through the 5000 iterations. I think the issue itself is outdated and the only reason we haven't closed it is because the original author never responded.

That said, I think you have a great point. Since OS is within the scope of the reliability analysis, it might be worth it to include parallelization as well. We definitely don't want to make it the focus of the analysis, so running across a range of numbers of cores is unnecessary, but a comparison of "one core" vs "all available" would be great. After all, it could be that parallelization speeds things up generally, but not in the one case I tried out.

We could also profile the code to see if parallelizing speeds up ICA but slows down other elements. This is purely conjectural but it seemed like the spatial clustering took longer when I limited tedana to one core.

@jbteves
Copy link
Collaborator Author

jbteves commented Jan 26, 2019

Well, I don't know about "all available," since on my server that would be 36 cores and probably violates a ton of assumptions typically made. But I think 1, 4, and 8 should tell us a lot, or even just 1 and 4.

@tsalo
Copy link
Member

tsalo commented Jan 26, 2019

By "all available" I meant more that it would use whatever resources are allocated to the job. If you use a scheduler, you should be able to set the number of cores you want to reserve to run the job. We just need to make sure that tedana respects that value. I believe that right now it will use any cores it can access, which we definitely do not want. For the analyses, 1 and 4 seem reasonable to me.

@tsalo tsalo added this to the method extensions & improvements milestone Feb 9, 2019
@tsalo tsalo added enhancement issues describing possible enhancements to the project and removed discussion issues that still need to be discussed labels Feb 11, 2019
@jbteves jbteves changed the title Reduce Multithreading Programmatically Control Thread Count Feb 11, 2019
@jbteves
Copy link
Collaborator Author

jbteves commented May 23, 2019

This seems to be a shockingly difficult problem from reading the open NumPy issue, where a large discussion basically leads me to conclude that we are hopelessly downstream of the problem. For a quick summary, it appears that the issue rests in the BLAS libraries that NumPy may or may not use as they are system and even CPU dependent (yikes). Furthermore, the NumPy commenters appear to believe that this is not a priority for them or their users, reducing the likelihood that this problem will be solved in the near future. As much as this frustrates me, I have to conclude that we should stick with @tsalo's solution of reducing default core count to 1, and hoping that some day this will be fixed upstream of us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement issues describing possible enhancements to the project
Projects
None yet
Development

No branches or pull requests

2 participants