Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n3fit OOM in conda env #522

Closed
siranipour opened this issue Jul 30, 2019 · 33 comments
Closed

n3fit OOM in conda env #522

siranipour opened this issue Jul 30, 2019 · 33 comments
Assignees
Labels
n3fit Issues and PRs related to n3fit

Comments

@siranipour
Copy link
Contributor

I've been trying to run a 1 rep DIS only fit using the new code. I use the given PN3_DIS_example.yml runcard and execute with n3fit PN3_DIS_example.yml 1 -o NNPDF31_nnlo_as_0118_DISonly_NEWCODE.

The code works, but quickly ramps up in memory usage before OOMing and linux begins to SWAP.

I have been talking to Juan about this and he reports the same issue, provided the installation is done through conda. I quote this useful snippet from our email exchange

two issues that do not appear outside the conda environment:
1 - The memory grows in a point in which there should be no more memory growth (the model has already been formed, the memory usage from that point onwards should be minimal)
2 - It generates a stupid amount of threads after the parallelization has already occurred. My guess is these threads are the ones generating the out-of-memory because they are not generated in the non-conda version.

A probable culprit is a bugged library in the conda package. Would be useful to take a look at this. Cheers.

@scarlehoff
Copy link
Member

scarlehoff commented Jul 30, 2019

I just tried doing the following

pip uninstall tensorflow
pip install tensorflow-gpu

And the memory seems to be under control. This is strange because in theory they are the same version of tensorflow and they should do the same provided you don't have a GPU, so I guess whoever is in charge of generating the standard tensorflow conda package screwed up with the latest version.

I wouldn't call this a fix because I have no idea why this fixed it, I just happened to test this first and it worked for me. If it works for anyone else let's call it "a workaround".

@Zaharid
Copy link
Contributor

Zaharid commented Jul 30, 2019

I guess the conda tensorflow is compiled with some unhelpful flags. Maybe because of the use of the MKL library, with its own ideas about thread parallelization?

Btw, pip install tensorflow is a known pitfall that beaks the world around it. See e.g.

https://discuss.python.org/t/the-next-manylinux-specification/1043/21

@siranipour
Copy link
Contributor Author

Can confirm that this worked for me also! Although if I do conda list I still see tensorflow as one of the installed libraries. Am I being silly or is that supposed to happen?

@Zaharid
Copy link
Contributor

Zaharid commented Jul 30, 2019

I think conda list tracks the pip installed packages.

@Zaharid
Copy link
Contributor

Zaharid commented Jul 30, 2019

@scarlehoff could you perhaps have a look at this

https://www.tensorflow.org/guide/performance/overview#manual_tuning

and see if these options need to be tweaked. I guess we are doing some rather unusual workflow and it is doing some failed optimization.

@Zaharid
Copy link
Contributor

Zaharid commented Jul 30, 2019

E.g., does setting export OMP_NUM_THREADS=1 help?

@scarlehoff
Copy link
Member

@scarlehoff could you perhaps have a look at this

https://www.tensorflow.org/guide/performance/overview#manual_tuning

and see if these options need to be tweaked. I guess we are doing some rather unusual workflow and it is doing some failed optimization.

@Zaharid

I already do some of that, and it does affect the threads which are generated during model building.

But then the conda-installed version generates another bunch of threads after everything is set (and, if it is tensorflow doing the extra threads, it is ignoring those options)

also, OMP_NUM_THREADS=1 was my first test but it didn't seem to help

note: lately the "reply from mail" option in github seems to be unreliable...

@Zaharid
Copy link
Contributor

Zaharid commented Jul 30, 2019

What about the inter_op_parallelism_threads setting?

@scarlehoff
Copy link
Member

Yes.

But I am not sure that is the problem, it is a guess because in my non-conda installation it doesn't open so many threads and because those threads are open after tensorflow is already functioning in multithreaded mode.

@Zaharid
Copy link
Contributor

Zaharid commented Jul 30, 2019

Sorry, yes what?

I'd say that the difference is that conda comes with the intel optimized low level code that comes with its own flags and settings and happens to be terribly bad with its defaults here.

So we should see if there is some setting that can be changed or else complain somewhere.

@scarlehoff
Copy link
Member

Yes I've tried what you comment in your last msg. It is actually set by default at the beginning of every run of n3fit. As I said before, -for me- all those options are only affecting the normal flow of tensorflow and not those extra threads that are only open in the conda case.

In my non-conda installation I am using the intel optimizations flags so I'd say the problem might be the conda version does not have the optimizations on?

and I insist, I am not sure the extra threads are connected to the problem and they don't seem to be open by tensorflow or, (if they are), they are ignoring all tensorflow settings and producing a memory leak somehow...

@Zaharid
Copy link
Contributor

Zaharid commented Jul 30, 2019

Are you compiling tensorflow yourself and linking it with MKL-DNN or whatever is called?

I think the conda version (from the defaults channel) is supposed to have all proprietary the goodies.

@scarlehoff
Copy link
Member

No, I am using arch's package in one computer, debian's package in another.
I should say I also had a student using the conda one running many fits for a while so the problem has been introduced only in the latest version.

@scarrazza
Copy link
Member

Could you please specify the tf version?

@siranipour siranipour self-assigned this Jul 31, 2019
@Zaharid
Copy link
Contributor

Zaharid commented Jul 31, 2019

Moreover it would be good to know if this fixes itself by using an older version of tensorflow (which you can get with conda install tensorflow=<version> or if not by installing tensorflow from the conda forge channel).

@wilsonmr
Copy link
Contributor

wilsonmr commented Aug 1, 2019

I don't think I'm seeing this with my uni desktop which has conda package tensorflow=1.13 (Is it obvious when there is a problem?)

@siranipour
Copy link
Contributor Author

I just monitored the memory usage with htop which began to skyrocket when it created the models I believe

@wilsonmr
Copy link
Contributor

wilsonmr commented Aug 1, 2019

I appear to be fairly stable at 40% which is like 4Gb which is admittedly more that advised for a DIS only fit. So perhaps looking into the conda package version would be a good idea as Zahari suggested. Or at least to see if you can reproduce the bug with a different version of tensorflow

@scarlehoff
Copy link
Member

scarlehoff commented Aug 12, 2019

I just tried installing 1.13 and I don't see the memory disaster but the extra unnecessary threads are created anyway. It would be good to test several versions of tensorflow from conda to see whether one of them does the trick.

But as @wilsonmr I do see it is using more memory than the non-conda version (and it slowly grows...) which makes me think something is still wrong.

@scarlehoff
Copy link
Member

The conda-forge package is a repacking from Tensorflows' own wheels (which I imagine it is the same for pip) while for the default repository they compile tensorflow (see conda-forge/tensorflow-feedstock#64 )

So forcing tensorflow in the conda package to be installed from conda-forge might be a nice compromise fix

@Zaharid
Copy link
Contributor

Zaharid commented Aug 14, 2019

Seems like there is no way to specify that in the recipe:

conda/conda-build#3656

But maybe we can specify a version that is known to work?

@scarlehoff
Copy link
Member

That's a pity.
Using Tensorflow 1.13.1 with default and conda-forge I get:

Conda-forge:
Stable memory usage, 1.8 Gb

Default:
Slowly growing memory, 2.4Gb with the DIS runcard.

@Zaharid
Copy link
Contributor

Zaharid commented Aug 17, 2019

@siranipour Please confirm the results by Juan above and try to see if some other version from defaults works.

@siranipour
Copy link
Contributor Author

I'm having no such luck fellas. In my conda env i run conda uninstall tensorflow then conda install -c conda-forge tensorflow but the memory still begins to climb. Am I missing something obvious?? Should I have used pip instead?

@Zaharid
Copy link
Contributor

Zaharid commented Aug 19, 2019 via email

@scarlehoff
Copy link
Member

scarlehoff commented Aug 19, 2019

I'm having no such luck fellas. In my conda env i run conda uninstall tensorflow then conda install -c conda-forge tensorflow but the memory still begins to climb. Am I missing something obvious?? Should I have used pip instead?

Just doing conda install -c conda-forge tensorflow=1.13.1 should do the trick.
i.e., what I did was

conda install nnpdf
conda install -c conda-forge tensorflow=1.13.1

Note that if you just do conda install -c conda-forge tensorflow, conda will ignore you (maybe if you do --force-reinstall it works). I think -c just prefixes a channel but if there is a newer version coming from a different channel it might choose the newest one.

Edit: just read the --help, there are the options --override-channels and --strict-channel-priority which would produce the same result.

@siranipour
Copy link
Contributor Author

Same story even from a scratch environment. I run

conda create -n temp
conda activate temp
conda install nnpdf
conda install -c conda-forge tensorflow

@Zaharid
Copy link
Contributor

Zaharid commented Aug 19, 2019 via email

@siranipour
Copy link
Contributor Author

Ah perfect that did the trick! Sitting pretty at 3.7G and it's rock solid. Incidentally, the standard output this time was different. I got more of the green [INFO] blocks when generating the pseudo data etc. Cheers gents.

@scarlehoff
Copy link
Member

Ah perfect that did the trick! Sitting pretty at 3.7G and it's rock solid. Incidentally, the standard output this time was different. I got more of the green [INFO] blocks when generating the pseudo data etc. Cheers gents.

The difference on the logs is because a series of bugs in tensorflow and abseil which break python standard logging:
abseil/abseil-py#99
tensorflow/tensorflow#26691
I have a workaround written for it but according to them the problem is fixed and will be pushed shortly.

@scarlehoff scarlehoff added the n3fit Issues and PRs related to n3fit label Aug 29, 2019
@scarlehoff
Copy link
Member

Has this been fixed? (be it in the conda-recipe or magically by the conda package)?

@siranipour
Copy link
Contributor Author

Think so? Will take a proper look

@scarlehoff
Copy link
Member

I will close this one since the version of TF we are using has also changed anyway. If it happens it will be "a new issue" in every sense of the word.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
n3fit Issues and PRs related to n3fit
Projects
None yet
Development

No branches or pull requests

5 participants