n3fit OOM in conda env #522

siranipour · 2019-07-30T11:03:25Z

I've been trying to run a 1 rep DIS only fit using the new code. I use the given PN3_DIS_example.yml runcard and execute with n3fit PN3_DIS_example.yml 1 -o NNPDF31_nnlo_as_0118_DISonly_NEWCODE.

The code works, but quickly ramps up in memory usage before OOMing and linux begins to SWAP.

I have been talking to Juan about this and he reports the same issue, provided the installation is done through conda. I quote this useful snippet from our email exchange

two issues that do not appear outside the conda environment:
1 - The memory grows in a point in which there should be no more memory growth (the model has already been formed, the memory usage from that point onwards should be minimal)
2 - It generates a stupid amount of threads after the parallelization has already occurred. My guess is these threads are the ones generating the out-of-memory because they are not generated in the non-conda version.

A probable culprit is a bugged library in the conda package. Would be useful to take a look at this. Cheers.

The text was updated successfully, but these errors were encountered:

scarlehoff · 2019-07-30T11:14:28Z

I just tried doing the following

pip uninstall tensorflow
pip install tensorflow-gpu

And the memory seems to be under control. This is strange because in theory they are the same version of tensorflow and they should do the same provided you don't have a GPU, so I guess whoever is in charge of generating the standard tensorflow conda package screwed up with the latest version.

I wouldn't call this a fix because I have no idea why this fixed it, I just happened to test this first and it worked for me. If it works for anyone else let's call it "a workaround".

Zaharid · 2019-07-30T11:22:36Z

I guess the conda tensorflow is compiled with some unhelpful flags. Maybe because of the use of the MKL library, with its own ideas about thread parallelization?

Btw, pip install tensorflow is a known pitfall that beaks the world around it. See e.g.

https://discuss.python.org/t/the-next-manylinux-specification/1043/21

siranipour · 2019-07-30T11:26:12Z

Can confirm that this worked for me also! Although if I do conda list I still see tensorflow as one of the installed libraries. Am I being silly or is that supposed to happen?

Zaharid · 2019-07-30T11:26:44Z

I think conda list tracks the pip installed packages.

Zaharid · 2019-07-30T11:29:29Z

@scarlehoff could you perhaps have a look at this

https://www.tensorflow.org/guide/performance/overview#manual_tuning

and see if these options need to be tweaked. I guess we are doing some rather unusual workflow and it is doing some failed optimization.

Zaharid · 2019-07-30T11:32:26Z

E.g., does setting export OMP_NUM_THREADS=1 help?

scarlehoff · 2019-07-30T11:51:08Z

@scarlehoff could you perhaps have a look at this

https://www.tensorflow.org/guide/performance/overview#manual_tuning

and see if these options need to be tweaked. I guess we are doing some rather unusual workflow and it is doing some failed optimization.

@Zaharid

I already do some of that, and it does affect the threads which are generated during model building.

But then the conda-installed version generates another bunch of threads after everything is set (and, if it is tensorflow doing the extra threads, it is ignoring those options)

also, OMP_NUM_THREADS=1 was my first test but it didn't seem to help

note: lately the "reply from mail" option in github seems to be unreliable...

Zaharid · 2019-07-30T12:05:50Z

What about the inter_op_parallelism_threads setting?

scarlehoff · 2019-07-30T16:16:03Z

Yes.

But I am not sure that is the problem, it is a guess because in my non-conda installation it doesn't open so many threads and because those threads are open after tensorflow is already functioning in multithreaded mode.

Zaharid · 2019-07-30T16:24:05Z

Sorry, yes what?

I'd say that the difference is that conda comes with the intel optimized low level code that comes with its own flags and settings and happens to be terribly bad with its defaults here.

So we should see if there is some setting that can be changed or else complain somewhere.

scarlehoff · 2019-07-30T16:35:39Z

Yes I've tried what you comment in your last msg. It is actually set by default at the beginning of every run of n3fit. As I said before, -for me- all those options are only affecting the normal flow of tensorflow and not those extra threads that are only open in the conda case.

In my non-conda installation I am using the intel optimizations flags so I'd say the problem might be the conda version does not have the optimizations on?

and I insist, I am not sure the extra threads are connected to the problem and they don't seem to be open by tensorflow or, (if they are), they are ignoring all tensorflow settings and producing a memory leak somehow...

Zaharid · 2019-07-30T16:42:20Z

Are you compiling tensorflow yourself and linking it with MKL-DNN or whatever is called?

I think the conda version (from the defaults channel) is supposed to have all proprietary the goodies.

scarlehoff · 2019-07-30T17:09:56Z

No, I am using arch's package in one computer, debian's package in another.
I should say I also had a student using the conda one running many fits for a while so the problem has been introduced only in the latest version.

scarrazza · 2019-07-31T14:51:50Z

Could you please specify the tf version?

Zaharid · 2019-07-31T16:19:04Z

Moreover it would be good to know if this fixes itself by using an older version of tensorflow (which you can get with conda install tensorflow=<version> or if not by installing tensorflow from the conda forge channel).

wilsonmr · 2019-08-01T09:59:56Z

I don't think I'm seeing this with my uni desktop which has conda package tensorflow=1.13 (Is it obvious when there is a problem?)

siranipour · 2019-08-01T10:24:48Z

I just monitored the memory usage with htop which began to skyrocket when it created the models I believe

wilsonmr · 2019-08-01T10:36:04Z

I appear to be fairly stable at 40% which is like 4Gb which is admittedly more that advised for a DIS only fit. So perhaps looking into the conda package version would be a good idea as Zahari suggested. Or at least to see if you can reproduce the bug with a different version of tensorflow

scarlehoff · 2019-08-12T09:43:37Z

I just tried installing 1.13 and I don't see the memory disaster but the extra unnecessary threads are created anyway. It would be good to test several versions of tensorflow from conda to see whether one of them does the trick.

But as @wilsonmr I do see it is using more memory than the non-conda version (and it slowly grows...) which makes me think something is still wrong.

scarlehoff · 2019-08-14T13:19:34Z

The conda-forge package is a repacking from Tensorflows' own wheels (which I imagine it is the same for pip) while for the default repository they compile tensorflow (see conda-forge/tensorflow-feedstock#64 )

So forcing tensorflow in the conda package to be installed from conda-forge might be a nice compromise fix

Zaharid · 2019-08-14T16:09:15Z

Seems like there is no way to specify that in the recipe:

conda/conda-build#3656

But maybe we can specify a version that is known to work?

scarlehoff · 2019-08-15T10:33:15Z

That's a pity.
Using Tensorflow 1.13.1 with default and conda-forge I get:

Conda-forge:
Stable memory usage, 1.8 Gb

Default:
Slowly growing memory, 2.4Gb with the DIS runcard.

Zaharid · 2019-08-17T11:36:24Z

@siranipour Please confirm the results by Juan above and try to see if some other version from defaults works.

siranipour · 2019-08-19T12:40:50Z

I'm having no such luck fellas. In my conda env i run conda uninstall tensorflow then conda install -c conda-forge tensorflow but the memory still begins to climb. Am I missing something obvious?? Should I have used pip instead?

Zaharid · 2019-08-19T12:42:36Z

Try creating an environment from scratch.

…

On Mon, 19 Aug 2019, 13:40 siranipour, ***@***.***> wrote: I'm having no such luck fellas. In my conda env i run conda uninstall tensorflow then conda install -c conda-forge tensorflow but the memory still begins to climb. Am I missing something obvious?? Should I have used pip instead? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#522?email_source=notifications&email_token=ABLJWUS3ONEGIA2AWUHLWB3QFKIFFA5CNFSM4IH3DMVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4SYZIA#issuecomment-522554528>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLJWUXTLKP4PLC63BRG2KLQFKIFFANCNFSM4IH3DMVA> .

scarlehoff · 2019-08-19T12:58:20Z

I'm having no such luck fellas. In my conda env i run conda uninstall tensorflow then conda install -c conda-forge tensorflow but the memory still begins to climb. Am I missing something obvious?? Should I have used pip instead?

Just doing conda install -c conda-forge tensorflow=1.13.1 should do the trick.
i.e., what I did was

conda install nnpdf
conda install -c conda-forge tensorflow=1.13.1

Note that if you just do conda install -c conda-forge tensorflow, conda will ignore you (maybe if you do --force-reinstall it works). I think -c just prefixes a channel but if there is a newer version coming from a different channel it might choose the newest one.

Edit: just read the --help, there are the options --override-channels and --strict-channel-priority which would produce the same result.

siranipour · 2019-08-19T13:03:13Z

Same story even from a scratch environment. I run

conda create -n temp
conda activate temp
conda install nnpdf
conda install -c conda-forge tensorflow

Zaharid · 2019-08-19T13:04:47Z

Does the last line actually install something or ignore you as Juan says?

…

On Mon, 19 Aug 2019, 14:03 siranipour, ***@***.***> wrote: Same story even from a scratch environment. I run conda create -n temp conda activate temp conda install nnpdf conda install -c conda-forge tensorflow — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#522?email_source=notifications&email_token=ABLJWUTMHXYB4DF47YRBLHTQFKKZFA5CNFSM4IH3DMVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4S3DUQ#issuecomment-522564050>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLJWUXVKYUFOACBGFMEDFLQFKKZFANCNFSM4IH3DMVA> .

siranipour · 2019-08-19T13:24:27Z

Ah perfect that did the trick! Sitting pretty at 3.7G and it's rock solid. Incidentally, the standard output this time was different. I got more of the green [INFO] blocks when generating the pseudo data etc. Cheers gents.

scarlehoff · 2019-08-19T13:26:33Z

Ah perfect that did the trick! Sitting pretty at 3.7G and it's rock solid. Incidentally, the standard output this time was different. I got more of the green [INFO] blocks when generating the pseudo data etc. Cheers gents.

The difference on the logs is because a series of bugs in tensorflow and abseil which break python standard logging:
abseil/abseil-py#99
tensorflow/tensorflow#26691
I have a workaround written for it but according to them the problem is fixed and will be pushed shortly.

scarlehoff · 2019-10-14T08:13:03Z

Has this been fixed? (be it in the conda-recipe or magically by the conda package)?

siranipour · 2019-10-14T18:26:52Z

Think so? Will take a proper look

scarlehoff · 2019-11-05T10:30:59Z

I will close this one since the version of TF we are using has also changed anyway. If it happens it will be "a new issue" in every sense of the word.

siranipour assigned Zaharid, scarlehoff and wilsonmr Jul 30, 2019

siranipour self-assigned this Jul 31, 2019

scarlehoff added the n3fit Issues and PRs related to n3fit label Aug 29, 2019

scarlehoff closed this as completed Nov 5, 2019

wilsonmr mentioned this issue Nov 18, 2019

n3fit memory problem TF 2.0 #615

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

n3fit OOM in conda env #522

n3fit OOM in conda env #522

siranipour commented Jul 30, 2019

scarlehoff commented Jul 30, 2019 •

edited

Loading

Zaharid commented Jul 30, 2019

siranipour commented Jul 30, 2019

Zaharid commented Jul 30, 2019

Zaharid commented Jul 30, 2019

Zaharid commented Jul 30, 2019

scarlehoff commented Jul 30, 2019

Zaharid commented Jul 30, 2019

scarlehoff commented Jul 30, 2019

Zaharid commented Jul 30, 2019

scarlehoff commented Jul 30, 2019

Zaharid commented Jul 30, 2019

scarlehoff commented Jul 30, 2019

scarrazza commented Jul 31, 2019

Zaharid commented Jul 31, 2019

wilsonmr commented Aug 1, 2019 •

edited

Loading

siranipour commented Aug 1, 2019

wilsonmr commented Aug 1, 2019

scarlehoff commented Aug 12, 2019 •

edited

Loading

scarlehoff commented Aug 14, 2019

Zaharid commented Aug 14, 2019

scarlehoff commented Aug 15, 2019

Zaharid commented Aug 17, 2019

siranipour commented Aug 19, 2019

Zaharid commented Aug 19, 2019 via email

scarlehoff commented Aug 19, 2019 •

edited

Loading

siranipour commented Aug 19, 2019

Zaharid commented Aug 19, 2019 via email

siranipour commented Aug 19, 2019

scarlehoff commented Aug 19, 2019

scarlehoff commented Oct 14, 2019

siranipour commented Oct 14, 2019

scarlehoff commented Nov 5, 2019

n3fit OOM in conda env #522

n3fit OOM in conda env #522

Comments

siranipour commented Jul 30, 2019

scarlehoff commented Jul 30, 2019 • edited Loading

Zaharid commented Jul 30, 2019

siranipour commented Jul 30, 2019

Zaharid commented Jul 30, 2019

Zaharid commented Jul 30, 2019

Zaharid commented Jul 30, 2019

scarlehoff commented Jul 30, 2019

Zaharid commented Jul 30, 2019

scarlehoff commented Jul 30, 2019

Zaharid commented Jul 30, 2019

scarlehoff commented Jul 30, 2019

Zaharid commented Jul 30, 2019

scarlehoff commented Jul 30, 2019

scarrazza commented Jul 31, 2019

Zaharid commented Jul 31, 2019

wilsonmr commented Aug 1, 2019 • edited Loading

siranipour commented Aug 1, 2019

wilsonmr commented Aug 1, 2019

scarlehoff commented Aug 12, 2019 • edited Loading

scarlehoff commented Aug 14, 2019

Zaharid commented Aug 14, 2019

scarlehoff commented Aug 15, 2019

Zaharid commented Aug 17, 2019

siranipour commented Aug 19, 2019

Zaharid commented Aug 19, 2019 via email

scarlehoff commented Aug 19, 2019 • edited Loading

siranipour commented Aug 19, 2019

Zaharid commented Aug 19, 2019 via email

siranipour commented Aug 19, 2019

scarlehoff commented Aug 19, 2019

scarlehoff commented Oct 14, 2019

siranipour commented Oct 14, 2019

scarlehoff commented Nov 5, 2019

scarlehoff commented Jul 30, 2019 •

edited

Loading

wilsonmr commented Aug 1, 2019 •

edited

Loading

scarlehoff commented Aug 12, 2019 •

edited

Loading

scarlehoff commented Aug 19, 2019 •

edited

Loading