-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
n3fit OOM in conda env #522
Comments
I just tried doing the following
And the memory seems to be under control. This is strange because in theory they are the same version of tensorflow and they should do the same provided you don't have a GPU, so I guess whoever is in charge of generating the standard tensorflow conda package screwed up with the latest version. I wouldn't call this a fix because I have no idea why this fixed it, I just happened to test this first and it worked for me. If it works for anyone else let's call it "a workaround". |
I guess the conda tensorflow is compiled with some unhelpful flags. Maybe because of the use of the MKL library, with its own ideas about thread parallelization? Btw, https://discuss.python.org/t/the-next-manylinux-specification/1043/21 |
Can confirm that this worked for me also! Although if I do |
I think conda list tracks the pip installed packages. |
@scarlehoff could you perhaps have a look at this https://www.tensorflow.org/guide/performance/overview#manual_tuning and see if these options need to be tweaked. I guess we are doing some rather unusual workflow and it is doing some failed optimization. |
E.g., does setting |
I already do some of that, and it does affect the threads which are generated during model building. But then the conda-installed version generates another bunch of threads after everything is set (and, if it is tensorflow doing the extra threads, it is ignoring those options) also, OMP_NUM_THREADS=1 was my first test but it didn't seem to help note: lately the "reply from mail" option in github seems to be unreliable... |
What about the |
Yes. But I am not sure that is the problem, it is a guess because in my non-conda installation it doesn't open so many threads and because those threads are open after tensorflow is already functioning in multithreaded mode. |
Sorry, yes what? I'd say that the difference is that conda comes with the intel optimized low level code that comes with its own flags and settings and happens to be terribly bad with its defaults here. So we should see if there is some setting that can be changed or else complain somewhere. |
Yes I've tried what you comment in your last msg. It is actually set by default at the beginning of every run of n3fit. As I said before, -for me- all those options are only affecting the normal flow of tensorflow and not those extra threads that are only open in the conda case. In my non-conda installation I am using the intel optimizations flags so I'd say the problem might be the conda version does not have the optimizations on? and I insist, I am not sure the extra threads are connected to the problem and they don't seem to be open by tensorflow or, (if they are), they are ignoring all tensorflow settings and producing a memory leak somehow... |
Are you compiling tensorflow yourself and linking it with MKL-DNN or whatever is called? I think the conda version (from the defaults channel) is supposed to have all proprietary the goodies. |
No, I am using arch's package in one computer, debian's package in another. |
Could you please specify the tf version? |
Moreover it would be good to know if this fixes itself by using an older version of tensorflow (which you can get with |
I don't think I'm seeing this with my uni desktop which has |
I just monitored the memory usage with |
I appear to be fairly stable at 40% which is like 4Gb which is admittedly more that advised for a DIS only fit. So perhaps looking into the conda package version would be a good idea as Zahari suggested. Or at least to see if you can reproduce the bug with a different version of |
I just tried installing 1.13 and I don't see the memory disaster but the extra unnecessary threads are created anyway. It would be good to test several versions of tensorflow from conda to see whether one of them does the trick. But as @wilsonmr I do see it is using more memory than the non-conda version (and it slowly grows...) which makes me think something is still wrong. |
The conda-forge package is a repacking from Tensorflows' own wheels (which I imagine it is the same for pip) while for the default repository they compile tensorflow (see conda-forge/tensorflow-feedstock#64 ) So forcing tensorflow in the conda package to be installed from conda-forge might be a nice compromise fix |
Seems like there is no way to specify that in the recipe: But maybe we can specify a version that is known to work? |
That's a pity. Conda-forge: Default: |
@siranipour Please confirm the results by Juan above and try to see if some other version from defaults works. |
I'm having no such luck fellas. In my conda env i run |
Try creating an environment from scratch.
…On Mon, 19 Aug 2019, 13:40 siranipour, ***@***.***> wrote:
I'm having no such luck fellas. In my conda env i run conda uninstall
tensorflow then conda install -c conda-forge tensorflow but the memory
still begins to climb. Am I missing something obvious?? Should I have used
pip instead?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#522?email_source=notifications&email_token=ABLJWUS3ONEGIA2AWUHLWB3QFKIFFA5CNFSM4IH3DMVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4SYZIA#issuecomment-522554528>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABLJWUXTLKP4PLC63BRG2KLQFKIFFANCNFSM4IH3DMVA>
.
|
Just doing
Note that if you just do Edit: just read the --help, there are the options |
Same story even from a scratch environment. I run
|
Does the last line actually install something or ignore you as Juan says?
…On Mon, 19 Aug 2019, 14:03 siranipour, ***@***.***> wrote:
Same story even from a scratch environment. I run
conda create -n temp
conda activate temp
conda install nnpdf
conda install -c conda-forge tensorflow
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#522?email_source=notifications&email_token=ABLJWUTMHXYB4DF47YRBLHTQFKKZFA5CNFSM4IH3DMVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4S3DUQ#issuecomment-522564050>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABLJWUXVKYUFOACBGFMEDFLQFKKZFANCNFSM4IH3DMVA>
.
|
Ah perfect that did the trick! Sitting pretty at 3.7G and it's rock solid. Incidentally, the standard output this time was different. I got more of the green |
The difference on the logs is because a series of bugs in tensorflow and abseil which break python standard logging: |
Has this been fixed? (be it in the conda-recipe or magically by the conda package)? |
Think so? Will take a proper look |
I will close this one since the version of TF we are using has also changed anyway. If it happens it will be "a new issue" in every sense of the word. |
I've been trying to run a 1 rep DIS only fit using the new code. I use the given
PN3_DIS_example.yml
runcard and execute withn3fit PN3_DIS_example.yml 1 -o NNPDF31_nnlo_as_0118_DISonly_NEWCODE
.The code works, but quickly ramps up in memory usage before OOMing and linux begins to SWAP.
I have been talking to Juan about this and he reports the same issue, provided the installation is done through conda. I quote this useful snippet from our email exchange
A probable culprit is a bugged library in the conda package. Would be useful to take a look at this. Cheers.
The text was updated successfully, but these errors were encountered: