Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to TF 2.6 #7619

Closed
16 of 19 tasks
dakshvar22 opened this issue Dec 21, 2020 · 43 comments
Closed
16 of 19 tasks

Upgrade to TF 2.6 #7619

dakshvar22 opened this issue Dec 21, 2020 · 43 comments
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework area:rasa-oss/ml 👁 All issues related to machine learning type:dependencies Pull requests that update a dependency file

Comments

@dakshvar22
Copy link
Contributor

dakshvar22 commented Dec 21, 2020

TF 2.4 is out and hence we should update the dependency inside Rasa OS to use it.
Introduces some breaking changes as mentioned in the changelog

Update: we are now targeting 2.6 see here

Note that there's a draft PR here that already contains some of the necessary changes. Known remaining tasks in order to actually update code for TF 2.5 are (this list contains only the known remaining tasks, there may be others hence it is not an exhaustive of all remaining tasks):

Temporarily de-prioritised followups:

Definition of done
No memory tests failing
No increased timeouts for tests
Model regression tests passing with approx. same time and accuracy

@dakshvar22 dakshvar22 added the area:rasa-oss/ml 👁 All issues related to machine learning label Dec 21, 2020
@eikooc
Copy link

eikooc commented Dec 23, 2020

Rasa does not work on the new M1 chip before upgrading this dependency. Just thought I would leave that as a note here.

@Ghostvv Ghostvv self-assigned this Jan 14, 2021
@Ghostvv
Copy link
Contributor

Ghostvv commented Jan 14, 2021

blocked until a new version is released with this fix: https://github.com/tensorflow/tensorflow/pull/45534/files#diff-39eadfc771b57ef11bff67b85cc38a47b9c13ec203b145e3feaf556a43bdccb1R4762 - no, it doesn't seem to be the problem

@Ghostvv
Copy link
Contributor

Ghostvv commented Jan 14, 2021

pushed the necessary code updates to tf-2.4 branch

@Ghostvv
Copy link
Contributor

Ghostvv commented Jan 18, 2021

blocked by: tensorflow/tensorflow#46511

@Ghostvv Ghostvv removed their assignment Jan 18, 2021
@koernerfelicia
Copy link
Contributor

Once this is done should check this: #7762

@Ghostvv
Copy link
Contributor

Ghostvv commented Jan 22, 2021

most probably we have to wait till 2.5

@alwx alwx added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:dependencies Pull requests that update a dependency file labels Jan 27, 2021
@koernerfelicia
Copy link
Contributor

Related: #7793

@luzhongqiu
Copy link

same issue, wait for tf above 2.4 version, thanks

@rmiaouh
Copy link

rmiaouh commented Feb 17, 2021

Hi guys, same issue, we will need tf 2.4 version, ty for all.

@koernerfelicia
Copy link
Contributor

Once this is done we should revisit this and see if it has been fixed:

#8004

@Ghostvv
Copy link
Contributor

Ghostvv commented Mar 9, 2021

we cannot upgrade to 2.4, we need to wait till 2.5

@koernerfelicia
Copy link
Contributor

Yeah, sorry I was being unclear, at this point when I say "done" I mean whatever tensorflow update we eventually manage

@joejuzl joejuzl mentioned this issue Apr 9, 2021
@wochinge wochinge changed the title Upgrade to TF 2.4 Upgrade to TF 2.4 / Apple M1 compatibility Apr 19, 2021
@marjoripomarole
Copy link

@tmbo
Copy link
Member

tmbo commented Jun 14, 2021

@TyDunn is this one still blocked? given that we'll need to buy m1s as well, would be great to sort this out 😅

@koernerfelicia
Copy link
Contributor

@tmbo I'm trying to verify this atm, will update when I know more!

@marjoripomarole
Copy link

I am available to help migrate this over and test. Let me know if there is any ongoing work and how I can help speed it up. I have a M1 on hand and the desire to get rasa working asap on it :)

@koernerfelicia
Copy link
Contributor

@mpomarole thank you for the kind offer! We'll let you know if we need any support :)

@koernerfelicia koernerfelicia changed the title Upgrade to TF 2.4 / Apple M1 compatibility Upgrade to TF 2.5 / Apple M1 compatibility Jun 17, 2021
@dakshvar22
Copy link
Contributor Author

@samsucik Do we have a small tensorflow code snippet with a very basic model and dummy data that could demonstrate this increase in training time? I think it would be worthwile to ask the tensorflow folks if we are doing something wrong or the increase is indeed expected.

Also, is the increase in time observable on both CPU and GPU?

@samsucik
Copy link
Contributor

samsucik commented Jul 21, 2021

@dakshvar22 I'm coming back to this after almost a week and my previous comments basically summarise where I had left off. What you're describing is the state I'd like to reach asap 😉

Also, is the increase in time observable on both CPU and GPU?

Yes. I saw the regression tests taking 3x ~5x longer (on GPU runners) and I saw the same/similar thing with e2ebot trained locally on a CPU.

@samsucik
Copy link
Contributor

Runtimes & failing model regression tests

To complete my previous update on model regression tests, I've re-run the tests on the Hermit dataset and tests involving the transformer version of DIET (i.e. DIET(seq) all failed again, the same way as before. It looks as if the failed runs were killed because they were taking too long -- they got killed after ~2 hours, whereas they normally take only around 15 minutes (see the most recent scheduled run). This needs further investigation -- it might be due to something hanging up, though I wouldn't necessarily link it to the hanging up we observed previously.

All in all, with TF 2.5, our regression test runs take ~5x longer and sometimes even get killed in the process (TF 2.5 run vs scheduled run).

@samsucik
Copy link
Contributor

Overall status update & handover notes

My work is in the tf-2.5 branch and this associated PR, the old tf-2.4 branch and the associated PR are now behind and abandoned -- though the regression tests were run on the old PR (the new one didn't exist yet).

Having observed the regression tests taking so long, I profiled the training of our small e2ebot and took it from there (since this training was also taking much longer with TF 2.5). Using snakeviz, I noticed that a number of ops were taking particularly longer, especially IfGrad and WhileGrad, and the various TF CRF methods such as crf_log_norm or crf_log_likelihood. However, this later turned to be slightly misleading because the high runtime isn't really due to something like IfGrad, but potentially to some of its sub-calls. Still, seeing the CRF methods taking so long, I returned to the distilled example of our CRF layer used here, adapted it and used it for further investigation. Indeed, I saw the CRF methods still taking longer with TF 2.5, but I found it hard to actually pinpoint the precise issue (whether by searching through others' complaints on the internet, or by digging into the code, or by using the profiling tools). While I managed to make TF Profiler work with both the e2ebot-training example and with the CRF one, I'm quite new to the tool and felt like I wasn't getting too much value out of it in terms of pinpointing the precise TF ops. Importantly, in the Profiler stats, I didn't see the same ops topping the runtime list as I saw when analysing the general profiling results...

All of my code used for comparing TF 2.3 vs 2.5 runs of various Python examples is in this project dir. The README should help you with everything you need.

@samsucik
Copy link
Contributor

Btw as for the "investigate failed memory leak tests.. these are flakey" task in the issue description, I didn't get to look into it. I did not do memory profiling, only runtime profiling.

Regarding the failing windows tests (e.g. here), in some cases these are seemingly due to memory errors, but sometimes all that we observe is that a worker crashes, so it's not all so clear. Additionally, there are some Ubuntu tests failing too (though I think that's a bit flakey and may not be related to TF 2.5). In any case, I think that memory profiling would help a lot here.

@koernerfelicia
Copy link
Contributor

Related issue: #9129. This upgrade blocks our ability to fix since gelu was moved to tensorflow core in 2.4 and up

@koernerfelicia
Copy link
Contributor

Dug a little deeper into the model regression test outputs. Looks like this also affects DIET without entity recognition. See here

@koernerfelicia
Copy link
Contributor

I'm not sure what we can generalise about the effect on CPU performance. After yesterday's results re: DIET without entity recognition I ran some CPU regression tests to see whether this is true also on CPU. See here. Will start the full suite of CPU tests to dig into this further.

@ancalita
Copy link
Member

On removing @training.enable_multi_worker (this TODO item)

@samsucik Sentry has flagged this error recently caused by this decorator - digging a bit through the forum, it seems to affect users who are upgrading to TF versions higher than the one currently specified in rasa poetry.lock.
Are you able to look into this as part of this issue or should I create a new issue for this decorator specifically and place it in the Research inbox?

@koernerfelicia
Copy link
Contributor

@ancalita (Sam is currently out and I'm working on the issue with Daksh). We'll remove it as part of the upgrade, though this might be a while (about to write an update as my next comment). I think we can restrict users to 2.3.3 and wait until the upgrade, unless this error is urgent, in which case I think we can remove it, and possibly create an issue for someone to look into how to enable distributed training. Latter is out of scope for this project, and may be better handled by Engine.

@koernerfelicia
Copy link
Contributor

Issue affecting our custom CRF layer has been communicated to tensorflow here. As far as we know there is no workaround. Waiting to hear otherwise or progress on the issue.

@ancalita
Copy link
Member

@koernerfelicia the error is not urgent, it can wait until the upgrade.

@koernerfelicia koernerfelicia changed the title Upgrade to TF 2.5 / Apple M1 compatibility Upgrade to TF 2.6 / Apple M1 compatibility Sep 17, 2021
@koernerfelicia
Copy link
Contributor

investigate crashed workers on Windows 3.6 and 3.7 during test_e2e_with_entity_evaluation (test-other-unit-tests)

Crashed tests on Windows seem to be due to OOM (see here for Slack discussion, and here for possible solutions to the general issue of OOM on CI).

@koernerfelicia koernerfelicia changed the title Upgrade to TF 2.6 / Apple M1 compatibility Upgrade to TF 2.6 Oct 1, 2021
@koernerfelicia
Copy link
Contributor

Tests for which we had to increase timeouts or memory threshold. Once TF addresses the issue above, we should aim to bring these timeouts back down.

test_train_persist_load_with_different_settings_non_windows
test_train_persist_load_with_different_settings
test_train_persist_load_with_only_entity_recognition
test_train_persist_load_with_composite_entities
test_lm_featurizer_shape_values_train
test_lm_featurizer_number_of_sub_tokens_process
TestNLULeakManyEpochs

@koernerfelicia
Copy link
Contributor

koernerfelicia commented Oct 25, 2021

virtualroot pushed a commit to RasaHQ/rasa-x-helm that referenced this issue Oct 28, 2021
Rasa X 0.42.4 solves CVE-2021-42556
Bumping Rasa OSS to 2.8.12 solves the solves issues in TensorFlow 2.3
RasaHQ/rasa#7619

This breaks backward compatibility of previously trained models.
It is not possible to load models trained with previous versions of Rasa
Open Source. Please re-train your assistant before trying to use this version.
virtualroot pushed a commit to RasaHQ/rasa-x-helm that referenced this issue Oct 28, 2021
Rasa X 0.42.4 solves CVE-2021-42556
Bumping Rasa OSS to 2.8.12 solves the solves issues in TensorFlow 2.3
RasaHQ/rasa#7619

This breaks backward compatibility of previously trained models.
It is not possible to load models trained with previous versions of Rasa
Open Source. Please re-train your assistant before trying to use this version.
virtualroot pushed a commit to RasaHQ/rasa-x-helm that referenced this issue Oct 28, 2021
Rasa X 0.42.4 solves CVE-2021-42556
Bumping Rasa OSS to 2.8.12 solves the issues in TensorFlow 2.3
RasaHQ/rasa#7619

This breaks backward compatibility of previously trained models.
It is not possible to load models trained with previous versions of Rasa
Open Source. Please re-train your assistant before trying to use this version.
tczekajlo pushed a commit to RasaHQ/rasa-x-helm that referenced this issue Nov 1, 2021
* other: Bump Rasa X version to 0.42.4

* Relese minor version addressing security patches

Rasa X 0.42.4 solves CVE-2021-42556
Bumping Rasa OSS to 2.8.12 solves the issues in TensorFlow 2.3
RasaHQ/rasa#7619

This breaks backward compatibility of previously trained models.
It is not possible to load models trained with previous versions of Rasa
Open Source. Please re-train your assistant before trying to use this version.

* Add links to release notes

* Reduce the amount of links to OSS releases notes

Co-authored-by: github-actions <[email protected]>
Co-authored-by: Alejandro Lazaro <[email protected]>
Co-authored-by: Alejandro Lazaro <[email protected]>
@m-vdb
Copy link
Collaborator

m-vdb commented Jan 10, 2022

Closing as this has been done as part of #9649

@m-vdb m-vdb closed this as completed Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework area:rasa-oss/ml 👁 All issues related to machine learning type:dependencies Pull requests that update a dependency file
Projects
None yet
Development

No branches or pull requests