Upgrade to TF 2.6 #7619

dakshvar22 · 2020-12-21T17:19:49Z

eikooc · 2020-12-23T07:42:58Z

Rasa does not work on the new M1 chip before upgrading this dependency. Just thought I would leave that as a note here.

Ghostvv · 2021-01-14T10:51:15Z

~~blocked until a new version is released with this fix: https://github.com/tensorflow/tensorflow/pull/45534/files#diff-39eadfc771b57ef11bff67b85cc38a47b9c13ec203b145e3feaf556a43bdccb1R4762~~ - no, it doesn't seem to be the problem

Ghostvv · 2021-01-14T10:56:02Z

pushed the necessary code updates to tf-2.4 branch

Ghostvv · 2021-01-18T10:15:50Z

blocked by: tensorflow/tensorflow#46511

koernerfelicia · 2021-01-22T07:43:54Z

Once this is done should check this: #7762

Ghostvv · 2021-01-22T09:36:23Z

most probably we have to wait till 2.5

koernerfelicia · 2021-01-28T12:52:35Z

Related: #7793

luzhongqiu · 2021-02-03T01:20:57Z

same issue, wait for tf above 2.4 version, thanks

rmiaouh · 2021-02-17T12:55:21Z

Hi guys, same issue, we will need tf 2.4 version, ty for all.

koernerfelicia · 2021-03-01T09:05:04Z

Once this is done we should revisit this and see if it has been fixed:

#8004

Ghostvv · 2021-03-09T09:36:53Z

we cannot upgrade to 2.4, we need to wait till 2.5

koernerfelicia · 2021-03-09T09:41:19Z

Yeah, sorry I was being unclear, at this point when I say "done" I mean whatever tensorflow update we eventually manage

marjoripomarole · 2021-06-09T15:12:54Z

2.5 is now out!

https://developer.apple.com/metal/tensorflow-plugin/

https://github.com/tensorflow/tensorflow/releases/tag/v2.5.0

Seems like this can move forward?

tmbo · 2021-06-14T10:14:51Z

@TyDunn is this one still blocked? given that we'll need to buy m1s as well, would be great to sort this out 😅

koernerfelicia · 2021-06-14T10:16:59Z

@tmbo I'm trying to verify this atm, will update when I know more!

marjoripomarole · 2021-06-14T12:56:56Z

I am available to help migrate this over and test. Let me know if there is any ongoing work and how I can help speed it up. I have a M1 on hand and the desire to get rasa working asap on it :)

koernerfelicia · 2021-06-14T16:03:08Z

@mpomarole thank you for the kind offer! We'll let you know if we need any support :)

dakshvar22 · 2021-07-21T10:17:22Z

@samsucik Do we have a small tensorflow code snippet with a very basic model and dummy data that could demonstrate this increase in training time? I think it would be worthwile to ask the tensorflow folks if we are doing something wrong or the increase is indeed expected.

Also, is the increase in time observable on both CPU and GPU?

samsucik · 2021-07-21T10:29:21Z

@dakshvar22 I'm coming back to this after almost a week and my previous comments basically summarise where I had left off. What you're describing is the state I'd like to reach asap 😉

Also, is the increase in time observable on both CPU and GPU?

Yes. I saw the regression tests taking 3x ~5x longer (on GPU runners) and I saw the same/similar thing with e2ebot trained locally on a CPU.

samsucik · 2021-07-23T09:38:14Z

Runtimes & failing model regression tests

To complete my previous update on model regression tests, I've re-run the tests on the Hermit dataset and tests involving the transformer version of DIET (i.e. DIET(seq) all failed again, the same way as before. It looks as if the failed runs were killed because they were taking too long -- they got killed after ~2 hours, whereas they normally take only around 15 minutes (see the most recent scheduled run). This needs further investigation -- it might be due to something hanging up, though I wouldn't necessarily link it to the hanging up we observed previously.

All in all, with TF 2.5, our regression test runs take ~5x longer and sometimes even get killed in the process (TF 2.5 run vs scheduled run).

samsucik · 2021-07-23T12:36:02Z

Overall status update & handover notes

My work is in the tf-2.5 branch and this associated PR, the old tf-2.4 branch and the associated PR are now behind and abandoned -- though the regression tests were run on the old PR (the new one didn't exist yet).

Having observed the regression tests taking so long, I profiled the training of our small e2ebot and took it from there (since this training was also taking much longer with TF 2.5). Using snakeviz, I noticed that a number of ops were taking particularly longer, especially IfGrad and WhileGrad, and the various TF CRF methods such as crf_log_norm or crf_log_likelihood. However, this later turned to be slightly misleading because the high runtime isn't really due to something like IfGrad, but potentially to some of its sub-calls. Still, seeing the CRF methods taking so long, I returned to the distilled example of our CRF layer used here, adapted it and used it for further investigation. Indeed, I saw the CRF methods still taking longer with TF 2.5, but I found it hard to actually pinpoint the precise issue (whether by searching through others' complaints on the internet, or by digging into the code, or by using the profiling tools). While I managed to make TF Profiler work with both the e2ebot-training example and with the CRF one, I'm quite new to the tool and felt like I wasn't getting too much value out of it in terms of pinpointing the precise TF ops. Importantly, in the Profiler stats, I didn't see the same ops topping the runtime list as I saw when analysing the general profiling results...

All of my code used for comparing TF 2.3 vs 2.5 runs of various Python examples is in this project dir. The README should help you with everything you need.

samsucik · 2021-07-23T13:13:51Z

Btw as for the "investigate failed memory leak tests.. these are flakey" task in the issue description, I didn't get to look into it. I did not do memory profiling, only runtime profiling.

Regarding the failing windows tests (e.g. here), in some cases these are seemingly due to memory errors, but sometimes all that we observe is that a worker crashes, so it's not all so clear. Additionally, there are some Ubuntu tests failing too (though I think that's a bit flakey and may not be related to TF 2.5). In any case, I think that memory profiling would help a lot here.

koernerfelicia · 2021-07-26T08:33:26Z

Related issue: #9129. This upgrade blocks our ability to fix since gelu was moved to tensorflow core in 2.4 and up

koernerfelicia · 2021-08-05T11:46:57Z

Dug a little deeper into the model regression test outputs. Looks like this also affects DIET without entity recognition. See here

koernerfelicia · 2021-08-06T09:55:57Z

I'm not sure what we can generalise about the effect on CPU performance. After yesterday's results re: DIET without entity recognition I ran some CPU regression tests to see whether this is true also on CPU. See here. Will start the full suite of CPU tests to dig into this further.

ancalita · 2021-08-12T08:11:13Z

On removing @training.enable_multi_worker (this TODO item)

@samsucik Sentry has flagged this error recently caused by this decorator - digging a bit through the forum, it seems to affect users who are upgrading to TF versions higher than the one currently specified in rasa poetry.lock.
Are you able to look into this as part of this issue or should I create a new issue for this decorator specifically and place it in the Research inbox?

koernerfelicia · 2021-08-13T13:03:45Z

@ancalita (Sam is currently out and I'm working on the issue with Daksh). We'll remove it as part of the upgrade, though this might be a while (about to write an update as my next comment). I think we can restrict users to 2.3.3 and wait until the upgrade, unless this error is urgent, in which case I think we can remove it, and possibly create an issue for someone to look into how to enable distributed training. Latter is out of scope for this project, and may be better handled by Engine.

koernerfelicia · 2021-08-13T13:26:05Z

Issue affecting our custom CRF layer has been communicated to tensorflow here. As far as we know there is no workaround. Waiting to hear otherwise or progress on the issue.

ancalita · 2021-08-16T07:55:52Z

@koernerfelicia the error is not urgent, it can wait until the upgrade.

koernerfelicia · 2021-09-28T13:53:34Z

investigate crashed workers on Windows 3.6 and 3.7 during test_e2e_with_entity_evaluation (test-other-unit-tests)

Crashed tests on Windows seem to be due to OOM (see here for Slack discussion, and here for possible solutions to the general issue of OOM on CI).

koernerfelicia · 2021-10-08T11:43:14Z

Tests for which we had to increase timeouts or memory threshold. Once TF addresses the issue above, we should aim to bring these timeouts back down.

test_train_persist_load_with_different_settings_non_windows
test_train_persist_load_with_different_settings
test_train_persist_load_with_only_entity_recognition
test_train_persist_load_with_composite_entities
test_lm_featurizer_shape_values_train
test_lm_featurizer_number_of_sub_tokens_process
TestNLULeakManyEpochs

koernerfelicia · 2021-10-25T15:48:19Z

See here for more workarounds: https://www.notion.so/rasa/tf-2-6-workarounds-timeouts-memory-f3706b6322214574b0e97b5689903524

Rasa X 0.42.4 solves CVE-2021-42556 Bumping Rasa OSS to 2.8.12 solves the solves issues in TensorFlow 2.3 RasaHQ/rasa#7619 This breaks backward compatibility of previously trained models. It is not possible to load models trained with previous versions of Rasa Open Source. Please re-train your assistant before trying to use this version.

Rasa X 0.42.4 solves CVE-2021-42556 Bumping Rasa OSS to 2.8.12 solves the issues in TensorFlow 2.3 RasaHQ/rasa#7619 This breaks backward compatibility of previously trained models. It is not possible to load models trained with previous versions of Rasa Open Source. Please re-train your assistant before trying to use this version.

* other: Bump Rasa X version to 0.42.4 * Relese minor version addressing security patches Rasa X 0.42.4 solves CVE-2021-42556 Bumping Rasa OSS to 2.8.12 solves the issues in TensorFlow 2.3 RasaHQ/rasa#7619 This breaks backward compatibility of previously trained models. It is not possible to load models trained with previous versions of Rasa Open Source. Please re-train your assistant before trying to use this version. * Add links to release notes * Reduce the amount of links to OSS releases notes Co-authored-by: github-actions <[email protected]> Co-authored-by: Alejandro Lazaro <[email protected]> Co-authored-by: Alejandro Lazaro <[email protected]>

m-vdb · 2022-01-10T15:20:06Z

Closing as this has been done as part of #9649

dakshvar22 added the area:rasa-oss/ml 👁 All issues related to machine learning label Dec 21, 2020

Ghostvv self-assigned this Jan 14, 2021

Ghostvv removed their assignment Jan 18, 2021

alwx added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:dependencies Pull requests that update a dependency file labels Jan 27, 2021

koernerfelicia mentioned this issue Mar 6, 2021

Eliminate the CI warnings caused from the migration of gelu activation in TensorFlow #8063

Closed

4 tasks

joejuzl mentioned this issue Apr 9, 2021

Multiple GPUs #8369

Closed

joejuzl mentioned this issue Apr 19, 2021

Security Alert #7756

Closed

wochinge changed the title ~~Upgrade to TF 2.4~~ Upgrade to TF 2.4 / Apple M1 compatibility Apr 19, 2021

koernerfelicia mentioned this issue May 17, 2021

Verify whether bug fix by tensorflow unblocks #7619 #8700

Closed

2 tasks

koernerfelicia mentioned this issue Jun 17, 2021

Upgrad tensorflow dependency from Tensorflow 2.3.2 to Tensorflow 2.5.0 #8865

Closed

2 tasks

koernerfelicia changed the title ~~Upgrade to TF 2.4 / Apple M1 compatibility~~ Upgrade to TF 2.5 / Apple M1 compatibility Jun 17, 2021

alopez added the effort:research/4 label Jun 25, 2021

alopez assigned koernerfelicia Jul 26, 2021

koernerfelicia mentioned this issue Jul 29, 2021

Wrap up tensorflow-addons PR #1935 CRF: Add scores for decoded tags to crf_decode #9232

Closed

4 tasks

koernerfelicia changed the title ~~Upgrade to TF 2.5 / Apple M1 compatibility~~ Upgrade to TF 2.6 / Apple M1 compatibility Sep 17, 2021

akelad mentioned this issue Sep 29, 2021

Inconsistent GPU model results #9737

Closed

koernerfelicia changed the title ~~Upgrade to TF 2.6 / Apple M1 compatibility~~ Upgrade to TF 2.6 Oct 1, 2021

koernerfelicia unassigned samsucik, dakshvar22 and koernerfelicia Oct 15, 2021

virtualroot mentioned this issue Oct 28, 2021

chore: Bump Rasa X version to 0.42.4 RasaHQ/rasa-x-helm#239

Merged

m-vdb closed this as completed Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to TF 2.6 #7619

Upgrade to TF 2.6 #7619

dakshvar22 commented Dec 21, 2020 •

edited by koernerfelicia

Loading

eikooc commented Dec 23, 2020

Ghostvv commented Jan 14, 2021 •

edited

Loading

Ghostvv commented Jan 14, 2021

Ghostvv commented Jan 18, 2021

koernerfelicia commented Jan 22, 2021

Ghostvv commented Jan 22, 2021

koernerfelicia commented Jan 28, 2021

luzhongqiu commented Feb 3, 2021

rmiaouh commented Feb 17, 2021

koernerfelicia commented Mar 1, 2021

Ghostvv commented Mar 9, 2021

koernerfelicia commented Mar 9, 2021

marjoripomarole commented Jun 9, 2021

tmbo commented Jun 14, 2021

koernerfelicia commented Jun 14, 2021

marjoripomarole commented Jun 14, 2021

koernerfelicia commented Jun 14, 2021

dakshvar22 commented Jul 21, 2021

samsucik commented Jul 21, 2021 •

edited

Loading

samsucik commented Jul 23, 2021

samsucik commented Jul 23, 2021

samsucik commented Jul 23, 2021

koernerfelicia commented Jul 26, 2021

koernerfelicia commented Aug 5, 2021

koernerfelicia commented Aug 6, 2021

ancalita commented Aug 12, 2021

koernerfelicia commented Aug 13, 2021

koernerfelicia commented Aug 13, 2021

ancalita commented Aug 16, 2021

koernerfelicia commented Sep 28, 2021

koernerfelicia commented Oct 8, 2021

koernerfelicia commented Oct 25, 2021 •

edited

Loading

m-vdb commented Jan 10, 2022

Upgrade to TF 2.6 #7619

Upgrade to TF 2.6 #7619

Comments

dakshvar22 commented Dec 21, 2020 • edited by koernerfelicia Loading

eikooc commented Dec 23, 2020

Ghostvv commented Jan 14, 2021 • edited Loading

Ghostvv commented Jan 14, 2021

Ghostvv commented Jan 18, 2021

koernerfelicia commented Jan 22, 2021

Ghostvv commented Jan 22, 2021

koernerfelicia commented Jan 28, 2021

luzhongqiu commented Feb 3, 2021

rmiaouh commented Feb 17, 2021

koernerfelicia commented Mar 1, 2021

Ghostvv commented Mar 9, 2021

koernerfelicia commented Mar 9, 2021

marjoripomarole commented Jun 9, 2021

tmbo commented Jun 14, 2021

koernerfelicia commented Jun 14, 2021

marjoripomarole commented Jun 14, 2021

koernerfelicia commented Jun 14, 2021

dakshvar22 commented Jul 21, 2021

samsucik commented Jul 21, 2021 • edited Loading

samsucik commented Jul 23, 2021

samsucik commented Jul 23, 2021

samsucik commented Jul 23, 2021

koernerfelicia commented Jul 26, 2021

koernerfelicia commented Aug 5, 2021

koernerfelicia commented Aug 6, 2021

ancalita commented Aug 12, 2021

koernerfelicia commented Aug 13, 2021

koernerfelicia commented Aug 13, 2021

ancalita commented Aug 16, 2021

koernerfelicia commented Sep 28, 2021

koernerfelicia commented Oct 8, 2021

koernerfelicia commented Oct 25, 2021 • edited Loading

m-vdb commented Jan 10, 2022

dakshvar22 commented Dec 21, 2020 •

edited by koernerfelicia

Loading

Ghostvv commented Jan 14, 2021 •

edited

Loading

samsucik commented Jul 21, 2021 •

edited

Loading

koernerfelicia commented Oct 25, 2021 •

edited

Loading