Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.0 architecture revamp/9340/e2e lookup #9405

Merged
merged 19 commits into from
Sep 6, 2021

Conversation

ka-bu
Copy link
Contributor

@ka-bu ka-bu commented Aug 20, 2021

Proposed changes:

  • closes [Re-implement featurization] Move to integration branch, adapt Policies, and add checks #9340
  • build the lookup table:
    • lookup table implementation and tests (see rasa.core.featurizers.precomputation)
    • components for prep/build of lookup table (start and end of e2e featurization pipeline) (see rasa.core.featurizers.precomputation)
  • using the lookup table:
    • adaption of SingleStateFeaturizer and TrackerFeaturizer (use the lookup table)
    • adaption of unit tests for SingleStateFeaturizer and TrackerFeaturizer - and refactoring of unit tests for SingleStateFeaturizer
      • only the tests for SingleStateFeaturizer really make use of the lookup table
      • tests for TrackerFeaturizer use lookup table set to None which corresponds to usage of RegexInterpreter() -- which is exactly corresponding to the tests as they were before (i.e. tracker featurizer never included tests with a different interpreter)
    • adaption of TEDPolicy (use the lookup table) and its unit tests
      • tests for TEDPolicy use lookup table set to None which corresponds to usage of RegexInterpreter() -- which is exactly corresponding to the tests as they were before
  • not the lookup table 😄
    • generalized functionality related to Features and added that to Features (see rasa.shared.nlu.training_data.features)
      • note: this was useful because it made the adaption of SingleStateFeaturizer and the implementation of the lookup table much cleaner than with existing functionalities

Ignore:

  • as always, ignore the "new" modules with leading underscore

Not included (yet):

  • caching (i.e. we don't want to have to run BERT again and again if only the policies change but not the featurization pipeline)
  • tests computing and using a lookup table wrt. full bot example (Will be done with regression tests later on. We could add more specific tests on the "end to end" usage but that requires featurizers etc.)

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@ka-bu ka-bu requested review from JEM-Mosig and wochinge August 23, 2021 12:53
@ka-bu ka-bu marked this pull request as ready for review August 23, 2021 12:53
@ka-bu ka-bu requested a review from a team as a code owner August 23, 2021 12:53
@ka-bu ka-bu requested review from a team and removed request for a team August 23, 2021 12:53
Copy link
Contributor

@wochinge wochinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, what a PR 💯 Thanks for taking this on 🚀 I've to admit that I skipped the test_single_state_featurizers and test_features for now. In my opinion somebody from Research should have a look at these and the related changes in the production code as they are the code owners for this.

rasa/core/policies/ted_policy.py Outdated Show resolved Hide resolved
rasa/core/policies/ted_policy.py Outdated Show resolved Hide resolved
self,
tracker: DialogueStateTracker,
domain: Domain,
interpreter: NaturalLanguageInterpreter,
precomputations: Optional[CoreFeaturizationPrecomputations] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: How about making it a little less vague?

Suggested change
precomputations: Optional[CoreFeaturizationPrecomputations] = None,
precomputed_features: Optional[CoreFeaturizationPrecomputations] = None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, the precomputations must be messages and not just features which is why I went with the more generic name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about featurized_messages then? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokenized_and_featurized_message ... :D Long term there could also be classifications in there from more complex recipes? 🤔 And it is not like all features end up in that dictionary because the SingleStateFeaturizer still creates all the multi-hot like features and creates sentence features from sequence features and so on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, makes sense 👍🏻 Should we still stress that this is specific for end-to-end? We can still rename this once it's no longer specific to end-to-end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "end to end" is used in lots of places. Here it was used to point out that the policies work with text directly, right? But we could have e.g. data with intent names and then someone decides to use bert to convert those intent names to dense features. .... Does that make sense or am I missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. But in that case we can simply rename this component, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about MessageContainerForCoreFeaturization ?

rasa/core/policies/ted_policy.py Show resolved Hide resolved
rasa/core/featurizers/tracker_featurizers.py Show resolved Hide resolved
tests/core/featurizers/test_precomputation.py Outdated Show resolved Hide resolved
tests/core/featurizers/test_precomputation.py Outdated Show resolved Hide resolved
tests/core/featurizers/test_precomputation.py Outdated Show resolved Hide resolved
tests/core/featurizers/test_precomputation.py Outdated Show resolved Hide resolved
tests/core/featurizers/test_precomputation.py Outdated Show resolved Hide resolved
Co-authored-by: Tobias Wochinger <[email protected]>
Copy link

@JEM-Mosig JEM-Mosig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. The code changed while I was reviewing, so something might be outdated.

rasa/core/featurizers/precomputation.py Outdated Show resolved Hide resolved
rasa/core/featurizers/precomputation.py Outdated Show resolved Hide resolved
rasa/core/featurizers/precomputation.py Outdated Show resolved Hide resolved
rasa/core/featurizers/precomputation.py Outdated Show resolved Hide resolved
rasa/core/featurizers/precomputation.py Outdated Show resolved Hide resolved
rasa/shared/nlu/training_data/features.py Show resolved Hide resolved
@ka-bu ka-bu requested a review from joejuzl August 27, 2021 10:39
Copy link
Contributor

@joejuzl joejuzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed rasa/core/featurizers/precomputation.py and the policy changes.

Really great work! Was a joy to review 💯

Comments are mainly around docstrings and naming.

rasa/core/featurizers/precomputation.py Outdated Show resolved Hide resolved
rasa/core/featurizers/precomputation.py Outdated Show resolved Hide resolved
# extract the message
existing_message = self._table[key_attribute].get(key_value)
if existing_message is not None:
if hash(existing_message) != hash(message_with_one_key_attribute):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, when could this occur?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't happen at the moment - but could happen if we e.g. some day in the far future merge training data loaded from disk that has been featurized with different featurizers (though I'm not sure the hash really includes knowledge about the actual features there 🤔 )

rasa/core/featurizers/precomputation.py Outdated Show resolved Hide resolved

Args:
sub_state: substate for which we want to extract the relevent features
attributes: if not `None`, this specifies the list of the attributes of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the possible values of these attributes? The same as the key attributes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be arbitrary attributes. If the NLU pipeline has added attributes and features for these attributes, then it is possible that we want to list attributes here which are not key attributes. (Should not happen at the moment - we add new attributes (TOKENS) but the attribute field in the features should be "TEXT" 🤔 ... Good that you asked! I find this a bit confusing ... Guess we'll definitely add that to the documentation on the featurizers that we've just started here )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, no actually it is not that confusing because tokens are always stored as TOKEN_NAMES[] and the features then just remember :) .... well, at least it is not confusing as long as we only have a single tokenizer :D

rasa/core/featurizers/precomputation.py Outdated Show resolved Hide resolved
rasa/core/featurizers/precomputation.py Outdated Show resolved Hide resolved
)
container.derive_messages_from_events_and_add(events=all_events)

# Reminder: in case of complex recipes that train CountVectorizers, we'll have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be fixed in the CountVectorizer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True - Shall I remove that reminder, or shall we keep it until that is fixed there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there is an issue open for it yet? If not we should create one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @aeshky is working on migrating the count vectorizer - we just fix that there (i.e. let the count vectorizer not break because it hasn't been trained but instead just not add any features to the message, like in other components), no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - sounds good.

rasa/core/featurizers/precomputation.py Outdated Show resolved Hide resolved
@ka-bu ka-bu requested a review from joejuzl September 1, 2021 16:44
Copy link
Contributor

@joejuzl joejuzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! once conflicts are dealt with.

@ka-bu ka-bu enabled auto-merge (squash) September 6, 2021 08:26
@ka-bu ka-bu merged commit 19fb70a into main Sep 6, 2021
@ka-bu ka-bu deleted the 3.0-architecture-revamp/9340/e2e-lookup branch September 6, 2021 17:36
aeshky added a commit that referenced this pull request Sep 7, 2021
…https://github.com/RasaHQ/rasa into 3.0-architecture-revamp/9330/NLUTrainingDataProvider

* '3.0-architecture-revamp/9330/NLUTrainingDataProvider' of https://github.com/RasaHQ/rasa:
  3.0 architecture revamp/9340/e2e lookup (#9405)
  Fixed automatic importing of mitie (#9482)
  narrow scopes a bit more
  try with narrower scope to release memory
  Update branch despite failing check runs (#9541)
  Use concurrency group to cancel workflows (#9540)
  Fix team name (#9544)
ErickGiffoni pushed a commit to FGA-GCES/rasa that referenced this pull request Sep 9, 2021
Co-authored-by: Tobias Wochinger <[email protected]>
Co-authored-by: Joe Juzl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Re-implement featurization] Move to integration branch, adapt Policies, and add checks
4 participants