-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.0 architecture revamp/9340/e2e lookup #9405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, what a PR 💯 Thanks for taking this on 🚀 I've to admit that I skipped the test_single_state_featurizers
and test_features
for now. In my opinion somebody from Research should have a look at these and the related changes in the production code as they are the code owners for this.
rasa/core/policies/ted_policy.py
Outdated
self, | ||
tracker: DialogueStateTracker, | ||
domain: Domain, | ||
interpreter: NaturalLanguageInterpreter, | ||
precomputations: Optional[CoreFeaturizationPrecomputations] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: How about making it a little less vague?
precomputations: Optional[CoreFeaturizationPrecomputations] = None, | |
precomputed_features: Optional[CoreFeaturizationPrecomputations] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned before, the precomputations must be messages and not just features which is why I went with the more generic name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about featurized_messages
then? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenized_and_featurized_message
... :D Long term there could also be classifications in there from more complex recipes? 🤔 And it is not like all features end up in that dictionary because the SingleStateFeaturizer
still creates all the multi-hot like features and creates sentence features from sequence features and so on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, makes sense 👍🏻 Should we still stress that this is specific for end-to-end? We can still rename this once it's no longer specific to end-to-end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term "end to end" is used in lots of places. Here it was used to point out that the policies work with text directly, right? But we could have e.g. data with intent names and then someone decides to use bert to convert those intent names to dense features. .... Does that make sense or am I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. But in that case we can simply rename this component, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about MessageContainerForCoreFeaturization
?
Co-authored-by: Tobias Wochinger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments. The code changed while I was reviewing, so something might be outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed rasa/core/featurizers/precomputation.py
and the policy changes.
Really great work! Was a joy to review 💯
Comments are mainly around docstrings and naming.
# extract the message | ||
existing_message = self._table[key_attribute].get(key_value) | ||
if existing_message is not None: | ||
if hash(existing_message) != hash(message_with_one_key_attribute): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, when could this occur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't happen at the moment - but could happen if we e.g. some day in the far future merge training data loaded from disk that has been featurized with different featurizers (though I'm not sure the hash really includes knowledge about the actual features there 🤔 )
|
||
Args: | ||
sub_state: substate for which we want to extract the relevent features | ||
attributes: if not `None`, this specifies the list of the attributes of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the possible values of these attributes? The same as the key attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be arbitrary attributes. If the NLU pipeline has added attributes and features for these attributes, then it is possible that we want to list attributes here which are not key attributes. (Should not happen at the moment - we add new attributes (TOKENS) but the attribute field in the features should be "TEXT" 🤔 ... Good that you asked! I find this a bit confusing ... Guess we'll definitely add that to the documentation on the featurizers that we've just started here )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, no actually it is not that confusing because tokens are always stored as TOKEN_NAMES[] and the features then just remember :) .... well, at least it is not confusing as long as we only have a single tokenizer :D
) | ||
container.derive_messages_from_events_and_add(events=all_events) | ||
|
||
# Reminder: in case of complex recipes that train CountVectorizers, we'll have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be fixed in the CountVectorizer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True - Shall I remove that reminder, or shall we keep it until that is fixed there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if there is an issue open for it yet? If not we should create one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @aeshky is working on migrating the count vectorizer - we just fix that there (i.e. let the count vectorizer not break because it hasn't been trained but instead just not add any features to the message, like in other components), no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - sounds good.
Co-authored-by: Joe Juzl <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! once conflicts are dealt with.
…https://github.com/RasaHQ/rasa into 3.0-architecture-revamp/9330/NLUTrainingDataProvider * '3.0-architecture-revamp/9330/NLUTrainingDataProvider' of https://github.com/RasaHQ/rasa: 3.0 architecture revamp/9340/e2e lookup (#9405) Fixed automatic importing of mitie (#9482) narrow scopes a bit more try with narrower scope to release memory Update branch despite failing check runs (#9541) Use concurrency group to cancel workflows (#9540) Fix team name (#9544)
Co-authored-by: Tobias Wochinger <[email protected]> Co-authored-by: Joe Juzl <[email protected]>
Proposed changes:
rasa.core.featurizers.precomputation
)rasa.core.featurizers.precomputation
)SingleStateFeaturizer
andTrackerFeaturizer
(use the lookup table)SingleStateFeaturizer
andTrackerFeaturizer
- and refactoring of unit tests forSingleStateFeaturizer
SingleStateFeaturizer
really make use of the lookup tableTrackerFeaturizer
use lookup table set toNone
which corresponds to usage ofRegexInterpreter()
-- which is exactly corresponding to the tests as they were before (i.e. tracker featurizer never included tests with a different interpreter)TEDPolicy
(use the lookup table) and its unit testsTEDPolicy
use lookup table set toNone
which corresponds to usage ofRegexInterpreter()
-- which is exactly corresponding to the tests as they were beforeFeatures
and added that toFeatures
(seerasa.shared.nlu.training_data.features
)SingleStateFeaturizer
and the implementation of the lookup table much cleaner than with existing functionalitiesIgnore:
Not included (yet):
Status (please check what you already did):
updated the documentationupdated the changelog (please check changelog for instructions)black
(please check Readme for instructions)