-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-trained embeddings not used as feature for CRFEntityExtractor #8930
Comments
This issue is described in two posts on the Rasa forum as well (second one by me): |
I'm checking this out right now. My gut feeling is that the LanguageModelTokeniser is meant to handle the BytePair tokeniser that's inside of huggingface. You're using it as a tokeniser for Rasa, so I imagine that's where something goes awry. Note, the LanguageModelTokeniser is also deprecated. |
I might also ask, is there a reason why you weren't using DIET? |
Correction! I was able to reproduce the issue. These two pipelines yield the same results.
Even the confidence values are the same (confirmed via |
Great, glad you were able to reproduce it. The reason I'm not using DIET is because I want to have a benchmark on how the NLU pipeline performs with/without finetuning a transformer model. |
In the meantime then; you can turn off the transformer layers inside of DIET. That way you can still get your measurement. |
This looks like an investigation issue where the definition of done would involve producing a simple example (possibly just the one that @koaning used, once he shares it), identifying the root cause, and creating a followup issue to implement and test the fix. |
I am not completely sure but it looks like a documentation issue. As far as I can remember, |
Tab on reproducing this issue across rasa versions, starting from
Configs used for rasa
|
Given the reproduction above, this looks like a docs issue. |
I had a closer look into the code for Given this is unused code, we should probably remove it? (@TyDunn might need an updated definition of done if we want to remove the unused code). |
I had another poke at the issue and it is possible to make Config 1: language: en
pipeline:
- name: WhitespaceTokenizer
- name: LanguageModelFeaturizer
model_name: "roberta"
model_weights: "roberta-base"
- name: LexicalSyntacticFeaturizer
"features": [
# features for the word preceding the word being evaluated
[ "suffix2", "prefix2" ],
# features for the word being evaluated
[ "BOS", "EOS" ],
# features for the word following the word being evaluated
[ "suffix2", "prefix2" ]]
- name: CRFEntityExtractor
"features": [["text_dense_features"]] Config 2: language: en
pipeline:
- name: WhitespaceTokenizer
- name: LanguageModelFeaturizer
model_name: "roberta"
model_weights: "roberta-base"
- name: LexicalSyntacticFeaturizer
"features": [
# features for the word preceding the word being evaluated
[ "suffix2", "prefix2" ],
# features for the word being evaluated
[ "BOS", "EOS" ],
# features for the word following the word being evaluated
[ "suffix2", "prefix2" ]]
- name: CRFEntityExtractor Config 3: language: en
pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
"features": [
# features for the word preceding the word being evaluated
[ "suffix2", "prefix2" ],
# features for the word being evaluated
[ "BOS", "EOS" ],
# features for the word following the word being evaluated
[ "suffix2", "prefix2" ]]
- name: CRFEntityExtractor
"features": [["text_dense_features"]] |
@tttthomasssss I am sure it's accidental that the documentation lacks information on how to use dense features. We should add it if it's not already there. |
…zer with the CRFEntityExtractor
Merged with #9572. |
Rasa version: 2.7.1
Rasa SDK version (if used & relevant): 2.7.0
Rasa X version (if used & relevant):
Python version: 3.8.8
Operating system (windows, osx, ...): Windows-10-10.0.19041-SP0
Issue:
In the docs for CRFEntityExtractor component, it says:
However, I get identical results when using different language models, or even no language model at all. I'm using Rasa NLU only for a simple entity extraction task. This leads me to think that the pre-trained embeddings are not getting passed on to the CRFEntityExtractor, despite LanguageModelFeaturizer generating dense features and no warnings indicating that the pretrained embeddings are not passed.
For example, when training a CRFEntityExtractor using config1/2/3 on the same train data and testing also on the same test set, I get identical precision/recall/f1 results.
Error (including full traceback):
Command or request that led to error:
Content of configuration file (config.yml) (if relevant):
Config 1
Config 2
Config 3
Content of domain file (domain.yml) (if relevant):
Content of train data
I am just using a few utterances from the SNIPS dataset. Here's a small example of my train data.
Definition of done
and add warningsOtherwise, create another issue for addressing this bugThe text was updated successfully, but these errors were encountered: