-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent and faster stories fingerprinting #8041
Conversation
rasa/shared/core/training_data/story_reader/story_step_builder.py
Outdated
Show resolved
Hide resolved
Great comments @joejuzl. Will add a test that makes sure multiple or with same intents in same story will lead to unique checkpoint names. |
@joejuzl really good that you caught that problem, you were right that it was just the steps from the current checkpoint. So if there were two or statements in the same story with the same intents, occurring after the same number of steps since the last checkpoint, we would get a collision. To prevent that I added the starting checkpoint names as additional info for the id hashing. Now this problem should be solved as different event sequences would always have different starting checkpoints. The first has START_CHECKPOINT, or whatever has been defined by the user, the second will have a generated one, the third will have yet another generated one based on the second generated. Second, I am using info about the actual events instead of just the event counts. The main reason being that we don't really enforce or even advise that story names should be distinguishable apart from a note in the docs that it makes it easier to debug. So now there can still be a collision if the story name is the same, if the events before the OR statement are exactly the same, if the start checkpoint is the same, and if the involved intents and messages are the same. But in that case I would argue it is actually the same thing and it is not a wrong collision. It will lead to some training data being duplicate but I am not sure it is an actual issue |
just saw this in my mail and had a random thought: did you cover the case where the two intents used in an |
@tmbo good point, wasn't aware that it was possible to define entities there, will look into it |
So the case when two messages in the OR statement have the same intent wouldn't have been problematic. However, there would have been a collision in the case of two stories being exactly the same in every aspect except the entities used for messages in OR statements. |
@joejuzl does this look good to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Proposed changes:
The trick to the speedup is that it seems that (?our config of?) YAML is extremely bad with serializing / deserializing big objects/strings. Like really bad. The number of operations that pyyaml and ruamel do to read and write yaml from and to strings goes up quadratically with the size. So for the fingerprinting, I avoid writing the yaml object to a string completely and instead use our own object fingerprint method. This finding also has some interesting implications for #7892. Will comment there
Status (please check what you already did):
black
(please check Readme for instructions)