Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent and faster stories fingerprinting #8041

Merged
merged 12 commits into from
Mar 3, 2021

Conversation

twerkmeister
Copy link
Contributor

@twerkmeister twerkmeister commented Feb 24, 2021

Proposed changes:

The trick to the speedup is that it seems that (?our config of?) YAML is extremely bad with serializing / deserializing big objects/strings. Like really bad. The number of operations that pyyaml and ruamel do to read and write yaml from and to strings goes up quadratically with the size. So for the fingerprinting, I avoid writing the yaml object to a string completely and instead use our own object fingerprint method. This finding also has some interesting implications for #7892. Will comment there

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@twerkmeister twerkmeister requested review from a team and tmbo and removed request for a team February 24, 2021 12:39
@twerkmeister twerkmeister mentioned this pull request Feb 24, 2021
3 tasks
@joejuzl joejuzl self-requested a review February 26, 2021 12:37
@tmbo tmbo removed their request for review March 1, 2021 13:56
@twerkmeister
Copy link
Contributor Author

Great comments @joejuzl. Will add a test that makes sure multiple or with same intents in same story will lead to unique checkpoint names.

@twerkmeister
Copy link
Contributor Author

@joejuzl really good that you caught that problem, you were right that it was just the steps from the current checkpoint. So if there were two or statements in the same story with the same intents, occurring after the same number of steps since the last checkpoint, we would get a collision. To prevent that I added the starting checkpoint names as additional info for the id hashing. Now this problem should be solved as different event sequences would always have different starting checkpoints. The first has START_CHECKPOINT, or whatever has been defined by the user, the second will have a generated one, the third will have yet another generated one based on the second generated.

Second, I am using info about the actual events instead of just the event counts. The main reason being that we don't really enforce or even advise that story names should be distinguishable apart from a note in the docs that it makes it easier to debug. So now there can still be a collision if the story name is the same, if the events before the OR statement are exactly the same, if the start checkpoint is the same, and if the involved intents and messages are the same. But in that case I would argue it is actually the same thing and it is not a wrong collision. It will lead to some training data being duplicate but I am not sure it is an actual issue

@tmbo
Copy link
Member

tmbo commented Mar 2, 2021

just saw this in my mail and had a random thought: did you cover the case where the two intents used in an OR are the same but have different entities?

@twerkmeister
Copy link
Contributor Author

@tmbo good point, wasn't aware that it was possible to define entities there, will look into it

@twerkmeister
Copy link
Contributor Author

So the case when two messages in the OR statement have the same intent wouldn't have been problematic. However, there would have been a collision in the case of two stories being exactly the same in every aspect except the entities used for messages in OR statements.

@twerkmeister
Copy link
Contributor Author

@joejuzl does this look good to you?

Copy link
Contributor

@joejuzl joejuzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@rasabot rasabot merged commit 43f1653 into 2.3.x Mar 3, 2021
@rasabot rasabot deleted the 4612-7955-stories-fingerprinting branch March 3, 2021 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants