Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate rasahq/rasa#7272 #7892

Closed
1 of 3 tasks
TyDunn opened this issue Feb 5, 2021 · 16 comments
Closed
1 of 3 tasks

Investigate rasahq/rasa#7272 #7892

TyDunn opened this issue Feb 5, 2021 · 16 comments
Assignees
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework effort:atom-squad/4 Label which is used by the Rasa Atom squad to do internal estimation of task sizes.

Comments

@TyDunn
Copy link
Contributor

TyDunn commented Feb 5, 2021

Investigate #7272

Background: We have the datasets. Tanja gave us some tips. Some code changes might be needed as part of this investigation (commenting out things / if a hack solves it).

Definition of done

@TyDunn TyDunn added priority:high type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Feb 5, 2021
@wochinge wochinge added effort:atom-squad/4 Label which is used by the Rasa Atom squad to do internal estimation of task sizes. and removed type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. labels Feb 5, 2021
@TyDunn
Copy link
Contributor Author

TyDunn commented Feb 9, 2021

@wochinge I updated the definition of done for this issue. Please let me know if there is anything else you need :)

@wochinge
Copy link
Contributor

wochinge commented Feb 12, 2021

Resources:

Results:

  • Loading the multiwoz dataset using Huggingface on my machine takes ~ 1 second
  • Loading the multiwoz dataset on a 4CPU, 16GiB Gcloud VM using Rasa hasn't terminated yet after > 18 hours
  • Loading the multiwoz dataset on my machine with skipped yaml validation / no fingerprinting / quick is_story/rule_file implementation takes on hour (profiling file here)
  • Loading the multiwoz dataset on my machine with skipped yaml validation / no fingerprinting / quick is_story/rule_file implementation and skipping some extra yaml loader options takes 3.6 minutes (profiling file here
  • Loading the multiwoz dataset on my machine with skipped yaml validation / with fingerprinting / quick is_story/rule_file implementation and skipping some extra yaml loader options takes 8.3 minutes
  • Loading the multiwoz dataset on my machine with yaml validation / with fingerprinting / quick is_story/rule_file implementation and skipping some extra yaml loader options takes 18.4 minutes

News so far:

  • profiling with the rasa-demo dataset: Profiling results here (you can e.g. view them using snakeviz or using tuna (way less confusing))
  • Insights
    • fingerprinting takes a lot time because we dump all stories to yaml in order to get a hash of the string. This might fixed considering this issue
    • schema validation takes a lot of time

@wochinge
Copy link
Contributor

wochinge commented Feb 12, 2021

@TyDunn As stated I don't we should compare ourselves to Huggingface because our code does more (e.g. like fingerprinting the model). We can of course focus solely on the importing but than the entire process will still be very slow so I think it makes sense to focus upon the process from reading until (excluded) training. Considering that my tests still not terminated I would aim for results in the range of minutes rather than seconds (probably everything < 20min is already a huge win. I'm updating my comment from above throughout the day.

@TyDunn
Copy link
Contributor Author

TyDunn commented Feb 12, 2021

@wochinge I completely agree. Now that we know huggingface loading of multiwoz takes seconds, while we take hours. I have updated the definition of done in #7272

@wochinge
Copy link
Contributor

I brought it down to 220 seconds. It's very hacky as I just commented out some things, but I guess something in this range is realistic

@TyDunn
Copy link
Contributor Author

TyDunn commented Feb 12, 2021

That's awesome. I would guess that something around that would be good enough

@wochinge
Copy link
Contributor

wochinge commented Feb 12, 2021

Suggested next steps based on PR and the comment from above (this is a order of priority) @RasaHQ/enable-squad

Where we started:: Rasa with multiwoz on a gcloud machine with 4 cpus and 16 GiB memory didn't finish loading the training data within 18 hours.

  1. Replace the implementation to look for keys in a yaml file with this (trivial): https://github.com/RasaHQ/rasa/pull/7944/files#r575331363

  2. Only interpolate env variables for config files and not for story files / nlu files or offer option to disable / enable this: https://github.com/RasaHQ/rasa/pull/7944/files#r575332472 (multiwoz on my machine: 18 minutes)

  3. Investigate how to speed up yaml validation. (multiwoz on my machine when dropping the validation: 8 minutes)

    • are there quicker libraries?
    • can we offer an option to disable the validation?
    • (we can re-use the validated content instead of loading it twice - just brings ~1-2)min though)
  4. Create a more elegant and quicker fingerprint for stories: https://github.com/RasaHQ/rasa/pull/7944/files#r575331766 (multiwoz on my machine when dropping fingerprinting: 3.6 minutes)

I linked some profiling results in the comment above. If you need some more, let me know 🙌🏻

@Ghostvv What do you think?

@wochinge
Copy link
Contributor

Added the issues to the comment above. Closing this issue then.

@Ghostvv
Copy link
Contributor

Ghostvv commented Feb 15, 2021

@wochinge the times you posted look amazing to me. I think if skipping validation is configurable 8 minutes is good enough, because we can validate once, and then skip validation for subsequent runs on the same data

@wochinge
Copy link
Contributor

@Ghostvv FYI: We are tackling the first two steps as part of this sprint which should bring it down to < 20 minutes. We're having a look at the other items in later sprints.

@twerkmeister
Copy link
Contributor

twerkmeister commented Feb 24, 2021

I have some really good news for this topic @wochinge @TyDunn @Ghostvv . Seems yaml parsers are really bad at handling big files, as the operations done for parsing scale quadratically, at least given our current config. I would advise against files larger than 200KB. With a little script to split up the multiWoz stories, I was able to reduce the time from about 30 minutes to read to about 4-5 minutes. This is both without schema validation. With schema validation I am still at about 30 minutes using the splitted files. I haven't tested it with schema validation and the one large file, I would probably wait forever

Here's a small script that I used to split the yaml files into blocks containing roughly 5000 events/turns each:

stories_path = "rasa_multiwoz/stories.yml"
split_stories_path = "rasa_multiwoz/split"  # has to exist
with open(stories_path) as f:
    story_lines = f.readlines()

current_block = 0
header_lines = ['version: "2.0"\n', 'stories:\n']
block_start = 2  # skipping version and stories keys
for i, line in enumerate(story_lines):
    if current_block < i // 5000 and line.startswith("- story:"):
        out_lines = header_lines + story_lines[block_start:i]
        with open(f"{split_stories_path}/stories_block_{current_block}.yml", "w") as out:
            out.writelines(out_lines)
        current_block += 1
        block_start = i
# write last block
out_lines = header_lines + story_lines[block_start:]
with open(f"{split_stories_path}/stories_block_{current_block}.yml",
          "w") as out:
    out.writelines(out_lines)

theoretically you could make the blocks even smaller and be even faster.

@Ghostvv
Copy link
Contributor

Ghostvv commented Feb 24, 2021

@twerkmeister thanks, but I wouldn't say it's a scalable solution

@twerkmeister
Copy link
Contributor

Could you elaborate on that @Ghostvv ?

@Ghostvv
Copy link
Contributor

Ghostvv commented Feb 24, 2021

we create story files for multiwoz automatically, we can adopt our scripts, however, I don't think it is good from user point of view, they will also have to be aware of that. Is it possible to maybe automate the splitting inside rasa? Or solve the problem of quadratic scaling?

@twerkmeister
Copy link
Contributor

twerkmeister commented Feb 24, 2021

All things we can look into, this was just a side finding of my actual task at hand that I wanted to share to save you guys some time. Especially in combination with #8041 this could get times down to somewhat managable

@Ghostvv
Copy link
Contributor

Ghostvv commented Feb 24, 2021

ah, thank you, I misunderstood it for solution sorry 🙈. I guess the fact that issue is closed confused me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework effort:atom-squad/4 Label which is used by the Rasa Atom squad to do internal estimation of task sizes.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants