Investigate rasahq/rasa#7272 #7892

TyDunn · 2021-02-05T12:09:39Z

Investigate #7272

Background: We have the datasets. Tanja gave us some tips. Some code changes might be needed as part of this investigation (commenting out things / if a hack solves it).

Definition of done

Determine how long it takes to load MultiWOZ in huggingface datasets
Determine what needs to be done for Loading training data takes too long #7272
Create tickets for what plan to solve Loading training data takes too long #7272

TyDunn · 2021-02-09T12:23:45Z

@wochinge I updated the definition of done for this issue. Please let me know if there is anything else you need :)

wochinge · 2021-02-12T09:03:24Z

Resources:

I use the tests from this PR to profile the loading: profile training data loading #7944

Results:

Loading the multiwoz dataset using Huggingface on my machine takes ~ 1 second
Loading the multiwoz dataset on a 4CPU, 16GiB Gcloud VM using Rasa hasn't terminated yet after > 18 hours
Loading the multiwoz dataset on my machine with skipped yaml validation / no fingerprinting / quick is_story/rule_file implementation takes on hour (profiling file here)
Loading the multiwoz dataset on my machine with skipped yaml validation / no fingerprinting / quick is_story/rule_file implementation and skipping some extra yaml loader options takes 3.6 minutes (profiling file here
Loading the multiwoz dataset on my machine with skipped yaml validation / with fingerprinting / quick is_story/rule_file implementation and skipping some extra yaml loader options takes 8.3 minutes
Loading the multiwoz dataset on my machine with yaml validation / with fingerprinting / quick is_story/rule_file implementation and skipping some extra yaml loader options takes 18.4 minutes

News so far:

profiling with the rasa-demo dataset: Profiling results here (you can e.g. view them using snakeviz or using tuna (way less confusing))
Insights
- fingerprinting takes a lot time because we dump all stories to yaml in order to get a hash of the string. This might fixed considering this issue
- schema validation takes a lot of time

wochinge · 2021-02-12T13:21:57Z

@TyDunn As stated I don't we should compare ourselves to Huggingface because our code does more (e.g. like fingerprinting the model). We can of course focus solely on the importing but than the entire process will still be very slow so I think it makes sense to focus upon the process from reading until (excluded) training. Considering that my tests still not terminated I would aim for results in the range of minutes rather than seconds (probably everything < 20min is already a huge win. I'm updating my comment from above throughout the day.

TyDunn · 2021-02-12T13:32:17Z

@wochinge I completely agree. Now that we know huggingface loading of multiwoz takes seconds, while we take hours. I have updated the definition of done in #7272

wochinge · 2021-02-12T15:55:03Z

I brought it down to 220 seconds. It's very hacky as I just commented out some things, but I guess something in this range is realistic

TyDunn · 2021-02-12T15:58:46Z

That's awesome. I would guess that something around that would be good enough

wochinge · 2021-02-12T16:38:10Z

Suggested next steps based on PR and the comment from above (this is a order of priority) @RasaHQ/enable-squad

Where we started:: Rasa with multiwoz on a gcloud machine with 4 cpus and 16 GiB memory didn't finish loading the training data within 18 hours.

Replace the implementation to look for keys in a yaml file with this (trivial): https://github.com/RasaHQ/rasa/pull/7944/files#r575331363
Only interpolate env variables for config files and not for story files / nlu files or offer option to disable / enable this: https://github.com/RasaHQ/rasa/pull/7944/files#r575332472 (multiwoz on my machine: 18 minutes)
Investigate how to speed up yaml validation. (multiwoz on my machine when dropping the validation: 8 minutes)
- are there quicker libraries?
- can we offer an option to disable the validation?
- (we can re-use the validated content instead of loading it twice - just brings ~1-2)min though)
Create a more elegant and quicker fingerprint for stories: https://github.com/RasaHQ/rasa/pull/7944/files#r575331766 (multiwoz on my machine when dropping fingerprinting: 3.6 minutes)

I linked some profiling results in the comment above. If you need some more, let me know 🙌🏻

@Ghostvv What do you think?

wochinge · 2021-02-15T11:04:57Z

Added the issues to the comment above. Closing this issue then.

Ghostvv · 2021-02-15T11:08:03Z

@wochinge the times you posted look amazing to me. I think if skipping validation is configurable 8 minutes is good enough, because we can validate once, and then skip validation for subsequent runs on the same data

wochinge · 2021-02-15T11:09:44Z

@Ghostvv FYI: We are tackling the first two steps as part of this sprint which should bring it down to < 20 minutes. We're having a look at the other items in later sprints.

twerkmeister · 2021-02-24T12:47:14Z

I have some really good news for this topic @wochinge @TyDunn @Ghostvv . Seems yaml parsers are really bad at handling big files, as the operations done for parsing scale quadratically, at least given our current config. I would advise against files larger than 200KB. With a little script to split up the multiWoz stories, I was able to reduce the time from about 30 minutes to read to about 4-5 minutes. This is both without schema validation. With schema validation I am still at about 30 minutes using the splitted files. I haven't tested it with schema validation and the one large file, I would probably wait forever

Here's a small script that I used to split the yaml files into blocks containing roughly 5000 events/turns each:

stories_path = "rasa_multiwoz/stories.yml"
split_stories_path = "rasa_multiwoz/split"  # has to exist
with open(stories_path) as f:
    story_lines = f.readlines()

current_block = 0
header_lines = ['version: "2.0"\n', 'stories:\n']
block_start = 2  # skipping version and stories keys
for i, line in enumerate(story_lines):
    if current_block < i // 5000 and line.startswith("- story:"):
        out_lines = header_lines + story_lines[block_start:i]
        with open(f"{split_stories_path}/stories_block_{current_block}.yml", "w") as out:
            out.writelines(out_lines)
        current_block += 1
        block_start = i
# write last block
out_lines = header_lines + story_lines[block_start:]
with open(f"{split_stories_path}/stories_block_{current_block}.yml",
          "w") as out:
    out.writelines(out_lines)

theoretically you could make the blocks even smaller and be even faster.

Ghostvv · 2021-02-24T12:50:01Z

@twerkmeister thanks, but I wouldn't say it's a scalable solution

twerkmeister · 2021-02-24T13:07:18Z

Could you elaborate on that @Ghostvv ?

Ghostvv · 2021-02-24T13:35:00Z

we create story files for multiwoz automatically, we can adopt our scripts, however, I don't think it is good from user point of view, they will also have to be aware of that. Is it possible to maybe automate the splitting inside rasa? Or solve the problem of quadratic scaling?

twerkmeister · 2021-02-24T13:44:16Z

All things we can look into, this was just a side finding of my actual task at hand that I wanted to share to save you guys some time. Especially in combination with #8041 this could get times down to somewhat managable

Ghostvv · 2021-02-24T13:48:44Z

ah, thank you, I misunderstood it for solution sorry 🙈. I guess the fact that issue is closed confused me

TyDunn added priority:high type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Feb 5, 2021

wochinge added effort:atom-squad/4 Label which is used by the Rasa Atom squad to do internal estimation of task sizes. and removed type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. labels Feb 5, 2021

TyDunn assigned wochinge Feb 8, 2021

wochinge mentioned this issue Feb 11, 2021

profile training data loading #7944

Closed

4 tasks

This was referenced Feb 15, 2021

speed up yaml parsing #7953

Closed

implement quick and robust story fingerprinting #7955

Closed

wochinge closed this as completed Feb 15, 2021

This was referenced Feb 15, 2021

Investigate yaml schema validation #7954

Closed

speed up is_key_in_yaml #7952

Closed

wochinge mentioned this issue Feb 17, 2021

load schema files for pykwalify to avoid global yaml usage #7970

Merged

4 tasks

twerkmeister mentioned this issue Feb 24, 2021

Consistent and faster stories fingerprinting #8041

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate rasahq/rasa#7272 #7892

Investigate rasahq/rasa#7272 #7892

TyDunn commented Feb 5, 2021 •

edited

Loading

TyDunn commented Feb 9, 2021

wochinge commented Feb 12, 2021 •

edited

Loading

wochinge commented Feb 12, 2021 •

edited

Loading

TyDunn commented Feb 12, 2021

wochinge commented Feb 12, 2021

TyDunn commented Feb 12, 2021

wochinge commented Feb 12, 2021 •

edited

Loading

wochinge commented Feb 15, 2021

Ghostvv commented Feb 15, 2021

wochinge commented Feb 15, 2021

twerkmeister commented Feb 24, 2021 •

edited

Loading

Ghostvv commented Feb 24, 2021

twerkmeister commented Feb 24, 2021

Ghostvv commented Feb 24, 2021

twerkmeister commented Feb 24, 2021 •

edited

Loading

Ghostvv commented Feb 24, 2021

Investigate rasahq/rasa#7272 #7892

Investigate rasahq/rasa#7272 #7892

Comments

TyDunn commented Feb 5, 2021 • edited Loading

TyDunn commented Feb 9, 2021

wochinge commented Feb 12, 2021 • edited Loading

wochinge commented Feb 12, 2021 • edited Loading

TyDunn commented Feb 12, 2021

wochinge commented Feb 12, 2021

TyDunn commented Feb 12, 2021

wochinge commented Feb 12, 2021 • edited Loading

wochinge commented Feb 15, 2021

Ghostvv commented Feb 15, 2021

wochinge commented Feb 15, 2021

twerkmeister commented Feb 24, 2021 • edited Loading

Ghostvv commented Feb 24, 2021

twerkmeister commented Feb 24, 2021

Ghostvv commented Feb 24, 2021

twerkmeister commented Feb 24, 2021 • edited Loading

Ghostvv commented Feb 24, 2021

TyDunn commented Feb 5, 2021 •

edited

Loading

wochinge commented Feb 12, 2021 •

edited

Loading

wochinge commented Feb 12, 2021 •

edited

Loading

wochinge commented Feb 12, 2021 •

edited

Loading

twerkmeister commented Feb 24, 2021 •

edited

Loading

twerkmeister commented Feb 24, 2021 •

edited

Loading