-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate rasahq/rasa#7272 #7892
Comments
@wochinge I updated the definition of done for this issue. Please let me know if there is anything else you need :) |
Resources:
Results:
News so far:
|
@TyDunn As stated I don't we should compare ourselves to Huggingface because our code does more (e.g. like fingerprinting the model). We can of course focus solely on the importing but than the entire process will still be very slow so I think it makes sense to focus upon the process from reading until (excluded) training. Considering that my tests still not terminated I would aim for results in the range of minutes rather than seconds (probably everything < 20min is already a huge win. I'm updating my comment from above throughout the day. |
I brought it down to 220 seconds. It's very hacky as I just commented out some things, but I guess something in this range is realistic |
That's awesome. I would guess that something around that would be good enough |
Suggested next steps based on PR and the comment from above (this is a order of priority) @RasaHQ/enable-squad Where we started:: Rasa with multiwoz on a gcloud machine with 4 cpus and 16 GiB memory didn't finish loading the training data within 18 hours.
I linked some profiling results in the comment above. If you need some more, let me know 🙌🏻 @Ghostvv What do you think? |
Added the issues to the comment above. Closing this issue then. |
@wochinge the times you posted look amazing to me. I think if skipping validation is configurable 8 minutes is good enough, because we can validate once, and then skip validation for subsequent runs on the same data |
@Ghostvv FYI: We are tackling the first two steps as part of this sprint which should bring it down to < 20 minutes. We're having a look at the other items in later sprints. |
I have some really good news for this topic @wochinge @TyDunn @Ghostvv . Seems yaml parsers are really bad at handling big files, as the operations done for parsing scale quadratically, at least given our current config. I would advise against files larger than 200KB. With a little script to split up the multiWoz stories, I was able to reduce the time from about 30 minutes to read to about 4-5 minutes. This is both without schema validation. With schema validation I am still at about 30 minutes using the splitted files. I haven't tested it with schema validation and the one large file, I would probably wait forever Here's a small script that I used to split the yaml files into blocks containing roughly 5000 events/turns each: stories_path = "rasa_multiwoz/stories.yml"
split_stories_path = "rasa_multiwoz/split" # has to exist
with open(stories_path) as f:
story_lines = f.readlines()
current_block = 0
header_lines = ['version: "2.0"\n', 'stories:\n']
block_start = 2 # skipping version and stories keys
for i, line in enumerate(story_lines):
if current_block < i // 5000 and line.startswith("- story:"):
out_lines = header_lines + story_lines[block_start:i]
with open(f"{split_stories_path}/stories_block_{current_block}.yml", "w") as out:
out.writelines(out_lines)
current_block += 1
block_start = i
# write last block
out_lines = header_lines + story_lines[block_start:]
with open(f"{split_stories_path}/stories_block_{current_block}.yml",
"w") as out:
out.writelines(out_lines) theoretically you could make the blocks even smaller and be even faster. |
@twerkmeister thanks, but I wouldn't say it's a scalable solution |
Could you elaborate on that @Ghostvv ? |
we create story files for multiwoz automatically, we can adopt our scripts, however, I don't think it is good from user point of view, they will also have to be aware of that. Is it possible to maybe automate the splitting inside rasa? Or solve the problem of quadratic scaling? |
All things we can look into, this was just a side finding of my actual task at hand that I wanted to share to save you guys some time. Especially in combination with #8041 this could get times down to somewhat managable |
ah, thank you, I misunderstood it for solution sorry 🙈. I guess the fact that issue is closed confused me |
Investigate #7272
Background: We have the datasets. Tanja gave us some tips. Some code changes might be needed as part of this investigation (commenting out things / if a hack solves it).
Definition of done
The text was updated successfully, but these errors were encountered: