-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of converter for training data files #6404
Conversation
8fb9c97
to
28ce03f
Compare
28ce03f
to
1d36962
Compare
does it make sense that it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's working nicely!
Two things I think we still need:
- Docstrings on the public methods/functions.
- Maybe one or two tests that run assertions on the contents of the converted files. But since the writers themselves are already tested maybe this isn't necessary.
(I've also added other comments)
does it make sense that it's
rasa data convert
?
@wochinge Would there be other rasa data
commands?
there is already |
rasa/cli/convert.py
Outdated
for file in os.listdir(training_data_path): | ||
source_path = Path(training_data_path) / file | ||
output_path = Path(output) / f"{source_path.stem}{CONVERTED_FILE_POSTFIX}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to use os.walk
(or even completely re-using some parts of data.get_core_nlu_files
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose os.listdir()
over the os.walk()
to avoid possible confusion for the users, in case they keep the files in different sub folders and want to experiment with the files one by one.
But I might be wrong!
@wochinge @federicotdn The fun part is that we already have
Looks like we actually need to reuse it here. What about |
Ok, actually even like this:
And add a note that we currently convert to YAML only from MD. |
Ok, another idea:
Then we're super consistent. |
86560b7
to
77b4ec5
Compare
77b4ec5
to
622e455
Compare
assuming we might also want to make changes to the domain / configuration that we might want to migrate, would it make sense to use something more general, e.g. |
I like the idea! Should we probably use it once we really will migrate something? Currently it's purely about converting training data from one format to the other. That would be actually a great tool to migrate the whole project and make sure it's 2.0 compatible. |
changelog/6404.feature.md
Outdated
@@ -0,0 +1 @@ | |||
User can use ``rasa data convert {nlu|core} -f yaml`` command to convert training data from Markdown format to YAML format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do I need to do to make this support nlg (responses) as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NLGMarkdownReader::reads
returns TrainingData
as well as MarkdownReader
, so it's mostly about:
- writing tests for
RasaYamlWriter
making sure the conversion is correct - extending the "convert_to_yaml" function from this review to support
NLGMarkdownReader
yes that is a good point, let's separate that and allow data migration as a separate command 👍 In any case, I think it already makes sense to add some instructions to the documentation about how to use the data migration https://rasa.com/docs/rasa/next/migration-guide#rasa-110-to-rasa-20 (should be a new section, what does the user need to do?) |
That's true! I'm just waiting for this review to have an approval to be sure what the final cli syntax looks like 😄 |
622e455
to
ee3c4c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!! Stuff we might want to add on a future PR: docstrings for the public methods/functions, and deeper testing for the convertion results (file contents).
rasa/cli/data.py
Outdated
if MarkdownReader.is_markdown_nlu_file(source_path): | ||
if not is_nlu: | ||
continue | ||
_write_nlu_yaml(source_path, output_path, source_path) | ||
num_of_files_converted += 1 | ||
elif not is_nlu and MarkdownStoryReader.is_markdown_story_file(source_path): | ||
_write_core_yaml(source_path, output_path, source_path) | ||
num_of_files_converted += 1 | ||
else: | ||
print_warning(f"Skipped file '{source_path}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning is not showing when doing rasa convert core
but iterating over NLU files. Maybe the if
structure can be changed like this:
if MarkdownReader.is_markdown_nlu_file(source_path): | |
if not is_nlu: | |
continue | |
_write_nlu_yaml(source_path, output_path, source_path) | |
num_of_files_converted += 1 | |
elif not is_nlu and MarkdownStoryReader.is_markdown_story_file(source_path): | |
_write_core_yaml(source_path, output_path, source_path) | |
num_of_files_converted += 1 | |
else: | |
print_warning(f"Skipped file '{source_path}'") | |
if is_nlu and MarkdownReader.is_markdown_nlu_file(source_path): | |
_write_nlu_yaml(source_path, output_path, source_path) | |
num_of_files_converted += 1 | |
continue | |
if not is_nlu and MarkdownStoryReader.is_markdown_story_file(source_path): | |
_write_core_yaml(source_path, output_path, source_path) | |
num_of_files_converted += 1 | |
continue | |
print_warning(f"Skipped file '{source_path}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little bit trickier. MarkdownStoryReader.is_markdown_story_file
returns true
for the NLU
files 🤦
So our rasa train
and other commands work only because we first check if it's NLU
before Core
.
So this if
condition is hacky but correct.
f55e8a2
to
f2657db
Compare
I agree, but non of the new methods are "public", I've renamed them to have "_" prefix
Do you think we should read the files and check the actual content? |
f2657db
to
29907af
Compare
I think it's worth it, yes. But because the writers are already tested I wouldn't consider it a top priority. |
29907af
to
3ec8253
Compare
using my admin rights to merge this - everything passes except windows tests that we identified are not passing on any build at the moment |
Closes #6402
Proposed changes:
rasa data convert {nlu|core} -f yaml
to convert training data from MD to YAMLStatus (please check what you already did):
black
(please check Readme for instructions)