Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of converter for training data files #6404

Merged
merged 3 commits into from
Aug 14, 2020

Conversation

degiz
Copy link
Contributor

@degiz degiz commented Aug 13, 2020

Closes #6402

Proposed changes:

  • CLI command rasa data convert {nlu|core} -f yaml to convert training data from MD to YAML

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@degiz degiz force-pushed the 6402_cli_converter branch from 8fb9c97 to 28ce03f Compare August 13, 2020 15:40
@degiz degiz requested review from federicotdn and tmbo August 13, 2020 15:41
@degiz degiz force-pushed the 6402_cli_converter branch from 28ce03f to 1d36962 Compare August 13, 2020 15:58
@wochinge
Copy link
Contributor

does it make sense that it's rasa data convert ?

Copy link
Contributor

@federicotdn federicotdn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's working nicely!
Two things I think we still need:

  • Docstrings on the public methods/functions.
  • Maybe one or two tests that run assertions on the contents of the converted files. But since the writers themselves are already tested maybe this isn't necessary.

(I've also added other comments)

does it make sense that it's rasa data convert ?

@wochinge Would there be other rasa data commands?

changelog/6404.feature.md Outdated Show resolved Hide resolved
rasa/cli/arguments/convert.py Outdated Show resolved Hide resolved
rasa/cli/arguments/convert.py Outdated Show resolved Hide resolved
rasa/cli/convert.py Outdated Show resolved Hide resolved
rasa/cli/convert.py Outdated Show resolved Hide resolved
rasa/cli/convert.py Outdated Show resolved Hide resolved
rasa/cli/convert.py Outdated Show resolved Hide resolved
rasa/cli/convert.py Outdated Show resolved Hide resolved
rasa/cli/convert.py Outdated Show resolved Hide resolved
rasa/cli/convert.py Outdated Show resolved Hide resolved
@wochinge
Copy link
Contributor

@wochinge Would there be other rasa data commands?

there is already rasa data split and I believe rasa data validate

Comment on lines 60 to 62
for file in os.listdir(training_data_path):
source_path = Path(training_data_path) / file
output_path = Path(output) / f"{source_path.stem}{CONVERTED_FILE_POSTFIX}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use os.walk (or even completely re-using some parts of data.get_core_nlu_files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose os.listdir() over the os.walk() to avoid possible confusion for the users, in case they keep the files in different sub folders and want to experiment with the files one by one.
But I might be wrong!

@degiz
Copy link
Contributor Author

degiz commented Aug 14, 2020

@wochinge @federicotdn The fun part is that we already have rasa data convert 😄

usage: rasa data convert [-h] [-v] [-vv] [--quiet] {nlu} ...

positional arguments:
  {nlu}
    nlu          Converts NLU data between Markdown and json formats.

Looks like we actually need to reuse it here.

What about rasa data convert -f yaml --nlu {DIR} --core {DIR} -output ?

@degiz
Copy link
Contributor Author

degiz commented Aug 14, 2020

Ok, actually even like this:

rasa data convert --nlu {DIR} --core {DIR} -f yaml -output {DIR}

And add a note that we currently convert to YAML only from MD.

@degiz
Copy link
Contributor Author

degiz commented Aug 14, 2020

Ok, another idea:

rasa data convert nlu --data {DIR} -f yaml -out {DIR}
rasa data convert core --data {DIR} -f yaml -out {DIR}

Then we're super consistent.

@degiz degiz force-pushed the 6402_cli_converter branch 2 times, most recently from 86560b7 to 77b4ec5 Compare August 14, 2020 10:14
@degiz degiz requested a review from federicotdn August 14, 2020 10:16
@degiz degiz force-pushed the 6402_cli_converter branch from 77b4ec5 to 622e455 Compare August 14, 2020 10:18
@tmbo
Copy link
Member

tmbo commented Aug 14, 2020

assuming we might also want to make changes to the domain / configuration that we might want to migrate, would it make sense to use something more general, e.g. rasa migrate?

@degiz
Copy link
Contributor Author

degiz commented Aug 14, 2020

rasa migrate

I like the idea! Should we probably use it once we really will migrate something? Currently it's purely about converting training data from one format to the other.

That would be actually a great tool to migrate the whole project and make sure it's 2.0 compatible.

@@ -0,0 +1 @@
User can use ``rasa data convert {nlu|core} -f yaml`` command to convert training data from Markdown format to YAML format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do I need to do to make this support nlg (responses) as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NLGMarkdownReader::reads returns TrainingDataas well as MarkdownReader, so it's mostly about:

  • writing tests for RasaYamlWriter making sure the conversion is correct
  • extending the "convert_to_yaml" function from this review to support NLGMarkdownReader

@tmbo
Copy link
Member

tmbo commented Aug 14, 2020

That would be actually a great tool to migrate the whole project and make sure it's 2.0 compatible.

yes that is a good point, let's separate that and allow data migration as a separate command 👍

In any case, I think it already makes sense to add some instructions to the documentation about how to use the data migration https://rasa.com/docs/rasa/next/migration-guide#rasa-110-to-rasa-20 (should be a new section, what does the user need to do?)

@degiz
Copy link
Contributor Author

degiz commented Aug 14, 2020

add some instructions to the documentation

That's true! I'm just waiting for this review to have an approval to be sure what the final cli syntax looks like 😄

@degiz degiz force-pushed the 6402_cli_converter branch from 622e455 to ee3c4c5 Compare August 14, 2020 10:37
Copy link
Contributor

@federicotdn federicotdn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!! Stuff we might want to add on a future PR: docstrings for the public methods/functions, and deeper testing for the convertion results (file contents).

changelog/6404.feature.md Outdated Show resolved Hide resolved
rasa/cli/arguments/data.py Outdated Show resolved Hide resolved
rasa/cli/data.py Outdated Show resolved Hide resolved
rasa/cli/data.py Outdated Show resolved Hide resolved
rasa/cli/data.py Outdated Show resolved Hide resolved
rasa/cli/data.py Outdated Show resolved Hide resolved
rasa/cli/data.py Outdated Show resolved Hide resolved
rasa/cli/data.py Outdated
Comment on lines 255 to 264
if MarkdownReader.is_markdown_nlu_file(source_path):
if not is_nlu:
continue
_write_nlu_yaml(source_path, output_path, source_path)
num_of_files_converted += 1
elif not is_nlu and MarkdownStoryReader.is_markdown_story_file(source_path):
_write_core_yaml(source_path, output_path, source_path)
num_of_files_converted += 1
else:
print_warning(f"Skipped file '{source_path}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning is not showing when doing rasa convert core but iterating over NLU files. Maybe the if structure can be changed like this:

Suggested change
if MarkdownReader.is_markdown_nlu_file(source_path):
if not is_nlu:
continue
_write_nlu_yaml(source_path, output_path, source_path)
num_of_files_converted += 1
elif not is_nlu and MarkdownStoryReader.is_markdown_story_file(source_path):
_write_core_yaml(source_path, output_path, source_path)
num_of_files_converted += 1
else:
print_warning(f"Skipped file '{source_path}'")
if is_nlu and MarkdownReader.is_markdown_nlu_file(source_path):
_write_nlu_yaml(source_path, output_path, source_path)
num_of_files_converted += 1
continue
if not is_nlu and MarkdownStoryReader.is_markdown_story_file(source_path):
_write_core_yaml(source_path, output_path, source_path)
num_of_files_converted += 1
continue
print_warning(f"Skipped file '{source_path}'")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little bit trickier. MarkdownStoryReader.is_markdown_story_file returns true for the NLU files 🤦
So our rasa train and other commands work only because we first check if it's NLU before Core.
So this if condition is hacky but correct.

@degiz degiz force-pushed the 6402_cli_converter branch 2 times, most recently from f55e8a2 to f2657db Compare August 14, 2020 11:33
@degiz
Copy link
Contributor Author

degiz commented Aug 14, 2020

docstrings for the public methods/functions

I agree, but non of the new methods are "public", I've renamed them to have "_" prefix

deeper testing for the convertion results (file contents)

Do you think we should read the files and check the actual content?

@degiz degiz force-pushed the 6402_cli_converter branch from f2657db to 29907af Compare August 14, 2020 11:47
@federicotdn
Copy link
Contributor

federicotdn commented Aug 14, 2020

Do you think we should read the files and check the actual content?

I think it's worth it, yes. But because the writers are already tested I wouldn't consider it a top priority.

@degiz degiz force-pushed the 6402_cli_converter branch from 29907af to 3ec8253 Compare August 14, 2020 13:38
@m-vdb
Copy link
Collaborator

m-vdb commented Aug 14, 2020

using my admin rights to merge this - everything passes except windows tests that we identified are not passing on any build at the moment

@m-vdb m-vdb merged commit 989e592 into master Aug 14, 2020
@m-vdb m-vdb deleted the 6402_cli_converter branch August 14, 2020 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create intitial CLI converter for 2.0 training data
5 participants