Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newline token conversion between markdown and json formats #6087

Closed
AMR-K opened this issue Jun 29, 2020 · 12 comments · Fixed by #6308
Closed

Newline token conversion between markdown and json formats #6087

AMR-K opened this issue Jun 29, 2020 · 12 comments · Fixed by #6308
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR.

Comments

@AMR-K
Copy link

AMR-K commented Jun 29, 2020

Rasa version: rasa==1.10.3

Rasa SDK version (if used & relevant): rasa-sdk==1.10.2

Rasa X version (if used & relevant):

Python version: Python 3.7.8

Operating system Ubuntu 20

Issue:
My team has training datasets with newline tokens \n as part of the text field in json files.
We generally use the markdown format for inspecting the datafiles before converting them back to json so that we can easily manipulate them.
But, converting the same json file to markdown and then back to json causes the escaping of newline tokens which isn't desirable.

Error (including full traceback):

-

Command or request that led to error:

$ cat json_input.json
{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "intent": "foo",
        "text": "bar \n bar \n bar"
      }
    ]
  }
}

$ rasa data convert nlu --data json_input.json --out markdown.md -f md
$ cat markdown.md
## intent:foo
- bar \n bar \n bar

$ rasa data convert nlu --data markdown.md --out json_output.json -f json
$ cat json_output.json
{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "intent": "foo",
        "text": "bar \\n bar \\n bar"
      }
    ],
    "regex_features": [],
    "lookup_tables": [],
    "entity_synonyms": []
  }
}

Code responsible for the issue:

ESCAPE = re.compile(r"[\b\f\n\r\t]")

return ESCAPE.sub(replace, s)

@AMR-K AMR-K added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Jun 29, 2020
@sara-tagger
Copy link
Collaborator

Thanks for the issue, @tmbo will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

@AMR-KELEG
Copy link
Contributor

@tabergma sorry for the tag if it's somehow spammy but can you help me with this issue.
The way \n characters are escaped distorts my training data files.
Thanks 😅

@AMR-KELEG
Copy link
Contributor

@akelad Could you please check this issue?
Is there a reason for the way \n tokens are escaped in this way?

@akelad
Copy link
Contributor

akelad commented Jul 28, 2020

It's been added to one of our teams inboxes - can I ask how come you're using JSON in the first place? I believe that format might be deprecated soon

@AMR-KELEG
Copy link
Contributor

AMR-KELEG commented Jul 28, 2020

It's been added to one of our teams inboxes - can I ask how come you're using JSON in the first place? I believe that format might be deprecated soon

Well, I have just checked the rasa blog post for version 2.0 and noticed that yaml will be the format for data files.
Json is the format that my team has been using for a while now and it's convenient since it can be easily manipulated / read by different programming languages.
I don't find json to be human-readable and I preferred the MD format so that's why I needed to convert json files to MD, manipulate them and then convert them back to json.

@akelad
Copy link
Contributor

akelad commented Jul 29, 2020

yeah that makes sense - would using yaml once 2.0 be a good replacement option for you for json? Json will still be around for a while, but we will be encouraging users to switch to the new format.

Also, since you already found the area of the code that causes this issue, would you be up for submitting a PR to fix it?

@AMR-KELEG
Copy link
Contributor

I have only used yaml for pipeline configurations so I am not sure how it's used for nlu data (will give it a try soon).
I have created a PR that un-escapes the \n tokens in a markdown file.

@akelad
Copy link
Contributor

akelad commented Jul 31, 2020

nice thanks!

@AMR-KELEG
Copy link
Contributor

O/ Akela,

I am checking the live docs https://rasa.com/docs/rasa/nlu/training-data-format/#data-formats but it looks like the yaml format isn't yet part of it.
Will the docs be updated soon?
I find it easier/ more convenient to check the online docs other than building them from source.

Thanks 😄

@akelad
Copy link
Contributor

akelad commented Aug 5, 2020

It's still a work in progress sorry! you can take a peek here: https://github.com/RasaHQ/rasa/pull/6297/files

@tmbo
Copy link
Member

tmbo commented Aug 5, 2020

@AMR-KELEG still working on the docs but we'll have an update soon. once we merged the PR it will be available at https://rasa.com/docs/rasa/next

@AMR-KELEG
Copy link
Contributor

It's still a work in progress sorry! you can take a peek here: https://github.com/RasaHQ/rasa/pull/6297/files

No worries 😄
Thanks for the pointer.
I will check the rst file for now.

@wochinge wochinge added the type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. label Aug 7, 2020
tmbo pushed a commit that referenced this issue Aug 18, 2020
* Unescape tokens on md-json conversion

Solve #6087
On converting json nlu data into markdown, tokens like: "\n" are
espaced to "\\n".
However, on converting markdown nlu data into json,
Unescaping isn't done

* Add an entry in the changelog

* Add test cases

* Move the decode_string to rasa/utils/io.py

* Remove unnecessary list comprehension

Co-authored-by: Akela Drissner-Schmid <[email protected]>
Co-authored-by: Tanja <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants