Newline token conversion between markdown and json formats #6087

AMR-K · 2020-06-29T12:29:31Z

Rasa version: rasa==1.10.3

Rasa SDK version (if used & relevant): rasa-sdk==1.10.2

Rasa X version (if used & relevant):

Python version: Python 3.7.8

Operating system Ubuntu 20

Issue:
My team has training datasets with newline tokens \n as part of the text field in json files.
We generally use the markdown format for inspecting the datafiles before converting them back to json so that we can easily manipulate them.
But, converting the same json file to markdown and then back to json causes the escaping of newline tokens which isn't desirable.

Error (including full traceback):

Command or request that led to error:

$ cat json_input.json
{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "intent": "foo",
        "text": "bar \n bar \n bar"
      }
    ]
  }
}

$ rasa data convert nlu --data json_input.json --out markdown.md -f md
$ cat markdown.md
## intent:foo
- bar \n bar \n bar

$ rasa data convert nlu --data markdown.md --out json_output.json -f json
$ cat json_output.json
{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "intent": "foo",
        "text": "bar \\n bar \\n bar"
      }
    ],
    "regex_features": [],
    "lookup_tables": [],
    "entity_synonyms": []
  }
}

Code responsible for the issue:

rasa/rasa/nlu/training_data/formats/markdown.py

Line 51 in 88ad06f

ESCAPE = re.compile(r"[\b\f\n\r\t]")

rasa/rasa/nlu/training_data/formats/markdown.py

Line 70 in 88ad06f

return ESCAPE.sub(replace, s)

The text was updated successfully, but these errors were encountered:

sara-tagger · 2020-06-30T06:00:12Z

Thanks for the issue, @tmbo will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

AMR-KELEG · 2020-07-22T14:25:27Z

@tabergma sorry for the tag if it's somehow spammy but can you help me with this issue.
The way \n characters are escaped distorts my training data files.
Thanks 😅

AMR-KELEG · 2020-07-27T16:09:20Z

@akelad Could you please check this issue?
Is there a reason for the way \n tokens are escaped in this way?

akelad · 2020-07-28T14:59:13Z

It's been added to one of our teams inboxes - can I ask how come you're using JSON in the first place? I believe that format might be deprecated soon

AMR-KELEG · 2020-07-28T18:20:10Z

It's been added to one of our teams inboxes - can I ask how come you're using JSON in the first place? I believe that format might be deprecated soon

Well, I have just checked the rasa blog post for version 2.0 and noticed that yaml will be the format for data files.
Json is the format that my team has been using for a while now and it's convenient since it can be easily manipulated / read by different programming languages.
I don't find json to be human-readable and I preferred the MD format so that's why I needed to convert json files to MD, manipulate them and then convert them back to json.

akelad · 2020-07-29T08:44:09Z

yeah that makes sense - would using yaml once 2.0 be a good replacement option for you for json? Json will still be around for a while, but we will be encouraging users to switch to the new format.

Also, since you already found the area of the code that causes this issue, would you be up for submitting a PR to fix it?

AMR-KELEG · 2020-07-31T03:00:16Z

I have only used yaml for pipeline configurations so I am not sure how it's used for nlu data (will give it a try soon).
I have created a PR that un-escapes the \n tokens in a markdown file.

akelad · 2020-07-31T10:50:04Z

nice thanks!

AMR-KELEG · 2020-08-04T13:19:08Z

O/ Akela,

I am checking the live docs https://rasa.com/docs/rasa/nlu/training-data-format/#data-formats but it looks like the yaml format isn't yet part of it.
Will the docs be updated soon?
I find it easier/ more convenient to check the online docs other than building them from source.

Thanks 😄

akelad · 2020-08-05T08:22:47Z

It's still a work in progress sorry! you can take a peek here: https://github.com/RasaHQ/rasa/pull/6297/files

tmbo · 2020-08-05T08:25:59Z

@AMR-KELEG still working on the docs but we'll have an update soon. once we merged the PR it will be available at https://rasa.com/docs/rasa/next

AMR-KELEG · 2020-08-05T20:33:56Z

It's still a work in progress sorry! you can take a peek here: https://github.com/RasaHQ/rasa/pull/6297/files

No worries 😄
Thanks for the pointer.
I will check the rst file for now.

* Unescape tokens on md-json conversion Solve #6087 On converting json nlu data into markdown, tokens like: "\n" are espaced to "\\n". However, on converting markdown nlu data into json, Unescaping isn't done * Add an entry in the changelog * Add test cases * Move the decode_string to rasa/utils/io.py * Remove unnecessary list comprehension Co-authored-by: Akela Drissner-Schmid <[email protected]> Co-authored-by: Tanja <[email protected]>

AMR-K added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Jun 29, 2020

AMR-KELEG mentioned this issue Jul 31, 2020

Unescape tokens on md-json conversion #6308

Merged

4 tasks

wochinge added the type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. label Aug 7, 2020

tmbo closed this as completed in #6308 Aug 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newline token conversion between markdown and json formats #6087

Newline token conversion between markdown and json formats #6087

AMR-K commented Jun 29, 2020

sara-tagger commented Jun 30, 2020

AMR-KELEG commented Jul 22, 2020

AMR-KELEG commented Jul 27, 2020

akelad commented Jul 28, 2020

AMR-KELEG commented Jul 28, 2020 •

edited

Loading

akelad commented Jul 29, 2020

AMR-KELEG commented Jul 31, 2020

akelad commented Jul 31, 2020

AMR-KELEG commented Aug 4, 2020

akelad commented Aug 5, 2020

tmbo commented Aug 5, 2020

AMR-KELEG commented Aug 5, 2020

Newline token conversion between markdown and json formats #6087

Newline token conversion between markdown and json formats #6087

Comments

AMR-K commented Jun 29, 2020

sara-tagger commented Jun 30, 2020

You may find help in the docs and the forum, too 🤗

AMR-KELEG commented Jul 22, 2020

AMR-KELEG commented Jul 27, 2020

akelad commented Jul 28, 2020

AMR-KELEG commented Jul 28, 2020 • edited Loading

akelad commented Jul 29, 2020

AMR-KELEG commented Jul 31, 2020

akelad commented Jul 31, 2020

AMR-KELEG commented Aug 4, 2020

akelad commented Aug 5, 2020

tmbo commented Aug 5, 2020

AMR-KELEG commented Aug 5, 2020

AMR-KELEG commented Jul 28, 2020 •

edited

Loading