Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~~Nesting Sentences Support~~ → Add Context for nested sentence #27

Closed
4 of 5 tasks
azu opened this issue Mar 9, 2022 · 4 comments
Closed
4 of 5 tasks

~~Nesting Sentences Support~~ → Add Context for nested sentence #27

azu opened this issue Mar 9, 2022 · 4 comments
Labels
help wanted Extra attention is needed Status: PR Welcome Welcome to Pull Request Status: Proposal Request for comments

Comments

@azu
Copy link
Member

azu commented Mar 9, 2022

He said "This is a pen. That is not a pen.".

Currently, sentence-splitter parse the text as follows:

image

We want to support nesting sentences.

image

  • Sentence A
    • Sentence B
    • Sentence C

PairMarker is related
https://github.com/azu/sentence-splitter/blob/master/src/parser/PairMaker.ts

AST Design

  • It is not perfect
[
  {
    "type": "Sentence",
    "raw": "He said 「This is a pen. That is not a pen」.",
    "children": [
      {
        "type": "Str",
        "raw": "He said ",
        "value": "He said "
      },
      {
        "type": "PairMark",
        "pairType": "start",
        "raw": "",
        "value": ""
      },
      {
        "type": "Sentence",
        "raw": "This is a pen.",
        "children": [
          {
            "type": "Str",
            "raw": "This is a pen.",
            "value": "This is a pen. "
          }
        ]
      },
      {
        "type": "Sentence",
        "raw": "That is not a pen.",
        "children": [
          {
            "type": "Str",
            "raw": "That is not a pen",
            "value": "That is not a pen"
          }
        ]
      },
      {
        "type": "PairMark",
        "pairType": "end",
        "raw": "",
        "value": ""
      },
      {
        "type": "Punctuation",
        "raw": ".",
        "value": "."
      }
    ]
}

Related

@azu azu added Status: Proposal Request for comments help wanted Extra attention is needed Status: PR Welcome Welcome to Pull Request labels Mar 9, 2022
@azu
Copy link
Member Author

azu commented Mar 13, 2022

I noticed that some case is difficult.

彼は「コレ」と読んだ

I think that 「コレ」 is not a sentence.
sentence-splitter can not detect it.

@azu
Copy link
Member Author

azu commented Feb 10, 2023

It should be opt-in feature.

We can not detect which is better.

@azu
Copy link
Member Author

azu commented Feb 10, 2023

@azu
Copy link
Member Author

azu commented Feb 11, 2023

We are talking about pens.
He said "This is a pen. I like it".
I could relate to that statement.

Current parser parse it following sentences.

image

Second sentence has "This is a pen. I like it", but we can not split it into new sentence.

The conversation text is just Str node.
HTML does not have suitable semantics for conversation.

As a result, sentence-splitter can not support nesting sentence.
Probably, rule implementation should handble the quote text after parsing sentences by sentence-splitter.

We will close this issue by adding current behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed Status: PR Welcome Welcome to Pull Request Status: Proposal Request for comments
Projects
None yet
Development

No branches or pull requests

1 participant