Whitespace configurable as sentence separator? #51

stippi · 2023-12-01T21:55:04Z

stippi
Dec 1, 2023

Hi. First, thank you for making this project!

I have tried to pass "\n" in the array of chars that should be treated as sentence separators. But this doesn't work for some reason. Basically, I am trying to get two sentences returned in a case list this one in your tests.

This is my code:

export type Sentence = {
  content: string;
  offset: number;
}

const options = {
  SeparatorParser: {
    separatorCharacters: [
      ".", // period
      "．", // (ja) zenkaku-period
      "。", // (ja) 句点
      "?", // question mark
      "!", //  exclamation mark
      "？", // (ja) zenkaku question mark
      "！", // (ja) zenkaku exclamation mark
      "\n"
    ]
  }
};

export function splitIntoSentencesAst(text: string): Sentence[] {
  const result = split(text, options);
  return result
    .filter(node => node.type === "Sentence")
    .map(node => {
      return {
        content: node.raw,
        offset: node.range[0]
      };
    });
}

However, this does not have the expected effect. I still get only one sentence (with children split at the white space).

My use case is that I am trying to send the output of a LLM to a text-to-speech API sentence by sentence in order to decrease latency. The output is often markdown with lists such as

1. First item
2. Second item

Thanks for any pointers!

azu · 2023-12-01T22:36:30Z

azu
Dec 1, 2023
Maintainer

sentence-splitter can not configure whitespace.
It is better to parse as Markdown first, and then parse the Markdown paragraphs as sentences.

Example:

https://stackblitz.com/edit/vitejs-vite-wnudfx?file=main.js

import { parse } from '@textlint/markdown-to-ast';
import { splitAST, SentenceSplitterSyntax, split } from 'sentence-splitter';
import { traverse } from '@textlint/ast-traverse';
const onChange = () => {
  const input = document.querySelector('#input');
  // parse markdown
  const AST = parse(input.value);

  // collect Paragraph Nodes
  const paragraphNodes = [];
  traverse(AST, {
    enter(node) {
      if (node.type === 'Paragraph') {
        paragraphNodes.push(node);
      }
    },
  });

  // parse each pargraph to sentences
  const allSentences = paragraphNodes.flatMap((pNode) => {
    const sentenceAST = splitAST(pNode);
    const sentences = sentenceAST.children.filter(
      (node) => node.type === SentenceSplitterSyntax.Sentence
    );
    return sentences;
  });
  document.querySelector('#output').textContent = allSentences
    .map((sentenceNode) => sentenceNode.raw)
    .join('\n---\n');
};

document.querySelector('#input').addEventListener('input', onChange);

onChange();

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitespace configurable as sentence separator? #51

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Whitespace configurable as sentence separator? #51

stippi Dec 1, 2023

Replies: 1 comment

azu Dec 1, 2023 Maintainer

stippi
Dec 1, 2023

azu
Dec 1, 2023
Maintainer