Page / paragraph wise division #1944

dsforever · 2024-09-27T05:03:21Z

dsforever
Sep 27, 2024

Hi,
I trying to build a small POC of knowledge repository enabled with GPT feature for my organisation with sensitive content with all format of files. I am using ES with langchain. The setup is working but with no accuracy, one of the suggested method is to chunk the data. As part of that, I was looking if there is any way in FScrawler to split the file (at least for PDF) based on paragraph for better semantic analysis.
Thanks

dadoonet · 2024-09-27T14:34:27Z

dadoonet
Sep 27, 2024
Maintainer

I think there is an example here about using an ingest pipeline to do chunking. So I'd expect something like the following to do the chunking for you:

PUT _ingest/pipeline/chunker
{
  "processors": [
    {
      "script": {
        "description": "Chunk body_content into sentences by looking for . followed by a space",
        "lang": "painless",
        "source": """
          String[] envSplit = /((?<!M(r|s|rs)\.)(?<=\.) |(?<=\!) |(?<=\?) )/.split(ctx['body_content']);
          ctx['passages'] = new ArrayList();
          int i = 0;
          boolean remaining = true;
          if (envSplit.length == 0) {
            return
          } else if (envSplit.length == 1) {
            Map passage = ['text': envSplit[0]];ctx['passages'].add(passage)
          } else {
            while (remaining) {
              Map passage = ['text': envSplit[i++]];
              while (i < envSplit.length && passage.text.length() + envSplit[i].length() < params.model_limit) {passage.text = passage.text + ' ' + envSplit[i++]}
              if (i == envSplit.length) {remaining = false}
              ctx['passages'].add(passage)
            }
          }
          """,
        "params": {
          "model_limit": 400
        }
      }
    }
  ]
}

You can define the pipeline within the fscrawler configuration.

I'd be happy to hear if this works for you. And if so, would love to get a documentation PR on it.

1 reply

dsforever Sep 27, 2024
Author

@dadoonet Thanks for the reply. Will check and get back to you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page / paragraph wise division #1944

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Page / paragraph wise division #1944

dsforever Sep 27, 2024

Replies: 1 comment · 1 reply

dadoonet Sep 27, 2024 Maintainer

dsforever Sep 27, 2024 Author

dsforever
Sep 27, 2024

Replies: 1 comment 1 reply

dadoonet
Sep 27, 2024
Maintainer

dsforever Sep 27, 2024
Author