Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk function #17586

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Conversation

charleschile
Copy link
Contributor

@charleschile charleschile commented Jul 17, 2024

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #

What this PR does / why we need it:

chunk() as LLM helper function.

Usage of chunk() function: chunk('<string>', '<chunk strategy>');
Note the chunk strategy parameter for fixed width consists of "fixed_width" and a number.
Returns a JSON-like string representation of an array of chunks: [[offset0, size0, "chunk"], [offset1, size1, "chunk"],...]

Chunking strategies and example SQL

1. Fixed Width Chunking

Splits the text into chunks of a specified fixed width.
Usage: chunk("<string>", "fixed_width; <width number>");

Example SQL:

select chunk("hello world this is a test.", "fixed_width; 11");

You can replace 11 with any desired chunk width.

Expected return:

[[0, 11, "hello world"], [11, 11, " this is a "], [22, 5, "test."]] 

2. Sentence-Based Chunking

Splits the text into chunks based on sentence boundaries (periods, exclamation marks, and question marks).
Usage: chunk("<string>", "sentence");

Example SQL:

select chunk("hello world this is a test? hello world! this is a test. hello world.", "sentence");

Expected return:

[[0, 27, "hello world this is a test?"], [27, 13, " hello world!"], [40, 16, " this is a test."], [56, 13, " hello world."]]

3. Paragraph-Based Chunking

Splits the text into chunks based on paragraph boundaries (newlines).
Usage: chunk("<string>", "paragraph");

Example SQL:

select chunk("hello world this is a test. \nhello world this is a test. hello world.", "paragraph");

Expected return:

[[0, 29, "hello world this is a test. \n"], [29, 41, "hello world this is a test. hello world.\n"]]

Pending

  • support more LLM chunking strategies

@mergify mergify bot added the kind/feature label Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant