Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: chunk_by_title() interface is rude #1844

Merged
merged 5 commits into from
Oct 25, 2023
Merged

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Oct 23, 2023

chunk_by_title() interface is "rude"

Executive Summary. Perhaps the most commonly specified option for chunk_by_title() is max_characters (default: 500), which specifies the chunk window size.

When a user specifies this value, they get an error message:

  >>> chunks = chunk_by_title(elements, max_characters=100)
  ValueError: Invalid values for combine_text_under_n_chars, new_after_n_chars, and/or max_characters.

A few of the things that might reasonably pass through a user's mind at such a moment are:

  • "Is 110 not a valid value for max_characters? Why would that be?"
  • "I didn't specify a value for combine_text_under_n_chars or new_after_n_chars, in fact I don't know what they are because I haven't studied the documentation and would prefer not to; I just want smaller chunks! How could I supply an invalid value when I haven't supplied any value at all for these?"
  • "Which of these values is the problem? Why are you making me figure that out for myself? I'm sure the code knows which one is not valid, why doesn't it share that information with me? I'm busy here!"

In this particular case, the problem is that combine_text_under_n_chars (defaults to 500) is greater than max_characters, which means it would never take effect (which is actually not a problem in itself).

To fix this, once figuring out that was the problem, probably after opening an issue and maybe reading the source code, the user would need to specify:

>>> chunks = chunk_by_title(
...     elements, max_characters=100, combine_text_under_n_chars=100
... )

This and other stressful user scenarios can be remedied by:

  • Using "active" defaults for the combine_text_under_n_chars and new_after_n_chars options.
  • Providing a specific error message for each way a constraint may be violated, such that direction to remedy the problem is immediately clear to the user.

An active default is for example:

  • Make the default for combine_text_under_n_chars: int | None = None such that the code can detect when it has not been specified.
  • When not specified, set its value to max_characters, the same as its current (static) default.
    This particular change would avoid the behavior in the motivating example above.

Another alternative for this argument is simply:

combine_text_under_n_chars = min(max_characters, combine_text_under_n_chars)

Fix

  1. Add constraint-specific error messages.
  2. Use "active" defaults for combine_text_under_n_ chars and new_after_n_chars.
  3. Improve docstring to describe active defaults, and explain other argument behaviors, in particular identifying suppression options like combine_text_under_n_chars = 0 to disable chunk combining.

@scanny scanny marked this pull request as ready for review October 23, 2023 18:22
@scanny scanny force-pushed the scanny/chunk-by-title-is-rude branch from d7609d1 to 27035d8 Compare October 23, 2023 18:23
@shreyanid
Copy link
Contributor

Looking at the chunking documentation and the docstring of chunk_by_title in this PR, I'm trying to understand the relationship between these 3 parameters combine_text_under_n_chars, new_after_n_chars, and max_characters

  • new_after_n_chars is after how many characters a section should be capped and a new section should start. Can this value in theory be greater than max_characters?
  • combine_text_under_n_chars is if a section has under this many characters, it will be combined with another(?) section
  • max_characters (not mentioned in the documentation) is to specify a "chunk size"? for elements text specifically? - im not sure how this hard max and the soft max new_after_n_chars act differently.

I am very new to chunking/RAG but I can review given a little context/crash course on how each of these are used :)

@scanny scanny force-pushed the scanny/chunk-by-title-is-rude branch 2 times, most recently from 30179e7 to 72022ad Compare October 24, 2023 02:13
@scanny scanny requested a review from shreyanid October 24, 2023 17:01
@scanny scanny force-pushed the scanny/chunk-by-title-is-rude branch from 72022ad to a7ab984 Compare October 24, 2023 17:01
(0, 0, 0), # invalid max_characters
(-5666, -6777, -8999), # invalid chunk size
(-5, 40, 50), # invalid chunk size
(50, 70, 20), # max_characters needs to be greater than new_after_n_chars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test for capping new_after_n_chars at max_characters? and any other cases here in this old test that aren't already covered

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added a test for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the cases from the old test were already covered.

@scanny scanny force-pushed the scanny/chunk-by-title-is-rude branch from a7ab984 to c6b558c Compare October 24, 2023 17:43
@scanny scanny force-pushed the scanny/chunk-by-title-is-rude branch 2 times, most recently from 79e47bd to 11854fd Compare October 24, 2023 18:32
@@ -56,8 +56,8 @@ def chunk_table_element(element: Table, max_characters: int = 500) -> List[Table
def chunk_by_title(
elements: List[Element],
multipage_sections: bool = True,
combine_text_under_n_chars: int = 500,
new_after_n_chars: int = 500,
combine_text_under_n_chars: int | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this syntax wasn't introduced until Python 3.10. On the other hand, CI isn't complaining (maybe because typing isn't really enforced?). Everywhere else we're using the Optional[int] syntax, I'm inclined to stick with that here.

Copy link
Collaborator Author

@scanny scanny Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax is accepted in Python 3.7+ (and maybe further back) when from __future__ import annotations appears in the import section.

This "pipe == Union" syntax is recommended as Union will be deprecated, so it's the "way forward" so to speak, which is why I started using it. Let me see if I can find a reference for that ... here's one: https://mail.python.org/archives/list/[email protected]/message/624BEVFNZXFEQP6ZWIGB4YKSPMCBE2M6/

next response in that thread is Guido approving.

The from __future__ import annotations also gives some other very nice benefits like not requiring forward references to be quoted and a few other things.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fix is in for this @qued :)

This does not add a feature or fix a user-opened issue, so no CHANGELOG
entry of its own.
`_split_elements_by_title_and_table()` is an implementation function
(private). It is called only by `chunk_by_title()` (outside tests) and
with all arguments specified, so has no need for defaults since they are
never used and complicate reasoning about its behavior.

Remove defaults from all parameters to
`_split_elements_by_title_and_table()`.
@scanny scanny force-pushed the scanny/chunk-by-title-is-rude branch from 11854fd to 6b8a3ce Compare October 24, 2023 21:22
@qued qued enabled auto-merge October 24, 2023 22:57
@qued qued added this pull request to the merge queue Oct 24, 2023
Merged via the queue into main with commit 40a265d Oct 25, 2023
@qued qued deleted the scanny/chunk-by-title-is-rude branch October 25, 2023 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants