-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: chunk_by_title() interface is rude #1844
Conversation
d7609d1
to
27035d8
Compare
Looking at the chunking documentation and the docstring of chunk_by_title in this PR, I'm trying to understand the relationship between these 3 parameters combine_text_under_n_chars, new_after_n_chars, and max_characters
I am very new to chunking/RAG but I can review given a little context/crash course on how each of these are used :) |
30179e7
to
72022ad
Compare
72022ad
to
a7ab984
Compare
(0, 0, 0), # invalid max_characters | ||
(-5666, -6777, -8999), # invalid chunk size | ||
(-5, 40, 50), # invalid chunk size | ||
(50, 70, 20), # max_characters needs to be greater than new_after_n_chars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a test for capping new_after_n_chars at max_characters? and any other cases here in this old test that aren't already covered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, added a test for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the cases from the old test were already covered.
a7ab984
to
c6b558c
Compare
79e47bd
to
11854fd
Compare
unstructured/chunking/title.py
Outdated
@@ -56,8 +56,8 @@ def chunk_table_element(element: Table, max_characters: int = 500) -> List[Table | |||
def chunk_by_title( | |||
elements: List[Element], | |||
multipage_sections: bool = True, | |||
combine_text_under_n_chars: int = 500, | |||
new_after_n_chars: int = 500, | |||
combine_text_under_n_chars: int | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this syntax wasn't introduced until Python 3.10. On the other hand, CI isn't complaining (maybe because typing isn't really enforced?). Everywhere else we're using the Optional[int]
syntax, I'm inclined to stick with that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This syntax is accepted in Python 3.7+ (and maybe further back) when from __future__ import annotations
appears in the import section.
This "pipe == Union" syntax is recommended as Union will be deprecated, so it's the "way forward" so to speak, which is why I started using it. Let me see if I can find a reference for that ... here's one: https://mail.python.org/archives/list/[email protected]/message/624BEVFNZXFEQP6ZWIGB4YKSPMCBE2M6/
next response in that thread is Guido approving.
The from __future__ import annotations
also gives some other very nice benefits like not requiring forward references to be quoted and a few other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fix is in for this @qued :)
This does not add a feature or fix a user-opened issue, so no CHANGELOG entry of its own.
`_split_elements_by_title_and_table()` is an implementation function (private). It is called only by `chunk_by_title()` (outside tests) and with all arguments specified, so has no need for defaults since they are never used and complicate reasoning about its behavior. Remove defaults from all parameters to `_split_elements_by_title_and_table()`.
11854fd
to
6b8a3ce
Compare
chunk_by_title()
interface is "rude"Executive Summary. Perhaps the most commonly specified option for
chunk_by_title()
ismax_characters
(default: 500), which specifies the chunk window size.When a user specifies this value, they get an error message:
A few of the things that might reasonably pass through a user's mind at such a moment are:
110
not a valid value formax_characters
? Why would that be?"combine_text_under_n_chars
ornew_after_n_chars
, in fact I don't know what they are because I haven't studied the documentation and would prefer not to; I just want smaller chunks! How could I supply an invalid value when I haven't supplied any value at all for these?"In this particular case, the problem is that
combine_text_under_n_chars
(defaults to 500) is greater thanmax_characters
, which means it would never take effect (which is actually not a problem in itself).To fix this, once figuring out that was the problem, probably after opening an issue and maybe reading the source code, the user would need to specify:
This and other stressful user scenarios can be remedied by:
combine_text_under_n_chars
andnew_after_n_chars
options.An active default is for example:
combine_text_under_n_chars: int | None = None
such that the code can detect when it has not been specified.max_characters
, the same as its current (static) default.This particular change would avoid the behavior in the motivating example above.
Another alternative for this argument is simply:
Fix
combine_text_under_n_ chars
andnew_after_n_chars
.combine_text_under_n_chars = 0
to disable chunk combining.