-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH]: CIP-5: Large Batch Handling Improvements Proposal #1077
[ENH]: CIP-5: Large Batch Handling Improvements Proposal #1077
Conversation
- Including only CIP for review. Refs: chroma-core#1049
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
and `chromadb.api.fastapi.FastAPI` | ||
- `chromadb.api.segment.SegmentAPI` will implement the `max_batch_size` property by fetching the value from the | ||
`Producer` class. | ||
- `chromadb.api.fastapi.FastAPI` will implement the `max_batch_size` by fetching it from a new `/pre-flight-checks` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the assumption that these are fetched once on client creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't thought about it but actual impl caches the values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Updated CIP - Implementation done - Added a new test in test_add Refs: chroma-core#1049
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main feedback here is we should not gracefully break up the batches for you, as partial failures are harder to reason about. It's easier to delegate the batching responsibility to the client and have you be responsible for handling partial failures in a way that makes sense to you. Otherwise this LGTM!
@HammadB, true that we cannot guarantee ACID-like behaviour this way, but even without, things seem to go wrong on very large batches. Do you think we can introduce In my experience, in situations where transactionality cannot be guaranteed, we can implement mechanisms to fail gracefully - e.g. store the failed batch in a temp file and allow the user to resubmit later. wdyt? |
I think this discussion brought by @HammadB and the last proposed solution by @tazarov requires more time and analysis (the last solution also has its implications). I'd suggest to reduce the scope of this change to just exposing the max_batch_size in the API so it can be merged and clients can handle batches themselves (and add it to the docs) -in the case of privateGPT it is causing errors, I guess there are more clients impacted- and discuss the error management in a different CIP. WDYT? |
@imartinez, let me do that now. |
- Refactored the code to no longer do the splitting in _add,_update,_upsert - Batch utils library remains as a utility method for users to split batches - Batch size is now validated in SegmentAPI and FastAPI (client-side) to ensure no large batches can be sent through that will result in an internal Chroma error - Improved tests - new max_batch_size test for API, pre-flight-check test, negative test for large batches. Refs: chroma-core#1049
@imartinez @HammadB, Refactored. The new impl no longer splits batches automatically. We do ship a utility library for that though
Added a couple more tests to ensure correctness. CIP doc has also been updated to reflect above changes. |
- Minor improvement suggested by @imartinez to pass API to create_batches utility method. Refs: chroma-core#1049
Another unrelated failure: chromadb/test/property/test_persist.py .F [100%]
=================================== FAILURES ===================================
___________________ test_persist_embeddings_state[settings0] ___________________
caplog = <_pytest.logging.LogCaptureFixture object at 0x7f496ee70550>
settings = Settings(environment='', chroma_db_impl=None, chroma_api_impl='chromadb.api.segment.SegmentAPI', chroma_telemetry_impl...=None, chroma_server_auth_token_transport_header=None, anonymized_telemetry=True, allow_reset=True, migrations='apply')
def test_persist_embeddings_state(
caplog: pytest.LogCaptureFixture, settings: Settings
) -> None:
caplog.set_level(logging.ERROR)
api = chromadb.Client(settings)
run_state_machine_as_test(
> lambda: PersistEmbeddingsStateMachine(settings=settings, api=api)
) # type: ignore
chromadb/test/property/test_persist.py:226:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/hypothesis/stateful.py:216: in run_state_machine_as_test
run_state_machine(state_machine_factory)
/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/hypothesis/stateful.py:112: in run_state_machine
@given(st.data())
chromadb/test/property/test_embeddings.py:210: in ann_accuracy
embedding_function=self.embedding_function,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
collection = Collection(name=A01
)
record_set = {'documents': [None], 'embeddings': [[0.09765625, 0.430419921875, 0.20556640625, 0.08978271484375, -0.1527099609375, 0.2[917](https://github.com/chroma-core/chroma/actions/runs/6082759956/job/16515250514?pr=1077#step:5:918)48046875, ...]], 'ids': ['1'], 'metadatas': [{'1': '10'}]}
n_results = 1, min_recall = 0.95
embedding_function = <chromadb.test.property.strategies.hashing_embedding_function object at 0x7f4956ddcc50> |
@HammadB, can you rerun the failed tests just to see if we can rule out intermittent failure? If it keeps failing, I'll debug it further |
@tazarov forsure, I've rerun them a handful of times but happy to do so again |
- Fixed couple fo copy/paste issues. Local tests are passing.
## Description of changes *Summarize the changes made by this PR.* - Added a workflow_dispatch to manually trigger test workflows - will be good for development experience --------- Signed-off-by: sunilkumardash9 <[email protected]>
- Updated CIP - Implementation done - Added a new test in test_add Refs: chroma-core#1049
- Refactored the code to no longer do the splitting in _add,_update,_upsert - Batch utils library remains as a utility method for users to split batches - Batch size is now validated in SegmentAPI and FastAPI (client-side) to ensure no large batches can be sent through that will result in an internal Chroma error - Improved tests - new max_batch_size test for API, pre-flight-check test, negative test for large batches. Refs: chroma-core#1049
- Fixed a failing test from rebase
- Fixed a failing test from rebase
@HammadB, Tests in my fork are passing, so this is ready for merge after your review. |
- Reverted create_batches function to use API instead of max_batch_size - Client now caches pre-flight-check values (max_batch_size) - Code cleanup
Base branch was modified
…#1077) - Including only CIP for review. Refs: chroma-core#1049 ## Description of changes *Summarize the changes made by this PR.* - Improvements & Bug fixes - New proposal to handle large batches of embeddings gracefully ## Test plan *How are these changes tested?* - [ ] Tests pass locally with `pytest` for python, `yarn test` for js ## Documentation Changes TBD --------- Signed-off-by: sunilkumardash9 <[email protected]> Co-authored-by: Sunil Kumar Dash <[email protected]>
Refs: #1049
Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytest
for python,yarn test
for jsDocumentation Changes
TBD