Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: CIP-5: Large Batch Handling Improvements Proposal #1077

Merged

Conversation

tazarov
Copy link
Contributor

@tazarov tazarov commented Sep 1, 2023

  • Including only CIP for review.

Refs: #1049

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • New proposal to handle large batches of embeddings gracefully

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js

Documentation Changes

TBD

@github-actions
Copy link

github-actions bot commented Sep 1, 2023

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readbility, Modularity, Intuitiveness)

and `chromadb.api.fastapi.FastAPI`
- `chromadb.api.segment.SegmentAPI` will implement the `max_batch_size` property by fetching the value from the
`Producer` class.
- `chromadb.api.fastapi.FastAPI` will implement the `max_batch_size` by fetching it from a new `/pre-flight-checks`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the assumption that these are fetched once on client creation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't thought about it but actual impl caches the values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Updated CIP
- Implementation done
- Added a new test in test_add

Refs: chroma-core#1049
@tazarov tazarov changed the title [ENH]: CIP-5 - Large Batch Handling Improvements Proposal [ENH]: CIP-5: Large Batch Handling Improvements Proposal Sep 4, 2023
Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main feedback here is we should not gracefully break up the batches for you, as partial failures are harder to reason about. It's easier to delegate the batching responsibility to the client and have you be responsible for handling partial failures in a way that makes sense to you. Otherwise this LGTM!

@tazarov
Copy link
Contributor Author

tazarov commented Sep 4, 2023

@HammadB, true that we cannot guarantee ACID-like behaviour this way, but even without, things seem to go wrong on very large batches. Do you think we can introduce tenacity (Apache 2 lic) for some added bonus of retries? I think this can be helpful further down the road, where a retry can resolve the problem.

In my experience, in situations where transactionality cannot be guaranteed, we can implement mechanisms to fail gracefully - e.g. store the failed batch in a temp file and allow the user to resubmit later. wdyt?

@imartinez
Copy link

I think this discussion brought by @HammadB and the last proposed solution by @tazarov requires more time and analysis (the last solution also has its implications).

I'd suggest to reduce the scope of this change to just exposing the max_batch_size in the API so it can be merged and clients can handle batches themselves (and add it to the docs) -in the case of privateGPT it is causing errors, I guess there are more clients impacted- and discuss the error management in a different CIP. WDYT?

@tazarov
Copy link
Contributor Author

tazarov commented Sep 5, 2023

@imartinez, let me do that now.

- Refactored the code to no longer do the splitting in _add,_update,_upsert
- Batch utils library remains as a utility method for users to split batches
- Batch size is now validated in SegmentAPI and FastAPI (client-side) to ensure no large batches can be sent through that will result in an internal Chroma error
- Improved tests - new max_batch_size test for API, pre-flight-check test, negative test for large batches.

Refs: chroma-core#1049
@tazarov
Copy link
Contributor Author

tazarov commented Sep 5, 2023

@imartinez @HammadB, Refactored. The new impl no longer splits batches automatically. We do ship a utility library for that though chromadb/utils/batch_utils.py. The API now validates the batch size (client-side), prior to trying to persist it. Everything else is left as-is:

  • pre-flight-check endpoint is still there
  • max_batch_size property is still available at API level for clients to use to break up batches

Added a couple more tests to ensure correctness.

CIP doc has also been updated to reflect above changes.

- Minor improvement suggested by @imartinez to pass API to create_batches utility method.

Refs: chroma-core#1049
@tazarov
Copy link
Contributor Author

tazarov commented Sep 5, 2023

@HammadB,

Another unrelated failure:

chromadb/test/property/test_persist.py .F                                [100%]

=================================== FAILURES ===================================
___________________ test_persist_embeddings_state[settings0] ___________________

caplog = <_pytest.logging.LogCaptureFixture object at 0x7f496ee70550>
settings = Settings(environment='', chroma_db_impl=None, chroma_api_impl='chromadb.api.segment.SegmentAPI', chroma_telemetry_impl...=None, chroma_server_auth_token_transport_header=None, anonymized_telemetry=True, allow_reset=True, migrations='apply')

    def test_persist_embeddings_state(
        caplog: pytest.LogCaptureFixture, settings: Settings
    ) -> None:
        caplog.set_level(logging.ERROR)
        api = chromadb.Client(settings)
        run_state_machine_as_test(
>           lambda: PersistEmbeddingsStateMachine(settings=settings, api=api)
        )  # type: ignore

chromadb/test/property/test_persist.py:226: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/hypothesis/stateful.py:216: in run_state_machine_as_test
    run_state_machine(state_machine_factory)
/opt/hostedtoolcache/Python/3.7.17/x64/lib/python3.7/site-packages/hypothesis/stateful.py:112: in run_state_machine
    @given(st.data())
chromadb/test/property/test_embeddings.py:210: in ann_accuracy
    embedding_function=self.embedding_function,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

collection = Collection(name=A01
)
record_set = {'documents': [None], 'embeddings': [[0.09765625, 0.430419921875, 0.20556640625, 0.08978271484375, -0.1527099609375, 0.2[917](https://github.com/chroma-core/chroma/actions/runs/6082759956/job/16515250514?pr=1077#step:5:918)48046875, ...]], 'ids': ['1'], 'metadatas': [{'1': '10'}]}
n_results = 1, min_recall = 0.95
embedding_function = <chromadb.test.property.strategies.hashing_embedding_function object at 0x7f4956ddcc50>

@tazarov
Copy link
Contributor Author

tazarov commented Sep 12, 2023

@HammadB, can you rerun the failed tests just to see if we can rule out intermittent failure? If it keeps failing, I'll debug it further

@HammadB
Copy link
Collaborator

HammadB commented Sep 12, 2023

@tazarov forsure, I've rerun them a handful of times but happy to do so again

tazarov and others added 7 commits September 14, 2023 07:20
- Fixed couple fo copy/paste issues. Local tests are passing.
## Description of changes

*Summarize the changes made by this PR.*
 - Added a workflow_dispatch to manually trigger test workflows
 - will be good for development experience

---------

Signed-off-by: sunilkumardash9 <[email protected]>
- Updated CIP
- Implementation done
- Added a new test in test_add

Refs: chroma-core#1049
- Refactored the code to no longer do the splitting in _add,_update,_upsert
- Batch utils library remains as a utility method for users to split batches
- Batch size is now validated in SegmentAPI and FastAPI (client-side) to ensure no large batches can be sent through that will result in an internal Chroma error
- Improved tests - new max_batch_size test for API, pre-flight-check test, negative test for large batches.

Refs: chroma-core#1049
@tazarov
Copy link
Contributor Author

tazarov commented Sep 15, 2023

@HammadB, Tests in my fork are passing, so this is ready for merge after your review.

chromadb/utils/batch_utils.py Show resolved Hide resolved
chromadb/api/fastapi.py Outdated Show resolved Hide resolved
chromadb/api/fastapi.py Outdated Show resolved Hide resolved
- Reverted create_batches function to use API instead of max_batch_size
- Client now caches pre-flight-check values (max_batch_size)
- Code cleanup
@HammadB HammadB enabled auto-merge (squash) September 18, 2023 17:21
auto-merge was automatically disabled September 18, 2023 18:30

Base branch was modified

@HammadB HammadB merged commit 82b9c83 into chroma-core:main Sep 18, 2023
tazarov added a commit to amikos-tech/chroma-core that referenced this pull request Sep 21, 2023
…#1077)

- Including only CIP for review.

Refs: chroma-core#1049

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
	 - New proposal to handle large batches of embeddings gracefully

## Test plan
*How are these changes tested?*

- [ ] Tests pass locally with `pytest` for python, `yarn test` for js

## Documentation Changes
TBD

---------

Signed-off-by: sunilkumardash9 <[email protected]>
Co-authored-by: Sunil Kumar Dash <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants