Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move from v1 to v2 APIs #137

Merged
merged 28 commits into from
Jan 6, 2025
Merged

Move from v1 to v2 APIs #137

merged 28 commits into from
Jan 6, 2025

Conversation

eskibars
Copy link
Contributor

@eskibars eskibars commented Dec 15, 2024

This isn't perfect, but I think it moves a long way forward.

Still TODO:

  • In v2 we made return codes much more sane, as opposed to the embedded errors in v1
  • Testing

Things I'd like to do but might save for other PRs:

  • Make the option to create corpora instead of "just" reset them, now that we have a corpus_key that could do so
    • Set it on by default for all of the crawlers in the config dir
    • We could initiate sensible filters at creation time to make the onboarding experience much smoother
  • If OAuth isn't defined, try to use the API key as a personal API key
  • There are a number of new parameters in these APIs in API v2 that we could enable and would make sense here
  • Parts of the code use API keys while others use OAuth, and there are better ways to handle that
  • There's some stuff we could do to replace metadata in cases where it's been updated and we detect a conflict

vectara:
corpus_key: feynman
reindex: false
create_corpus: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose to make this default true for all examples? I think it's better to make this default false (and in these examples too).

Copy link
Contributor Author

@eskibars eskibars Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just from an ease of use perspective for getting started users: if I've never created a "feynman" corpus before, it's 1 less step. In the future, I think it'd be even a step above to:

  1. Detect if the corpus exists and choose accordingly
  2. Create and/or update filters as part of the creation

In general, it's best to optimize for the user that knows/has less experience with your product in a tool like vectara-ingest

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense, although it's a good starting point, but often vectara-ingest is used repeatedly to update (as in all the moma crawls and askNews crawls). Also I think this requires personal API key.
Let me modify then to pick a few like this, and a few like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does require a personal API key, but parts of vectara-ingest already do in reality.

I'm curious though: why do you think it's better to have a false value for these? It's pretty innocuous to have a "true" value: if the corpus already exists it's not like it gets removed: it just errors on the corpus creation part (which we can catch/except). It seems safer to me in every case to try to create a corpus

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's not a really big deal either way IMO. I just think the news and moma examples lend themselves more to daily updates which are not good candidates for create_corpus. I'd like to demonstrate that us too.
  2. What other pieces of vectara-ingest do you think require personal API key? Trying to avoid making that a requirement, although we could if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that uses x-api-key in the code relies on at least having an API key in addition to OAuth (whether it's a personal API key or not). At first glance, that's: _does_doc_exist, delete_doc, _list_docs, _index_file, and index_document. I think all of those should work with a non-personal API key, but it feels strange to require both an API key and OAuth to have this functionality, and I strongly suspect any user would just plug in a personal API key, especially given the "regular" API keys are scoped to specific corpora and they'd need lots otherwise

Copy link
Collaborator

@ofermend ofermend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested rather extensively, found many bugs and fixed them. I think it's now good. @eskibars please take another look when you get a chance (after my changes) and then I think we can merge. This will become v2.0.0

README.md Outdated Show resolved Hide resolved
vectara:
corpus_key: feynman
reindex: false
create_corpus: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does require a personal API key, but parts of vectara-ingest already do in reality.

I'm curious though: why do you think it's better to have a false value for these? It's pretty innocuous to have a "true" value: if the corpus already exists it's not like it gets removed: it just errors on the corpus creation part (which we can catch/except). It seems safer to me in every case to try to create a corpus

@ofermend ofermend merged commit f6d4a1d into main Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants