-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move from v1 to v2 APIs #137
Conversation
vectara: | ||
corpus_key: feynman | ||
reindex: false | ||
create_corpus: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you choose to make this default true for all examples? I think it's better to make this default false (and in these examples too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just from an ease of use perspective for getting started users: if I've never created a "feynman" corpus before, it's 1 less step. In the future, I think it'd be even a step above to:
- Detect if the corpus exists and choose accordingly
- Create and/or update filters as part of the creation
In general, it's best to optimize for the user that knows/has less experience with your product in a tool like vectara-ingest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes sense, although it's a good starting point, but often vectara-ingest is used repeatedly to update (as in all the moma crawls and askNews crawls). Also I think this requires personal API key.
Let me modify then to pick a few like this, and a few like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does require a personal API key, but parts of vectara-ingest already do in reality.
I'm curious though: why do you think it's better to have a false value for these? It's pretty innocuous to have a "true" value: if the corpus already exists it's not like it gets removed: it just errors on the corpus creation part (which we can catch/except). It seems safer to me in every case to try to create a corpus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It's not a really big deal either way IMO. I just think the news and moma examples lend themselves more to daily updates which are not good candidates for create_corpus. I'd like to demonstrate that us too.
- What other pieces of vectara-ingest do you think require personal API key? Trying to avoid making that a requirement, although we could if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything that uses x-api-key in the code relies on at least having an API key in addition to OAuth (whether it's a personal API key or not). At first glance, that's: _does_doc_exist
, delete_doc
, _list_docs
, _index_file
, and index_document
. I think all of those should work with a non-personal API key, but it feels strange to require both an API key and OAuth to have this functionality, and I strongly suspect any user would just plug in a personal API key, especially given the "regular" API keys are scoped to specific corpora and they'd need lots otherwise
bugfixes in indexer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested rather extensively, found many bugs and fixed them. I think it's now good. @eskibars please take another look when you get a chance (after my changes) and then I think we can merge. This will become v2.0.0
vectara: | ||
corpus_key: feynman | ||
reindex: false | ||
create_corpus: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does require a personal API key, but parts of vectara-ingest already do in reality.
I'm curious though: why do you think it's better to have a false value for these? It's pretty innocuous to have a "true" value: if the corpus already exists it's not like it gets removed: it just errors on the corpus creation part (which we can catch/except). It seems safer to me in every case to try to create a corpus
This isn't perfect, but I think it moves a long way forward.
Still TODO:
Things I'd like to do but might save for other PRs: