refactor document registering and upload #441
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Follow-up to #434 (comment). Highlights:
Before this PR our upload method is a two step process: at first we register the documents with Ragna and return upload parameters. In a second step we just act on the return value and send the documents. This process is very flexible as it allows uploading the document to almost everywhere, e.g. an S3 bucket. However, this process is also opaque and for most use cases unnecessary complex: from what we have seen so far, users either upload to the local disk or don't upload at all (see "managed" Ragna #256) and just want to register. In a proper deployment it is also possible to just mount an external storage on the path that Ragna tries to store the documents and thus eliminating the need to upload to a different sink directly.
Thus, this PR simplifies a lot of things. The register endpoint now only returns the document schema, which includes its ID. With that one can hit the upload endpoint, which will unconditionally upload to the local disk. This allows us to remove the
Document.get_upload_info
andLocalDocument.decode_upload_token
methods.The document register endpoint now also accepts metadata to be stored alongside the document. Previously, metadata could only be attached by the the document class itself by adding it to the upload information and decoding it again on the upload (yes this is complex and a reason why I refactored it). In addition to the metadata passed to the endpoint, the document class now also can update the metadata in its constructor for default behavior. This is what we do for the path of a
LocalDocument
Add endpoint for batch uploading document metadata #404 added support to register multiple documents at the same time. This PR closes Complete support for batch document upload #407 by also supporting batch upload of documents. In fact, we remove the single register / upload endpoints, because they make little sense now.
The
Document.is_readable
abstract method was removed. This was only used to check before we try to read and be able to raise a good error if something is wrong. However, this check might be expensive. Plus, it is reasonable to assume thatDocument.read()
will raise a proper error message and thus circumventing the need for a pre-check all together.Since these are quite a few changes already, I refrained from also touching the UI as well. I have the code ready to be reviewed as soon as this PR is accepted. Note that the change to the documents currently leaves the UI defunct.