refactor document registering and upload #441

pmeier · 2024-07-08T08:52:55Z

Follow-up to #434 (comment). Highlights:

Before this PR our upload method is a two step process: at first we register the documents with Ragna and return upload parameters. In a second step we just act on the return value and send the documents. This process is very flexible as it allows uploading the document to almost everywhere, e.g. an S3 bucket. However, this process is also opaque and for most use cases unnecessary complex: from what we have seen so far, users either upload to the local disk or don't upload at all (see "managed" Ragna #256) and just want to register. In a proper deployment it is also possible to just mount an external storage on the path that Ragna tries to store the documents and thus eliminating the need to upload to a different sink directly.

Thus, this PR simplifies a lot of things. The register endpoint now only returns the document schema, which includes its ID. With that one can hit the upload endpoint, which will unconditionally upload to the local disk. This allows us to remove the Document.get_upload_info and LocalDocument.decode_upload_token methods.
The document register endpoint now also accepts metadata to be stored alongside the document. Previously, metadata could only be attached by the the document class itself by adding it to the upload information and decoding it again on the upload (yes this is complex and a reason why I refactored it). In addition to the metadata passed to the endpoint, the document class now also can update the metadata in its constructor for default behavior. This is what we do for the path of a LocalDocument
Add endpoint for batch uploading document metadata #404 added support to register multiple documents at the same time. This PR closes Complete support for batch document upload #407 by also supporting batch upload of documents. In fact, we remove the single register / upload endpoints, because they make little sense now.
The Document.is_readable abstract method was removed. This was only used to check before we try to read and be able to raise a good error if something is wrong. However, this check might be expensive. Plus, it is reasonable to assume that Document.read() will raise a proper error message and thus circumventing the need for a pre-check all together.

Since these are quite a few changes already, I refrained from also touching the UI as well. I have the code ready to be reviewed as soon as this PR is accepted. Note that the change to the documents currently leaves the UI defunct.

nenb

I'm less familiar with the parts that touch the new 'engine', but the rest of the PR seemed sensible to me. I left some clarification comments.

ragna/core/_document.py

nenb · 2024-07-09T13:27:47Z

ragna/deploy/_api.py

+
+    @router.put("/documents")
+    async def upload_documents(
+        user: UserDependency, documents: list[UploadFile]


Question: Does this load all files into memory on the server?

It does not. The UploadFile from FastAPI actually wraps a SpooledTemporaryFile, i.e. it will only be kept in memory for small files and otherwise temporarily be stored on disk.

refactor document upload

24785d5

pmeier added the dev: deploy label Jul 8, 2024

pmeier requested review from nenb and blakerosenthal July 8, 2024 08:52

cleanup

e45b305

This was referenced Jul 8, 2024

remove redis as dependency #442

Merged

use backend engine in UI #443

Merged

#211 Enable users to upload folders with file_uploader component in… #446

Closed

nenb approved these changes Jul 9, 2024

View reviewed changes

fix error messages

eb25b28

pmeier merged commit 40f0b6c into deploy-dev Jul 9, 2024
3 of 9 checks passed

pmeier deleted the document-engine branch July 9, 2024 15:09

pmeier mentioned this pull request Jul 10, 2024

introduce engine for API (#434) #440

Closed

pmeier mentioned this pull request Dec 3, 2024

use bokeh_fastapi through panel #503

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor document registering and upload #441

refactor document registering and upload #441

pmeier commented Jul 8, 2024

nenb left a comment

nenb Jul 9, 2024

pmeier Jul 9, 2024

refactor document registering and upload #441

refactor document registering and upload #441

Conversation

pmeier commented Jul 8, 2024

nenb left a comment

Choose a reason for hiding this comment

nenb Jul 9, 2024

Choose a reason for hiding this comment

pmeier Jul 9, 2024

Choose a reason for hiding this comment