REST API file-upload schema for `meta` #3790

TuanaCelik · 2022-12-30T11:45:27Z

Currently the meta field for the file-upload endpoint expects a string. Which in rest_api/controller/file_upload.py is then loaded as json:
meta_form = json.loads(meta) or {}

So if you want to upload a file with meta you have to provide meta as follows (in Python for example):

import requests

url = "http://localhost:8000/file-upload"

files = {"files": ("2.txt", open("data/build_a_scalable_question_answering_system/2.txt", "rb"), "text/plain")}
payload = {"meta": '{"uploader": "Tuana"}'}
headers = {"accept": "application/json"}

response = requests.post(url, data=payload, files=files, headers=headers)

Whereas for getting documents by filters the schema is slightly different but also a bit more intuitive imo. Filters are also a dict, but the schema is set as such we can do the following for the request:

import requests

url = "http://localhost:8000/documents/get_by_filters"

payload = {"filters": {"uploader": "Tuana"}}
headers = {
    "accept": "application/json",
    "content-type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

Suggestion: Change the schema to expect a dict for the meta field too. That way via the API reference users only have to worry about providing keys and values. And in code they provide an object rather than an object wrapped up as a string.

The text was updated successfully, but these errors were encountered:

manulpatel · 2023-02-24T06:46:02Z

Hello! @TuanaCelik. I am new to Heystack and want to work on this issue. Could you please guide me further?

TuanaCelik · 2023-02-24T16:53:58Z

Hey @manulpatel - great to hear! We have our contributing guidelines here
Usually the way it goes is that you would create a fork, make your changes and open a PR. Then someone from the team will review it for you and ask for changes if needed.

The issue I think originates from here
And you can test it out by running the rest api yourself. You might be able to send requests through this too: https://docs.haystack.deepset.ai/reference/upload_file

manulpatel · 2023-02-24T17:09:30Z

Thanks a lot! @TuanaCelik. I am following the guidelines and hoping for a succcessful PR.

w1gs · 2023-09-08T04:09:17Z

Hello!

After reviewing this, I wanted to get some more feedback before opening a PR. The problem here lies in the fact that the upload request has a content-type of multipart/form-data. In multipart/form-data requests, any JSON (or dict structs) must first be serialized into a string. Changing the schema to dict server-side will just cause the request to fail as it's expecting an actual dict object but always receiving a string.

The first solution is to have the requester JSON-encode the metadata before sending. Here's a Python example using json.dumps:

import json
import requests
meta = {"Uploader":"wigzy", "Type":"Article", "fields":{"Name":"article.txt"}}

url = "http://localhost:8000/file-upload?keep_files=true"

files = {"files": ("api-upload.txt", open("./api-upload.txt", "rb"), "text/plain")}
payload = {"meta": json.dumps(meta)}
headers = {"accept": "application/json"}

response = requests.post(url, data=payload, files=files, headers=headers)

This should produce a string of the metadata that can be processed server-side (with the current code). The request will look like this:

However, this doesn't really solve the initial problem since some processing still has to be done client-side and the JSON library would be specific to each language making the request.

Solution 2:

I found a workaround that would send the dict in the correct format without additional client libraries. If you wrap the dict object into a list object, the format should be correct. The issue with this is now instead of double quotes (which is valid JSON), the final string has its keys and values in single quotes which can't be processed with the json.loads() method. The data could be manipulated and processed on the server side so the final result is a dict object, but there would be no guarantee that the actual metadata won't be tampered with as well.

Here's the same code as above, just with the payload modified:

files = {"files": ("api-upload.txt", open("./api-upload.txt", "rb"), "text/plain")}
payload = {"meta": [meta]}
headers = {"accept": "application/json"}

response = requests.post(url, data=payload, files=files, headers=headers)

This goes back to the previous point. You still have to add the list brackets client-side (which is pretty much the same thing as adding quotes).

Solution 3:

The final solution I came up with is using a combination of the list technique above and processing it server-side with the Python ast library:

meta = ast.literal_eval(meta)

This converts the "JSON" string to a Python object. It does a really good job but has some flaws. The main one being lack of JSON and other language support. What I mean by that is if you pass it a valid JSON value of false, it will throw a value error as false is not a valid Python type (but False is). The same scenario applies to nil or none. This can be worked around by replacing these values with their Python equivalent server-side, but then again if the metadata content has the word "none" in it there's a chance it can get modified unintentionally. This is the short list of replacements I was able to come up with:

sanitize_meta = {
        'null', 'None',
        'none', 'None'
        'nil', 'None',
        'true', 'True',
        'false','False'
    }

I was also considering the idea of just base64 encoding the file bytes and sending it all as a JSON request but that's highly inefficient.

I can definitely get a PR in for any of these solutions, but then again, do the pros outweigh the cons of just having to stringify the metadata? Sorry for the long post, just figured this might help someone else out that attempts to tackle this issue if it doesn't get resolved. 😅

TuanaCelik added topic:rest_api type:enhancement labels Dec 30, 2022

masci added good first issue Good for newcomers P2 Medium priority, add to the next sprint if no P1 available labels Jan 11, 2023

bogdankostic mentioned this issue Jan 23, 2023

Make query endpoint ready for production workloads #3755

Closed

manulpatel mentioned this issue Mar 3, 2023

refactor: Change meta schema #4321

Closed

6 tasks

julian-risch added the Contributions wanted! Looking for external contributions label Mar 13, 2023

mayankjobanputra mentioned this issue Apr 7, 2023

Implement REST API for new pipelines #4427

Closed

masci removed the good first issue Good for newcomers label Sep 13, 2023

masci removed Contributions wanted! Looking for external contributions P2 Medium priority, add to the next sprint if no P1 available labels Dec 13, 2023

masci added the wontfix This will not be worked on label Mar 12, 2024

masci closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REST API file-upload schema for `meta` #3790

REST API file-upload schema for `meta` #3790

TuanaCelik commented Dec 30, 2022

manulpatel commented Feb 24, 2023

TuanaCelik commented Feb 24, 2023

manulpatel commented Feb 24, 2023 •

edited

Loading

w1gs commented Sep 8, 2023

REST API file-upload schema for meta #3790

REST API file-upload schema for meta #3790

Comments

TuanaCelik commented Dec 30, 2022

manulpatel commented Feb 24, 2023

TuanaCelik commented Feb 24, 2023

manulpatel commented Feb 24, 2023 • edited Loading

w1gs commented Sep 8, 2023

REST API file-upload schema for `meta` #3790

REST API file-upload schema for `meta` #3790

manulpatel commented Feb 24, 2023 •

edited

Loading