Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST API file-upload schema for meta #3790

Closed
Tracked by #3755
TuanaCelik opened this issue Dec 30, 2022 · 4 comments
Closed
Tracked by #3755

REST API file-upload schema for meta #3790

TuanaCelik opened this issue Dec 30, 2022 · 4 comments
Labels

Comments

@TuanaCelik
Copy link
Contributor

Currently the meta field for the file-upload endpoint expects a string. Which in rest_api/controller/file_upload.py is then loaded as json:
meta_form = json.loads(meta) or {}

So if you want to upload a file with meta you have to provide meta as follows (in Python for example):

import requests

url = "http://localhost:8000/file-upload"

files = {"files": ("2.txt", open("data/build_a_scalable_question_answering_system/2.txt", "rb"), "text/plain")}
payload = {"meta": '{"uploader": "Tuana"}'}
headers = {"accept": "application/json"}

response = requests.post(url, data=payload, files=files, headers=headers)

image

Whereas for getting documents by filters the schema is slightly different but also a bit more intuitive imo. Filters are also a dict, but the schema is set as such we can do the following for the request:

import requests

url = "http://localhost:8000/documents/get_by_filters"

payload = {"filters": {"uploader": "Tuana"}}
headers = {
    "accept": "application/json",
    "content-type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

image

Suggestion: Change the schema to expect a dict for the meta field too. That way via the API reference users only have to worry about providing keys and values. And in code they provide an object rather than an object wrapped up as a string.

@masci masci added good first issue Good for newcomers P2 Medium priority, add to the next sprint if no P1 available labels Jan 11, 2023
@manulpatel
Copy link

Hello! @TuanaCelik. I am new to Heystack and want to work on this issue. Could you please guide me further?

@TuanaCelik
Copy link
Contributor Author

Hey @manulpatel - great to hear! We have our contributing guidelines here
Usually the way it goes is that you would create a fork, make your changes and open a PR. Then someone from the team will review it for you and ask for changes if needed.

The issue I think originates from here
And you can test it out by running the rest api yourself. You might be able to send requests through this too: https://docs.haystack.deepset.ai/reference/upload_file

@manulpatel
Copy link

manulpatel commented Feb 24, 2023

Thanks a lot! @TuanaCelik. I am following the guidelines and hoping for a succcessful PR.

@w1gs
Copy link
Contributor

w1gs commented Sep 8, 2023

Hello!

After reviewing this, I wanted to get some more feedback before opening a PR. The problem here lies in the fact that the upload request has a content-type of multipart/form-data. In multipart/form-data requests, any JSON (or dict structs) must first be serialized into a string. Changing the schema to dict server-side will just cause the request to fail as it's expecting an actual dict object but always receiving a string.

The first solution is to have the requester JSON-encode the metadata before sending. Here's a Python example using json.dumps:

import json
import requests
meta = {"Uploader":"wigzy", "Type":"Article", "fields":{"Name":"article.txt"}}

url = "http://localhost:8000/file-upload?keep_files=true"

files = {"files": ("api-upload.txt", open("./api-upload.txt", "rb"), "text/plain")}
payload = {"meta": json.dumps(meta)}
headers = {"accept": "application/json"}

response = requests.post(url, data=payload, files=files, headers=headers)

This should produce a string of the metadata that can be processed server-side (with the current code). The request will look like this:

image

However, this doesn't really solve the initial problem since some processing still has to be done client-side and the JSON library would be specific to each language making the request.

Solution 2:

I found a workaround that would send the dict in the correct format without additional client libraries. If you wrap the dict object into a list object, the format should be correct. The issue with this is now instead of double quotes (which is valid JSON), the final string has its keys and values in single quotes which can't be processed with the json.loads() method. The data could be manipulated and processed on the server side so the final result is a dict object, but there would be no guarantee that the actual metadata won't be tampered with as well.

Here's the same code as above, just with the payload modified:

files = {"files": ("api-upload.txt", open("./api-upload.txt", "rb"), "text/plain")}
payload = {"meta": [meta]}
headers = {"accept": "application/json"}

response = requests.post(url, data=payload, files=files, headers=headers)
image

This goes back to the previous point. You still have to add the list brackets client-side (which is pretty much the same thing as adding quotes).

Solution 3:

The final solution I came up with is using a combination of the list technique above and processing it server-side with the Python ast library:

meta = ast.literal_eval(meta)

This converts the "JSON" string to a Python object. It does a really good job but has some flaws. The main one being lack of JSON and other language support. What I mean by that is if you pass it a valid JSON value of false, it will throw a value error as false is not a valid Python type (but False is). The same scenario applies to nil or none. This can be worked around by replacing these values with their Python equivalent server-side, but then again if the metadata content has the word "none" in it there's a chance it can get modified unintentionally. This is the short list of replacements I was able to come up with:

sanitize_meta = {
        'null', 'None',
        'none', 'None'
        'nil', 'None',
        'true', 'True',
        'false','False'
    }

I was also considering the idea of just base64 encoding the file bytes and sending it all as a JSON request but that's highly inefficient.

I can definitely get a PR in for any of these solutions, but then again, do the pros outweigh the cons of just having to stringify the metadata? Sorry for the long post, just figured this might help someone else out that attempts to tackle this issue if it doesn't get resolved. 😅

@masci masci removed the good first issue Good for newcomers label Sep 13, 2023
@masci masci removed Contributions wanted! Looking for external contributions P2 Medium priority, add to the next sprint if no P1 available labels Dec 13, 2023
@masci masci added the wontfix This will not be worked on label Mar 12, 2024
@masci masci closed this as completed Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants