Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward slash in dag_run_id gives rise to trouble accessing things through the REST API #20063

Closed
2 tasks done
davidavdav opened this issue Dec 6, 2021 · 14 comments · Fixed by #23106
Closed
2 tasks done
Labels
area:core kind:bug This is a clearly a bug

Comments

@davidavdav
Copy link
Contributor

Apache Airflow version

2.1.4

Operating System

linux

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==2.2.0
apache-airflow-providers-celery==2.0.0
apache-airflow-providers-cncf-kubernetes==2.0.2
apache-airflow-providers-docker==2.1.1
apache-airflow-providers-elasticsearch==2.0.3
apache-airflow-providers-ftp==2.0.1
apache-airflow-providers-google==5.1.0
apache-airflow-providers-grpc==2.0.1
apache-airflow-providers-hashicorp==2.1.0
apache-airflow-providers-http==2.0.1
apache-airflow-providers-imap==2.0.1
apache-airflow-providers-microsoft-azure==3.1.1
apache-airflow-providers-mysql==2.1.1
apache-airflow-providers-postgres==2.2.0
apache-airflow-providers-redis==2.0.1
apache-airflow-providers-sendgrid==2.0.1
apache-airflow-providers-sftp==2.1.1
apache-airflow-providers-slack==4.0.1
apache-airflow-providers-sqlite==2.0.1
apache-airflow-providers-ssh==2.1.1

Deployment

Docker-Compose

Deployment details

We tend to trigger dag runs by some external event, e.g., a media-file upload, see #19745. It is useful to use the media-file path as a dag run id. The media-id can come with some partial path, e.g., path/to/mediafile. All this seems to work fine in airflow, but we can't figure out a way to use the such a dag run id in the REST API, as the forward slashes / interfere with the API routing.

What happened

When using the API route api/v1/dags/{dag_id}/dagRuns/{dag_run_id} in, e.g., a HTTP GET, we expect a dag run to be found when dag_run_id has the value path/to/mediafile, but instead a .status: 404 is returned. When we change the dag_run_id to the format path|to|mediafile, the dag run is returned.

What you expected to happen

We would expect a dag run to be returned, even if it contains the character /

How to reproduce

Trigger a dag using a dag_run_id that contains a /, then try to retrieve it though the REST API.

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@davidavdav davidavdav added area:core kind:bug This is a clearly a bug labels Dec 6, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Dec 6, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

@uranusjr
Copy link
Member

uranusjr commented Dec 6, 2021

Hmm, this is tricky. We can’t just allow slash in dag_run_id either because there are endpoints like /dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances that would cause ambiguity.

Does using api/v1/dags/{dag_id}/dagRuns/path%2Fto%2Fmediafile give you a correct result? This might be the best solution if that works right now; but if it does not, we might need to disallow slashes in run IDs altogether instead.

@potiuk
Copy link
Member

potiuk commented Dec 6, 2021

Agreee. The %2F is the only "good" way to go. And if it does not work - I think we should probably fix it. This might become handy when we add multi-tenancy. "/" seems to be for example a nice, enforceable convention that can be used to separate dag namespaces.

@davidavdav
Copy link
Contributor Author

Currently, see #19745, %2f does not work. But I am not sure if it should, if by http protocol a %2f is equivalent to a /, then it might arrive as such in the flask routing. But then again, if by http protocol %2f is meant as an escape, maybe it should. I don't know the protocol that well...

@uranusjr
Copy link
Member

uranusjr commented Dec 8, 2021

I'll spend some time tomorrow looking into this. Percent encoding should work according to the specs.

@potiuk
Copy link
Member

potiuk commented Dec 8, 2021

Yep. It definitely should

@uranusjr
Copy link
Member

uranusjr commented Dec 9, 2021

I tracked this down to pallets/flask#900, and as Armin mentioned when he closed the ticket, the problem is in WSGI—%2f is treated exactly as / and there’s no way to work around that in the application layer because %2f has already been decoded when it’s received by e.g. Flask.

@andrewgodwin Does ASGI have the same problem? If not, I wonder if it’d work if we move from Gunicorn to e.g. Uvicorn and run Flask with WsgiToAsgi.


Update: I played with this a bit. This is still be a problem in agsiref.wsgi because according to the ASGI spec:

path (Unicode string) – HTTP request target excluding any query string, with percent-encoded sequences and UTF-8 byte sequences decoded into characters.

But there’s also raw_path that does retain the undecoded URL, and with some hacks I was able to make things “work” by doing the followings:

  • Adopt an ASGI server (I used Uvicorn).
  • Implement a ASGI-WSGI adapter that retains the percent encoded path.
  • Implement a custom URL adapter that percent-decode the path components instead.

But of course the question is whether this is worth the hassle 😛

@potiuk
Copy link
Member

potiuk commented Dec 9, 2021

Bummer

@andrewgodwin
Copy link
Contributor

Ah yes, this old biscuit. ASGI behaves the same in this regard - as HTTP specifies that %2f is equal to a /, ASGI also force-urldecodes everything before the application has a chance to look at the query string. raw_path is there for the people who complained about this, but in reality, you shouldn't expect to discriminate between escapes and the characters they represent (e.g. a proxy in between you and the client may muck with them).

I'd suggest either double-escaping these values, which may prove a little tricky in terms of backwards compatibility, or fix the routing so it's an unambiguous path that can be routed by Flask.

@potiuk
Copy link
Member

potiuk commented Dec 9, 2021

Thanks @uranusjr @andrewgodwin. TIL.

I believe (WDYT) the best way to "fix" it is to warn peeople (and deprecate) the use of "/" in the ID and explain that they won't be able to access it via API until they change it. We can't straight disallow it, but we can add warning in the UI and logs about it. And then we disallow it in Airflow 3.

We could potentially also disallow "/" in new DAGs while warn on updating existing. That might be a bit disruptive to someone who dynamically generates dags though.

@davidavdav
Copy link
Contributor Author

I think it is possible to include dag_id and/or dag_run_id filtering in the /dags GET api using parameters, see this minimal example server:

#!/usr/bin/env python3

## See if we can find way to send ath/to/mediafile like arguments to a restful api

from flask import Flask
from flask_restful import reqparse, Resource, Api
import logging

app = Flask(__name__)

api = Api(app)

parser = reqparse.RequestParser()
parser.add_argument("dag_id", required=False)
parser.add_argument("dag_run_id", required=False)

class Dag(Resource):
    def get(self):
        args = parser.parse_args()
        ## do clever DB access and filtering here, for now we just return the dag_run_id (default None)
        return args.dag_run_id

api.add_resource(Dag, "/dags")

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')
    app.run(host='0.0.0.0', port=4000, debug=True)

Test using curl 'http://localhost:4000/dags?dag_run_id=path/to/mediafile'

This could keep the current Api routes. Adding parameters like details=true or tasks=true, or task_id=$task_id could even be added to cover other parts of the DAG access api, in cases where identifiers would otherwise conflict with Flask routing.

@uranusjr
Copy link
Member

This means we’ll need to maintain two separate sets of URLs, which is suboptimal to me. I’d much rather just deprecate the use of / in DAG run IDs. We already disallow it for DAG and task IDs, and it is arguably an oversight the same is not enforced on DAG run IDs.

@potiuk
Copy link
Member

potiuk commented Dec 12, 2021

Agree. Sounds like a hack. I also thought about it and it's actually kinda misleading to use / for different purposes in a URL. Also if we use %2F, its none better.

Good example here (a bit different, but that's what we eventually will have to live with):

https://github.com/apache/airflow/pkgs/container/airflow%2Fv2-1-test%2Fci%2Fpython3.7

@jaklan
Copy link

jaklan commented May 24, 2024

Sorry for bumping the old issue, but maybe you have a quick answer for that - AWS added support for REST API in MWAA, but the problem is - usernames in MWAA start with assumed-role/ prefix, so they contain slash.

Is there any way to use /users endpoints not to receive The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again. error?

Neither /users/assumed-role/foobar nor /users/assumed-role%2Ffoobar seems to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants