Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a utility to safely join URLs using urllib.parse.urljoin #33062

Closed

Conversation

pankajkoti
Copy link
Member

While working with log_url property of the task instances,
it is observed that urljoin ignores the part of the path after
the last slash specified in the base_url when it does not end
with a trailing slash and Airflow webserver does not allow
setting the base_url with a trailing slash. Additionally, it is also
observed that if the relative URL has a leading slash, urljoin
just ingores the base URL and returns the relative URL.
Hence, we add a new utlity method safe_urljoin to handle
these cases.

closes: #32996


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

While working with `log_url` property of the task instances,
it is observed that `urljoin` ignores the part of the path after
the last slash specified in the base_url when it does not end
with a trailing slash and Airflow webserver does not allow
setting the base_url with a trailing slash. Additionally,
it is also observed that if the relative URL has a leading
slash, `urljoin` just ingores the base URL and returns the
relative URL. Hence, we add a new utlity method `safe_urljoin`
to handle these cases.

closes: apache#32996
@uranusjr
Copy link
Member

uranusjr commented Aug 3, 2023

Instead of needing to jump through all these, I would just fix all the second arguments to use absolute paths (e.g. /graph) and just use prepend the prefix as a string, without going through urljoin at all.

@pankajkoti
Copy link
Member Author

pankajkoti commented Aug 3, 2023

Instead of needing to jump through all these, I would just fix all the second arguments to use absolute paths (e.g. /graph) and just use prepend the prefix as a string, without going through urljoin at all.

I think the problem comes when the base URL is set to contain no trailing slash in the config.
e.g. like mentioned in the issue when the base path is set to https://my.astronomer.run/path, urljoin takes the base path as just https://my.astronomer.run/

Some cases are mentioned here #32996 (comment) on where the log_url is desired to be used

@uranusjr
Copy link
Member

uranusjr commented Aug 3, 2023

See my comment in #31833 (comment)

The prefix should not have a trailing slash. The path that gets appended to it should have a leading slash, and urljoin is the wrong function to join the two. The fix should be to not use urljoin, not to wrangle in additional logic to call urljoin.

@pankajkoti
Copy link
Member Author

okay, understood. Thanks a lot @uranusjr .

Closing this PR and will create a new PR with the suggested changes.

@pankajkoti pankajkoti closed this Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Task instance log_url is overwrites existing path in base_url
2 participants