Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task instance log_url is overwrites existing path in base_url #32996

Closed
1 of 2 tasks
wolfier opened this issue Aug 1, 2023 · 3 comments · Fixed by #33063
Closed
1 of 2 tasks

Task instance log_url is overwrites existing path in base_url #32996

wolfier opened this issue Aug 1, 2023 · 3 comments · Fixed by #33063
Assignees
Labels
area:webserver Webserver related Issues kind:bug This is a clearly a bug

Comments

@wolfier
Copy link
Contributor

wolfier commented Aug 1, 2023

Apache Airflow version

2.6.3

What happened

A task instance's log_url does not contain the full URL defined in base_url.

What you think should happen instead

The base_url may contain paths that should be acknowledged when build the log_url.

The log_url is built with urljoin. Due to how urljoin builds URLs, any existing paths are ignored leading to a faulty URL.

How to reproduce

This snippet showcases how urljoin ignores existing paths when building the url.

>>> from urllib.parse import urljoin
>>> 
>>> 
>>> urljoin(
...     "https://my.astronomer.run/path",
...     f"log?execution_date=test"
...     f"&task_id=wow"
...     f"&dag_id=super"
...     f"&map_index=-1",
... )
'https://eochgroup.astronomer.run/log?execution_date=test&task_id=wow&dag_id=super&map_index=-1'

Operating System

n/a

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else

This was introduced by #31833.

A way to fix this can be to utilize urlsplit and urlunsplit to account for existing paths.

from urllib.parse import urlsplit, urlunsplit

parts = urlsplit("https://my.astronomer.run/paths")
urlunsplit((
        parts.scheme,
        parts.netloc,
        f"{parts.path}/log",
        f"execution_date=test"
        f"&task_id=wow"
        f"&dag_id=super"
        f"&map_index=-1",
        ""
    )
)

Here is the fix in action.

>>> parts = urlsplit("https://my.astronomer.run/paths")
>>> urlunsplit((
...     parts.scheme,
...     parts.netloc,
...     f"{parts.path}/log",
...     f"execution_date=test"
...     f"&task_id=wow"
...     f"&dag_id=super"
...     f"&map_index=-1",
...     ''))
'https://my.astronomer.run/paths/log?execution_date=test&task_id=wow&dag_id=super&map_index=-1'
>>>
>>> parts = urlsplit("https://my.astronomer.run/paths/test")
>>> urlunsplit((
...     parts.scheme,
...     parts.netloc,
...     f"{parts.path}/log",
...     f"execution_date=test"
...     f"&task_id=wow"
...     f"&dag_id=super"
...     f"&map_index=-1",
...     ''))
'https://my.astronomer.run/paths/test/log?execution_date=test&task_id=wow&dag_id=super&map_index=-1'

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@wolfier wolfier added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Aug 1, 2023
@pankajkoti pankajkoti self-assigned this Aug 1, 2023
@pankajkoti
Copy link
Member

pankajkoti commented Aug 1, 2023

yes, I think how urljoin works is it only takes the base URL until the last /and skips remaining parts(path). If the base URL ended with a trailing /, it would have worked fine.

Additionally the relatively URL should not contain a leading /, otherwise, the output will only be the relative URL.

There are some examples mentioned here: python/cpython#96015

@jscheffl jscheffl added area:webserver Webserver related Issues and removed area:core labels Aug 1, 2023
@pankajkoti pankajkoti removed the needs-triage label for new issues that we didn't triage yet label Aug 2, 2023
@pankajkoti
Copy link
Member

pankajkoti commented Aug 2, 2023

hi @wolfier where exactly are we seeing the effect of this in the UI?

I tried to set a custom path and able to fetch logs for the task in the UI (looks like it is able to get the right path).

@wolfier
Copy link
Contributor Author

wolfier commented Aug 2, 2023

I am seeing this in places that can reference task instances and its attributes. This can be in the following places but not limited to:

  • python callables
  • callbacks (on_failure_callback)
  • jinja templating ( {{ ti.log_url }} )

pankajkoti added a commit to astronomer/airflow that referenced this issue Aug 3, 2023
While working with `log_url` property of the task instances,
it is observed that `urljoin` ignores the part of the path after
the last slash specified in the base_url when it does not end
with a trailing slash and Airflow webserver does not allow
setting the base_url with a trailing slash. Additionally,
it is also observed that if the relative URL has a leading
slash, `urljoin` just ingores the base URL and returns the
relative URL. Hence, we add a new utlity method `safe_urljoin`
to handle these cases.

closes: apache#32996
pankajkoti added a commit to astronomer/airflow that referenced this issue Aug 3, 2023
It is observed that urljoin is not yielding expected results for
the task instance's log_url which needs to be a concatenation of the
webserver base_url and specified relative url. The current usage
of urljoin does not seem to be the right way to achieve this based
on what urljoin is meant for and how it works. So, we use simple
string concatenation to yield the desired result.
More context in the comment apache#31833 (comment)

closes: apache#32996
eladkal pushed a commit that referenced this issue Aug 3, 2023
It is observed that urljoin is not yielding expected results for
the task instance's log_url which needs to be a concatenation of the
webserver base_url and specified relative url. The current usage
of urljoin does not seem to be the right way to achieve this based
on what urljoin is meant for and how it works. So, we use simple
string concatenation to yield the desired result.
More context in the comment #31833 (comment)

closes: #32996
ephraimbuddy pushed a commit that referenced this issue Aug 3, 2023
It is observed that urljoin is not yielding expected results for
the task instance's log_url which needs to be a concatenation of the
webserver base_url and specified relative url. The current usage
of urljoin does not seem to be the right way to achieve this based
on what urljoin is meant for and how it works. So, we use simple
string concatenation to yield the desired result.
More context in the comment #31833 (comment)

closes: #32996
(cherry picked from commit baa1bc0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:webserver Webserver related Issues kind:bug This is a clearly a bug
Projects
None yet
3 participants