Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show custom instance names for a mapped task in UI #36797

Merged
merged 28 commits into from
Feb 24, 2024

Conversation

RNHTTR
Copy link
Contributor

@RNHTTR RNHTTR commented Jan 15, 2024


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

airflow/models/dagrun.py Outdated Show resolved Hide resolved
@uranusjr
Copy link
Member

I’m thinking the following flow:

  1. Define map_index_template on the operator (done)
  2. Instead of just forwarding the template to TaskInstance, we only store the rendered result on it.
    • Say we call it map_rendered_name (bikeshedding welcomed)
    • We don’t really want to show the template anywhere anyway (it’s arguably worse than map_index without rendering since every mapped instance will have the exact same value).
    • When a task instance is created (not yet run), the field will be NULL in the database (trivial, all columns are nullable by default).
  3. In _run_raw_task, we render the value (like you already did), and store the result in self.map_rendered_name.
    • This will be flushed to the database when the function ends along with other states.
  4. Show the value in the web UI.

We’ll also need a database migration for the new field on TaskInstance.

Does this sound reasonable?

@RNHTTR RNHTTR requested a review from uranusjr January 20, 2024 17:44
@RNHTTR RNHTTR force-pushed the custom-mapped-task-indices branch 2 times, most recently from 52a4979 to 02ca753 Compare January 23, 2024 20:47
@uranusjr
Copy link
Member

Also need to look into test failures and static check errors, and fix them.

@RNHTTR RNHTTR mentioned this pull request Feb 7, 2024
2 tasks
@uranusjr
Copy link
Member

Forgot to actually make rendered_map_index visible in the UI 🤦

@RNHTTR RNHTTR marked this pull request as ready for review February 23, 2024 03:42
@uranusjr
Copy link
Member

Some todos after this PR is merged:

  • There are still places in the UI we still show the raw map index. We should either change them to show the rendered value, or both (in e.g. table views).
  • In the REST API, should we add a way for the user to filter with the rendered value, instead of map_index? A new parameter is probably needed if we do. But I’m not exactly sure we need to do it (since the API is for machines anyway?)

@MrBounty
Copy link

Hello,

I'm new to Airflow and open-source in general, and I'm currently implementing Airflow within my work. I'm very interested in the feature you're working on. Thank you for your contributions!

I'd like to ask about the possibility of getting access to this feature before its official release. I understand it may be part of the upcoming Airflow 2.9.0 version, which is currently 83% complete without a set due date.

Would it be possible to access the feature by creating a branch from my version of Airflow that includes your changes? I'm still learning the ropes of Git, so any guidance on the feasibility of this would be greatly appreciated.

If this approach is possible, I'm happy to test the feature and provide feedback.
Thank you,

@potiuk
Copy link
Member

potiuk commented Feb 23, 2024

Would it be possible to access the feature by creating a branch from my version of Airflow that includes your changes? I'm still learning the ropes of Git, so any guidance on the feasibility of this would be greatly appreciated.

Sure but it is does not come with any guarantees whatsoever, it might break any time and if it breaks it might stay broken for a long time. However if you will, you can definitely run airflow locally from latest main and if you see any issues, you can even fix them via PR (this is what most people contributing here are doing).

You can follow contribution flow -> see https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst as an entry point - then you have local virtualenv or Breeze contenerized development environment working - where you can more or less easily run airflow, but you have to be prepared for all the cases - like having to wipe your database, reinstall things from scratch, having your machine blow up becaus of used memory and CPU and all the bad things that can happen during the development. We cannot recommend it for running anything tha resembles productoin, and you should not put too much faith on being able to continue using the same instance when 2.9.0 is released, most likely you will have to reinstall everything from the scratch and wipe your database (and you will likely do it several times till the release date). But it also gives you the opportunity to contribute back and improve things.

Also for that you need to learn git and branching etc. based on the contributing guide, there is an expectation that you know what you are doing there.

There is no other "approach" that allows you to run things which are not released officially in production. Whatever we do here is purely for development and contribuion purposes. If you want to be exclusively a user, then releasing the software is a Legal Act of the Apache Software Foundation, and only then when the software is formally released and there are 3 binding +1 votes from the PMC members, the software we release should be used by the users who are not contributors. This has legal, licencing implications and even if we would like to, we cannot ever say that whatever we have in the repo is "usable" by users. It's usable to do contributions. Full stop.

Copy link
Contributor

@bbovenzi bbovenzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UI parts look good to me. We can go through and find more areas where we want to show rendered map index in a follow-up PR.

@uranusjr uranusjr merged commit 6a00111 into apache:main Feb 24, 2024
59 checks passed
@tirkarthi
Copy link
Contributor

Are migrations being worked on in another PR? This breaks main branch testing for me since rendered_map_index doesn't have migration.

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such column: task_instance.rendered_map_index
[SQL: SELECT task_instance.try_number, task_instance.task_id, task_instance.dag_id, task_instance.run_id, task_instance.map_index, task_instance.start_date, task_instance.end_date, task_instance.duration, task_instance.state, task_instance.max_tries, task_instance.hostname, task_instance.unixname, task_instance.job_id, task_instance.pool, task_instance.pool_slots, task_instance.queue, task_instance.priority_weight, task_instance.operator, task_instance.custom_operator_name, task_instance.queued_dttm, task_instance.queued_by_job_id, task_instance.pid, task_instance.executor_config, task_instance.updated_at, task_instance.rendered_map_index, task_instance.external_executor_id, task_instance.trigger_id, task_instance.trigger_timeout, task_instance.next_method, task_instance.next_kwargs, dag_run_1.state AS state_1, dag_run_1.id, dag_run_1.dag_id AS dag_id_1, dag_run_1.queued_at, dag_run_1.execution_date, dag_run_1.start_date AS start_date_1, dag_run_1.end_date AS end_date_1, dag_run_1.run_id AS run_id_1, dag_run_1.creating_job_id, dag_run_1.external_trigger, dag_run_1.run_type, dag_run_1.conf, dag_run_1.data_interval_start, dag_run_1.data_interval_end, dag_run_1.last_scheduling_decision, dag_run_1.dag_hash, dag_run_1.log_template_id, dag_run_1.updated_at AS updated_at_1, dag_run_1.clear_number 
FROM task_instance JOIN dag_run ON dag_run.dag_id = task_instance.dag_id AND dag_run.run_id = task_instance.run_id JOIN dag_run AS dag_run_1 ON dag_run_1.dag_id = task_instance.dag_id AND dag_run_1.run_id = task_instance.run_id 
WHERE task_instance.dag_id = ? AND dag_run.execution_date >= ? AND dag_run.execution_date <= ? AND ((task_instance.task_id, task_instance.map_index) NOT IN (SELECT 1, 1 FROM (SELECT 1, 1) WHERE 1!=1)) ORDER BY dag_run.execution_date]
[parameters: ('task_duration', '2024-01-07 00:00:00.000000', '2024-01-10 05:41:51.314018')]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

@tirkarthi
Copy link
Contributor

tirkarthi commented Feb 26, 2024

Migration PR in #37708

@MrBounty
Copy link

Would it be possible to access the feature by creating a branch from my version of Airflow that includes your changes? I'm still learning the ropes of Git, so any guidance on the feasibility of this would be greatly appreciated.

Sure but it is does not come with any guarantees whatsoever, it might break any time and if it breaks it might stay broken for a long time. However if you will, you can definitely run airflow locally from latest main and if you see any issues, you can even fix them via PR (this is what most people contributing here are doing).

You can follow contribution flow -> see https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst as an entry point - then you have local virtualenv or Breeze contenerized development environment working - where you can more or less easily run airflow, but you have to be prepared for all the cases - like having to wipe your database, reinstall things from scratch, having your machine blow up becaus of used memory and CPU and all the bad things that can happen during the development. We cannot recommend it for running anything tha resembles productoin, and you should not put too much faith on being able to continue using the same instance when 2.9.0 is released, most likely you will have to reinstall everything from the scratch and wipe your database (and you will likely do it several times till the release date). But it also gives you the opportunity to contribute back and improve things.

Also for that you need to learn git and branching etc. based on the contributing guide, there is an expectation that you know what you are doing there.

There is no other "approach" that allows you to run things which are not released officially in production. Whatever we do here is purely for development and contribuion purposes. If you want to be exclusively a user, then releasing the software is a Legal Act of the Apache Software Foundation, and only then when the software is formally released and there are 3 binding +1 votes from the PMC members, the software we release should be used by the users who are not contributors. This has legal, licencing implications and even if we would like to, we cannot ever say that whatever we have in the repo is "usable" by users. It's usable to do contributions. Full stop.

Thank you! This is what I was missing, specially the how to contribute. I will take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants