Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow hangs on "Found x duplicates in table task_fail. Will attempt to move them" #23472

Closed
1 of 2 tasks
tpapaj-CKPL opened this issue May 4, 2022 · 9 comments
Closed
1 of 2 tasks
Labels
affected_version:2.3 Issues Reported for 2.3 area:core kind:bug This is a clearly a bug

Comments

@tpapaj-CKPL
Copy link

tpapaj-CKPL commented May 4, 2022

Apache Airflow version

2.3.0 (latest released)

What happened

After upgrade from 2.2.2 to 2.3.0, when I am trying to do airflow db upgrade all I get is following log and then airflow hangs:
Found 16 duplicates in table task_fail. Will attempt to move them.

What you think should happen instead

Airflow should not hang at this log and properly move these duplicates

How to reproduce

  • Run Airflow 2.2.2
  • Create duplicates in table task_fail
  • Update Airflow to 2.3.0 and try upgrading DB

Operating System

Debian 10

Versions of Apache Airflow Providers

apache-airflow-providers-apache-cassandra==2.0.1 \
apache-airflow-providers-apache-hive==2.0.2 \
apache-airflow-providers-celery==2.1.0 \
apache-airflow-providers-cncf-kubernetes==2.0.2 \
apache-airflow-providers-ftp==2.0.1 \
apache-airflow-providers-http==2.0.1 \
apache-airflow-providers-imap==2.0.1 \
apache-airflow-providers-jdbc==2.0.1 \
apache-airflow-providers-mysql==2.1.1 \
apache-airflow-providers-papermill==2.2.3 \
apache-airflow-providers-postgres==2.2.0 \
apache-airflow-providers-sftp==2.1.1 \
apache-airflow-providers-sqlite==2.0.1 \
apache-airflow-providers-ssh==2.1.1 \
apache-airflow-providers-google==5.1.0 \
apache-airflow-providers-apache-beam==3.1.0 \
apache-airflow-providers-apache-spark==2.0.1 \

Deployment

Other Docker-based deployment

Deployment details

Google GKE
2 Instances of Airflow processes running(HA configuration)

Anything else

Problem occurs every time I am trying to update database.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@tpapaj-CKPL tpapaj-CKPL added area:core kind:bug This is a clearly a bug labels May 4, 2022
@jedcunningham
Copy link
Member

Hi @TPapajCin,

We discovered recently that one of the pre-upgrade checks didn't perform
very well. It was fixed in #23458.

Could you try with that patch applied in your environment and report back?

I'd also be curious to know how many rows you have in dag_run,
task_instance, and rendered_task_instance_fields, if you don't mind sharing
(over Slack is fine as well if you'd rather DM me).

Thanks!

@jedcunningham jedcunningham added the affected_version:2.3 Issues Reported for 2.3 label May 4, 2022
@tpapaj-CKPL
Copy link
Author

I can confirm that the fix worked! The "Creating tables" log appeared 10 seconds after the "Found x duplicates" log. It successfully made migrations and my Airflow instances now work on 2.3.0 version correctly. Thank you 👍

About the environment, I do not have access to the database but from the UI I can tell that I have around 200k DAG runs and around 400-450k task runs in total. There are ~35 DAG files.

@ldacey
Copy link
Contributor

ldacey commented May 5, 2022

Same thing impacted me, likely due to a lot of records and a custom entrypoint.

6,856,476 rows for task_instance
92,682 rows for rendered_task_instance_fields
900,570 rows for dag_run

Ultimately, I was able to get things up and running. To do so, here are the steps I took. I would not recommend this, but just in case someone gets stuck like I did.

  • Installed the patch you mentioned above temporarily
  • Created a container, navigated to the site-packages airflow folder and then I ran alembic commands one by one using the migration reference
  • I ran into errors during several steps like Migrate RTIF to use run_id and map_index due to null run_id values and Add map_index to TaskFail due to a constraint failure (missing records in task_instance table)
  • I ran alembic upgrade before:after --sql commands to generate the raw SQL and then copied, pasted, and executed each command inside of my postgresql container. This allowed me to move forward, but ultimately I had to delete ~85 records from the task_fail table which did not exist in the task_instance table
  • This worked, airflow was up and running so I installed apache-airflow==2.3.0 instead of the patch and I was good to go

@jedcunningham
Copy link
Member

Happy to hear it. Thanks for testing it out @TPapajCin!

Just for some extra clarity, the inefficient query was part of the pre-migration checks which would have automatically moved those bad rows @ldacey ran into into a separate table. You actually skipped the inefficient query completely with how you upgraded, but you got to the same place at the end of the day (airflow db upgrade just would have done it without manual intervention).

In case you don't know, you can also easily get a patch file directly from github. Just add .patch to the end of the url, e.g. https://github.com/apache/airflow/pull/23458.patch
You can then apply it with patch.

The fix is already slated for 2.3.1 which should be coming pretty soon here, so I'm going to close this issue.

@ldacey
Copy link
Contributor

ldacey commented May 5, 2022

Got it, thanks. I did not know about that .patch method either.

I think my issue was that the airflow db upgrade would start running and just hang "indefinitely" with no log output or indication of progress. I let it run for over an hour before I canceled and went with the individual alembic commands.

@jedcunningham
Copy link
Member

I see, that's good to know, thanks. We have some more optimizations in the pipeline that will hopefully speed it up even more.

@ashb
Copy link
Member

ashb commented May 5, 2022

@ldacey What DB are you on? Do I remember you are an MSSQL shop?

@ldacey
Copy link
Contributor

ldacey commented May 9, 2022

Sorry for the late reply, I am using PostgreSQL. I assume that I just became impatient and canceled the db upgrade command prematurely. I am not sure why my task_fail table had some rows (85) which did not exist in the task_instance table though

@zachliu
Copy link
Contributor

zachliu commented May 20, 2022

#23458 is a life saver! 20 hours db migration becomes 20 seconds! 😂

curl -L https://github.com/apache/airflow/pull/23458.patch | git apply -v --index

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.3 Issues Reported for 2.3 area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

5 participants