RdsDeleteDbInstanceOperator sometimes does not complete when run with `deferrable=True` #35563

cliebBS · 2023-11-09T21:36:29Z

Apache Airflow version

2.7.3

What happened

When using the RdsDeleteDbInstanceOperator to delete an RDS instance in my DAG while using deferrable=True, sometimes it misses the fact that the instance was deleted and leaves logs like:

[2023-11-09, 19:32:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: deleting
[2023-11-09, 19:32:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: deleting
[2023-11-09, 19:33:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: deleting
[2023-11-09, 19:33:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
[2023-11-09, 19:34:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
[2023-11-09, 19:34:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
[2023-11-09, 19:35:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
[2023-11-09, 19:35:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
[2023-11-09, 19:36:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is:

In the above example, at some point between 19:33:14 and 19:33:44, the actual instance was deleted, but the operator doesn't realize that it was deleted and instead continues to poll for the status of the RDS instance until it reaches the waiter_max_attempts, at which point it fails. Retries of the operator exit immediately with the log message:

[2023-11-09, 19:57:11 UTC] {taskinstance.py:1937} ERROR - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 2] Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/amazon/aws/operators/rds.py", line 690, in execute
    delete_db_instance = self.hook.conn.delete_db_instance(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/botocore/client.py", line 535, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/botocore/client.py", line 980, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the DeleteDBInstance operation: User: arn:aws:sts::account-num:assumed-role/an-iam-role/botocore-session is not authorized to perform: rds:DeleteDBInstance on resource: arn:aws:rds:us-east-1:account-num:db:test-db because no identity-based policy allows the rds:DeleteDBInstance action

I /think/ what's happening is that the deletion is completing in the time between when the waiter times out and when it reschedules to run again. Since the RDS delete operation doesn't leave any sign of the DB in the AWS API (unlike a terminated EMR cluster, for example), running the waiter again leaves the operator in a weird state where it thinks the resource exists, but is unable to get a status for it.

What you think should happen instead

The RdsDeleteDbInstanceOperator should always complete in the success state when the RDS instance is deleted.

How to reproduce

import datetime
from typing import Any

from airflow import DAG
from airflow.decorators import task
from airflow.providers.amazon.aws.operators.rds import (
    RdsCreateDbInstanceOperator,
    RdsDeleteDbInstanceOperator,
)

NUM_INSTANCES = 6  # set this higher to make it easier to reproduce in a single run

# TODO: set these appropriately for your environment
INSTANCE_PREFIX = "rds-delete-repro"

# I don't think we can make this any smaller
STORAGE_SIZE = 20
INSTANCE_TYPE = "db.t4g.micro"


@task()
def get_names(num_instances: int) -> list[str]:
    return [f"{INSTANCE_PREFIX}-{x}" for x in range(num_instances)]


@task()
def get_rds_kwargs(names: list[str]) -> list[dict[str, Any]]:
    rds_kwargs: dict[str, Any] = {
        "EngineVersion": "14.8",
        "DBParameterGroupName": "default.postgres14",
        "MasterUsername": "postgres",
        "MasterUserPassword": "postgres",
        "StorageType": "gp3",
        "AllocatedStorage": STORAGE_SIZE,
        "Port": 5432,
        "NetworkType": "IPV4",
        "MultiAZ": False,
        "BackupRetentionPeriod": 0,  # disables automatic backup.
        "PubliclyAccessible": False,
        "AutoMinorVersionUpgrade": False,
        "DeletionProtection": False,
        "StorageEncrypted": True,
        # TODO: set these to valid values for your AWS account
        # "VpcSecurityGroupIds": ["sg-0ced89fe13d1ae49a"],
        # "DBSubnetGroupName": "dbsubnet-rds-dev-eks",
    }

    return [{"db_instance_identifier": name, "rds_kwargs": rds_kwargs} for name in names]


with DAG(
    dag_id="rds_delete_repro",
    default_args={"retries": 1},
    start_date=datetime.datetime(2021, 1, 1),
    catchup=False,
    schedule_interval=None,
) as dag:
    db_instance_ids = get_names(NUM_INSTANCES)

    rds_kwargs = get_rds_kwargs(names=db_instance_ids)

    create_rds = RdsCreateDbInstanceOperator.partial(
        task_id="create_rds",
        db_instance_class=INSTANCE_TYPE,
        engine="postgres",
    ).expand_kwargs(rds_kwargs)

    delete_rds = RdsDeleteDbInstanceOperator.partial(
        task_id="delete_rds",
        rds_kwargs={
            "SkipFinalSnapshot": True,
            "DeleteAutomatedBackups": True,
        },
        deferrable=True,
    ).expand(
        db_instance_identifier=db_instance_ids,
    )

    rds_kwargs >> create_rds >> delete_rds

You can increase the number of RDS instances this will spin up at once to increase the odds that you'll trigger this problem. It also seems like this is much easier to reproduce when running on EKS using the official Helm chart vs running locally using the official docker-compose.yaml, though I have no idea why that should be making a difference.

Operating System

Official Docker image

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.10.0

Deployment

Docker-Compose

Deployment details

MacOS Sonoma 14.0

❯ docker version
Client: Docker Engine - Community
 Version:           24.0.7
 API version:       1.42 (downgraded from 1.43)
 Go version:        go1.21.3
 Git commit:        afdd53b4e3
 Built:             Thu Oct 26 07:06:42 2023
 OS/Arch:           darwin/arm64
 Context:           colima

Server:
 Engine:
  Version:          23.0.6
  API version:      1.42 (minimum version 1.12)
  Go version:       go1.20.4
  Git commit:       9dbdbd4b6d7681bd18c897a6ba0376073c2a72ff
  Built:            Fri May 12 13:54:36 2023
  OS/Arch:          linux/arm64
  Experimental:     true
 containerd:
  Version:          v1.7.0
  GitCommit:        1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
 runc:
  Version:          1.1.7
  GitCommit:        860f061b76bb4fc671f0f9e900f7d80ff93d4eb7
 docker-init:
  Version:          0.19.0
  GitCommit:

❯ docker-compose version
Docker Compose version 2.23.0

❯ colima version
colima version 0.5.6
git commit: ceef812c32ab74a49df9f270e048e5dced85f932

runtime: docker
arch: aarch64
client: v24.0.7
server: v23.0.6

❯ limactl --version
limactl version 0.18.0

❯ qemu-img --version
qemu-img version 8.1.2
Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers

❯ kubectl version
Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.24.17-eks-4f4795d

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

Taragolis · 2023-11-09T21:42:19Z

I think it might be fix in 8.11.0, it would be nice if you could check

Reference PR:

Fix AWS RDS hook's DB instance state check #34773

cliebBS · 2023-11-10T19:39:15Z

Oh, nice! I pulled that change into my Docker image and I've managed two successful runs of my actual DAG so far that delete six RDS instances each. I guess I'll be waiting eagerly for 8.11.0 to be released.

cliebBS · 2023-11-13T16:05:36Z

After further testing of this fix, I discovered that it does not seem to make the problem any better when using deferrable=True.

Taragolis · 2023-11-14T07:21:59Z

@cliebBS Could you update update examples and description, that this operator won't work only in deferrable mode?

cliebBS · 2023-11-14T14:11:18Z

@Taragolis The title of the ticket, the description, and the reproduction case have all been updated to reflect the issue now being specific to deferrable=True. This didn't really change much at all in the ticket, as the error logs still look the same.

o-nikolas · 2024-01-19T20:18:55Z

I setup your repro dag on latest main and did about 30 RDS DB creations and deferred deletions and could not repro the issue.

A lot of the code has changed recently, I suspect the issue is maybe resolved already?

Let me know if you can reproduce the issue on latest main.

github-actions · 2024-02-04T00:12:39Z

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions · 2024-02-11T00:12:56Z

This issue has been closed because it has not received response from the issue author.

cliebBS added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Nov 9, 2023

Taragolis added provider:amazon AWS/Amazon - related issues area:providers pending-response and removed area:core labels Nov 9, 2023

cliebBS closed this as completed Nov 10, 2023

cliebBS reopened this Nov 13, 2023

Taragolis removed pending-response needs-triage label for new issues that we didn't triage yet labels Nov 14, 2023

cliebBS changed the title ~~RdsDeleteDbInstanceOperator sometimes does not complete~~ RdsDeleteDbInstanceOperator sometimes does not complete when run with deferrable=True Nov 14, 2023

eladkal added Can't Reproduce The problem cannot be reproduced good first issue pending-response labels Jan 20, 2024

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 4, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RdsDeleteDbInstanceOperator sometimes does not complete when run with `deferrable=True` #35563

RdsDeleteDbInstanceOperator sometimes does not complete when run with `deferrable=True` #35563

cliebBS commented Nov 9, 2023 •

edited

Loading

Taragolis commented Nov 9, 2023

cliebBS commented Nov 10, 2023

cliebBS commented Nov 13, 2023

Taragolis commented Nov 14, 2023

cliebBS commented Nov 14, 2023

o-nikolas commented Jan 19, 2024

github-actions bot commented Feb 4, 2024

github-actions bot commented Feb 11, 2024

RdsDeleteDbInstanceOperator sometimes does not complete when run with deferrable=True #35563

RdsDeleteDbInstanceOperator sometimes does not complete when run with deferrable=True #35563

Comments

cliebBS commented Nov 9, 2023 • edited Loading

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

Taragolis commented Nov 9, 2023

cliebBS commented Nov 10, 2023

cliebBS commented Nov 13, 2023

Taragolis commented Nov 14, 2023

cliebBS commented Nov 14, 2023

o-nikolas commented Jan 19, 2024

github-actions bot commented Feb 4, 2024

github-actions bot commented Feb 11, 2024

RdsDeleteDbInstanceOperator sometimes does not complete when run with `deferrable=True` #35563

RdsDeleteDbInstanceOperator sometimes does not complete when run with `deferrable=True` #35563

cliebBS commented Nov 9, 2023 •

edited

Loading