Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RdsDeleteDbInstanceOperator sometimes does not complete when run with deferrable=True #35563

Closed
2 tasks done
cliebBS opened this issue Nov 9, 2023 · 8 comments
Closed
2 tasks done
Labels
area:providers Can't Reproduce The problem cannot be reproduced good first issue kind:bug This is a clearly a bug pending-response provider:amazon AWS/Amazon - related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Comments

@cliebBS
Copy link
Contributor

cliebBS commented Nov 9, 2023

Apache Airflow version

2.7.3

What happened

When using the RdsDeleteDbInstanceOperator to delete an RDS instance in my DAG while using deferrable=True, sometimes it misses the fact that the instance was deleted and leaves logs like:

[2023-11-09, 19:32:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: deleting
[2023-11-09, 19:32:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: deleting
[2023-11-09, 19:33:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: deleting
[2023-11-09, 19:33:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
[2023-11-09, 19:34:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
[2023-11-09, 19:34:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
[2023-11-09, 19:35:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
[2023-11-09, 19:35:44 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 
[2023-11-09, 19:36:14 UTC] {waiter_with_logging.py:78} INFO - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 1] DB Instance status is: 

In the above example, at some point between 19:33:14 and 19:33:44, the actual instance was deleted, but the operator doesn't realize that it was deleted and instead continues to poll for the status of the RDS instance until it reaches the waiter_max_attempts, at which point it fails. Retries of the operator exit immediately with the log message:

[2023-11-09, 19:57:11 UTC] {taskinstance.py:1937} ERROR - [findings_sync manual__2023-11-09T17:58:26+00:00 prev_db--delete_db 2] Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/amazon/aws/operators/rds.py", line 690, in execute
    delete_db_instance = self.hook.conn.delete_db_instance(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/botocore/client.py", line 535, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/botocore/client.py", line 980, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the DeleteDBInstance operation: User: arn:aws:sts::account-num:assumed-role/an-iam-role/botocore-session is not authorized to perform: rds:DeleteDBInstance on resource: arn:aws:rds:us-east-1:account-num:db:test-db because no identity-based policy allows the rds:DeleteDBInstance action

I /think/ what's happening is that the deletion is completing in the time between when the waiter times out and when it reschedules to run again. Since the RDS delete operation doesn't leave any sign of the DB in the AWS API (unlike a terminated EMR cluster, for example), running the waiter again leaves the operator in a weird state where it thinks the resource exists, but is unable to get a status for it.

What you think should happen instead

The RdsDeleteDbInstanceOperator should always complete in the success state when the RDS instance is deleted.

How to reproduce

import datetime
from typing import Any

from airflow import DAG
from airflow.decorators import task
from airflow.providers.amazon.aws.operators.rds import (
    RdsCreateDbInstanceOperator,
    RdsDeleteDbInstanceOperator,
)

NUM_INSTANCES = 6  # set this higher to make it easier to reproduce in a single run

# TODO: set these appropriately for your environment
INSTANCE_PREFIX = "rds-delete-repro"

# I don't think we can make this any smaller
STORAGE_SIZE = 20
INSTANCE_TYPE = "db.t4g.micro"


@task()
def get_names(num_instances: int) -> list[str]:
    return [f"{INSTANCE_PREFIX}-{x}" for x in range(num_instances)]


@task()
def get_rds_kwargs(names: list[str]) -> list[dict[str, Any]]:
    rds_kwargs: dict[str, Any] = {
        "EngineVersion": "14.8",
        "DBParameterGroupName": "default.postgres14",
        "MasterUsername": "postgres",
        "MasterUserPassword": "postgres",
        "StorageType": "gp3",
        "AllocatedStorage": STORAGE_SIZE,
        "Port": 5432,
        "NetworkType": "IPV4",
        "MultiAZ": False,
        "BackupRetentionPeriod": 0,  # disables automatic backup.
        "PubliclyAccessible": False,
        "AutoMinorVersionUpgrade": False,
        "DeletionProtection": False,
        "StorageEncrypted": True,
        # TODO: set these to valid values for your AWS account
        # "VpcSecurityGroupIds": ["sg-0ced89fe13d1ae49a"],
        # "DBSubnetGroupName": "dbsubnet-rds-dev-eks",
    }

    return [{"db_instance_identifier": name, "rds_kwargs": rds_kwargs} for name in names]


with DAG(
    dag_id="rds_delete_repro",
    default_args={"retries": 1},
    start_date=datetime.datetime(2021, 1, 1),
    catchup=False,
    schedule_interval=None,
) as dag:
    db_instance_ids = get_names(NUM_INSTANCES)

    rds_kwargs = get_rds_kwargs(names=db_instance_ids)

    create_rds = RdsCreateDbInstanceOperator.partial(
        task_id="create_rds",
        db_instance_class=INSTANCE_TYPE,
        engine="postgres",
    ).expand_kwargs(rds_kwargs)

    delete_rds = RdsDeleteDbInstanceOperator.partial(
        task_id="delete_rds",
        rds_kwargs={
            "SkipFinalSnapshot": True,
            "DeleteAutomatedBackups": True,
        },
        deferrable=True,
    ).expand(
        db_instance_identifier=db_instance_ids,
    )

    rds_kwargs >> create_rds >> delete_rds

You can increase the number of RDS instances this will spin up at once to increase the odds that you'll trigger this problem. It also seems like this is much easier to reproduce when running on EKS using the official Helm chart vs running locally using the official docker-compose.yaml, though I have no idea why that should be making a difference.

Operating System

Official Docker image

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.10.0

Deployment

Docker-Compose

Deployment details

MacOS Sonoma 14.0

❯ docker version
Client: Docker Engine - Community
 Version:           24.0.7
 API version:       1.42 (downgraded from 1.43)
 Go version:        go1.21.3
 Git commit:        afdd53b4e3
 Built:             Thu Oct 26 07:06:42 2023
 OS/Arch:           darwin/arm64
 Context:           colima

Server:
 Engine:
  Version:          23.0.6
  API version:      1.42 (minimum version 1.12)
  Go version:       go1.20.4
  Git commit:       9dbdbd4b6d7681bd18c897a6ba0376073c2a72ff
  Built:            Fri May 12 13:54:36 2023
  OS/Arch:          linux/arm64
  Experimental:     true
 containerd:
  Version:          v1.7.0
  GitCommit:        1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
 runc:
  Version:          1.1.7
  GitCommit:        860f061b76bb4fc671f0f9e900f7d80ff93d4eb7
 docker-init:
  Version:          0.19.0
  GitCommit:
❯ docker-compose version
Docker Compose version 2.23.0
❯ colima version
colima version 0.5.6
git commit: ceef812c32ab74a49df9f270e048e5dced85f932

runtime: docker
arch: aarch64
client: v24.0.7
server: v23.0.6
❯ limactl --version
limactl version 0.18.0
❯ qemu-img --version
qemu-img version 8.1.2
Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers
❯ kubectl version
Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.24.17-eks-4f4795d

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@cliebBS cliebBS added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Nov 9, 2023
@Taragolis
Copy link
Contributor

I think it might be fix in 8.11.0, it would be nice if you could check

Reference PR:

@cliebBS
Copy link
Contributor Author

cliebBS commented Nov 10, 2023

Oh, nice! I pulled that change into my Docker image and I've managed two successful runs of my actual DAG so far that delete six RDS instances each. I guess I'll be waiting eagerly for 8.11.0 to be released.

@cliebBS cliebBS closed this as completed Nov 10, 2023
@cliebBS
Copy link
Contributor Author

cliebBS commented Nov 13, 2023

After further testing of this fix, I discovered that it does not seem to make the problem any better when using deferrable=True.

@cliebBS cliebBS reopened this Nov 13, 2023
@Taragolis Taragolis removed pending-response needs-triage label for new issues that we didn't triage yet labels Nov 14, 2023
@Taragolis
Copy link
Contributor

@cliebBS Could you update update examples and description, that this operator won't work only in deferrable mode?

@cliebBS cliebBS changed the title RdsDeleteDbInstanceOperator sometimes does not complete RdsDeleteDbInstanceOperator sometimes does not complete when run with deferrable=True Nov 14, 2023
@cliebBS
Copy link
Contributor Author

cliebBS commented Nov 14, 2023

@Taragolis The title of the ticket, the description, and the reproduction case have all been updated to reflect the issue now being specific to deferrable=True. This didn't really change much at all in the ticket, as the error logs still look the same.

@o-nikolas
Copy link
Contributor

I setup your repro dag on latest main and did about 30 RDS DB creations and deferred deletions and could not repro the issue.

A lot of the code has changed recently, I suspect the issue is maybe resolved already?

Let me know if you can reproduce the issue on latest main.

Copy link

github-actions bot commented Feb 4, 2024

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 4, 2024
Copy link

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers Can't Reproduce The problem cannot be reproduced good first issue kind:bug This is a clearly a bug pending-response provider:amazon AWS/Amazon - related issues stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

No branches or pull requests

4 participants