Refactor usage of environment variables for DAG configuration to use Airflow Variables #4863
Labels
💻 aspect: code
Concerns the software code in the repository
🧰 goal: internal improvement
Improvement that benefits maintainers, not users
🟩 priority: low
Low priority and doesn't need to be rushed
🧱 stack: catalog
Related to the catalog and Airflow DAGs
🔧 tech: airflow
Involves Apache Airflow
Current Situation
There are a number of configuration settings in our DAG and DAG-support code that read exclusively using
os.getenv
, and as a result, are configured as environment variables in the service (rather than Metastore variables, which are more flexibly managed and changed).This prevents us from making changes to these variables without changing the environment variables, which currently requires a restart of the Airflow webserver and scheduler. It also has implications for future remote executor usage, where environment variables available to the scheduler are not shared to the remote execution environments, and instead, the metastore should be relied on to retrieve variables.
To clarify: this is not an issue about using environment variables to configure Airflow Variables, it is an issue about the retrieval of those variables at DAG runtime. How to actually configure Airflow Variables is a separate question, but doesn't have any bearing on whether DAGs read variables at runtime from the execution environment or from the metastore. Of course, that comes with a big caveat: if we don't use the execution environment variables to configure then, then they cannot be read from the execution environment. Therefore, reading them from the environment significantly hinders the flexibility we have when choosing how and where to manage variables.
Suggested Improvement
Refactor usage of the following variables such that they can be retrieved using
Variable.get
. In order to prevent incurring the performance penalty that comes withVariable.get
, most of these will require refactoring DAG code to read variables at execution time (either through templates orVariable.get
inside of tasks).This list was built by searching for
getenv
andos.environ
in the DAG code:CONTACT_EMAIL
DEFAULT_RETRY_COUNT
DATA_REFRESH_POKE_INTERVAL
DATA_REFRESH_POOL
AWS_CLOUDWATCH_CONN_ID
AWS_RDS_CONN_ID
ES_INDEX_READINESS_RECORD_COUNT
CANONICAL_DOMAIN
OUTPUT_DIR
DJANGO_ADMIN_URL
OPENVERSE_BUCKET
I have ignored places where we read environment variables in unit testing code, as in those instances, we're relying exclusively on the development environment's method of Airflow Variable management being environment variables. I think we can safely assume we will continue to use environment variables to configure Airflow Variables for the local development environment, and that therefore unit testing code can continue to read them that way.
I've also ignored anything in
retired
because we can just remove that code altogether, as discussed in #4240. I've assigned the issue to myself and will go ahead and open a PR for it later today (and leave an update here after that's done).Some of these might be easier to address than others. Many of them might be best addressed by turning usage of them into templates (e.g.,
CONTACT_EMAIL
,DATA_REFRESH_POKE_INTERVAL
and other things passed as arguments to@dag
,@task
or invocations of them). Others might be easier to just move into tasks directly withVariable.get
calls. Others yet may require more complex considerations, especially if the current method and usage is relied upon to enable some approach to unit testing.This should not be attempted in a single PR! I didn't have time when writing this issue to break it down into what I thought could be accomplished in smaller PRs. When picking up any one of these variables, I recommend doing a search of the codebase and seeing how many and what types of usages of the variable there are. That might help guide whether a given variable can be addressed in the same PR as another.
Benefit
Two main benefits:
Additional context
Related to/came out of this discussion: #4833 (comment)
ccing the @WordPress/openverse-catalog for visibility as the main Airflow knowledge holders 🙂 No action item here for y'all, though, other than any helpful input you might have to guide this work... or if you disagree with the premise/goal.
The text was updated successfully, but these errors were encountered: