Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom metadata to be attached to every log #5999

Merged
1 change: 1 addition & 0 deletions core/dbt/constants.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
SECRET_ENV_PREFIX = "DBT_ENV_SECRET_"
DEFAULT_ENV_PLACEHOLDER = "DBT_DEFAULT_PLACEHOLDER"
METADATA_ENV_PREFIX = "DBT_ENV_CUSTOM_ENV_"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name might be natural in terms of some context I don't have yet, but I wonder if it can be improved. Maybe DBT_LOGGING_VAR_ or DBT_METADATA_VAR_?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually an existing name (docs). Metadata is in the manifest and these vars end up there currently. I centralized the constant into this (newish) file as part of my changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be my preference to have a new key for this behavior, such as DBT_LOG_CUSTOM_ENV. Users already have existing DBT_ENV_CUSTOM_ENV environment variables, which they might actually be surprised to find in every log line. @jtcohen6 could you weigh in on this?

To me, this seems to be a different use case than we already have for DBT_ENV_CUSTOM_ENV.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflating these two use cases also makes it impossible to put things in one place in the artifact output as opposed to in the artifact output plus every log line.

Copy link
Member Author

@emmyoop emmyoop Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure specifically how users are using DBT_ENV_CUSTOM_ENV currently.

I do see what you're saying with users possibly being surprised these env vars show up in logs now. But is there a use case for these being the exact variables they want to show up? Definitely need @jtcohen6 to weigh in.

It's easy to separate out if that's the use case we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now those DBT_ENV_CUSTOM_ENV variables show up in the artifacts that are written up in the end, and there was a recent change to put them in the context for access in Jinja.

I'm not familiar with how users might use those, but from some of our test cases, it does look like they were intended for putting things like a run_id or a job_id. So maybe it would be fine to use this prefix string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tagging me in, and for the good conversation here!

Users set these env vars with the explicit intent of having them show up in metadata artifacts. It makes it easier to debug and quickly cross-reference. I see logs as a metadata artifact, in very much the same vein.

This was also our intent by crafting such an unwieldy prefix. It's hard to type DBT_ENV_CUSTOM_ENV_ by accident. I'm definitely open to a better name (DBT_ENV_CUSTOM_METADATA_?) that's equally clear about what it's doing. By convention, "functional" env vars should just be prefixed with DBT_, including env vars setting global configs.

I think it's an okay assumption to automatically coordinate between these. If later on, we hear a valid use case for having them show up in one, and not the other, I'm willing to admit that I was wrong! But I'd rather do this than require users to set two sets of identical env vars from the get-go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input @jtcohen6! I do think DBT_ENV_CUSTOM_METADATA_ does a better job of saying what the var is for sure, so lean towards this new name.

My understanding is that if we change the name we should add a deprecation warning for the old name and remove it in 2.0. My question is would DBT_ENV_CUSTOM_ENV_ only show up in the artifacts (since that's maintaining current behavior) and DBT_ENV_CUSTOM_METADATA_ shows up in both logs and artifacts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still vote in favor of coordinating the information so that it shows up in both places.

How would we feel about implementing this as DBT_ENV_CUSTOM_ENV_ for now, and then opening a new ticket for the rename (+ deprecation warning)?

4 changes: 2 additions & 2 deletions core/dbt/context/providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
ParsedSourceDefinition,
)
from dbt.contracts.graph.metrics import MetricReference, ResolvedMetricReference
from dbt.contracts.util import get_metadata_env
from dbt.events.functions import get_metadata_vars
from dbt.exceptions import (
CompilationException,
ParsingException,
Expand Down Expand Up @@ -713,7 +713,7 @@ def _get_namespace_builder(self):

@contextproperty
def dbt_metadata_envs(self) -> Dict[str, str]:
return get_metadata_env()
return get_metadata_vars()

@contextproperty
def invocation_args_dict(self):
Expand Down
19 changes: 2 additions & 17 deletions core/dbt/contracts/util.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import dataclasses
import os
from datetime import datetime
from typing import List, Tuple, ClassVar, Type, TypeVar, Dict, Any, Optional

Expand All @@ -11,7 +10,7 @@
IncompatibleSchemaException,
)
from dbt.version import __version__
from dbt.events.functions import get_invocation_id
from dbt.events.functions import get_invocation_id, get_metadata_vars
from dbt.dataclass_schema import dbtClassMixin

from dbt.dataclass_schema import (
Expand Down Expand Up @@ -148,20 +147,6 @@ def __str__(self) -> str:
return BASE_SCHEMAS_URL + self.path


SCHEMA_VERSION_KEY = "dbt_schema_version"


METADATA_ENV_PREFIX = "DBT_ENV_CUSTOM_ENV_"


def get_metadata_env() -> Dict[str, str]:
return {
k[len(METADATA_ENV_PREFIX) :]: v
for k, v in os.environ.items()
if k.startswith(METADATA_ENV_PREFIX)
}


# This is used in the ManifestMetadata, RunResultsMetadata, RunOperationResultMetadata,
# FreshnessMetadata, and CatalogMetadata classes
@dataclasses.dataclass
Expand All @@ -170,7 +155,7 @@ class BaseArtifactMetadata(dbtClassMixin):
dbt_version: str = __version__
generated_at: datetime = dataclasses.field(default_factory=datetime.utcnow)
invocation_id: Optional[str] = dataclasses.field(default_factory=get_invocation_id)
env: Dict[str, str] = dataclasses.field(default_factory=get_metadata_env)
env: Dict[str, str] = dataclasses.field(default_factory=get_metadata_vars)

def __post_serialize__(self, dct):
dct = super().__post_serialize__(dct)
Expand Down
8 changes: 7 additions & 1 deletion core/dbt/events/base_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import threading
from datetime import datetime


# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# These base types define the _required structure_ for the concrete event #
# types defined in types.py #
Expand All @@ -15,6 +14,12 @@ class Cache:
pass


def get_global_metadata_vars() -> dict:
from dbt.events.functions import get_metadata_vars

return get_metadata_vars()


def get_invocation_id() -> str:
from dbt.events.functions import get_invocation_id

Expand Down Expand Up @@ -48,6 +53,7 @@ def __post_init__(self):
if not hasattr(self.info, "msg") or not self.info.msg:
self.info.msg = self.message()
self.info.invocation_id = get_invocation_id()
self.info.extra = get_global_metadata_vars()
self.info.ts = datetime.utcnow()
self.info.pid = get_pid()
self.info.thread = get_thread_name()
Expand Down
21 changes: 19 additions & 2 deletions core/dbt/events/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from dbt.events.base_types import NoStdOut, BaseEvent, NoFile, Cache
from dbt.events.types import EventBufferFull, MainReportVersion, EmptyLine
import dbt.flags as flags
from dbt.constants import SECRET_ENV_PREFIX
from dbt.constants import SECRET_ENV_PREFIX, METADATA_ENV_PREFIX

from dbt.logger import make_log_dir_if_missing, GLOBAL_LOGGER
from datetime import datetime
Expand All @@ -18,7 +18,7 @@
import os
import uuid
import threading
from typing import List, Optional, Union, Callable
from typing import List, Optional, Union, Callable, Dict
from collections import deque

LOG_VERSION = 3
Expand All @@ -40,6 +40,7 @@
format_color = True
format_json = False
invocation_id: Optional[str] = None
metadata_vars: Optional[Dict[str, str]] = None


def setup_event_logger(log_path, level_override=None):
Expand Down Expand Up @@ -267,6 +268,22 @@ def fire_event(e: BaseEvent) -> None:
send_to_logger(STDOUT_LOG, level_tag=e.level_tag(), log_line=log_line)


def get_metadata_vars() -> Dict[str, str]:
global metadata_vars
if metadata_vars is None:
metadata_vars = {
k[len(METADATA_ENV_PREFIX) :]: v
for k, v in os.environ.items()
if k.startswith(METADATA_ENV_PREFIX)
}
return metadata_vars


def reset_metadata_vars() -> None:
global metadata_vars
metadata_vars = None


def get_invocation_id() -> str:
global invocation_id
if invocation_id is None:
Expand Down
1 change: 1 addition & 0 deletions core/dbt/lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ def get_dbt_config(project_dir, args=None, single_threaded=False):

# Make sure we have a valid invocation_id
dbt.events.functions.set_invocation_id()
dbt.events.functions.reset_metadata_vars()

return config

Expand Down
5 changes: 4 additions & 1 deletion core/dbt/tests/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from dbt.main import handle_and_check
from dbt.logger import log_manager
from dbt.contracts.graph.manifest import Manifest
from dbt.events.functions import fire_event, capture_stdout_logs, stop_capture_stdout_logs
from dbt.events.functions import fire_event, capture_stdout_logs, stop_capture_stdout_logs, reset_metadata_vars
from dbt.events.test_types import IntegrationTestDebug

# =============================================================================
Expand Down Expand Up @@ -63,6 +63,9 @@ def run_dbt(args: List[str] = None, expect_pass=True):
# Ignore logbook warnings
warnings.filterwarnings("ignore", category=DeprecationWarning, module="logbook")

# reset global vars
reset_metadata_vars()

# The logger will complain about already being initialized if
# we don't do this.
log_manager.reset_handlers()
Expand Down
6 changes: 6 additions & 0 deletions test/unit/test_context.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from dbt.config.project import VarProvider
from dbt.context import base, target, configured, providers, docs, manifest, macros
from dbt.contracts.files import FileHash
from dbt.events.functions import reset_metadata_vars
from dbt.node_types import NodeType
import dbt.exceptions
from .utils import (
Expand Down Expand Up @@ -503,6 +504,8 @@ def test_macro_namespace(config_postgres, manifest_fx):
assert result["some_macro"].macro is package_macro

def test_dbt_metadata_envs(monkeypatch, config_postgres, manifest_fx, get_adapter, get_include_paths):
reset_metadata_vars()

envs = {
"DBT_ENV_CUSTOM_ENV_RUN_ID": 1234,
"DBT_ENV_CUSTOM_ENV_JOB_ID": 5678,
Expand All @@ -519,3 +522,6 @@ def test_dbt_metadata_envs(monkeypatch, config_postgres, manifest_fx, get_adapte
)

assert ctx["dbt_metadata_envs"] == {'JOB_ID': 5678, 'RUN_ID': 1234}

# cleanup
reset_metadata_vars()
6 changes: 5 additions & 1 deletion test/unit/test_manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,8 @@
)

from dbt.contracts.graph.compiled import CompiledModelNode
from dbt.events.functions import get_invocation_id
from dbt.events.functions import reset_metadata_vars

from dbt.node_types import NodeType
import freezegun

Expand All @@ -60,6 +61,8 @@

class ManifestTest(unittest.TestCase):
def setUp(self):
reset_metadata_vars()

# TODO: why is this needed for tests in this module to pass?
tracking.active_user = None

Expand Down Expand Up @@ -304,6 +307,7 @@ def setUp(self):

def tearDown(self):
del os.environ['DBT_ENV_CUSTOM_ENV_key']
reset_metadata_vars()

@freezegun.freeze_time('2018-02-14T09:15:13Z')
def test__no_nodes(self):
Expand Down
33 changes: 33 additions & 0 deletions tests/functional/context_methods/test_custom_env_vars.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import pytest
import json
import os

from dbt.tests.util import run_dbt_and_capture


def parse_json_logs(json_log_output):
parsed_logs = []
for line in json_log_output.split("\n"):
try:
log = json.loads(line)
except ValueError:
continue

parsed_logs.append(log)

return parsed_logs


class TestCustomVarInLogs:
@pytest.fixture(scope="class", autouse=True)
def setup(self):
# on windows, python uppercases env var names because windows is case insensitive
os.environ["DBT_ENV_CUSTOM_ENV_SOME_VAR"] = "value"
yield
del os.environ["DBT_ENV_CUSTOM_ENV_SOME_VAR"]

def test_extra_filled(self, project):
_, log_output = run_dbt_and_capture(['--log-format=json', 'deps'],)
logs = parse_json_logs(log_output)
for log in logs:
assert log['info'].get('extra') == {"SOME_VAR": "value"}
26 changes: 25 additions & 1 deletion tests/unit/test_proto_events.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
PluginLoadError,
PrintStartLine,
)
from dbt.events.functions import event_to_dict, LOG_VERSION
from dbt.events.functions import event_to_dict, LOG_VERSION, reset_metadata_vars
from dbt.events import proto_types as pl
from dbt.version import installed

Expand Down Expand Up @@ -97,3 +97,27 @@ def test_node_info_events():
)
assert event
assert event.node_info.node_path == "some_path"


def test_extra_dict_on_event(monkeypatch):

monkeypatch.setenv("DBT_ENV_CUSTOM_ENV_env_key", "env_value")

reset_metadata_vars()

event = MainReportVersion(version=str(installed), log_version=LOG_VERSION)
event_dict = event_to_dict(event)
assert set(event_dict["info"].keys()) == info_keys
assert event.info.extra == {"env_key": "env_value"}
serialized = bytes(event)

# Extract EventInfo from serialized message
generic_event = pl.GenericMessage().parse(serialized)
assert generic_event.info.code == "A001"
# get the message class for the real message from the generic message
message_class = getattr(sys.modules["dbt.events.proto_types"], generic_event.info.name)
new_event = message_class().parse(serialized)
assert new_event.info.extra == event.info.extra

# clean up
reset_metadata_vars()