[CT-3496] allow "missing" inputs for inputs where you want to use empty mock data for unit test #9282

graciegoheen · 2023-12-13T19:58:05Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Let's say I'm working in the above dbt project.

I want to add a unit test to my dim_wizards - specifically, I want to test my is_valid_email_address logic:

with wizards as (

    select * from {{ ref('stg_wizards') }}

),

worlds as (

    select * from {{ ref('stg_worlds') }}

),

accepted_email_domains as (

    select * from {{ ref('top_level_email_domains') }}

),

check_valid_emails as (

    select  
        wizards.wizard_id,
        wizards.wizard_name,
        wizards.email,
        wizards.phone_number,
        wizards.world_id,

		coalesce (regexp_like(
            wizards.email, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
        )
        = true
        and accepted_email_domains.tld is not null,
        false) as is_valid_email_address

    from wizards
    left join accepted_email_domains
        on wizards.email_top_level_domain = lower(accepted_email_domains.tld)

)

select
    check_valid_emails.wizard_id,
    check_valid_emails.wizard_name,
    check_valid_emails.email,
    check_valid_emails.is_valid_email_address,
    check_valid_emails.phone_number,
    worlds.world_name
from check_valid_emails
left join worlds
    on check_valid_emails.world_id = worlds.world_id

Even though dim_wizards depends on three nodes:

stg_worlds model
stg_wizards model
top_level_email_domains seed

I only want to supply mock input data for the nodes that are relevant to my test.

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

In this case, that's just a few columns from stg_wizards and a column from top_level_email_domains. The stg_worlds model is completely irrelevant for this unit test.

If I don't include stg_worlds in my unit test definition, however, dbt build doesn't work quite right:

(unit_test_before_models_core) ➜  coalesce-ci-demo-2023 git:(unit_testing_example) ✗ dbt build
18:02:26  Running with dbt=1.8.0-a1
18:02:26  Registered adapter: postgres=1.8.0-a1
18:02:27  Found 6 models, 5 seeds, 0 sources, 0 exposures, 0 metrics, 409 macros, 0 groups, 0 semantic models, 1 unit test
18:02:27  
18:02:27  Concurrency: 5 threads (target='postgres')
18:02:27  
18:02:27  1 of 11 START seed file dbt_dev.orders ......................................... [RUN]
18:02:27  2 of 11 START seed file dbt_dev.top_level_email_domains ........................ [RUN]
18:02:27  3 of 11 START seed file dbt_dev.wands .......................................... [RUN]
18:02:27  4 of 11 START seed file dbt_dev.wizards ........................................ [RUN]
18:02:27  5 of 11 START seed file dbt_dev.worlds ......................................... [RUN]
18:02:27  1 of 11 OK loaded seed file dbt_dev.orders ..................................... [INSERT 3 in 0.32s]
18:02:27  2 of 11 OK loaded seed file dbt_dev.top_level_email_domains .................... [INSERT 5 in 0.32s]
18:02:27  3 of 11 OK loaded seed file dbt_dev.wands ...................................... [INSERT 3 in 0.32s]
18:02:27  5 of 11 OK loaded seed file dbt_dev.worlds ..................................... [INSERT 6 in 0.32s]
18:02:27  4 of 11 OK loaded seed file dbt_dev.wizards .................................... [INSERT 6 in 0.32s]
18:02:27  6 of 11 START sql view model dbt_dev.stg_orders ................................ [RUN]
18:02:27  7 of 11 START sql view model dbt_dev.stg_wands ................................. [RUN]
18:02:27  8 of 11 START sql view model dbt_dev.stg_worlds ................................ [RUN]
18:02:27  9 of 11 START sql view model dbt_dev.stg_wizards ............................... [RUN]
18:02:28  8 of 11 OK created sql view model dbt_dev.stg_worlds ........................... [CREATE VIEW in 0.32s]
18:02:28  6 of 11 OK created sql view model dbt_dev.stg_orders ........................... [CREATE VIEW in 0.33s]
18:02:28  9 of 11 OK created sql view model dbt_dev.stg_wizards .......................... [CREATE VIEW in 0.35s]
18:02:28  7 of 11 OK created sql view model dbt_dev.stg_wands ............................ [CREATE VIEW in 0.35s]
18:02:28  11 of 11 START unit_test valid_email_address ................................... [RUN]
18:02:28  11 of 11 ERROR valid_email_address ............................................. [ERROR in 0.03s]
18:02:28  10 of 11 SKIP relation dbt_dev.dim_wizards ..................................... [SKIP]
18:02:28  12 of 11 SKIP relation dbt_dev.fct_orders ...................................... [SKIP]
18:02:28  
18:02:28  Finished running 5 seeds, 6 view models, 1 unit_test in 0 hours 0 minutes and 1.12 seconds (1.12s).
18:02:28  
18:02:28  Completed with 1 error and 0 warnings:
18:02:28  
18:02:28    Compilation Error in unit_test dim_wizards__valid_email_address (models/marts/unit_tests.yml)
  Unit_Test 'unit_test.coalesce_ci_demo_2023.dim_wizards.valid_email_address' (models/marts/unit_tests.yml) depends on a node named 'dim_wizards__valid_email_address__stg_worlds' which was not found
18:02:28  
18:02:28  Done. PASS=9 WARN=0 ERROR=1 SKIP=2 TOTAL=12

I think this error is because:

my model depends on stg_worlds
by my unit test does not

If I reference stg_worlds in my unit test definition, but don't supply any mock data...

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
        rows: {}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

I get a different error:

(unit_test_before_models_core) ➜  coalesce-ci-demo-2023 git:(unit_testing_example) ✗ dbt build
18:06:22  Running with dbt=1.8.0-a1
18:06:22  Registered adapter: postgres=1.8.0-a1
18:06:23  Encountered an error:
Parsing Error
  Invalid unit_tests config given in FilePath(searched_path='models', relative_path='marts/unit_tests.yml', modification_time=1702490400.4283197, project_root='/Users/gracegoheen/dev/coalesce-ci-demo-2023') @ unit_tests: {'name': 'valid_email_address', 'model': 'dim_wizards', 'given': [{'input': "ref('stg_wizards')", 'rows': [{'wizard_id': 1, 'email': '[email protected]', 'email_top_level_domain': 'example.com'}, {'wizard_id': 2, 'email': '[email protected]', 'email_top_level_domain': 'unknown.com'}, {'wizard_id': 3, 'email': 'badgmail.com', 'email_top_level_domain': 'gmail.com'}, {'wizard_id': 4, 'email': 'missingdot@gmailcom', 'email_top_level_domain': 'gmail.com'}]}, {'input': "ref('top_level_email_domains')", 'rows': [{'tld': 'example.com'}, {'tld': 'gmail.com'}]}, {'input': "ref('stg_worlds')", 'rows': {}}], 'expect': {'rows': [{'wizard_id': 1, 'is_valid_email_address': True}, {'wizard_id': 2, 'is_valid_email_address': False}, {'wizard_id': 3, 'is_valid_email_address': False}, {'wizard_id': 4, 'is_valid_email_address': False}]}} - at path ['given'][2]['rows']: {} is not valid under any of the given schemas

If I don't include rows: at all...

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

I get a different error:

(unit_test_before_models_core) ➜  coalesce-ci-demo-2023 git:(unit_testing_example) ✗ dbt build
18:06:38  Running with dbt=1.8.0-a1
18:06:38  Registered adapter: postgres=1.8.0-a1
18:06:39  Encountered an error:
Parsing Error
  Unable to find seed 'coalesce_ci_demo_2023.stg_worlds' for unit tests in directories: ['seeds']

Describe alternatives you've considered

The work-around I used was adding a dummy single row for stg_worlds:

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
        rows:
          - {world_id: 1}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

I would like some way to be able to input no mock data for certain inputs.

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

graciegoheen · 2023-12-13T21:00:52Z

Did some more digging, and figured out the proper syntax!

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
        rows: []
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

@dbeatty10 Do we feel comfortable closing this out (and adding to our planned documentation) - or do you think there's any further improvement we could do here?

One enhancement I can think of, is not having to supply the input at all:

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

graciegoheen · 2024-02-12T20:19:54Z

Note: if you don't include a seed direct parent, we'll use the seed

graciegoheen added enhancement New feature or request triage and removed triage labels Dec 13, 2023

github-actions bot changed the title ~~allow supplying empty inputs for mock data for unit test~~ [CT-3496] allow supplying empty inputs for mock data for unit test Dec 13, 2023

graciegoheen mentioned this issue Dec 13, 2023

[CT-2911] [Epic] Unit testing dbt models #8283

Closed

graciegoheen changed the title ~~[CT-3496] allow supplying empty inputs for mock data for unit test~~ [CT-3496] allow "missing" inputs for inputs where you want to use empty mock data for unit test Dec 13, 2023

dbeatty10 added the unit tests Issues related to built-in dbt unit testing functionality label Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-3496] allow "missing" inputs for inputs where you want to use empty mock data for unit test #9282

[CT-3496] allow "missing" inputs for inputs where you want to use empty mock data for unit test #9282

graciegoheen commented Dec 13, 2023 •

edited

Loading

graciegoheen commented Dec 13, 2023

graciegoheen commented Feb 12, 2024

[CT-3496] allow "missing" inputs for inputs where you want to use empty mock data for unit test #9282

[CT-3496] allow "missing" inputs for inputs where you want to use empty mock data for unit test #9282

Comments

graciegoheen commented Dec 13, 2023 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

graciegoheen commented Dec 13, 2023

graciegoheen commented Feb 12, 2024

graciegoheen commented Dec 13, 2023 •

edited

Loading