Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3496] allow "missing" inputs for inputs where you want to use empty mock data for unit test #9282

Open
3 tasks done
Tracked by #8283
graciegoheen opened this issue Dec 13, 2023 · 2 comments
Open
3 tasks done
Tracked by #8283
Labels
enhancement New feature or request unit tests Issues related to built-in dbt unit testing functionality

Comments

@graciegoheen
Copy link
Contributor

graciegoheen commented Dec 13, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Screenshot 2023-12-13 at 12 15 18 PM

Let's say I'm working in the above dbt project.

I want to add a unit test to my dim_wizards - specifically, I want to test my is_valid_email_address logic:

with wizards as (

    select * from {{ ref('stg_wizards') }}

),

worlds as (

    select * from {{ ref('stg_worlds') }}

),

accepted_email_domains as (

    select * from {{ ref('top_level_email_domains') }}

),

check_valid_emails as (

    select  
        wizards.wizard_id,
        wizards.wizard_name,
        wizards.email,
        wizards.phone_number,
        wizards.world_id,

		coalesce (regexp_like(
            wizards.email, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
        )
        = true
        and accepted_email_domains.tld is not null,
        false) as is_valid_email_address

    from wizards
    left join accepted_email_domains
        on wizards.email_top_level_domain = lower(accepted_email_domains.tld)

)

select
    check_valid_emails.wizard_id,
    check_valid_emails.wizard_name,
    check_valid_emails.email,
    check_valid_emails.is_valid_email_address,
    check_valid_emails.phone_number,
    worlds.world_name
from check_valid_emails
left join worlds
    on check_valid_emails.world_id = worlds.world_id

Even though dim_wizards depends on three nodes:

  • stg_worlds model
  • stg_wizards model
  • top_level_email_domains seed

I only want to supply mock input data for the nodes that are relevant to my test.

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

In this case, that's just a few columns from stg_wizards and a column from top_level_email_domains. The stg_worlds model is completely irrelevant for this unit test.

If I don't include stg_worlds in my unit test definition, however, dbt build doesn't work quite right:

(unit_test_before_models_core) ➜  coalesce-ci-demo-2023 git:(unit_testing_example) ✗ dbt build
18:02:26  Running with dbt=1.8.0-a1
18:02:26  Registered adapter: postgres=1.8.0-a1
18:02:27  Found 6 models, 5 seeds, 0 sources, 0 exposures, 0 metrics, 409 macros, 0 groups, 0 semantic models, 1 unit test
18:02:27  
18:02:27  Concurrency: 5 threads (target='postgres')
18:02:27  
18:02:27  1 of 11 START seed file dbt_dev.orders ......................................... [RUN]
18:02:27  2 of 11 START seed file dbt_dev.top_level_email_domains ........................ [RUN]
18:02:27  3 of 11 START seed file dbt_dev.wands .......................................... [RUN]
18:02:27  4 of 11 START seed file dbt_dev.wizards ........................................ [RUN]
18:02:27  5 of 11 START seed file dbt_dev.worlds ......................................... [RUN]
18:02:27  1 of 11 OK loaded seed file dbt_dev.orders ..................................... [INSERT 3 in 0.32s]
18:02:27  2 of 11 OK loaded seed file dbt_dev.top_level_email_domains .................... [INSERT 5 in 0.32s]
18:02:27  3 of 11 OK loaded seed file dbt_dev.wands ...................................... [INSERT 3 in 0.32s]
18:02:27  5 of 11 OK loaded seed file dbt_dev.worlds ..................................... [INSERT 6 in 0.32s]
18:02:27  4 of 11 OK loaded seed file dbt_dev.wizards .................................... [INSERT 6 in 0.32s]
18:02:27  6 of 11 START sql view model dbt_dev.stg_orders ................................ [RUN]
18:02:27  7 of 11 START sql view model dbt_dev.stg_wands ................................. [RUN]
18:02:27  8 of 11 START sql view model dbt_dev.stg_worlds ................................ [RUN]
18:02:27  9 of 11 START sql view model dbt_dev.stg_wizards ............................... [RUN]
18:02:28  8 of 11 OK created sql view model dbt_dev.stg_worlds ........................... [CREATE VIEW in 0.32s]
18:02:28  6 of 11 OK created sql view model dbt_dev.stg_orders ........................... [CREATE VIEW in 0.33s]
18:02:28  9 of 11 OK created sql view model dbt_dev.stg_wizards .......................... [CREATE VIEW in 0.35s]
18:02:28  7 of 11 OK created sql view model dbt_dev.stg_wands ............................ [CREATE VIEW in 0.35s]
18:02:28  11 of 11 START unit_test valid_email_address ................................... [RUN]
18:02:28  11 of 11 ERROR valid_email_address ............................................. [ERROR in 0.03s]
18:02:28  10 of 11 SKIP relation dbt_dev.dim_wizards ..................................... [SKIP]
18:02:28  12 of 11 SKIP relation dbt_dev.fct_orders ...................................... [SKIP]
18:02:28  
18:02:28  Finished running 5 seeds, 6 view models, 1 unit_test in 0 hours 0 minutes and 1.12 seconds (1.12s).
18:02:28  
18:02:28  Completed with 1 error and 0 warnings:
18:02:28  
18:02:28    Compilation Error in unit_test dim_wizards__valid_email_address (models/marts/unit_tests.yml)
  Unit_Test 'unit_test.coalesce_ci_demo_2023.dim_wizards.valid_email_address' (models/marts/unit_tests.yml) depends on a node named 'dim_wizards__valid_email_address__stg_worlds' which was not found
18:02:28  
18:02:28  Done. PASS=9 WARN=0 ERROR=1 SKIP=2 TOTAL=12

I think this error is because:

  • my model depends on stg_worlds
  • by my unit test does not

If I reference stg_worlds in my unit test definition, but don't supply any mock data...

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
        rows: {}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

I get a different error:

(unit_test_before_models_core) ➜  coalesce-ci-demo-2023 git:(unit_testing_example) ✗ dbt build
18:06:22  Running with dbt=1.8.0-a1
18:06:22  Registered adapter: postgres=1.8.0-a1
18:06:23  Encountered an error:
Parsing Error
  Invalid unit_tests config given in FilePath(searched_path='models', relative_path='marts/unit_tests.yml', modification_time=1702490400.4283197, project_root='/Users/gracegoheen/dev/coalesce-ci-demo-2023') @ unit_tests: {'name': 'valid_email_address', 'model': 'dim_wizards', 'given': [{'input': "ref('stg_wizards')", 'rows': [{'wizard_id': 1, 'email': '[email protected]', 'email_top_level_domain': 'example.com'}, {'wizard_id': 2, 'email': '[email protected]', 'email_top_level_domain': 'unknown.com'}, {'wizard_id': 3, 'email': 'badgmail.com', 'email_top_level_domain': 'gmail.com'}, {'wizard_id': 4, 'email': 'missingdot@gmailcom', 'email_top_level_domain': 'gmail.com'}]}, {'input': "ref('top_level_email_domains')", 'rows': [{'tld': 'example.com'}, {'tld': 'gmail.com'}]}, {'input': "ref('stg_worlds')", 'rows': {}}], 'expect': {'rows': [{'wizard_id': 1, 'is_valid_email_address': True}, {'wizard_id': 2, 'is_valid_email_address': False}, {'wizard_id': 3, 'is_valid_email_address': False}, {'wizard_id': 4, 'is_valid_email_address': False}]}} - at path ['given'][2]['rows']: {} is not valid under any of the given schemas

If I don't include rows: at all...

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

I get a different error:

(unit_test_before_models_core) ➜  coalesce-ci-demo-2023 git:(unit_testing_example) ✗ dbt build
18:06:38  Running with dbt=1.8.0-a1
18:06:38  Registered adapter: postgres=1.8.0-a1
18:06:39  Encountered an error:
Parsing Error
  Unable to find seed 'coalesce_ci_demo_2023.stg_worlds' for unit tests in directories: ['seeds']

Describe alternatives you've considered

The work-around I used was adding a dummy single row for stg_worlds:

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
        rows:
          - {world_id: 1}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

I would like some way to be able to input no mock data for certain inputs.

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

@graciegoheen graciegoheen added enhancement New feature or request triage and removed triage labels Dec 13, 2023
@github-actions github-actions bot changed the title allow supplying empty inputs for mock data for unit test [CT-3496] allow supplying empty inputs for mock data for unit test Dec 13, 2023
@graciegoheen
Copy link
Contributor Author

Did some more digging, and figured out the proper syntax!

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
        rows: []
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

@dbeatty10 Do we feel comfortable closing this out (and adding to our planned documentation) - or do you think there's any further improvement we could do here?

One enhancement I can think of, is not having to supply the input at all:

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: [email protected],     email_top_level_domain: example.com}
          - {wizard_id: 2, email: [email protected],     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

@graciegoheen graciegoheen changed the title [CT-3496] allow supplying empty inputs for mock data for unit test [CT-3496] allow "missing" inputs for inputs where you want to use empty mock data for unit test Dec 13, 2023
@graciegoheen
Copy link
Contributor Author

Note: if you don't include a seed direct parent, we'll use the seed

@dbeatty10 dbeatty10 added the unit tests Issues related to built-in dbt unit testing functionality label Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request unit tests Issues related to built-in dbt unit testing functionality
Projects
None yet
Development

No branches or pull requests

2 participants