Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect data_type for BigQuery REPEATED fields. #190

Open
1 of 5 tasks
Thrasi opened this issue Nov 12, 2024 · 0 comments
Open
1 of 5 tasks

Incorrect data_type for BigQuery REPEATED fields. #190

Thrasi opened this issue Nov 12, 2024 · 0 comments
Labels
bug Something isn't working triage

Comments

@Thrasi
Copy link

Thrasi commented Nov 12, 2024

Describe the bug

BigQuery supports arrays as REPEATED fields
Running generate_model_yaml for a model with a repeated field of a certain datatype will result in a
schema with data_type: datatype rather than data_type: array<datatype>
Running the model again with contract.enforced=true will show the error:

column_name definition_type contract_type mismatch_reason
repeated_int ARRAY<INT64> INT64 data type mismatch

Repeated records should have data_type: array

Steps to reproduce

create a model_with_repeated_field.sql:

select 
    [1,2,3,4] as repeated_int

run it

$ dbt run -s model_with_repeated_field -f

Expected results

version: 2

models:
  - name: model_with_repeated_field
    description: ""
    columns:
      - name: repeated_int
        data_type: array<int64>
        description: ""

Actual results

version: 2

models:
  - name: model_with_repeated_field
    description: ""
    columns:
      - name: repeated_int
        data_type: int64
        description: ""

Screenshots and log output

Running the model with the following yml:

models:
  - name: model_with_repeated_field
    config:
      contract:
        enforced: true
    columns:
      - name: repeated_int
        data_type: int64
        description: ""

Screenshot 2024-11-12 at 16 32 13

System information

packages:
  - package: dbt-labs/codegen
    version: 0.12.1
  - package: dbt-labs/dbt_utils
    version: 1.1.1

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.8.8
  - latest:    1.8.8 - Up to date!

Plugins:
  - bigquery: 1.8.3 - Up to date!

The operating system you're using:
MacOS Sequoia Version 15.1

The output of python --version:
Python 3.10.15

Additional context

generate_model_yaml gets the data_type using data_type_format_model

    {% if include_data_types %}
        {% do model_yaml.append('        data_type: ' ~ codegen.data_type_format_model(column)) %}
    {% endif %}

which in turn calls codegen.format_column

{% macro data_type_format_model(column) -%}
  {{ return(adapter.dispatch('data_type_format_model', 'codegen')(column)) }}
{%- endmacro %}

{# format a column data type for a model #}
{% macro default__data_type_format_model(column) %}
    {% set formatted = codegen.format_column(column) %}
    {{ return(formatted['data_type'] | lower) }}
{% endmacro %}

format_column is vendored in macros/vendored/dbt_core/format_column.sql
But it doesn't handle the specific case for repeated fields.

I am tempted to create a default__format_column and a bigquery__format_column to handle BigQuery specifically:

{% macro format_column(column) -%}
  {{ return(adapter.dispatch('format_column', 'codegen')(column)) }}
{%- endmacro %}

{% macro default__format_column(column) -%}
  {% set data_type = column.dtype %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

{% macro bigquery__format_column(column) -%}
  {% if column.mode.lower() == "repeated" %}
    {% set data_type = "array" if column.dtype.lower() == "record" else "array<" ~ column.dtype ~ ">" %}
  {% else %}
    {% set data_type = column.dtype %}
  {% endif %}
  {% set formatted = column.column.lower() ~ " " ~ data_type %}
  {{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}

and moving it all into helpers.sql while removing the vendored format_column

Alternatively, bigquery__format_column could return a 'mode': column.mode field
and the adapter specific datatype conversion would be taken care of in
bigquery__data_type_format_model and bigquery__data_type_format_source

Are you interested in contributing the fix?

Yes. I would like some input on whether the proposed solution in Additional context is reasonable.

@Thrasi Thrasi added bug Something isn't working triage labels Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant