Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Count_Distinct in Snowflake #10818

Merged
merged 9 commits into from
Aug 16, 2024

Conversation

radeusgd
Copy link
Member

Pull Request Description

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    TypeScript,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • Unit tests have been written where possible.

@radeusgd radeusgd added the CI: No changelog needed Do not require a changelog entry for this PR. label Aug 14, 2024
@radeusgd radeusgd self-assigned this Aug 14, 2024
@radeusgd
Copy link
Member Author

Extra Nightly Tests

@radeusgd
Copy link
Member Author

radeusgd commented Aug 14, 2024

I wanted to see if the COUNT DISTINCT hacks we do affects performance of the query.
I tried the following code:

from Standard.Base import all
import Standard.Base.Errors.Illegal_State.Illegal_State

from Standard.Table import all
from Standard.Database import all
from Standard.Snowflake import all

my_snowflake =
    account_name = Environment.get "ENSO_SNOWFLAKE_ACCOUNT"
    if account_name.is_nothing then Nothing else
        get_var name =
            Environment.get name if_missing=(Panic.throw (Illegal_State.Error "ENSO_SNOWFLAKE_ACCOUNT is set, but "+name+" is not. Please set all required environment variables."))
        user = get_var "ENSO_SNOWFLAKE_USER"
        password = get_var "ENSO_SNOWFLAKE_PASSWORD"
        database = get_var "ENSO_SNOWFLAKE_DATABASE"
        schema = Environment.get "ENSO_SNOWFLAKE_SCHEMA" if_missing="PUBLIC"
        warehouse = Environment.get "ENSO_SNOWFLAKE_WAREHOUSE" if_missing=""

        resolved_password = if password.starts_with "enso://" then Enso_Secret.get password else password
        credentials = Credentials.Username_And_Password user resolved_password
        Snowflake_Details.Snowflake account_name credentials database schema warehouse

main =
    c = Database.connect my_snowflake
    n = 50000
    in_memory_table = Table.new [["X", (0.up_to n . map v-> if v%5 == 1 then Nothing else (v%5))]]
    in_memory_table.column_info.print
    in_memory_table.print
    table_name = "My_Big_Table"
    IO.println <| Duration.time_execution <|
        Panic.rethrow <| in_memory_table.select_into_database_table c table_name primary_key=[] temporary=True
    
    runs = 20
    measure_query query = 
        IO.println "Running "+query
        results = 0.up_to runs . map _->
            r = Duration.time_execution <|
                Panic.rethrow <| c.read (..Raw_SQL query)
            r.second.print
            r.first
        results.each t-> IO.println t.to_display_text

    IO.println "SIMPLE"
    measure_query 'SELECT COUNT(DISTINCT "X") FROM "'+table_name+'"'

    rnd = Random.uuid
    IO.println "old replaced"
    measure_query 'SELECT COUNT(DISTINCT COALESCE("X", {\'enso-null-replacement-marker\':\''+rnd+'\'}::variant)) FROM "'+table_name+'"'

    IO.println "new replaced+casted"
    measure_query 'SELECT COUNT(DISTINCT COALESCE("X"::variant, {\'enso-null-replacement-marker\':\''+rnd+'\'}::variant)) FROM "'+table_name+'"'

The biggest problem was that uploading only 50k rows took me... 5 minutes! And that's far too little rows to show any performance difference:
image

@@ -394,7 +394,7 @@ snowflake_specific_spec suite_builder default_connection db_name setup =
table_name = Name_Generator.random_name "TimestampTZ"
table = default_connection.get.create_table table_name [Column_Description.Value "A" (Value_Type.Date_Time with_timezone=True), Column_Description.Value "rowid" Value_Type.Integer] primary_key=[]
table.should_succeed
Panic.with_finalizer default_connection.get.drop_table table.name <|
Panic.with_finalizer (default_connection.get.drop_table table.name) <|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not give a compile-time error, having too many arguments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, such stuff is only currently resolved at runtime.

Once #9812 is implemented, we'd get a warning if static type analysis were enabled.

@radeusgd
Copy link
Member Author

I've redone the analysis by manually uploading a 10M row dataset to Snowflake through the web panel and then running the queries on it. Here we can see a difference:
image

The 'Simple' flavour is COUNT(DISTINCT X) without the COALESCE, Old is what we had right now and New is the one with the fix for boolean columns. I don't see any significant difference between Old and New, so I think the fix is totally OK.

We can see however that we do pay a price for the COALESCE workaround that gives us the correct null behaviour. Thankfully, if we are doing distinct by a single column with ignore_nothing=True (non-default setting), we will get the efficient Simple query. Unfortunately, with the default ignore_nothing=True OR if we have more than one column, we will get the less efficient one.

@radeusgd radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Aug 14, 2024
@mergify mergify bot merged commit 09137f7 into develop Aug 16, 2024
36 checks passed
@mergify mergify bot deleted the wip/radeusgd/10611-snowflake-count-distinct-booleans branch August 16, 2024 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: No changelog needed Do not require a changelog entry for this PR. CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix Count_Distinct on a Boolean column in Snowflake
3 participants