Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast inside aggregation fails with type error #3534

Open
2 tasks done
snth opened this issue Sep 18, 2023 · 6 comments
Open
2 tasks done

Cast inside aggregation fails with type error #3534

snth opened this issue Sep 18, 2023 · 6 comments
Labels
language-design Changes to PRQL-the-language

Comments

@snth
Copy link
Member

snth commented Sep 18, 2023

What's up?

What happened?

Cast inside aggregation fails with type error.

PRQL input

from artists
derive {artist_int = (artist_id | as int)}
aggregate {
  total_int = sum artist_int,
  total_fail = sum (artist_id | as int),
}

SQL output

Error: 
   ╭─[:5:33]
   │
 5 │   total_fail = sum (artist_id | as int),
   │                                 ───┬──  
   │                                    ╰──── function std.sum, param `column` expected type `array`, but found type `scalar`
   │ 
   │ Help: Type `array` expands to `[null]`
───╯

Expected SQL output

SELECT
  COALESCE(SUM(CAST(artist_id AS int)), 0) AS total_int,
  COALESCE(SUM(CAST(artist_id AS int)), 0) AS total_fail
FROM
  artists

-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org)

MVCE confirmation

  • Minimal example
  • New issue

Anything else?

No response

@snth snth added the bug Invalid compiler output or panic label Sep 18, 2023
@snth
Copy link
Member Author

snth commented Sep 18, 2023

@max-sixty I've copied your comment over:

Thanks, this does indeed look like a bug.

For the moment, a workaround is to create the column outside the aggregation (as on the derive line above).

@aljazerzen
Copy link
Member

This is a symptom of #2723

as implies that the column is of type scalar (actually it should be int, but we are not there yet).

sum expects an array of scalars.

The compiler is basically saying "so, which one is it? What is the type of (artist_id | as int)?"

@aljazerzen aljazerzen added language-design Changes to PRQL-the-language and removed bug Invalid compiler output or panic labels Nov 9, 2023
@snth
Copy link
Member Author

snth commented Nov 10, 2023

I don't think I agree. All column expressions are always expressed as if operating on a single value with the understanding that they are implicitly "mapped"/broadcast over all rows.

What is the type of (artist_id | as int)?"

A column of ints, which can also be seen from the fact that

from artists
derive {artist_int = (artist_id | as int)}
aggregate {
  total_int = sum artist_int,
#  total_fail = sum (artist_id | as int),
}

compiles to

SELECT
  COALESCE(SUM(CAST(artist_id AS int)), 0) AS total_int
FROM
  artists

-- Generated by PRQL compiler version:0.10.0 (https://prql-lang.org)

So in my mind, the bug report is about why does it work when defined with a separate derive but not when the same expression is the argument to a sum aggregation?

@aljazerzen
Copy link
Member

What is the type of (artist_id | as int)?"

A column of ints, which can also be seen from the fact that

If it a column of ints, then the cast should be artist_id | as [int], no? I think this boils down to what is our understanding of the PRQL type system, which we talk about in #2723, and which has not been closed yet.

Current implementation of the compiler and the type system both stem from my understanding of the language, where [int] and int are two distinct types and there is no "operator broadcasting" that would broadcast as int to each value in the column. But as I said, this is not something that has been decided on, we need to finish #2723

@snth
Copy link
Member Author

snth commented Nov 10, 2023

Totally agree about finishing #2723 , but in the meantime how is this different to any of the other type inference that's going on?

What are the types of income and income > 5 in the first few lines of the Playground example?

from invoices
derive {
  transaction_fee = 0.8,
  income = total - transaction_fee
}
filter income > 5

Assuming <transaction_fee> == float and <total> == [float], then

  • <income> == [float] - float == [float] ?
  • <income > 5> == [float] > int == [bool] ?

I think that's not how these things are usually understood though. Usually everything is interpreted at a single row / tuple level and the types are rather:

  • <transaction_fee> == float
  • <total> == float
  • <income> == float - float == float
  • <income > 5> == float > int == bool

So similarly it should be that <artist_id | as int> == int.

@aljazerzen
Copy link
Member

Yes, I very much agree on your conclusion of what the types should be.

Going forward, I would also say that in this query:

from invoices
derive {
  transaction_fee = 0.8,
  income = total - transaction_fee
}
aggregate { total = sum income }
  • <total - transaction_fee> == float
  • type of the income within derive is float,
  • type of the income within aggregate is [float], as sum cannot operate on a single float, but requires an array

So that would answer your question of "why can I cast within derive, but not within aggregate". derive operates row-wise, but aggregate operates column-wise; it applies aggregation to the whole column at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-design Changes to PRQL-the-language
Projects
None yet
Development

No branches or pull requests

2 participants