-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename select
to map
#3782
Comments
I am going to second this concern. Its similarity to SQL also makes it feel peculiar on first inspection. The newcomer's internal monologue might be:
But with I don't have the confidence that |
When we say (I see that some of Worth checking out |
I too think that However, the way it is used here does not necessarily take a function as an argument, so it seems strange for me. |
Thanks @eitsupi for voicing another concern that I didn't express: To expand on my earlier note, I want to say that we need to overcome people's reservations about PRQL as an alternative to SQL. When I first read about PRQL, I remember thinking, "PRQL seems too good to be true... I wonder how much new stuff I'll have to learn." Keeping |
I guess @aljazerzen 's point might be that
So select { alias = func col1 col2 ... coln } is really shorthand for something like (in Pandas notation to try and avoid ambiguity): df['alias'] = map( lambda row: func(row['col1'], row['col2'], ..., row['coln']), df.iterrows() ) We get a fair amount of the question "Why have This also came up a lot in my discussions at ApacheCon with Julian Hyde and Mihai Budiu about the "two language problem" of SQL: there's the query language that we all think of as SQL (FROM, SELECT, ...) and then there's the expression language for working with column values. Opinion was divided on whether this is necessary and should be maintained or a better language would eliminate this. I think these are big questions and we should leave things be as they are until we have really formalised the type system. |
A shorter answer is that I would suggest rather keeping Looks like Postgres doesn't have it (https://www.postgresql.org/docs/16/functions-array.html) and the usual solution is to UNNEST into rows, work on those and then |
I see a lot of backpressure, but I have to say I had expected it on this one. I don't expect this will get accepted, but I want to explain my motivation, as it is deeper than a mere rename.
This is exactly my point: I do see If we try to express select((col_a + 1,), my_rel) The pipelined value My explanation of why this is allowed (other than saying "select just works this way"), says that in Python notation, the query would actually be expressed like this: select(lambda row: (row.col_a + 1, ), my_rel) What I'm saying is that the PRQL expression
You might all wonder why I chose to spend time on mental gymnastics like this, since we do have bug reports open. When talking about PRQL, I like to say that it has "consistent semantics" and "compact name resolution rules". But in reality, this is not true - we do have some quite obscure rules for name resolution. We had a lot of discussion about "which column names are available where", "when is a column name overridden?", "how can names of relations in a query be used?" and it all boils down to a single question: How to refer to things within a pipeline?
So I think that the defining feature of PRQL is how we answers this question. And I don't want it to be, what it currently is:
I think that the "implicit lambda functions" can define our name resolution and transforms much more elegantly, but unfortunately not exactly the same as they are now. This has now become a long rant, with an intention to explain my major concern with language design of PRQL. If this problem resonates with anyone, we can continue this discussion, but otherwise it does take a lot of time to write down. |
@aljazerzen Thanks for your detailed explanation. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
What I really proposing is making map for arrays and select for relations equivalent. Derive is very similar, but does have a bit different functionality, so I don't mind keeping it. |
I see. One question I have is, does this have anything to do with row-oriented or column-oriented?
v.s.
R's data.frame and pandas.DataFrame are actually the latter, right? |
Yes, dataframes are mostly the latter (row-oriented), but not all of them. The same applies to SQL databases - traditionally, they are row-oriented, but some of them opt for columnar storage. Ultimately, this does not matter for the query language - we can say that relations as row-oriented, but then compile all operations into columnar operations (i.e. regular SQL). |
I am wondering if it is possible to define func row -> (
use row.*;
{col_a + 1}
) Possibly this is a process specific to something like |
Yes, that is very possible, I'd say that you could define them as:
No, that's a general concept that would be used to explain behavior of following transforms:
Missing here are:
|
Is automatic resolution of top-level names useful when considering things like nested tables or json type colums? |
I don't understand your last comment. |
I meant that I am concerned that the automatic insertion of something like |
@aljazerzen Thank you for taking the time to write out the details in #3782 (comment). I agree a lot with your drive to define a consistent semantics and simplify the name resolution rules. My main objection at this stage is to the choice of name Probably more importantly though, I think the naming is probably a secondary concern and we should figure out the correct semantics and name resolution rules. We could do that under the banner of From your list of name resolution rules, the one that I would really like to remove is:
SQL needs this queries essentially have one scope / namespace (at least at the top level and ignoring things like subqueries) and there is no sequencing or logical flow. PRQL is different as each transform acts on the current relation. So while it is sometimes useful to prefix columns with the relation name in order to disambiguate duplicate column names after a join, I see this actually as an indication that we haven't clearly defined the semantics of I see at least two problems with keeping this:
An example will probably make it clearer: from a
join b (==id)
select {a.x, b.y, b.z}
filter b.y > a.y Say column My sense is that we've kept this around because in line 3 we have a problem in that we don't know which The problem is that after the Some approaches around this that I've seen:
I think the LINQ approach is the best as that was specifically designed to be that way by Erik Meijer in order to make LINQ monadic and in fact I believe that second lambda is (or is related to) the If we were to adopt this we would need some additional syntax:
from a
join b (==id) {x, b.y, b.z} equivalent to from a
join b (==id) {this.x, that.y, that.z} Sorry, this turned out to be a rather longer post as well. That wasn't actually intended and some of these ideas just developed as I went along. |
@snth's suggestion of having
|
That's a good point, yeah. This issue is not about renaming Regarding the join: I would classify current behavior of It is not directly inherited from SQL, as in fact does work in the way that you say it should. Your example reports that Also your two concerns might not that concerning.
Regardless, I like your proposal of a post-join-tuple-merger function. |
After reading your comment I was thinking that it might even remove the need for Referring to the docs at Identifiers & keywords >> this & that, the only remaining need for In the from invoices
join tracks (track_id==tracks.id)
I do see that, and I don't necessarily have a good answer yet but my gut feeling is that what you gain in consistency makes it worth it. I thought that the following simplified version of my previous example would prove my point but to my surprise this actually works in the playground: let a = [{id=1, x=1}]
let b = [{id=1, x=10}]
from a
join b (==id)
#select {a.id, b.x} which produces id x id x
1 1 1 10 I guess that's because SQL is very lax and allows duplicate column names. However if we want PRQL to be a general language that can apply in many diverse data processing examples then I don't think we'll have that available. For example say you wanted to use PRQL as a DSL for streams of structs in Rust or JSON objects. In both those cases field names would need to be unique so So I hear your concern about duplicating the behaviour of |
The topic of this issue is all over the place. I deem my proposal rejected with good reasons, @snth, I suggest you open a new issue about post-join-mapping. |
What's up?
Since relations are just arrays of tuples,
select
is just mapping each tuple into a new tuple. This is equivalent to behavior of.map()
function from other languages:map()
Array.map
,Iterator.map
,Stream.map
,Should we rename
select
tomap
?The major downside is that right now, PRQL looks very similar to SQL, thanks to many similar names of PRQL transforms and SQL clauses.
The text was updated successfully, but these errors were encountered: