Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add string functions in a string module #3913

Merged
merged 8 commits into from
Dec 12, 2023
Merged

Conversation

PrettyWood
Copy link
Collaborator

@PrettyWood PrettyWood commented Dec 10, 2023

Continue work on #217
closes #3238

Screenshot 2023-12-12 at 19 01 41

let ltrim = column -> <text> internal std.ltrim
let rtrim = column -> <text> internal std.rtrim
let trim = column -> <text> internal std.trim
let length = column -> <int> internal std.length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for now, but eventually for length probably we want it within a string namespace; given how general the term is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm planning to first open a PR to add the math module
And for this one I plan to add a str module
Would that be ok?

So we would have derive { x | str.lower | str.endswith "foo" }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A module str has been added. Do you prefer string?

Copy link
Member

@max-sixty max-sixty Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect re a str module!

No particularly strong view on str vs string. Possibly we should pick the one that is not the name of the type.

I'd also previously thought that "string" was a bit too "engineer-y", and most folks think a string is something used for tying. So we currently have from_text, for example. But I'm not sure we're going to win that battle! So I'm fine with any of those — str / string / text.

Any thoughts from others?

Copy link
Collaborator

@vanillajonathan vanillajonathan Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like most languages prefer string or String over str and String or text.

  • C# uses string.Format
  • D uses std.string.isNumeric("123")
  • Elixir uses String.first("elixir")
  • F# uses String.length "Hello"
  • Go uses strings.ToLower("Hello")
  • Java uses String.format
  • JavaScript uses String.fromCharCode(65)
  • Kotlin uses String.format("Hello")
  • Lua uses string.lower("Hello")
  • PowerShell uses [string]::Format("Hello")
  • Python uses str.format
  • Ruby uses String::try_convert('Hello')
  • Rust uses String::from("Hello, world!");
  • Scala uses String.format("Hello")
  • VB.NET uses String.Format

prqlc/prql-compiler/src/sql/std.sql.prql Outdated Show resolved Hide resolved
prqlc/prql-compiler/tests/integration/sql.rs Outdated Show resolved Hide resolved
@PrettyWood PrettyWood changed the base branch from main to add-math-module December 11, 2023 13:17
@PrettyWood PrettyWood changed the title feat: add string functions feat: add string functions in a str module Dec 11, 2023
Comment on lines +190 to +191
#[case::generic(sql::Dialect::Generic, "LIKE CONCAT('%', 'pika', '%')")]
#[case::sqlite(sql::Dialect::SQLite, "LIKE '%' || 'pika' || '%'")] // `CONCAT` is not supported in SQLite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I had heard of rstest but haven't used it before. (I use pytest a lot, which I think is similar)

Happy to try this! (Maybe @aljazerzen has thoughts?)

(alternatively, we could have a normal loop and supply different names for the snapshot name)

Copy link
Collaborator Author

@PrettyWood PrettyWood Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a huge fan of pytest and the parametrize + fixtures mechanisms.
And I'm a huge fan of rstest for the exact same reasons. It is quite similar and I feel like it would be a great value to cover many dialects since prql will probably need more tests to ensure compatibility across all the supported DBs. In my company we support 5 SQL databases/DWs and everything is parametrized on that.
But I can definitely remove that if you prefer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy to try it for a while at least!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 for rstest - on the macro-expansion level it does basically the same thing as for_each_file!, but offers much more flexibility.

Copy link
Member

@aljazerzen aljazerzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

Again, it would be nice to also have test-dbs tests.

Comment on lines +190 to +191
#[case::generic(sql::Dialect::Generic, "LIKE CONCAT('%', 'pika', '%')")]
#[case::sqlite(sql::Dialect::SQLite, "LIKE '%' || 'pika' || '%'")] // `CONCAT` is not supported in SQLite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 for rstest - on the macro-expansion level it does basically the same thing as for_each_file!, but offers much more flexibility.

Base automatically changed from add-math-module to main December 12, 2023 10:01
@PrettyWood PrettyWood marked this pull request as ready for review December 12, 2023 10:39
@PrettyWood
Copy link
Collaborator Author

@aljazerzen @eitsupi @vanillajonathan @richb-hanover Following this comment. Sorry to ping you but since you are main contributors on this repo, your opinion matters for the name of the module that would contain main functions on strings
I went with str (so usage would be str.upper, str.ends_with). But we could go with string (not a huge fan since it just adds more characters for nothing) or text (maybe more enduser friendly I don't know)
Thank you 🙏

@eitsupi
Copy link
Member

eitsupi commented Dec 12, 2023

Thanks for mentioning me (and thanks for your great works!).
I don't feel uncomfortable with str because I think the abbreviation str is pretty common, but string might be better for sure.

@vanillajonathan
Copy link
Collaborator

The substring function is interesting, perhaps it could be used for indexing and slicing with brackets #1811

@richb-hanover
Copy link
Contributor

I'll weigh in to support the full name string. Abbreviations become awkward because you don't always remember what to leave out (for example, why is it ls?)

@PrettyWood
Copy link
Collaborator Author

PrettyWood commented Dec 12, 2023

Ok I'll change for string you're already a majority 😄 This name can always be changed later
EDIT: done! @max-sixty do you agree with this last change? If yes I think we're good to merge

@max-sixty
Copy link
Member

EDIT: done! @max-sixty do you agree with this last change? If yes I think we're good to merge

Great!

@PrettyWood PrettyWood changed the title feat: add string functions in a str module feat: add string functions in a string module Dec 12, 2023
@PrettyWood PrettyWood merged commit 79c6b08 into main Dec 12, 2023
75 checks passed
@PrettyWood PrettyWood deleted the string-functions branch December 12, 2023 18:44
@max-sixty
Copy link
Member

Thanks @PrettyWood ! These are great.

Indeed — I think when developing the compiler, it's easy for us to focus too much on "do we have a good algorithm for right-associative precedence", etc — but actually adding an equivalent for LIKE is much more important for most users. So these sorts of PRs have lots of impact.

@aljazerzen
Copy link
Member

As a followup, we might want to reconsider type text and std.from_text.

@max-sixty
Copy link
Member

As a followup, we might want to reconsider type text and std.from_text.

I am OK with changing these if folks prefer...

@vanillajonathan
Copy link
Collaborator

Yeah, I think we should change those, because text is pretty much only used in the context of SQLite, in the rest of the computing world, it is string.

See: #3913 (comment)

@snth
Copy link
Member

snth commented Dec 18, 2023

Sorry for coming late to this (I was away for a while) but I would throw in my vote in favour of text or str. I agree with @PrettyWood that string just seems to add extra characters for no benefit.

I also think our earlier discussions around text vs string still hold and text would be my preferred option.

@PrettyWood
Copy link
Collaborator Author

For me the big downside of text is bytes vs strings. How do you differentiate them? I agree that often people want strings but not always! and even here there are subtleties like encoding, chars vs graphemes, ...

@snth
Copy link
Member

snth commented Dec 18, 2023

I think text would pretty much be a synonym for unicode strings. If you want bytes then you would need another type like bytea. I lived through the upheaval around this in the Python 2 to 3 transition but it's not something I've ever encountered in database work. Is it much of an issue there?

The closest I came was when working with MS SQL 15 years ago, you would sometimes get unexpected ordering depending on what collation you chose for your database. Is that still much of an issue these days? It's not something I've encountered in a long time so I really can't say.

@PrettyWood
Copy link
Collaborator Author

To be honest my interest in PRQL is to simplify the refactoring of a custom DSL we have that translates many steps into Mongo, Polars and many SQL dialects.
If PRQL is first meant for data scientists/analysts who are not developers, maybe choosing text is better.
DuckDB went with that approach as well
I'm not familiar with the primary target of PRQL so I might be biased

@vanillajonathan
Copy link
Collaborator

vanillajonathan commented Dec 18, 2023

Well, string is so widely prevalent in the computing world that I would think it probably is familiar even with non-developers such a data scientists and analysts.

@aljazerzen
Copy link
Member

I strongly agree with @snth: regardless of how it is named, type string or type text should contain UTF-8 encoded text. For arbitrary bytes there should be another type.

But I'm impartial on the name. A quick survey of exactly 1 person showed that string is not familiar to biologists, who tend to know the basics of R. On the other hand, string really is prevalent in the computing world.

@max-sixty
Copy link
Member

max-sixty commented Dec 19, 2023

A quick survey of a couple of others reinforces that "string" is actually not that well-known. From a program manager at SpaceX (i.e. someone who is close to lots of technical things without writing code herself):

Not a lot to me. Something about a string of logic, but that's about it for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PRQL equivalent of ilike
7 participants