feat: add string functions in a `string` module #3913

PrettyWood · 2023-12-10T23:31:04Z

Continue work on #217
closes #3238

max-sixty · 2023-12-11T09:18:24Z

prqlc/prql-compiler/src/semantic/std.prql

+let ltrim = column -> <text> internal std.ltrim
+let rtrim = column -> <text> internal std.rtrim
+let trim = column -> <text> internal std.trim
+let length = column -> <int> internal std.length


Fine for now, but eventually for length probably we want it within a string namespace; given how general the term is

Yes I'm planning to first open a PR to add the math module
And for this one I plan to add a str module
Would that be ok?

So we would have derive { x | str.lower | str.endswith "foo" }

A module str has been added. Do you prefer string?

Perfect re a str module!

No particularly strong view on str vs string. Possibly we should pick the one that is not the name of the type.

I'd also previously thought that "string" was a bit too "engineer-y", and most folks think a string is something used for tying. So we currently have from_text, for example. But I'm not sure we're going to win that battle! So I'm fine with any of those — str / string / text.

Any thoughts from others?

It seems like most languages prefer string or String over str and String or text.

C# uses string.Format

D uses std.string.isNumeric("123")

Elixir uses String.first("elixir")

F# uses String.length "Hello"

Go uses strings.ToLower("Hello")

Java uses String.format

JavaScript uses String.fromCharCode(65)

Kotlin uses String.format("Hello")

Lua uses string.lower("Hello")

PowerShell uses [string]::Format("Hello")

Python uses str.format

Ruby uses String::try_convert('Hello')

Rust uses String::from("Hello, world!");

Scala uses String.format("Hello")

VB.NET uses String.Format

prqlc/prql-compiler/src/sql/std.sql.prql

prqlc/prql-compiler/tests/integration/sql.rs

max-sixty · 2023-12-11T23:57:03Z

prqlc/prql-compiler/tests/integration/sql.rs

+#[case::generic(sql::Dialect::Generic, "LIKE CONCAT('%', 'pika', '%')")]
+#[case::sqlite(sql::Dialect::SQLite, "LIKE '%' || 'pika' || '%'")] // `CONCAT` is not supported in SQLite


Interesting. I had heard of rstest but haven't used it before. (I use pytest a lot, which I think is similar)

Happy to try this! (Maybe @aljazerzen has thoughts?)

(alternatively, we could have a normal loop and supply different names for the snapshot name)

I'm a huge fan of pytest and the parametrize + fixtures mechanisms.
And I'm a huge fan of rstest for the exact same reasons. It is quite similar and I feel like it would be a great value to cover many dialects since prql will probably need more tests to ensure compatibility across all the supported DBs. In my company we support 5 SQL databases/DWs and everything is parametrized on that.
But I can definitely remove that if you prefer

Very happy to try it for a while at least!

I'm +1 for rstest - on the macro-expansion level it does basically the same thing as for_each_file!, but offers much more flexibility.

aljazerzen

Nice work.

Again, it would be nice to also have test-dbs tests.

aljazerzen · 2023-12-12T08:40:00Z

prqlc/prql-compiler/tests/integration/sql.rs

+#[case::generic(sql::Dialect::Generic, "LIKE CONCAT('%', 'pika', '%')")]
+#[case::sqlite(sql::Dialect::SQLite, "LIKE '%' || 'pika' || '%'")] // `CONCAT` is not supported in SQLite


I'm +1 for rstest - on the macro-expansion level it does basically the same thing as for_each_file!, but offers much more flexibility.

for more information, see https://pre-commit.ci

PrettyWood · 2023-12-12T13:37:49Z

@aljazerzen @eitsupi @vanillajonathan @richb-hanover Following this comment. Sorry to ping you but since you are main contributors on this repo, your opinion matters for the name of the module that would contain main functions on strings
I went with str (so usage would be str.upper, str.ends_with). But we could go with string (not a huge fan since it just adds more characters for nothing) or text (maybe more enduser friendly I don't know)
Thank you 🙏

eitsupi · 2023-12-12T13:51:02Z

Thanks for mentioning me (and thanks for your great works!).
I don't feel uncomfortable with str because I think the abbreviation str is pretty common, but string might be better for sure.

vanillajonathan · 2023-12-12T14:31:08Z

The substring function is interesting, perhaps it could be used for indexing and slicing with brackets #1811

richb-hanover · 2023-12-12T16:00:37Z

I'll weigh in to support the full name string. Abbreviations become awkward because you don't always remember what to leave out (for example, why is it ls?)

PrettyWood · 2023-12-12T16:19:23Z

Ok I'll change for string you're already a majority 😄 This name can always be changed later
EDIT: done! @max-sixty do you agree with this last change? If yes I think we're good to merge

max-sixty · 2023-12-12T18:09:08Z

EDIT: done! @max-sixty do you agree with this last change? If yes I think we're good to merge

Great!

max-sixty · 2023-12-12T19:04:55Z

Thanks @PrettyWood ! These are great.

Indeed — I think when developing the compiler, it's easy for us to focus too much on "do we have a good algorithm for right-associative precedence", etc — but actually adding an equivalent for LIKE is much more important for most users. So these sorts of PRs have lots of impact.

aljazerzen · 2023-12-13T10:17:58Z

As a followup, we might want to reconsider type text and std.from_text.

max-sixty · 2023-12-13T18:41:18Z

As a followup, we might want to reconsider type text and std.from_text.

I am OK with changing these if folks prefer...

vanillajonathan · 2023-12-13T18:50:50Z

Yeah, I think we should change those, because text is pretty much only used in the context of SQLite, in the rest of the computing world, it is string.

See: #3913 (comment)

snth · 2023-12-18T20:24:21Z

Sorry for coming late to this (I was away for a while) but I would throw in my vote in favour of text or str. I agree with @PrettyWood that string just seems to add extra characters for no benefit.

I also think our earlier discussions around text vs string still hold and text would be my preferred option.

PrettyWood · 2023-12-18T20:29:32Z

For me the big downside of text is bytes vs strings. How do you differentiate them? I agree that often people want strings but not always! and even here there are subtleties like encoding, chars vs graphemes, ...

snth · 2023-12-18T20:51:06Z

I think text would pretty much be a synonym for unicode strings. If you want bytes then you would need another type like bytea. I lived through the upheaval around this in the Python 2 to 3 transition but it's not something I've ever encountered in database work. Is it much of an issue there?

The closest I came was when working with MS SQL 15 years ago, you would sometimes get unexpected ordering depending on what collation you chose for your database. Is that still much of an issue these days? It's not something I've encountered in a long time so I really can't say.

PrettyWood · 2023-12-18T20:58:42Z

To be honest my interest in PRQL is to simplify the refactoring of a custom DSL we have that translates many steps into Mongo, Polars and many SQL dialects.
If PRQL is first meant for data scientists/analysts who are not developers, maybe choosing text is better.
DuckDB went with that approach as well
I'm not familiar with the primary target of PRQL so I might be biased

vanillajonathan · 2023-12-18T21:31:05Z

Well, string is so widely prevalent in the computing world that I would think it probably is familiar even with non-developers such a data scientists and analysts.

aljazerzen · 2023-12-19T07:15:23Z

I strongly agree with @snth: regardless of how it is named, type string or type text should contain UTF-8 encoded text. For arbitrary bytes there should be another type.

But I'm impartial on the name. A quick survey of exactly 1 person showed that string is not familiar to biologists, who tend to know the basics of R. On the other hand, string really is prevalent in the computing world.

max-sixty · 2023-12-19T16:40:38Z

A quick survey of a couple of others reinforces that "string" is actually not that well-known. From a program manager at SpaceX (i.e. someone who is close to lots of technical things without writing code herself):

Not a lot to me. Something about a string of logic, but that's about it for me

max-sixty reviewed Dec 11, 2023

View reviewed changes

refactor: create a math module

e518714

PrettyWood force-pushed the string-functions branch from 0261574 to 3bb88f3 Compare December 11, 2023 13:16

PrettyWood changed the base branch from main to add-math-module December 11, 2023 13:17

PrettyWood changed the title ~~feat: add string functions~~ feat: add string functions in a str module Dec 11, 2023

test: make sure sub modules are handled with dialect

c2df14b

PrettyWood force-pushed the string-functions branch from a5d0ad2 to 2bb4844 Compare December 11, 2023 13:32

feat: add string functions

6500782

PrettyWood force-pushed the string-functions branch from 2bb4844 to 6500782 Compare December 11, 2023 14:34

PrettyWood mentioned this pull request Dec 11, 2023

refactor: add a math module #3916

Merged

fix: remove brackets and support sqlite

080c120

PrettyWood force-pushed the string-functions branch from 9707ce1 to 080c120 Compare December 11, 2023 23:31

max-sixty reviewed Dec 11, 2023

View reviewed changes

aljazerzen approved these changes Dec 12, 2023

View reviewed changes

Base automatically changed from add-math-module to main December 12, 2023 10:01

Merge branch 'main' into string-functions

8a12fe4

PrettyWood marked this pull request as ready for review December 12, 2023 10:39

pre-commit-ci bot and others added 2 commits December 12, 2023 10:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

687441a

for more information, see https://pre-commit.ci

test: clean name of snapshot

c554264

refactor: rename module from str to string

f09ef81

PrettyWood changed the title ~~feat: add string functions in a str module~~ feat: add string functions in a string module Dec 12, 2023

PrettyWood merged commit 79c6b08 into main Dec 12, 2023
75 checks passed

PrettyWood deleted the string-functions branch December 12, 2023 18:44

PrettyWood mentioned this pull request Dec 18, 2023

feat: add date.to_text for MySQL, Postgres, MSSQL and Clickhouse + doc #3955

Merged

max-sixty mentioned this pull request Dec 19, 2023

Release plan for 0.11.0 #3929

Closed

aljazerzen mentioned this pull request Dec 20, 2023

Name of the type for UTF-8 encoded strings of textual data #3981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add string functions in a `string` module #3913

feat: add string functions in a `string` module #3913

PrettyWood commented Dec 10, 2023 •

edited

Loading

max-sixty Dec 11, 2023

PrettyWood Dec 11, 2023

PrettyWood Dec 11, 2023

max-sixty Dec 11, 2023 •

edited

Loading

vanillajonathan Dec 12, 2023 •

edited

Loading

max-sixty Dec 11, 2023

PrettyWood Dec 12, 2023 •

edited

Loading

max-sixty Dec 12, 2023

aljazerzen Dec 12, 2023

aljazerzen left a comment

aljazerzen Dec 12, 2023

PrettyWood commented Dec 12, 2023

eitsupi commented Dec 12, 2023

vanillajonathan commented Dec 12, 2023

richb-hanover commented Dec 12, 2023

PrettyWood commented Dec 12, 2023 •

edited

Loading

max-sixty commented Dec 12, 2023

max-sixty commented Dec 12, 2023

aljazerzen commented Dec 13, 2023

max-sixty commented Dec 13, 2023

vanillajonathan commented Dec 13, 2023

snth commented Dec 18, 2023

PrettyWood commented Dec 18, 2023

snth commented Dec 18, 2023

PrettyWood commented Dec 18, 2023

vanillajonathan commented Dec 18, 2023 •

edited

Loading

aljazerzen commented Dec 19, 2023

max-sixty commented Dec 19, 2023 •

edited

Loading

		#[case::generic(sql::Dialect::Generic, "LIKE CONCAT('%', 'pika', '%')")]
		#[case::sqlite(sql::Dialect::SQLite, "LIKE '%' \|\| 'pika' \|\| '%'")] // `CONCAT` is not supported in SQLite

feat: add string functions in a string module #3913

feat: add string functions in a string module #3913

Conversation

PrettyWood commented Dec 10, 2023 • edited Loading

max-sixty Dec 11, 2023

Choose a reason for hiding this comment

PrettyWood Dec 11, 2023

Choose a reason for hiding this comment

PrettyWood Dec 11, 2023

Choose a reason for hiding this comment

max-sixty Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

vanillajonathan Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

max-sixty Dec 11, 2023

Choose a reason for hiding this comment

PrettyWood Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

max-sixty Dec 12, 2023

Choose a reason for hiding this comment

aljazerzen Dec 12, 2023

Choose a reason for hiding this comment

aljazerzen left a comment

Choose a reason for hiding this comment

aljazerzen Dec 12, 2023

Choose a reason for hiding this comment

PrettyWood commented Dec 12, 2023

eitsupi commented Dec 12, 2023

vanillajonathan commented Dec 12, 2023

richb-hanover commented Dec 12, 2023

PrettyWood commented Dec 12, 2023 • edited Loading

max-sixty commented Dec 12, 2023

max-sixty commented Dec 12, 2023

aljazerzen commented Dec 13, 2023

max-sixty commented Dec 13, 2023

vanillajonathan commented Dec 13, 2023

snth commented Dec 18, 2023

PrettyWood commented Dec 18, 2023

snth commented Dec 18, 2023

PrettyWood commented Dec 18, 2023

vanillajonathan commented Dec 18, 2023 • edited Loading

aljazerzen commented Dec 19, 2023

max-sixty commented Dec 19, 2023 • edited Loading

feat: add string functions in a `string` module #3913

feat: add string functions in a `string` module #3913

PrettyWood commented Dec 10, 2023 •

edited

Loading

max-sixty Dec 11, 2023 •

edited

Loading

vanillajonathan Dec 12, 2023 •

edited

Loading

PrettyWood Dec 12, 2023 •

edited

Loading

PrettyWood commented Dec 12, 2023 •

edited

Loading

vanillajonathan commented Dec 18, 2023 •

edited

Loading

max-sixty commented Dec 19, 2023 •

edited

Loading