Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ir): add StringSlice #8832

Merged
merged 2 commits into from
Apr 2, 2024
Merged

feat(ir): add StringSlice #8832

merged 2 commits into from
Apr 2, 2024

Conversation

NickCrews
Copy link
Contributor

Fixes #8796

│ ric │
└───────────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━┓
┃ StringSlice(food, 3) ┃
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this repr isn't great: both [:3] and [3:] result in StringSlice(food, 3). But that is separate problem.

],
)
def test_string(backend, alltypes, df, result_func, expected_func):
expr = result_func(alltypes).name("tmp")
def test_substring(backend, alltypes, df, result_func, expected_func):
Copy link
Contributor Author

@NickCrews NickCrews Mar 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the diff for these test cases is really messy, sorry. What I did:

  • move some test cases to separate test_substring test
  • remove xfails/brokens marks since I fixed/implemented things in polars and mssql
  • removed no test cases
  • added the positive-index test case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, I was going to ask why you deleted test cases 😂

Fixes ibis-project#8796

This also improves the compilation for Substring:
- the bigquery, clickhouse, and datafusion implementations can just get deleted, they do the same thing as the base SqlGlotCompiler
- The mssql compiler didn't actually compile Substring(arg, start, None) properly, it created the simple `substr(arg, len)` SQL function,
but in mssql the length is always required. So fix that unrelated bug.
- Optimizes the SQL generated by visit_Substring: Before, the generated SQL was of the form `if start >= 1 then substr(arg, start, length) else substr(arg, start+len(arg), length)`. Now, it is like `substring(arg, if start >= 1 then start else start+len(arg), length)`. This reduces the number of appearances of arg from 3 to 2. In general we want to shrink the args of the ifelses to as small as possible.
- Add in the check for negative literal lengths to the SqlGlotCompiler
stop = self.length()

return self.substr(start, stop - start)
if start is not None and not isinstance(start, ir.Expr) and start < 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test case for this? Might be good to add one if there isn't. Same for the other seemingly uncovered lines below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is not currently a test case. Agreed should add one. In this PR or a followup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be done in a follow-up.

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, question about uncovered lines when raising errors at the API level.

@cpcloud cpcloud added this to the 9.0 milestone Apr 1, 2024
@cpcloud cpcloud added internals Issues or PRs related to ibis's internal APIs ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI labels Apr 1, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Apr 1, 2024
@cpcloud
Copy link
Member

cpcloud commented Apr 1, 2024

I'll regenerate the snowflake snapshots.

@cpcloud cpcloud removed this from the 9.0 milestone Apr 1, 2024
return ops.Substring(_.arg, start=0, length=_.end)
if (
isinstance(_.start, ops.Literal)
and isinstance(_.start.value, int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rewrite rule is good as is, just noting that this could be spelled as a pattern as well: p.StringSlice(p.Literal(int), p.Literal(int).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in there would be a different rule for each of the patterns? (Literal, None), (None, literal), (literal, literal)? Maybe a full example of how this could be spelled? If there are different rules then would I need to import three different rewrites in all the compilers?

I'd love to learn for next time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ideal in this current scenario, just wanted to highlight you that nested patterns are also supported.

There are various ways to spell this, here is a more complicated one showing what can be done with the rewrite system:

from ibis.expr.rewrites import p, d

p.StringSlice(end=None) >> d.Substring(_.arg, start=_.start) |
p.StringSlice(start=None) >> d.Substring(_.arg, start=0, length=_.end) |
p.StringSlice(p.Literal(int), p.Literal(int)) >> d.Substring(_.arg, start=_.start, length=_.end.value - _.start.value) |
p.StringSlice >> d.Substring(_.arg, start=_.start, length=d.Subtract(_.end, _.start)

@cpcloud cpcloud changed the title feat: add StringSlice feat(ir): add StringSlice Apr 2, 2024
@cpcloud cpcloud added this to the 9.0 milestone Apr 2, 2024
@cpcloud cpcloud merged commit e4e3531 into ibis-project:main Apr 2, 2024
86 checks passed
@kszucs
Copy link
Member

kszucs commented Apr 2, 2024

Thanks @NickCrews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals Issues or PRs related to ibis's internal APIs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf: natively support String.slice(start, end) as well as String.substr(start, length)
3 participants