Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Extra string allocations in EXTRACT handling #19965

Open
solongordon opened this issue Nov 10, 2017 · 7 comments
Open

sql: Extra string allocations in EXTRACT handling #19965

solongordon opened this issue Nov 10, 2017 · 7 comments
Labels
A-sql-builtins SQL built-in functions and semantics thereof. C-performance Perf of queries or internals. Solution not expected to change functional behavior. E-starter Might be suitable for a starter project for new employees or team members. good first issue T-sql-queries SQL Queries Team

Comments

@solongordon
Copy link
Contributor

solongordon commented Nov 10, 2017

See discussion at https://reviewable.io/reviews/cockroachdb/cockroach/19923#-KyS_gHiEzAfKPa0fg4X:-Ky_Yc13IQxe498kphDh:b708j2.

In our handling for the extract built-in, we do string normalization on the precision argument. @justinj and @knz pointed out that there are some unnecessary string allocations here which could be reduced.

Also applies to extract_duration and date_trunc.

Jira issue: CRDB-5946

@knz knz added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Apr 27, 2018
@jordanlewis jordanlewis added the A-sql-builtins SQL built-in functions and semantics thereof. label Apr 30, 2018
@jordanlewis jordanlewis added E-starter Might be suitable for a starter project for new employees or team members. good first issue labels Aug 21, 2018
@jordanlewis
Copy link
Member

In general, I think we could fix this and other similar situations by introducing a way to normalize scalar builtin functions and their arguments.

We could have a way of defining a rule that looks at a scalar builtin and lets you normalize its arguments. But I'm not exactly sure whether this is going to be worth doing - how many circumstances of this nature really are there?

It turns out that TPCH query 7 actually uses extract, so I'm going to run a test with that query and see if the extra allocations in this issue actually make a difference or not.

@jordanlewis
Copy link
Member

cc @otan @knz - why were we planning to remove extract_duration in 20.2? I see a TODO to that effect that I don't understand.

@jordanlewis
Copy link
Member

Using the simpler query

select extract('YeAr' from l_commitdate) as f from lineitem group by f;

we see about 30% of the program's allocations in strings.ToLower :) Though we also see 30% just returning the new float that represents the year, so it's not totally clear that this is such a big deal still.

Still, it really would be nice to have a way to normalize the input arguments to builtins before running them once per every row... it feels like we should be able to somehow define that this function always lower()'s its first argument via the optimizer. Then, since lower() is an immutable function, it'll always get constant folded via the pre-existing constant folding rules 🤔

@cockroachdb/sql-optimizer any thoughts on this? It's not urgent but just curious.

@jordanlewis
Copy link
Member

Since extract is super weird, and completely handled by the parser, we could theoretically just tell the parser to add a lower() to the first argument unconditionally, but that seems inflexible, because there are other functions like date_trunc that have this same property that don't get special handling via the parser.

@RaduBerinde
Copy link
Member

We could have a crdb_internal_extract_from_lower variant and a normalization rule that converts extract(x) to crdb_internal_extract_from_lower(lower(x)).

craig bot pushed a commit that referenced this issue Jan 31, 2021
59598: parser: lowercase timespans in extract r=jordanlewis a=jordanlewis

Touches #19965. Confirmed that this (relatively silly) PR gets rid of the allocations in the experiment listed in #19965 (comment)

The `extract` builtin is kind of weird - it is actually supported by the
parser and not an ordinary builtin function that people can call without
parser support. As such, we can normalize its inputs right in the
parser.

The benefit of this is that, later, we are going to unconditionally
ToLower() the string arguments to extract - which costs an allocation.
So we can do this normalization up front to save a bunch of allocations
in queries that run `extract` over a lot of data rows.

Release note (performance improvement): improve the allocation
performance of workloads that use the `EXTRACT` builtin.

Co-authored-by: Jordan Lewis <[email protected]>
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Sep 26, 2023
@mgartner mgartner moved this from Triage to New Backlog in SQL Queries Sep 26, 2023
@yuzefovich yuzefovich added the T-sql-queries SQL Queries Team label May 2, 2024
@XiaochenCui
Copy link
Contributor

XiaochenCui commented Jul 23, 2024

This is useful for extract since there are some test cases with uppercase arguments. I'll implement this later and provide a performance report.

More general idea:

I'm also exploring the possibilities of making this a general improvement for all built-in functions. The idea is to fold constants during optimization. However, it seems complicated the issue:

  • If we translate extract('year', ...) to extract(lower('year'), ...), it will result in significantly more CPU instructions than the original function, which is not good.
  • If we mark the first argument of extract as lower, it increases the code complexity a lot with little performance improvement.

Issues with the function extract:

By the way, I found that there are many test cases with unquoted arguments:

  • SELECT extract(hour from timetz '12:01:02.345678-07'), where "hour" is a constant.
  • SELECT extract(element, input::timestamp), where "element" is a column.

So what are the rules for argument resolution? What if there is a column named "hour"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-builtins SQL built-in functions and semantics thereof. C-performance Perf of queries or internals. Solution not expected to change functional behavior. E-starter Might be suitable for a starter project for new employees or team members. good first issue T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

7 participants