Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GpuExpression columnarEval can return scalars from subqueries that may be unhandled #8303

Closed
jlowe opened this issue May 16, 2023 · 6 comments · Fixed by #8833
Closed
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@jlowe
Copy link
Member

jlowe commented May 16, 2023

#8293 is an instance of a case where a subquery caused a sub-expression to be evaluated as a scalar in a context that expected a column vector. One way to handle it is to explode the scalar out to a full column the same size as the batch and proceed as if it were a column all along (not necessarily the most performant), or update all the code to treat scalars as scalars (which may involve hacking the scalar into a 1-row column since most cudf functions operate on columns not scalars).

If we want to go the "always explode the scalar into a full column" route, then we should make it so GpuExpression.columnarEval is incapable of returning a scalar -- e.g.: it will always wrap the real eval with something like GpuExpressionsUtils.columnarEvalToColumn.

@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 16, 2023
@revans2
Copy link
Collaborator

revans2 commented May 17, 2023

We cannot/should not force GpuExpression.columnarEval to always return a ColumnVector without some kind of an escape valve. We at least need a way for GpuLiteral, which is a GpuExpression to return a GpuScalar, or we are going to completely remove all scalars from the code which will also mess up processing that only supports a literal value for a particular input and expects it.

@jlowe
Copy link
Member Author

jlowe commented May 17, 2023

True, we still need the ability to get scalars in some situations. Does it make sense to have two forms of expression eval, one that expects to resolve to a scalar because it is explicitly expected, and one that forces it to be a columnar vector (e.g.: via columnarEvalToColumn)? As it is now, we're likely to keep having to fix places that aren't prepared to handle both.

If we don't want two eval forms, then this issue becomes a tracker to audit all the places where subexpressions are evaluated and updating them to handle scalars (and hope we don't add more places that aren't properly dealing with scalars).

@revans2
Copy link
Collaborator

revans2 commented May 17, 2023

It is not just sub-expressions. GpuCoalesce can also return a scalar and I think there may be others. But in general, yes I think what we want is two APIs. One that is guaranteed to return a ColumnVector and another that is not. We have that today, but it could really be cleaned up. If you call the API that could return both you have to support it being a GpuScalar or a ColumnVector no exceptions. That means we are going to have to update Gpu{Unary|Binary|Trinary}Expression so that it is clean to support cases where we don't cover everything. GpuArrayContains is one example where we don't cover all situations, but there are lots of them.

This is not going to be a super simple fix.

If you call the one that returns a ColumnVector you just need to handle that case.

@mattahrens mattahrens added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels May 23, 2023
@abellina
Copy link
Collaborator

abellina commented Jul 19, 2023

I am keeping this issue open from the PR because there is follow on work that has to happen as P0:

  • Currently, GpuScalarSubquery returns a GpuScalar when columnarEvalAny is called, and GpuColumnVector when columnarEval is returned. We need to make sure this is OK, it is not clear to me still, but I think that if we handle this second point that's OK.
  • We need to audit all binary/ternary ops that throw from doColumnar and likely update most to handle GpuScalar. I believe some of this may require cuDF changes.

@jlowe @revans2 please comment if I've missed something.

@abellina
Copy link
Collaborator

abellina commented Jul 26, 2023

List of binary expressions that throw from a doColumnar with a GpuScalar argument:

  • doColumnar(GpuScalar, GpuColumnVector):

    • GpuGetJsonObject: getJSONObject in cuDF JNI expects a scalar for the path expression to search for.

    • GpuSortArray: this needs to have a scalar as the sort direction (for the CPU also), so it now implements GpuBinaryExpression_Any_Scalar so that the first argument can be anything, but the second argument (direction) has to be GpuScalar.

    • GpuGetMapValue: implemented exception cases to not throw

    • GpuArrayContains: implemented exception cases to not throw

    • GpuRoundBase: The RHS in this expression has to be foldable, or Spark won't plan it. Changed it to implement GpuBinaryExpression_Any_Scalar.

    • GpuStartsWith: cuDF JNI doesn't implement V.startsWith(V). Changed it to implement GpuBinaryExpression_Any_Scalar.

    • GpuEndsWith: cuDF JNI doesn't implement V.endsWith(V). Changed it to implement GpuBinaryExpression_Any_Scalar.

    • GpuContains: cuDF JNI doesn't implement V.stringContains(V). Changed it to implement GpuBinaryExpression_Any_Scalar.

    • GpuLike: cuDF JNI doesn't implement V.like(V). Changed it to implement GpuBinaryExpression_Any_Scalar.

    • GpuRLike: cuDF JNI implements this as V.containsRe(regex program). I am not sure if I could generate a column of "regex programes" . Changed it to implement GpuBinaryExpression_Any_Scalar.

    • GpuStringInstr: cuDF JNI doesn't implement V.stringLocate(V). Changed it to implement GpuBinaryExpression_Any_Scalar.

    • These are computing a format string strFormat on the driver that they then uses on the executors. So all these date functions need a literal as the right argument, and they have checks for that. I changed it to implement GpuBinaryExpression_Any_Scalar:

      • GpuDateFormatClass
      • GpuToTimestamp
      • GpuFromUnixTime
      • GpuFromUTCTimestamp
  • doColumnar(GpuColumnVector, GpuScalar): none

  • doColumnar(numRows, GpuScalar, GpuScalar):

    • GpuArrayContains: implemented exception cases to not throw

@abellina
Copy link
Collaborator

abellina commented Jul 26, 2023

GpuTernaryExpressions are all string expressions, and there are less of them.

  • GpuStringLocate: SQL api: locate(substr, str[, end]) we only support literal substr, and literal end, both of which can be columns in Spark. For now this function implements GpuTernaryExpression_Scalar_Any_Scalar and it uses TypeSig.lit to enforce both scalars, and fallback. This issue [FEA] Support a strings_column_view as the target parameter in cudf::string::find rapidsai/cudf#12013, if implemented fully, would let us implement all permutations (Any_Any_Any).
  • GpuStringReplace: We will implement GpuTernaryExpression_Any_Scalar_Scalar. We ensure that the arguments restricted to scalar are literal else we fallback. We need to work with cuDF to support passing a scalar or a vector for each of searchExpr and replaceExpr, as currently only vector/vector and scalar/scalar are supported, and vector/vector is not used for a row-wise replace (instead it's a "replace for this row, the pairs found from the first vector with the corresponding element in the second vector")
  • GpuStringTranslate: We will implement GpuTernaryExpression_Any_Scalar_Scalar. It requires that fromExpr and toExpr both be literal strings. This would require work from cuDF to setup the replacement lists. It is similar to replace, but cuDF would have to take from and to strings in each row and make sure it can do a character by character replace (from[0] -> to[0]), with spaces if to.length is less than from.length.
  • GpuRegExpTernaryBase: This has to be GpuTernaryExpression_Any_Scalar_Scalar. We have made sure that we ask for literals for the regular expression and the position.
  • GpuSubstringIndex: This has to be GpuTernaryExpression_Any_Scalar_Scalar for now. We have follow on issues to support delimiters that are more than one character and not use the regular expression. [FEA] Rework GpuSubstringIndex to use cudf::slice_strings #8750
  • BasePad: spark doesn't restrict any of the arguments, so we could pad a string column, using a column of characters to use as padding, and with a column stating the padding length. We would need to work with cuDF to support this. For now it will implement GpuTernaryExpression_Any_Scalar_Scalar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants