Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parse_url kernel for QUERY literal and column key #10211

Merged
merged 5 commits into from
Jan 26, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 21 additions & 8 deletions integration_tests/src/main/python/url_test.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2023, NVIDIA CORPORATION.
# Copyright (c) 2023-2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -26,7 +26,7 @@
r'(:[0-9]{1,3}){0,1}(/[a-zA-Z0-9]{1,3}){0,3}(\?[a-zA-Z0-9]{1,3}=[a-zA-Z0-9]{1,3}){0,1}(#([a-zA-Z0-9]{1,3})){0,1}'

url_pattern_with_key = r'((http|https|ftp|file)://)(([a-z]{1,3}\.){0,3}([a-z]{1,3})\.([a-z]{1,3}))' \
r'(:[0-9]{1,3}){0,1}(/[a-z]{1,3}){0,3}(\?key=[a-z]{1,3}){0,1}(#([a-z]{1,3})){0,1}'
r'(:[0-9]{1,3}){0,1}(/[a-z]{1,3}){0,3}(\?[a-c]{1,3}=[a-z]{1,3}(&[a-c]{1,3}=[a-z]{1,3}){0,3}){0,1}(#([a-z]{1,3})){0,1}'

edge_cases = [
"[email protected]/path?query=1#Ref",
Expand Down Expand Up @@ -150,8 +150,8 @@

supported_parts = ['PROTOCOL', 'HOST', 'QUERY']
unsupported_parts = ['PATH', 'REF', 'FILE', 'AUTHORITY', 'USERINFO']
supported_with_key_parts = ['PROTOCOL', 'HOST']
unsupported_with_key_parts = ['QUERY', 'PATH', 'REF', 'FILE', 'AUTHORITY', 'USERINFO']
supported_with_key_parts = ['PROTOCOL', 'HOST', 'QUERY']
unsupported_with_key_parts = ['PATH', 'REF', 'FILE', 'AUTHORITY', 'USERINFO']

@pytest.mark.parametrize('data_gen', [url_gen, edge_cases_gen], ids=idfn)
@pytest.mark.parametrize('part', supported_parts, ids=idfn)
Expand All @@ -166,16 +166,29 @@ def test_parse_url_unsupported_fallback(part):
lambda spark: unary_op_df(spark, url_gen).selectExpr("a", "parse_url(a, '" + part + "')"),
'ParseUrl')

def test_parse_url_query_with_key():
url_gen = StringGen(url_pattern_with_key)
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, url_gen)
.selectExpr("a", "parse_url(a, 'QUERY', 'abc')", "parse_url(a, 'QUERY', 'a')")
)
revans2 marked this conversation as resolved.
Show resolved Hide resolved

@allow_non_gpu('ProjectExec', 'ParseUrl')
def test_parse_url_query_with_key_regex_fallback():
url_gen = StringGen(url_pattern_with_key)
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, url_gen)
.selectExpr("a", "parse_url(a, 'QUERY', 'a?c')", "parse_url(a, 'QUERY', '*')"),
'ParseUrl')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It would probably be best to separate these out into different test runs, as the project will fall back to the CPU if any of them say they are not supported instead of if all of them do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


@pytest.mark.parametrize('part', supported_with_key_parts, ids=idfn)
def test_parse_url_with_key(part):
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, url_gen).selectExpr("parse_url(a, '" + part + "', 'key')"))



@allow_non_gpu('ProjectExec', 'ParseUrl')
@pytest.mark.parametrize('part', unsupported_with_key_parts, ids=idfn)
def test_parse_url_query_with_key_fallback(part):
def test_parse_url_with_key_fallback(part):
assert_gpu_fallback_collect(
lambda spark: unary_op_df(spark, url_gen).selectExpr("parse_url(a, '" + part + "', 'key')"),
'ParseUrl')
'ParseUrl')
Original file line number Diff line number Diff line change
Expand Up @@ -3260,10 +3260,24 @@ object GpuOverrides extends Logging {
if (a.failOnError) {
willNotWorkOnGpu("Fail on error is not supported on GPU when parsing urls.")
}


// In Spark, the key in parse_url could act like a regex, but on GPU, we only support
// literal keys. So we need to fallback if the key contains regex special characters.
// See Spark issue: https://issues.apache.org/jira/browse/SPARK-44500
extractStringLit(a.children(1)).map(_.toUpperCase) match {
case Some("QUERY") if (a.children.size == 3) =>
willNotWorkOnGpu("Part to extract QUERY with key is not supported on GPU")
case Some("QUERY") if (a.children.size == 3) => {
extractLit(a.children(2)).foreach { key =>
if (key.value != null) {
val keyStr = key.value.asInstanceOf[UTF8String].toString
val specialCharacters = List("\\", "[", "]", "{", "}", "^", "-", "$",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need ( and ) in here too for the check.

@NVnavkumar and/or @andygrove is this a good enough check for a possible regexp problem? Be aware that Spark is creating the regexp with a prefix and postfix

https://github.com/apache/spark/blob/fa60a7e216e63b1edb199b1610b26197815c656b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala#L132-L133

private val REGEXPREFIX = "(&|^)"
private val REGEXSUBFIX = "=([^&]*)"

And it is trying to capture the second capture group. So it is kind of a mess.

Copy link
Collaborator

@NVnavkumar NVnavkumar Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a list of regex meta characters here:

private val regexMetaChars = ".$^[]\\|?*+(){}"

I think we should pull this out of a private val (maybe as a constant in GpuOverrides) and use it in both places.

Plus I think - could be part of a valid key to extract so there is no need to put it in a regex check here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

".", "+", "*", "?", "|")
if (specialCharacters.exists(keyStr.contains(_))) {
willNotWorkOnGpu(s"Key $keyStr could act like a regex which is not " +
"supported on GPU")
}
}
}
}
case Some(part) if GpuParseUrl.isSupportedPart(part) =>
case Some(other) =>
willNotWorkOnGpu(s"Part to extract $other is not supported on GPU")
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
* Copyright (c) 2023-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -81,8 +81,8 @@ case class GpuParseUrl(children: Seq[Expression])
// return a null columnvector
return GpuColumnVector.columnVectorFromNull(col.getRowCount.toInt, StringType)
}
throw new UnsupportedOperationException(s"$this is not supported partToExtract=$part. " +
s"Only PROTOCOL, HOST and QUERY without a key are supported")
val keyStr = key.getValue.asInstanceOf[UTF8String].toString
ParseURI.parseURIQueryWithLiteral(col.getBase, keyStr)
}

override def columnarEval(batch: ColumnarBatch): GpuColumnVector = {
Expand Down
Loading