Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation for simple_query_string. #61

Conversation

Yury-Fridlyand
Copy link

@Yury-Fridlyand Yury-Fridlyand commented May 18, 2022

Signed-off-by: Yury Fridlyand [email protected]

Description

Rework on #53 and #55.

Issues Resolved

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

MaxKsyunz added 6 commits May 17, 2022 17:16
Required renaming analyzeSyntax to parse.

Signed-off-by: MaxKsyunz <[email protected]>
Added base class that is commont to match_query and match_bool_prefix and others.

Signed-off-by: MaxKsyunz <[email protected]>
Copy link

@MaxKsyunz MaxKsyunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss adding field parsing to the parser and propagating the parsed result down to SimpleQueryString.

I believe that will result clearer code.

Also, why is the SQL Java CI workflow failing?

ppl/src/main/antlr/OpenSearchPPLParser.g4 Outdated Show resolved Hide resolved
ppl/src/main/antlr/OpenSearchPPLParser.g4 Outdated Show resolved Hide resolved
@@ -308,6 +319,12 @@ relevanceArgName
| BOOST
;

relevanceField
: relevanceArgValue;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too permissive and allows for odd queries like: SIMPLE_QUERY_STRING([true, 3.14], ...)

Field list is a list of weighted field name patterns -- ["field_A^3.4", "field*", fie*^3.45].

Wildcard by itself -- "*" -- is a valid field name as well.

See https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html#simple-query-string-top-level-params

as well as org/opensearch/index/search/SimpleQueryStringQueryParser.java in opensearch code.

Copy link
Author

@Yury-Fridlyand Yury-Fridlyand May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 94748b5, thank you.
Note it affects other relevance functions as well (match), so I have to update tests too, see d5def08.

return new Function(
ctx.relevanceFunctionName().getText().toLowerCase(),
relevanceArguments(ctx));
if (ctx.relevanceFunctionType1() != null) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a visitRelevanceFunctionType1 method that you can override and put the content of this if block.

ctx.relevanceFunctionType1().relevanceFunctionType1Name().getText().toLowerCase(),
relevanceArgumentsType1(ctx.relevanceFunctionType1()));
}
else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a visitRelevanceFunctionType2 method that you can override and put the content of this else block.

@@ -252,9 +254,16 @@ public UnresolvedExpression visitConvertedDataType(ConvertedDataTypeContext ctx)

@Override
public UnresolvedExpression visitRelevanceExpression(RelevanceExpressionContext ctx) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced by two overrides -- see the following comments.

import java.util.function.BiFunction;
import java.util.stream.Collectors;

public class SimpleQueryString extends LuceneQuery {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be SimpleQueryStringQuery to follow name convention.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, fixed in 2b4a99b.

continue;
}

var parts = elem.split("\\^");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the return value of split? The should be used instead of var.

Also, why "\\^" instead of ^?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I dislike the extensive use of var in strongly-typed languages like Java - because I find it makes the code unreadable.
Anybody reading the code shouldn't have to look up the function to find out what the return type is supposed to be.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why "\\^" instead of ^?

split function receives a regex, so I have to escape ^, otherwise it would be interpreted as a regex control symbol.


var parts = elem.split("\\^");
var weight = Float.parseFloat(parts[parts.length - 1]);
var field = Arrays.stream(parts).limit(parts.length - 1).collect(Collectors.joining("^"));
Copy link

@MaxKsyunz MaxKsyunz May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to look at making field weight a first-class parser concept. It's not clear if this implementation can differentiate between field named title^4.3 and weight of 4.3 for field title.

Is the \\^ above to allow for escaping ^ that is part of field names?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed the following in the opensearch-project -- opensearch-project#182 (comment)

Copy link
Author

@Yury-Fridlyand Yury-Fridlyand May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the \\^ above to allow for escaping ^ that is part of field names?

No. split function receives a regex, so I have to escape ^, otherwise it would be interpreted as a regex control symbol.

A user can make a search like

simple_query_string(["some_field^1", "my^favourite^field^ever^100500"], ...)

and it would be parsed into following map

/-------------------------v--------\
|          field          | weight |
>-------------------------+--------<
| some_field              | 1      |
| my^favourite^field^ever | 100500 |
\-------------------------^--------/

Signed-off-by: Yury Fridlyand <[email protected]>
@@ -158,6 +161,20 @@ public Expression visitAggregateFunction(AggregateFunction node, AnalysisContext
}
}

@Override
public Expression visitLiteralList(LiteralList node, AnalysisContext context) {
var lst = node

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list vs lst? There's really no value to saving a single character in the code.
Code (java especially) should be verbose. list isn't really verbose either, and you could call it literalListCollection or similar.
Along those lines, I would also call node above literalListNode.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lst holds a list of internal AST objects. The list and the objects are consumed on the next line of code and don't exist anymore, so I called it just lst and not veryImportantListOfMagicThingsWhatever.
Re function arguments: have a look on the rest functions in this file. All of them are

public Expression visitSomething(Something node, AnalysisContext context)

So newly added function fits this style. Should I modify all of them?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to fix everything, but we should name our variables appropriately.

import org.opensearch.sql.expression.env.Environment;

/**
* Literal Expression.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literal List Expression

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed in 84d84e1.

continue;
}

var parts = elem.split("\\^");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I dislike the extensive use of var in strongly-typed languages like Java - because I find it makes the code unreadable.
Anybody reading the code shouldn't have to look up the function to find out what the return type is supposed to be.

var res = new HashMap<String, Float>();
for (var literal : input) {
var elem = literal.stringValue();
if (!elem.contains("^")) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't consider the case where the field contains a "^" in the name. We should be looking to make sure that the element contains a "^" with a float afterwards (\d+.\d+).
This requires regular expressions or tokenizing the string.

private Map<String, Float> parseFields(List<ExprValue> input) {
try {
var res = new HashMap<String, Float>();
for (var literal : input) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a .stream().map() here

}
return res;
}
catch (Exception e) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific what we're catching? This will catch everything that turn it into a SemanticCheckException without much explanation.

MaxKsyunz and others added 15 commits May 18, 2022 17:50
Signed-off-by: MaxKsyunz <[email protected]>
Signed-off-by: Yury Fridlyand <[email protected]>
Signed-off-by: Yury Fridlyand <[email protected]>
Signed-off-by: Yury Fridlyand <[email protected]>
Signed-off-by: Yury Fridlyand <[email protected]>
Signed-off-by: Yury Fridlyand <[email protected]>
- Two new AST entities: `RelevanceField` and `RelevanceFieldList` (instead of `LiteralList`).
- Update in `AstExpressionBuilder`x2.
- Rework on `ExpressionAnalyzer` and `SimpleQueryStringQuery`.
- Update in `PPL` ANTLR parser.

Signed-off-by: Yury Fridlyand <[email protected]>
Removed `RelevanceField` as not needed anymore. Since parsing result is stored in `ExprTupleValue` which is actually a map, so we can keep data there.

Signed-off-by: Yury Fridlyand <[email protected]>

@Override
public List<UnresolvedExpression> getChild() {
return List.of();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parent function returns null. What's the benefit of returning a list?

// Operators. Bit

BIT_NOT_OP: '~';
//BIT_OR_OP: '|';

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented code - can we remove?

MaxKsyunz and others added 6 commits May 30, 2022 13:45
@codecov
Copy link

codecov bot commented May 31, 2022

Codecov Report

Merging #61 (cba5749) into dev-simple_query_string-#192 (26058b8) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@                        Coverage Diff                         @@
##             dev-simple_query_string-#192      #61      +/-   ##
==================================================================
+ Coverage                           97.66%   97.70%   +0.04%     
- Complexity                           2743     2767      +24     
==================================================================
  Files                                 268      267       -1     
  Lines                                6772     6891     +119     
  Branches                              435      436       +1     
==================================================================
+ Hits                                 6614     6733     +119     
  Misses                                157      157              
  Partials                                1        1              
Flag Coverage Δ
sql-engine 97.70% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ensearch/sql/expression/ExpressionNodeVisitor.java 100.00% <ø> (ø)
...arch/storage/script/filter/FilterQueryBuilder.java 100.00% <ø> (ø)
...rg/opensearch/sql/analysis/ExpressionAnalyzer.java 100.00% <100.00%> (ø)
...opensearch/sql/data/model/ExprCollectionValue.java 100.00% <100.00%> (ø)
...c/main/java/org/opensearch/sql/expression/DSL.java 100.00% <100.00%> (ø)
...h/sql/expression/function/BuiltinFunctionName.java 100.00% <100.00%> (ø)
...h/sql/expression/function/OpenSearchFunctions.java 100.00% <100.00%> (ø)
...age/script/filter/lucene/relevance/MatchQuery.java 100.00% <100.00%> (ø)
...ilter/lucene/relevance/SimpleQueryStringQuery.java 100.00% <100.00%> (ø)
...pensearch/sql/ppl/parser/AstExpressionBuilder.java 100.00% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26058b8...cba5749. Read the comment docs.

Signed-off-by: Yury Fridlyand <[email protected]>
Another example to show how to set custom values for the optional parameters::

os> select firstname, lastname, city, address from accounts where simple_query_string(['firstname', city ^ 2], 'Amber Nogal', analyzer=keyword, default_operator='AND');
fetched rows / total rows = 0/0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should consider adding some additional test data for these tests. Perhaps out of the scope of this ticket but it would be nice having clear statements that show usage of these tests by returning exemplary data. It seems hard to do this with the optional parameters and the small amount of queryable data we currently have.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider using more complicated samples as I did in integration test. Prior to that we need to reduce the data set to have it short and simple. See b2639e3.

.entrySet()
.stream()
.collect(Collectors.toMap(
n -> (String) ((Literal) n.getKey()).getValue(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if we could do this without the casts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to convert there Map<UnresolvedExpression, UnresolvedExpression> into a Map<String, ExprValue>.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First cast can be simplified to n.getKey().toString(), but I don't see a way to simplify the second one.

return fieldList
.entrySet()
.stream()
.map(e -> String.format("\"%s\" ^ %s", e.getKey().toString(), e.getValue().toString()))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe toString is unnecessary here -- String.format takes care of it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, fixed in 71bf20c.

@@ -45,7 +45,7 @@ public List<ExprValue> collectionValue() {
public String toString() {
return valueList.stream()
.map(Object::toString)
.collect(Collectors.joining(",", "[", "]"));
.collect(Collectors.joining(", ", "[", "]"));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this change necessary to make a test pass?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it just makes the string representation a bit more readable. I was the first user of it.

@@ -59,7 +61,7 @@ public void contextShouldIncludeAllFieldsAfterVisitingIndexNameInFromClause() {
Map<String, Type> typeByName = context.peek().resolveAll(Namespace.FIELD_NAME);
assertThat(
typeByName,
allOf(
Matchers.<Map<String, Type>>allOf(
aMapWithSize(21),
hasEntry("semantics", new OpenSearchIndex("semantics", INDEX)),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is different in upstream/main.

Compare to https://github.com/opensearch-project/sql/blob/26058b8436b47687bb6039a6431c17a6d5d59f35/legacy/src/test/java/org/opensearch/sql/legacy/antlr/semantic/SemanticAnalyzerBasicTest.java#L64

The (Type) cast is removed by IntelliJ's Code Cleanup which then required addition of th

I don't know how you ran into this but that's what I experienced.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, fixed in 65ec516.

numParameters <= MIN_NUM_PARAMETERS + numOptionalParameters;
numParameters++) {
List<ExprType> args = new ArrayList<>(Collections.nCopies(numParameters, STRING));
List<ExprType> args = new ArrayList<>(Collections.nCopies(numParameters - 1, STRING));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is -1 necessary?

upstream/main does not have -1 here nor on line 84 above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add one more parameter on the next line. It may be of another type, so I introduced additional function argument.
I agree that it is an ugly implementation.

@@ -116,7 +119,7 @@ public void defineFieldSymbolShouldBeAbleToResolveAll() {
Map<String, Type> typeByName = environment().resolveAll(Namespace.FIELD_NAME);
assertThat(
typeByName,
allOf(
Matchers.<Map<String, Type>>allOf(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as SemanticAnalyzerBasicTest.java this file is different in upstream/main.

Copy link
Author

@Yury-Fridlyand Yury-Fridlyand Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, fixed in 65ec516.


private static FunctionResolver match_bool_prefix() {
FunctionName name = BuiltinFunctionName.MATCH_BOOL_PREFIX.getName();
// TODO: Create different resolver more suited for relevance functions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is caused by the merge. Fixed in cba5749, the merge was reverted.

build.gradle Outdated
@@ -6,9 +6,9 @@

buildscript {
ext {
opensearch_version = System.getProperty("opensearch.version", "2.0.0-rc1-SNAPSHOT")
opensearch_version = System.getProperty("opensearch.version", "2.0.0-SNAPSHOT")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this is in your PR - is this due to a downmerge from main?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is caused by the merge. Fixed in cba5749, the merge was reverted.

// Test is disabled because DSL and AstDSL expressions always has fields in the different order
// regardless of how they are specified
// Test is disabled because DSL and AstDSL expressions sometimes has fields in the different order
// regardless of how they are specified. Test passes and fails in a flaky behavior.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate.
Can we raise a follow-up ticket to resolve this in a timely manner?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 048984c.

@@ -6,9 +6,9 @@

buildscript {
ext {
opensearch_version = System.getProperty("opensearch.version", "2.0.0-SNAPSHOT")
opensearch_version = System.getProperty("opensearch.version", "2.0.0-rc1-SNAPSHOT")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will revert changes that are in the upstream/main branch.
Let's be careful that we aren't undoing Joshua's work.

@Yury-Fridlyand
Copy link
Author

Reimplemented in #66.

@Yury-Fridlyand Yury-Fridlyand deleted the dev-simple_query_string-#192-impl2 branch June 6, 2022 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants