Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a VALUES clause to the query of a SERVICE clause to simplify the execution #1341

Merged
merged 13 commits into from
Jun 5, 2024

Conversation

UNEXENU
Copy link
Contributor

@UNEXENU UNEXENU commented May 8, 2024

If the query tree, with which the result of a SERVICE clause is joined, leads to a small result (The definition of "small" can be set by a runtime parameter, the current default is 100), then this result is added to the query that is sent to the endpoint of the SERVICE, making it much easier to process, and the result of the SERVICE query potentially much smaller.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round of reviews.

@@ -92,6 +95,44 @@ ResultTable Service::computeResult() {
serviceIriString.remove_suffix(1);
ad_utility::httpUtils::Url serviceUrl{serviceIriString};

if (siblingTree_ != nullptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in its own function that returns e.g. an optional<string> that returns the VALUES clause or nothing.

// reduce complexity of the SERVICE query.
auto siblingResult = siblingTree_->getResult();

const size_t rowLimit = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this configurable for experiments by registering it in the RuntimeParameters class.

auto siblingResult = siblingTree_->getResult();

const size_t rowLimit = 100;
if (siblingResult->size() < rowLimit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do it the other way round (if size >= limit => immediately return/break), then the code gets less nested.

if (it == siblingVariables.end()) {
continue;
}
const auto& sVar = *it;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of this are currently a bit wrong for multiple variables:

If I have the following table:
?x ?y ?z

Then I need the values clause

VALUES (?x ?y ?z) { (<x1> <y1> <z1>) (<x2> <y2> <z2>)}
with multiple variables , not three individual values clauses, otherwise I get

8 results (the cartesian product) instead of the 2 lines).

(For the exact syntax of multi varaible clauses you can look around the standard or play with the UI, the above might contain typos).

std::string valueClauses = "{ ";
for (const auto& lVar : parsedServiceClause_.visibleVariables_) {
auto it = siblingVariables.find(lVar);
if (it == siblingVariables.end()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First in a separate step compute the relevant column indices for the variables, and then simply iterate over those.
That makes the large loop more readable and possibly also more efficient.

Comment on lines 119 to 123
const auto& optionalString =
ExportQueryExecutionTrees::idToStringAndType(
siblingTree_->getRootOperation()->getIndex(),
siblingResult->idTable()(rowIndex, sVar.second.columnIndex_),
siblingResult->localVocab());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop can also be refactored for readability, but we can do this once the other issues are resolved.

std::string_view expectedSparqlQuery5 =
"PREFIX doof: <http://doof.org> SELECT ?x ?y "
"WHERE { VALUES ?x { <x> <blu> } . VALUES ?y { <y> <bla> } . "
"?x <ble> ?y . }";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already stated, this is not quite right (or as efficient as it can be ) yet, but we will get there (see the comment above)

Copy link

codecov bot commented May 10, 2024

Codecov Report

Attention: Patch coverage is 98.90110% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 88.81%. Comparing base (f9c3132) to head (21b42be).

Files Patch % Lines
src/engine/Service.cpp 98.30% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1341      +/-   ##
==========================================
+ Coverage   88.77%   88.81%   +0.04%     
==========================================
  Files         324      324              
  Lines       28676    28762      +86     
  Branches     3172     3187      +15     
==========================================
+ Hits        25456    25544      +88     
  Misses       2067     2067              
+ Partials     1153     1151       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 204 to 208
for (size_t i = 0; i < commonColumnIndices.size(); ++i) {
const auto& optionalString = ExportQueryExecutionTrees::idToStringAndType(
siblingTree_->getRootOperation()->getIndex(),
siblingResult->idTable()(rowIndex, commonColumnIndices[i]),
siblingResult->localVocab());
Copy link
Contributor Author

@UNEXENU UNEXENU May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow the localVocab accessed here with siblingResult->localVocab() doesn't contain useful data.
As a result the optionalString has no Content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look at this. I however don't suspect that the problem is the local vocab.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you send me a reproducer (Dataset + Query) where something doesn't work as expected?
I just tried a simple example with one variable where this worked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sry, here is a reproducable example query, i use it with Olympics Dataset but it doesn't use the local dataset anyway:

PREFIX schema: <http://schema.org/>
PREFIX imdb: <https://www.imdb.com/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
SELECT ?imdb_id ?imdb_votes ?imdb_rating WHERE {
VALUES ?imdb_id { "tt0477348" "tt0118715" "tt0116282" } .
  SERVICE <https://qlever.cs.uni-freiburg.de/api/imdb> {
	?movie_imdb imdb:id ?imdb_id .
	?movie_imdb imdb:type "movie" .
    ?movie_imdb imdb:numVotes ?imdb_votes .
    ?movie_imdb imdb:averageRating ?imdb_rating .
  }
}
ORDER BY DESC(?imdb_votes)

Expected result: 3 rows with the given imdb_id values.
Actual result: 0 rows, Debug log shows that the sent Service-Query contains a Values Clause with imdb_id as key but no values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured it out....
Your code is absolutely working as it should, the problem was the implementation of the VALUES clause.
Basically you could neither evaluate it twice, nor read it from the cache during the same query.
I have pushed a quick fix to your branch (just do a git pull in your local copy to get it) and everything should work as expected.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round of reviews.

I managed to get a simple query working with this, So we are on a good track with this.

auto bRootOp = std::dynamic_pointer_cast<Service>(b._qet->getRootOperation());

// Exactly one of the two Operations can be a service.
if (aRootOp == bRootOp) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (aRootOp == bRootOp) {
if (static_cast<bool>(aRootOp) == static_cast<bool>(bRootOp)) {

The other one is very fishy (expects that you don't get passed the same Service twice).

src/engine/QueryPlanner.cpp Show resolved Hide resolved
if (auto valuesClause = getSiblingValuesClause(); valuesClause.has_value()) {
parsedServiceClause_.graphPatternAsString_ =
"{ " + valuesClause.value() +
parsedServiceClause_.graphPatternAsString_.substr(2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably find the first occurence of the {, I am not sure if this breaks as soon as someone puts random spaces in their query.

src/engine/Service.cpp Show resolved Hide resolved
@@ -60,6 +60,10 @@ class Service : public Operation {
GetTsvFunction getTsvFunction = sendHttpOrHttpsRequest,
std::shared_ptr<QueryExecutionTree> siblingTree = nullptr);

inline void setSiblingTree(std::shared_ptr<QueryExecutionTree> siblingTree) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline doesn't do anything here.
And add a short docstring, the semantics of a sibling tree should be documented.

src/engine/Service.h Show resolved Hide resolved
Comment on lines 204 to 208
for (size_t i = 0; i < commonColumnIndices.size(); ++i) {
const auto& optionalString = ExportQueryExecutionTrees::idToStringAndType(
siblingTree_->getRootOperation()->getIndex(),
siblingResult->idTable()(rowIndex, commonColumnIndices[i]),
siblingResult->localVocab());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look at this. I however don't suspect that the problem is the local vocab.

Comment on lines 197 to 199
if (commonColumnIndices.size() > 1) {
vars = "(" + vars + ")";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about you ALWAYS add the parentheses, they also work for only one variable, and this makes the code simpler.

Comment on lines 204 to 208
for (size_t i = 0; i < commonColumnIndices.size(); ++i) {
const auto& optionalString = ExportQueryExecutionTrees::idToStringAndType(
siblingTree_->getRootOperation()->getIndex(),
siblingResult->idTable()(rowIndex, commonColumnIndices[i]),
siblingResult->localVocab());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you send me a reproducer (Dataset + Query) where something doesn't work as expected?
I just tried a simple example with one variable where this worked.


std::vector<ColumnIndex> commonColumnIndices;
const auto& siblingVars = siblingTree_->getVariableColumns();
std::string vars = "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one bug remaining:
If a row (combination of all relevant variables) appears multiple times in the sibling result, then you may only store it once in the VALUES clause, otherwise the SERVICE will duplicate each result leading to wrong results with too many rows.

@UNEXENU
Copy link
Contributor Author

UNEXENU commented May 17, 2024

Theres a bug (introduced by this branch) in using the cache for the Service Operation with the following Query:

PREFIX schema: <http://schema.org/>
PREFIX imdb: <https://www.imdb.com/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
SELECT ?imdb_id ?imdb_votes ?imdb_rating WHERE {
VALUES (?imdb_id) { ("tt0477348") ("tt0118715") } .
  SERVICE <https://qlever.cs.uni-freiburg.de/api/imdb> {
  VALUES ?imdb_id { "tt0477348" "tt0118715" "tt0116282" } .
	?movie_imdb imdb:id ?imdb_id .
	?movie_imdb imdb:type "movie" .
    ?movie_imdb imdb:numVotes ?imdb_votes .
    ?movie_imdb imdb:averageRating ?imdb_rating .
  }
}
ORDER BY DESC(?imdb_votes)

Steps to reproduce: Run the query once. Then comment the first value clause out such that only the service clause remains, and run again.

Expected behavior: First run: 2 rows. Second Run: 3 rows

Actual behavior: First run: 2 rows. Second Run: 2 rows (Service result cached)

Conclusion: The Service Operations result will be loaded from Cache despite the change in its query issued by not injecting (or modifying) the siblings Result values.

@joka921
Copy link
Member

joka921 commented May 17, 2024

Theres a bug (introduced by this branch) in using the cache for the Service Operation with the following Query:

PREFIX schema: <http://schema.org/>
PREFIX imdb: <https://www.imdb.com/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
SELECT ?imdb_id ?imdb_votes ?imdb_rating WHERE {
VALUES (?imdb_id) { ("tt0477348") ("tt0118715") } .
  SERVICE <https://qlever.cs.uni-freiburg.de/api/imdb> {
  VALUES ?imdb_id { "tt0477348" "tt0118715" "tt0116282" } .
	?movie_imdb imdb:id ?imdb_id .
	?movie_imdb imdb:type "movie" .
    ?movie_imdb imdb:numVotes ?imdb_votes .
    ?movie_imdb imdb:averageRating ?imdb_rating .
  }
}
ORDER BY DESC(?imdb_votes)

Steps to reproduce: Run the query once. Then comment the first value clause out such that only the service clause remains, and run again.

Expected behavior: First run: 2 rows. Second Run: 3 rows

Actual behavior: First run: 2 rows. Second Run: 2 rows (Service result cached)

Conclusion: The Service Operations result will be loaded from Cache despite the change in its query issued by not injecting (or modifying) the siblings Result values.

@UNEXENU Thank you for analyzing this... The fix for this is easy: Change the getCacheKeyImpl of the ServiceClasss.t. it also contains the cacheKey of the sibling tree (if it exists).

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already very useful and currently running on the Wikidata instance.
To merge this first step, there is still missing:

  1. Unit tests (see the codecov tool and my suggestions in the review).
  2. The fix ( + unit test) of the caching bug (sibling tree has to be part of cache key).
  3. During the building of the Values clause you should regularly check the cancellation token (throwIfCancelled, see in the remaining code).

Let me know if there's things that I can assist you with.

SubtreePlan plan =
jcs.size() == 1
? makeSubtreePlan<Join>(qec, a._qet, b._qet, jcs[0][0], jcs[0][1])
: makeSubtreePlan<MultiColumnJoin>(qec, a._qet, b._qet);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add unit tests for this planning (in the QueryPlannerTest.cpp and QueryPlannerTestHelpers.h you can add a matcher for a service clause, have a look at that code, and let me know if there is trouble)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented a simple test for this in this commit, let me know if that is sufficient or provide feedback for further improvements.

}

rowSet.insert(row);
values += row + " ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all your someString += something + something other
use absl::StrAppend. It is more efficient and more idiomatic. (three places in this file)

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small changes,
But we are almost ready to merge.
Sorry for the delay, I was on vacation.

ad_utility::HashSet<std::string> rowSet;

std::string values = " { ";
for (size_t rowIndex = 0; rowIndex < siblingResult->size(); ++rowIndex) {
std::string row = "(";
for (size_t i = 0; i < commonColumnIndices.size(); ++i) {
for (const auto& columnIdx : commonColumnIndices) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inide the for each row loop (the outer one) should probably be one cancellation check per iterationl

@@ -283,6 +284,9 @@ constexpr auto OrderBy = [](const ::OrderBy::SortedVariables& sortedVariables,
// Match a `UNION` operation.
constexpr auto Union = MatchTypeAndOrderedChildren<::Union>;

// Match a `SERVICE` operation.
constexpr auto Service = MatchTypeAndOrderedChildren<::Service>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please implement the matcher s.t. it also takes a matcher for the sibling tree and the graphPatternAsString
etc.

Copy link

Quality Gate Passed Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@joka921 joka921 changed the title Service Improvements Add a VALUES clause to the query of a SERVICE clause to simplify the execution Jun 5, 2024
@joka921 joka921 merged commit bb6e2f9 into ad-freiburg:master Jun 5, 2024
19 checks passed
hannahbast pushed a commit that referenced this pull request Aug 1, 2024
There is a bug in the optimization introduced by #1341, which checks whether the sibling operand of a join with a `SERVICE` operations has a small result (and if yes, passes that result to the `SERVICE` clause as `VALUES` in order to reduce the result size of the `SERVICE` call). Instead of just adding the sibling operand to the existing `Service` operation (which led to an object that should have been `const` being changed), now a new `Service` operation that includes the sibling is created. In particular, this fixes https://qlever.cs.uni-freiburg.de/osm-planet, which worked before #1341 and now works again as it should.
joka921 pushed a commit that referenced this pull request Sep 18, 2024
With this commit, the optimization that constrains a `SERVICE` query using a `VALUES` clause when the sibling of the `SERVICE` is small (originally introduced in #1341) is also applied for `OPTIONAL` and `MINUS` clauses when the right operand is a `SERVICE`.
joka921 pushed a commit that referenced this pull request Oct 26, 2024
Previously, the sibling of a `SERVICE` (see #1341 for details) was computed twice. This was not a problem for small siblings result, because they were cached, making the second computation actually a cheap cache read. It was however a problem if the result was too large for the cache. It then was computed twice, although only one of the computations was actually used, because the SERVICE discarded the result because it was too large.
This commit fixes this behavior. The sibling is now only computed once, and only gets explicitly shared with the `SERVICE` if it is sufficiently small.
joka921 pushed a commit that referenced this pull request Oct 28, 2024
Adds the commandline argument `service-max-value-rows` allowing to set the maximum number of rows passed to a `Service` operation (see #1341 and #1502 for details). The default value is set to 10,000 rows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants