-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a VALUES clause to the query of a SERVICE clause to simplify the execution #1341
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first round of reviews.
src/engine/Service.cpp
Outdated
@@ -92,6 +95,44 @@ ResultTable Service::computeResult() { | |||
serviceIriString.remove_suffix(1); | |||
ad_utility::httpUtils::Url serviceUrl{serviceIriString}; | |||
|
|||
if (siblingTree_ != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in its own function that returns e.g. an optional<string>
that returns the VALUES clause or nothing.
src/engine/Service.cpp
Outdated
// reduce complexity of the SERVICE query. | ||
auto siblingResult = siblingTree_->getResult(); | ||
|
||
const size_t rowLimit = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this configurable for experiments by registering it in the RuntimeParameters
class.
src/engine/Service.cpp
Outdated
auto siblingResult = siblingTree_->getResult(); | ||
|
||
const size_t rowLimit = 100; | ||
if (siblingResult->size() < rowLimit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do it the other way round (if size >= limit => immediately return/break), then the code gets less nested.
src/engine/Service.cpp
Outdated
if (it == siblingVariables.end()) { | ||
continue; | ||
} | ||
const auto& sVar = *it; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics of this are currently a bit wrong for multiple variables:
If I have the following table:
?x ?y ?z
Then I need the values clause
VALUES (?x ?y ?z) { (<x1> <y1> <z1>) (<x2> <y2> <z2>)}
with multiple variables , not three individual values clauses, otherwise I get
8 results (the cartesian product) instead of the 2 lines).
(For the exact syntax of multi varaible clauses you can look around the standard or play with the UI, the above might contain typos).
src/engine/Service.cpp
Outdated
std::string valueClauses = "{ "; | ||
for (const auto& lVar : parsedServiceClause_.visibleVariables_) { | ||
auto it = siblingVariables.find(lVar); | ||
if (it == siblingVariables.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First in a separate step compute the relevant column indices for the variables, and then simply iterate over those.
That makes the large loop more readable and possibly also more efficient.
src/engine/Service.cpp
Outdated
const auto& optionalString = | ||
ExportQueryExecutionTrees::idToStringAndType( | ||
siblingTree_->getRootOperation()->getIndex(), | ||
siblingResult->idTable()(rowIndex, sVar.second.columnIndex_), | ||
siblingResult->localVocab()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop can also be refactored for readability, but we can do this once the other issues are resolved.
test/ServiceTest.cpp
Outdated
std::string_view expectedSparqlQuery5 = | ||
"PREFIX doof: <http://doof.org> SELECT ?x ?y " | ||
"WHERE { VALUES ?x { <x> <blu> } . VALUES ?y { <y> <bla> } . " | ||
"?x <ble> ?y . }"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As already stated, this is not quite right (or as efficient as it can be ) yet, but we will get there (see the comment above)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1341 +/- ##
==========================================
+ Coverage 88.77% 88.81% +0.04%
==========================================
Files 324 324
Lines 28676 28762 +86
Branches 3172 3187 +15
==========================================
+ Hits 25456 25544 +88
Misses 2067 2067
+ Partials 1153 1151 -2 ☔ View full report in Codecov by Sentry. |
src/engine/Service.cpp
Outdated
for (size_t i = 0; i < commonColumnIndices.size(); ++i) { | ||
const auto& optionalString = ExportQueryExecutionTrees::idToStringAndType( | ||
siblingTree_->getRootOperation()->getIndex(), | ||
siblingResult->idTable()(rowIndex, commonColumnIndices[i]), | ||
siblingResult->localVocab()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow the localVocab accessed here with siblingResult->localVocab()
doesn't contain useful data.
As a result the optionalString has no Content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look at this. I however don't suspect that the problem is the local vocab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you send me a reproducer (Dataset + Query) where something doesn't work as expected?
I just tried a simple example with one variable where this worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sry, here is a reproducable example query, i use it with Olympics Dataset but it doesn't use the local dataset anyway:
PREFIX schema: <http://schema.org/>
PREFIX imdb: <https://www.imdb.com/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
SELECT ?imdb_id ?imdb_votes ?imdb_rating WHERE {
VALUES ?imdb_id { "tt0477348" "tt0118715" "tt0116282" } .
SERVICE <https://qlever.cs.uni-freiburg.de/api/imdb> {
?movie_imdb imdb:id ?imdb_id .
?movie_imdb imdb:type "movie" .
?movie_imdb imdb:numVotes ?imdb_votes .
?movie_imdb imdb:averageRating ?imdb_rating .
}
}
ORDER BY DESC(?imdb_votes)
Expected result: 3 rows with the given imdb_id
values.
Actual result: 0 rows, Debug log shows that the sent Service-Query contains a Values Clause with imdb_id
as key but no values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured it out....
Your code is absolutely working as it should, the problem was the implementation of the VALUES
clause.
Basically you could neither evaluate it twice, nor read it from the cache during the same query.
I have pushed a quick fix to your branch (just do a git pull in your local copy to get it) and everything should work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first round of reviews.
I managed to get a simple query working with this, So we are on a good track with this.
src/engine/QueryPlanner.cpp
Outdated
auto bRootOp = std::dynamic_pointer_cast<Service>(b._qet->getRootOperation()); | ||
|
||
// Exactly one of the two Operations can be a service. | ||
if (aRootOp == bRootOp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (aRootOp == bRootOp) { | |
if (static_cast<bool>(aRootOp) == static_cast<bool>(bRootOp)) { |
The other one is very fishy (expects that you don't get passed the same Service twice).
src/engine/Service.cpp
Outdated
if (auto valuesClause = getSiblingValuesClause(); valuesClause.has_value()) { | ||
parsedServiceClause_.graphPatternAsString_ = | ||
"{ " + valuesClause.value() + | ||
parsedServiceClause_.graphPatternAsString_.substr(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably find the first occurence of the {
, I am not sure if this breaks as soon as someone puts random spaces in their query.
src/engine/Service.h
Outdated
@@ -60,6 +60,10 @@ class Service : public Operation { | |||
GetTsvFunction getTsvFunction = sendHttpOrHttpsRequest, | |||
std::shared_ptr<QueryExecutionTree> siblingTree = nullptr); | |||
|
|||
inline void setSiblingTree(std::shared_ptr<QueryExecutionTree> siblingTree) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inline doesn't do anything here.
And add a short docstring, the semantics of a sibling tree should be documented.
src/engine/Service.cpp
Outdated
for (size_t i = 0; i < commonColumnIndices.size(); ++i) { | ||
const auto& optionalString = ExportQueryExecutionTrees::idToStringAndType( | ||
siblingTree_->getRootOperation()->getIndex(), | ||
siblingResult->idTable()(rowIndex, commonColumnIndices[i]), | ||
siblingResult->localVocab()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look at this. I however don't suspect that the problem is the local vocab.
src/engine/Service.cpp
Outdated
if (commonColumnIndices.size() > 1) { | ||
vars = "(" + vars + ")"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about you ALWAYS add the parentheses, they also work for only one variable, and this makes the code simpler.
src/engine/Service.cpp
Outdated
for (size_t i = 0; i < commonColumnIndices.size(); ++i) { | ||
const auto& optionalString = ExportQueryExecutionTrees::idToStringAndType( | ||
siblingTree_->getRootOperation()->getIndex(), | ||
siblingResult->idTable()(rowIndex, commonColumnIndices[i]), | ||
siblingResult->localVocab()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you send me a reproducer (Dataset + Query) where something doesn't work as expected?
I just tried a simple example with one variable where this worked.
src/engine/Service.cpp
Outdated
|
||
std::vector<ColumnIndex> commonColumnIndices; | ||
const auto& siblingVars = siblingTree_->getVariableColumns(); | ||
std::string vars = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one bug remaining:
If a row (combination of all relevant variables) appears multiple times in the sibling result, then you may only store it once in the VALUES clause, otherwise the SERVICE will duplicate each result leading to wrong results with too many rows.
Theres a bug (introduced by this branch) in using the cache for the Service Operation with the following Query:
Steps to reproduce: Run the query once. Then comment the first value clause out such that only the service clause remains, and run again. Expected behavior: First run: 2 rows. Second Run: 3 rows Actual behavior: First run: 2 rows. Second Run: 2 rows (Service result cached) Conclusion: The Service Operations result will be loaded from Cache despite the change in its query issued by not injecting (or modifying) the siblings Result values. |
@UNEXENU Thank you for analyzing this... The fix for this is easy: Change the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already very useful and currently running on the Wikidata instance.
To merge this first step, there is still missing:
- Unit tests (see the codecov tool and my suggestions in the review).
- The fix ( + unit test) of the caching bug (sibling tree has to be part of cache key).
- During the building of the Values clause you should regularly check the cancellation token (throwIfCancelled, see in the remaining code).
Let me know if there's things that I can assist you with.
SubtreePlan plan = | ||
jcs.size() == 1 | ||
? makeSubtreePlan<Join>(qec, a._qet, b._qet, jcs[0][0], jcs[0][1]) | ||
: makeSubtreePlan<MultiColumnJoin>(qec, a._qet, b._qet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add unit tests for this planning (in the QueryPlannerTest.cpp and QueryPlannerTestHelpers.h you can add a matcher for a service clause, have a look at that code, and let me know if there is trouble)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented a simple test for this in this commit, let me know if that is sufficient or provide feedback for further improvements.
src/engine/Service.cpp
Outdated
} | ||
|
||
rowSet.insert(row); | ||
values += row + " "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all your someString += something + something other
use absl::StrAppend
. It is more efficient and more idiomatic. (three places in this file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small changes,
But we are almost ready to merge.
Sorry for the delay, I was on vacation.
ad_utility::HashSet<std::string> rowSet; | ||
|
||
std::string values = " { "; | ||
for (size_t rowIndex = 0; rowIndex < siblingResult->size(); ++rowIndex) { | ||
std::string row = "("; | ||
for (size_t i = 0; i < commonColumnIndices.size(); ++i) { | ||
for (const auto& columnIdx : commonColumnIndices) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inide the for each row loop (the outer one) should probably be one cancellation check per iterationl
test/QueryPlannerTestHelpers.h
Outdated
@@ -283,6 +284,9 @@ constexpr auto OrderBy = [](const ::OrderBy::SortedVariables& sortedVariables, | |||
// Match a `UNION` operation. | |||
constexpr auto Union = MatchTypeAndOrderedChildren<::Union>; | |||
|
|||
// Match a `SERVICE` operation. | |||
constexpr auto Service = MatchTypeAndOrderedChildren<::Service>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please implement the matcher s.t. it also takes a matcher for the sibling tree and the graphPatternAsString
etc.
Quality Gate passedIssues Measures |
There is a bug in the optimization introduced by #1341, which checks whether the sibling operand of a join with a `SERVICE` operations has a small result (and if yes, passes that result to the `SERVICE` clause as `VALUES` in order to reduce the result size of the `SERVICE` call). Instead of just adding the sibling operand to the existing `Service` operation (which led to an object that should have been `const` being changed), now a new `Service` operation that includes the sibling is created. In particular, this fixes https://qlever.cs.uni-freiburg.de/osm-planet, which worked before #1341 and now works again as it should.
With this commit, the optimization that constrains a `SERVICE` query using a `VALUES` clause when the sibling of the `SERVICE` is small (originally introduced in #1341) is also applied for `OPTIONAL` and `MINUS` clauses when the right operand is a `SERVICE`.
Previously, the sibling of a `SERVICE` (see #1341 for details) was computed twice. This was not a problem for small siblings result, because they were cached, making the second computation actually a cheap cache read. It was however a problem if the result was too large for the cache. It then was computed twice, although only one of the computations was actually used, because the SERVICE discarded the result because it was too large. This commit fixes this behavior. The sibling is now only computed once, and only gets explicitly shared with the `SERVICE` if it is sufficiently small.
If the query tree, with which the result of a
SERVICE
clause is joined, leads to a small result (The definition of "small" can be set by a runtime parameter, the current default is 100), then this result is added to the query that is sent to the endpoint of theSERVICE
, making it much easier to process, and the result of theSERVICE
query potentially much smaller.