-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix /datasets?filter to select across namespaces #3359
Conversation
PBENCH-1117 I discovered that the single simplistic `LEFT JOIN` allows combining native `Dataset` and `Metadata` terms in a `SELECT`, but with limitations: because the SQL join constructs a row for each `Metadata` match, matches for, e.g., `server.origin` and `dataset.metalog.pbench.script` or `global.server.legacy` will appear on separate table rows. Each has duplicate `Dataset` columns, but that doesn't help when trying to select across namespaces. The only effective solution I was able to find was to cascade the joins in order to build a new table with a separate column for each metadata namespace row matching the dataset. This allows a single `SELECT` to work across the columns in the new table. I'm somewhat unhappy with the `filter` unit tests, because the `sqlite3` engine seems to be generating odd and incorrect SQL for queries involving the `user` namespace. (PostgreSQL does this correctly on a live server.) This should be a fourth `LEFT JOIN` but on a user_id match as well as the dataset reference and primary key name. I've made the unit tests "pass" ... but the SQL being compared wouldn't actually work correctly. (Also note that the uery builder omits the `user` join if the query doesn't reference that alias, but doesn't similarly optimize for the other joins, which seems weird.)
With the join now renaming multiple copies of the `Metadata` table, the simplistic override of the `SELECT` terms doesn't work. Instead of trying to "optimize" the `SELECT`, just iterate through the returned datasets and their metadata collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm offering my approval, if you don't find anything in the below worth changing. However, there are a few things that I'm concerned about.
Foremost is the change to filter_query()
: it's not obvious to me that terms
is being properly managed. (However, apparently you think the tests will pass...I just poked Mr. Jenkins to see what he thinks; if they do pass, and my concern is well-founded, then I think you'll need to expand the tests....)
The next is I have a thought about the four-joins approach.
And, finally, there's a large nit about the tests' overwriting/reusing certain identifiers.
(And, of course, there are a bunch of other nits, smaller things, and other comments/questions.)
expression = aliases[native_key].value[keys].as_string() | ||
filter = expression.contains(v) if contains else expression == v | ||
else: | ||
filter = and_(*terms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, can terms
ever have more than one "term" in it? If not, consider changing it to a scalar.
Relatedly, if we take the if
branch instead of the else
branch here, I think we will ignore terms
altogether...is that desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
terms
is only used when we're referencing the Dataset
table. And you're right: with the changes I think there's never more than one term and I could probably simplify. That doesn't seem critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, there's nothing critical here.
However, as the code stands with this change, I don't think we need for the two if
statements (lines 246 and 276 respectively) to be separate from each other (that is, the path through the first dictates the path through the second).
So, I think it would be reasonable (even good!) to combine them together. We can get rid of the use_dataset
flag, and the scope of terms
becomes limited to only one block (and, in fact, becomes so simple that we can get rid of it altogether). You would have to replace the elif
with an else
, but that's easily done by moving the conditional to the contained if
. And, the result would be much more straightforward code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your latest revision is much improved. However, I still suggest combining the two if
statements, if you opt to make a change for some other reason.
elif k: | ||
self.accumulate(aggregate, k, v) | ||
|
||
Database.dump_query(query, current_app.logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The effect of this is conditional upon DEBUG
-level logging being enabled, right? It would be nice if "debug" appeared in its name....
OK, having read further, I see that you've enhanced this to work for non-DEBUG
-level; in that case, I humbly request that you explicitly include the third argument here, which will make it more obvious to the reader that it's a "debug thing".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a debug thing until its not. The default behavior of this method hasn't changed. I just extended it to allow externally overriding to info
for testing. I didn't see any point in renaming, and adding the default argument here is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the default argument here makes it clear to the reader that the function is intended to be active only for debugging (otherwise, this behavior is implicit, and the concerned reader has to go find the definition of dump_query()
to figure that out). (We can keep the argument's default value for compatibility with other, existing code; but it would be good from a code-as-documentation perspective to supply the value explicitly here.)
if m.key == Metadata.METALOG: | ||
self.accumulate(aggregate["dataset"], m.key, m.value) | ||
else: | ||
self.accumulate(aggregate, m.key, m.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternately,
a = aggregate["dataset"] if m.key == Metadata.METALOG else aggregate
self.accumulate(a, m.key, m.value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I like the explicit if
better here than the local variable, especially when I'm pretty sure black
wouldn't let me put it on one line anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really like it better, that's fine. (Myself, I prefer a common call point with conditional data over multiple conditional call points.) However, I checked with Mr. Black before I posted it, and he said it was fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as is, I just have a couple of questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I just have a couple of pointed questions and a coding suggestion.
expression = aliases[native_key].value[keys].as_string() | ||
filter = expression.contains(v) if contains else expression == v | ||
else: | ||
filter = and_(*terms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your latest revision is much improved. However, I still suggest combining the two if
statements, if you opt to make a change for some other reason.
2256c05
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it! (Unless you want to make any of the changes from my earlier reviews.... 😉)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
PBENCH-1117
I discovered that the single simplistic
LEFT JOIN
allows combining nativeDataset
andMetadata
terms in aSELECT
, but with limitations: because the SQL join constructs a row for eachMetadata
match, matches for, e.g.,server.origin
anddataset.metalog.pbench.script
orglobal.server.legacy
will appear on separate table rows. Each has duplicateDataset
columns, but that doesn't help when trying to select across namespaces.The only effective solution I was able to find was to cascade the joins in order to build a new table with a separate column for each metadata namespace row matching the dataset. This allows a single
SELECT
to work across the columns in the new table.This broke the
keysummary
code, requiring some adjustment there. With the join now renaming multiple copies of theMetadata
table, the simplistic override of theSELECT
terms doesn't work. Instead of trying to "optimize" theSELECT
, the new code just iterates through the returned datasets and their metadata collections.