Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix /datasets?filter to select across namespaces #3359

Merged
merged 5 commits into from
Mar 27, 2023

Conversation

dbutenhof
Copy link
Member

PBENCH-1117

I discovered that the single simplistic LEFT JOIN allows combining native Dataset and Metadata terms in a SELECT, but with limitations: because the SQL join constructs a row for each Metadata match, matches for, e.g., server.origin and dataset.metalog.pbench.script or global.server.legacy will appear on separate table rows. Each has duplicate Dataset columns, but that doesn't help when trying to select across namespaces.

The only effective solution I was able to find was to cascade the joins in order to build a new table with a separate column for each metadata namespace row matching the dataset. This allows a single SELECT to work across the columns in the new table.

This broke the keysummary code, requiring some adjustment there. With the join now renaming multiple copies of the Metadata table, the simplistic override of the SELECT terms doesn't work. Instead of trying to "optimize" the SELECT, the new code just iterates through the returned datasets and their metadata collections.

PBENCH-1117

I discovered that the single simplistic `LEFT JOIN` allows combining native
`Dataset` and `Metadata` terms in a `SELECT`, but with limitations: because
the SQL join constructs a row for each `Metadata` match, matches for, e.g.,
`server.origin` and `dataset.metalog.pbench.script` or `global.server.legacy`
will appear on separate table rows. Each has duplicate `Dataset` columns, but
that doesn't help when trying to select across namespaces.

The only effective solution I was able to find was to cascade the joins in
order to build a new table with a separate column for each metadata namespace
row matching the dataset. This allows a single `SELECT` to work across the
columns in the new table.

I'm somewhat unhappy with the `filter` unit tests, because the `sqlite3`
engine seems to be generating odd and incorrect SQL for queries involving the
`user` namespace. (PostgreSQL does this correctly on a live server.) This
should be a fourth `LEFT JOIN` but on a user_id match as well as the dataset
reference and primary key name. I've made the unit tests "pass" ... but the
SQL being compared wouldn't actually work correctly. (Also note that the uery
builder omits the `user` join if the query doesn't reference that alias, but
doesn't similarly optimize for the other joins, which seems weird.)
With the join now renaming multiple copies of the `Metadata` table, the
simplistic override of the `SELECT` terms doesn't work. Instead of trying to
"optimize" the `SELECT`, just iterate through the returned datasets and their
metadata collection.
@dbutenhof dbutenhof added bug Server API Of and relating to application programming interfaces to services and functions labels Mar 24, 2023
@dbutenhof dbutenhof self-assigned this Mar 24, 2023
webbnh
webbnh previously approved these changes Mar 24, 2023
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm offering my approval, if you don't find anything in the below worth changing. However, there are a few things that I'm concerned about.

Foremost is the change to filter_query(): it's not obvious to me that terms is being properly managed. (However, apparently you think the tests will pass...I just poked Mr. Jenkins to see what he thinks; if they do pass, and my concern is well-founded, then I think you'll need to expand the tests....)

The next is I have a thought about the four-joins approach.

And, finally, there's a large nit about the tests' overwriting/reusing certain identifiers.

(And, of course, there are a bunch of other nits, smaller things, and other comments/questions.)

expression = aliases[native_key].value[keys].as_string()
filter = expression.contains(v) if contains else expression == v
else:
filter = and_(*terms)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, can terms ever have more than one "term" in it? If not, consider changing it to a scalar.

Relatedly, if we take the if branch instead of the else branch here, I think we will ignore terms altogether...is that desired?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

terms is only used when we're referencing the Dataset table. And you're right: with the changes I think there's never more than one term and I could probably simplify. That doesn't seem critical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, there's nothing critical here.

However, as the code stands with this change, I don't think we need for the two if statements (lines 246 and 276 respectively) to be separate from each other (that is, the path through the first dictates the path through the second).

So, I think it would be reasonable (even good!) to combine them together. We can get rid of the use_dataset flag, and the scope of terms becomes limited to only one block (and, in fact, becomes so simple that we can get rid of it altogether). You would have to replace the elif with an else, but that's easily done by moving the conditional to the contained if. And, the result would be much more straightforward code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your latest revision is much improved. However, I still suggest combining the two if statements, if you opt to make a change for some other reason.

elif k:
self.accumulate(aggregate, k, v)

Database.dump_query(query, current_app.logger)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect of this is conditional upon DEBUG-level logging being enabled, right? It would be nice if "debug" appeared in its name....

OK, having read further, I see that you've enhanced this to work for non-DEBUG-level; in that case, I humbly request that you explicitly include the third argument here, which will make it more obvious to the reader that it's a "debug thing".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a debug thing until its not. The default behavior of this method hasn't changed. I just extended it to allow externally overriding to info for testing. I didn't see any point in renaming, and adding the default argument here is redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the default argument here makes it clear to the reader that the function is intended to be active only for debugging (otherwise, this behavior is implicit, and the concerned reader has to go find the definition of dump_query() to figure that out). (We can keep the argument's default value for compatibility with other, existing code; but it would be good from a code-as-documentation perspective to supply the value explicitly here.)

Comment on lines +345 to +348
if m.key == Metadata.METALOG:
self.accumulate(aggregate["dataset"], m.key, m.value)
else:
self.accumulate(aggregate, m.key, m.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternately,

                a = aggregate["dataset"] if m.key == Metadata.METALOG else aggregate
                self.accumulate(a, m.key, m.value)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I like the explicit if better here than the local variable, especially when I'm pretty sure black wouldn't let me put it on one line anyway.

Copy link
Member

@webbnh webbnh Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really like it better, that's fine. (Myself, I prefer a common call point with conditional data over multiple conditional call points.) However, I checked with Mr. Black before I posted it, and he said it was fine.

lib/pbench/test/unit/server/test_datasets_list.py Outdated Show resolved Hide resolved
npalaska
npalaska previously approved these changes Mar 24, 2023
Copy link
Member

@npalaska npalaska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as is, I just have a couple of questions.

lib/pbench/server/api/resources/datasets_list.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_datasets_list.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_datasets_list.py Outdated Show resolved Hide resolved
@dbutenhof dbutenhof dismissed stale reviews from npalaska and webbnh via e32fd47 March 24, 2023 20:35
npalaska
npalaska previously approved these changes Mar 24, 2023
webbnh
webbnh previously approved these changes Mar 25, 2023
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I just have a couple of pointed questions and a coding suggestion.

lib/pbench/server/api/resources/datasets_list.py Outdated Show resolved Hide resolved
lib/pbench/server/api/resources/datasets_list.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_datasets_list.py Outdated Show resolved Hide resolved
expression = aliases[native_key].value[keys].as_string()
filter = expression.contains(v) if contains else expression == v
else:
filter = and_(*terms)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your latest revision is much improved. However, I still suggest combining the two if statements, if you opt to make a change for some other reason.

lib/pbench/test/unit/server/test_datasets_list.py Outdated Show resolved Hide resolved
lib/pbench/server/api/resources/datasets_list.py Outdated Show resolved Hide resolved
siddardh-ra
siddardh-ra previously approved these changes Mar 27, 2023
@dbutenhof dbutenhof dismissed stale reviews from siddardh-ra, webbnh, and npalaska via 2256c05 March 27, 2023 12:07
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it! (Unless you want to make any of the changes from my earlier reviews.... 😉)

Copy link
Member

@npalaska npalaska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dbutenhof dbutenhof merged commit 2d1e6af into distributed-system-analysis:main Mar 27, 2023
@dbutenhof dbutenhof deleted the joiner branch March 27, 2023 13:39
@portante portante added this to the v0.73 milestone Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Of and relating to application programming interfaces to services and functions bug Server
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants