-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APIs to support metadata object creation and querying #2049
base: main
Are you sure you want to change the base?
APIs to support metadata object creation and querying #2049
Conversation
51eb24a
to
faf6c5a
Compare
371962b
to
ba71d77
Compare
Nikhil, I'm coming late to this game, so my apologies if I'm poking at settled issues, but I have a few questions/comments on the description of this PR before I look at the code.
And, just for my education, why are there separate URIs for creating and listing "favorite Metadata" and "saved Metadata", when there are common URIs for getting, updating, and deleting them? That is, it seems like "favorite" objects are just a particular flavor of Metadata objects...should the server care about the distinction (or can that be dealt with on the client side), and, if so, should the distinction be made through an embedded field or a request parameter instead of offering a separate URI for them? |
I think we should get away from the term "sessions" here, which I don't think really makes sense in the generalized context. Also, "Metadata or dashboard saved sessions"?? Saved sessions are one kind of metadata we're supporting in this API. We're saving "values" or "documents" associated with the metadata type (or "key").
You seem to have settled on having the client provide/retrieve JSON rather than making them render a string, which -- without having looked at the code yet -- I resume means you've used SQLAlchemy's JSON column support. I'd already checked that it's apparently supported for both Postgresql and sqlite3, so that should be good. In this case, I don't think "text blobs" is a good description; we're apparently working with JSON documents, so we can just say that. I also hadn't anticipated locking the metadata keys to just "saved" and "favorites" ... it seems that a client ought to be able to add additional metadata keys without requiring server changes. Then again, with Flask URI templates, this might be exactly what you're doing? Do we really want to shorten the concept of "saved session" to just "saved"?? It sounds very generic! (Of course, if we're handling opaque JSON documents, the server implementation actually is very generic, and if you're really taking the "metadata key" from a URI template rather than hardcoding "saved" and "favorites", then we can change the names at any time independent of this PR. (And that's what I'd prefer: we just have "key" and "value", without any server interpretation.)
I doubt that "favorite" metadata will have "config"/"description" fields in the JSON, and the way this is written (especially that it's the same in both saved sessions and favorites) suggests that the field values are required and perhaps enforced by the server. I'd rather not imply that. (Although the PR description isn't product documentation, it will become a permanent part of the git log.)
Again, "session" feels awkward here. We have "saved sessions" and "favorites". Here we're querying the value of (literally) a metadata table row ID. I don't think that's exactly how we want to advertise it; but "metadata item ID"? "metadata id"? Hmm.
Didn't we dump the GraphQL schema? This seems extraneous and potentially confusing. |
I took the "strike-through" to mean that these items had been dumped...but I've started reviewing the code, and at least elements of them seem to be present there. (E.g., what does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff here, but I have lots of questions (some small, some large) and the usual handful of nits.
"User trying to update fields that are not present in the metadata database. Fields: {}", | ||
non_existent, | ||
) | ||
abort(400, message="Invalid fields in update request payload") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't want to take pity on the requestor and indicate which fields are invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we can add that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the second thought however, should we let the requester figure out which fields are invalid, this is in accordance with our design that we try not to give too many details to the requester, we just tell them what went wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything around security, that's true; not so sure we want to extend that to non-security failures. I'd rather be helpful where it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the client can issue this request over and over with different fields, it would be mechanically straightforward for the client developer to figure out what part of the request is bad...but, it would be annoying for him/her. So, really, I see no reason why we shouldn't just indicate the offending field(s) in the message and save all the trouble.
# Check if the metadata payload contain fields that are either protected or | ||
# not present in the metadata db. If any key in the payload does not match | ||
# with the column name we will abort the update request. | ||
non_existent = set(post_data.keys()).difference( | ||
set(Metadata.__table__.columns.keys()) | ||
) | ||
if non_existent: | ||
self.logger.warning( | ||
"User trying to update fields that are not present in the metadata database. Fields: {}", | ||
non_existent, | ||
) | ||
abort(400, message="Invalid fields in update request payload") | ||
protected = set(post_data.keys()).intersection(set(Metadata.get_protected())) | ||
for field in protected: | ||
if getattr(metadata_session, field) != post_data[field]: | ||
self.logger.warning( | ||
"User trying to update the non-updatable fields. {}: {}", | ||
field, | ||
post_data[field], | ||
) | ||
abort(403, message="Invalid update request payload") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I've seen this code before...I suggest refactoring it into a utility routine, perhaps in some common parent class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah its a bit redundant
Well right now it actually is a text blob we require value to be a text and that text can be a json string, thats why I gave an example as The reason for adding
Yeah I agree I should fix the wording, the value field can accept any blob of text and it doesnt have to be a json string, its just a client defined text metadata, which client can fetch and parse it. |
Yes I will update the description
Ah, yeah thats an interesting question I think they should age out and get automatically removed rather than allowing everyone to delete the public metadata objects, but a good point we can discuss it.
Yeah should we allow public metadata objects to be editable? I am not sure
2 separate URIs for favorite and saved metadata is because of the way how flask manages the endpoint definition with |
Currently a user can click/unclick the favorite button on any dataset, at any time. There's no ownership, so it doesn't matter if it belongs to "someone else". Again, this behavior is really a quirk of our evolutionary cycle and won't be possible once we're done because there won't be any unowned datasets. I suppose that doesn't necessarily mean that an un-logged-in user won't be able to favorite any published datasets, though; which would still have to be shared by all un-logged-in viewers. Anyway, yes, absolutely anything owned by "public" (or the "null user", or just "owned context") has no protections, and can be created, modified, or deleted at whim by anyone else who's not logged in. It's a stupid job, but it's our job, and we're going to do it. So, yeah, delete needs to work on unowned metadata without authorization. Whether for saved views or for favorites. |
so @dbutenhof yes I should remove all the Also if we are to allow any key type, we probably can't use the MetadataKeys enum (enum will lock the list of keys we can accept). In a way thats good rather than flooding our database with all sorts of user defined keys but I am confused. Another question is on the GET request for |
Yes, the Enum implies a limited set of keys, but that's what you've already implemented. My vision was that we simply support an open set of key=value pairs tied to the user (or anti-user). That means we just use the key string they passed in (probably lowercased) and there's no way to enum-ify or digit-ify it. If you'd like to change to that model, I'd be thrilled. You chose the ENUM, and I'm just saying if we go that way I'd rather use the ENUM column type and have a readable string in the database rather than a numeric equivalent. And while I think the idea of |
Missed this one. Yeah; if you ask for all "foo"s and there is no "foo", is that success or failure? I think that we're basically returning
or perhaps just
or even (since they asked for a specific key) just
And in the latter case, having no "key" values for the current user/anti-user might just return 200 with
and be done. The theory being that asking for the values for "mykeythathasnovalues" is an empty list. But another option would be to handle a query for a key that has no values as HTTP 204 (no content). Successful, but you get zip. The empty array might be easier for the client to handle, though. |
Hmm, it feels a little odd though, I mean anyone can write a web crawler and keep deleting the public datasets and we would have no idea whats happening. Either we can strictly force user to login to create and use metadata objects or at least provide a basic protection against unwarranted deletes on public metadatas that's there for the legacy purpose, isn't? |
Post request for publishing the metadata object for public access. | ||
This requires a Pbench auth token in the header field. Only authorized users can make their metadata public. | ||
Right now once the metadata object is made public it can not be reverted back to the private entity | ||
TODO: Is it desirable to be able to revert the object back as a private entity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite what I thought we were talking about. The idea wasn't to make a user's metadata accessible without authorization, but rather to have a mechanism to build a URI that would essentially embed the metadata directly without going to the DB.
For example, something like http:/server.example.com/dashboard/sharedview?token=<read-only-token>&config=<encodedsavedviewconfigstring>
Now at least part of this could be a "metadata" server API to package it up, if we want. I think the dashboard would have a "share" button for each current page, and would generate a URI including the information necessary to re-open that view for anyone with the URI, including the React context and document IDs being displayed. There might also be a "share" link on the Sessions view to package up the saved view and create a URI the same way.
If we generate the link in the server a non-UI client could use it, too, but I think we'll still need dashboard integration.
On the other hand... this would additionally enable a "publish view" that's somewhat similar to publishing a dataset, and maybe that's not entirely a bad thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we didn't discuss about it, this was a prototype idea that I thought would be good to have. As you correctly pointed out it enables user to publish a certain view if they want to and that view would be accessible to all the logged-in and non logged-in users. If users are comfortable with publishing their dataset, they may want to publish their metadata view as well for others to access (and even for themselves without logging in next time).
I am not sure if this would add much of a value for the user experience, but an option we could give. However, I am certainly open for any suggestions.
class UserMetadata(Database.Base): | ||
""" Metadata Model for storing user metadata details """ | ||
|
||
# TODO: Think about the better name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a perfectly reasonable table name to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @npalaska, this might be a silly question but what do we mean by user metadata here, in context with the dashboard? I believe that the user (via dashboard) has to send some data to the REST API but do not know what it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aquibbaig, I believe the intention is to provide Server-side storage for any user state which should persist between login sessions, anything which the Dashboard might want to store on behalf of the user, such as favorite datasets or dataset tags or labels, views, window geometries, default languages, or other settings, etc., any context which is associated with a user rather than a dataset.
Storing user-generated favorites/tags/labels with datasets doesn't work well if the user doesn't own the dataset or if different users want to mark the same dataset differently. Users might want to "favorite" a dataset published by someone else. Since the notion of "favorite datasets" directs what the user sees, it's an attribute of the user, not the dataset. The metadata gives the Dashboard a place to store this stuff so that it can be fetched each time the user logs in.
The "metadata" API is supposed to be very generic, to give it maximum flexibility for supporting the Dashboard and any other client uses that we haven't imagined yet.
updated = Column(DateTime, nullable=False, default=datetime.datetime.now()) | ||
value = Column(Text, unique=False, nullable=False) | ||
key = Column(String(128), nullable=False) | ||
user_id = Column(Integer, ForeignKey("users.id", ondelete="CASCADE"), nullable=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem with this "publish by setting user_id == None" is that once it's been published there's no way anyone can delete it or update it.
If we really want to do this (instead of or in addition to the generated "share view" URI), I think you should add an "access" like what I did in the ES schema; and publishing doesn't set the owner to None but just changes "access" from private to public. That way the original owner of the published item can still change it or eventually delete it. (And we can also keep track of who created it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, "access" could evolve into an ACL.
""" | ||
Post request for creating metadata instance for a user. | ||
|
||
The url requires a metadata session key such as /metadata/<key> | ||
the only keys acceptable to create the metadata sessions are favourite/saved | ||
The url requires a metadata object key such as /metadata/<key> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be no longer true, since you removed the post
argument and the template string from the URI.
"Metadata {}: Logged in user_id {} is different than the one provided in the URI {}", | ||
action, | ||
current_user_id, | ||
"Metadata user verification: Logged in user_id {} is different than the one provided in the URI {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User isn't provided "in the URI". You're comparing the user_id in the metadata object against the token in the Authorization header.
except Exception: | ||
self.logger.exception( | ||
"Exception occurred in the GET request while querying the Metadata model, id: {}", | ||
id, | ||
) | ||
abort(500, message="INTERNAL ERROR") | ||
|
||
# Verify if the metadata session id in the url belongs to the logged in user | ||
self.verify_metadata(metadata_session, "get") | ||
# Verify if the metadata object id in the url belongs to the logged in user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, "url" isn't really right.
"User trying to update fields that are not present in the metadata database. Fields: {}", | ||
non_existent, | ||
) | ||
abort(400, message="Invalid fields in update request payload") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything around security, that's true; not so sure we want to extend that to non-security failures. I'd rather be helpful where it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we consider this comment first before proceeding?
@portante The session module from the dashboard is same as the metadata object module (sqlalchemy table) defined here. This table has a Earlier the dashboard was using the Graphql interface to create/query the metadata objects (session module from dashboard). However the only difference is we decided to implement it using the REST protocol keeping everything else same. |
resource_class_args=(config, logger, token_auth), | ||
) | ||
api.add_resource( | ||
QueryMetadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering what's the difference between "GetMetadata" and "QueryMetadata"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GetMetadata
returns all metadata values for the authorized user, while QueryMetadata
specifies a metadata id
and returns just that value.
Having two classes somewhat obscures the relationship, but if you look at the URI registration in __init__.py
you'll see that they're both GET /metadata
with effectively an optional id
... that is, GetMetadata
maps to just GET /metadata
while QueryMetadata
maps to GET /metadata/<id>
. I think Nikhil looked into combining those and found that Flask didn't provide a straightforward way to make the <id>
parameter optional; the pattern wouldn't match if it was omitted, so he had to add the second class to make it work either way.
I'm not sure how to interpret this comment, but I'll add a little commentary... This PR is very old and will likely never be finished: we're moving to Keycloak for user authentication and I expect that "user metadata" (along with roles and groups) will be maintained in our Keycloak broker instance rather than building our own SQL tables locally. |
Address the issue #2048
API support for creating or querying User Metadata objects.
According to the discussion in #2048 we have decided to implement the dashboard graphql interface using the REST APIs. All the API endpoints are either protected with an optional or strict user authentication. E.g. A user need to be logged-in in order to create the metadata object, public metadata object can not be created. However, at any time, user can publish the their metadata objects to make them public. Public metadata objects are read-only, therefore a non logged-in user can get a list of public metadata objects but can not modify or delete it.
This PR handles the basic API support such as Metadata object creation (POST), Metadata session query (GET, DELETE, PUT) with optional Pbench Authorization header included as described below.
Client can send any text blob as metadata with a user defined key in the json request when creating a metadata object. Dashboard will assign some key such as "saved metadata" (or whatever we want to call it) and eventually "favorites" to persist the currently in-browser favorites metadata objects list. However, key can be anything and not just restricted to saved/favorites.
/v1/metadata
/v1/metadata/<metadata_key>
/v1/metadata/<metadata_id>
/v1/metadata
/v1/metadata/<metadata_id>
/v1/metadata/<metadata_id>/publish
Defines a graphql mutationcreateMetadata
that mimic the current front-endcreateUrl
mutation.Add two simple queriesgetMetadataById
andgetMetadataByUserid
Defines the graphql schema on the mutation and queriesExpose the/user/metadata
endpoint where a post request does not accept the graphql query but a simple JSON requestEndpoint parses the JSON request to make a valid graphql query to execute against the schema and returns the flask response back to client