-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify container description #227
Comments
I find the use cases to include "basic" information about contained resources in the container description compelling. Applications can immediately provide simple functionality by keeping the number of requests/connections minimal. It'd be reasonable to require this level of support on container read operations from servers in order to enable "smart" enough applications to get off the ground without having to resort to more advanced mechanisms. I would consider last modification and size to be "basic" information. Ditto human-readable label if available. And possibly the creator of the resource. Whether knowing a resource is a container or not (by reading the container description) is very useful, that information can be derived as per shared slash semantics, hence it is not absolutely necessary that the container description includes resource types of contained resources. |
Can you add any reference to http/1.1 server specification with the information that is to be available on server side. |
I would rephrase the question here to be something more like:
I disagree that container listing is the best way to do this. A query endpoint (e.g. triple pattern fragments) can achieve the very same end with (arguably) better scalability characteristics. The basic problem with including this data in a container relates to authorization. Consider, for example, a container with 100 child resources. A simple GET request to the container will require an access check at the container level. Then 100 subsequent checks would be needed for each child resource. What happens with 1,000 child resources? 10,000 child resources? This does not scale. The only way this scales is by introducing a paging mechanism such that you limit the scope of authZ enforcement to a predictable window size, which is why I suggest TPF. |
for whom? Agree from a server's point of view but not particularly attractive from an application's point of view. It is quite a burden for applications to fetch each resource to get a hold of what they need (along the lines of that's mentioned in the above use cases) in order to provide something usable. I would consider having to collect the data through a query endpoint relatively more complex than getting it simply from the container representation. Moreover, servers are not required to provide a query endpoint - at the time of this writing - so the basic information wouldn't be consistently available to applications. If your counter argument/proposal is to address the use cases above by querying, we need to introduce a query mechanism as a hard requirement. (Which would help to meet quite a bit of other needs but that's all besides the point).
Generally agree but we need empirical data as mentioned. True that a container can theoretically hold infinite number of resources (I think). Are applications - with the understanding of hierarchical organisation of Solid storage - organising data such that containers with many resources is common (in the wild)? If at all, how is resource organisation or management factored in? Servers may want to limit the number of members a container can have to a number it is comfortable with. Implementation detail. Agree on needing pagination as a way to control the cost of a request/response which would be an alternative to above - server fixing the max number of resources allowed per container. Implementation detail. |
This is not what I am suggesting. I agree that such an interaction is a non-starter: there are way too many HTTP round-trips. A query endpoint allows a client to retrieve all the information it needs in a single request.
Here is empirical data for a system that implements the "check every child resource" approach: https://wiki.lyrasis.org/display/FF/Many+Members+Performance+Testing You can see response times in the 60 second range for 10K child resources. |
Our definition of a container is this RDFS class called As you can see, there's a related property So for example (prefixes missing):
|
Would it make any sense to have the listing of a container's contents follow the permissions on the container rather than the permissions on the contents? For example:
This would mean that the server never has to do a mass check of the permissions on its contents but the user would still have the option to hide the server-managed information when that is their intention. |
I know. I said that as the current solution to meet the needs. Querying, pagination or something else is currently not possible (=unspecified). Thanks re Fedora data, that is useful. It is not easy (for me) to break it down as there are a number of different dimensions with varying values. The test with ~60s is perhaps on the higher end ("perhaps postgres needs caching configured?") - if you can provide more insight on this, that'd be useful. There is a can of warms here re caching of access policies.. Is there something along those lines available for Trellis? |
@namedgraph I presume you can filter based on authorization policy per resource? And the response time for request to /photos/ with different access controls on each contained item is marginally different to if each item is public-read? |
No because each resource (container or other) can have different access controls. System must not leak any information about contained resources when agent is unauthorized to read those resources - last modification, size etc. are indeed sensitive and should not be exposed. The most a read access on a container permits is the visibility of the containment statements (just references). |
@acoburn wrote
The LDP group worked quite hard on a spec for paging. See: https://www.w3.org/TR/ldp-paging/ |
Re: Trellis, that code works as described by @jeff-zucker (authZ decisions are made based on container permissions, not based on access to the child resource). Trellis also does not include any information about the child resources, so it just sidesteps this issue. Consequently, container retrieval is measured in milliseconds. For Fedora, there was a huge amount of work done related to this issue, and ultimately, many users began finding various work-arounds that just avoided using LDP containment, e.g.:
In my own experience, the Fedora server just got really, really slow once you had more than a thousand child resources in a single container. There were various attempts to resolve this, but those efforts never really went anywhere with that tech stack. I don't know where things stand these days, but it led to a lot of people abandoning the project. Re: Query -- I see paging and query as two ways of describing a very similar feature, and they are both really useful. |
No ACL for children resources, no (yes for containers themselves). Since client-side containers is just UI for certain SPARQL queries, and we don't have ACL for plain SPARQL -- only for Linked Data resources. Once you have SPARQL access, you can pretty much see all the data, so it's a privilege to have. |
I recently noticed that ESS does not include the modified time because it's not part of the spec, and that makes apps unusable for large collections. So I'm very happy to see this :). I think my use-case has already been covered in previous comments, but I'll go over it briefly in case it's useful to see it from an app developer's perspective. What I want to do in my app is reduce the quantity (and size) of network requests. Given that querying is not supported, the solution I've arrived at is caching everything in the client. This makes the first session slower, but makes subsequent sessions faster. It also improves the overall responsiveness of the app, because it doesn't have to make network requests for reading data. However, all of this depends on being able to read only the updates at the start of every session. So far, that's what I've been using the modified time for, and without it I can't think of a way to improve the application start up. Something else that would be useful is knowing the types of resources included in the documents. For example, reading the type index I can find containers that include the types of resources I'm interested in. But that doesn't mean that a container doesn't have other types of resources, and I'd like to avoid reading documents that are not relevant to my app. I understand that doing this can have an impact on server performance, so I don't have strong opinions as to how this information should be retrieved. I think it would make sense to return only containment triples by default, and use some mechanism like headers to indicate what other types of information is relevant. Re:pagination, I suppose for really large amounts of data it would be necessary. With my current approach it's actually better to get everything in one request, given that I'll want to read all the documents that are relevant to my application (I was actually using globbing before it was deprecated). Pagination would be useful with query support - at that point I may be able to avoid caching everything - but given the current status this is the only viable solution I found. |
For the TrinPod server case in authenticating what RDF data to include in a container request: We use a fully hierarchical authentication scheme that at the lowest level is a single statement, so our server first retrieves all the information that a request would have without authentication, then does an auth check on each statement that the authenticated user has access to to generate the final response. The hierarchical nature of the auth check in combination with the cached acls presents virtually no resource hit on the server side. On the Application side, in creating our Files app which we are finishing now, we are arriving at the idea that a single request to a container should present enough information for the user to intelligently decide what they want to do next, such as expand a child branch of that container. So we would be very happy to support any proposed standards about what to include as part of a container request to improve the UX. I think the paging issue that @acoburn brings up is also very important, so a standard around that would be great too. At the moment, as standards aren't yet in place, for TrinPod we are including in a request to a container: all the child nodes of the container with ldp:contains, and then the ldp:contains of those child nodes as well as the last event triples around the content in the requested container (such as any schma:UpdateAction around that content) of course all filtered by user access permissions. |
Created issue for resource paging: #230 |
@csarven I vote to make those two specs part of the Solid standard - but I think also needed would be a recommendation for how many items to include in a given page |
@gibsonf1 If paging is required, I can't see why more than one mechanism is needed. The number of items to include for a paged resource would either be a client preference included in the request in which a server a may agree to or simply use its own (implementation detail). |
It would be worth having a comparison between both. |
I'm catching up here, and I appreciate that this is a summarization of several different things, and so I don't think it serves to pose this as a single question. What I'm seeing here are at least these problems:
The first case is essentially a generalization of the Data Browser behavior where it looks for Number 2 is essentially what we have referred to elsewhere as a File Scan operation. We haven't set down what a File Scan operation is, but in the context of Solid is pretty clear a File Scan operation is to read the contents of a container and it now requires read privileges on the container, and that should be adequate for now. It is very interesting to read that @gibsonf1 has an implementation that performs well when checking access control for a tree, but in the interest of having a spec that many can implement, at least in the initial versions, I think it is correct to assume that it is rather hard to achieve that performance, as @acoburn has experienced. Thus, at least initially, we should make sure that a File Scan operation can be done with read privileges on the container only. Anything beyond that is not a File Scan operation. Then, the question becomes what information a File Scan operation can legitimately expose. I think the above discussion and @acoburn 's comment in #116 makes it very clear that at least the containment triples are a part of the container representation, if you need the hidden file case, then you need to make a child container and then have other permissions on that. My opinion, at least right now, is that there are some other attributes, like mtime, type and size are things that could be a part of the container representation in a File Scan operation. Again, if you need to protect those, make a container with different permissions. There's also some precedence to this, Apache has a default index that exposes mtime and size by default. In conclusion, number 2 above is the File Scan operation, which maps to a read operation on the container in Solid, which exposes containment triples, size, type and mtime as well as other server managed and client managed metadata. But, there's more! ;-) It could be argued that computing mtime and size is too heavy for most users, we shouldn't give that unless people ask for it. For that, I suggest we look into defining and registering a |
In the implementation I'm working on I would find it very useful to be able to get the description of the container without any While I'm not a believer in an average person finding the filesystem an intuitive interface. This comparison seems to be still useful among spec writers / developers. When I issue I see it very much related to the discussion about possibilities for separating clients managed and server managed triples. Having only clients managed triples in the response (incl. assigned label) would solve this use case (at least for me) |
I would be all for entirely server-managed container resources, to make a clean separation between resources that are server managed and not, but I'm not sure that is a possibility at this point. |
@kjetilk do you mean:
? I think this would be much better than the current mixed bag of statements and all the quirkiness around it. My preference would be to have the opposite
Most likely this change would be too radical. |
I would be OK with either, but I suppose both are too radical at this point, and the latter more so than the former. |
@kjetilk do you see this change as radical, looking mostly at the former one, due to the impact on existing implementations, or possibly impacting other parts of the spec (or other specs) and requiring cascading changes? |
Since current implementations look for containment statements in |
@kjetilk I recall your comments a little over a year ago in solid/authorization-panel#253 (comment) , solid/authorization-panel#253 (comment) , and solid/authorization-panel#253 (comment) I think all that we discussed there would be clearer if client-managed and server-managed would be distinct resources with specific access control applied to them. Currently, containers are an exception to resource-level access control. As a result something which should be very simple (allowing the creation of contained resources while disallowing editing the container description), becomes a nightmare. I think reviving AuthZ UCR will give us the opportunity to take another look at how containers and their client-managed descriptions are intended to be used. IMO if there is a major design issue that we can fix, doing it pre 1.0 might be the best time to do it. |
Removed Release 0.11.0 milestone per agreement at 2024-02-14 CG meeting. |
Hey @hzbarcea, thanks for the update. If this is not being included in 0.11 after 3 years of discussions, when can we expect to have this resolved? Looking at the activity in this issue, it seems like this is important to many people. I would like to understand why it's taking so long to be resolved. In the linked meeting notes it says it won't be included because it would block the release, but what does that mean? Is it blocked because we still need to make a decision? Concerns for server implementators? Lack of contributions in the spec? |
@NoelDeMartin, you raise excellent questions and concerns. Evidently, there are no legitimate details:
HOWEVER there is considerable detail about who "voted" on a proposal and how they voted to remove an item from a milestone. That's essential or understanding certain aspects of this issue and the social dynamics. No, this issue is not blocking the next release. The entire premise is unsubstantiated. Whether it's included in the next release/milestone or not holds little significance from the ED's perspective. Similar attempts are being made to remove other items from the next milestone without presenting clear justifications or plans. There is nothing constructive here. All that aside, I will do my best to follow up on dangling technical concerns/open considerations. There aren't that many, but they could be significant when I translate them to PRs because they will 1) clear up misunderstandings and expectations 2) introduce class 3-4 changes because it touches on some other issues/considerations/requests. As I see it, this issue is not something we look at in isolation, and removing it from a milestone indicate a lack of understanding of the concerns initially raised. As I've and others have mentioned in recent meetings, we (CG) try to make progress on the specifications in this incubation space. If/when and in what form a WG takes place has no impact on continuing to take this issue seriously (see my bullet points above for example) now. If anyone has new data, opinion, or initiative on it, they're encouraged to share them. The door is wide open. Suggesting that the WG will handle it is an attempt to limit discussion and, dare I say, influence who gets to "vote". |
Background: to date, the Solid Protocol (including earlier drafts and issues) only required server-managed containment statements in the representation of a container. Additional information such as last modification, size, resource type etc. about the contained resources as part of the container representation was deemed to be optional or considered to be a best practice. Examples in the wild show that some servers do make this additional information available, meanwhile some other servers do not support it. Some applications do make use of the information if available or work around the limitation to get a hold of the information [Anecdotal Evidence]
General use case:
Support navigation of the container and its contents.
Use cases:
Related UCs:
Scenarios to consider:
https://example.org/{uuid}
General requirement:
Include descriptions about contained resources in container's description to further support navigation and application interaction.
Specific requirements:
Considerations:
Related issues:
Notes:
describedby
) could include information about the contained resources. Doesn't violate best practice on self-describing documents per se but it is perhaps not the most intuitive place to look for additional information about the contained resources.Prefer
header be meaningful?The text was updated successfully, but these errors were encountered: