-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC4021: Archive client controls #4021
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the alternative linked in my comment may be more useful and this should require fed peeking to exist to actually work as intended.
@@ -0,0 +1,36 @@ | |||
# MSC4021: Archive client controls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be added as an alternative.
Also this likely would need to depend on fed peeking since currently you need to join a room to access the info which some people may find bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that 2291 is an alternative; I think the goals are different. 2291 indicates whether the bot is allowed to crawl the room, whereas it looks like the intent for this one is to communicate to search engines whether they are allowed to index the room. For example, I might want my room available on archive.matrix.org, but I may not want Google to index it and present it in search results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imho the indexing is already a form of crawling a room. That's my reasoning. And the other msc can also be used for this case imho. It's a little more generic than this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a desire (matrix-org/matrix-viewer#47 (comment)) to have the room directory API include this sort of information directly, which is why I'm not sure 2291 will work here. I edited the doc to expand upon this and add 2291 as an alternative. Because this is intended to function more similarly to m.room.join_rules
I don't think fed peeking is an issue, but I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, I might want my room available on archive.matrix.org, but I may not want Google to index it and present it in search results.
@uhoreg With MSC2291, I think this could be achieved with a mix of messages
and log
in the m.room.robots
event. Am I misinterpreting?
m.room.robots
{
"*": {
"messages": false,
"log": true
}
}
messages
: (boolean) whether the bot is allowed to index the room's
messages. Default:true
ifm.room.history_visibility
is
world_readable
, andfalse
otherwise.log
: (boolean) whether the bot is allowed to display logs of the room to
users. This will befalse
ifmessages
isfalse
. Default:true
if
m.room.history_visibility
isworld_readable
, andfalse
otherwise.
The names are slightly confusing to what they actually do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In MSC2291, messages
is intended to indicate whether the bot itself is allowed to index messages, whereas this proposal is intended to communicate preferences to other crawlers that crawl the bot's logs. This may be able to be done with an addition to 2291 (e.g. add a new property), but 2291 itself doesn't do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like there might not be any difference between the bot itself (Matrix Public Archive) and a different crawler that crawls the bots logs (a search engine). They're both accessing the same information (same or derived) and feels like messages
from MSC2291 to control indexing of messages covers that. In other words, if messages: false
, the archive can't index messages and neither can search engines.
Basically, any bot preference should probably be passed down for other bots to follow?
I think if it's a wildcard *
, it should apply to downstream bots. It's less clear how things should flow if someone specified an app. Perhaps it wouldn't flow in the specific app case but could use the *
rules to govern how search engines look at it.
And maybe we want to define some generic "search_engines" key for example since it might be common. But not all of the preferences are applicable since we can't pass along all of this preference detail seamlessly (impedance mismatch).
from returning duplicate content or taking precedence in search results over an organization's self-hosted archive. | ||
|
||
For example, if `via` is set to `"archive.example.net"` in `#main:example.net`, the page at | ||
https://archive.matrix.org/r/main:example.net/date/2023/05/28 should return this HTTP header: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to assume that all archivers will have the same URL format, which may not be true. If they all run matrix-public-archive, then that may be, but it's possible that some other archiving software may use a different format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative could be to have via
be a full URI, like https://archive.example.net/r/main:example.net
, and then https://archive.matrix.org/r/main:example.net/date/2023/05/28 would return:
Link: <https://archive.example.net/r/main:example.net>; rel="canonical"
It would miss out on features like date pagination, although it now occurs to me that for the purposes of web indexing, this might actually be preferable behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this alternative is that it might be more difficult for the self-hosted client at archive.example.net to parse and not include this canonical link header, because I don't think it would be ideal for the canonical archive to return this header. So I don't know, maybe that's something to leave up to client interpretation, maybe a standard URL format should be part of the spec? 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted something specific to the Matrix Public Archive URL format, we could use an event type scoped to the sub-domain like org.matrix.archive.canonical
to convey this information.
| `archive` | boolean | | Whether the room should be included in room directory listings which are indended to be viewed by the public | | ||
| `robots` | [string] | Valid [robots meta rules](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#directives) | A list of rules which should be included in a `robots` meta tag and/or [HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag-implementation) by public-facing clients. e.g. `["noarchive"]` or `["noindex", "nofollow"]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Matrix Public Archive, there are kind of two things to consider:
- Whether you want to show up in the archive at all (display)
- Whether you want to allow search engines to index that content (indexing)
The robots
field definately covers the search engine indexing decision by being able to opt out with noindex
For the display decision, it's less clear whether robots
can cover it. But noarchive
sounds pretty decent just by name and also because of what it means:
noarchive
Requests the search engine not to cache the page content.
-- https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta/name#other_metadata_names
And the Matrix Public Archive really just allows you to view a public Matrix room with some potential caching on top (it doesn't store anything). But this might be an overloaded usage of noarchive
since caching is not the same as displaying which the archive also does at its core.
Depending on the answer here, the archive
field may be redundant compared to what can be specified in robots
Perhaps the display should be keyed off something else entirely anyway.
|
||
## Proposal | ||
|
||
Add an `m.room.archive_controls` state event where you can specify information about if and how you would like your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m.room.archive_controls
feels very specific to the archive use case and we may want to be more generic.
For example, people building a blog or forum on Matrix would use similar robots
controls (see other beyond chat applications for Matrix)
Maybe we only need to be generic with a m.room.robots
state event and other archive specific event types would still be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but I didn't want this to be confused for controls over Matrix chat/integration bots, which this really isn't. It's more of a control over a specific class of clients in my mind, which I wasn't sure how to refer to.
Unless you think this has purpose outside of clients which are intended for public unauthenticated access, but I think a comments system on a blog would also fall under that category.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
IMHO this proposal is misguided, because a room with world-readable history can have its history read by any client, which would be free to ignore such "controls." That is, these would not be controls, but merely expressed preferences. They would likely give users or room admins a false sense of security, because while they may have set a "control" to prevent indexing of their room's history, any party could be doing so with just a few lines of code, even connecting it to their own Instead, we should seek to make the consequences of room settings as clear as possible to users and admins. |
Rendered
This proposal solves these problems: