Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a neato informative table of various URL pieces #337

Open
domenic opened this issue Jul 17, 2017 · 65 comments
Open

Add a neato informative table of various URL pieces #337

domenic opened this issue Jul 17, 2017 · 65 comments
Labels
clarification Standard could be clearer

Comments

@domenic
Copy link
Member

domenic commented Jul 17, 2017

Basically copy the bottom half of this: https://nodejs.org/api/url.html#url_url_strings_and_url_objects

(We could presumably SVG-ize it so it's a little prettier.)

Via the thread at https://twitter.com/wa7son/status/886982643463708673

@TimothyGu
Copy link
Member

Quick WIP with Inkscape:

drawing svg

Would something like this work?

@annevk
Copy link
Member

annevk commented Sep 5, 2017

I think so, although maybe a table is better as that would be more accessible I suspect. Note also that ? is not part of the pathname getter. The other thing that might be interesting is to illustrate a couple different URLs. In particular different schemes. You also omitted the origin field although that's rather hard given that it needs to skip user/pass somehow.

@TimothyGu
Copy link
Member

Note also that ? is not part of the pathname getter.

Oops, an off-by-one.

In particular different schemes

That does indeed sound like a nice idea.

You also omitted the origin field

I did so intentionally, as it's not a concept intrinsically related to URL parsing, but rather more about Web apps/security. And because it's hard.


I did a table version with a few variants. The first is a straight translation of the SVG graph. The second is closer to the version in the Node.js doc. The third is the same as the second but has origin, mainly there to show how ugly it is. The forth is a URN for fun. I do like the fact that you can link to the spec for that exact attribute, but with some coloring I still think the SVG one is a bit prettier.

screenshot from 2017-09-05 16-07-49

@annevk
Copy link
Member

annevk commented Sep 5, 2017

Alright, I'm game. We should be able to get the links to work with SVG too. I'm not really sure if we can make all of it equally accessible though.

@domenic
Copy link
Member Author

domenic commented Sep 5, 2017

Personally I like the first table one, possibly with additional text-align center.

I also think it might be interesting to have a counterpart table that is about the URL record terms, instead of the API? (E.g. scheme instead of protocol, query instead of search, fragment instead of hash.) Maybe that wouldn't be that helpful though.

@annevk
Copy link
Member

annevk commented Sep 5, 2017

It's probably useful as there are some interesting differences between the two. Bit unclear where the table should be located at that point, but maybe we could put it in an Appendix?

ruflin added a commit to ruflin/ecs that referenced this issue May 30, 2018
So far the url structure was heavily inspired by whatwg/url#337. I initially only wanted to make some tweaks to it to improve querying but I realised I never fully felt comfortable with the field names used here. So I started to look at the url parser of different languages like Go, Ruby, Python and the output they provide are surprisingly similar but not consistent with whatwg. The change made here brings the field names closer to what most url parsers output.
ruflin added a commit to ruflin/ecs that referenced this issue May 30, 2018
So far the url structure was heavily inspired by whatwg/url#337. I initially only wanted to make some tweaks to it to improve querying but I realised I never fully felt comfortable with the field names used here. So I started to look at the url parser of different languages like Go, Ruby, Python and the output they provide are surprisingly similar but not consistent with whatwg. The change made here brings the field names closer to what most url parsers output.
ruflin added a commit to ruflin/ecs that referenced this issue Jun 5, 2018
So far the url structure was heavily inspired by whatwg/url#337. I initially only wanted to make some tweaks to it to improve querying but I realised I never fully felt comfortable with the field names used here. So I started to look at the url parser of different languages like Go, Ruby, Python and the output they provide are surprisingly similar but not consistent with whatwg. The change made here brings the field names closer to what most url parsers output.
andrewkroh pushed a commit to elastic/ecs that referenced this issue Jun 6, 2018
* Change structure of URL

So far the url structure was heavily inspired by whatwg/url#337. I initially only wanted to make some tweaks to it to improve querying but I realised I never fully felt comfortable with the field names used here. So I started to look at the url parser of different languages like Go, Ruby, Python and the output they provide are surprisingly similar but not consistent with whatwg. The change made here brings the field names closer to what most url parsers output.

* fix description of query field

* switch it to integer

* Add dot to host.name to make it consistent
@EnnexMB
Copy link

EnnexMB commented Aug 19, 2018

What is the status of this issue? The issue that I submitted today, "Documentation on URL syntax", has been closed and deferred to this issue, which has been in play for over a year. In the meantime, the only documentation I've found that lays out URL syntax is the series of steps in section 4.5, which requires some figuring out to understand. So are we going to add something to fix that?

@annevk
Copy link
Member

annevk commented Aug 20, 2018

@EnnexMB it basically needs someone to work on it and resolve the open questions above.

@EnnexMB
Copy link

EnnexMB commented Aug 20, 2018

Ok, can I help with this? It seems that the two open questions are:

  • What to be used to represent the syntax (graphic, table, or formula). I suggest the second or third of the four tables posted by TimothyGu, but:
    • It should be in text, not a graphic, with links on the terms. The code Timothy used to generate the graphic would be helpful.
    • The terms should match those used in the URL spec, although it would be good to point out the parallel API terms and link to their document, and it would be best if the analogous table is posted there as well.
    • It would be helpful to have an additional row at the top of the table with a symbol to distinguish required from optional elements. In the formula proposed in my issue yesterday, this was done with square brackets. Timothy's table is much easier to read, but the optionality is important information to include.
    • I guess the third table is best if "origin" means something, which I don't know. Linking to where it's explained in the spec would fix that. It appears twice, so some clarification is needed.
    • I don't know about URNs, but if Timothy's fourth table is related to URLs, then it would be good to show that relationship as well.
  • Where to put it. I suggest section 4.5 unless there is a more appropriate place. If it's put in an appendix, it should be prominently linked to, and maybe should be linked to in a few places anyway, because this is going to be something that will be useful to people.

What have I left out? If Timothy will post the code behind his graphic, I'll work on editing it to implement the suggestions above.

@annevk
Copy link
Member

annevk commented Aug 20, 2018

Sounds good, thanks. What do you think @TimothyGu?

As for placement, the top of section 4 might also work, given that it illustrates the relationship between various subsections.

@TimothyGu
Copy link
Member

@EnnexMB Thanks for your interest in this.

The WIP are unfortunately on my laptop that has seen some physical damage since the time I created them. I'll try to recover the files tonight.

The code Timothy used to generate the graphic would be helpful.

For the first (https://user-images.githubusercontent.com/1538624/30042227-a0f3af64-9222-11e7-96a4-39c0cf11d279.png) it was just a manually created SVG. For the second it was a pretty standard HTML table with the spec's default styling.

It would be helpful to have an additional row at the top of the table with a symbol to distinguish required from optional elements.

NB: what's optional is quite different for different URL schemes. The URN at the end is a good indicator of that. In fact, for non-special URLs only the scheme is required and nothing else – tim: is a valid URL! It's important to be mindful of that.

I guess the third table is best if "origin" means something, which I don't know.

I'd be okay removing that. It's not really a component of the URL but rather a byproduct, so may not fit in that table.

@EnnexMB
Copy link

EnnexMB commented Aug 24, 2018

Hi @TimothyGu, were you able to recover the file? I don't think we need the first one, since it seems to be superseded by the second, the table version. Standard HTML is fine, and it could help to start from the structure you've already created, rather than starting from scratch. Of course, if you want to move forward with the changes yourself, that would be great too. But I'd be happy to do it if that would help.

I understand that optionality is complex. I had in mind to devise some compact way to represent it in the first row of the table. You (and others) might want to take a look at the formula in my original post and see if you agree with the optionality as represented there by square brackets. (I just now edited with a correction.) It does have everything optional except scheme:. I wrote that formula entirely based on the serializing instructions in section 4.5.

@EnnexMB
Copy link

EnnexMB commented Aug 31, 2018

@TimothyGu, any luck getting that file? I really think that one way or another we should get this done.

@TimothyGu
Copy link
Member

TimothyGu commented Sep 2, 2018

@EnnexMB Sorry about the delay, but yes! Here's the diff for the table version:

https://gist.github.com/5eb111b5021b338d516e97225a65bed4

Here's the SVG if you're interested. Note the search coverage is still wrong.

https://gist.github.com/bf539f420463bab1eb7426cff267a5b4

(drawing2.svg have the fonts embedded)

Please go ahead and work on it. I won't be able to do so myself and I really appreciate your stepping up.

@EnnexMB
Copy link

EnnexMB commented Sep 7, 2018

Thank you @TimothyGu. I need some help with the format of the file in the first link. Can someone send me a link to documentation on the diff format used there? I Googled "diff file" and don't see anything relevant.

@annevk
Copy link
Member

annevk commented Sep 7, 2018

I found https://www.thegeekstuff.com/2014/12/patch-command-examples/. The document being patched is the source file for the URL Standard by the way, url.bs.

@TimothyGu
Copy link
Member

@EnnexMB Oops, I’m sorry to have missed your comment on the gist itself. What @annevk gave should work, though I would personally do this:

  1. Put the gist file in a file, let’s call it tmp.diff
  2. Apply it using git apply tmp.diff.

git apply has several advantages over patch and is usually much easier to use, so I’d recommend that for diffs with Git headers like the one I provided.

@EnnexMB
Copy link

EnnexMB commented Sep 7, 2018

Okay, I'm sorry, but I still need a bit more help here.

I think the problem is that this all started when I was reading the URL standard and posted an issue about it, which landed me here in GitHub, but I have no experience in GitHub. So when I'm told to use git apply tmp.diff, I don't know what environment I'm supposed to be in to do that.

I Googled git apply and found what appears to be documentation of that command, and from there, of git itself, which appears to be software that I need to install on my computer in order to proceed with this. Is that correct, or is there a way to work with that diff file online without installing software?

Sorry to distract from the thread topic by needing some guidance.

@annevk
Copy link
Member

annevk commented Sep 10, 2018

It's for the command line, e.g., the Terminmal application on macOS. And yeah, you'd need to have such tooling installed (for macOS you'll get prompted to install it). To help you, I applied the diff to url.bs and copied the result to https://html5.org/temp/url.bs.

@EnnexMB
Copy link

EnnexMB commented Sep 11, 2018

Edit, Sept. 15: Disregard this post, and see my next one below.


Okay, thank you. That gave me a helpful starting point.

I don't know how to include HTML in this post, so I've inserted two images of what I've done and then after those images, I provide a link to the HTML file that generated both of them.

Here is @TimothyGu's third table with the changes I suggested and some additional changes:
url syntax representation- original table proposal modified
The complete list of changes from his original table is documented in the HTML file linked below. Also, in that HTML file, the red, underlined text is working links.

In addition, I've done some further work to present an alternative proposal, which has three parts.

  • Formulaic representation: I think this is worth including because it uses the standard system of square brackets to represent optional elements and curly brackets with a vertical line to represent a set of elements to select from. Also, referring to it can assist in understanding the meaning of the graphical representation below it.
  • Graphical representation: This uses different colors to represent optional elements, with a gradation of lighter colors to represent elements that are optional within other optional elements, and adjacent elements in the same color to represent mutually exclusive choices. The information content is the same as in the formulaic representation, but it is easier for a human to read.
  • Table of element conditions: This summarizes the rules in section 4.5. "URL serializing" of the standard. Again, referring to this table can make it easier to understand both the formulaic and graphical representations.

In the following image, the underlined text is working links in the HTML file linked further below.

url syntax representation- new proposal

The two images above were generated in an HTML file using the same CSS as the URL standard. However, that didn't handle conversion of the double-brace wrappers used in @TimothyGu's code, so I converted those to <code> tags. (I'd be very interested in knowing how to use those double-brace wrappers if someone could direct me to information on that.)

The HTML file is posted at Gist, and I don't see a way to link to it so it can be read directly by your browser. So to see it as intended, you will have to copy it into your own htm file and view it in your browser from there. If someone will tell me a better way to do this in the future, I will do that.

@EnnexMB
Copy link

EnnexMB commented Sep 15, 2018

Alright, hold on a second. Disregard my previous post from a few days ago. I was just reading up on CSS syntax and in sections 4.1 and 5.1 came upon railroad diagrams. It's a far better way to represent syntax than my home-spun graphical representation above. I found a website for generating them, and here is the result for URLs:

url syntax railroad diagram

Along with that graphic, there is an htm file that shows that diagram with links on the element names to the relevant sections of the URL Standard, along with another representation of the syntax in EBNF notation, which is the code used to generate the diagram.

As above, the htm file is saved as a Gist, and I wish I knew a way to post it so it would load directly in your browser, but I don't.

From my previous post, the table of element conditions might still be useful. I'd say disregard all the rest.

@TimothyGu
Copy link
Member

TimothyGu commented Sep 15, 2018

See #24 on some previous work done on creating a formal grammar for URLs, perhaps displayed through railroad diagrams (see http://intertwingly.net/stories/2014/10/20/Url.xhtml). In my opinion, RR diagrams and formal grammar solve a different problem, and a version of what I had should be enough just for a simple overview of URLs, which is what this bug is all about.

@EnnexMB
Copy link

EnnexMB commented Sep 15, 2018

The RR diagrams you linked to are very complex and, as you say, solve a different problem than we are discussing here. The RR diagram I posted is very simple and contains the same information as in your table plus information on optionality of elements. Do you have an idea of how to convey that optionality information in your table? That was what I was getting at with the graphical representation, but I think the RR diagram does it much better.

Whether we use the RR diagram or a version of that table or something else, I would like to suggest that this issue be brought to a conclusion by posting something in the standard to give readers and easy way to understand the syntax of URLs.

@annevk
Copy link
Member

annevk commented Sep 19, 2018

The railroad diagrams were intended to replace section 4.3 (writing) and 4.4 (parser). Some problems were:

  1. They were not strictly identical.
  2. They were not necessarily easier to understand.
  3. Railroad diagrams as a concept were not formally defined.

If you don't plan to replace those sections and only offer them as non-normative guidance, then 2 and 3 go away.

@EnnexMB
Copy link

EnnexMB commented Sep 19, 2018

@rubys, would you like to work with me (or without me) on this? You obviously have far more experience and expertise on it and have already done a lot of the work. I wouldn't want to reinvent your wheel. It sounds like there's interest in using something now if it's either perfect or non-normative.

@rubys
Copy link
Member

rubys commented Sep 19, 2018

@EnnexMB, I'm willing to help, but there seems to be some confusion. For example, I don't believe that the railroad diagrams were ever meant to replace any existing sections, and if I ever gave that impression, I apologize. Nor do I believe that they were meant to be normative (my memory is fuzzy on this point, perhaps they were initially proposed as such, but if so, we quickly determined that they were best non-normative.

Beyond that, there is an even bigger disconnect. To illustrate, look at the original table and note that it uses the word protocol. Now look at recent work, and see the word scheme. I think the biggest problem here is determining who the target audience is for this change. I gather that the original request was focused on users of the API.

I guess what I am getting at is that there may be multiple issues here, and they aren't mutually exclusive. It may be worthwhile adding multiple graphics to different sections.

Finally, yes, I'm willing to help. If you have something you would like to see in the document and can show it displaying in a web page, I can review it and do the command line magic to make pull request for you. If what you produce addresses this issue, that's great. But if not, that's not a problem either.

@EnnexMB
Copy link

EnnexMB commented Sep 20, 2018

Hi @rubys, I'm glad you're willing to put those misunderstandings behind us and move forward with this.

We do need to figure out the matter of terminology you mentioned. Let me ask this question. I see two possibilities:

  • The two sets of terminology are equivalent, i.e., they form a set of synonym pairs used in two different technology domains.
  • There is a meaningful difference between the two sets, so that, for example, there is a (perhaps subtle) difference between the meanings of protocol and scheme, between query and search, and between fragment and hash.

If the first case is true, then perhaps each box of the diagrams could include both terms, i.e., it would be bilingual.
If the second case is true, perhaps this brings up your suggestion of different graphics in different sections. But would it also be possible to have diagrams that express the relationship between the two terminologies, so that people could understand that relationship instead of looking at them in isolation from each other?

Regarding a web page that displays a candidate of something to go in the Standard, I'd like to suggest that we're talking about something on the spectrum between my diagram and your diagrams. One problem with my diagram is that it doesn't include the case of relative URLs. But what I like about my diagram is that it summarizes the whole sequence of absolute URLs in one diagram (albeit with a line break). Your diagrams go into much more detail and therefore cover the content of my diagram in at least four different diagrams, as listed above. It seems that both approaches are worthwhile for seeing both the forest and the trees.

In addition to @rubys, it would be helpful if @annevk and anyone else chimes in if you ever feel that we're going off in a direction that's not going to work. It would be unpleasant for us to develop something, only to be told later it's not suitable.

@EnnexMB
Copy link

EnnexMB commented Sep 21, 2018

I have added syntax diagrams to the Wikipedia pages on URNs and URIs. Those diagrams are generated directly from the syntax code posted on those pages (which was there before). The portion of the URI article that includes that drawing is transcluded (automatically copied) to the page on URLs. That means that other people at Wikipedia have decided that the syntax of URIs and URLs are the same. I don't know if that's correct or not.

There is a contradiction between the URI/URL diagram and the one I originally proposed above. The diagram above shows the path as optional, but the syntax in Wikipedia (based on RFC 3896) shows it as required. So I suppose this is another error in that original diagram.

If either of those diagrams posted in Wikipedia is incorrect, or if it is incorrect that URI and URL syntax are the same thing, then either feel free to edit the Wikipedia pages or let me know what the problems are and I'll get them corrected.

The URI/URL syntax drawing does not have the level of detail in my original diagram or in @rubys's diagrams. I won't enhance the diagrams in Wikipedia until a new diagram or diagrams have been vetted and approved here.

@EnnexMB
Copy link

EnnexMB commented Oct 5, 2018

There's been no response on either of my last two posts for two weeks. I don't know if this is because my questions were deemed to dumb to comment on or too difficult to answer.

I do think @rubys was right when he said that the conflict in terminology is an important place to start. But whereas he suggested choosing one form of terminology based on who the audience is, I'm suggesting sorting out and resolving the conflict so that all audiences can talk with each other and be understood. Can we do that? If we can, then we can proceed to make up a useful and correct (albeit nonnormative) illustration of the syntax.

Regarding URIs and URLs, there is some disagreement in the world about whether they are synonymous or not. It would seem that the folks who set the standard for URLs would be a good authority for establishing the correct answer to that. And when we have that answer, we'll know whether the illustration of the URL syntax also applies to URIs synonymously or needs to be adjusted to apply to URIs.

Are we going to move forward to get an illustration of the syntax done?

@annevk
Copy link
Member

annevk commented Oct 8, 2018

It's not really clear to me what questions you have, I only count one question mark in the preceding two posts. Here's my view on the terminology:

  1. For the URL model we continue to use scheme et al. as these are more appropriate and less misleading than protocol et al. For the URL API we continue to use the latter as changing the API breaks compatibility and introducing new APIs solely to fix the terminology is not worth it. We could align the model with the API, but given the many non-API consumers of the model I don't think that would be fair.
  2. The point of view of the URL Standard is that URIs and IRIs no longer exists (subsumed by URLs). And that URNs are URLs with the urn: scheme.

@EnnexMB
Copy link

EnnexMB commented Oct 8, 2018

Thanks @annevk.

In your answer 1, it sounds like you're saying that the terminologies are synonymous. Therefore, the diagram I first posted above can be made bilingual by inserting the corresponding API terms in brackets below the non-API terms where they are different, as follows:
url syntax with api tems
This diagram has the problems already discussed above and is only used here to show the presentation of API terminology alongside the non-API terminology where they are different. I have taken the API terms from @TimothyGu's original postings (first and second) in this issue.

@rubys, does this resolve your concern about the terminology? Can we move forward with developing the correct syntax diagrams in this way?

@annevk, in your answer 2, is the view of the URL Standard authoritative, or is there some competing body that could disagree with you? If this view is authoritative, then I could propose that Wikipedia state that the term "URI" is depricated and when we finish the syntax diagram here, that should be posted on Wikipedia in the URL article with reference to the new diagram in the URL Standard.

@annevk
Copy link
Member

annevk commented Oct 8, 2018

The IETF would likely disagree.

@EnnexMB
Copy link

EnnexMB commented Oct 8, 2018

Okay, thank you.
What do you think of the bilingual syntax diagram?

@annevk
Copy link
Member

annevk commented Oct 8, 2018

Sorry, I think it would be nicer to always list the second term, even if it's identical, and link it to its definition.

@EnnexMB
Copy link

EnnexMB commented Oct 8, 2018

Yeah, listing both names in all boxes could make it clearer, even if repetitive.
And yes, the intention would be to link the boxes to the definitions. Is it sufficient to link to the definitions in section 4.1, for example for scheme? I don't see analogous definitions in the Standard for the API terms.

@domenic
Copy link
Member Author

domenic commented Oct 8, 2018

Note that they don't really match. E.g. if the scheme is "https", then protocol is "https:". Similarly query/search and fragment/hash.

@EnnexMB
Copy link

EnnexMB commented Oct 8, 2018

Okay, thank you @domenic. This is the question I was originally asking. If they don't match exactly, then they are not synonymous and the bilingual diagram above is not appropriate. In that case, either:

  • The syntax diagram can cover one terminology or the other (non-API or API), or
  • It could incorporate the differences, or
  • There could be two separate diagrams for the different terminologies.

I think the second one (incorporate the differences) would be best if it can be done reasonably well, as it would help people understand the relationship between the terminologies.

So, is there a place that lays out the exact relationship between the terminologies, i.e., the differences that you are referring to?

@domenic
Copy link
Member Author

domenic commented Oct 8, 2018

The getter algorithms in https://url.spec.whatwg.org/#dom-url-href

@EnnexMB
Copy link

EnnexMB commented Oct 8, 2018

Okay, could you help by providing a translation of those algorithms to a set of correspondence rules, like the one you stated above, that scheme https = protocol https:.

I wonder how that rule applies. The scheme is always followed by ":", so from that rule, it looks like the definition of protocol just includes the ":" instead of appending it. If that's correct, that's fine; we do need to represent such relationships correctly. So can you provide the set of those rules to work with?

@annevk
Copy link
Member

annevk commented Oct 9, 2018

The set of rules are described by the algorithms, no?

@dwsinger
Copy link

dwsinger commented Sep 9, 2020

It would be good to name and identify both the domain-names in the host-name (separated by dots) and the path-components (akak 'folder names') in the path (separated by slashes).

@annevk
Copy link
Member

annevk commented Sep 10, 2020

Thanks @dwsinger. We already have #435 for formalizing domain labels. And we should probably formalize "path segment" for the latter, which we already use in URL writing, but not in URL representation (there it's just an ASCII string without a formal name).

annevk added a commit that referenced this issue Jan 16, 2023
As suggested in #337 by David Singer.

This also formalizes single-dot and double-dot URL path segments as proper concepts and allows them to be part of the data structure rather than writing section, which is much more sound.
annevk added a commit that referenced this issue Jan 17, 2023
As suggested in #337 by David Singer.

This also formalizes single-dot and double-dot URL path segments as proper concepts and allows them to be part of the data structure rather than writing section, which is much more sound.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Standard could be clearer
Development

No branches or pull requests

8 participants