Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discuss: a new specifications #36

Open
PoiScript opened this issue May 16, 2020 · 4 comments
Open

discuss: a new specifications #36

PoiScript opened this issue May 16, 2020 · 4 comments

Comments

@PoiScript
Copy link
Owner

A correct implementation requires a precise specifications. However, neither org syntax draft nor org elements api can really serves as a good specifications: Org syntax doesn't specify syntax unambiguously, meanwhile, org-elements-api is quite buggy and provides some apparent wrong results in some cases (e.g. #22, #19 and #14).

To resolve that, I think the ultimate solution is to maintain a specification by our own. To be clear, it's not going to create a subset of org markup language but just describes the expected results you will get when using orgize.

Fortunately, we don't need to start from scratch. We can just make a copy of the original org syntax and adapt it for our needs (like defining required and optional fields). Then, I would like to borrow some concepts from CommonMark Spec, especially the part of handling whitespace and list indenting. In short, our new specification will be majorly based on org syntax and with some modifications and additions.

Feel free to leave any comments below if you have any idea about the new upcoming specification =).


As a reminder, here're some issues/commits that need to be reopened/reverted after applying the new specification.

#17: should be reopened. Org syntax clarified that title should be matched after other part have been matched.

ba9c83c: should reverted. We should only handle ascii whitespace. I think it is totally unnecessary to take care of unicode whitespace if we don't need to be compatible with org-elements-api.

#26 #34: should be closed and moved to a new issue. Org syntax specifies that headlines' stars must be followed by at least one space character. But I think tab is also acceptable (just like CommonMark does and I will include it in the new specifications latter). Hence, only *** , *** \n or ***\t\t should be valid, but not ***, ****\r or ***\n.

#33: should be closed.

@PoiScript
Copy link
Owner Author

I had created a SYNTAX.md at docs/. Sorry for not using org-mode, but github can't render org-mode file properly :(.

@calmofthestorm
Copy link
Contributor

calmofthestorm commented May 18, 2020

I like this idea. Using org-elements as a guide is good, but it has too many bugs to be a spec. I haven't looked into it, but maybe the org syntax draft at orgmode.org is open to contributions? If so, and they had a common understanding of what the source of truth is (likely org-mode itself), perhaps we could upstream some/all of these edge cases.

  • I agree on Cannot parse empty headlines with tags #17.

  • Whitespace
    Regarding whitespace, org-mode itself seems to use different definitions in different places. A headline must always be stars followed by a literal space -- the spec, org mode's behavior, and org-element all reject \t and Unicode whitespace there. There can be any number of literal spaces before a keyword (like TODO), but no tabs or Unicode whitespace.
    My preference would be to match org-mode as closely as possible even if it means strange or confusing semantics, and I can volunteer to provide a PR to the spec detailing the behavior for headline parsing. It would be easy enough to define several different combinators like space0 that define the different kinds of whitespace.

  • \r\n prevents further parsing; potential for substantial data loss #26 and Fix headline parsing leading to dropped lines. #34
    org-element and org-mode both reject *\t as a headline. The spec says "a space character", one of only two places it says that instead of "whitespace character". I think we should accept only a space character there.
    Either way, the PR can be simplified if it doesn't need to handle * -- the complexity in that PR comes from needing to special case EOF with respect to newlines.

  • Fix a few edge cases with priority parsing. #33
    This bears closer examination. org-mode itself will parse a priority cookie anywhere in the line. * Hello[#A]world will be treated as having priority A, at least for what I tested: org-priority-up and org-agenda. I don't think it makes sense for a parser to be that permissive, but I do think that it is reasonable to accept * [#A]Hello World, since both org-element and org-mode do.

@PoiScript
Copy link
Owner Author

I haven't looked into it, but maybe the org syntax draft at orgmode.org is open to contributions?

Neither have I. But the original version of org syntax actually contains some implementation information about elisp, which is irrelevant to orgize. So I wish to have a new specification without anything related to eslip.

Regarding whitespace, org-mode itself seems to use different definitions in different places.

Yes, this is the problem that the new specification is going to resolve. Basically, I want to make clear definitions of space, tab, line ending, line and blank line which can be used throughout the whole specification. Here is an initial version of definitions (majorly stole from CommonMark):

A space is U+0020.

A tab is U+0009, it's treated as 4 spaces when it's used to define block structure.

A line is a sequence of zero or more characters other than newline (U+000A) or carriage return (U+000D), followed by a line ending or by the end of file.

A line ending is a newline (U+000A), a carriage return (U+000D) not followed by a newline, or a carriage return and a following newline.

A blank line is a line containing no characters, or a line containing only spaces (U+0020) or tabs (U+0009).

org-element and org-mode both reject *\t as a headline. The spec says "a space character", one of only two places it says that instead of "whitespace character". I think we should accept only a space character there.

I don't think it makes sense for a parser to be that permissive, but I do think that it is reasonable to accept * [#A]Hello World, since both org-element and org-mode do.

I can see that you're trying to behave the same way as org-mode or org-element. But I don't think it's a good idea to introduce too much special cases. Just like the definitions I mentioned above, if a headlines' stars should be followed by a space, I would expect it also accepts a tab. Same with todo keyword, priority and tags. After all, I believe the specification should be able to enforce the consistent behave of parsing without bringing too much special cases.

@calmofthestorm
Copy link
Contributor

That seems like a perfectly reasonable way to define it, thank you for the explanation. I care more about clear, consistent, documented behavior than what it is, and this specification will deliver that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants