Proposal: Allow application to add custom syntax constructions #245

mity · 2024-02-21T10:17:35Z

mity
Feb 21, 2024
Maintainer

I'm now working on a project which will likely need some superset of Markdown syntax. As some of the syntax is too application-specific. I'm not whether to maintain such customization in a fork or whether to (substantially) expand API of vanilla MD4C.

If done, it should likely cover the following uses cases:

Adding simple new emphasis similar to MD_FLAG_STRIKETHROUGH extension. Usable e.g. for superscript (e = mc^2^) or highlighting (==highlight me==) etc. (With this we wouldn't need to add highly app-specific stuff like LaTeX math equation extention to vanilla code.)
Adding simple token-like spans similar to how entities are handled. Usable e.g. for github-like issue auto-linking (#123) or user mentions (@johndoe).
Adding custom "inline" commands, optionally with some argument (consider e.g. Doxygen's @c command).
Adding custom "block" commands, optionally with some argument (consider e.g. Doxygen's @param command).
Adding custom fenced block, possibly even container block (consider e.g. Doxygen's @{{ and @}} commands).

Such expansion would significantly make the API more complex. Even though I believe it's possible without breaking backward ABI compatibility, it would likely mean introduction of new (extended) MD_PARSER structure (requireing with abi_version)`:

We would likely need new (expanded) version of MD_PARSER (MD_PARSER_v2?) structure with following changes/additions:
- MD_PARSER_v2::abi_version would expect a new value (2).
- Callback declarations would have to use int instead of enumerations for the block/span/text IDs so that application can add new ones in run-time.
- New members would allow to provide structures (possibly with some new callbacks) describing new syntax additions.

My current (incomplete) idea is something like this:

typedef struct MD_PARSER_v2 {
    unsigned abi_version;  /* set to 2 */

    ...  /* Copy of other currrent (V1) members */

    /* Application provided simple custom blocks and/or spans. */
    void** custom;
    unsigned n_custom;
} MD_PARSER_v2;

The individual members of custom[] would be pointers to helper structures describing how the given extension is to be parsed. For custom emphasis-like spans and token-like spans it might perhaps be something like this:

typedef enum MD_CUSTOM_TYPE {
    MD_CUSTOM_SPAN,     /* Standard span like emphasis with opener and closer mark. */
    MD_CUSTOM_TOKEN     /* Simple toke-n-like span (e.g. for implementing user mentions). */
    ...
} MD_CUSTOM_TYPE;


typedef struct MD_CUSTOM_SPAN {
    MD_CUSTOM_TYPE custom_type;     /* Set to MD_CUSTOM_SPAN */
    int span_type;                  /* Span type propagated into enter_span() and leave_span(). */
    int text_type;                  /* Text type propagated into text() callback. */
    unsigned flags;
    MD_CHAR opener_mark;
    MD_CHAR closer_mark;
    MD_SIZE min_mark_len;
    MD_SIZE max_mark_len;
} MD_CUSTOM_SPAN;


/* Flags for MD_CUSTOM_TOKEN::flags */
#define MD_CUSTOM_TOKEN_FLAG_KEEPOPENER      0x0001  /* Keep opener mark in the text flow. */

typedef struct MD_CUSTOM_TOKEN {
    MD_CUSTOM_TYPE custom_type;     /* Set to MD_CUSTOM_TOKEN */
    int span_type;
    int text_type;
    unsigned flags;
    MD_CHAR opener_mark;
    MD_SIZE max_token_len;

    /* Optional callback to verify/validate whether the contents/argument of
     * the token is valid or not. Should return non-zero if valid, zero if
     * invalid. */
    int (*validate_token)(int /*span_type*/, const MD_CHAR* /*token*/, MD_SIZE /*size*/, void* /*userdata*/);
} MD_CUSTOM_TOKEN;

So my questions for potential users of such API are:

Does this approach makes sense to you?
Does it cover your potential uses cases?
Is it flexible enough for future expansions?

step- · 2024-02-21T13:54:45Z

step-
Feb 21, 2024

Thank you for starting this conversation. When I switched markdown parser to MD4C for my application I had to drop a feature that the previous parser made possible: localization of input markdown. It is natural for me to think about that use case in relation to MD4C custom syntax constructions.

My application, based on the previous parser, supports:

A --po command option to output the input markdown formatted as a GNU Gettext .pot file. Each msgid in the .pot file corresponds to an input markdown leaf block. The msgid consists of the pure text in the block but span markdown is intact. For instance, input line * **bold** list item yields msgid **bold** list item.
A %%textdomain DOMAIN line directive to enable translation. When enabled, the parser will parse each leaf block, look up its translation in DOMAIN using a gettext libray call, and parse the translated block instead of the original one as markdown.

I leave out further details. It seems to me that for an MD4C extension to be able to support the behavior I described, there should be a way to stop parsing at the leaf block level. Perhaps that is what item 4 in your list (custom "block" command) means? What is the "block"?

Suppose a custom "block" extension can return the input text of each block, could my application change that text and re-enter the parser with the modified markdown at exactly the same document context of the original block? The validate_token callback in your list wouldn't be enough. A new replace_token callback would be needed, I think.

0 replies

mity · 2024-02-21T14:50:49Z

mity
Feb 21, 2024
Maintainer Author

What is the "block"?

Block as understood by the CommonMark specification

Suppose a custom "block" extension can return the input text of each block, could my application change that text and re-enter the parser with the modified markdown at exactly the same document context of the original block?

At the moment, no. And I'm not sure at all whether it's something I want to do: MD4C is not DOM parser, it even never constructs any complete block contents in a single uninterrupted buffer internally. I'm also not very keen on any replacement features: Imho MD4C is and should be "just" a parser.

I might consider providing access to "raw block contents" in a form of array of pointers and/or offsets which together make a (leaf) block contents (this would roughly correspond to an array of MD_LINE in the MD4C source, se e.g. interface of md_process_normal_block_contents()), but the rest would be on your application and its callback implementation.

A new replace_token callback would be needed, I think.

Ugh, the validate_token() was meant only for "token spans" (which I agree may be a misleading name).
By that I meant spans-which-are-not-actually-spans, i.e. inline elements which have no real arbitrary contents apart of a single word/argument/token (not sure whatever is the best to call it to avoid confusion).

Example of this could be a custom feature similar to what github does when auto-linking issues (e.g. #123), validate_token() would then be able to return false if the argument (here 123) is not a number.

Similarly for the suggested example of user mentions, it could verify it's valid/known user name or that it at least follows some (application-specific) idea how usernames may look like.

1 reply

step- Feb 22, 2024

Block as understood by the CommonMark specification

Yes, so the "leaf blocks" in my description above correspond to section 4.2; and the application processes raw contents of such leaf blocks.

Suppose a custom "block" extension can return the input text of each block, could my application change that text and re-enter the parser with the modified markdown at exactly the same document context of the original block?

MD4C is not DOM parser, it even never constructs any complete block contents in a single uninterrupted buffer internally. I'm also not very keen on any replacement features: Imho MD4C is and should be "just" a parser.

I understand it isn't a DOM parser. I prefer it that way too. Since running the MD4C parser is fast and inexpensive, the application could implement replacement features entirely by calling the parser twice: the first time getting raw (leaf) block contents, replacing content and pasting a new document in memory; the second time passing the new document to the parser for final rendering.

I might consider providing access to "raw block contents" in a form of array of pointers and/or offsets which together make a (leaf) block contents (this would roughly correspond to an array of MD_LINE in the MD4C source, se e.g. interface of md_process_normal_block_contents()), but the rest would be on your application and its callback implementation.

See my comment above. I assume you would provide access to all current contents that are represented in the md_process_*_block_contents() group of functions.

Ugh, the validate_token() was meant only for "token spans" (which I agree may be a misleading name). By that I meant spans-which-are-not-actually-spans, i.e. inline elements which have no real arbitrary contents apart of a single word/argument/token (not sure whatever is the best to call it to avoid confusion).

Maybe "extension token"?

I see your examples, and validate_token as such sounds necessary. I understand it in the following way. The application declares that "#" is an extension token. Then, whenever the parser finds a non-otherwise-interpreted "#" input character, it calls on the application to validate the text span (offset, length) immediately following "#". The application could reply either INVALID or VALID and the actual span length. What would happen if a token is invalid? Would the parser fail altogether or render the "span" as verbatim text?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Allow application to add custom syntax constructions #245

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Proposal: Allow application to add custom syntax constructions #245

mity Feb 21, 2024 Maintainer

Replies: 2 comments · 1 reply

step- Feb 21, 2024

mity Feb 21, 2024 Maintainer Author

step- Feb 22, 2024

mity
Feb 21, 2024
Maintainer

Replies: 2 comments 1 reply

step-
Feb 21, 2024

mity
Feb 21, 2024
Maintainer Author