-
Notifications
You must be signed in to change notification settings - Fork 0
Full User Manual
mdq's main feature is the selector string, which is a series of selectors delimited by pipes (|
). Each
selector operates on a stream of elements, with the initial stream being a single-element stream that contains the
entire input document.
Selectors consist of a selector type and one or more string matchers. Selectors are designed to mirror the Markdown they select.
The selector types are as follows (more details on each immediately below):
Syntax | Selects | String matcher matches... |
---|---|---|
# matcher |
section | section's header title. |
- matcher |
unordered list items | text in the list item |
- [?] matcher (see below) |
unordered task items | " |
1. matcher |
ordered list items | " |
- [?] matcher (see below) |
ordered task items | " |
[matcher](matcher) |
links | two matchers: • first matches the link text, • second matches the URL |
![matcher](matcher) |
images | two matchers: • first matches the image alt text, • second matches the URL |
> matcher |
block quotes | block quote contents |
```matcher matcher |
code blocks | two matchers: • first matches the language, • second matches the code itself |
</> matcher |
html tags | html tag contents, including opening and closing angle brackets ( < and > ) |
P: matcher |
paragraph | paragraph text |
:-: matcher :-: matcher |
tables | two matchers: • first matches the columns by header, • second matches rows |
String matchers can be:
- bareword strings (case insensitive)
- quoted strings (case sensitive)
- regular expressions
- empty (
-
*
, which also means "any" (if you prefer to write that out explicitly)
See below for more details.
Taking the example above:
$ cat example.md | mdq '# usage | -'
This specifies two matchers:
-
# usage
, which is a section selector withmatcher=usage
; this looks for any section whose title contains "usage" -
-
, which is an unordered list selector with an empty matcher; this looks for any list items
Piping these together yields: "any list item within a section whose title contains 'usage'."
The selector string is always UTF-8.
# matcher
- The space between the
#
and the matcher is required, unless the matcher is empty. - Selects sections whose header matches the matcher.
- The result is the section title and its contents, including any subsections.
- Does not differentiate between section levels: the markdown
# fizz buzz
and## all the buzz
would both match againstmdq '# buzz'
(discussion #120).
Examples:
-
#
: matches any section -
# foo
: matches any section whose title contains "foo"
- unordered list
- [?] unordered task
1. ordered list
1. [?] ordered task
- The space between the
-
/1.
and the matcher is required, unless the matcher is empty. - For ordered lists, the selector must be
1.
exactly: no other numbers, and the period is required. - Note that there is no way to specify "an ordered or unordered list item" (discussion #119).
For tasks (in either ordered or unordered form), there are three variants:
variant | meaning |
---|---|
[ ] |
match only uncompleted tasks |
[x] |
match only completed tasks |
[?] |
match either |
Examples:
-
-
: matches any unordered list item -
1.
: matches any ordered list item -
- [?] todo
: matches any unordered task, regardless of whether it's completed or not, that contains the text "todo"
[matcher](matcher)
![matcher](matcher)
- These work similarly, except one selects (only) links and the other selects (only) images.
- For the image selectors, no space is allowed between the
!
and the[
. - For both forms, the first matcher selects the link text or alt text (depending on the form), and the second matcher selects the URL.
- String anchors are especially useful for the URL matcher (see below).
On output, mdq will normalize the markdown for links. See below for details.
In Markdown, links and images can specify the URL in a few forms:
- [as a reference][1]
- [inline](https://example.com)
- [collapsed][] (GitHub extension)
- [shortcut] (GitHub extension)
[1]: https://example.com/1
[collapsed]: https://example.com/collapsed
[shortcut]: https://example.com/shortcut
The URL matcher works identically on all of these, and always matches the URL. There is no way to match the reference
id (for example, [1]
above).
Examples:
-
[]()
: matches all links -
[hello]()
: matches all links whose display text contains "hello" -
![](^https://example.com/)
: matches all images whose URL starts withhttps://example.com/
> matcher
- The space between the
>
and the matcher is required, unless the matcher is empty.
Examples:
-
>
: matches all block quotes -
> hello
: matches all block quotes containing "hello"
```language-matcher content-matcher
- Both matchers are optional (that is, the empty matcher is allowed for both)
- There must not be any space between the backticks and the language matcher, if it is provided.
- There must be a space between the backticks (and optional language matcher) and the content-matcher, if it is provided.
Examples:
-
```
: matches all code blocks -
```rust
: matches all code blocks for Rust:```rust let foo = "";
-
``` "let foo"
: matches all code blocks containinglet foo
, regardless of language:```swift let foo = "";
-
```rust "let foo"
: matches all code blocks for Rust containinglet foo
:```rust let foo = "";
-
```/rust|java/
: matches all code blocks for either Rust or Java. (There's nothing special about this form: this example is just to remind you that the language matcher is just a normal matcher, and as such supports regexes.)
</> matcher
- The space between the
</>
and the matcher is required, unless the matcher is empty. - The matched substring includes the angle brackets. For example, this markdown:
... results in an HTML element
Some <span>inline text
"<span>"
, not"span"
- Note that the way Markdown handles HTML, only the tags get rendered as HTML elements — not the text between them. So, the markdown
<span>hello world</span>
produces two HTML elements,"<span>"
and"</span>"
. - Remember that unquoted matchers must start with a letter. If you want to match against a tag including its opening angle bracket, you must quote it (or use a regex).
- This selector finds both HTML blocks and HTML inlines. It does not differentiate between them, or allow you to select only one kind or the other.
Examples:
-
</> "<span>"
: matches HTML tags containing<span>
P: matcher
- The space between the
P:
and the matcher is required, unless the matcher is empty.
Examples:
-
P: hello, world
: matches paragraphs containing "hello, world"
:-: columns :-: rows
- The space between each
:-:
and its matcher is required, unless the matcher is empty for the rows matcher. - The columns matcher must not be empty; if you want to match all columns, use
*
. - The columns matcher specifies which columns in the table will be output, as matched by the header row (and only the header row)
- The rows matcher specifies which non-header rows in the table will be outputted (the header row is always outputted)
This matcher is different from the others in that it doesn't just select the element (that is, the table), but also modifies it. In particular, this selector lets you specify columns and rows from within the table.
Tip
If this syntax is confusing, think about the following progression:
- Start with a table with one column, one header row, and one data row with center-alignment:
| my header | |:---------:| // 🡐 :-: is center-alignment | data row |
- Flatten it to one line:
| my header | |:---------:| | data row |
- Remove the
|
s, since they'd conflict with mdq's selector separators:my header :---------: data row
- Shorten the
-
s:my header :-: data row
- Copy the
:-:
to the front, as the token to specify the selector type::-: my header :-: data row
Note that this just illustrates why I picked the syntax; it's not what's going on in the selector. In particular, the :-:
syntax does
not suggest that this only matches center-aligned columns, or that it's possible to match based on alignment at all. It's just a token that
looks table-y. |-|
would look even more table-y, but the |
token is already used for separating selectors, and I didn't want to confuse
the syntax with binding precedence and all that jazz.
If the table is "jagged" (that is, if any rows have different lengths), then it will be assumed to be as wide as its widest row; all other rows (including headers) will be padded by empty cells on the right.
Note
This is a departure from the official Markdown spec, which says that the table width is set by the header rows, and that shorter rows are padded by empty cells but longer rows are simply truncated. This is intentional, so that you can get at the data in your slightly-out-of-spec markdown.
Examples:
Given a table:
| Name | Description |
|------|-------------|
| Foo | the fuzz |
| Bar | the buzz |
| Foot | the unit |
-
:-: * :-:
: returns the full table-
:-: :-:
: invalid; the headers matcher is required
-
-
:-: name :-:
: returns just theName
column, with all four rows (the rows matcher is empty, which implies*
)| Name | |------| | Foo | | Bar | | Foot |
-
:-: * :-: buzz
: returns both columns, but only the header row and the row containing "buzz":(Note that the headers matcher only matches the headers row, but the rows matcher matches any column within the data rows.)| Name | Description | |------|-------------| | Bar | the buzz |
You can specify string matchers in several ways.
- The simplest is just an empty string (or any amount of whitespace), which means "any" and matches all elements.
- An asterisk (
*
) means the same thing, and any whitespace around it is also ignored. - An unquoted string is just some text
- A quoted string, which starts with either double or single quotes (
"
or'
). - A regular expression, which starts with
/
.
Unquoted strings are case-insensitive, while quoted strings are case-sensitive. Both support anchors, as described below. Both match any substring.
Regular expressions are case-sensitive by default, although the pattern syntax supports ignoring case. mdq doesn't provide any special handling of anchors, though the pattern syntax supports them, just as you'd expect.
An unquoted string:
- must start with a letter
- has a context-specific ending delimiter, which in practice is hopefully pretty intuitive; see below.
- has an additional ending delimiter
$
, which always applies (see anchors below)
- has an additional ending delimiter
- doesn't support any escaping: every character until the ending delimiter is treated literally, except that leading and trailing whitespace is trimmed.
- is always case-insensitive
- matches any substring, unless you provide an anchor
The ending delimiter is just "the next token that completes this part of the selector":
- For section, list, and task selectors, this is the pipe (
|
) that delimits this selector from the next one. - For link and image selectors:
- For the display matcher (
[matcher]
), it's]
- For the URL matcher (
(matcher)
), it's)
- For the display matcher (
Example:
-
# some header text | -
: matcher issome header text
.- note that the fact that it's in a section selector means that
|
is the ending delimiter - note also that the leading and trailing whitespace is trimmed
- note that the fact that it's in a section selector means that
-
[foo]( example.com )
: first matcher isfoo
and second isexample.com
- note that the first matcher ends at the
]
, and the second at)
. Hopefully that's pretty intuitive! - And again, note the trimmed whitespace. We could have also had it in the first matcher.
- note that the first matcher ends at the
-
# hello, \u{2603}
matches literally\
u
{
2
6
0
3
}
. It does not match the snowman character. -
# hello, ☃
does match the snowman -
# fizz # buzz
: note that-
here does not signify a second section selector, but is just a literal#
in the matcher text. -
# fizz$
: the$
anchor is always a delimiter, so this matches "fizz" only at the end of a a section title
A quoted string:
- always starts with
"
or'
, and always ends with that same character. - is always case-sensitive
- matches any substring, unless you provide an anchor
Quoted strings support the following escape sequences:
sequence | produces |
---|---|
\' , \"
|
' or " , respectively |
\` |
' |
\\ |
\ |
\n , \r , \t
|
newline, carriage return, tab (respectively) |
\u{...} |
a single unicode code point; ... is a hex number between 1 and 6 digits (respectively) |
- A double-quoted string may include single-quotes unescaped, and a single-quoted string may include double-quotes unescaped.
-
\'
and\"
always work in both quoted string variants -
\`
: An escaped backtick always translates to a single-quote (not a backtick!)
Tip
The escaped backtick ➡ '
is to make it easier to use mdq with shells like bash or zsh.
Within those shells, a double-quoted string can have single-quotes, but also means you'll need to escape lots of other characters (and double-escape escape sequences!). Single-quoted shell strings are much easier to work with, but don't let you enter single-quotes. The escaped backtick provides an escape hatch for this:
mdq '# don\`t speak'
Examples:
"hello's world"
'my "hello world" message'
-
'my \"hello world\" message'
: you don't need to escape the"
(since we're in a'
-quoted string), but you're allowed to -
"A string with\nTwo lines"
: The\n
here is a newline, not a literal\
n
. -
"I love \u{2603} in the winter"
:\u{2603}
is a snowman: ☃ -
"I love \u2603 in the winter"
: syntax error, because the curly braces are always required for\u
escapes
Regexes are always delimited by /
(see discussion #56).
mdq uses the regex
crate, so that's the syntax we support. In particular, it does not support negative lookaheads.
See: https://docs.rs/regex/1.10/regex/index.html#syntax, and issue #121 for negative lookaheads.
Regexes search for a match anywhere at the string, not just at the beginning. If you need to match just at the
beginning, use a ^
anchor within the pattern.
Examples:
-
/hello.+world/
: "hello", then one or more of any character, then "world" -
/^hello/
: "hello", at the beginning of a string -
/world$/
: "world", at the end of a string -
/(?i)HELLO/
: "hello", case-insensitive (see relevant section of the regex docs)
Quoted and unquoted strings support the following anchors:
-
^
: anchors to start of string; must be first character -
$
: anchors to end of string; must be the last character
Because of this anchor, $
is always an ending delimiter for unquoted strings, regardless of the string's
context-specific delimiter.
For quoted strings, the anchors go immediately before or after the quotes. Whitespace is allowed between the anchor and the quotes, and that whitespace is ignored.
For unquoted strings, whitespace is allowed before or after either anchor, and is trimmed away.
Examples:
-
^"foo"
: "foo" at the beginning of a string -
"bar"$
: "bar" at the end of a string -
^foobar$
: the string must be "foobar" exactly (since it's anchored to both the beginning and the end) -
^ "foo"
,"bar" $
,^ foobar $
: equivalent to the previous, respectively
By default, mdq will output all selected items as markdown. If there were multiple items selected, they will be separated by a thematic break:
---
You can also specify JSON output, as mentioned below.
-
By default, all links and images will be converted to reference form.
You can change this behavior with
--link_format
. -
By default, link references will be moved up to be in the first section that mentions the link:
# Section one A [link][1]. [1]: https://example.com/1 # Section two Note that the link reference (`[1]: `) went in section one, since that's the section that used it.
You can change this behavior with
--link-placement
.
If converting links and images to reference form, all numeric references will be reordered to start at 1 and count
up sequentially from there. Any non-numeric references (for example, [a]
) will be unaltered.
If you specify -- output json
(alias: -o json
), mdq will output its results as json. The schema is:
{
"items": [
/* ... */
],
"links": {
"1": {
"url": "<url>",
"title": "[title]"
}
},
"footnotes": {
"a": [
/* ... */
]
}
}
-
"items"
contains an array of selected items. This entry is always present, but may be empty if there were no selected items. Each item is a polymorphic object; see below. -
"links"
provides the reference definitions for links and images that use the[text][1]
form only for links and images that appear as part of inline text. (See warning below.)- The keys are the reference ids, without square brackets. For example,
1
. - The values are always objects, with either one or two entries:
-
"url"
: the reference URL; always provided -
"title"
: the reference title, if one exists; omitted if there's no title.The title is often rendered as a tooltip, and is specified in markdown as:
[1]: https://example.com "this is a title"
-
-
[collapsed][]
and[shorthand]
links count as reference links, and will appear in this object. -
[inline](https://example.com)
links will not be in this object. - If there are no reference-style links, the entire
"links"
entry will be omitted.
- The keys are the reference ids, without square brackets. For example,
-
"footnotes"
provides the footnote texts for footnotes:Some text that needs further explanation.[^a] [^a]: This is the footnote text. It is _full_ **markdown** and can - even - include block elements
- The keys are the footnote ids, without the square brackets or the
^
. In the example above, the id would be"a"
. - The values follow the same syntax as the top-level
"items"
. Since footnote textmay contain several top-level blocks (as in the example above), this is an array.
- The keys are the footnote ids, without the square brackets or the
Warning
Links and images can appear in two different kinds of elements:
- selected links / images, as by a
[]()
selector - inline markdown, such as in paragraphs:
some _text_ that [contains links][1]
The top-level "links"
entry will only contain items for the second of these, the inline markdown. That's because
for links and images you specifically select, the JSON object representing that item already has a place to put the
link info; for inline markdown, there isn't such a place.
This can be a bit jarring: you select a section that produces a "links"
entry, but hen you select those links, that
entry goes away:
$ cat example.md | mdq -o json '# my section' # has top-level links entry
$ cat example.md | mdq -o json '# my section | []()' # where did my links go??
See discussion #123.
Each item is a single-element JSON object, whose key specifies what kind of item it is, and whose value depends on the key. The following subsections will describe each one.
This represents the full Markdown document; it's what you get if you run just mdq -o json
with no selectors.
{
"document": [
/* items */
]
}
{
"section": {
"depth": 1,
"title": "Markdown text. Inline elements like _emphasis_ or [links][1] are rendered as Markdown.",
"body": [
/* items */
]
}
}
{
"paragraph": "Markdown text. Inline elements like _emphasis_ or [links][1] are rendered as Markdown."
}
```rust metadata string
text of the code block
```
{
"code_block": {
"code": "text of the code block",
"type": "code | math | toml | yaml",
"language": "rust",
"metadata": "metadata string"
}
}
-
language
andmetadata
are omitted if absent -
metadata
is always absent iflanguage
is absent
{
"link": {
/* see below */
}
}
Link item values always have a "display"
, but the rest of the entries depend on the link style:
-
Reference links:
[Markdown text. Inline elements like _emphasis_ are rendered as Markdown][1]
{ "display": "Markdown text. Inline elements like _emphasis_ are rendered as Markdown", "reference": "1" }
(The top-level
"links"
entry will then contain the URL and title information.) -
Inline links:
[Markdown _text_](https://example.com "link title")
{ "display": "Markdown text. Inline elements like _emphasis_ or [links][1] are rendered as Markdown", "url": "https://example.com", "title": "link title" }
-
Collapsed or shortcut links:
[collapsed _markdown_][] [collapsed _markdown_]: https://example.com "link title"
{ "display": "collapsed _markdown_", "url": "https://example.com", "title": "'link title' (omitted if absent)", "reference_style": "collapsed" }
[shortcut]
links work the same way, except that they have"reference_style": "collapsed"
.
Images work just like links, except that:
- the type specifier is
"image"
, not"link
" -
"display"
becomes"alt"
> Some
>
> Text
{
"block_quote": [
/* items */
]
}
1. one
2. two
{
"list": [
/* list item values, as described below */
]
}
Each object is the value of a list item object; it does not contain the enclosing {"list_item": }
object. The
example would look like:
{
"list": [
{
"index": 1,
"item": [
{
"paragraph": "one"
}
]
},
{
"index": 2,
"item": [
{
"paragraph": "two"
}
]
}
]
}
3. [ ] text
{
"list_item": {
"item": [
/* items */
],
"index": 3,
"checked": false
}
}
- List contents are blocks that can contain multiple segments, which is why
"item"
is an array. -
"index"
is always an integer; it is omitted for unordered list items -
"checked"
is always a boolean; it is omitted if the list item is not a task
| header 1 | header 2 |
|----------|----------|
| hello | world |
{
"table": {
"alignments": [
"left | right | center | none"
/* ... */
],
"rows": [
[
"header 1",
"header 2"
],
[
"hello",
"world"
]
]
}
}
----
{
"thematic_break": null
}