path: root/spec.txt
diff options
Diffstat (limited to 'spec.txt')
1 files changed, 364 insertions, 70 deletions
diff --git a/spec.txt b/spec.txt
index bc2e381..12ec482 100644
--- a/spec.txt
+++ b/spec.txt
@@ -2,8 +2,8 @@
title: CommonMark Spec
- John MacFarlane
-version: 2
-date: 2014-09-19
+version: 0.5
+date: 2014-10-25
# Introduction
@@ -192,10 +192,10 @@ In the examples, the `→` character is used to represent tabs.
# Preprocessing
A [line](#line) <a id="line"></a>
-is a sequence of zero or more characters followed by a line
-ending (CR, LF, or CRLF) or by the end of
+is a sequence of zero or more [characters](#character) followed by a
+line ending (CR, LF, or CRLF) or by the end of file.
+A [character](#character)<a id="character"></a> is a unicode code point.
This spec does not specify an encoding; it thinks of lines as composed
of characters rather than bytes. A conforming parser may be limited
to a certain encoding.
@@ -377,16 +377,18 @@ Spaces are allowed at the end:
<hr />
-However, no other characters may occur at the end or the
+However, no other characters may occur in the line:
_ _ _ _ a
<p>_ _ _ _ a</p>
It is required that all of the non-space characters be the same.
@@ -426,8 +428,11 @@ bar
-Note, however, that this is a setext header, not a paragraph followed
-by a horizontal rule:
+If a line of dashes that meets the above conditions for being a
+horizontal rule could also be interpreted as the underline of a [setext
+header](#setext-header), the interpretation as a
+[setext-header](#setext-header) takes precedence. Thus, for example,
+this is a setext header, not a paragraph followed by a horizontal rule:
@@ -662,7 +667,10 @@ ATX headers can be empty:
A [setext header](#setext-header) <a id="setext-header"></a>
consists of a line of text, containing at least one nonspace character,
with no more than 3 spaces indentation, followed by a [setext header
-underline](#setext-header-underline). A [setext header
+underline](#setext-header-underline). The line of text must be
+one that, were it not followed by the setext header underline,
+would be interpreted as part of a paragraph: it cannot be a code
+block, header, blockquote, horizontal rule, or list. A [setext header
underline](#setext-header-underline) <a id="setext-header-underline"></a>
is a sequence of `=` characters or a sequence of `-` characters, with no
more than 3 spaces indentation and any number of trailing
@@ -807,7 +815,8 @@ of dashes"/>
<p>of dashes&quot;/&gt;</p>
-The setext header underline cannot be a lazy line:
+The setext header underline cannot be a [lazy continuation
+line](#lazy-continuation-line) in a list item or block quote:
> Foo
@@ -819,6 +828,16 @@ The setext header underline cannot be a lazy line:
<hr />
+- Foo
+<hr />
A setext header cannot interrupt a paragraph:
@@ -863,6 +882,56 @@ Setext headers cannot be empty:
+Setext header text lines must not be interpretable as block
+constructs other than paragraphs. So, the line of dashes
+in these examples gets interpreted as a horizontal rule:
+<hr />
+<hr />
+- foo
+<hr />
+ foo
+<hr />
+> foo
+<hr />
+If you want a header with `> foo` as its literal text, you can
+use backslash escapes:
+\> foo
+<h2>&gt; foo</h2>
## Indented code blocks
@@ -1355,8 +1424,8 @@ name is one of the following (case-insensitive):
`output`, `col`, `p`, `colgroup`, `pre`, `dd`, `progress`, `div`,
`section`, `dl`, `table`, `td`, `dt`, `tbody`, `embed`, `textarea`,
`fieldset`, `tfoot`, `figcaption`, `th`, `figure`, `thead`, `footer`,
-`footer`, `tr`, `form`, `ul`, `h1`, `h2`, `h3`, `h4`, `h5`, `h6`,
-`video`, `script`, `style`.
+`tr`, `form`, `ul`, `h1`, `h2`, `h3`, `h4`, `h5`, `h6`, `video`,
+`script`, `style`.
An [HTML block](#html-block) <a id="html-block"></a> begins with an
[HTML block tag](#html-block-tag), [HTML comment](#html-comment),
@@ -1401,7 +1470,7 @@ okay.
-Here we have two code blocks with a Markdown paragraph between them:
+Here we have two HTML blocks with a Markdown paragraph between them:
<DIV CLASS="foo">
@@ -1447,11 +1516,11 @@ A processing instruction:
- echo 'foo'
+ echo '>';
- echo 'foo'
+ echo '>';
@@ -1946,8 +2015,8 @@ bbb
Final spaces are stripped before inline parsing, so a paragraph
-that ends with two or more spaces will not end with a hard line
+that ends with two or more spaces will not end with a [hard line
@@ -2375,7 +2444,8 @@ An [ordered list marker](#ordered-list-marker) <a id="ordered-list-marker"></a>
is a sequence of one of more digits (`0-9`), followed by either a
`.` character or a `)` character.
-The following rules define [list items](#list-item):
+The following rules define [list items](#list-item):<a
1. **Basic case.** If a sequence of lines *Ls* constitute a sequence of
blocks *Bs* starting with a non-space character and not separated
@@ -2826,9 +2896,11 @@ Four spaces indent gives a code block:
some or all of the indentation from one or more lines in which the
next non-space character after the indentation is
[paragraph continuation text](#paragraph-continuation-text) is a
- list item with the same contents and attributes.
+ list item with the same contents and attributes.<a
+ id="lazy-continuation-line"></a>
-Here is an example with lazy continuation lines:
+Here is an example with [lazy continuation
1. A paragraph
@@ -3005,6 +3077,21 @@ A list item may be empty:
+A list item can contain a header:
+- # Foo
+- Bar
+ ---
+ baz
### Motivation
John Gruber's Markdown spec says the following about list items:
@@ -3210,12 +3297,12 @@ of an [ordered list](#ordered-list) is determined by the list number of
its initial list item. The numbers of subsequent list items are
-A list is [loose](#loose) if it any of its constituent list items are
-separated by blank lines, or if any of its constituent list items
-directly contain two block-level elements with a blank line between
-them. Otherwise a list is [tight](#tight). (The difference in HTML output
-is that paragraphs in a loose with are wrapped in `<p>` tags, while
-paragraphs in a tight list are not.)
+A list is [loose](#loose)<a id="loose"></a> if it any of its constituent
+list items are separated by blank lines, or if any of its constituent
+list items directly contain two block-level elements with a blank line
+between them. Otherwise a list is [tight](#tight).<a id="tight"></a>
+(The difference in HTML output is that paragraphs in a loose list are
+wrapped in `<p>` tags, while paragraphs in a tight list are not.)
Changing the bullet or ordered list delimiter starts a new list:
@@ -3247,6 +3334,87 @@ Changing the bullet or ordered list delimiter starts a new list:
+In CommonMark, a list can interrupt a paragraph. That is,
+no blank line is needed to separate a paragraph from a following
+- bar
+- baz
+`` does not allow this, through fear of triggering a list
+via a numeral in a hard-wrapped line:
+The number of windows in my house is
+14. The number of doors is 6.
+<p>The number of windows in my house is</p>
+<ol start="14">
+<li>The number of doors is 6.</li>
+Oddly, `` *does* allow a blockquote to interrupt a paragraph,
+even though the same considerations might apply. We think that the two
+cases should be treated the same. Here are two reasons for allowing
+lists to interrupt paragraphs:
+First, it is natural and not uncommon for people to start lists without
+blank lines:
+ I need to buy
+ - new shoes
+ - a coat
+ - a plane ticket
+Second, we are attracted to a
+> [principle of uniformity](#principle-of-uniformity):<a
+> id="principle-of-uniformity"></a> if a span of text has a certain
+> meaning, it will continue to have the same meaning when put into a list
+> item.
+(Indeed, the spec for [list items](#list-item) presupposes this.)
+This principle implies that if
+ * I need to buy
+ - new shoes
+ - a coat
+ - a plane ticket
+is a list item containing a paragraph followed by a nested sublist,
+as all Markdown implementations agree it is (though the paragraph
+may be rendered without `<p>` tags, since the list is "tight"),
+ I need to buy
+ - new shoes
+ - a coat
+ - a plane ticket
+by itself should be a paragraph followed by a nested sublist.
+Our adherence to the [principle of uniformity](#principle-of-uniformity)
+thus inclines us to think that there are two coherent packages:
+1. Require blank lines before *all* lists and blockquotes,
+ including lists that occur as sublists inside other list items.
+2. Require blank lines in none of these places.
+[reStructuredText]( takes
+the first approach, for which there is much to be said. But the second
+seems more consistent with established practice with Markdown.
There can be blank lines between items, but two blank lines end
a list:
@@ -3463,8 +3631,8 @@ This is a tight list, because the blank lines are in a code block:
This is a tight list, because the blank line is between two
-paragraphs of a sublist. So the inner list is loose while
-the other list is tight:
+paragraphs of a sublist. So the sublist is loose while
+the outer list is tight:
- a
@@ -3650,7 +3818,8 @@ If a backslash is itself escaped, the following character is not:
-A backslash at the end of the line is a hard line break:
+A backslash at the end of the line is a [hard line
@@ -3727,21 +3896,25 @@ foo
## Entities
-With the goal of making this standard as HTML-agnostic as possible, all HTML valid HTML Entities in any
-context are recognized as such and converted into their actual values (i.e. the UTF8 characters representing
-the entity itself) before they are stored in the AST.
+With the goal of making this standard as HTML-agnostic as possible, all
+valid HTML entities in any context are recognized as such and
+converted into unicode characters before they are stored in the AST.
-This allows implementations that target HTML output to trivially escape the entities when generating HTML,
-and simplifies the job of implementations targetting other languages, as these will only need to handle the
-UTF8 chars and need not be HTML-entity aware.
+This allows implementations that target HTML output to trivially escape
+the entities when generating HTML, and simplifies the job of
+implementations targetting other languages, as these will only need to
+handle the unicode chars and need not be HTML-entity aware.
[Named entities](#name-entities) <a id="named-entities"></a> consist of `&`
-+ any of the valid HTML5 entity names + `;`. The [following document](
-is used as an authoritative source of the valid entity names and their corresponding codepoints.
++ any of the valid HTML5 entity names + `;`. The
+[following document](
+is used as an authoritative source of the valid entity names and their
+corresponding codepoints.
-Conforming implementations that target Markdown don't need to generate entities for all the valid
-named entities that exist, with the exception of `"` (`&quot;`), `&` (`&amp;`), `<` (`&lt;`) and `>` (`&gt;`),
-which always need to be written as entities for security reasons.
+Conforming implementations that target HTML don't need to generate
+entities for all the valid named entities that exist, with the exception
+of `"` (`&quot;`), `&` (`&amp;`), `<` (`&lt;`) and `>` (`&gt;`), which
+always need to be written as entities for security reasons.
&nbsp; &amp; &copy; &AElig; &Dcaron; &frac34; &HilbertSpace; &DifferentialD; &ClockwiseContourIntegral;
@@ -3750,9 +3923,10 @@ which always need to be written as entities for security reasons.
[Decimal entities](#decimal-entities) <a id="decimal-entities"></a>
-consist of `&#` + a string of 1--8 arabic digits + `;`. Again, these entities need to be recognised
-and tranformed into their corresponding UTF8 codepoints. Invalid Unicode codepoints will be written
-as the "unknown codepoint" character (`0xFFFD`)
+consist of `&#` + a string of 1--8 arabic digits + `;`. Again, these
+entities need to be recognised and tranformed into their corresponding
+UTF8 codepoints. Invalid Unicode codepoints will be written as the
+"unknown codepoint" character (`0xFFFD`)
&#35; &#1234; &#992; &#98765432;
@@ -3779,7 +3953,8 @@ Here are some nonentities:
Although HTML5 does accept some entities without a trailing semicolon
-(such as `&copy`), these are not recognized as entities here, because it makes the grammar too ambiguous:
+(such as `&copy`), these are not recognized as entities here, because it
+makes the grammar too ambiguous:
@@ -3787,7 +3962,8 @@ Although HTML5 does accept some entities without a trailing semicolon
-Strings that are not on the list of HTML5 named entities are not recognized as entities either:
+Strings that are not on the list of HTML5 named entities are not
+recognized as entities either:
@@ -4035,7 +4211,7 @@ for efficient parsing strategies that do not backtrack:
(a) it is not part of a sequence of four or more unescaped `*`s,
(b) it is not followed by whitespace, and
(c) either it is not followed by a `*` character or it is
- followed immediately by strong emphasis.
+ followed immediately by emphasis or strong emphasis.
2. A single `_` character [can open emphasis](#can-open-emphasis) iff
@@ -4043,7 +4219,7 @@ for efficient parsing strategies that do not backtrack:
(b) it is not followed by whitespace,
(c) it is not preceded by an ASCII alphanumeric character, and
(d) either it is not followed by a `_` character or it is
- followed immediately by strong emphasis.
+ followed immediately by emphasis or strong emphasis.
3. A single `*` character [can close emphasis](#can-close-emphasis)
<a id="can-close-emphasis"></a> iff
@@ -4088,16 +4264,42 @@ for efficient parsing strategies that do not backtrack:
(c) it is not followed by an ASCII alphanumeric character.
9. Emphasis begins with a delimiter that [can open
- emphasis](#can-open-emphasis) and includes inlines parsed
- sequentially until a delimiter that [can close
+ emphasis](#can-open-emphasis) and ends with a delimiter that [can close
emphasis](#can-close-emphasis), and that uses the same
- character (`_` or `*`) as the opening delimiter, is reached.
+ character (`_` or `*`) as the opening delimiter. The inlines
+ between the open delimiter and the closing delimiter are the
+ contents of the emphasis inline.
10. Strong emphasis begins with a delimiter that [can open strong
- emphasis](#can-open-strong-emphasis) and includes inlines parsed
- sequentially until a delimiter that [can close strong
- emphasis](#can-close-strong-emphasis), and that uses the
- same character (`_` or `*`) as the opening delimiter, is reached.
+ emphasis](#can-open-strong-emphasis) and ends with a delimiter that
+ [can close strong emphasis](#can-close-strong-emphasis), and that uses the
+ same character (`_` or `*`) as the opening delimiter. The inlines
+ between the open delimiter and the closing delimiter are the
+ contents of the strong emphasis inline.
+Where rules 1--10 above are compatible with multiple parsings,
+the following principles resolve ambiguity:
+11. An interpretation `<strong>...</strong>` is always preferred to
+ `<em><em>...</em></em>`.
+12. An interpretation `<strong><em>...</em></strong>` is always
+ preferred to `<em><strong>..</strong></em>`.
+13. Earlier closings are preferred to later closings. Thus,
+ when two potential emphasis or strong emphasis spans overlap,
+ the first takes precedence: for example, `*foo _bar* baz_`
+ is parsed as `<em>foo _bar</em> baz_` rather than
+ `*foo <em>bar* baz</em>`. For the same reason,
+ `**foo*bar**` is parsed as `<em><em>foo</em>bar</em>*`
+ rather than `<strong>foo*bar</strong>`.
+14. Inline code spans, links, images, and HTML tags group more tightly
+ than emphasis. So, when there is a choice between an interpretation
+ that contains one of these elements and one that does not, the
+ former always wins. Thus, for example, `*[foo*](bar)` is
+ parsed as `*<a href="bar">foo*</a>` rather than as
+ `<em>[foo</em>](bar)`.
These rules can be illustrated through a series of examples.
@@ -4345,6 +4547,32 @@ __this is a double underscore (`__`)__
<p><strong>this is a double underscore (<code>__</code>)</strong></p>
+Or use the other emphasis character:
`*` delimiters allow intra-word emphasis; `_` delimiters do not:
@@ -4520,6 +4748,36 @@ __foo _bar_ baz__
<p><strong>foo <em>bar</em> baz</strong></p>
+**foo, *bar*, baz**
+<p><strong>foo, <em>bar</em>, baz</strong></p>
+__foo, _bar_, baz__
+<p><strong>foo, <em>bar</em>, baz</strong></p>
+But note:
+The difference is that in the two preceding cases,
+the internal delimiters [can close emphasis](#can-close-emphasis),
+while in the cases with spaces, they cannot.
Note that you cannot nest emphasis directly inside emphasis
using the same delimeter, or strong emphasis directly inside
strong emphasis:
@@ -4601,7 +4859,7 @@ However, a string of four or more `****` can never close emphasis:
-Note that there are some asymmetries here:
+We retain symmetry in these cases:
@@ -4609,7 +4867,7 @@ Note that there are some asymmetries here:
@@ -4618,18 +4876,12 @@ Note that there are some asymmetries here:
**foo* bar*
<p><em>foo <em>bar</em></em></p>
-<p>**foo* bar*</p>
+<p><em><em>foo</em> bar</em></p>
More cases with mismatched delimiters:
-**foo* bar*
-<p>**foo* bar*</p>
@@ -4638,7 +4890,7 @@ More cases with mismatched delimiters:
@@ -4650,7 +4902,7 @@ More cases with mismatched delimiters:
@@ -4659,6 +4911,46 @@ More cases with mismatched delimiters:
<p>***foo <em>bar</em></p>
+The following cases illustrate rule 13:
+*foo _bar* baz_
+<p><em>foo _bar</em> baz_</p>
+**foo bar* baz**
+<p><em><em>foo bar</em> baz</em>*</p>
+The following cases illustrate rule 14:
+<p>*<a href="bar">foo*</a></p>
+<p>*<img src="bar" alt="foo*" /></p>
+*<img src="foo" title="*"/>
+<p>*<img src="foo" title="*"/></p>
## Links
A link contains a [link label](#link-label) (the visible text),
@@ -4817,9 +5109,10 @@ in Markdown:
<p><a href="foo):">link</a></p>
-URL-escaping and should be left alone inside the destination, as all URL-escaped characters
-are also valid URL characters. HTML entities in the destination will be parsed into their UTF8
-codepoints, as usual, and optionally URL-escaped when written as HTML.
+URL-escaping should be left alone inside the destination, as all
+URL-escaped characters are also valid URL characters. HTML entities in
+the destination will be parsed into their UTF-8 codepoints, as usual, and
+optionally URL-escaped when written as HTML.
@@ -5796,7 +6089,8 @@ Backslash escapes do not work in HTML attributes:
## Hard line breaks
A line break (not in a code span or HTML tag) that is preceded
-by two or more spaces is parsed as a linebreak (rendered
+by two or more spaces is parsed as a [hard line
+break](#hard-line-break)<a id="hard-line-break"></a> (rendered
in HTML as a `<br />` tag):