diff options
author | John MacFarlane <jgm@berkeley.edu> | 2014-09-17 14:05:04 -0700 |
---|---|---|
committer | John MacFarlane <jgm@berkeley.edu> | 2014-09-17 14:05:04 -0700 |
commit | 309173a493aea59cce5cce1b52b86e01b041bb8f (patch) | |
tree | 30eea56599b096d11cf4eaad38d395b3910a35b4 /spec.txt | |
parent | 6326bc748c8f5f225d82c01fe6763776f2bbd88e (diff) | |
parent | 3aa56049d4b52b55a2313e51698090ee81e10036 (diff) |
Merge pull request #66 from vmg/revamp
Enfastenate the C Parsenator
Diffstat (limited to 'spec.txt')
-rw-r--r-- | spec.txt | 67 |
1 files changed, 40 insertions, 27 deletions
@@ -1682,7 +1682,7 @@ them. [Foo bar] . -<p><a href="my url" title="title">Foo bar</a></p> +<p><a href="my%20url" title="title">Foo bar</a></p> . The title may be omitted: @@ -1745,7 +1745,7 @@ case-insensitive (see [matches](#matches)). [αγω] . -<p><a href="/φου">αγω</a></p> +<p><a href="/%CF%86%CE%BF%CF%85">αγω</a></p> . Here is a link reference definition with no corresponding link. @@ -3688,7 +3688,7 @@ raw HTML: . <http://google.com?find=\*> . -<p><a href="http://google.com?find=\*">http://google.com?find=\*</a></p> +<p><a href="http://google.com?find=%5C*">http://google.com?find=\*</a></p> . . @@ -3727,47 +3727,59 @@ foo ## Entities -Entities are parsed as entities, not as literal text, in all contexts -except code spans and code blocks. Three kinds of entities are recognized. +With the goal of making this standard as HTML-agnostic as possible, all HTML valid HTML Entities in any +context are recognized as such and converted into their actual values (i.e. the UTF8 characters representing +the entity itself) before they are stored in the AST. + +This allows implementations that target HTML output to trivially escape the entities when generating HTML, +and simplifies the job of implementations targetting other languages, as these will only need to handle the +UTF8 chars and need not be HTML-entity aware. [Named entities](#name-entities) <a id="named-entities"></a> consist of `&` -+ a string of 2-32 alphanumerics beginning with a letter + `;`. ++ any of the valid HTML5 entity names + `;`. The [following document](http://www.whatwg.org/specs/web-apps/current-work/multipage/entities.json) +is used as an authoritative source of the valid entity names and their corresponding codepoints. + +Conforming implementations that target Markdown don't need to generate entities for all the valid +named entities that exist, with the exception of `"` (`"`), `&` (`&`), `<` (`<`) and `>` (`>`), +which always need to be written as entities for security reasons. . & © Æ Ď ¾ ℋ ⅆ ∲ . -<p> & © Æ Ď ¾ ℋ ⅆ ∲</p> +<p> & © Æ Ď ¾ ℋ ⅆ ∲</p> . [Decimal entities](#decimal-entities) <a id="decimal-entities"></a> -consist of `&#` + a string of 1--8 arabic digits + `;`. +consist of `&#` + a string of 1--8 arabic digits + `;`. Again, these entities need to be recognised +and tranformed into their corresponding UTF8 codepoints. Invalid Unicode codepoints will be written +as the "unknown codepoint" character (`0xFFFD`) . - # Ӓ Ϡ � +# Ӓ Ϡ � . -<p> # Ӓ Ϡ �</p> +<p># Ӓ Ϡ �</p> . [Hexadecimal entities](#hexadecimal-entities) <a id="hexadecimal-entities"></a> consist of `&#` + either `X` or `x` + a string of 1-8 hexadecimal digits -+ `;`. ++ `;`. They will also be parsed and turned into their corresponding UTF8 values in the AST. . - " ആ ಫ +" ആ ಫ . -<p> " ആ ಫ</p> +<p>" ആ ಫ</p> . Here are some nonentities: . -  &x; &#; &#x; � &ThisIsWayTooLongToBeAnEntityIsntIt; &hi?; +  &x; &#; &#x; &ThisIsWayTooLongToBeAnEntityIsntIt; &hi?; . -<p>&nbsp &x; &#; &#x; &#123456789; &ThisIsWayTooLongToBeAnEntityIsntIt; &hi?;</p> +<p>&nbsp &x; &#; &#x; &ThisIsWayTooLongToBeAnEntityIsntIt; &hi?;</p> . Although HTML5 does accept some entities without a trailing semicolon -(such as `©`), these are not recognized as entities here: +(such as `©`), these are not recognized as entities here, because it makes the grammar too ambiguous: . © @@ -3775,13 +3787,12 @@ Although HTML5 does accept some entities without a trailing semicolon <p>&copy</p> . -On the other hand, many strings that are not on the list of HTML5 -named entities are recognized as entities here: +Strings that are not on the list of HTML5 named entities are not recognized as entities either: . &MadeUpEntity; . -<p>&MadeUpEntity;</p> +<p>&MadeUpEntity;</p> . Entities are recognized in any context besides code spans or @@ -3797,7 +3808,7 @@ code blocks, including raw HTML, URLs, [link titles](#link-title), and . [foo](/föö "föö") . -<p><a href="/föö" title="föö">foo</a></p> +<p><a href="/f%C3%B6%C3%B6" title="föö">foo</a></p> . . @@ -3805,7 +3816,7 @@ code blocks, including raw HTML, URLs, [link titles](#link-title), and [foo]: /föö "föö" . -<p><a href="/föö" title="föö">foo</a></p> +<p><a href="/f%C3%B6%C3%B6" title="föö">foo</a></p> . . @@ -3813,7 +3824,7 @@ code blocks, including raw HTML, URLs, [link titles](#link-title), and foo ``` . -<pre><code class="language-föö">foo +<pre><code class="language-föö">foo </code></pre> . @@ -3946,7 +3957,7 @@ But this is a link: . <http://foo.bar.`baz>` . -<p><a href="http://foo.bar.`baz">http://foo.bar.`baz</a>`</p> +<p><a href="http://foo.bar.%60baz">http://foo.bar.`baz</a>`</p> . And this is an HTML tag: @@ -4755,7 +4766,7 @@ braces: . [link](</my uri>) . -<p><a href="/my uri">link</a></p> +<p><a href="/my%20uri">link</a></p> . The destination cannot contain line breaks, even with pointy braces: @@ -4806,12 +4817,14 @@ in Markdown: <p><a href="foo):">link</a></p> . -URL-escaping and entities should be left alone inside the destination: +URL-escaping and should be left alone inside the destination, as all URL-escaped characters +are also valid URL characters. HTML entities in the destination will be parsed into their UTF8 +codepoints, as usual, and optionally URL-escaped when written as HTML. . [link](foo%20bä) . -<p><a href="foo%20bä">link</a></p> +<p><a href="foo%20b%C3%A4">link</a></p> . Note that, because titles can often be parsed as destinations, @@ -4821,7 +4834,7 @@ get unexpected results: . [link]("title") . -<p><a href=""title"">link</a></p> +<p><a href="%22title%22">link</a></p> . Titles may be in single quotes, double quotes, or parentheses: |