Added definitions of whitespace and other character classes.

Closes #108.
author: John MacFarlane <jgm@berkeley.edu> 2014-12-23 17:24:14 -0700
committer: John MacFarlane <jgm@berkeley.edu> 2014-12-23 17:27:39 -0700
commit: 8b44dab7b3465445ac4137dc7893665f2336024b (patch)
tree: fba43806eeed0e3f2f13af5bbe1b001d2fcba5c6
parent: c25ca523790f1e7ed222bd7de2245be1c60bf441 (diff)
1 files changed, 100 insertions, 62 deletions
diff --git a/spec.txt b/spec.txt
index 3217e6c..bb7e620 100644
--- a/spec.txt
+++ b/spec.txt
@@ -189,17 +189,61 @@ Markdown, which can then be converted into other formats.
 
 In the examples, the `→` character is used to represent tabs.
 
-# Preprocessing
+# Preliminaries
+
+## Characters and lines
+
+The input is a sequence of zero or more [lines](#line).
 
 A [line](@line)
 is a sequence of zero or more [characters](#character) followed by a
-line ending (CR, LF, or CRLF) or by the end of file.
+[line ending](#line-ending) or by the end of file.
 
 A [character](@character) is a unicode code point.
 This spec does not specify an encoding; it thinks of lines as composed
 of characters rather than bytes.  A conforming parser may be limited
 to a certain encoding.
 
+A [line ending](@line-ending) is, depending on the platform, a
+newline (`U+000A`), carriage return (`U+000D`), or
+carriage return + newline.
+
+For security reasons, a conforming parser must strip or replace the
+Unicode character `U+0000`.
+
+A line containing no characters, or a line containing only spaces
+(`U+0020`) or tabs (`U+0009`), is called a [blank line](@blank-line).
+
+The following definitions of character classes will be used in this spec:
+
+A [whitespace character](@whitespace-character) is a space
+(`U+0020`), tab (`U+0009`), carriage return (`U+000D`), or
+newline (`U+000A`).
+
+[Whitespace](@whitespace) is a sequence of one or more [whitespace
+characters](#whitespace-character).
+
+A [unicode whitespace character](@unicode-whitespace-character) is
+any code point in the unicode `Zs` class, or a tab (`U+0009`),
+carriage return (`U+000D`), newline (`U+000A`), or form feed
+(`U+000C`).
+
+[Unicode whitespace](@unicode-whitespace) is a sequence of one
+or more [unicode whitespace characters](#unicode-whitespace-character).
+
+A [non-space character](@non-space-character) is anything but `U+0020`.
+
+A [punctuation character](@punctuation-character) is anything in
+the unicode classes `Pc`, `Pd`, `Pe`,` `Pf`, `Pi`, `Po`, or `Ps`.
+
+An [ASCII punctuation character](@ascii-punctuation-character)
+is a [punctuation character](#punctuation-character) in the
+ASCII class: that is, `!`, `"`, `#`, `$`, `%`, `&`, `'`, `(`, `)`,
+`*`, `+`, `,`, `-`, `.`, `/`, `:`, `;`, `<`, `=`, `>`, `?`, `@`,
+`[`, `\`, `]`, `^`, `_`, `` ` ``, `{`, `|`, `}`, or `~`.
+
+## Tab expansion
+
 Tabs in lines are expanded to spaces, with a tab stop of 4 characters:
 
 .
@@ -218,14 +262,6 @@ Tabs in lines are expanded to spaces, with a tab stop of 4 characters:
 </code></pre>
 .
 
-Line endings are replaced by newline characters (LF).
-
-A line containing no characters, or a line containing only spaces (after
-tab expansion), is called a [blank line](@blank-line).
-
-For security reasons, a conforming parser must strip or replace the
-Unicode character `U+0000`.
-
 # Blocks and inlines
 
 We can think of a document as a sequence of
@@ -394,7 +430,8 @@ a------
 <p>---a---</p>
 .
 
-It is required that all of the non-space characters be the same.
+It is required that all of the
+[non-space characters](#non-space-character) be the same.
 So, this is not a horizontal rule:
 
 .
@@ -952,9 +989,9 @@ An [indented code block](@indented-code-block) is composed of one or more
 [indented chunks](#indented-chunk) separated by blank lines.
 An [indented chunk](@indented-chunk) is a sequence of non-blank lines,
 each indented four or more spaces. The contents of the code block are
-the literal contents of the lines, including trailing newlines,
-minus four spaces of indentation. An indented code block has no
-attributes.
+the literal contents of the lines, including trailing
+[line endings](#line-ending), minus four spaces of indentation.
+An indented code block has no attributes.
 
 An indented code block cannot interrupt a paragraph, so there must be
 a blank line between a paragraph and a following indented code block.
@@ -1750,14 +1787,14 @@ So there is no important loss of expressive power with the new rule.
 ## Link reference definitions
 
 A [link reference definition](@link-reference-definition)
-consists of a [link
-label](#link-label), indented up to three spaces, followed
-by a colon (`:`), optional blank space (including up to one
-newline), a [link destination](#link-destination), optional
-blank space (including up to one newline), and an optional [link
+consists of a [link label](#link-label), indented up to three spaces, followed
+by a colon (`:`), optional [whitespace](#whitespace) (including up to one
+[line ending](#line-ending)), a [link destination](#link-destination),
+optional [whitespace](#whitespace) (including up to one
+[line ending](#line-ending)), and an optional [link
 title](#link-title), which if it is present must be separated
-from the [link destination](#link-destination) by whitespace.
-No further non-space characters may occur on the line.
+from the [link destination](#link-destination) by [whitespace](#whitespace).
+No further [non-space characters](#non-space-character) may occur on the line.
 
 A [link reference-definition](#link-reference-definition)
 does not correspond to a structural element of a document.  Instead, it
@@ -1874,7 +1911,7 @@ It contributes nothing to the document.
 .
 
 This is not a link reference definition, because there are
-non-space characters after the title:
+[non-space characters](#non-space-character) after the title:
 
 .
 [foo]: /url "title" ok
@@ -2133,7 +2170,8 @@ The following rules define [block quotes](@block-quote):
 2.  **Laziness.**  If a string of lines *Ls* constitute a [block
     quote](#block-quote) with contents *Bs*, then the result of deleting
     the initial [block quote marker](#block-quote-marker) from one or
-    more lines in which the next non-space character after the [block
+    more lines in which the next
+    [non-space character](#non-space-character) after the [block
     quote marker](#block-quote-marker) is [paragraph continuation
     text](#paragraph-continuation-text) is a block quote with *Bs* as
     its content.
@@ -2494,7 +2532,8 @@ is a sequence of one of more digits (`0-9`), followed by either a
 The following rules define [list items](@list-item):
 
 1.  **Basic case.**  If a sequence of lines *Ls* constitute a sequence of
-    blocks *Bs* starting with a non-space character and not separated
+    blocks *Bs* starting with a [non-space character](#non-space-character)
+    and not separated
     from each other by more than one blank line, and *M* is a list
     marker *M* of width *W* followed by 0 < *N* < 5 spaces, then the result
     of prepending *M* and the following spaces to the first line of
@@ -2972,7 +3011,7 @@ Four spaces indent gives a code block:
 4.  **Laziness.**  If a string of lines *Ls* constitute a [list
     item](#list-item) with contents *Bs*, then the result of deleting
     some or all of the indentation from one or more lines in which the
-    next non-space character after the indentation is
+    next [non-space character](#non-space-character) after the indentation is
     [paragraph continuation text](#paragraph-continuation-text) is a
     list item with the same contents and attributes.  The unindented
     lines are called
@@ -4174,11 +4213,11 @@ A [backtick string](@backtick-string)
 is a string of one or more backtick characters (`` ` ``) that is neither
 preceded nor followed by a backtick.
 
-A [code span](@code-span) begins with a backtick string and ends with a backtick
-string of equal length.  The contents of the code span are the
-characters between the two backtick strings, with leading and trailing
-spaces and newlines removed, and consecutive spaces and newlines
-collapsed to single spaces.
+A [code span](@code-span) begins with a backtick string and ends with
+a backtick string of equal length.  The contents of the code span are
+the characters between the two backtick strings, with leading and
+trailing spaces and [line endings](#line-ending) removed, and
+[whitespace](#whitespace) collapsed to single spaces.
 
 This is a simple code span:
 
@@ -4206,7 +4245,7 @@ spaces:
 <p><code>``</code></p>
 .
 
-Newlines are treated like spaces:
+[Line endings](#line-ending) are treated like spaces:
 
 .
 ``
@@ -4216,8 +4255,8 @@ foo
 <p><code>foo</code></p>
 .
 
-Interior spaces and newlines are collapsed into single spaces, just
-as they would be by a browser:
+Interior spaces and [line endings](#line-ending) are collapsed into
+single spaces, just as they would be by a browser:
 
 .
 `foo   bar
@@ -4231,13 +4270,13 @@ anyway?  A:  Because we might be targeting a non-HTML format, and we
 shouldn't rely on HTML-specific rendering assumptions.
 
 (Existing implementations differ in their treatment of internal
-spaces and newlines.  Some, including `Markdown.pl` and
-`showdown`, convert an internal newline into a `<br />` tag.
-But this makes things difficult for those who like to hard-wrap
-their paragraphs, since a line break in the midst of a code
-span will cause an unintended line break in the output.  Others
-just leave internal spaces as they are, which is fine if only
-HTML is being targeted.)
+spaces and [line endings](#line-ending).  Some, including `Markdown.pl` and
+`showdown`, convert an internal [line ending](#line-ending) into a
+`<br />` tag.  But this makes things difficult for those who like to
+hard-wrap their paragraphs, since a line break in the midst of a code
+span will cause an unintended line break in the output.  Others just
+leave internal spaces as they are, which is fine if only HTML is being
+targeted.)
 
 .
 `foo `` bar`
@@ -4355,34 +4394,32 @@ The following rules capture all of these patterns, while allowing
 for efficient parsing strategies that do not backtrack:
 
 1.  A single `*` character [can open emphasis](@can-open-emphasis)
-    iff it is not followed by whitespace.  (For these purposes,
-    any unicode space character counts as whitespace.)
+    iff it is not followed by [unicode whitespace](#unicode-whitespace).
 
 2.  A single `_` character [can open emphasis](#can-open-emphasis) iff
-    it is not followed by whitespace and it is not preceded by an
-    ASCII alphanumeric character.
+    it is not followed by [unicode whitespace](#unicode-whitespace)
+    and it is not preceded by an ASCII alphanumeric character.
 
 3.  A single `*` character [can close emphasis](@can-close-emphasis)
-    iff it is not preceded by whitespace.
+    iff it is not preceded by [unicode whitespace](#unicode-whitespace).
 
 4.  A single `_` character [can close emphasis](#can-close-emphasis) iff
-    it is not preceded by whitespace and it is not followed by an
-    ASCII alphanumeric character.
+    it is not preceded by [unicode whitespace](#unicode-whitespace)
+    and it is not followed by an ASCII alphanumeric character.
 
 5.  A double `**` [can open strong emphasis](@can-open-strong-emphasis)
-    iff it is not followed by
-    whitespace.
+    iff it is not followed by [unicode whitespace](#unicode-whitespace).
 
 6.  A double `__` [can open strong emphasis](#can-open-strong-emphasis)
-    iff it is not followed by whitespace and it is not preceded by an
-    ASCII alphanumeric character.
+    iff it is not followed by [unicode whitespace](#unicode-whitespace)
+    and it is not preceded by an ASCII alphanumeric character.
 
 7.  A double `**` [can close strong emphasis](@can-close-strong-emphasis)
-    iff it is not preceded by whitespace.
+    iff it is not preceded by [unicode whitespace](#unicode-whitespace).
 
 8.  A double `__` [can close strong emphasis](#can-close-strong-emphasis)
-    iff it is not preceded by whitespace and it is not followed by an
-    ASCII alphanumeric character.
+    iff it is not preceded by [unicode whitespace](#unicode-whitespace)
+    and it is not followed by an ASCII alphanumeric character.
 
 9.  Emphasis begins with a delimiter that [can open
     emphasis](#can-open-emphasis) and ends with a delimiter that [can close
@@ -6610,8 +6647,8 @@ baz
 baz</p>
 .
 
-For a more visible alternative, a backslash before the newline may be
-used instead of two spaces:
+For a more visible alternative, a backslash before the
+[line ending](#line-ending) may be used instead of two spaces:
 
 .
 foo\
@@ -6734,9 +6771,10 @@ foo
 
 A regular line break (not in a code span or HTML tag) that is not
 preceded by two or more spaces is parsed as a softbreak.  (A
-softbreak may be rendered in HTML either as a newline or as a space.
-The result will be the same in browsers. In the examples here, a
-newline will be used.)
+softbreak may be rendered in HTML either as a
+[line ending](#line-ending) or as a space. The result will be the same
+in browsers. In the examples here, a [line ending](#line-ending) will
+be used.)
 
 .
 foo
@@ -6971,9 +7009,9 @@ document
           str "aliquando id"
 ```
 
-Notice how the newline in the first paragraph has been parsed as
-a `softbreak`, and the asterisks in the first list item have become
-an `emph`.
+Notice how the [line ending](#line-ending) in the first paragraph has
+been parsed as a `softbreak`, and the asterisks in the first list item
+have become an `emph`.
 
 The document can be rendered as HTML, or in any other format, given
 an appropriate renderer.
author	John MacFarlane <jgm@berkeley.edu>	2014-12-23 17:24:14 -0700
committer	John MacFarlane <jgm@berkeley.edu>	2014-12-23 17:27:39 -0700
commit	8b44dab7b3465445ac4137dc7893665f2336024b (patch)
tree	fba43806eeed0e3f2f13af5bbe1b001d2fcba5c6
parent	c25ca523790f1e7ed222bd7de2245be1c60bf441 (diff)