Updated string literals description to encompass Unicode literals and the

additional escape sequences defined for Unicode. This closes bug #117158.
2000-12-19 04:52:03 +00:00 · 2000-12-19 04:52:03 +00:00 · dea764d7f1
parent 1367b83797
commit dea764d7f1
1 changed files with 24 additions and 11 deletions
--- a/Doc/ref/ref2.tex
+++ b/Doc/ref/ref2.tex
@ -304,6 +304,9 @@ escapeseq:       "\" <any ASCII character>
 \end{verbatim}
 \index{ASCII@\ASCII{}}

+\index{triple-quoted string}
+\index{Unicode Consortium}
+\index{string!Unicode}
 In plain English: String literals can be enclosed in matching single
 quotes (\code{'}) or double quotes (\code{"}).  They can also be
 enclosed in matching groups of three single or double quotes (these
@ -311,10 +314,12 @@ are generally referred to as \emph{triple-quoted strings}).  The
 backslash (\code{\e}) character is used to escape characters that
 otherwise have a special meaning, such as newline, backslash itself,
 or the quote character.  String literals may optionally be prefixed
-with a letter `r' or `R'; such strings are called raw strings and use
-different rules for backslash escape sequences.
-\index{triple-quoted string}
-\index{raw string}
+with a letter `r' or `R'; such strings are called
+\dfn{raw strings}\index{raw string} and use different rules for
+backslash escape sequences.  A prefix of 'u' or 'U' makes the string
+a Unicode string.  Unicode strings use the Unicode character set as
+defined by the Unicode Consortium and ISO~10646.  Some additional
+escape sequences, described below, are available in Unicode strings.

 In triple-quoted strings,
 unescaped newlines and quotes are allowed (and are retained), except
@ -339,25 +344,33 @@ to those used by Standard \C{}.  The recognized escape sequences are:
 \lineii{\e b}	{\ASCII{} Backspace (BS)}
 \lineii{\e f}	{\ASCII{} Formfeed (FF)}
 \lineii{\e n}	{\ASCII{} Linefeed (LF)}
+\lineii{\e N\{\var{name}\}}
+       {Character named \var{name} in the Unicode database (Unicode only)}
 \lineii{\e r}	{\ASCII{} Carriage Return (CR)}
 \lineii{\e t}	{\ASCII{} Horizontal Tab (TAB)}
+\lineii{\e u\var{xxxx}}
+       {Character with 16-bit hex value \var{xxxx} (Unicode only)}
+\lineii{\e U\var{xxxxxxxx}}
+       {Character with 32-bit hex value \var{xxxxxxxx} (Unicode only)}
 \lineii{\e v}	{\ASCII{} Vertical Tab (VT)}
-\lineii{\e\var{ooo}} {\ASCII{} character with octal value \emph{ooo}}
-\lineii{\e x\var{hh...}} {\ASCII{} character with hex value \emph{hh...}}
+\lineii{\e\var{ooo}} {\ASCII{} character with octal value \var{ooo}}
+\lineii{\e x\var{hh}} {\ASCII{} character with hex value \var{hh}}
 \end{tableii}
 \index{ASCII@\ASCII{}}

-In strict compatibility with Standard \C, up to three octal digits are
+In strict compatibility with Standard C, up to three octal digits are
 accepted, but an unlimited number of hex digits is taken to be part of
 the hex escape (and then the lower 8 bits of the resulting hex number
 are used in 8-bit implementations).

-Unlike Standard \C{},
+Unlike Standard \index{unrecognized escape sequence}C,
 all unrecognized escape sequences are left in the string unchanged,
-i.e., \emph{the backslash is left in the string.}  (This behavior is
+i.e., \emph{the backslash is left in the string}.  (This behavior is
 useful when debugging: if an escape sequence is mistyped, the
-resulting output is more easily recognized as broken.)
-\index{unrecognized escape sequence}
+resulting output is more easily recognized as broken.)  It is also
+important to note that the escape sequences marked as ``(Unicode
+only)'' in the table above fall into the category of unrecognized
+escapes for non-Unicode string literals.

 When an `r' or `R' prefix is present, backslashes are still used to
 quote the following character, but \emph{all backslashes are left in