Prev Next | Back Along | Up Home

2.7   How can I represent esoteric characters (e.g. character entities) in a document?

For example, say you want an em-dash (XML character entity —, Unicode character \u2014) in your document: use a real em-dash. Insert concrete characters (e.g. type a real em-dash) into your input file, using whatever encoding suits your application, and tell Docutils the input encoding. Docutils uses Unicode internally, so the em-dash character is a real em-dash internally.

ReStructuredText has no character entity subsystem; it doesn't know anything about XML charents. To Docutils, "—" in input text is 7 discrete characters; no interpretation happens. When writing HTML, the "&" is converted to "&", so in the raw output you'd see "—". There's no difference in interpretation for text inside or outside inline literals or literal blocks -- there's no character entity interpretation in either case.

If you can't use a Unicode-compatible encoding and must rely on 7-bit ASCII, there is a workaround. Files containing character entity set substitution definitions using the "unicode" directive are available (tarball). A description and instructions for use are here. Thanks to David Priest for the original idea. Incorporating these files into Docutils is on the to-do list.

If you insist on using XML-style charents, you'll have to implement a pre-processing system to convert to UTF-8 or something. That introduces complications though; you can no longer write about charents naturally; instead of writing "—" you'd have to write "—".

Prev Next | Back Along | Up Home