CodeToRest.py - a module to translate source code to reST¶
The API lists four functions which convert source code into either reST or HTML. It relies on Implementation to classify the source as code or comment, then _generate_rest to convert this to reST. Supporting reST directives and roles define CodeChat-specific syntax used in this conversion.
Imports¶
These are listed in the order prescribed by PEP 8.
Standard library¶
Third-party imports¶
Local application imports¶
API¶
The following routines provide easy access to the core functionality of this module: code_to_rest_string, code_to_rest_file, code_to_html_string, and code_to_html_file.
code_to_rest_string¶
This function converts a string containing source code to reST, preserving all indentations of both source code and comments. To do so, the comment characters are stripped from reST-formatted comments and all code is placed inside code blocks.
code_str: the code to translate to reST.
See options.
Use a StringIO to capture writes into a string.
Include a header containing some CodeChat style. Don’t put this in a separate .js
file, since docutils doesn’t have an easy way to include it.
code_to_rest_file¶
Convert a source file to a reST file.
Path to a source code file to process.
Path to a destination reST file to create. It will be overwritten if it
already exists. If not specified, it is source_path.rst
.
Encoding to use for the input file. The default of None detects the encoding of the input file.
Encoding to use for the output file.
See options.
Provide a default rst_path
.
Use docutil’s I/O classes to better handle and sniff encodings.
Note: both these classes automatically close themselves after a read or write.
If not already present, provide the filename of the source to help in identifying a lexer.
code_to_html_string¶
This converts a string containing source code to HTML, which it returns as a string.
See code_str.
A file-like object where warnings and errors will be written, or None to send them to stderr.
See options.
docutils converts reST to HTML.
Include our custom css file: provide the path to the default css and then to our css. The style sheet dirs must include docutils defaults.
Make sure to use Unicode everywhere.
Don’t stop processing, no matter what.
Capture errors to a string and return it.
code_to_html_file¶
Convert source code stored in a file to HTML, which is saved in another file.
See source_path.
Destination file name to hold the generated HTML. This file will be
overwritten. If not supplied, source_path.html
will be assumed.
See input_encoding.
See output_encoding.
output_encoding="utf-8",
):
html_path = html_path or source_path + ".html"
fi = io.FileInput(source_path=source_path, encoding=input_encoding)
fo = io.FileOutput(destination_path=html_path, encoding=output_encoding)
code_str = fi.read()
lexer = get_lexer(filename=source_path, code=code_str)
html = code_to_html_string(code_str, lexer=lexer)
fo.write(html)
Styling¶
Provide a programmatic way to get a list of paths to static files needed by CodeChat. When using Sphinx, this should be assigned or appended to html_static_path.
There’s only one path needed – css/
, relative to this directory. TODO: this won’t correctly handle resources that get placed in temp files, such as a zip file.
Provide correct formatting of CodeChat-produced documents in reST based on the CodeChat style.
Converting classified code to reST¶
Approach¶
When creating reST block containing code or comments, two difficulties arise: preserving the indentation of both source code and comments; and preserving empty lines of code at the beginning or end of a block of code. In the following examples, examine both the source code and the resulting HTML to get the full picture, since the text below is (after all) in reST, and will be therefore be transformed to HTML.
First, consider a method to preserve empty lines of code. Consider, for example, the following:
Python source |
Translated to reST |
Translated to (simplified) HTML |
---|---|---|
# Do something
foo = 1
# Do something else
bar = 2
|
Do something foo = 1
Do something else bar = 2
|
<p>Do something:</p>
<pre>foo = 1
</pre>
<p>Do something else:</p>
<pre>bar = 2
</pre>
|
In this example, the blank line is lost, since reST allows the literal bock
containing foo = 1
to end with multiple blank lines; the resulting HTML
contains only one newline between each of these lines. To solve this, some CSS
hackery helps tighten up spacing between lines. In addition, this routine adds
a one-line fence, removed during processing, at the beginning and end of each
code block to preserve blank lines. The new translation becomes:
Python source |
Translated to reST |
Translated to (simplified) HTML |
---|---|---|
# Do something
foo = 1
# Do something else
bar = 2
|
Do something Do something else |
<p>Do something:</p>
<pre>foo = 1
</pre>
<p>Do something else:</p>
<pre>bar = 2
</pre>
|
Preserving indentation¶
The same fence approach also preserves indentation. Without the fences, indentation is consumed by the reST parser:
Python source |
Translated to reST |
Translated to (simplified) HTML |
---|---|---|
# One space indent
foo = 1
# No indent
bar = 2
|
One space indent foo = 1
No indent bar = 2
|
<p>One space indent</p>
<pre>foo = 1
</pre>
<p>No indent</p>
<pre>bar = 1
</pre>
|
With fences added:
Python source |
Translated to reST |
Translated to (simplified) HTML |
---|---|---|
# One space indent
foo = 1
# No indent
bar = 2
|
One space indent No indent |
<p>One space indent</p>
<pre> foo = 1
</pre>
<p>No indent</p>
<pre>bar = 1
</pre>
|
Preserving indentation for comments is more difficult. Blockquotes in reST are defined by common indentation, so that any number of (common) spaces define a blockquote. So, the distance between a zero and two-space indent is the same as the distance between a two-space and a three-space indent; we need the second indent to be half the size of the first.
Python source |
Translated to reST |
Translated to (simplified) HTML |
---|---|---|
# No indent
# Two space indent
# Three space indent
|
No indent
|
<p>No indent</p>
<blockquote>Two space indent
<blockquote>Three space
indent
</blockquote>
</blockquote>
|
To fix this, the raw directive
is used to insert a pair of <div>
and <div>
HTML elements which set
the left margin of indented text based on how many spaces (0.5 em = 1 space).
Python source |
Translated to reST |
Translated to (simplified) HTML |
---|---|---|
# No indent
# Two space indent
# Three space indent
|
No indent Two space indent Three space indent |
<p>No indent</p>
<div style="margin-left:1.0em">
<p>Two space indent</p>
</div>
<div style="margin-left:1.5em;">
<p>Three space indent</p>
</div>
|
Following either a fenced code block or a raw block, care must be taken to separate these blocks’ content from indented comments which follow them. For example, the following code:
Python source |
Translated to reST |
Translated to (simplified) HTML |
---|---|---|
def foo():
# Indented comment
|
<pre>def foo():
Ending fence
|
Notice that the Ending fence
ends up in the resulting HTML! To fix this,
simply add an unindented reST comment after a block.
Python source |
Translated to reST |
Translated to (simplified) HTML |
---|---|---|
def foo():
# Indented comment
|
|
<pre>def foo():
</pre>
<blockquote>
Indented comment
</blockquote>
|
Mixed code and comments¶
Note that mixing code and comments is hard: reST will still apply some of its parsing rules to an inline code block or inline literal, meaning that leading or trailing spaces and backticks will not be preserved, instead parsing incorrectly. For example,
1 :code:` Testing `
renders incorrectly. So, mixed lines are simply translated as code, meaning reST markup can’t be applied to the comments.
Summary and implementation¶
This boils down to the following basic rules:
Code blocks must be preceded and followed by a removed marker (fences).
Comments must be preceded and followed by reST which sets the left margin based on the number of spaces before the comment.
Both must be followed by an empty, unindented reST comment.
_generate_rest¶
Generate reST from the classified code. To do this, create a state machine,
where current_type defines the state. When the state changes, exit the
previous state (output a closing fence or closing </div>
, then enter the
new state (output a fenced code block or an opening <div style=...>
.
An iterable of (type, string) pairs, one per line.
A file-like output to which the reST text is written.
Keep track of the current type. Begin with a 0-indent comment.
Keep track of the current line number.
See if there’s a change in state.
Exit the current state.
Enter the new state.
Code state: emit the beginning of a fenced block.
Comment state: emit an opening indent for non-zero indents.
Add an indent if needed.
Specify the line number in the source, so that errors will be accurately reported. This isn’t necessary in code blocks, since errors can’t occur.
After this directive, the following text may be indented,
which would make it a part of the set-line
directive! So,
include an empty comment to terminate the set-line
, making
any following indents a separate syntactical element. See the
end of reST comment syntax
for more discussion.
Output string based on state. All code needs an initial space to place it inside the fenced-code block.
Update the state.
When done, exit the last state.
_exit_state¶
Output text produced when exiting a state. Supports _generate_rest.
The type (classification) of the last line.
See out_file.
Code state: emit an ending fence.
Comment state: emit a closing indent.
Initial state. Nothing needed.
Supporting reST directives and roles¶
_FencedCodeBlock¶
Create a fenced code block: the first and last lines are presumed to be fences, which keep the parser from discarding whitespace. Drop these, then treat everything else as code.
See the directive docs for more information.
The content must contain at least two lines (the fences).
Remove the fences.
By default, the Pygments stripnl option is True, causing Pygments to drop any empty lines. The reST parser converts a line containing only spaces to an empty line, which will then be stripped by Pygments if these are leading or trailing newlines. So, add a space back in to keep these lines from being dropped.
So, first add spaces from the beginning of the lines until we reach the first non-blank line.
If we’ve seen all the content, then don’t do it again – we’d be adding unnecessary spaces. Otherwise, walk from the end of the content backwards, adding spaces until the first non-blank line.
Recall Python indexing: while 0 is the first element in a list, -1 is the last element, so offset all indices by -1.
Mark all fenced code with a specific class, for styling.
Now, process the resulting contents as a code block.
Sphinx fix: if the current highlight language is python
,
“Normal Python code is only highlighted if it is parseable” (quoted
from the previous link). This means code snippets, such as
def foo():
won’t be highlighted: Python wants def foo(): pass
,
for example. To get around this, setting the highlight_args
option
“force”=True skips the parsing. I found this in
Sphinx.highlighting.highlight_block
(see the force
argument)
and in Sphinx.writers.html.HTMLWriter.visit_literal_block
, where
the code-block
directive (which supports fragments of code, not
just parseable code) has highlight_args['force'] = True
set. This
should be ignored by docutils, so it’s done for both Sphinx and
docutils. Note: This is based on examining Sphinx 1.3.1 source
code.
Note that the nodeList returned by the CodeBlock directive contains
only a single literal_block
node. The setting should be applied to
it.
The HTMLFormatter
, one of the Pygments formatters, supports the lineanchors
option, which adds a unique ID to each formatted line. Enable this for future use with ctags.
To get useful results, line numbers must be accurate; otherwise, they restart at 1 for each block. See _SetLine for details on how to extract the current line number.
_SetLine¶
This directive allows changing the line number at which errors will be
reported. .. set-line:: 10
makes the current line report as line 10,
regardless of its actual location in the file.
The input_lines
class (see docutils.statemachine.ViewList)
maintains two lists, data
and items
. data
is a list of
strings, one per line, of reST to process. items
is a list of
(source, offset)
giving the source of each line and the offset of
each line from the beginning of its source. Modifying the offset will
change the reported error location.
Line renumbering should begin at this offset in il.items
, which is
the line currently being processed.
Walk through the current input_lines
up through all its parents.
Walk from the current line to the end of the current file, rewriting the offset (that is, the effective line number).
If the source file changes, stop renumbering. The
current_source
must have been included; only renumber
this, not lines from another source which included
current_source
.
Adjust the offset when moving up to the parent.
This directive create no nodes.
_docname_role¶
Create a role which returns a specified part of the docname (the path to the current document). Syntax: :docname:`attr`
returns the attr
method of the docname as a Path. For example, :docname:`name`
would return the name of the current docname.
This function returns a tuple of two values:
A list of nodes which will be inserted into the document tree at the point where the interpreted role was encountered (can be an empty list).
A list of system messages, which will be inserted into the document tree immediately after the end of the current block (can also be empty).
The local name of the interpreted role, the role name actually used in the document.
A string containing the entire interpreted text input, including the role and markup. Return it as a problematic node linked to a system message if a problem is encountered.
The interpreted text content.
The line number where the interpreted text begins.
The docutils.parsers.rst.states.Inliner
object that called this function. It contains the several attributes useful for error reporting and document tree access.
A dictionary of directive options for customization (from the “role” directive), to be interpreted by this function. Used for additional attributes for the generated elements and other functionality.
A list of strings, the directive content for customization (from the “role” directive). To be interpreted by the role function.
Invoke
Return p.<text> using getattr.
Report an error.
Return the path component as text.
add_highlight_language¶
This function returns the source
; it also prepends a Sphinx highlight directive if possible.
The source reST to potentially prepend a highlight directive to.
The lexer which was used to produce this source.
If there’s file-wide metadata, then it’s hard to know where the highlight directive can be safely placed:
Putting it before file-wide metadata demotes it to not being metadata.
Finding the right place to put the
.. highlight
directive after the metadata is difficult to know.
There’s no file-wide metadata. Add the highlight directive.
There might be file-wide metadata.
_CodeInclude¶
Provide a way to include source code to be processed by CodeChat. It is a slight modification of the docutils include directive and supports all the same options; it also supports the class option.
Implementation note: this is mostly copied directly from docutils.parsers.rst.directives.misc.Include
, version 0.21.
Include content read from a separate source file.
Content will be lexed based on the provided lexer then parsed by the parser. The encoding of the included file can be specified. Only a part of the given file argument may be included by specifying start and end line or text to match before and/or after the text to be used.
Updated option_spec
for this directive.
option_spec = {
"lexer": directives.unchanged,
"encoding": directives.encoding,
"tab-width": int,
"start-line": int,
"end-line": int,
"start-after": directives.unchanged_required,
"end-before": directives.unchanged_required,
"class": directives.class_option,
}
standard_include_path = Path(states.__file__).parent / "include"
def run(self):
Include a file as part of the content of this reST file.
Depending on the options, the file (or a clipping) is converted to nodes and returned or inserted into the input stream.
if not self.state.document.settings.file_insertion_enabled:
raise self.warning('"%s" directive disabled.' % self.name)
current_source = self.state.document.current_source
path = directives.path(self.arguments[0])
if path.startswith("<") and path.endswith(">"):
_base = self.standard_include_path
path = path[1:-1]
else:
_base = Path(current_source).parent
path = utils.relative_path(None, _base / path)
encoding = self.options.get(
"encoding", self.state.document.settings.input_encoding
)
e_handler = self.state.document.settings.input_encoding_error_handler
tab_width = self.options.get(
"tab-width", self.state.document.settings.tab_width
)
try:
include_file = io.FileInput(
source_path=path, encoding=encoding, error_handler=e_handler
)
except UnicodeEncodeError:
raise self.severe(
f'Problems with "{self.name}" directive path:\n'
f'Cannot encode input file path "{path}" '
"(wrong locale?)."
)
except OSError as error:
raise self.severe(
f'Problems with "{self.name}" directive '
f"path:\n{io.error_string(error)}."
)
else:
self.state.document.settings.record_dependencies.add(path)
Get to-be-included content
startline = self.options.get("start-line", None)
endline = self.options.get("end-line", None)
try:
if startline or (endline is not None):
lines = include_file.readlines()
rawtext = "".join(lines[startline:endline])
else:
rawtext = include_file.read()
except UnicodeError as error:
raise self.severe(
f'Problem with "{self.name}" directive:\n' + io.error_string(error)
)
start-after/end-before: no restrictions on newlines in match-text, and no restrictions on matching inside lines vs. line boundaries
skip content in rawtext before and incl. a matching text
skip content in rawtext after and incl. a matching text
Added code from here…
Only Sphinx has the env
attribute.
If the lexer is specified, include it.
If Sphinx is running, try getting a user-specified lexer from the Sphinx configuration.
Translate the source code to reST.
If the class
option is specified, wrap the code in a div with the specified classes.
If Sphinx is running, insert the appropriate highlight directive.
… to here.
Deleted code: Options for literal
and code
don’t apply.
Prevent circular inclusion:
log entries are tuples (<source>, <clip-options>)
if not include_log: # new document, initialize with document source
include_log.append(
(utils.relative_path(None, current_source), (None, None, None, None))
)
if (path, clip_options) in include_log:
master_paths = (pth for (pth, opt) in reversed(include_log))
inclusion_chain = "\n> ".join((path, *master_paths))
raise self.warning(
'circular inclusion in "%s" directive:\n%s'
% (self.name, inclusion_chain)
)
if "parser" in self.options:
parse into a dummy document and return created nodes
clean up doctree and complete parsing
Include as rST source:
mark end (cf. parsers.rst.states.Body.comment())
update include-log
Register the new directives and role with docutils.
Imitate Sphinx’s naming convention of literalinclude.