SourceClassifier.py - a module to lex source code¶
This module classifies source code in any format supported by CommentDelimiterInfo.py - Info on comment delimiters for many languages into an iterable of (type, string)
tuples; see Implementation.
Imports¶
These are listed in the order prescribed by PEP 8.
Standard library¶
Third-party imports¶
Local application imports¶
Supporting routines¶
get_lexer¶
Provide several ways to find a lexer. Provide any of the following arguments, and this function will return the appropriate lexer for it.
The lexer itself, which will simply be returned.
The short name, or alias, of the lexer to use.
The filename of the source file to lex.
The MIME type of the source file to lex.
The code to be highlighted, used to guess a lexer.
options: Specify the lexer (see get_lexer arguments), and provide it any other needed options.
This sets the default tabsize to 4 spaces in Pygments’ lexer, and this link is a list of all available lexers
Given code, try to guess a more accurate lexer.
Only use this guess if we support it.
If guessing fails or isn’t available, look up a lexer based on the file name.
Provide the ability to print debug info if needed.
Uncomment for debug prints. print(val),
Implementation¶
This function transforms source code in any format supported by CommentDelimiterInfo.py - Info on comment delimiters for many languages into an iterable of (type, string)
tuples, where:
type
is an integer giving the number of spaces in this comment’s indent, or -1 if the comment is code. Note that non-CodeChat comments are classified as code.string
is one line of code or comment, ending with a newline. Comments are supplied with the indent and delimiters removed.
This routine is the heart of the algorithm.
code_str: the code to classify.
lexer: The lexer used to analyze the code.
Gather some additional information, based on the lexer, which is needed to correctly process comments:
If there’s no multi-line start info, then classify generic comments as inline. The exception is HTML, which (currently, as of Pygments v.10.0) classifies as generic comments, but also supports in-line JavaScript comments.
Likewise, no inline info indicates that generic comments are block comments. Note that inline comments are a sequence of strings, so look inside the sequence to the string.
Invoke a Pygments lexer on the provided source code, obtaining an iterable of tokens. Also analyze Python code for docstrings.
Combine tokens from the lexer into three groups: whitespace, comment, or other.
Make a per-line list of [group, ws_len, string], so that the last string in each list ends with a newline. Change the group of block comments that actually span multiple lines.
Classify each line. For CodeChat-formatted comments, remove the leading whitespace and all comment characters (the // or #, for example).
CodeChat style¶
CodeChat.css - Style sheet for CodeChat documents provides some of the CSS needed to properly format CodeChat documents. However, not all the necessary styling can be accomplished using CSS. This script sets the styles that can’t be set in CSS. Specifically, it removes the bottom margin for cases where code follows a paragraph, and the top margin for cases where code preceds a paragraph. The expected structure:
1<!-- All CodeChat-produced code is marked by the ``fenced-code``
2 class. -->
3<div class="fenced-code">
4 ...Some code here...
5</div>
6
7<!-- An indent from CodeChat -- not present with no indent. The
8 code below will add the classes ``CodeChat_noTop
9 CodeChat_noBottom`` to this element.
10-->
11<div class="CodeChat-indent">
12 <!-- The code below will add the class ``CodeChat_noTop`` to
13 this element.
14 -->
15 <p>Text.</p>
16 <!-- The code below will add the class ``CodeChat_noBottom``
17 to this element.
18 -->
19 <p>Some text here</p>
20</div>
21
22<div class="fenced-code">
23 ...Some more code here...
24</div>
Disable Black for this block. fmt: off
Define a function to add classes as defined above.
It takes an optional parameter, the root element to search from. Styles will be applies to this element and all its children. It defaults to document
.
Walk the tree in the given direction.
A list of DOM elements to walk.
The walker function: x => x.next/prevElementSibling
.
Which child to select, x => first/lastElementChild
.
Create an array to hold the children found.
For each element in the list of elements, find all its next/previous children.
If the current element (that
) does not have a next/previous sibling, ascend the tree until we find one or reach the top.
We found a next/previous sibling. Go there.
Include the next/previous sibling in the output.
Add all first/last children of this node to the output.
Use document
as the default element if it isn’t specified.
All CodeChat-produced code is marked by the fenced-code
class.
Go to the next node from code, then set the margin-top of all first children to 0 so that there will be no extra space between the code and the following comment.
Same, but remove space between a comment and the following code.
Only style after the DOM is ready.
fmt: on
Step 1 of source_lexer¶
See code_str.
See lexer.
Pygments does some cleanup on the code given to it before lexing it. If this is Python code, we want to run AST on that cleaned-up version, so that AST results can be correlated with Pygments results. However, Pygments doesn’t offer a way to do this; so, add that ability in to the detected lexer.
Note: though Pygments does support a String.Doc token type,
it doesn’t truly identify docstrings; from Pygments 2.1.3,
pygments.lexers.python.PythonLexer
:
55(r'^(\s*)([rRuU]{,2}"""(?:.|\n)*?""")', bygroups(Text, String.Doc)),
56(r"^(\s*)([rRuU]{,2}'''(?:.|\n)*?''')", bygroups(Text, String.Doc)),
which means that String.Doc
is simply ANY triple-quoted string.
Looking at other lexers, String.Doc
often refers to a
specially-formatted comment, not a docstring. From
pygments.lexers.javascript
:
591(r'//.*?\n', Comment.Single),
592(r'/\*\*!.*?\*/', String.Doc),
593(r'/\*.*?\*/', Comment.Multiline),
So, the String.Doc
token can’t be used in any meaningful way by
CodeToRest to identify docstrings. Per Wikipedia’s docstrings article, only three languages support
this feature. So, we’ll need a language-specific approach. For Python,
PEP 0257 provides the
syntax and even some code to correctly remove docstring indentation. The
ast module provides
routines which parse a given Python string without executing it, which
is better than evaluating arbitrary Python then looking at its
__doc__
attributes.
Perhaps the approach would be to scan with ast, then see if the line number matches the ending line number of a string, and if so convert the string into a comment. Trickiness: Python will merge strings; consider the following:
1>>> def foo():
2... ("""A comment.""" \
3... ' More.'
4... " And more.")
5... pass
6...
7>>> print(foo.__doc__)
8A comment. More. And more.
It’s probably best not to support this case. Unfortunately, ast reports
this as a single string, rather than as a list of several elements.
The approach: make sure the docstring found by ast is in the text of a
Pygments string token. If so, replace the string token by a block
comment, whose contents come from inspect.cleandoc
of the docstring.
So, process this with ast if this is Python or Python3 code to find docstrings.
If found, store { ending_line_number_of_the_comment: docstring }
into
ast_docstring
.
Provide a place to store syntax errors resulting from parsing the Python code.
Determine if code is Python or Python3. Note that AST processing cannot
support Python 2 specific syntax (e.g. the <>
operator).
Syntax errors cause ast.parse
to fail. Catch and report them.
If so, walk through the preprocessed code to analyze each token.
Check if current token is a docstring. The docstring will be cleaned later.
If so, store current line number and token value. Note
that lineno
gives the last line of the string,
per http://bugs.python.org/issue16806, for
Python <= 3.7. Per the docs, end_lineno
was
introduced in Python 3.8.
Take the file name (which shows up as <unknown>
) out of the error message returned.
Now, run the lexer.
Pygments monkeypatching¶
Provide a way to perform preprocessing on text before lexing it. This code was
copied from pygments.lexer.Lexer.get_token
, v. 2.6.1.
Return an iterable of (tokentype, value) pairs generated from
text
. If unfiltered
is set to True
, the filtering mechanism
is bypassed even if filters are defined.
Also preprocess the text, i.e. expand tabs and strip it if wanted and applies registered filters.
check for BOM first
no BOM found, so use chardet
if decoded is None:
enc = chardet.detect(text[:1024]) # Guess using first 1KB
decoded = text.decode(enc.get("encoding") or "utf-8", "replace")
text = decoded
else:
text = text.decode(self.encoding)
if text.startswith("\ufeff"):
text = text[len("\ufeff") :]
else:
if text.startswith("\ufeff"):
text = text[len("\ufeff") :]
text now is a unicode string
EDIT: This is not from the original Pygments code. It was added to return the preprocessed text.
This code was copied from pygments.lexer.Lexer.get_token
, v. 2.6.1 (the last
few lines).
Step 2 of source_lexer¶
Given tokens, group them.
An iterable of (tokentype, string) pairs provided by the lexer, per get_tokens.
When true, classify generic comments as inline.
When true, classify generic comment as block comments.
Docstring dict found from AST scanning.
Keep track of the current group, string, and line no.
Walk through tokens.
Increase token line no. for every newline found.
Compare formatted token containing docstring with AST result.
Insert an extra space after the docstring delimiter, making this look like a reST comment.
If there’s a change in group, yield what we’ve accumulated so far, then initialize the state to the newly-found group and string.
Otherwise, keep accumulating.
Output final pair, if we have it.
Supporting routines and definitions¶
Define the groups into which tokens will be placed.
The basic classification used by group_for_tokentype.
A /* comment */
-style comment contained in one string.
Grouping is:
/* BLOCK_COMMENT_START
BLOCK_COMMENT_BODY, (repeats for all comment body)
BLOCK_COMMENT_END */
Given a tokentype, group it.
The tokentype to place into a group.
See comment_is_inline.
See comment_is_block.
The list of Pygments tokens lists
Token.Text
(how a newline is classified) and Token.Whitespace
.
Consider either as whitespace.
There is a Token.Comment, but this can refer to inline or block comments. Therefore, use info from CommentDelimiterInfo as a tiebreaker.
A few goofy lexers use this as of Pygments 2.1.3. See https://bitbucket.org/birkenfeld/pygments-main/issues/1251/use-of-commentsingleline-instead-of.
If the tiebreaker above doesn’t classify a Token.Comment, then assume it to be an inline comment. This occurs in the Matlab lexer using Pygments 2.0.2.
Step #3 of source_lexer¶
Given an iterable of groups, break them into lists based on newlines. The list consists of (group, comment_leading_whitespace_length, string) tuples.
An iterable of (group, string) pairs provided by
group_lexer_tokens
.
An element of COMMENT_DELIMITER_INFO for the language being classified.
Keep a list of (group, string) tuples we’re accumulating.
Keep track of the length of whitespace at the beginning of the body and end portions of a block comment.
The length of an opening block comment
Accumulate until we find a newline, then yield that.
A given group (such as a block string) may extend across multiple newlines. Split these groups apart first.
Look for block comments spread across multiple lines and label them correctly.
Look for an indented multi-line comment block. First, determine what the indent must be: (column in which comment starts) + (length of comment delimiter) + (1 space).
Determine the indent style (all spaces, or spaces followed by a
character, typically *
). If it’s not spaces only, it must
be spaces followed by a delimiter.
First, get the last character of the block comment delimiter.
This is expressed as a 1-character range, so that ‘’ will be
returned if the index is past the end of the string. Perl’s PODs
consist of =whatever text you want\n
, meaning the entire line
should be discarded. To make this “easy”, I define
comment_delim_info[1] as a very large number, so that the entire
line will be discarded. Hence, the need for the hack below.
Look at the second and following lines to see if their indent is consistent.
It’s inconsistent. Set ws_len to 0 to signal that this isn’t an indented block comment.
Accumulate results.
For block comments, move from a start to a body group.
If the next line is the last line, update the block group.
Yield when we find a newline, then clear our accumulator.
We’ve output a group; reset the ws_len to 0 in case the group just output was a multi-line block comment with ws_len > 0.
Output final group, if one is still accumulating.
Block comment indentation processing¶
A single line:
/* comment */
. No special handling needed.Multiple lines, indented with spaces. For example:
1Column: 1 21234567890234567 3 4 /* A multi- 5 6 line 7 WRONG INDENTATION 8 string 9 */
In the code above, the text of the comment begins at column 6 of line 4, with the letter A. Therefore, line 7 lacks the necessary 6-space indent, so that no indentation will be removed from this comment.
If line 6 was indented properly, the resulting reST would be:
...rest to indent the left margin of the following text 2 spaces... A multi- line RIGHT INDENTATION string ...rest to end the left margin indent...
Note that the first 5 characters were stripped off each line, leaving only line 8 indented (preserving its indentation relative to the comment’s indentation). Some special cases in doing this processing:
Line 5 may contain less than the expected 5 space indent: it could be only a newline. This must be supported with a special case.
The comment closing (line 9) contains just 3 spaces; this is allowed. If there are non-space characters before the closing comment delimiter, they must not occur before column 6. For example,
/* A multi- line comment */
and
/* A multi- line comment */
have consistent indentation. In particular, the last line of a multi-line comment may contain zero or more whitespace characters followed by the closing block comment delimiter. However,
/* A multi- line comment */
is not sufficiently indented to qualify for indentation removal.
So, to recognize:
A line from a multi-line comment to examine.
The expected indent to check for; a length, in characters.
Placeholder for delimiter expected near the end of an indent (one character). Not used by this function, but this function must take the same parameters as is_delim_indented_line.
True if this is the last line of a multi-line comment.
See comment_delim_info.
A line containing only whitespace is always considered valid.
A line beginning with ws_len spaces has the expected indent.
The closing delimiter will always be followed by a newline, hence the - 1 factor.
Last line: zero or more whitespaces followed by the closing block comment
delimiter is valid. Since ''.isspace() == False
, check for this case
and consider it true.
No other correctly indented cases.
(continuing from the list above…)
Multiple lines, indented with spaces followed by a delimiter. For example:
1Column: 1 21234567890123456789 3 /* Multi- 4 * line 5 * 6 *WRONG INDENTATION 7 * WRONG INDENTATION 8 */
The rules are similar to the previous case (indents with space alone). However, there is one special case: line 5 doesn’t require a space after the asterisk; a newline is acceptable. If the indentation is corrected, the result is:
...rest to indent the left margin of the following text 2 spaces... Multi- line RIGHT INDENTATION RIGHT INDENTATION ...rest to end left margin indent...
So, to recognize:
A line from a multi-line comment to examine.
The expected indent to check for; a length, in characters.
Delimiter expected near the end of an indent (one character).
True if this is the last line of a multi-line comment.
See comment_delim_info.
A line the correct number of spaces, followed by a delimiter then either a space or a newline is correctly indented.
Last line possibility: indent_len - 2 spaces followed by the delimiter
is a valid indent. For example, an indent of 3 begins with /* comment
and can end with _*/
, a total of (indent_len == 3) - (2 spaces
that are usually a * followed by a space) + (closing delim */
length
of 2 chars) == 3.
No other correctly indented cases.
Step #4 of source_lexer¶
Classify the output of gather_groups_on_newlines
into either a code or
comment with n leading whitespace types. Remove all comment characters.
Note
This is a destructive edit, instead of a classification. To make this invertible, it needs to be non-destructive. The idea:
Output s, the entire line as a string, if it’s not a reST comment. Currently, it outputs -1, s.
Output whitespace characters or ‘’, opening comment delimiter and space, comment text, closing comment delimiter or ‘’, newline). Currently, it outputs len(leading whitespace characters), comment text + newline.
An iterable of [(group1, string1_no_newline), (group2, string2_no_newline),
…, (groupN, stringN_ending_newline)], produced by
gather_groups_on_newlines
.
See comment_delim_info.
See lexer.
Keep track of block comment state.
Walk through groups.
The type = # of leading whitespace characters, or 0 if none.
Encode this whitespace in the type, then drop it.
For body or end block comments, use the indent set by at the beginning of the comment. Otherwise, there is no indent, so set it to 0.
Update the block reST state.
Strip all comment characters off the strings and combine them.
Remove the initial space character from the first comment, or ws_len chars from body or end comments.
Some of the body or end block lines may be just whitespace. Don’t strip these: the line may be too short or we might remove a newline. Or, this might be inconsistent indentation in which ws_len == 0, in which case we should also strip nothing.
The first ws_len - 1 characters should be stripped.
The last character, if it exists and is a space, should also be stripped.
A comment of //\n
qualifies as a reST comment, but should
not have the \n
stripped off. Avoid this case. Otherwise,
handle the more typical // comment
case, in which the space
after the comment delimiter should be removed.
Everything else is considered code.
Supporting routines¶
Given a (group, string) tuple, return the string with comment delimiter removed if it is a comment, or just the string if it’s not a comment.
The group this string was classified into.
The string corresponding to this group.
See comment_delim_info.
See lexer.
Number of characters in an opening block comment.
Number of characters in an closing block comment.
Special case: COBOL. A comment has the character *
or /
in column 7.
Unlike the opening and closing block comment delimiters, the inline comment delimiter may be a sequence. Check each possibility for a match.
Look at each possibility for a match.
string_lower = string.lower()
for inline_comment_delim in inline_comment_delim_seq:
if string_lower.startswith(inline_comment_delim):
return string[len(inline_comment_delim) :]
return string
if group == _GROUP.block_comment:
return string[len_opening_block_comment_delim:-len_closing_block_comment_delim]
if group == _GROUP.block_comment_start:
return string[len_opening_block_comment_delim:]
if group == _GROUP.block_comment_end:
return string[:-len_closing_block_comment_delim]
else:
return string
Return a string with the given delimiter removed from its beginning.
Note
This code isn’t used yet – it’s for a rewrite which will support multiple delimiters.
Either the number of characters in the beginning delimiter, or a tuple of strings which give all valid beginning comment delimiters.
The string which start with the delimiter to be removed.
Loop through all delimiters.
If we find one at the beginning of the string, strip it off.
Not found – panic.
Determine if the given line is a comment to be interpreted by reST.
Supports remove_comment_chars
, classify_groups
.
A sequence of (group, string) representing a single line.
True if this line contains the body or end of a block comment that will be interpreted by reST.
See comment_delim_info.
See lexer.
See if there is any _GROUP.other in this line. If so, it’s not a reST comment.
If there’s no comments (meaning the entire line is whitespace), it’s not a reST comment.
Find the first comment. There may be whitespace preceding it, so select the correct index.
The cases are:
// comment, //\n, #
-> reST comment. Note that in some languages (Python is one example), the newline isn’t included in the comment.//comment
-> not a reST comment./* comment, /*\n
-> reST commentAny block comment body or end for which its block start was a reST comment -> reST comment.
/**/
-> a reST comment. (I could see this either as reST or not; because it was simpler, I picked reST.)
Begin by checking case #4 above.
To check the other cases, first remove the comment delimiter so we can examine the next character following the delimiter.
first_comment_text = _remove_comment_delim(
first_group, first_string, comment_delim_info, lexer
)
first_char_is_rest = (
len(first_comment_text) > 0 and first_comment_text[0] in (" ", "\n")
) or len(first_comment_text) == 0
if first_char_is_rest and not _is_block_body_or_end(first_group):
return True
return False
Determine if this group is either a block comment body or a block comment end.