GH-98831: "Generate" the interpreter (#98830)

The switch cases (really TARGET(opcode) macros) have been moved from ceval.c to generated_cases.c.h. That file is generated from instruction definitions in bytecodes.c (which impersonates a C file so the C code it contains can be edited without custom support in e.g. VS Code).

The code generator lives in Tools/cases_generator (it has a README.md explaining how it works). The DSL used to describe the instructions is a work in progress, described in https://github.com/faster-cpython/ideas/blob/main/3.12/interpreter_definition.md.

This is surely a work-in-progress. An easy next step could be auto-generating super-instructions.

**IMPORTANT: Merge Conflicts**

If you get a merge conflict for instruction implementations in ceval.c, your best bet is to port your changes to bytecodes.c. That file looks almost the same as the original cases, except instead of `TARGET(NAME)` it uses `inst(NAME)`, and the trailing `DISPATCH()` call is omitted (the code generator adds it automatically).
This commit is contained in:
Guido van Rossum 2022-11-02 21:31:26 -07:00 committed by GitHub
parent 2cfcaf5af6
commit 41bc101dd6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
13 changed files with 8961 additions and 3851 deletions

1
.gitattributes vendored
View File

@ -82,6 +82,7 @@ Parser/parser.c generated
Parser/token.c generated Parser/token.c generated
Programs/test_frozenmain.h generated Programs/test_frozenmain.h generated
Python/Python-ast.c generated Python/Python-ast.c generated
Python/generated_cases.c.h generated
Python/opcode_targets.h generated Python/opcode_targets.h generated
Python/stdlib_module_names.h generated Python/stdlib_module_names.h generated
Tools/peg_generator/pegen/grammar_parser.py generated Tools/peg_generator/pegen/grammar_parser.py generated

View File

@ -1445,7 +1445,19 @@ regen-opcode-targets:
$(srcdir)/Python/opcode_targets.h.new $(srcdir)/Python/opcode_targets.h.new
$(UPDATE_FILE) $(srcdir)/Python/opcode_targets.h $(srcdir)/Python/opcode_targets.h.new $(UPDATE_FILE) $(srcdir)/Python/opcode_targets.h $(srcdir)/Python/opcode_targets.h.new
Python/ceval.o: $(srcdir)/Python/opcode_targets.h $(srcdir)/Python/condvar.h .PHONY: regen-cases
regen-cases:
# Regenerate Python/generated_cases.c.h from Python/bytecodes.c
# using Tools/cases_generator/generate_cases.py
PYTHONPATH=$(srcdir)/Tools/cases_generator \
$(PYTHON_FOR_REGEN) \
$(srcdir)/Tools/cases_generator/generate_cases.py \
-i $(srcdir)/Python/bytecodes.c \
-o $(srcdir)/Python/generated_cases.c.h.new
$(UPDATE_FILE) $(srcdir)/Python/generated_cases.c.h $(srcdir)/Python/generated_cases.c.h.new
Python/ceval.o: $(srcdir)/Python/opcode_targets.h $(srcdir)/Python/condvar.h $(srcdir)/Python/generated_cases.c.h
Python/frozen.o: $(FROZEN_FILES_OUT) Python/frozen.o: $(FROZEN_FILES_OUT)

View File

@ -0,0 +1 @@
We have new tooling, in ``Tools/cases_generator``, to generate the interpreter switch from a list of opcode definitions.

4006
Python/bytecodes.c Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

3860
Python/generated_cases.c.h generated Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,39 @@
# Tooling to generate interpreters
What's currently here:
- lexer.py: lexer for C, originally written by Mark Shannon
- plexer.py: OO interface on top of lexer.py; main class: `PLexer`
- parser.py: Parser for instruction definition DSL; main class `Parser`
- `generate_cases.py`: driver script to read `Python/bytecodes.c` and
write `Python/generated_cases.c.h`
**Temporarily also:**
- `extract_cases.py`: script to extract cases from
`Python/ceval.c` and write them to `Python/bytecodes.c`
- `bytecodes_template.h`: template used by `extract_cases.py`
The DSL for the instruction definitions in `Python/bytecodes.c` is described
[here](https://github.com/faster-cpython/ideas/blob/main/3.12/interpreter_definition.md).
Note that there is some dummy C code at the top and bottom of the file
to fool text editors like VS Code into believing this is valid C code.
## A bit about the parser
The parser class uses a pretty standard recursive descent scheme,
but with unlimited backtracking.
The `PLexer` class tokenizes the entire input before parsing starts.
We do not run the C preprocessor.
Each parsing method returns either an AST node (a `Node` instance)
or `None`, or raises `SyntaxError` (showing the error in the C source).
Most parsing methods are decorated with `@contextual`, which automatically
resets the tokenizer input position when `None` is returned.
Parsing methods may also raise `SyntaxError`, which is irrecoverable.
When a parsing method returns `None`, it is possible that after backtracking
a different parsing method returns a valid AST.
Neither the lexer nor the parsers are complete or fully correct.
Most known issues are tersely indicated by `# TODO:` comments.
We plan to fix issues as they become relevant.

View File

@ -0,0 +1,85 @@
#include "Python.h"
#include "pycore_abstract.h" // _PyIndex_Check()
#include "pycore_call.h" // _PyObject_FastCallDictTstate()
#include "pycore_ceval.h" // _PyEval_SignalAsyncExc()
#include "pycore_code.h"
#include "pycore_function.h"
#include "pycore_long.h" // _PyLong_GetZero()
#include "pycore_object.h" // _PyObject_GC_TRACK()
#include "pycore_moduleobject.h" // PyModuleObject
#include "pycore_opcode.h" // EXTRA_CASES
#include "pycore_pyerrors.h" // _PyErr_Fetch()
#include "pycore_pymem.h" // _PyMem_IsPtrFreed()
#include "pycore_pystate.h" // _PyInterpreterState_GET()
#include "pycore_range.h" // _PyRangeIterObject
#include "pycore_sliceobject.h" // _PyBuildSlice_ConsumeRefs
#include "pycore_sysmodule.h" // _PySys_Audit()
#include "pycore_tuple.h" // _PyTuple_ITEMS()
#include "pycore_emscripten_signal.h" // _Py_CHECK_EMSCRIPTEN_SIGNALS
#include "pycore_dict.h"
#include "dictobject.h"
#include "pycore_frame.h"
#include "opcode.h"
#include "pydtrace.h"
#include "setobject.h"
#include "structmember.h" // struct PyMemberDef, T_OFFSET_EX
void _PyFloat_ExactDealloc(PyObject *);
void _PyUnicode_ExactDealloc(PyObject *);
#define SET_TOP(v) (stack_pointer[-1] = (v))
#define PEEK(n) (stack_pointer[-(n)])
#define GETLOCAL(i) (frame->localsplus[i])
#define inst(name) case name:
#define family(name) static int family_##name
#define NAME_ERROR_MSG \
"name '%.200s' is not defined"
typedef struct {
PyObject *kwnames;
} CallShape;
static void
dummy_func(
PyThreadState *tstate,
_PyInterpreterFrame *frame,
unsigned char opcode,
unsigned int oparg,
_Py_atomic_int * const eval_breaker,
_PyCFrame cframe,
PyObject *names,
PyObject *consts,
_Py_CODEUNIT *next_instr,
PyObject **stack_pointer,
CallShape call_shape,
_Py_CODEUNIT *first_instr,
int throwflag,
binaryfunc binary_ops[]
)
{
switch (opcode) {
/* BEWARE!
It is essential that any operation that fails must goto error
and that all operation that succeed call DISPATCH() ! */
// BEGIN BYTECODES //
// INSERT CASES HERE //
// END BYTECODES //
}
error:;
exception_unwind:;
handle_eval_breaker:;
resume_frame:;
resume_with_error:;
start_frame:;
unbound_local_error:;
}
// Families go below this point //

View File

@ -0,0 +1,247 @@
"""Extract the main interpreter switch cases."""
# Reads cases from ceval.c, writes to bytecodes.c.
# (This file is not meant to be compiled, but it has a .c extension
# so tooling like VS Code can be fooled into thinking it is C code.
# This helps editing and browsing the code.)
#
# The script generate_cases.py regenerates the cases.
import argparse
import difflib
import dis
import re
import sys
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input", type=str, default="Python/ceval.c")
parser.add_argument("-o", "--output", type=str, default="Python/bytecodes.c")
parser.add_argument("-t", "--template", type=str, default="Tools/cases_generator/bytecodes_template.c")
parser.add_argument("-c", "--compare", action="store_true")
parser.add_argument("-q", "--quiet", action="store_true")
inverse_specializations = {
specname: familyname
for familyname, specnames in dis._specializations.items()
for specname in specnames
}
def eopen(filename, mode="r"):
if filename == "-":
if "r" in mode:
return sys.stdin
else:
return sys.stdout
return open(filename, mode)
def leading_whitespace(line):
return len(line) - len(line.lstrip())
def extract_opcode_name(line):
m = re.match(r"\A\s*TARGET\((\w+)\)\s*{\s*\Z", line)
if m:
opcode_name = m.group(1)
if opcode_name not in dis._all_opmap:
raise ValueError(f"error: unknown opcode {opcode_name}")
return opcode_name
raise ValueError(f"error: no opcode in {line.strip()}")
def figure_stack_effect(opcode_name):
# Return (i, diff``) where i is the stack effect for oparg=0
# and diff is the increment for oparg=1.
# If it is irregular or unknown, raise ValueError.
if m := re.match(f"^(\w+)__(\w+)$", opcode_name):
# Super-instruction adds effect of both parts
first, second = m.groups()
se1, incr1 = figure_stack_effect(first)
se2, incr2 = figure_stack_effect(second)
if incr1 or incr2:
raise ValueError(f"irregular stack effect for {opcode_name}")
return se1 + se2, 0
if opcode_name in inverse_specializations:
# Specialized instruction maps to unspecialized instruction
opcode_name = inverse_specializations[opcode_name]
opcode = dis._all_opmap[opcode_name]
if opcode in dis.hasarg:
try:
se = dis.stack_effect(opcode, 0)
except ValueError as err:
raise ValueError(f"{err} for {opcode_name}")
if dis.stack_effect(opcode, 0, jump=True) != se:
raise ValueError(f"{opcode_name} stack effect depends on jump flag")
if dis.stack_effect(opcode, 0, jump=False) != se:
raise ValueError(f"{opcode_name} stack effect depends on jump flag")
for i in range(1, 257):
if dis.stack_effect(opcode, i) != se:
return figure_variable_stack_effect(opcode_name, opcode, se)
else:
try:
se = dis.stack_effect(opcode)
except ValueError as err:
raise ValueError(f"{err} for {opcode_name}")
if dis.stack_effect(opcode, jump=True) != se:
raise ValueError(f"{opcode_name} stack effect depends on jump flag")
if dis.stack_effect(opcode, jump=False) != se:
raise ValueError(f"{opcode_name} stack effect depends on jump flag")
return se, 0
def figure_variable_stack_effect(opcode_name, opcode, se0):
# Is it a linear progression?
se1 = dis.stack_effect(opcode, 1)
diff = se1 - se0
for i in range(2, 257):
sei = dis.stack_effect(opcode, i)
if sei - se0 != diff * i:
raise ValueError(f"{opcode_name} has irregular stack effect")
# Assume it's okay for larger oparg values too
return se0, diff
START_MARKER = "/* Start instructions */" # The '{' is on the preceding line.
END_MARKER = "/* End regular instructions */"
def read_cases(f):
cases = []
case = None
started = False
# TODO: Count line numbers
for line in f:
stripped = line.strip()
if not started:
if stripped == START_MARKER:
started = True
continue
if stripped == END_MARKER:
break
if stripped.startswith("TARGET("):
if case:
cases.append(case)
indent = " " * leading_whitespace(line)
case = ""
opcode_name = extract_opcode_name(line)
try:
se, diff = figure_stack_effect(opcode_name)
except ValueError as err:
case += f"{indent}// error: {err}\n"
case += f"{indent}inst({opcode_name}) {{\n"
else:
inputs = []
outputs = []
if se > 0:
for i in range(se):
outputs.append(f"__{i}")
elif se < 0:
for i in range(-se):
inputs.append(f"__{i}")
if diff > 0:
if diff == 1:
outputs.append(f"__array[oparg]")
else:
outputs.append(f"__array[oparg*{diff}]")
elif diff < 0:
if diff == -1:
inputs.append(f"__array[oparg]")
else:
inputs.append(f"__array[oparg*{-diff}]")
input = ", ".join(inputs)
output = ", ".join(outputs)
case += f"{indent}// stack effect: ({input} -- {output})\n"
case += f"{indent}inst({opcode_name}) {{\n"
else:
if case:
case += line
if case:
cases.append(case)
return cases
def write_cases(f, cases):
for case in cases:
caselines = case.splitlines()
while caselines[-1].strip() == "":
caselines.pop()
if caselines[-1].strip() == "}":
caselines.pop()
else:
raise ValueError("case does not end with '}'")
if caselines[-1].strip() == "DISPATCH();":
caselines.pop()
caselines.append(" }")
case = "\n".join(caselines)
print(case + "\n", file=f)
def write_families(f):
for opcode, specializations in dis._specializations.items():
all = [opcode] + specializations
if len(all) <= 3:
members = ', '.join(all)
print(f"family({opcode.lower()}) = {{ {members} }};", file=f)
else:
print(f"family({opcode.lower()}) = {{", file=f)
for i in range(0, len(all), 3):
members = ', '.join(all[i:i+3])
if i+3 < len(all):
print(f" {members},", file=f)
else:
print(f" {members} }};", file=f)
def compare(oldfile, newfile, quiet=False):
with open(oldfile) as f:
oldlines = f.readlines()
for top, line in enumerate(oldlines):
if line.strip() == START_MARKER:
break
else:
print(f"No start marker found in {oldfile}", file=sys.stderr)
return
del oldlines[:top]
for bottom, line in enumerate(oldlines):
if line.strip() == END_MARKER:
break
else:
print(f"No end marker found in {oldfile}", file=sys.stderr)
return
del oldlines[bottom:]
if not quiet:
print(
f"// {oldfile} has {len(oldlines)} lines after stripping top/bottom",
file=sys.stderr,
)
with open(newfile) as f:
newlines = f.readlines()
if not quiet:
print(f"// {newfile} has {len(newlines)} lines", file=sys.stderr)
for line in difflib.unified_diff(oldlines, newlines, fromfile=oldfile, tofile=newfile):
sys.stdout.write(line)
def main():
args = parser.parse_args()
with eopen(args.input) as f:
cases = read_cases(f)
with open(args.template) as f:
prolog, epilog = f.read().split("// INSERT CASES HERE //", 1)
if not args.quiet:
print(f"// Read {len(cases)} cases from {args.input}", file=sys.stderr)
with eopen(args.output, "w") as f:
f.write(prolog)
write_cases(f, cases)
f.write(epilog)
write_families(f)
if not args.quiet:
print(f"// Wrote {len(cases)} cases to {args.output}", file=sys.stderr)
if args.compare:
compare(args.input, args.output, args.quiet)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,125 @@
"""Generate the main interpreter switch."""
# Write the cases to generated_cases.c.h, which is #included in ceval.c.
# TODO: Reuse C generation framework from deepfreeze.py?
import argparse
import io
import sys
import parser
from parser import InstDef
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument("-i", "--input", type=str, default="Python/bytecodes.c")
arg_parser.add_argument("-o", "--output", type=str, default="Python/generated_cases.c.h")
arg_parser.add_argument("-c", "--compare", action="store_true")
arg_parser.add_argument("-q", "--quiet", action="store_true")
def eopen(filename: str, mode: str = "r"):
if filename == "-":
if "r" in mode:
return sys.stdin
else:
return sys.stdout
return open(filename, mode)
def parse_cases(src: str, filename: str|None = None) -> tuple[list[InstDef], list[parser.Family]]:
psr = parser.Parser(src, filename=filename)
instrs: list[InstDef] = []
families: list[parser.Family] = []
while not psr.eof():
if inst := psr.inst_def():
assert inst.block
instrs.append(InstDef(inst.name, inst.inputs, inst.outputs, inst.block))
elif fam := psr.family_def():
families.append(fam)
else:
raise psr.make_syntax_error(f"Unexpected token")
return instrs, families
def always_exits(block: parser.Block) -> bool:
text = block.text
lines = text.splitlines()
while lines and not lines[-1].strip():
lines.pop()
if not lines or lines[-1].strip() != "}":
return False
lines.pop()
if not lines:
return False
line = lines.pop().rstrip()
# Indent must match exactly (TODO: Do something better)
if line[:12] != " "*12:
return False
line = line[12:]
return line.startswith(("goto ", "return ", "DISPATCH", "GO_TO_", "Py_UNREACHABLE()"))
def write_cases(f: io.TextIOBase, instrs: list[InstDef]):
indent = " "
f.write("// This file is generated by Tools/scripts/generate_cases.py\n")
f.write("// Do not edit!\n")
for instr in instrs:
assert isinstance(instr, InstDef)
f.write(f"\n{indent}TARGET({instr.name}) {{\n")
# input = ", ".join(instr.inputs)
# output = ", ".join(instr.outputs)
# f.write(f"{indent} // {input} -- {output}\n")
assert instr.block
blocklines = instr.block.text.splitlines(True)
# Remove blank lines from ends
while blocklines and not blocklines[0].strip():
blocklines.pop(0)
while blocklines and not blocklines[-1].strip():
blocklines.pop()
# Remove leading '{' and trailing '}'
assert blocklines and blocklines[0].strip() == "{"
assert blocklines and blocklines[-1].strip() == "}"
blocklines.pop()
blocklines.pop(0)
# Remove trailing blank lines
while blocklines and not blocklines[-1].strip():
blocklines.pop()
# Write the body
for line in blocklines:
f.write(line)
assert instr.block
if not always_exits(instr.block):
f.write(f"{indent} DISPATCH();\n")
# Write trailing '}'
f.write(f"{indent}}}\n")
def main():
args = arg_parser.parse_args()
with eopen(args.input) as f:
srclines = f.read().splitlines()
begin = srclines.index("// BEGIN BYTECODES //")
end = srclines.index("// END BYTECODES //")
src = "\n".join(srclines[begin+1 : end])
instrs, families = parse_cases(src, filename=args.input)
ninstrs = nfamilies = 0
if not args.quiet:
ninstrs = len(instrs)
nfamilies = len(families)
print(
f"Read {ninstrs} instructions "
f"and {nfamilies} families from {args.input}",
file=sys.stderr,
)
with eopen(args.output, "w") as f:
write_cases(f, instrs)
if not args.quiet:
print(
f"Wrote {ninstrs} instructions to {args.output}",
file=sys.stderr,
)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,257 @@
# Parser for C code
# Originally by Mark Shannon (mark@hotpy.org)
# https://gist.github.com/markshannon/db7ab649440b5af765451bb77c7dba34
import re
import sys
import collections
from dataclasses import dataclass
def choice(*opts):
return "|".join("(%s)" % opt for opt in opts)
# Regexes
# Longer operators must go before shorter ones.
PLUSPLUS = r'\+\+'
MINUSMINUS = r'--'
# ->
ARROW = r'->'
ELLIPSIS = r'\.\.\.'
# Assignment operators
TIMESEQUAL = r'\*='
DIVEQUAL = r'/='
MODEQUAL = r'%='
PLUSEQUAL = r'\+='
MINUSEQUAL = r'-='
LSHIFTEQUAL = r'<<='
RSHIFTEQUAL = r'>>='
ANDEQUAL = r'&='
OREQUAL = r'\|='
XOREQUAL = r'\^='
# Operators
PLUS = r'\+'
MINUS = r'-'
TIMES = r'\*'
DIVIDE = r'/'
MOD = r'%'
NOT = r'~'
XOR = r'\^'
LOR = r'\|\|'
LAND = r'&&'
LSHIFT = r'<<'
RSHIFT = r'>>'
LE = r'<='
GE = r'>='
EQ = r'=='
NE = r'!='
LT = r'<'
GT = r'>'
LNOT = r'!'
OR = r'\|'
AND = r'&'
EQUALS = r'='
# ?
CONDOP = r'\?'
# Delimiters
LPAREN = r'\('
RPAREN = r'\)'
LBRACKET = r'\['
RBRACKET = r'\]'
LBRACE = r'\{'
RBRACE = r'\}'
COMMA = r','
PERIOD = r'\.'
SEMI = r';'
COLON = r':'
BACKSLASH = r'\\'
operators = { op: pattern for op, pattern in globals().items() if op == op.upper() }
for op in operators:
globals()[op] = op
opmap = { pattern.replace("\\", "") or '\\' : op for op, pattern in operators.items() }
# Macros
macro = r'# *(ifdef|ifndef|undef|define|error|endif|if|else|include|#)'
MACRO = 'MACRO'
id_re = r'[a-zA-Z_][0-9a-zA-Z_]*'
IDENTIFIER = 'IDENTIFIER'
suffix = r'([uU]?[lL]?[lL]?)'
octal = r'0[0-7]+' + suffix
hex = r'0[xX][0-9a-fA-F]+'
decimal_digits = r'(0|[1-9][0-9]*)'
decimal = decimal_digits + suffix
exponent = r"""([eE][-+]?[0-9]+)"""
fraction = r"""([0-9]*\.[0-9]+)|([0-9]+\.)"""
float = '(((('+fraction+')'+exponent+'?)|([0-9]+'+exponent+'))[FfLl]?)'
number_re = choice(octal, hex, float, decimal)
NUMBER = 'NUMBER'
simple_escape = r"""([a-zA-Z._~!=&\^\-\\?'"])"""
decimal_escape = r"""(\d+)"""
hex_escape = r"""(x[0-9a-fA-F]+)"""
escape_sequence = r"""(\\("""+simple_escape+'|'+decimal_escape+'|'+hex_escape+'))'
string_char = r"""([^"\\\n]|"""+escape_sequence+')'
str_re = '"'+string_char+'*"'
STRING = 'STRING'
char = r'\'.\'' # TODO: escape sequence
CHARACTER = 'CHARACTER'
comment_re = r'//.*|/\*([^*]|\*[^/])*\*/'
COMMENT = 'COMMENT'
newline = r"\n"
matcher = re.compile(choice(id_re, number_re, str_re, char, newline, macro, comment_re, *operators.values()))
letter = re.compile(r'[a-zA-Z_]')
keywords = (
'AUTO', 'BREAK', 'CASE', 'CHAR', 'CONST',
'CONTINUE', 'DEFAULT', 'DO', 'DOUBLE', 'ELSE', 'ENUM', 'EXTERN',
'FLOAT', 'FOR', 'GOTO', 'IF', 'INLINE', 'INT', 'LONG',
'REGISTER', 'OFFSETOF',
'RESTRICT', 'RETURN', 'SHORT', 'SIGNED', 'SIZEOF', 'STATIC', 'STRUCT',
'SWITCH', 'TYPEDEF', 'UNION', 'UNSIGNED', 'VOID',
'VOLATILE', 'WHILE'
)
for name in keywords:
globals()[name] = name
keywords = { name.lower() : name for name in keywords }
def make_syntax_error(
message: str, filename: str, line: int, column: int, line_text: str,
) -> SyntaxError:
return SyntaxError(message, (filename, line, column, line_text))
@dataclass(slots=True)
class Token:
kind: str
text: str
begin: tuple[int, int]
end: tuple[int, int]
@property
def line(self):
return self.begin[0]
@property
def column(self):
return self.begin[1]
@property
def end_line(self):
return self.end[0]
@property
def end_column(self):
return self.end[1]
@property
def width(self):
return self.end[1] - self.begin[1]
def replaceText(self, txt):
assert isinstance(txt, str)
return Token(self.kind, txt, self.begin, self.end)
def __repr__(self):
b0, b1 = self.begin
e0, e1 = self.end
if b0 == e0:
return f"{self.kind}({self.text!r}, {b0}:{b1}:{e1})"
else:
return f"{self.kind}({self.text!r}, {b0}:{b1}, {e0}:{e1})"
def tokenize(src, line=1, filename=None):
linestart = -1
# TODO: finditer() skips over unrecognized characters, e.g. '@'
for m in matcher.finditer(src):
start, end = m.span()
text = m.group(0)
if text in keywords:
kind = keywords[text]
elif letter.match(text):
kind = IDENTIFIER
elif text == '...':
kind = ELLIPSIS
elif text == '.':
kind = PERIOD
elif text[0] in '0123456789.':
kind = NUMBER
elif text[0] == '"':
kind = STRING
elif text in opmap:
kind = opmap[text]
elif text == '\n':
linestart = start
line += 1
kind = '\n'
elif text[0] == "'":
kind = CHARACTER
elif text[0] == '#':
kind = MACRO
elif text[0] == '/' and text[1] in '/*':
kind = COMMENT
else:
lineend = src.find("\n", start)
if lineend == -1:
lineend = len(src)
raise make_syntax_error(f"Bad token: {text}",
filename, line, start-linestart+1, src[linestart:lineend])
if kind == COMMENT:
begin = line, start-linestart
newlines = text.count('\n')
if newlines:
linestart = start + text.rfind('\n')
line += newlines
else:
begin = line, start-linestart
if kind != "\n":
yield Token(kind, text, begin, (line, start-linestart+len(text)))
__all__ = []
__all__.extend([kind for kind in globals() if kind.upper() == kind])
def to_text(tkns: list[Token], dedent: int = 0) -> str:
res: list[str] = []
line, col = -1, 1+dedent
for tkn in tkns:
if line == -1:
line, _ = tkn.begin
l, c = tkn.begin
#assert(l >= line), (line, txt, start, end)
while l > line:
line += 1
res.append('\n')
col = 1+dedent
res.append(' '*(c-col))
res.append(tkn.text)
line, col = tkn.end
return ''.join(res)
if __name__ == "__main__":
import sys
filename = sys.argv[1]
if filename == "-c":
src = sys.argv[2]
else:
src = open(filename).read()
# print(to_text(tokenize(src)))
for tkn in tokenize(src, filename=filename):
print(tkn)

View File

@ -0,0 +1,222 @@
"""Parser for bytecodes.inst."""
from dataclasses import dataclass, field
from typing import NamedTuple, Callable, TypeVar
import lexer as lx
from plexer import PLexer
P = TypeVar("P", bound="Parser")
N = TypeVar("N", bound="Node")
def contextual(func: Callable[[P], N|None]) -> Callable[[P], N|None]:
# Decorator to wrap grammar methods.
# Resets position if `func` returns None.
def contextual_wrapper(self: P) -> N|None:
begin = self.getpos()
res = func(self)
if res is None:
self.setpos(begin)
return
end = self.getpos()
res.context = Context(begin, end, self)
return res
return contextual_wrapper
class Context(NamedTuple):
begin: int
end: int
owner: PLexer
def __repr__(self):
return f"<{self.begin}-{self.end}>"
@dataclass
class Node:
context: Context|None = field(init=False, default=None)
@property
def text(self) -> str:
context = self.context
if not context:
return ""
tokens = context.owner.tokens
begin = context.begin
end = context.end
return lx.to_text(tokens[begin:end])
@dataclass
class Block(Node):
tokens: list[lx.Token]
@dataclass
class InstDef(Node):
name: str
inputs: list[str] | None
outputs: list[str] | None
block: Block | None
@dataclass
class Family(Node):
name: str
members: list[str]
class Parser(PLexer):
@contextual
def inst_def(self) -> InstDef | None:
if header := self.inst_header():
if block := self.block():
header.block = block
return header
raise self.make_syntax_error("Expected block")
return None
@contextual
def inst_header(self):
# inst(NAME) | inst(NAME, (inputs -- outputs))
# TODO: Error out when there is something unexpected.
# TODO: Make INST a keyword in the lexer.
if (tkn := self.expect(lx.IDENTIFIER)) and tkn.text == "inst":
if (self.expect(lx.LPAREN)
and (tkn := self.expect(lx.IDENTIFIER))):
name = tkn.text
if self.expect(lx.COMMA):
inp, outp = self.stack_effect()
if (self.expect(lx.RPAREN)
and self.peek().kind == lx.LBRACE):
return InstDef(name, inp, outp, [])
elif self.expect(lx.RPAREN):
return InstDef(name, None, None, [])
return None
def stack_effect(self):
# '(' [inputs] '--' [outputs] ')'
if self.expect(lx.LPAREN):
inp = self.inputs() or []
if self.expect(lx.MINUSMINUS):
outp = self.outputs() or []
if self.expect(lx.RPAREN):
return inp, outp
raise self.make_syntax_error("Expected stack effect")
def inputs(self):
# input (, input)*
here = self.getpos()
if inp := self.input():
near = self.getpos()
if self.expect(lx.COMMA):
if rest := self.inputs():
return [inp] + rest
self.setpos(near)
return [inp]
self.setpos(here)
return None
def input(self):
# IDENTIFIER
if (tkn := self.expect(lx.IDENTIFIER)):
if self.expect(lx.LBRACKET):
if arg := self.expect(lx.IDENTIFIER):
if self.expect(lx.RBRACKET):
return f"{tkn.text}[{arg.text}]"
if self.expect(lx.TIMES):
if num := self.expect(lx.NUMBER):
if self.expect(lx.RBRACKET):
return f"{tkn.text}[{arg.text}*{num.text}]"
raise self.make_syntax_error("Expected argument in brackets", tkn)
return tkn.text
if self.expect(lx.CONDOP):
while self.expect(lx.CONDOP):
pass
return "??"
return None
def outputs(self):
# output (, output)*
here = self.getpos()
if outp := self.output():
near = self.getpos()
if self.expect(lx.COMMA):
if rest := self.outputs():
return [outp] + rest
self.setpos(near)
return [outp]
self.setpos(here)
return None
def output(self):
return self.input() # TODO: They're not quite the same.
@contextual
def family_def(self) -> Family | None:
here = self.getpos()
if (tkn := self.expect(lx.IDENTIFIER)) and tkn.text == "family":
if self.expect(lx.LPAREN):
if (tkn := self.expect(lx.IDENTIFIER)):
name = tkn.text
if self.expect(lx.RPAREN):
if self.expect(lx.EQUALS):
if members := self.members():
if self.expect(lx.SEMI):
return Family(name, members)
return None
def members(self):
here = self.getpos()
if tkn := self.expect(lx.IDENTIFIER):
near = self.getpos()
if self.expect(lx.COMMA):
if rest := self.members():
return [tkn.text] + rest
self.setpos(near)
return [tkn.text]
self.setpos(here)
return None
@contextual
def block(self) -> Block:
tokens = self.c_blob()
return Block(tokens)
def c_blob(self):
tokens = []
level = 0
while tkn := self.next(raw=True):
if tkn.kind in (lx.LBRACE, lx.LPAREN, lx.LBRACKET):
level += 1
elif tkn.kind in (lx.RBRACE, lx.RPAREN, lx.RBRACKET):
level -= 1
if level <= 0:
break
tokens.append(tkn)
return tokens
if __name__ == "__main__":
import sys
if sys.argv[1:]:
filename = sys.argv[1]
if filename == "-c" and sys.argv[2:]:
src = sys.argv[2]
filename = None
else:
with open(filename) as f:
src = f.read()
srclines = src.splitlines()
begin = srclines.index("// BEGIN BYTECODES //")
end = srclines.index("// END BYTECODES //")
src = "\n".join(srclines[begin+1 : end])
else:
filename = None
src = "if (x) { x.foo; // comment\n}"
parser = Parser(src, filename)
x = parser.inst_def()
print(x)

View File

@ -0,0 +1,104 @@
import lexer as lx
Token = lx.Token
class PLexer:
def __init__(self, src: str, filename: str|None = None):
self.src = src
self.filename = filename
self.tokens = list(lx.tokenize(self.src, filename=filename))
self.pos = 0
def getpos(self) -> int:
# Current position
return self.pos
def eof(self) -> bool:
# Are we at EOF?
return self.pos >= len(self.tokens)
def setpos(self, pos: int) -> None:
# Reset position
assert 0 <= pos <= len(self.tokens), (pos, len(self.tokens))
self.pos = pos
def backup(self) -> None:
# Back up position by 1
assert self.pos > 0
self.pos -= 1
def next(self, raw: bool = False) -> Token | None:
# Return next token and advance position; None if at EOF
# TODO: Return synthetic EOF token instead of None?
while self.pos < len(self.tokens):
tok = self.tokens[self.pos]
self.pos += 1
if raw or tok.kind != "COMMENT":
return tok
return None
def peek(self, raw: bool = False) -> Token | None:
# Return next token without advancing position
tok = self.next(raw=raw)
self.backup()
return tok
def maybe(self, kind: str, raw: bool = False) -> Token | None:
# Return next token without advancing position if kind matches
tok = self.peek(raw=raw)
if tok and tok.kind == kind:
return tok
return None
def expect(self, kind: str) -> Token | None:
# Return next token and advance position if kind matches
tkn = self.next()
if tkn is not None:
if tkn.kind == kind:
return tkn
self.backup()
return None
def require(self, kind: str) -> Token:
# Return next token and advance position, requiring kind to match
tkn = self.next()
if tkn is not None and tkn.kind == kind:
return tkn
raise self.make_syntax_error(f"Expected {kind!r} but got {tkn and tkn.text!r}", tkn)
def extract_line(self, lineno: int) -> str:
# Return source line `lineno` (1-based)
lines = self.src.splitlines()
if lineno > len(lines):
return ""
return lines[lineno - 1]
def make_syntax_error(self, message: str, tkn: Token|None = None) -> SyntaxError:
# Construct a SyntaxError instance from message and token
if tkn is None:
tkn = self.peek()
if tkn is None:
tkn = self.tokens[-1]
return lx.make_syntax_error(message,
self.filename, tkn.line, tkn.column, self.extract_line(tkn.line))
if __name__ == "__main__":
import sys
if sys.argv[1:]:
filename = sys.argv[1]
if filename == "-c" and sys.argv[2:]:
src = sys.argv[2]
filename = None
else:
with open(filename) as f:
src = f.read()
else:
filename = None
src = "if (x) { x.foo; // comment\n}"
p = PLexer(src, filename)
while not p.eof():
tok = p.next(raw=True)
left = repr(tok)
right = lx.to_text([tok]).rstrip()
print(f"{left:40.40} {right}")