mirror of https://github.com/python/cpython
GH-98831: "Generate" the interpreter (#98830)
The switch cases (really TARGET(opcode) macros) have been moved from ceval.c to generated_cases.c.h. That file is generated from instruction definitions in bytecodes.c (which impersonates a C file so the C code it contains can be edited without custom support in e.g. VS Code). The code generator lives in Tools/cases_generator (it has a README.md explaining how it works). The DSL used to describe the instructions is a work in progress, described in https://github.com/faster-cpython/ideas/blob/main/3.12/interpreter_definition.md. This is surely a work-in-progress. An easy next step could be auto-generating super-instructions. **IMPORTANT: Merge Conflicts** If you get a merge conflict for instruction implementations in ceval.c, your best bet is to port your changes to bytecodes.c. That file looks almost the same as the original cases, except instead of `TARGET(NAME)` it uses `inst(NAME)`, and the trailing `DISPATCH()` call is omitted (the code generator adds it automatically).
This commit is contained in:
parent
2cfcaf5af6
commit
41bc101dd6
|
@ -82,6 +82,7 @@ Parser/parser.c generated
|
|||
Parser/token.c generated
|
||||
Programs/test_frozenmain.h generated
|
||||
Python/Python-ast.c generated
|
||||
Python/generated_cases.c.h generated
|
||||
Python/opcode_targets.h generated
|
||||
Python/stdlib_module_names.h generated
|
||||
Tools/peg_generator/pegen/grammar_parser.py generated
|
||||
|
|
|
@ -1445,7 +1445,19 @@ regen-opcode-targets:
|
|||
$(srcdir)/Python/opcode_targets.h.new
|
||||
$(UPDATE_FILE) $(srcdir)/Python/opcode_targets.h $(srcdir)/Python/opcode_targets.h.new
|
||||
|
||||
Python/ceval.o: $(srcdir)/Python/opcode_targets.h $(srcdir)/Python/condvar.h
|
||||
.PHONY: regen-cases
|
||||
regen-cases:
|
||||
# Regenerate Python/generated_cases.c.h from Python/bytecodes.c
|
||||
# using Tools/cases_generator/generate_cases.py
|
||||
PYTHONPATH=$(srcdir)/Tools/cases_generator \
|
||||
$(PYTHON_FOR_REGEN) \
|
||||
$(srcdir)/Tools/cases_generator/generate_cases.py \
|
||||
-i $(srcdir)/Python/bytecodes.c \
|
||||
-o $(srcdir)/Python/generated_cases.c.h.new
|
||||
$(UPDATE_FILE) $(srcdir)/Python/generated_cases.c.h $(srcdir)/Python/generated_cases.c.h.new
|
||||
|
||||
Python/ceval.o: $(srcdir)/Python/opcode_targets.h $(srcdir)/Python/condvar.h $(srcdir)/Python/generated_cases.c.h
|
||||
|
||||
|
||||
Python/frozen.o: $(FROZEN_FILES_OUT)
|
||||
|
||||
|
|
|
@ -0,0 +1 @@
|
|||
We have new tooling, in ``Tools/cases_generator``, to generate the interpreter switch from a list of opcode definitions.
|
File diff suppressed because it is too large
Load Diff
3851
Python/ceval.c
3851
Python/ceval.c
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,39 @@
|
|||
# Tooling to generate interpreters
|
||||
|
||||
What's currently here:
|
||||
|
||||
- lexer.py: lexer for C, originally written by Mark Shannon
|
||||
- plexer.py: OO interface on top of lexer.py; main class: `PLexer`
|
||||
- parser.py: Parser for instruction definition DSL; main class `Parser`
|
||||
- `generate_cases.py`: driver script to read `Python/bytecodes.c` and
|
||||
write `Python/generated_cases.c.h`
|
||||
|
||||
**Temporarily also:**
|
||||
|
||||
- `extract_cases.py`: script to extract cases from
|
||||
`Python/ceval.c` and write them to `Python/bytecodes.c`
|
||||
- `bytecodes_template.h`: template used by `extract_cases.py`
|
||||
|
||||
The DSL for the instruction definitions in `Python/bytecodes.c` is described
|
||||
[here](https://github.com/faster-cpython/ideas/blob/main/3.12/interpreter_definition.md).
|
||||
Note that there is some dummy C code at the top and bottom of the file
|
||||
to fool text editors like VS Code into believing this is valid C code.
|
||||
|
||||
## A bit about the parser
|
||||
|
||||
The parser class uses a pretty standard recursive descent scheme,
|
||||
but with unlimited backtracking.
|
||||
The `PLexer` class tokenizes the entire input before parsing starts.
|
||||
We do not run the C preprocessor.
|
||||
Each parsing method returns either an AST node (a `Node` instance)
|
||||
or `None`, or raises `SyntaxError` (showing the error in the C source).
|
||||
|
||||
Most parsing methods are decorated with `@contextual`, which automatically
|
||||
resets the tokenizer input position when `None` is returned.
|
||||
Parsing methods may also raise `SyntaxError`, which is irrecoverable.
|
||||
When a parsing method returns `None`, it is possible that after backtracking
|
||||
a different parsing method returns a valid AST.
|
||||
|
||||
Neither the lexer nor the parsers are complete or fully correct.
|
||||
Most known issues are tersely indicated by `# TODO:` comments.
|
||||
We plan to fix issues as they become relevant.
|
|
@ -0,0 +1,85 @@
|
|||
#include "Python.h"
|
||||
#include "pycore_abstract.h" // _PyIndex_Check()
|
||||
#include "pycore_call.h" // _PyObject_FastCallDictTstate()
|
||||
#include "pycore_ceval.h" // _PyEval_SignalAsyncExc()
|
||||
#include "pycore_code.h"
|
||||
#include "pycore_function.h"
|
||||
#include "pycore_long.h" // _PyLong_GetZero()
|
||||
#include "pycore_object.h" // _PyObject_GC_TRACK()
|
||||
#include "pycore_moduleobject.h" // PyModuleObject
|
||||
#include "pycore_opcode.h" // EXTRA_CASES
|
||||
#include "pycore_pyerrors.h" // _PyErr_Fetch()
|
||||
#include "pycore_pymem.h" // _PyMem_IsPtrFreed()
|
||||
#include "pycore_pystate.h" // _PyInterpreterState_GET()
|
||||
#include "pycore_range.h" // _PyRangeIterObject
|
||||
#include "pycore_sliceobject.h" // _PyBuildSlice_ConsumeRefs
|
||||
#include "pycore_sysmodule.h" // _PySys_Audit()
|
||||
#include "pycore_tuple.h" // _PyTuple_ITEMS()
|
||||
#include "pycore_emscripten_signal.h" // _Py_CHECK_EMSCRIPTEN_SIGNALS
|
||||
|
||||
#include "pycore_dict.h"
|
||||
#include "dictobject.h"
|
||||
#include "pycore_frame.h"
|
||||
#include "opcode.h"
|
||||
#include "pydtrace.h"
|
||||
#include "setobject.h"
|
||||
#include "structmember.h" // struct PyMemberDef, T_OFFSET_EX
|
||||
|
||||
void _PyFloat_ExactDealloc(PyObject *);
|
||||
void _PyUnicode_ExactDealloc(PyObject *);
|
||||
|
||||
#define SET_TOP(v) (stack_pointer[-1] = (v))
|
||||
#define PEEK(n) (stack_pointer[-(n)])
|
||||
|
||||
#define GETLOCAL(i) (frame->localsplus[i])
|
||||
|
||||
#define inst(name) case name:
|
||||
#define family(name) static int family_##name
|
||||
|
||||
#define NAME_ERROR_MSG \
|
||||
"name '%.200s' is not defined"
|
||||
|
||||
typedef struct {
|
||||
PyObject *kwnames;
|
||||
} CallShape;
|
||||
|
||||
static void
|
||||
dummy_func(
|
||||
PyThreadState *tstate,
|
||||
_PyInterpreterFrame *frame,
|
||||
unsigned char opcode,
|
||||
unsigned int oparg,
|
||||
_Py_atomic_int * const eval_breaker,
|
||||
_PyCFrame cframe,
|
||||
PyObject *names,
|
||||
PyObject *consts,
|
||||
_Py_CODEUNIT *next_instr,
|
||||
PyObject **stack_pointer,
|
||||
CallShape call_shape,
|
||||
_Py_CODEUNIT *first_instr,
|
||||
int throwflag,
|
||||
binaryfunc binary_ops[]
|
||||
)
|
||||
{
|
||||
switch (opcode) {
|
||||
|
||||
/* BEWARE!
|
||||
It is essential that any operation that fails must goto error
|
||||
and that all operation that succeed call DISPATCH() ! */
|
||||
|
||||
// BEGIN BYTECODES //
|
||||
// INSERT CASES HERE //
|
||||
// END BYTECODES //
|
||||
|
||||
}
|
||||
error:;
|
||||
exception_unwind:;
|
||||
handle_eval_breaker:;
|
||||
resume_frame:;
|
||||
resume_with_error:;
|
||||
start_frame:;
|
||||
unbound_local_error:;
|
||||
}
|
||||
|
||||
// Families go below this point //
|
||||
|
|
@ -0,0 +1,247 @@
|
|||
"""Extract the main interpreter switch cases."""
|
||||
|
||||
# Reads cases from ceval.c, writes to bytecodes.c.
|
||||
# (This file is not meant to be compiled, but it has a .c extension
|
||||
# so tooling like VS Code can be fooled into thinking it is C code.
|
||||
# This helps editing and browsing the code.)
|
||||
#
|
||||
# The script generate_cases.py regenerates the cases.
|
||||
|
||||
import argparse
|
||||
import difflib
|
||||
import dis
|
||||
import re
|
||||
import sys
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("-i", "--input", type=str, default="Python/ceval.c")
|
||||
parser.add_argument("-o", "--output", type=str, default="Python/bytecodes.c")
|
||||
parser.add_argument("-t", "--template", type=str, default="Tools/cases_generator/bytecodes_template.c")
|
||||
parser.add_argument("-c", "--compare", action="store_true")
|
||||
parser.add_argument("-q", "--quiet", action="store_true")
|
||||
|
||||
|
||||
inverse_specializations = {
|
||||
specname: familyname
|
||||
for familyname, specnames in dis._specializations.items()
|
||||
for specname in specnames
|
||||
}
|
||||
|
||||
|
||||
def eopen(filename, mode="r"):
|
||||
if filename == "-":
|
||||
if "r" in mode:
|
||||
return sys.stdin
|
||||
else:
|
||||
return sys.stdout
|
||||
return open(filename, mode)
|
||||
|
||||
|
||||
def leading_whitespace(line):
|
||||
return len(line) - len(line.lstrip())
|
||||
|
||||
|
||||
def extract_opcode_name(line):
|
||||
m = re.match(r"\A\s*TARGET\((\w+)\)\s*{\s*\Z", line)
|
||||
if m:
|
||||
opcode_name = m.group(1)
|
||||
if opcode_name not in dis._all_opmap:
|
||||
raise ValueError(f"error: unknown opcode {opcode_name}")
|
||||
return opcode_name
|
||||
raise ValueError(f"error: no opcode in {line.strip()}")
|
||||
|
||||
|
||||
def figure_stack_effect(opcode_name):
|
||||
# Return (i, diff``) where i is the stack effect for oparg=0
|
||||
# and diff is the increment for oparg=1.
|
||||
# If it is irregular or unknown, raise ValueError.
|
||||
if m := re.match(f"^(\w+)__(\w+)$", opcode_name):
|
||||
# Super-instruction adds effect of both parts
|
||||
first, second = m.groups()
|
||||
se1, incr1 = figure_stack_effect(first)
|
||||
se2, incr2 = figure_stack_effect(second)
|
||||
if incr1 or incr2:
|
||||
raise ValueError(f"irregular stack effect for {opcode_name}")
|
||||
return se1 + se2, 0
|
||||
if opcode_name in inverse_specializations:
|
||||
# Specialized instruction maps to unspecialized instruction
|
||||
opcode_name = inverse_specializations[opcode_name]
|
||||
opcode = dis._all_opmap[opcode_name]
|
||||
if opcode in dis.hasarg:
|
||||
try:
|
||||
se = dis.stack_effect(opcode, 0)
|
||||
except ValueError as err:
|
||||
raise ValueError(f"{err} for {opcode_name}")
|
||||
if dis.stack_effect(opcode, 0, jump=True) != se:
|
||||
raise ValueError(f"{opcode_name} stack effect depends on jump flag")
|
||||
if dis.stack_effect(opcode, 0, jump=False) != se:
|
||||
raise ValueError(f"{opcode_name} stack effect depends on jump flag")
|
||||
for i in range(1, 257):
|
||||
if dis.stack_effect(opcode, i) != se:
|
||||
return figure_variable_stack_effect(opcode_name, opcode, se)
|
||||
else:
|
||||
try:
|
||||
se = dis.stack_effect(opcode)
|
||||
except ValueError as err:
|
||||
raise ValueError(f"{err} for {opcode_name}")
|
||||
if dis.stack_effect(opcode, jump=True) != se:
|
||||
raise ValueError(f"{opcode_name} stack effect depends on jump flag")
|
||||
if dis.stack_effect(opcode, jump=False) != se:
|
||||
raise ValueError(f"{opcode_name} stack effect depends on jump flag")
|
||||
return se, 0
|
||||
|
||||
|
||||
def figure_variable_stack_effect(opcode_name, opcode, se0):
|
||||
# Is it a linear progression?
|
||||
se1 = dis.stack_effect(opcode, 1)
|
||||
diff = se1 - se0
|
||||
for i in range(2, 257):
|
||||
sei = dis.stack_effect(opcode, i)
|
||||
if sei - se0 != diff * i:
|
||||
raise ValueError(f"{opcode_name} has irregular stack effect")
|
||||
# Assume it's okay for larger oparg values too
|
||||
return se0, diff
|
||||
|
||||
|
||||
|
||||
START_MARKER = "/* Start instructions */" # The '{' is on the preceding line.
|
||||
END_MARKER = "/* End regular instructions */"
|
||||
|
||||
def read_cases(f):
|
||||
cases = []
|
||||
case = None
|
||||
started = False
|
||||
# TODO: Count line numbers
|
||||
for line in f:
|
||||
stripped = line.strip()
|
||||
if not started:
|
||||
if stripped == START_MARKER:
|
||||
started = True
|
||||
continue
|
||||
if stripped == END_MARKER:
|
||||
break
|
||||
if stripped.startswith("TARGET("):
|
||||
if case:
|
||||
cases.append(case)
|
||||
indent = " " * leading_whitespace(line)
|
||||
case = ""
|
||||
opcode_name = extract_opcode_name(line)
|
||||
try:
|
||||
se, diff = figure_stack_effect(opcode_name)
|
||||
except ValueError as err:
|
||||
case += f"{indent}// error: {err}\n"
|
||||
case += f"{indent}inst({opcode_name}) {{\n"
|
||||
else:
|
||||
inputs = []
|
||||
outputs = []
|
||||
if se > 0:
|
||||
for i in range(se):
|
||||
outputs.append(f"__{i}")
|
||||
elif se < 0:
|
||||
for i in range(-se):
|
||||
inputs.append(f"__{i}")
|
||||
if diff > 0:
|
||||
if diff == 1:
|
||||
outputs.append(f"__array[oparg]")
|
||||
else:
|
||||
outputs.append(f"__array[oparg*{diff}]")
|
||||
elif diff < 0:
|
||||
if diff == -1:
|
||||
inputs.append(f"__array[oparg]")
|
||||
else:
|
||||
inputs.append(f"__array[oparg*{-diff}]")
|
||||
input = ", ".join(inputs)
|
||||
output = ", ".join(outputs)
|
||||
case += f"{indent}// stack effect: ({input} -- {output})\n"
|
||||
case += f"{indent}inst({opcode_name}) {{\n"
|
||||
else:
|
||||
if case:
|
||||
case += line
|
||||
if case:
|
||||
cases.append(case)
|
||||
return cases
|
||||
|
||||
|
||||
def write_cases(f, cases):
|
||||
for case in cases:
|
||||
caselines = case.splitlines()
|
||||
while caselines[-1].strip() == "":
|
||||
caselines.pop()
|
||||
if caselines[-1].strip() == "}":
|
||||
caselines.pop()
|
||||
else:
|
||||
raise ValueError("case does not end with '}'")
|
||||
if caselines[-1].strip() == "DISPATCH();":
|
||||
caselines.pop()
|
||||
caselines.append(" }")
|
||||
case = "\n".join(caselines)
|
||||
print(case + "\n", file=f)
|
||||
|
||||
|
||||
def write_families(f):
|
||||
for opcode, specializations in dis._specializations.items():
|
||||
all = [opcode] + specializations
|
||||
if len(all) <= 3:
|
||||
members = ', '.join(all)
|
||||
print(f"family({opcode.lower()}) = {{ {members} }};", file=f)
|
||||
else:
|
||||
print(f"family({opcode.lower()}) = {{", file=f)
|
||||
for i in range(0, len(all), 3):
|
||||
members = ', '.join(all[i:i+3])
|
||||
if i+3 < len(all):
|
||||
print(f" {members},", file=f)
|
||||
else:
|
||||
print(f" {members} }};", file=f)
|
||||
|
||||
|
||||
def compare(oldfile, newfile, quiet=False):
|
||||
with open(oldfile) as f:
|
||||
oldlines = f.readlines()
|
||||
for top, line in enumerate(oldlines):
|
||||
if line.strip() == START_MARKER:
|
||||
break
|
||||
else:
|
||||
print(f"No start marker found in {oldfile}", file=sys.stderr)
|
||||
return
|
||||
del oldlines[:top]
|
||||
for bottom, line in enumerate(oldlines):
|
||||
if line.strip() == END_MARKER:
|
||||
break
|
||||
else:
|
||||
print(f"No end marker found in {oldfile}", file=sys.stderr)
|
||||
return
|
||||
del oldlines[bottom:]
|
||||
if not quiet:
|
||||
print(
|
||||
f"// {oldfile} has {len(oldlines)} lines after stripping top/bottom",
|
||||
file=sys.stderr,
|
||||
)
|
||||
with open(newfile) as f:
|
||||
newlines = f.readlines()
|
||||
if not quiet:
|
||||
print(f"// {newfile} has {len(newlines)} lines", file=sys.stderr)
|
||||
for line in difflib.unified_diff(oldlines, newlines, fromfile=oldfile, tofile=newfile):
|
||||
sys.stdout.write(line)
|
||||
|
||||
|
||||
def main():
|
||||
args = parser.parse_args()
|
||||
with eopen(args.input) as f:
|
||||
cases = read_cases(f)
|
||||
with open(args.template) as f:
|
||||
prolog, epilog = f.read().split("// INSERT CASES HERE //", 1)
|
||||
if not args.quiet:
|
||||
print(f"// Read {len(cases)} cases from {args.input}", file=sys.stderr)
|
||||
with eopen(args.output, "w") as f:
|
||||
f.write(prolog)
|
||||
write_cases(f, cases)
|
||||
f.write(epilog)
|
||||
write_families(f)
|
||||
if not args.quiet:
|
||||
print(f"// Wrote {len(cases)} cases to {args.output}", file=sys.stderr)
|
||||
if args.compare:
|
||||
compare(args.input, args.output, args.quiet)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -0,0 +1,125 @@
|
|||
"""Generate the main interpreter switch."""
|
||||
|
||||
# Write the cases to generated_cases.c.h, which is #included in ceval.c.
|
||||
|
||||
# TODO: Reuse C generation framework from deepfreeze.py?
|
||||
|
||||
import argparse
|
||||
import io
|
||||
import sys
|
||||
|
||||
import parser
|
||||
from parser import InstDef
|
||||
|
||||
arg_parser = argparse.ArgumentParser()
|
||||
arg_parser.add_argument("-i", "--input", type=str, default="Python/bytecodes.c")
|
||||
arg_parser.add_argument("-o", "--output", type=str, default="Python/generated_cases.c.h")
|
||||
arg_parser.add_argument("-c", "--compare", action="store_true")
|
||||
arg_parser.add_argument("-q", "--quiet", action="store_true")
|
||||
|
||||
|
||||
def eopen(filename: str, mode: str = "r"):
|
||||
if filename == "-":
|
||||
if "r" in mode:
|
||||
return sys.stdin
|
||||
else:
|
||||
return sys.stdout
|
||||
return open(filename, mode)
|
||||
|
||||
|
||||
def parse_cases(src: str, filename: str|None = None) -> tuple[list[InstDef], list[parser.Family]]:
|
||||
psr = parser.Parser(src, filename=filename)
|
||||
instrs: list[InstDef] = []
|
||||
families: list[parser.Family] = []
|
||||
while not psr.eof():
|
||||
if inst := psr.inst_def():
|
||||
assert inst.block
|
||||
instrs.append(InstDef(inst.name, inst.inputs, inst.outputs, inst.block))
|
||||
elif fam := psr.family_def():
|
||||
families.append(fam)
|
||||
else:
|
||||
raise psr.make_syntax_error(f"Unexpected token")
|
||||
return instrs, families
|
||||
|
||||
|
||||
def always_exits(block: parser.Block) -> bool:
|
||||
text = block.text
|
||||
lines = text.splitlines()
|
||||
while lines and not lines[-1].strip():
|
||||
lines.pop()
|
||||
if not lines or lines[-1].strip() != "}":
|
||||
return False
|
||||
lines.pop()
|
||||
if not lines:
|
||||
return False
|
||||
line = lines.pop().rstrip()
|
||||
# Indent must match exactly (TODO: Do something better)
|
||||
if line[:12] != " "*12:
|
||||
return False
|
||||
line = line[12:]
|
||||
return line.startswith(("goto ", "return ", "DISPATCH", "GO_TO_", "Py_UNREACHABLE()"))
|
||||
|
||||
|
||||
def write_cases(f: io.TextIOBase, instrs: list[InstDef]):
|
||||
indent = " "
|
||||
f.write("// This file is generated by Tools/scripts/generate_cases.py\n")
|
||||
f.write("// Do not edit!\n")
|
||||
for instr in instrs:
|
||||
assert isinstance(instr, InstDef)
|
||||
f.write(f"\n{indent}TARGET({instr.name}) {{\n")
|
||||
# input = ", ".join(instr.inputs)
|
||||
# output = ", ".join(instr.outputs)
|
||||
# f.write(f"{indent} // {input} -- {output}\n")
|
||||
assert instr.block
|
||||
blocklines = instr.block.text.splitlines(True)
|
||||
# Remove blank lines from ends
|
||||
while blocklines and not blocklines[0].strip():
|
||||
blocklines.pop(0)
|
||||
while blocklines and not blocklines[-1].strip():
|
||||
blocklines.pop()
|
||||
# Remove leading '{' and trailing '}'
|
||||
assert blocklines and blocklines[0].strip() == "{"
|
||||
assert blocklines and blocklines[-1].strip() == "}"
|
||||
blocklines.pop()
|
||||
blocklines.pop(0)
|
||||
# Remove trailing blank lines
|
||||
while blocklines and not blocklines[-1].strip():
|
||||
blocklines.pop()
|
||||
# Write the body
|
||||
for line in blocklines:
|
||||
f.write(line)
|
||||
assert instr.block
|
||||
if not always_exits(instr.block):
|
||||
f.write(f"{indent} DISPATCH();\n")
|
||||
# Write trailing '}'
|
||||
f.write(f"{indent}}}\n")
|
||||
|
||||
|
||||
def main():
|
||||
args = arg_parser.parse_args()
|
||||
with eopen(args.input) as f:
|
||||
srclines = f.read().splitlines()
|
||||
begin = srclines.index("// BEGIN BYTECODES //")
|
||||
end = srclines.index("// END BYTECODES //")
|
||||
src = "\n".join(srclines[begin+1 : end])
|
||||
instrs, families = parse_cases(src, filename=args.input)
|
||||
ninstrs = nfamilies = 0
|
||||
if not args.quiet:
|
||||
ninstrs = len(instrs)
|
||||
nfamilies = len(families)
|
||||
print(
|
||||
f"Read {ninstrs} instructions "
|
||||
f"and {nfamilies} families from {args.input}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
with eopen(args.output, "w") as f:
|
||||
write_cases(f, instrs)
|
||||
if not args.quiet:
|
||||
print(
|
||||
f"Wrote {ninstrs} instructions to {args.output}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -0,0 +1,257 @@
|
|||
# Parser for C code
|
||||
# Originally by Mark Shannon (mark@hotpy.org)
|
||||
# https://gist.github.com/markshannon/db7ab649440b5af765451bb77c7dba34
|
||||
|
||||
import re
|
||||
import sys
|
||||
import collections
|
||||
from dataclasses import dataclass
|
||||
|
||||
def choice(*opts):
|
||||
return "|".join("(%s)" % opt for opt in opts)
|
||||
|
||||
# Regexes
|
||||
|
||||
# Longer operators must go before shorter ones.
|
||||
|
||||
PLUSPLUS = r'\+\+'
|
||||
MINUSMINUS = r'--'
|
||||
|
||||
# ->
|
||||
ARROW = r'->'
|
||||
ELLIPSIS = r'\.\.\.'
|
||||
|
||||
# Assignment operators
|
||||
TIMESEQUAL = r'\*='
|
||||
DIVEQUAL = r'/='
|
||||
MODEQUAL = r'%='
|
||||
PLUSEQUAL = r'\+='
|
||||
MINUSEQUAL = r'-='
|
||||
LSHIFTEQUAL = r'<<='
|
||||
RSHIFTEQUAL = r'>>='
|
||||
ANDEQUAL = r'&='
|
||||
OREQUAL = r'\|='
|
||||
XOREQUAL = r'\^='
|
||||
|
||||
# Operators
|
||||
PLUS = r'\+'
|
||||
MINUS = r'-'
|
||||
TIMES = r'\*'
|
||||
DIVIDE = r'/'
|
||||
MOD = r'%'
|
||||
NOT = r'~'
|
||||
XOR = r'\^'
|
||||
LOR = r'\|\|'
|
||||
LAND = r'&&'
|
||||
LSHIFT = r'<<'
|
||||
RSHIFT = r'>>'
|
||||
LE = r'<='
|
||||
GE = r'>='
|
||||
EQ = r'=='
|
||||
NE = r'!='
|
||||
LT = r'<'
|
||||
GT = r'>'
|
||||
LNOT = r'!'
|
||||
OR = r'\|'
|
||||
AND = r'&'
|
||||
EQUALS = r'='
|
||||
|
||||
# ?
|
||||
CONDOP = r'\?'
|
||||
|
||||
# Delimiters
|
||||
LPAREN = r'\('
|
||||
RPAREN = r'\)'
|
||||
LBRACKET = r'\['
|
||||
RBRACKET = r'\]'
|
||||
LBRACE = r'\{'
|
||||
RBRACE = r'\}'
|
||||
COMMA = r','
|
||||
PERIOD = r'\.'
|
||||
SEMI = r';'
|
||||
COLON = r':'
|
||||
BACKSLASH = r'\\'
|
||||
|
||||
operators = { op: pattern for op, pattern in globals().items() if op == op.upper() }
|
||||
for op in operators:
|
||||
globals()[op] = op
|
||||
opmap = { pattern.replace("\\", "") or '\\' : op for op, pattern in operators.items() }
|
||||
|
||||
# Macros
|
||||
macro = r'# *(ifdef|ifndef|undef|define|error|endif|if|else|include|#)'
|
||||
MACRO = 'MACRO'
|
||||
|
||||
id_re = r'[a-zA-Z_][0-9a-zA-Z_]*'
|
||||
IDENTIFIER = 'IDENTIFIER'
|
||||
|
||||
suffix = r'([uU]?[lL]?[lL]?)'
|
||||
octal = r'0[0-7]+' + suffix
|
||||
hex = r'0[xX][0-9a-fA-F]+'
|
||||
decimal_digits = r'(0|[1-9][0-9]*)'
|
||||
decimal = decimal_digits + suffix
|
||||
|
||||
|
||||
exponent = r"""([eE][-+]?[0-9]+)"""
|
||||
fraction = r"""([0-9]*\.[0-9]+)|([0-9]+\.)"""
|
||||
float = '(((('+fraction+')'+exponent+'?)|([0-9]+'+exponent+'))[FfLl]?)'
|
||||
|
||||
number_re = choice(octal, hex, float, decimal)
|
||||
NUMBER = 'NUMBER'
|
||||
|
||||
simple_escape = r"""([a-zA-Z._~!=&\^\-\\?'"])"""
|
||||
decimal_escape = r"""(\d+)"""
|
||||
hex_escape = r"""(x[0-9a-fA-F]+)"""
|
||||
escape_sequence = r"""(\\("""+simple_escape+'|'+decimal_escape+'|'+hex_escape+'))'
|
||||
string_char = r"""([^"\\\n]|"""+escape_sequence+')'
|
||||
str_re = '"'+string_char+'*"'
|
||||
STRING = 'STRING'
|
||||
char = r'\'.\'' # TODO: escape sequence
|
||||
CHARACTER = 'CHARACTER'
|
||||
|
||||
comment_re = r'//.*|/\*([^*]|\*[^/])*\*/'
|
||||
COMMENT = 'COMMENT'
|
||||
|
||||
newline = r"\n"
|
||||
matcher = re.compile(choice(id_re, number_re, str_re, char, newline, macro, comment_re, *operators.values()))
|
||||
letter = re.compile(r'[a-zA-Z_]')
|
||||
|
||||
keywords = (
|
||||
'AUTO', 'BREAK', 'CASE', 'CHAR', 'CONST',
|
||||
'CONTINUE', 'DEFAULT', 'DO', 'DOUBLE', 'ELSE', 'ENUM', 'EXTERN',
|
||||
'FLOAT', 'FOR', 'GOTO', 'IF', 'INLINE', 'INT', 'LONG',
|
||||
'REGISTER', 'OFFSETOF',
|
||||
'RESTRICT', 'RETURN', 'SHORT', 'SIGNED', 'SIZEOF', 'STATIC', 'STRUCT',
|
||||
'SWITCH', 'TYPEDEF', 'UNION', 'UNSIGNED', 'VOID',
|
||||
'VOLATILE', 'WHILE'
|
||||
)
|
||||
for name in keywords:
|
||||
globals()[name] = name
|
||||
keywords = { name.lower() : name for name in keywords }
|
||||
|
||||
|
||||
def make_syntax_error(
|
||||
message: str, filename: str, line: int, column: int, line_text: str,
|
||||
) -> SyntaxError:
|
||||
return SyntaxError(message, (filename, line, column, line_text))
|
||||
|
||||
|
||||
@dataclass(slots=True)
|
||||
class Token:
|
||||
kind: str
|
||||
text: str
|
||||
begin: tuple[int, int]
|
||||
end: tuple[int, int]
|
||||
|
||||
@property
|
||||
def line(self):
|
||||
return self.begin[0]
|
||||
|
||||
@property
|
||||
def column(self):
|
||||
return self.begin[1]
|
||||
|
||||
@property
|
||||
def end_line(self):
|
||||
return self.end[0]
|
||||
|
||||
@property
|
||||
def end_column(self):
|
||||
return self.end[1]
|
||||
|
||||
@property
|
||||
def width(self):
|
||||
return self.end[1] - self.begin[1]
|
||||
|
||||
def replaceText(self, txt):
|
||||
assert isinstance(txt, str)
|
||||
return Token(self.kind, txt, self.begin, self.end)
|
||||
|
||||
def __repr__(self):
|
||||
b0, b1 = self.begin
|
||||
e0, e1 = self.end
|
||||
if b0 == e0:
|
||||
return f"{self.kind}({self.text!r}, {b0}:{b1}:{e1})"
|
||||
else:
|
||||
return f"{self.kind}({self.text!r}, {b0}:{b1}, {e0}:{e1})"
|
||||
|
||||
|
||||
def tokenize(src, line=1, filename=None):
|
||||
linestart = -1
|
||||
# TODO: finditer() skips over unrecognized characters, e.g. '@'
|
||||
for m in matcher.finditer(src):
|
||||
start, end = m.span()
|
||||
text = m.group(0)
|
||||
if text in keywords:
|
||||
kind = keywords[text]
|
||||
elif letter.match(text):
|
||||
kind = IDENTIFIER
|
||||
elif text == '...':
|
||||
kind = ELLIPSIS
|
||||
elif text == '.':
|
||||
kind = PERIOD
|
||||
elif text[0] in '0123456789.':
|
||||
kind = NUMBER
|
||||
elif text[0] == '"':
|
||||
kind = STRING
|
||||
elif text in opmap:
|
||||
kind = opmap[text]
|
||||
elif text == '\n':
|
||||
linestart = start
|
||||
line += 1
|
||||
kind = '\n'
|
||||
elif text[0] == "'":
|
||||
kind = CHARACTER
|
||||
elif text[0] == '#':
|
||||
kind = MACRO
|
||||
elif text[0] == '/' and text[1] in '/*':
|
||||
kind = COMMENT
|
||||
else:
|
||||
lineend = src.find("\n", start)
|
||||
if lineend == -1:
|
||||
lineend = len(src)
|
||||
raise make_syntax_error(f"Bad token: {text}",
|
||||
filename, line, start-linestart+1, src[linestart:lineend])
|
||||
if kind == COMMENT:
|
||||
begin = line, start-linestart
|
||||
newlines = text.count('\n')
|
||||
if newlines:
|
||||
linestart = start + text.rfind('\n')
|
||||
line += newlines
|
||||
else:
|
||||
begin = line, start-linestart
|
||||
if kind != "\n":
|
||||
yield Token(kind, text, begin, (line, start-linestart+len(text)))
|
||||
|
||||
|
||||
__all__ = []
|
||||
__all__.extend([kind for kind in globals() if kind.upper() == kind])
|
||||
|
||||
|
||||
def to_text(tkns: list[Token], dedent: int = 0) -> str:
|
||||
res: list[str] = []
|
||||
line, col = -1, 1+dedent
|
||||
for tkn in tkns:
|
||||
if line == -1:
|
||||
line, _ = tkn.begin
|
||||
l, c = tkn.begin
|
||||
#assert(l >= line), (line, txt, start, end)
|
||||
while l > line:
|
||||
line += 1
|
||||
res.append('\n')
|
||||
col = 1+dedent
|
||||
res.append(' '*(c-col))
|
||||
res.append(tkn.text)
|
||||
line, col = tkn.end
|
||||
return ''.join(res)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
filename = sys.argv[1]
|
||||
if filename == "-c":
|
||||
src = sys.argv[2]
|
||||
else:
|
||||
src = open(filename).read()
|
||||
# print(to_text(tokenize(src)))
|
||||
for tkn in tokenize(src, filename=filename):
|
||||
print(tkn)
|
|
@ -0,0 +1,222 @@
|
|||
"""Parser for bytecodes.inst."""
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from typing import NamedTuple, Callable, TypeVar
|
||||
|
||||
import lexer as lx
|
||||
from plexer import PLexer
|
||||
|
||||
|
||||
P = TypeVar("P", bound="Parser")
|
||||
N = TypeVar("N", bound="Node")
|
||||
def contextual(func: Callable[[P], N|None]) -> Callable[[P], N|None]:
|
||||
# Decorator to wrap grammar methods.
|
||||
# Resets position if `func` returns None.
|
||||
def contextual_wrapper(self: P) -> N|None:
|
||||
begin = self.getpos()
|
||||
res = func(self)
|
||||
if res is None:
|
||||
self.setpos(begin)
|
||||
return
|
||||
end = self.getpos()
|
||||
res.context = Context(begin, end, self)
|
||||
return res
|
||||
return contextual_wrapper
|
||||
|
||||
|
||||
class Context(NamedTuple):
|
||||
begin: int
|
||||
end: int
|
||||
owner: PLexer
|
||||
|
||||
def __repr__(self):
|
||||
return f"<{self.begin}-{self.end}>"
|
||||
|
||||
|
||||
@dataclass
|
||||
class Node:
|
||||
context: Context|None = field(init=False, default=None)
|
||||
|
||||
@property
|
||||
def text(self) -> str:
|
||||
context = self.context
|
||||
if not context:
|
||||
return ""
|
||||
tokens = context.owner.tokens
|
||||
begin = context.begin
|
||||
end = context.end
|
||||
return lx.to_text(tokens[begin:end])
|
||||
|
||||
|
||||
@dataclass
|
||||
class Block(Node):
|
||||
tokens: list[lx.Token]
|
||||
|
||||
|
||||
@dataclass
|
||||
class InstDef(Node):
|
||||
name: str
|
||||
inputs: list[str] | None
|
||||
outputs: list[str] | None
|
||||
block: Block | None
|
||||
|
||||
|
||||
@dataclass
|
||||
class Family(Node):
|
||||
name: str
|
||||
members: list[str]
|
||||
|
||||
|
||||
class Parser(PLexer):
|
||||
|
||||
@contextual
|
||||
def inst_def(self) -> InstDef | None:
|
||||
if header := self.inst_header():
|
||||
if block := self.block():
|
||||
header.block = block
|
||||
return header
|
||||
raise self.make_syntax_error("Expected block")
|
||||
return None
|
||||
|
||||
@contextual
|
||||
def inst_header(self):
|
||||
# inst(NAME) | inst(NAME, (inputs -- outputs))
|
||||
# TODO: Error out when there is something unexpected.
|
||||
# TODO: Make INST a keyword in the lexer.
|
||||
if (tkn := self.expect(lx.IDENTIFIER)) and tkn.text == "inst":
|
||||
if (self.expect(lx.LPAREN)
|
||||
and (tkn := self.expect(lx.IDENTIFIER))):
|
||||
name = tkn.text
|
||||
if self.expect(lx.COMMA):
|
||||
inp, outp = self.stack_effect()
|
||||
if (self.expect(lx.RPAREN)
|
||||
and self.peek().kind == lx.LBRACE):
|
||||
return InstDef(name, inp, outp, [])
|
||||
elif self.expect(lx.RPAREN):
|
||||
return InstDef(name, None, None, [])
|
||||
return None
|
||||
|
||||
def stack_effect(self):
|
||||
# '(' [inputs] '--' [outputs] ')'
|
||||
if self.expect(lx.LPAREN):
|
||||
inp = self.inputs() or []
|
||||
if self.expect(lx.MINUSMINUS):
|
||||
outp = self.outputs() or []
|
||||
if self.expect(lx.RPAREN):
|
||||
return inp, outp
|
||||
raise self.make_syntax_error("Expected stack effect")
|
||||
|
||||
def inputs(self):
|
||||
# input (, input)*
|
||||
here = self.getpos()
|
||||
if inp := self.input():
|
||||
near = self.getpos()
|
||||
if self.expect(lx.COMMA):
|
||||
if rest := self.inputs():
|
||||
return [inp] + rest
|
||||
self.setpos(near)
|
||||
return [inp]
|
||||
self.setpos(here)
|
||||
return None
|
||||
|
||||
def input(self):
|
||||
# IDENTIFIER
|
||||
if (tkn := self.expect(lx.IDENTIFIER)):
|
||||
if self.expect(lx.LBRACKET):
|
||||
if arg := self.expect(lx.IDENTIFIER):
|
||||
if self.expect(lx.RBRACKET):
|
||||
return f"{tkn.text}[{arg.text}]"
|
||||
if self.expect(lx.TIMES):
|
||||
if num := self.expect(lx.NUMBER):
|
||||
if self.expect(lx.RBRACKET):
|
||||
return f"{tkn.text}[{arg.text}*{num.text}]"
|
||||
raise self.make_syntax_error("Expected argument in brackets", tkn)
|
||||
|
||||
return tkn.text
|
||||
if self.expect(lx.CONDOP):
|
||||
while self.expect(lx.CONDOP):
|
||||
pass
|
||||
return "??"
|
||||
return None
|
||||
|
||||
def outputs(self):
|
||||
# output (, output)*
|
||||
here = self.getpos()
|
||||
if outp := self.output():
|
||||
near = self.getpos()
|
||||
if self.expect(lx.COMMA):
|
||||
if rest := self.outputs():
|
||||
return [outp] + rest
|
||||
self.setpos(near)
|
||||
return [outp]
|
||||
self.setpos(here)
|
||||
return None
|
||||
|
||||
def output(self):
|
||||
return self.input() # TODO: They're not quite the same.
|
||||
|
||||
@contextual
|
||||
def family_def(self) -> Family | None:
|
||||
here = self.getpos()
|
||||
if (tkn := self.expect(lx.IDENTIFIER)) and tkn.text == "family":
|
||||
if self.expect(lx.LPAREN):
|
||||
if (tkn := self.expect(lx.IDENTIFIER)):
|
||||
name = tkn.text
|
||||
if self.expect(lx.RPAREN):
|
||||
if self.expect(lx.EQUALS):
|
||||
if members := self.members():
|
||||
if self.expect(lx.SEMI):
|
||||
return Family(name, members)
|
||||
return None
|
||||
|
||||
def members(self):
|
||||
here = self.getpos()
|
||||
if tkn := self.expect(lx.IDENTIFIER):
|
||||
near = self.getpos()
|
||||
if self.expect(lx.COMMA):
|
||||
if rest := self.members():
|
||||
return [tkn.text] + rest
|
||||
self.setpos(near)
|
||||
return [tkn.text]
|
||||
self.setpos(here)
|
||||
return None
|
||||
|
||||
@contextual
|
||||
def block(self) -> Block:
|
||||
tokens = self.c_blob()
|
||||
return Block(tokens)
|
||||
|
||||
def c_blob(self):
|
||||
tokens = []
|
||||
level = 0
|
||||
while tkn := self.next(raw=True):
|
||||
if tkn.kind in (lx.LBRACE, lx.LPAREN, lx.LBRACKET):
|
||||
level += 1
|
||||
elif tkn.kind in (lx.RBRACE, lx.RPAREN, lx.RBRACKET):
|
||||
level -= 1
|
||||
if level <= 0:
|
||||
break
|
||||
tokens.append(tkn)
|
||||
return tokens
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
if sys.argv[1:]:
|
||||
filename = sys.argv[1]
|
||||
if filename == "-c" and sys.argv[2:]:
|
||||
src = sys.argv[2]
|
||||
filename = None
|
||||
else:
|
||||
with open(filename) as f:
|
||||
src = f.read()
|
||||
srclines = src.splitlines()
|
||||
begin = srclines.index("// BEGIN BYTECODES //")
|
||||
end = srclines.index("// END BYTECODES //")
|
||||
src = "\n".join(srclines[begin+1 : end])
|
||||
else:
|
||||
filename = None
|
||||
src = "if (x) { x.foo; // comment\n}"
|
||||
parser = Parser(src, filename)
|
||||
x = parser.inst_def()
|
||||
print(x)
|
|
@ -0,0 +1,104 @@
|
|||
import lexer as lx
|
||||
Token = lx.Token
|
||||
|
||||
|
||||
class PLexer:
|
||||
def __init__(self, src: str, filename: str|None = None):
|
||||
self.src = src
|
||||
self.filename = filename
|
||||
self.tokens = list(lx.tokenize(self.src, filename=filename))
|
||||
self.pos = 0
|
||||
|
||||
def getpos(self) -> int:
|
||||
# Current position
|
||||
return self.pos
|
||||
|
||||
def eof(self) -> bool:
|
||||
# Are we at EOF?
|
||||
return self.pos >= len(self.tokens)
|
||||
|
||||
def setpos(self, pos: int) -> None:
|
||||
# Reset position
|
||||
assert 0 <= pos <= len(self.tokens), (pos, len(self.tokens))
|
||||
self.pos = pos
|
||||
|
||||
def backup(self) -> None:
|
||||
# Back up position by 1
|
||||
assert self.pos > 0
|
||||
self.pos -= 1
|
||||
|
||||
def next(self, raw: bool = False) -> Token | None:
|
||||
# Return next token and advance position; None if at EOF
|
||||
# TODO: Return synthetic EOF token instead of None?
|
||||
while self.pos < len(self.tokens):
|
||||
tok = self.tokens[self.pos]
|
||||
self.pos += 1
|
||||
if raw or tok.kind != "COMMENT":
|
||||
return tok
|
||||
return None
|
||||
|
||||
def peek(self, raw: bool = False) -> Token | None:
|
||||
# Return next token without advancing position
|
||||
tok = self.next(raw=raw)
|
||||
self.backup()
|
||||
return tok
|
||||
|
||||
def maybe(self, kind: str, raw: bool = False) -> Token | None:
|
||||
# Return next token without advancing position if kind matches
|
||||
tok = self.peek(raw=raw)
|
||||
if tok and tok.kind == kind:
|
||||
return tok
|
||||
return None
|
||||
|
||||
def expect(self, kind: str) -> Token | None:
|
||||
# Return next token and advance position if kind matches
|
||||
tkn = self.next()
|
||||
if tkn is not None:
|
||||
if tkn.kind == kind:
|
||||
return tkn
|
||||
self.backup()
|
||||
return None
|
||||
|
||||
def require(self, kind: str) -> Token:
|
||||
# Return next token and advance position, requiring kind to match
|
||||
tkn = self.next()
|
||||
if tkn is not None and tkn.kind == kind:
|
||||
return tkn
|
||||
raise self.make_syntax_error(f"Expected {kind!r} but got {tkn and tkn.text!r}", tkn)
|
||||
|
||||
def extract_line(self, lineno: int) -> str:
|
||||
# Return source line `lineno` (1-based)
|
||||
lines = self.src.splitlines()
|
||||
if lineno > len(lines):
|
||||
return ""
|
||||
return lines[lineno - 1]
|
||||
|
||||
def make_syntax_error(self, message: str, tkn: Token|None = None) -> SyntaxError:
|
||||
# Construct a SyntaxError instance from message and token
|
||||
if tkn is None:
|
||||
tkn = self.peek()
|
||||
if tkn is None:
|
||||
tkn = self.tokens[-1]
|
||||
return lx.make_syntax_error(message,
|
||||
self.filename, tkn.line, tkn.column, self.extract_line(tkn.line))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
if sys.argv[1:]:
|
||||
filename = sys.argv[1]
|
||||
if filename == "-c" and sys.argv[2:]:
|
||||
src = sys.argv[2]
|
||||
filename = None
|
||||
else:
|
||||
with open(filename) as f:
|
||||
src = f.read()
|
||||
else:
|
||||
filename = None
|
||||
src = "if (x) { x.foo; // comment\n}"
|
||||
p = PLexer(src, filename)
|
||||
while not p.eof():
|
||||
tok = p.next(raw=True)
|
||||
left = repr(tok)
|
||||
right = lx.to_text([tok]).rstrip()
|
||||
print(f"{left:40.40} {right}")
|
Loading…
Reference in New Issue