The Compiler¶
The compilation process in MicroPython involves the following steps:
- The lexer converts the stream of text that makes up a MicroPython program into tokens.
- The parser then converts the tokens into an abstract syntax (parse tree).
- Then bytecode or native code is emitted based on the parse tree.
For purposes of this discussion, we are going to add a simple language feature add1
that can be use in Python as:
>>> add1 3
4
>>>
The add1 statement takes an integer as argument and adds 1 to it.
Adding a grammar rule¶
MicroPython’s grammar is based on the CPython grammar
and is defined in py/grammar.h. This grammar is what is used to parse MicroPython source files.
There are two macros you need to know to define a grammar rule i.e DEF_RULE or DEF_RULE_NC.
DEF_RULE allows you to define a rule with a compile function.
A simple grammar definition with a compile function looks like the following:
DEF_RULE(add1_stmt, c(add1_stmt), and(2), tok(KW_ADD))
The second argument c(add1_stmt) is the corresponding compile function that should be implemented
in py/compile.c for this rule.
The third important argument can be or or and. This specifies the number of nodes associated
with a statement. For example, in this case, our add1 statement is similar to ADD1 in assembly
language. It takes one numeric argument. Therefore, the add1_stmt has two nodes associated with it.
One for the statement itself i.e add1 and the other for its argument.
Note
The add1 rule here is just an example and not part of the standard
MicroPython grammar.
Finally, the fourth argument in this example is the token associated with the rule. This token should be
defined in the lexer by editing py/lexer.h.
Defining the same rule without a compile function is achieved by using the DEF_RULE_NC macro
and omitting the compile function argument:
DEF_RULE_NC(add1_stmt, and(2), tok(KW_ADD))
The remaining arguments take on the same meaning.
Note
The macro DEF_RULE or DEF_RULE_NC takes other arguments. For an in-depth understanding of
supported parameters, see py/grammar.h.
Adding a lexical token¶
Every rule defined in the grammar should have a token associated with it that is defined in py/lexer.h.
Add this token by editing the _mp_token_kind_t enum:
typedef enum _mp_token_kind_t {
...
...
P_TOKEN_KW_OR,
MP_TOKEN_KW_PASS,
MP_TOKEN_KW_RAISE,
MP_TOKEN_KW_RETURN,
MP_TOKEN_KW_TRY,
MP_TOKEN_KW_WHILE,
MP_TOKEN_KW_WITH,
MP_TOKEN_KW_YIELD,
MP_TOKEN_KW_ADD1,
} mp_token_kind_t;
Since, we are adding a keyword, edit py/lexer.c to add the new keyword:
STATIC const char *const tok_kw[] = {
"False",
"None",
"True",
"__debug__",
"and",
"as",
"assert",
#if MICROPY_PY_ASYNC_AWAIT
"async",
"await",
#endif
"break",
"class",
"continue",
"def",
"del",
"elif",
"else",
"except",
"finally",
"for",
"from",
"global",
"if",
"import",
"in",
"is",
"lambda",
"nonlocal",
"not",
"or",
"pass",
"raise",
"return",
"try",
"while",
"with",
"yield",
"add1",
};
Notice the keyword is named depending on what you want it to be. For consistency, maintain the naming standard accordingly.
Note
The order of these keywords in py/lexer.c should match the order of tokens in the enum
defined in py/lexer.h.
Parsing¶
The parser takes the tokens produced by the lexer and converts them to an abstract syntax tree (AST) or
parse tree. The implementation for the parser is defined in py/parse.c.
The parser also maintains a table of constants for use in different aspects of parsing, similar to what a symbol table does.
Several optimizations like constant folding on integers for all operations i.e logical, binary, unary, etc, optimizing enhancements on parenthesis around expressions are performed during this phase and optimizations on strings.
It’s worth noting that docstrings are discarded and not accessible to the compiler. Even optimizations like string interning are not applied to docstrings.
Compiler passes¶
Like many compilers, MicroPython compiles all code to MicroPython bytecode or native code. The functionality
that achieves this is implemented in py/compile.c. The most relevant method you should know about is this:
mp_raw_code_t *mp_compile_to_raw_code(mp_parse_tree_t *parse_tree,
qstr source_file, bool is_repl)
The compiler compiles the code in several passes.
First pass¶
In the first pass, the compiler computes the stack sizes in scope:
// compile pass 1
comp->emit = emit_bc;
#if MICROPY_EMIT_NATIVE
comp->emit_method_table = &emit_bc_method_table;
#endif
uint max_num_labels = 0;
for (scope_t *s = comp->scope_head; s != NULL && comp->compile_error == MP_OBJ_NULL; s = s->next) {
#if MICROPY_EMIT_INLINE_ASM
if (s->emit_options == MP_EMIT_OPT_ASM) {
compile_scope_inline_asm(comp, s, MP_PASS_SCOPE);
} else
#endif
{
compile_scope(comp, s, MP_PASS_SCOPE);
// Check if any implicitly declared variables should be closed over
for (size_t i = 0; i < s->id_info_len; ++i) {
id_info_t *id = &s->id_info[i];
if (id->kind == ID_INFO_KIND_GLOBAL_IMPLICIT) {
scope_check_to_close_over(s, id);
}
}
}
..
}
Other computations regarding scopes and identifiers are computed too. At this point, the number of labels that will be required in the emitted code is also determined and set.
Second and third passes¶
The second and third passes involve computing the code size and emitting the inline assembler code for
the different architectures:
// compile pass 2 and 3
#if MICROPY_EMIT_NATIVE
emit_t *emit_native = NULL;
#endif
for (scope_t *s = comp->scope_head; s != NULL && comp->compile_error == MP_OBJ_NULL; s = s->next) {
#if MICROPY_EMIT_INLINE_ASM
if (s->emit_options == MP_EMIT_OPT_ASM) {
// inline assembly
if (comp->emit_inline_asm == NULL) {
comp->emit_inline_asm = ASM_EMITTER(new)(max_num_labels);
}
comp->emit = NULL;
comp->emit_inline_asm_method_table = ASM_EMITTER_TABLE;
compile_scope_inline_asm(comp, s, MP_PASS_CODE_SIZE);
#if MICROPY_EMIT_INLINE_XTENSA
// Xtensa requires an extra pass to compute size of l32r const table
// TODO this can be improved by calculating it during SCOPE pass
// but that requires some other structural changes to the asm emitters
#if MICROPY_DYNAMIC_COMPILER
if (mp_dynamic_compiler.native_arch == MP_NATIVE_ARCH_XTENSA)
#endif
{
compile_scope_inline_asm(comp, s, MP_PASS_CODE_SIZE);
}
#endif
if (comp->compile_error == MP_OBJ_NULL) {
compile_scope_inline_asm(comp, s, MP_PASS_EMIT);
}
} else
The inline assembler code comprises assembly instructions in a Python function. See the inline assembler tutorial for more details.
Fourth, fifth and sixth passes¶
The other two passes compute the stack and code size, while the last pass emits the final code:
compile_scope(comp, s, MP_PASS_STACK_SIZE);
if (comp->compile_error == MP_OBJ_NULL) {
compile_scope(comp, s, MP_PASS_CODE_SIZE);
}
if (comp->compile_error == MP_OBJ_NULL) {
compile_scope(comp, s, MP_PASS_EMIT);
}
Before these passes, there is a selection for the type of code to be emitted which can either be native or bytecode.
switch (s->emit_options) {
#if MICROPY_EMIT_NATIVE
case MP_EMIT_OPT_NATIVE_PYTHON:
case MP_EMIT_OPT_VIPER:
if (emit_native == NULL) {
emit_native = NATIVE_EMITTER(new)(&comp->compile_error, &comp->next_label, max_num_labels);
}
comp->emit_method_table = NATIVE_EMITTER_TABLE;
comp->emit = emit_native;
break;
#endif // MICROPY_EMIT_NATIVE
default:
comp->emit = emit_bc;
#if MICROPY_EMIT_NATIVE
comp->emit_method_table = &emit_bc_method_table;
#endif
break;
}
The bytecode option is the default but something unique to note for the native code option is that there is another
option via VIPER. See the Emitting native code section for
more details on viper annotations.
Emitting bytecode¶
For every statement in the code, there is a corresponding function to emit MicroPython bytecode.
This function should be written in py/emitbc.c. The implementation of this function looks similar
to this:
void mp_emit_bc_yield(emit_t *emit, int kind) {
MP_STATIC_ASSERT(MP_BC_YIELD_VALUE + 1 == MP_BC_YIELD_FROM);
emit_write_bytecode_byte(emit, -kind, MP_BC_YIELD_VALUE + kind);
emit->scope->scope_flags |= MP_SCOPE_FLAG_GENERATOR;
}
We use the yield statement for an example here but the implementation details are similar for other statements.
The method emit_write_bytecode_byte() is a wrapper around the main function emit_get_cur_to_write_bytecode()
that all functions must call to emit byte code.
Emitting native code¶
Similar to how bytecode is generated, there should be a corresponding function in py/emitnative.c for each
code statement:
STATIC void emit_native_yield(emit_t *emit, int kind) {
// Note: 1 (yield) or 3 (yield from) labels are reserved for this function, starting at *emit->label_slot
if (emit->do_viper_types) {
mp_raise_NotImplementedError(MP_ERROR_TEXT("native yield"));
}
emit->scope->scope_flags |= MP_SCOPE_FLAG_GENERATOR;
need_stack_settled(emit);
if (kind == MP_EMIT_YIELD_FROM) {
// Top of yield-from loop, conceptually implementing:
// for item in generator:
// yield item
// Jump to start of loop
emit_native_jump(emit, *emit->label_slot + 2);
// Label for top of loop
emit_native_label_assign(emit, *emit->label_slot + 1);
}
// Save pointer to current stack position for caller to access yielded value
emit_get_stack_pointer_to_reg_for_pop(emit, REG_TEMP0, 1);
emit_native_mov_state_reg(emit, OFFSETOF_CODE_STATE_SP, REG_TEMP0);
// Put return type in return value slot
ASM_MOV_REG_IMM(emit->as, REG_TEMP0, MP_VM_RETURN_YIELD);
ASM_MOV_LOCAL_REG(emit->as, LOCAL_IDX_RET_VAL(emit), REG_TEMP0);
// Save re-entry PC
ASM_MOV_REG_PCREL(emit->as, REG_TEMP0, *emit->label_slot);
emit_native_mov_state_reg(emit, LOCAL_IDX_GEN_PC(emit), REG_TEMP0);
// Jump to exit handler
ASM_JUMP(emit->as, emit->exit_label);
// Label re-entry point
mp_asm_base_label_assign(&emit->as->base, *emit->label_slot);
// Re-open any active exception handler
if (emit->exc_stack_size > 0) {
// Find innermost active exception handler, to restore as current handler
exc_stack_entry_t *e = &emit->exc_stack[emit->exc_stack_size - 1];
for (; e >= emit->exc_stack; --e) {
if (e->is_active) {
// Found active handler, get its PC
ASM_MOV_REG_PCREL(emit->as, REG_RET, e->label);
ASM_MOV_LOCAL_REG(emit->as, LOCAL_IDX_EXC_HANDLER_PC(emit), REG_RET);
break;
}
}
}
emit_native_adjust_stack_size(emit, 1); // send_value
if (kind == MP_EMIT_YIELD_VALUE) {
// Check LOCAL_IDX_EXC_VAL for any injected value
ASM_MOV_REG_LOCAL(emit->as, REG_ARG_1, LOCAL_IDX_EXC_VAL(emit));
emit_call(emit, MP_F_NATIVE_RAISE);
} else {
// Label loop entry
emit_native_label_assign(emit, *emit->label_slot + 2);
// Get the next item from the delegate generator
vtype_kind_t vtype;
emit_pre_pop_reg(emit, &vtype, REG_ARG_2); // send_value
emit_access_stack(emit, 1, &vtype, REG_ARG_1); // generator
ASM_MOV_REG_LOCAL(emit->as, REG_ARG_3, LOCAL_IDX_EXC_VAL(emit)); // throw_value
emit_post_push_reg(emit, VTYPE_PYOBJ, REG_ARG_3);
emit_get_stack_pointer_to_reg_for_pop(emit, REG_ARG_3, 1); // ret_value
emit_call(emit, MP_F_NATIVE_YIELD_FROM);
// If returned non-zero then generator continues
ASM_JUMP_IF_REG_NONZERO(emit->as, REG_RET, *emit->label_slot + 1, true);
// Pop exhausted gen, replace with ret_value
emit_native_adjust_stack_size(emit, 1); // ret_value
emit_fold_stack_top(emit, REG_ARG_1);
}
}
The difference here is that we have to handle viper typing. Viper annotations allow us to emit more than one type of object. By default, Python objects are emitted but with Viper, something can be declared as a Python object or any type. Viper is therefore a subset of Python objects, in fact, if anything is declared in Viper as a Python object, it acts as native Python. Viper typing may break Python equivalence as integers become native integers and not Python objects.