Introduction
Dynamic languages such as Python are built on top of an interpreter that is able to understand a broad variety of bytecode instructions allowing them to replicate algorithms and process data. This design makes programs based on interpreter languages well-suited for platform independency and allows fast iterations in development.
lovm2
- love em too - is a small language building framework written in Rust featuring a dead-simple stack-based virtual machine doing exactly that. Furthermore, it comes with tools for generating said bytecode out of the box allowing you to rapidly prototype your own coding language without a hassle. There are no advanced concepts to care about. No polymorphism, closures, asynchronous runtime... just straightforward functions, variables and data structures.
The static lovm2
library is so tiny that compiling it into your language yields almost no overhead and also makes it applicable for usage inside a Python environment via pylovm2.
The project is in an early development stage and no API is stable yet. Feel free to contribute.
Setup
-
Modify your
Cargo.toml
.lovm2 = { git = "https://github.com/lausek/lovm2", rev = "7472952" }
-
Run
cargo update
. -
Import
lovm2
components into scope usinguse lovm2::prelude::*;
.
Concepts
This chapter aims to give you a brief overview of the internal principles. Even though lovm2
is designed to be as simple as possible, it is still quite important to grasp the implementation concepts behind it.
The general steps of producing a runnable program are roughly:
- Create a new
LV2ModuleBuilder
and populate it withLV2Function
- Call
module_builder.build()
constructing a runnableLV2Module
from the current state - Load the module into an instance of a virtual machine
LV2Vm
usingadd_main_module()
- Start the program by calling
run
on the virtual machine
Bytecode
lovm2
is centered around the value stack. This data structure has a variety of tasks such as passing parameters to functions or acting as a computation buffer for operations. There are instructions that put values on top of the stack like CPush
, LPush
, and GPush
. Some just take a value off and store it somewhere like LMove
, GMove
. Almost all other instructions will take a given amount of values from it and leave a return value in place. You can also implement your own bytecode instructions (see Interrupt chapter for more).
For example, the term 1 + (2 * 3)
will be compiled to this sequence:
instruction | stack
----------------------------
CPush | [1]
CPush | [1, 2]
CPush | [1, 2, 3]
Mul | [1, 6]
Add | [7]
You do not need to micromanage the bytecode itself. There are common language constructs with which you can built pretty much everything. These constructs are composed into a high-level intermediate representation so every new function gets its own LV2Function
structure. Below you can see the transformation process of a function into a runnable LV2CodeObject
where each arrow means "lowering" to the next level.
LV2Function -> LIR -> LV2CodeObject
LIR
or low-level intermediate representation is not directly exposed to the user but it is quite crucial if you want to understand how the bytecode generator works under the hood.
LV2CodeObject
's on their own are already valid programs bundling together instructions, constants and variables. As usual in every language, functions can be grouped together and exposed to the virtual machine in a construct called LV2Module
. Introducing this module level abstraction makes sense as we will later work with collections that are not backed by lovm2
bytecode but native functions.
LV2Function -> LIR
\
LV2Function -> LIR -> LV2CodeObject -> LV2Module --> LV2Vm
/
LV2Function -> LIR
Modules
While you are already familiar with lovm2
's own representation of executable code, LV2Module
is far more abstract under the hood. lovm2
is able to load specifically compiled shared objects - or DLLs as you would call them in the Windows world - at runtime and execute real native functions as well.
And that's not all. As long as your structure complies with the LV2CallProtocol
trait you are free to even implement native functions inside your own compiler or runtime. This job can be done using the lovm2_extend package which allows you to write your own modules in Rust.
Types
lovm2
embraces dynamic typing. The most basic ones are Bool
, Int
, Float
, String
.
Nil
is the default return type of functions that do not have return values. You can also use it to mark the absence of a value.
Ref
can wrap any other value and implement a shared mutable location as such.
Complex Types
List
and Dict
are a bit more complicated, because they need to store other values. As such, they support the contains
, len
, get
, set
and delete
methods.
These types also utilize Ref
heavily. If you use the standard lovm2
functionality for generating programs, you will always implicitly work with a Ref
to the corresponding data. The virtual machine also ensures that every value being stored inside these types is itself wrapped up in a reference. This is required for the implementation of slices. The Box
instruction realizes this functionality.
Another special value kind is Any
. This type is used for allowing external shared object modules to pass their custom Rust structures into the VM.
Conversion
The Conv
instruction is able to convert data according to the following rules:
from / to | Nil | Bool | Int | Float | String | List | Dict | Ref |
---|---|---|---|---|---|---|---|---|
Nil | ✓ | ✓ | ||||||
Bool | ✓ | ✓ | ✓ | |||||
Int | ✓ | ✓ | ✓ | ✓ | ||||
Float | ✓ | ✓ | ✓ | |||||
String | ✓¹ | ✓¹ | ✓ | |||||
List | ✓ | ✓ | ||||||
Dict | ✓ | ✓ | ||||||
Ref | ✓ | ✓ |
¹ implies parsing overhead
Building Programs
This chapter will show you how to utilize the gen
module in order to compile your own lovm2
programs. It is also possible to persist compiled modules onto your disk and load them later.
use lovm2::prelude::*; fn main() { let mut builder = LV2ModuleBuilder::new(); // creates the entry point function and trigger a debug interrupt. builder .entry() .trigger(10); let module = builder.build().except("compile error"); println!("{}", module); }
The main generation functionality is exposed via LV2Block
and every structure that contains it like LV2Branch
, LV2Repeat
and functions. You can use these methods on all of them:
step(..)
append a new statement to the block.branch()
create a new branch at the current position. This returns aLV2BranchBuilder
.repeat()
andrepeat_until(..)
which return a mutable reference to a new block. The first variant is an endless loop, while the latter supports breaking once a condition is met.return_nil()
,return_value(<expr>)
return early from a block.assign(<var>, <expr>)
stores an evaluated expression in<var>
.global(<var>)
,local(<var>)
change scope of<var>
.
Functions
The whole LV2ModuleBuilder
is centered around the creation of LV2Function
s. The resulting bytecode is able to process a given amount of parameters and leave a return value in place.
As you can see in this example listing, you should not need to create such data manually as there is functionality for adding it to the builder directly.
use lovm2::prelude::*; fn main() { // creates a hir with no arguments let fn_no_args = builder.add("fn1"); // creates a hir that expects parameter n let fn_with_args = builder.add_with_args("fn2", &[lv2_var!(n)]); }
To return from function, call .return_value(expr)
on the block specifying the returned value or .return_nil()
if no value is produced.
Due to the convention that every function has to return a value, an implicit nil push is appended if the last instruction is not a return already.
Helper Macros
There are a bunch of macros inside the prelude that trivialize creating more complicated lovm2
constructs for developers.
lv2_var!(ident, ...)
turns all the identifiers given into the special typeLV2Variable
which is needed basically everywhere. If more than one ident is declared, this returns a tuple.lv2_dict!(ident => expr, ...)
creates anLV2Expr
that will dynamically initialize a dictionary with the key-values pairs specified.lv2_list!(item, ...)
creates anLV2Expr
that initializes a list dynamically.lv2_call!(ident, ... args)
syntactic sugar for theLV2Call
element.lv2_access!(ident, ... keys)
syntactic sugar for retrieving items of a dictionary or list.
Expressions
The LV2Expr
structure represents any computation that leads to a new value on the stack. Expressions can be nested arbitrarily.
To give you an overview of what an expression could look like, here is the stripped down version of its actual implementation.
#![allow(unused)] fn main() { pub enum LV2Expr { // a constant value Value, // variable in read position Variable, // a branch that evaluates expressions conditionally Branch, // call to a function Call, // operations with one operand Operation1, // operations with two operands Operation2, // ... and many more. } }
Resolvement of Variables
The scope of a variable is determined by the previously emitted Global
and Local
declarations.
In a block, the variable's scope is not fixed but can change after each statement.
Behavior before v0.5.0: While lowering, the runtime keeps track of locally assigned identifiers. This is crucial for determining the scope during read access later. If a variable is not known locally, a fallback to global scope happens.
Example
We want to transform the formula (1 + 2) * f(2)
to lovm2
bytecode. Use a parser of your choice to generate an abstract syntax tree from the textual representation. Note that lovm2
does not care about operator priorities so its your parsers duty to correctly handle them. After processing the input, your AST should look something like this:
Value(1)
\
-- Operation(+)
/ \
Value(2) -- Operation(*)
/
Call(f, 2)
And here is the compiletime representation of said formula:
#![allow(unused)] fn main() { let formula = LV2Expr::from(1) .add(2) .mul(LV2Call::new("f").arg(2)); }
The unoptimized LIR
now looks like this:
CPush(1)
CPush(2)
Operator2(Add)
CPush(2)
Call(f, 1)
Operator2(Mul)
Assignment
Whenever a function is called, the call stack is adjusted guaranteeing that no local variables will interfere each other. To create or change a local variable, it is sufficient to use this construct:
use lovm2::prelude::*; fn main() { let n = &lv2_var!(n); builder .entry() .assign(n, expr) }
It is possible to store data in the global scope allowing values to live across function calls. This requires usage of the following assignment variant:
use lovm2::prelude::*; fn main() { let n = &lv2_var!(n); builder .entry() .global(n) .assign(n, expr) }
block.increment(n)
and block.decrement(n)
are also quite handy for updating variable values.
There is even a way of setting the values on lists and dictionaries. Under the hood, Set
is actually expecting a Ref
as the target location - which is retrieved by Get
- and overwrites the value inside. This is compatible with the way dictionaries and lists are internally constructed.
use lovm2::prelude::*; fn main() { let (point, x, y) = &lv2_var!(point, x, y); builder .entry() .assign(point, LV2Expr::dict()) .set(LV2Expr::from(point).get(x), 4) .set(LV2Expr::from(point).get(y), 2); }
Branching
While working on your functions hir, you can call the .branch()
method to create a point of conditional execution. A branch can have several conditions with associated blocks and at most one default condition that is always met. Branches with just a default_condition
are not allowed.
#![allow(unused)] fn main() { let main_hir = builder.entry(); // ... let equal_check = main_hir.branch(); equal_check .add_condition(expr) .step(...); equal_check .default_condition() .step(...); }
Expression Branches
Regular branches are considered statements and do not leave values on stack as such. However, there is a special expression variant LV2Expr::branch()
that lets you create the adhoc version of a ternary operator.
#![allow(unused)] fn main() { // f(n) = n == 2 ? "a" : "b" let n = lv2_var!(n); let f_body = LV2Expr::branch().add_condition(LV2Expr::from(n).eq(2), "a").default_value("b"); builder .add_with_args("f", vec![n]) .return_value(f_body); }
Note:
LV2ExprBranch
must always return a value meaning that you have to call thedefault_value
method at least once. Otherwise the compiler will complain that a structure of typeLV2ExprBranchIncomplete
cannot be converted into an expression. This is a compile time check to ensure that those branches always evaluate to a value.
Example
Let's implement a function that returns 1 if the given value is equal to 2 and 0 otherwise. From a Rust perspective, we could generate the function like this:
#![allow(unused)] fn main() { let main = builder.add_with_args("is_2", vec![n.clone()]); let branch = main.branch(); branch .add_condition(LV2Expr::from(n).eq(2)) .return_value(1); branch .default_condition() .return_value(0); }
The representation will be translated temporarily to the following optimized LIR. As you can see, a lot of labels (prefixed with a .
) got involved right now. Everything between .cond_0
's start and end label is derived from our first conditions predicate and body. The JumpIfFalse
instruction separates them by making sure that the body will be skipped if the expression evaluates to false. As usual, whenever we hit a return instruction, the function will terminate assuring that we will not fall through into our default branch.
is_2:
Store(n)
.branch_start:
.cond_start:
Push(n)
CPush(2)
Operator2(Equal)
JumpIfFalse(.cond_end)
CPush(1)
Ret
.cond_end:
CPush(0)
Ret
In the last lowering step, there are only two things left to do: resolving label offsets and patching them into the jump instructions.
Repeating
Loops can be created inside blocks using the .repeat()
and .repeat_until(..)
methods.
#![allow(unused)] fn main() { let main_hir = builder.entry(); let repeat_endless = main_hir.repeat(); // run until the condition is true let repeat_until = main_hir.repeat_until(expr); }
To avoid namespace collissions in Rust, there is no while
construct but you can simply create it via .repeat_until(LV2Expr::from(..).not())
. The optimizer makes sure that no instruction overhead is generated.
Inside repeat blocks you are free to use .break_repeat()
and .continue_repeat()
to precisely control the flow. As in every programming language, Break
terminates the loop while Continue
jumps to its start again.
The .repeat_iterating(collection, item)
constructor is able to consecutively assign every entity to the variable passed as item
as long as the collection
value supports iteration. Check the Iterators chapter if you want to find out more about this.
#![allow(unused)] fn main() { // repeat for all items in an iterator, assign item to variable `i` let repeat_iterating = main_hir.repeat_iterating(lv2_list!(1, 2, 3), lv2_var!(i)); // ... and this is the elaborate variant let it = lv2_list!(1, 2, 3).to_iter(); let repeat_iterating = main_hir.repeat_iterating(it, lv2_var!(i)); }
Example
We want to print the odd numbers between 0 and 10. This is an unbeautified implementation in pythonic pseudocode.
i = 0
while True:
if i == 10:
break
i += 1
if i % 2 == 0:
continue
print(i)
Translating it into a LV2Function
one by one could look like this:
#![allow(unused)] fn main() { use lovm2::prelude::LV2Expr; let i = &lv2_var!(i); // i = 0 main_hir.assign(i, 0); let repeat = main_hir.repeat(); // if i == 10: break repeat .branch() .add_condition(LV2Expr::from(i).eq(10)) .break_repeat(); // i += 1 repeat.increment(i); // if i % 2 == 0: continue repeat .branch() .add_condition(LV2Expr::from(i).rem(2).eq(0)) .continue_repeat(); // print(i) repeat.step(lv2_call!(print, i, "\n")); }
You can imagine that the resulting LIR
is a lot more elaborate than the previous examples so we will only focus on the essential parts. From a intermediate perspective an endless loop is implemented by appending an unconditional jump to the loops body.
main:
.repeat_start:
...
Jump(.repeat_start)
.repeat_end:
To terminate the loop once the counter variable reaches 10, we add a conditional break to the body. This is solely a jump targeting the loops end label.
.cond_start:
Push(i)
CPush(10)
Operator2(Equal)
JumpIfFalse(.cond_end)
Jump(.repeat_end)
.cond_end:
On the other hand Continue
targets the start label.
Iterators
Iterators are special values that are used to step through the items of Str
, List
and Dict
. Creating them from a HIR
perspective is as easy as calling .to_iter()
on the source expression. Depending on the base collection, the return type of LV2Expr::from(it).next(..)
will change:
Str
: Characters of the string.Dict
: AList
in which the first item is thekey
and the second item is thevalue
.
Only List
causes the iterator to return a contained value as is.
Ranges
You can also use LV2Expr::iter_ranged(from, to)
if you want to have a sequence of numbers. Note that from
is inclusive and to
is exclusive.
#![allow(unused)] fn main() { let it1 = LV2Expr::iter_ranged(0, 5); // => [0, 1, 2, 3, 4] let it2 = LV2Expr::iter_ranged(-5, 5); // => [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4] // passing an iterator expression into `Iter::reverse` does // exactly what you would expect let it3 = LV2Expr::create_ranged(0, 5).reverse(); // => [4, 3, 2, 1, 0] // you can also omit `from` and `to` by passing a `Nil` value let it4 = LV2Expr::iter_ranged(Value::Nil, 5) // => [0, 1, 2, 3, 4] let it5 = LV2Expr::iter_ranged(10, Value::Nil) // => [10, 11, 12, ...] }
Example
If you want to control the iterator more granularly, feel free to use LV2Expr::from(it).has_next()
and LV2Expr::from(it).next()
.
#![allow(unused)] fn main() { let (it, item) = &lv2_var!(it, item); main_hir.assign(it, lv2_list!(1, 2, 3, 4).to_iter()); main_hir .repeat_until(LV2Expr::from(it).has_next().not())) .assign(item, LV2Expr::from(it).next()); }
Optimization
A rudimentary peephole optimizer is enabled by default. It acts upon the generated LIR
. If you want to disable optimization at all, use the method build_with_options
and set the attribute optimize
to false.
lovm2
gives you a guarantee that your LV2Function
will not be unexpectedly altered when lowering.
Dead Code elimination
After lowering the LIR
, the optimizer will eliminate code that is not reachable.
main:
CPush(0) main:
Ret => CPush(0)
CPush(1) Ret
...
Constant evaluation
Computing constant operations ahead can not only improve the programs performance, but also drop certain constants out of the LV2CodeObject
reducing its size. Bytecode sequences will be tranformed like this:
CPush(0)
CPush(1) => CPush(2)
Add
Note: The optimizer currently only optimizes expressions if all operands are constant and does not cover neutral elements like
+ 0
or* 1
as such.
Logical short-curcuit
It is common for languages to avoid evaluating the second operand of Or
/And
operations if the first term is already sufficient for the expressions outcome.
Logical negation
The optimizer will merge instruction sequences like this:
...
Not => Jf
Jt
Virtual Machine
The virtual machine is the heart of lovm2
projects and thrives computation forward. It maintains the whole program state inside a LV2Context
and shares said data with every function and module interested in it.
Context
The context stores the programs state.
#![allow(unused)] fn main() { pub struct LV2Context { /// the module that will be run first entry: Option<Rc<dyn LV2CallProtocol>>, /// available functions scope: HashMap<LV2Variable, LV2CallableRef>, /// global variables globals: HashMap<LV2Variable, LV2Value>, /// call stack with local variables lstack: Vec<LV2Frame>, /// value stack vstack: Vec<LV2Value>, } }
Hooks
Load Hook
vm.set_load_hook(callback)
The load hook is a special function bound to the LV2Vm
that will be consulted first whenever a module should be loaded into the LV2Vm
. It returns an Option
containing the correct LV2Module
structure if the hook was able to resolve the requested name on its own.
Import Hook
vm.set_import_hook(callback)
The import hook handles naming of functions being imported into the scope. As such it can also be used to adjust the naming scheme of the lovm2
standard library.
The function signature expects the callback to return an Option<String>
where Some("name")
will proceed importing with a new identifier. Importing a function can be avoided by returning None
.
Interrupt
vm.set_interrupt(n, callback)
Interrupts are a runtime extension of the bytecode. You can use this to implement custom instructions and frequently used functions without the overhead of a name lookup.
The test environment uses Interrupt(10)
to analyse the programs state at a certain point of execution.
Range | Meaning |
---|---|
0 - 9 | reserved |
10 | debug |
11 - 63 | reserved |
64 - 255 | free |