Introduction

Dynamic languages such as Python are built on top of an interpreter that is able to understand a broad variety of bytecode instructions allowing them to replicate algorithms and process data. This design makes programs based on interpreter languages well-suited for platform independency and allows fast iterations in development.

lovm2 - love em too - is a small language building framework written in Rust featuring a dead-simple stack-based virtual machine doing exactly that. Furthermore, it comes with tools for generating said bytecode out of the box allowing you to rapidly prototype your own coding language without a hassle. There are no advanced concepts to care about. No polymorphism, closures, asynchronous runtime... just straightforward functions, variables and data structures.

The static lovm2 library is so tiny that compiling it into your language yields almost no overhead and also makes it applicable for usage inside a Python environment via pylovm2.

The project is in an early development stage and no API is stable yet. Feel free to contribute.

Setup

  1. Modify your Cargo.toml

    • Add the latest crates.io version
    lovm2 = "0.4.8"
    
    • ... or - if you feel lucky - use the current master branch directly
    lovm2 = { git = "https://github.com/lausek/lovm2" }
    
  2. Run cargo update on your terminal

  3. Import the useful lovm2 components into scope using use lovm2::prelude::*;

Concepts

This chapter aims to give you a brief overview of the internal principles. Even though lovm2 is designed to be as simple as possible, it is still quite important to grasp the implementation concepts behind it.

The general steps of producing a runnable program are roughly:

  • Create a new ModuleBuilder and populate it with functions aka. HIR data
  • Call module_builder.build() constructing a runnable Module from the current state
  • Load the module into an instance of a virtual machine Vm using add_main_module()
  • Start the program by calling run on the virtual machine

Bytecode

lovm2 is centered around the value stack. This data structure has a variety of tasks such as passing parameters to functions or acting as a computation buffer for operations. There are instructions that put values on top of the stack like CPush, LPush, and GPush. Some just take a value off and store it somewhere like LMove, GMove. Almost all other instructions will take a given amount of values from it and leave a return value in place. You can also implement your own bytecode instructions (see Interrupt chapter for more).

For example, the term 1 + (2 * 3) will be compiled to this sequence:

 instruction    | stack
----------------------------
 CPush          | [1]
 CPush          | [1, 2]
 CPush          | [1, 2, 3]
 Mul            | [1, 6]
 Add            | [7]

You do not need to micromanage the bytecode itself. There are common language constructs with which you can built pretty much everything. These constructs are composed into a high-level intermediate representation so every new function gets its own HIR structure. Below you can see the transformation process of a function into a runnable CodeObject where each arrow means "lowering" to the next level.

HIR -> LIR -> CodeObject

LIR or low-level intermediate representation is not directly exposed to the user but it is quite crucial if you want to understand how the bytecode generator works under the hood.

CodeObject's on their own are already valid programs bundling together instructions, constants and variables. As usual in every language, functions can be grouped together and exposed to the virtual machine in a construct called Module. Introducing this module level abstraction makes sense as we will later work with collections that are not backed by lovm2 bytecode but native functions.

HIR -> LIR 
            \
HIR -> LIR    -> CodeObject -> Module --> load into VM
            /
HIR -> LIR 

Modules

While you are already familiar with lovm2's own representation of executable code, Modules are far more abstract under the hood. lovm2 is able to load specifically compiled shared objects - or DLLs as you would call them in the Windows world - at runtime and execute real native functions as well.

And that's not all. As long as your structure complies with the CallProtocol trait you are free to even implement native functions inside your own compiler or runtime. This job can be done using the lovm2_extend package which allows you to write your own modules in Rust.

Types

lovm2 embraces dynamic typing. The most basic ones are Bool, Int, Float, String.

Nil is the default return type of functions that do not have return values. You can also use it to mark the absence of a value.

Ref can wrap any other value and implement a shared mutable location as such.

Complex Types

List and Dict are a bit more complicated, because they need to store other values. As such, they support the contains, len, get, set and delete methods.

These types also utilize Ref heavily. If you use the standard lovm2 functionality for generating programs, you will always implicitly work with a Ref to the corresponding data. The virtual machine also ensures that every value being stored inside these types is itself wrapped up in a reference. This is required for the implementation of slices. The Box instruction realizes this functionality.

Another special value kind is Any. This type is used for allowing external shared object modules to pass their custom Rust structures into the VM.

Conversion

The Conv instruction is able to convert data according to the following rules:

from / toNilBoolIntFloatStringListDictRef
Nil
Bool
Int
Float
String~~
List
Dict
Ref

~: implies parsing overhead

Building Programs

This chapter will show you how to utilize the gen module in order to compile your own lovm2 programs. It is also possible to persist compiled modules onto your disk and load them later.

use lovm2::prelude::*;

fn main() {
let mut builder = ModuleBuilder::new();

// creates the entry point `HIR` and returns a mutable reference.
// this is actually a shortcut for builder.add(ENTRY_POINT)
let main_hir = builder.entry();

// modify `main_hir` with statements
// if in doubt, just call the `step` method and pass it the hir element
main_hir.step(Interrupt::new(10));

let module = builder.build().except("compile error");
println!("{}", module);
}

The main generation functionality is exposed via Block and every structure that contains it like Branch, Repeat and functions. You can use these methods on all of them:

  • step(..) append a new statement to the block.
  • branch() create a new branch at the current position. This returns a BranchBuilder.
  • repeat() and repeat_until(..) which return a mutable reference to a new block. The first variant is an endless loop, while the latter supports breaking once a condition is met.

Functions

The whole ModuleBuilder is centered around the creation of HIR. As we already found out in the Concepts chapter, a HIR is conceptually equal to a function. The resulting bytecode is able to process a given amount of parameters and leave a return value in place.

As you can see in this example listing, you should not need to create such data manually as there is functionality for adding it to the builder directly.

use lovm2::prelude::*;

fn main() {
// creates a hir with no arguments
let fn_no_args = builder.add("fn1");

// creates a hir that expects parameter n
let fn_with_args = builder.add_with_args("fn2", &[lv2_var!(n)]);
}

To return from function, add a Return::value(expr) to the hir specifying the returned value or Return::nil() if no value is produced.

Due to the convention that every function has to return a value, an implicit Return::nil() is appended if the last instruction is not a return already.

Helper Macros

There are a bunch of macros inside the prelude that trivialize creating more complicated lovm2 constructs for developers.

  • lv2_var!(ident, ...) turns all the identifiers given into the special type Variable which is needed basically everywhere. If more than one ident is declared, this returns a tuple.
  • lv2_dict!(ident => expr, ...) creates an Expr that will dynamically initialize a dictionary with the key-values pairs specified.
  • lv2_list!(item, ...) creates an Expr that initializes a list dynamically.
  • lv2_call!(ident, ... args) syntactic sugar for the Call element.
  • lv2_access!(ident, ... keys) syntactic sugar for the Access element.

Expressions

The Expr represents any computation that leads to a new value on the stack. Expressions can be nested arbitrarily.

To give you an overview of what an expression could look like, here is the stripped down version of its actual implementation.


#![allow(unused)]
fn main() {
pub enum Expr {
    // a constant value
    Value,
    // variable in read position
    Variable,
    // a branch that evaluates expressions conditionally
    Branch,
    // call to a function
    Call,
    // operations with one operand
    Operation1,
    // operations with two operands
    Operation2,
    // result of a type conversion
    Conv,
    // attribute read on a list or dict
    Access,
    // special variant for creating lists and dicts
    DynamicValue,
    // create a mutable subpart of a list
    Slice,
}
}

Resolvement of Variables

While lowering, the runtime keeps track of locally assigned identifiers. This is crucial for determining the scope during read access later. If a variable is not known locally, a fallback to global scope happens.

Expr variants relying on this functionality are Variable and Access. As such, their macro helper functions lv2_var! and lv2_access! follow the same rules.

Example

We want to transform the formula (1 + 2) * f(2) to lovm2 bytecode. Use a parser of your choice to generate an abstract syntax tree from the textual representation. Note that lovm2 does not care about operator priorities so its your parsers duty to correctly handle them. After processing the input, your AST should look something like this:

Value(1)
        \
         -- Operation(+)
        /               \
Value(2)                 -- Operation(*)
                        / 
              Call(f, 2)

And here is the compiletime representation of said formula. As you can see, every operation has an equivalent static method on Expr. Also note that calling a function has its own construct. This is due to calls being allowed in statement position as well.


#![allow(unused)]
fn main() {
let formula = Expr::mul(
    Expr::add(1, 2),
    Call::new("f").arg(2),
);
}

The (unoptimized) LIR now looks like this:

CPush(1)
CPush(2)
Operator2(Add)
CPush(2)
Call(f, 1)
Operator2(Mul)

Assignment

Whenever a function is called, the call stack is adjusted guaranteeing that no local variables will interfere each other. To create or change a local variable, it is sufficient to use this construct:


#![allow(unused)]
fn main() {
Assign::local(&lv2_var!(n), expr)
}

It is possible to store data in the global scope allowing values to live across function calls. This requires usage of the following assignment variant:


#![allow(unused)]
fn main() {
Assign::global(&lv2_var!(n), expr)
}

Assign::increment(n) and Assign::decrement(n) are also quite handy for updating variable values.

There is even a way of setting the values on lists and dictionaries. Under the hood, Set is actually expecting a Ref as the target location - which is retrieved by Access - and overwrites the value inside. This is compatible with the way dictionaries and lists are internally constructed.


#![allow(unused)]
fn main() {
Assign::local(&lv2_var!(point), lv2_dict!());
Assign::set(&lv2_access!(point, x), x_coord);
Assign::set(&lv2_access!(point, y), y_coord);
}

Branching

While working on your functions hir, you can call the .branch() method to create a point of conditional execution. A branch can have several conditions with associated blocks and at most one default condition that is always met. Branches with just a default_condition are not allowed.


#![allow(unused)]
fn main() {
let main_hir = builder.entry();

// ...

let equal_check = main_hir.branch();

equal_check
    .add_condition(expr)
    .step(...);

equal_check
    .default_condition()
    .step(...);
}

Expression Branches

Regular branches are considered statements and do not leave values on stack as such. However, there is a special expression variant Expr::branch() that lets you create the adhoc version of a ternary operator.


#![allow(unused)]
fn main() {
// f(n) = n == 2 ? "a" : "b"

let n = lv2_var!(n);
let f_body = Expr::branch().add_condition(Expr::eq(n, 2), "a").default_value("b");

builder
    .add_with_args("f", vec![n])
    .step(Return::value(f_body));
}

Note: ExprBranch must always return a value meaning that you have to call the default_value method at least once. Otherwise the compiler will complain that a structure of type ExprBranchIncomplete cannot be converted into an expression. This is a compile time check to ensure that those branches always evaluate to a value.

Example

Let's implement a function that returns 1 if the given value is equal to 2 and 0 otherwise. From a Rust perspective, we could generate the function like this:


#![allow(unused)]
fn main() {
let main = builder.add_with_args("is_2", vec![n.clone()]);
let branch = main.branch();

branch
    .add_condition(Expr::eq(n, 2))
    .step(Return::value(1));

branch
    .default_condition()
    .step(Return::value(0));
}

The representation will be translated temporarily to the following optimized LIR. As you can see, a lot of labels (prefixed with a .) got involved right now. Everything between .cond_0's start and end label is derived from our first conditions predicate and body. The JumpIfFalse instruction separates them by making sure that the body will be skipped if the expression evaluates to false. As usual, whenever we hit a return instruction, the function will terminate assuring that we will not fall through into our default branch.

is_2:
	StoreLocal(n)
.branch_0_start:
.cond_0_start:
	PushLocal(n)
	CPush(2)
	Operator2(Equal)
	JumpIfFalse(.cond_0_end)
	CPush(1)
	Ret
.cond_0_end:
	CPush(0)
	Ret

In the last lowering step, there are only two things left to do: resolving label offsets and patching them into the jump instructions.

Repeating

Loops can be created inside blocks using the .repeat() and .repeat_until(..) methods.


#![allow(unused)]
fn main() {
let main_hir = builder.entry();

let repeat_endless = main_hir.repeat();

// run until the condition is true
let repeat_until = main_hir.repeat_until(expr);
}

To avoid namespace collissions in Rust, there is no while construct but you can simply create it via .repeat_until(Expr::not(..)). The optimizer makes sure that no instruction overhead is generated.

Inside loop blocks you are free to use Break and Continue to precisely control the flow. As in every programming language, Break terminates the loop while Continue jumps to its start again.

The .repeat_iterating(collection, item) constructor is able to consecutively assign every entity to the variable passed as item as long as the collection value supports iteration. Check the Iterators chapter if you want to find out more about this.


#![allow(unused)]
fn main() {
// repeat for all items in an iterator, assign item to variable `i`
let repeat_iterating = main_hir.repeat_iterating(lv2_list!(1, 2, 3), lv2_var!(i));

// ... and this is the elaborate variant
let it = Iter::create(lv2_list!(1, 2, 3));
let repeat_iterating = main_hir.repeat_iterating(it, lv2_var!(i));
}

Example

We want to print the odd numbers between 0 and 10. This is an unbeautified implementation in pythonic pseudocode.

i = 0
while True:
    if i == 10:
        break
    i += 1
    if i % 2 == 0:
        continue
    print(i)

Translating it to HIR one by one could look like this:


#![allow(unused)]
fn main() {
let i = &lv2_var!(i);

// i = 0
main_hir.step(Assign::local(i, 0));

let repeat = main_hir.repeat();

// if i == 10: break
repeat
    .branch()
    .add_condition(Expr::eq(i, 10))
    .step(Break::new());

// i += 1
repeat.step(Assign::increment(i));

// if i % 2 == 0: continue
repeat
    .branch()
    .add_condition(Expr::eq(Expr::rem(i, 2), 0))
    .step(Continue::new());

// print(i)
repeat.step(lv2_call!(print, i, "\n"));
}

You can imagine that the resulting LIR is a lot more elaborate than the previous examples so we will only focus on the essential parts. From a intermediate perspective an endless loop is implemented by appending an unconditional jump to the loops body.

main:
.rep_0_start:
    ...
	Jump(.rep_0_start)
.rep_0_end:

To terminate the loop once the counter variable reaches 10, we add a conditional break to the body. This is solely a jump targeting the loops end label.

.cond_0_start:
	PushLocal(i)
	CPush(10)
	Operator2(Equal)
	JumpIfFalse(.cond_0_end)
	Jump(.rep_0_end)
.cond_0_end:

On the other hand Continue targets the start label.

Iterators

Iterators are special values that are used to step through the items of Str, List and Dict. Creating them from a HIR perspective is as easy as passing the source expression into Iter::create(..). Depending on the base collection, the return type of Iter::next(..) will change:

  • Str: Characters of the string.
  • Dict: A List in which the first item is the key and the second item is the value.

Only List causes the iterator to return a contained value as is.

Ranges

You can also use Iter::create_ranged(from, to) if you want to have a sequence of numbers. Note that from is inclusive and to is exclusive.


#![allow(unused)]
fn main() {
let it1 = Iter::create_ranged(0, 5);
// => [0, 1, 2, 3, 4]

let it2 = Iter::create_ranged(-5, 5);
// => [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4]

// passing an iterator expression into `Iter::reverse` does 
// exactly what you would expect

let it3 = Iter::reverse(Iter::create_ranged(0, 5));
// => [4, 3, 2, 1, 0]

// you can also omit `from` and `to` by passing a `Nil` value

let it4 = Iter::create_ranged(Value::Nil, 5)
// => [0, 1, 2, 3, 4]

let it5 = Iter::create_ranged(10, Value::Nil)
// => [10, 11, 12, ...]
}

Example

If you want to control the iterator more granularly, feel free to use Iter::has_next(it) and Iter::next(it).


#![allow(unused)]
fn main() {
let (it, item) = &lv2_var!(it, item);

main_hir.step(Assign::local(it, Iter::create(lv2_list!(1, 2, 3, 4))));

main_hir
    .repeat_until(Expr::not(Iter::has_next(it)))
    .step(Assign::local(item, Iter::next(it)));
}

Optimization

A rudimentary peephole optimizer is enabled by default. It acts upon the generated LIR. If you want to disable optimization at all, use the method build_with_options and set the attribute optimize to false.

lovm2 gives you a guarantee that your HIR will not be unexpectedly altered when lowering.

Dead Code elimination

After lowering the LIR, the optimizer will eliminate code that is not reachable.

main:               
    CPush(0)        main:
    Ret        =>       CPush(0)
    CPush(1)            Ret
    ...

Constant evaluation

Computing constant operations ahead can not only improve the programs performance, but also drop certain constants out of the CodeObject reducing its size. Bytecode sequences will be tranformed like this:

CPush(0)
CPush(1)   =>   CPush(2)
Add

Note: The optimizer currently only optimizes expressions if all operands are constant and does not cover neutral elements like + 0 or * 1 as such.

Logical short-curcuit

It is common for languages to avoid evaluating the second operand of Or/And operations if the first term is already sufficient for the expressions outcome.

Logical negation

The optimizer will merge instruction sequences like this:

...
Not   =>   Jf
Jt

Virtual Machine

The virtual machine is the heart of lovm2 projects and thrives computation forward. It maintains the whole program state inside a Context and shares said data with every function and module interested in it.

Context

The context stores the programs state.


#![allow(unused)]
fn main() {
pub struct Context {
    /// the module that will be run first
    pub entry: Option<Rc<dyn CallProtocol>>,
    /// available functions
    scope: HashMap<Variable, CallableRef>,
    /// global variables
    globals: HashMap<Variable, Value>,
    /// call stack with local variables
    lstack: Vec<Frame>,
    /// value stack
    vstack: Vec<Value>,
}
}

Hooks

Load Hook

vm.set_load_hook(callback)

The load hook is a special function bound to the Vm that will be consulted first whenever a module should be loaded into the Vm. It returns an Option containing the correct Module structure if the hook was able to resolve the requested name on its own.

Import Hook

vm.set_import_hook(callback)

The import hook handles naming of functions being imported into the scope. As such it can also be used to adjust the naming scheme of the lovm2 standard library.

The function signature expects the callback to return an Option<String> where Some("name") will proceed importing with a new identifier. Importing a function can be avoided by returning None.

Interrupt

vm.set_interrupt(n, callback)

Interrupts are a runtime extension of the bytecode. You can use this to implement custom instructions and frequently used functions without the overhead of a name lookup.

The test environment uses Interrupt(10) to analyse the programs state at a certain point of execution.

RangeMeaning
0 - 9reserved
10debug
11 - 63reserved
64 - 255free

Standard Library

Examples

Projects