The art of writing interpreters in Wasm

Hello, hackers! Today I want to share a small experiment about writing fast interpreters in WebAssembly.

Wasm offers “near-native performance” compared to native code, and in many cases that is true. Let’s compile a simple C++ loop that sums one million numbers, both to native code and to Wasm using the WASI SDK, and run it with Wasmtime:


Configuration Time
Native2840 µs
Wasm (WASI SDK + Wasmtime)2838 µs

This result is expected: Wasmtime’s Cranelift backend does a good job. However, the claim of “near-native performance” mostly applies to statically typed languages such as C, C++, Rust, and similar languages. If you try to run a dynamic language, such as JavaScript or Python, the results can be surprising.

For example, let’s run a simple loop in JavaScript through Wasm and compare it with native JavaScript engines:


Engine Time
Node --jitless349 µs
QuickJS-ng443 µs
SpiderMonkey compiled to Wasm1034 µs

The difference is significant. SpiderMonkey compiled to Wasm is almost 3x slower than JIT-less Node. Thanks to recent optimizations in Wasmtime, this gap is now much smaller than it used to be: previously, the difference was much worse, around 20x.

The main reason is that most dynamic-language runtimes have only been ported to Wasm as interpreter-only C/C++ implementations. This is the simplest approach, and it is how runtimes such as SpiderMonkey, QuickJS, and CPython are typically compiled to Wasm.

But why is an interpreter written in C/C++ and compiled to Wasm so much slower? And what can we do to make it faster?

In this article, I want to explore this question in the context of JavaScript: can we implement a JavaScript interpreter directly in Wasm and make it run with something close to “near-native” performance?

A benchmark

To begin our journey, let’s create a benchmark. We will use a simple loop similar to the one from the introduction. Even though the benchmark is very simple, it still exercises several core aspects of JavaScript: property access, arrays, loops, and basic arithmetic.


function total_discount(items) {
  let sum = 0;
  for (let i = 0; i < items.length; ++i) {
    sum += items[i].discount;
  }
  return sum;
}

I extracted a small subset of bytecodes from the QuickJS engine and implemented a naive switch-based interpreter in C++. This will be our starting point.


Switch-based interpreter Time
Wasm3306 µs
Native1507 µs

As you can see, the interpreter performs poorly compared to QuickJS (443 microseconds), especially when compiled to Wasm.

The problem

If you work on compilers or interpreters, you probably know that switch-based interpreters are one of the worst cases for a compiler. Why? Because the compiler has very limited visibility into the execution structure of individual bytecodes.

For example, some bytecodes have both fast and slow paths, but expressing this structure clearly in C or C++ is difficult. Another problem is branch prediction. In a switch-based interpreter, the CPU’s branch predictor has a hard time predicting the next dispatch target because execution keeps jumping between many different switch cases in a pattern that is close to random via CPU point of view.

The switch itself can also become very large. In that case, the compiler’s register allocator has to deal with one huge function with many unrelated control-flow paths, which makes it difficult to assign registers efficiently.

In our case, with Wasm, the situation is even worse because we have to reason about two levels of compilation:


C++ → Wasm → native code

However, this is an old and well-known problem for experienced compiler engineers. Most industrial-grade engines, such as V8, JavaScriptCore, and others, solve it either by using a DSL to generate an interpreter in assembly, or by writing the interpreter directly in assembly. This gives them full control over registers, fast paths, slow paths, dispatch, and other low-level details.

The problem with writing assembly directly, or with building a custom DSL, is that it takes a lot of time. Ideally, I want a cheap way to convert or extend existing C/C++ interpreters so that, with minimal effort, they can get much closer to the performance of industrial-grade interpreters.

Tail call interpreter

There is a way to address one of these problems without rewriting the whole interpreter from C++ to assembly: a well-known technique based on tail calls.

The idea is to convert each case in a switch-based interpreter into a separate function, and then use a tail call to jump from one bytecode handler to the next. There are many recent examples of this technique.

This helps because it is much easier for the compiler to allocate registers inside a small function than inside one huge switch. Also, if we pass frequently used values, such as the stack pointer, as function parameters from one bytecode handler to another, the compiler may be able to keep them in fixed registers across handlers.

Luckily, recent versions of Clang and Wasmtime both support tail calls, so let’s convert our switch-based interpreter into a tail-call-based one.


Tail-call interpreter Time
Wasm3460 µs

We can see that this works well for the native version: we get a small but measurable speedup. However, the Wasm result is surprising: it is actually worse than the switch-based version (3306 microseconds).

Let’s look at the generated WAT for one of the bytecode handlers in the tail-call-based interpreter compiled to Wasm:


(func $"VMTail::op_push_0(long*, TaggedPointer*, VMTail*)" (type 1)
    (param i32 i32 i32) (result i32)
  …
  local.get 0
  local.get 1
  local.get 2
  local.get 3
  i32.const 2
  i32.shl
  i32.load offset=12720
  return_call_indirect (type 1)
)

What we see here is that clang++ from the WASI SDK did use tail calls, but it placed all bytecode handlers, such as op_push_0, into one large function table. This is understandable from Clang’s perspective: from the C++ point of view, these are just regular functions. Clang cannot infer that we only use them as bytecode handlers, nor can it place them into a separate typed function table optimized specifically for dispatch.

The tail-call version based on return_call_indirect is not exactly what we want. return_call_indirect performs an additional signature check that we do not need. On top of that, Wasmtime adds some per-function overhead to preserve Wasm’s security and isolation properties.

So let’s do something a bit brave: we can make a few surgical changes directly in WAT and fix the dispatch manually.

First, let’s put all bytecode handler functions into a separate typed function table with a non-nullable reference type, so we can avoid null checks:


(table $op_table 16 16 (func (param i32 i32 i32) (result i32)) …)
(elem (table $op_table) (i32.const 0) (ref 1)
  (item ref.func $"VMTail::op_set_loc_uninitialized(long*, TaggedPointer*, VMTail*)")
  (item ref.func $"VMTail::op_push_0(long*, TaggedPointer*, VMTail*)")
  (item ref.func $"VMTail::op_put_loc_0(long*, TaggedPointer*, VMTail*)")
  (item ref.func $"VMTail::op_put_loc_1(long*, TaggedPointer*, VMTail*)")
  (item ref.func $"VMTail::op_get_loc(long*, TaggedPointer*, VMTail*)")
  …)

Then we replace dispatch through return_call_indirect with:


table.get $op_table
return_call_ref 1

Here is the result:


Version Time
Wasm (tail-call interpreter, typed table)3185 µs
Initial Wasm switch-based version3306 µs

The result is, let’s say, questionable.

I looked into the native assembly produced by Wasmtime for the updated tail-call version of the interpreter and found one major downside of this approach. Wasmtime adds checks and bookkeeping around each Wasm function. In our case, that means around each bytecode handler.

So while we gain a more efficient dispatch structure, we also start paying extra overhead for every function call. As far as I know, there is currently no way to express a custom ABI or tell Wasmtime to omit this overhead inside bytecode handlers.

Because of that, this path looks like a dead end. We cannot build a highly efficient Wasm interpreter simply by compiling a C/C++ interpreter into a tail-call-based form.

What else can we do to improve performance significantly?

First, our C++ object representation is far from ideal. It is based on std::unordered_map, which is obviously slow for this use case. We also do not apply any of the standard interpreter optimizations: inline caches, top-of-stack caching, quickening, and so on.

Unfortunately, this kind of precise low-level control is hard to achieve from C++. For example, we cannot directly declare a WasmGC struct in C++ and use it as the object representation for JavaScript objects. We also cannot precisely control tail calls, local variables, or function references in the way we need.

In fact, there does not seem to be a language that gives us direct and precise control over WasmGC, typed function references, tail calls, and the rest of the Wasm execution model at the same time.

I considered MoonBit and Kotlin because I did not want to write everything directly in WAT. But, somewhat surprisingly, I found a way to write WAT without too much pain.

Writing WAT by hand is hard — Use Claude to generate WAT

It turns out that I can write high-level blueprints in C++ and ask Claude to translate them into WAT. Surprisingly, it understands the semantics well enough to produce correct WAT.

Generating the WAT interpreter

So now, with Claude’s help, it is time to generate the WAT version of our interpreter.

Object representation

The first thing to change is the object representation. In the C++ version, JavaScript objects were backed by std::unordered_map, which is convenient but very expensive for this kind of benchmark. In the WAT version, we can model JavaScript values directly using WasmGC types:


;; JSValue = (ref null eq):
;;   - i31ref for small integers
;;   - (ref $JSValArray) for arrays
;;   - (ref $JSPlainObject) for plain objects
(type $JSValArray
  (array (mut (ref null eq))))

(type $JSField
  (struct
    (field $key i32)
    (field $value (mut (ref null eq)))))

(type $JSFieldArray
  (array (mut (ref null $JSField))))

(type $JSPlainObject
  (struct
    (field $shape i32)
    (field $fields (ref $JSFieldArray))))

The model is intentionally simple. A JavaScript value is represented as a nullable eqref. Small integers are stored as i31ref, arrays are represented as mutable WasmGC arrays, and plain objects are represented as a pair of a shape identifier and an array of fields.

Each field stores a numeric key and a mutable value. The shape field acts as a compact description of the object layout. In a real JavaScript engine, this would correspond to concepts such as hidden classes, maps, or shapes. Here, it gives us just enough structure to make repeated property accesses optimizable.

This representation is much closer to what we want from Wasm: objects are GC-managed values, fields are accessed through typed WasmGC operations, and the interpreter no longer has to go through a C++ hash table for every property access.

Inline caches

The second important optimization is inline caching.

The basic idea is simple. The first time the interpreter executes a generic property access bytecode, it performs the full lookup: check the object, inspect its shape, search for the field, and compute the field index. Once this succeeds, the interpreter records the result in an inline cache attached to that bytecode location. The fast path becomes very small:

  1. check that the object still has the expected shape;
  2. load the cached field slot;
  3. return the value.

If the shape check fails, execution falls back to the generic slow path, updating the cache if needed.

This is the same general idea behind inline caches in JavaScript engines and adaptive interpreters.

In manual_interpreters.wat, this is implemented directly at the bytecode level: generic handlers collect runtime feedback, cache the observed shape and field index, and then dispatch to a specialized handler on subsequent executions. This keeps the interpreter simple, but still gives it the most important optimization for this benchmark: repeated property accesses no longer pay the full dynamic lookup cost.

After applying all of this plus some tuning I’ve got the following numbers:


Engine Time
Native Node --jitless349 µs
QuickJS-ng native443 µs
Manually written WAT interpreter609 µs
SpiderMonkey compiled to Wasm1034 µs

This is a much better result. The manually written WAT interpreter is still slower than native JIT-less Node, but the gap is now much smaller. More importantly, it is significantly faster than SpiderMonkey compiled to Wasm, even though the interpreter is still very small and implements only a narrow subset of JavaScript semantics.

Conclusion

In the end, our small interpreter for a subset of QuickJS bytecode performs only 1.7x slower than native JIT-less Node. I think this is a great result, especially considering that the resulting Wasm module is only 3.8 KB.

Of course, it could be optimized further with bytecode fusion and other techniques, or even by switching to a different bytecode format that is not stack-based. But after implementing all of this, my conclusion is that writing high-performance interpreters in Wasm is still hard.

Wasm does not give us the same level of precise low-level control as assembly, especially when it comes to dispatch, calling conventions, register allocation, and function boundaries.

However, Wasm starts to shine if we eliminate interpreter dispatch entirely and compile JavaScript bytecode directly into Wasm. At that point, we no longer have unoptimizable indirect jumps, and we no longer need to emulate a custom interpreter ABI. In fact, we can do this with our WAT interpreter by unrolling the interpreter loop for the bytecode of total_discount.js.

Once again, I used Claude to do this, because unrolling an interpreter loop is, in a sense, a mechanical refactoring. I did not apply any additional optimizations on top of that. I only glued the bytecode handlers together.

The result is 103 microseconds, with a Wasm module size of only 846 bytes.

So the final takeaway is this: Wasm may not be ideal as a direct compilation target for traditional C/C++ interpreters, and it may still be difficult to write an industrial-grade interpreter directly in Wasm. But once we move from interpretation to compilation, even very simple Wasm code can become extremely fast.

Thanks for reading.

P.S.

All the benchmarks and the full source code are available here.