One of the fundamental questions in VM design for OO languages is how you represent your object handles internally. Regardless of what the language itself exposes in terms of fundamental types, boxing, and the like, the VM still has to shuffle objects around on the stack frame, pass them between methods/functions, etc.
The traditional way to do this is the “tagged union”, where you use a two-element struct consisting a type field and a union of values for each possible type. One of these types is probably an object pointer; the other types let you represent unboxed fundamental types like ints and floats. This is the approach used by Rubinius, by Lua, and probably many more.
The Neko VM instead uses 31-bit pointers for everything except ints, and fixes the lowest bit as the integer bit. If this bit is on, the value represents a 31-bit integer; if it’s off, the value is a pointer. Of course this means that Neko objects can only be at even addresses in memory. (I’m not sure what happens on 64-bit machines; either ints stay at 31 bits or they grow to 63-bit longs; the pointers certainly grow to 64-bit).
The result is that Neko object handles are the size of pointers, hence small, but Neko loses the ability to handle unboxed floats. All float operations will require lots of heap allocation and dereferencing. On the other hand, Lua object handles are much larger, but Lua can do float arithmetic on the stack. (The VM stack, not the system stack.)
The Neko folks claim that their representation is better, because it’s smaller, and faster when you’re copying things around. But what value do you really get by sacrificing floats? And what about when we take into account different architectures?
Comparing sizes is easy. On a 32-bit machine, Lua objects take up 12 bytes: a double is 8 bytes, and the tag grows the struct to 12. So Lua object handles are three times the size of Neko handles. On a 64-bit machine, Lua objects take up 16 bytes, and Neko objects take up 8. Note that Lua handles are now only twice as big as Neko handles.
Comparing speed is a little more interesting. How much is lost, exactly by copying around those extra 8 bytes, for each architecture? I did some simple experiments where I copied objects of various sizes around 10 million times, picking a random start and end point for each within a block of allocated memory on the heap, and measuring how long everything took.
On my 32-bit machine, taking 10m random numbers took 2.874 seconds; copying 12-byte objects from one location to the other each time took an additional 91ms. Copying 4-byte objects took only an extra 77ms. That works out to a 15.3% slowdown for Lua.
On my 64-bit machine, taking 10m random numbers took 517ms; copying 16-byte objects each time took an additional 1.85 seconds; copying 8-byte objects took an additional 1.81 seconds. That works out to a 2.2% slowdown for tagged unions.
So personally, the 32-bit case is maybe arguable, but the 64-bit case doesn’t seem that compelling. Copying object handles around is one thing of very many that the VM spends its time on, so the overall slowdown is going to be much less than 2.2%. I don’t know that sacrificing float performance, and half of your integer space, is really worth it.
If you want to run these experiments for yourself, the code is here. Please let me know if I’m doing something wrong!