Showing only posts with topic "optimization" [.rss for this topic]. See all posts.

Copying objects in memory

One of the fundamental questions in VM design for OO languages is how you represent your object handles internally. Regardless of what the language itself exposes in terms of fundamental types, boxing, and the like, the VM still has to shuffle objects around on the stack frame, pass them between methods/functions, etc.

The traditional way to do this is the “tagged union”, where you use a two-element struct consisting a type field and a union of values for each possible type. One of these types is probably an object pointer; the other types let you represent unboxed fundamental types like ints and floats. This is the approach used by Rubinius, by Lua, and probably many more.

The Neko VM instead uses 31-bit pointers for everything except ints, and fixes the lowest bit as the integer bit. If this bit is on, the value represents a 31-bit integer; if it’s off, the value is a pointer. Of course this means that Neko objects can only be at even addresses in memory. (I’m not sure what happens on 64-bit machines; either ints stay at 31 bits or they grow to 63-bit longs; the pointers certainly grow to 64-bit).

The result is that Neko object handles are the size of pointers, hence small, but Neko loses the ability to handle unboxed floats. All float operations will require lots of heap allocation and dereferencing. On the other hand, Lua object handles are much larger, but Lua can do float arithmetic on the stack. (The VM stack, not the system stack.)

The Neko folks claim that their representation is better, because it’s smaller, and faster when you’re copying things around. But what value do you really get by sacrificing floats? And what about when we take into account different architectures?

Comparing sizes is easy. On a 32-bit machine, Lua objects take up 12 bytes: a double is 8 bytes, and the tag grows the struct to 12. So Lua object handles are three times the size of Neko handles. On a 64-bit machine, Lua objects take up 16 bytes, and Neko objects take up 8. Note that Lua handles are now only twice as big as Neko handles.

Comparing speed is a little more interesting. How much is lost, exactly by copying around those extra 8 bytes, for each architecture? I did some simple experiments where I copied objects of various sizes around 10 million times, picking a random start and end point for each within a block of allocated memory on the heap, and measuring how long everything took.

On my 32-bit machine, taking 10m random numbers took 2.874 seconds; copying 12-byte objects from one location to the other each time took an additional 91ms. Copying 4-byte objects took only an extra 77ms. That works out to a 15.3% slowdown for Lua.

On my 64-bit machine, taking 10m random numbers took 517ms; copying 16-byte objects each time took an additional 1.85 seconds; copying 8-byte objects took an additional 1.81 seconds. That works out to a 2.2% slowdown for tagged unions.

So personally, the 32-bit case is maybe arguable, but the 64-bit case doesn’t seem that compelling. Copying object handles around is one thing of very many that the VM spends its time on, so the overall slowdown is going to be much less than 2.2%. I don’t know that sacrificing float performance, and half of your integer space, is really worth it.

If you want to run these experiments for yourself, the code is here. Please let me know if I’m doing something wrong!

Indirect threading for VM opcode dispatch

There’s a good discussion with lots of interesting details on a recent patch submission for adding indirect threading to the Python VM. (And by “discussion” I mean a single, un-threaded sequence of comments where you have to manually figure out who’s replying to what, which apparently is what everyone in the world is happy with nowadays except for me. Email clients have had threading since 1975, bitches, so get with the fucking program. [Hence, Whisper—ed.]) Pointed to by, which remains surprisingly useful, as long as you cut yourself once the comment thread devolves (as it invariably does) into meta-argumentation.

Indirect threading is a vaguely-neat trick that I first learned about around the time I was getting into the Rubinius code. The idea is that, in the inner loop of your VM, which is going through and interpreting the opcodes one at a time (dispatching each to a block of handler code), instead of jumping back to the top of the loop at the end of handler code section, you jump directly to the location of the handler code for the next opcode. The big benefit is not so much that you save a jump per opcode (which maybe is optimized out for you anyways), but that the CPU can do branch prediction on a per-opcode basis. So common opcode sequences will all be pipelined together.

But the discussion shows that this kind of thing is very compiler- and architecture-dependent, and you have to spend a lot of time making sure that GCC is optimizing the “right way” for your particular architecture, is not overly-optimizing by collapsing the jumps together, etc. OTOH, the submitter is reporting a 20% speedup, and this is the very heart of the VM, so it could very well be worth spending time on such trickery.

More information:

  • The structure and performance of efficient interpreters [pdf]
  • Inline threading, Tracemonkey, etc.
  • A Pypy-dev thread on threaded interpretation.
  • Various performance-specific bits of the V8 Javascript interpreter design.