www.socialtext.net will be down for scheduled maintenance today Sat July 31st starting at 6:00pm PDT (Sun August 1st 1:00am UTC). Estimated downtime is 20 minutes, please contact support@socialtext.com with any questions.

LLVM

hide

LLVM - what it is and what it isn't (from a p5 internals POV)

This section aims to clarify some misconceptions about what llvm is/does.

LLVM is a compiler toolchain that can be used as a library. It has several
components:

LLVM assembley

LLVM assembley is a low level language. This is the intermediate
representation LLVM uses.

It has two on disk formats, one binary and one text, and an in memory
structure that can be generated using the C++ libraries.

It is designed to be portable and easy to optimize.

The level of abstraction is very similar to C. Memory management is manual
and pointer based, it uses the same C types and data structure abstractions
(Structs, pointer arrays, etc), it follows the C calling conventions
(calling an llvm function and a C function is the same)

The language itself is very different from C. The code structure is much
more similar to that of an assembley language, it's designed to be written and
read by machines, not humans.

The language also provides some more control features than C, such as tail
calls and efficient exception handling.

C to LLVM compilers

C<llvm-gcc> and C<clang> compile C to llvm assembley.

They can be used to generate LLVM optimizable versions of the perl C<pp>
functions and support code (e.g. F<sv.c> and friends).

Yuval: this was the main focus of my experimentation

Running LLVM code

LLVM provides an llvm assembley interpreter (jit based), but it also has an
llvm assembley to static binary compiler.

The interpreter has some startup penalties, and requires the libraries to be
linked.

LLVM at runtime

The llvm bytecode lib can be used to generate code at runtime, which can
then be run by the jit.

This code is linked into the host process, and can use existing C or llvm
symbols, whether statically or dynamically loaded.

LLVM's major performance advantage over traditional compilers is in its link
time optimization ability. It can analyze code inter procedurally, and
optimize aggressively based on this knowlege. For instance:

  • code loaded from a library can be inlined to a callsite if appropriate
  • pointer indirection can be eliminated by modifying the calling conventions

 at the call site and in the function

  • loop invariants inside a function can be moved outside to the callsite
  • all symbols can be made static after linkage, preventing further linkage

 but allowing more aggressive optimizations

Current Results

Perl internals on LLVM

By compiling perl to llvm assembley with llvm-gcc (in the future clang will
be more appropriate), I was able to produce an llvm assembley based libperl.
This means that llvm could optimize in between the various code units.

This produced variable results, with perlbench running on average 10% faster
compared to the same version of GCC on my MacBook Pro if I recall correctly.

The results of this have been mailed to p5p.
Lhttp://www.nntp.perl.org/group/perl.perl5.porters/2008/06/msg137481.html

On their own these results are uninteresting - it's just porting the build
toolchain to a rather esoteric compiler, which happens to be insignificantly
faster on my machine.

Perl code on LLVM

The next logical step is compiling optrees to LLVM assembley.

I created a C function that is akin to unrolling of runops for a simple
arithmetic perl subroutine.

Perl's opcode dispatch is very simple to inline this way, since runops
essentially just walks a graph:

 PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX)<br />

unrolling this would amount to creating:

 PL_op = pp_foo(aTHX);<br />
 PL_op = pp_bar(aTHX);<br />

For simple arithmetic compiling a perl optree to this form yielded a
performance improvement of about 15% compared to runops_standard for the
llvm compiled perl.

LLVM can produce a function pointer which is trivial to install in the XSUB
field of an SV.

This means that jitted subroutines are upgraded to XSUBs, while unjitted (or
unjittable) subroutines are executed using the runops function with no code
changes.

An incremental plan for P5 on LLVM

In order to properly tap into the potential of LLVM as a target, several
changes must be made.

This section provides an incremental plan for deliverables that should
demonstrate the usefulness or futility of the project early on, before
sweeping code changes are committed to.

1. JITable ops

To allow LLVM to aggressively optimize the ops data should be transfered
using C calling conventions instead of using Perl's various stacks.

For instance using pp_add currently involves pushing two SVs on to the
stack, and then invoking a C function and providing it with a pointer to the stack
using aTHX.

A more direct approach would be a C<r_add> function:

 SV * r_add(pTHX_ SV *l, SV *r);<br />

pp_add could be defined in terms of r_add by simply popping off parameters
and calling the C function with them.

The possible tradeoffs of this conversion are discussed in more detail
below.

The conversion should be carried out piecemeal, starting with a handful of
ops for testing. Only ops with a perceived benefit should be converted.

2. JITtable emitted code

Using these ops is a little more involved than simply unrolling runops.

The implicit stack parameters for a given optree must be deduced.

Fortunately arity data is annotated in the opcode table.

By starting with simple jittable ops and working up towards more complex
ones this approach can be usable without supporting all of perl's operator
calling conventions.

By processing the optree and annotating it with an SSA form (using a lookup
table on the side, to avoid changing any struct definitions), the data flow
is easily converted to LLVM assembley.

the boundry between jittable ops and non jittable ops is bridged by
generating appropriate stack pushes and pops.

Effectively, as far as the regular Perl runloop is concerned, the sequence
of jitted ops has been converted to a custom single op

The generated code for e.g. C<3 + $a> will look something like:


  SV *t1 = r_const(aTHX_ op1); /* extract the sv inside op1 */ <br />
  SV *t2 = r_padsv(aTHX_ op2); /* extract the lexical variable $a */ <br />
  SV *t3 = r_add(aTHX_ op3, t1, t2); <br />
  PL_op = op4; /* whatever is the ->op_next of the add op */<br />
  XPUSHs(t3); /* push the result on the stack */ <br />
  /* execution resumes normally using runops */<br />

the r_* functions will look inside an explicit op parameter, instead of relying
on PL_op to get the various flags and metadata they require. Data present in the
dynamic scope is still reached using the THX or globals (in this case the pad
offset is found using op2, but the value inside the pad is not reachable via
just the optree, that part remains the same).

Each subroutine is compiled separately. Initially only subroutines that are
easy to compile (made of fixed arity ops, no stack boundries, etc) can be
compiled, with additional functionality added later.

By "upgrading" pure perl subs into XSUBs after successful compilation, no
special hooking is required.

For the uninitiated - B<SSA> is Static Single Assignment. In this form every
value has a symbol, and symbols are only assigned once. This is a common
intermediate representation for many languages.

LLVM assembly is based on SSA so this form will map directly to the emitted
output.

Generating an SSA form from a stack is fairly simple by using a stack of
symbols and a counter to generate new symbols. That way each stack position
is assigned a symbol on push, and the symbol is used on peek (and of course
discarded on pop).

This symbol stack is duplicated on branches and regenerated similarly to the
way the SSA phi function works on join.

3. Control ops

Branches can initially be handled simply by comparing C<PL_op>:

 PL_op = pp_foo(aTHX); /* some conditional/loop op */

 if ( PL_op == if_op->op_next ) {
 ...
 } else { /* == op_other */
 ...
 }

simplifying the task of emitting "native" branches such as

 if ( SvTRUE(t1) ) {
 ...
 } else {
 ...
 }

This is easy to do because all of the conditional ops are handled very
consistently by Perl.

The only exception to this rule is explicit vs. implicit return, but that's
a relatively simple case to handle.

Long term

If/when the project is at the stage where trivial subroutines can be
compiled to LLVM assembley trees, be jitted into a function pointer and installed
into an XSUB, then it will be time to convert the remaining ops, starting with
pp_hot, to parameter based definitions.

Other corners to be rounded off:

  • Command line options and/or pragmas for controlling the compilation should be introduced, including parameters for automatic jitting (see below for profile based compilation).
  • Stress testing, especially to demonstrate limitations of C based runtime need to be examined. Since this potentially uses the C stack more aggressively new limitations might be found and will need to be worked around. See below for Cheney on the MTA and other ideas.
  • New unit tests should be written (or at least a new harness to use the existing tests with various JIT options).
  • A marker for jitted CVs (as opposed to real XSUB) could be introduced, to allow disabling jit temporarily (e.g. for debugger tracing, reenabling pluggable runops, etc). The LLVM optree should probably be stored either in magic or in the SV ANY slot of the CV.
  • Lots of work will be needed for making llvm an optional dependency. LLVM is not as portable as Perl, and will only be a desirable target for platforms which it supports native jit for. This incurs a big penalty on the already high complexity of the code (many build variants, lots of preprocessing), and must be managed very carefully to avoid becoming a big maintenance burden.

Tradeoffs

Splitting up pp_ ops into smaller functions is a big manual effort.
Fortunately it can be done incrementally and it's not that hard to convert each
function.

The drawbacks are that it will either slow down normal C compliation, or it
will require even more crazy preprocessing of the C code to make sure the
C<r_*> functions are not defined unless compilation is targetting LLVM.

Aside from this the drawbacks of using LLVM involve:

=over 4

=item 1. more complex compilation toolchain (llvm assembley output must be dealt
with)

=item 2. slower startup times for llvm based perls

=item 3. inability to use the debugger or pluggable runops modules on jitted code

=item 4. more memory usage (compiled optrees, llvm library, llvm intermediate
 structures, etc)

=item 5. this plan is not guaranteed to work well. preliminary results are vague
 (but still somewhat promising). While not as comitting as a new virtual
 machine for perl 5, this is still a big undertaking.

 I estimate at least 1-2 months of full time work by someone who is at
 least somewhat familiar with both projects until we can judge whether or
 not this project can succeed at all.

 this is not that much work, but is still more than anyone involved in
 p5 internals currently has. I suspect GSoC might be a good fit for this.

 i doubt we can estimate how long from the point where it's deemed viable
 to the point where it's a real performance improvement, to the point
 where it's production ready. even though this is not as ambitious as
 ponie, this is still a big undertaking.

=back

Advantages of this plan compared to other approaches of jitting perl 5:

=over 4

=item 1. no code changes, the same intricate definitions of all of perl's opcodes
 are reused. this means that all the accumilated wisdom is not lost, the
 data structures are the same, and in theory the system is still binary
 compatible with compiled XS modules

 this means a lot less work than developing a real vm backend for perl 5

 llvm's C interoperability features are key to this

=item 2. llvm is backed by many organizations, it's stable, well funded and has a
 lively community so it is unlikely to die off.

 this provides us with a real, feature complete jit system with very
 advanced optimizations that can target many platforms, without any
 maintenance burden on the p5 developers.

=back

However, in the long run this is still a micro optimization. Ruby is
reportedly about twice as slow as perl for most things. That doesn't keep anyone from
writing successful applications in ruby.

Java is probably many times faster than Perl. None of us seem to be
switching to it because you can crunch numbers faster in it.

A JIT can't make dramatic high level performance improvements.

It'll speed up tight loops, simple arithmetic, opcode dispatch, etc but for
apps whose performance is bound by memory management overhead or IO this
will effort will go almost unnoticed.

Furthermore, IMHO most apps are not even bound by those limitations, but
rather by programmer laziness/sloppiness. If better performance is a low hanging
fruit that remains unpicked, JIT is really not going to make much of a difference.

Since typical Perl code probably won't be dramatically affected by the
performance benefits of this project, funding is probably unlikely.

Even if it runs 2x faster, it doesn't mean it'll be 2x cheaper to run. This is
the "you are not google" axiom of scalability. Even if people want this
performance, for most of them it won't actually help their business.

This project might allow Perl to become an appropriate choice for more
performance critical applications, or alleviate the urge to write XS based
optimizations, but that effect will probably be minor at best.

I suspect the main benefit would actually be in marketing (look how excited
we're getting about python being "5x faster"), and in demonstrating that the
runtime is able to adapt and be modernized, perhaps paving the path to more
ambitious language and runtime improvements.

          • THAT LAST CHUNK WAS THE IMPORTANT BIT, IF YOU SKIPPED IT THE REST IS

EVEN MORE BORING *****

Future directions

Some ideas on how to take this further, based on other interesting projects.

These are ideas for further experimentation, some of which I've tried
pursuing out of curiosity, none of which were anywhere near done.

None of these ideas are relevant until a stable and useful LLVM emitter can
actually be demonstrated for p5, with good performance on artificial
benchmarks.

They could theoretically be used to bridge the gap between good performance
in the lab and good performance in the wild.

1. Profile based compilation

To avoid too much memory usage for generated code and too much cpu usage
generating it, code tracing should probably be used to identify hot paths.

This can be done by modifying C<ENTERSUB> to trigger jit compilation for
commonly used CVs.

To avoid cache spilling while tracing, a hashing variant of a C<bloom filter>
may be used to B<count CVs>. This loses accuracy but given that this is a
stochastic process anyway the tradeoff is probably fair.

A hash function with N salts can be used to sequentially flip the bits of
the bloom filter. The hash is computed over the salt and the CV address.

The first hash is used to find the actual bloom filter. By doing this lookup
once we avoid a cache spill once per hash function.

The hashing is then repeated for the CV address. If the bloom filter bit is
off, it is flipped on and the loop is aborted.

If all the bits are turned on, the subroutine is jitted.

By increasing N the time before jitting is increased (higher threshold for
what is a hot path).

By increasing the number or size of the bloom filters collision is avoided.

By using the hashing variant of a bloom filter (sub partitions), instead of
spilling the cache up to N times, it is spilled at most once.

The advantage over a hash of counters keyed by SV is significant: there is
only one pointer indirection, the memory usage for the table is constant, and the
operations to update the counters are simple and always require a constant
time.

Yet another variant is introducing random decay to the filter. Once a
certain fill percentage is reached, random bits are turned off. This allows the
system to keep jitting code without getting false positives by filling the filter
up too much. This approach is also sensitive to changing hot paths without
needing historical using stochastic approximation.

2. Trace based compilation

Another interesting direction is using type information to explode the code
path, much like B<psyco> does.

This techniques allows tight loops in dynamic code to be have the dynamic
typing semantics moved out of the loops as invariants, producing dramatic
performance improvements for simple arithmetic.

This can be done by generating guard clauses that inspect the SV flags, and
then producing code that treats these parameters as invariants:


 if ( SvIOK(sv) ) {<br />
 /* generate code in which this condition is an invariant */<br />

This technique has been used by the psyco project to produce python code
that computes simple arithmetic with speeds that are competitive with C/Java,
without any code changes in the python.

The drawback is potentially huge memory usage due to many variants being
generated. This combined with profile based compilation could be an
interesting future direction.

LLVM's optimizers are supposed to be clever enough to perform high quality
constant detection, even accross procedure boundries, and using that for
branch elimination, so it may be a matter of just generating const variants.

Otherwise, further dumbed down versions of the ops are in order (for
instance, r_add on IVs, with two code paths, one for overflows and one without, nd a
separate r_add for NVs which does floating point math).

By having these variants the emitted machine code could potentially be much
smaller/faster, and the various sub opcodes would be better candidates for
inlining by llvm.

3. Inter procedure optimization

C<ENTERSUB> ops can be analyzed in depth.

If the CV of the target is known at compile time (the call is made by using
a GV pointer instead of a symbol name), the LLVM variant of that sub can be
called directly, skipping entersub altogether.

This could be facilitated using some aggressive inlining pragma.

4. Stack overflow guards

To prevent overflowing the C stack, entersub ops can have an additional guard
clause added, to fall "traditional" interpretation if the C stack is running out
of room.

This allows the system to make use of slower but safe code as necessary, if
the recursion limits are reached.

A related approach is using cheney-on-the-mta techniques to swap out the C
stack, or possibly overflow it into the perl stack, to avoid this limit
without the performance tradeoffs of switching back to interpreting optrees.

5. Other targets

The work on refactoring the C<pp> functions, creating an emitter etc will
likely pave the path towards enabling other VMs in the future too.

The hardest part of targetting Perl to other VMs is that the opcodes are very
intricately defined, and are highly dependent on the execution context (the
state of the stacks, the data in the optree, the op flags, etc).

LLVM should be able to provide better performance by refactoring large C<pp>
functions into smaller, simpler ones.

This is a much bigger undertaking than using LLVM with existing because it
requires reimplementing, not just refactoring the various opcode
implementations, but immediate performance benefits might provide an
additional incentive towards making these changes.

This is an incremental step towards cleaning up the runtime so that the
existing OP* structures can be sanely compiled to other targets.

This will also greatly simplify generating optrees with B generate, and
analyzing existing optrees.

This is the most pie in the sky idea, but in the long run probably the most
important.

My interest in this work was as a project that could potentially catalyze some
improvement to the hackability of the optree, the performance aspects are
minor.

Prior Art

B<lua-jit 2.x> has trace based compilation that produces assembley code only
for excercised code paths. it's very clever and clean.
Lhttp://lua-users.org/lists/lua-l/2008-02/msg00051.html contains lots of info
on the internals

B<psyco> uses trace caching to generate static variants of dynamic python code,
to eliminate branches early on: Lhttp://psyco.sourceforge.net/

B<CheneyMTA> Lhttp://home.pipeline.com/~hbaker1/CheneyMTA.html is a technique for
compiling continuation passing style code to native C code without using a
trampoline function to prevent the C stack from growing.

Plan

I was guessing 1-2 months until we can have certainty whether or not it'll work
at all, that is having a functional proof of concept with no real performance
goals yet, but which has all the right pieces in place.

I think a more realistic time frame for this whole project is something like 6
months to get a first version out with the various pains dealt with, and then
another endless potential for incremental performance improvements by tweaking
the emitted LLVM assembley code.

Here's a slightly more detailed list of items for the 1-2 time frame:

=over 4

=item 1. embedding LLVM in perl, probably several days for someone familiar with the
configuration toolchain

=item 2. cleaning up the llvm-gcc related stuff, again a few days. i made a mess of
this because the llvm linker was a little wonky on OSX when I played around with
this.

=item 3. implementing a runops that traps ENTERSUB dispatching to jittable CVs, which
triggers the jit stuff and installs an XSUB (maybe it's better to tweak ENTERSUB
itself?), probably about 2 days for a simple version (none of the clever stuff
mentioned in the doc)

=item 4. the optree to LLVM assembly compiler (the naive unrolling + conditionals),
should be about a week or two for simple control ops. the hardest part here is
ensuring proper test coverage

=item 5. refactoring of a small set of PP functions into standalone versions, based on
some artificial benchmark using these ops, this is more open ended. I estimate
about 2 weeks of work to start measuring performance improvements, by writing a
benchmarking harness and then refactoring the opcodes to try and boost
performance based on profiling.

=item At this point the project should be 90% done, so I expect the remaining 90% to
involve improving portability and the llvm build process (making it optional),
and dealing with pathological cases for the emitter and the jit triggering (that
stuff is pretty hard to tweak to ensure good all around performance, e.g. making
sure jitting addresses CPU bound hot paths, and isn't done excessively).

=back

AUTHOR

Yuval Kogman nothingmuch@woobling.org

Lhttp://nothingmuch.woobling.org 0xEBD27418

Lhttp://www.nntp.perl.org/group/perl.perl5.porters/2008/06/msg137461.html

Lhttp://www.nntp.perl.org/group/perl.perl5.porters/2009/04/msg145425.html


Tags

    There are no tags for this page.

Attachments

Created by rurban on Apr 9 7:37am. Updated by rurban on Apr 9 7:40am. (2 revisions, 1,940 views)