# A small intro on the rationale

So I’m working on a Symbolic Regression Machine written in C/C++ called Shine, which is intended to be a JIT for Genetic Programming libraries (like Pyevolve for instance). The main rationale behind Shine is that we have today a lot of research on speeding Genetic Programming using GPUs (the GPU fever !) or any other special hardware, etc, however we don’t have many papers talking about optimizing GP using the state of art compilers optimizations like we have on clang, gcc, etc.

The “hot spot” or the component that consumes a lot of CPU resources today on Genetic Programming is the evaluation of each individual in order to calculate the fitness of the program tree. This evaluation is often executed on each set of parameters of the “training” set. Suppose you want to make a symbolic regression of a single expression like the Pythagoras Theorem and you have a linear space of parameters from 1.0 to 1000.0 with a step of 0.1 you have 10.000 evaluations for each individual (program tree) of your population !

What Shine does is described on the image below:

It takes the individual of the Genetic Programming engine and then converts it to LLVM Intermediate Representation (LLVM assembly language), after that it runs the transformation passes of the LLVM (here is where the true power of modern compilers enter on the GP context) and then the LLVM JIT converts the optimized LLVM IR into native code for the specified target (X86, PowerPC, etc).

You can see below the Shine architecture:

This architecture brings a lot of flexibility for Genetic Programming, you can for instance write functions that could be used later on your individuals on any language supported by the LLVM, what matters to Shine is the LLVM IR, you can use any language that LLVM supports and then use the IR generated by LLVM, you can mix code from C, C++, Ada, Fortran, D, etc and use your functions as non-terminal nodes of your Genetic Programming trees.

Shine is still on its earlier development, it looks a simple idea but I still have a lot of problems to solve, things like how to JIT the evaluation process itself instead of doing calls from Python using ctypes bindings of the JITed trees.

# Doing Genetic Programming on the Python AST itself

During the development of Shine, an idea happened to me, that I could use a restricted Python Abstract Syntax Tree (AST) as the representation of individuals on a Genetic Programming engine, the main advantage of this is the flexibility and the possibility to reuse a lot of things. Of course that a shared library written in C/C++ would be useful for a lot of Genetic Programming engines that doesn’t uses Python, but since my spare time to work on this is becoming more and more rare I started to rethink the approach and use Python and the LLVM bindings for LLVM (LLVMPY) and I just discovered that is pretty easy to JIT a restricted set of the Python AST to native code using LLVM, and this is what this post is going to show.

# JIT’ing a restricted Python AST

The most amazing part of LLVM is obviously the amount of transformation passes, the JIT and of course the ability to use the entire framework through a simple API (ok, not so simple sometimes). To simplify this example, I’m going to use an arbitrary restricted AST set of the Python AST that supports only subtraction (-), addition (+), multiplication (*) and division (/).

To understand the Python AST, you can use the Python parser that converts source into AST:

 12345678910 >>> import ast >>> astp = ast.parse("2*7") >>> ast.dump(astp) 'Module(     body=[Expr(         value=BinOp(             left=Num(n=2), op=Mult(), right=Num(n=7)         )     )] )'

What the parse created was an Abstract Syntax Tree containing the BinOp (Binary Operation) with the left operator as the number 2, the right operator as the number 7 and the operation itself as Multiplication(Mult), very easy to understand. What we are going to do to create the LLVM IR is to create a visitor that is going to visit each node of the tree. To do that, we can subclass the Python NodeVisitor class from the ast module. What the NodeVisitor does is to visit each node of the tree and then call the method ‘visit_OPERATOR’ if it exists, when the NodeVisitor is going to visit the node for the BinOp for example, it will call the method ‘visit_BinOp’ passing as parameter the BinOp node itself.

The structure of the class for for the JIT visitor will look like the code below:

 12345678910 # Import the ast and the llvm Python bindings import ast from llvm import * from llvm.core import * from llvm.ee import * import llvm.passes as lp class AstJit(ast.NodeVisitor):     def __init__(self):         pass

What we need to do now is to create an initialization method to keep the last state of the JIT visitor, this is needed because we are going to JIT the content of the Python AST into a function and the last instruction of the function needs to return what was the result of the last instruction visited by the JIT. We also need to receive a LLVM Module object in which our function will be created as well the closure type, for the sake of simplicity I’m not type any object, I’m just assuming that all numbers from the expression are integers, so the closure type will be the LLVM integer type.

 1234567891011121314151617181920212223242526272829 def __init__(self, module, parameters):         self.last_state = None         self.module = module         # Parameters that will be created on the IR function         self.parameters = parameters         self.closure_type = Type.int()         # An attribute to hold a link to the created function         # so we can use it to JIT later         self.func_obj = None         self._create_builder()     def _create_builder(self):         # How many parameters of integer type         params = [self.closure_type] * len(self.parameters)         # The prototype of the function, returning a integer         # and receiving the integer parameters         ty_func = Type.function(self.closure_type, params)         # Add the function to the module with the name 'func_ast_jit'         self.func_obj = self.module.add_function(ty_func, 'func_ast_jit')         # Create an argument in the function for each parameter specified         for index, pname in enumerate(self.parameters):             self.func_obj.args[index].name = pname         # Create a basic block and the builder         bb = self.func_obj.append_basic_block("entry")         self.builder = Builder.new(bb)

Now what we need to implement on our visitor is the ‘visit_OPERATOR’ methods for the BinOp and for the Numand Name operators. We will also implement the method to create the return instruction that will return the last state.

 12345678910111213141516171819202122232425262728293031323334353637383940 # A 'Name' is a node produced in the AST when you     # access a variable, like '2+x+y', 'x' and 'y' are     # the two names created on the AST for the expression.     def visit_Name(self, node):         # This variable is what function argument ?         index = self.parameters.index(node.id)         self.last_state = self.func_obj.args[index]         return self.last_state     # Here we create a LLVM IR integer constant using the     # Num node, on the expression '2+3' you'll have two     # Num nodes, the Num(n=2) and the Num(n=3).     def visit_Num(self, node):         self.last_state = Constant.int(self.closure_type, node.n)         return self.last_state       # The visitor for the binary operation     def visit_BinOp(self, node):         # Get the operation, left and right arguments         lhs = self.visit(node.left)         rhs = self.visit(node.right)         op = node.op         # Convert each operation (Sub, Add, Mult, Div) to their         # LLVM IR integer instruction equivalent         if isinstance(op, ast.Sub):             op = self.builder.sub(lhs, rhs, 'sub_t')         elif isinstance(op, ast.Add):             op = self.builder.add(lhs, rhs, 'add_t')         elif isinstance(op, ast.Mult):             op = self.builder.mul(lhs, rhs, 'mul_t')         elif isinstance(op, ast.Div):             op = self.builder.sdiv(lhs, rhs, 'sdiv_t')                 self.last_state = op         return self.last_state     # Build the return (ret) statement with the last state     def build_return(self):         self.builder.ret(self.last_state)

And that is it, our visitor is ready to convert a Python AST to a LLVM IR assembly language, to run it we’ll first create a LLVM module and an expression:

 12345 module = Module.new('ast_jit_module') # Note that I'm using two variables 'a' and 'b' expr = "(2+3*b+33*(10/2)+1+3/3+a)/2" node = ast.parse(expr) print ast.dump(node)

Will output:

Module(body=[Expr(value=BinOp(left=BinOp(left=BinOp(left=BinOp(


Now we can finally run our visitor on that generated AST the check the LLVM IR output:

 1234 visitor = AstJit(module, ['a', 'b']) visitor.visit(node) visitor.build_return() print module

Will output the LLVM IR:

; ModuleID = 'ast_jit_module'

define i32 @func_ast_jit(i32 %a, i32 %b) {
entry:
%mul_t = mul i32 3, %b
%sdiv_t = sdiv i32 %add_t4, 2
ret i32 %sdiv_t
}


Now is when the real fun begins, we want to run LLVM optimization passes to optimize our code with an equivalent GCC -O2 optimization level, to do that we create a PassManagerBuilder and a PassManager, the PassManagerBuilder is the component that adds the passes to the PassManager, you can also manually add arbitrary transformations like dead code elimination, function inlining, etc:

 12345678910 pmb = lp.PassManagerBuilder.new() # Optimization level pmb.opt_level = 2 pm = lp.PassManager.new() pmb.populate(pm) # Run the passes into the module pm.run(module) print module

Will output:

; ModuleID = 'ast_jit_module'

define i32 @func_ast_jit(i32 %a, i32 %b) nounwind readnone {
entry:
%mul_t = mul i32 %b, 3
%sdiv_t = sdiv i32 %add_t4, 2
ret i32 %sdiv_t
}


And here we have the optimized LLVM IR of the Python AST expression. The next step is to JIT that IR into native code and then execute it with some parameters:

 123456 ee = ExecutionEngine.new(module)     arg_a = GenericValue.int(Type.int(), 100)     arg_b = GenericValue.int(Type.int(), 42)         retval = ee.run_function(visitor.func_obj, [arg_a, arg_b])     print "Return: %d" % retval.as_int()

Will output:

Return: 197


And that’s it, you have created a AST->LLVM IR converter, optimized the LLVM IR with the transformation passes and then converted it to native code using the LLVM execution engine. I hope you liked =)

## The future can be written in RPython now

Following the recent article arguing why PyPy is the future of Python, I must say, PyPy is not the future of Python, is the present. When I have tested it last time (PyPy-c 1.1.0) with Pyevolve into the optimization of a simple Sphere function, it was at least 2x slower than Unladen Swallow Q2, but in that time, PyPy was not able to JIT. Now, with this new release of PyPy and the JIT’ing support, the scenario has changed.

PyPy has evolved a lot (actually, you can see this evolution here), a nice work was done on the GC system, saving (when compared to CPython) 8 bytes per object allocated, which is very interesting for applications that makes heavy use of object allocation (GP system are a strong example of this, since when they are implemented on object oriented languages, each syntax tree node is an object). Efforts are also being done to improve support for CPython extensions (written in C/C++), one of them is a little tricky: the use of RPyC, to proxy through TCP the remote calls to CPython; but the other seems by far more effective, which is the creation of the CPyExt subsystem. By using CPyExt, all you need is to have your CPython API functions implemented in CPyExt, a lot of people is working on this right now and you can do it too, it’s a long road to have a good API coverage, but when you think about advantages, this road becomes small.

In order to benchmark CPython, Jython, CPython+Psyco, Unladen Swallow and PyPy, I’ve used the Rastrigin function optimization (an example of that implementation is here in the Example 7 of Pyevolve 0.6rc1):

$f(x) = 10n + \sum_{i=1}^{n}{x_{i}^{2}}$ $-10\cos(2\pi x_{i})$

Due to its large search space and number of local minima, Rastrigin function is often used to measure the performance of Genetic Algorithms. Rastrigin function has a global minimum at $x=0$ where the $f(x) = 0$; in order to increase the search space and required resources, I’ve used 40 variables ($n=40$)  and 10k generations.

Here are the information about versions used in this benchmark:

• OS Ubuntu Linux 10.04 LTS (lucid)
• CPython 2.6.5 (Apr 16 2010)
• Jython 2.5.1 (Sun JVM 1.6.0_20, Server Mode)
• CPython 2.6.5 + PsycoV2 trunk (r74587)
• CPython 2.6.5 + Psyco 1.6.0 (default lucid package)
• PyPy trunk (r74537)

No warmup was performed in JVM or in PyPy. PyPy translator was executed using the “-Ojit” option in order to get the JIT version of the Python interpreter. The JVM was executed using the server mode, I’ve tested the client and server mode for Sun JVM and IcedTea6, the best results were observed from the server mode using Sun JVM, however when I’ve compared the client mode of IcedTea6 with the client mode of Sun JVM, the best results observed were from IcedTea6 (the same as using server mode in IcedTea6). Unladen Swallow was compiled using the project wiki instructions for building an optimized binary.

The machine used was an Intel(R) Core(TM) 2 Duo E4500 (2×2.20Ghz) with 2GB of RAM.

The result of the benchmark (measured using wall time) in seconds for each setup (these results are the best of 3 sequential runs):

As you can see, PyPy with JIT got a speedup of 2.57x when compared to CPython 2.6.5 and 2.0x  faster than Unladen Swallow current trunk.

PyPy is not only the future of Python, but is becoming the present right now. PyPy will not bring us only an implementation of Python in Python (which in itself is the valuable result of great efforts), but also will bring the performance back (which many doubted at the beginning, wondering how could it be possible for an implementation of Python in Python be faster than an implementation in C ? And here is where the translation and JIT magic enters). When the time comes that Python interpreter can be entire written in a high level language (actually almost the same language, which is really weird), Python community can put their focus on improving the language itself instead of spending time solving the complexity of the lower level languages, is this not the great point of those efforts ?

By the way, just to note, PyPy isn’t only a translator for the Python interpreter written in RPython, it’s a translator of RPython, what means that PyPy isn’t only the future of Python, but probably, the future of many interpreters.

## A method for JIT’ing algorithms and data structures with LLVM

Hello folks, I always post about Python and EvoComp (Pyevolve), but this time it’s about C, LLVM, search algorithms and data structures. This post describes the efforts to implement an idea: to JIT (verb) algorithms and the data structures used by them, together.

### AVL Tree Intro

Here is a short intro to AVL Trees from Wikipedia:

In computer science, an AVL tree is a self-balancing binary search tree, and it is the first such data structure to be invented. In an AVL tree, the heights of the two child subtrees of any node differ by at most one; therefore, it is also said to be height-balanced. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where n is the number of nodes in the tree prior to the operation. Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.

### The problem and the idea

When we have a data structure and algorithms to handle (insert, remove and lookup) that structure, the native code of our algorithm is usually full of overhead; for example, in an AVL Tree (Balanced Binary Tree), the overhead appear in: checking if we really have a left or right node while traversing the nodes for lookups, accessing nodes inside nodes, etc. This overhead creates unnecessary assembly operations which in turn, creates native code overhead, even when the compiler optimize it. This overhead directly impacts on the performance of our algorithm (this traditional approach, of course, give us a very flexible structure and the complexity (not Big-O) is easy to handle, but we pay for it: performance loss).