Introducing GCC-Bridge: A C/Fortran compiler targeting the JVM
In this post, I wanted to finally give a proper introduction to GCC-Bridge, a C/Fortran compiler targeting the Java Virtual Machine (JVM) that makes it possible for Renjin to run R packages that include "native" C and Fortran code without sacrificing platform independence.
Supporting R packages with native code is a big deal: 48% of CRAN's 33 MLOC of code is native code. And so while our ultimate goal is to allow users to write fast R code without falling back to another language, if we're to be serious about running existing R packages, then we need a solution for the existing native code base.
But we also wanted a solution that preserved Renjin's advantages over GNU R. If we were to try to use JNI to load platform-specific native libraries, then we would inherit all of the deployment headaches that we set to solve in building Renjin on the JVM, and loose the ability to run on Google AppEngine and other sandboxed environments.
More troubling, the widespread use of global variables in package native code would severely complicate Renjin's auto parallization strategies, and prevent users from running multiple, concurrent Renjin sessions in the same JVM process.
For these reasons, we set out to build GCC-Bridge, a toolchain that could compile C, C++, and Fortran sources to pure Java bytecode.
Bridging GCC and the JVM
GCC-Bridge, as its name implies, builds on the GNU Compiler Collection (GCC), which has a modular structure designed to support multiple input languages, including C, C++, and Fortran, and multiple backends targeting, for example, x86, ARM, MIPs, etc.
GCC achieves this small miracle by reducing all input languages into a common, simple intermediate language called Gimple. GCC performs most of its optimizations on Gimple, before lowering it even further to another intermediate language called the Register Transfer Language (RTL), which is then handed over to the backends to generate actual machine instructions.
For us, Gimple is also a terrific starting point for a compiler targeting the JVM. Consider a simple C function which sums an array of double-precision floating point:
All the complexities of C and Fortran are reduced to a simple list of statements, with a small number of operations. This is terrific, because I really, really, didn't want to have to learn Fortran!
GCC-Bridge consists of a small plugin for GCC itself, which dumps the optimized gimple out to a JSON file, one per source file, and a compiler, a Java program which compiles the json-encoded Gimple files to Java class files, using the ASM bytecode library.
Note that we're not compiling to the Java language. Like Scala or Clojure, we're targeting the Java Virtual Machine, the virtual machine original designed for Java but that has its own standard instruction set.
Emulating the GNU R C API
GNU R provides several methods for interfacing with native code from R.
The simplest of these methods, the so-called .C and .Fortran interfaces, simply pass the R vectors as double-precision or integer arrays to C or Fortran functions, which might look like this:
void kmeans_Lloyd(double *x, int *pn, int *pp, double *centers, int *pk, int *cl, int *pmaxiter, int *nc, double *wss);
Renjin has supported this interface for some time, but starting with version 0.8.x released at the end of last year, we now support the .Call interface as well, which involves passing pointers to GNU R internal
The great thing about the GCC-Bridge tool chain is that it gives us the chance to play with the input sources before compiling them. We use this capability to map all references to the
SEXPREC type to Renjin's own Java interface
org.renjin.sexp.SEXP, and link all calls to the internal GNU R API to Java methods, initially generated from GNU R's own header files.
Kicking the Tires
GCC-Bridge is an important part of the Renjin toolchain for GNU R packages, but it can also be used independently of Renjin.
You can fork this repo and use it as a basis for compiling your own C/Fortran source to Java classes.
Keep in mind that we've worked primarily on compiling scientific code that does pure computation, so you won't find implementations of many basic C standard library functions like
fopen() at this point.
There's alot of interesting things to talk about, so this will be the first post in a series. In subsequent posts, I'll dive into the compiler's internals and look at how we handle anathmas like pointer arithmatic and
malloc(); I'll explore the performance implications of running C code on the JVM; and finally I'll review the current limitations of the compiler and some potential ways forward.