%\documentclass[11pt]{proc} \documentclass[preprint,10pt]{sigplanconf} \usepackage{amsmath} \usepackage{url} \usepackage{graphicx} \begin{document} \title{Emscripten: An LLVM-to-JavaScript Compiler} \conferenceinfo{Splash '11}{??-2011, Portland.} \copyrightyear{2011} \copyrightdata{[to be supplied]} \titlebanner{} % These are ignored unless \preprintfooter{} % 'preprint' option specified. \authorinfo{Alon Zakai} {Mozilla} {azakai@mozilla.com} \maketitle %\title{Emscripten: An LLVM-to-JavaScript Compiler} %\subtitle{} %\authorinfo{Alon Zakai} % {Mozilla} % {azakai@mozilla.com} %\author{Alon Zakai \\ Mozilla \\ \url{azakai@mozilla.com}} %\maketitle \begin{abstract} We present Emscripten, a compiler from LLVM (Low Level Virtual Machine) assembly to JavaScript. This opens up two avenues for running code written in languages other than JavaScript on the web: (1) Compile code directly into LLVM assembly, and then compile that into JavaScript using Emscripten, or (2) Compile a language's entire runtime into LLVM and then JavaScript, as in the previous approach, and then use the compiled runtime to run code written in that language. For example, the former approach can work for C and C++, while the latter can work for Python; all three examples open up new opportunities for running code on the web. Emscripten itself is written in JavaScript and is available under the MIT license (a permissive open source license), at \url{http://www.emscripten.org}. As a compiler from LLVM to JavaScript, the challenges in designing Emscripten are somewhat the reverse of the norm -- one must go from a low-level assembly into a high-level language, and recreate parts of the original high-level structure of the code that were lost in the compilation to low-level LLVM. We detail the methods used in Emscripten to deal with those challenges, and in particular present and prove the validity of Emscripten's Relooper algorithm, which recreates high-level loop structures from low-level branching data. \end{abstract} %\category{CR-number}{subcategory}{third-level} %\terms %term1, term2 %\keywords %keyword1, keyword2 \bigskip %\copyright 2011 Alon Zakai. License: Creative Commons Attribution-ShareAlike (CC BY-SA), \url{http://creativecommons.org/licenses/by-sa/3.0/} \section{Introduction} Since the mid 1990's, JavaScript~\cite{js} has been present in most web browsers (sometimes with minor variations and under slightly different names, e.g., JScript in Internet Explorer), and today it is well-supported on essentially all web browsers, from desktop browsers like Internet Explorer, Firefox, Chrome and Safari, to mobile browsers on smartphones and tablets. Together with HTML and CSS, JavaScript forms the standards-based foundation of the web. Running other programming languages on the web has been suggested many times, and browser plugins have allowed doing so, e.g., via the Java and Flash plugins. However, plugins must be manually installed and do not integrate in a perfect way with the outside HTML. Perhaps more problematic is that they cannot run at all on some platforms, for example, Java and Flash cannot run on iOS devices such as the iPhone and iPad. For those reasons, JavaScript remains the primary programming language of the web. There are, however, reasonable motivations for running code from other programming languages on the web, for example, if one has a large amount of existing code already written in another language, or if one simply has a strong preference for another language and perhaps is more productive in it. As a consequence, there has been work on tools to compile languages \textbf{into} JavaScript. Since JavaScript is present in essentially all web browsers, by compiling one's language of choice into JavaScript, one can still generate content that will run practically everywhere. Examples of the approach of compiling into JavaScript include the Google Web Toolkit~\cite{gwt}, which compiles Java into JavaScript; Pyjamas\footnote{\url{http://pyjs.org/}}, which compiles Python into JavaScript; SCM2JS \cite{hop}, which compiles Scheme to JavaScript, Links \cite{links}, which compiles an ML-like language into JavaScript; and AFAX \cite{afax}, which compiles F\# to JavaScript; see also \cite{ashkenas} for additional examples. While useful, such tools usually only allow a subset of the original language to be compiled. For example, multithreaded code (with shared memory) is not possible on the web, so compiling code of that sort is not directly possible. There are also often limitations of the conversion process, for example, Pyjamas compiles Python to JavaScript in a nearly 1-to-1 manner, and as a consequence the underlying semantics are those of JavaScript, not Python, so for example division of integers can yield unexpected results (it should yield an integer in Python 2.x, but in JavaScript and in Pyjamas a floating-point number can be generated). In this paper we present another project along those lines: \textbf{Emscripten}, which compiles LLVM (Low Level Virtual Machine\footnote{\url{http://llvm.org/}}) assembly into JavaScript. LLVM is a compiler project primarily focused on C, C++ and Objective-C. It compiles those languages through a \emph{frontend} (the main ones of which are Clang and LLVM-GCC) into the LLVM intermediary representation (which can be machine-readable bitcode, or human-readable assembly), and then passes it through a \emph{backend} which generates actual machine code for a particular architecture. Emscripten plays the role of a backend which targets JavaScript. By using Emscripten, potentially many languages can be run on the web, using one of the following methods: \begin{itemize} \item Compile \textbf{code} in a language recognized by one of the existing LLVM frontends into LLVM, and then compile that into JavaScript using Emscripten. Frontends for various languages exist, including many of the most popular programming languages such as C and C++, and also various new and emerging languages (e.g., Rust\footnote{\url{https://github.com/graydon/rust/}}). \item Compile the \textbf{runtime} used to parse and execute code in a particular language into LLVM, then compile that into JavaScript using Emscripten. It is then possible to run code in that runtime on the web. This is a useful approach if a language's runtime is written in a language for which an LLVM frontend exists, but the language itself has no such frontend. For example, there is currently no frontend for Python, however it is possible to compile CPython -- the standard implementation of Python, written in C -- into JavaScript, and run Python code on that (see Section~\ref{sec:examples}). \end{itemize} From a technical standpoint, one challenge in designing and implementing Emscripten is that it compiles a low-level language -- LLVM assembly -- into a high-level one -- JavaScript. This is somewhat the reverse of the usual situation one is in when building a compiler, and leads to some unique difficulties. For example, to get good performance in JavaScript one must use natural JavaScript code flow structures, like loops and ifs, but those structures do not exist in LLVM assembly (instead, what is present there is a `soup of code fragments': blocks of code with branching information but no high-level structure). Emscripten must therefore reconstruct a high-level representation from the low-level data it receives. In theory that issue could have been avoided by compiling a higher-level language into JavaScript. For example, if compiling Java into JavaScript (as the Google Web Toolkit does), then one can benefit from the fact that Java's loops, ifs and so forth generally have a very direct parallel in JavaScript. But of course the downside in that approach is it yields a compiler only for Java. In Section~\ref{sec:relooper} we present the `Relooper' algorithm, which generates high-level loop structures from the low-level branching data present in LLVM assembly. It is similar to loop recovery algorithms used in decompilation (see, for example, \cite{Cifuentes98assemblyto}, \cite{pro97}). The main difference between the Relooper and standard loop recovery algorithms is that the Relooper generates loops in a different language than that which was compiled originally, whereas decompilers generally assume they are returning to the original language. The Relooper's goal is not to accurately recreate the original source code, but rather to generate native JavaScript control flow structures, which can then be implemented efficiently in modern JavaScript engines. Another challenge in Emscripten is to maintain accuracy (that is, to keep the results of the compiled code the same as the original) while not sacrificing performance. LLVM assembly is an abstraction of how modern CPUs are programmed for, and its basic operations are not all directly possible in JavaScript. For example, if in LLVM we are to add two unsigned 8-bit numbers $x$ and $y$, with overflowing (e.g., 255 plus 1 should give 0), then there is no single operation in JavaScript which can do this -- we cannot just write $x+y$, as that would use the normal JavaScript semantics. It is possible to emulate a CPU in JavaScript, however doing so is very slow. Emscripten's approach is to allow such emulation, but to try to use it as little as possible, and to provide tools that help one find out which parts of the compiled code actually need such full emulation. We conclude this introduction with a list of this paper's main contributions: \begin{itemize} \item We describe Emscripten itself, during which we detail its approach in compiling LLVM into JavaScript. \item We give details of Emscripten's Relooper algorithm, mentioned earlier, which generates high-level loop structures from low-level branching data, and prove its validity. \end{itemize} In addition, the following are the main contributions of Emscripten itself, that to our knowledge were not previously possible: \begin{itemize} \item It allows compiling a very large subset of C and C++ code into JavaScript, which can then be run on the web. \item By compiling their runtimes, it allows running languages such as Python on the web (with their normal semantics). \end{itemize} The remainder of this paper is structured as follows. In Section~\ref{sec:compapp} we describe the approach Emscripten takes to compiling LLVM assembly into JavaScript, and show some benchmark data. In Section~\ref{sec:emarch} we describe Emscripten's internal design and in particular elaborate on the Relooper algorithm. In Section~\ref{sec:examples} we give several example uses of Emscripten. In Section~\ref{sec:summary} we summarize and give directions for future work. \section{Compilation Approach} \label{sec:compapp} Let us begin by considering what the challenge is, when we want to compile LLVM assembly into JavaScript. Assume we are given the following simple example of a C program: \begin{verbatim} #include int main() { int sum = 0; for (int i = 1; i <= 100; i++) sum += i; printf("1+...+100=%d\n", sum); return 0; } \end{verbatim} This program calculates the sum of the integers from 1 to 100. When compiled by Clang, the generated LLVM assembly code includes the following: \label{code:examplellvm} \begin{verbatim} @.str = private constant [14 x i8] c"1+...+100=%d\0A\00" define i32 @main() { %1 = alloca i32, align 4 %sum = alloca i32, align 4 %i = alloca i32, align 4 store i32 0, i32* %1 store i32 0, i32* %sum, align 4 store i32 1, i32* %i, align 4 br label %2 ;