This post is a review of a paper that was published in the Operating Systems Design and Implementation conference (OSDI) 2008:
Binary Translation Using Peephole Superoptimizers
Sorav Bansal and Alex Aiken, Stanford University
URL: http://www.usenix.org/events/osdi08/tech/full_papers/bansal/bansal.pdf
This is one of my new favorite papers from OSDI. They propose a novel application of techniques found in superoptimizers: automatic learning of translation rules that can be used to construct binary translators. They provide a method of determining optimal register mappings between architectures, and implement their approach. They use the implementation to generate translation rules, thereby constructing a static binary translator. PowerPC is used as the source architecture, and the target instruction set is x86. Experimental results are shown, that measure the runtime of translated code, compared with the native execution. Micro-benchmarks (lmbench) and larger benchmarks (specint) are the experimental workloads.
The following are some rough notes that I made just after reading this paper. I'll try to tweak my original notes a little, so they seem more coherent to the reader:
One really interesting thing about this work is that they use qemu on a linux-x86 host to compare performance and results of instruction sequences executing on an emulated PowerPC platform. Although the paper didn't say much about how they actually ran their experiments, I assume that they batch two phases: 1) train with sequences found on the powerpc platform, and record all of the state change and performance results, 2) feed the results from (1) into the x86 half of their implementation, discovering the translation rules. They didn't say much about how certain information about the architectures is encoded, such as register layout, flags, but did mention how they handled the stack and heap. I'm assuming the architectural details and general info about the instruction set encoding could be represented in some high-level interpreted language or configuration file.
Their approach is computationally complex, so it takes about a week to get a good set of rules. One source for additional research would be to look into optimizations for the approach. Pruning the search tree of possible sequences and finding register mapping more efficiently are particular examples. However, once the translation rules are found, the binary translator can be used with minimal overhead (avg ~67% native in their case).
Interestingly, the translated code often performed better than the native on microbenchmarks. This, they mention, could be due the fact that code is fundamentally easier to optimize on a RISC architecture such as PowerPC. When that highly optimized code is then translated to x86, a CISC architecture, the translated code is better optimized than the native x86 instructions used in the benchmarks. One thing to note here is that it is very difficult to compare performance across different architectures, so in general, this is a touchy subject. However, the approach in this paper is using relative execution costs, to find the best possible translation sequence of instructions such that performance is optimized and correctness is ensured. Also, I'm assuming all comparisons are done between translated x86 code and native x86 code from the x86 versions of the benchmarks (with some unspecified level of optimization).
Experiments were only performed on a static translator, although they do explain that their method could be theoretically applied to a dynamic translator. They calculate the overhead that would result from dynamic translation, and try to compare, but I'm not quite sure their analysis is correct, have to check this. I think they are ignoring the fact that there will be repeated
invocations of the algorithm that determines the best register mapping, as the register mappings change with different sequences of instructions. It seems like a bad idea to keep changing the register mappings, because the memory layout of the translation cache and representation of the virtual cpu will change continuously. However from my experience implementing a dynamic translator, the translation overhead is typically very low compared to the overhead due to low instruction quality (execution time of the instructions from the cache). The only difference in my case was that my source and destination were the same architecture (roughly), so the translation module itself was very efficient, also since many of the instructions had identical translations (and it was a RISC architecture). The point here is, that I think it is an open problem whether or not a performant dynamic translator would result from this work, although I am optimistic.
What I mostly learned from this paper is the material on superoptimizers. It makes sense that this would be a natural application, and I think that it's a good overlap between OS and Programming Languages. Also, it was good to get a review on issues in binary translation, which I have a lot of knowledge of (certainly, from implementing a full-system dynamic binary translator from scratch!). The paper touched on a large numbers of the problems I had to deal with during implementation. They don't go into too much depth on any one of these, but it definitely jogged my memory and made me remember the additional things you have to worry about when the translator is dynamic.
Unfortunately, I don't think their contribution can be applied to writing a same-architecture translator that includes translation of kernel code (full-system), with much reduction on the engineering effort. This is because much of the engineering complexity in this case has to do with branch handling, spilling/filling registers, exception handling and redirection, emulating memory accesses to protected data or memory mapped I/O, emulation of interrupt logic and low-level devices (ie, timer, serial port), and synchonizing resources among emulated threads. Additionally, a translator that translates between the same source and target instruction sets trivializes many of the translation rules, because many non-priveleged instructions will have identical translations. But, for a basic survey of fundamental binary translation issues for mostly user-level code between different instruction sets, this is a good paper.