VM の実装とその性能

AST vs. Bytecode: Interpreters in the Age of Meta-Compilation

Citation

Octave Larose, Sophie Kaleba, Humphrey Burchell, and Stefan Marr. 2023. AST vs. Bytecode: Interpreters in the Age of Meta-Compilation. Proc. ACM Program. Lang. 7, OOPSLA2, Article 233 (October 2023), 29 pages. https://doi.org/10.1145/3622808

Abstract

Thanks to partial evaluation and meta-tracing, it became practical to build language implementations that reach state-of-the-art peak performance by implementing only an interpreter. Systems such as RPython and GraalVM provide components such as a garbage collector and just-in-time compiler in a language-agnostic manner, greatly reducing implementation effort. However, meta-compilation-based language implementations still need to improve further to reach the low memory use and fast warmup behavior that custom-built systems provide. A key element in this endeavor is interpreter performance. Folklore tells us that bytecode interpreters are superior to abstract-syntax-tree (AST) interpreters both in terms of memory use and run-time performance. This work assesses the trade-offs between AST and bytecode interpreters to verify common assumptions and whether they hold in the context of meta-compilation systems. We implemented four interpreters, each an AST and a bytecode one using RPython and GraalVM. We keep the difference between the interpreters as small as feasible to be able to evaluate interpreter performance, peak performance, warmup, memory use, and the impact of individual optimizations. Our results show that both systems indeed reach performance close to Node.js/V8. Looking at interpreter-only performance, our AST interpreters are on par with, or even slightly faster than their bytecode counterparts. After just-in-time compilation, the results are roughly on par. This means bytecode interpreters do not have their widely assumed performance advantage. However, we can confirm that bytecodes are more compact in memory than ASTs, which becomes relevant for larger applications. However, for smaller applications, we noticed that bytecode interpreters allocate more memory because boxing avoidance is not as applicable, and because the bytecode interpreter structure requires memory, e.g., for a reified stack. Our results show AST interpreters to be competitive on top of meta-compilation systems. Together with possible engineering benefits, they should thus not be discounted so easily in favor of bytecode interpreters.

RegCPython: A Register-based Python Interpreter for Better Performance

Citation

Qiang Zhang, Lei Xu, and Baowen Xu. 2022. RegCPython: A Register-based Python Interpreter for Better Performance. ACM Trans. Archit. Code Optim. 20, 1, Article 14 (March 2023), 25 pages. https://doi.org/10.1145/3568973

Abstract

Interpreters are widely used in the implementation of many programming languages, such as Python, Perl, and Java. Even though various JIT compilers emerge in an endless stream, interpretation efficiency still plays a critical role in program performance. Does a stack-based interpreter or a register-based interpreter perform better? The pros and cons of the pair of architectures have long been discussed. The stack architecture is attractive for its concise model and compact bytecode, but our study finds that the register-based interpreter can also be implemented easily and that its bytecode size only grows by a small margin. Moreover, the latter turns out to be appreciably faster. Specifically, we implemented an open source Python interpreter named RegCPython based on CPython v3.10.1. The former is register based, while the latter is stack based. Without changes in syntax, Application Programming Interface, and Application Binary Interface, RegCPython is excellently compatible with CPython, as it does not break existing syntax or interfaces. It achieves a speedup of 1.287 on the most favorable benchmark and 0.977 even on the most unfavorable benchmark. For all Python-intensive benchmarks, the average speedup reaches 1.120 on x86 and 1.130 on ARM. Our evaluation work, which also serves as an empirical study, provides a detailed performance survey of both interpreters on modern hardware. It points out that the register-based interpreters are more efficient mainly due to the elimination of machine instructions needed, while changes in branch mispredictions and cache misses have a limited impact on performance. Additionally, it confirms that the register-based implementation is also satisfactory in terms of memory footprint, compilation cost, and implementation complexity.

Quantifying the interpretation overhead of Python

Citation

Qiang Zhang, Lei Xu, Xiangyu Zhang, and Baowen Xu. 2022. Quantifying the interpretation overhead of Python. Sci. Comput. Program. 215, C (Mar 2022). https://doi.org/10.1016/j.scico.2021.102759

Abstract

While Python has become increasingly popular for its convenience, it is also criticized for its suboptimal performance. To figure out what burdens the interpreter of Python and provide insights into possible optimizations, we conduct this empirical study on CPython’s performance via sampling-based profiling. This sampling-based approach incurs a low runtime overhead and does not require any modification of the interpreter and the application code, thus providing convincing experimental results. Specifically, we use 48 benchmarks from the pyperformance project to analyze the runtime overhead of the interpreter. We compare the usage of different opcodes and decompose the overhead at various granularities (e.g., files, functions, and statements). It turns out that most parts contribute a small portion of the overhead, and the promising improvements lie in the minority, such as name access opcodes and reference counting functions. Furthermore, we pay attention to four specific performance-affecting issues: name access, dynamic typing, garbage collection, and opcode dispatch. The issue study reveals several promising optimization techniques, such as register-based virtual machine architecture and tracing-based garbage collection, as well as a few fruitless optimization points, such as operator overloading and dispatch.

Quantitative Overhead Analysis for Python

Citation

M. Ismail and G. E. Suh, “Quantitative Overhead Analysis for Python,” 2018 IEEE International Symposium on Workload Characterization (IISWC), Raleigh, NC, USA, 2018, pp. 36-47, doi: 10.1109/IISWC.2018.8573512. keywords: {Python;Benchmark testing;Memory management;Microarchitecture;Pins;Hardware;Electric breakdown},

Abstract

Dynamic programming languages such as Python are becoming increasingly more popular, yet often show a significant performance slowdown compared to static languages such as C. This paper provides a detailed quantitative analysis of the overhead in Python without and with just-in-time (JIT) compilation. The study identifies a new major source of overhead, C function calls, for the Python interpreter. Additionally, we study the interaction of the run-time with the underlying processor hardware and find that the performance of Python with JIT depends heavily on the cache hierarchy and memory system. We find that proper nursery sizing is necessary for each application to optimize the trade-off between cache performance and garbage collection overhead. Although our studies focuses on Python, we show that our key findings can also apply to other dynamic languages such as Javascript.

Branch prediction and the performance of interpreters — Don’t trust folklore

Citation

E. Rohou, B. N. Swamy and A. Seznec, “Branch prediction and the performance of interpreters — Don’t trust folklore,” 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), San Francisco, CA, USA, 2015, pp. 103-114, doi: 10.1109/CGO.2015.7054191. keywords: {History;Switches;Radiation detectors;Benchmark testing;Hardware;Bridges;Cryptography},

Abstract

Interpreters have been used in many contexts. They provide portability and ease of development at the expense of performance. The literature of the past decade covers analysis of why interpreters are slow, and many software techniques to improve them. A large proportion of these works focuses on the dispatch loop, and in particular on the implementation of the switch statement: typically an indirect branch instruction. Folklore attributes a significant penalty to this branch, due to its high misprediction rate. We revisit this assumption, considering state-of-the-art branch predictors and the three most recent Intel processor generations on current interpreters. Using both hardware counters on Has well, the latest Intel processor generation, and simulation of the IT-TAGE, we show that the accuracy of indirect branch prediction is no longer critical for interpreters. We further compare the characteristics of these interpreters and analyze why the indirect branch is less important than before.

Compiler techniques to improve dynamic branch prediction for indirect jump and call instructions

Citation

Jason Mccandless and David Gregg. 2012. Compiler techniques to improve dynamic branch prediction for indirect jump and call instructions. ACM Trans. Archit. Code Optim. 8, 4, Article 24 (January 2012), 20 pages. https://doi.org/10.1145/2086696.2086703

Abstract

Indirect jump instructions are used to implement multiway branch statements and virtual function calls in object-oriented languages. Branch behavior can have significant impact on program performance, but fortunately hardware predictors can alleviate much of the risk. Modern processors include indirect branch predictors which use part of the target address to update a global history. We present a code generation technique to maximize the branch history information available to the predictor. We implement our optimization as an assembly language transformation, and evaluate it for SPEC benchmarks and interpreters using simulated and real hardware, showing indirect branch misprediction decreases.

Efficient interpretation using quickening

Citation

Stefan Brunthaler. 2010. Efficient interpretation using quickening. In Proceedings of the 6th symposium on Dynamic languages (DLS ‘10). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/1869631.1869633

Abstract

Just-in-time compilers offer the biggest achievable payoff performance-wise, but their implementation is a non-trivial, time-consuming task affecting the interpreter’s maintenance for years to come, too. Recent research addresses this issue by providing ways of leveraging existing just-in-time compilation infrastructures. Though there has been considerable research on improving the efficiency of just-in-time compilers, the area of optimizing interpreters has gotten less attention as if the implementation of a dynamic translation system was the “ultima ratio” for efficiently interpreting programming languages. We present optimization techniques for improving the efficiency of interpreters without requiring just-in-time compilation thereby maintaining the ease-of-implementation characteristic that brought many people to implementing an interpreter in the first place.

Optimizing indirect branch prediction accuracy in virtual machine interpreters

Citation

Kevin Casey, M. Anton Ertl, and David Gregg. 2007. Optimizing indirect branch prediction accuracy in virtual machine interpreters. ACM Trans. Program. Lang. Syst. 29, 6 (October 2007), 37–es. https://doi.org/10.1145/1286821.1286828

Abstract

Interpreters designed for efficiency execute a huge number of indirect branches and can spend more than half of the execution time in indirect branch mispredictions. Branch target buffers (BTBs) are the most widely available form of indirect branch prediction; however, their prediction accuracy for existing interpreters is only 2%–50%. In this article we investigate two methods for improving the prediction accuracy of BTBs for interpreters: replicating virtual machine (VM) instructions and combining sequences of VM instructions into superinstructions. We investigate static (interpreter build-time) and dynamic (interpreter runtime) variants of these techniques and compare them and several combinations of these techniques. To show their generality, we have implemented these optimizations in VMs for both Java and Forth. These techniques can eliminate nearly all of the dispatch branch mispredictions, and have other benefits, resulting in speedups by a factor of up to 4.55 over efficient threaded-code interpreters, and speedups by a factor of up to 1.34 over techniques relying on dynamic superinstructions alone.

Context threading: a flexible and efficient dispatch technique for virtual machine interpreters

Citation

M. Berndl, B. Vitale, M. Zaleski and A. D. Brown, “Context threading: a flexible and efficient dispatch technique for virtual machine interpreters,” International Symposium on Code Generation and Optimization, San Jose, CA, USA, 2005, pp. 15-26, doi: 10.1109/CGO.2005.14. keywords: {Virtual machining;Hardware;Context-aware services;Java;Algorithms;Counting circuits;Computer languages;Pipelines;Hazards;Debugging},

Abstract

Direct-threaded interpreters use indirect branches to dispatch bytecodes, but deeply-pipelined architectures rely on branch prediction for performance. Due to the poor correlation between the virtual program’s control flow and the hardware program counter, which we call the context problem, direct threading’s indirect branches are poorly predicted by the hardware, limiting performance. Our dispatch technique, context threading, improves branch prediction and performance by aligning hardware and virtual machine state. Linear virtual instructions are dispatched with native calls and returns, aligning the hardware and virtual PC. Thus, sequential control flow is predicted by the hardware return stack. We convert virtual branching instructions to native branches, mobilizing the hardware’s branch prediction resources. We evaluate the impact of context threading on both branch prediction and performance using interpreters for Java and OCaml on the Pentium and PowerPC architectures. On the Pentium IV our technique reduces mean mispredicted branches by 95%. On the PowerPC, it reduces mean branch stall cycles by 75% for OCaml and 82% for Java. Due to reduced branch hazards, context threading reduces mean execution time by 25% for Java and by 19% and 37% for OCaml on the P4 and PPC970, respectively. We also combine context threading with a conservative inlining technique and find its performance comparable to that of selective inlining.

Self-optimizing AST interpreters

Citation

Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. 2012. Self-optimizing AST interpreters. In Proceedings of the 8th symposium on Dynamic languages (DLS ‘12). Association for Computing Machinery, New York, NY, USA, 73–82. https://doi.org/10.1145/2384577.2384587

Abstract

An abstract syntax tree (AST) interpreter is a simple and natural way to implement a programming language. However, it is also considered the slowest approach because of the high overhead of virtual method dispatch. Language implementers therefore define bytecodes to speed up interpretation, at the cost of introducing inflexible and hard to maintain bytecode formats. We present a novel approach to implementing AST interpreters in which the AST is modified during interpretation to incorporate type feedback. This tree rewriting is a general and powerful mechanism to optimize many constructs common in dynamic programming languages. Our system is implemented in Java and uses the static typing and primitive data types of Java elegantly to avoid the cost of boxed representations of primitive values in dynamic programming languages.

Virtual machine showdown: Stack versus registers

Citation

Yunhe Shi, Kevin Casey, M. Anton Ertl, and David Gregg. 2008. Virtual machine showdown: Stack versus registers. ACM Trans. Archit. Code Optim. 4, 4, Article 2 (January 2008), 36 pages. https://doi.org/10.1145/1328195.1328197

Abstract

Virtual machines (VMs) enable the distribution of programs in an architecture-neutral format, which can easily be interpreted or compiled. A long-running question in the design of VMs is whether a stack architecture or register architecture can be implemented more efficiently with an interpreter. We extend existing work on comparing virtual stack and virtual register architectures in three ways. First, our translation from stack to register code and optimization are much more sophisticated. The result is that we eliminate an average of more than 46% of executed VM instructions, with the bytecode size of the register machine being only 26% larger than that of the corresponding stack one. Second, we present a fully functional virtual-register implementation of the Java virtual machine (JVM), which supports Intel, AMD64, PowerPC and Alpha processors. This register VM supports inline-threaded, direct-threaded, token-threaded, and switch dispatch. Third, we present experimental results on a range of additional optimizations such as register allocation and elimination of redundant heap loads. On the AMD64 architecture the register machine using switch dispatch achieves an average speedup of 1.48 over the corresponding stack machine. Even using the more efficient inline-threaded dispatch, the register VM achieves a speedup of 1.15 over the equivalent stack-based VM.

The case for virtual register machines

Citation

David Gregg, Andrew Beatty, Kevin Casey, Brain Davis, and Andy Nisbet. 2005. The case for virtual register machines. Sci. Comput. Program. 57, 3 (September 2005), 319–338. https://doi.org/10.1016/j.scico.2004.08.005

Abstract

Virtual machines (VMs) are a popular target for language implementers. A long-running question in the design of virtual machines has been whether stack or register architectures can be implemented more efficiently with an interpreter. Many designers favour stack architectures since the location of operands is implicit in the stack pointer. In contrast, the operands of register machine instructions must be specified explicitly. In this paper, we present a working system for translating stack-based Java virtual machine (JVM) code to a simple register code. We describe the translation process, the complicated parts of the JVM which make translation more difficult, and the optimisations needed to eliminate copy instructions. Experimental results show that a register format reduced the number of executed instructions by 34.88%, while increasing the number of bytecode loads by an average of 44.81%. Overall, this corresponds to an increase of 2.32 loads for each dispatch removed. We believe that the high cost of dispatches makes register machines attractive even at the cost of increased loads.

The Structure and Performance of Efficient Interpreters

Memo

https://news.ycombinator.com/item?id=13990592

Citation

Ertl, M. Anton, and David Gregg. “The structure and performance of efficient interpreters.” Journal of Instruction-Level Parallelism 5 (2003): 1-25.

Abstract

Interpreters designed for high general-purpose performance typically perform a large number of indirect branches (3.2%–13% of all executed instructions in our benchmarks). These branches consume more than half of the run-time in a number of configurations we simulated. We evaluate how accurate various existing and proposed branch prediction schemes are on a number of interpreters, how the mispredictions affect the performance of the interpreters and how two different interpreter implementation techniques perform with various branch predictors. We also suggest various ways in which hardware designers, C compiler writers, and interpreter writers can improve the performance of interpreters

The Structure and Performance of Interpreters

Citation

Theodore H. Romer, Dennis Lee, Geoffrey M. Voelker, Alec Wolman, Wayne A. Wong, Jean-Loup Baer, Brian N. Bershad, and Henry M. Levy. 1996. The structure and performance of interpreters. In Proceedings of the seventh international conference on Architectural support for programming languages and operating systems (ASPLOS VII). Association for Computing Machinery, New York, NY, USA, 150–159. https://doi.org/10.1145/237090.237175

Abstract

Interpreted languages have become increasingly popular due to demands for rapid program development, ease of use, portability, and safety. Beyond the general impression that they are “slow,” however, little has been documented about the performance of interpreters as a class of applications.This paper examines interpreter performance by measuring and analyzing interpreters from both software and hardware perspectives. As examples, we measure the MIPSI, Java, Perl, and Tcl interpreters running an array of micro and macro benchmarks on a DEC Alpha platform. Our measurements of these interpreters relate performance to the complexity of the interpreter’s virtual machine and demonstrate that native runtime libraries can play a key role in providing good performance. From an architectural perspective, we show that interpreter performance is primarily a function of the interpreter itself and is relatively independent of the application being interpreted. We also demonstrate that high-level interpreters’ demands on processor resources are comparable to those of other complex compiled programs, such as gcc. We conclude that interpreters, as a class of applications, do not currently motivate special hardware support for increased performance.