Forum: Too Lazy BBS

Re: Is This a Dumb Idea? paralellizing byte codes

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.compilers on Sun Oct 23 12:33:40 2022

From Newsgroup: comp.compilers

Jon Forrest <nobozo@gmail.com> writes:

Modern CPUs employ all kinds of clever techniques to improve
instruction level parallelism (ILP). I was wondering if it
makes sense to try to employ similar techniques in the
virtual machines used to execute byte code produced by language
compilers.

I'll first assume that you mean virtual-machine interpreters.

First of all, if you use a modern CPU with OoO execution, it does much
of that by itself.

If you use an in-order execution CPU, you can try to software-pipeline
the execution of the virtual-machine instructions, in particular
perform virtual-machine instruction fetching early [hoogerbrugge+99].

It is helpful to design the virtual machine for that: Have a mostly
fixed encoding, so that, e.g., the dispatch can be prepared in
advance, rather than, e.g., having to wait until you know about the
present instruction before you can start the fetch of the next
instruction.

By that I mean what if virtual machines were to examine byte code
streams to detect when it would be safe to execute multiple
byte codes concurrently? Then, based on its findings, the virtual
machine would execute as many byte codes concurrently as is safe.

There has been quite a bit work on combining sequences of
virtual-machine instructions into superinstructions [hughes82,proebsting95,naessen+01,ertl&gregg03euroforth,eller05], but
the typical approach was to combine a sequence of dependent
instructions, which has several advantages:

* One VM instruction often communicates the data to subsequent ones
through memory (i.e., D-cache), which often incurs a latency of
several cycles. The superinstruction can communicate the data
through a register.

* If the data is addressed explicitly by the VM instruction (e.g., in
a register-oriented VM), that requires code for dealing with this
addressing, which can be eliminated for the data passed implicitly
in a superinstruction. E.g. "add r3=r1*r2; mul r5=r3+r4" requires
dealing with six VM "register" accesses, while "madd r5=r1*r2+r4"
requires only four.

If you are actually thinking about JIT compilation of the virtual
machine, you can use the usual techniques for extracting
instruction-level parallelism: instruction scheduling (within basic
blocks, or within traces or superblocks), and software pipelining.
Some of these techniques incur a significant overhead, however, so you
may not want to employ them in a JIT compiler.

I'm also worried that internal virtual machine locking requirements
might make this idea infeasible. For example, in a virtual machine with
a global interpreter lock, would it be possible for there to be any >concurrent execution?

The memory-ordering requirements may restrict the reorderings of
memory accesses, but otherwise I don't see a problem.

This idea, if it works, would be a great way to take advantage of
multiple cores without having to rewrite any user code. The big
question is whether it would work.

The stuff I have outlined does not introduce additional execution
threads, and I don't think you can find thread-level parallelism at
the virtual-machine level unless you have your virtual machine (and
the programming language that it models) designed for exactly that.

- anton

@string{spe="Software---Practice and Experience"}
@Article{hoogerbrugge+99,
author = "Jan Hoogerbrugge and Lex Augusteijn and Jeroen Trum
and Rik van de Wiel",
title = "A code compression system based on pipelined
interpreters",
journal = spe,
volume = "29",
number = "11",
pages = "1005--1023",
month = sep,
year = "1999",
OPTannote= ""
}

@InProceedings{hughes82,
author = "R. J. M. Hughes",
title = "Super-Combinators",
booktitle = "Conference Record of the 1980 LISP Conference,
Stanford, CA",
pages = "1--11",
publisher = "ACM",
address = "New York",
year = "1982",
OPTannote = {}
}

@InProceedings{proebsting95,
author = "Todd A. Proebsting",
title = "Optimizing an {ANSI~C} Interpreter with Superoperators",
crossref = "popl95",
pages = "322--332",
annote = "Interpreter performance is optimized by combining
operators during code generation, when they are
still organized as trees. So a different, optimized
interpreter
is used for each program. Speedups of 1.8--3.1 are
achieved, but this is probably strongly dependent on
the original set of operators. The paper uses lccs
intermediate code operators \cite{fraser&hanson91a}."
}

@Proceedings{popl95,
booktitle = "Principles of Programming Languages (POPL '95)",
title = "Principles of Programming Languages (POPL '95)",
year = "1995",
key = "POPL '95"
}

@InProceedings{naessen+01,
author = {Henrik N\"ass\'en and Mats Carlsson and Konstantinos
Sagonas},
title = {Instruction Merging and Specialization in the
{SICStus Prolog} Virtual Machine},
booktitle = {Principles and Practice of Declarative Programming
(PPDP01)},
OPTpages = {},
year = {2001},
url = {http://www.csd.uu.se/%7Ekostis/Papers/sicstus.ps.gz},
annote = {Gives an overview of various WAM optimization
techniques and then evaluates combining (merging)
pairs of instructions into (about 60)
superinstructions, specializing WAM instructions for
specific immediate arguments (in particular,
specific registers, for about 200 new instructions),
and a combination of both (for about 100 new
instructions). Instruction merging produces small
speedups (about 8\% on average), specialization
produces a small slowdown on average, and both
combined are about as fast as instruction merging
alone. VM code size is reduced by around 10\% with
these techniques, and the VM emulator size grows by
up to 15KB.}
}

@InProceedings{ertl&gregg03euroforth,
author = {M. Anton Ertl and David Gregg},
title = {Implementation Issues for Superinstructions in
{Gforth}},
booktitle = {EuroForth 2003 Conference Proceedings},
OPTpages = {},
year = {2003},
URL = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03euroforth.ps.gz},
abstract = {Combining Forth primitives into superinstructions
provides nice speedups. Several approaches to
superinstructions were explored in the Gforth
project. This paper discusses the effects of these
approaches on performance, compilation time,
implementation effort, and on programming tools such
as the decompiler and backtracing.}
}

@MastersThesis{eller05,
author = {Helmut Eller},
title = {Optimizing Interpreters with Superinstructions},
school = {TU Wien},
year = {2005},
type = {Diplomarbeit},
url = {http://www.complang.tuwien.ac.at/Diplomarbeiten/eller05.ps.gz},
abstract = {Superinstructions can be used to make virtual
machine (VM) interpreters faster. A superinstruction
is a combination of simpler VM instructions which
can be executed faster than the corresponding
sequence of simpler VM instructions, because the
interpretative overhead, like instruction dispatch
and argument fetching, is reduced. This work
discusses the following three topics related to
superinstructions. First, I present some heuristics
to choose superinstructions. I evaluated the
heuristics for Forth and Java programs. If the
number of allowed superinstructions was very large,
$> 1000$, then the heuristic which chooses all
possible subsequences up to length 4 achieved the
best results. If the number of allowed
superinstructions was more limited, then a heuristic
which favors short sequences and sequences which
occur in many different programs and many different
basic blocks performed better than the
others. Second, I compare a simple greedy algorithm
and an optimal algorithm to cover a program with
superinstructions. I found that the greedy algorithm
achieves almost optimal results. Finally, I compare
superinstructions with non-sequential patterns. In
my experiments, superinstructions performed slightly
better than non-sequential patterns.}
}

--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
--- Synchronet 3.21b-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.compilers on Sun Oct 23 13:16:54 2022

From Newsgroup: comp.compilers

Alain Ketterlin <alain@universite-de-strasbourg.fr> writes:

I've heard/read several times that byte-code micro-optimizations are not >worth the trouble.

Apart from the paper below, which is discussed below, what else?

Here is a paper from 2015 on a related subject
("Branch prediction and the performance of interpreters -- Don't trust >folklore"):

https://ieeexplore.ieee.org/document/7054191

(you may find the corresponding research report if you can't access the
full text from that site). It shows how far processors have gone in what
was once left to the program designer.

On that I can only say: Not all research papers are trustworthy.
Catchy titles may be a warning signal.

I did my own measurements on a Haswell (the same CPU they used in the
paper) and published them in
<2015Sep7.142507@mips.complang.tuwien.ac.at> (<http://al.howardknight.net/?ID=158702747000> for those of you who
don't know what to do with Message-IDs).

If you don't want to read that posting, the executive summary is that
the sentence in the abstract "we show that the accuracy of indirect
branch prediction is no longer critical for interpreters." is wrong.

Looking at the run-time in seconds for the large benchmarks:

| shared non-shared
| --no- --no- --no-
|dynamic dynamic super default
| 3.332 2.440 2.276 1.468 benchgc
| 1.652 1.524 1.064 0.544 brainless
| 4.016 3.416 2.288 1.520 brew
| 3.420 3.236 2.232 1.232 cd16sim
| 2.956 2.684 1.484 0.864 fcp
| 13.128 9.848 9.280 7.840 lexex

We see a speedup factor of 1.08-1.37 (but, measuring mispredictions,
no consistent reduction in mispredictions) from (non-shared)
dispatching the code for the next VM instruction at the end of the
code of every VM instruction, rather than jumping to a shared piece of
dispatch code (from "shared --no-dynamic" to "non-shared
--no-dynamic").

We see a speedup factor of 1.06-1.81 and a reduction in mispredictions
by a factor 1.35-8.76 from replicating the code for each occurence of
a VM instruction (from "non-shared --no-dynamic" to "--no-super").

We see a speedup factor of 1.18-1.96 and a reduction in branch
mispredictions by up to a factor of 3.2 by then eliminating the
dispatches at the end of non-control-flow VM instructions (from
"--no-super" to default).

The overall speedup factor for all these steps is 1.67-3.42.

The somewhat longer summary from the posting above:

|Haswell's indirect branch prediction is really much
|better than before, but for larger programs running on an interpreter
|like Gforth, replication still provides substantial branch prediction |improvements that result in significant speedups. And while there is
|no longer a branch prediction advantage to keeping separate indirect
|branches for dispatch, there is still a significant speed advantage;
|and dynamic superinstructions also provide a good speedup, resulting
|in overall speedups by factors 1.67-3.42 for the application
|benchmarks.
|
|Why are the results here different from those in the paper?
|1) Different Interpreter 2) different benchmarks. If you write an |interpreter, and look for performance, should you go for interpreter |optimizations like threaded-code, replication, and dynamic
|superinstructions like I suggest, or just use a switch-based
|interpreter like the paper suggests? Threaded code is a good idea
|with little cost in any case. If that provides a significant speedup
|and your VM instruction implementations are short (you run into cache
|trouble with long VM instructions [vitale&abdelrahman04]), then
|replication with superinstructions will probably give a good speedup.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
--- Synchronet 3.21b-Linux NewsLink 1.2

From Alain Ketterlin@alain@universite-de-strasbourg.fr to comp.compilers on Sun Oct 23 21:29:34 2022

From Newsgroup: comp.compilers

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Alain Ketterlin <alain@universite-de-strasbourg.fr> writes:

I've heard/read several times that byte-code micro-optimizations are not >>worth the trouble.

Apart from the paper below, which is discussed below, what else?

This is not directly related to the paper I mention later. I was talking
about optimizing bytecode vs. compiler optimizations. I know of no
interpreter doing elaborate static byte-code optimization.

https://ieeexplore.ieee.org/document/7054191

On that I can only say: Not all research papers are trustworthy.
Catchy titles may be a warning signal.

I did my own measurements on a Haswell (the same CPU they used in the
paper) and published them in
<2015Sep7.142507@mips.complang.tuwien.ac.at> (<http://al.howardknight.net/?ID=158702747000> for those of you who
don't know what to do with Message-IDs).

[...]

|Why are the results here different from those in the paper?
|1) Different Interpreter 2) different benchmarks.

I'm glad it works for you. For the record: they consider interpreters
for Python, Javascript and CLI, on a fairly broad set of benchmarks. And
they also evaluate (through simulation) branch-predictors that may (or
may not) be included in more recent architectures.

-- Alain.
--- Synchronet 3.21b-Linux NewsLink 1.2

From gah4@gah4@u.washington.edu to comp.compilers on Wed Oct 26 18:18:33 2022

From Newsgroup: comp.compilers

On Saturday, October 22, 2022 at 11:51:31 AM UTC-7, nob...@gmail.com wrote:

Modern CPUs employ all kinds of clever techniques to improve
instruction level parallelism (ILP). I was wondering if it
makes sense to try to employ similar techniques in the
virtual machines used to execute byte code produced by language
compilers.

Seems to me it is not that parallelizing byte codes that is
a dumb idea, but byte codes themselves are.

This was known when Alpha replaced VAX. Work on making faster VAX
systems was stuck with the byte oriented instruction stream which was impossible to pipeline.

Not so many years after IBM S/360 was the first "computer
architecture", DEC wanted VAX to be an architecture with a wide range
of implementations, and do last for many years. But VAX didn't last so
long, and descendants of S/360, now z/Architecture, are still around.

S/360 instructions are in units of 16 bits, always aligned, and you
can tell from the first byte how long the instruction is. That allows
the processor to quickly parse instructions and pass them to
reservation stations (in the case of the 360/91).

But VAX instructions have to be parsed byte by byte, until the
hardware finds how long the instruction is.

So it seems that the real answer is to devise a word oriented, or in
other words RISC, virtual machine. (Actual RISC hardware might not be
a good choice.)

I have in the past, maybe here, wondered about the ability to improve
a RISC processor. As one example, many include a branch delay slot,
which depends somewhat on the processor pipeline.

It seems that one fix would be for compilers to write out not the
actual instructions, but slightly earlier in the code generator.

Then a processor specific program would, fairly quickly, convert that
to actual instructions. That could be done at program installation
time, or at program load time, as needed.

If a new processor came along that, for example, needed two branch
delay slots, it would generate them.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Thu Oct 27 14:51:04 2022

From Newsgroup: comp.compilers

On 2022-10-27, gah4 <gah4@u.washington.edu> wrote:

On Saturday, October 22, 2022 at 11:51:31 AM UTC-7, nob...@gmail.com wrote:

Modern CPUs employ all kinds of clever techniques to improve
instruction level parallelism (ILP). I was wondering if it
makes sense to try to employ similar techniques in the
virtual machines used to execute byte code produced by language
compilers.

Seems to me it is not that parallelizing byte codes that is
a dumb idea, but byte codes themselves are.

I think you're taking "byte code" too literally to refer to
refer to a virtual machine where the instructions are byte-wide
opcodes that implicitly refer to operands on a stack.

I think that nowadays the term refers to any software-based
synthetic instruction set oriented toward supporting a higher
level language.

This was known when Alpha replaced VAX. Work on making faster VAX
systems was stuck with the byte oriented instruction stream which was impossible to pipeline.

Not impossible for Intel and AMD, obviously.

Variable-length instruction encodings do not inherently hamper
pipelining.

What might hamper pipelining would be variable-lenth instruction
encodings /where the length is not known until the instruction is
executed, due to depending on its output somehow/.

If you can decode an instruction and then immediately know where
the next one starts, you can pipeline.

The internal representation of a pipeline doesn't use the the original variable-length representation any more; the instruction bytes do not
literally move through the pipeline.

So it seems that the real answer is to devise a word oriented, or in
other words RISC, virtual machine. (Actual RISC hardware might not be
a good choice.)

I designed one in TXR Lisp; but the "byte code" terminology appears
numerous times in the source code, and leaks into the name of one
API fuction calld vm-desc-bytecode, which accesses the code vector
of a virtual machine description.

The opcodes are actually four byte words, stored in the local endian.
(When a compiled file is loaded that was compiled on a different
endian system, the load function will swap the byte order on all
the four byte words in the "bytecode" vector).

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[THe VAX had auto-increment indexed address modes like

addl3 (ra)+[rb],(rc)+[rd],(re)+[rf]

which means to take the word that ra+4*rb points to, add 4 to ra, take
the word that rc+4*rd points to, add 4 to rc, put their sum in the word
that re+4*rf points to, and add 4 to re. If any of those registers are
the same register, the fetches and increments have to happen as if it was
all done sequentially. There were instructions that took six operands.
While this much address complication was rare, it had to work. -John]
--- Synchronet 3.21b-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.compilers on Sat Oct 29 09:06:38 2022

From Newsgroup: comp.compilers

gah4 <gah4@u.washington.edu> writes:

Seems to me it is not that parallelizing byte codes that is
a dumb idea, but byte codes themselves are.

This was known when Alpha replaced VAX. Work on making faster VAX
systems was stuck with the byte oriented instruction stream which was >impossible to pipeline.

Pipelined VAX implementation exist, e.g. the VAX 8800 and NVAX. AMD64
is also a byte-granularity instruction set. While byte granularity
makes wide decoders more expensive (by roughly a factor of 4 compared
to an instruction set with 32-bit granularity for the same number of
decoded bytes). RISC-V's compressed (C) instructions have 16-bit
granularity, so at least the RISC-V designers think that the benefits
of instruction compression outweigh the costs in decoding.

Anyway, this has little to do with the question of the original
poster. Virtual machines (often called byte codes, even if they use a different instruction granularity) use instruction sets that are quite different from that of VAX or AMD64; in particular, they have no
"instruction formats" with "addressing modes", where every base
instruction can be combined with a set of addressing modes in an
orthogonal way. That's because they are not decoded like hardware instructions.

There are, however, costs to byte granularity in VM interpreters:

* Byte granularity makes it harder to use direct-threaded code and
optimizations that start with that, in particular dynamic
superinstructions: You cannot rewrite the VM code into
direct-threaded code in-place, but have to translate it to another
place, which also means that you cannot reuse the branch targets
as-is.

* Also, if the VM instruction has an inline argument wider than one
byte, the access to that argument can be significantly more
expensive on architectures with alignment restrictions (e.g.,
ironically, SPARC). But alignment restrictions have died out in
general-purpose computers, and VM interpreters are not that popular
on embedded systems.

- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
--- Synchronet 3.21b-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Geek2
  Sun May 17 07:06:15 2026
  from Euclid, Oh via Telnet
- Geek2
  Sat May 16 21:25:04 2026
  from Euclid, Oh via Telnet
- Jas Hud
  Sat May 16 00:50:28 2026
  from Bbs.Eob-Bbs.Com,wi via Telnet
- Geek2
  Fri May 15 19:53:20 2026
  from Euclid, Oh via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	65
Nodes:	6 (0 / 6)
Uptime:	08:02:58
Calls:	862
Files:	1,311
D/L today:	1 files (1,366K bytes)
Messages:	264,936

Re: Is This a Dumb Idea? paralellizing byte codes

Who's Online

Recent Visitors

System Info