• Interpreters and indirect-branch prediction

    From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Nov 13 08:20:27 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    For something like a bytecode interpreter, the prediction accuracy
    of the jump predictor is going to be epically bad.

    And with "jump predictor", you probably mean the indirect-branch
    predictor.

    On hardware of up to around 2005 where the indirect branches were at
    best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
    with one shared indirect branch (as generated by the usual for...{switch...{...}} idiom in C) results in misprediction rates of
    80% [ertl&gregg03jilp]. With one a non-shared indirect branch per VM instruction implementation (as in typical threaded-code
    implementations), we measured 50%-61% mispredictions for BTB with
    2-bit counters in simulation [ertl&gregg03jilp] and similar results on
    actual CPUs.

    We simulated 2-level history-based indirect branch predictors [ertl&gregg03jilp], and they typically have less than 10%
    mispredictions.

    Hardware designers improved indirect-branch predictors to deal with
    more history than BTBs with 2-bit counters already in Dothan (launched
    2004) and it's descendents (in particular, the Core 2 line), judging
    by the results of <http://www.complang.tuwien.ac.at/forth/threading/>
    (look in the column "switch").

    In 2015 Rohou et al. measured the prediction accuracy in Python,
    JavaScript, and a CLI (.NET virtual machine, also known under other
    names) on Nehalem (2009, Sandy Bridge (2011), Haswell (2013), and a
    simulated TAGE1 branch predictor); unfortunately, they gave results in mispredictions per (kilo-)instructions rather than mispredictions per prediction and they used different interpreters, so you cannot compare
    that work with our earlier work. In any case, they found good
    improvements from Nehalem to Haswell.

    Inspired by this work, I did some measurements of Gforth, disabling
    various optimizations (which reduce the branch mispredictions) and
    posted them here. You can find the sequence of postings here: <http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>.
    However, I did not measure the prediction accuracy, because the idea
    was to check whether the claims made in the paper (e.g., "Modern
    predictors no longer need [superinstructions] to capture patterns in applications." or "[prediction accuracy] can not be considered as an
    obstacle for performance anymore.") are true.

    In the Haswell results, using an interpreter with non-shared indirect
    branches (non-shared --no-dynamic) does indeed often have a similar
    prediction accuracy as the interpreter with one shared indirect branch
    (shared --no-dynamic), but eliminating the sharing still provides a
    small speedup; and dynamic replication (--no-super), which results in
    having one indirect branch per instance of the VM instruction in the interpreted program, i.e., no sharing between different instances of
    the same VM instruction, provides a good reduction in branch
    mispredictions and in execution time on several benchmarks. And given
    that "non-shared --no-dynamic" and "--no-super" actually execute the
    same number and sequence of instructions (only with replication in the "--no-super" case), any improvement in run-time is only due to the
    improved branch prediction; here I only show these two columns, to
    avoid confusion; look at <http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>
    for the full data set:

    Run time in seconds:

    non-shared
    --no- --no-
    dynamic super
    2.440 2.276 benchgc
    1.524 1.064 brainless
    3.416 2.288 brew
    3.236 2.232 cd16sim
    2.684 1.484 fcp
    9.848 9.280 lexex

    For mispredictions the picture was as follows:

    non-shared
    --no- --no-
    dynamic super
    17,005,970 12,584,698 benchgc
    178,234,917 51,375,412 brainless
    279,851,195 47,928,843 brew
    297,000,355 70,879,605 cd16sim
    293,473,631 33,496,775 fcp
    12,267,065 8,269,104 lexex

    So the improvment in branch prediction provides speedup factors
    between 1.06 and 1.81 on these benchmarks. Plus, the results for the interpreters with a shared indirect branch (what the usual way of
    writing a switch-based interpreter gives you) are even slower than for
    the "non-shared --no-dynamic". So to paraphrase the title of Rohou et
    al.'s paper, don't trust Rohou et al.

    However, I expect that the indirect branch predictors have improved
    since Haswell, though, and maybe we now see much lower numbers of mispredictions even for "non-shared --no-dynamic" and for "shared --no-dynamic", but I would have to measure that.

    Anyway, back to Mitch Alsup's original claim: No, for a bytecode
    interpreter the prediction accuracy of an indirect-branch predictor is
    not necessarily going to be especially bad. It depends on how good
    the indirect-branch predictor is, and on the benchmark.

    @Article{ertl&gregg03jilp,
    author = {M. Anton Ertl and David Gregg},
    title = {The Structure and Performance of \emph{Efficient}
    Interpreters},
    journal = {The Journal of Instruction-Level Parallelism},
    year = {2003},
    volume = {5},
    month = nov,
    url = {https://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
    url2 = {http://www.jilp.org/vol5/v5paper12.pdf},
    note = {http://www.jilp.org/vol5/},
    abstract = {Interpreters designed for high general-purpose
    performance typically perform a large number of
    indirect branches (3.2\%--13\% of all executed
    instructions in our benchmarks). These branches
    consume more than half of the run-time in a number
    of configurations we simulated. We evaluate how
    accurate various existing and proposed branch
    prediction schemes are on a number of interpreters,
    how the mispredictions affect the performance of the
    interpreters and how two different interpreter
    implementation techniques perform with various
    branch predictors. We also suggest various ways in
    which hardware designers, C compiler writers, and
    interpreter writers can improve the performance of
    interpreters.}
    }

    @InProceedings{rohou+15,
    author = {Erven Rohou and Bharath Narasimha Swamy and Andr\'e Seznec},
    title = {Branch Prediction and the Performance of
    Interpreters --- Don't Trust Folklore},
    booktitle = {Code Generation and Optimization (CGO)},
    OPTpages = {},
    year = {2015},
    url = {https://hal.inria.fr/hal-01100647/document},
    annote = {Evaluates the indirect branch predictors of Nehalem,
    Sandy Bridge, and Haswell and the ITTAGE predictor
    on the interpreters of Python, Spidermonkey
    (JavaScript), and a (.NET) CLI interpreter running a
    number of benchmarks. Haswell and ITTAGE are very
    good at branch prediction for these benchmarks, and
    they suggest that switch-based interpreters good
    enough because of that. However, my own results
    \url{news:<2015Sep7.142507@mips.complang.tuwien.ac.at>}
    show that shared dispatch branches still can produce
    a significant slowdown.}
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Wed Nov 13 19:36:07 2024
    On Wed, 13 Nov 2024 8:20:27 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    For something like a bytecode interpreter, the prediction accuracy
    of the jump predictor is going to be epically bad.

    And with "jump predictor", you probably mean the indirect-branch
    predictor.

    On hardware of up to around 2005 where the indirect branches were at
    best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
    with one shared indirect branch (as generated by the usual for...{switch...{...}} idiom in C) results in misprediction rates of
    80% [ertl&gregg03jilp].

    That certainly qualifies as bad.

    With one a non-shared indirect branch per VM instruction implementation (as in typical threaded-code
    implementations), we measured 50%-61% mispredictions for BTB with
    2-bit counters in simulation [ertl&gregg03jilp] and similar results on
    actual CPUs.

    50% misprediction rate qualifies as bad.

    We simulated 2-level history-based indirect branch predictors [ertl&gregg03jilp], and they typically have less than 10%
    mispredictions.

    10% misprediction rate qualifies as "not so bad".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Thu Nov 14 08:35:41 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [Why indirect-branch predictors work in "bytecode" interpreters]
    To me, that looks like the bytecode is insufficiently expressive, if
    an often-repeating sequence occurs :-)

    As already mentioned, what a hardware branch predictor works on is the
    dynamic sequence of branches.

    This might come from a straight-line sequence of VM instructions that
    occurs in only once in one program, and nowhere else; e.g., in a loop
    (as mentioned earlier), or in a subroutine that's called from several
    places. If that one program is not among the programs that the VM
    designer used for determining frequent code sequences that should be
    turned into static VM superinstructions, on what basis should that VM superinstruction be added? Even if that program is among those
    programs, should this VM superinstruction be added? It's only useful
    in one program.

    But the hardware branch prediction is also based on dynamic sequences
    that are not statically sequential. E.g., there might be a VM branch
    (possibly conditional or indirect), a VM call or a VM return in that
    sequence. The hardware branch predictor will use the repetition of
    the sequence for prediction, but it's usually impractical to form
    static VM superinstructions from such sequences.

    And in any case, whatever VM superinstructions you use, at some point
    every VM superinstruction ends with an indirect branch, and then you
    need a good hardware indirect-branch predictor, or the performance
    will suffer.

    If you have a good indirect-branch predictor, there are no such
    boundaries; it predicts the indirect branch of every VM instruction implementation based on the preceeding branches and their targets.

    In the presence of a good predictor, it would still be an interesting >question if a more compressed version would actually perform better.

    There is a benefit to using VM superinstructions: the number of
    instruction dispatches is reduced, and consequently less code has to
    be executed; there are also less branches performed, so the
    instruction fetcher can be more effective.

    You can see the effect in the difference between "--no-super" and
    "default" in <http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>.
    Here's that data extracted for your convenience; result on Haswell:

    |Run time in seconds:
    |
    |
    |--no-
    |super default
    |2.276 1.468 benchgc
    |1.064 0.544 brainless
    |2.288 1.520 brew
    |2.232 1.232 cd16sim
    |1.484 0.864 fcp
    |9.280 7.840 lexex
    |
    |For mispredictions the picture was as follows:
    |
    | --no-
    | super default
    |12,584,698 9,426,835 benchgc
    |51,375,412 18,391,920 brainless
    |47,928,843 21,390,209 brew
    |70,879,605 22,105,620 cd16sim
    |33,496,775 17,280,944 fcp
    | 8,269,104 6,885,712 lexex

    "default" here is dynamic superinstructions, i.e., every straight-line
    sequence of code (even across conditional branches) is turned into a
    VM superinstruction; i.e., the only indirect branches are for taken
    VM-level branches. As you can see, the speedup from dynamic
    superinstructions is quite nice, and only a little of it is due to
    better branch predictions. E.., in the cd16sim case (with the largest reduction in mispredictions, the mispredictions account for about

    ((71-22)M mispredictions)*(20cycles/misprediction)/(4400M cycles/s)=0.22s

    out of 1s, so about 0.78s are due to the reduction in dispatch code.

    These dynamic superinstructions are formed by concatenating the code
    snippets coming from the individual VM instructions, so eliminating
    dispatch code is the only benefit for them.

    In recent work [ertl&paysan24] we have also reduced VM-instruction
    pointer updates in Gforth, and found speedups by a factor >2 on
    loop-dominated benchmarks on recent high-performance
    microarchitectures; our explanation is that the VM instruction pointer
    updates are the critical dependence path in executing these benchmarks
    unless we use our reduction techniques.

    In the early 2000s we looked at using static VM superinstructions. At
    the time the main effect was the reduction in mispredicted indirect
    branches, but they can also have the effect of reducing other
    instructions. E.g., in Gforth (engine gforth-fast) the VM instruction
    F@ has the following code snippet (Intel-style syntax):

    560BC7692D89: add r13,$08
    560BC7692D8D: movsd [r12],xmm15
    560BC7692D93: movsd xmm15,[r8]
    560BC7692D98: sub r12,$08
    560BC7692D9C: mov r8,$00[r13]

    and the VM instruction F+ has the following code snippet:

    560BC7692E3B: mov rax,r12
    560BC7692E3E: lea r12,$08[r12]
    560BC7692E43: addsd xmm15,$08[rax]

    Register assignments:

    r13: data-stack pointer
    r8: top of data stack
    r12: FP-stack pointer
    xmm15: top of FP stack

    (Don't ask me why gcc introduced the mov in 560BC7692E3B rather than
    doing the lea afterwards or using the result of the lea in the addsd.)

    Gforth has a static VM superinstruction for the sequence "F@ F+".
    It's code is:

    7FC65EA14E20: add r13,$08
    7FC65EA14E24: addsd xmm15,[r8]
    7FC65EA14E29: mov r8,$00[r13]

    In this case, the static VM superinstruction avoided a push and pop
    from the FP stack, and the associated code (FP-stack pointer update,
    and storing and loading the top of the FP stack from memory).

    We first implemented static superinstructions and tried up to 1600 of
    them [ertl&gregg03], mainly for improved branch prediction on the CPUs
    of the time (which used the BTB for indirect-branch prediction). When
    dynamic superinstructions (with replication) solved the branch
    prediction problem, the remaining benefit of static superinstructions
    is the one above, but it only helps for some sequences, but not
    others. In Gforth 0.7, we have 13 static superinstructions, in the
    current development version we have 41. Among the reasons to be
    stingy with adding static superinstructions are:

    * The more VM instruction implementations we add, the longer the
    compile time of the VM becomes.

    * The effect of static superinstructions in addition to dynamic
    superinstructions is quite small (see "with static super" in Figure
    10 [ertl&gregg03]); and the first few provide most of the benefit
    (see below).

    * At one point we had 400, but the engine crashed (probably not with
    all gcc versions, or we would not have been able to produc numbers
    for up to 1600). So we switched around, and only had static
    superinstructions for the most frequent sequences where we also
    expect a benefit like the one shown above.

    Looking at our 2003 paper [ertl&gregg03] again, I also see an answer
    for your first question: Look at Figure 13. There you see the the
    effect of static superinstructions and static replicas of VM
    instructions. Static replicas are just that: Instead of + you now
    have +A +B +C, and use them round-robin whenever an instance of +
    occurs in the VM program; the idea here was just to improve the branch prediction on microarchitectures with BTB-based indirect-branch
    prediction. It turns out that the balance between static
    superinstructions and static replicas was that having a relatively
    small number of static superinstructions (e.g., about 200 out of a
    total 1600 additional VM instruction implementations, or 35 out of a
    total 400) provides a benefit, but if you increase the number of superinstructions more, you get a slowdown. My explanation is that
    the additional superinstructions reduce the number of replicas, but
    help in only a few places, whereas the additional replicas help
    in a lot of places.

    Conversely, there are a lot of sequences that occur throughout the
    code that are not covered by 1600 static superinstructions or less.
    So how many static superinstructions would you want to add before you
    consider the VM to be "sufficiently expressive"?

    - anton

    @InProceedings{ertl&paysan24,
    author = {Ertl, M. Anton and Paysan, Bernd},
    title = {{The Performance Effects of Virtual-Machine Instruction Pointer Updates}},
    booktitle = {38th European Conference on Object-Oriented Programming (ECOOP 2024)},
    pages = {14:1--14:26},
    series = {Leibniz International Proceedings in Informatics (LIPIcs)},
    ISBN = {978-3-95977-341-6},
    ISSN = {1868-8969},
    year = {2024},
    volume = {313},
    editor = {Aldrich, Jonathan and Salvaneschi, Guido},
    publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
    address = {Dagstuhl, Germany},
    URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ECOOP.2024.14},
    URN = {urn:nbn:de:0030-drops-208634},
    doi = {10.4230/LIPIcs.ECOOP.2024.14},
    annote = {Keywords: virtual machine, interpreter, out-of-order execution},
    abstract = {How much performance do VM instruction-pointer (IP)
    updates cost and how much benefit do we get from
    optimizing them away? Two decades ago it had little
    effect on the hardware of the day, but on recent
    hardware the dependence chain of IP updates can
    become the critical path on processors with
    out-of-order execution. In particular, this happens
    if the VM instructions are light-weight and the
    application programs are loop-dominated. The
    present work presents several ways of reducing or
    eliminating the dependence chains from IP updates,
    either by breaking the dependence chains with the
    \emph{l}oop optimization or by reducing the number
    of IP updates (the \emph{c} and \emph{ci}
    optimizations) or their latency (the \emph{b}
    optimization). Some benchmarks see speedups from
    these optimizations by factors $>2$ on most recent
    cores, while other benchmarks and older cores see
    more modest results, often in the speedup ranges
    1.1--1.3.}
    }

    @InProceedings{ertl&gregg03,
    author = "M. Anton Ertl and David Gregg",
    title = "Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters",
    crossref = "sigplan03",
    OPTpages = "",
    url = "https://www.complang.tuwien.ac.at/papers/ertl%26gregg03.ps.gz",
    abstract = "Interpreters designed for efficiency execute a huge
    number of indirect branches and can spend more than
    half of the execution time in indirect branch
    mispredictions. Branch target buffers are the best
    widely available\mn{on all recent general-purpose
    machines?} form of indirect branch prediction;
    however, their prediction accuracy for existing
    interpretes is only 2\%--50\%. In this paper we
    investigate two methods for improving the prediction
    accuracy of BTBs for interpreters: replicating
    virtual machine (VM) instructions and combining
    sequences of VM instructions into superinstructions.
    We investigate static (interpreter build-time) and
    dynamic (interpreter run-time) variants of these
    techniques and compare them and several combinations
    of these techniques. These techniques can eliminate
    nearly all of the dispatch branch mispredictions,
    and have other benefits, resulting in speedups by a
    factor of up to 3.17 over efficient threaded-code
    interpreters, and speedups by a factor of up to 1.3
    over techniques relying on superinstructions alone."
    }

    @Proceedings{sigplan03,
    booktitle = "SIGPLAN Conference on Programming Language
    Design and Implementation (PLDI'03)",
    title = "SIGPLAN Conference on Programming Language
    Design and Implementation (PLDI'03)",
    year = "2003",
    key = "PLDI '03"
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)