• Dealing with mispredictions (was: Microarchitectural support ...)

    From Anton Ertl@21:1/5 to mitchalsup@aol.com on Thu Dec 26 09:46:21 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Sooner or later, the pipeline designer needs to recognize the of
    occuring
    code sequence pictured as::

    INST
    INST
    BC-------\
    INST |
    INST |
    INST |
    /----BR |
    | INST<----/
    | INST
    | INST
    \--->INST
    INST

    So that the branch predictor predicts as usual, but DECODER recognizes
    the join point of this prediction, so if the prediction is wrong, one
    only nullifies the mispredicted instructions and then inserts the
    alternate instructions while holding the join point instructions until
    the alternate instruction complete.

    Would this really save much? The main penalty here would still be
    fetching and decoding the alternate instructions. Sure, the
    instructions after the join point would not have to be fetched and
    decoded, but they would still have to go through the renamer, which
    typically is as narrow or narrower than instruction fetch and decode,
    so avoiding fetch and decode only helps for power (ok, that's
    something), but probably not performance.

    And the kind of insertion you imagine makes things more complicated,
    and only helps in the rare case of a misprediction.

    What alternatives do we have? There still are some branches that are
    hard to predict and for which it would be helpful to optimize them.

    Classically the programmer or compiler was supposed to turn
    hard-to-predict branches into conditional execution (e.g., someone
    (IIRC ARM) has an ITE instruction for that, and My 6600 has something
    similar IIRC). These kinds of instructions tend to turn the condition
    from a control-flow dependency (free when predicted, costly when
    mispredicted) into a data-flow dependency (usually some cost, but
    usually much lower than a misprediction).

    But programmers are not that great on predicting mispredictions (and programming languages usually don't have ways to express them),
    compilers are worse (even with feedback-directed optimization as it
    exists, i.e., without prediction accuracy feedback), and
    predictability might change between phases or callers.

    So it seems to me that this is something that the hardware might use
    history data to predict whether a branch is hard to predict (and maybe
    also taking into account how the dependencies affect the cost), and to
    switch between a branch-predicting implementation and a data-flow implementation of the condition.

    I have not followed ISCA and Micro proceedings in recent years, but I
    would not be surprised if somebody has already done a paper on such an
    idea.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Dec 26 19:11:01 2024
    According to Michael S <already5chosen@yahoo.com>:
    Yes, compilers often generate such code.
    When coding in asm, I typically know at least something about
    probability of branches, so I tend to code it differently:

    The first version of FORTRAN had a FREQUENCY statement which let you tell it the
    relative likelihood of each of the results of a three-way IF, and the expected number of
    iterations of a DO loop. It turned out to be useless, because programmers usually
    guessed wrong.

    The final straw was a compiler where they realized FREQUENCY was implemented backward and nobody noticed.

    Unless you've profiled the code and you data to support your branch guesses, just write it in the clearest way you can.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Levine on Thu Dec 26 22:35:10 2024
    On Thu, 26 Dec 2024 19:11:01 -0000 (UTC)
    John Levine <johnl@taugh.com> wrote:

    According to Michael S <already5chosen@yahoo.com>:
    Yes, compilers often generate such code.
    When coding in asm, I typically know at least something about
    probability of branches, so I tend to code it differently:

    The first version of FORTRAN had a FREQUENCY statement which let you
    tell it the relative likelihood of each of the results of a three-way
    IF, and the expected number of iterations of a DO loop. It turned
    out to be useless, because programmers usually guessed wrong.

    The final straw was a compiler where they realized FREQUENCY was
    implemented backward and nobody noticed.

    Unless you've profiled the code and you data to support your branch
    guesses, just write it in the clearest way you can.


    My asm coding is mostly various math library routines.
    Not a production, just fun.
    I find the style illustrated above to be the clearest way of handling
    less common conditions. And even I am wrong about commonality, the
    performance cost of such mistake on the modern hardware is minimal.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Thu Dec 26 21:18:26 2024
    On Thu, 26 Dec 2024 19:11:01 +0000, John Levine wrote:

    According to Michael S <already5chosen@yahoo.com>:
    Yes, compilers often generate such code.
    When coding in asm, I typically know at least something about
    probability of branches, so I tend to code it differently:

    The first version of FORTRAN had a FREQUENCY statement which let you
    tell it the
    relative likelihood of each of the results of a three-way IF, and the expected number of
    iterations of a DO loop. It turned out to be useless, because
    programmers usually
    guessed wrong.

    The final straw was a compiler where they realized FREQUENCY was
    implemented backward and nobody noticed.

    There was a PHD Thesis CMU circa 1978 called the "maximal munching
    method"
    and was supposed to match the largest executable patterns to the DAG corresponding to the VAX 11/780 instruction set. The method was
    producing
    fast dense code and everyone was happy until they figured out that a
    comparison was backwards and they were actually matching the smallest executable patterns; yet the code ran faster that way than before they
    fixed it.

    Unless you've profiled the code and you data to support your branch
    guesses, just write it in the clearest way you can.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Thu Dec 26 21:54:53 2024
    On Thu, 26 Dec 2024 9:46:21 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Sooner or later, the pipeline designer needs to recognize the of
    occuring
    code sequence pictured as::

    INST
    INST
    BC-------\
    INST |
    INST |
    INST |
    /----BR |
    | INST<----/
    | INST
    | INST
    \--->INST
    INST

    So that the branch predictor predicts as usual, but DECODER recognizes
    the join point of this prediction, so if the prediction is wrong, one
    only nullifies the mispredicted instructions and then inserts the
    alternate instructions while holding the join point instructions until
    the alternate instruction complete.

    Would this really save much? The main penalty here would still be
    fetching and decoding the alternate instructions. Sure, the
    instructions after the join point would not have to be fetched and
    decoded, but they would still have to go through the renamer, which
    typically is as narrow or narrower than instruction fetch and decode,
    so avoiding fetch and decode only helps for power (ok, that's
    something), but probably not performance.

    When you have the property that FETCH will stumble over the join point
    before the branch resolves, the fact you reached the join point means
    a branch misprediction is avoided (~16 cycles) and you nullify 4
    instructions from reservation stations. FETCH is not disrupted, and
    execution continues.

    The balance is the mispredict recovery overhead (~16 cycles) compared
    to the cost of inserting the un-predicted path into execution (1 cycle
    in the illustrated case).

    And the kind of insertion you imagine makes things more complicated,
    and only helps in the rare case of a misprediction.

    PREDication is designed for the unpredictable branches--as a means to
    directly express the fact that the <equivalent> branch code is not
    expected to be predicted well.

    For easy to predict branches, don't recode as PRED--presto; done. So,
    rather than having to Hint branches or to guard individual instructions;
    I PREDicate short clauses; saving bits in each instruction because these
    bits come from the PRED-instruction.

    What alternatives do we have? There still are some branches that are
    hard to predict and for which it would be helpful to optimize them.

    Classically the programmer or compiler was supposed to turn
    hard-to-predict branches into conditional execution (e.g., someone
    (IIRC ARM) has an ITE instruction for that, and My 6600 has something
    similar IIRC). These kinds of instructions tend to turn the condition
    from a control-flow dependency (free when predicted, costly when mispredicted) into a data-flow dependency (usually some cost, but
    usually much lower than a misprediction).

    Conditional execution and merging (CMOV) rarely takes as few
    instructions
    as branchy code and <almost> always consumes more power. However, there
    are a few cases where CMOV works out better than PRED, so: My 66000 has
    both.

    But programmers are not that great on predicting mispredictions (and programming languages usually don't have ways to express them),
    compilers are worse (even with feedback-directed optimization as it
    exists, i.e., without prediction accuracy feedback), and
    predictability might change between phases or callers.

    A conditional branch inside a subroutine is almost always dependent on
    who calls the subroutine. Some calls may have a nearly 100% prediction
    rate in one direction, other calls a near 100% prediction rate in the
    other direction.

    One thing that IS different n My 66000 (other than PREDs not needing to
    be predicted) is that loops are not predicted--there is a LOOP instr-
    uction that performs ADD-CMP-BC back to the top of the loop in 1 cycle.
    Since HW can see the iterating register and the terminating limit,
    one does not overpredict iterations and then mispredict them away,
    instead on predicts loop only so long as the arithmetic supports that prediction.

    Thus, in My 66000, looping branches do not contribute to predictor
    pollution (updates), leaving the branch predictors to deal with the
    harder stuff. In addition we have a LD IP instruction (called CALX)
    that loads a value from a table directly into IP, so no jumps here.
    And finally:: My 66000 has a Jump Through Table (JTT) instruction,
    which performs:: range check, table access, add scaled table entry
    to IP and transfer control.

    Thus, there is very little indirect prediction (maybe none on smaller implementations), switch tables are all PIC, and the tables are
    typically ΒΌ the size of the equivalents in other 64-bit ISAs.

    So, by taking Looping branches, indirect branches, and indirect calls
    out of the prediction tables, those that remain should be more
    predictable.

    So it seems to me that this is something that the hardware might use
    history data to predict whether a branch is hard to predict (and maybe
    also taking into account how the dependencies affect the cost), and to
    switch between a branch-predicting implementation and a data-flow implementation of the condition.

    I have not followed ISCA and Micro proceedings in recent years, but I
    would not be surprised if somebody has already done a paper on such an
    idea.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Thu Dec 26 22:27:51 2024
    John Levine <johnl@taugh.com> schrieb:
    According to Michael S <already5chosen@yahoo.com>:
    Yes, compilers often generate such code.
    When coding in asm, I typically know at least something about
    probability of branches, so I tend to code it differently:

    The first version of FORTRAN had a FREQUENCY statement which let you tell it the
    relative likelihood of each of the results of a three-way IF, and the expected number of
    iterations of a DO loop. It turned out to be useless, because programmers usually
    guessed wrong.

    The final straw was a compiler where they realized FREQUENCY was implemented backward and nobody noticed.

    Unless you've profiled the code and you data to support your branch guesses, just write it in the clearest way you can.

    There is one partial expeption: Putting error handling, which
    should occur very infrequently, in a cold partition can indeed
    bring benefits, and the compiler can not always figure it out;
    heuristics like "a NULL check is less likely to be taken" can
    be wrong.

    But then again, defining an "unlikely" macro and having code like

    if (unlikely (a>b))
    {
    /* Do some error handling. */
    }

    probably increases the readability over the naked condition, so...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)