• Re: binary search debugging of compilers

    From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Fri May 19 03:44:04 2023
    From Newsgroup: comp.compilers

    On 2023-05-18, Spiros Bousbouras <spibou@gmail.com> wrote:
    On Wed, 17 May 2023 10:55:27 -0400
    Russ Cox <rsc@swtch.com> wrote:
    When BitKeeper came along, we added this:

    --longest Restrict the deltas to those on the longest line
    between the two range endpoints. Unlike a range, the
    lower bound is included in the output.

    because BitKeeper history is a lattice, not a straight line. So your
    test points became

    bk changes -nd:REV: > REVS

    and binary search over those. That gets you the longest "straight" line
    in the graph.

    Any partial ordering can be extended to a total ordering. Why can't you
    just do that instead of using 'the longest "straight" line' ?

    I think, not without introducing tie breaker rules that don't
    necessarily make any sense.

    But, maybe that doens't matter as something else. Say we have:

    D-------E
    / \
    A---B---C I----J
    \ /
    F---G---H

    A is before B and so on. Are F-G-H before D-E or after?

    Suppose the bug was introduced by E. We would like to know that.

    The longest path between the A-J bisection points goes through F-G-H,
    ignoring D-E. So it may wrongly look like I is the culprit, if E is
    never tested. I has the bug, H doesn't.

    Say we pick a precedence and flatten everything, in these
    two ways:

    A---B---C---D---E---F---G---H---I----J

    or

    A---B---C---F---G---H---D---E---I----J

    problem is, under the first order, the bug is bouncing into
    and out of existence:

    A---B---C---D---E---F---G---H---I----J
    X X X


    E has the bug, and so do I and J. But F-G-H do not. So the bisect
    algorithm has a problem; it is predicated on the idea that the bug
    exists in the upper endpoint, is absent in the lower and has appeared
    once in between.

    Now, if we actually cherry pick all those commits into that order
    (solving whatever conflicts are required), then we can do a meaningful
    bisect. Then the rebased F G H commits do have the bug:

    A---B---C---D---E---F'--G'--H'--I'---J
    X X X X X X

    It's unthinkable to be bisecting some lineage whipped up on-the-fly
    that doesn't actually exist in the repo.

    The best thing is to just keep the actual history that way.

    Or else, do that longest-path things when bisecting. Then
    if a merge point is implicated as the bad commit, then go
    into the branch. When I is implicated, we note that it
    has two parents, E and H. Only the H parentage was tested.
    C was good. Thus we bisect C to E.

    A smarter bisect algorithm could do this.

    To make this sort of topical for comp.compilers, we note that the
    revision history is a graph which resembles the control flow
    graph of a program (one with no backwards branches).

    The graph can be divided into our favorite chunks: basic blocks.

    Basic blocks are sequence of commits in which no commit has multiple
    parents (other than the first one), or is the parent of multiple children (other than the last one).

    The bisect algorithm should divide the graph into basic blocks,
    and recurse into then when necessary.

    So in:


    D-------E
    / \
    A---B---C I----J
    \ /
    F---G---H


    we have these basic blocks:

    /
    A---B---C
    \


    D---E
    \


    /
    F---G---H

    I---J


    First, the bisect could process the longest-path basic blocks: those
    headed by A, F and I. The goal is to determine which basic block
    introduces the bug.

    When a buggy basic block is identified, then bisect actually
    does commit-granularity bisection in that block.

    If we discover that the bad commit is the head of a block,
    and that block has multiple parents, we then recurse over
    those parents somehow.

    We identify paths (through the basic block graph) we have not taken
    and similarly analyze them.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Cliff Click@cclick0@gmail.com to comp.compilers on Fri May 19 10:47:07 2023
    From Newsgroup: comp.compilers

    HotSpot C2 compiler has a set of -XX debugging flags for doing binary
    search on top-level methods to compile with C2 (or C1 or interpret), and
    a similar control over which methods to inline. While I was at Sun I had
    a bisect shell script to narrow down failing compilations (and trim the
    set of inlined methods).

    Cliff
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From mrs@mrs@kithrup.com (Mike Stump) to comp.compilers on Sat May 20 18:04:39 2023
    From Newsgroup: comp.compilers

    From: Larry McVoy <lm@mcvoy.com>
    Date: Mon, May 15, 2023 at 1:57 PM
    Subject: Re: history of source control binary search for debugging

    Good to hear from you Russ.

    I'm pretty sure I "invented" that idea, which means I hadn't seen it before. >All I was doing was using the fact that binary search is log(n). And giving >non-kernel people a way to do some grunt work and then get it close and
    then hand that off to the kernel people.

    But note that the BUG-HUNTING was using snapshot, not source management. >BitKeeper hadn't been invented yet and the other source management systems >sucked because they were reproducible only at tagged points. At least
    that was true of CVS.

    With cvs update -D, one can select a date and update the tree to that
    specific date. Indeed, I've been known to use '1 day ago', ... '50
    days ago' and binary search that way as well. Nice interface as you
    only have to deal with changing a single number. And the granularity
    in days is a nice metric. It gets you down to a day or two and then
    you dig in from there. Further, with cvs, you could update only parts
    of the tree if you suspect which part might be failing. Front end,
    loop unrolling and so on.

    Another technique I've used to is save off part of the compiler
    (cc1/cc1plus) binary from the nightly build system, and then when you
    want state on a bug, you simply for i in */cc1; do test $i; done and
    then you get a total view of that code against all the compilers. You
    can see bugs pop in and out, or if they are simple, always worked
    before, and never after.

    These methods are so basic that there is no need to write them down.
    I've been using such techniques longer than linux has been around.
    :-)
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From gah4@gah4@u.washington.edu to comp.compilers on Sat May 20 20:20:56 2023
    From Newsgroup: comp.compilers

    On Saturday, May 20, 2023 at 3:05:13rC>PM UTC-7, Thomas Koenig wrote:
    Kaz Kylheku <864-11...@kylheku.com> schrieb:
    Oh, stepping doubles with ++ is, I would say, not *that* uncommon.
    I doubt that if such a bug were introduced as an easter egg into GCC,
    it would go very long without being discovered by the FOSS distros and other downstream users.

    It would very likely be caught by regression testing. GCC has an
    extensive test suite, and it is a requirement that this is run
    before submitting a patch.

    I believe it was Borland Turbo C, maybe 2.0.

    At one point, I had a few different C compilers for MS-DOS, not so long
    before I went to using OS/2. It seems that Borland also had compilers
    for OS/2, but I don't remember using those.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Russ Cox@rsc@swtch.com to comp.compilers on Fri May 12 13:59:22 2023
    From Newsgroup: comp.compilers

    Hi all,

    In the Go compiler, we have been using a technique for the past eight
    years that I assume has been done in other compilers before, but I'm
    unaware of any. I describe it below, along with refinements we've made
    over the years. I'd be very interested to hear about earlier uses of
    approaches like this, or of any other ideas for use cases.

    In 2015, Keith Randall was working on a new SSA backend for the Go
    compiler. The old and new backends coexisted, and we could use either
    for any given function in the compilation. Keith introduced an
    environment variable GOSSAHASH that specifies the last few binary
    digits of the hash of function names that should use the new backend.
    So GOSSAHASH=0110 means compile only those functions whose names hash
    to a value with last four bits 0110. When a test is failing with the
    new backend, you try GOSSAHASH=0 and GOSSAHASH=1, and then use binary
    search to progressively refine the pattern, narrowing the failure down
    until only a single function is being compiled with the new backend.
    This was invaluable for approaching failures in large real-world tests
    (tests for libraries or production code, not for the compiler) that we
    had not written and did not understand.

    In 2016, David Chase was debugging a new optimizer rewrite rule that
    should have been correct but was causing mysterious test failures.
    He reused the same technique at finer granularity: the bit pattern now
    matched against a hash of the current file name and line number,
    so that the binary search pinpoints the exact line of code that is
    miscompiled (meaning causes a test failure) when the suspect rewrite
    rule is used.

    Since then we have continued to find uses for this approach.
    For example, when we added automatic FMA insertion on certain
    architectures, some tests started failing. By making the rewrite rule controlled by a file:line hash pattern, we can pinpoint the exact line
    or lines where FMA insertion causes the failure.

    As another example, we are considering a small change to the semantics
    of variables declared in for loops, along the lines of what C# 5 did
    and what JavaScript does in for-let loops. By using a hash pattern to
    toggle whether the new semantics applies at a given file:line, we can
    identify the specific loops where the semantic change provokes a
    test failure.

    The binary search does not have to stop at a single failure. With a
    richer pattern syntax adding an AND-NOT operator, we can disable the
    instances we've found, and if the test still fails, run a new binary
    search to find another culprit. (Technically, AND-NOT is not necessary
    for doing a complete binary traversal of the search space guided by
    single failures, but it is easy, and it is necessary for the next step.)

    The binary search can also be adapted to finding pairs or larger
    groups of changes, all of which are necessary to provoke a failure.
    If suffix B provokes the failure but neither 0B nor 1B do, then
    (assuming the test is not flaky!) you start using the pattern rCL0B OR 1BrCY, refine the left half of the expression with binary search
    (for example the first step checks rCL00B OR 1BrCY versus rCL10B OR 1BrCY),
    and then having refined the left half, move on to refining the right half. After refining both, you have an OR of patterns that each match one
    change, and all the changes are necessary to provoke the failure.
    The splitting naturally recurses as needed to find a locally minimal set
    of changes, in the sense that by construction, removing any single
    change would not provoke the failure.

    Stepping slightly away from compilers temporarily, we have recently
    realized that this technique can be adapted to identifying specific bug-inducing contexts for library functions as well. For example,
    suppose we want to introduce a new qsort implementation, but that
    causes tests to fail in a large program (maybe even a compiler) that
    makes use of qsort in many contexts. Binary search selecting the old
    or new algorithm according to a hash of the call stack will quickly
    identify the exact call stack affected by the new qsort.

    As I mentioned at the top, I am interested to hear about earlier uses
    of approaches like this, or any other ideas for situations where the
    approach might be applicable. Clearly the general problem has overlaps
    with group testing [1], and in particular Hwang's binary search-based
    approach (1972) [2]. I've also found one paper describing this
    algorithm in the context of identifying which of a large set of
    changes between GDB versions caused a crash (1999) [3].
    (That paper's version of the algorithm for finding pairs or larger
    groups is buggy unless there is only one such group.)

    It seems like the idea of binary search to narrow down a buggy
    software change is so natural that there must have been many earlier
    uses for debugging before 1999, especially in compilers.

    Thanks for any references, interesting history, or ideas.

    If anyone is interested in code, we've published the binary search tool [4]
    and the code that interprets the patterns [5].

    Best,
    Russ

    [1] https://en.wikipedia.org/wiki/Group_testing (a remarkably good article)
    [2] https://www.jstor.org/stable/2284447
    [3] https://dl.acm.org/doi/10.1145/318774.318946
    [4] https://pkg.go.dev/golang.org/x/tools/cmd/bisect
    [5] https://pkg.go.dev/golang.org/x/tools/internal/bisect

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Sat May 13 03:20:18 2023
    From Newsgroup: comp.compilers

    On 2023-05-12, Russ Cox <rsc@swtch.com> wrote:
    In 2015, Keith Randall was working on a new SSA backend for the Go
    compiler. The old and new backends coexisted, and we could use either
    for any given function in the compilation. Keith introduced an
    environment variable GOSSAHASH that specifies the last few binary
    digits of the hash of function names that should use the new backend.
    So GOSSAHASH=0110 means compile only those functions whose names hash
    to a value with last four bits 0110. When a test is failing with the
    new backend, you try GOSSAHASH=0 and GOSSAHASH=1, and then use binary
    search to progressively refine the pattern, narrowing the failure down
    until only a single function is being compiled with the new backend.
    This was invaluable for approaching failures in large real-world tests
    (tests for libraries or production code, not for the compiler) that we
    had not written and did not understand. ...

    It seems like the idea of binary search to narrow down a buggy
    software change is so natural that there must have been many earlier
    uses for debugging before 1999, especially in compilers.

    I have used binary searching to find bugs in the TXR Lisp compiler.

    Say for the sake of simplicity that I already have a pretty reliable
    idea in which file the miscompiled function resides.

    Because the compiler executes the top-level forms that it compiles,
    I can go to 50% of that file (literally by typing 50% in Vim).
    Then stick a (setf *opt-level* 0) at that line. The bottom half
    of the file is then compiled without optimizations. If the bug goes
    away, then something is being mis-optimized in the second half.
    And so it goes.

    Another trick I have used is to suppress the compilation of
    specific forms according to this pattern:

    (defun fun (...) ...) --> (eval '(defun fun () ...))

    I.e. take the form, and eval its quoted version:

    form -> (eval 'form)

    So the compiler now is compiling nothing more than an eval call which
    takes a canned code literal as its argument. When the resulting compiled
    file is loaded, fun will be an interpreted function, not a compiled
    fucntion: the compiled version of the eval call executes, interpreting
    the defun form, resulting in an interpreted function being defined. If
    the bug goes away, that function was being miscompiled.

    (This works in Lisp dialects that have a pure interpreter behind eval;
    dialects whose eval is actually a compiler may not support this
    technique well.)

    The hash technique seems nice. If we have no idea where in the
    code base the miscompilation is happening, it seems like it could help;
    it seems easier than some of the following techniques.

    I often have a good idea where in the code base the problem is because
    during the rebuild of TXR, files are compiled one by one. As each file
    in the stdlib/ is compiled, the next compile job now uses that file (if
    it is implied for loading). So more ofen than not, as soon as we
    miscompile some file that used during compilation, the compilation of
    the next file will misbehave. Or possibly the same file. The compiler
    evaluates what it compiles (unless told not to). If the compiler is
    compiling some code that it's also itself using, it may misbehave soon
    after miscompiling some of it, before that file is done compiling.

    A useful technique used sometimes is to substitute known-good compiled
    files. If the problem goes away if we substitute a known-good
    compiled file (touching the timestamps so that it's not clobbered
    with a recompile), we can suspect that the file is being miscompiled.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Fernando@pronesto@gmail.com to comp.compilers on Sat May 13 04:47:48 2023
    From Newsgroup: comp.compilers

    Hi Russ, that's very interesting. You should consider submitting a report of the technique to CGO! The next deadline is on the 19th, but then there will be another deadline on September 1st.

    As for a related work, see John Regehr's research on test case reduction for compilers [1].

    [1] John Regehr, Yang Chen, Pascal Cuoq, Eric Eide, Chucky Ellison, Xuejun Yang:
    Test-case reduction for C compiler bugs. PLDI 2012: 335-346

    Regards,

    Fernando
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Sun May 14 02:49:15 2023
    From Newsgroup: comp.compilers

    On 2023-05-13, Fernando <pronesto@gmail.com> wrote:
    Hi Russ, that's very interesting. You should consider submitting a report of the technique to CGO! The next deadline is on the 19th, but then there will be
    another deadline on September 1st.

    We can summarize it briefly.

    Background:

    Suppose we have some data processing technique which transforms N
    entities. A new version of the technique changes how all the entities
    are transformed. One or more of them are transformed erroneously,
    resulting in a system regression (one which doesn't readily identify
    the bad element). Because all elements are transformed with the new
    technique, there are big differences in all of them compared to the
    old technique: comparing old and new is futile.

    Setup:

    We can (somehow) enumerate the N entities with integers, which we
    express using a pure binary enumeration. Then, we can discover the
    number of an erroneously processed element, one bit at a time.

    Procedure:

    First we treat all elements numbered ..XXXXXX0 using the new technique technique, and all others using the the old technique. If the
    system fails, we know that it's the ..XXXXX0 set which has one or more
    badly processed elements. Otherwise it's the other set, whose
    enumerations end in a 1. Either way, we have discovered the identity of
    the last bit.

    Suppose that discovered bit is 1. We then proceed with the next bit: we
    process ..XXXXXX01 entities with the new technique and all others
    with the old technique. If the problem reproduces, we have narrowed
    to the ...XXXXX01 set, otherwise to the .XXXXX11 set.

    In this manner we discover every bit, at which point we have
    a useful piece of information: the identity of a misprocessed element.
    We can then focus the investigation on how it is misprocessed.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.compilers on Sun May 14 19:59:21 2023
    From Newsgroup: comp.compilers

    Russ Cox <rsc@swtch.com> schrieb:

    As I mentioned at the top, I am interested to hear about earlier uses
    of approaches like this, or any other ideas for situations where the
    approach might be applicable. Clearly the general problem has overlaps
    with group testing [1], and in particular Hwang's binary search-based approach (1972) [2].

    For GCC regression-hunting, two techniques are routinely used.

    One of them is the use of "gcc bisect", described at https://git-scm.com/docs/git-bisect . If there is a failing test
    case, this can (at the cost of some CPU time) usually pinpoint
    offending commit.

    For reducing test cases, at least for C and C++, cvise https://github.com/marxin/cvise is now the mehod of choice; it is
    brutally efficient at reducing test cases to an absolute minimum.
    Because it can transform C and C++ programs, it can make
    transformations of the program which are language-aware.

    Both are tools external to the compiler.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From gah4@gah4@u.washington.edu to comp.compilers on Sun May 14 13:38:36 2023
    From Newsgroup: comp.compilers

    On Sunday, May 14, 2023 at 9:20:54rC>AM UTC-7, Kaz Kylheku wrote:

    (snip)

    Setup:

    We can (somehow) enumerate the N entities with integers, which we
    express using a pure binary enumeration. Then, we can discover the
    number of an erroneously processed element, one bit at a time.

    Procedure:

    First we treat all elements numbered ..XXXXXX0 using the new technique technique, and all others using the the old technique. If the
    system fails, we know that it's the ..XXXXX0 set which has one or more
    badly processed elements. Otherwise it's the other set, whose
    enumerations end in a 1. Either way, we have discovered the identity of
    the last bit.

    There is the assumtion that only one of the N is in error.

    Reminds me of ECC memory, which can correct one bit errors, and detect
    two bit errors. After log(N) tests, you find the one that it is
    supposed to be, fix it, and it either works or doesn't.

    Then you work on a new set of tests.

    Or start out with a more complicated set, which can learn more in one set.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Cameron McInally@cameron.mcinally@nyu.edu to comp.compilers on Sun May 14 23:28:11 2023
    From Newsgroup: comp.compilers

    On Sun, May 14, 2023 at 22:42 Thomas Koenig <tkoenig@netcologne.de> wrote:

    Russ Cox <rsc@swtch.com> schrieb:

    As I mentioned at the top, I am interested to hear about earlier uses
    of approaches like this, or any other ideas for situations where the approach might be applicable. Clearly the general problem has overlaps
    with group testing [1], and in particular Hwang's binary search-based approach (1972) [2].

    For GCC regression-hunting, two techniques are routinely used.

    One of them is the use of "gcc bisect", described at https://git-scm.com/docs/git-bisect
    . If there is a failing test
    case, this can (at the cost of some CPU time) usually pinpoint
    offending commit.


    Ah, this is close to my heart. The story I heard was that automated
    regression bisection originated in the Cray compiler group around the
    mid 1990s. I didn't join that team until the aughts, but can attest to
    the quality of Cray's system, which is still in use today AFAIK.

    https://ieeexplore.ieee.org/document/625082

    For the general case of reducing runtime errors, the Wolf Fencing algorithm would be a good thread to pull for uncovering the history.

    https://dl.acm.org/doi/abs/10.1145/358690.358695

    Hope that helps,
    Cam
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Mon May 15 21:35:41 2023
    From Newsgroup: comp.compilers

    On 2023-05-14, Thomas Koenig <tkoenig@netcologne.de> wrote:
    Russ Cox <rsc@swtch.com> schrieb:

    As I mentioned at the top, I am interested to hear about earlier uses
    of approaches like this, or any other ideas for situations where the
    approach might be applicable. Clearly the general problem has overlaps
    with group testing [1], and in particular Hwang's binary search-based
    approach (1972) [2].

    For GCC regression-hunting, two techniques are routinely used.

    One of them is the use of "gcc bisect", described at https://git-scm.com/docs/git-bisect . If there is a failing test
    case, this can (at the cost of some CPU time) usually pinpoint
    offending commit.

    The technique described by Russ Cox is applicable when you know what
    the offending commit is. E.g. you made some changes, say, to the
    optimizer, and something is going wrong when you boostrap the compiler. Obviously, the optimization changes are breaking something; the prior
    commit works. But you don't understand what is breaking where.


    For reducing test cases, at least for C and C++, cvise https://github.com/marxin/cvise is now the mehod of choice; it is
    brutally efficient at reducing test cases to an absolute minimum.
    Because it can transform C and C++ programs, it can make
    transformations of the program which are language-aware.

    This is only applicable if you have an external test case which
    reproduces some compiler problem (like a crash) and would like
    to make it smaller.

    The binary search technique described by Cox is uniquely applicable
    in the situation that something goes wrong with the bootstrapping
    of a compiler.

    1. A bug has been introduced into the compiler source.
    You already know exactly which commit did that, and
    what change is responsible, due to having done the git bisect
    or whatever. Maybe this is just new, uncommited work that is
    breaking compared to the working HEAD.

    2. Consequently, the compiler miscompiles something in its own
    source code (anywhere in its source: in the compiler proiper, or
    standard library support routines of that language which are
    used by the compiler.)

    3. Consequently, the compiled compiler misbehaves somehow.
    (Or possibly, the boostrapping even fails: compiled code is
    loaded into the compiler's image, and it cannot continue
    the boostrapping process.)

    Characterizing what is wrong in (3) is not too difficult (though it can
    be). The problem is that the issues exhibited in 3 are in good parts of
    the compiler. So they are red herring issues. If you look at the source
    code for those parts, it is fine and has not been touched. Rather, the
    object code is incorrect due to the introduced compiler problem.

    You might know the commit which introduces the problem, yet be stumped
    as to how, due to the complexity of all the pieces and how they
    interact, and the difficulty of some of the algorithms. (Plus hidden
    problems, like the new work actually being correct, but exposing a bug
    in existing code; so staring at just the new code until you are blue
    does no good.)

    Cox's trick lets you turn on the new, known-bad code only for a
    seletected subset of functions (or whatever logical units of the image).

    You can iteratively shrink the subset until it contains only one
    mistranslated unit.

    Then you know---aha!--the bad optimizer change is, say, mistranslating
    the partition function in stdlib/quicksort.mylang. If just that function
    is treated with the buggy new change, and nothing else, the show grinds
    to a halt.

    So now you know you can focus your investigation on what the buggy
    compiler change is doing to the partition function in
    stdlib/quicksort.mylang.

    You can write test cases exercising that function to understand
    how its behavior has gone wrong, and study the object code to
    see how it's organized to bring about the wrong behavior, and how
    that relates the bad new code in the compiler.

    You can tweak the code, and build the image with just that partition
    function being processed with the new code to see whether the regression
    has gone away, and iterate on that. Then try boostrapping the entire
    image with the repaired code.

    Etc.

    The bug can show up as a mistranslation of anything, which is why I
    picked the partition helper of quicksort example. That could wreck the compiler's behavior, if some logic in it somemwhere depends on
    sorting something. You could really be on a wild goose chase trying
    to debug *that*, rather than it completely unrelated root cause.

    Both are tools external to the compiler.

    Right; so not applicable to this internal technique where basically we subdivide the compiler's own image into "parts compiled with the
    known-good compiler" and "parts compiled with the new, known-bad change
    in effect", where the parts are on a finer granularity than files.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Mon May 15 21:52:23 2023
    From Newsgroup: comp.compilers

    On 2023-05-14, gah4 <gah4@u.washington.edu> wrote:
    On Sunday, May 14, 2023 at 9:20:54rC>AM UTC-7, Kaz Kylheku wrote:

    (snip)

    Setup:

    We can (somehow) enumerate the N entities with integers, which we
    express using a pure binary enumeration. Then, we can discover the
    number of an erroneously processed element, one bit at a time.

    Procedure:

    First we treat all elements numbered ..XXXXXX0 using the new technique
    technique, and all others using the the old technique. If the
    system fails, we know that it's the ..XXXXX0 set which has one or more
    badly processed elements. Otherwise it's the other set, whose
    enumerations end in a 1. Either way, we have discovered the identity of
    the last bit.

    There is the assumption that only one of the N is in error.

    You might think, but it's not actually a problem with the algorithm.

    When we bisect a change history, there is the assumption that the
    bug showed up, so there are good commits before that, and bad commits
    after. If the history is not like that: the bug disappears and
    reappears, then the bisect will not identify *the* bad commit reliably,
    because there isn't one.

    I have a story like this; but it digresses from my present point, so
    I will return to it below.

    In this situation, we are bisecting a space of N entities. All we need
    is to find one bad one. If there are three bad ones, it doesn't matter
    which one we find.

    Say we have a basket of apples with a rotten smell emanating from it.
    We can subdivide it, and smell one half and the other. If both halves
    smell, we know we have two or more rotten apples and they ended up
    in different halves. This doesn't matter. We just pick one half and
    keep subdividing. As long as we stay on the trail of the rotten scent, we will get down to one rotten apple, and we can use that apple to analyze further: what kind of mould or bacterium has infected it and so on. Probably
    the other rotten apples have the same root cause. If they have different root causes, we can do another search after fixing the one root cause we have found.

    Now about the story. The conclusion of the story was that there was
    a GC-related bug in an ash (arithmetic shift) function, dealing with heap-allocated bignum integers.

    A bignum integer was being prematurely reclaimed by the garbage
    collector.

    The fix looked like this:

    commit 9cf992835c8a0f2e6c4ace07b67fea2acb762cc5
    Author: Kaz Kylheku <kaz@kylheku.com>
    Date: Wed Jun 19 23:21:39 2019 -0700

    ash: gc problem.

    On 32 bit x86 Solaris 10, with gcc 4.9.0, this issue caused a
    miscompilation of the pset macro, due to ash abruptly
    returning zero, causing the op-code field of an instruction to
    be NOOP rather than GCALL.

    I'm committing this fix just for reference; I will immediately
    replace it with a refactoring of the function.

    * arith.c (ash): Add a gc_hint to prevent the a bignum from
    being reclaimed if the b = make_bignum() triggers gc.
    This happens in the case when the previous case computes
    a = make_bignum() and falls through.

    diff --git a/arith.c b/arith.c
    index fdb294b7..f374ddf4 100644
    --- a/arith.c
    +++ b/arith.c
    @@ -3235,6 +3235,7 @@ val ash(val a, val bits)
    b = make_bignum();
    if ((mpe = mp_shift(mp(a), mp(b), bn)) != MP_OKAY)
    break;
    + gc_hint(a);
    return normalize(b);
    default:
    goto bad3;

    The problem only showed upon Solaris (first red flag). When I git bisected it, it was randomly appearing and disappearing, so git bisect was of no use.

    Not all problems introduce themselves in a way that there is a reproducible test case.

    Here, the garbage collector had to go off exactly at the wrong time during the execution of the ash function. Small perturbations to completely unrelated code can make it go away.

    In the end, I had to debug it on Solaris (no working gdb or anything), using the repro that I had.

    Once I discovered the bad VM instruction, which was affecting the behavior of a certain Lisp macro in the standard library, I traced it through the compiler right down to the assembler, where the opcode field for a good assembly instruction was being miscalculated. I could not reproduce the bad ash calculation though in isolation.

    In summary:

    1. non-gc-aware, sloppy coding in ash function caused ...
    2. rare problem in assembler's masking and shifting calculations, causing ... 3. Bad instruction when compiling the "pset" (parallel assignment) macro ...
    4. Which failed some code relying pset.

    All of this reproduced only one one platform, one particular commit.
    Small perturbations of Lisp code just about anywhere made it go away.
    Bisect was useless.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From gah4@gah4@u.washington.edu to comp.compilers on Tue May 16 23:52:10 2023
    From Newsgroup: comp.compilers

    On Monday, May 15, 2023 at 6:54:08rC>PM UTC-7, Kaz Kylheku wrote:

    (snip)

    Say we have a basket of apples with a rotten smell emanating from it.
    We can subdivide it, and smell one half and the other. If both halves
    smell, we know we have two or more rotten apples and they ended up
    in different halves. This doesn't matter. We just pick one half and
    keep subdividing.

    But the algorithm described, at least as I remember it, doesn't test both halves.

    Sniffing two baskets doesn't take so long, but two tests might.

    I was suspecting that there were more efficient ways than doing
    both halves, though didn't try to figure out what they might be.

    As long as we stay on the trail of the rotten scent, we will
    get down to one rotten apple, and we can use that apple to analyze further: what kind of mould or bacterium has infected it and so on. Probably
    the other rotten apples have the same root cause. If they have different
    root
    causes, we can do another search after fixing the one root cause we have
    found.

    I don't know apple statistics so well. If you suspect more than one from the beginning,

    As a binary search, it should be 50% probability on each test.
    If you see higher than 50% as the tests go on, it might look
    suspicious already.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Russ Cox@rsc@swtch.com to comp.compilers on Wed May 17 10:55:27 2023
    From Newsgroup: comp.compilers

    Thanks to everyone for the replies and pointers.

    As far as 'git bisect' is concerned, we have found 'git bisect'
    incredibly useful too, of course. It is clearly another instance of
    binary search to find bugs, but it is binary search over time, while
    the system I described is binary search over the program being
    compiled. "What change broke the compiler?" vs "What code does that
    change miscompile?". The two are independent and complementary.

    I had originally not mentioned binary search over time because it
    seemed non-compiler-specific, but as Cameron McInally pointed out,
    one of its independent inventions was at Cray Research in the mid-1990s
    in a compiler context [1]. I reached out to Brian Ness to find out more
    of the history of that work, and I've included his response below,
    with permission. The other independent invention I know of was by
    Larry McVoy, also in the mid-1990s, in the context of Linux kernel
    development [2]. Linux did not have any kind of whole-tree source
    control snapshots, but they did post frequent enough manual snapshots
    to make the technique applicable. I also reached out to Larry McVoy for
    more of the history of that work, and I've included his response below,
    also with permission. That branch of the history led directly to
    'git bisect' [3].

    In general, binary search for narrowing a problem being debugged was
    obviously well known by the mid-1990s. At that time, whole snapshot
    source history was becoming common enough that the idea of binary
    search over it was inevitable. I expect there are other instances of independent invention we don't know about.

    Best,
    Russ

    [1] https://ieeexplore.ieee.org/document/625082
    [2] https://elixir.bootlin.com/linux/1.3.73/source/Documentation/BUG-HUNTING [3] https://groups.google.com/g/fa.linux.kernel/c/cp6abJnEN5U/m/5Z5s14LkzR4J

    ---------- Forwarded message ---------
    From: Russ Cox <rsc@swtch.com>
    Date: Mon, May 15, 2023 at 1:18 PM
    Subject: history of source control binary search for debugging

    Hi Larry and Brian,

    I am interested in the origin of the idea of binary search over source
    control history running tests to identify the change that introduced a
    given failure. I have traced it back as far as the mid-1990s with
    Larry's Documentation/BUG-HUNTING file in the Linux kernel
    (March 1996) and Brian's COMPSAC'97 paper with Viet Ngo "Regression
    containment through source change isolation" (August 1997).
    (I have been unable to find Viet Ngo's email address or I would add
    him to this thread.)

    Can you share any background about when you first encountered the idea,
    or any pointers to earlier systems or engineers I might reach out to?
    Thanks very much.

    Best,
    Russ

    ---------- Forwarded message ---------
    From: Larry McVoy <lm@mcvoy.com>
    Date: Mon, May 15, 2023 at 1:57 PM
    Subject: Re: history of source control binary search for debugging

    Good to hear from you Russ.

    I'm pretty sure I "invented" that idea, which means I hadn't seen it before. All I was doing was using the fact that binary search is log(n). And giving non-kernel people a way to do some grunt work and then get it close and
    then hand that off to the kernel people.

    But note that the BUG-HUNTING was using snapshot, not source management. BitKeeper hadn't been invented yet and the other source management systems sucked because they were reproducible only at tagged points. At least
    that was true of CVS.

    When BitKeeper came along, we added this:

    --longest Restrict the deltas to those on the longest line
    between the two range endpoints. Unlike a range, the
    lower bound is included in the output.

    because BitKeeper history is a lattice, not a straight line. So your
    test points became

    bk changes -nd:REV: > REVS

    and binary search over those. That gets you the longest "straight" line
    in the graph. I dunno if Git has something like that, maybe. Lots of
    stuff got lost in the transition to Git, still chaps my hide to this
    day.

    ---------- Forwarded message ---------
    From: Brian Ness <bcn@bubblecreek.com>
    Date: Mon, May 15, 2023 at 3:47 PM
    Subject: Re: history of source control binary search for debugging

    Hi Russ,

    I havenrCOt been in contact with Viet since about 2005 or 2006, so I
    donrCOt know his whereabouts now.

    Viet and I worked together in the compiler development group at Cray.
    There were language specific front-end teams for Fortan90 and C/C++,
    an optimizer team, and a backend team. Viet and I both worked on the
    optimizer team. Each team delivered its component as a set of
    relocatables (.o files), identified by a component version string.
    We had a tool for building the various compilers from the relocatable
    sets, by simply specifying the desired version of each component in a
    version string given as an input parameter to the build tool. There
    were other options, such as for linking up debug versions of the
    compilers. The build tool retrieved the relocatables from an archive
    we had set up, and linked them together as an executable. Each night,
    cron jobs would build compilers for testing from the most recent
    versions of each component. When regressions were discovered, we would
    build compilers with all previously validated components and execute
    the newly failing tests. If a given test failed with a previously
    validated compiler, we might have a test that had been modified or an
    external library that had regressed, but when previously validated
    compiler components passed the newly failing test, we would try one
    new compiler component (e.g. the optimizer) with previously validated
    other components. Since putting together the different combinations of components was easy with our build tool, we could quickly identify
    which new component introduced the regression.

    The optimizer group did something similar, but with source code rather
    than relocatables. Cray had its own version control system called USM
    (Unicos Source Manager) which had support for atomic changes, meaning
    that a set of related changes in multiple files could be encapsulated
    and treated as a single thing. We called those atomic change sets,
    rCLmodsrCY. As far as I know, there were no other version control systems supporting that at the time. Atomic change sets made it easy to
    backtrack through the source code to identify the particular set of
    changes causing a regression (or other behavior change). After proving
    the viability of this approach with linear searches through the code,
    we optimized it by using binary searches. This was important, because
    an optimizer build from source could involve compiling something like
    a million lines of code. Even on our supercomputers this could take a
    while, so we wanted to do the minimum number of builds from source.
    We called the process rCLmod isolationrCY.

    After identifying all the mods causing regressions using our mod
    isolation process, we would remove those mods and rebuild the
    optimizer for the next round of testing. This was repeated until the
    optimizer had no known regressions, and that version would be marked
    as a release candidate. The rCLpulledrCY mods would be reworked by the
    author and re-applied for a future release candidate.

    This process allowed us to move from infrequent delivery of
    non-regressing optimizers to one or two week delivery cycles.

    I wasnrCOt the one to come up with the idea for mod isolation at Cray,
    but I was the one who implemented it and automated much of the
    process. The COMPSAC paper resulted from VietrCOs Phd advisor being
    hired by Cray to evaluate our development processes. During his
    evaluation, he discovered our mod isolation process, and said he had
    never seen this being done before. He asked us to write the paper.
    I presented it at COMPSAC. We were not able to find much prior work on
    this, probably because there wasnrCOt support for atomic changes in the
    version control tools in use at that time, with the exception of USM.
    Of course, now this functionality is available within git as rCLgit
    bisectrCY.
    (I used rCLgit bisectrCY just yesterday.)

    I hope this helps you. Let me know if you want any other info.

    -Brian
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Cameron McInally@cameron.mcinally@nyu.edu to comp.compilers on Wed May 17 13:40:33 2023
    From Newsgroup: comp.compilers

    On Wed, May 17, 2023 at 13:13 Russ Cox <rsc@swtch.com> wrote:

    Thanks to everyone for the replies and pointers.

    As far as 'git bisect' is concerned, we have found 'git bisect'
    incredibly useful too, of course. It is clearly another instance of
    binary search to find bugs, but it is binary search over time, while
    the system I described is binary search over the program being
    compiled. "What change broke the compiler?" vs "What code does that
    change miscompile?". The two are independent and complementary.


    As for a binary search over the program being compiled, LLVM has a
    mechanism to bisect optimization passes. Not exactly the same, but worth noting.

    https://llvm.org/docs/OptBisect.html

    Also, Viet is still in compilers at Coherent Logix.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Wed May 17 18:28:39 2023
    From Newsgroup: comp.compilers

    On 2023-05-17, gah4 <gah4@u.washington.edu> wrote:
    On Monday, May 15, 2023 at 6:54:08rC>PM UTC-7, Kaz Kylheku wrote:

    (snip)

    Say we have a basket of apples with a rotten smell emanating from it.
    We can subdivide it, and smell one half and the other. If both halves
    smell, we know we have two or more rotten apples and they ended up
    in different halves. This doesn't matter. We just pick one half and
    keep subdividing.

    But the algorithm described, at least as I remember it, doesn't test both halves.

    If the problem reproducews with the half you're testing, you assume
    the misprocessed function is in that half. (There could be a bad
    function in the other half also, and you could test that, but it's unnecessary.)

    If the problem deosn't reproduce in the half you're testing, you
    assume the misprocessed functions lie in the other half, and
    recurse on that half instead.

    The algorithm will work even if every single function is miscompiled.
    (And will take the same logarithmic number of steps, since we cut
    in half no matter what.)

    It will arbitrarily narrow it down to one, showing that even if the
    compiler change is enabled for just one function, there is a problem.
    That function can be investigated. Then maybe then the fix solve the
    issue for the five hundred other broken functions, too.

    The main problem in that situation is that the function you narrow it
    down to may not be the easiest one to investigate. Say that a
    three line function is broken, and a 1000 line function is broken;
    the binary search finds the 1000 line function, whereas it would be
    easier to investigate how the three line function is miscompiled.

    There could be the following issue: among several functions that
    are miscompiled, there are two: foo and bar. *Both* must be
    miscompiled for the problem to reproduce, due to an interaction.
    If there is a recursive subdivision such that foo and bar are in
    different halves, then the problem doesn't reproduce for either half.

    Therefore, when the problem fails to reproduce for the currently
    tested half, it may be wise to cross-test the other half. If it also
    doesn't reproduce, the halves have to be recombined and some other
    approach taken, like perturbing the enumeration so that the partitioning
    is done differently, or doing a slow linear elimination.


    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Wed May 17 18:47:26 2023
    From Newsgroup: comp.compilers

    On 2023-05-17, Russ Cox <rsc@swtch.com> wrote:
    because BitKeeper history is a lattice, not a straight line. So your
    test points became

    That's one reason I religiously keep git histories linear, though
    they could be lattices also, like in BK.

    I've not used the "git merge" command since 2009, other than by accident
    (via git pull, having forgotten to reconfigure pull to do rebase rather
    than merge). And not even that; I quickly trained myself never to run
    "git pull"; I banished that command also. Always "git fetch" and then
    "git rebase". Before rebasing, I usually take a look at the diverging upstream: how much new is coming in, and what it's like.

    I really, really hate the whole concept of a commit having more than one parent.

    From: Brian Ness <bcn@bubblecreek.com>
    than relocatables. Cray had its own version control system called USM
    (Unicos Source Manager) which had support for atomic changes, meaning
    that a set of related changes in multiple files could be encapsulated
    and treated as a single thing. We called those atomic change sets, rCLmodsrCY. As far as I know, there were no other version control systems supporting that at the time.

    Simple archives of a snapshot of the entire source code support
    mods, though obviously in an unwieldy way when the code is large;
    and if you want every change isolated, that requires keeping a lot
    of archives.

    (And that's basically what git is: every commit in git is snapshot
    rather than a delta.)

    Linux kernel people used to bisect with kernel snapshots, before
    they settled on a version control system.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Spiros Bousbouras@spibou@gmail.com to comp.compilers on Thu May 18 10:50:24 2023
    From Newsgroup: comp.compilers

    On Wed, 17 May 2023 10:55:27 -0400
    Russ Cox <rsc@swtch.com> wrote:
    When BitKeeper came along, we added this:

    --longest Restrict the deltas to those on the longest line
    between the two range endpoints. Unlike a range, the
    lower bound is included in the output.

    because BitKeeper history is a lattice, not a straight line. So your
    test points became

    bk changes -nd:REV: > REVS

    and binary search over those. That gets you the longest "straight" line
    in the graph.

    Any partial ordering can be extended to a total ordering. Why can't you
    just do that instead of using 'the longest "straight" line' ?
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Russ Cox@rsc@swtch.com to comp.compilers on Thu May 18 15:28:56 2023
    From Newsgroup: comp.compilers

    Any partial ordering can be extended to a total ordering. Why can't you
    just do that instead of using 'the longest "straight" line' ?

    A precondition for binary search is that you are searching a function
    f over some domain [0, n) with the property that f(i) == true implies
    f(j) == true for all j >= i. That is, once the condition "flips", it
    stays flipped. Otherwise binary search is not a correct algorithm.

    The assumption in tools like git bisect is that some commit introduces the
    bug, and then the bug is detectable in all future commits as well,
    establishing the precondition. The test f(i), testing at commit i, can be viewed as asking whether the bad commit is in the range [0, i] (the full history of that commit). With that definition, clearly f(i) implies f(j)
    for all j >= i, because the commit history [0, j] is a superset of the
    commit history [0, i].

    If instead you squash unrelated commit lines together in a total order, the precondition is no longer true: if i is from one line but j > i is from a different unmerged line, then testing j does not include some history that testing i does. The bug can disappear as you cross from one line to
    another. Focusing on the longest straight line in the commit history is one
    way to establish the necessary precondition.

    Another possibility, which is what git bisect does, is to maintain a set of commits not yet decided and then find some "good" midpoint along the
    longest line of commits remaining. You may in general end up with multiple disconnected lines in your set of commits not yet decided, and you have to handle each one separately until you find the bug.

    What git bisect does, then, is a more complex algorithm that uses
    binary search over longest lines as a subroutine. Restricted to a
    single line of commits the precondition holds and binary search works.
    (I am simplifying a bit. The actual implementation is perhaps not this principled, but this is the essence of what it is doing and why it
    works.)

    The binary search I described to start this thread is also not exactly
    the same binary search algorithm, because the underlying input problem
    is not a single-input function with domain [0, n). Instead the input
    problem (in the simplest form) is a function taking a set of possible
    changes [lo, hi) that reports whether the bug is in that set. We
    recognize it as binary search because it's still the same code and
    idea, and every binary search does maintain this progressively
    narrower range, but normal binary searches don't tell f the range.

    Best,
    Russ
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.compilers on Fri May 19 03:21:18 2023
    From Newsgroup: comp.compilers

    On 2023-05-17, gah4 <gah4@u.washington.edu> wrote:
    As to the problem of debugging by miscompiled code, this reminds
    me of stories of debugging compilers by feeding them cards from
    the card recycling bin. It seems that cards were especially high grade
    paper and so worth keeping separate for recycling. I never actually
    saw anyone do that, though.

    Obviously, the card recycling bin is going to be a source of garbage
    that is invalid inputs to the compiler, with perhaps some valid bits.

    This was effectively an early form of fuzzing.

    In my younger days, some said that I was especially good at finding
    compiler errors. Maybe not as good now.

    Fuzzing has matured now. There are tools now which instrument the
    executable, and monitor what is going on in terms of the program
    control flow in response to random inputs. Then they adjust the bits
    to try to stimulate paths not taken to tickle bugs. That's the
    jist of it.

    I would try things that were
    allowed, but others might not have thought about doing.

    I've not used fuzzing in a while ... since 2016 it turns out! I played with it in the summer of that year. The software was AFL (American Fuzzy Lop) 2.30b.

    I had found three bugs in a short timespan back then.

    One of them was exponential memory explosion in nested quasiquoting syntax like ^^^....^^exr. (TXR Lisp's quasiquoting operator is written ^ rather than the usual backtick ` used in most traditional Lisp dialects, otherwise it is the same thing).

    AFL also discovered that if it puts a huge number into the op syntax like this: (op list @123455234), the op expander will obligingly generate a lambda with that many arguments! E.g. (op list @3) generates something similar to
    (lambda (arg0 arg1 arg3) (list arg3)). So imagine if we replace @3 with a large integer.

    AFL also found a crash in the regex parser. It was not doing a range check on numeric character escapes, making it possible for the program to sneak a negatively valued character code into the regex copiler, where it resulted in an out-of-bounds array access problem.

    I have been meaning brush the dust off this technique again.

    Only one I remember now, is a C compiler that miscompiled the ++
    operator on a double variable. Allowed in C, but maybe not often used.

    Oh, stepping doubles with ++ is, I would say, not *that* uncommon.
    I doubt that if such a bug were introduced as an easter egg into GCC,
    it would go very long without being discovered by the FOSS distros and
    other downstream users.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21b-Linux NewsLink 1.2