• Re: i386 on amd64 can fail to return from cond_wait_user, usingbasically 100% of a FreeBSD cpu

    From Konstantin Belousov@kostikbel@gmail.com to muc.lists.freebsd.stable on Sat Jul 5 11:23:16 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Fri, Jul 04, 2025 at 11:01:22PM -0700, Mark Millard wrote:
    Some package builds are failing on the port-packages build cluster
    machines that do i386 builds during the following code. The analysis
    is from replication in a personal context (using poudriere bulk with
    -i), using poudriere-devel instead. I'll note that the personal
    context is from using PkgBase 14.3-RELEASE in the poudriere jail.
    I also installed most of the realted *-dbg* PkgBase packages in order
    to get the nicer backtracing.

    (gdb) bt
    #0 _umtx_op_err () at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/arch/i386/i386/_umtx_op_err.S:37
    #1 0x2499f897 in _thr_umtx_timedwait_uint (mtx=0x249a365c, id=0, clockid=4, abstime=0x0, shared=0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_umtx.c:233
    #2 0x24995b26 in _thr_sleep (curthread=0x24d36004, clockid=4, abstime=0x0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_kern.c:197
    #3 0x24990beb in cond_wait_user (cvp=0x24dfa8a0, mp=0x24d38d04, abstime=<optimized out>, cancel=<optimized out>) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:317

    NOTE: cond_wait_user never returns but #2..#0 repeat (observed by
    repeated ^c and bt usage).

    (i386 is the oddball with 32-bit time_t but I do not know
    if that is involved here.)

    #4 cond_wait_common (cond=<optimized out>, mutex=<optimized out>, abstime=0x0, cancel=1) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:377
    #5 0x24990e8f in __thr_cond_wait (cond=0x23b9b4f4, mutex=0x23b9b4ec) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:392
    #6 0x23be1e4b in uv_cond_wait () from /usr/local/lib/libuv.so.1
    #7 0x024bd497 in node::NodePlatform::DrainTasks(v8::Isolate*) ()
    #8 0x0232f5b6 in node::SpinEventLoopInternal(node::Environment*) ()
    #9 0x02485bf0 in node::NodeMainInstance::Run() ()
    #10 0x023eaba1 in node::Start(int, char**) ()
    #11 0x24a1da85 in __libc_start1 (argc=5, argv=0xffffda3c, env=0xffffda54, cleanup=0x23b73020 <rtld_nop_exit>, mainX=0x314a720 <main>)
    at /home/pkgbuild/worktrees/releng/14.3/lib/libc/csu/libc_start1.c:157 #12 0x0232d0a8 in _start ()

    www/librewolf and other firefox related package builds can do
    this until a 7200 sec timeout by poudriere occurs:

    Killing runaway build after 7200 seconds with no output

    I'll note that truss did not generate any output when used to
    watch the process that was stuck. It appears to be a world-internal
    problem.

    Can you provide a minimal stand-alone reproducer in C?


    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Mark Millard@marklmi@yahoo.com to muc.lists.freebsd.stable on Sat Jul 5 10:08:34 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Jul 5, 2025, at 01:23, Konstantin Belousov <kostikbel@gmail.com> wrote:
    On Fri, Jul 04, 2025 at 11:01:22PM -0700, Mark Millard wrote:
    Some package builds are failing on the port-packages build cluster
    machines that do i386 builds during the following code. The analysis
    is from replication in a personal context (using poudriere bulk with
    -i), using poudriere-devel instead. I'll note that the personal
    context is from using PkgBase 14.3-RELEASE in the poudriere jail.
    I also installed most of the realted *-dbg* PkgBase packages in order
    to get the nicer backtracing.

    (gdb) bt
    #0 _umtx_op_err () at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/arch/i386/i386/_umtx_op_err.S:37
    #1 0x2499f897 in _thr_umtx_timedwait_uint (mtx=0x249a365c, id=0, clockid=4, abstime=0x0, shared=0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_umtx.c:233
    #2 0x24995b26 in _thr_sleep (curthread=0x24d36004, clockid=4, abstime=0x0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_kern.c:197
    #3 0x24990beb in cond_wait_user (cvp=0x24dfa8a0, mp=0x24d38d04, abstime=<optimized out>, cancel=<optimized out>) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:317

    NOTE: cond_wait_user never returns but #2..#0 repeat (observed by
    repeated ^c and bt usage).

    (i386 is the oddball with 32-bit time_t but I do not know
    if that is involved here.)

    #4 cond_wait_common (cond=<optimized out>, mutex=<optimized out>, abstime=0x0, cancel=1) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:377
    #5 0x24990e8f in __thr_cond_wait (cond=0x23b9b4f4, mutex=0x23b9b4ec) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:392
    #6 0x23be1e4b in uv_cond_wait () from /usr/local/lib/libuv.so.1
    #7 0x024bd497 in node::NodePlatform::DrainTasks(v8::Isolate*) ()
    #8 0x0232f5b6 in node::SpinEventLoopInternal(node::Environment*) ()
    #9 0x02485bf0 in node::NodeMainInstance::Run() ()
    #10 0x023eaba1 in node::Start(int, char**) ()
    #11 0x24a1da85 in __libc_start1 (argc=5, argv=0xffffda3c, env=0xffffda54, cleanup=0x23b73020 <rtld_nop_exit>, mainX=0x314a720 <main>)
    at /home/pkgbuild/worktrees/releng/14.3/lib/libc/csu/libc_start1.c:157
    #12 0x0232d0a8 in _start ()

    www/librewolf and other firefox related package builds can do
    this until a 7200 sec timeout by poudriere occurs:

    Killing runaway build after 7200 seconds with no output

    I'll note that truss did not generate any output when used to
    watch the process that was stuck. It appears to be a world-internal
    problem.
    Looks like my truss comment does not have the implications
    that I thought.
    Can you provide a minimal stand-alone reproducer in C?
    Unsure. My attempt will be mostly exploratory, not being familiar
    with the subject area or /usr/local/lib/libuv.so.1's uv_cond_wait
    or node's code.
    To some extent the below is me looking for an idea of someething
    specific to try.
    Looking at where /home/pkgbuild/worktrees/releng/14.3/lib/libthr/
    is shown as first listed: the call arguments to __the_cond_wait
    are shown as:
    (gdb) up 5
    #5 0x24990e8f in __thr_cond_wait (cond=0x23b9b4f4, mutex=0x23b9b4ec) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:392
    warning: 392 /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c: No such file or directory
    (gdb) print *cond
    $3 = (pthread_cond_t) 0x24dfa8a0
    (gdb) print **cond
    $4 = {__has_user_waiters = 1, kcond = {c_has_waiters = 0, c_flags = 0, c_clockid = 4, c_spare = {0}}}
    (gdb) print *mutex
    $5 = (pthread_mutex_t) 0x24d38d04
    (gdb) print **mutex
    $6 = {m_lock = {m_owner = 0, m_flags = 0, m_ceilings = {0, 0}, m_rb_lnk = 0, m_pad = 0, m_spare = {0, 0}}, m_flags = 1, m_count = 0, m_spinloops = 0, m_yieldloops = 0, m_ps = 0, m_qe = {
    tqe_next = 0x0, tqe_prev = 0x0}, m_pqe = {tqe_next = 0x0, tqe_prev = 0x0}, m_rb_prev = 0x0}
    There is not much set to a non-zero value:
    **cond.__has_user_waiters==1
    **cond.kcond.c_clockid==4
    **mutex.m_flags==1
    But, for all I know, some of that status could involve changes
    made after the call was made.
    There is also the question of debugger accuracy, since the
    jail contains an optimized PkgBase world build (with debugger
    information available), as far as I know. Such are sometimes
    messy to interpret.
    A more detailed view of causing another hangup after
    interrupting the existing hangup with ^C:
    . . .
    (gdb) finish
    Run till exit from #0 0x2499f897 in _thr_umtx_timedwait_uint (mtx=0x249a365c, id=0, clockid=4, abstime=0x0, shared=0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_umtx.c:233
    0x24995b26 in _thr_sleep (curthread=0x24d36004, clockid=4, abstime=0x0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_kern.c:197
    warning: 197 /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_kern.c: No such file or directory
    Value returned is $11 = 4
    (gdb) finish
    Run till exit from #0 0x24995b26 in _thr_sleep (curthread=0x24d36004, clockid=4, abstime=0x0) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_kern.c:197
    0x24990beb in cond_wait_user (cvp=0x24dfa8a0, mp=0x24d38d04, abstime=<optimized out>, cancel=<optimized out>) at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c:317
    warning: 317 /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c: No such file or directory
    Value returned is $12 = 4
    (gdb) next
    319 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    321 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    322 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    325 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    335 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    300 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    301 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    302 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    309 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    286 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    315 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    316 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    317 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/thread/thr_cond.c (gdb)
    The code around 317 is:
    if (cancel)
    _thr_cancel_enter2(curthread, 0);
    error = _thr_sleep(curthread, cvp->kcond.c_clockid, abstime);
    if (cancel)
    _thr_cancel_leave(curthread, 0);
    with the _thr_sleep line being 317.
    The implicit next from hitting return does not get back
    to a prompt, so _thr_sleep is staying on the stack.
    top shows:
    78445 420 root 141 0 147728Ki 131756Ki CPU30 30 588:37 99.94% /usr/local/bin/node
    during this.
    The same happens for _umtx_op_err with:
    (gdb) br _umtx_op_err
    Breakpoint 4 at 0x2498eec4: file /home/pkgbuild/worktrees/releng/14.3/lib/libthr/arch/i386/i386/_umtx_op_err.S, line 37.
    (gdb) c
    Continuing.
    Thread 1 hit Breakpoint 4, _umtx_op_err () at /home/pkgbuild/worktrees/releng/14.3/lib/libthr/arch/i386/i386/_umtx_op_err.S:37
    37 in /home/pkgbuild/worktrees/releng/14.3/lib/libthr/arch/i386/i386/_umtx_op_err.S
    (gdb) next
    So it looks like the cpu time reported by top is in the kernel,
    despite my earlier attempt to check on that via truss.
    But I'm still not noticing a suggestion of anything specific to
    try.
    ===
    Mark Millard
    marklmi at yahoo.com
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2