• ZFS deadlocks/memory accounting issues

    From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Mon Mar 16 16:41:35 2026
    From Newsgroup: muc.lists.freebsd.stable

    Since we upgraded to 14.3 last summer, we have been experiencing
    numerous memory accounting issues on our NFS servers. These manifest
    as a server *desperate* to free up memory despite having multiple
    gigabytes of physical RAM available. (Some of these machines have 1
    TiB of RAM, with more than 64 GiB free, and were swapping and invoking
    the OOM-killer.)

    I had a server deadlock just now after only three days of uptime with
    32 GiB of free memory. Prior to the crash, about 70 GiB (of 128) was
    used by the ARC, of which some 60 GiB was accounted for as
    "evictable", and the load was pretty modest.

    In DDB on the console, I noted:

    pid ppid pgrp uid state wmesg wchan cmd
    60673 60672 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60672 1 3008 0 S wait 0xfffffe031ee41560 nrpe
    60670 1186 60670 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session
    60669 1202 1202 0 D voffloc 0xfffff8024db4966a perl
    60668 60667 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60667 1 3008 0 S wait 0xfffffe031ee41000 nrpe
    60665 60664 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60664 1 3008 0 S wait 0xfffffe031723a5c0 nrpe
    60662 60661 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60661 1 3008 0 S wait 0xfffffe03172395a0 nrpe
    60659 60658 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60658 1 3008 0 S wait 0xfffffe0317239040 nrpe
    60656 60655 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60655 1 3008 0 S wait 0xfffffe0317238ae0 nrpe
    60653 60652 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60652 1 3008 0 S wait 0xfffffe0317238580 nrpe
    60650 60649 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60649 1 3008 0 S wait 0xfffffe0317238020 nrpe
    60647 60646 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60646 1 3008 0 S wait 0xfffffe0317237ac0 nrpe
    60644 60643 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60643 1 3008 0 S wait 0xfffffe0317237000 nrpe
    60641 60640 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60640 1 3008 0 S wait 0xfffffe00d3cfa040 nrpe
    60638 1202 1202 0 D voffloc 0xfffff8024db4966a perl
    60637 1186 60637 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session
    60636 60635 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60635 1 3008 0 S wait 0xfffffe00d3cf9ae0 nrpe
    60633 60632 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60632 1 3008 0 S wait 0xfffffe00d3cf9580 nrpe
    60630 60629 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60629 1 3008 0 S wait 0xfffffe00d3cf9020 nrpe
    60627 60626 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60626 1 3008 0 S wait 0xfffffe00d3cf8560 nrpe
    60624 60623 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60623 1 3008 0 S wait 0xfffffe00d3cf8000 nrpe
    60621 60620 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60620 1 3008 0 S wait 0xfffffe0317188060 nrpe
    60618 60617 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60617 1 3008 0 S wait 0xfffffe0317187b00 nrpe
    60615 60614 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60614 1 3008 0 S wait 0xfffffe03171875a0 nrpe
    60612 60611 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60611 1 3008 0 S wait 0xfffffe0317186ae0 nrpe
    60609 60608 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60608 1 3008 0 S wait 0xfffffe0317186580 nrpe
    60606 1186 60606 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session
    60605 60604 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60604 1 3008 0 S wait 0xfffffe0317186020 nrpe
    60602 60601 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60601 1 3008 0 S wait 0xfffffe0317185ac0 nrpe
    60599 1202 1202 0 D voffloc 0xfffff8024db4966a perl
    60598 60597 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60597 1 3008 0 S wait 0xfffffe0317185560 nrpe
    60595 60594 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60594 1 3008 0 S wait 0xfffffe0317185000 nrpe
    60592 60591 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60591 1 3008 0 S wait 0xfffffe031724c5c0 nrpe
    60589 60588 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60588 1 3008 0 S wait 0xfffffe031724c060 nrpe
    60586 60585 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60585 1 3008 0 S wait 0xfffffe031724b5a0 nrpe
    60583 60582 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60582 1 3008 0 S wait 0xfffffe031724a580 nrpe
    60580 60579 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60579 1 3008 0 S wait 0xfffffe031724a020 nrpe
    60577 1186 60577 0 Ds aw.aew_ 0xfffffe0326e5a608 sshd-session
    60576 60575 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60575 1 3008 0 S wait 0xfffffe0317249560 nrpe
    60573 1202 1202 0 D aw.aew_ 0xfffffe0326df6478 perl
    60572 60571 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60571 1 3008 0 S wait 0xfffffe0317249000 nrpe
    5015 5010 5015 6263 Ss+ ttyin 0xfffff810aa50a8b0 zsh
    5010 5006 5006 6263 S select 0xfffff8024ca966c0 sshd-session
    5006 1186 5006 0 Ss select 0xfffff8024ca984c0 sshd-session
    3008 1 3008 0 Ss select 0xfffff80209dc98c0 nrpe
    2910 1 2910 0 Ds+ aw.aew_ 0xfffffe03274d66e8 getty

    This getty is the one running on the console tty, which was stuck.
    Note the wait channel is "aw.aew_cv", which is part of the logic for
    evicting buffers from the ARC. Other threads are waiting for a
    dbuf (ZFS disk buffer) object mutex.

    I'm currently planning on taking us to 14.4 later this spring, but it
    would be nice to know if anyone else has seen this bug or has a fix.
    I've tried dropping kern.maxvnodes and increasing
    vfs.zfs.arc_free_target, with no change in symptoms.

    This particular server is due to be replaced but the new disk array
    (which was ordered in January) won't ship until late April per the
    vendor.

    -GAWollman



    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 15:08:44 2026
    From Newsgroup: muc.lists.freebsd.stable

    On Mon, Mar 16, 2026 at 2:41rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

    Since we upgraded to 14.3 last summer, we have been experiencing
    numerous memory accounting issues on our NFS servers. These manifest
    as a server *desperate* to free up memory despite having multiple
    gigabytes of physical RAM available. (Some of these machines have 1
    TiB of RAM, with more than 64 GiB free, and were swapping and invoking
    the OOM-killer.)

    I had a server deadlock just now after only three days of uptime with
    32 GiB of free memory. Prior to the crash, about 70 GiB (of 128) was
    used by the ARC, of which some 60 GiB was accounted for as
    "evictable", and the load was pretty modest.

    In DDB on the console, I noted:

    pid ppid pgrp uid state wmesg wchan cmd
    60673 60672 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60672 1 3008 0 S wait 0xfffffe031ee41560 nrpe
    60670 1186 60670 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session 60669 1202 1202 0 D voffloc 0xfffff8024db4966a perl
    60668 60667 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60667 1 3008 0 S wait 0xfffffe031ee41000 nrpe
    60665 60664 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60664 1 3008 0 S wait 0xfffffe031723a5c0 nrpe
    60662 60661 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60661 1 3008 0 S wait 0xfffffe03172395a0 nrpe
    60659 60658 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60658 1 3008 0 S wait 0xfffffe0317239040 nrpe
    60656 60655 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60655 1 3008 0 S wait 0xfffffe0317238ae0 nrpe
    60653 60652 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60652 1 3008 0 S wait 0xfffffe0317238580 nrpe
    60650 60649 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60649 1 3008 0 S wait 0xfffffe0317238020 nrpe
    60647 60646 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60646 1 3008 0 S wait 0xfffffe0317237ac0 nrpe
    60644 60643 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60643 1 3008 0 S wait 0xfffffe0317237000 nrpe
    60641 60640 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60640 1 3008 0 S wait 0xfffffe00d3cfa040 nrpe
    60638 1202 1202 0 D voffloc 0xfffff8024db4966a perl
    60637 1186 60637 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session 60636 60635 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60635 1 3008 0 S wait 0xfffffe00d3cf9ae0 nrpe
    60633 60632 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60632 1 3008 0 S wait 0xfffffe00d3cf9580 nrpe
    60630 60629 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60629 1 3008 0 S wait 0xfffffe00d3cf9020 nrpe
    60627 60626 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60626 1 3008 0 S wait 0xfffffe00d3cf8560 nrpe
    60624 60623 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60623 1 3008 0 S wait 0xfffffe00d3cf8000 nrpe
    60621 60620 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60620 1 3008 0 S wait 0xfffffe0317188060 nrpe
    60618 60617 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60617 1 3008 0 S wait 0xfffffe0317187b00 nrpe
    60615 60614 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60614 1 3008 0 S wait 0xfffffe03171875a0 nrpe
    60612 60611 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60611 1 3008 0 S wait 0xfffffe0317186ae0 nrpe
    60609 60608 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60608 1 3008 0 S wait 0xfffffe0317186580 nrpe
    60606 1186 60606 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session 60605 60604 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60604 1 3008 0 S wait 0xfffffe0317186020 nrpe
    60602 60601 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60601 1 3008 0 S wait 0xfffffe0317185ac0 nrpe
    60599 1202 1202 0 D voffloc 0xfffff8024db4966a perl
    60598 60597 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60597 1 3008 0 S wait 0xfffffe0317185560 nrpe
    60595 60594 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60594 1 3008 0 S wait 0xfffffe0317185000 nrpe
    60592 60591 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60591 1 3008 0 S wait 0xfffffe031724c5c0 nrpe
    60589 60588 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60588 1 3008 0 S wait 0xfffffe031724c060 nrpe
    60586 60585 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60585 1 3008 0 S wait 0xfffffe031724b5a0 nrpe
    60583 60582 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60582 1 3008 0 S wait 0xfffffe031724a580 nrpe
    60580 60579 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60579 1 3008 0 S wait 0xfffffe031724a020 nrpe
    60577 1186 60577 0 Ds aw.aew_ 0xfffffe0326e5a608 sshd-session 60576 60575 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60575 1 3008 0 S wait 0xfffffe0317249560 nrpe
    60573 1202 1202 0 D aw.aew_ 0xfffffe0326df6478 perl
    60572 60571 3008 0 D db->db_ 0xfffff8058173af68 nrpe
    60571 1 3008 0 S wait 0xfffffe0317249000 nrpe
    5015 5010 5015 6263 Ss+ ttyin 0xfffff810aa50a8b0 zsh
    5010 5006 5006 6263 S select 0xfffff8024ca966c0 sshd-session
    5006 1186 5006 0 Ss select 0xfffff8024ca984c0 sshd-session
    3008 1 3008 0 Ss select 0xfffff80209dc98c0 nrpe
    2910 1 2910 0 Ds+ aw.aew_ 0xfffffe03274d66e8 getty

    This getty is the one running on the console tty, which was stuck.
    Note the wait channel is "aw.aew_cv", which is part of the logic for
    evicting buffers from the ARC. Other threads are waiting for a
    dbuf (ZFS disk buffer) object mutex.

    I'm currently planning on taking us to 14.4 later this spring, but it
    would be nice to know if anyone else has seen this bug or has a fix.
    I've tried dropping kern.maxvnodes and increasing
    vfs.zfs.arc_free_target, with no change in symptoms.

    This particular server is due to be replaced but the new disk array
    (which was ordered in January) won't ship until late April per the
    vendor.

    -GAWollman
    I once saw a similar bug. In my case I had a process that mmap()ed
    some very large files on fusefs, consuming lots of inactive pages.
    And when the system comes under memory pressure, it asks ARC to evict
    first. So the ARC would end up shrinking down to arc_min every time.
    In my case, the solution was to set vfs.fusefs.data_cache_mode=0 . I
    suspect that similar bugs could be possible with UFS or tmpfs, if they
    have giant files that are mmaped().
    A less effective workaround was to set vfs.zfs.arc.min to some
    reasonable value. That can prevent ARC from shrinking too far. You
    could try that.
    Another thing you could try is to run "vmstat -o" when the system is
    in the problematic state. That will show you which vm objects are
    using the most inactive pages.
    Hope this helps,
    -Alan
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Mon Mar 16 17:23:33 2026
    From Newsgroup: muc.lists.freebsd.stable

    <<On Mon, 16 Mar 2026 15:08:44 -0600, Alan Somers <asomers@freebsd.org> said:

    A less effective workaround was to set vfs.zfs.arc.min to some
    reasonable value. That can prevent ARC from shrinking too far. You
    could try that.

    So far as I can tell, the ARC doesn't actually shrink, and shouldn't
    need to given the gigabytes of free physmem at the time (well,
    immediately prior). Within 5 minutes of the crash, the total ARC size
    was 70 GiB, c_max was 127 GiB, and c_min was 4 GiB -- in practice it's
    never anywhere near that small. The first observation after the
    server came back up, ARC size was already over 20 GiB.

    Either *something* is causing the kernel to think it has no free
    memory when there's actually lots, or else something is causing the
    kernel to allocate gigabytes of RAM much faster than we can observe it happening.

    There's epsilon memory in the inactive queue on this system, before or
    after the crash: it's so small I can't even see the line on the graph.
    The 24-hour maximum is 268 MiB, or about 0.2% of RAM.

    Another thing you could try is to run "vmstat -o" when the system is
    in the problematic state.

    What's the equivalent in DDB? No getty, no login.

    -GAWollman



    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 15:29:00 2026
    From Newsgroup: muc.lists.freebsd.stable

    On Mon, Mar 16, 2026 at 3:23rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

    <<On Mon, 16 Mar 2026 15:08:44 -0600, Alan Somers <asomers@freebsd.org> said:

    A less effective workaround was to set vfs.zfs.arc.min to some
    reasonable value. That can prevent ARC from shrinking too far. You
    could try that.

    So far as I can tell, the ARC doesn't actually shrink, and shouldn't
    need to given the gigabytes of free physmem at the time (well,
    immediately prior). Within 5 minutes of the crash, the total ARC size
    was 70 GiB, c_max was 127 GiB, and c_min was 4 GiB -- in practice it's
    never anywhere near that small. The first observation after the
    server came back up, ARC size was already over 20 GiB.

    Either *something* is causing the kernel to think it has no free
    memory when there's actually lots, or else something is causing the
    kernel to allocate gigabytes of RAM much faster than we can observe it happening.

    There's epsilon memory in the inactive queue on this system, before or
    after the crash: it's so small I can't even see the line on the graph.
    The 24-hour maximum is 268 MiB, or about 0.2% of RAM.

    Another thing you could try is to run "vmstat -o" when the system is
    in the problematic state.

    What's the equivalent in DDB? No getty, no login.

    -GAWollman
    I don't know how to do it from ddb. But if you dump a core file, then
    you can run vmstat after rebooting like this: "vmstat -o -M /var/crash/vmcore.XXX" . Also, did the kernel panic, or did you
    manually enter ddb? If it paniced, can you please share the stack
    trace?
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Mon Mar 16 17:39:17 2026
    From Newsgroup: muc.lists.freebsd.stable

    BTW, Alan, mail to your freebsd.org mailbox bounces because you
    forward to gmail.

    <<On Mon, 16 Mar 2026 15:29:00 -0600, Alan Somers <asomers@freebsd.org> said:

    I don't know how to do it from ddb. But if you dump a core file,

    None of our systems are set up for that. They all have huge memory
    and pretty tiny swap partitions, and in any case, they don't panic,
    they just deadlock. Or the OOM killer just shoots all user processes;
    these are nearly indistinguishable from a service provider's
    perspective.

    They're just NFS servers; they don't run anything else except what's
    necessary for monitoring and administration.

    -GAWollman



    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 15:55:59 2026
    From Newsgroup: muc.lists.freebsd.stable

    On Mon, Mar 16, 2026 at 3:39rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

    BTW, Alan, mail to your freebsd.org mailbox bounces because you
    forward to gmail.

    <<On Mon, 16 Mar 2026 15:29:00 -0600, Alan Somers <asomers@freebsd.org> said:

    I don't know how to do it from ddb. But if you dump a core file,

    None of our systems are set up for that. They all have huge memory
    and pretty tiny swap partitions, and in any case, they don't panic,
    they just deadlock. Or the OOM killer just shoots all user processes;
    these are nearly indistinguishable from a service provider's
    perspective.

    They're just NFS servers; they don't run anything else except what's necessary for monitoring and administration.

    -GAWollman
    Pretty tiny swap partitions? Maybe that's the problem. I recall kib@
    telling me that some amount of swap is essential, even when plenty of
    RAM is available. But I can't remember why. So if you can't upgrade
    those tiny swap partitions, then I suggest you install an SSD just for
    use as a dump device. I've done that sometimes.
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Konstantin Belousov@kostikbel@gmail.com to muc.lists.freebsd.stable on Mon Mar 16 23:56:55 2026
    From Newsgroup: muc.lists.freebsd.stable

    On Mon, Mar 16, 2026 at 03:08:44PM -0600, Alan Somers wrote:
    I once saw a similar bug. In my case I had a process that mmap()ed
    some very large files on fusefs, consuming lots of inactive pages.
    And when the system comes under memory pressure, it asks ARC to evict
    first. So the ARC would end up shrinking down to arc_min every time.
    In my case, the solution was to set vfs.fusefs.data_cache_mode=0 . I
    suspect that similar bugs could be possible with UFS or tmpfs, if they
    have giant files that are mmaped().

    What are 'similar bugs with UFS or tmpfs'?
    Can you please be more specific, what is the erronous behavior?


    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Mon Mar 16 18:15:54 2026
    From Newsgroup: muc.lists.freebsd.stable

    <<On Mon, 16 Mar 2026 15:55:59 -0600, Alan Somers <asomers@freebsd.org> said:
    On Mon, Mar 16, 2026 at 3:39rC>PM Garrett Wollman <wollman@bimajority.org> wrote:
    None of our systems are set up for that. They all have huge memory
    and pretty tiny swap partitions, and in any case, they don't panic,
    they just deadlock. Or the OOM killer just shoots all user processes;
    these are nearly indistinguishable from a service provider's
    perspective.
    Pretty tiny swap partitions?
    Tiny compared to RAM, typically 16 or 32 GiB. After all, these are
    NFS servers, they shouldn't have more than a few dozen MiB of
    swappable anonymous memory.(*) We're not going to put a 2T SSD as a hopefully-never-to-be-used swap drive in a file server.
    I configured a dump device on the server that crashed today, if it
    crashes again when I'm at a keyboard I'll see if I can get to write a
    dump in the 32 GiB of swap that it has configured.
    -GAWollman
    (*) If the kernel erroneously thinks it's out of free memory and
    swapping stuff out only opens up a few MiB, that would certainly
    explain why it goes on to ARC eviction and eventual OOM. On this
    server, after two hours of uptime, I see:
    Device 1K-blocks Used Avail Capacity
    /dev/gpt/swap0 33554432 32132 33522300 0%
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Sulev-Madis Silber@freebsd-stable-freebsd-org730@ketas.si.pri.ee to muc.lists.freebsd.stable on Tue Mar 17 00:15:24 2026
    From Newsgroup: muc.lists.freebsd.stable

    i have had problems like this for long time but on 13.* and ram was absolutely tiny
    problems seemed to appear when something mmapped files on zfs
    problem has puzzled me years and if i limit mmap, nothing fails. except if maybe the strange file errors in some memory pressure cases
    unrelated?
    some also had questionable similar issues just a while ago on Actual Servers (TM)
    same issue, processes get killed
    or in this case here, wired doesn't rapidly increase?
    also io speeds seem to affect stuff or i don't know what to think of this
    but problems described here also involve some mysterious memory usage in zfs. don't know
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 16:18:38 2026
    From Newsgroup: muc.lists.freebsd.stable

    On Mon, Mar 16, 2026 at 3:57rC>PM Konstantin Belousov <kostikbel@gmail.com> wrote:

    On Mon, Mar 16, 2026 at 03:08:44PM -0600, Alan Somers wrote:
    I once saw a similar bug. In my case I had a process that mmap()ed
    some very large files on fusefs, consuming lots of inactive pages.
    And when the system comes under memory pressure, it asks ARC to evict first. So the ARC would end up shrinking down to arc_min every time.
    In my case, the solution was to set vfs.fusefs.data_cache_mode=0 . I suspect that similar bugs could be possible with UFS or tmpfs, if they
    have giant files that are mmaped().

    What are 'similar bugs with UFS or tmpfs'?
    Can you please be more specific, what is the erronous behavior?
    I experienced this bug in 2021, and reproduced it on both FreeBSD 12.2
    and 13.0. The setup was:
    * A ZFS-root server with hundreds of GB of RAM and hundreds of TB of
    ZFS, with a complicated ZFS workload.
    * A custom fusefs file system. Each fusefs mountpoint presented a
    small number of files, some huge, and was backed by a file on ZFS
    itself.
    * A ctld target for each fusefs mountpoint, backed by one file on that mountpoint.
    "vmstat -o" showed that each of those ctld targets consume a huge
    amount of inactive memory. Basically, ctld was mmaping the whole file
    and never releasing any pages. The dtrace sdt:zfs:none:arc-needfree
    probe showed that the page daemon was frequently asking ZFS to free
    memory from ARC. ZFS complied, and the ARC size would slowly shrink
    down to vfs.zfs.arc_min . In my case, there was no crash, and the OOM
    killer wasn't involved, but performance suffered. Setting vfs.fusefs.data_cache_mode=0 was a perfect workaround for us, so I
    never investigated further.
    When I say that I suspect similar bugs may exist with UFS or tmpfs,
    I'm suspecting that if ctld exports huge files from those file systems
    on a mixed UFS/ZFS system, then they might consume huge amounts of
    inactive pages. But I've never checked.
    -Alan
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 16:27:11 2026
    From Newsgroup: muc.lists.freebsd.stable

    On Mon, Mar 16, 2026 at 4:16rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

    <<On Mon, 16 Mar 2026 15:55:59 -0600, Alan Somers <asomers@freebsd.org> said:

    On Mon, Mar 16, 2026 at 3:39rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

    None of our systems are set up for that. They all have huge memory
    and pretty tiny swap partitions, and in any case, they don't panic,
    they just deadlock. Or the OOM killer just shoots all user processes;
    these are nearly indistinguishable from a service provider's
    perspective.

    Pretty tiny swap partitions?

    Tiny compared to RAM, typically 16 or 32 GiB. After all, these are
    NFS servers, they shouldn't have more than a few dozen MiB of
    swappable anonymous memory.(*) We're not going to put a 2T SSD as a hopefully-never-to-be-used swap drive in a file server.
    You won't need a 2 TB SSD. By default, FreeBSD will make a mini dump,
    which excludes most of ARC and most memory used by userspace programs.
    For example, a recent core dump of mine takes 40 GB on a system with 1
    TB of RAM. Note that I'm setting dumpon_flags="-Z" to enable core
    dump compression. That makes the dump go faster, as well as use less
    space. See dumpon(8) for more information about full vs mini core
    dumps.

    I configured a dump device on the server that crashed today, if it
    crashes again when I'm at a keyboard I'll see if I can get to write a
    dump in the 32 GiB of swap that it has configured.

    -GAWollman

    (*) If the kernel erroneously thinks it's out of free memory and
    swapping stuff out only opens up a few MiB, that would certainly
    explain why it goes on to ARC eviction and eventual OOM. On this
    server, after two hours of uptime, I see:

    Device 1K-blocks Used Avail Capacity
    /dev/gpt/swap0 33554432 32132 33522300 0%

    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 17:32:42 2026
    From Newsgroup: muc.lists.freebsd.stable

    On Mon, Mar 16, 2026 at 5:30rC>PM Mark Millard <marklmi@yahoo.com> wrote:


    ZFS is documented to have the property: "This approach provides
    coherency between memory-mapped and IO access as the expense of wasted
    memory due to having two copies of the file in memory and extra overhead caused by the need to copy the contents between the two copies."
    (Chapter 10, page 548, last bullet item of the 2nd edition of the design
    and implementation book.)
    That's not relevant in this case, because no ZFS file was mmapped().
    It was only a fusefs file that was mmaped().
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21d-Linux NewsLink 1.2