• "failed to reclaim memory" with much free physmem

    From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Tue Sep 9 12:19:42 2025
    From Newsgroup: muc.lists.freebsd.stable

    On some of our newer large-memory NFS servers, we are seeing services
    killed with "failed to reclaim memory". According to our monitoring,
    the server has >100G of physmem free at the time, and the only
    solution seems to be rebooting. (There is a small amount of swap
    configured and even less of it in use.) Does this sound familiar to
    anyone? What should we be monitoring that we evidently aren't now?

    -GAWollman



    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Mark Millard@marklmi@yahoo.com to muc.lists.freebsd.stable on Tue Sep 9 12:19:21 2025
    From Newsgroup: muc.lists.freebsd.stable

    Garrett Wollman <wollman_at_bimajority.org> wrote on
    Date: Tue, 09 Sep 2025 16:19:42 UTC :
    On some of our newer large-memory NFS servers, we are seeing services
    killed with "failed to reclaim memory". According to our monitoring,
    the server has >100G of physmem free at the time,
    Was that 100G+ somewhat before any reclaiming of memory started,
    the lead-up to the notice? Any likelihood of sudden, rapid,
    huge drops in free RAM based on workload behavior?
    Some other figures from the lead-up to the OOM activity
    would be snapshots of the likes of top's:
    Active, Inact, Laundry, Wired, and Free
    (things in Buf also show up in the other categories)
    Is NUMA involved?
    and the only
    solution seems to be rebooting. (There is a small amount of swap
    configured and even less of it in use.)
    That swap is in use at all could be of interest. I wonder
    whaat it was doing when the swap was put to use or laundry
    was growing that lead to swap being put to use.
    Does this sound familiar to
    anyone? What should we be monitoring that we evidently aren't now?
    I'll note that you can delay the "failed to
    reclaim memory" OOM activity via the use of
    the likes of:
    # sysctl vm.pageout_oom_seq=120
    FYI:
    # sysctl -d vm.pageout_oom_seq
    vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM
    The default is 12 and larger gives more delay by
    causing more attempts to meet the threshold
    involved before OOM is used. No figure gives an
    unbounded delay so far as I know. (I do not know
    anything about the "counts wrap" behavior.)
    But if the conditions have a bounded duration,
    vm.pageout_oom_seq can make OOM activity be
    avoided over that duration fairly generally.
    (Even just one thread can keep the Active memory
    so large as to not meet the free RAM threshold(s)
    involved, even if swap is unused.)
    Someone might want to see some of the output from
    the likes of something like:
    # sysctl vm | grep -v "^vm\.uma\." | grep -e "\.v_" -e stats -e oom_seq | sort from the lead-up to a "failed to reclaim memory".
    Having a larger vm.pageout_oom_seq can make it
    easier to observe the lead-up time frame.
    ===
    Mark Millard
    marklmi at yahoo.com
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Mark Saad@nonesuch@longcount.org to muc.lists.freebsd.stable on Wed Sep 10 07:50:20 2025
    From Newsgroup: muc.lists.freebsd.stable


    On Sep 9, 2025, at 12:20rC>PM, Garrett Wollman <wollman@bimajority.org> wrote:

    N++On some of our newer large-memory NFS servers, we are seeing services killed with "failed to reclaim memory". According to our monitoring,
    the server has >100G of physmem free at the time, and the only
    solution seems to be rebooting. (There is a small amount of swap
    configured and even less of it in use.) Does this sound familiar to
    anyone? What should we be monitoring that we evidently aren't now?

    -GAWollman

    Garrett
    what version of FreeBSD is this ? 'uname -a' There was some chatter about a zfs issue on 14 where memory usage was incorrectly increasing. I can't find the thread let me search around . In the meantime tell is about the disks ,filesysten , etc .
    ---
    Mark Saad | nonesuch@longcount.org

    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Thu Sep 11 13:58:22 2025
    From Newsgroup: muc.lists.freebsd.stable

    <<On Tue, 9 Sep 2025 12:19:21 -0700, Mark Millard <marklmi@yahoo.com> said:

    Garrett Wollman <wollman_at_bimajority.org> wrote on
    Date: Tue, 09 Sep 2025 16:19:42 UTC :

    On some of our newer large-memory NFS servers, we are seeing services
    killed with "failed to reclaim memory". According to our monitoring,
    the server has >100G of physmem free at the time,

    Was that 100G+ somewhat before any reclaiming of memory started,
    the lead-up to the notice?

    That was within five minutes of munin-node getting shot by the OOM
    killer. There was much less memory free ca. 24 hours before the
    event.

    Any likelihood of sudden, rapid, huge drops in free RAM based on
    workload behavior?

    I don't have access to client workloads, but it would have to be a bug
    in ZFS if so; these are file servers, all they run is NFS.

    Is NUMA involved?

    Damn if I know.

    and the only
    solution seems to be rebooting. (There is a small amount of swap
    configured and even less of it in use.)

    That swap is in use at all could be of interest. I wonder
    whaat it was doing when the swap was put to use or laundry
    was growing that lead to swap being put to use.

    It's pretty normal on these servers, which stay up for six months
    between OS upgrades, for some userland daemons to get swapped out,
    although I agree that it seems like it shouldn't happen given that the
    size of memory (1 TiB) is much greater than the size of running
    processes (< 1 GiB).

    My suspicion here is that there's some sort of accounting error, but I
    don't know where to look, and I only have data retrospectively, and
    only the data that munin is collecting. (Someone else was on call
    when this happened most recently and they reported that their login
    shell kept on getting shot -- as was the getty on the serial console.)

    -GAWollman



    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Thu Sep 11 14:09:58 2025
    From Newsgroup: muc.lists.freebsd.stable

    <<On Wed, 10 Sep 2025 07:50:20 -0400, Mark Saad <nonesuch@longcount.org> said:

    what version of FreeBSD is this ?

    Ah, yes, it's 14.3-RELEASE. It's an NFS server with three zpools
    (boot plus two exported), total about 760 TiB.

    -GAWollman



    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Thu Sep 11 16:23:12 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Thu, Sep 11, 2025 at 11:10rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

    <<On Wed, 10 Sep 2025 07:50:20 -0400, Mark Saad <nonesuch@longcount.org> said:

    what version of FreeBSD is this ?

    Ah, yes, it's 14.3-RELEASE. It's an NFS server with three zpools
    (boot plus two exported), total about 760 TiB.
    I'm no ZFS guy, so I'm probably the last guy you should listen to,
    but I'd suggest you look at sys/contrib/openzfs/module/os/linux/zfs/arc_os.c. Why?
    Because there is a bunch of stuff in there that isn't in sys/contrib/openzfs/module/os/freebsd/zfs/arc_os.c
    and it might give you some hints w.r.t. tuning the arc?
    Also, I've mentioned this before, but if you choose to not post to freebsd-current@, I'd suggest you at least cc a few people who
    work in the area (mav@, asomers@, markj@ and maybe a couple
    more). It at least seems to me that they don't read freebsd-stable@
    often.
    rick

    -GAWollman


    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Thu Sep 11 17:22:10 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Thu, Sep 11, 2025 at 10:58rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

    <<On Tue, 9 Sep 2025 12:19:21 -0700, Mark Millard <marklmi@yahoo.com> said:

    Garrett Wollman <wollman_at_bimajority.org> wrote on
    Date: Tue, 09 Sep 2025 16:19:42 UTC :

    On some of our newer large-memory NFS servers, we are seeing services
    killed with "failed to reclaim memory". According to our monitoring,
    the server has >100G of physmem free at the time,

    Was that 100G+ somewhat before any reclaiming of memory started,
    the lead-up to the notice?

    That was within five minutes of munin-node getting shot by the OOM
    killer. There was much less memory free ca. 24 hours before the
    event.

    Any likelihood of sudden, rapid, huge drops in free RAM based on
    workload behavior?

    I don't have access to client workloads, but it would have to be a bug
    in ZFS if so; these are file servers, all they run is NFS.
    Bug or tuning weakness?
    If you look at sys/contrib/openzfs/module/os/linux/zfs/arc_os.c, it does
    a bunch of arm-waving setting arc_sys_free whereas sys/contrib/openzfs/module/os/freebsd/zfs/arc_os.c doesn't do anything.
    I'd try tuning it via vfs.zfs.arc.sys_free?
    (The default is 0 and that says "use all of the memory" if I read it
    correctly. I probably haven't read it correctly, which was why I suggested
    you compare the two of them.)
    rick

    Is NUMA involved?

    Damn if I know.

    and the only
    solution seems to be rebooting. (There is a small amount of swap
    configured and even less of it in use.)

    That swap is in use at all could be of interest. I wonder
    whaat it was doing when the swap was put to use or laundry
    was growing that lead to swap being put to use.

    It's pretty normal on these servers, which stay up for six months
    between OS upgrades, for some userland daemons to get swapped out,
    although I agree that it seems like it shouldn't happen given that the
    size of memory (1 TiB) is much greater than the size of running
    processes (< 1 GiB).

    My suspicion here is that there's some sort of accounting error, but I
    don't know where to look, and I only have data retrospectively, and
    only the data that munin is collecting. (Someone else was on call
    when this happened most recently and they reported that their login
    shell kept on getting shot -- as was the getty on the serial console.)

    -GAWollman


    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Thu Sep 11 18:32:17 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Thu, Sep 11, 2025 at 11:10rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

    <<On Wed, 10 Sep 2025 07:50:20 -0400, Mark Saad <nonesuch@longcount.org> said:

    what version of FreeBSD is this ?

    Ah, yes, it's 14.3-RELEASE. It's an NFS server with three zpools
    (boot plus two exported), total about 760 TiB.
    One simple thing you could do that might provide some insight
    into what is going on is..
    - do "nfsstat -s" in a loop (once/sec) along with "date" written out
    to some log file on the server.
    - Then when the problem shows up, look at the log file and see what
    RPCs/operations load the server was experiencing.
    (read vs write vs lookup vs ???)
    rick

    -GAWollman


    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Fri Sep 12 00:05:34 2025
    From Newsgroup: muc.lists.freebsd.stable

    <<On Thu, 11 Sep 2025 18:32:17 -0700, Rick Macklem <rick.macklem@gmail.com> said:

    One simple thing you could do that might provide some insight
    into what is going on is..
    - do "nfsstat -s" in a loop (once/sec) along with "date" written out
    to some log file on the server.
    - Then when the problem shows up, look at the log file and see what
    RPCs/operations load the server was experiencing.
    (read vs write vs lookup vs ???)

    We monitor NFS ops with munin, same as everything else, every five
    minutes. The more detailed data has already rolled off the RRDs, but
    in the half-hour before the OOM event, write ops spiked to a (still
    quite tame) 5,000 per second. That's well below observed peaks in
    writes over every averaging interval.[1] (The other NFS ops that you'd
    expect to see for a v4 client doing lots of writes increased as well,
    about one open/close pair per four write ops.) So I don't think it's
    anything NFS is doing on its own, but might be something ZFS is doing
    badly when the writes hit.

    The server continued to operate, with various other daemons getting
    shot as the OOM killer rampaged, until the on-call person got alerted
    by our monitoring. Never less than 105G physmem free in the 12 hours
    leading up to the event. It took about 36 hours after a hard reboot
    for the system to get back to the same level of free RAM and to start
    swapping out idle daemons.

    <https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png>
    shows the memory utilization over the course of the past week
    including the incident on Tuesday morning. I don't know why there's
    25G of inactive pages for three days leading up to the OOM; perhaps
    that's related? Inactive is normally much less than 1G.

    One thing that was going on when the crash happened is that we were
    demoing the Bacula Enterprise client on one large filesystem, using
    their new support for using `zfs diff` to speed up incrementals, and
    it was taking an unexpectedly long time. No idea at this point
    whether that might be a cause or a symptom.

    -GAWollman

    [1] We've had some days when the *24-hour* average write op rate has
    been over 30,000 per second, although I can't say whether that
    happened under 13.3, 13.4, or 14.3, all of which we've run on this
    server in the past 12 months.


    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Mark Millard@marklmi@yahoo.com to muc.lists.freebsd.stable on Fri Sep 12 08:23:34 2025
    From Newsgroup: muc.lists.freebsd.stable

    Garrett Wollman <wollman_at_bimajority.org>
    Date: Fri, 12 Sep 2025 04:05:34 UTC
    . . .
    https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png

    shows the memory utilization over the course of the past week
    including the incident on Tuesday morning. I don't know why there's
    25G of inactive pages for three days leading up to the OOM; perhaps
    that's related? Inactive is normally much less than 1G.
    Is the growth to huge wired figures like 932.89G something
    new --or has such been historically normal?
    ===
    Mark Millard
    marklmi at yahoo.com
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Fri Sep 12 11:41:39 2025
    From Newsgroup: muc.lists.freebsd.stable

    <<On Fri, 12 Sep 2025 08:23:34 -0700, Mark Millard <marklmi@yahoo.com> said:

    [I wrote:]
    https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png

    shows the memory utilization over the course of the past week
    including the incident on Tuesday morning. I don't know why there's
    25G of inactive pages for three days leading up to the OOM; perhaps
    that's related? Inactive is normally much less than 1G.

    Is the growth to huge wired figures like 932.89G something
    new --or has such been historically normal?

    Totally normal, that's the ARC warming up with client activity.
    Typical machine learning datasets these days are on the order of a
    terabyte, so they won't entirely fit in memory. (These systems also
    have 2 TB of L2ARC but that gets discarded on reboot, so obviously
    we'd like to avoid reboots.)

    -GAWollman



    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 08:41:43 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Fri, Sep 12, 2025 at 8:24rC>AM Mark Millard <marklmi@yahoo.com> wrote:

    Garrett Wollman <wollman_at_bimajority.org>
    Date: Fri, 12 Sep 2025 04:05:34 UTC

    . . .

    https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png

    shows the memory utilization over the course of the past week
    including the incident on Tuesday morning. I don't know why there's
    25G of inactive pages for three days leading up to the OOM; perhaps
    that's related? Inactive is normally much less than 1G.

    Is the growth to huge wired figures like 932.89G something
    new --or has such been historically normal?
    Given your report here: https://lists.freebsd.org/archives/freebsd-stable/2025-August/003024.html
    and an offlist report I got from Peter Errikson (copied into the above thread), I'd guess the problem was introduced by the transition to ZoL (which means 14.n, I think?).
    I don't know who the best guys to figure this out would be, but I suspect
    more of them will notice if you post to freebsd-current@ (yes, I know freebsd-stable@ is technically the correct list, but if the right people
    don't see the post..).
    I'd really like to see this figured out, but I have no idea how to
    proceed. As I noted, there is a lot of arc related stuff in the Linux
    port that is not in the FreeBSD port of ZFS, but I have no idea if/what
    needs to be done?
    rick
    ps: I've at least added a couple of cc's in the hope they might have
    some ideas.

    ===
    Mark Millard
    marklmi at yahoo.com


    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 08:50:10 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Fri, Sep 12, 2025 at 8:42rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

    <<On Fri, 12 Sep 2025 08:23:34 -0700, Mark Millard <marklmi@yahoo.com> said:

    [I wrote:]
    https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png

    shows the memory utilization over the course of the past week
    including the incident on Tuesday morning. I don't know why there's
    25G of inactive pages for three days leading up to the OOM; perhaps
    that's related? Inactive is normally much less than 1G.

    Is the growth to huge wired figures like 932.89G something
    new --or has such been historically normal?

    Totally normal, that's the ARC warming up with client activity.
    Typical machine learning datasets these days are on the order of a
    terabyte, so they won't entirely fit in memory. (These systems also
    have 2 TB of L2ARC but that gets discarded on reboot, so obviously
    we'd like to avoid reboots.)
    Ok, but don't we want something that prevents the arc from taking all
    the memory? (It seems like 932Gbytes should be close to a hard
    upper bound for a system with 1Tbyte of ram?)
    rick

    -GAWollman


    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Fri Sep 12 11:58:27 2025
    From Newsgroup: muc.lists.freebsd.stable

    <<On Fri, 12 Sep 2025 08:50:10 -0700, Rick Macklem <rick.macklem@gmail.com> said:

    Ok, but don't we want something that prevents the arc from taking all
    the memory? (It seems like 932Gbytes should be close to a hard
    upper bound for a system with 1Tbyte of ram?)

    Presently it says:

    kstat.zfs.misc.arcstats.c_max: 1098065367040

    That's 1 TiB less 1379 MiB. Which in all honesty *ought to be enough*
    for munin-node, nrpe, ntpd, sendmail, inetd, lldpd, nslcd, sshd, syslogd, mountd, and nfsuserd.

    -GAWollman



    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 09:25:35 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Fri, Sep 12, 2025 at 8:58rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

    <<On Fri, 12 Sep 2025 08:50:10 -0700, Rick Macklem <rick.macklem@gmail.com> said:

    Ok, but don't we want something that prevents the arc from taking all
    the memory? (It seems like 932Gbytes should be close to a hard
    upper bound for a system with 1Tbyte of ram?)

    Presently it says:

    kstat.zfs.misc.arcstats.c_max: 1098065367040

    That's 1 TiB less 1379 MiB. Which in all honesty *ought to be enough*
    for munin-node, nrpe, ntpd, sendmail, inetd, lldpd, nslcd, sshd, syslogd, mountd, and nfsuserd.
    Maybe. A lot of things like he size of the buffer cache and buckets used
    for malloc, etc. are tuned when the system boots,
    based on how much ram the system has.
    I have no idea what those numbers look like for a 1Tbyte system.
    In other words, if the system booted thinking it has 2Gbytes of ram
    I suspect you would be correct, but if the system boots thinking it
    has 1Tbyte of ram, then???
    rick

    -GAWollman

    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Mark Millard@marklmi@yahoo.com to muc.lists.freebsd.stable on Fri Sep 12 10:22:02 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Sep 12, 2025, at 08:23, Mark Millard <marklmi@yahoo.com> wrote:
    Garrett Wollman <wollman_at_bimajority.org>
    Date: Fri, 12 Sep 2025 04:05:34 UTC

    . . .

    https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png

    shows the memory utilization over the course of the past week
    including the incident on Tuesday morning. I don't know why there's
    25G of inactive pages for three days leading up to the OOM; perhaps
    that's related? Inactive is normally much less than 1G.

    Is the growth to huge wired figures like 932.89G something
    new --or has such been historically normal?
    At various stages, what does:
    # sysctl vm | grep -e stats.free_ -e stats.vm.v_free_
    show? In my current context, an example
    output is (single domain context):
    # sysctl vm | grep -e stats.free_ -e stats.vm.v_free_ vm.domain.0.stats.free_severe: 186376
    vm.domain.0.stats.free_min: 308882
    vm.domain.0.stats.free_reserved: 63871
    vm.domain.0.stats.free_target: 1043915
    vm.domain.0.stats.free_count: 41010336
    vm.stats.vm.v_free_severe: 186376
    vm.stats.vm.v_free_count: 41010331
    vm.stats.vm.v_free_min: 308882
    vm.stats.vm.v_free_target: 1043915
    vm.stats.vm.v_free_reserved: 63871
    It would not look as redundant for a multi-domain
    context.
    More detail about some of what would be output
    is below.
    There are the figures (shown for a non-NUMA context,
    so only the 1 domain):
    # sysctl -d vm.domain | grep "\.stats\.free_"
    vm.domain.0.stats.free_severe: Severe free pages
    vm.domain.0.stats.free_min: Minimum free pages
    vm.domain.0.stats.free_reserved: Reserved free pages vm.domain.0.stats.free_target: Target free pages
    vm.domain.0.stats.free_count: Free pages
    # sysctl vm.domain | grep "\.stats\.free_"
    vm.domain.0.stats.free_severe: 186376
    vm.domain.0.stats.free_min: 308882
    vm.domain.0.stats.free_reserved: 63871
    vm.domain.0.stats.free_target: 1043915
    vm.domain.0.stats.free_count: 40923251
    The domain's vmd_oom_seq value increments
    when there is a shortage that has not
    changed and:
    vmd->vmd_free_count < vmd->vmd_pageout_wakeup_thresh
    where:
    vmd->vmd_pageout_wakeup_thresh = (vmd->vmd_free_target / 10) * 9
    Or, in terms of the sysctl interface:
    (vm.domain.?.stats.free_target / 10) * 9
    (It is not explicitly published via sysctl from what
    I saw.)
    The domain's vmd_oom_seq value is compared to the
    value reported by vm.pageout_oom_seq but there is
    "voting" across all the domains for the overall oom
    decision.
    There are 2 figures that just ZFS uses: /usr/main-src/sys/contrib/openzfs/module/os/freebsd/zfs/sysctl_os.c: if (val < minfree)
    /usr/main-src/sys/contrib/openzfs/module/os/freebsd/zfs/arc_os.c: zfs_arc_free_target = vm_cnt.v_free_target;
    /usr/main-src/sys/contrib/openzfs/include/os/freebsd/spl/sys/kmem.h:#define minfree vm_cnt.v_free_min
    In sysctl terms these are in the list:
    # sysctl -d vm.stats.vm | grep "\<v_free_"
    vm.stats.vm.v_free_severe: Severe page depletion point vm.stats.vm.v_free_count: Free pages
    vm.stats.vm.v_free_min: Minimum low-free-pages threshold vm.stats.vm.v_free_target: Pages desired free
    vm.stats.vm.v_free_reserved: Pages reserved for deadlock
    # sysctl vm.stats.vm | grep "\<v_free_"
    vm.stats.vm.v_free_severe: 186376
    vm.stats.vm.v_free_count: 40997647
    vm.stats.vm.v_free_min: 308882
    vm.stats.vm.v_free_target: 1043915
    vm.stats.vm.v_free_reserved: 63871
    These are overall, not per-NUMA-domain.
    ZFS does not seem to do per-NUMA-domain memory
    usage management: no interface used for such
    information as far as I've seen.
    ===
    Mark Millard
    marklmi at yahoo.com
    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Fri Sep 12 14:15:14 2025
    From Newsgroup: muc.lists.freebsd.stable

    So we just had the exact same failure on a much older, much smaller
    NFS server (only 128G RAM, 6G free). Really not much activity going
    on at the time, but this server was upgraded to 14.3 on the same day
    as the other server, so both had 70 days of uptime.

    Wondering now if I should enable `vm.panic_on_oom` across the fleet,
    because these servers can often reboot faster than our monitoring
    takes to notice a fault (particularly when only some daemons are
    getting killed).

    -GAWollman



    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 17:46:33 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Fri, Sep 12, 2025 at 8:58rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

    <<On Fri, 12 Sep 2025 08:50:10 -0700, Rick Macklem <rick.macklem@gmail.com> said:

    Ok, but don't we want something that prevents the arc from taking all
    the memory? (It seems like 932Gbytes should be close to a hard
    upper bound for a system with 1Tbyte of ram?)

    Presently it says:

    kstat.zfs.misc.arcstats.c_max: 1098065367040

    That's 1 TiB less 1379 MiB. Which in all honesty *ought to be enough*
    for munin-node, nrpe, ntpd, sendmail, inetd, lldpd, nslcd, sshd, syslogd, mountd, and nfsuserd.
    If you look at arc_default_max() in sys/contrib/openzfs/module/os/freebsd/arc_os.c
    you'll see it returns "allmem - 1Gbyte".
    This may make sense for a machine with a few Gbytes of ram, but I'd bump it
    up for machines like you have. (As I noted, a system that boots with 128Gbyte->1Tbyte
    of ram is going to size things a lot larger and "allmem" looks like
    the total ram in the
    system. They haven't even subtracted out what the kernel uses.)
    (Disclaimer: I know nothing about ZFS, so the above may be crap!!)
    It's a trivial function to patch, rick

    -GAWollman

    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 18:29:30 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Fri, Sep 12, 2025 at 5:46rC>PM Rick Macklem <rick.macklem@gmail.com> wrote:

    On Fri, Sep 12, 2025 at 8:58rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

    <<On Fri, 12 Sep 2025 08:50:10 -0700, Rick Macklem <rick.macklem@gmail.com> said:

    Ok, but don't we want something that prevents the arc from taking all
    the memory? (It seems like 932Gbytes should be close to a hard
    upper bound for a system with 1Tbyte of ram?)

    Presently it says:

    kstat.zfs.misc.arcstats.c_max: 1098065367040

    That's 1 TiB less 1379 MiB. Which in all honesty *ought to be enough*
    for munin-node, nrpe, ntpd, sendmail, inetd, lldpd, nslcd, sshd, syslogd, mountd, and nfsuserd.
    If you look at arc_default_max() in sys/contrib/openzfs/module/os/freebsd/arc_os.c
    you'll see it returns "allmem - 1Gbyte".
    This may make sense for a machine with a few Gbytes of ram, but I'd bump it up for machines like you have. (As I noted, a system that boots with 128Gbyte->1Tbyte
    of ram is going to size things a lot larger and "allmem" looks like
    the total ram in the
    system. They haven't even subtracted out what the kernel uses.)
    (Disclaimer: I know nothing about ZFS, so the above may be crap!!)

    It's a trivial function to patch, rick
    Here's another simple one..look at..
    # sysctl -a | fgrep maxmbufmem
    It appears to be set to 1/2 of the physical memory for me.
    Lets see, 50% of memory allocated to mbufs and 99.9%
    of physical memory allowed for the arc.
    - This reminds me of the stats CNN puts up, where the
    percentages never add up to 100.
    rick


    -GAWollman

    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Fri Sep 12 21:35:26 2025
    From Newsgroup: muc.lists.freebsd.stable

    <<On Fri, 12 Sep 2025 18:29:30 -0700, Rick Macklem <rick.macklem@gmail.com> said:

    Lets see, 50% of memory allocated to mbufs and 99.9%
    of physical memory allowed for the arc.
    - This reminds me of the stats CNN puts up, where the
    percentages never add up to 100.

    The point being that the ARC is supposed to respond to backpressure
    long before memory runs out. And again, we're talking about a system
    with 100 GiB of outright FREE physical memory. There's no possible
    way that can be fully allocated in less than 5 minutes -- the NICs
    aren't that fast and the servers aren't doing anything else.

    -GAWollman



    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 19:09:29 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Fri, Sep 12, 2025 at 6:35rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

    <<On Fri, 12 Sep 2025 18:29:30 -0700, Rick Macklem <rick.macklem@gmail.com> said:

    Lets see, 50% of memory allocated to mbufs and 99.9%
    of physical memory allowed for the arc.
    - This reminds me of the stats CNN puts up, where the
    percentages never add up to 100.

    The point being that the ARC is supposed to respond to backpressure
    long before memory runs out. And again, we're talking about a system
    with 100 GiB of outright FREE physical memory. There's no possible
    way that can be fully allocated in less than 5 minutes -- the NICs
    aren't that fast and the servers aren't doing anything else.
    I don't recall you mentioning your NIC speed, but 10Gbps->about 1Gbyte/sec. That's 100sec. But you certainly could be correct.
    rick

    -GAWollman

    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 19:25:25 2025
    From Newsgroup: muc.lists.freebsd.stable

    On Fri, Sep 12, 2025 at 7:09rC>PM Rick Macklem <rick.macklem@gmail.com> wrote:

    On Fri, Sep 12, 2025 at 6:35rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

    <<On Fri, 12 Sep 2025 18:29:30 -0700, Rick Macklem <rick.macklem@gmail.com> said:

    Lets see, 50% of memory allocated to mbufs and 99.9%
    of physical memory allowed for the arc.
    - This reminds me of the stats CNN puts up, where the
    percentages never add up to 100.

    The point being that the ARC is supposed to respond to backpressure
    long before memory runs out.
    The problem is that it must react quickly and aggressively enough.
    If you've ever studied queueing theory, you know that it is difficult to impossible to stabilize a system without feedback. For NFS the feedback is
    the replies to RPCs that throttle the clients. However, throw in a bunch of clients and large TCP send windows and the feedback doesn't happen that quickly.
    If I were trying to fix this, I'd start by either:
    - setting vfs.zfs.arc_max to a much smaller value than 99.9% and see
    if that stabilizes the server. If I was lucky and it did, I'd slowly increase
    the value and then cut it down by a fair amount after I saw the first failure.
    (I might also be tempted to decrease kern.ipc.maxmbufmem.)
    OR
    - I'd take a good look at the old FreeBSD 13.n code and see how it
    adjusted the arc and then try and make the new code do the same
    thing. (I noted that there is a lot more code in the Linux port than
    the FreeBSD port of the current ZFS code, found in os/<name>/zfs/arc_os.c.) If I had a setup where I could test/play with this, I think it would be
    kinda fun, but I doubt something done on a 4Gbyte laptop is going
    to produce similar results, especially when I really only have one NFS
    client to generate load against it.
    Good luck with whatever you try, rick
    And again, we're talking about a system
    with 100 GiB of outright FREE physical memory. There's no possible
    way that can be fully allocated in less than 5 minutes -- the NICs
    aren't that fast and the servers aren't doing anything else.
    I don't recall you mentioning your NIC speed, but 10Gbps->about 1Gbyte/sec. That's 100sec. But you certainly could be correct.

    rick


    -GAWollman

    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Tue Sep 16 21:33:26 2025
    From Newsgroup: muc.lists.freebsd.stable

    <<On Fri, 12 Sep 2025 21:35:26 -0400, Garrett Wollman <wollman@bimajority.org> said:

    The point being that the ARC is supposed to respond to backpressure
    long before memory runs out. And again, we're talking about a system
    with 100 GiB of outright FREE physical memory. There's no possible
    way that can be fully allocated in less than 5 minutes -- the NICs
    aren't that fast and the servers aren't doing anything else.

    The past couple of nights we've had failures of other NFS servers
    (same FreeBSD build, different hardware, different clients, different
    data). The most recent one, unlike the one I started this thread
    with, didn't get so far as to invoke the OOM killer -- it seems to
    have been stuck in arc_wait_for_eviction(). I wasn't in a position to
    get a backtrace, so I can't tell if this was the call from
    arc_get_data_impl() (which is called for every block allocated but
    normally just returns immediately) or the one from arc_lowmem() (which
    is ultimately called from the vm_lowmem event handler when the system
    is really out of memory).

    As with previous failures, this one was with plenty of physical memory seemingly available (20 GiB out of 96 GiB). Separate swap partition,
    of course, and after 34 minutes memory allocation is pretty much back
    to where it was before the crash.

    -GAWollman



    --
    Posted automagically by a mail2news gateway at muc.de e.V.
    Please direct questions, flames, donations, etc. to news-admin@muc.de
    --- Synchronet 3.21a-Linux NewsLink 1.2