Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Kawasu
  Thu Oct 16 10:17:15 2025
  from Mena, Ar via Telnet
- Geek2
  Thu Oct 16 06:39:58 2025
  from Euclid, Oh via Telnet
- Amr
  Tue Oct 14 21:13:21 2025
  from Fayetteville, Nc via Telnet
- Amr
  Tue Oct 14 20:34:34 2025
  from Fayetteville, Nc via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	27
Nodes:	6 (0 / 6)
Uptime:	43:45:15
Calls:	631
Calls today:	2
Files:	1,187
D/L today:	24 files (29,813K bytes)
Messages:	175,620

"failed to reclaim memory" with much free physmem

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Tue Sep 9 12:19:42 2025

From Newsgroup: muc.lists.freebsd.stable

On some of our newer large-memory NFS servers, we are seeing services
killed with "failed to reclaim memory". According to our monitoring,
the server has >100G of physmem free at the time, and the only
solution seems to be rebooting. (There is a small amount of swap
configured and even less of it in use.) Does this sound familiar to
anyone? What should we be monitoring that we evidently aren't now?

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Mark Millard@marklmi@yahoo.com to muc.lists.freebsd.stable on Tue Sep 9 12:19:21 2025

From Newsgroup: muc.lists.freebsd.stable

Garrett Wollman <wollman_at_bimajority.org> wrote on
Date: Tue, 09 Sep 2025 16:19:42 UTC :

On some of our newer large-memory NFS servers, we are seeing services
killed with "failed to reclaim memory". According to our monitoring,
the server has >100G of physmem free at the time,

Was that 100G+ somewhat before any reclaiming of memory started,
the lead-up to the notice? Any likelihood of sudden, rapid,
huge drops in free RAM based on workload behavior?
Some other figures from the lead-up to the OOM activity
would be snapshots of the likes of top's:
Active, Inact, Laundry, Wired, and Free
(things in Buf also show up in the other categories)
Is NUMA involved?

and the only
solution seems to be rebooting. (There is a small amount of swap
configured and even less of it in use.)

That swap is in use at all could be of interest. I wonder
whaat it was doing when the swap was put to use or laundry
was growing that lead to swap being put to use.

Does this sound familiar to
anyone? What should we be monitoring that we evidently aren't now?

I'll note that you can delay the "failed to
reclaim memory" OOM activity via the use of
the likes of:
# sysctl vm.pageout_oom_seq=120
FYI:
# sysctl -d vm.pageout_oom_seq
vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM
The default is 12 and larger gives more delay by
causing more attempts to meet the threshold
involved before OOM is used. No figure gives an
unbounded delay so far as I know. (I do not know
anything about the "counts wrap" behavior.)
But if the conditions have a bounded duration,
vm.pageout_oom_seq can make OOM activity be
avoided over that duration fairly generally.
(Even just one thread can keep the Active memory
so large as to not meet the free RAM threshold(s)
involved, even if swap is unused.)
Someone might want to see some of the output from
the likes of something like:
# sysctl vm | grep -v "^vm\.uma\." | grep -e "\.v_" -e stats -e oom_seq | sort from the lead-up to a "failed to reclaim memory".
Having a larger vm.pageout_oom_seq can make it
easier to observe the lead-up time frame.
===
Mark Millard
marklmi at yahoo.com
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Mark Saad@nonesuch@longcount.org to muc.lists.freebsd.stable on Wed Sep 10 07:50:20 2025

From Newsgroup: muc.lists.freebsd.stable

On Sep 9, 2025, at 12:20rC>PM, Garrett Wollman <wollman@bimajority.org> wrote:

N++On some of our newer large-memory NFS servers, we are seeing services killed with "failed to reclaim memory". According to our monitoring,
the server has >100G of physmem free at the time, and the only
solution seems to be rebooting. (There is a small amount of swap
configured and even less of it in use.) Does this sound familiar to
anyone? What should we be monitoring that we evidently aren't now?

-GAWollman

Garrett
what version of FreeBSD is this ? 'uname -a' There was some chatter about a zfs issue on 14 where memory usage was incorrectly increasing. I can't find the thread let me search around . In the meantime tell is about the disks ,filesysten , etc .
---
Mark Saad | nonesuch@longcount.org

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Thu Sep 11 13:58:22 2025

From Newsgroup: muc.lists.freebsd.stable

<<On Tue, 9 Sep 2025 12:19:21 -0700, Mark Millard <marklmi@yahoo.com> said:

Garrett Wollman <wollman_at_bimajority.org> wrote on
Date: Tue, 09 Sep 2025 16:19:42 UTC :

On some of our newer large-memory NFS servers, we are seeing services
killed with "failed to reclaim memory". According to our monitoring,
the server has >100G of physmem free at the time,

Was that 100G+ somewhat before any reclaiming of memory started,
the lead-up to the notice?

That was within five minutes of munin-node getting shot by the OOM
killer. There was much less memory free ca. 24 hours before the
event.

Any likelihood of sudden, rapid, huge drops in free RAM based on
workload behavior?

I don't have access to client workloads, but it would have to be a bug
in ZFS if so; these are file servers, all they run is NFS.

Is NUMA involved?

Damn if I know.

and the only
solution seems to be rebooting. (There is a small amount of swap
configured and even less of it in use.)

That swap is in use at all could be of interest. I wonder
whaat it was doing when the swap was put to use or laundry
was growing that lead to swap being put to use.

It's pretty normal on these servers, which stay up for six months
between OS upgrades, for some userland daemons to get swapped out,
although I agree that it seems like it shouldn't happen given that the
size of memory (1 TiB) is much greater than the size of running
processes (< 1 GiB).

My suspicion here is that there's some sort of accounting error, but I
don't know where to look, and I only have data retrospectively, and
only the data that munin is collecting. (Someone else was on call
when this happened most recently and they reported that their login
shell kept on getting shot -- as was the getty on the serial console.)

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Thu Sep 11 14:09:58 2025

From Newsgroup: muc.lists.freebsd.stable

<<On Wed, 10 Sep 2025 07:50:20 -0400, Mark Saad <nonesuch@longcount.org> said:

what version of FreeBSD is this ?

Ah, yes, it's 14.3-RELEASE. It's an NFS server with three zpools
(boot plus two exported), total about 760 TiB.

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Thu Sep 11 16:23:12 2025

From Newsgroup: muc.lists.freebsd.stable

On Thu, Sep 11, 2025 at 11:10rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

<<On Wed, 10 Sep 2025 07:50:20 -0400, Mark Saad <nonesuch@longcount.org> said:

what version of FreeBSD is this ?

Ah, yes, it's 14.3-RELEASE. It's an NFS server with three zpools
(boot plus two exported), total about 760 TiB.

I'm no ZFS guy, so I'm probably the last guy you should listen to,
but I'd suggest you look at sys/contrib/openzfs/module/os/linux/zfs/arc_os.c. Why?
Because there is a bunch of stuff in there that isn't in sys/contrib/openzfs/module/os/freebsd/zfs/arc_os.c
and it might give you some hints w.r.t. tuning the arc?
Also, I've mentioned this before, but if you choose to not post to freebsd-current@, I'd suggest you at least cc a few people who
work in the area (mav@, asomers@, markj@ and maybe a couple
more). It at least seems to me that they don't read freebsd-stable@
often.
rick

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Thu Sep 11 17:22:10 2025

From Newsgroup: muc.lists.freebsd.stable

On Thu, Sep 11, 2025 at 10:58rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

<<On Tue, 9 Sep 2025 12:19:21 -0700, Mark Millard <marklmi@yahoo.com> said:

Garrett Wollman <wollman_at_bimajority.org> wrote on
Date: Tue, 09 Sep 2025 16:19:42 UTC :

On some of our newer large-memory NFS servers, we are seeing services
killed with "failed to reclaim memory". According to our monitoring,
the server has >100G of physmem free at the time,

Was that 100G+ somewhat before any reclaiming of memory started,
the lead-up to the notice?

That was within five minutes of munin-node getting shot by the OOM
killer. There was much less memory free ca. 24 hours before the
event.

Any likelihood of sudden, rapid, huge drops in free RAM based on
workload behavior?

I don't have access to client workloads, but it would have to be a bug
in ZFS if so; these are file servers, all they run is NFS.

Bug or tuning weakness?
If you look at sys/contrib/openzfs/module/os/linux/zfs/arc_os.c, it does
a bunch of arm-waving setting arc_sys_free whereas sys/contrib/openzfs/module/os/freebsd/zfs/arc_os.c doesn't do anything.

I'd try tuning it via vfs.zfs.arc.sys_free?

(The default is 0 and that says "use all of the memory" if I read it
correctly. I probably haven't read it correctly, which was why I suggested
you compare the two of them.)
rick

Is NUMA involved?

Damn if I know.

and the only
solution seems to be rebooting. (There is a small amount of swap
configured and even less of it in use.)

That swap is in use at all could be of interest. I wonder
whaat it was doing when the swap was put to use or laundry
was growing that lead to swap being put to use.

It's pretty normal on these servers, which stay up for six months
between OS upgrades, for some userland daemons to get swapped out,
although I agree that it seems like it shouldn't happen given that the
size of memory (1 TiB) is much greater than the size of running
processes (< 1 GiB).

My suspicion here is that there's some sort of accounting error, but I
don't know where to look, and I only have data retrospectively, and
only the data that munin is collecting. (Someone else was on call
when this happened most recently and they reported that their login
shell kept on getting shot -- as was the getty on the serial console.)

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Thu Sep 11 18:32:17 2025

From Newsgroup: muc.lists.freebsd.stable

On Thu, Sep 11, 2025 at 11:10rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

<<On Wed, 10 Sep 2025 07:50:20 -0400, Mark Saad <nonesuch@longcount.org> said:

what version of FreeBSD is this ?

Ah, yes, it's 14.3-RELEASE. It's an NFS server with three zpools
(boot plus two exported), total about 760 TiB.

One simple thing you could do that might provide some insight
into what is going on is..
- do "nfsstat -s" in a loop (once/sec) along with "date" written out
to some log file on the server.
- Then when the problem shows up, look at the log file and see what
RPCs/operations load the server was experiencing.
(read vs write vs lookup vs ???)
rick

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Fri Sep 12 00:05:34 2025

From Newsgroup: muc.lists.freebsd.stable

<<On Thu, 11 Sep 2025 18:32:17 -0700, Rick Macklem <rick.macklem@gmail.com> said:

One simple thing you could do that might provide some insight
into what is going on is..
- do "nfsstat -s" in a loop (once/sec) along with "date" written out
to some log file on the server.
- Then when the problem shows up, look at the log file and see what
RPCs/operations load the server was experiencing.
(read vs write vs lookup vs ???)

We monitor NFS ops with munin, same as everything else, every five
minutes. The more detailed data has already rolled off the RRDs, but
in the half-hour before the OOM event, write ops spiked to a (still
quite tame) 5,000 per second. That's well below observed peaks in
writes over every averaging interval.[1] (The other NFS ops that you'd
expect to see for a v4 client doing lots of writes increased as well,
about one open/close pair per four write ops.) So I don't think it's
anything NFS is doing on its own, but might be something ZFS is doing
badly when the writes hit.

The server continued to operate, with various other daemons getting
shot as the OOM killer rampaged, until the on-call person got alerted
by our monitoring. Never less than 105G physmem free in the 12 hours
leading up to the event. It took about 36 hours after a hard reboot
for the system to get back to the same level of free RAM and to start
swapping out idle daemons.

<https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png>
shows the memory utilization over the course of the past week
including the incident on Tuesday morning. I don't know why there's
25G of inactive pages for three days leading up to the OOM; perhaps
that's related? Inactive is normally much less than 1G.

One thing that was going on when the crash happened is that we were
demoing the Bacula Enterprise client on one large filesystem, using
their new support for using `zfs diff` to speed up incrementals, and
it was taking an unexpectedly long time. No idea at this point
whether that might be a cause or a symptom.

-GAWollman

[1] We've had some days when the *24-hour* average write op rate has
been over 30,000 per second, although I can't say whether that
happened under 13.3, 13.4, or 14.3, all of which we've run on this
server in the past 12 months.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Mark Millard@marklmi@yahoo.com to muc.lists.freebsd.stable on Fri Sep 12 08:23:34 2025

From Newsgroup: muc.lists.freebsd.stable

Garrett Wollman <wollman_at_bimajority.org>
Date: Fri, 12 Sep 2025 04:05:34 UTC
. . .

https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png

shows the memory utilization over the course of the past week
including the incident on Tuesday morning. I don't know why there's
25G of inactive pages for three days leading up to the OOM; perhaps
that's related? Inactive is normally much less than 1G.

Is the growth to huge wired figures like 932.89G something
new --or has such been historically normal?
===
Mark Millard
marklmi at yahoo.com
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Fri Sep 12 11:41:39 2025

From Newsgroup: muc.lists.freebsd.stable

<<On Fri, 12 Sep 2025 08:23:34 -0700, Mark Millard <marklmi@yahoo.com> said:

[I wrote:]

https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png

shows the memory utilization over the course of the past week
including the incident on Tuesday morning. I don't know why there's
25G of inactive pages for three days leading up to the OOM; perhaps
that's related? Inactive is normally much less than 1G.

Is the growth to huge wired figures like 932.89G something
new --or has such been historically normal?

Totally normal, that's the ARC warming up with client activity.
Typical machine learning datasets these days are on the order of a
terabyte, so they won't entirely fit in memory. (These systems also
have 2 TB of L2ARC but that gets discarded on reboot, so obviously
we'd like to avoid reboots.)

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 08:41:43 2025

From Newsgroup: muc.lists.freebsd.stable

On Fri, Sep 12, 2025 at 8:24rC>AM Mark Millard <marklmi@yahoo.com> wrote:

Garrett Wollman <wollman_at_bimajority.org>
Date: Fri, 12 Sep 2025 04:05:34 UTC

. . .

https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png

shows the memory utilization over the course of the past week
including the incident on Tuesday morning. I don't know why there's
25G of inactive pages for three days leading up to the OOM; perhaps
that's related? Inactive is normally much less than 1G.

Is the growth to huge wired figures like 932.89G something
new --or has such been historically normal?

Given your report here: https://lists.freebsd.org/archives/freebsd-stable/2025-August/003024.html
and an offlist report I got from Peter Errikson (copied into the above thread), I'd guess the problem was introduced by the transition to ZoL (which means 14.n, I think?).
I don't know who the best guys to figure this out would be, but I suspect
more of them will notice if you post to freebsd-current@ (yes, I know freebsd-stable@ is technically the correct list, but if the right people
don't see the post..).
I'd really like to see this figured out, but I have no idea how to
proceed. As I noted, there is a lot of arc related stuff in the Linux
port that is not in the FreeBSD port of ZFS, but I have no idea if/what
needs to be done?
rick
ps: I've at least added a couple of cc's in the hope they might have
some ideas.

===
Mark Millard
marklmi at yahoo.com

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 08:50:10 2025

From Newsgroup: muc.lists.freebsd.stable

On Fri, Sep 12, 2025 at 8:42rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

<<On Fri, 12 Sep 2025 08:23:34 -0700, Mark Millard <marklmi@yahoo.com> said:

[I wrote:]

https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png

shows the memory utilization over the course of the past week
including the incident on Tuesday morning. I don't know why there's
25G of inactive pages for three days leading up to the OOM; perhaps
that's related? Inactive is normally much less than 1G.

Is the growth to huge wired figures like 932.89G something
new --or has such been historically normal?

Totally normal, that's the ARC warming up with client activity.
Typical machine learning datasets these days are on the order of a
terabyte, so they won't entirely fit in memory. (These systems also
have 2 TB of L2ARC but that gets discarded on reboot, so obviously
we'd like to avoid reboots.)

Ok, but don't we want something that prevents the arc from taking all
the memory? (It seems like 932Gbytes should be close to a hard
upper bound for a system with 1Tbyte of ram?)
rick

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Fri Sep 12 11:58:27 2025

From Newsgroup: muc.lists.freebsd.stable

<<On Fri, 12 Sep 2025 08:50:10 -0700, Rick Macklem <rick.macklem@gmail.com> said:

Ok, but don't we want something that prevents the arc from taking all
the memory? (It seems like 932Gbytes should be close to a hard
upper bound for a system with 1Tbyte of ram?)

Presently it says:

kstat.zfs.misc.arcstats.c_max: 1098065367040

That's 1 TiB less 1379 MiB. Which in all honesty *ought to be enough*
for munin-node, nrpe, ntpd, sendmail, inetd, lldpd, nslcd, sshd, syslogd, mountd, and nfsuserd.

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 09:25:35 2025

From Newsgroup: muc.lists.freebsd.stable

On Fri, Sep 12, 2025 at 8:58rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

<<On Fri, 12 Sep 2025 08:50:10 -0700, Rick Macklem <rick.macklem@gmail.com> said:

Ok, but don't we want something that prevents the arc from taking all
the memory? (It seems like 932Gbytes should be close to a hard
upper bound for a system with 1Tbyte of ram?)

Presently it says:

kstat.zfs.misc.arcstats.c_max: 1098065367040

That's 1 TiB less 1379 MiB. Which in all honesty *ought to be enough*
for munin-node, nrpe, ntpd, sendmail, inetd, lldpd, nslcd, sshd, syslogd, mountd, and nfsuserd.

Maybe. A lot of things like he size of the buffer cache and buckets used
for malloc, etc. are tuned when the system boots,
based on how much ram the system has.
I have no idea what those numbers look like for a 1Tbyte system.
In other words, if the system booted thinking it has 2Gbytes of ram
I suspect you would be correct, but if the system boots thinking it
has 1Tbyte of ram, then???
rick

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Mark Millard@marklmi@yahoo.com to muc.lists.freebsd.stable on Fri Sep 12 10:22:02 2025

From Newsgroup: muc.lists.freebsd.stable

On Sep 12, 2025, at 08:23, Mark Millard <marklmi@yahoo.com> wrote:

Garrett Wollman <wollman_at_bimajority.org>
Date: Fri, 12 Sep 2025 04:05:34 UTC

. . .

https://bimajority.org/%7Ewollman/memory-pinpoint%3D1756957462%2C1757648662.png

shows the memory utilization over the course of the past week
including the incident on Tuesday morning. I don't know why there's
25G of inactive pages for three days leading up to the OOM; perhaps
that's related? Inactive is normally much less than 1G.

Is the growth to huge wired figures like 932.89G something
new --or has such been historically normal?

At various stages, what does:
# sysctl vm | grep -e stats.free_ -e stats.vm.v_free_
show? In my current context, an example
output is (single domain context):
# sysctl vm | grep -e stats.free_ -e stats.vm.v_free_ vm.domain.0.stats.free_severe: 186376
vm.domain.0.stats.free_min: 308882
vm.domain.0.stats.free_reserved: 63871
vm.domain.0.stats.free_target: 1043915
vm.domain.0.stats.free_count: 41010336
vm.stats.vm.v_free_severe: 186376
vm.stats.vm.v_free_count: 41010331
vm.stats.vm.v_free_min: 308882
vm.stats.vm.v_free_target: 1043915
vm.stats.vm.v_free_reserved: 63871
It would not look as redundant for a multi-domain
context.
More detail about some of what would be output
is below.
There are the figures (shown for a non-NUMA context,
so only the 1 domain):
# sysctl -d vm.domain | grep "\.stats\.free_"
vm.domain.0.stats.free_severe: Severe free pages
vm.domain.0.stats.free_min: Minimum free pages
vm.domain.0.stats.free_reserved: Reserved free pages vm.domain.0.stats.free_target: Target free pages
vm.domain.0.stats.free_count: Free pages
# sysctl vm.domain | grep "\.stats\.free_"
vm.domain.0.stats.free_severe: 186376
vm.domain.0.stats.free_min: 308882
vm.domain.0.stats.free_reserved: 63871
vm.domain.0.stats.free_target: 1043915
vm.domain.0.stats.free_count: 40923251
The domain's vmd_oom_seq value increments
when there is a shortage that has not
changed and:
vmd->vmd_free_count < vmd->vmd_pageout_wakeup_thresh
where:
vmd->vmd_pageout_wakeup_thresh = (vmd->vmd_free_target / 10) * 9
Or, in terms of the sysctl interface:
(vm.domain.?.stats.free_target / 10) * 9
(It is not explicitly published via sysctl from what
I saw.)
The domain's vmd_oom_seq value is compared to the
value reported by vm.pageout_oom_seq but there is
"voting" across all the domains for the overall oom
decision.
There are 2 figures that just ZFS uses: /usr/main-src/sys/contrib/openzfs/module/os/freebsd/zfs/sysctl_os.c: if (val < minfree)
/usr/main-src/sys/contrib/openzfs/module/os/freebsd/zfs/arc_os.c: zfs_arc_free_target = vm_cnt.v_free_target;
/usr/main-src/sys/contrib/openzfs/include/os/freebsd/spl/sys/kmem.h:#define minfree vm_cnt.v_free_min
In sysctl terms these are in the list:
# sysctl -d vm.stats.vm | grep "\<v_free_"
vm.stats.vm.v_free_severe: Severe page depletion point vm.stats.vm.v_free_count: Free pages
vm.stats.vm.v_free_min: Minimum low-free-pages threshold vm.stats.vm.v_free_target: Pages desired free
vm.stats.vm.v_free_reserved: Pages reserved for deadlock
# sysctl vm.stats.vm | grep "\<v_free_"
vm.stats.vm.v_free_severe: 186376
vm.stats.vm.v_free_count: 40997647
vm.stats.vm.v_free_min: 308882
vm.stats.vm.v_free_target: 1043915
vm.stats.vm.v_free_reserved: 63871
These are overall, not per-NUMA-domain.
ZFS does not seem to do per-NUMA-domain memory
usage management: no interface used for such
information as far as I've seen.
===
Mark Millard
marklmi at yahoo.com
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Fri Sep 12 14:15:14 2025

From Newsgroup: muc.lists.freebsd.stable

So we just had the exact same failure on a much older, much smaller
NFS server (only 128G RAM, 6G free). Really not much activity going
on at the time, but this server was upgraded to 14.3 on the same day
as the other server, so both had 70 days of uptime.

Wondering now if I should enable `vm.panic_on_oom` across the fleet,
because these servers can often reboot faster than our monitoring
takes to notice a fault (particularly when only some daemons are
getting killed).

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 17:46:33 2025

From Newsgroup: muc.lists.freebsd.stable

On Fri, Sep 12, 2025 at 8:58rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

<<On Fri, 12 Sep 2025 08:50:10 -0700, Rick Macklem <rick.macklem@gmail.com> said:

Ok, but don't we want something that prevents the arc from taking all
the memory? (It seems like 932Gbytes should be close to a hard
upper bound for a system with 1Tbyte of ram?)

Presently it says:

kstat.zfs.misc.arcstats.c_max: 1098065367040

That's 1 TiB less 1379 MiB. Which in all honesty *ought to be enough*
for munin-node, nrpe, ntpd, sendmail, inetd, lldpd, nslcd, sshd, syslogd, mountd, and nfsuserd.

If you look at arc_default_max() in sys/contrib/openzfs/module/os/freebsd/arc_os.c
you'll see it returns "allmem - 1Gbyte".
This may make sense for a machine with a few Gbytes of ram, but I'd bump it
up for machines like you have. (As I noted, a system that boots with 128Gbyte->1Tbyte
of ram is going to size things a lot larger and "allmem" looks like
the total ram in the
system. They haven't even subtracted out what the kernel uses.)
(Disclaimer: I know nothing about ZFS, so the above may be crap!!)
It's a trivial function to patch, rick

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 18:29:30 2025

From Newsgroup: muc.lists.freebsd.stable

On Fri, Sep 12, 2025 at 5:46rC>PM Rick Macklem <rick.macklem@gmail.com> wrote:

On Fri, Sep 12, 2025 at 8:58rC>AM Garrett Wollman <wollman@bimajority.org> wrote:

<<On Fri, 12 Sep 2025 08:50:10 -0700, Rick Macklem <rick.macklem@gmail.com> said:

Ok, but don't we want something that prevents the arc from taking all
the memory? (It seems like 932Gbytes should be close to a hard
upper bound for a system with 1Tbyte of ram?)

Presently it says:

kstat.zfs.misc.arcstats.c_max: 1098065367040

That's 1 TiB less 1379 MiB. Which in all honesty *ought to be enough*
for munin-node, nrpe, ntpd, sendmail, inetd, lldpd, nslcd, sshd, syslogd, mountd, and nfsuserd.

If you look at arc_default_max() in sys/contrib/openzfs/module/os/freebsd/arc_os.c
you'll see it returns "allmem - 1Gbyte".
This may make sense for a machine with a few Gbytes of ram, but I'd bump it up for machines like you have. (As I noted, a system that boots with 128Gbyte->1Tbyte
of ram is going to size things a lot larger and "allmem" looks like
the total ram in the
system. They haven't even subtracted out what the kernel uses.)
(Disclaimer: I know nothing about ZFS, so the above may be crap!!)

It's a trivial function to patch, rick

Here's another simple one..look at..
# sysctl -a | fgrep maxmbufmem
It appears to be set to 1/2 of the physical memory for me.
Lets see, 50% of memory allocated to mbufs and 99.9%
of physical memory allowed for the arc.
- This reminds me of the stats CNN puts up, where the
percentages never add up to 100.
rick

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Fri Sep 12 21:35:26 2025

From Newsgroup: muc.lists.freebsd.stable

<<On Fri, 12 Sep 2025 18:29:30 -0700, Rick Macklem <rick.macklem@gmail.com> said:

Lets see, 50% of memory allocated to mbufs and 99.9%
of physical memory allowed for the arc.
- This reminds me of the stats CNN puts up, where the
percentages never add up to 100.

The point being that the ARC is supposed to respond to backpressure
long before memory runs out. And again, we're talking about a system
with 100 GiB of outright FREE physical memory. There's no possible
way that can be fully allocated in less than 5 minutes -- the NICs
aren't that fast and the servers aren't doing anything else.

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 19:09:29 2025

From Newsgroup: muc.lists.freebsd.stable

On Fri, Sep 12, 2025 at 6:35rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

<<On Fri, 12 Sep 2025 18:29:30 -0700, Rick Macklem <rick.macklem@gmail.com> said:

Lets see, 50% of memory allocated to mbufs and 99.9%
of physical memory allowed for the arc.
- This reminds me of the stats CNN puts up, where the
percentages never add up to 100.

The point being that the ARC is supposed to respond to backpressure
long before memory runs out. And again, we're talking about a system
with 100 GiB of outright FREE physical memory. There's no possible
way that can be fully allocated in less than 5 minutes -- the NICs
aren't that fast and the servers aren't doing anything else.

I don't recall you mentioning your NIC speed, but 10Gbps->about 1Gbyte/sec. That's 100sec. But you certainly could be correct.
rick

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Rick Macklem@rick.macklem@gmail.com to muc.lists.freebsd.stable on Fri Sep 12 19:25:25 2025

From Newsgroup: muc.lists.freebsd.stable

On Fri, Sep 12, 2025 at 7:09rC>PM Rick Macklem <rick.macklem@gmail.com> wrote:

On Fri, Sep 12, 2025 at 6:35rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

<<On Fri, 12 Sep 2025 18:29:30 -0700, Rick Macklem <rick.macklem@gmail.com> said:

Lets see, 50% of memory allocated to mbufs and 99.9%
of physical memory allowed for the arc.
- This reminds me of the stats CNN puts up, where the
percentages never add up to 100.

The point being that the ARC is supposed to respond to backpressure
long before memory runs out.

The problem is that it must react quickly and aggressively enough.
If you've ever studied queueing theory, you know that it is difficult to impossible to stabilize a system without feedback. For NFS the feedback is
the replies to RPCs that throttle the clients. However, throw in a bunch of clients and large TCP send windows and the feedback doesn't happen that quickly.
If I were trying to fix this, I'd start by either:
- setting vfs.zfs.arc_max to a much smaller value than 99.9% and see
if that stabilizes the server. If I was lucky and it did, I'd slowly increase
the value and then cut it down by a fair amount after I saw the first failure.
(I might also be tempted to decrease kern.ipc.maxmbufmem.)
OR
- I'd take a good look at the old FreeBSD 13.n code and see how it
adjusted the arc and then try and make the new code do the same
thing. (I noted that there is a lot more code in the Linux port than
the FreeBSD port of the current ZFS code, found in os/<name>/zfs/arc_os.c.) If I had a setup where I could test/play with this, I think it would be
kinda fun, but I doubt something done on a 4Gbyte laptop is going
to produce similar results, especially when I really only have one NFS
client to generate load against it.
Good luck with whatever you try, rick

And again, we're talking about a system
with 100 GiB of outright FREE physical memory. There's no possible
way that can be fully allocated in less than 5 minutes -- the NICs
aren't that fast and the servers aren't doing anything else.

I don't recall you mentioning your NIC speed, but 10Gbps->about 1Gbyte/sec. That's 100sec. But you certainly could be correct.

rick

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Tue Sep 16 21:33:26 2025

From Newsgroup: muc.lists.freebsd.stable

<<On Fri, 12 Sep 2025 21:35:26 -0400, Garrett Wollman <wollman@bimajority.org> said:

The point being that the ARC is supposed to respond to backpressure
long before memory runs out. And again, we're talking about a system
with 100 GiB of outright FREE physical memory. There's no possible
way that can be fully allocated in less than 5 minutes -- the NICs
aren't that fast and the servers aren't doing anything else.

The past couple of nights we've had failures of other NFS servers
(same FreeBSD build, different hardware, different clients, different
data). The most recent one, unlike the one I started this thread
with, didn't get so far as to invoke the OOM killer -- it seems to
have been stuck in arc_wait_for_eviction(). I wasn't in a position to
get a backtrace, so I can't tell if this was the call from
arc_get_data_impl() (which is called for every block allocated but
normally just returns immediately) or the one from arc_lowmem() (which
is ultimately called from the vm_lowmem event handler when the system
is really out of memory).

As with previous failures, this one was with plenty of physical memory seemingly available (20 GiB out of 96 GiB). Separate swap partition,
of course, and after 34 minutes memory allocation is pretty much back
to where it was before the crash.

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

Recent Visitors

System Info

"failed to reclaim memory" with much free physmem