Forum: Too Lazy BBS

ZFS deadlocks/memory accounting issues

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Mon Mar 16 16:41:35 2026

From Newsgroup: muc.lists.freebsd.stable

Since we upgraded to 14.3 last summer, we have been experiencing
numerous memory accounting issues on our NFS servers. These manifest
as a server *desperate* to free up memory despite having multiple
gigabytes of physical RAM available. (Some of these machines have 1
TiB of RAM, with more than 64 GiB free, and were swapping and invoking
the OOM-killer.)

I had a server deadlock just now after only three days of uptime with
32 GiB of free memory. Prior to the crash, about 70 GiB (of 128) was
used by the ARC, of which some 60 GiB was accounted for as
"evictable", and the load was pretty modest.

In DDB on the console, I noted:

pid ppid pgrp uid state wmesg wchan cmd
60673 60672 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60672 1 3008 0 S wait 0xfffffe031ee41560 nrpe
60670 1186 60670 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session
60669 1202 1202 0 D voffloc 0xfffff8024db4966a perl
60668 60667 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60667 1 3008 0 S wait 0xfffffe031ee41000 nrpe
60665 60664 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60664 1 3008 0 S wait 0xfffffe031723a5c0 nrpe
60662 60661 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60661 1 3008 0 S wait 0xfffffe03172395a0 nrpe
60659 60658 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60658 1 3008 0 S wait 0xfffffe0317239040 nrpe
60656 60655 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60655 1 3008 0 S wait 0xfffffe0317238ae0 nrpe
60653 60652 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60652 1 3008 0 S wait 0xfffffe0317238580 nrpe
60650 60649 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60649 1 3008 0 S wait 0xfffffe0317238020 nrpe
60647 60646 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60646 1 3008 0 S wait 0xfffffe0317237ac0 nrpe
60644 60643 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60643 1 3008 0 S wait 0xfffffe0317237000 nrpe
60641 60640 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60640 1 3008 0 S wait 0xfffffe00d3cfa040 nrpe
60638 1202 1202 0 D voffloc 0xfffff8024db4966a perl
60637 1186 60637 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session
60636 60635 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60635 1 3008 0 S wait 0xfffffe00d3cf9ae0 nrpe
60633 60632 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60632 1 3008 0 S wait 0xfffffe00d3cf9580 nrpe
60630 60629 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60629 1 3008 0 S wait 0xfffffe00d3cf9020 nrpe
60627 60626 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60626 1 3008 0 S wait 0xfffffe00d3cf8560 nrpe
60624 60623 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60623 1 3008 0 S wait 0xfffffe00d3cf8000 nrpe
60621 60620 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60620 1 3008 0 S wait 0xfffffe0317188060 nrpe
60618 60617 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60617 1 3008 0 S wait 0xfffffe0317187b00 nrpe
60615 60614 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60614 1 3008 0 S wait 0xfffffe03171875a0 nrpe
60612 60611 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60611 1 3008 0 S wait 0xfffffe0317186ae0 nrpe
60609 60608 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60608 1 3008 0 S wait 0xfffffe0317186580 nrpe
60606 1186 60606 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session
60605 60604 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60604 1 3008 0 S wait 0xfffffe0317186020 nrpe
60602 60601 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60601 1 3008 0 S wait 0xfffffe0317185ac0 nrpe
60599 1202 1202 0 D voffloc 0xfffff8024db4966a perl
60598 60597 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60597 1 3008 0 S wait 0xfffffe0317185560 nrpe
60595 60594 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60594 1 3008 0 S wait 0xfffffe0317185000 nrpe
60592 60591 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60591 1 3008 0 S wait 0xfffffe031724c5c0 nrpe
60589 60588 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60588 1 3008 0 S wait 0xfffffe031724c060 nrpe
60586 60585 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60585 1 3008 0 S wait 0xfffffe031724b5a0 nrpe
60583 60582 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60582 1 3008 0 S wait 0xfffffe031724a580 nrpe
60580 60579 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60579 1 3008 0 S wait 0xfffffe031724a020 nrpe
60577 1186 60577 0 Ds aw.aew_ 0xfffffe0326e5a608 sshd-session
60576 60575 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60575 1 3008 0 S wait 0xfffffe0317249560 nrpe
60573 1202 1202 0 D aw.aew_ 0xfffffe0326df6478 perl
60572 60571 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60571 1 3008 0 S wait 0xfffffe0317249000 nrpe
5015 5010 5015 6263 Ss+ ttyin 0xfffff810aa50a8b0 zsh
5010 5006 5006 6263 S select 0xfffff8024ca966c0 sshd-session
5006 1186 5006 0 Ss select 0xfffff8024ca984c0 sshd-session
3008 1 3008 0 Ss select 0xfffff80209dc98c0 nrpe
2910 1 2910 0 Ds+ aw.aew_ 0xfffffe03274d66e8 getty

This getty is the one running on the console tty, which was stuck.
Note the wait channel is "aw.aew_cv", which is part of the logic for
evicting buffers from the ARC. Other threads are waiting for a
dbuf (ZFS disk buffer) object mutex.

I'm currently planning on taking us to 14.4 later this spring, but it
would be nice to know if anyone else has seen this bug or has a fix.
I've tried dropping kern.maxvnodes and increasing
vfs.zfs.arc_free_target, with no change in symptoms.

This particular server is due to be replaced but the new disk array
(which was ordered in January) won't ship until late April per the
vendor.

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 15:08:44 2026

From Newsgroup: muc.lists.freebsd.stable

On Mon, Mar 16, 2026 at 2:41rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

Since we upgraded to 14.3 last summer, we have been experiencing
numerous memory accounting issues on our NFS servers. These manifest
as a server *desperate* to free up memory despite having multiple
gigabytes of physical RAM available. (Some of these machines have 1
TiB of RAM, with more than 64 GiB free, and were swapping and invoking
the OOM-killer.)

I had a server deadlock just now after only three days of uptime with
32 GiB of free memory. Prior to the crash, about 70 GiB (of 128) was
used by the ARC, of which some 60 GiB was accounted for as
"evictable", and the load was pretty modest.

In DDB on the console, I noted:

pid ppid pgrp uid state wmesg wchan cmd
60673 60672 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60672 1 3008 0 S wait 0xfffffe031ee41560 nrpe
60670 1186 60670 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session 60669 1202 1202 0 D voffloc 0xfffff8024db4966a perl
60668 60667 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60667 1 3008 0 S wait 0xfffffe031ee41000 nrpe
60665 60664 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60664 1 3008 0 S wait 0xfffffe031723a5c0 nrpe
60662 60661 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60661 1 3008 0 S wait 0xfffffe03172395a0 nrpe
60659 60658 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60658 1 3008 0 S wait 0xfffffe0317239040 nrpe
60656 60655 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60655 1 3008 0 S wait 0xfffffe0317238ae0 nrpe
60653 60652 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60652 1 3008 0 S wait 0xfffffe0317238580 nrpe
60650 60649 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60649 1 3008 0 S wait 0xfffffe0317238020 nrpe
60647 60646 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60646 1 3008 0 S wait 0xfffffe0317237ac0 nrpe
60644 60643 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60643 1 3008 0 S wait 0xfffffe0317237000 nrpe
60641 60640 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60640 1 3008 0 S wait 0xfffffe00d3cfa040 nrpe
60638 1202 1202 0 D voffloc 0xfffff8024db4966a perl
60637 1186 60637 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session 60636 60635 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60635 1 3008 0 S wait 0xfffffe00d3cf9ae0 nrpe
60633 60632 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60632 1 3008 0 S wait 0xfffffe00d3cf9580 nrpe
60630 60629 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60629 1 3008 0 S wait 0xfffffe00d3cf9020 nrpe
60627 60626 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60626 1 3008 0 S wait 0xfffffe00d3cf8560 nrpe
60624 60623 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60623 1 3008 0 S wait 0xfffffe00d3cf8000 nrpe
60621 60620 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60620 1 3008 0 S wait 0xfffffe0317188060 nrpe
60618 60617 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60617 1 3008 0 S wait 0xfffffe0317187b00 nrpe
60615 60614 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60614 1 3008 0 S wait 0xfffffe03171875a0 nrpe
60612 60611 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60611 1 3008 0 S wait 0xfffffe0317186ae0 nrpe
60609 60608 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60608 1 3008 0 S wait 0xfffffe0317186580 nrpe
60606 1186 60606 0 Ds db->db_ 0xfffff8173309f1e8 sshd-session 60605 60604 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60604 1 3008 0 S wait 0xfffffe0317186020 nrpe
60602 60601 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60601 1 3008 0 S wait 0xfffffe0317185ac0 nrpe
60599 1202 1202 0 D voffloc 0xfffff8024db4966a perl
60598 60597 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60597 1 3008 0 S wait 0xfffffe0317185560 nrpe
60595 60594 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60594 1 3008 0 S wait 0xfffffe0317185000 nrpe
60592 60591 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60591 1 3008 0 S wait 0xfffffe031724c5c0 nrpe
60589 60588 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60588 1 3008 0 S wait 0xfffffe031724c060 nrpe
60586 60585 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60585 1 3008 0 S wait 0xfffffe031724b5a0 nrpe
60583 60582 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60582 1 3008 0 S wait 0xfffffe031724a580 nrpe
60580 60579 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60579 1 3008 0 S wait 0xfffffe031724a020 nrpe
60577 1186 60577 0 Ds aw.aew_ 0xfffffe0326e5a608 sshd-session 60576 60575 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60575 1 3008 0 S wait 0xfffffe0317249560 nrpe
60573 1202 1202 0 D aw.aew_ 0xfffffe0326df6478 perl
60572 60571 3008 0 D db->db_ 0xfffff8058173af68 nrpe
60571 1 3008 0 S wait 0xfffffe0317249000 nrpe
5015 5010 5015 6263 Ss+ ttyin 0xfffff810aa50a8b0 zsh
5010 5006 5006 6263 S select 0xfffff8024ca966c0 sshd-session
5006 1186 5006 0 Ss select 0xfffff8024ca984c0 sshd-session
3008 1 3008 0 Ss select 0xfffff80209dc98c0 nrpe
2910 1 2910 0 Ds+ aw.aew_ 0xfffffe03274d66e8 getty

This getty is the one running on the console tty, which was stuck.
Note the wait channel is "aw.aew_cv", which is part of the logic for
evicting buffers from the ARC. Other threads are waiting for a
dbuf (ZFS disk buffer) object mutex.

I'm currently planning on taking us to 14.4 later this spring, but it
would be nice to know if anyone else has seen this bug or has a fix.
I've tried dropping kern.maxvnodes and increasing
vfs.zfs.arc_free_target, with no change in symptoms.

This particular server is due to be replaced but the new disk array
(which was ordered in January) won't ship until late April per the
vendor.

-GAWollman

I once saw a similar bug. In my case I had a process that mmap()ed
some very large files on fusefs, consuming lots of inactive pages.
And when the system comes under memory pressure, it asks ARC to evict
first. So the ARC would end up shrinking down to arc_min every time.
In my case, the solution was to set vfs.fusefs.data_cache_mode=0 . I
suspect that similar bugs could be possible with UFS or tmpfs, if they
have giant files that are mmaped().
A less effective workaround was to set vfs.zfs.arc.min to some
reasonable value. That can prevent ARC from shrinking too far. You
could try that.
Another thing you could try is to run "vmstat -o" when the system is
in the problematic state. That will show you which vm objects are
using the most inactive pages.
Hope this helps,
-Alan
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Mon Mar 16 17:23:33 2026

From Newsgroup: muc.lists.freebsd.stable

<<On Mon, 16 Mar 2026 15:08:44 -0600, Alan Somers <asomers@freebsd.org> said:

A less effective workaround was to set vfs.zfs.arc.min to some
reasonable value. That can prevent ARC from shrinking too far. You
could try that.

So far as I can tell, the ARC doesn't actually shrink, and shouldn't
need to given the gigabytes of free physmem at the time (well,
immediately prior). Within 5 minutes of the crash, the total ARC size
was 70 GiB, c_max was 127 GiB, and c_min was 4 GiB -- in practice it's
never anywhere near that small. The first observation after the
server came back up, ARC size was already over 20 GiB.

Either *something* is causing the kernel to think it has no free
memory when there's actually lots, or else something is causing the
kernel to allocate gigabytes of RAM much faster than we can observe it happening.

There's epsilon memory in the inactive queue on this system, before or
after the crash: it's so small I can't even see the line on the graph.
The 24-hour maximum is 268 MiB, or about 0.2% of RAM.

Another thing you could try is to run "vmstat -o" when the system is
in the problematic state.

What's the equivalent in DDB? No getty, no login.

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 15:29:00 2026

From Newsgroup: muc.lists.freebsd.stable

On Mon, Mar 16, 2026 at 3:23rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

<<On Mon, 16 Mar 2026 15:08:44 -0600, Alan Somers <asomers@freebsd.org> said:

A less effective workaround was to set vfs.zfs.arc.min to some
reasonable value. That can prevent ARC from shrinking too far. You
could try that.

So far as I can tell, the ARC doesn't actually shrink, and shouldn't
need to given the gigabytes of free physmem at the time (well,
immediately prior). Within 5 minutes of the crash, the total ARC size
was 70 GiB, c_max was 127 GiB, and c_min was 4 GiB -- in practice it's
never anywhere near that small. The first observation after the
server came back up, ARC size was already over 20 GiB.

Either *something* is causing the kernel to think it has no free
memory when there's actually lots, or else something is causing the
kernel to allocate gigabytes of RAM much faster than we can observe it happening.

There's epsilon memory in the inactive queue on this system, before or
after the crash: it's so small I can't even see the line on the graph.
The 24-hour maximum is 268 MiB, or about 0.2% of RAM.

Another thing you could try is to run "vmstat -o" when the system is
in the problematic state.

What's the equivalent in DDB? No getty, no login.

-GAWollman

I don't know how to do it from ddb. But if you dump a core file, then
you can run vmstat after rebooting like this: "vmstat -o -M /var/crash/vmcore.XXX" . Also, did the kernel panic, or did you
manually enter ddb? If it paniced, can you please share the stack
trace?
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Mon Mar 16 17:39:17 2026

From Newsgroup: muc.lists.freebsd.stable

BTW, Alan, mail to your freebsd.org mailbox bounces because you
forward to gmail.

<<On Mon, 16 Mar 2026 15:29:00 -0600, Alan Somers <asomers@freebsd.org> said:

I don't know how to do it from ddb. But if you dump a core file,

None of our systems are set up for that. They all have huge memory
and pretty tiny swap partitions, and in any case, they don't panic,
they just deadlock. Or the OOM killer just shoots all user processes;
these are nearly indistinguishable from a service provider's
perspective.

They're just NFS servers; they don't run anything else except what's
necessary for monitoring and administration.

-GAWollman

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 15:55:59 2026

From Newsgroup: muc.lists.freebsd.stable

On Mon, Mar 16, 2026 at 3:39rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

BTW, Alan, mail to your freebsd.org mailbox bounces because you
forward to gmail.

<<On Mon, 16 Mar 2026 15:29:00 -0600, Alan Somers <asomers@freebsd.org> said:

I don't know how to do it from ddb. But if you dump a core file,

None of our systems are set up for that. They all have huge memory
and pretty tiny swap partitions, and in any case, they don't panic,
they just deadlock. Or the OOM killer just shoots all user processes;
these are nearly indistinguishable from a service provider's
perspective.

They're just NFS servers; they don't run anything else except what's necessary for monitoring and administration.

-GAWollman

Pretty tiny swap partitions? Maybe that's the problem. I recall kib@
telling me that some amount of swap is essential, even when plenty of
RAM is available. But I can't remember why. So if you can't upgrade
those tiny swap partitions, then I suggest you install an SSD just for
use as a dump device. I've done that sometimes.
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

From Konstantin Belousov@kostikbel@gmail.com to muc.lists.freebsd.stable on Mon Mar 16 23:56:55 2026

From Newsgroup: muc.lists.freebsd.stable

On Mon, Mar 16, 2026 at 03:08:44PM -0600, Alan Somers wrote:

I once saw a similar bug. In my case I had a process that mmap()ed
some very large files on fusefs, consuming lots of inactive pages.
And when the system comes under memory pressure, it asks ARC to evict
first. So the ARC would end up shrinking down to arc_min every time.
In my case, the solution was to set vfs.fusefs.data_cache_mode=0 . I
suspect that similar bugs could be possible with UFS or tmpfs, if they
have giant files that are mmaped().

What are 'similar bugs with UFS or tmpfs'?
Can you please be more specific, what is the erronous behavior?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

From Garrett Wollman@wollman@bimajority.org to muc.lists.freebsd.stable on Mon Mar 16 18:15:54 2026

From Newsgroup: muc.lists.freebsd.stable

<<On Mon, 16 Mar 2026 15:55:59 -0600, Alan Somers <asomers@freebsd.org> said:

On Mon, Mar 16, 2026 at 3:39rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

None of our systems are set up for that. They all have huge memory
and pretty tiny swap partitions, and in any case, they don't panic,
they just deadlock. Or the OOM killer just shoots all user processes;
these are nearly indistinguishable from a service provider's
perspective.

Pretty tiny swap partitions?

Tiny compared to RAM, typically 16 or 32 GiB. After all, these are
NFS servers, they shouldn't have more than a few dozen MiB of
swappable anonymous memory.(*) We're not going to put a 2T SSD as a hopefully-never-to-be-used swap drive in a file server.
I configured a dump device on the server that crashed today, if it
crashes again when I'm at a keyboard I'll see if I can get to write a
dump in the 32 GiB of swap that it has configured.
-GAWollman
(*) If the kernel erroneously thinks it's out of free memory and
swapping stuff out only opens up a few MiB, that would certainly
explain why it goes on to ARC eviction and eventual OOM. On this
server, after two hours of uptime, I see:
Device 1K-blocks Used Avail Capacity
/dev/gpt/swap0 33554432 32132 33522300 0%
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

From Sulev-Madis Silber@freebsd-stable-freebsd-org730@ketas.si.pri.ee to muc.lists.freebsd.stable on Tue Mar 17 00:15:24 2026

From Newsgroup: muc.lists.freebsd.stable

i have had problems like this for long time but on 13.* and ram was absolutely tiny
problems seemed to appear when something mmapped files on zfs
problem has puzzled me years and if i limit mmap, nothing fails. except if maybe the strange file errors in some memory pressure cases
unrelated?
some also had questionable similar issues just a while ago on Actual Servers (TM)
same issue, processes get killed
or in this case here, wired doesn't rapidly increase?
also io speeds seem to affect stuff or i don't know what to think of this
but problems described here also involve some mysterious memory usage in zfs. don't know
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 16:18:38 2026

From Newsgroup: muc.lists.freebsd.stable

On Mon, Mar 16, 2026 at 3:57rC>PM Konstantin Belousov <kostikbel@gmail.com> wrote:

On Mon, Mar 16, 2026 at 03:08:44PM -0600, Alan Somers wrote:

I once saw a similar bug. In my case I had a process that mmap()ed
some very large files on fusefs, consuming lots of inactive pages.
And when the system comes under memory pressure, it asks ARC to evict first. So the ARC would end up shrinking down to arc_min every time.
In my case, the solution was to set vfs.fusefs.data_cache_mode=0 . I suspect that similar bugs could be possible with UFS or tmpfs, if they
have giant files that are mmaped().

What are 'similar bugs with UFS or tmpfs'?
Can you please be more specific, what is the erronous behavior?

I experienced this bug in 2021, and reproduced it on both FreeBSD 12.2
and 13.0. The setup was:
* A ZFS-root server with hundreds of GB of RAM and hundreds of TB of
ZFS, with a complicated ZFS workload.
* A custom fusefs file system. Each fusefs mountpoint presented a
small number of files, some huge, and was backed by a file on ZFS
itself.
* A ctld target for each fusefs mountpoint, backed by one file on that mountpoint.
"vmstat -o" showed that each of those ctld targets consume a huge
amount of inactive memory. Basically, ctld was mmaping the whole file
and never releasing any pages. The dtrace sdt:zfs:none:arc-needfree
probe showed that the page daemon was frequently asking ZFS to free
memory from ARC. ZFS complied, and the ARC size would slowly shrink
down to vfs.zfs.arc_min . In my case, there was no crash, and the OOM
killer wasn't involved, but performance suffered. Setting vfs.fusefs.data_cache_mode=0 was a perfect workaround for us, so I
never investigated further.
When I say that I suspect similar bugs may exist with UFS or tmpfs,
I'm suspecting that if ctld exports huge files from those file systems
on a mixed UFS/ZFS system, then they might consume huge amounts of
inactive pages. But I've never checked.
-Alan
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 16:27:11 2026

From Newsgroup: muc.lists.freebsd.stable

On Mon, Mar 16, 2026 at 4:16rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

<<On Mon, 16 Mar 2026 15:55:59 -0600, Alan Somers <asomers@freebsd.org> said:

On Mon, Mar 16, 2026 at 3:39rC>PM Garrett Wollman <wollman@bimajority.org> wrote:

None of our systems are set up for that. They all have huge memory
and pretty tiny swap partitions, and in any case, they don't panic,
they just deadlock. Or the OOM killer just shoots all user processes;
these are nearly indistinguishable from a service provider's
perspective.

Pretty tiny swap partitions?

Tiny compared to RAM, typically 16 or 32 GiB. After all, these are
NFS servers, they shouldn't have more than a few dozen MiB of
swappable anonymous memory.(*) We're not going to put a 2T SSD as a hopefully-never-to-be-used swap drive in a file server.

You won't need a 2 TB SSD. By default, FreeBSD will make a mini dump,
which excludes most of ARC and most memory used by userspace programs.
For example, a recent core dump of mine takes 40 GB on a system with 1
TB of RAM. Note that I'm setting dumpon_flags="-Z" to enable core
dump compression. That makes the dump go faster, as well as use less
space. See dumpon(8) for more information about full vs mini core
dumps.

I configured a dump device on the server that crashed today, if it
crashes again when I'm at a keyboard I'll see if I can get to write a
dump in the 32 GiB of swap that it has configured.

-GAWollman

(*) If the kernel erroneously thinks it's out of free memory and
swapping stuff out only opens up a few MiB, that would certainly
explain why it goes on to ARC eviction and eventual OOM. On this
server, after two hours of uptime, I see:

Device 1K-blocks Used Avail Capacity
/dev/gpt/swap0 33554432 32132 33522300 0%

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

From Alan Somers@asomers@freebsd.org to muc.lists.freebsd.stable on Mon Mar 16 17:32:42 2026

From Newsgroup: muc.lists.freebsd.stable

On Mon, Mar 16, 2026 at 5:30rC>PM Mark Millard <marklmi@yahoo.com> wrote:

ZFS is documented to have the property: "This approach provides
coherency between memory-mapped and IO access as the expense of wasted
memory due to having two copies of the file in memory and extra overhead caused by the need to copy the contents between the two copies."
(Chapter 10, page 548, last bullet item of the 2nd edition of the design
and implementation book.)

That's not relevant in this case, because no ZFS file was mmapped().
It was only a fusefs file that was mmaped().
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-admin@muc.de
--- Synchronet 3.21d-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Geek2
  Sun May 17 07:06:15 2026
  from Euclid, Oh via Telnet
- Geek2
  Sat May 16 21:25:04 2026
  from Euclid, Oh via Telnet
- Jas Hud
  Sat May 16 00:50:28 2026
  from Bbs.Eob-Bbs.Com,wi via Telnet
- Geek2
  Fri May 15 19:53:20 2026
  from Euclid, Oh via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	65
Nodes:	6 (0 / 6)
Uptime:	12:08:06
Calls:	862
Files:	1,311
D/L today:	5 files (10,064K bytes)
Messages:	265,374

ZFS deadlocks/memory accounting issues

Who's Online

Recent Visitors

System Info