Forum: Too Lazy BBS

Spool compression

From InterLinked@nntp@phreaknet.org to news.software.nntp on Wed Jun 17 23:11:57 2026

From Newsgroup: news.software.nntp

I thought at one point I read something about how articles could be
compressed in the spool in INN, or some other news software, but I can't
find that now, aside from compressing just overview. (Not thinking of compression for transit here but compression at rest.)

Digging around a bit, I found a thread from 1990 in news.software.b
which discussed this[1], consensus that news articles individually tend
to be small enough that compressing them is not worth the effort.
Certainly on busy systems, it makes little sense, but for the use case
where disk space is limited and CPU is not fully utilized normally
anyways, it seems attractive, assuming you don't mind losing the ability
to easily grep through things or use other utilities with the spool
(which, as recently mentioned, are less common now).

Is anyone aware of a news package that has ever supported compression of articles in the spool? My theory is that articles today are larger than
35-40 years ago, both due to more headers and larger bodies. I did some
tests on some articles I picked at random, and most seem to be in the
1KB-6KB range. I compressed a 930B article to 739B (20% savings) and a
5,730B article to 3,649B (37% savings). 25-35% seems typical.
Conceivably, on a system with limited disk space, one could increase
retention by maybe around 25% just with this trick. Certainly, at some
point it starts to look attractive.

I'm toying with the concept for tradspool, though I imagine with CNFS or
other multi-article files, compression would be even more effective.

[1] https://groups.google.com/g/news.software.b/c/jDdQlTLzzIw/m/XIWEHQpvgWsJ --- Synchronet 3.22a-Linux NewsLink 1.2

From Kevin Bowling@kevin.bowling@kev009.com to news.software.nntp on Wed Jun 17 21:14:22 2026

From Newsgroup: news.software.nntp

On 6/17/26 20:11, InterLinked wrote:

I thought at one point I read something about how articles could be compressed in the spool in INN, or some other news software, but I can't find that now, aside from compressing just overview. (Not thinking of compression for transit here but compression at rest.)

Digging around a bit, I found a thread from 1990 in news.software.b
which discussed this[1], consensus that news articles individually tend
to be small enough that compressing them is not worth the effort.

The economics favor the opposites for both with any modern system. I.e.
the data would be compressible (especially with a trained dictionary),
and the "effort" is borderline free and optimizes harder scale axis like
I/O bandwidth and page cache.

Certainly on busy systems, it makes little sense, but for the use case
where disk space is limited and CPU is not fully utilized normally
anyways, it seems attractive, assuming you don't mind losing the ability
to easily grep through things or use other utilities with the spool
(which, as recently mentioned, are less common now).

Again economy has flipped and busy systems benefit more if it is done intelligently.

Is anyone aware of a news package that has ever supported compression of articles in the spool? My theory is that articles today are larger than 35-40 years ago, both due to more headers and larger bodies. I did some

I would question that intuition, I don't think article size has changed
much.

Here is a histogram on my tradspool:
<512B: 788,863 (0.95%)
512B-1K: 5,108,243 (6.16%)
1-2K: 31,046,227 (37.43%)
2-4K: 26,416,756 (31.85%)
4-8K: 11,161,756 (13.46%)
8-16K: 4,478,769 (5.40%)
16-32K: 1,522,232 (1.84%)
32-64K: 358,406 (0.43%)

64K: 5,959 (0.01%)

Total: ~82.9 million articles

(note I cutoff at 52k, the larger articles would've been sucked when I
was filling out some history on a couple groups)

tests on some articles I picked at random, and most seem to be in the 1KB-6KB range. I compressed a 930B article to 739B (20% savings) and a 5,730B article to 3,649B (37% savings). 25-35% seems typical.
Conceivably, on a system with limited disk space, one could increase retention by maybe around 25% just with this trick. Certainly, at some
point it starts to look attractive.

I'm toying with the concept for tradspool, though I imagine with CNFS or other multi-article files, compression would be even more effective.

Yes, something like CNFS will result in greater gains because an
untrained dictionary will span a larger working set.

Otherwise a lot of Usenet articles are small enough that you really need
a trained dictionary to get any ratio on individual articles.

I use block level LZ4 (ZFS) on my tradspool, and it is maybe a 10% improvement. The major issue with this setup is the filesystem metadata
and data block per article, especially for small articles (the FS has no support for packing small files into the medadata node, stuffing
multiple small files into one data node etc). CNFS would result in a
much higher ratio.

[1] https://groups.google.com/g/news.software.b/c/jDdQlTLzzIw/m/ XIWEHQpvgWsJ

--- Synchronet 3.22a-Linux NewsLink 1.2

From InterLinked@nntp@phreaknet.org to news.software.nntp on Thu Jun 18 10:11:55 2026

From Newsgroup: news.software.nntp

On 6/18/2026 12:14 AM, Kevin Bowling wrote:

On 6/17/26 20:11, InterLinked wrote:

I thought at one point I read something about how articles could be
compressed in the spool in INN, or some other news software, but I
can't find that now, aside from compressing just overview. (Not
thinking of compression for transit here but compression at rest.)

Digging around a bit, I found a thread from 1990 in news.software.b
which discussed this[1], consensus that news articles individually
tend to be small enough that compressing them is not worth the effort.

The economics favor the opposites for both with any modern system.-a I.e. the data would be compressible (especially with a trained dictionary),
and the "effort" is borderline free and optimizes harder scale axis like
I/O bandwidth and page cache.

Certainly on busy systems, it makes little sense, but for the use case
where disk space is limited and CPU is not fully utilized normally
anyways, it seems attractive, assuming you don't mind losing the
ability to easily grep through things or use other utilities with the
spool (which, as recently mentioned, are less common now).

Again economy has flipped and busy systems benefit more if it is done intelligently.

Is anyone aware of a news package that has ever supported compression
of articles in the spool? My theory is that articles today are larger
than 35-40 years ago, both due to more headers and larger bodies. I
did some

I would question that intuition, I don't think article size has changed much.

Here is a histogram on my tradspool:
<512B:-a-a-a 788,863-a (0.95%)
512B-1K:-a 5,108,243 (6.16%)
1-2K:-a-a-a 31,046,227 (37.43%)
2-4K:-a-a-a 26,416,756 (31.85%)
4-8K:-a-a-a 11,161,756 (13.46%)
8-16K:-a-a-a 4,478,769 (5.40%)
16-32K:-a-a 1,522,232 (1.84%)
32-64K:-a-a-a-a 358,406 (0.43%)

64K:-a-a-a-a-a-a-a-a 5,959 (0.01%)

Total: ~82.9 million articles

Well, this is helpful.

Of course, I just realized that compressing files that are already less
than the block size (4 KB typically) probably won't save anything, so compressing such articles would be pointless, in a tradspool
environment. According to your histogram, that's about 74% of articles
that won't compress in a useful way unless the block size is changed.

Compressing the other articles would still help if it could shave off at
least one block. But even for files > 4 KB if the compressed result is
the same size on disk then that's probably no good either.

(note I cutoff at 52k, the larger articles would've been sucked when I
was filling out some history on a couple groups)

tests on some articles I picked at random, and most seem to be in the
1KB-6KB range. I compressed a 930B article to 739B (20% savings) and a
5,730B article to 3,649B (37% savings). 25-35% seems typical.
Conceivably, on a system with limited disk space, one could increase
retention by maybe around 25% just with this trick. Certainly, at some
point it starts to look attractive.

I'm toying with the concept for tradspool, though I imagine with CNFS
or other multi-article files, compression would be even more effective.

Yes, something like CNFS will result in greater gains because an
untrained dictionary will span a larger working set.

Otherwise a lot of Usenet articles are small enough that you really need
a trained dictionary to get any ratio on individual articles.

That makes sense. Train a dictionary across all the uncompressed
articles in a spool to get something representative for Usenet, and then
use that when compressing individual articles.

Even, I suspect it may not be enough to eliminate at least one block for
many articles. So I think 25% was a bit optimistic on my part, maybe
10-20% is more realistic.

I use block level LZ4 (ZFS) on my tradspool, and it is maybe a 10% improvement.-a The major issue with this setup is the filesystem metadata and data block per article, especially for small articles (the FS has no support for packing small files into the medadata node, stuffing
multiple small files into one data node etc).-a CNFS would result in a
much higher ratio.

Yeah, CFNS would be better, but I don't want to do CFNS for other
reasons so trying to think what I could do given the limitations of many
small files.
--- Synchronet 3.22a-Linux NewsLink 1.2

From wollman@wollman@hergotha.csail.mit.edu (Garrett Wollman) to news.software.nntp on Thu Jun 18 15:14:31 2026

From Newsgroup: news.software.nntp

In article <1110ubg$2ld8b$1@dont-email.me>,
InterLinked <nntp@phreaknet.org> wrote:

On 6/18/2026 12:14 AM, Kevin Bowling wrote:
Even, I suspect it may not be enough to eliminate at least one block for >many articles. So I think 25% was a bit optimistic on my part, maybe
10-20% is more realistic.

I use block level LZ4 (ZFS) on my tradspool, and it is maybe a 10%
improvement.-a The major issue with this setup is the filesystem metadata >> and data block per article, especially for small articles (the FS has no
support for packing small files into the medadata node, stuffing
multiple small files into one data node etc).-a CNFS would result in a
much higher ratio.

Tradspool here, and I still get pretty good compression with LZ4:

NAME RATIO
tank/root/usr/local/news 1.88x
tank/root/usr/local/news/db 1.77x
tank/root/usr/local/news/spool 1.89x
tank/root/usr/local/news/tmp 1.00x

It's a small server and I don't carry binaries, so that should give a
decent idea of what LZ4 on actual text will give you. ("1.89x" is
what others would call "47%".) This is with 90-day retention, the
whole of /usr/local/news is less than 10 GiB.

There's more than enough CPU to use zstd or gzip but what's the point?
This is just our default zpool configuration, and there are a bunch of optimizations in ZFS that sit on top of `compression=on`.

-GAWollman
--
Garrett A. Wollman | "Act to avoid constraining the future; if you can, wollman@bimajority.org| act to remove constraint from the future. This is Opinions not shared by| a thing you can do, are able to do, to do together."
my employers. | - Graydon Saunders, _A Succession of Bad Days_ (2015) --- Synchronet 3.22a-Linux NewsLink 1.2

From Russ Allbery@eagle@eyrie.org to news.software.nntp on Thu Jun 18 09:05:40 2026

From Newsgroup: news.software.nntp

InterLinked <nntp@phreaknet.org> writes:

I'm toying with the concept for tradspool, though I imagine with CNFS or other multi-article files, compression would be even more effective.

As various people have implicitly noted but perhaps not said explicitly,
one of the things that has changed since the original news servers were
written is that compression is now a file system feature for some file
systems.

I would not, in 2026, implement application-level compression of things
you're storing as simple files disk unless you have enough knowledge to
use some application-specific compression mechanism (so, for instance,
image compression is still a different matter). If you're just using
standard general-purpose compression mechanisms like zstd, your time is
almost certainly better spent finding an underlying file system that
handles the compression for you. The file system knows things like its
block layout strategy and can make good choices about opportunistic
compression that the news software is not in a position to make.

It might still be worthwhile in some cases to compress blobs stored in databases or other similar cases, but even there I'd want to benchmark
against a file system that natively implements compression.

ZFS is probably the most mature file system with this feature, but running
it on Linux can be a little complicated due to boring licensing reasons.
btrfs is another option for transparent compression if you don't feel like dealing with ZFS, but isn't as mature. (That said, I've been using btrfs
on all my personal devices for years now without any trouble, although I
have avoided running a disk entirely out of space, which many people say
btrfs doesn't always handle well.)

If you feel like maximizing drama, excitement, and exposure to esoteric
free software governance conflicts that generate 100-post forum threads,
you could use bcachefs. :)
--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
--- Synchronet 3.22a-Linux NewsLink 1.2

From InterLinked@nntp@phreaknet.org to news.software.nntp on Thu Jun 18 17:38:52 2026

From Newsgroup: news.software.nntp

On 6/18/2026 11:14 AM, Garrett Wollman wrote:

In article <1110ubg$2ld8b$1@dont-email.me>,
InterLinked <nntp@phreaknet.org> wrote:

On 6/18/2026 12:14 AM, Kevin Bowling wrote:
Even, I suspect it may not be enough to eliminate at least one block for
many articles. So I think 25% was a bit optimistic on my part, maybe
10-20% is more realistic.

I use block level LZ4 (ZFS) on my tradspool, and it is maybe a 10%
improvement.|e-a The major issue with this setup is the filesystem metadata >>> and data block per article, especially for small articles (the FS has no >>> support for packing small files into the medadata node, stuffing
multiple small files into one data node etc).|e-a CNFS would result in a >>> much higher ratio.

Tradspool here, and I still get pretty good compression with LZ4:

NAME RATIO
tank/root/usr/local/news 1.88x
tank/root/usr/local/news/db 1.77x
tank/root/usr/local/news/spool 1.89x
tank/root/usr/local/news/tmp 1.00x

It's a small server and I don't carry binaries, so that should give a
decent idea of what LZ4 on actual text will give you. ("1.89x" is
what others would call "47%".) This is with 90-day retention, the
whole of /usr/local/news is less than 10 GiB.

There's more than enough CPU to use zstd or gzip but what's the point?
This is just our default zpool configuration, and there are a bunch of optimizations in ZFS that sit on top of `compression=on`.

The point is, given a finite amount of disk space, it would allow for
storing more news or having longer retention, effectively for "free".
--- Synchronet 3.22a-Linux NewsLink 1.2

From InterLinked@nntp@phreaknet.org to news.software.nntp on Thu Jun 18 17:55:24 2026

From Newsgroup: news.software.nntp

On 6/18/2026 12:05 PM, Russ Allbery wrote:

InterLinked <nntp@phreaknet.org> writes:

I'm toying with the concept for tradspool, though I imagine with CNFS or
other multi-article files, compression would be even more effective.

As various people have implicitly noted but perhaps not said explicitly,
one of the things that has changed since the original news servers were written is that compression is now a file system feature for some file systems.

I would not, in 2026, implement application-level compression of things you're storing as simple files disk unless you have enough knowledge to
use some application-specific compression mechanism (so, for instance,
image compression is still a different matter). If you're just using
standard general-purpose compression mechanisms like zstd, your time is almost certainly better spent finding an underlying file system that
handles the compression for you. The file system knows things like its
block layout strategy and can make good choices about opportunistic compression that the news software is not in a position to make.

It might still be worthwhile in some cases to compress blobs stored in databases or other similar cases, but even there I'd want to benchmark against a file system that natively implements compression.

I would agree letting the file system do it is probably better if
possible, but that might not always be an option - see below.

ZFS is probably the most mature file system with this feature, but running
it on Linux can be a little complicated due to boring licensing reasons. btrfs is another option for transparent compression if you don't feel like dealing with ZFS, but isn't as mature. (That said, I've been using btrfs
on all my personal devices for years now without any trouble, although I
have avoided running a disk entirely out of space, which many people say btrfs doesn't always handle well.)

The other constraint with file systems is if you can use them at all. My Internet facing news server is just one service running on a Digital
Ocean droplet, and those only support ext4 and xfs for the primary disk.

In this case, the only way to increase the amount of articles I can
store for "free" is to do user land compression of the articles - and
I'm using tradspool for granular expiration (or infinite retention)
depending on group, so that means per-file compression since CNFS would
not even be an option.

zstd seems to work even better than compress; in some cases up to 50% compression for a single article even without a custom dictionary. I get
even better results with a custom dictionary trained on the articles
from a few disparate groups, sometimes up to 90%, but I worry about overfitting as inspecting the dictionary it seems to have large portions
of entire messages (perhaps those which were quoted most), which I don't
think would generalize well in the long run. Would need to experiment
with that further and over time.

It seems likely that if I just compressed articles larger than 4 KB, I
could get the vast majority of articles to fit in just 1 block, rather
than possibly 2 or 3. Obviously there's no point in compressing articles
less than 4 KB so those could be left alone.

I realize this is probably another "feature" that has limited appeal...
but I'm sure there are a few others who may be using a VPS and want to
get the most out of it.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Russ Allbery@eagle@eyrie.org to news.software.nntp on Thu Jun 18 16:24:23 2026

From Newsgroup: news.software.nntp

InterLinked <nntp@phreaknet.org> writes:

The other constraint with file systems is if you can use them at all. My Internet facing news server is just one service running on a Digital
Ocean droplet, and those only support ext4 and xfs for the primary disk.

Ah! Yes, okay, I hadn't considered that, and that's going to be a
constraint.

zstd seems to work even better than compress; in some cases up to 50% compression for a single article even without a custom dictionary.

Yeah, zstd is what I'd use these days.

I realize this is probably another "feature" that has limited appeal...
but I'm sure there are a few others who may be using a VPS and want to
get the most out of it.

The entirety of Usenet is features with limited appeal. :) I assume you're doing this as a hobby to have fun, and in that case I heartily encourage
you to do whatever makes you happy, including implementing things no one
else cares about. I have had so much fun in my life doing that.

I'm only giving you design advice as if this were a work project because I
find it fun to kick around design questions. You should feel entirely free
to ignore me and do something that sounds more entertaining or rewarding
or just satisfying!
--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
--- Synchronet 3.22a-Linux NewsLink 1.2

From SyberShock!@spamtrap_usenet_001@sybershock.com to news.software.nntp on Thu Jun 18 23:27:52 2026

From Newsgroup: news.software.nntp

On Wed, 17 Jun 2026 23:11:57 -0400
InterLinked <nntp@phreaknet.org> wrote:

<snip>

I wouldn't bother with userland compression tools since that is one more thing that can break. Also be warned that the gnu coreutils and standard tools have been re-written in Rust and are all serverely buggy and borked. I wouldn't touch any compression tool with a Rust-rewrite dependency.

You could create a BTRFS or ZFS partition or image file and mount it with one of those file systems then place your spool on that volume. Enable automatic filesystem compression then go play golf or whatever.

If you are running on ext4 you can just create and mount a image file and format it for BTRFS or ZFS. I have used BTRFS image volumes that allowed me to fit 1.2+ TB of data on a 500GB drive.

Grass touch time +1.

--- Synchronet 3.22a-Linux NewsLink 1.2

From SyberShock!@spamtrap_usenet_001@sybershock.com to news.software.nntp on Thu Jun 18 23:40:15 2026

From Newsgroup: news.software.nntp

On Thu, 18 Jun 2026 10:11:55 -0400
InterLinked <nntp@phreaknet.org> wrote:

Train a dictionary across all the uncompressed
articles in a spool to get something representative for Usenet, and then
use that when compressing individual articles.

Replace header names with single-charater tokens and omit the colon as an understood character.

Snarf and hash article siglines and store one copy of each unique sig in a hash-named file and use the hash as macro placeholder.

Snarf 'From:' data with the same hash technique.

Detect duplicate article bodies and blocks via hashing, and store one copy with a hash pointer.

Assign a base62 shortcode to every known path host.

Assign shortcodes to 'Message-ID' domain names and 'Organization:' names and 'Xref:' hostnames.

Convert 'Xref:' integers to hex or base58.

Convert all 'Date:' headers to epoch seconds then use a shortcode token for the least significant digits of the epoch date.

Create 3 or 4 byte base62 shortcodes for all newsgroup names in the 'Newsgroups:' header.

--- Synchronet 3.22a-Linux NewsLink 1.2

From InterLinked@nntp@phreaknet.org to news.software.nntp on Thu Jun 18 20:28:03 2026

From Newsgroup: news.software.nntp

On 6/18/2026 7:24 PM, Russ Allbery wrote:

InterLinked <nntp@phreaknet.org> writes:

The other constraint with file systems is if you can use them at all. My
Internet facing news server is just one service running on a Digital
Ocean droplet, and those only support ext4 and xfs for the primary disk.

Ah! Yes, okay, I hadn't considered that, and that's going to be a
constraint.

zstd seems to work even better than compress; in some cases up to 50%
compression for a single article even without a custom dictionary.

Yeah, zstd is what I'd use these days.

I've already been impressed. I'll probably do some prototyping with the command line tool and then using the zstd library directly if I like
where that goes. And see if I can build a good dictionary.

I realize this is probably another "feature" that has limited appeal...
but I'm sure there are a few others who may be using a VPS and want to
get the most out of it.

The entirety of Usenet is features with limited appeal. :) I assume you're doing this as a hobby to have fun, and in that case I heartily encourage
you to do whatever makes you happy, including implementing things no one
else cares about. I have had so much fun in my life doing that.

I'm only giving you design advice as if this were a work project because I find it fun to kick around design questions. You should feel entirely free
to ignore me and do something that sounds more entertaining or rewarding
or just satisfying!

Yeah, I wouldn't be doing this if I didn't enjoy it to some extent, yes,
but it's not "just for fun" either - having it all work and with
reasonably sound design is important too, hence why I've been soliciting
other input to check my blind spots. Lower priority than having email
work, certainly, but given the sheer volume of even a small number of
groups, I'm paying more attention to some of the minutiae so this
doesn't blow up in production.

And not unlike in the corporate environment at times, there is no more
budget allocation for this and it still needs to work reasonably well :)
--- Synchronet 3.22a-Linux NewsLink 1.2

From InterLinked@nntp@phreaknet.org to news.software.nntp on Thu Jun 18 20:33:26 2026

From Newsgroup: news.software.nntp

On 6/18/2026 7:27 PM, SyberShock! wrote:

On Wed, 17 Jun 2026 23:11:57 -0400
InterLinked <nntp@phreaknet.org> wrote:

<snip>

I wouldn't bother with userland compression tools since that is one more thing that can break. Also be warned that the gnu coreutils and standard tools have been re-written in Rust and are all serverely buggy and borked. I wouldn't touch any compression tool with a Rust-rewrite dependency.

I've heard about that and related issues on some mailing lists recently. Fortunately I don't think Debian has gone that route yet. And zstd isn't
a coreutil AFAIK either though it is pre-installed on Debian, conveniently.

You could create a BTRFS or ZFS partition or image file and mount it with one of those file systems then place your spool on that volume. Enable automatic filesystem compression then go play golf or whatever.

If you are running on ext4 you can just create and mount a image file and format it for BTRFS or ZFS. I have used BTRFS image volumes that allowed me to fit 1.2+ TB of data on a 500GB drive.

That sounds interesting, though I've also heard before that running ZFS
on top of another system like ext4 is a bad idea; admittedly, I don't understand the nuances well enough to understand why but given ZFS likes
to be its own complete solution I could see that.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Russ Allbery@eagle@eyrie.org to news.software.nntp on Thu Jun 18 17:56:00 2026

From Newsgroup: news.software.nntp

SyberShock! <spamtrap_usenet_001@sybershock.com> writes:

InterLinked <nntp@phreaknet.org> wrote:

Train a dictionary across all the uncompressed articles in a spool to
get something representative for Usenet, and then use that when
compressing individual articles.

Replace header names with single-charater tokens and omit the colon as
an understood character.

That immediately reminded me of:

https://cr.yp.to/sarcasm/modest-proposal.txt
--
Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

Please post questions rather than mailing me directly.
<https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Hannibal
  Fri Jul 3 01:51:09 2026
  from Des Moines via Telnet
- Geek2
  Thu Jul 2 11:41:05 2026
  from Euclid, Oh via Telnet
- Hannibal
  Thu Jul 2 05:49:27 2026
  from Des Moines via SSH
- Geek2
  Wed Jul 1 16:31:20 2026
  from Euclid, Oh via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	70
Nodes:	6 (0 / 6)
Uptime:	01:55:29
Calls:	949
Calls today:	1
Files:	1,325
Messages:	281,112

Spool compression

Who's Online

Recent Visitors

System Info