• Spool compression

    From InterLinked@nntp@phreaknet.org to news.software.nntp on Wed Jun 17 23:11:57 2026
    From Newsgroup: news.software.nntp

    I thought at one point I read something about how articles could be
    compressed in the spool in INN, or some other news software, but I can't
    find that now, aside from compressing just overview. (Not thinking of compression for transit here but compression at rest.)

    Digging around a bit, I found a thread from 1990 in news.software.b
    which discussed this[1], consensus that news articles individually tend
    to be small enough that compressing them is not worth the effort.
    Certainly on busy systems, it makes little sense, but for the use case
    where disk space is limited and CPU is not fully utilized normally
    anyways, it seems attractive, assuming you don't mind losing the ability
    to easily grep through things or use other utilities with the spool
    (which, as recently mentioned, are less common now).

    Is anyone aware of a news package that has ever supported compression of articles in the spool? My theory is that articles today are larger than
    35-40 years ago, both due to more headers and larger bodies. I did some
    tests on some articles I picked at random, and most seem to be in the
    1KB-6KB range. I compressed a 930B article to 739B (20% savings) and a
    5,730B article to 3,649B (37% savings). 25-35% seems typical.
    Conceivably, on a system with limited disk space, one could increase
    retention by maybe around 25% just with this trick. Certainly, at some
    point it starts to look attractive.

    I'm toying with the concept for tradspool, though I imagine with CNFS or
    other multi-article files, compression would be even more effective.

    [1] https://groups.google.com/g/news.software.b/c/jDdQlTLzzIw/m/XIWEHQpvgWsJ --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Kevin Bowling@kevin.bowling@kev009.com to news.software.nntp on Wed Jun 17 21:14:22 2026
    From Newsgroup: news.software.nntp

    On 6/17/26 20:11, InterLinked wrote:
    I thought at one point I read something about how articles could be compressed in the spool in INN, or some other news software, but I can't find that now, aside from compressing just overview. (Not thinking of compression for transit here but compression at rest.)

    Digging around a bit, I found a thread from 1990 in news.software.b
    which discussed this[1], consensus that news articles individually tend
    to be small enough that compressing them is not worth the effort.

    The economics favor the opposites for both with any modern system. I.e.
    the data would be compressible (especially with a trained dictionary),
    and the "effort" is borderline free and optimizes harder scale axis like
    I/O bandwidth and page cache.

    Certainly on busy systems, it makes little sense, but for the use case
    where disk space is limited and CPU is not fully utilized normally
    anyways, it seems attractive, assuming you don't mind losing the ability
    to easily grep through things or use other utilities with the spool
    (which, as recently mentioned, are less common now).

    Again economy has flipped and busy systems benefit more if it is done intelligently.


    Is anyone aware of a news package that has ever supported compression of articles in the spool? My theory is that articles today are larger than 35-40 years ago, both due to more headers and larger bodies. I did some

    I would question that intuition, I don't think article size has changed
    much.

    Here is a histogram on my tradspool:
    <512B: 788,863 (0.95%)
    512B-1K: 5,108,243 (6.16%)
    1-2K: 31,046,227 (37.43%)
    2-4K: 26,416,756 (31.85%)
    4-8K: 11,161,756 (13.46%)
    8-16K: 4,478,769 (5.40%)
    16-32K: 1,522,232 (1.84%)
    32-64K: 358,406 (0.43%)
    64K: 5,959 (0.01%)

    Total: ~82.9 million articles

    (note I cutoff at 52k, the larger articles would've been sucked when I
    was filling out some history on a couple groups)
    tests on some articles I picked at random, and most seem to be in the 1KB-6KB range. I compressed a 930B article to 739B (20% savings) and a 5,730B article to 3,649B (37% savings). 25-35% seems typical.
    Conceivably, on a system with limited disk space, one could increase retention by maybe around 25% just with this trick. Certainly, at some
    point it starts to look attractive.

    I'm toying with the concept for tradspool, though I imagine with CNFS or other multi-article files, compression would be even more effective.

    Yes, something like CNFS will result in greater gains because an
    untrained dictionary will span a larger working set.

    Otherwise a lot of Usenet articles are small enough that you really need
    a trained dictionary to get any ratio on individual articles.

    I use block level LZ4 (ZFS) on my tradspool, and it is maybe a 10% improvement. The major issue with this setup is the filesystem metadata
    and data block per article, especially for small articles (the FS has no support for packing small files into the medadata node, stuffing
    multiple small files into one data node etc). CNFS would result in a
    much higher ratio.

    [1] https://groups.google.com/g/news.software.b/c/jDdQlTLzzIw/m/ XIWEHQpvgWsJ

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From InterLinked@nntp@phreaknet.org to news.software.nntp on Thu Jun 18 10:11:55 2026
    From Newsgroup: news.software.nntp

    On 6/18/2026 12:14 AM, Kevin Bowling wrote:
    On 6/17/26 20:11, InterLinked wrote:
    I thought at one point I read something about how articles could be
    compressed in the spool in INN, or some other news software, but I
    can't find that now, aside from compressing just overview. (Not
    thinking of compression for transit here but compression at rest.)

    Digging around a bit, I found a thread from 1990 in news.software.b
    which discussed this[1], consensus that news articles individually
    tend to be small enough that compressing them is not worth the effort.

    The economics favor the opposites for both with any modern system.-a I.e. the data would be compressible (especially with a trained dictionary),
    and the "effort" is borderline free and optimizes harder scale axis like
    I/O bandwidth and page cache.

    Certainly on busy systems, it makes little sense, but for the use case
    where disk space is limited and CPU is not fully utilized normally
    anyways, it seems attractive, assuming you don't mind losing the
    ability to easily grep through things or use other utilities with the
    spool (which, as recently mentioned, are less common now).

    Again economy has flipped and busy systems benefit more if it is done intelligently.


    Is anyone aware of a news package that has ever supported compression
    of articles in the spool? My theory is that articles today are larger
    than 35-40 years ago, both due to more headers and larger bodies. I
    did some

    I would question that intuition, I don't think article size has changed much.

    Here is a histogram on my tradspool:
    <512B:-a-a-a 788,863-a (0.95%)
    512B-1K:-a 5,108,243 (6.16%)
    1-2K:-a-a-a 31,046,227 (37.43%)
    2-4K:-a-a-a 26,416,756 (31.85%)
    4-8K:-a-a-a 11,161,756 (13.46%)
    8-16K:-a-a-a 4,478,769 (5.40%)
    16-32K:-a-a 1,522,232 (1.84%)
    32-64K:-a-a-a-a 358,406 (0.43%)
    64K:-a-a-a-a-a-a-a-a 5,959 (0.01%)

    Total: ~82.9 million articles

    Well, this is helpful.

    Of course, I just realized that compressing files that are already less
    than the block size (4 KB typically) probably won't save anything, so compressing such articles would be pointless, in a tradspool
    environment. According to your histogram, that's about 74% of articles
    that won't compress in a useful way unless the block size is changed.

    Compressing the other articles would still help if it could shave off at
    least one block. But even for files > 4 KB if the compressed result is
    the same size on disk then that's probably no good either.

    (note I cutoff at 52k, the larger articles would've been sucked when I
    was filling out some history on a couple groups)
    tests on some articles I picked at random, and most seem to be in the
    1KB-6KB range. I compressed a 930B article to 739B (20% savings) and a
    5,730B article to 3,649B (37% savings). 25-35% seems typical.
    Conceivably, on a system with limited disk space, one could increase
    retention by maybe around 25% just with this trick. Certainly, at some
    point it starts to look attractive.

    I'm toying with the concept for tradspool, though I imagine with CNFS
    or other multi-article files, compression would be even more effective.

    Yes, something like CNFS will result in greater gains because an
    untrained dictionary will span a larger working set.

    Otherwise a lot of Usenet articles are small enough that you really need
    a trained dictionary to get any ratio on individual articles.

    That makes sense. Train a dictionary across all the uncompressed
    articles in a spool to get something representative for Usenet, and then
    use that when compressing individual articles.

    Even, I suspect it may not be enough to eliminate at least one block for
    many articles. So I think 25% was a bit optimistic on my part, maybe
    10-20% is more realistic.

    I use block level LZ4 (ZFS) on my tradspool, and it is maybe a 10% improvement.-a The major issue with this setup is the filesystem metadata and data block per article, especially for small articles (the FS has no support for packing small files into the medadata node, stuffing
    multiple small files into one data node etc).-a CNFS would result in a
    much higher ratio.

    Yeah, CFNS would be better, but I don't want to do CFNS for other
    reasons so trying to think what I could do given the limitations of many
    small files.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From wollman@wollman@hergotha.csail.mit.edu (Garrett Wollman) to news.software.nntp on Thu Jun 18 15:14:31 2026
    From Newsgroup: news.software.nntp

    In article <1110ubg$2ld8b$1@dont-email.me>,
    InterLinked <nntp@phreaknet.org> wrote:
    On 6/18/2026 12:14 AM, Kevin Bowling wrote:
    Even, I suspect it may not be enough to eliminate at least one block for >many articles. So I think 25% was a bit optimistic on my part, maybe
    10-20% is more realistic.

    I use block level LZ4 (ZFS) on my tradspool, and it is maybe a 10%
    improvement.-a The major issue with this setup is the filesystem metadata >> and data block per article, especially for small articles (the FS has no
    support for packing small files into the medadata node, stuffing
    multiple small files into one data node etc).-a CNFS would result in a
    much higher ratio.

    Tradspool here, and I still get pretty good compression with LZ4:

    NAME RATIO
    tank/root/usr/local/news 1.88x
    tank/root/usr/local/news/db 1.77x
    tank/root/usr/local/news/spool 1.89x
    tank/root/usr/local/news/tmp 1.00x

    It's a small server and I don't carry binaries, so that should give a
    decent idea of what LZ4 on actual text will give you. ("1.89x" is
    what others would call "47%".) This is with 90-day retention, the
    whole of /usr/local/news is less than 10 GiB.

    There's more than enough CPU to use zstd or gzip but what's the point?
    This is just our default zpool configuration, and there are a bunch of optimizations in ZFS that sit on top of `compression=on`.

    -GAWollman
    --
    Garrett A. Wollman | "Act to avoid constraining the future; if you can, wollman@bimajority.org| act to remove constraint from the future. This is Opinions not shared by| a thing you can do, are able to do, to do together."
    my employers. | - Graydon Saunders, _A Succession of Bad Days_ (2015) --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Thu Jun 18 09:05:40 2026
    From Newsgroup: news.software.nntp

    InterLinked <nntp@phreaknet.org> writes:

    I'm toying with the concept for tradspool, though I imagine with CNFS or other multi-article files, compression would be even more effective.

    As various people have implicitly noted but perhaps not said explicitly,
    one of the things that has changed since the original news servers were
    written is that compression is now a file system feature for some file
    systems.

    I would not, in 2026, implement application-level compression of things
    you're storing as simple files disk unless you have enough knowledge to
    use some application-specific compression mechanism (so, for instance,
    image compression is still a different matter). If you're just using
    standard general-purpose compression mechanisms like zstd, your time is
    almost certainly better spent finding an underlying file system that
    handles the compression for you. The file system knows things like its
    block layout strategy and can make good choices about opportunistic
    compression that the news software is not in a position to make.

    It might still be worthwhile in some cases to compress blobs stored in databases or other similar cases, but even there I'd want to benchmark
    against a file system that natively implements compression.

    ZFS is probably the most mature file system with this feature, but running
    it on Linux can be a little complicated due to boring licensing reasons.
    btrfs is another option for transparent compression if you don't feel like dealing with ZFS, but isn't as mature. (That said, I've been using btrfs
    on all my personal devices for years now without any trouble, although I
    have avoided running a disk entirely out of space, which many people say
    btrfs doesn't always handle well.)

    If you feel like maximizing drama, excitement, and exposure to esoteric
    free software governance conflicts that generate 100-post forum threads,
    you could use bcachefs. :)
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From InterLinked@nntp@phreaknet.org to news.software.nntp on Thu Jun 18 17:38:52 2026
    From Newsgroup: news.software.nntp

    On 6/18/2026 11:14 AM, Garrett Wollman wrote:
    In article <1110ubg$2ld8b$1@dont-email.me>,
    InterLinked <nntp@phreaknet.org> wrote:
    On 6/18/2026 12:14 AM, Kevin Bowling wrote:
    Even, I suspect it may not be enough to eliminate at least one block for
    many articles. So I think 25% was a bit optimistic on my part, maybe
    10-20% is more realistic.

    I use block level LZ4 (ZFS) on my tradspool, and it is maybe a 10%
    improvement.|e-a The major issue with this setup is the filesystem metadata >>> and data block per article, especially for small articles (the FS has no >>> support for packing small files into the medadata node, stuffing
    multiple small files into one data node etc).|e-a CNFS would result in a >>> much higher ratio.

    Tradspool here, and I still get pretty good compression with LZ4:

    NAME RATIO
    tank/root/usr/local/news 1.88x
    tank/root/usr/local/news/db 1.77x
    tank/root/usr/local/news/spool 1.89x
    tank/root/usr/local/news/tmp 1.00x

    It's a small server and I don't carry binaries, so that should give a
    decent idea of what LZ4 on actual text will give you. ("1.89x" is
    what others would call "47%".) This is with 90-day retention, the
    whole of /usr/local/news is less than 10 GiB.

    There's more than enough CPU to use zstd or gzip but what's the point?
    This is just our default zpool configuration, and there are a bunch of optimizations in ZFS that sit on top of `compression=on`.

    The point is, given a finite amount of disk space, it would allow for
    storing more news or having longer retention, effectively for "free".
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From InterLinked@nntp@phreaknet.org to news.software.nntp on Thu Jun 18 17:55:24 2026
    From Newsgroup: news.software.nntp

    On 6/18/2026 12:05 PM, Russ Allbery wrote:
    InterLinked <nntp@phreaknet.org> writes:

    I'm toying with the concept for tradspool, though I imagine with CNFS or
    other multi-article files, compression would be even more effective.

    As various people have implicitly noted but perhaps not said explicitly,
    one of the things that has changed since the original news servers were written is that compression is now a file system feature for some file systems.

    I would not, in 2026, implement application-level compression of things you're storing as simple files disk unless you have enough knowledge to
    use some application-specific compression mechanism (so, for instance,
    image compression is still a different matter). If you're just using
    standard general-purpose compression mechanisms like zstd, your time is almost certainly better spent finding an underlying file system that
    handles the compression for you. The file system knows things like its
    block layout strategy and can make good choices about opportunistic compression that the news software is not in a position to make.

    It might still be worthwhile in some cases to compress blobs stored in databases or other similar cases, but even there I'd want to benchmark against a file system that natively implements compression.

    I would agree letting the file system do it is probably better if
    possible, but that might not always be an option - see below.

    ZFS is probably the most mature file system with this feature, but running
    it on Linux can be a little complicated due to boring licensing reasons. btrfs is another option for transparent compression if you don't feel like dealing with ZFS, but isn't as mature. (That said, I've been using btrfs
    on all my personal devices for years now without any trouble, although I
    have avoided running a disk entirely out of space, which many people say btrfs doesn't always handle well.)

    The other constraint with file systems is if you can use them at all. My Internet facing news server is just one service running on a Digital
    Ocean droplet, and those only support ext4 and xfs for the primary disk.

    In this case, the only way to increase the amount of articles I can
    store for "free" is to do user land compression of the articles - and
    I'm using tradspool for granular expiration (or infinite retention)
    depending on group, so that means per-file compression since CNFS would
    not even be an option.

    zstd seems to work even better than compress; in some cases up to 50% compression for a single article even without a custom dictionary. I get
    even better results with a custom dictionary trained on the articles
    from a few disparate groups, sometimes up to 90%, but I worry about overfitting as inspecting the dictionary it seems to have large portions
    of entire messages (perhaps those which were quoted most), which I don't
    think would generalize well in the long run. Would need to experiment
    with that further and over time.

    It seems likely that if I just compressed articles larger than 4 KB, I
    could get the vast majority of articles to fit in just 1 block, rather
    than possibly 2 or 3. Obviously there's no point in compressing articles
    less than 4 KB so those could be left alone.

    I realize this is probably another "feature" that has limited appeal...
    but I'm sure there are a few others who may be using a VPS and want to
    get the most out of it.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Thu Jun 18 16:24:23 2026
    From Newsgroup: news.software.nntp

    InterLinked <nntp@phreaknet.org> writes:

    The other constraint with file systems is if you can use them at all. My Internet facing news server is just one service running on a Digital
    Ocean droplet, and those only support ext4 and xfs for the primary disk.

    Ah! Yes, okay, I hadn't considered that, and that's going to be a
    constraint.

    zstd seems to work even better than compress; in some cases up to 50% compression for a single article even without a custom dictionary.

    Yeah, zstd is what I'd use these days.

    I realize this is probably another "feature" that has limited appeal...
    but I'm sure there are a few others who may be using a VPS and want to
    get the most out of it.

    The entirety of Usenet is features with limited appeal. :) I assume you're doing this as a hobby to have fun, and in that case I heartily encourage
    you to do whatever makes you happy, including implementing things no one
    else cares about. I have had so much fun in my life doing that.

    I'm only giving you design advice as if this were a work project because I
    find it fun to kick around design questions. You should feel entirely free
    to ignore me and do something that sounds more entertaining or rewarding
    or just satisfying!
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From SyberShock!@spamtrap_usenet_001@sybershock.com to news.software.nntp on Thu Jun 18 23:27:52 2026
    From Newsgroup: news.software.nntp

    On Wed, 17 Jun 2026 23:11:57 -0400
    InterLinked <nntp@phreaknet.org> wrote:

    <snip>

    I wouldn't bother with userland compression tools since that is one more thing that can break. Also be warned that the gnu coreutils and standard tools have been re-written in Rust and are all serverely buggy and borked. I wouldn't touch any compression tool with a Rust-rewrite dependency.

    You could create a BTRFS or ZFS partition or image file and mount it with one of those file systems then place your spool on that volume. Enable automatic filesystem compression then go play golf or whatever.

    If you are running on ext4 you can just create and mount a image file and format it for BTRFS or ZFS. I have used BTRFS image volumes that allowed me to fit 1.2+ TB of data on a 500GB drive.

    Grass touch time +1.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From SyberShock!@spamtrap_usenet_001@sybershock.com to news.software.nntp on Thu Jun 18 23:40:15 2026
    From Newsgroup: news.software.nntp

    On Thu, 18 Jun 2026 10:11:55 -0400
    InterLinked <nntp@phreaknet.org> wrote:

    Train a dictionary across all the uncompressed
    articles in a spool to get something representative for Usenet, and then
    use that when compressing individual articles.

    Replace header names with single-charater tokens and omit the colon as an understood character.

    Snarf and hash article siglines and store one copy of each unique sig in a hash-named file and use the hash as macro placeholder.

    Snarf 'From:' data with the same hash technique.

    Detect duplicate article bodies and blocks via hashing, and store one copy with a hash pointer.

    Assign a base62 shortcode to every known path host.

    Assign shortcodes to 'Message-ID' domain names and 'Organization:' names and 'Xref:' hostnames.

    Convert 'Xref:' integers to hex or base58.

    Convert all 'Date:' headers to epoch seconds then use a shortcode token for the least significant digits of the epoch date.

    Create 3 or 4 byte base62 shortcodes for all newsgroup names in the 'Newsgroups:' header.





    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From InterLinked@nntp@phreaknet.org to news.software.nntp on Thu Jun 18 20:28:03 2026
    From Newsgroup: news.software.nntp

    On 6/18/2026 7:24 PM, Russ Allbery wrote:
    InterLinked <nntp@phreaknet.org> writes:

    The other constraint with file systems is if you can use them at all. My
    Internet facing news server is just one service running on a Digital
    Ocean droplet, and those only support ext4 and xfs for the primary disk.

    Ah! Yes, okay, I hadn't considered that, and that's going to be a
    constraint.

    zstd seems to work even better than compress; in some cases up to 50%
    compression for a single article even without a custom dictionary.

    Yeah, zstd is what I'd use these days.

    I've already been impressed. I'll probably do some prototyping with the command line tool and then using the zstd library directly if I like
    where that goes. And see if I can build a good dictionary.

    I realize this is probably another "feature" that has limited appeal...
    but I'm sure there are a few others who may be using a VPS and want to
    get the most out of it.

    The entirety of Usenet is features with limited appeal. :) I assume you're doing this as a hobby to have fun, and in that case I heartily encourage
    you to do whatever makes you happy, including implementing things no one
    else cares about. I have had so much fun in my life doing that.

    I'm only giving you design advice as if this were a work project because I find it fun to kick around design questions. You should feel entirely free
    to ignore me and do something that sounds more entertaining or rewarding
    or just satisfying!

    Yeah, I wouldn't be doing this if I didn't enjoy it to some extent, yes,
    but it's not "just for fun" either - having it all work and with
    reasonably sound design is important too, hence why I've been soliciting
    other input to check my blind spots. Lower priority than having email
    work, certainly, but given the sheer volume of even a small number of
    groups, I'm paying more attention to some of the minutiae so this
    doesn't blow up in production.

    And not unlike in the corporate environment at times, there is no more
    budget allocation for this and it still needs to work reasonably well :)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From InterLinked@nntp@phreaknet.org to news.software.nntp on Thu Jun 18 20:33:26 2026
    From Newsgroup: news.software.nntp

    On 6/18/2026 7:27 PM, SyberShock! wrote:
    On Wed, 17 Jun 2026 23:11:57 -0400
    InterLinked <nntp@phreaknet.org> wrote:

    <snip>

    I wouldn't bother with userland compression tools since that is one more thing that can break. Also be warned that the gnu coreutils and standard tools have been re-written in Rust and are all serverely buggy and borked. I wouldn't touch any compression tool with a Rust-rewrite dependency.

    I've heard about that and related issues on some mailing lists recently. Fortunately I don't think Debian has gone that route yet. And zstd isn't
    a coreutil AFAIK either though it is pre-installed on Debian, conveniently.

    You could create a BTRFS or ZFS partition or image file and mount it with one of those file systems then place your spool on that volume. Enable automatic filesystem compression then go play golf or whatever.

    If you are running on ext4 you can just create and mount a image file and format it for BTRFS or ZFS. I have used BTRFS image volumes that allowed me to fit 1.2+ TB of data on a 500GB drive.

    That sounds interesting, though I've also heard before that running ZFS
    on top of another system like ext4 is a bad idea; admittedly, I don't understand the nuances well enough to understand why but given ZFS likes
    to be its own complete solution I could see that.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Thu Jun 18 17:56:00 2026
    From Newsgroup: news.software.nntp

    SyberShock! <spamtrap_usenet_001@sybershock.com> writes:
    InterLinked <nntp@phreaknet.org> wrote:

    Train a dictionary across all the uncompressed articles in a spool to
    get something representative for Usenet, and then use that when
    compressing individual articles.

    Replace header names with single-charater tokens and omit the colon as
    an understood character.

    That immediately reminded me of:

    https://cr.yp.to/sarcasm/modest-proposal.txt
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.22a-Linux NewsLink 1.2