Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Kawasu
  Thu Oct 16 10:17:15 2025
  from Mena, Ar via Telnet
- Geek2
  Thu Oct 16 06:39:58 2025
  from Euclid, Oh via Telnet
- Amr
  Tue Oct 14 21:13:21 2025
  from Fayetteville, Nc via Telnet
- Amr
  Tue Oct 14 20:34:34 2025
  from Fayetteville, Nc via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	27
Nodes:	6 (0 / 6)
Uptime:	40:23:46
Calls:	631
Calls today:	2
Files:	1,187
D/L today:	24 files (29,813K bytes)
Messages:	174,391

inn 2.7.0, expire takes ages and hangs

From aw@aw@somewhere.invalid (Adam W.) to news.software.nntp on Tue Oct 29 13:47:54 2024

From Newsgroup: news.software.nntp

Hi,

Users started complaining that they're getting "400 Expiring process ..." errors on my server. I looked and it's true, expire was running since it
was started from the cron at night, but it was hanged. strace showed that expire process (expire -v1, called from news.daily) was waiting for data
on a socket:

$ strace -f -p 28158
strace: Process 28158 attached
select(6, [5], NULL, NULL, {tv_sec=28, tv_usec=42890}

It just waits on select, and does nothing else.

$ ls -l /proc/28158/fd/
total 0
lr-x------ 1 news news 64 Oct 29 14:23 0 -> 'pipe:[318604403]'
l-wx------ 1 news news 64 Oct 29 14:23 1 -> 'pipe:[318607484]'
lrwx------ 1 news news 64 Oct 29 14:23 10 -> /usr/local/news/spool/cnfs/cnfs1 lrwx------ 1 news news 64 Oct 29 14:23 11 -> /usr/local/news/spool/cnfs/cnfs2 lrwx------ 1 news news 64 Oct 29 14:23 12 -> /usr/local/news/spool/cnfs/cnfs3 lrwx------ 1 news news 64 Oct 29 14:23 13 -> /usr/local/news/spool/cnfs/cnfs4 lrwx------ 1 news news 64 Oct 29 14:23 14 -> /usr/local/news/spool/cnfs/cnfs5 lrwx------ 1 news news 64 Oct 29 14:23 15 -> /usr/local/news/spool/cnfs/junk1 lrwx------ 1 news news 64 Oct 29 14:23 16 -> /usr/local/news/spool/cnfs/ctrl1 lrwx------ 1 news news 64 Oct 29 14:23 17 -> /usr/local/news/spool/cnfs/priv1 lrwx------ 1 news news 64 Oct 29 14:23 18 -> /usr/local/news/spool/cnfs/big81 lrwx------ 1 news news 64 Oct 29 14:23 19 -> /usr/local/news/spool/cnfs/alt1 l-wx------ 1 news news 64 Oct 29 14:23 2 -> 'pipe:[318607484]'
lrwx------ 1 news news 64 Oct 29 14:23 20 -> /usr/local/news/spool/cnfs/spam1 lrwx------ 1 news news 64 Oct 29 14:23 21 -> /usr/local/news/spool/cnfs/spam2 l-wx------ 1 news news 64 Oct 29 14:23 22 -> /usr/local/news/db/history.n lr-x------ 1 news news 64 Oct 29 14:23 23 -> /usr/local/news/db/history.n lr-x------ 1 news news 64 Oct 29 14:23 24 -> /usr/local/news/db/history lrwx------ 1 news news 64 Oct 29 14:23 3 -> '/tmp/#7734312 (deleted)' lrwx------ 1 news news 64 Oct 29 14:23 4 -> 'socket:[318607485]'
lrwx------ 1 news news 64 Oct 29 14:23 5 -> 'socket:[318607492]'
lr-x------ 1 news news 64 Oct 29 14:23 6 -> /usr/local/news/db/history lrwx------ 1 news news 64 Oct 29 14:23 7 -> /usr/local/news/db/history.n.dir lrwx------ 1 news news 64 Oct 29 14:23 8 -> /usr/local/news/db/history.n.index lrwx------ 1 news news 64 Oct 29 14:23 9 -> /usr/local/news/db/history.n.hash

$ cat /proc/28158/fdinfo/5
pos: 0
flags: 02
mnt_id: 9
scm_fds: 0

I killed the process, restarted the server, and it seems to be fine
(unless it's not?).

So now questions:

1. Is it possible that I messed anything up by forcefully killing it?

2. Why did it hang / how can I diagnose it / can I diagnose it?

3. I added noexpire to news.daily cron line to avoid it in the future. I
guess there will be no consequences in my setup, as:

a) I'm using huge CNFS buffers for bulk storage (around 100 GB),
b) I'm using timehash for low-volume local groups (currently 53 MB),
c) my expire.ctl is set to never expire ("*:A:never:never:never").

Am I correct?

4. What is responsible for expiring history? Expireover or expire? I'd
guess the former (which I still keep enabled), but now I'm not sure
anymore.

BTW, I remember that it happened before, here's the sample report, just
nobody complained then (which is strange, because people are using this
server constantly, maybe it wasn't throttled then?). Note the time it took expire to run:

expire begin Sat Oct 19 03:08:38 CEST 2024: (-v1)
Article lines processed 6057586
Articles retained 6051793
Entries expired 5793
expire end Sat Oct 19 16:15:51 CEST 2024

One earlier run was even longer:

expire begin Thu Oct 10 03:10:09 CEST 2024: (-v1)
Article lines processed 6012924
Articles retained 6007938
Entries expired 4986
expire end Fri Oct 11 19:58:42 CEST 2024

Most of the runs take at most two minutes, like this one:

expire begin Mon Oct 7 03:06:06 CEST 2024: (-v1)
Article lines processed 5986522
Articles retained 5979903
Entries expired 6619
expire end Mon Oct 7 03:07:51 CEST 2024

Another question: why does it expire anything if I have expiration
disabled? I guess "Entries expired" should be 0 in all cases? Or is it
about removing history entries for articles that are no longer in CNFS (or that are too old, I have "remember" set to 721: "/remember/:721")?

Thanks.
--- Synchronet 3.21a-Linux NewsLink 1.2

From aw@aw@somewhere.invalid (Adam W.) to news.software.nntp on Tue Oct 29 14:07:59 2024

From Newsgroup: news.software.nntp

Adam W. <aw@somewhere.invalid> wrote:

2. Why did it hang / how can I diagnose it / can I diagnose it?

I'll answer myself... this might be some hint:

expire begin Tue Oct 29 03:08:30 CET 2024: (-v1)
expire: cannot reserve server
Article lines processed 0
Articles retained 0
Entries expired 0
So it tried to reserve the server, and hanged there.

I'm looking at ICCcommand code and I think this condition is wrong:

#v+
/* No data -- if we timed out, return. */
if (ICCtimeout) {
#v-

If ICCtimeout is zero (and it is, because expire never calls ICCsettimeout
and 120 second timeout is used then), then this block will never be
entered. I guess the block was there, someone added a default timeout to
the line before select (T.tv_sec = ...) to prevent select() from hanging forever if the app didn't specify the timeout, but forgot to remove this
"if" (the block should execute always, not be guarded by "if")?

So I guess we have two issues here. First: why there's no response to the command, and how to diagnose it? Second: the bug (if I understand it correctly) in the above code.

The problem is that I can't really do any disruptive tests as it's the production server, it has to work (currently it's the most used news
server in Poland).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Matija Nalis@mnalis-news@voyager.hr to news.software.nntp on Sat Nov 2 17:27:57 2024

From Newsgroup: news.software.nntp

On Tue, 29 Oct 2024 13:47:54 -0000 (UTC), Adam W. <aw@somewhere.invalid> wrote:

$ strace -f -p 28158
strace: Process 28158 attached
select(6, [5], NULL, NULL, {tv_sec=28, tv_usec=42890}

It just waits on select, and does nothing else.

Usually I'd use lsof(8) to see what is on the other side of that socket, and try to continue debugging there.

1. Is it possible that I messed anything up by forcefully killing it?

Was it regular TERM or KILL (-9) signal?
But I guess it should be fine in either case, since it was paused and was
not updating anything on the disk.

2. Why did it hang / how can I diagnose it / can I diagnose it?

you can add "-v level" to increase verbosity of expire(8).

3. I added noexpire to news.daily cron line to avoid it in the future. I guess there will be no consequences in my setup, as:

<snip>

well, yes in the short term, but see below...

4. What is responsible for expiring history? Expireover or expire? I'd
guess the former (which I still keep enabled), but now I'm not sure
anymore.

According to expire(8), it should be the one expiring the history.

So, if you added "noexpire" to news.daily (which avoids calling expire(8) AFAICT), your history database would likely grow unbounded (unless you trim
it yourself elsewhere)

Another question: why does it expire anything if I have expiration
disabled?

And how exactly did you have "expiration disabled"?
You still seemed (at the time) to be running expire(8) via cron or
something (news.daily?)

Now that you have added "noexpire" to cron.daily, I would expect that
expire(8) would not be called at all.
Or do you have reason to believe otherwise?
--
Opinions above are GNU-copylefted.
--- Synchronet 3.21a-Linux NewsLink 1.2

From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Sat Nov 2 21:41:58 2024

From Newsgroup: news.software.nntp

Hi Adam,

3. I added noexpire to news.daily cron line to avoid it in the future. I guess there will be no consequences in my setup

expire rebuilds the history file, and makes the database more efficient because it resizes it depending on the number of stored articles. I
would therefore suggest to keep running expire.

4. What is responsible for expiring history? Expireover or expire?

The expire program cleans up and removes old entries in the history
database.

expire begin Sat Oct 19 03:08:38 CEST 2024: (-v1)
Article lines processed 6057586
Articles retained 6051793
Entries expired 5793
expire end Sat Oct 19 16:15:51 CEST 2024

I don't know why it took so much time. It definitively shouldn't have.

Another question: why does it expire anything if I have expiration
disabled? I guess "Entries expired" should be 0 in all cases? Or is it
about removing history entries for articles that are no longer in CNFS (or that are too old, I have "remember" set to 721: "/remember/:721")?

expire indeed removes entries for articles no longer present in your
spool because they were cancelled or superseded for instance for your
timehash spool, or your CNFS buffers have wrapped. Also, the history
keeps track of Message-IDs rejected by filters like cleanfeed, pyClean
or like. These Message-IDs are kept during /remember/ days.
--
Julien |eLIE

-2-aEx nihilo nihil.-a-+ (Perse)

--- Synchronet 3.21a-Linux NewsLink 1.2

From aw@aw@somewhere.invalid (Adam W.) to news.software.nntp on Mon Nov 4 22:38:12 2024

From Newsgroup: news.software.nntp

Matija Nalis <mnalis-news@voyager.hr> wrote:

It just waits on select, and does nothing else.

Usually I'd use lsof(8) to see what is on the other side of that socket, and try to continue debugging there.

Good idea, I forgot about this possibility, and I wanted to bring server
back up quickly.

1. Is it possible that I messed anything up by forcefully killing it?

Was it regular TERM or KILL (-9) signal?

TERM, expire process accepted it.

2. Why did it hang / how can I diagnose it / can I diagnose it?

you can add "-v level" to increase verbosity of expire(8).

Nice. I just read the manual (should've done it earlier...), now it's more clear.

Another question: why does it expire anything if I have expiration
disabled?

And how exactly did you have "expiration disabled"?

I mean the article expiration, not history. I use CNFS, so articles expire only when buffers roll over, and I don't forcefully expire my timehash (at least for now). So the only expiration I'd expect is cleaning the history
(but probably these statistics are also about it).

For some reason I was convinced that expire only expires articles, and expireover takes care of the overview AND history. Which doesn't make
sense, as overview can be disabled on transit servers, but they still need history...

Or do you have reason to believe otherwise?

No, it's clear now.

So, it seems I still need expire to trim the history.

I think I'll make the copy of the server (I'll just copy the running
instance, it might be inconsistent but it won't matter), reconfigure its
feeds (so my server feeds the copy), and experiment on a copy. Lots of
data to transfer over the Internet, but I don't have a better idea on how
else to diagnose it...
--- Synchronet 3.21a-Linux NewsLink 1.2

From aw@aw@somewhere.invalid (Adam W.) to news.software.nntp on Mon Nov 4 22:39:51 2024

From Newsgroup: news.software.nntp

Julien +LIE <iulius@nom-de-mon-site.com.invalid> wrote:

expire rebuilds the history file, and makes the database more efficient because it resizes it depending on the number of stored articles. I
would therefore suggest to keep running expire.

Thanks Julien. It's clear now. I'll keep digging then...
--- Synchronet 3.21a-Linux NewsLink 1.2

From aw@aw@somewhere.invalid (Adam W.) to news.software.nntp on Tue Nov 5 15:50:47 2024

From Newsgroup: news.software.nntp

Matija Nalis <mnalis-news@voyager.hr> wrote:

Usually I'd use lsof(8) to see what is on the other side of that socket, and try to continue debugging there.

What I know now is that the socket is:

/usr/local/news/run/ctlinndcKoXl2

It appears when expire process starts and disappears when it terminates.
In strace I see that this path is sent to the /usr/local/news/run/control socket, owned by innd.

I guess that's how ctlinnd works...

So it's possible that something went wrong between expire and innd,
probably in innd itself... some race?

I see that innd is waiting on four sockets now, one of them being /usr/local/news/run/control, another /usr/local/news/run/nntpin, and
two other being IPv4 and IPv6 listening sockets.

Maybe there's some race in that select loop, maybe if one of these sockets
is readable in just the wrong time, the other is ignored... I'd have to
dig deeper into inn's source.

What I'm surprised is that I'm the first person to stumble upon this bug,
and my setup is pretty typical -- normal modern Debian system, vanilla inn without any modifications, etc.

you can add "-v level" to increase verbosity of expire(8).

Unfortunately all it seems to be doing is printing history lines and what
it's doing with them. Might be useful, but I don't know if in this case.

I'm testing it, but it doesn't want to hang yet. Hopefully it will. I'm
also thinking of stress-testing just the ctlinnd part of it.
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

Recent Visitors

System Info

inn 2.7.0, expire takes ages and hangs