• inn 2.7.0, expire takes ages and hangs

    From aw@aw@somewhere.invalid (Adam W.) to news.software.nntp on Tue Oct 29 13:47:54 2024
    From Newsgroup: news.software.nntp

    Hi,

    Users started complaining that they're getting "400 Expiring process ..." errors on my server. I looked and it's true, expire was running since it
    was started from the cron at night, but it was hanged. strace showed that expire process (expire -v1, called from news.daily) was waiting for data
    on a socket:

    $ strace -f -p 28158
    strace: Process 28158 attached
    select(6, [5], NULL, NULL, {tv_sec=28, tv_usec=42890}

    It just waits on select, and does nothing else.

    $ ls -l /proc/28158/fd/
    total 0
    lr-x------ 1 news news 64 Oct 29 14:23 0 -> 'pipe:[318604403]'
    l-wx------ 1 news news 64 Oct 29 14:23 1 -> 'pipe:[318607484]'
    lrwx------ 1 news news 64 Oct 29 14:23 10 -> /usr/local/news/spool/cnfs/cnfs1 lrwx------ 1 news news 64 Oct 29 14:23 11 -> /usr/local/news/spool/cnfs/cnfs2 lrwx------ 1 news news 64 Oct 29 14:23 12 -> /usr/local/news/spool/cnfs/cnfs3 lrwx------ 1 news news 64 Oct 29 14:23 13 -> /usr/local/news/spool/cnfs/cnfs4 lrwx------ 1 news news 64 Oct 29 14:23 14 -> /usr/local/news/spool/cnfs/cnfs5 lrwx------ 1 news news 64 Oct 29 14:23 15 -> /usr/local/news/spool/cnfs/junk1 lrwx------ 1 news news 64 Oct 29 14:23 16 -> /usr/local/news/spool/cnfs/ctrl1 lrwx------ 1 news news 64 Oct 29 14:23 17 -> /usr/local/news/spool/cnfs/priv1 lrwx------ 1 news news 64 Oct 29 14:23 18 -> /usr/local/news/spool/cnfs/big81 lrwx------ 1 news news 64 Oct 29 14:23 19 -> /usr/local/news/spool/cnfs/alt1 l-wx------ 1 news news 64 Oct 29 14:23 2 -> 'pipe:[318607484]'
    lrwx------ 1 news news 64 Oct 29 14:23 20 -> /usr/local/news/spool/cnfs/spam1 lrwx------ 1 news news 64 Oct 29 14:23 21 -> /usr/local/news/spool/cnfs/spam2 l-wx------ 1 news news 64 Oct 29 14:23 22 -> /usr/local/news/db/history.n lr-x------ 1 news news 64 Oct 29 14:23 23 -> /usr/local/news/db/history.n lr-x------ 1 news news 64 Oct 29 14:23 24 -> /usr/local/news/db/history lrwx------ 1 news news 64 Oct 29 14:23 3 -> '/tmp/#7734312 (deleted)' lrwx------ 1 news news 64 Oct 29 14:23 4 -> 'socket:[318607485]'
    lrwx------ 1 news news 64 Oct 29 14:23 5 -> 'socket:[318607492]'
    lr-x------ 1 news news 64 Oct 29 14:23 6 -> /usr/local/news/db/history lrwx------ 1 news news 64 Oct 29 14:23 7 -> /usr/local/news/db/history.n.dir lrwx------ 1 news news 64 Oct 29 14:23 8 -> /usr/local/news/db/history.n.index lrwx------ 1 news news 64 Oct 29 14:23 9 -> /usr/local/news/db/history.n.hash

    $ cat /proc/28158/fdinfo/5
    pos: 0
    flags: 02
    mnt_id: 9
    scm_fds: 0

    I killed the process, restarted the server, and it seems to be fine
    (unless it's not?).

    So now questions:

    1. Is it possible that I messed anything up by forcefully killing it?

    2. Why did it hang / how can I diagnose it / can I diagnose it?

    3. I added noexpire to news.daily cron line to avoid it in the future. I
    guess there will be no consequences in my setup, as:

    a) I'm using huge CNFS buffers for bulk storage (around 100 GB),
    b) I'm using timehash for low-volume local groups (currently 53 MB),
    c) my expire.ctl is set to never expire ("*:A:never:never:never").

    Am I correct?

    4. What is responsible for expiring history? Expireover or expire? I'd
    guess the former (which I still keep enabled), but now I'm not sure
    anymore.

    BTW, I remember that it happened before, here's the sample report, just
    nobody complained then (which is strange, because people are using this
    server constantly, maybe it wasn't throttled then?). Note the time it took expire to run:

    expire begin Sat Oct 19 03:08:38 CEST 2024: (-v1)
    Article lines processed 6057586
    Articles retained 6051793
    Entries expired 5793
    expire end Sat Oct 19 16:15:51 CEST 2024

    One earlier run was even longer:

    expire begin Thu Oct 10 03:10:09 CEST 2024: (-v1)
    Article lines processed 6012924
    Articles retained 6007938
    Entries expired 4986
    expire end Fri Oct 11 19:58:42 CEST 2024

    Most of the runs take at most two minutes, like this one:

    expire begin Mon Oct 7 03:06:06 CEST 2024: (-v1)
    Article lines processed 5986522
    Articles retained 5979903
    Entries expired 6619
    expire end Mon Oct 7 03:07:51 CEST 2024

    Another question: why does it expire anything if I have expiration
    disabled? I guess "Entries expired" should be 0 in all cases? Or is it
    about removing history entries for articles that are no longer in CNFS (or that are too old, I have "remember" set to 721: "/remember/:721")?

    Thanks.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From aw@aw@somewhere.invalid (Adam W.) to news.software.nntp on Tue Oct 29 14:07:59 2024
    From Newsgroup: news.software.nntp

    Adam W. <aw@somewhere.invalid> wrote:

    2. Why did it hang / how can I diagnose it / can I diagnose it?

    I'll answer myself... this might be some hint:

    expire begin Tue Oct 29 03:08:30 CET 2024: (-v1)
    expire: cannot reserve server
    Article lines processed 0
    Articles retained 0
    Entries expired 0
    So it tried to reserve the server, and hanged there.

    I'm looking at ICCcommand code and I think this condition is wrong:

    #v+
    /* No data -- if we timed out, return. */
    if (ICCtimeout) {
    #v-

    If ICCtimeout is zero (and it is, because expire never calls ICCsettimeout
    and 120 second timeout is used then), then this block will never be
    entered. I guess the block was there, someone added a default timeout to
    the line before select (T.tv_sec = ...) to prevent select() from hanging forever if the app didn't specify the timeout, but forgot to remove this
    "if" (the block should execute always, not be guarded by "if")?

    So I guess we have two issues here. First: why there's no response to the command, and how to diagnose it? Second: the bug (if I understand it correctly) in the above code.

    The problem is that I can't really do any disruptive tests as it's the production server, it has to work (currently it's the most used news
    server in Poland).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Matija Nalis@mnalis-news@voyager.hr to news.software.nntp on Sat Nov 2 17:27:57 2024
    From Newsgroup: news.software.nntp

    On Tue, 29 Oct 2024 13:47:54 -0000 (UTC), Adam W. <aw@somewhere.invalid> wrote:
    $ strace -f -p 28158
    strace: Process 28158 attached
    select(6, [5], NULL, NULL, {tv_sec=28, tv_usec=42890}

    It just waits on select, and does nothing else.

    Usually I'd use lsof(8) to see what is on the other side of that socket, and try to continue debugging there.

    1. Is it possible that I messed anything up by forcefully killing it?

    Was it regular TERM or KILL (-9) signal?
    But I guess it should be fine in either case, since it was paused and was
    not updating anything on the disk.

    2. Why did it hang / how can I diagnose it / can I diagnose it?

    you can add "-v level" to increase verbosity of expire(8).

    3. I added noexpire to news.daily cron line to avoid it in the future. I guess there will be no consequences in my setup, as:

    <snip>

    well, yes in the short term, but see below...

    4. What is responsible for expiring history? Expireover or expire? I'd
    guess the former (which I still keep enabled), but now I'm not sure
    anymore.

    According to expire(8), it should be the one expiring the history.

    So, if you added "noexpire" to news.daily (which avoids calling expire(8) AFAICT), your history database would likely grow unbounded (unless you trim
    it yourself elsewhere)

    Another question: why does it expire anything if I have expiration
    disabled?

    And how exactly did you have "expiration disabled"?
    You still seemed (at the time) to be running expire(8) via cron or
    something (news.daily?)

    Now that you have added "noexpire" to cron.daily, I would expect that
    expire(8) would not be called at all.
    Or do you have reason to believe otherwise?
    --
    Opinions above are GNU-copylefted.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Sat Nov 2 21:41:58 2024
    From Newsgroup: news.software.nntp

    Hi Adam,

    3. I added noexpire to news.daily cron line to avoid it in the future. I guess there will be no consequences in my setup

    expire rebuilds the history file, and makes the database more efficient because it resizes it depending on the number of stored articles. I
    would therefore suggest to keep running expire.


    4. What is responsible for expiring history? Expireover or expire?

    The expire program cleans up and removes old entries in the history
    database.


    expire begin Sat Oct 19 03:08:38 CEST 2024: (-v1)
    Article lines processed 6057586
    Articles retained 6051793
    Entries expired 5793
    expire end Sat Oct 19 16:15:51 CEST 2024

    I don't know why it took so much time. It definitively shouldn't have.


    Another question: why does it expire anything if I have expiration
    disabled? I guess "Entries expired" should be 0 in all cases? Or is it
    about removing history entries for articles that are no longer in CNFS (or that are too old, I have "remember" set to 721: "/remember/:721")?

    expire indeed removes entries for articles no longer present in your
    spool because they were cancelled or superseded for instance for your
    timehash spool, or your CNFS buffers have wrapped. Also, the history
    keeps track of Message-IDs rejected by filters like cleanfeed, pyClean
    or like. These Message-IDs are kept during /remember/ days.
    --
    Julien |eLIE

    -2-aEx nihilo nihil.-a-+ (Perse)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From aw@aw@somewhere.invalid (Adam W.) to news.software.nntp on Mon Nov 4 22:38:12 2024
    From Newsgroup: news.software.nntp

    Matija Nalis <mnalis-news@voyager.hr> wrote:

    It just waits on select, and does nothing else.

    Usually I'd use lsof(8) to see what is on the other side of that socket, and try to continue debugging there.

    Good idea, I forgot about this possibility, and I wanted to bring server
    back up quickly.

    1. Is it possible that I messed anything up by forcefully killing it?

    Was it regular TERM or KILL (-9) signal?

    TERM, expire process accepted it.

    2. Why did it hang / how can I diagnose it / can I diagnose it?

    you can add "-v level" to increase verbosity of expire(8).

    Nice. I just read the manual (should've done it earlier...), now it's more clear.

    Another question: why does it expire anything if I have expiration
    disabled?

    And how exactly did you have "expiration disabled"?

    I mean the article expiration, not history. I use CNFS, so articles expire only when buffers roll over, and I don't forcefully expire my timehash (at least for now). So the only expiration I'd expect is cleaning the history
    (but probably these statistics are also about it).

    For some reason I was convinced that expire only expires articles, and expireover takes care of the overview AND history. Which doesn't make
    sense, as overview can be disabled on transit servers, but they still need history...

    Or do you have reason to believe otherwise?

    No, it's clear now.

    So, it seems I still need expire to trim the history.

    I think I'll make the copy of the server (I'll just copy the running
    instance, it might be inconsistent but it won't matter), reconfigure its
    feeds (so my server feeds the copy), and experiment on a copy. Lots of
    data to transfer over the Internet, but I don't have a better idea on how
    else to diagnose it...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From aw@aw@somewhere.invalid (Adam W.) to news.software.nntp on Mon Nov 4 22:39:51 2024
    From Newsgroup: news.software.nntp

    Julien +LIE <iulius@nom-de-mon-site.com.invalid> wrote:

    expire rebuilds the history file, and makes the database more efficient because it resizes it depending on the number of stored articles. I
    would therefore suggest to keep running expire.

    Thanks Julien. It's clear now. I'll keep digging then...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From aw@aw@somewhere.invalid (Adam W.) to news.software.nntp on Tue Nov 5 15:50:47 2024
    From Newsgroup: news.software.nntp

    Matija Nalis <mnalis-news@voyager.hr> wrote:

    Usually I'd use lsof(8) to see what is on the other side of that socket, and try to continue debugging there.

    What I know now is that the socket is:

    /usr/local/news/run/ctlinndcKoXl2

    It appears when expire process starts and disappears when it terminates.
    In strace I see that this path is sent to the /usr/local/news/run/control socket, owned by innd.

    I guess that's how ctlinnd works...

    So it's possible that something went wrong between expire and innd,
    probably in innd itself... some race?

    I see that innd is waiting on four sockets now, one of them being /usr/local/news/run/control, another /usr/local/news/run/nntpin, and
    two other being IPv4 and IPv6 listening sockets.

    Maybe there's some race in that select loop, maybe if one of these sockets
    is readable in just the wrong time, the other is ignored... I'd have to
    dig deeper into inn's source.

    What I'm surprised is that I'm the first person to stumble upon this bug,
    and my setup is pretty typical -- normal modern Debian system, vanilla inn without any modifications, etc.

    you can add "-v level" to increase verbosity of expire(8).

    Unfortunately all it seems to be doing is printing history lines and what
    it's doing with them. Might be useful, but I don't know if in this case.

    I'm testing it, but it doesn't want to hang yet. Hopefully it will. I'm
    also thinking of stress-testing just the ctlinnd part of it.
    --- Synchronet 3.21a-Linux NewsLink 1.2