• Re: INN laxmid clarification

    From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Sat Aug 30 06:50:26 2025
    From Newsgroup: news.software.nntp

    On Jul 30, 2023 at 4:15:08rC>AM CDT, "Julien |eLIE" <iulius@nom-de-mon-site.com.invalid> wrote:

    Hi Jesse,

    Over these past years, I have often seen questions about syntax checks.
    Maybe laxmid should allow more Message-IDs than only the ones with ".."
    and two "@"?

    Strictly speaking, a dot (".") must be followed by another non-special chars, so <a.@b> and <a@b.> are invalid per RFC.

    I suggest to change the behaviour of laxmid so that innd accepts even
    more Message-IDs. For instance in the common dot-atom-text syntax, just checking we have "<", at least one non-special char, "@", at least onenon-special char, and ">".
    no-fold-literal is kept untouched but dot-atom-text is changed.

    The syntax per RFC is:

    msg-id = "<" msg-id-core ">"
    msg-id-core = id-left "@" id-right
    id-left = dot-atom-text
    id-right = dot-atom-text / no-fold-literal

    dot-atom-text = 1*atext *("." 1*atext)
    no-fold-literal = "[" *mdtext "]"

    mdtext = %d33-61 / ; The rest of the US-ASCII
    %d63-90 / ; characters not including
    %d94-126 ; ">", "[", "]", or "\"

    atext = ALPHA / DIGIT / ; Printable US-ASCII
    "!" / "#" / ; characters not including
    "$" / "%" / ; specials. Used for atoms.
    "&" / "'" /
    "*" / "+" /
    "-" / "/" /
    "=" / "?" /
    "^" / "_" /
    "`" / "{" /
    "|" / "}" /
    "~"


    laxmid would accept for innd:

    dot-atom-text = 1*(atext / "." / "@")


    At least, I think it would cope with all Message-IDs in the wild. (Are
    there ones without any "@" at all?)

    As for nnrpd, laxmid would go on having the current behaviour of
    allowing ".." and two "@" as this was a request in 2017 from a news
    admin with users having broken posting agents sending such Message-IDs.
    No need for now to allow the injection of even more broken Message-IDs.

    Any thoughts about that change?

    Hi Julien,

    Bringing this up again because I have found BNews Message-IDs cannot be injected without modification, and there are a ton of them from various
    sources I'd rather not attempt to modify. Once source is online via NNTP, so easy to use pullnews or suck, which would be the path of least resistance.

    <bnews.sri-unix.2509> - 435 Syntax error in message-ID
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Sat Aug 30 17:50:12 2025
    From Newsgroup: news.software.nntp

    Hi Jesse,

    laxmid would accept for innd:

    dot-atom-text = 1*(atext / "." / "@")

    At least, I think it would cope with all Message-IDs in the wild.

    As for nnrpd, laxmid would go on having the current behaviour of
    allowing ".." and two "@" as this was a request in 2017 from a news
    admin with users having broken posting agents sending such Message-IDs.
    No need for now to allow the injection of even more broken Message-IDs.

    Any thoughts about that change?

    Bringing this up again because I have found BNews Message-IDs cannot be injected without modification, and there are a ton of them from various sources I'd rather not attempt to modify. Once source is online via NNTP, so easy to use pullnews or suck, which would be the path of least resistance.

    <bnews.sri-unix.2509> - 435 Syntax error in message-ID

    Sorry for having forgotten your request. I bet I was waiting for your approval of my suggestion of change (innd would accept 0 to 2 '@', but not nnrpd whose behaviour would remain unchanged) before starting to work on it.

    I think the following patch will work:

    --- a/lib/messageid.c
    +++ b/lib/messageid.c
    @@ -127,8 +127,8 @@ InitializeMessageIDcclass(void)
    ** When stripspaces is true, whitespace at the beginning and at the end
    ** of MessageID are discarded.
    **
    -** When laxsyntax is true, '@' can occur twice in MessageID, and '..' is
    -** also accepted in the left part of MessageID.
    +** When laxsyntax is true, '@' can occur twice in MessageID, or never occur, +** and '..' is also accepted in the left part of MessageID.
    */
    bool
    IsValidMessageID(const char *MessageID, bool stripspaces, bool laxsyntax)
    @@ -155,6 +155,12 @@ IsValidMessageID(const char *MessageID, bool stripspaces, bool laxsyntax)
    /* Scan local-part: "< dot-atom-text". */
    if (*p++ != '<')
    return false;
    +
    + /* In case there's no '@' in the Message-ID and laxsyntax is set, just
    + * check the syntax of the Message-ID as though it had no left part. */
    + if (laxsyntax && strchr((const char *) p, '@') == NULL)
    + return IsValidRightPartMessageID((const char *) p, stripspaces, true); +
    for (;; p++) {
    if (midatomchar(*p)) {
    while (midatomchar(*++p))




    --- a/nnrpd/post.c
    +++ b/nnrpd/post.c
    @@ -471,6 +471,10 @@ ProcessHeaders(char *idbuff, bool needmoderation)
    if (!IsValidMessageID(HDR(HDR__MESSAGEID), true, laxmid)) {
    return "Can't parse Message-ID header field body";
    }
    + /* Do not accept a Message-ID without an '@', even if laxmid is set. */
    + if (laxmid && strchr(HDR(HDR__MESSAGEID), '@') == NULL) {
    + return "Missing @ in Message-ID header field body";
    + }

    /* Set the Path header field. */
    if (HDR(HDR__PATH) == NULL || PERMaccessconf->strippath) {




    If you can confirm it suits your need, and you are now able to inject BNews articles downloaded by pullnews, it would be great.

    I'll also add a note in the documentation to warn that when laxmid is set, remote peers may reject articles with a syntactically invalid Message-ID.
    --
    Julien |eLIE

    -2-arCo C'est une bonne situation |oa, scribe-a?
    rCo Oh, c'est une situation assise.-a-+ (Ast|-rix)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Sat Aug 30 17:26:54 2025
    From Newsgroup: news.software.nntp

    On Aug 30, 2025 at 10:50:12rC>AM CDT, "Julien |eLIE" <iulius@nom-de-mon-site.com.invalid> wrote:

    Hi Jesse,

    laxmid would accept for innd:

    dot-atom-text = 1*(atext / "." / "@")

    At least, I think it would cope with all Message-IDs in the wild.

    As for nnrpd, laxmid would go on having the current behaviour of
    allowing ".." and two "@" as this was a request in 2017 from a news
    admin with users having broken posting agents sending such Message-IDs.
    No need for now to allow the injection of even more broken Message-IDs.

    Any thoughts about that change?

    Bringing this up again because I have found BNews Message-IDs cannot be
    injected without modification, and there are a ton of them from various
    sources I'd rather not attempt to modify. Once source is online via NNTP, so >> easy to use pullnews or suck, which would be the path of least resistance. >>
    <bnews.sri-unix.2509> - 435 Syntax error in message-ID

    Sorry for having forgotten your request. I bet I was waiting for your approval
    of my suggestion of change (innd would accept 0 to 2 '@', but not nnrpd whose behaviour would remain unchanged) before starting to work on it.

    I think the following patch will work:

    --- a/lib/messageid.c
    +++ b/lib/messageid.c
    @@ -127,8 +127,8 @@ InitializeMessageIDcclass(void)
    ** When stripspaces is true, whitespace at the beginning and at the end
    ** of MessageID are discarded.
    **
    -** When laxsyntax is true, '@' can occur twice in MessageID, and '..' is -** also accepted in the left part of MessageID.
    +** When laxsyntax is true, '@' can occur twice in MessageID, or never occur,
    +** and '..' is also accepted in the left part of MessageID.
    */
    bool
    IsValidMessageID(const char *MessageID, bool stripspaces, bool laxsyntax) @@ -155,6 +155,12 @@ IsValidMessageID(const char *MessageID, bool stripspaces,
    bool laxsyntax)
    /* Scan local-part: "<dot-atom-text". */> if (*p++ != '<')>
    return false;
    +
    + /* In case there's no '@' in the Message-ID and laxsyntax is set, just
    + * check the syntax of the Message-ID as though it had no left part. */ + if (laxsyntax && strchr((const char *) p, '@') == NULL)
    + return IsValidRightPartMessageID((const char *) p, stripspaces, true);
    +
    for (;; p++) {
    if (midatomchar(*p)) {
    while (midatomchar(*++p))




    --- a/nnrpd/post.c
    +++ b/nnrpd/post.c
    @@ -471,6 +471,10 @@ ProcessHeaders(char *idbuff, bool needmoderation)
    if (!IsValidMessageID(HDR(HDR__MESSAGEID), true, laxmid)) {
    return "Can't parse Message-ID header field body";
    }
    + /* Do not accept a Message-ID without an '@', even if laxmid is set. */ + if (laxmid && strchr(HDR(HDR__MESSAGEID), '@') == NULL) {
    + return "Missing @ in Message-ID header field body";
    + }

    /* Set the Path header field. */
    if (HDR(HDR__PATH) == NULL || PERMaccessconf->strippath) {




    If you can confirm it suits your need, and you are now able to inject BNews articles downloaded by pullnews, it would be great.

    I'll also add a note in the documentation to warn that when laxmid is set, remote peers may reject articles with a syntactically invalid Message-ID.

    This does get past the Message-ID header issue, but presents a new one with
    the Date header.

    437 Bad "Date" header field -- "Fri Jul 9 03:46:46 1982"

    I was looking at lib/date.c but it's a bit complex for me to digest. I see references in comments to "lax mode" and not sure if this is an undocumented option or maybe removed in the past?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Sat Aug 30 11:19:20 2025
    From Newsgroup: news.software.nntp

    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    This does get past the Message-ID header issue, but presents a new one with the Date header.

    437 Bad "Date" header field -- "Fri Jul 9 03:46:46 1982"

    I was looking at lib/date.c but it's a bit complex for me to digest. I
    see references in comments to "lax mode" and not sure if this is an undocumented option or maybe removed in the past?

    innd always uses lax mode for date parsing.

    That date is in ctime(3) format, which isn't supported by INN even in lax
    mode. That format was already forbidden in the first article format
    standard (RFC 850) from June 1983, and a lot of articles before that are
    in the completely incompatible A News format that INN has never attempted
    to parse. It looks like you have a transitional article that is in B News format but is still using the ctime(3) format for Date.

    RFC 850 says:

    Note in particular that ctime format:

    Wdy Mon DD HH:MM:SS YYYY

    is not acceptable because it is not a valid ARPANET date.
    However, since older software still generates this format,
    news implementations are encouraged to accept this format
    and translate it into an acceptable format.

    I wouldn't object to supporting this in INN in lax mode, but it's not
    entirely trivial to add without accidentally breaking something else
    because the order of the date elements is significantly different than a standardized date. It would take someone a bit of time to figure out how
    to safely incorporate it into parsedate_rfc5322_lax. (For example, the
    code that skips over a leading day of the week would also currently skip
    over the month.) It might be easiest to add a separate ctime parser and to
    just attempt a ctime parse whenever the date is otherwise invalid. I'm not
    sure how wide of a range of formats the old ctime dates came in.
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Sat Aug 30 19:01:31 2025
    From Newsgroup: news.software.nntp

    On Aug 30, 2025 at 1:19:20rC>PM CDT, "Russ Allbery" <eagle@eyrie.org> wrote:

    innd always uses lax mode for date parsing.

    That date is in ctime(3) format, which isn't supported by INN even in lax mode. That format was already forbidden in the first article format
    standard (RFC 850) from June 1983, and a lot of articles before that are
    in the completely incompatible A News format that INN has never attempted
    to parse. It looks like you have a transitional article that is in B News format but is still using the ctime(3) format for Date.

    RFC 850 says:

    Note in particular that ctime format:

    Wdy Mon DD HH:MM:SS YYYY

    is not acceptable because it is not a valid ARPANET date.
    However, since older software still generates this format,
    news implementations are encouraged to accept this format
    and translate it into an acceptable format.

    I wouldn't object to supporting this in INN in lax mode, but it's not entirely trivial to add without accidentally breaking something else
    because the order of the date elements is significantly different than a standardized date. It would take someone a bit of time to figure out how
    to safely incorporate it into parsedate_rfc5322_lax. (For example, the
    code that skips over a leading day of the week would also currently skip
    over the month.) It might be easiest to add a separate ctime parser and to just attempt a ctime parse whenever the date is otherwise invalid. I'm not sure how wide of a range of formats the old ctime dates came in.

    Appreciate the history. I know there are others trying to put together archives, but most are doing so with other software or for web browsing purposes. I may be the only one trying to get these articles in INN. :-)

    Mostly they've all been in the same format, except a few outliers that have invalid Date and Posted headers, but do have a Date-Received header that's
    more appropriate.

    The outliers seem to follow this pattern:

    Date: Wed, 31-Dec-69 18:59:59 EDT
    Posted: Wed Dec 31 18:59:59 1969
    Date-Received: Sun, 28-Jul-85 00:57:37 EDT

    Those I assume will require changing the Date header, at minimum before they'd be accepted.

    I did find a few more Message-ID variations that are rejected:

    <[OFFICE-3]GVT-RICH-490UQ> - 435 Syntax error in message-ID
    <366@mimir..dmt.oz> - 435 Syntax error in message-ID
    <[MC.LCS.MIT.EDU].851959.860315.KFL> - 435 Syntax error in message-ID
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Sat Aug 30 13:00:11 2025
    From Newsgroup: news.software.nntp

    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    Mostly they've all been in the same format, except a few outliers that
    have invalid Date and Posted headers, but do have a Date-Received header that's more appropriate.

    The outliers seem to follow this pattern:

    Date: Wed, 31-Dec-69 18:59:59 EDT
    Posted: Wed Dec 31 18:59:59 1969
    Date-Received: Sun, 28-Jul-85 00:57:37 EDT

    Looks like Date and Date-Received are in the same format there. The
    problem with that format will be the dashes, which I'm fairly certain that INN's date parser won't accept. The rest of the string should be fine if someone figures out how to handle the dash without breaking something
    else.
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Sun Aug 31 00:46:23 2025
    From Newsgroup: news.software.nntp

    Hi Jesse,

    Date: Wed, 31-Dec-69 18:59:59 EDT
    Posted: Wed Dec 31 18:59:59 1969
    Date-Received: Sun, 28-Jul-85 00:57:37 EDT

    Those I assume will require changing the Date header, at minimum before they'd
    be accepted.

    Wouldn't it be possible to somehow rewrite the Date header field before injecting the article? (The old one could be kept in an X-Date header
    field.) There may be news readers that are unable to parse them too.

    In Python:

    from email.utils import format_datetime,parsedate_to_datetime
    format_datetime(parsedate_to_datetime("Wed, 31-Dec-69 18:59:59 EDT")) 'Wed, 31 Dec 1969 18:59:59 -0400'


    I did find a few more Message-ID variations that are rejected:

    <[OFFICE-3]GVT-RICH-490UQ> - 435 Syntax error in message-ID
    <366@mimir..dmt.oz> - 435 Syntax error in message-ID
    <[MC.LCS.MIT.EDU].851959.860315.KFL> - 435 Syntax error in message-ID

    These are indeed invalid domain names. I am under the impression that
    the laxsyntax check should just ensure there are 1 to 248 (authorized)
    chars surrounded by brackets, without verifying the number, order and
    place of '.', '[', ']', etc.
    --
    Julien |eLIE

    -2-aLes Romains mesurent les distances en pas, nous en piedsrCa Il faut six
    pieds pour faire un pas.-a-+ (Ast|-rix)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Sun Aug 31 00:55:26 2025
    From Newsgroup: news.software.nntp

    On Aug 30, 2025 at 5:46:23rC>PM CDT, "Julien |eLIE" <iulius@nom-de-mon-site.com.invalid> wrote:

    Hi Jesse,

    Date: Wed, 31-Dec-69 18:59:59 EDT
    Posted: Wed Dec 31 18:59:59 1969
    Date-Received: Sun, 28-Jul-85 00:57:37 EDT

    Those I assume will require changing the Date header, at minimum before they'd
    be accepted.

    Wouldn't it be possible to somehow rewrite the Date header field before injecting the article? (The old one could be kept in an X-Date header field.) There may be news readers that are unable to parse them too.

    In Python:

    from email.utils import format_datetime,parsedate_to_datetime
    format_datetime(parsedate_to_datetime("Wed, 31-Dec-69 18:59:59 EDT"))
    'Wed, 31 Dec 1969 18:59:59 -0400'

    Will have to do something to bring these articles in. Since these exist on a server that speaks (limited) NNTP, I am trying to bring in as many as possible without having to download, modify, and inject.

    I did find a few more Message-ID variations that are rejected:

    <[OFFICE-3]GVT-RICH-490UQ> - 435 Syntax error in message-ID
    <366@mimir..dmt.oz> - 435 Syntax error in message-ID
    <[MC.LCS.MIT.EDU].851959.860315.KFL> - 435 Syntax error in message-ID

    These are indeed invalid domain names. I am under the impression that
    the laxsyntax check should just ensure there are 1 to 248 (authorized)
    chars surrounded by brackets, without verifying the number, order and
    place of '.', '[', ']', etc.

    In my humble opinion, that's all it should need to do. This seems inline with how Cyclone and Diablo seem to behave by default (at least how I see
    commercial operators have them configured).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Sun Aug 31 10:16:19 2025
    From Newsgroup: news.software.nntp

    Hi Jesse,

    Date: Wed, 31-Dec-69 18:59:59 EDT

    Wouldn't it be possible to somehow rewrite the Date header field before
    injecting the article? (The old one could be kept in an X-Date header
    field.) There may be news readers that are unable to parse them too.

    Will have to do something to bring these articles in. Since these exist on a server that speaks (limited) NNTP, I am trying to bring in as many as possible
    without having to download, modify, and inject.

    Understood.
    As your use is very specific, and writing a proper parsing of that kind
    of dates is time-consuming and complicated (as Russ told us), I wonder
    whether the faster approach wouldn't be to add a new level of control in syntaxchecks:

    syntaxchecks: [ laxmid laxdate ]

    which would just take the current time (= arrival time) as the posting
    date when the Date header field exists and is not parseable. The
    information is recorded in the history file and overview for expiry
    purpose, so it shouldn't break anything as far as I see.

    INN already considers there's no expires date when the Expires header
    field is not parseable, without rejecting the article.
    makehistory already takes the time of the rebuild as the posting date
    when the Date header field is not parseable.
    NEWNEWS uses arrival times.

    If it looks good to you, I can properly integrate that laxdate behaviour
    in INN. Meanwhile, you could just use:


    --- a/innd/art.c
    +++ b/innd/art.c
    @@ -1054,11 +1054,7 @@ ARTclean(ARTDATA *data, char *buff, bool ihave)
    data->Posted = parsedate_rfc5322_lax(p);

    if (data->Posted == (time_t) -1) {
    - sprintf(buff, "%d Bad \"Date\" header field -- \"%s\"",
    - ihave ? NNTP_FAIL_IHAVE_REJECT : NNTP_FAIL_TAKETHIS_REJECT,
    - MaxLength(p, p));
    - TMRstop(TMR_ARTCLEAN);
    - return false;
    + data->Posted = data->Arrived;
    }

    if (HDR_FOUND(HDR__INJECTION_DATE)) {
    @@ -1066,11 +1062,7 @@ ARTclean(ARTDATA *data, char *buff, bool ihave)
    data->Posted = parsedate_rfc5322_lax(p);

    if (data->Posted == (time_t) -1) {
    - sprintf(buff, "%d Bad \"Injection-Date\" header field --
    \"%s\"",
    - ihave ? NNTP_FAIL_IHAVE_REJECT : NNTP_FAIL_TAKETHIS_REJECT,
    - MaxLength(p, p));
    - TMRstop(TMR_ARTCLEAN);
    - return false;
    + data->Posted = data->Arrived;
    }
    }



    --- a/storage/expire.c
    +++ b/storage/expire.c
    @@ -707,8 +707,7 @@ OVgroupbasedexpire(TOKEN token, const char *group,
    const char *data,
    }
    when = parsedate_rfc5322_lax(p);
    if (when == (time_t) -1) {
    - EXPoverindexdrop++;
    - return true;
    + when = arrived;
    }
    } else {
    when = arrived;




    I did find a few more Message-ID variations that are rejected:

    <[OFFICE-3]GVT-RICH-490UQ> - 435 Syntax error in message-ID
    <366@mimir..dmt.oz> - 435 Syntax error in message-ID
    <[MC.LCS.MIT.EDU].851959.860315.KFL> - 435 Syntax error in message-ID >>
    These are indeed invalid domain names. I am under the impression that
    the laxsyntax check should just ensure there are 1 to 248 (authorized)
    chars surrounded by brackets, without verifying the number, order and
    place of '.', '[', ']', etc.

    In my humble opinion, that's all it should need to do. This seems inline with how Cyclone and Diablo seem to behave by default (at least how I see commercial operators have them configured).

    Here is a new proposal of patch (do not keep the 4 lines from the
    previous patch which was specifically looking for '@'):

    --- a/lib/messageid.c
    +++ b/lib/messageid.c
    @@ -155,6 +155,21 @@ IsValidMessageID(const char *MessageID, bool
    stripspaces, bool laxsyntax)
    /* Scan local-part: "< dot-atom-text". */
    if (*p++ != '<')
    return false;
    +
    + if (laxsyntax) {
    + for (;; p++) {
    + if (!midnormchar(*p) && *p != '[' && *p != ']')
    + break;
    + }
    + if (*p++ != '>')
    + return false;
    + if (stripspaces) {
    + for (; ISWHITE(*p); p++)
    + ;
    + }
    + return (*p == '\0');
    + }
    +
    for (;; p++) {
    if (midatomchar(*p)) {
    while (midatomchar(*++p))





    I'll naturally integrate it properly in a separate function so that this
    check is not called by nnrprd. I'm waiting for your feedback for these
    two changes in Message-ID and Date handling.
    --
    Julien |eLIE

    -2-aSur vingt personnes qui parlent de nous, dix-neuf en disent du mal et
    la vingti|?me, qui en dit du bien, le dit mal.-a-+ (Rivarol)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Colin Macleod@user7@newsgrouper.org.invalid to news.software.nntp on Sun Aug 31 16:58:05 2025
    From Newsgroup: news.software.nntp

    =?UTF-8?Q?Julien_=C3=89LIE?= <iulius@nom-de-mon-site.com.invalid> posted:

    Wouldn't it be possible to somehow rewrite the Date header field before injecting the article? (The old one could be kept in an X-Date header field.) There may be news readers that are unable to parse them too.

    In Python:

    from email.utils import format_datetime,parsedate_to_datetime
    format_datetime(parsedate_to_datetime("Wed, 31-Dec-69 18:59:59 EDT")) 'Wed, 31 Dec 1969 18:59:59 -0400'

    I ran into similar problems with dates while loading historical posts from archive.org into a database for my Newsgrouper site.

    I was converting dates into Unix seconds format for ease of indexing/ searching/sorting. Since my code is in Tcl I was using its /clock scan/
    for this - https://www.tcl-lang.org/man/tcl9.0/TclCmd/clock.html#M10 .
    But I hit lots of dates that this failed with, and gradually built up a collection of hacks to massage the input into a form it would accept.
    These can be seen in proc parse_art, lines 132-146 of https://chiselapp.com/user/cmacleod/repository/newsgrouper/file?udc=1&ln=on&ci=tip&name=scripts%2Fdb_load_arch
    plus the timezone mappings at lines 110-124.
    Note that /clock scan/ in Tcl9 is more strict by default, rejecting dates
    like June 31, which did occur in the input. Tcl8 would accept these, /clock/ in Tcl9 will revert to the old behaviour with the option /-validate 0/ .

    Converting back from unix seconds to an acceptable form for a Date header
    can be done with:
    [clock format $dat -format {%a, %d %b %Y %H:%M:%S GMT} -gmt true]
    --
    Colin Macleod ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ https://cmacleod.me.uk

    FEED FEED FEED FEED FEED FEED FEED FEED
    GAZA GAZA GAZA GAZA GAZA GAZA GAZA GAZA
    NOW! NOW! NOW! NOW! NOW! NOW! NOW! NOW!
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Sun Aug 31 11:21:46 2025
    From Newsgroup: news.software.nntp

    Julien |eLIE <iulius@nom-de-mon-site.com.invalid> writes:

    As your use is very specific, and writing a proper parsing of that kind of dates is time-consuming and complicated (as Russ told us), I wonder
    whether the faster approach wouldn't be to add a new level of control in syntaxchecks:

    syntaxchecks: [ laxmid laxdate ]

    which would just take the current time (= arrival time) as the posting
    date when the Date header field exists and is not parseable. The
    information is recorded in the history file and overview for expiry
    purpose, so it shouldn't break anything as far as I see.

    Maybe "ignoredate" in this case instead of "laxdate"? This is a bit
    different than what "laxmid" means (ignore invalid message IDs and use the message ID anyway), since it's ignoring the date entirely for the purposes
    of all the other things INN does with dates. That way we would keep
    laxdate in case we ever want to enable strict standards-enforcing date
    parsing or provide a different laxer date parser that understands ctime(3)
    and dashes, etc.

    The actual field stored in overview for clients will still be the value of
    the Date header so far as I can see, so that will be invalid (not parsable
    by clients) unless you put something different into the overview. That's already the case with the existing lax date parsing, so might not matter
    and will be true for any of the proposals for handling old dates. I'm not
    sure how many clients try to parse the date information in overview and do something with it.
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Adam H. Kerman@ahk@chinet.com to news.software.nntp on Sun Aug 31 19:42:07 2025
    From Newsgroup: news.software.nntp

    Russ Allbery <eagle@eyrie.org> wrote:
    Julien <iulius@nom-de-mon-site.com.invalid> writes:

    As your use is very specific, and writing a proper parsing of that kind of >>dates is time-consuming and complicated (as Russ told us), I wonder
    whether the faster approach wouldn't be to add a new level of control in >>syntaxchecks:

    syntaxchecks: [ laxmid laxdate ]

    which would just take the current time (= arrival time) as the posting
    date when the Date header field exists and is not parseable. The >>information is recorded in the history file and overview for expiry >>purpose, so it shouldn't break anything as far as I see.

    Maybe "ignoredate" in this case instead of "laxdate"? This is a bit
    different than what "laxmid" means (ignore invalid message IDs and use the >message ID anyway), since it's ignoring the date entirely for the purposes
    of all the other things INN does with dates. That way we would keep
    laxdate in case we ever want to enable strict standards-enforcing date >parsing or provide a different laxer date parser that understands ctime(3) >and dashes, etc.

    The actual field stored in overview for clients will still be the value of >the Date header so far as I can see, so that will be invalid (not parsable
    by clients) unless you put something different into the overview. That's >already the case with the existing lax date parsing, so might not matter
    and will be true for any of the proposals for handling old dates. I'm not >sure how many clients try to parse the date information in overview and do >something with it.

    Could I make a suggestion as a nonprogrammer but someone who has spent countless hours putting data into a consistent pattern or good syntax?

    Think of the Date header as temporary and that at some point, it might
    be nice if it reflected the original date but now in modern syntax. I'm suggesting this as a Date header that reflects when it was appended to
    an archive is not going to be helpful to a newsreader when it comes to
    sorting. And a whole lot of articles are going to share an identical
    Date header as articles are going to be appended in huge batches.

    Add a special X- header with an explicit header reflecting where the
    article came from in a specific pattern, so it can be readily found.

    Then another X- header with a preliminary analysis of the Date string.

    In the example

    Date: Wed, 31-Dec-69 18:59:59 EDT

    Leave the punctuation as is. Look for the alphanumeric pattern. X
    capital letter, x lower case letter, N numeral, _ space

    X-Date-Pattern: Xxx,_NN-Xxx-NN_NN:NN:NN_XXX

    Some time later, perhaps a year later, someone might analyze this. If a three-alpha recognizeable as a day is in the first Xxx group (which
    might be an XXX), then it's a day. It might have an optional "." and ","
    isn't always going to be present as a separator. Similarly, the second three-alpha might be a month.

    The two-digit year could be confused with a time element but there
    probably won't be dates prior to the Unix epoch.

    The three-alpha time zone isn't necessarily unique worldwide, but prior
    to a certain date we know that the articles were from the United States
    only.

    If the transformation into modern syntax results in a logical day-date combination, then that's passed one test that the transformation was
    valid.

    There's going to be a lot of eyeballing necessary but this could
    possibly be a way to choose which articles have dates requiring further analysis.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Sun Aug 31 23:52:35 2025
    From Newsgroup: news.software.nntp

    On Aug 31, 2025 at 2:42:07rC>PM CDT, ""Adam H. Kerman"" <ahk@chinet.com> wrote:

    Russ Allbery <eagle@eyrie.org> wrote:
    Julien <iulius@nom-de-mon-site.com.invalid> writes:

    As your use is very specific, and writing a proper parsing of that kind of >>> dates is time-consuming and complicated (as Russ told us), I wonder
    whether the faster approach wouldn't be to add a new level of control in >>> syntaxchecks:

    syntaxchecks: [ laxmid laxdate ]

    which would just take the current time (= arrival time) as the posting
    date when the Date header field exists and is not parseable. The
    information is recorded in the history file and overview for expiry
    purpose, so it shouldn't break anything as far as I see.

    Maybe "ignoredate" in this case instead of "laxdate"? This is a bit
    different than what "laxmid" means (ignore invalid message IDs and use the >> message ID anyway), since it's ignoring the date entirely for the purposes >> of all the other things INN does with dates. That way we would keep
    laxdate in case we ever want to enable strict standards-enforcing date
    parsing or provide a different laxer date parser that understands ctime(3) >> and dashes, etc.

    The actual field stored in overview for clients will still be the value of >> the Date header so far as I can see, so that will be invalid (not parsable >> by clients) unless you put something different into the overview. That's
    already the case with the existing lax date parsing, so might not matter
    and will be true for any of the proposals for handling old dates. I'm not
    sure how many clients try to parse the date information in overview and do >> something with it.

    Could I make a suggestion as a nonprogrammer but someone who has spent countless hours putting data into a consistent pattern or good syntax?

    Think of the Date header as temporary and that at some point, it might
    be nice if it reflected the original date but now in modern syntax. I'm suggesting this as a Date header that reflects when it was appended to
    an archive is not going to be helpful to a newsreader when it comes to sorting. And a whole lot of articles are going to share an identical
    Date header as articles are going to be appended in huge batches.

    Add a special X- header with an explicit header reflecting where the
    article came from in a specific pattern, so it can be readily found.

    Then another X- header with a preliminary analysis of the Date string.

    In the example

    Date: Wed, 31-Dec-69 18:59:59 EDT

    Leave the punctuation as is. Look for the alphanumeric pattern. X
    capital letter, x lower case letter, N numeral, _ space

    X-Date-Pattern: Xxx,_NN-Xxx-NN_NN:NN:NN_XXX

    Some time later, perhaps a year later, someone might analyze this. If a three-alpha recognizeable as a day is in the first Xxx group (which
    might be an XXX), then it's a day. It might have an optional "." and "," isn't always going to be present as a separator. Similarly, the second three-alpha might be a month.

    The two-digit year could be confused with a time element but there
    probably won't be dates prior to the Unix epoch.

    The three-alpha time zone isn't necessarily unique worldwide, but prior
    to a certain date we know that the articles were from the United States
    only.

    If the transformation into modern syntax results in a logical day-date combination, then that's passed one test that the transformation was
    valid.

    There's going to be a lot of eyeballing necessary but this could
    possibly be a way to choose which articles have dates requiring further analysis.

    One of the ultimate goals of my archive is to sort the history file by posted date and re-feed to another INN instance, so article numbering in the 'final' archive is chronological. Though, I'm starting to think I'll never get to that point. :-)

    What I've discovered is that some newsreaders do not handle sorting a group by Date where both two-digit and four-digit representations of the year in the Date header exist throughout articles in the group. For example, in Usenapp, articles whose Date header has the year represented in two digits will be sorted and displayed first, then articles with the year represented using four digits second, resulting in a wonky chronology. This is my primary driver for getting the article numbers in chronological order, as most if not all newsreaders sort using the article number by default.

    Julien's suggestion would work to get the articles injected to the spool, but could present other issues, primarily with article numbering.

    When "Billy G." announced they had the archive.org Usenet content available
    via NNTP, I was elated as it would save a ton of work, but between the Date issue, and a lot of articles having duplicated headers that INN won't accept (not sure if this is caused by his import process or if the source material
    has duplicate headers), I'm starting to think I need to go back to dealing
    with the source material directly. They built their own NNTP implementation
    for this purpose, and I didn't think about 'complaince' of the content initially.

    That leaves me a few battles to win. Like Adam, I do not have a programmer's mindset, so dealing with detecting date format issues and duplicated headers isn't straightforward for me, at least not when dealing with hundreds of millions of articles.
    --- Synchronet 3.21a-Linux NewsLink 1.2