• Re: INN laxmid clarification

    From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Sat Aug 30 06:50:26 2025
    From Newsgroup: news.software.nntp

    On Jul 30, 2023 at 4:15:08rC>AM CDT, "Julien |eLIE" <iulius@nom-de-mon-site.com.invalid> wrote:

    Hi Jesse,

    Over these past years, I have often seen questions about syntax checks.
    Maybe laxmid should allow more Message-IDs than only the ones with ".."
    and two "@"?

    Strictly speaking, a dot (".") must be followed by another non-special chars, so <a.@b> and <a@b.> are invalid per RFC.

    I suggest to change the behaviour of laxmid so that innd accepts even
    more Message-IDs. For instance in the common dot-atom-text syntax, just checking we have "<", at least one non-special char, "@", at least onenon-special char, and ">".
    no-fold-literal is kept untouched but dot-atom-text is changed.

    The syntax per RFC is:

    msg-id = "<" msg-id-core ">"
    msg-id-core = id-left "@" id-right
    id-left = dot-atom-text
    id-right = dot-atom-text / no-fold-literal

    dot-atom-text = 1*atext *("." 1*atext)
    no-fold-literal = "[" *mdtext "]"

    mdtext = %d33-61 / ; The rest of the US-ASCII
    %d63-90 / ; characters not including
    %d94-126 ; ">", "[", "]", or "\"

    atext = ALPHA / DIGIT / ; Printable US-ASCII
    "!" / "#" / ; characters not including
    "$" / "%" / ; specials. Used for atoms.
    "&" / "'" /
    "*" / "+" /
    "-" / "/" /
    "=" / "?" /
    "^" / "_" /
    "`" / "{" /
    "|" / "}" /
    "~"


    laxmid would accept for innd:

    dot-atom-text = 1*(atext / "." / "@")


    At least, I think it would cope with all Message-IDs in the wild. (Are
    there ones without any "@" at all?)

    As for nnrpd, laxmid would go on having the current behaviour of
    allowing ".." and two "@" as this was a request in 2017 from a news
    admin with users having broken posting agents sending such Message-IDs.
    No need for now to allow the injection of even more broken Message-IDs.

    Any thoughts about that change?

    Hi Julien,

    Bringing this up again because I have found BNews Message-IDs cannot be injected without modification, and there are a ton of them from various
    sources I'd rather not attempt to modify. Once source is online via NNTP, so easy to use pullnews or suck, which would be the path of least resistance.

    <bnews.sri-unix.2509> - 435 Syntax error in message-ID
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Sat Aug 30 17:50:12 2025
    From Newsgroup: news.software.nntp

    Hi Jesse,

    laxmid would accept for innd:

    dot-atom-text = 1*(atext / "." / "@")

    At least, I think it would cope with all Message-IDs in the wild.

    As for nnrpd, laxmid would go on having the current behaviour of
    allowing ".." and two "@" as this was a request in 2017 from a news
    admin with users having broken posting agents sending such Message-IDs.
    No need for now to allow the injection of even more broken Message-IDs.

    Any thoughts about that change?

    Bringing this up again because I have found BNews Message-IDs cannot be injected without modification, and there are a ton of them from various sources I'd rather not attempt to modify. Once source is online via NNTP, so easy to use pullnews or suck, which would be the path of least resistance.

    <bnews.sri-unix.2509> - 435 Syntax error in message-ID

    Sorry for having forgotten your request. I bet I was waiting for your approval of my suggestion of change (innd would accept 0 to 2 '@', but not nnrpd whose behaviour would remain unchanged) before starting to work on it.

    I think the following patch will work:

    --- a/lib/messageid.c
    +++ b/lib/messageid.c
    @@ -127,8 +127,8 @@ InitializeMessageIDcclass(void)
    ** When stripspaces is true, whitespace at the beginning and at the end
    ** of MessageID are discarded.
    **
    -** When laxsyntax is true, '@' can occur twice in MessageID, and '..' is
    -** also accepted in the left part of MessageID.
    +** When laxsyntax is true, '@' can occur twice in MessageID, or never occur, +** and '..' is also accepted in the left part of MessageID.
    */
    bool
    IsValidMessageID(const char *MessageID, bool stripspaces, bool laxsyntax)
    @@ -155,6 +155,12 @@ IsValidMessageID(const char *MessageID, bool stripspaces, bool laxsyntax)
    /* Scan local-part: "< dot-atom-text". */
    if (*p++ != '<')
    return false;
    +
    + /* In case there's no '@' in the Message-ID and laxsyntax is set, just
    + * check the syntax of the Message-ID as though it had no left part. */
    + if (laxsyntax && strchr((const char *) p, '@') == NULL)
    + return IsValidRightPartMessageID((const char *) p, stripspaces, true); +
    for (;; p++) {
    if (midatomchar(*p)) {
    while (midatomchar(*++p))




    --- a/nnrpd/post.c
    +++ b/nnrpd/post.c
    @@ -471,6 +471,10 @@ ProcessHeaders(char *idbuff, bool needmoderation)
    if (!IsValidMessageID(HDR(HDR__MESSAGEID), true, laxmid)) {
    return "Can't parse Message-ID header field body";
    }
    + /* Do not accept a Message-ID without an '@', even if laxmid is set. */
    + if (laxmid && strchr(HDR(HDR__MESSAGEID), '@') == NULL) {
    + return "Missing @ in Message-ID header field body";
    + }

    /* Set the Path header field. */
    if (HDR(HDR__PATH) == NULL || PERMaccessconf->strippath) {




    If you can confirm it suits your need, and you are now able to inject BNews articles downloaded by pullnews, it would be great.

    I'll also add a note in the documentation to warn that when laxmid is set, remote peers may reject articles with a syntactically invalid Message-ID.
    --
    Julien |eLIE

    -2-arCo C'est une bonne situation |oa, scribe-a?
    rCo Oh, c'est une situation assise.-a-+ (Ast|-rix)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Sat Aug 30 17:26:54 2025
    From Newsgroup: news.software.nntp

    On Aug 30, 2025 at 10:50:12rC>AM CDT, "Julien |eLIE" <iulius@nom-de-mon-site.com.invalid> wrote:

    Hi Jesse,

    laxmid would accept for innd:

    dot-atom-text = 1*(atext / "." / "@")

    At least, I think it would cope with all Message-IDs in the wild.

    As for nnrpd, laxmid would go on having the current behaviour of
    allowing ".." and two "@" as this was a request in 2017 from a news
    admin with users having broken posting agents sending such Message-IDs.
    No need for now to allow the injection of even more broken Message-IDs.

    Any thoughts about that change?

    Bringing this up again because I have found BNews Message-IDs cannot be
    injected without modification, and there are a ton of them from various
    sources I'd rather not attempt to modify. Once source is online via NNTP, so >> easy to use pullnews or suck, which would be the path of least resistance. >>
    <bnews.sri-unix.2509> - 435 Syntax error in message-ID

    Sorry for having forgotten your request. I bet I was waiting for your approval
    of my suggestion of change (innd would accept 0 to 2 '@', but not nnrpd whose behaviour would remain unchanged) before starting to work on it.

    I think the following patch will work:

    --- a/lib/messageid.c
    +++ b/lib/messageid.c
    @@ -127,8 +127,8 @@ InitializeMessageIDcclass(void)
    ** When stripspaces is true, whitespace at the beginning and at the end
    ** of MessageID are discarded.
    **
    -** When laxsyntax is true, '@' can occur twice in MessageID, and '..' is -** also accepted in the left part of MessageID.
    +** When laxsyntax is true, '@' can occur twice in MessageID, or never occur,
    +** and '..' is also accepted in the left part of MessageID.
    */
    bool
    IsValidMessageID(const char *MessageID, bool stripspaces, bool laxsyntax) @@ -155,6 +155,12 @@ IsValidMessageID(const char *MessageID, bool stripspaces,
    bool laxsyntax)
    /* Scan local-part: "<dot-atom-text". */> if (*p++ != '<')>
    return false;
    +
    + /* In case there's no '@' in the Message-ID and laxsyntax is set, just
    + * check the syntax of the Message-ID as though it had no left part. */ + if (laxsyntax && strchr((const char *) p, '@') == NULL)
    + return IsValidRightPartMessageID((const char *) p, stripspaces, true);
    +
    for (;; p++) {
    if (midatomchar(*p)) {
    while (midatomchar(*++p))




    --- a/nnrpd/post.c
    +++ b/nnrpd/post.c
    @@ -471,6 +471,10 @@ ProcessHeaders(char *idbuff, bool needmoderation)
    if (!IsValidMessageID(HDR(HDR__MESSAGEID), true, laxmid)) {
    return "Can't parse Message-ID header field body";
    }
    + /* Do not accept a Message-ID without an '@', even if laxmid is set. */ + if (laxmid && strchr(HDR(HDR__MESSAGEID), '@') == NULL) {
    + return "Missing @ in Message-ID header field body";
    + }

    /* Set the Path header field. */
    if (HDR(HDR__PATH) == NULL || PERMaccessconf->strippath) {




    If you can confirm it suits your need, and you are now able to inject BNews articles downloaded by pullnews, it would be great.

    I'll also add a note in the documentation to warn that when laxmid is set, remote peers may reject articles with a syntactically invalid Message-ID.

    This does get past the Message-ID header issue, but presents a new one with
    the Date header.

    437 Bad "Date" header field -- "Fri Jul 9 03:46:46 1982"

    I was looking at lib/date.c but it's a bit complex for me to digest. I see references in comments to "lax mode" and not sure if this is an undocumented option or maybe removed in the past?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Sat Aug 30 11:19:20 2025
    From Newsgroup: news.software.nntp

    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    This does get past the Message-ID header issue, but presents a new one with the Date header.

    437 Bad "Date" header field -- "Fri Jul 9 03:46:46 1982"

    I was looking at lib/date.c but it's a bit complex for me to digest. I
    see references in comments to "lax mode" and not sure if this is an undocumented option or maybe removed in the past?

    innd always uses lax mode for date parsing.

    That date is in ctime(3) format, which isn't supported by INN even in lax
    mode. That format was already forbidden in the first article format
    standard (RFC 850) from June 1983, and a lot of articles before that are
    in the completely incompatible A News format that INN has never attempted
    to parse. It looks like you have a transitional article that is in B News format but is still using the ctime(3) format for Date.

    RFC 850 says:

    Note in particular that ctime format:

    Wdy Mon DD HH:MM:SS YYYY

    is not acceptable because it is not a valid ARPANET date.
    However, since older software still generates this format,
    news implementations are encouraged to accept this format
    and translate it into an acceptable format.

    I wouldn't object to supporting this in INN in lax mode, but it's not
    entirely trivial to add without accidentally breaking something else
    because the order of the date elements is significantly different than a standardized date. It would take someone a bit of time to figure out how
    to safely incorporate it into parsedate_rfc5322_lax. (For example, the
    code that skips over a leading day of the week would also currently skip
    over the month.) It might be easiest to add a separate ctime parser and to
    just attempt a ctime parse whenever the date is otherwise invalid. I'm not
    sure how wide of a range of formats the old ctime dates came in.
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Sat Aug 30 19:01:31 2025
    From Newsgroup: news.software.nntp

    On Aug 30, 2025 at 1:19:20rC>PM CDT, "Russ Allbery" <eagle@eyrie.org> wrote:

    innd always uses lax mode for date parsing.

    That date is in ctime(3) format, which isn't supported by INN even in lax mode. That format was already forbidden in the first article format
    standard (RFC 850) from June 1983, and a lot of articles before that are
    in the completely incompatible A News format that INN has never attempted
    to parse. It looks like you have a transitional article that is in B News format but is still using the ctime(3) format for Date.

    RFC 850 says:

    Note in particular that ctime format:

    Wdy Mon DD HH:MM:SS YYYY

    is not acceptable because it is not a valid ARPANET date.
    However, since older software still generates this format,
    news implementations are encouraged to accept this format
    and translate it into an acceptable format.

    I wouldn't object to supporting this in INN in lax mode, but it's not entirely trivial to add without accidentally breaking something else
    because the order of the date elements is significantly different than a standardized date. It would take someone a bit of time to figure out how
    to safely incorporate it into parsedate_rfc5322_lax. (For example, the
    code that skips over a leading day of the week would also currently skip
    over the month.) It might be easiest to add a separate ctime parser and to just attempt a ctime parse whenever the date is otherwise invalid. I'm not sure how wide of a range of formats the old ctime dates came in.

    Appreciate the history. I know there are others trying to put together archives, but most are doing so with other software or for web browsing purposes. I may be the only one trying to get these articles in INN. :-)

    Mostly they've all been in the same format, except a few outliers that have invalid Date and Posted headers, but do have a Date-Received header that's
    more appropriate.

    The outliers seem to follow this pattern:

    Date: Wed, 31-Dec-69 18:59:59 EDT
    Posted: Wed Dec 31 18:59:59 1969
    Date-Received: Sun, 28-Jul-85 00:57:37 EDT

    Those I assume will require changing the Date header, at minimum before they'd be accepted.

    I did find a few more Message-ID variations that are rejected:

    <[OFFICE-3]GVT-RICH-490UQ> - 435 Syntax error in message-ID
    <366@mimir..dmt.oz> - 435 Syntax error in message-ID
    <[MC.LCS.MIT.EDU].851959.860315.KFL> - 435 Syntax error in message-ID
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Sat Aug 30 13:00:11 2025
    From Newsgroup: news.software.nntp

    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    Mostly they've all been in the same format, except a few outliers that
    have invalid Date and Posted headers, but do have a Date-Received header that's more appropriate.

    The outliers seem to follow this pattern:

    Date: Wed, 31-Dec-69 18:59:59 EDT
    Posted: Wed Dec 31 18:59:59 1969
    Date-Received: Sun, 28-Jul-85 00:57:37 EDT

    Looks like Date and Date-Received are in the same format there. The
    problem with that format will be the dashes, which I'm fairly certain that INN's date parser won't accept. The rest of the string should be fine if someone figures out how to handle the dash without breaking something
    else.
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Sun Aug 31 00:46:23 2025
    From Newsgroup: news.software.nntp

    Hi Jesse,

    Date: Wed, 31-Dec-69 18:59:59 EDT
    Posted: Wed Dec 31 18:59:59 1969
    Date-Received: Sun, 28-Jul-85 00:57:37 EDT

    Those I assume will require changing the Date header, at minimum before they'd
    be accepted.

    Wouldn't it be possible to somehow rewrite the Date header field before injecting the article? (The old one could be kept in an X-Date header
    field.) There may be news readers that are unable to parse them too.

    In Python:

    from email.utils import format_datetime,parsedate_to_datetime
    format_datetime(parsedate_to_datetime("Wed, 31-Dec-69 18:59:59 EDT")) 'Wed, 31 Dec 1969 18:59:59 -0400'


    I did find a few more Message-ID variations that are rejected:

    <[OFFICE-3]GVT-RICH-490UQ> - 435 Syntax error in message-ID
    <366@mimir..dmt.oz> - 435 Syntax error in message-ID
    <[MC.LCS.MIT.EDU].851959.860315.KFL> - 435 Syntax error in message-ID

    These are indeed invalid domain names. I am under the impression that
    the laxsyntax check should just ensure there are 1 to 248 (authorized)
    chars surrounded by brackets, without verifying the number, order and
    place of '.', '[', ']', etc.
    --
    Julien |eLIE

    -2-aLes Romains mesurent les distances en pas, nous en piedsrCa Il faut six
    pieds pour faire un pas.-a-+ (Ast|-rix)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Sun Aug 31 00:55:26 2025
    From Newsgroup: news.software.nntp

    On Aug 30, 2025 at 5:46:23rC>PM CDT, "Julien |eLIE" <iulius@nom-de-mon-site.com.invalid> wrote:

    Hi Jesse,

    Date: Wed, 31-Dec-69 18:59:59 EDT
    Posted: Wed Dec 31 18:59:59 1969
    Date-Received: Sun, 28-Jul-85 00:57:37 EDT

    Those I assume will require changing the Date header, at minimum before they'd
    be accepted.

    Wouldn't it be possible to somehow rewrite the Date header field before injecting the article? (The old one could be kept in an X-Date header field.) There may be news readers that are unable to parse them too.

    In Python:

    from email.utils import format_datetime,parsedate_to_datetime
    format_datetime(parsedate_to_datetime("Wed, 31-Dec-69 18:59:59 EDT"))
    'Wed, 31 Dec 1969 18:59:59 -0400'

    Will have to do something to bring these articles in. Since these exist on a server that speaks (limited) NNTP, I am trying to bring in as many as possible without having to download, modify, and inject.

    I did find a few more Message-ID variations that are rejected:

    <[OFFICE-3]GVT-RICH-490UQ> - 435 Syntax error in message-ID
    <366@mimir..dmt.oz> - 435 Syntax error in message-ID
    <[MC.LCS.MIT.EDU].851959.860315.KFL> - 435 Syntax error in message-ID

    These are indeed invalid domain names. I am under the impression that
    the laxsyntax check should just ensure there are 1 to 248 (authorized)
    chars surrounded by brackets, without verifying the number, order and
    place of '.', '[', ']', etc.

    In my humble opinion, that's all it should need to do. This seems inline with how Cyclone and Diablo seem to behave by default (at least how I see
    commercial operators have them configured).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Sun Aug 31 10:16:19 2025
    From Newsgroup: news.software.nntp

    Hi Jesse,

    Date: Wed, 31-Dec-69 18:59:59 EDT

    Wouldn't it be possible to somehow rewrite the Date header field before
    injecting the article? (The old one could be kept in an X-Date header
    field.) There may be news readers that are unable to parse them too.

    Will have to do something to bring these articles in. Since these exist on a server that speaks (limited) NNTP, I am trying to bring in as many as possible
    without having to download, modify, and inject.

    Understood.
    As your use is very specific, and writing a proper parsing of that kind
    of dates is time-consuming and complicated (as Russ told us), I wonder
    whether the faster approach wouldn't be to add a new level of control in syntaxchecks:

    syntaxchecks: [ laxmid laxdate ]

    which would just take the current time (= arrival time) as the posting
    date when the Date header field exists and is not parseable. The
    information is recorded in the history file and overview for expiry
    purpose, so it shouldn't break anything as far as I see.

    INN already considers there's no expires date when the Expires header
    field is not parseable, without rejecting the article.
    makehistory already takes the time of the rebuild as the posting date
    when the Date header field is not parseable.
    NEWNEWS uses arrival times.

    If it looks good to you, I can properly integrate that laxdate behaviour
    in INN. Meanwhile, you could just use:


    --- a/innd/art.c
    +++ b/innd/art.c
    @@ -1054,11 +1054,7 @@ ARTclean(ARTDATA *data, char *buff, bool ihave)
    data->Posted = parsedate_rfc5322_lax(p);

    if (data->Posted == (time_t) -1) {
    - sprintf(buff, "%d Bad \"Date\" header field -- \"%s\"",
    - ihave ? NNTP_FAIL_IHAVE_REJECT : NNTP_FAIL_TAKETHIS_REJECT,
    - MaxLength(p, p));
    - TMRstop(TMR_ARTCLEAN);
    - return false;
    + data->Posted = data->Arrived;
    }

    if (HDR_FOUND(HDR__INJECTION_DATE)) {
    @@ -1066,11 +1062,7 @@ ARTclean(ARTDATA *data, char *buff, bool ihave)
    data->Posted = parsedate_rfc5322_lax(p);

    if (data->Posted == (time_t) -1) {
    - sprintf(buff, "%d Bad \"Injection-Date\" header field --
    \"%s\"",
    - ihave ? NNTP_FAIL_IHAVE_REJECT : NNTP_FAIL_TAKETHIS_REJECT,
    - MaxLength(p, p));
    - TMRstop(TMR_ARTCLEAN);
    - return false;
    + data->Posted = data->Arrived;
    }
    }



    --- a/storage/expire.c
    +++ b/storage/expire.c
    @@ -707,8 +707,7 @@ OVgroupbasedexpire(TOKEN token, const char *group,
    const char *data,
    }
    when = parsedate_rfc5322_lax(p);
    if (when == (time_t) -1) {
    - EXPoverindexdrop++;
    - return true;
    + when = arrived;
    }
    } else {
    when = arrived;




    I did find a few more Message-ID variations that are rejected:

    <[OFFICE-3]GVT-RICH-490UQ> - 435 Syntax error in message-ID
    <366@mimir..dmt.oz> - 435 Syntax error in message-ID
    <[MC.LCS.MIT.EDU].851959.860315.KFL> - 435 Syntax error in message-ID >>
    These are indeed invalid domain names. I am under the impression that
    the laxsyntax check should just ensure there are 1 to 248 (authorized)
    chars surrounded by brackets, without verifying the number, order and
    place of '.', '[', ']', etc.

    In my humble opinion, that's all it should need to do. This seems inline with how Cyclone and Diablo seem to behave by default (at least how I see commercial operators have them configured).

    Here is a new proposal of patch (do not keep the 4 lines from the
    previous patch which was specifically looking for '@'):

    --- a/lib/messageid.c
    +++ b/lib/messageid.c
    @@ -155,6 +155,21 @@ IsValidMessageID(const char *MessageID, bool
    stripspaces, bool laxsyntax)
    /* Scan local-part: "< dot-atom-text". */
    if (*p++ != '<')
    return false;
    +
    + if (laxsyntax) {
    + for (;; p++) {
    + if (!midnormchar(*p) && *p != '[' && *p != ']')
    + break;
    + }
    + if (*p++ != '>')
    + return false;
    + if (stripspaces) {
    + for (; ISWHITE(*p); p++)
    + ;
    + }
    + return (*p == '\0');
    + }
    +
    for (;; p++) {
    if (midatomchar(*p)) {
    while (midatomchar(*++p))





    I'll naturally integrate it properly in a separate function so that this
    check is not called by nnrprd. I'm waiting for your feedback for these
    two changes in Message-ID and Date handling.
    --
    Julien |eLIE

    -2-aSur vingt personnes qui parlent de nous, dix-neuf en disent du mal et
    la vingti|?me, qui en dit du bien, le dit mal.-a-+ (Rivarol)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Colin Macleod@user7@newsgrouper.org.invalid to news.software.nntp on Sun Aug 31 16:58:05 2025
    From Newsgroup: news.software.nntp

    =?UTF-8?Q?Julien_=C3=89LIE?= <iulius@nom-de-mon-site.com.invalid> posted:

    Wouldn't it be possible to somehow rewrite the Date header field before injecting the article? (The old one could be kept in an X-Date header field.) There may be news readers that are unable to parse them too.

    In Python:

    from email.utils import format_datetime,parsedate_to_datetime
    format_datetime(parsedate_to_datetime("Wed, 31-Dec-69 18:59:59 EDT")) 'Wed, 31 Dec 1969 18:59:59 -0400'

    I ran into similar problems with dates while loading historical posts from archive.org into a database for my Newsgrouper site.

    I was converting dates into Unix seconds format for ease of indexing/ searching/sorting. Since my code is in Tcl I was using its /clock scan/
    for this - https://www.tcl-lang.org/man/tcl9.0/TclCmd/clock.html#M10 .
    But I hit lots of dates that this failed with, and gradually built up a collection of hacks to massage the input into a form it would accept.
    These can be seen in proc parse_art, lines 132-146 of https://chiselapp.com/user/cmacleod/repository/newsgrouper/file?udc=1&ln=on&ci=tip&name=scripts%2Fdb_load_arch
    plus the timezone mappings at lines 110-124.
    Note that /clock scan/ in Tcl9 is more strict by default, rejecting dates
    like June 31, which did occur in the input. Tcl8 would accept these, /clock/ in Tcl9 will revert to the old behaviour with the option /-validate 0/ .

    Converting back from unix seconds to an acceptable form for a Date header
    can be done with:
    [clock format $dat -format {%a, %d %b %Y %H:%M:%S GMT} -gmt true]
    --
    Colin Macleod ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ https://cmacleod.me.uk

    FEED FEED FEED FEED FEED FEED FEED FEED
    GAZA GAZA GAZA GAZA GAZA GAZA GAZA GAZA
    NOW! NOW! NOW! NOW! NOW! NOW! NOW! NOW!
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Sun Aug 31 11:21:46 2025
    From Newsgroup: news.software.nntp

    Julien |eLIE <iulius@nom-de-mon-site.com.invalid> writes:

    As your use is very specific, and writing a proper parsing of that kind of dates is time-consuming and complicated (as Russ told us), I wonder
    whether the faster approach wouldn't be to add a new level of control in syntaxchecks:

    syntaxchecks: [ laxmid laxdate ]

    which would just take the current time (= arrival time) as the posting
    date when the Date header field exists and is not parseable. The
    information is recorded in the history file and overview for expiry
    purpose, so it shouldn't break anything as far as I see.

    Maybe "ignoredate" in this case instead of "laxdate"? This is a bit
    different than what "laxmid" means (ignore invalid message IDs and use the message ID anyway), since it's ignoring the date entirely for the purposes
    of all the other things INN does with dates. That way we would keep
    laxdate in case we ever want to enable strict standards-enforcing date
    parsing or provide a different laxer date parser that understands ctime(3)
    and dashes, etc.

    The actual field stored in overview for clients will still be the value of
    the Date header so far as I can see, so that will be invalid (not parsable
    by clients) unless you put something different into the overview. That's already the case with the existing lax date parsing, so might not matter
    and will be true for any of the proposals for handling old dates. I'm not
    sure how many clients try to parse the date information in overview and do something with it.
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Adam H. Kerman@ahk@chinet.com to news.software.nntp on Sun Aug 31 19:42:07 2025
    From Newsgroup: news.software.nntp

    Russ Allbery <eagle@eyrie.org> wrote:
    Julien <iulius@nom-de-mon-site.com.invalid> writes:

    As your use is very specific, and writing a proper parsing of that kind of >>dates is time-consuming and complicated (as Russ told us), I wonder
    whether the faster approach wouldn't be to add a new level of control in >>syntaxchecks:

    syntaxchecks: [ laxmid laxdate ]

    which would just take the current time (= arrival time) as the posting
    date when the Date header field exists and is not parseable. The >>information is recorded in the history file and overview for expiry >>purpose, so it shouldn't break anything as far as I see.

    Maybe "ignoredate" in this case instead of "laxdate"? This is a bit
    different than what "laxmid" means (ignore invalid message IDs and use the >message ID anyway), since it's ignoring the date entirely for the purposes
    of all the other things INN does with dates. That way we would keep
    laxdate in case we ever want to enable strict standards-enforcing date >parsing or provide a different laxer date parser that understands ctime(3) >and dashes, etc.

    The actual field stored in overview for clients will still be the value of >the Date header so far as I can see, so that will be invalid (not parsable
    by clients) unless you put something different into the overview. That's >already the case with the existing lax date parsing, so might not matter
    and will be true for any of the proposals for handling old dates. I'm not >sure how many clients try to parse the date information in overview and do >something with it.

    Could I make a suggestion as a nonprogrammer but someone who has spent countless hours putting data into a consistent pattern or good syntax?

    Think of the Date header as temporary and that at some point, it might
    be nice if it reflected the original date but now in modern syntax. I'm suggesting this as a Date header that reflects when it was appended to
    an archive is not going to be helpful to a newsreader when it comes to
    sorting. And a whole lot of articles are going to share an identical
    Date header as articles are going to be appended in huge batches.

    Add a special X- header with an explicit header reflecting where the
    article came from in a specific pattern, so it can be readily found.

    Then another X- header with a preliminary analysis of the Date string.

    In the example

    Date: Wed, 31-Dec-69 18:59:59 EDT

    Leave the punctuation as is. Look for the alphanumeric pattern. X
    capital letter, x lower case letter, N numeral, _ space

    X-Date-Pattern: Xxx,_NN-Xxx-NN_NN:NN:NN_XXX

    Some time later, perhaps a year later, someone might analyze this. If a three-alpha recognizeable as a day is in the first Xxx group (which
    might be an XXX), then it's a day. It might have an optional "." and ","
    isn't always going to be present as a separator. Similarly, the second three-alpha might be a month.

    The two-digit year could be confused with a time element but there
    probably won't be dates prior to the Unix epoch.

    The three-alpha time zone isn't necessarily unique worldwide, but prior
    to a certain date we know that the articles were from the United States
    only.

    If the transformation into modern syntax results in a logical day-date combination, then that's passed one test that the transformation was
    valid.

    There's going to be a lot of eyeballing necessary but this could
    possibly be a way to choose which articles have dates requiring further analysis.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Sun Aug 31 23:52:35 2025
    From Newsgroup: news.software.nntp

    On Aug 31, 2025 at 2:42:07rC>PM CDT, ""Adam H. Kerman"" <ahk@chinet.com> wrote:

    Russ Allbery <eagle@eyrie.org> wrote:
    Julien <iulius@nom-de-mon-site.com.invalid> writes:

    As your use is very specific, and writing a proper parsing of that kind of >>> dates is time-consuming and complicated (as Russ told us), I wonder
    whether the faster approach wouldn't be to add a new level of control in >>> syntaxchecks:

    syntaxchecks: [ laxmid laxdate ]

    which would just take the current time (= arrival time) as the posting
    date when the Date header field exists and is not parseable. The
    information is recorded in the history file and overview for expiry
    purpose, so it shouldn't break anything as far as I see.

    Maybe "ignoredate" in this case instead of "laxdate"? This is a bit
    different than what "laxmid" means (ignore invalid message IDs and use the >> message ID anyway), since it's ignoring the date entirely for the purposes >> of all the other things INN does with dates. That way we would keep
    laxdate in case we ever want to enable strict standards-enforcing date
    parsing or provide a different laxer date parser that understands ctime(3) >> and dashes, etc.

    The actual field stored in overview for clients will still be the value of >> the Date header so far as I can see, so that will be invalid (not parsable >> by clients) unless you put something different into the overview. That's
    already the case with the existing lax date parsing, so might not matter
    and will be true for any of the proposals for handling old dates. I'm not
    sure how many clients try to parse the date information in overview and do >> something with it.

    Could I make a suggestion as a nonprogrammer but someone who has spent countless hours putting data into a consistent pattern or good syntax?

    Think of the Date header as temporary and that at some point, it might
    be nice if it reflected the original date but now in modern syntax. I'm suggesting this as a Date header that reflects when it was appended to
    an archive is not going to be helpful to a newsreader when it comes to sorting. And a whole lot of articles are going to share an identical
    Date header as articles are going to be appended in huge batches.

    Add a special X- header with an explicit header reflecting where the
    article came from in a specific pattern, so it can be readily found.

    Then another X- header with a preliminary analysis of the Date string.

    In the example

    Date: Wed, 31-Dec-69 18:59:59 EDT

    Leave the punctuation as is. Look for the alphanumeric pattern. X
    capital letter, x lower case letter, N numeral, _ space

    X-Date-Pattern: Xxx,_NN-Xxx-NN_NN:NN:NN_XXX

    Some time later, perhaps a year later, someone might analyze this. If a three-alpha recognizeable as a day is in the first Xxx group (which
    might be an XXX), then it's a day. It might have an optional "." and "," isn't always going to be present as a separator. Similarly, the second three-alpha might be a month.

    The two-digit year could be confused with a time element but there
    probably won't be dates prior to the Unix epoch.

    The three-alpha time zone isn't necessarily unique worldwide, but prior
    to a certain date we know that the articles were from the United States
    only.

    If the transformation into modern syntax results in a logical day-date combination, then that's passed one test that the transformation was
    valid.

    There's going to be a lot of eyeballing necessary but this could
    possibly be a way to choose which articles have dates requiring further analysis.

    One of the ultimate goals of my archive is to sort the history file by posted date and re-feed to another INN instance, so article numbering in the 'final' archive is chronological. Though, I'm starting to think I'll never get to that point. :-)

    What I've discovered is that some newsreaders do not handle sorting a group by Date where both two-digit and four-digit representations of the year in the Date header exist throughout articles in the group. For example, in Usenapp, articles whose Date header has the year represented in two digits will be sorted and displayed first, then articles with the year represented using four digits second, resulting in a wonky chronology. This is my primary driver for getting the article numbers in chronological order, as most if not all newsreaders sort using the article number by default.

    Julien's suggestion would work to get the articles injected to the spool, but could present other issues, primarily with article numbering.

    When "Billy G." announced they had the archive.org Usenet content available
    via NNTP, I was elated as it would save a ton of work, but between the Date issue, and a lot of articles having duplicated headers that INN won't accept (not sure if this is caused by his import process or if the source material
    has duplicate headers), I'm starting to think I need to go back to dealing
    with the source material directly. They built their own NNTP implementation
    for this purpose, and I didn't think about 'complaince' of the content initially.

    That leaves me a few battles to win. Like Adam, I do not have a programmer's mindset, so dealing with detecting date format issues and duplicated headers isn't straightforward for me, at least not when dealing with hundreds of millions of articles.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Mon Sep 1 20:45:17 2025
    From Newsgroup: news.software.nntp

    Hi Jesse,

    One of the ultimate goals of my archive is to sort the history file by posted date and re-feed to another INN instance, so article numbering in the 'final' archive is chronological. Though, I'm starting to think I'll never get to that
    point. :-)

    Oh yes, I'm sorry about that. Indeed, if INN does not know how to parse
    the Date header field, it won't be able to store the actual posting date
    in the history file.

    Incidentally, I am really unsure that Diablo performs better. Its
    parsedate function only has 40 lines and does not handle dashes, so you
    won't have the posting date either.
    https://github.com/jpmens/diablo/blob/master/lib/parsedate.c


    If someone has the time and the skill to write a decent parser in C to
    decode dates in ctime(3) format, we could add it to INN and achieve your
    dream :)


    Julien's suggestion would work to get the articles injected to the spool, but could present other issues, primarily with article numbering.

    Are you still interested in the ignoredate setting then?
    As for laxmid, I think the change we discussed is still worthwhile to have.


    When "Billy G." announced they had the archive.org Usenet content available via NNTP, I was elated as it would save a ton of work, but between the Date issue, and a lot of articles having duplicated headers that INN won't accept (not sure if this is caused by his import process or if the source material has duplicate headers), I'm starting to think I need to go back to dealing with the source material directly. They built their own NNTP implementation for this purpose, and I didn't think about 'complaince' of the content initially.

    That's why we have RFCs to aim at a better interoperability with current software (readers and clients). Unfortunately, old articles pre-dating
    RFCs or articles generated by non-compliant software may not be
    correctly parsed by current software.


    That leaves me a few battles to win. Like Adam, I do not have a programmer's mindset, so dealing with detecting date format issues and duplicated headers isn't straightforward for me, at least not when dealing with hundreds of millions of articles.

    I would tend to think that the best move would be that Billy's news
    server does the translation job and provides syntactically valid news
    articles per current Netnews standard. It would achieve
    interoperability with current news readers and news servers.

    I totally agree with Adam who recommends that the Date header field
    reflects the original date but now in modern syntax. That's the point
    of data conservation. One should ensure that old material is still
    readable by modern software. Header fields should be adapted. The
    point of a readable archive is to provide access to messages and notably
    their contents (body), not to have difficulties in sorting them, etc.
    Of course, the original header fields and removed duplicated ones could
    still be provided in X- header fields for the ones interested in seeing
    the original material without modification.

    Think about the videos of your childhood or of your grand-parents. The important is probably not in duplicating in the same format the magnetic
    VHS contents or the Super 8mm contents, but having it in a modern and
    still viewable format (though the overall quality may have decreased
    because of digitisation artifacts). Good for you of course if you still
    have the appropriate obsolete hardware able to read them in the original
    form, but it will be less and less practical and you'll have to maintain
    it working or find a compatible one in second-hand sale.
    Like old videos or music encoded in an obscure codec format from a
    Windows 95 codec pack, or documents written with a no longer existing software. The important (at least to me) is that they are converted to
    modern software so as not to definitely loose them...
    The same goes for old A News or B News article formats which somehow
    need a bit of translation.
    --
    Julien |eLIE

    -2-aVinum spumosum nisi defluat est uitiosum.-a-+

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Mon Sep 1 20:52:50 2025
    From Newsgroup: news.software.nntp

    Hi Russ,

    take the current time (= arrival time) as the posting
    date when the Date header field exists and is not parseable.

    Maybe "ignoredate" in this case instead of "laxdate"? This is a bit
    different than what "laxmid" means (ignore invalid message IDs and use the message ID anyway), since it's ignoring the date entirely for the purposes
    of all the other things INN does with dates.

    Yes indeed, thanks for the remark. Your "ignoredate" proposal reflects
    the behaviour of the parameter. I'll use that name.


    The actual field stored in overview for clients will still be the value of the Date header so far as I can see, so that will be invalid (not parsable
    by clients) unless you put something different into the overview. That's already the case with the existing lax date parsing, so might not matter
    and will be true for any of the proposals for handling old dates. I'm not sure how many clients try to parse the date information in overview and do something with it.

    Some clients may be puzzled and not be able to display the article if
    they actually expect a valid date information. It will be an
    interoperability issue; I'll mention it in the documentation of
    "ignoredate".
    --
    Julien |eLIE

    -2-aIl y a deux sortes de justice-a: vous avez l'avocat qui conna|<t bien la
    loi, et l'avocat qui conna|<t bien le juge-a!-a-+ (Coluche)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Tue Sep 2 05:03:36 2025
    From Newsgroup: news.software.nntp

    On Mon, 1 Sep 2025 20:45:17 +0200, Julien |eLIE wrote:

    Hi Jesse,

    One of the ultimate goals of my archive is to sort the history file by
    posted date and re-feed to another INN instance, so article numbering
    in the 'final' archive is chronological. Though, I'm starting to think
    I'll never get to that point. :-)

    Oh yes, I'm sorry about that. Indeed, if INN does not know how to parse
    the Date header field, it won't be able to store the actual posting date
    in the history file.

    Incidentally, I am really unsure that Diablo performs better. Its
    parsedate function only has 40 lines and does not handle dashes, so you
    won't have the posting date either.
    https://github.com/jpmens/diablo/blob/master/lib/parsedate.c


    If someone has the time and the skill to write a decent parser in C to
    decode dates in ctime(3) format, we could add it to INN and achieve your dream :)

    I won't hold my breath. :-)

    Julien's suggestion would work to get the articles injected to the
    spool, but could present other issues, primarily with article
    numbering.

    Are you still interested in the ignoredate setting then?
    As for laxmid, I think the change we discussed is still worthwhile to
    have.

    I am unsure I would use it at this time. I wouldn't waste any time on it unless someone else finds a need.

    That leaves me a few battles to win. Like Adam, I do not have a
    programmer's mindset, so dealing with detecting date format issues and
    duplicated headers isn't straightforward for me, at least not when
    dealing with hundreds of millions of articles.

    I would tend to think that the best move would be that Billy's news
    server does the translation job and provides syntactically valid news articles per current Netnews standard. It would achieve
    interoperability with current news readers and news servers.

    This isn't a bad idea. I will inquire with him and see what he thinks.

    I totally agree with Adam who recommends that the Date header field
    reflects the original date but now in modern syntax. That's the point
    of data conservation. One should ensure that old material is still
    readable by modern software. Header fields should be adapted. The
    point of a readable archive is to provide access to messages and notably their contents (body), not to have difficulties in sorting them, etc.
    Of course, the original header fields and removed duplicated ones could
    still be provided in X- header fields for the ones interested in seeing
    the original material without modification.

    Think about the videos of your childhood or of your grand-parents. The important is probably not in duplicating in the same format the magnetic
    VHS contents or the Super 8mm contents, but having it in a modern and
    still viewable format (though the overall quality may have decreased
    because of digitisation artifacts). Good for you of course if you still
    have the appropriate obsolete hardware able to read them in the original form, but it will be less and less practical and you'll have to maintain
    it working or find a compatible one in second-hand sale.
    Like old videos or music encoded in an obscure codec format from a
    Windows 95 codec pack, or documents written with a no longer existing software. The important (at least to me) is that they are converted to modern software so as not to definitely loose them...
    The same goes for old A News or B News article formats which somehow
    need a bit of translation.

    What is interesting is a lot of the articles from 1981-1982 in Billy's
    archive with a valid Date header and also have these headers:

    X-Google-Info: Converted from the original A-News header
    X-Google-Info: Converted from the original B-News header

    Yet, there are a fair amount of articles that have the date issue and I
    need to purge my history file of rejected articles and re-run the suck
    where the Date header caused a lot of rejections. I didn't grab the output
    for Message-IDs to inspect the articles in depth. It is possible they are duplicates with different Message-IDs as I have a good amount of articles
    from the same time period in my spool. I remember when looking at the
    Utzoo archive that they didn't have Message-ID headers, but Article-ID
    headers that weren't compatible, so either Google or someone else
    converted those. Need to figure out if those rejected articles are unique,
    or just imported to Billy's system as a duplicate with some other Message-
    ID.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Tue Sep 2 11:27:25 2025
    From Newsgroup: news.software.nntp

    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    What is interesting is a lot of the articles from 1981-1982 in Billy's archive with a valid Date header and also have these headers:

    X-Google-Info: Converted from the original A-News header
    X-Google-Info: Converted from the original B-News header

    Yet, there are a fair amount of articles that have the date issue and I
    need to purge my history file of rejected articles and re-run the suck
    where the Date header caused a lot of rejections.

    My recollection is that the team at Google (maybe it was at DejaNews?)
    that did this ingestion started with INN at the time, which was probably
    INN 1.x or at least before I rewrote parsedate, and thus probably only
    rewrote the Date headers that failed with the original yacc-based INN date parser that I think might have been copied from C News.

    I added support for every date format in an article that we had on
    Stanford's spool at the time, but I seem to recall I didn't attempt to
    support every date format the C News parser supported. (I think it was originally based on some other yacc date parser from somewhere else? My
    memory on all of this is pretty vague since this was 15 or 20 years ago at least, so someone should check me before relying on any of this.) It
    accepted all sorts of interesting stuff.

    I didn't grab the output for Message-IDs to inspect the articles in
    depth. It is possible they are duplicates with different Message-IDs as
    I have a good amount of articles from the same time period in my spool.
    I remember when looking at the Utzoo archive that they didn't have
    Message-ID headers, but Article-ID headers that weren't compatible, so
    either Google or someone else converted those.

    Google (or DejaNews) injested a bunch of A-News articles and those
    *definitely* require conversion (they don't look anything like a modern article), so they definitely wrote a converter.
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Tue Sep 2 11:29:03 2025
    From Newsgroup: news.software.nntp

    "Billy G." <no-reply@no.spam> writes:

    I have date parsing in Go and it's a mess.

    Welcome to the relatively exclusive club of people who have tried to write
    a date parser and have discovered all the nonsense that people do in
    practice with dates, and also how annoyingly complicated human date
    formats are. :)

    Go at least is a nicer language to fight with all of that in than C was.
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Russ Allbery@eagle@eyrie.org to news.software.nntp on Tue Sep 2 11:31:19 2025
    From Newsgroup: news.software.nntp

    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    The format above is the most common. Then there's a handful of articles whose real date is impossible to determine:

    Aug 30 13:38:03.129 - localhost <369@psivax.UUCP> 437 Bad "Date" header field -- "Wed, 31-Dec-69 18:59:59 EST"
    Aug 30 13:41:24.347 - localhost <702@mmintl.UUCP> 437 Bad "Date" header field -- "Wed, 31-Dec-69 18:59:59 EDT"
    Aug 30 13:42:35.106 - localhost <1305@mtgzz.UUCP> 437 Bad "Date" header field -- "Wed, 31-Dec-69 18:59:59 EDT"
    Aug 30 13:47:25.759 - localhost <1639@qubix.UUCP> 437 Bad "Date" header field -- "Wed, 31-Dec-69 18:59:59 EDT"

    For the record, all of these date header values are meaningless. That date corresponds to the UNIX timestamp -1, so that's just an error return value
    run through a date formatting routine. It doesn't carry any information.
    It's probably a bug introduced by some intermediate layer of conversion at
    some point over the years.
    --
    Russ Allbery (eagle@eyrie.org) <https://www.eyrie.org/~eagle/>

    Please post questions rather than mailing me directly.
    <https://www.eyrie.org/~eagle/faqs/questions.html> explains why.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Tue Sep 2 18:42:29 2025
    From Newsgroup: news.software.nntp

    On Tue, 02 Sep 2025 11:27:25 -0700, Russ Allbery wrote:

    Jesse Rehmer <jesse.rehmer@blueworldhosting.com> writes:

    What is interesting is a lot of the articles from 1981-1982 in Billy's
    archive with a valid Date header and also have these headers:

    X-Google-Info: Converted from the original A-News header
    X-Google-Info: Converted from the original B-News header

    Yet, there are a fair amount of articles that have the date issue and I
    need to purge my history file of rejected articles and re-run the suck
    where the Date header caused a lot of rejections.

    My recollection is that the team at Google (maybe it was at DejaNews?)
    that did this ingestion started with INN at the time, which was probably
    INN 1.x or at least before I rewrote parsedate, and thus probably only rewrote the Date headers that failed with the original yacc-based INN
    date parser that I think might have been copied from C News.

    I added support for every date format in an article that we had on
    Stanford's spool at the time, but I seem to recall I didn't attempt to support every date format the C News parser supported. (I think it was originally based on some other yacc date parser from somewhere else? My memory on all of this is pretty vague since this was 15 or 20 years ago
    at least, so someone should check me before relying on any of this.) It accepted all sorts of interesting stuff.

    I didn't grab the output for Message-IDs to inspect the articles in
    depth. It is possible they are duplicates with different Message-IDs as
    I have a good amount of articles from the same time period in my spool.
    I remember when looking at the Utzoo archive that they didn't have
    Message-ID headers, but Article-ID headers that weren't compatible, so
    either Google or someone else converted those.

    Google (or DejaNews) injested a bunch of A-News articles and those *definitely* require conversion (they don't look anything like a modern article), so they definitely wrote a converter.

    Would it be possible through filter_innd.pl to take the value of X-Google- ArrivalTime and replace the Date header with that value?

    Also, I'm trying to strip the X-Google-Attributes and X-Google-Thread
    headers from all articles, but cannot get it to work. I've added these
    header values in innd/innd.c, so they are available to $hdr, but can't
    seem to unset them, or I'm not doing it in a way that work with the rest
    of cleanfeed. The basic code works to unset headers when a user posts
    through filter_nnrpd.pl, but doesn't seem to in filter_innd.pl?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Tue Sep 2 20:55:39 2025
    From Newsgroup: news.software.nntp

    Hi Billy,

    I have date parsing in Go and it's a mess.
    Many old articles are using any format you can think of...

    Maybe this date parser in Go could be of help?
    https://github.com/araddon/dateparse


    and INN does not like many of the older ones.
    That's why I wrote my own less strict server to get it all in.

    A great hobby :-)
    --
    Julien |eLIE

    -2-arCo Et si vous ne trouvez pas, je vous fais bouillir et servir aux lions
    avec de la sauce |a la menthe-a!!!
    rCo Mais c'est horrible |oa-a!
    rCo Oui, pauvres b|-tes-a!-a-+ (Ast|-rix)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Tue Sep 2 20:59:27 2025
    From Newsgroup: news.software.nntp

    Hi Jesse,

    Would it be possible through filter_innd.pl to take the value of X-Google- ArrivalTime and replace the Date header with that value?

    Also, I'm trying to strip the X-Google-Attributes and X-Google-Thread
    headers from all articles, but cannot get it to work. I've added these
    header values in innd/innd.c, so they are available to $hdr, but can't
    seem to unset them, or I'm not doing it in a way that work with the rest
    of cleanfeed. The basic code works to unset headers when a user posts
    through filter_nnrpd.pl, but doesn't seem to in filter_innd.pl?

    Unfortunately, filter_innd.pl does not permit to modify articles. They
    are read-only in the filter, contrary to the nnrpd Perl hook.

    https://www.eyrie.org/~eagle/software/inn/docs/hook-perl.html

    "The %hdr hash should not be modified inside filter_art(). Currently, $hdr{__BODY__} is the only data that will cause your filter to die if
    you modify it, but in the future other keys may also contain live data. Modifying live INN data in Perl will hopefully only cause a fatal
    exception in your Perl code that disables Perl filtering until you fix
    it, but it's possible for it to cause article munging or even core dumps
    in INN. So always, always make a copy first."
    --
    Julien |eLIE

    -2-aQuo vadis-a?-a-+ (saint Jean)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Billy G.@no-reply@no.spam to news.software.nntp on Tue Sep 2 22:48:10 2025
    From Newsgroup: news.software.nntp

    On 02.09.25 19:42, Jesse Rehmer wrote:
    Also, I'm trying to strip the X-Google-Attributes and X-Google-Thread
    headers from all articles, but cannot get it to work.

    Don't waste your time trying.
    I can remove headers on-the-fly in Go catching continued lines too.

    the 1969 are from mbox2nntp!utzoo!

    Date: and Posted: are wrong. like Russ said. epoch negative 1 second...
    Must have arrived like that from utzoo archive or I don't know.
    But Date-Received looks valid on them.
    I don't think my software did that. It only modified path headers.

    Aug 30 13:41:24.347 - localhost <702@mmintl.UUCP>
    437 Bad "Date" header field -- "Wed, 31-Dec-69 18:59:59 EDT"

    Path: archive.newsdeef.eu!mbox2nntp!utzoo!linus!philabs!pwa-b!mmintl!franka From: franka@mmintl.UUCP (Frank Adams)
    Newsgroups: net.philosophy
    Subject: Re: More Atheistic Wishful Thinking
    Date: Wed, 31-Dec-69 18:59:59 EDT
    Article-I.D.: mmintl.702
    Posted: Wed Dec 31 18:59:59 1969
    Date-Received: Fri, 4-Oct-85 04:25:29 EDT
    References: <1522@umcp-cs.UUCP> <1668@pyuxd.UUCP> <1552@umcp-cs.UUCP> <701@utastro.UUCP> <664@mmintl.UUCP> <739@utastro.UUCP> <680@mmintl.U <755@utastro.UUCP>
    Reply-To: franka@mmintl.UUCP (Frank Adams)
    Organization: Multimate International, E. Hartford, CT
    Lines: 89
    Keywords: identity selfness resurrection
    Summary: Identity is information
    Message-ID: <702@mmintl.UUCP>


    and the 2 duplicate headers:

    Aug 30 22:49:59.188 - localhost <27375@philabs.UUCP>
    437 Duplicate "Path" header field

    the first contains Path: and path:

    Aug 30 23:58:57.078 - localhost <426@novavax.UUCP>
    437 Duplicate "Date" header field

    the second has all headers duped.

    head <27375@philabs.UUCP>
    221 0 <27375@philabs.UUCP>
    Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
    Posting-Version: version B 2.10.1 6/24/83; site philabs.UUCP
    Path: archive.newsdeef.eu!mbox2nntp!utzoo!linus!philabs!jah
    From: jah@philabs.UUCP (Julie Harazduk)
    Newsgroups: net.flame,net.politics
    Subject: Re: Nuclear exchange (life of the Sun).
    Date: 07 Dec 1983 14:09:13 EST
    Article-I.D.: philabs.27375
    Posted: Wed Dec 7 14:09:13 1983
    Date-Received: Fri, 9-Dec-83 08:10:13 EST
    References: <158@dual.UUCP>, <233@denelcor.UUCP>
    Organization: Philips Labs, Briarcliff Manor, NY
    Lines: 10
    path: mbox2nntp!(sdcvax,mcvax,cbosgd,allegra,deecvax)!philabs!jah
    Message-ID: <27375@philabs.UUCP>
    .

    head <426@novavax.UUCP>
    221 0 <426@novavax.UUCP>
    Path: archive.newsdeef.eu!mbox2nntp!utzoo!utgpu!water!watmath!clyde!att-cb!osu-cis!tut.cis.ohio-state.edu!mailrus!umix!uunet!steinmetz!ge-dab!codas!novavax!maddoxt
    From: maddoxt@novavax.UUCP (Thomas Maddox)
    Newsgroups: comp.society.futures,comp.ai
    Subject: Re: The future of AI [was Re: Time Magazine -- Computers of the Future]
    Keywords: AI, research program
    Date: 16 Apr 1988 16:33:18 UTC
    References: <8803270154.AA08607@bu-cs.bu.edu> <962@daisy.UUCP> <4640@bcsaic.UUCP> <1134@its63b.ed.ac.uk>
    Reply-To: maddoxt@novavax.UUCP (Thomas Maddox)
    Organization: Nova University, Fort Lauderdale, Florida
    Lines: 23
    Path: mbox2nntp!utzoo!utgpu!water!watmath!clyde!att-cb!osu-cis!tut.cis.ohio-state.edu!mailrus!umix!uunet!steinmetz!ge-dab!codas!novavax!maddoxt
    From: maddoxt@novavax.UUCP (Thomas Maddox)
    Newsgroups: comp.society.futures,comp.ai
    Subject: Re: The future of AI [was Re: Time Magazine -- Computers of the Future]
    Keywords: AI, research program
    Date: 16 Apr 1988 16:33:18 UTC
    References: <8803270154.AA08607@bu-cs.bu.edu> <962@daisy.UUCP> <4640@bcsaic.UUCP> <1134@its63b.ed.ac.uk>
    Reply-To: maddoxt@novavax.UUCP (Thomas Maddox)
    Organization: Nova University, Fort Lauderdale, Florida
    Message-ID: <426@novavax.UUCP>
    .
    body <426@novavax.UUCP>
    222 0 <426@novavax.UUCP>

    In article <1134@its63b.ed.ac.uk> gvw@its63b.ed.ac.uk (G Wilson) writes:

    I think AI can be summed up by Terry Winograd's defection. His
    SHRDLU program is still quoted in *every* AI textbook (at least all
    the ones I've seen), but he is no longer a believer in the AI
    research programme (see "Understanding Computers and Cognition",
    by Winograd and Flores).

    Using this same reasoning, one might given up quantum
    mechanics because of Einstein's "defection." Whether a particular
    researcher continues his research is an interesting historical
    question (and indeed many physicists lamented the loss of Einstein),
    but it does not call into question the research program itself, which
    must stand or fall on its own merits.
    AI will continue to produce results and remain a viable
    enterprise, or it won't and will degenerate. However, so long as it
    continues to feed powerful ideas and techniques into the various
    fields it connects with, to dismiss it seems remarkably premature. If
    you are one of the pro- or anti-AI heavyweights, i.e., someone with
    power, prestige, or money riding on society's evaluation of AI
    research, then you join the polemic with all guns firing.
    The rest of us can continue to enjoy both the practical and intellectual fruits of the research and the debate.
    .


    I gave up at some point trying to import all of the old stuff into INN.
    Got too many declines and you'd have to write code for each article not
    going in... check why, whats wrong and think how to fix...
    --
    .......
    Billy G. (go-while)
    https://pugleaf.net
    @Newsgroup: rocksolid.nodes.help
    irc.pugleaf.net:6697 (SSL) #lounge
    discord: https://discord.gg/rECSbHHFzp
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Tue Sep 2 23:12:14 2025
    From Newsgroup: news.software.nntp

    On Tue, 2 Sep 2025 22:48:10 +0100, Billy G. wrote:

    On 02.09.25 19:42, Jesse Rehmer wrote:
    Also, I'm trying to strip the X-Google-Attributes and X-Google-Thread
    headers from all articles, but cannot get it to work.

    Don't waste your time trying.
    I can remove headers on-the-fly in Go catching continued lines too.

    I think the X-Google-Info and X-Google-ArrivalTime headers are insightful,
    but the rest aren't and just take up space. I don't think any of the X-
    Deja* headers are of any use, at least none that I've seen.

    I gave up at some point trying to import all of the old stuff into INN.
    Got too many declines and you'd have to write code for each article not
    going in... check why, whats wrong and think how to fix...

    When you get to the oldest messages in the uztoo archive... basically have
    to create a bunch of headers because there really are none.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Billy G.@no-reply@no.spam to news.software.nntp on Tue Sep 2 12:41:21 2025
    From Newsgroup: news.software.nntp


    a lot of articles having duplicated headers that INN won't accept
    (not sure if this is caused by his import process or if the source
    material has duplicate headers)

    Do you have some examples of articles with duplicate headers?
    Path should show where article came from (eg: !any-name.mbox.zip/gz)

    I have date parsing in Go and it's a mess.
    Many old articles are using any format you can think of...
    and INN does not like many of the older ones.
    That's why I wrote my own less strict server to get it all in.

    I scanned the archive vs blueworld (last year) and sucked everything
    from your server to the archive . Wasn't that much.

    Another feature my server can do is re-ordering the overview
    which I already did because I believe there is not much more to find.

    It's all in go-pugleaf databases (sqlite3) per newsgroup too.
    Headers and Body, easy parseable into any format we need.
    Proper date parsing exists and manipulating the Date: Header to be
    X-Date and injecting a valid RFC date while sending, everywhere the
    header is wrong formated, should be easy - but some wrong date headers
    are spam from badly written tools and could be discarded.
    --
    .......
    Billy G. (go-while)
    https://pugleaf.net
    @Newsgroup: rocksolid.nodes.help
    irc.pugleaf.net:6697 (SSL) #lounge
    discord: https://discord.gg/rECSbHHFzp
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Tue Sep 2 15:02:13 2025
    From Newsgroup: news.software.nntp

    On Tue, 2 Sep 2025 12:41:21 +0100, Billy G. wrote:

    a lot of articles having duplicated headers that INN won't accept
    (not sure if this is caused by his import process or if the source material has duplicate headers)

    Do you have some examples of articles with duplicate headers?
    Path should show where article came from (eg: !any-name.mbox.zip/gz)

    Here are two with duplicated headers:
    Aug 30 22:49:59.188 - localhost <27375@philabs.UUCP> 437 Duplicate "Path" header field
    Aug 30 23:58:57.078 - localhost <426@novavax.UUCP> 437 Duplicate "Date"
    header field


    I have date parsing in Go and it's a mess.
    Many old articles are using any format you can think of...
    and INN does not like many of the older ones.
    That's why I wrote my own less strict server to get it all in.

    I scanned the archive vs blueworld (last year) and sucked everything
    from your server to the archive . Wasn't that much.

    Another feature my server can do is re-ordering the overview which I
    already did because I believe there is not much more to find.

    It's all in go-pugleaf databases (sqlite3) per newsgroup too.
    Headers and Body, easy parseable into any format we need.
    Proper date parsing exists and manipulating the Date: Header to be
    X-Date and injecting a valid RFC date while sending, everywhere the
    header is wrong formated, should be easy - but some wrong date headers
    are spam from badly written tools and could be discarded.

    There are two date formats I'm seeing an issue with:

    Aug 30 19:10:02.809 - localhost <bnews.ihuxr.198> 437 Bad "Date" header
    field -- "Tue Nov 9 07:29:44 1982"
    Aug 30 19:10:02.849 - localhost <bnews.unc.4239> 437 Bad "Date" header
    field -- "Tue Nov 9 07:45:23 1982"
    Aug 30 19:10:03.579 - localhost <bnews.ihuxr.274> 437 Bad "Date" header
    field -- "Tue Jan 4 03:25:40 1983"
    Aug 30 22:12:48.929 - localhost <bnews.sri-arpa.836> 437 Bad "Date" header field -- "Fri Apr 8 00:56:18 1983"

    The format above is the most common. Then there's a handful of articles
    whose real date is impossible to determine:

    Aug 30 13:38:03.129 - localhost <369@psivax.UUCP> 437 Bad "Date" header
    field -- "Wed, 31-Dec-69 18:59:59 EST"
    Aug 30 13:41:24.347 - localhost <702@mmintl.UUCP> 437 Bad "Date" header
    field -- "Wed, 31-Dec-69 18:59:59 EDT"
    Aug 30 13:42:35.106 - localhost <1305@mtgzz.UUCP> 437 Bad "Date" header
    field -- "Wed, 31-Dec-69 18:59:59 EDT"
    Aug 30 13:47:25.759 - localhost <1639@qubix.UUCP> 437 Bad "Date" header
    field -- "Wed, 31-Dec-69 18:59:59 EDT"

    I'm not sure what to make of the last example, there are much less
    articles that have this format with an invalid date/year.

    Most of those articles do have a valid Date-Received header like <369@psivax.UUCP>:

    Date-Received: Thu, 21-Mar-85 02:16:31 EST

    Without checking all of them, there is a X-Google-ArrivalTime header in a format that I believe INN would accept if it were also used on the Date header.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Wed Sep 3 21:09:43 2025
    From Newsgroup: news.software.nntp

    Hi Jesse,

    When you get to the oldest messages in the uztoo archive... basically have
    to create a bunch of headers because there really are none.

    Are there in the A News format?

    If that's the case, RFC 1849 (Son of 1036) describes it:
    https://datatracker.ietf.org/doc/html/rfc1849


    Appendix A. Archaeological Notes

    A.1. "A News" Article Format

    The obsolete "A News" article format consisted of exactly five lines
    of header information, followed by the body. For example:

    Aeagle.642
    news.misc
    cbosgd!mhuxj!mhuxt!eagle!jerry
    Fri Nov 19 16:14:55 1982
    Usenet Etiquette - Please Read
    body
    body
    body

    The first line consisted of an "A" followed by an article ID
    (analogous to a message ID and used for similar purposes). The
    second line was the list of newsgroups. The third line was the path.
    The fourth was the date, in the format above (all fields fixed
    width), resembling an Internet date but not quite the same. The
    fifth was the subject.

    This format is documented for archaeological purposes only. Do not
    generate articles in this format.

    A.2. Early "B News" Article Format

    This obsolete pseudo-Internet article format, used briefly during the
    transition between the A News format and the modern format, followed
    the general outline of a MAIL message but with some non-standard
    headers. For example:

    From: cbosgd!mhuxj!mhuxt!eagle!jerry (Jerry Schwarz)
    Newsgroups: news.misc
    Title: Usenet Etiquette -- Please Read
    Article-I.D.: eagle.642
    Posted: Fri Nov 19 16:14:55 1982
    Received: Fri Nov 19 16:59:30 1982
    Expires: Mon Jan 1 00:00:00 1990

    body
    body
    body

    The From header contained the information now found in the Path
    header, plus possibly the full name now typically found in the From
    header. The Title header contained what is now the Subject content.
    The Posted header contained what is now the Date content. The
    Article-I.D. header contained an article ID, analogous to a message
    ID and used for similar purposes. The Newsgroups and Expires headers
    were approximately as they are now. The Received header contained
    the date when the latest relayer to process the article first saw it.
    All dates were in the above format, with all fields fixed width,
    resembling an Internet date but not quite the same.

    This format is documented for archaeological purposes only. Do not
    generate articles in this format.
    --
    Julien |eLIE

    -2-aJust don't create a file called -rf.-a-+ (Larry Wall)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to news.software.nntp on Thu Sep 4 18:03:27 2025
    From Newsgroup: news.software.nntp

    =?UTF-8?Q?Julien_=C3=89LIE?= <iulius@nom-de-mon-site.com.invalid> wrote or quoted:
    If someone has the time and the skill to write a decent parser in C to >decode dates in ctime(3) format, we could add it to INN and achieve your >dream :)

    I am not sure if this request is still relevant right now.
    If it is, the first step would be a formal grammar.
    I could keep working on it later if needed.

    So, here is the grammar attempt for now.

    <ctime-string> ::=
    <day-of-week> " " <month> " " <day> " " <time> " " <year> "\n"

    <day-of-week> ::= "Sun" | "Mon" | "Tue" | "Wed" | "Thu" | "Fri" | "Sat"

    <month> ::= "Jan" | "Feb" | "Mar" | "Apr" | "May" | "Jun"
    | "Jul" | "Aug" | "Sep" | "Oct" | "Nov" | "Dec"

    <day> ::=
    " " <digit>
    | <digit><digit>
    (* range 1rCo31, first form has a leading space when <digit> is 1rCo9 *)

    <time> ::= <hour> ":" <minute> ":" <second>

    <hour> ::= <digit><digit> (* 00rCo23 *)
    <minute> ::= <digit><digit> (* 00rCo59 *)
    <second> ::= <digit><digit> (* 00rCo60, leap second allowed *)

    <year> ::= <digit><digit><digit><digit>

    <digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to news.software.nntp on Thu Sep 4 18:39:59 2025
    From Newsgroup: news.software.nntp

    ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
    I am not sure if this request is still relevant right now.

    Some lines of the source code have more than 72 characters.

    /*
    * ctime_parser.c
    *
    * A descent parser for ctime(3) formatted strings.
    *
    * Example input:
    * "Wed Jun 30 21:49:08 1993\n"
    *
    * Author: Stefan Ram
    * Copyright (c) 2025 Stefan Ram
    * Licensed under the Apache License, Version 2.0
    *
    * Note: This software has just been written and has not been
    * extensively tested.
    *
    */

    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>

    /* -------------------- Status codes -------------------- */
    typedef enum {
    CTIME_OK = 0,
    CTIME_ERR_SYNTAX, /* structural problem with the string */
    CTIME_ERR_RANGE /* values outside allowed ranges */
    } CTimeStatus;

    /* -------------------- Parse result -------------------- */
    typedef struct {
    int weekday; /* 0=Sun .. 6=Sat */
    int month; /* 1=Jan .. 12=Dec */
    int day; /* 1..31 */
    int hour; /* 0..23 */
    int minute; /* 0..59 */
    int second; /* 0..60 */
    int year; /* typically 4-digit */

    CTimeStatus status; /* parse result */
    } CTimeFields;

    /* -------------------- Parser object -------------------- */
    typedef struct {
    const char *pos; /* current input pointer */
    CTimeFields *out; /* where to put results/status */
    } Parser;

    /* -------------------- Utility functions -------------------- */

    static void set_error(Parser *p, CTimeStatus st) {
    if (p->out->status == CTIME_OK) {
    p->out->status = st;
    }
    }

    static void expect(Parser *p, char c) {
    if (*p->pos != c) set_error(p, CTIME_ERR_SYNTAX);
    else p->pos++;
    }

    static int parse_number_digits(Parser *p, int ndigits) {
    int v = 0;
    for (int i = 0; i < ndigits; i++) {
    if (!isdigit((unsigned char)p->pos[i])) {
    set_error(p, CTIME_ERR_SYNTAX);
    return -1;
    }
    v = v * 10 + (p->pos[i] - '0');
    }
    p->pos += ndigits;
    return v;
    }

    /* -------------------- Grammar routines -------------------- */

    static int parse_day_of_week(Parser *p) {
    static const char *names[] = {"Sun","Mon","Tue","Wed","Thu","Fri","Sat"};
    for (int i = 0; i < 7; i++) {
    if (strncmp(p->pos, names[i], 3) == 0) {
    p->pos += 3;
    return i;
    }
    }
    set_error(p, CTIME_ERR_SYNTAX);
    return -1;
    }

    static int parse_month(Parser *p) {
    static const char *names[] = {
    "Jan","Feb","Mar","Apr","May","Jun",
    "Jul","Aug","Sep","Oct","Nov","Dec"
    };
    for (int i = 0; i < 12; i++) {
    if (strncmp(p->pos, names[i], 3) == 0) {
    p->pos += 3;
    return i + 1;
    }
    }
    set_error(p, CTIME_ERR_SYNTAX);
    return -1;
    }

    static int parse_day(Parser *p) {
    int d;
    if (*p->pos == ' ') {
    p->pos++;
    if (!isdigit((unsigned char)*p->pos)) {
    set_error(p, CTIME_ERR_SYNTAX);
    return -1;
    }
    d = *p->pos++ - '0';
    } else {
    if (!isdigit((unsigned char)p->pos[0]) ||
    !isdigit((unsigned char)p->pos[1])) {
    set_error(p, CTIME_ERR_SYNTAX);
    return -1;
    }
    d = (p->pos[0]-'0')*10 + (p->pos[1]-'0');
    p->pos += 2;
    }
    if (d < 1 || d > 31) {
    set_error(p, CTIME_ERR_RANGE);
    return -1;
    }
    return d;
    }

    /* -------------------- Top-level parse -------------------- */

    CTimeFields parse_ctime(const char *s) {
    CTimeFields f = {0};
    f.status = CTIME_OK;
    Parser P = { s, &f };

    f.weekday = parse_day_of_week(&P);
    expect(&P, ' ');
    f.month = parse_month(&P);
    expect(&P, ' ');
    f.day = parse_day(&P);
    expect(&P, ' ');
    f.hour = parse_number_digits(&P, 2);
    expect(&P, ':');
    f.minute = parse_number_digits(&P, 2);
    expect(&P, ':');
    f.second = parse_number_digits(&P, 2);
    expect(&P, ' ');
    f.year = parse_number_digits(&P, 4);
    expect(&P, '\n');

    /* If there are trailing characters after newline, that's an error */
    if (*P.pos != '\0' && f.status == CTIME_OK)
    f.status = CTIME_ERR_SYNTAX;

    return f;
    }

    /* -------------------- Demonstration -------------------- */
    #define TEST_CTIME_PARSER
    #ifdef TEST_CTIME_PARSER
    int main(void) {
    const char *tests[] = {
    "Wed Jun 30 21:49:08 1993\n",
    "Sun Dec 5 07:03:12 2021\n",
    "Bad Mon 30 00:00:00 1900\n", /* bogus */
    NULL
    };

    for (const char **t = tests; *t; t++) {
    printf("Parsing: \"%s\"\n", *t);
    CTimeFields f = parse_ctime(*t);
    if (f.status == CTIME_OK) {
    printf(" -> OK: wd=%d, mon=%d, day=%d, %02d:%02d:%02d, year=%d\n",
    f.weekday, f.month, f.day,
    f.hour, f.minute, f.second, f.year);
    } else {
    printf(" -> ERROR, status=%d\n", f.status);
    }
    }
    return 0;
    }
    #endif

    Output:

    Parsing: "Wed Jun 30 21:49:08 1993
    "
    OK: wd=3, mon=6, day=30, 21:49:08, year=1993
    Parsing: "Sun Dec 5 07:03:12 2021
    "
    OK: wd=0, mon=12, day=5, 07:03:12, year=2021
    Parsing: "Bad Mon 30 00:00:00 1900
    "
    ERROR, status=1

    (That funny quotation mark is ok, because the ctime(3) format
    includes an end-of-line at the end.)


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Billy G.@contact-5c2e-000@pugleaf.net to news.software.nntp on Thu Sep 4 20:14:52 2025
    From Newsgroup: news.software.nntp

    On 31.08.25 17:58, Colin Macleod wrote:
    But I hit lots of dates that this failed with, and gradually built up a collection of hacks to massage the input into a form it would accept.

    how we all ran into the same issues :D
    --
    .......
    Billy G. (go-while)
    https://pugleaf.net
    @Newsgroup: rocksolid.nodes.help
    irc.pugleaf.net:6697 (SSL) #lounge
    discord: https://discord.gg/rECSbHHFzp
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Billy G.@contact-5c2e-000@pugleaf.net to news.software.nntp on Thu Sep 4 20:18:49 2025
    From Newsgroup: news.software.nntp

    On 04.09.25 19:39, Stefan Ram wrote:

    const char *tests[] = {
    "Wed Jun 30 21:49:08 1993\n",
    "Sun Dec 5 07:03:12 2021\n",
    "Bad Mon 30 00:00:00 1900\n", /* bogus */
    NULL
    };

    go-pugleaf$ go build -o build/parsedates ./cmd/parsedates

    go-pugleaf$ build/parsedates "Wed Jun 30 21:49:08 1993"
    2025/09/04 20:17:27 go-pugleaf Date Parser (version: -unset-)
    Original date string: Wed Jun 30 21:49:08 1993
    rLa Parsed successfully:
    RFC3339: 1993-06-30T21:49:08Z
    Human: Wednesday, 30 June 1993 21:49:08 UTC
    Year: 1993
    Month: June (6)
    Day: 30
    Time: 21:49:08

    go-pugleaf$ build/parsedates "Sun Dec 5 07:03:12 2021"
    2025/09/04 20:18:01 go-pugleaf Date Parser (version: -unset-)
    Original date string: Sun Dec 5 07:03:12 2021
    rLa Parsed successfully:
    RFC3339: 2021-12-05T07:03:12Z
    Human: Sunday, 5 December 2021 07:03:12 UTC
    Year: 2021
    Month: December (12)
    Day: 5
    Time: 07:03:12

    build/parsedates "Bad Mon 30 00:00:00 1900"
    2025/09/04 20:18:31 go-pugleaf Date Parser (version: -unset-)
    Original date string: Bad Mon 30 00:00:00 1900
    rLa Parsed successfully:
    RFC3339: 1990-01-30T00:00:00Z
    Human: Tuesday, 30 January 1990 00:00:00 UTC
    Year: 1990
    Month: January (1)
    Day: 30
    Time: 00:00:00
    --
    .......
    Billy G. (go-while)
    https://pugleaf.net
    @Newsgroup: rocksolid.nodes.help
    irc.pugleaf.net:6697 (SSL) #lounge
    discord: https://discord.gg/rECSbHHFzp
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to news.software.nntp on Thu Sep 4 19:25:22 2025
    From Newsgroup: news.software.nntp

    ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
    * ctime_parser.c

    Documentation:

    Thank you for selecting this parser! By doing so, you have chosen
    a reliable and carefully written tool for handling ctime(3) date
    strings. This document serves as the reference and user guide
    to help you make full and confident use of the library.

    The ctime(3) Format

    The C standard library provides the function ctime(3), which takes a
    time_t value and produces a human-readable string representing local
    time. The format of the resulting string is fixed and well-known:

    "Wed Jun 30 21:49:08 1993\n"

    Breaking it down:

    - Day of week: Three-letter abbreviation (Sun .. Sat)

    - Month: Three-letter abbreviation (Jan .. Dec)

    - Day of month: Either space + digit (for 1rCo9) or two digits (10rCo31)

    - Time: hh:mm:ss, with leading zeroes

    - Year: Four digits

    - Newline: Always present at the end

    This is the same layout produced by asctime(3). Many programs
    rely on this representation, and having a robust parser for
    it is therefore useful.

    What is a Parser?

    A parser is a piece of software that reads input according
    to a grammar and produces structured information from it.
    In other words, instead of handling raw text, a parser lets
    you work with well-defined fields (weekday, month, day, time,
    year) as integers or codes.

    What This Parser Does

    This parser accepts a string in strict ctime(3) format and extracts
    the following fields:

    - weekday (0 = Sunday rCa 6 = Saturday)

    - month (1 = January rCa 12 = December)

    - day (1rCo31)

    - hour (0rCo23)

    - minute (0rCo59)

    - second (0rCo60, allowing for leap seconds)

    - year (integer, typically four digits)

    It also provides status information, so your program knows
    whether the parse succeeded or failed, and why.

    Preparing Input for the Parser

    The parser expects a C string (const char *) that contains the
    exact format produced by ctime(3). Common ways to obtain such
    a string include:

    - Obtaining one directly from ctime() or asctime().

    - Reading a log file that stores ctime-style timestamps.

    - Generating the format in a controlled environment for testing.

    Make sure the input is complete, including the final newline,
    before passing it to the parser.

    The Main Interface

    The entry point is:

    CTimeFields parse_ctime(const char *s);

    Parameter:

    - s is a NUL-terminated C string in ctime(3) format.

    Result:

    - A CTimeFields struct with the following fields:

    typedef struct {
    int weekday; /* 0=Sun .. 6=Sat */
    int month; /* 1=Jan .. 12=Dec */
    int day; /* 1..31 */
    int hour; /* 0..23 */
    int minute; /* 0..59 */
    int second; /* 0..60 */
    int year; /* typically four digits */
    CTimeStatus status; /* parse result */
    } CTimeFields;

    Status codes (CTimeStatus):

    - CTIME_OK rCo parsing was successful.

    - CTIME_ERR_SYNTAX rCo format was invalid.

    - CTIME_ERR_RANGE rCo a numeric field was outside its legal range.

    You should always check the status field before using the parsed
    values.

    Structure of the Implementation

    The parser is recursive-descent style, directly matching
    the structure of the ctime(3) grammar. Each grammatical unit
    (such as day of week, month, day number, or time fields) has
    a dedicated parsing function. A helper Parser object carries
    the input position and result reference.

    This approach keeps the code straightforward, predictable,
    and clear to follow, closely reflecting the format we
    expect. Resource Management

    This parser does not perform any dynamic memory allocation.

    No malloc() or free() calls are involved. The entire parse is
    stack-based and returns results by value.

    This means:

    - No need to release resources after a parse.

    - Thread-safety as long as you provide independent input strings.

    Example Usage

    Here is a simple example that shows typical use:

    #include <stdio.h>

    int main(void) {
    const char *s = "Wed Jun 30 21:49:08 1993\n";
    CTimeFields f = parse_ctime(s);

    if (f.status == CTIME_OK) {
    printf("Parsed successfully:\n");
    printf(" Weekday: %d\n", f.weekday);
    printf(" Month: %d\n", f.month);
    printf(" Day: %d\n", f.day);
    printf(" Time: %02d:%02d:%02d\n",
    f.hour, f.minute, f.second);
    printf(" Year: %d\n", f.year);
    } else {
    printf("Parse failed, status=%d\n", f.status);
    }
    }

    Source Code Walkthrough

    The file is organized into sections:

    - Status and result definitions

    * Enumerations for error codes.

    * The CTimeFields struct.

    - Parser support

    * Utilities to check expected characters, set error status,
    and extract fixed-digit numbers.

    - Grammar routines

    parse_day_of_week, parse_month, parse_day, etc., each
    reflecting the grammar rules.

    - Top-level function

    parse_ctime() orchestrates all parts in sequence.

    - Optional test harness

    Enabled via #define TEST_CTIME_PARSER.

    This design makes it easy both to use and to maintain.

    Additional Notes

    - Input must contain the final newline; otherwise,
    the parser will reject it.

    - The parser assumes ASCII-compatible encoding.

    - Leap seconds (xx:xx:60) are accepted for completeness.

    - Range checks are included for day values, hours, minutes, and seconds.

    Grammar for ctime(3) strings


    <ctime-string> ::=
    <day-of-week> " " <month> " " <day> " " <time> " " <year> "\n"

    <day-of-week> ::= "Sun" | "Mon" | "Tue" | "Wed" | "Thu" | "Fri" | "Sat"

    <month> ::= "Jan" | "Feb" | "Mar" | "Apr" | "May" | "Jun"
    | "Jul" | "Aug" | "Sep" | "Oct" | "Nov" | "Dec"

    <day> ::=
    " " <digit>
    | <digit><digit>
    (* range 1rCo31, first form has a leading space when <digit> is 1rCo9 *)

    <time> ::= <hour> ":" <minute> ":" <second>

    <hour> ::= <digit><digit> (* 00rCo23 *)
    <minute> ::= <digit><digit> (* 00rCo59 *)
    <second> ::= <digit><digit> (* 00rCo60, leap second allowed *)

    <year> ::= <digit><digit><digit><digit>

    <digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

    Closing Remarks

    This library provides a practical and dependable way to convert
    ctime(3) strings into structured fields. The code has been
    designed with simplicity and clarity in mind.

    We wish you success in your projects, confident use of this
    parser, and enjoyment in knowing that date and time strings
    no longer pose a challenge.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to news.software.nntp on Thu Sep 4 19:40:39 2025
    From Newsgroup: news.software.nntp

    ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
    - Range checks are included for day values, hours, minutes, and seconds.

    No! The validation of ranges is currently limited. We can improve it
    - if desired - adding the line "validate_fields(&f);" to "parse_ctime",
    adding "validate_fields" and changing "parse_day" . . .

    Updated code and output with more test cases below.

    code:

    /*
    * ctime_parser.c V1.1
    *
    * A recursive-descent parser for ctime(3) / asctime(3) formatted strings.
    *
    * Example input:
    * "Wed Jun 30 21:49:08 1993\n"
    *
    * Author: Stefan Ram
    * Copyright (c) 2025 Stefan Ram
    * Licensed under the Apache License, Version 2.0
    *
    * Note: This software has just been written and has not been extensively tested.
    */

    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>

    /* -------------------- Status codes -------------------- */
    typedef enum {
    CTIME_OK = 0,
    CTIME_ERR_SYNTAX, /* structural problem with the string */
    CTIME_ERR_RANGE /* values outside allowed ranges */
    } CTimeStatus;

    /* -------------------- Parse result -------------------- */
    typedef struct {
    int weekday; /* 0=Sun .. 6=Sat */
    int month; /* 1=Jan .. 12=Dec */
    int day; /* 1..31 */
    int hour; /* 0..23 */
    int minute; /* 0..59 */
    int second; /* 0..60 */
    int year; /* typically 4-digit */

    CTimeStatus status; /* parse result */
    } CTimeFields;

    /* -------------------- Parser object -------------------- */
    typedef struct {
    const char *pos; /* current input pointer */
    CTimeFields *out; /* where to put results/status */
    } Parser;

    /* -------------------- Utility functions -------------------- */

    static void set_error(Parser *p, CTimeStatus st) {
    if (p->out->status == CTIME_OK) {
    p->out->status = st;
    }
    }

    static void expect(Parser *p, char c) {
    if (*p->pos != c) set_error(p, CTIME_ERR_SYNTAX);
    else p->pos++;
    }

    static int parse_number_digits(Parser *p, int ndigits) {
    int v = 0;
    for (int i = 0; i < ndigits; i++) {
    if (!isdigit((unsigned char)p->pos[i])) {
    set_error(p, CTIME_ERR_SYNTAX);
    return -1;
    }
    v = v * 10 + (p->pos[i] - '0');
    }
    p->pos += ndigits;
    return v;
    }

    /* -------------------- Grammar routines -------------------- */

    static int parse_day_of_week(Parser *p) {
    static const char *names[] = {"Sun","Mon","Tue","Wed","Thu","Fri","Sat"};
    for (int i = 0; i < 7; i++) {
    if (strncmp(p->pos, names[i], 3) == 0) {
    p->pos += 3;
    return i;
    }
    }
    set_error(p, CTIME_ERR_SYNTAX);
    return -1;
    }

    static int parse_month(Parser *p) {
    static const char *names[] = {
    "Jan","Feb","Mar","Apr","May","Jun",
    "Jul","Aug","Sep","Oct","Nov","Dec"
    };
    for (int i = 0; i < 12; i++) {
    if (strncmp(p->pos, names[i], 3) == 0) {
    p->pos += 3;
    return i + 1;
    }
    }
    set_error(p, CTIME_ERR_SYNTAX);
    return -1;
    }

    static int is_leap_year(int year) {
    if ((year % 400) == 0) return 1;
    if ((year % 100) == 0) return 0;
    if ((year % 4) == 0) return 1;
    return 0;
    }

    static int days_in_month(int month, int year) {
    static const int month_len[12] = {
    31, /* Jan */
    28, /* Feb (default) */
    31, /* Mar */
    30, /* Apr */
    31, /* May */
    30, /* Jun */
    31, /* Jul */
    31, /* Aug */
    30, /* Sep */
    31, /* Oct */
    30, /* Nov */
    31 /* Dec */
    };
    if (month < 1 || month > 12) return 0;
    if (month == 2 && is_leap_year(year)) return 29;
    return month_len[month-1];
    }

    static int parse_day(Parser *p) {
    int d;
    if (*p->pos == ' ') { // space + single digit
    p->pos++;
    if (!isdigit((unsigned char)*p->pos)) {
    set_error(p, CTIME_ERR_SYNTAX);
    return -1;
    }
    d = *p->pos++ - '0';
    } else {
    if (!isdigit((unsigned char)p->pos[0]) ||
    !isdigit((unsigned char)p->pos[1])) {
    set_error(p, CTIME_ERR_SYNTAX);
    return -1;
    }
    d = (p->pos[0]-'0')*10 + (p->pos[1]-'0');
    p->pos += 2;
    }
    if (d < 1 || d > 31) {
    set_error(p, CTIME_ERR_RANGE);
    return -1;
    }
    return d;
    }

    static void validate_fields(CTimeFields *f) {
    if (f->status != CTIME_OK) return;

    /* Hours */
    if (f->hour < 0 || f->hour > 23) {
    f->status = CTIME_ERR_RANGE;
    return;
    }
    /* Minutes */
    if (f->minute < 0 || f->minute > 59) {
    f->status = CTIME_ERR_RANGE;
    return;
    }
    /* Seconds */
    if (f->second < 0 || f->second > 60) {
    f->status = CTIME_ERR_RANGE;
    return;
    }
    /* Day of month with leap year rules */
    int maxd = days_in_month(f->month, f->year);
    if (f->day < 1 || f->day > maxd) {
    f->status = CTIME_ERR_RANGE;
    return;
    }
    }

    /* -------------------- Top-level parse -------------------- */

    CTimeFields parse_ctime(const char *s) {
    CTimeFields f = {0};
    f.status = CTIME_OK;
    Parser P = { s, &f };

    f.weekday = parse_day_of_week(&P);
    expect(&P, ' ');
    f.month = parse_month(&P);
    expect(&P, ' ');
    f.day = parse_day(&P);
    expect(&P, ' ');
    f.hour = parse_number_digits(&P, 2);
    expect(&P, ':');
    f.minute = parse_number_digits(&P, 2);
    expect(&P, ':');
    f.second = parse_number_digits(&P, 2);
    expect(&P, ' ');
    f.year = parse_number_digits(&P, 4);
    expect(&P, '\n');

    /* If there are trailing characters after newline, that's an error */
    if (*P.pos != '\0' && f.status == CTIME_OK)
    f.status = CTIME_ERR_SYNTAX;

    validate_fields(&f);

    return f;
    }

    /* -------------------- Demonstration -------------------- */
    #define TEST_CTIME_PARSER
    #ifdef TEST_CTIME_PARSER
    int main(void) {
    const char *tests[] = {
    "Wed Jun 30 21:49:08 1993\n", /* valid */
    "Sun Dec 5 07:03:12 2021\n", /* valid */
    "Bad Mon 30 00:00:00 1900\n", /* bogus weekday -> SYNTAX */

    /* Range error cases */
    "Mon Jan 32 12:00:00 2022\n", /* day=32 -> ERR_RANGE */
    "Tue Apr 31 12:00:00 2022\n", /* April has 30 days -> ERR_RANGE */
    "Wed Feb 29 12:00:00 2021\n", /* 2021 not leap year -> ERR_RANGE */
    "Thu Feb 29 12:00:00 2020\n", /* 2020 leap year, okay -> OK */
    "Fri Feb 30 12:00:00 2020\n", /* still invalid (29 max) -> ERR_RANGE */
    "Sat Mar 10 24:00:00 2022\n", /* hour=24 -> ERR_RANGE */
    "Sun Mar 10 12:60:00 2022\n", /* minute=60 -> ERR_RANGE */
    "Mon Mar 10 12:59:61 2022\n", /* second=61 -> ERR_RANGE */

    NULL
    };

    for (const char **t = tests; *t; t++) {
    printf("Parsing: \"%s\"\n", *t);
    CTimeFields f = parse_ctime(*t);
    if (f.status == CTIME_OK) {
    printf(" -> OK: wd=%d, mon=%d, day=%d, %02d:%02d:%02d, year=%d\n",
    f.weekday, f.month, f.day,
    f.hour, f.minute, f.second, f.year);
    } else {
    printf(" -> ERROR, status=%d\n", f.status);
    }
    }
    return 0;
    }
    #endif

    output:

    Parsing: "Wed Jun 30 21:49:08 1993
    "
    OK: wd=3, mon=6, day=30, 21:49:08, year=1993
    Parsing: "Sun Dec 5 07:03:12 2021
    "
    OK: wd=0, mon=12, day=5, 07:03:12, year=2021
    Parsing: "Bad Mon 30 00:00:00 1900
    "
    ERROR, status=1
    Parsing: "Mon Jan 32 12:00:00 2022
    "
    ERROR, status=2
    Parsing: "Tue Apr 31 12:00:00 2022
    "
    ERROR, status=2
    Parsing: "Wed Feb 29 12:00:00 2021
    "
    ERROR, status=2
    Parsing: "Thu Feb 29 12:00:00 2020
    "
    OK: wd=4, mon=2, day=29, 12:00:00, year=2020
    Parsing: "Fri Feb 30 12:00:00 2020
    "
    ERROR, status=2
    Parsing: "Sat Mar 10 24:00:00 2022
    "
    ERROR, status=2
    Parsing: "Sun Mar 10 12:60:00 2022
    "
    ERROR, status=2
    Parsing: "Mon Mar 10 12:59:61 2022
    "
    ERROR, status=2


    --- Synchronet 3.21a-Linux NewsLink 1.2