• Retro Guy's Scripts for importing old articles into INN

    From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Wed Sep 3 19:18:26 2025
    From Newsgroup: news.software.nntp

    Since there has been some discussion about archiving very old articles
    with INN, I found this e-mail thread I forgot about from Retro Guy (RIP)
    and wanted to share the knowledge in case it helps others. I haven't used
    the script yet, and he was using rpost to post as a client and stripping a
    lot of headers. Personally, I would leave the headers besides Date intact
    and use rnews to inject them.

    -----
    Here you go. I have attached some php scripts and my notes. Please read through the scripts before running them, just to be cautious. I may also
    have left some hardcoded paths that you would need to change.

    I hope this all makes sense. I haven't really thought about it for a few months.

    Oh, and no, I'm not working with archive.org historical archive. I'm using David Wiseman's UTZOO archive: https://archive.org/details/utzoo-wiseman- usenet-archive

    -----
    How I think this works. It's been a few months since I did this so
    hopefully I'm not overlooking something:

    First, create 'artlist.in' in the script dir ($scriptdir).
    artlist.in is a file containing the full path to the articles you want to eventually import. One article per file.
    So, something like:
    find /full/path/to/your/unmodified/files/ -type f > artlist.in

    Create a directory in the script dir named out ($scriptdir/out). This directory will be written to by the next script, so it should be empty.

    Now, run datefix.php from $scriptdir. It will read './artlist.in' and
    write to './out' and './newsgroups.inc'.

    Next, run get_groups.php. It will read './newsgroups.inc' and write './ newsgroups.out'.
    It takes all the 'Newsgroups: *' from all the articles, splits them into
    one newsgroup per line, and writes them to newsgroup.out.

    Next, run 'sort newsgroups.out | uniq > newsgroups.txt'.
    This is pretty clear, it sorts all the newsgroups and then deletes
    duplicates.

    Then run the following shell script on your server (as news user) to
    create the groups from 'newsgroups.txt':

    -----
    #/bin/bash
    for WORD in `cat ./newsgroups.txt`
    do
    echo $WORD
    ctlinnd newgroup $WORD
    done
    echo "Done."
    -----

    There WILL be messed up group names due, most likely, to people typing
    them incorrectly when posting. Unless you want to read through the file
    first and remove them, they will be created on the server.

    Next, create a file named './artlist' that contains a list of every
    article in './out' by full path name. One full path article filename per
    line.

    Here's a shell script example, but I'm sure you can already do this:
    find /full/path/to/out/ -type f > artlist

    Finally, try to post the articles. Write a script similar to (I used
    rpost):

    -----
    #!/bin/bash

    # Server details
    server="server.name"
    port="port number"
    username="username"
    password="password"

    # Connect to NNTP server
    rpost $server -n -u -U $username -P $password -b artlist

    # Quit the NNTP server
    rpost -q

    echo "Articles posted successfully!"
    -----

    datefix.php:

    #!/usr/bin/php
    <?php
    /* FIRST: Create artlist.in */
    /* Clean ./out/* */

    $artfile = "artlist.in";
    $artlist = file($artfile);
    $newsgroupslist = "newsgroups.inc";
    unlink($newsgroupslist);

    $newarticle = array();
    $i=0;
    foreach($artlist as $article) {
    if(!is_file(trim($article))) {
    continue;
    }
    $articleline = file(trim($article));
    $lines = 0;
    $is_header = 1;
    foreach($articleline as $line) {
    if(trim($line) == "" && $lines > 0) {
    $is_header=0;
    $lines++;
    }
    if(stripos($line, "Relay-Version") === 0 && $is_header == 1) {
    continue;
    }
    if(stripos($line, "Posting-Version") === 0 && $is_header == 1) {
    continue;
    }
    if(stripos($line, "Date-Received") === 0 && $is_header == 1) {
    continue;
    }
    if(stripos($line, "Xref") === 0 && $is_header == 1) {
    continue;
    }
    if(stripos($line, "X-Trace") === 0 && $is_header == 1) {
    continue;
    }
    if(stripos($line, "X-Complaints-To") === 0 && $is_header == 1) {
    continue;
    }
    if(stripos($line, "NNTP-Posting-Host") === 0 && $is_header == 1) {
    continue;
    }
    if(stripos($line, "Injection-Info") === 0 && $is_header == 1) {
    continue;
    }
    if(stripos($line, "Newsgroups: ") === 0 && $is_header == 1) {
    $groups = explode(': ', $line);
    file_put_contents($newsgroupslist, $groups[1], FILE_APPEND);
    }
    if(stripos($line, "Date: ") === 0 && $is_header == 1) {
    $finddate=explode(': ', $line);
    $newarticle[] = "Date: ".date("D, j M Y H:i T",strtotime($finddate[1]))."\n";
    continue;
    }
    if(trim($line) == ".") {
    $newarticle[] = "..\n";
    continue;
    }
    $newarticle[] = $line;
    }

    $newfile = 'out/'.$i;
    $i++;
    foreach($newarticle as $newline) {
    file_put_contents($newfile, $newline, FILE_APPEND);
    }
    unset($newarticle);
    }
    /* NEXT RUN get_groups.php */


    -----

    get_groups.php:

    #!/usr/bin/php
    <?php
    $groups_file = "newsgroups.inc";
    $newsgroups = file($groups_file);
    $outfile = "newsgroups.out";
    unlink($outfile);

    foreach($newsgroups as $groups) {
    $group = preg_split("/(,|\ )/", $groups);
    foreach($group as $addgroup) {
    file_put_contents($outfile, trim($addgroup)."\n",
    FILE_APPEND);
    }
    }
    /* NEXT IS 'sort newsgroups.out | uniq > newsgroups.txt */
    /* Then send it to novalink.us and create groups */
    /* THEN: Create artlist from ./out */

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Wed Sep 3 21:56:22 2025
    From Newsgroup: news.software.nntp

    Hi Jesse,

    Since there has been some discussion about archiving very old articles
    with INN, I found this e-mail thread I forgot about from Retro Guy (RIP)
    and wanted to share the knowledge in case it helps others.

    Thanks for sharing!


    I haven't used
    the script yet, and he was using rpost to post as a client and stripping a lot of headers. Personally, I would leave the headers besides Date intact
    and use rnews to inject them.

    The removed header fields should be OK to keep, indeed, as you intend to inject articles through a relaying agent like innd with rnews. I think
    Retro Guy removed them because they should not appear when injecting an article (rpost may not like them, nor an injecting agent like nnrpd).


    foreach($articleline as $line) {
    if(trim($line) == "" && $lines > 0) {
    $is_header=0;
    $lines++;
    }
    if(stripos($line, "Injection-Info") === 0 && $is_header == 1) {
    continue;
    }
    if(stripos($line, "Newsgroups: ") === 0 && $is_header == 1) {
    $groups = explode(': ', $line);
    file_put_contents($newsgroupslist, $groups[1], FILE_APPEND);
    }

    The loop does not seem to implement continuation lines.
    So only the first line of the Newsgroups header field will be parsed
    (though I bet continuation lines in that header field are very rare),
    and as the script only removes the first line of the Injection-Info
    header field (as far as I see), the remaining lines will be appended to
    the previous header field...


    if(stripos($line, "Date: ") === 0 && $is_header == 1) {
    $finddate=explode(': ', $line);
    $newarticle[] = "Date: ".date("D, j M Y H:i T",strtotime($finddate[1]))."\n";
    continue;
    }

    You'll be happy with the strtotime() function which magically decodes
    all kinds of dates :)

    There's an online tester for this function:
    https://strtotime.co.uk/

    Sun, 28-Jul-85 00:57:37 EDT
    Sun, 28 Jul 1985 05:57:37 +0100


    if(trim($line) == ".") {
    $newarticle[] = "..\n";
    continue;
    }

    In fact, the script should add ".$line\n" to $newarticle[] when $line
    starts with a dot. This is what is called dot-stuffing.
    The current code will unfortunately alter article bodies when there is a leading dot. For instance, if I write ".test" in a mere line, it will
    become "test" if I do not add a second "." when injecting the article.


    You may also want to remove duplicated header fields (at least the ones
    that should not appear twice), and maybe add a space after the colon
    following a header field name if it is not present.

    You'll soon have a working article rewriter system and achieve your
    dream :-)
    --
    Julien |eLIE

    -2-aC'est la goutte d'eau qui fait d|-border le vase et qui met le feu aux
    poudres.-a-+

    --- Synchronet 3.21a-Linux NewsLink 1.2