Forum: Too Lazy BBS

Who's Online

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	26
Nodes:	6 (0 / 6)
Uptime:	54:30:20
Calls:	632
Files:	1,187
D/L today:	27 files (19,977K bytes)
Messages:	178,946

Retro Guy's Scripts for importing old articles into INN

From Jesse Rehmer@jesse.rehmer@blueworldhosting.com to news.software.nntp on Wed Sep 3 19:18:26 2025

From Newsgroup: news.software.nntp

Since there has been some discussion about archiving very old articles
with INN, I found this e-mail thread I forgot about from Retro Guy (RIP)
and wanted to share the knowledge in case it helps others. I haven't used
the script yet, and he was using rpost to post as a client and stripping a
lot of headers. Personally, I would leave the headers besides Date intact
and use rnews to inject them.

-----
Here you go. I have attached some php scripts and my notes. Please read through the scripts before running them, just to be cautious. I may also
have left some hardcoded paths that you would need to change.

I hope this all makes sense. I haven't really thought about it for a few months.

Oh, and no, I'm not working with archive.org historical archive. I'm using David Wiseman's UTZOO archive: https://archive.org/details/utzoo-wiseman- usenet-archive

-----
How I think this works. It's been a few months since I did this so
hopefully I'm not overlooking something:

First, create 'artlist.in' in the script dir ($scriptdir).
artlist.in is a file containing the full path to the articles you want to eventually import. One article per file.
So, something like:
find /full/path/to/your/unmodified/files/ -type f > artlist.in

Create a directory in the script dir named out ($scriptdir/out). This directory will be written to by the next script, so it should be empty.

Now, run datefix.php from $scriptdir. It will read './artlist.in' and
write to './out' and './newsgroups.inc'.

Next, run get_groups.php. It will read './newsgroups.inc' and write './ newsgroups.out'.
It takes all the 'Newsgroups: *' from all the articles, splits them into
one newsgroup per line, and writes them to newsgroup.out.

Next, run 'sort newsgroups.out | uniq > newsgroups.txt'.
This is pretty clear, it sorts all the newsgroups and then deletes
duplicates.

Then run the following shell script on your server (as news user) to
create the groups from 'newsgroups.txt':

-----
#/bin/bash
for WORD in `cat ./newsgroups.txt`
do
echo $WORD
ctlinnd newgroup $WORD
done
echo "Done."
-----

There WILL be messed up group names due, most likely, to people typing
them incorrectly when posting. Unless you want to read through the file
first and remove them, they will be created on the server.

Next, create a file named './artlist' that contains a list of every
article in './out' by full path name. One full path article filename per
line.

Here's a shell script example, but I'm sure you can already do this:
find /full/path/to/out/ -type f > artlist

Finally, try to post the articles. Write a script similar to (I used
rpost):

-----
#!/bin/bash

# Server details
server="server.name"
port="port number"
username="username"
password="password"

# Connect to NNTP server
rpost $server -n -u -U $username -P $password -b artlist

# Quit the NNTP server
rpost -q

echo "Articles posted successfully!"
-----

datefix.php:

#!/usr/bin/php
<?php
/* FIRST: Create artlist.in */
/* Clean ./out/* */

$artfile = "artlist.in";
$artlist = file($artfile);
$newsgroupslist = "newsgroups.inc";
unlink($newsgroupslist);

$newarticle = array();
$i=0;
foreach($artlist as $article) {
if(!is_file(trim($article))) {
continue;
}
$articleline = file(trim($article));
$lines = 0;
$is_header = 1;
foreach($articleline as $line) {
if(trim($line) == "" && $lines > 0) {
$is_header=0;
$lines++;
}
if(stripos($line, "Relay-Version") === 0 && $is_header == 1) {
continue;
}
if(stripos($line, "Posting-Version") === 0 && $is_header == 1) {
continue;
}
if(stripos($line, "Date-Received") === 0 && $is_header == 1) {
continue;
}
if(stripos($line, "Xref") === 0 && $is_header == 1) {
continue;
}
if(stripos($line, "X-Trace") === 0 && $is_header == 1) {
continue;
}
if(stripos($line, "X-Complaints-To") === 0 && $is_header == 1) {
continue;
}
if(stripos($line, "NNTP-Posting-Host") === 0 && $is_header == 1) {
continue;
}
if(stripos($line, "Injection-Info") === 0 && $is_header == 1) {
continue;
}
if(stripos($line, "Newsgroups: ") === 0 && $is_header == 1) {
$groups = explode(': ', $line);
file_put_contents($newsgroupslist, $groups[1], FILE_APPEND);
}
if(stripos($line, "Date: ") === 0 && $is_header == 1) {
$finddate=explode(': ', $line);
$newarticle[] = "Date: ".date("D, j M Y H:i T",strtotime($finddate[1]))."\n";
continue;
}
if(trim($line) == ".") {
$newarticle[] = "..\n";
continue;
}
$newarticle[] = $line;
}

$newfile = 'out/'.$i;
$i++;
foreach($newarticle as $newline) {
file_put_contents($newfile, $newline, FILE_APPEND);
}
unset($newarticle);
}
/* NEXT RUN get_groups.php */

-----

get_groups.php:

#!/usr/bin/php
<?php
$groups_file = "newsgroups.inc";
$newsgroups = file($groups_file);
$outfile = "newsgroups.out";
unlink($outfile);

foreach($newsgroups as $groups) {
$group = preg_split("/(,|\ )/", $groups);
foreach($group as $addgroup) {
file_put_contents($outfile, trim($addgroup)."\n",
FILE_APPEND);
}
}
/* NEXT IS 'sort newsgroups.out | uniq > newsgroups.txt */
/* Then send it to novalink.us and create groups */
/* THEN: Create artlist from ./out */

--- Synchronet 3.21a-Linux NewsLink 1.2

From =?UTF-8?Q?Julien_=C3=89LIE?=@iulius@nom-de-mon-site.com.invalid to news.software.nntp on Wed Sep 3 21:56:22 2025

From Newsgroup: news.software.nntp

Hi Jesse,

Since there has been some discussion about archiving very old articles
with INN, I found this e-mail thread I forgot about from Retro Guy (RIP)
and wanted to share the knowledge in case it helps others.

Thanks for sharing!

I haven't used
the script yet, and he was using rpost to post as a client and stripping a lot of headers. Personally, I would leave the headers besides Date intact
and use rnews to inject them.

The removed header fields should be OK to keep, indeed, as you intend to inject articles through a relaying agent like innd with rnews. I think
Retro Guy removed them because they should not appear when injecting an article (rpost may not like them, nor an injecting agent like nnrpd).

foreach($articleline as $line) {
if(trim($line) == "" && $lines > 0) {
$is_header=0;
$lines++;
}
if(stripos($line, "Injection-Info") === 0 && $is_header == 1) {
continue;
}
if(stripos($line, "Newsgroups: ") === 0 && $is_header == 1) {
$groups = explode(': ', $line);
file_put_contents($newsgroupslist, $groups[1], FILE_APPEND);
}

The loop does not seem to implement continuation lines.
So only the first line of the Newsgroups header field will be parsed
(though I bet continuation lines in that header field are very rare),
and as the script only removes the first line of the Injection-Info
header field (as far as I see), the remaining lines will be appended to
the previous header field...

if(stripos($line, "Date: ") === 0 && $is_header == 1) {
$finddate=explode(': ', $line);
$newarticle[] = "Date: ".date("D, j M Y H:i T",strtotime($finddate[1]))."\n";
continue;
}

You'll be happy with the strtotime() function which magically decodes
all kinds of dates :)

There's an online tester for this function:
https://strtotime.co.uk/

Sun, 28-Jul-85 00:57:37 EDT

Sun, 28 Jul 1985 05:57:37 +0100

if(trim($line) == ".") {
$newarticle[] = "..\n";
continue;
}

In fact, the script should add ".$line\n" to $newarticle[] when $line
starts with a dot. This is what is called dot-stuffing.
The current code will unfortunately alter article bodies when there is a leading dot. For instance, if I write ".test" in a mere line, it will
become "test" if I do not add a second "." when injecting the article.

You may also want to remove duplicated header fields (at least the ones
that should not appear twice), and maybe add a space after the colon
following a header field name if it is not present.

You'll soon have a working article rewriter system and achieve your
dream :-)
--
Julien |eLIE

-2-aC'est la goutte d'eau qui fait d|-border le vase et qui met le feu aux
poudres.-a-+

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

System Info

Retro Guy's Scripts for importing old articles into INN