• Post DB

    From Stefan Ram@21:1/5 to All on Mon Sep 9 10:44:36 2024
    I'm not 100% sure I'm barking up the right tree (newsgroup) here,
    but whatever.

    So, yesterday I was chewing the fat about how to whip up a database
    for posts retrieved from newsservers.

    I'm picturing some program that pulls newsgroups from newsservers
    and dumps them into a database.

    In my mind's eye, a post looks something like this, give or take:

    Path: A
    Message-ID: B

    Body: C

    . But if you snag the same post from a different server, it might
    look like this:

    Message-ID: B
    Path: D

    Body: C

    . At first blush, you'd end up with the same body stored multiple
    times in the database. Talk about a waste of space!

    To trim the fat, we could rejigger these posts so all the variable
    stuff is up front:

    Path: A
    Message-ID: B

    Body: C

    and

    Path: D
    Message-ID: B

    Body: C

    Now the tail end of both posts is identical, so we can toss that
    in a separate table at position 0.

    The posts themselves would then just contain the different parts
    and a pointer to the shared bit that's only stored once:

    Path: A
    Rest: 0

    Path: D
    Rest: 0

    0:
    Message-ID: B

    Body: C

    . This way, you could store the same post from multiple newsservers
    without eating up your hard drive space like it's In-N-Out fries.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From D@21:1/5 to Stefan Ram on Tue Sep 10 22:35:44 2024
    On 9 Sep 2024 10:44:36 GMT, ram@zedat.fu-berlin.de (Stefan Ram) wrote:
    I'm not 100% sure I'm barking up the right tree (newsgroup) here,
    but whatever.
    So, yesterday I was chewing the fat about how to whip up a database
    for posts retrieved from newsservers.
    I'm picturing some program that pulls newsgroups from newsservers
    and dumps them into a database.
    In my mind's eye, a post looks something like this, give or take:
    Path: A
    Message-ID: B
    Body: C
    . But if you snag the same post from a different server, it might
    look like this:
    Message-ID: B
    Path: D
    Body: C
    . At first blush, you'd end up with the same body stored multiple
    times in the database. Talk about a waste of space!
    To trim the fat, we could rejigger these posts so all the variable
    stuff is up front:
    Path: A
    Message-ID: B
    Body: C
    and
    Path: D
    Message-ID: B
    Body: C
    Now the tail end of both posts is identical, so we can toss that
    in a separate table at position 0.
    The posts themselves would then just contain the different parts
    and a pointer to the shared bit that's only stored once:
    Path: A
    Rest: 0
    Path: D
    Rest: 0
    0:
    Message-ID: B
    Body: C
    . This way, you could store the same post from multiple newsservers
    without eating up your hard drive space like it's In-N-Out fries.

    p.s. this feature of 40tude dialog might possibly be useful...

    40tude Dialog > Help [F1] > The "Copy/Move selected articles" Window:
    ...
    How can I create virtual groups?
    You can't. There is nothing like virtual groups in Dialog. However,
    you can join the articles from two groups into one in Dialog. This
    includes joining the articles from the same newsgroup but different
    servers into one group (in the perspective of Dialog) or folder,
    e.g. to fill missing parts of multipart postings. To do this, you
    can manually Copy/Move articles from one group to another or you
    can use an action rule to automate this.

    40tude Dialog > Help [F1] > Index > Scoring syntax:
    ...
    Copy and Move actions
    There is an extended syntax for copy and move action. Say you want to move >all emails from a mailing list to a separate folder. You can use:
    [email.*]
    !move(%X-Mailing-List%) Header {^(X-Mailing-List:)}
    The section identifier here says that the following rule should be applied
    to emails only.
    The !move action itself looks for a X-Mailing-List: header in the message, >but instead of copying the message to a folder with a fixed name, it is >copied to a folder named after the value of the X-Mailing-List header
    field itself. Say, e.g. you receive a message which has a "X-Mailing-List: >speedboats@yahoogroups.com" header, then this message is moved to the >"speedboats@yahoogroups.com" folder. By now, you might have noticed that
    this is a very powerful rule for organizing mailing list into folders.
    A variation of this is:
    [email.*]
    !move(@%X-Mailing-List%) Header {^(X-Mailing-List:)}
    Notice the additional "@" in the target folder expression. If this @ is >present, the value of the header field is simply searched for the
    character @ and only the text before the first found @ is used as the
    target folder. Again, for a message which has a "X-Mailing-List: >speedboats@yahoogroups.com" header field, the message would be moved to the >"speedboats" folder, instead of the "speedboats@yahoogroups.com" folder. >Almost virtual groups
    There are no virtual groups in Dialog, however you can simply copy or move >articles from one group to another manually or by using an action rule:
    [a.binary.group]
    !move(a.binary.group;NewsserverOne) bytes %>0
    All previous examples used a folder as the target of the copy/move
    operation. This rule uses a newsgroup as the target. Note that the name of >the target group and the name of the newsserver where this group is from
    is separated by a semicolon.
    Say, you are using two newsservers named "NewsserverOne" and
    "NewsserverTwo". Both newsservers have the group "a.binary.group", but
    while most of the messages are the same in this group on the two servers, >there are some that are not available on the other server.
    The first line in the preceding sample makes sure that the following rule
    is only applied to the "a.binary.group" newsgroup.
    The rule itself simply says to move all articles (since the condition
    "bytes %>0" is always met) to the group a.binary.group of NewsserverOne.
    If you retrieve message headers for this group from server NewsserverOne,
    the rule is ignored, since all articles go to the correct group anyway. If >you retrieve message headers from NewsserverTwo for this group though,
    they will not show up in "a.binary.group (NewsserverTwo)", but in >"a.binary.group (NewsserverOne)".
    The result is that you joined messages from the same group, but different >newsservers into one group in Dialog. This is especially useful for binary >groups to fill missing multipart postings.
    [end quoted plain text]

    see also: news:de.comm.software.40tude-dialog for many detailed articles
    (45160 total, from 8/18/2003 to 8/19/2024, on news.blueworldhosting.com)
    about scoring, actions, scripting, etc. this tech group is still active

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From D@21:1/5 to Stefan Ram on Tue Sep 10 16:35:22 2024
    On 9 Sep 2024 10:44:36 GMT, ram@zedat.fu-berlin.de (Stefan Ram) wrote:
    I'm not 100% sure I'm barking up the right tree (newsgroup) here,
    but whatever.
    So, yesterday I was chewing the fat about how to whip up a database
    for posts retrieved from newsservers.
    I'm picturing some program that pulls newsgroups from newsservers
    and dumps them into a database.
    In my mind's eye, a post looks something like this, give or take:
    Path: A
    Message-ID: B
    Body: C
    . But if you snag the same post from a different server, it might
    look like this:
    Message-ID: B
    Path: D
    Body: C
    . At first blush, you'd end up with the same body stored multiple
    times in the database. Talk about a waste of space!
    To trim the fat, we could rejigger these posts so all the variable
    stuff is up front:
    Path: A
    Message-ID: B
    Body: C
    and
    Path: D
    Message-ID: B
    Body: C
    Now the tail end of both posts is identical, so we can toss that
    in a separate table at position 0.
    The posts themselves would then just contain the different parts
    and a pointer to the shared bit that's only stored once:
    Path: A
    Rest: 0
    Path: D
    Rest: 0
    0:
    Message-ID: B
    Body: C
    . This way, you could store the same post from multiple newsservers
    without eating up your hard drive space like it's In-N-Out fries.

    twelve server samples of your article headers show remarkable consistency:
    1 path, 2 from, 3 newsgroups, 4 subject, 5 date, 6 organization, 7 lines,
    8 expires, 9 message-id, 10 mime-version, 11 content-type, 12 content- transfer-encoding, 13 x-trace, 14 cancel-lock, 15 x-copyright, 16 x-no- archive, 17 archive, 18 x-no-archive-readme, 19 x-no-html, 20 content- language, 21 xref (first sample full headers, then snipped for brevity):

    news:news.alphared.net
    Path: alphared!3.eu.feeder.erje.net!feeder.erje.net!fu-berlin.de!uni-berlin.de!not-for-mail
    From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    Mime-Version: 1.0
    Content-Type: text/plain; charset=UTF-8
    Content-Transfer-Encoding: 8bit
    X-Trace: news.uni-berlin.de b02EqmO53gQ7jbmmMP85UgkDHjtKodMUvyU6kuS12ifm6t >Cancel-Lock: sha1:BiPE/2gBrIau46RUtTtIhXqrOSQ= sha256:ciJQo1bvZST9PNWeu73aWJv3mxLLHhWyjI7ehRRUSH4=
    X-Copyright: (C) Copyright 2024 Stefan Ram. All rights reserved.
    Distribution through any means other than regular usenet
    channels is forbidden. It is forbidden to publish this
    article in the Web, to change URIs of this article into links,
    and to transfer the body without this notice, but quotations
    of parts in other Usenet posts are allowed.
    X-No-Archive: Yes
    Archive: no
    X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
    services to mirror the article in the web. But the article may
    be kept on a Usenet archive server with only NNTP access.
    X-No-Html: yes
    Content-Language: en-US
    Xref: alphared news.software.readers:11775

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    news:news.alt119.net
    Path: news.alt119.net!peer.alt119.net!news.samoylyk.net!fu-berlin.de!uni-berlin.de!not-for-mail
    From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    snip

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    news:news.blueworldhosting.com
    Path: nnrp.usenet.blueworldhosting.com!!spool1.usenet.blueworldhosting.com!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!fu-berlin.de!uni-berlin.de!not-for-mail
    From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    snip

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    news:news.dizum.net
    Path: sewer!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!not-for-mail
    From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    snip

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    news:freenews.netfront.net
    Path: news.netfront.net!border-2.nntp.ord.giganews.com!nntp.giganews.com!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!not-for-mail
    From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    snip

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    news:news.i2pn2.org
    Path: i2pn2.org!rocksolid2!news.neodome.net!fu-berlin.de!uni-berlin.de!not-for-mail
    From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    snip

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    news:news.neodome.net
    Path: news.neodome.net!fu-berlin.de!uni-berlin.de!not-for-mail
    From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    snip

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    news:news.mixmin.net
    Path: news.mixmin.net!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!not-for-mail
    From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    snip

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    news:news.novabbs.org
    Path: rocksolid2!news.neodome.net!fu-berlin.de!uni-berlin.de!not-for-mail >From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    snip

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    news:paganini.bofh.team
    Path: paganini.bofh.team!newsfeed.bofh.team!3.eu.feeder.erje.net!feeder.erje.net!fu-berlin.de!uni-berlin.de!not-for-mail
    From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    snip

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    news:news.samoylyk.net
    Path: news.samoylyk.net!fu-berlin.de!uni-berlin.de!not-for-mail
    From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    snip

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    news:news.usenet.ovh
    Path: usenet.ovh!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!not-for-mail
    From: ram@zedat.fu-berlin.de (Stefan Ram)
    Newsgroups: news.software.readers
    Subject: Post DB
    Date: 9 Sep 2024 10:44:36 GMT
    Organization: Stefan Ram
    Lines: 63
    Expires: 1 Jul 2025 11:59:58 GMT
    Message-ID: <database-20240909114248@ram.dialup.fu-berlin.de>
    snip

    I'm not 100% sure I'm barking up the right tree (newsgroup) here,

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)