• Re: Reducing Redundancy

    From Stefan+Usenet@Stefan+Usenet@Froehlich.Priv.at (Stefan Froehlich) to comp.databases.mysql on Wed Sep 11 13:41:35 2024
    From Newsgroup: comp.databases.mysql

    On Tue, 10 Sep 2024 17:33:01 Stefan Ram wrote:
    I'm picturing some program that pulls newsgroups from newsservers
    and dumps them into a database.

    "There is another theory which states that this has already
    happened"

    In my mind's eye, a post looks something like this, give or take:

    Path: A
    Message-ID: B

    Body: C

    . But if you snag the same post from a different server, it might
    look like this:

    Message-ID: B
    Path: D

    Body: C

    At first blush, you'd end up with the same body stored multiple
    times in the database. Talk about a waste of space!

    Not only a waste of space, you definitely want to avoid duplication
    in a database. These two incarnations are the same posting so its
    contents should be stored only once.

    To trim the fat, we could rejigger these posts so all the variable
    stuff is up front:

    Path: A
    Message-ID: B

    Body: C

    and

    Path: D
    Message-ID: B

    Body: C

    Now the tail end of both posts is identical, so we can toss that
    in a separate table at position 0.

    You'd really want to archive more than one incarnation of a posting,
    just because you pulled it from different servers? Why?

    Parse the incoming postings, extract the headers, store at least the
    Message-Id in a separate attribute of your table (wisely some more)
    and set a unique key to that field.

    This way, you could store the same post from multiple newsservers
    without eating up your hard drive space like it's In-N-Out fries.

    Still, the question remains: Why?

    Only reason I could see is to generate a database of distribution
    paths pointing to your archive. But I can't see any benefit in that.

    Bye,
    Stefan
    --
    http://kontaktinser.at/ - die kostenlose Kontaktboerse fuer Oesterreich Offizieller Erstbesucher(TM) von mmeike

    Der hastige B|+rger will Stefan. Das mu|f ja wohl einen Grund haben? (Sloganizer)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From ram@ram@zedat.fu-berlin.de (Stefan Ram) to comp.databases.mysql on Tue Sep 10 15:33:01 2024
    From Newsgroup: comp.databases.mysql

    I posted this yesterday in news.software.readers, but I'm not
    sure what the best group is for this. I'd love to get some
    feedback on whether this is a solid approach.

    So, yesterday I was chewing the fat about how to whip up a database
    for posts retrieved from newsservers.

    I'm picturing some program that pulls newsgroups from newsservers
    and dumps them into a database.

    In my mind's eye, a post looks something like this, give or take:

    Path: A
    Message-ID: B

    Body: C

    . But if you snag the same post from a different server, it might
    look like this:

    Message-ID: B
    Path: D

    Body: C

    . At first blush, you'd end up with the same body stored multiple
    times in the database. Talk about a waste of space!

    To trim the fat, we could rejigger these posts so all the variable
    stuff is up front:

    Path: A
    Message-ID: B

    Body: C

    and

    Path: D
    Message-ID: B

    Body: C

    Now the tail end of both posts is identical, so we can toss that
    in a separate table at position 0.

    The posts themselves would then just contain the different parts
    and a pointer to the shared bit that's only stored once:

    Path: A
    Rest: 0

    Path: D
    Rest: 0

    0:
    Message-ID: B

    Body: C

    . This way, you could store the same post from multiple newsservers
    without eating up your hard drive space like it's In-N-Out fries.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.databases.mysql on Tue Sep 10 21:04:05 2024
    From Newsgroup: comp.databases.mysql

    On 10 Sep 2024 15:33:01 GMT, Stefan Ram wrote:

    To trim the fat, we could rejigger these posts so all the variable
    stuff is up front:

    Use the Message-ID field as the primary key.
    --- Synchronet 3.21d-Linux NewsLink 1.2