From Newsgroup: comp.databases.mysql
On Tue, 10 Sep 2024 17:33:01 Stefan Ram wrote:
I'm picturing some program that pulls newsgroups from newsservers
and dumps them into a database.
"There is another theory which states that this has already
happened"
In my mind's eye, a post looks something like this, give or take:
Path: A
Message-ID: B
Body: C
. But if you snag the same post from a different server, it might
look like this:
Message-ID: B
Path: D
Body: C
At first blush, you'd end up with the same body stored multiple
times in the database. Talk about a waste of space!
Not only a waste of space, you definitely want to avoid duplication
in a database. These two incarnations are the same posting so its
contents should be stored only once.
To trim the fat, we could rejigger these posts so all the variable
stuff is up front:
Path: A
Message-ID: B
Body: C
and
Path: D
Message-ID: B
Body: C
Now the tail end of both posts is identical, so we can toss that
in a separate table at position 0.
You'd really want to archive more than one incarnation of a posting,
just because you pulled it from different servers? Why?
Parse the incoming postings, extract the headers, store at least the
Message-Id in a separate attribute of your table (wisely some more)
and set a unique key to that field.
This way, you could store the same post from multiple newsservers
without eating up your hard drive space like it's In-N-Out fries.
Still, the question remains: Why?
Only reason I could see is to generate a database of distribution
paths pointing to your archive. But I can't see any benefit in that.
Bye,
Stefan
--
http://kontaktinser.at/ - die kostenlose Kontaktboerse fuer Oesterreich Offizieller Erstbesucher(TM) von mmeike
Der hastige B|+rger will Stefan. Das mu|f ja wohl einen Grund haben? (Sloganizer)
--- Synchronet 3.21a-Linux NewsLink 1.2