From Newsgroup: news.software.readers
Colin Macleod <
user7@newsgrouper.org.invalid> wrote or quoted:
Yes but many such forums (Reddit, Hacker News,...) have no way to see which >posts are new since your last visit, which any decent usenet client will do. >That drives me round the bend! Efn4
I read some news sources on the web. Essentially, a news source
is a list of links to news reports. I wrote a Python script
that scans those lists and creates a digest list for me, which
itself is a link list in HTML shown to me in my browser.
This digest only contains articles not shown in a previous digest,
which is implemented using a log file of all articles shown so far.
If anyone wants to build something like this himself:
Here is some pseudocode I wrote to illustrate my script, which is
much larger. That pseudocode was never executed and will still
contain errors, but is close enough to executable Python that
people with experience in Python should be able to make it run.
This approach will only work with web pages that do not require
JavaScript for access control or filling the page with content.
import re
import urllib.request
import webbrowser
# input, news source(s):
article_list = \
r"
http://example.com/article_list-20260219184720-TMP-DML.html"
# output:
digest_file_name = r"output-file-20260219184720-TMP-DML.html"
log_file_name = r"log-file-20260219184720-TMP-DML.html"
# procedure:
with open( digest_file_name, "w", errors='ignore' )as digest:
request = urllib.request.Request( article_list )
resource = urllib.request.urlopen( request )
cs = resource.headers.get_content_charset()
content = resource.read().decode( cs, errors="ignore" )
# assuming each article link is in an element of type "p"
# and each p element is a link to an article:
# (This needs to be adapted to each news source!)
for p in re.finditer\
( r'''<p[^\001]*?</p>''', content, flags=re.DOTALL ):
text = p.group( 0 )
# was this already seen? check with log file:
with open(log_file_name, 'r', encoding='utf-8') as log_file:
log_file_content = log_file.read()
already_seen = text in log_file_content
if not already_seen: # <==== ### DUPES ARE SKIPPED HERE ###
# add to log file:
with open( log_file_name, 'a', encoding='utf-8') as log_file:
log_file.write(text + '\n')
# exclude unwanted topics:
if "Prince Harry" not in text:
# add to article_list
print( text, file=article_list ) webbrowser.open(digest_file_name)
--- Synchronet 3.21b-Linux NewsLink 1.2