Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
I am trying to understand something about how different newsreaders handle malformed headers because my home-grown "newsreader" has "problems" when responding to Winston's posts due to the way he formats his "FROM" header.
From: ...w-i|#-o-#-n|# <winstonmvp@gmail.com>
That line apparently contains non-ASCII characters in the display name:
-i (U+00A1)
|# (U+00F1)
-o (U+00A7)
-# (U+00B1)
-n (U+00A4)
another |# (U+00F1)
When I reply to posts from Winston (the ones where his display name
contains characters like "w!n-o-#-nn"), my own outgoing article sometimes gets
corrupted on the way out. The corruption seems to happen, I think, because Winston's display name or headers contain raw 8-bit characters that are not valid UTF-8 and not MIME-encoded.
Usenet (like email) requires that all headers must be pure ASCII unless
they use MIME encoded-words, which means Winston's headers maybe should be
From: =?UTF-8?Q?w=C2=A1=C3=B1=C2=A7=C2=B1=C2=A4=C3=B1?= <winstonmvp@gmail.com>
This is fully legal for Usenet and will not trigger nntp server rewrites.
I wrote my own newsreader so I manually enforce strict 7-bit ASCII in my outgoing posts by the use of an extensive shortcuts.xml conversion macro.
However, when Winston's illegal header bytes get copied into my attribution line or reply headers, some NNTP servers rewrite the article to "fix" the mismatch, which ends up mangling my otherwise clean pure-7-bit ASCII text.
My question of others, since you're using "normal" newsreaders, is:
Q: Do any of you see charset or encoding issues when replying to Winston's
posts, or do your newsreaders and servers silently fix the illegal
bytes so you never notice?
I am trying to determine whether this is something unique to my strict
ASCII workflow, or whether other clients also have to deal with it.
Thanks for any insight.
Note that I'm implementing the following shortcuts.xml to fix this,
but nobody else will be using that conversion so it's just an N.B.
<!-- Remove Unicode garbage from quoted Usenet 'X wrote:' lines -->
<!-- (e.g., Winston's illegal headers) so my posts stay 7-bit clean -->
<ReplaceRE Find="[^\x00-\x7F]" Replace="" />
<ReplaceRE Find="^.*wrote:" Replace="Winston wrote:" />
<ReplaceRE Find="^(References|In-Reply-To):.*" Replace="" />
--
When you write your own newsreader, you have to do everything yourself.
On 2026-03-12 01:32, Maria Sophia wrote:
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
No.
Asking TB to produce the raw message, it comes as
From: =?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?= <...>
Which is legal, obviously. (Reasoning: if TB does it, then it is legal)
Message-ID: <10obf37$3koaa$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit
User-Agent: Mozilla Thunderbird
Content-Language: en-US
Looking at the stored file in my computer:
00000070 50 4F 53 54 roe 45 44 21 6E roe 6F 74 2D 66 roe 6F 72 2D 6D roe 61 69 6C 0A roe 46 72 6F 6D roe 3A 20 3D 3F POSTED!not-for-mail.From: =?
0000008C 55 54 46 2D roe 38 3F 42 3F roe 4C 69 34 75 roe 64 38 4B 68 roe 77 37 48 43 roe 70 38 4B 78 roe 77 71 54 44 UTF-8?B?Li4ud8Khw7HCp8KxwqTD
000000A8 73 51 3D 3D roe 3F 3D 20 3C roe 77 69 6E 73 roe 74 6F 6E 6D roe 76 70 40 67 roe 6D 61 69 6C roe 2E 63 6F 6D sQ==?= <..........@gmail.com
000000C4 3E 0A 4E 65 roe 77 73 67 72 roe 6F 75 70 73 roe 3A 20 61 6C roe 74 2E 63 6F roe 6D 70 2E 6F roe 73 2E 77 69 >.Newsgroups: alt.comp.os.wi
Which has been processed by Leafnode, without any problem.
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
I am trying to understand something about how different newsreaders handle malformed headers because my home-grown "newsreader" has "problems" when responding to Winston's posts due to the way he formats his "FROM" header.
From: ...w-i|#-o-#-n|# <winstonmvp@gmail.com>
That line apparently contains non-ASCII characters in the display name:
-i (U+00A1)
|# (U+00F1)
-o (U+00A7)
-# (U+00B1)
-n (U+00A4)
another |# (U+00F1)
To add further value to what Carlos kindly tested using Thunderbird, apparently, those on Thunderbird see not this (which is what I see):
From: ...w-i|#-o-#-n|#<winstonmvp@gmail.com>
To add further value to what Carlos kindly tested using Thunderbird, apparently, those on Thunderbird see not this (which is what I see):
From: ...w-i|#-o-#-n|# <winstonmvp@gmail.com>
Which, is comprised of...
-i (U+00A1)
|# (U+00F1)
-o (U+00A7)
-# (U+00B1)
-n (U+00A4)
another |# (U+00F1)
But they actually see this instead (according to what Carlos reported):
From: =?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?=
On 12/03/2026 06:16, Maria Sophia wrote:
To add further value to what Carlos kindly tested using Thunderbird,
apparently, those on Thunderbird see not this (which is what I see):
From: ...wi+o#n+<winstonmvp@gmail.com>
I'm using Thunderbird and I see exactly what you see. Maybe it's
something to do with which fonts we have installed or with our Windows settings? (I'm using Windows 11 rather than Windows 10, but I doubt that would make any difference.)
But they actually see this instead (according to what Carlos reported):
From: =?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?=
No, that's what I see when looking at the raw version. What I see in the editor or the message viewer is
...wi+o#n+ <winstonmvp@gmail.com>
and on a follow up is "On 2026-03-12 08:08, ...wi+o#n+ wrote:"
Notice that we are both using thunderbird, so what happens is
coordinated. It is sent as mime, but displayed as normal utf text.
That's on the header. The body is plain UTF, no need for any conversion.
The header needs to be compatible with older software.
Thank you for clarifying what I misunderstood from Carlos' tests, which is that you see what I see which Winston has subsequently confirmed are alt codes he manually typed in to set his FROM Usenet header long ago using
...w = ...w (literal)
-i = Alt 0161 (Windows inserts byte A1 hexadecimal value)
|# = Alt 0241 (Windows inserts byte F1 hexadecimal value)
-o = Alt 0167 (Windows inserts byte A7 hexadecimal value)
-# = Alt 0177 (Windows inserts byte B1 hexadecimal value)
-n = Alt 0164 (Windows inserts byte A4 hexadecimal value)
Thanks for confirming what I see Carlos has also confirmed, which is that
you see in Thunderbird what I see in my newsreader which is "...w-i|#-o-#-n|#".
From: ...w-i|#-o-#-n|# <winstonmvp@gmail.com>
On 3/11/2026 5:32 PM, Maria Sophia wrote:
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
I am trying to understand something about how different newsreaders
handle
malformed headers because my home-grown "newsreader" has "problems" when
responding to Winston's posts due to the way he formats his "FROM"
header.
-a From: ...w-i|#-o-#-n|# <winstonmvp@gmail.com>
That line apparently contains non-ASCII characters in the display name:
-a -i (U+00A1)
-a |# (U+00F1)
-a -o (U+00A7)
-a -# (U+00B1)
-a -n (U+00A4)
-a another |# (U+00F1)
w = standard lower case w keystroke
-i = Alt 0161
|# = Alt 0241
-o = Alt 0167-a or -o = Alt 21
-# = Alt 0177
-n = Alt 0164
|# = Alt 0241
All from one or more fonts available in Character Map.
-a- I've come across other folks that use some available character codes that appear blank - just copy the code and paste into a field to meet
the '*' required character entry.
Asking TB to produce the raw message, it comes as
From: =?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?= <...>
Which is legal, obviously. (Reasoning: if TB does it, then it is legal)
On 12/03/2026 07:08, ...w-i|#-o-#-n|# wrote:
On 3/11/2026 5:32 PM, Maria Sophia wrote:I also see your name as ...w-i|#-o-#-n|# (in Betterbird). It doesn't bother me
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
I am trying to understand something about how different newsreaders
handle
malformed headers because my home-grown "newsreader" has "problems" when >>> responding to Winston's posts due to the way he formats his "FROM"
header.
-a From: ...w-i|#-o-#-n|# <winstonmvp@gmail.com>
That line apparently contains non-ASCII characters in the display name:
-a -i (U+00A1)
-a |# (U+00F1)
-a -o (U+00A7)
-a -# (U+00B1)
-a -n (U+00A4)
-a another |# (U+00F1)
w = standard lower case w keystroke
-i = Alt 0161
|# = Alt 0241
-o = Alt 0167-a or -o = Alt 21
-# = Alt 0177
-n = Alt 0164
|# = Alt 0241
All from one or more fonts available in Character Map.
-a-a- I've come across other folks that use some available character
codes that appear blank - just copy the code and paste into a field to
meet the '*' required character entry.
unduly but it has puzzled me for a while. May I ask what you are doing
and why not simply use winston as in your email address?
On 3/12/2026 11:24 AM, MikeS wrote:
On 12/03/2026 07:08, ...wi+o#n+ wrote:
On 3/11/2026 5:32 PM, Maria Sophia wrote:I also see your name as ...wi+o#n+ (in Betterbird). It doesn't bother me
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
I am trying to understand something about how different newsreaders
handle
malformed headers because my home-grown "newsreader" has "problems" when >>>> responding to Winston's posts due to the way he formats his "FROM"
header.
a From: ...wi+o#n+ <winstonmvp@gmail.com>
That line apparently contains non-ASCII characters in the display name: >>>> a i (U+00A1)
a + (U+00F1)
a o (U+00A7)
a # (U+00B1)
a n (U+00A4)
a another + (U+00F1)
w = standard lower case w keystroke
i = Alt 0161
+ = Alt 0241
o = Alt 0167a or o = Alt 21
# = Alt 0177
n = Alt 0164
+ = Alt 0241
All from one or more fonts available in Character Map.
aa- I've come across other folks that use some available character
codes that appear blank - just copy the code and paste into a field to
meet the '*' required character entry.
unduly but it has puzzled me for a while. May I ask what you are doing
and why not simply use winston as in your email address?
Have used that form for nntp and signature since 1998
Html nntp, Text nntp[1], private nntp groups, private list servers,
private web groups, blogging...
[1] text nntp(e.g. Eternal Sept. like servers - no HTML formatting composition) users are the only source where questions, criticism,
comments occur...but less than 5% of where 'it's' being used.
<g>Before 1998, the nomenclature was slightly longer
=> Wi+o#n+4+#<>gEo
--
...wi+o#n+
On 3/12/2026 11:24 AM, MikeS wrote:
On 12/03/2026 07:08, ...w-i|#-o-#-n|# wrote:
On 3/11/2026 5:32 PM, Maria Sophia wrote:I also see your name as ...w-i|#-o-#-n|# (in Betterbird). It doesn't bother >> me unduly but it has puzzled me for a while. May I ask what you are
Question:
Do Winston's headers cause charset issues for anyone else?
Or just me?
I am trying to understand something about how different newsreaders
handle
malformed headers because my home-grown "newsreader" has "problems"
when
responding to Winston's posts due to the way he formats his "FROM"
header.
-a From: ...w-i|#-o-#-n|# <winstonmvp@gmail.com>
That line apparently contains non-ASCII characters in the display name: >>>> -a -i (U+00A1)
-a |# (U+00F1)
-a -o (U+00A7)
-a -# (U+00B1)
-a -n (U+00A4)
-a another |# (U+00F1)
w = standard lower case w keystroke
-i = Alt 0161
|# = Alt 0241
-o = Alt 0167-a or -o = Alt 21
-# = Alt 0177
-n = Alt 0164
|# = Alt 0241
All from one or more fonts available in Character Map.
-a-a- I've come across other folks that use some available character
codes that appear blank - just copy the code and paste into a field
to meet the '*' required character entry.
doing and why not simply use winston as in your email address?
Have used that form for nntp and signature since 1998
Html nntp, Text nntp[1], private nntp groups, private list servers,
private web groups, blogging...
[1] text nntp(e.g. Eternal Sept. like servers - no HTML formatting composition) users are the only source where questions, criticism,
comments occur...but less than 5% of where 'it's' being used.
<g>Before 1998, the nomenclature was slightly longer
-a =>-a W-i|#-o-#-n|#-4|u|f|||#g|<|2
People have complained *to me* that my responses have mojibake in them.
So I'm trying to fix that problem *for them*.
Maria Sophia wrote:
People have complained *to me* that my responses have mojibake in them.
So I'm trying to fix that problem *for them*.
Delving deeper in thought...
Given RFC 5322 says headers must be ASCII unless MIME-encoded, others have pointed out Big-5 & ISO-8859-1 sometimes gets inserted into my headers.
I don't add that. I can't add them. They're not in my dictionaries.
So "something else" must be adding them. But what?
I never really understood character encoding, and I've said so many times. But I wonder if what's happening is possibly
1. The "From:" display name contains raw CP1252 bytes
2. Which are not valid UTF-8
3. Where, if my outgoing message declares "charset=UTF-8"
4. Maybe some NNTP servers might respond by trying to be helpful
5. One way being by slapping a different charset label on the header
Given... these CP1252 bytes (0xA1, 0xA7, 0xB1, 0xF1) are
a. illegal in UTF-8
b. legal in ISO-8859-1
c. also legal byte patterns in Big-5
Maybe that's where some of my responses get ISO-8859-1 or Big-5 headers?
Maybe... given UTF-8 is not ASCII, but ASCII is valid UTF-8...
i. Declaring UTF-8 forces some nntp servers to validate all bytes.
ii. But CP1252 bytes are illegal in UTF-8
iii. Where UTF-8 replies trigger more server 'helpfulness'
An interesting related aside is that... for
I. 0xA1 is not a valid UTF-8 start byte
II. 0xF1 is a valid UTF-8 start byte,
but only if followed by 0x80-0xBF, which it isn't
III. 0xA7 is illegal as a UTF-8 start byte
IV. 0xB1 is illegal as a UTF-8 start byte
V. 0xA4 is illegal as a UTF-8 start byte
VI. 0xF1 is a valid UTF-8 start byte,
but only if followed by 0x80-0xBF, which it isn't
The RFC-correct solution would be:
From: =?UTF-8?Q?W=C2=A1=C3=B1=C2=A7=C2=B1=C2=A4=C3=B1=C2=AC=C3=96=C3=9F=C3=B3=C3=B2g=C3=AE=C3=AB?= <...>
But that's ugly.
Using W-i|#-o-#-n|#-4|u|f|||#g|<|2 would be even more so, given
VII. 0xAC is illegal as a UTF-8 start byte
VIII. 0xD6 is a valid start byte only if followed by continuation byte
And so on, where the "W" in W-i|#-o-#-n|# and the "g" in |u|f|||#g|<|2 are the only
bytes in that entire (pre 1988) decorated name that is both ASCII and valid UTF-8. Everything else is raw CP1252.
The UTF-8 version of the whole name would be:
57 C2 A1 C3 B1 C2 A7 C2 B1 C2 A4 C3 B1 C2 AC C3 96 C3 9F C3 B3 C3 B2 67 C3 AE C3 AB
But all this is only meaningful if it causes downstream issues,
where I think simply switching my headers to ASCII solved the
mojibake that Andy, Carlos and others asked me to try to fix.
On 3/12/2026 9:18 AM, Maria Sophia wrote:
Thank you for clarifying what I misunderstood from Carlos' tests, which is >> that you see what I see which Winston has subsequently confirmed are alt
codes he manually typed in to set his FROM Usenet header long ago using
...w = ...w (literal)
i = Alt 0161 (Windows inserts byte A1 hexadecimal value)
+ = Alt 0241 (Windows inserts byte F1 hexadecimal value)
o = Alt 0167 (Windows inserts byte A7 hexadecimal value)
# = Alt 0177 (Windows inserts byte B1 hexadecimal value)
n = Alt 0164 (Windows inserts byte A4 hexadecimal value)
No typing required.
Character map, choose font that has desired character(for the above
Arial works), double click character(places the character in the
'Characters to copy field', repeat for balance of string, once string is complete, click on Copy. Paste wherever desired(Notepad is a good
temporary storage point, if using in multiple other apps/programs.
As noted earlier, this is what I see in Thunderbird's From
Thanks for confirming what I see Carlos has also confirmed, which is that
you see in Thunderbird what I see in my newsreader which is "...wi+o#n+".
From: ...wi+o#n+ <winstonmvp@gmail.com>
column(Message list)
<https://i.postimg.cc/BvbXZ8mv/Tbird-From-Column-01.jpg>
The same naming is also seen in the Message pane's From field.
- b/c its using the Address book contact form
If wondering about the ... prefix, its a precedent for sorting on the
From field(my posts appear at the top of an unthreaded sorted list)
--
...wi+o#n+
I wrote a newsreader a few years ago, in Python. Python had a
module to decode headers encoded as in RFC2047; this one I
think:
<https://docs.python.org/3/library/email.header.html> <https://www.ietf.org/rfc/rfc2047>
I didn't bother to detect /whether/ the headers had encoded words,
I decoded everything in case it did. (I've seen several encoded
words in different encodings in a single header field.)
In your case it sounds like you need an encoder as well as a
decoder. If there aren't such modules in whatever your system is
written in, you could write one. Perhaps a sub-process written in
Python: pass it the raw header and it returns it in unicode. And
vice versa to encode it.
John Hall wrote:
On 12/03/2026 06:16, Maria Sophia wrote:
To add further value to what Carlos kindly tested using Thunderbird,
apparently, those on Thunderbird see not this (which is what I see):
From: ...w-i|#-o-#-n|#<winstonmvp@gmail.com>
I'm using Thunderbird and I see exactly what you see. Maybe it's
something to do with which fonts we have installed or with our Windows
settings? (I'm using Windows 11 rather than Windows 10, but I doubt that
would make any difference.)
Thank you for clarifying what I misunderstood from Carlos' tests, which is that you see what I see which Winston has subsequently confirmed are alt codes he manually typed in to set his FROM Usenet header long ago using
...w = ...w (literal)
-i = Alt 0161 (Windows inserts byte A1 hexadecimal value)
|# = Alt 0241 (Windows inserts byte F1 hexadecimal value)
-o = Alt 0167 (Windows inserts byte A7 hexadecimal value)
-# = Alt 0177 (Windows inserts byte B1 hexadecimal value)
-n = Alt 0164 (Windows inserts byte A4 hexadecimal value)
Those are all valid Windows Alt-codes, but the important detail is that
they produce raw 8-bit bytes from the Windows-1252 (Latin-1) character set.
I could be wrong as I never really understood this characters stuff, but
a. They are not UTF-8
b. They are not ASCII
c. They are not MIME-encoded
d. They are raw 8-bit bytes
The valid format is:--
=?charset?encoding?encoded-text?=
Hence, if we break Winston's header down:
=?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?=
| | | |
| | | +-- Base64 text
| | +------------------------ Encoding type ("B" = Base64)
| +-------------------------- Character set (UTF-8)
+-------------------------------- Begin encoded-word
The Base64 portion is:
Li4ud8Khw7HCp8KxwqTDsQ==
Decoding that Base64 string yields the UTF-8 text:
...w-i|#-o-#-n|#
Carlos E.R. wrote:
But they actually see this instead (according to what Carlos reported):
From: =?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?=
No, that's what I see when looking at the raw version. What I see in the
editor or the message viewer is
...w-i|#-o-#-n|# <winstonmvp@gmail.com>
and on a follow up is "On 2026-03-12 08:08, ...w-i|#-o-#-n|# wrote:"
Notice that we are both using thunderbird, so what happens is
coordinated. It is sent as mime, but displayed as normal utf text.
That's on the header. The body is plain UTF, no need for any conversion.
The header needs to be compatible with older software.
Hi Carlos,
Thanks for correcting my misconception as I never really understood all
this mojibake character-set interaction but now that Winston explained he
is typing Windows Alt-codes, and after your clarification, I am scratching the surface at beginning to understand what is actually happening.
It may be that Thunderbird *stores* or *shows* the header in MIME-encoded form when you view the raw source, but apparently Thunderbird does not MIME-encode Winston's display name when sending the message.
I'm not using Thunderbird (and I changed the header to reflect that since
TB users are on this thread) but it appears that in normal viewing mode, Thunderbird simply displays the raw 8-bit Windows-1252 bytes exactly as
they appear:
...w-i|#-o-#-n|# <winstonmvp@gmail.com>
Which matches what I see on my end.
Apparently Thunderbird is perfectly happy to accept those raw 8-bit bytes
in the header, even though they are not valid UTF-8 and not legal ASCII.
My own workflow is strict ASCII, so when those bytes get copied into my attribution line, I think what happens is some NNTP servers try to repair--
the mismatch and end up mangling my outgoing post, which is really the only reason I care (as I don't care to be a Usenet-rules enforcer by any means).
So, to clarify, I think you & Winston are saying the behavior is:
1. Winston types Windows11252 Alt-codes.
2. Thunderbird displays them as-is in the UI.
3. Thunderbird shows a MIME-encoded version only when viewing
the raw message source.
4. My ASCII-only workflow exposes the illegal bytes,
which sometimes apparently triggers server rewrites (AFAICT)
Thanks again for checking this from the Thunderbird side, as knowing how
you see Winston's messages helps me figure out how to handle the mojibake.
THIS IS A TEST. IT'S AN EXACT COPY OF THE PREVIOUS POST.
THE ONLY DIFFERENCE IS THIS HAS UTF-8 DECLARED IN THE HEADER. NOT ASCII.
DO YOU SEE THE SAME OUTPUT or DO YOU SEE IT DIFFERENTLY?
The RFC-correct solution would be:--------************
From: =?UTF-8?Q?W=C2=A1=C3=B1=C2=A7=C2=B1=C2=A4=C3=B1=C2=AC=C3=96=C3=9F=C3=B3=C3=B2g=C3=AE=C3=AB?=
<...>
But that's ugly.
Using WN+++o#nN++N++N++N++N++gN++N++ would be even more so, given
I could be wrong as I never really understood this characters stuff, but
a. They are not UTF-8
b. They are not ASCII
c. They are not MIME-encoded
d. They are raw 8-bit bytes
Huh, no. They were typed as 8-bit bytes from Latin-1 charset at some
point in time, but today they are UTF-8. UTF in the body, and as MIME in
the header.
You said it yourself in another post:
This text arrives corrupted. In the other post they are legible. It is declared as UTF-8, but I guess it is not actually all valid UTF-8.
Carlos E.R. wrote:
I could be wrong as I never really understood this characters stuff, but >>> a. They are not UTF-8
b. They are not ASCII
c. They are not MIME-encoded
d. They are raw 8-bit bytes
Huh, no. They were typed as 8-bit bytes from Latin-1 charset at some
point in time, but today they are UTF-8. UTF in the body, and as MIME in
the header.
You said it yourself in another post:
Hi Carlos,
I agree. I apologize for the flip flop indecision. I don't know what's
going on, as I'm only trying to fix the trouble W-i|#-o-#-n|#-4|u|f|||#g|<|2 creates.
I will endlessly admit I never understood this charset stuff, and I will point out that the only reason I even care is you and others asked me to
fix the problems that sometimes my posts look like a Chinese jigsaw puzzle.
Since I don't mess with the characters, something else is messing with the characters, where a test in this very thread shows that when I use headers
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bi t
Then W-i|#-o-#-n|#-4|u|f|||#g|<|2 remains W-i|#-o-#-n|#-4|u|f|||#g|<|2
But when I use headers
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bi t
Then W-i|#-o-#-n|#-4|u|f|||#g|<|2 turns the entire post into a ransom note.
Usenet (NNTP) follows email header rules (RFC 5322 + RFC 2047):
a. The body may be UTF-8, if declared.
b. Headers cannot contain raw 8-bit bytes.
c. Hence, non-ASCII characters must be encoded using MIME encoded-words
From: =?UTF-8?Q?W=C2=A1=C3=B1=C2=A7=C2=B1=C2=A4=C3=B1?= <winston@example.com>
Given Winston's "FROM:" header has those characters, which are not ASCII,
all I can say is that they're not valid characters for *headers*, unless they're MIME encoded. Are they Mime-encoded? I don't know. I don't see it.
As you said, I belatedly realized Winston's characters are valid Unicode--
and valid UTF-8 but they appear in a header, apparently without required
MIME encoding when Usenet servers are allowed to mangle or reject 8-bit header bytes. When I respond, the attribute line contains W-i|#-o-#-n|#-4|u|f|||#g|<|2
What I'm trying to figure out is why my body gets mangled because the attribution line contains raw Latin-1 bytes, but my outgoing headers
declare UTF-8, so I think a server in the path re-encodes the body and corrupts it. But I'm not really sure what is causing the mojibake. .
Then Wi+o#n+4+#<>gEo turns the entire post into a ransom note.
Possibly because the text is not actually UTF-8
Given Winston's "FROM:" header has those characters, which are not ASCII,
all I can say is that they're not valid characters for *headers*, unless
they're MIME encoded. Are they Mime-encoded? I don't know. I don't see it.
Yes, they are MIME encoded. I posted the other day the section in HEX,
taken directly from the on disk file that Leafnode has written on my
system, so no translation from Thunderbird.
The RFC-correct solution would be:
[Invalid header field line]
But that's ugly.
Carlos E.R. wrote:
Then W-i|#-o-#-n|#-4|u|f|||#g|<|2 turns the entire post into a ransom note. >>Possibly because the text is not actually UTF-8
Yeah. In a later post you see I belatedly figured that out for myself.
Sorry for the flip flop indecision on whether I think it's UTF-8 or not.
Did I ever mention I never really understood this Usenet charset stuff?
I'm one of the few people whose ego isn't so huge that they can't admit
when they don't know something, where I openly and humbly easily admit that
I seriously lack charset understanding when it comes to Usenet headers.
Luckily, the two things I'm doing seems to work "most" of the time:
a. If I copy/paste from a variety of web sources (particularly Chromium),
I run my body through a text-normalizer to eliminate Unicode chars.
<shortcuts.xml>
b. I manually place a US-ASCII header which seems to tell the receiving
newsreaders not to both trying to deal with W-i|#-o-#-n|#-4|u|f|||#g|<|2's
Windows-1252 ISO-8859-1 (Latin-1) character set.
w = 0x57 (ASCII)
-i = 0xA1
|# = 0xF1
-o = 0xA7
-# = 0xB1
-n = 0xA4
-4 = 0xAC
|u = 0xD6
|f = 0xDF
|| = 0xF3
|# = 0xF2
g = 0x67 (ASCII)
|< = 0xEE
|2 = 0xEB
Every one of those bytes is a single-byte Latin-1 / Windows-1252 character. None of them are UTF-8.
Given Winston's "FROM:" header has those characters, which are not ASCII, >>> all I can say is that they're not valid characters for *headers*, unless >>> they're MIME encoded. Are they Mime-encoded? I don't know. I don't see it. >>Yes, they are MIME encoded. I posted the other day the section in HEX,
taken directly from the on disk file that Leafnode has written on my
system, so no translation from Thunderbird.
I may be wrong since I never understood this stuff, so I appreciate your clarifications, and I openly let you know I really don't understand this.
I think you are describing Thunderbird's behavior, not necessarily
Winston's behavior, while mostly I'm describing Winston's original bytes,
not Thunderbird's. (Although it appears that Winston uses TB after all.)
I think we can all presume Winston originally long ago typed raw
Windows-1252 bytes using Alt-codes for his display name, but I think it may be that TB does not actually send those bytes directly in the header.
Those are raw 8-bit Latin-1 bytes when he types them.
However, I think TB does not send those bytes directly.
When Winston posts using TB, I think TB maybe perhaps converts the Latin-1 bytes to UTF-8, and then MIME-encodes the header using RFC 2047. That may
be why the raw source on your system shows something like:
From: =?UTF-8?B?Li4ud8Khw7HCp8KxwqTDsQ==?=
On your side, TB maybe perhaps then decodes that MIME-encoded header for display, so in the normal UI you see:
...w!n-o-#-nn <winstonmvp@gmail.com>
I'm rather confused, as I don't control anything but my side of the
equation, and all I'm doing is dealing with Winston's display name,
but maybe what's possibly happening overall, is this (maybe?):
1. Winston typed Windows-1252 Alt-codes for his display name long ago.
...W-i|#-o-#-n|#
2. His Thunderbird converts those Latin-1 bytes to UTF-8 internally.
3. His Thunderbird MIME-encodes the UTF-8 header before sending it.
4. Your Thunderbird decodes the MIME header & displays normal UTF-8 text.
5. My own newsreader client copies the original Latin-1 bytes from the
attribution line because it does not decode the MIME header.
6. That mismatch triggers mojibake in my outgoing posts when my headers
declare "charset=UTF-8" instead of "charset=US-ASCII".
I never understood this stuff, but perhaps maybe that explains why you see
a valid MIME-encoded UTF-8 header in the raw view, while I see the original Latin-1 bytes in my ASCII world. Thunderbird is doing the right thing on Winston's end, but perhaps my own ASCII-only setup exposes the mismatch.
Thanks again for helping me sort out what Thunderbird is doing on your
side, as I used TB years ago for a client and hated how it thought Usenet
was email. Maybe it's better now as that had to be a decade or so ago.
I think you are describing Thunderbird's behavior, not necessarily
Winston's behavior, while mostly I'm describing Winston's original bytes,
not Thunderbird's. (Although it appears that Winston uses TB after all.)
I think we can all presume Winston originally long ago typed raw
Windows-1252 bytes using Alt-codes for his display name, but I think it may be that TB does not actually send those bytes directly in the header.
Forget latin-1. The servers are sending mime encoded utf-8 in the
headers. Life is simple that way.
On 3/13/2026 9:45 PM, Maria Sophia wrote:
I think you are describing Thunderbird's behavior, not necessarily
Winston's behavior, while mostly I'm describing Winston's original bytes,
not Thunderbird's. (Although it appears that Winston uses TB after all.)
Yes to TB. Also SeaMonkey and WLM2012
I think we can all presume Winston originally long ago typed raw
Windows-1252 bytes using Alt-codes for his display name, but I think
it may
be that TB does not actually send those bytes directly in the header.
As noted earlier..no typing was ever done. The string was created in Character map with typing - select, repeat for next character, copy
string, paste to desired field.
The RFC-correct solution would be:
[Invalid header field line]
Please note that RFC 2047 defines a line length limit: <https://datatracker.ietf.org/doc/html/rfc2047#section-2>
|
| While there is no limit to the length of a multiple-line header
| field, each line of a header field that contains one or more
| 'encoded-word's is limited to 76 characters.
for this reason:
|
| The length restrictions are included both to ease interoperability
| through internetwork mail gateways, and to impose a limit on the
| amount of lookahead a header parser must employ (while looking for a
| final ?= delimiter) before it can decide whether a token is an
| "encoded-word" or something else.
But that's ugly.
Because nobody should get the raw header displayed (except on request)
I think this should be no problem.
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 65 |
| Nodes: | 6 (0 / 6) |
| Uptime: | 62:45:50 |
| Calls: | 862 |
| Files: | 1,311 |
| D/L today: |
10 files (20,373K bytes) |
| Messages: | 264,046 |