Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 28 |
Nodes: | 6 (0 / 6) |
Uptime: | 55:00:08 |
Calls: | 422 |
Files: | 1,025 |
Messages: | 90,706 |
I am often faced with this problem.
I have a string like (this was the "From" address of an email I recently received):
=?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>
Note that this may not be the ideal example, but it is the one closest to hand. Here's another example:
TF-8?q?They’re_telling_us_something_about_something_ok?
when it should have been just:
They're telling us something about something ok?
My question is: Is there a (Unix/Linux) tool that will reliably fix this? I.e. convert the binary glop format into the desired, pure ASCII, format.
On Sun, 04 May 2025 18:15:38 +0000, Kenny McCormack wrote:
I am often faced with this problem.
I have a string like (this was the "From" address of an email I recently received):
=?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>
Note that this may not be the ideal example, but it is the one closest to
hand. Here's another example:
TF-8?q?Theyre_telling_us_something_about_something_ok?
when it should have been just:
They're telling us something about something ok?
My question is: Is there a (Unix/Linux) tool that will reliably fix this?
I.e. convert the binary glop format into the desired, pure ASCII, format.
What you are looking at is the "punycode"[1] expression of a non-ASCII character
sequence.
AFAIK, there aren't any /standard/ utilities that convert to and from punycode.
However, there are /libraries/ that handle punycode (libidn[2], for one).
In article <vv8bb2$2f5m5$2@dont-email.me>,
Lew Pitcher <lew.pitcher@digitalfreehold.ca> wrote:
AFAIK, there aren't any /standard/ utilities that convert to and from punycode.
However, there are /libraries/ that handle punycode (libidn[2], for one).
"standard" doesn't really matter much to me. If there is a tool out there, in any form, from any source, I'd like to hear about it.
Generally, when there is a library to do something, there is a program written to access the functionality in that library - i.e., a "thin
wrapper" around the library. Sounds like that program is what I am looking for.
Indeed, there is an "idn" tool as part of the libidn source. The binary
is in a separate package from the library on Debian. Perhaps that tool
will do what you want in the shell.
(BTW, I too have looked for such a tool in the past and come up empty.
The previous poster's mention of "punycode" and "libidn" were super
helpful to me, too. Thanks Lew!)
On Sun, 04 May 2025 18:15:38 +0000, Kenny McCormack wrote:
I have a string like (this was the "From" address of an email I recently received):
=?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>
TF-8?q?They’re_telling_us_something_about_something_ok?
My question is: Is there a (Unix/Linux) tool that will reliably fix this?
I.e. convert the binary glop format into the desired, pure ASCII, format.
What you are looking at is the "punycode"[1] expression of a
non-ASCII character sequence.
In comp.unix.shell, Lew Pitcher <lew.pitcher@digitalfreehold.ca> wrote:
On Sun, 04 May 2025 18:15:38 +0000, Kenny McCormack wrote:
I have a string like (this was the "From" address of an email I recently received):
=?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>
This is called a "MIME word" or "MIME encoded word" Mail headers are
supposed to be 7-bit clean even when mail bodies can have 8-bit
encodings. MIME words exist to put quoted-printable and base64 in
headers.
=?utf-8?B?UGhpbGxpcCBHw7xudGVy?=
^^^^^ | ^^^^^^^^^^^^^^^^^^^^
Charset | Content
|
Encoding: "B" for base64 "Q" for quoted-printable.
:r! echo UGhpbGxpcCBHw7xudGVy | base64 -d
Phillip Günter
TF-8?q?They’re_telling_us_something_about_something_ok?
That one is rather mangled. It doesn't follow the =?%s?%c?%s?= format.
It also has a high-bit character in the content.
But it does show the
quirk of MIME word quoted-printable: underscores used for spaces.
thinks Günter probably likes his name with the accent and not "pure ASCII"
(Not surprising for people with names that contain non-ASCII characters
to have their names represented correctly.)
Janis
I have a string like (this was the "From" address of an email I recently received):
=?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>
In article <slrn1023i84.2s2es.cmartin+usenetYYMMDD@nyx2.nyx.net>,
Chuck Martin <cmartin+usenetYYMMDD@nyx.net> wrote:
...
Try piping it into the following command:
perl -CS -MEncode -ne 'print decode("MIME-Header", $_)'
Bingo! Thanks very much.
That worked well on the other example I gave in the OP too (which was actually more the focus of my query).
I still get the binary glop in place of the single quotes (The text
contains the word "They're", rendered as "They<binaryglop>re"). Note that
in the input string, <binaryglop> is (the literal string): =E2=80=99
which gets converted by your Perl program into the 3
characters with octal codes (as displayed by "od -bc"): 342 200 231
I can deal with this later problem myself via brute force with AWK, but it would be nice if I didn't have to - i.e., if there were a complete solution (i.e., one that does also the other half of the job).
My guess is that this isn't an apostrophe, but a "right single quotation >mark", which is sadly a common sight in such a context, and Emacs tells
me that this (UCS codepoint 0x2019) is represented as E2 80 99 in UTF-8.
Are there good ways to convert such chars to something more reasonable?
The only thing that occurs to me right now is passing it through iconv
to a more limited charset using transliteration (e.g. "iconv -f utf8 -t >iso8859-1//TRANSLIT -c") and then back to the desired encoding and
charset.
(But I suppose if this is already involving perl, then perhaps such a >modification can be done through perl too.)
Try piping it into the following command:
perl -CS -MEncode -ne 'print decode("MIME-Header", $_)'
In article <vvssf0$13ls6$1@dont-email.me>,
Nuno Silva <nunojsilva@invalid.invalid> wrote:
...
My guess is that this isn't an apostrophe, but a "right single quotation
mark", which is sadly a common sight in such a context, and Emacs tells
me that this (UCS codepoint 0x2019) is represented as E2 80 99 in UTF-8.
Correct, but as far as I am concerned, they are all single quotes, just mangled versions of same. The goal is to convert them all back into
regular single quotes. And, as you will see below, similar comments apply for double quotes.
The AWK code that I am currently using to clean this problem contains these lines:
gsub(/=..=..=9[CD]/,"\"")
gsub(/=..=..=../,"'")
which is good enough for me.
Are there good ways to convert such chars to something more reasonable?
The only thing that occurs to me right now is passing it through iconv
to a more limited charset using transliteration (e.g. "iconv -f utf8 -t
iso8859-1//TRANSLIT -c") and then back to the desired encoding and
charset.
As mentioned in the OP, I have never been successful in getting "iconv" to do much of anything. No, this is not a plea for help or for man pages to be read out loud.
(But I suppose if this is already involving perl, then perhaps such a
modification can be done through perl too.)
Probably, but I'm not much into Perl. I do appreciate the solution given here by Chuck, but don't intend on doing any real deconstruction on it.
The only thing that occurs to me right now is passing it through iconv
to a more limited charset using transliteration (e.g. "iconv -f utf8 -t iso8859-1//TRANSLIT -c") and then back to the desired encoding and
charset.
(But I suppose if this is already involving perl, then perhaps such a modification can be done through perl too.)
I am often faced with this problem.
I have a string like (this was the "From" address of an email I recently received):
=?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>
Note that this may not be the ideal example, but it is the one closest to hand. Here's another example:
TF-8?q?They’re_telling_us_something_about_something_ok?
when it should have been just:
They're telling us something about something ok?
My question is: Is there a (Unix/Linux) tool that will reliably fix this? I.e. convert the binary glop format into the desired, pure ASCII, format.
$ sq=$(echo -ne '\u0027\u2018\u2019\u0022\u201c\u201d')
Chris Elvidge wrote:
$ sq=$(echo -ne '\u0027\u2018\u2019\u0022\u201c\u201d')
To eliminate needless overhead:
sg=$'\u0027\u2018\u2019\u0022\u201c\u201d'
On 14/05/2025 at 07:28, Brian Patrie wrote:
Chris Elvidge wrote:
$ sq=$(echo -ne '\u0027\u2018\u2019\u0022\u201c\u201d')
To eliminate needless overhead:
sg=$'\u0027\u2018\u2019\u0022\u201c\u201d'
On 2025-05-14, Chris Elvidge <chris@internal.net> wrote:
On 14/05/2025 at 07:28, Brian Patrie wrote:
Chris Elvidge wrote:
$ sq=$(echo -ne '\u0027\u2018\u2019\u0022\u201c\u201d')
To eliminate needless overhead:
sg=$'\u0027\u2018\u2019\u0022\u201c\u201d'
So it is needless overhead in a script that is intended
to be portable to certain shells.
The -n option of echo is documented in POSIX (as an XSI extension),
whereas $'...' is a feature of some shells. (Korn, Bash, Zsh?)
[...]
On 2025-05-14, Chris Elvidge<chris@internal.net> wrote:
On 14/05/2025 at 07:28, Brian Patrie wrote:
Chris Elvidge wrote:
> $ sq=$(echo -ne '\u0027\u2018\u2019\u0022\u201c\u201d')
To eliminate needless overhead:
sg=$'\u0027\u2018\u2019\u0022\u201c\u201d'
So it is needless overhead in a script that is intended
to be portable to certain shells.
The -n option of echo is documented in POSIX (as an XSI extension),
whereas $'...' is a feature of some shells. (Korn, Bash, Zsh?)
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
Concerning portability the question in the first place is IMO what to
do with those '\u....' . - I don't think this is standard, or is it?
So if it's non-standard we could use arbitrary common but non-standard
shell features.
Like the shell built-in 'printf' without $'...' to use just '...'.
Or IMO best just the already suggested ANSI strings var=$'...' .
The 'echo' utility as specified by the current version of POSIX, POSIX-1.2024, isn't required to support the '\u...' sequence:
https://pubs.opengroup.org/onlinepubs/9799919799/utilities/echo.html#tag_20_37_04
Nor is the 'printf' utility:
https://pubs.opengroup.org/onlinepubs/9799919799/utilities/printf.html#tag_20_96
and OpenBSD Ksh doesn't support using that sequence in an argument to
its 'echo' builtin.
Additionally, this version of POSIX added dollar-single quotes to the standard:
A sequence of characters starting with a <dollar-sign> immediately
followed by a single-quote ($') shall preserve the literal value of
all characters up to an unescaped terminating single-quote (')
-- https://pubs.opengroup.org/onlinepubs/9799919799/utilities/V3_chap02.html#tag_19_02_04
However, although it's supported by Bash, ATT Ksh, and Zsh, it doesn't
seem to be supported by Dash (as of version 0.5.12):
$ eg=$'\n'
$ printf '%s' "${eg}" | od -c
0000000 $ \ n
0000003
$
or by OpenBSD Ksh (as of the version included in OpenBSD 7.6; not sure
about its status in 7.7):
$ eg=$'\n'
$ printf '%s' "${eg}" | od -c
0000000 $ \ n
0000003
$
(i should probably add a table about this to a page on the Gentoo wiki
that i've been slowly working on, https://wiki.gentoo.org/wiki/Shell/Scripting )
Alexis.
The -n option of echo is documented in POSIX (as an XSI extension),
whereas $'...' is a feature of some shells. (Korn, Bash, Zsh?)