• Re: How to convert to pure ASCII

    From Lew Pitcher@21:1/5 to Kenny McCormack on Sun May 4 18:23:31 2025
    On Sun, 04 May 2025 18:15:38 +0000, Kenny McCormack wrote:

    I am often faced with this problem.

    I have a string like (this was the "From" address of an email I recently received):

    =?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>

    Note that this may not be the ideal example, but it is the one closest to hand. Here's another example:

    TF-8?q?They’re_telling_us_something_about_something_ok?

    when it should have been just:

    They're telling us something about something ok?

    My question is: Is there a (Unix/Linux) tool that will reliably fix this? I.e. convert the binary glop format into the desired, pure ASCII, format.

    What you are looking at is the "punycode"[1] expression of a non-ASCII character
    sequence.

    AFAIK, there aren't any /standard/ utilities that convert to and from punycode. However, there are /libraries/ that handle punycode (libidn[2], for one).

    Perhaps a web search for IDN tools will come up with a punycode translator program.

    <snip>

    [1] https://www.rfc-editor.org/rfc/rfc3492.txt
    [2] http://www.gnu.org/software/libidn/

    --
    Lew Pitcher
    "In Skills We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to lew.pitcher@digitalfreehold.ca on Sun May 4 18:33:55 2025
    In article <vv8bb2$2f5m5$2@dont-email.me>,
    Lew Pitcher <lew.pitcher@digitalfreehold.ca> wrote:
    On Sun, 04 May 2025 18:15:38 +0000, Kenny McCormack wrote:

    I am often faced with this problem.

    I have a string like (this was the "From" address of an email I recently received):

    =?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>

    Note that this may not be the ideal example, but it is the one closest to
    hand. Here's another example:

    TF-8?q?Theyre_telling_us_something_about_something_ok?

    when it should have been just:

    They're telling us something about something ok?

    My question is: Is there a (Unix/Linux) tool that will reliably fix this?
    I.e. convert the binary glop format into the desired, pure ASCII, format.

    What you are looking at is the "punycode"[1] expression of a non-ASCII character
    sequence.

    Yup. I've never heard the term "punycode" before, but it sounds appropriate.

    AFAIK, there aren't any /standard/ utilities that convert to and from punycode.
    However, there are /libraries/ that handle punycode (libidn[2], for one).

    "standard" doesn't really matter much to me. If there is a tool out there,
    in any form, from any source, I'd like to hear about it.

    Generally, when there is a library to do something, there is a program
    written to access the functionality in that library - i.e., a "thin
    wrapper" around the library. Sounds like that program is what I am looking for.

    --
    Kenny, I'll ask you to stop using quotes of mine as taglines.

    - Rick C Hodgin -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to All on Sun May 4 18:15:38 2025
    I am often faced with this problem.

    I have a string like (this was the "From" address of an email I recently received):

    =?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>

    Note that this may not be the ideal example, but it is the one closest to
    hand. Here's another example:

    TF-8?q?They’re_telling_us_something_about_something_ok?

    when it should have been just:

    They're telling us something about something ok?

    My question is: Is there a (Unix/Linux) tool that will reliably fix this?
    I.e. convert the binary glop format into the desired, pure ASCII, format.

    Note: I have tried "iconv" and have had it work in some situations, but it mostly doesn't do anything (i.e., is equivalent to "cat"). In particular,
    one problem with "iconv" is that one of the parameters is the "from encoding", and generally, this is unknown. You are just presented with the glop and
    have to figure it out on your own. But even when I put in "-f UTF-8" in
    the command lines (of iconv), with the above text as input, it still does nothing useful.

    --
    He must be a Muslim. He's got three wives and he doesn't drink.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John-Paul Stewart@21:1/5 to Kenny McCormack on Sun May 4 15:56:13 2025
    On 2025-05-04 2:33 p.m., Kenny McCormack wrote:
    In article <vv8bb2$2f5m5$2@dont-email.me>,
    Lew Pitcher <lew.pitcher@digitalfreehold.ca> wrote:
    AFAIK, there aren't any /standard/ utilities that convert to and from punycode.
    However, there are /libraries/ that handle punycode (libidn[2], for one).

    "standard" doesn't really matter much to me. If there is a tool out there, in any form, from any source, I'd like to hear about it.

    Generally, when there is a library to do something, there is a program written to access the functionality in that library - i.e., a "thin
    wrapper" around the library. Sounds like that program is what I am looking for.

    Indeed, there is an "idn" tool as part of the libidn source. The binary
    is in a separate package from the library on Debian. Perhaps that tool
    will do what you want in the shell.

    I don't know how it might be packaged for non-Debian distributions or
    non-Linux OSes. Maybe with the library, maybe not.

    (BTW, I too have looked for such a tool in the past and come up empty.
    The previous poster's mention of "punycode" and "libidn" were super
    helpful to me, too. Thanks Lew!)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to jpstewart@personalprojects.net on Sun May 4 21:54:26 2025
    In article <m7pv2tFgsp1U1@mid.individual.net>,
    John-Paul Stewart <jpstewart@personalprojects.net> wrote:
    ...
    Indeed, there is an "idn" tool as part of the libidn source. The binary
    is in a separate package from the library on Debian. Perhaps that tool
    will do what you want in the shell.

    Unfortunately, it has brought no joy. I installed both idn and idn2 on a
    spare machine (*) and did a little testing. Neither did anything good, with the strings posted in the OP.

    Note that the "d" in "idn" stands for "Domain" as in "Domain Name Service" (i.e., DNS). And the man page(s) talk mostly about stuff seeming to relate
    to DNS, not to general conversion of arbitrary strings. So, I'm not too confident that this is the answer we seek.

    (BTW, I too have looked for such a tool in the past and come up empty.
    The previous poster's mention of "punycode" and "libidn" were super
    helpful to me, too. Thanks Lew!)

    I'd say the search continues.

    ==============================================================

    (*) Interestingly, the needed libs (libidn.so.x.y.z and libidn2.so.x.y.z)
    were already present on the machine, so it only needed to install the
    binary executables (/usr/lib/idn and /usr/lib/idn2).

    --
    If you ask a Trumper who is to blame for the debacle of Jan 6, they will almost certainly say
    something about Antifa/BLM/something/whatever. This shows just how screwed up they are; they can't
    even get their narrative straight. What they *should* say is "Eugene Goodman". If not for him, the plot
    would probably have succeeded, so he (Eugene) is clearly to blame for the failure.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to lew.pitcher@digitalfreehold.ca on Sun May 4 23:28:55 2025
    In comp.unix.shell, Lew Pitcher <lew.pitcher@digitalfreehold.ca> wrote:
    On Sun, 04 May 2025 18:15:38 +0000, Kenny McCormack wrote:
    I have a string like (this was the "From" address of an email I recently received):
    =?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>

    This is called a "MIME word" or "MIME encoded word" Mail headers are
    supposed to be 7-bit clean even when mail bodies can have 8-bit
    encodings. MIME words exist to put quoted-printable and base64 in
    headers.

    =?utf-8?B?UGhpbGxpcCBHw7xudGVy?=
    ^^^^^ | ^^^^^^^^^^^^^^^^^^^^
    Charset | Content
    |
    Encoding: "B" for base64 "Q" for quoted-printable.

    :r! echo UGhpbGxpcCBHw7xudGVy | base64 -d
    Phillip Günter

    TF-8?q?They’re_telling_us_something_about_something_ok?

    That one is rather mangled. It doesn't follow the =?%s?%c?%s?= format.
    It also has a high-bit character in the content. But it does show the
    quirk of MIME word quoted-printable: underscores used for spaces.

    My question is: Is there a (Unix/Linux) tool that will reliably fix this?
    I.e. convert the binary glop format into the desired, pure ASCII, format.

    Your mail program should. I have seen Python and Perl libraries for
    decoding, too. Knowing the name should help you find them.

    What you are looking at is the "punycode"[1] expression of a
    non-ASCII character sequence.

    No. Punycode is a entirely different coding used to put highbit content
    in domain names (DNS). It is not base64 based.

    https://en.wikipedia.org/wiki/MIME#Encoded-Word https://en.wikipedia.org/wiki/Punycode

    As an example "Günter.com" encodes in Punycode as: xn--Gnter-kva.com

    Someone has that registered, but it is blank in my browser.

    Elijah
    ------
    thinks Günter probably likes his name with the accent and not "pure ASCII"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Eli the Bearded on Mon May 5 03:47:27 2025
    On 05.05.2025 01:28, Eli the Bearded wrote:
    In comp.unix.shell, Lew Pitcher <lew.pitcher@digitalfreehold.ca> wrote:
    On Sun, 04 May 2025 18:15:38 +0000, Kenny McCormack wrote:
    I have a string like (this was the "From" address of an email I recently received):
    =?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>

    This is called a "MIME word" or "MIME encoded word" Mail headers are
    supposed to be 7-bit clean even when mail bodies can have 8-bit
    encodings. MIME words exist to put quoted-printable and base64 in
    headers.

    =?utf-8?B?UGhpbGxpcCBHw7xudGVy?=
    ^^^^^ | ^^^^^^^^^^^^^^^^^^^^
    Charset | Content
    |
    Encoding: "B" for base64 "Q" for quoted-printable.

    :r! echo UGhpbGxpcCBHw7xudGVy | base64 -d
    Phillip Günter

    TF-8?q?They’re_telling_us_something_about_something_ok?

    That one is rather mangled. It doesn't follow the =?%s?%c?%s?= format.

    That may have been just a copy/paste error; =?U may have got dropped.

    It also has a high-bit character in the content.

    The originally posted data had an ASCII single quote (not an accent);
    may have got changed by the newsreader on reply.

    Wouldn't the following work...?

    =?UTF-8?q?They're_telling_us_something_about_something_ok?

    But it does show the
    quirk of MIME word quoted-printable: underscores used for spaces.


    [...]

    thinks Günter probably likes his name with the accent and not "pure ASCII"

    (Not surprising for people with names that contain non-ASCII characters
    to have their names represented correctly.)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Miss Davies@21:1/5 to Janis Papanagnou on Mon May 5 08:17:26 2025
    On 2025-05-05, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    (Not surprising for people with names that contain non-ASCII characters
    to have their names represented correctly.)

    Janis


    indeed!

    Siân

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Cri-Cri@21:1/5 to Kenny McCormack on Tue May 6 18:02:57 2025
    On Sun, 4 May 2025 18:15:38 -0000 (UTC), Kenny McCormack wrote:

    I have a string like (this was the "From" address of an email I recently received):

    =?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>

    If it's punycode, as was claimed elsewhere, you might try this:

    https://pypi.org/project/punycode/

    Most (I think) distributions have some version of Python installed (or
    it's in the repos). The package seems to be Python3 only, but I suppose
    all distributions have ditched version 2, that went EOL in 2020 (I think
    it was).

    --
    Cri-Cri

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nuno Silva@21:1/5 to Kenny McCormack on Mon May 12 14:18:24 2025
    On 2025-05-12, Kenny McCormack wrote:

    In article <slrn1023i84.2s2es.cmartin+usenetYYMMDD@nyx2.nyx.net>,
    Chuck Martin <cmartin+usenetYYMMDD@nyx.net> wrote:
    ...
    Try piping it into the following command:

    perl -CS -MEncode -ne 'print decode("MIME-Header", $_)'

    Bingo! Thanks very much.

    That worked well on the other example I gave in the OP too (which was actually more the focus of my query).

    I still get the binary glop in place of the single quotes (The text
    contains the word "They're", rendered as "They<binaryglop>re"). Note that
    in the input string, <binaryglop> is (the literal string): =E2=80=99
    which gets converted by your Perl program into the 3
    characters with octal codes (as displayed by "od -bc"): 342 200 231

    I can deal with this later problem myself via brute force with AWK, but it would be nice if I didn't have to - i.e., if there were a complete solution (i.e., one that does also the other half of the job).

    My guess is that this isn't an apostrophe, but a "right single quotation
    mark", which is sadly a common sight in such a context, and Emacs tells
    me that this (UCS codepoint 0x2019) is represented as E2 80 99 in UTF-8.

    Are there good ways to convert such chars to something more reasonable?
    The only thing that occurs to me right now is passing it through iconv
    to a more limited charset using transliteration (e.g. "iconv -f utf8 -t iso8859-1//TRANSLIT -c") and then back to the desired encoding and
    charset.

    (But I suppose if this is already involving perl, then perhaps such a modification can be done through perl too.)

    --
    Nuno Silva

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to nunojsilva@invalid.invalid on Mon May 12 14:03:32 2025
    In article <vvssf0$13ls6$1@dont-email.me>,
    Nuno Silva <nunojsilva@invalid.invalid> wrote:
    ...
    My guess is that this isn't an apostrophe, but a "right single quotation >mark", which is sadly a common sight in such a context, and Emacs tells
    me that this (UCS codepoint 0x2019) is represented as E2 80 99 in UTF-8.

    Correct, but as far as I am concerned, they are all single quotes, just
    mangled versions of same. The goal is to convert them all back into
    regular single quotes. And, as you will see below, similar comments apply
    for double quotes.

    The AWK code that I am currently using to clean this problem contains these lines:

    gsub(/=..=..=9[CD]/,"\"")
    gsub(/=..=..=../,"'")

    which is good enough for me.

    Are there good ways to convert such chars to something more reasonable?
    The only thing that occurs to me right now is passing it through iconv
    to a more limited charset using transliteration (e.g. "iconv -f utf8 -t >iso8859-1//TRANSLIT -c") and then back to the desired encoding and
    charset.

    As mentioned in the OP, I have never been successful in getting "iconv" to do much of anything. No, this is not a plea for help or for man pages to be
    read out loud.

    (But I suppose if this is already involving perl, then perhaps such a >modification can be done through perl too.)

    Probably, but I'm not much into Perl. I do appreciate the solution given
    here by Chuck, but don't intend on doing any real deconstruction on it.

    --
    "I have a simple philosophy. Fill what's empty. Empty what's full. And
    scratch where it itches."

    Alice Roosevelt Longworth

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to cmartin+usenetYYMMDD@nyx.net on Mon May 12 12:20:43 2025
    In article <slrn1023i84.2s2es.cmartin+usenetYYMMDD@nyx2.nyx.net>,
    Chuck Martin <cmartin+usenetYYMMDD@nyx.net> wrote:
    ...
    Try piping it into the following command:

    perl -CS -MEncode -ne 'print decode("MIME-Header", $_)'

    Bingo! Thanks very much.

    That worked well on the other example I gave in the OP too (which was
    actually more the focus of my query).

    I still get the binary glop in place of the single quotes (The text
    contains the word "They're", rendered as "They<binaryglop>re"). Note that
    in the input string, <binaryglop> is (the literal string): =E2=80=99
    which gets converted by your Perl program into the 3
    characters with octal codes (as displayed by "od -bc"): 342 200 231

    I can deal with this later problem myself via brute force with AWK, but it would be nice if I didn't have to - i.e., if there were a complete solution (i.e., one that does also the other half of the job).

    --
    Every time a Republican gets caught doing something illegal (i.e., just about every
    day or two), they always immediately issue two simultaneous statements about it:
    1) "I didn't do it" (Standard denial, which of course only cult-members pay any attention to)
    2) "Here's how I did it and why I did it and why it shouldn't matter to you and why you should go back to watching sports on TV"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Elvidge@21:1/5 to Kenny McCormack on Mon May 12 15:57:54 2025
    On 12/05/2025 at 15:03, Kenny McCormack wrote:
    In article <vvssf0$13ls6$1@dont-email.me>,
    Nuno Silva <nunojsilva@invalid.invalid> wrote:
    ...
    My guess is that this isn't an apostrophe, but a "right single quotation
    mark", which is sadly a common sight in such a context, and Emacs tells
    me that this (UCS codepoint 0x2019) is represented as E2 80 99 in UTF-8.

    Correct, but as far as I am concerned, they are all single quotes, just mangled versions of same. The goal is to convert them all back into
    regular single quotes. And, as you will see below, similar comments apply for double quotes.

    The AWK code that I am currently using to clean this problem contains these lines:

    gsub(/=..=..=9[CD]/,"\"")
    gsub(/=..=..=../,"'")

    which is good enough for me.

    I use sed:

    # sq = single quote, smart left single quote, smart right single quote,
    double quote, smart left double quote, smart right double quote
    # [$sq] to change smart quotes to single quotes

    $ sq=$(echo -ne '\u0027\u2018\u2019\u0022\u201c\u201d')
    $ sed -e "s/[$sq]/'/g" filename



    Are there good ways to convert such chars to something more reasonable?
    The only thing that occurs to me right now is passing it through iconv
    to a more limited charset using transliteration (e.g. "iconv -f utf8 -t
    iso8859-1//TRANSLIT -c") and then back to the desired encoding and
    charset.

    As mentioned in the OP, I have never been successful in getting "iconv" to do much of anything. No, this is not a plea for help or for man pages to be read out loud.

    (But I suppose if this is already involving perl, then perhaps such a
    modification can be done through perl too.)

    Probably, but I'm not much into Perl. I do appreciate the solution given here by Chuck, but don't intend on doing any real deconstruction on it.




    --
    Chris Elvidge, England
    I WILL NOT SPANK OTHERS

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to nunojsilva@invalid.invalid on Mon May 12 20:21:15 2025
    In comp.unix.shell, Nuno Silva <nunojsilva@invalid.invalid> wrote:
    The only thing that occurs to me right now is passing it through iconv
    to a more limited charset using transliteration (e.g. "iconv -f utf8 -t iso8859-1//TRANSLIT -c") and then back to the desired encoding and
    charset.

    Hah. That style iconv fix was going to be my suggestion.

    (But I suppose if this is already involving perl, then perhaps such a modification can be done through perl too.)

    I have a number of vim map commands to fix things I paste into news
    posts. I created a perl script that can do the same thing a few years
    ago for another request.

    If people want it, I've copied it here:

    https://qaz.wtf/C/textify

    The mappings include UTF-8 and broken UTF-8, so technically it is a
    binary file. That's why I don't post the source here.

    Elijah
    ------
    no longer encounters the broken lynx UTF-8 mentioned there

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chuck Martin@21:1/5 to Kenny McCormack on Mon May 12 10:10:12 2025
    On 2025-05-04, Kenny McCormack <gazelle@shell.xmission.com> wrote:
    I am often faced with this problem.

    I have a string like (this was the "From" address of an email I recently received):

    =?utf-8?B?UGhpbGxpcCBHw7xudGVy?= <s69pguen@uni-bonn.de>

    Note that this may not be the ideal example, but it is the one closest to hand. Here's another example:

    TF-8?q?They’re_telling_us_something_about_something_ok?

    when it should have been just:

    They're telling us something about something ok?

    My question is: Is there a (Unix/Linux) tool that will reliably fix this? I.e. convert the binary glop format into the desired, pure ASCII, format.

    Try piping it into the following command:

    perl -CS -MEncode -ne 'print decode("MIME-Header", $_)'

    For example:

    chuck@sigma:~:10001% echo '=?utf-8?B?UGhpbGxpcCBHw7xudGVy?=' | perl -CS -MEncode -ne 'print decode("MIME-Header", $_)'
    Phillip Günter

    --
    To reply via e-mail, replace "YYMMDD" with the current date in the
    appropriate format, or your mail will bounce. Removing the + and
    everything that follows in my username will also cause your mail to
    bounce. Details: https://nyx.net/~cmartin/HowToEmailMe

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brian Patrie@21:1/5 to Chris Elvidge on Wed May 14 01:28:30 2025
    Chris Elvidge wrote:
    $ sq=$(echo -ne '\u0027\u2018\u2019\u0022\u201c\u201d')

    To eliminate needless overhead:
    sg=$'\u0027\u2018\u2019\u0022\u201c\u201d'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Elvidge@21:1/5 to Brian Patrie on Wed May 14 11:08:47 2025
    On 14/05/2025 at 07:28, Brian Patrie wrote:
    Chris Elvidge wrote:
    $ sq=$(echo -ne '\u0027\u2018\u2019\u0022\u201c\u201d')

    To eliminate needless overhead:
    sg=$'\u0027\u2018\u2019\u0022\u201c\u201d'

    True. Thanks.


    --
    Chris Elvidge, England
    I WILL NOT BARF UNLESS I'M SICK

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Chris Elvidge on Wed May 14 15:02:26 2025
    On 2025-05-14, Chris Elvidge <chris@internal.net> wrote:
    On 14/05/2025 at 07:28, Brian Patrie wrote:
    Chris Elvidge wrote:
    $ sq=$(echo -ne '\u0027\u2018\u2019\u0022\u201c\u201d')

    To eliminate needless overhead:
    sg=$'\u0027\u2018\u2019\u0022\u201c\u201d'

    So it is needless overhead in a script that is intended
    to be portable to certain shells.
    The -n option of echo is documented in POSIX (as an XSI extension),
    whereas $'...' is a feature of some shells. (Korn, Bash, Zsh?)

    There is a small issue with it in Bash in that when you use it
    in a function definition, Bash strips away the $'...'
    syntax and stores the raw string.
    Then when you type "set" (with no arguments) to view the function
    definitions, the raw characters are dumped to your terminal,
    rather than their escaped representation.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kaz Kylheku on Wed May 14 20:39:14 2025
    On 14.05.2025 17:02, Kaz Kylheku wrote:
    On 2025-05-14, Chris Elvidge <chris@internal.net> wrote:
    On 14/05/2025 at 07:28, Brian Patrie wrote:
    Chris Elvidge wrote:
    $ sq=$(echo -ne '\u0027\u2018\u2019\u0022\u201c\u201d')

    To eliminate needless overhead:
    sg=$'\u0027\u2018\u2019\u0022\u201c\u201d'

    So it is needless overhead in a script that is intended
    to be portable to certain shells.
    The -n option of echo is documented in POSIX (as an XSI extension),

    The 'echo -n' could also simply be replaced by 'printf'.

    whereas $'...' is a feature of some shells. (Korn, Bash, Zsh?)

    Concerning portability the question in the first place is IMO what to
    do with those '\u....' . - I don't think this is standard, or is it?

    So if it's non-standard we could use arbitrary common but non-standard
    shell features.

    Like the shell built-in 'printf' without $'...' to use just '...'.
    Or IMO best just the already suggested ANSI strings var=$'...' .

    Janis

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brian Patrie@21:1/5 to Kaz Kylheku on Thu May 15 01:15:55 2025
    Kaz Kylheku wrote:
    On 2025-05-14, Chris Elvidge<chris@internal.net> wrote:
    On 14/05/2025 at 07:28, Brian Patrie wrote:
    Chris Elvidge wrote:
    > $ sq=$(echo -ne '\u0027\u2018\u2019\u0022\u201c\u201d')

    To eliminate needless overhead:
    sg=$'\u0027\u2018\u2019\u0022\u201c\u201d'

    So it is needless overhead in a script that is intended
    to be portable to certain shells.
    The -n option of echo is documented in POSIX (as an XSI extension),
    whereas $'...' is a feature of some shells. (Korn, Bash, Zsh?)

    echo -e is also not portable; so $'' seemed a safe assumption.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Alexis on Thu May 15 13:31:46 2025
    On 15.05.2025 08:08, Alexis wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:

    Concerning portability the question in the first place is IMO what to
    do with those '\u....' . - I don't think this is standard, or is it?

    So if it's non-standard we could use arbitrary common but non-standard
    shell features.

    Like the shell built-in 'printf' without $'...' to use just '...'.
    Or IMO best just the already suggested ANSI strings var=$'...' .

    The 'echo' utility as specified by the current version of POSIX, POSIX-1.2024, isn't required to support the '\u...' sequence:

    https://pubs.opengroup.org/onlinepubs/9799919799/utilities/echo.html#tag_20_37_04

    Nor is the 'printf' utility:

    https://pubs.opengroup.org/onlinepubs/9799919799/utilities/printf.html#tag_20_96

    Thanks for the confirmation!


    and OpenBSD Ksh doesn't support using that sequence in an argument to
    its 'echo' builtin.

    Original Ksh's 'echo' supports it if provided as ANSI string $'...',
    and its 'printf' even as plain string.

    $ echo -ne $'\u0027\u2018\u2019\u0022\u201c\u201d\n'
    'æÆ"ôö
    $ printf '\u0027\u2018\u2019\u0022\u201c\u201d\n'
    'æÆ"ôö


    Additionally, this version of POSIX added dollar-single quotes to the standard:

    A sequence of characters starting with a <dollar-sign> immediately
    followed by a single-quote ($') shall preserve the literal value of
    all characters up to an unescaped terminating single-quote (')

    Yes, this syntax is what ksh93 supports and had introduced as "ANSI
    String", where Ksh does an interpretation of the escaped characters.
    I suppose POSIX allows the syntax at least to not have scripts using
    it be sort of "non-standard" because of that? - Or does POSIX support
    ANSI strings with escaped characters interpreted, meanwhile?


    -- https://pubs.opengroup.org/onlinepubs/9799919799/utilities/V3_chap02.html#tag_19_02_04

    However, although it's supported by Bash, ATT Ksh, and Zsh, it doesn't
    seem to be supported by Dash (as of version 0.5.12):

    I'm not surprised.


    $ eg=$'\n'
    $ printf '%s' "${eg}" | od -c
    0000000 $ \ n
    0000003
    $

    or by OpenBSD Ksh (as of the version included in OpenBSD 7.6; not sure
    about its status in 7.7):

    I'm also not surprised. - I generally suggest to use the ksh93u+m
    version which (roughly) is a bug-fixed version of the meanwhile
    unsupported original ksh93u+.


    $ eg=$'\n'
    $ printf '%s' "${eg}" | od -c
    0000000 $ \ n
    0000003
    $

    (i should probably add a table about this to a page on the Gentoo wiki
    that i've been slowly working on, https://wiki.gentoo.org/wiki/Shell/Scripting )


    Alexis.

    Have fun! :-)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Geoff Clare@21:1/5 to Kaz Kylheku on Thu May 15 13:36:25 2025
    Kaz Kylheku wrote:

    The -n option of echo is documented in POSIX (as an XSI extension),

    You have that backwards. POSIX allows -n to affect the behaviour
    of echo on non-XSI systems (in an implementation-defined manner).
    On XSI systems, "echo -n a" is required to output "-n a".

    whereas $'...' is a feature of some shells. (Korn, Bash, Zsh?)

    As others have pointed out, $'...' has been added to POSIX (in the
    2024 revision), but without \u and \U. The rationale explains that
    they were omitted because their behaviour differs between shells.

    --
    Geoff Clare <netnews@gclare.org.uk>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)