• Re: How to manage accented characters in mail header?

    From Stefan Ram@21:1/5 to Chris Green on Sat Jan 4 14:49:38 2025
    Chris Green <cl@isbd.net> wrote or quoted:
    From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>

    In Python, when you roll with decode_header from the email.header
    module, it spits out a list of parts, where each part is like
    a tuple of (decoded string, charset). To smash these decoded
    sections into one string, you’ll want to loop through the list,
    decode each piece (if it needs it), and then throw them together.
    Here’s a straightforward example of how to pull this off:

    from email.header import decode_header

    # Example header
    header_example = \
    'From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>'

    # Decode the header
    decoded_parts = decode_header(header_example)

    # Kick off an empty list for the decoded strings
    decoded_strings = []

    for part, charset in decoded_parts:
    if isinstance(part, bytes):
    # Decode the bytes to a string using the charset
    decoded_string = part.decode(charset or 'utf-8')
    else:
    # If it’s already a string, just roll with it
    decoded_string = part
    decoded_strings.append(decoded_string)

    # Join the parts into a single string
    final_string = ''.join(decoded_strings)

    print(final_string)# From: Sébastien Crignon <sebastien.crignon@amvs.fr>

    Breakdown

    decode_header(header_example): This line takes your email header
    and breaks it down into a list of tuples.

    Looping through decoded_parts: You check if each part is in
    bytes. If it is, you decode it using whatever charset it’s
    got (defaulting to 'utf-8' if it’s a little vague).

    Appending Decoded Strings: You toss each decoded part into a list.

    Joining Strings: Finally, you use ''.join(decoded_strings) to glue
    all the decoded strings into a single, coherent piece.

    Just a Heads Up

    Keep an eye out for cases where the charset might be None. In those
    moments, it’s smart to fall back to 'utf-8' or something safe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Green@21:1/5 to All on Sat Jan 4 14:31:24 2025
    I have a Python script that filters my incoming E-Mail. It has been
    working OK (with various updates and improvements) for many years.

    I now have a minor new problem when handling E-Mail with a From: that
    has accented characters in it:-

    From: Sébastien Crignon <sebastien.crignon@amvs.fr>


    I use Python mailbox to parse the message:-

    import mailbox
    ...
    ...
    msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

    Then various mailbox methods to get headers etc.
    I use the following to get the From: address:-

    str(msg.get('from', "unknown").lower()

    The result has the part with the accented character wrapped as follows:-

    From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>


    I know I have hit this issue before but I can't rememeber the fix. The
    problem I have now is that searching the above doesn't work as
    expected. Basically I just need to get rid of the ?utf-8? wrapped bit altogether as I'm only interested in the 'real' address. How can I
    easily remove the UTF8 section in a way that will work whether or not
    it's there?


    --
    Chris Green
    ·

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter Pearson@21:1/5 to Chris Green on Sat Jan 4 15:00:21 2025
    On Sat, 4 Jan 2025 14:31:24 +0000, Chris Green <cl@isbd.net> wrote:
    I have a Python script that filters my incoming E-Mail. It has been
    working OK (with various updates and improvements) for many years.

    I now have a minor new problem when handling E-Mail with a From: that
    has accented characters in it:-

    From: Sébastien Crignon <sebastien.crignon@amvs.fr>


    I use Python mailbox to parse the message:-

    import mailbox
    ...
    ...
    msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

    Then various mailbox methods to get headers etc.
    I use the following to get the From: address:-

    str(msg.get('from', "unknown").lower()

    The result has the part with the accented character wrapped as follows:-

    From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>


    I know I have hit this issue before but I can't rememeber the fix. The problem I have now is that searching the above doesn't work as
    expected. Basically I just need to get rid of the ?utf-8? wrapped bit altogether as I'm only interested in the 'real' address. How can I
    easily remove the UTF8 section in a way that will work whether or not
    it's there?

    This seemed to work for me:

    import email.header
    text, encoding = email.header.decode_header(some_string)[0]


    --
    To email me, substitute nowhere->runbox, invalid->com.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Chris Green on Sat Jan 4 19:40:34 2025
    Chris Green <cl@isbd.net> wrote or quoted:
    print(final_string)# From: Sébastien Crignon <sebastien.crignon@amvs.fr>
    Is there a simple[r] way to extract just the 'real' address between
    the <>, that's all I actually need. I think it has the be the last
    chunk of the From: doesn't it?

    Besides the deal with the pointy brackets, there's also this
    other setup with round ones, like in

    sebastien.crignon@amvs.fr (Sébastien Crignon)

    . The standard library has:

    email.utils.parseaddr(address)

    Parse address – which should be the value of some
    address-containing field such as To or Cc - into its
    constituent realname and email address parts. Returns a tuple
    of that information, unless the parse fails, in which case a
    2-tuple of ('', '') is returned.

    .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Green@21:1/5 to Stefan Ram on Sat Jan 4 19:07:57 2025
    Stefan Ram <ram@zedat.fu-berlin.de> wrote:
    Chris Green <cl@isbd.net> wrote or quoted:
    From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>

    In Python, when you roll with decode_header from the email.header
    module, it spits out a list of parts, where each part is like
    a tuple of (decoded string, charset). To smash these decoded
    sections into one string, you’ll want to loop through the list,
    decode each piece (if it needs it), and then throw them together.
    Here’s a straightforward example of how to pull this off:

    from email.header import decode_header

    # Example header
    header_example = \
    'From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>'

    # Decode the header
    decoded_parts = decode_header(header_example)

    # Kick off an empty list for the decoded strings
    decoded_strings = []

    for part, charset in decoded_parts:
    if isinstance(part, bytes):
    # Decode the bytes to a string using the charset
    decoded_string = part.decode(charset or 'utf-8')
    else:
    # If it’s already a string, just roll with it
    decoded_string = part
    decoded_strings.append(decoded_string)

    # Join the parts into a single string
    final_string = ''.join(decoded_strings)

    print(final_string)# From: Sébastien Crignon <sebastien.crignon@amvs.fr>

    Breakdown

    decode_header(header_example): This line takes your email header
    and breaks it down into a list of tuples.

    Looping through decoded_parts: You check if each part is in
    bytes. If it is, you decode it using whatever charset it’s
    got (defaulting to 'utf-8' if it’s a little vague).

    Appending Decoded Strings: You toss each decoded part into a list.

    Joining Strings: Finally, you use ''.join(decoded_strings) to glue
    all the decoded strings into a single, coherent piece.

    Just a Heads Up

    Keep an eye out for cases where the charset might be None. In those
    moments, it’s smart to fall back to 'utf-8' or something safe.

    Thanks, I think! :-)

    Is there a simple[r] way to extract just the 'real' address between
    the <>, that's all I actually need. I think it has the be the last
    chunk of the From: doesn't it?


    --
    Chris Green
    ·

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Chris Green via Python-list on Mon Jan 6 20:43:21 2025
    On 2025-01-04 19:07:57 +0000, Chris Green via Python-list wrote:
    Stefan Ram <ram@zedat.fu-berlin.de> wrote:
    Chris Green <cl@isbd.net> wrote or quoted:
    From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>

    Is there a simple[r] way to extract just the 'real' address between
    the <>, that's all I actually need. I think it has the be the last
    chunk of the From: doesn't it?

    No,
    From: <sebastien.crignon@amvs.fr> (SΘbastien Crignon)
    would also be permissible (properly encoded, of course), and even
    From: < sebastien (SΘbastien) . crignon (Crignon) @ amvs . fr >
    (although I think the latter is deprecated).

    And also, there can be more than one address in a From header.

    To properly extract email addresses from a header, use email.utils.getaddresses(). You don't have to decode the header first.
    The MIME-encoding is supposed to not interfere with parsing headers for machine-readable information like addresses or message ids.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | hjp@hjp.at | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmd8MlMACgkQ8g5IURL+ KF0cpRAApcdq2w7t5+wi7xUEpTpJoZvoW7VbuAfm5LUh5YKYZImjrAyZf7o7t02S wAFyUSEqr5UmcDYE3Fvg2zizYi8ua1QTKIJbLfR88qwTGOHkLDfu0bVn7DvvEiTW 5dlU6ykxFUk7YeNtdBgtmiaUNi5iuONrDcWK+ddFm4oyDesCcaTnLlqIQnR524kU xord18YoGwMDxI+LAXeMNgX/tndqh3339r6Av/UwLq96O+1e+lHOC1bUKYng+Pwm IdeZEfHJMFKvylTV2tqh5zhut3g9F3WiWAoC0Gr5Ib3tIoKXXA1KPwSpHBSzNqP0 tEfi5KIOmScbCtd8vIlE0Qpkg+4Z3Eb7RGWsxMiuk/KGxblHBxkKgBrAkLJSY1Sd PT9YnU/6h/z1i6x4Rp34zXfdg+yfN3nO1DqRK17bWvvCKIiNvT5G4JldgAcSme8L QgQdY8Ls1keGUSggy35xYkpLxzafFFc0uZuFzJrm1oVXb0NKcldyiMJDAqegnaMR wTueBj2QFTJYZqrAlHgUyGzOnP4HZxTfYwyynoFkuzQpDrla3QHnjUFFAIz6mv0W CVBVo48wPuyiqgPU/UgWb1sQZ2YDv+pFmlUJUqpbUC+H3JrWU7wg9rljIU218HYn /6pSzIwEIuscCji4iVCZXIVXaWI8XqQqO0BqjJu