• SLRTR 000000: ZWJ in Sphinx

    From Stefan Ram@21:1/5 to All on Mon May 5 11:34:52 2025
    The Use of the U+200D "ZERO WIDTH JOINER" (ZWJ) Character in
    reStructuredText Input for "Sphinx"
    (Technical Report SLRTR 000000)

    (This technical report was prepared by the author during his spare
    time.)

    Stefan Ram
    2025

    Abstract - The character U+200D "ZERO WIDTH JOINER" (ZWJ) may be
    employed in inputs written in the "reStructuredText" (rst) markup
    notation for the software documentation tool "Sphinx" in order to permit
    the inclusion of special characters within embedded code segments.
    To ensure that Sphinx's automatic line breaking continues to function correctly, two minor adjustments to Sphinx are required.

    I. Introduction

    The software documentation tool "Sphinx" accepts texts composed in the "reStructuredText" (rst) notation. Within paragraphs, code segments are
    denoted by enclosing the relevant text between pairs of grave accents
    (``) as illustrated in Figure 1.

    Figure 1: A Code Segment within a Paragraph

    |... the expression ``x[ 2 ]`` may be used ...

    Such segments are, however, subject to two restrictions:
    - They must not begin or end with a space character (" ").
    - They must not contain pairs of grave accents.

    II. Versions of the Software Considered

    This report pertains to Sphinx, version 8.2.3.

    III. The U+200D ZERO WIDTH JOINER (ZWJ) Character as a Workaround

    It is nevertheless possible to include a space at the beginning of
    an embedded code segment by prefixing it with the invisible character
    U+200D "ZERO WIDTH JOINER" (ZWJ). Similarly, a space may be appended
    to the end of such a segment by suffixing it with a ZWJ. Furthermore,
    a sequence of multiple grave accents within an embedded code segment
    can be achieved by interposing a ZWJ between the grave accents.

    The ZWJ character is invisible in Sphinx's output, or it may be
    removed by means of post-processing if so desired.

    IV. Consideration of ZWJ in Line Breaking and Word Division

    Sphinx interprets a ZWJ as a character of width one and regards it as
    a potential break point within words. Consequently, the formatting of
    output text may be affected. This behavior can be modified by two
    changes to the Sphinx source code.

    A. Adjustment of Character Width

    Within the Sphinx source file "docutils\utils\__init__.py", the width of
    ZWJ characters should be subtracted from the total text width, so that
    ZWJ is not counted as a character of length one. This is accomplished by inserting the following line prior to the "return width" statement in
    the definition of the column_width function:

    Figure 2: The line to be inserted

    |width -= text.count('\u200d')

    B. Adjustment of Break Point Determination

    (This adjustment is likely unnecessary for ZWJ within embedded code
    segments, but may be required if ZWJ is used within words of running
    text for any reason.)

    In the Sphinx source file "sphinx\writers\text.py", words should not be
    split at the occurrence of ZWJ within a word. To this end, the
    definition shown in Figure 2 may be inserted below the definition of the
    split function (which itself is within the definition of the _split
    function in the TextWrapper class). The indentation of the new col_width function should match that of the preceding split function.

    Figure 3: The definition to be inserted

    |def col_width(t: str) -> int:
    | '''for the purpose of word splitting, treat
    | zero-width characters just as characters
    | of width one.'''
    | width = column_width(t)
    | if width == 0: width = 1
    | return width

    The source code should further be modified such that this new col_width function is invoked in the call to "groupby" three lines below,
    replacing the previous use of column_width.

    (End of Technical Report)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mark Bourne@21:1/5 to Stefan Ram on Wed May 7 19:44:45 2025
    Stefan Ram wrote:
    The Use of the U+200D "ZERO WIDTH JOINER" (ZWJ) Character in
    reStructuredText Input for "Sphinx"
    (Technical Report SLRTR 000000)

    (This technical report was prepared by the author during his spare
    time.)

    Stefan Ram
    2025

    Abstract - The character U+200D "ZERO WIDTH JOINER" (ZWJ) may be
    employed in inputs written in the "reStructuredText" (rst) markup
    notation for the software documentation tool "Sphinx" in order to permit
    the inclusion of special characters within embedded code segments.
    To ensure that Sphinx's automatic line breaking continues to function correctly, two minor adjustments to Sphinx are required.

    Does using U+2060 "WORD JOINER" instead avoid the need to modify Sphinx?

    Just going by the description at <https://unicode-explorer.com/c/2060>,
    which describes it as "a zero width non-breaking space", that seems to
    be what you want. I'm not sure from the description of U+200D whether
    that one's supposed to allow breaking, but U+2060 sounds like it
    definitely shouldn't allow breaking ;).

    --
    Mark.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Mark Bourne on Wed May 7 19:17:04 2025
    Mark Bourne <nntp.mbourne@spamgourmet.com> wrote or quoted:
    Does using U+2060 "WORD JOINER" instead avoid the need to modify Sphinx?

    I haven't seen any code in Sphinx that fixes width calculations
    for this special character, so I'm guessing you'd still have to
    change Sphinx. But here's the thing:

    My issue was that some characters just can't be used inside ``...``
    for inline code, and I wasn't sure what to do when I needed to
    generate inline code with those "forbidden" characters. But now
    I found out about the ":literal:" role, which apparently lets you
    include /any/ character - even if it's a bit more to type. That would
    be really helpful for my program that generates reStructuredText.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)