• Re: Static regex for embedded systems

    From David Brown@21:1/5 to pozz on Tue Jan 21 16:40:17 2025
    On 21/01/2025 15:31, pozz wrote:
    Many times I need to parse/decode a text string that comes from an
    external system, over a serial bus, MQTT, and so on.

    Many times this string has a fixed syntax/layout. In order to parse this string, I everytime create a custom parser that can be tedious,
    cumbersom and error prone.

    For example, suppose you have to decode a string from a modem that uses
    AT commands. Many answers from the modem has the following schema:

      \r\n+<prefix>: <field1>,<field2>\r\n
      \r\nOK\r\n

    The prefix is known, the number and type of fileds are known too. With
    regex, the parser would be simple.


    There are plenty of libraries for run-time regular expressions - they
    have been in the C++ standard library since C++11 (IIRC). But run-time
    parsing of regex strings and matching for the input string will be very
    time and space costly in a small embedded system. You want compile-time regular expression handling.

    This had been planned for inclusion in C++23 (and therefore part of the
    current GNU ARM Embedded toolchain, if that's what you are using), but
    it didn't make it. However, there are standalone compile-time regex
    libraries for C++ using templates and compile-time functions, which will
    in effect generate specialised parsers from the regex strings.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Reuther@21:1/5 to All on Tue Jan 21 17:03:33 2025
    Am 21.01.2025 um 15:31 schrieb pozz:
    Many times I need to parse/decode a text string that comes from an
    external system, over a serial bus, MQTT, and so on.

    Many times this string has a fixed syntax/layout. In order to parse this string, I everytime create a custom parser that can be tedious,
    cumbersom and error prone.
    [...]

    I don't see a question in this posting, but isn't this the task that
    'lex' is intended to be used for?

    (Personally, I have no problem with handcrafted parsers.)


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Niocl=C3=A1i=C5=BF=C3=ADn@21:1/5 to All on Wed Jan 22 00:41:29 2025
    Maybe read
    Message-ID: <22-05-027@comp.compilers>
    (
    news:comp.compilers
    From: Paul B Mann
    Subject: Re: Please provide a learning path for mastering lexical analysis languages
    Date: Sun, 8 May 2022 22:27:55 -0700 (PDT)
    )
    about
    LRSTAR (
    HTTPS://SourceGorge.net/projects/lrstar
    ). I never use LRSTAR. It is supposed to be efficient for C++ on Microsoft Windows. I do not know if it is ever used for an embedded system.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to pozz on Tue Jan 21 19:38:03 2025
    On Tue, 21 Jan 2025 18:03:48 +0100, pozz <pozzugno@gmail.com> wrote:

    Il 21/01/2025 17:03, Stefan Reuther ha scritto:
    Am 21.01.2025 um 15:31 schrieb pozz:
    Many times I need to parse/decode a text string that comes from an
    external system, over a serial bus, MQTT, and so on.

    Many times this string has a fixed syntax/layout. In order to parse this >>> string, I everytime create a custom parser that can be tedious,
    cumbersom and error prone.
    [...]

    I don't see a question in this posting,

    The hiddend question was if there's a better approach than handcrafted >parsers.


    but isn't this the task that
    'lex' is intended to be used for?

    I will look at it.


    (Personally, I have no problem with handcrafted parsers.)

    So long as they are correct 8-)


    Stefan

    Lex and Flex create table driven lexers (and driver code for them).
    Under certain circumstances Flex can create far smaller tables than
    Lex, but likely either would be massive overkill for the scenario you described.

    Minding David's warnings about lexer size, if you really want to try
    using regex, I would recommend RE2C. RE2C is a preprocessor that
    generates simple recursive code to directly implement matching of
    regex strings in your code. There are versions available for several
    languages.
    https://re2c.org/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to George Neuner on Wed Jan 22 10:59:03 2025
    On 22/01/2025 01:38, George Neuner wrote:
    On Tue, 21 Jan 2025 18:03:48 +0100, pozz <pozzugno@gmail.com> wrote:

    Il 21/01/2025 17:03, Stefan Reuther ha scritto:
    Am 21.01.2025 um 15:31 schrieb pozz:
    Many times I need to parse/decode a text string that comes from an
    external system, over a serial bus, MQTT, and so on.

    Many times this string has a fixed syntax/layout. In order to parse this >>>> string, I everytime create a custom parser that can be tedious,
    cumbersom and error prone.
    [...]

    I don't see a question in this posting,

    The hiddend question was if there's a better approach than handcrafted
    parsers.


    but isn't this the task that
    'lex' is intended to be used for?

    I will look at it.


    (Personally, I have no problem with handcrafted parsers.)

    So long as they are correct 8-)


    This is vital. You want a /lot/ of test cases to check the algorithm.


    Stefan

    Lex and Flex create table driven lexers (and driver code for them).
    Under certain circumstances Flex can create far smaller tables than
    Lex, but likely either would be massive overkill for the scenario you described.

    Minding David's warnings about lexer size, if you really want to try
    using regex, I would recommend RE2C. RE2C is a preprocessor that
    generates simple recursive code to directly implement matching of
    regex strings in your code. There are versions available for several languages.
    https://re2c.org/


    The "best" solution depends on the OP's knowledge, the variety of the
    patterns needed, the resources of the target system, and restrictions on
    things like programming language support. For example, the C++ template
    based project I suggested earlier (which I have not tried myself) should
    give quite efficient results, but it requires a modern C++ compiler.

    I think if the OP is only looking for a few patterns, or styles of
    pattern, then regex's and powerful code generator systems are overkill.
    It will take more work to learn and understand them, and code generated
    by tools like lex and flex is not designed to be human-friendly, nor is
    it likely to match well with coding standards for small embedded systems.

    I'd probably just have a series of matcher functions for different parts
    (fixed string, numeric field as integer, flag field as boolean, etc.)
    and have manual parsers for the different types. As a C++ user I'd be returning std::optional<> types here and using the new "and_then"
    methods to give neat chains, but a C programmer might want to pass a
    pointer to a value variable and return "bool" for success. If I had a
    lot of such patterns to match, then I might use templates for generating
    the higher level matchers - for C, it would be either a macro system or
    an external Python script.

    Or just use sscanf() :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Reuther@21:1/5 to All on Wed Jan 22 17:53:15 2025
    Am 22.01.2025 um 01:38 schrieb George Neuner:
    On Tue, 21 Jan 2025 18:03:48 +0100, pozz <pozzugno@gmail.com> wrote:
    (Personally, I have no problem with handcrafted parsers.)

    So long as they are correct 8-)

    Correctness has an inverse correlation with complexity, so optimize for non-complexity.

    I would implement a two-stage parser: first break the lines into a
    buffer, then throw a bunch of statements like

    if (Parser p(str); p.matchString("+")
    && p.matchTextUntil(":", &prefix)
    && p.matchWhitespace() ...)

    at this, with Parser being a small C++ class wrapping the individual
    matching operations (strncmp, strspn, etc.)

    Surely this is more complex as a regex/template, but still easy enough
    to be "obviously correct".

    Lex and Flex create table driven lexers (and driver code for them).
    Under certain circumstances Flex can create far smaller tables than
    Lex, but likely either would be massive overkill for the scenario you described.

    Maybe, maybe not. I find it hard to extrapolate to the complete task
    from the two examples given. If there's hundreds of these templates,
    that need to be matched bit-by-bit, I have the impression that lex would
    be a quick and easy way to pull them out of a byte stream.

    But splitting it into lines first, and then tackling each line on its
    own (...using lex, maybe? Or any other tool. Or a parser class.) might
    be a good option as well. For example, this can answer the question
    whether linefeeds are required to be \r\n, or whether a single \n also suffices, in a central place. And if you decide that you want to do a
    hard connection close if you see a \r or \n outside a \r\n sequence (to
    prevent an attack such as SMTP smuggling), that would be easy.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to stefan.news@arcor.de on Wed Jan 22 18:33:52 2025
    On Wed, 22 Jan 2025 17:53:15 +0100, Stefan Reuther
    <stefan.news@arcor.de> wrote:

    Am 22.01.2025 um 01:38 schrieb George Neuner:
    On Tue, 21 Jan 2025 18:03:48 +0100, pozz <pozzugno@gmail.com> wrote:
    (Personally, I have no problem with handcrafted parsers.)

    So long as they are correct 8-)

    Correctness has an inverse correlation with complexity, so optimize for >non-complexity.

    I would implement a two-stage parser: first break the lines into a
    buffer, then throw a bunch of statements like

    if (Parser p(str); p.matchString("+")
    && p.matchTextUntil(":", &prefix)
    && p.matchWhitespace() ...)

    at this, with Parser being a small C++ class wrapping the individual
    matching operations (strncmp, strspn, etc.)

    Surely this is more complex as a regex/template, but still easy enough
    to be "obviously correct".

    Lex and Flex create table driven lexers (and driver code for them).
    Under certain circumstances Flex can create far smaller tables than
    Lex, but likely either would be massive overkill for the scenario you
    described.

    Maybe, maybe not. I find it hard to extrapolate to the complete task
    from the two examples given. If there's hundreds of these templates,
    that need to be matched bit-by-bit, I have the impression that lex would
    be a quick and easy way to pull them out of a byte stream.

    Agreed the task is ambigious, but my (possibly very wrong) impression
    was of a relatively simple parser needing to recognize just a handful
    of "commands".

    "hundreds of templates" ... where "template" implies to me that there
    is inline data to be extracted ... is more a job for Yacc/Bison than
    for Lex/Flex.


    But splitting it into lines first, and then tackling each line on its
    own (...using lex, maybe? Or any other tool. Or a parser class.) might
    be a good option as well. For example, this can answer the question
    whether linefeeds are required to be \r\n, or whether a single \n also >suffices, in a central place. And if you decide that you want to do a
    hard connection close if you see a \r or \n outside a \r\n sequence (to >prevent an attack such as SMTP smuggling), that would be easy.

    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to david.brown@hesbynett.no on Wed Jan 22 18:23:21 2025
    On Wed, 22 Jan 2025 10:59:03 +0100, David Brown
    <david.brown@hesbynett.no> wrote:

    On 22/01/2025 01:38, George Neuner wrote:
    On Tue, 21 Jan 2025 18:03:48 +0100, pozz <pozzugno@gmail.com> wrote:

    Il 21/01/2025 17:03, Stefan Reuther ha scritto:
    Am 21.01.2025 um 15:31 schrieb pozz:
    Many times I need to parse/decode a text string that comes from an
    external system, over a serial bus, MQTT, and so on.

    Many times this string has a fixed syntax/layout. In order to parse this >>>>> string, I everytime create a custom parser that can be tedious,
    cumbersom and error prone.
    [...]

    I don't see a question in this posting,

    The hiddend question was if there's a better approach than handcrafted
    parsers.


    but isn't this the task that
    'lex' is intended to be used for?

    I will look at it.


    (Personally, I have no problem with handcrafted parsers.)

    So long as they are correct 8-)


    This is vital. You want a /lot/ of test cases to check the algorithm.


    Stefan

    Lex and Flex create table driven lexers (and driver code for them).
    Under certain circumstances Flex can create far smaller tables than
    Lex, but likely either would be massive overkill for the scenario you
    described.

    Minding David's warnings about lexer size, if you really want to try
    using regex, I would recommend RE2C. RE2C is a preprocessor that
    generates simple recursive code to directly implement matching of
    regex strings in your code. There are versions available for several
    languages.
    https://re2c.org/


    The "best" solution depends on the OP's knowledge, the variety of the >patterns needed, the resources of the target system, and restrictions on >things like programming language support. For example, the C++ template >based project I suggested earlier (which I have not tried myself) should
    give quite efficient results, but it requires a modern C++ compiler.

    I think if the OP is only looking for a few patterns, or styles of
    pattern, then regex's and powerful code generator systems are overkill.
    It will take more work to learn and understand them, and code generated
    by tools like lex and flex is not designed to be human-friendly, nor is
    it likely to match well with coding standards for small embedded systems.

    I'd probably just have a series of matcher functions for different parts >(fixed string, numeric field as integer, flag field as boolean, etc.)
    and have manual parsers for the different types. As a C++ user I'd be >returning std::optional<> types here and using the new "and_then"
    methods to give neat chains, but a C programmer might want to pass a
    pointer to a value variable and return "bool" for success. If I had a
    lot of such patterns to match, then I might use templates for generating
    the higher level matchers - for C, it would be either a macro system or
    an external Python script.

    Or just use sscanf() :-)

    There /used/ to be some very small regex matchers that did not
    "compile", but just directly interpreted the contents of the pattern
    string. A page or three of code, reusable by every regex pattern in
    the program.

    Obviously they were limited to /simple/ matching: no Perl stuff like
    counting, looping, etc. Unfortunately I haven't seen any of these
    tiny regex implementations since the late '70s [coincidentally about
    when lex was becoming popular].

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)